Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a data checking logic description system and a data checking computing system.
The invention provides a data checking logic description system and a data checking computing system, wherein the scheme is as follows:
In a first aspect, a data checking logic description system is provided, the system comprising:
The data source analysis module is used for providing online real-time data source definition, validity check and related analysis functions based on the concept division of three elements of data check, and processing and extracting metadata into data source knowledge for the subsequent modules to use;
And the compiling module is used for providing the functions of on-line real-time checking logic description editing, checking and analyzing. In the compiling process, the data source knowledge provided by the data source analysis module is received to carry out joint analysis, logic optimization work is carried out, final logic representation is generated, and then the final logic representation is submitted to a specific data checking calculation system to finish calculation and output checking results, so that loose coupling of three elements is realized.
Preferably, the data source analysis module comprises an SQL parser, and the online edited and submitted ANSI SQL is submitted to the SQL parser for parsing, and after passing, a legal data source definition is generated;
the SQL parser is responsible for checking the submitted SQL, fulfilling the following four classes of functions:
the security check is that any non-select statement is forbidden to pass through in the grammar level, so that any SQL statement can not change or delete the existing data of the system, and the data security is ensured;
usability checking, namely checking whether the submitted sentence accords with the grammar rule of the select clause and is used as a compliance description of source definition;
SQL normalization, namely adjusting submitted sentences into normalized SQL which has the same semantic meaning and is easy to define by subsequent checking rules;
and extracting data source knowledge, namely extracting the data source knowledge including the reference name, the reference type and the name space information from the statement after the SQL normalization is completed, and providing metadata information support for a subsequent compiling module.
Preferably, the SQL parser further comprises a metadata extraction unit which is in online interaction with a specific database product at the back end to acquire all metadata related to SQL sentences, so that semantic analysis is supported, security inspection and availability inspection are more perfect, and meanwhile, the SQL parser processes and extracts the metadata as data source knowledge for a subsequent module and a specific data checking computing system.
Preferably, the metadata extraction unit shields technical differences of the back-end database products on the basis of performing basic functions, unifies expression forms of relational metadata, and provides a consistent interface to the outside.
Preferably, the compiling module includes:
the lexical analysis submodule starts from the lexical level of the check logic description, carries out word segmentation processing on the check rule definition, and converts the character stream processing into a mark stream for analysis by the grammar analysis submodule;
The grammar analysis sub-module constructs a concrete grammar tree based on an LL (x) algorithm according to grammar rules described by the check logic after receiving the mark flow provided by the lexical analysis sub-module, then regenerates the abstract grammar tree, and finally submits the abstract grammar tree to the semantic analysis sub-module;
the semantic analysis sub-module is used for receiving the data source knowledge provided by the data source analysis module, carrying out joint analysis and carrying out type and logic inspection on name references mentioned in the check rules;
The optimization and logic generation sub-module performs logic optimization after passing through the semantic analysis sub-module, each round of optimization is carried out, the optimizer receives an intermediate representation, and a logic equivalent but more optimized intermediate representation is obtained after calculation;
after the optimizer completes the multi-round optimization, an optimal intermediate representation is generated, intermediate process results are removed, a data structure is trimmed, and a final logic representation is generated and is used as a general interface to be matched with a specific data checking computing system.
In a second aspect, there is provided a data checking computing system, the system comprising:
The data input module is used for extracting check data according to the related information of the data source and submitting the check data to the calculation core module;
wherein, the data checking logic describes the compiling matched tool chain in the system to ensure the security of the definition of the data source, the SQL parser performs comprehensive inspection on the data source definition;
The code generation module is used for converting the check rules related to the check data into executable codes by utilizing the final logic representation and the data source knowledge output in the data check logic description system and submitting the executable codes to the calculation core module;
the computing core module is used for receiving the check data and executable codes compiled by the check rules, executing computation and obtaining results;
and the result output module is used for receiving the checking result given by the calculation core module, adapting with various data storage and completing the persistence of the result.
Preferably, the data input module adopts an independent data structure to describe the semi-structured data source information, so as to realize extraction of semi-structured check data.
Preferably, the data input module adds a stream data extraction unit, and the result output module adds a stream data write-back unit to realize stream data source adaptation.
Preferably, the code generation module comprises a step of generating executable codes related to a specific platform and a frame by utilizing the final logic representation and data source knowledge output in the data checking logic description system;
and the code generation module is placed in a Spark driver, and the code is generated and then distributed to the calculation core modules of the executors.
Preferably, the module core component of the computing core module is an interpreter, the interpreter is provided with an independent thread pool, and according to algebraic structures of operator combinations, the concurrency parts are disassembled and respectively submitted to different threads for execution;
and a set of unit for executing information collection is arranged in the interpreter and is responsible for collecting information in the calculation process and feeding back the information to other units for runtime optimization.
Compared with the prior art, the invention has the following beneficial effects:
1. The data checking logic description system adopts online compiling to realize online real-time editing and feedback of data source definition and checking rule definition, and obtains related metadata through online connection of a data source analysis module and a rear-end database product in a semantic analysis stage, thereby improving the calculation safety;
2. the Chinese programming is properly introduced into the check logic description in the data check logic description system, so that the usability is improved, and the use population is enlarged;
3. The final logic representation generated after compiling the check logic description in the data check logic description system is a data structure irrelevant to a specific computing platform and a framework, stores complete check logic description information with types, can adapt to various computing platforms and frameworks, and provides check execution code generation support for the specific platform and the framework;
4. The data checking calculation system is based on the characteristic that a large number of checking rules depend on the same data source, and the interpreter design of the calculation core module realizes a unique aggregation algorithm, so that any plurality of checking rules of the same data source participate in calculation together, and IO pressure under the condition of high concurrency execution of a large number of checking rules is greatly relieved;
5. The data checking computing system provides an independent and complete checking computing environment, realizes thorough calculation and storage separation, and enables the checking calculation force to be independently adjusted according to the workload;
6. The data checking and calculating system shields the difference of various database products at the back end, makes it possible for multiple back ends to multiplex and share the same checking system, avoids repeated construction, saves technological resources, and can calculate all homologous checking rules only by one IO (input/output) to greatly promote the balanced utilization of system resources and improve the calculation efficiency.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the present invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications could be made by those skilled in the art without departing from the inventive concept. These are all within the scope of the present invention.
The invention provides a data checking logic description system, which comprises:
And the data source analysis module is used for providing online real-time data source definition, validity check and related analysis functions based on the conceptual division of three elements of data check, and processing and extracting metadata into data source knowledge for the subsequent modules to use.
And in the compiling process, receiving the data source knowledge provided by the data source analysis module, carrying out joint analysis, carrying out logic optimization work, generating a final logic representation, and then delivering the final logic representation to a specific data checking calculation system to finish checking calculation so as to realize loose coupling of three elements.
The data source analysis module comprises an SQL analyzer, and ANSI SQL edited and submitted on line is submitted to the SQL analyzer for analysis, and after passing, a legal data source definition is generated.
The SQL parser is responsible for checking the submitted SQL, fulfilling the following four classes of functions:
and (3) safety checking, namely prohibiting the compiling of any non-select statement from passing through at the grammar level, ensuring that any SQL statement cannot change or delete the existing data of the system, and ensuring the data safety.
Availability checking, checking whether the submitted sentence meets the select clause grammar rules and is a compliance description of the source definition.
SQL normalization, namely adjusting submitted sentences into normalized SQL which has the same semantics and is easy to be defined by follow-up checking rules.
And extracting data source knowledge, namely extracting the data source knowledge including the reference name, the reference type and the name space information from the statement after the SQL normalization is completed, and providing metadata information support for the subsequent modules.
The SQL parser also comprises a metadata extraction unit which is in online interaction with a specific database product at the back end to acquire all metadata related to SQL sentences, so that semantic analysis is supported, security inspection and usability inspection are more perfect, and meanwhile, the SQL parser processes and extracts the metadata as data source knowledge for a subsequent module to use.
The metadata extraction unit shields technical differences of the back-end database products on the basis of fulfilling basic functions, unifies expression forms of the relational metadata, and provides a consistent interface to the outside.
Replacing SQL with a data checking logic description, wherein the compiling module comprises:
the lexical analysis submodule starts from the lexical level of the check logic description, carries out word segmentation processing on the check rule definition, and converts the character stream processing into a mark stream for analysis by the grammar analysis submodule;
The grammar analysis sub-module constructs a concrete grammar tree based on an LL (x) algorithm according to grammar rules described by the check logic after receiving the mark flow provided by the lexical analysis sub-module, then regenerates the abstract grammar tree, and finally submits the abstract grammar tree to the semantic analysis sub-module;
and the semantic analysis sub-module is used for receiving the data source knowledge provided by the data source analysis module, carrying out joint analysis and carrying out type and logic inspection on name references mentioned in the check rules.
And the optimization and logic generation sub-module is used for carrying out logic optimization after passing through the semantic analysis sub-module, and each round of optimization is carried out, wherein the optimizer receives an intermediate representation and obtains a logically equivalent but more optimized intermediate representation after calculation.
After the optimizer completes the multi-round optimization, an optimal intermediate representation is generated, intermediate process results are removed, a data structure is trimmed, and a final logic representation is generated and is used as a general interface to be matched with a specific data checking computing system.
The invention also provides a data checking and calculating system, which comprises:
And the data input module is used for extracting the check data according to the related information of the data source and submitting the check data to the calculation core module.
Wherein the data checking logic describes a compiling matching tool chain in the system for ensuring the security of data source definition, the SQL parser performs a comprehensive check on the data source definition.
The data checking logic describes that the data source knowledge generated in the system is used for data extraction support of the module.
And the code generation module is used for converting the check rules related to the check data into executable codes by utilizing the final logic representation and the data source knowledge which are output in the data check logic description system and submitting the executable codes to the calculation core module.
And the calculation core module is used for receiving the check data and executable codes compiled by the check rules, and executing calculation to obtain a result.
And the result output module is used for receiving the checking result given by the calculation core module, adapting with various data storage and completing the persistence of the result.
The data input module adopts an independent data structure to describe the semi-structured data source information, and the extraction of semi-structured check data is realized.
The data input module is added with a stream data extraction unit, and the result output module is added with a stream data write-back unit to realize stream data source adaptation.
The code generation module includes generating executable code using the final logical representation and data source knowledge output in the data checking logic description system.
And the code generation module is placed in a Spark driver, and the code is generated and then distributed to the calculation core modules of the executors.
The module core component of the computing core module is an interpreter, the interpreter is provided with an independent thread pool, and the concurrency parts are disassembled according to algebraic structures of operator combinations and respectively delivered to different threads for execution.
And a set of unit for executing information collection is arranged in the interpreter and is responsible for collecting information in the calculation process and feeding back the information to other units for runtime optimization.
Next, the present invention will be described in more detail.
The embodiment of the invention comprises a checking logic description system and a data checking calculation system, and is respectively used for solving two most important problems in the data checking field, namely logic description and calculation, as shown in fig. 1, and the two systems are described in detail below.
The first part is that the invention provides a data checking logic description system.
The description of data checking logic generally contains the following three types of information:
(1) The data source, i.e. the source of the checked data and the data structure information, a data source definition may be given, for example, by SQL.
(2) The checking logic can define a plurality of checking logics, which are also called checking rules, on the basis of the definition of a given data source, each rule can be regarded as a minimum unit of checking calculation, usually, one data source can define tens to hundreds of checking rules to complete comprehensive and fine data quality checking, and taking a simple checking logic as an example, the data source defined in (1) can obtain a selection. It can be seen that the core logic of the checking rules is mixed with the data source elements, which are in fact only part of the where word, namely "col is not null".
(3) And (3) checking output, namely describing the structure and the purpose of saving the checking calculation result, taking the SQL statement in (2) as an example, and adding a checking output definition to obtain a selection count (x) from A WHERE DATE = '20211220'and col is not null, namely, counting the data meeting the condition.
The data source, the check logic and the check output are three elements called data check, and as can be seen from the above examples, the three elements are mixed together by the traditional SQL to form tight coupling. The invention takes the three elements as break-through ports, and the three elements are completely decoupled through the design and the realization of novel logic description, so that advantages are established in various aspects, and obvious effects are obtained.
In the aspect of the design of logic description, the construction and the choice of grammar, semantics, characteristics and the like are mainly completed, so that a user can safely, accurately, flexibly and conveniently define a data source and check logic.
The implementation aspect of the logic description is to realize a matched compiler and tool chain from the engineering perspective on the basis of design work, and support common use scenes, such as online real-time compiling, error feedback, grammar highlighting, logic checking and representation generation and the like, and mainly comprises a data source analysis module and a compiling module.
Details of the design, implementation will be described below:
The design part, the logic description is special for data checking, and on the basis of meeting the basic logic construction capability, learning and using thresholds are reduced as much as possible, so that basic Chinese support and grammar sugar of some Chinese expression forms are added in a grammar level, as shown in a rule definition part in fig. 2, branch keywords can be written in a form of 'if' otherwise ', also can be written in a form of' if.. Then 'else', comparison symbols can be written in a form of Chinese (such as 'equal') or mathematical symbols (such as 'equal='), and in addition, some character string calculation can use Chinese expression forms, and grammar accords with Chinese use habit.
Logic includes basic grammar such as basic branch judgment, function call, logic operation, numerical operation and the like, data has strict type division, including Boolean type, integer (byte, short, int, long), floating point type (float, double), character string, date, time stamp, regular expression and the like, for example, the rule definition in fig. 2 constructs the literal quantity '2020-01-01' of one date type by using a 'date' keyword, and 'true' and 'false' are the literal quantity of Boolean type and the like.
To improve computational certainty (determinism), inspired by functional programming, all grammar structures are expressions (expressions), without statements (statements), the evaluation process does not contain any side effect, and Java is used as a contrast, "if...else..is a sentence in which side-effects are involved, in the logic description design of the present invention," if..then..otherwise., "is an expression, without side effects, the evaluation result is the calculation result of one branch, which requires that the calculation result of each branch is the same type, and when the types are different, an error is prompted. The checkrule definition in FIG. 2, the final evaluation result is equal to the evaluation result of the branch predicate expression.
To increase the security of the computation, the logic description may be subjected to a strict strong type check, and if the requirements of the computation type are not met, errors may be fed back quickly during compilation.
Finally, all functions used in the logic description come from a standard library, can be continuously expanded, is convenient for efficient writing of checking logic, avoids lengthy or repeated definition of logic, and is shown in FIG. 3.
Overall, the logic description design follows the principles of safety, flexible expression, simple writing and easy understanding.
The implementation part, as shown in fig. 4, is detailed as follows:
And the data source analysis module is responsible for completing the definition of the data source based on the division of the three elements of the check and providing online data source definition, real-time validity check and related analysis functions. In order to maintain generality, flexibility and safety, the grammar of the data source definition is based on a subset of the ANSI SQL language, only comprises a select query clause, the module comprises an SQL parser to support a core function, the workflow is shown in figure 5, the SQL is edited and submitted online and parsed by the SQL parser, and after passing, a legal data source definition is generated. The SQL parser realizes a complete SQL compiler front end and is responsible for checking the submitted SQL, and mainly fulfills the following four functions:
(1) 6, 7, prohibiting any non-select statement from compiling to pass through at the grammar level, ensuring that SQL statements cannot change and delete the existing data of the system, and ensuring the data security;
(2) Usability checking, as shown in FIG. 8, checking whether the submitted sentence accords with the grammar rule of the select clause and can be used as legal description of data source definition;
(3) SQL normalization, namely adjusting submitted sentences into normalized SQL which has the same meaning and is easy to be defined by subsequent check rules, for example, as shown in figure 9, stars in the selected sentences are fully unfolded, so that the name reference and related check when the compiling module checks the rule definition are convenient;
(4) And (3) extracting data source knowledge, namely extracting the data source knowledge from sentences after the step (3) is completed, wherein the data source knowledge comprises reference names, reference types, namespaces and the like, for example, in data source definition 'select a, b, c from tables', the data source knowledge is extracted by an SQL parser, the data source knowledge comprises a, b, c, table and other references and types thereof, and the information not only provides metadata information support for a subsequent compiling module, but also provides runtime information support in a checking calculation process. Notably, knowledge of the data source retains all metadata following reference-based principles, and reference-based representations have many benefits, enabling the module to support semantic access to both structured and semi-structured data-dependent metadata, as well as to support semantic access to streaming data. As a generic data structure, data source knowledge provides a unified data interface.
The internal structure and workflow of the SQL parser are shown in fig. 10, after the SQL sentence is subjected to basic lexical and grammatical analysis, in the semantic analysis stage, the metadata extraction unit can interact with a back-end specific database product online to obtain all metadata related to the SQL sentence, so that more strict semantic analysis is supported, safety and usability inspection are further perfected, and meanwhile, the SQL parser processes and extracts the metadata as data source knowledge for a subsequent compiling module.
It is worth noting that the SQL parser has no SQL compiled back-end function and cannot truly execute SQL, and in addition, the metadata extraction unit shields technical differences of the back-end database products on the basis of fulfilling basic functions as a connection point of various database products, unifies the expression form of relational metadata and provides a consistent interface to the outside.
The module completes the definition and analysis work related to the data source, provides metadata support for the subsequent compiling module, enables the checking rule to obtain clear calculation semantics in a specific context environment, and enables the data source knowledge generated by the SQL parser to be a bridge for connecting the data source and the checking logic and also plays a role in dividing the data source and the checking logic.
The compiling module is used for focusing on compiling analysis, optimizing and generating final representation of the checking logic description, the compiling flow is shown in fig. 11, the checking rules sequentially carry out analysis and calculation through sub-modules such as morphology, grammar, semantics, optimizing and logic generation, and finally logic intermediate representation irrelevant to a specific computing platform and database products is generated. Any link is wrong, so that compiling is interrupted, detailed compiling error information is fed back in advance, and a user can quickly correct logic. For example, as shown in fig. 12, the right expression is wrongly written as a character string 5, the types on both sides of the inequality sign are inconsistent, and the compilation does not pass and prompts an error.
The following will be described in terms of sub-modules:
And the lexical analysis submodule starts from the grammar level of the check logic, performs word segmentation processing on the check rule, and processes and converts the check rule into a token stream (token stream) from a character stream (CHARACTER STREAM) for analysis by the grammar analysis submodule.
And the grammar analysis sub-module is used for constructing a concrete grammar tree (CST) based on an LL (x) algorithm according to the grammar rule of the check logic after receiving the mark flow provided by the lexical analysis sub-module, regenerating an abstract grammar tree (AST) and finally submitting the abstract grammar tree to the semantic analysis sub-module.
And the semantic analysis sub-module is used for receiving the data source knowledge provided by the data source analysis module and checking and analyzing the semantics of the check logic. In order to expose more logic errors in advance, the strong type checking can ensure the type-based calculation safety (such as fig. 12, 13 and 14), the checking also needs to ensure that the evaluation result type of the checking logic is always a Boolean type so as to embody the clear judgment of the data quality, and in addition, the analysis process can also allow some rule-based type conversion, thereby improving the language usability on the premise of safety.
As shown in fig. 15, the inspection analysis also includes name references, logical consistency constraints, etc. mentioned in the inspection rules to fully ensure that the inspection rule description is complete and computable. After all analysis is completed, an intermediate representation (Typed IR) with type information is generated, which is simply referred to as intermediate representation and submitted to the subsequent optimization and logic generation sub-module.
The optimization and logic generation sub-module, once passing the semantic analysis sub-module, represents that the current checking rule has passed various checks. The main purpose of the module is to perform logic optimization, as shown in fig. 11, each round of optimization, the optimizer receives an intermediate representation, and a logically equivalent but more optimized intermediate representation is obtained after calculation. For example, an important optimization means is to advance all calculations that only contain literal amounts, such as 1+1, so that it can be directly optimized to 2 without having to perform a large number of inefficient repeated calculations during the actual execution stage of the checking rule.
After the optimizer completes the multi-round optimization, an optimal intermediate representation can be generated, some process results are removed, the data structure is trimmed, and a final logic representation is generated, wherein the logic representation is a data structure with data types and calculation auxiliary information, contains all information required by calculation, is irrelevant to a specific database product, can be used as a general interface to be matched with a specific calculation engine, framework or platform, and is detailed in a second part, namely a data checking calculation system.
The data source definition and the check rule definition are completely decoupled, and the check output is tightly connected with the data persistence layer, so that the check output can be completed by a specific calculation engine, framework or platform, and the result output module in the second part-data check calculation system is described in detail below.
By proper modification, the system can be expanded as follows:
(1) And similarly, the module adds json for describing the structure and the data type of the semi-structured data, generates the data source knowledge through the json analyzer, and keeps the metadata based on the principle of preserving reference (reference), so that the interface layer is stable, the whole compiling module can be connected in a seamless way, the whole system can realize the analysis function of a detection rule based on the semi-structured data source, and information support is provided for corresponding expansion of a follow-up data detection computing system.
(2) And (3) adding analysis support for the streaming data, wherein the analysis support is similar to the thought in the step (1), on the basis of maintaining a reference principle, DSL (for example, YAML) for describing the streaming data structure and type is added, a parser is written, standard data source knowledge is generated, and a compiling module can be docked, so that the system can analyze the inspection rules based on the streaming data source.
(3) And continuously optimizing the final logic representation, namely, in the optimizing and logic generating sub-module, the optimizer performs progressive optimization on the intermediate representation, and along with the expansion of the system function, the optimizer can continuously add more layers of analysis and optimization, continuously improve the final logic representation, improve the computing expression capacity and generate higher-performance execution codes in a power-assisted checking computing link.
In the whole, a plurality of modules of the system are tightly matched, the complete decoupling of three elements of data sources, check logics and check output is realized, a user can rapidly complete the definition of the data sources and the check logics by virtue of online real-time compiling and feedback, the development, testing and deployment processes are obviously simplified, chinese is properly introduced, the complementary advantages of Chinese and common writing methods are realized, the readability and maintainability of the check logics are improved, the use threshold is reduced, the cost and efficiency are improved, the safety of data and calculation is obviously improved by virtue of the characteristics of a large number of strict grammar, semantic inspection, strong type inspection and the like in the compiling period, and the data sources are defined by adopting ANSI SQL subsets, and the check logic description is independent of a standard data source knowledge interface provided by a data source analysis module, so that the problem of the logical consistency of the rear end of various databases is thoroughly solved, and the effects of one-time writing and multiple operation are achieved.
The second part, the invention also provides a data checking and calculating system.
The data checking calculation system aims to complete large-scale checking calculation based on the work of the first part data checking logic description system. The system core is a checking engine, the engine provides a complete checking calculation runtime environment, basic modules and structure schematic are shown by referring to fig. 16, a calculation core module is responsible for actual calculation, required target checking data and checking calculation logic are respectively provided by a data input module and a code generation module, and after calculation is completed, a result output module completes persistence of checking results.
The following is a detailed description of the module:
And the data input module extracts target checking data according to the related information of the data source and submits the target checking data to the calculation core module. The data source related information comes from data source knowledge and normalized SQL (fig. 9 and 10) provided by the data checking logic description system, and by utilizing the information, the module can completely sense the structure and source of the data to be checked, and stably and safely and continuously provide data for the calculation core module.
And the code generation module is responsible for converting the checking rules related to the target checking data into executable codes in batches and submitting the executable codes to the calculation core module for execution. The internal structure and workflow of the module are shown in figure 17, wherein the final logic representation and the knowledge of the data source are compiled and generated by the data checking logic description system, and after the translator receives the information, the final logic representation is traversed and the data source knowledge is combined to generate the efficient executable code. Inspired by Lambda algorithm, the executable code is composed of a plurality of orthogonal basic operators, after the verification logic compiles, the executable code is expressed by algebraic combination of the basic operators, and the set of operators can be considered as 'assembly language' of the whole verification engine.
And the calculation core module is used for receiving the check data and executable codes compiled by the check rules, and executing calculation to obtain a result. The module core component is an interpreter facing the basic operator, and in order to improve concurrency and strengthen control during running, the interpreter is provided with an independent thread pool, and the concurrency parts are disassembled according to algebraic structures of operator combinations and respectively delivered to different threads for execution. In addition, a set of execution information collection Unit (Metrics Unit) is built in the interpreter and is responsible for collecting information in the calculation process and feeding back to other units for runtime optimization, for example, after the execution information of the branch logic is collected, the hit rate of each branch logic is counted regularly, and when the hit rate of a certain branch reaches a threshold value, the optimization logic can be triggered to enable the branch and the judgment logic to execute concurrently, so that the calculation concurrency is further improved.
In a large-scale data checking calculation scene, generally, checking of one data source involves calculation of tens or even hundreds of checking rules, obviously, the characteristics of multiple cores of a modern CPU are fully utilized, and all checking rules depending on the same data source are combined for calculation, so that the purposes of saving IO resources and improving calculation efficiency can be achieved. Based on the theory, the module design realizes an aggregation algorithm, and can complete the calculation of homologous arbitrary multiple check rules only by one IO of a data source, and the independence of the calculation result of each rule is maintained, so that the combination is different.
Notably, while the checking computation is interpreted at the checking engine level, all constructs including the interpreter are eventually compiled into JVM bytecode execution at the entire JVM level, without distinction from normal programs.
And the result output module is a specific implementation of the checking output in the three elements of the data checking and is responsible for receiving the checking result given by the calculation core module, adapting with various data storage and completing the persistence of the checking result.
The checking engine can run alone or be connected with the distributed computing framework, so that computing power and lateral expansion capability matched with the data scale can be easily obtained. As shown in fig. 18, according to the characteristic of Spark distributed computing, the system embeds a verification engine sub-module into a Spark runtime system, puts the work of a code generation module in a Spark driver (SPARK DRIVER), generates executable codes, and then distributes the executable codes to computing core modules of each actuator (Spark Executor), and the data input and result output modules can quickly form a distributed verification computing environment with the same function as a single-version verification engine but greatly increased verification computing power only by proper adaptation.
In addition, as the modules of the engine are provided with standard interfaces, the connection is loosely coupled, metadata support provided by the system is described by means of data checking logic, and the following expansion is easily realized:
(1) The system can add an independent semi-structured data extractor into a data input module to realize the checking calculation function of the calculation engine on the semi-structured data.
(2) And introducing a streaming data checking calculation support, namely describing the support of the system to the streaming data checking based on the data checking logic, adding a streaming data extraction unit into a data input module of the system, and adding a streaming data writing back unit into a result output module to realize the data checking calculation function of a streaming data source (such as Kafka).
The two types of expansion benefit from the general computing characteristics of a computing core module, an interpreter inside the module is interpreted and executed based on algebraic combination structures of basic operators, and in the execution process, the interpreter acquires specific field data to be checked through reference (reference), for example, in a checking rule that 'name is not null', name is a reference, and finally, the interpreter acquires external field data to be checked from a data input module through reference. The data acquisition method based on the reference is suitable for structured and unstructured data, can also be suitable for streaming data, and can keep the stability and the universality of an interface, and the method is completely dependent on the principle of acquiring external data to be checked through the reference. Following this principle, even if the data involved in the actual computation does not exist (semi-structured data) or a run-time error occurs, the interpreter can properly handle the exception, which is also regarded as a kind of check result, and therefore, the computation core module does not need to be changed.
In the whole, the data checking calculation system provides an independent and complete checking calculation environment, realizes thorough calculation separation, enables checking calculation force to be independently adjusted according to work load, shields the difference of various database products at the back end, enables various back ends to multiplex and share the same checking system, avoids repeated construction, saves scientific and technological resources, and finally creates an aggregation algorithm only once IO to calculate all homologous checking rules, greatly promotes the balanced utilization of system resources and remarkably improves the calculation efficiency.
In recent years, financial supervision puts forward more and more strict requirements on the quality of data standardized by each large bank supervision (EAST for short), and EAST data has the characteristics of comprehensive service coverage, large detail data volume, high reporting timeliness and the like, and in order to continuously improve the quality of EAST data, a safe and efficient mass data quality detection system is needed to be constructed.
The EAST big data checking and analyzing system of the Xingjingbang is constructed based on the data checking logic description system and the data checking computing system, the system stock checking rules 4600 are more than one, in order to respond supervision change sensitively, the system provides functions of real-time online editing, checking and submitting of the checking rules, and the like, takes effect immediately, does not need to issue, and can be recalculated immediately with new rule logic. The daily average checking data size is 1.3TB, the checking time is 1-2 hours, the single day checking of 5-8TB data is carried out in the peak period, the calculation time is 5-9 hours, and the checking time and the work load basically have a linear relation.
According to the preliminary test, under the same cluster environment, SQL is adopted to write a checking rule, the same data set is given, the same work task is completed, the memory resources (in MB, namely megabytes, seconds) required by the checking process are only 8 to 14 percent of the memory resources required by SQL, and the computing resources (vcore, namely virtual cores, seconds) are 52 to 74 percent of the memory resources required by SQL. In addition, under the same environment, the number of check rules is increased from 25 to 250 for checking the same data source, the calculation time is increased to 1.9 times of the original time, and under the same condition, the time is not in line with the linear increase because the SQL-based check triggers the IO bottleneck.
The flexible result output module is beneficial, and the check computing system can select and output results such as statistics, sampling and the like, so that service personnel can conveniently and quickly master the overall data quality and analyze specific problems in detail; in addition, the checking system is irrelevant to a specific back-end database, each branch can be quickly transplanted and deployed, and is in butt joint with the back end of each data, so that a branch local personalized data checking system is quickly built.
Those skilled in the art will appreciate that the invention provides a system and its individual devices, modules, units, etc. that can be implemented entirely by logic programming of method steps, in addition to being implemented as pure computer readable program code, in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Therefore, the system and the devices, modules and units thereof provided by the invention can be regarded as a hardware component, and the devices, modules and units for realizing various functions included in the system can be regarded as structures in the hardware component, and the devices, modules and units for realizing various functions can be regarded as structures in the hardware component as well as software modules for realizing the method.
The foregoing describes specific embodiments of the present application. It is to be understood that the application is not limited to the particular embodiments described above, and that various changes or modifications may be made by those skilled in the art within the scope of the appended claims without affecting the spirit of the application. The embodiments of the application and the features of the embodiments may be combined with each other arbitrarily without conflict.