CN115080548B

CN115080548B - Data checking logic description system and data checking computing system

Info

Publication number: CN115080548B
Application number: CN202210504855.2A
Authority: CN
Inventors: 张琦; 赵学锋; 杨红卫
Original assignee: Industrial Bank Co Ltd; CIB Fintech Services Shanghai Co Ltd
Current assignee: Industrial Bank Co Ltd; CIB Fintech Services Shanghai Co Ltd
Priority date: 2022-05-10
Filing date: 2022-05-10
Publication date: 2025-07-29
Anticipated expiration: 2042-05-10
Also published as: CN115080548A

Abstract

The present invention provides a data verification logic description system and a data verification calculation system, including: a data source analysis module: providing online real-time data source definition, legality check and related analysis functions, and extracting metadata processing into data source knowledge; a compilation module: receiving the data source knowledge provided by the data source analysis module, performing joint analysis and logic optimization, and generating a final logical representation; a data input module: extracting data to be verified; a code generation module: generating executable code using the above-mentioned final logical representation and data source knowledge; a calculation core module: completing verification calculations based on the generated code and verification data; a result output module: receiving verification results, completing result persistence, and ultimately achieving loose coupling of the three elements. The present invention can simplify the development, testing, and deployment processes, improve the maintainability of verification logic, avoid duplicate construction, save scientific and technological resources, and greatly promote the balanced utilization of system resources and improve computing efficiency.

Description

Data checking logic description system and data checking computing system

Technical Field

The invention relates to the technical field of data processing, in particular to a data checking logic description system and a data checking computing system.

Background

The financial supervision puts forward more and more strict requirements on the quality of data standardized by each large bank supervision (EAST for short), and the EAST data has the characteristics of comprehensive service coverage, large detail data volume, strict data quality requirement, high reporting timeliness and the like, and in order to continuously improve the EAST data quality, a safe and efficient mass data quality detection system needs to be constructed.

In the large-scale data quality detection and monitoring, a logic description system is needed to accurately describe the checking and verifying logic of the data, namely checking logic or checking rule, and a computing system based on the checking logic is needed to complete specific computation about the logic. Common solutions in industry employ SQL and secondary development based on some database product to describe the checking logic of the data and rely on the deployment of the database product to provide checking forces. In the field of data inspection, the mode has a plurality of defects, and the specific analysis is as follows:

1. The SQL is used for describing the check logic, so that three elements of the check are strongly coupled, and the problem of low development, test, deployment and maintenance efficiency is caused, namely, the data check calculation depends on three elements of a data source, the check logic and check output, the SQL mixes and writes the three elements together, the same data source and the check logic only need to write a plurality of similar SQL because of different output results (statistics, sampling and the like), a large amount of redundancy is caused, in addition, the SQL mixes three-element information, the readability is not high, when the check logic is complex (as shown in figure 2), the boundary of the three elements becomes very fuzzy, and the whole SQL sentence is difficult to understand and maintain. The data checking logic description system can completely decouple three elements, one data source is only required to be defined once, any definition of a plurality of checking rules can be given based on the same data source, any problems related to checking output do not need to be considered before checking logic is executed, and development, testing, deployment and maintenance links are simplified to different degrees.

2. The problem of data security caused by using SQL is that the lack of a custom SQL compiler limits the writing of SQL language, and data security is extremely challenged. The invention makes constraint on the definition of the data source based on the self-research SQL parser, and utilizes the checking logic description compiler to complete various checks of data types, references and the like, so that the security of data and calculation is obviously improved.

3. The problem of repeated development and logic consistency of the rear end of a multi-database brought by SQL (structured query language) is that a large amount of atomic computation logics (figure 3) are needed for checking data at an enterprise level, the SQL language has limited expression capability, secondary development is needed by means of specific database products, the implementation of the atomic logics is bound with the database products strongly, when the data in various databases in an organization need to be checked, the same atomic logics are needed to be repeatedly developed for the products respectively, the problem of development resources is caused, and more difficult, the problem of the calculation consistency brought by different database products is solved by a large amount of test resource investment. The system is described by using the data checking logic in the invention, all atomic computation is supported by a system standard library, the system is not dependent on specific database products, repeated realization is not needed, and the problem of multi-terminal logic consistency is avoided. The check logic can be written once and used at a plurality of places.

4. The problem of tight coupling of system storage and calculation power resources under a mixed storage and calculation architecture is that when the requirements of checking data storage capacity and checking calculation capacity are inconsistent, the traditional mixed storage and calculation architecture deployed based on the same database faces the dilemma that the calculation power is transversely expanded and met, but the storage is wasted or the moderate storage is kept, but the calculation power is not met for a long time. The stand-alone database faces a harder situation, and a single server cannot meet the increasing demand of data on checking calculation. The checking engine creates an independently deployed computing environment, realizes the separation of storage and calculation, and can expand the calculation power according to the workload even if a single database is used, thereby meeting the service requirements.

5. When complex logic is needed to be realized by checking calculation, the complex logic must be developed for the second time by depending on data storage products, the same set of logic is needed to be developed for each product, if the calculation results are required to be consistent, logic consistency test must be carried out separately, and a large amount of development and test resources are wasted. The checking engine is matched with checking logic description, shields the product difference of the database at the back end, provides completely consistent compiling-period and running-period environments, has the same set of checking environment, can simultaneously butt-joint a plurality of completely heterogeneous data storage products, maximizes a multiplexing and sharing checking system and saves various resources.

6. The general database product cannot fully optimize by utilizing knowledge in the checking field, and the general database product has the great characteristic that the same data source often contains a large number of checking rules, if each rule adopts SQL description, each SQL statement is an independent calculation unit during calculation, the database product lacks necessary information to perform global overall optimization, the cache is almost invalid under the mass data scale, great IO resource waste exists, and the system IO can reach the bottleneck quickly along with the increase of the number of the checking rules, so that the overall throughput is affected. The unique aggregation algorithm of the checking engine can calculate all homologous rules only by one IO, so that the balanced utilization of system resources is greatly promoted.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a data checking logic description system and a data checking computing system.

The invention provides a data checking logic description system and a data checking computing system, wherein the scheme is as follows:

In a first aspect, a data checking logic description system is provided, the system comprising:

The data source analysis module is used for providing online real-time data source definition, validity check and related analysis functions based on the concept division of three elements of data check, and processing and extracting metadata into data source knowledge for the subsequent modules to use;

And the compiling module is used for providing the functions of on-line real-time checking logic description editing, checking and analyzing. In the compiling process, the data source knowledge provided by the data source analysis module is received to carry out joint analysis, logic optimization work is carried out, final logic representation is generated, and then the final logic representation is submitted to a specific data checking calculation system to finish calculation and output checking results, so that loose coupling of three elements is realized.

Preferably, the data source analysis module comprises an SQL parser, and the online edited and submitted ANSI SQL is submitted to the SQL parser for parsing, and after passing, a legal data source definition is generated;

the SQL parser is responsible for checking the submitted SQL, fulfilling the following four classes of functions:

the security check is that any non-select statement is forbidden to pass through in the grammar level, so that any SQL statement can not change or delete the existing data of the system, and the data security is ensured;

usability checking, namely checking whether the submitted sentence accords with the grammar rule of the select clause and is used as a compliance description of source definition;

SQL normalization, namely adjusting submitted sentences into normalized SQL which has the same semantic meaning and is easy to define by subsequent checking rules;

and extracting data source knowledge, namely extracting the data source knowledge including the reference name, the reference type and the name space information from the statement after the SQL normalization is completed, and providing metadata information support for a subsequent compiling module.

Preferably, the SQL parser further comprises a metadata extraction unit which is in online interaction with a specific database product at the back end to acquire all metadata related to SQL sentences, so that semantic analysis is supported, security inspection and availability inspection are more perfect, and meanwhile, the SQL parser processes and extracts the metadata as data source knowledge for a subsequent module and a specific data checking computing system.

Preferably, the metadata extraction unit shields technical differences of the back-end database products on the basis of performing basic functions, unifies expression forms of relational metadata, and provides a consistent interface to the outside.

Preferably, the compiling module includes:

the lexical analysis submodule starts from the lexical level of the check logic description, carries out word segmentation processing on the check rule definition, and converts the character stream processing into a mark stream for analysis by the grammar analysis submodule;

The grammar analysis sub-module constructs a concrete grammar tree based on an LL (x) algorithm according to grammar rules described by the check logic after receiving the mark flow provided by the lexical analysis sub-module, then regenerates the abstract grammar tree, and finally submits the abstract grammar tree to the semantic analysis sub-module;

the semantic analysis sub-module is used for receiving the data source knowledge provided by the data source analysis module, carrying out joint analysis and carrying out type and logic inspection on name references mentioned in the check rules;

The optimization and logic generation sub-module performs logic optimization after passing through the semantic analysis sub-module, each round of optimization is carried out, the optimizer receives an intermediate representation, and a logic equivalent but more optimized intermediate representation is obtained after calculation;

after the optimizer completes the multi-round optimization, an optimal intermediate representation is generated, intermediate process results are removed, a data structure is trimmed, and a final logic representation is generated and is used as a general interface to be matched with a specific data checking computing system.

In a second aspect, there is provided a data checking computing system, the system comprising:

The data input module is used for extracting check data according to the related information of the data source and submitting the check data to the calculation core module;

wherein, the data checking logic describes the compiling matched tool chain in the system to ensure the security of the definition of the data source, the SQL parser performs comprehensive inspection on the data source definition;

The code generation module is used for converting the check rules related to the check data into executable codes by utilizing the final logic representation and the data source knowledge output in the data check logic description system and submitting the executable codes to the calculation core module;

the computing core module is used for receiving the check data and executable codes compiled by the check rules, executing computation and obtaining results;

and the result output module is used for receiving the checking result given by the calculation core module, adapting with various data storage and completing the persistence of the result.

Preferably, the data input module adopts an independent data structure to describe the semi-structured data source information, so as to realize extraction of semi-structured check data.

Preferably, the data input module adds a stream data extraction unit, and the result output module adds a stream data write-back unit to realize stream data source adaptation.

Preferably, the code generation module comprises a step of generating executable codes related to a specific platform and a frame by utilizing the final logic representation and data source knowledge output in the data checking logic description system;

and the code generation module is placed in a Spark driver, and the code is generated and then distributed to the calculation core modules of the executors.

Preferably, the module core component of the computing core module is an interpreter, the interpreter is provided with an independent thread pool, and according to algebraic structures of operator combinations, the concurrency parts are disassembled and respectively submitted to different threads for execution;

and a set of unit for executing information collection is arranged in the interpreter and is responsible for collecting information in the calculation process and feeding back the information to other units for runtime optimization.

Compared with the prior art, the invention has the following beneficial effects:

1. The data checking logic description system adopts online compiling to realize online real-time editing and feedback of data source definition and checking rule definition, and obtains related metadata through online connection of a data source analysis module and a rear-end database product in a semantic analysis stage, thereby improving the calculation safety;

2. the Chinese programming is properly introduced into the check logic description in the data check logic description system, so that the usability is improved, and the use population is enlarged;

3. The final logic representation generated after compiling the check logic description in the data check logic description system is a data structure irrelevant to a specific computing platform and a framework, stores complete check logic description information with types, can adapt to various computing platforms and frameworks, and provides check execution code generation support for the specific platform and the framework;

4. The data checking calculation system is based on the characteristic that a large number of checking rules depend on the same data source, and the interpreter design of the calculation core module realizes a unique aggregation algorithm, so that any plurality of checking rules of the same data source participate in calculation together, and IO pressure under the condition of high concurrency execution of a large number of checking rules is greatly relieved;

5. The data checking computing system provides an independent and complete checking computing environment, realizes thorough calculation and storage separation, and enables the checking calculation force to be independently adjusted according to the workload;

6. The data checking and calculating system shields the difference of various database products at the back end, makes it possible for multiple back ends to multiplex and share the same checking system, avoids repeated construction, saves technological resources, and can calculate all homologous checking rules only by one IO (input/output) to greatly promote the balanced utilization of system resources and improve the calculation efficiency.

Drawings

Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, given with reference to the accompanying drawings in which:

FIG. 1 is a block diagram of a system according to the present invention;

FIG. 2 is a schematic diagram of Chinese support for checking rule definition;

FIG. 3 is a schematic diagram of a checklogic describing the use of standard function library functions;

FIG. 4 is a block diagram of a system for describing the checking logic;

FIG. 5 is a schematic diagram of a data source definition process;

FIG. 6 is a schematic diagram of SQL parser security check feedback;

FIG. 7 is a schematic diagram of SQL parser security check feedback;

FIG. 8 is a schematic diagram of SQL parser availability check feedback;

FIG. 9 is a SQL normalized schematic of the SQL parser;

FIG. 10 is a schematic diagram of the structure and workflow of an SQL parser;

FIG. 11 is a schematic diagram of a checklogic description compilation flow;

FIG. 12 is an example of a compile error hint;

FIG. 13 is a schematic diagram of branch expression type mismatch;

FIG. 14 is a schematic diagram of function parameter type mismatch;

FIG. 15 is a schematic diagram of a checklogic description referencing errors;

FIG. 16 is a schematic diagram of a checking engine module and architecture;

FIG. 17 is a diagram of a code generation structure and flow;

Fig. 18 is a schematic diagram of a Spark framework embedded in a checking engine.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the present invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications could be made by those skilled in the art without departing from the inventive concept. These are all within the scope of the present invention.

The invention provides a data checking logic description system, which comprises:

And the data source analysis module is used for providing online real-time data source definition, validity check and related analysis functions based on the conceptual division of three elements of data check, and processing and extracting metadata into data source knowledge for the subsequent modules to use.

And in the compiling process, receiving the data source knowledge provided by the data source analysis module, carrying out joint analysis, carrying out logic optimization work, generating a final logic representation, and then delivering the final logic representation to a specific data checking calculation system to finish checking calculation so as to realize loose coupling of three elements.

The data source analysis module comprises an SQL analyzer, and ANSI SQL edited and submitted on line is submitted to the SQL analyzer for analysis, and after passing, a legal data source definition is generated.

and (3) safety checking, namely prohibiting the compiling of any non-select statement from passing through at the grammar level, ensuring that any SQL statement cannot change or delete the existing data of the system, and ensuring the data safety.

Availability checking, checking whether the submitted sentence meets the select clause grammar rules and is a compliance description of the source definition.

SQL normalization, namely adjusting submitted sentences into normalized SQL which has the same semantics and is easy to be defined by follow-up checking rules.

And extracting data source knowledge, namely extracting the data source knowledge including the reference name, the reference type and the name space information from the statement after the SQL normalization is completed, and providing metadata information support for the subsequent modules.

The SQL parser also comprises a metadata extraction unit which is in online interaction with a specific database product at the back end to acquire all metadata related to SQL sentences, so that semantic analysis is supported, security inspection and usability inspection are more perfect, and meanwhile, the SQL parser processes and extracts the metadata as data source knowledge for a subsequent module to use.

The metadata extraction unit shields technical differences of the back-end database products on the basis of fulfilling basic functions, unifies expression forms of the relational metadata, and provides a consistent interface to the outside.

Replacing SQL with a data checking logic description, wherein the compiling module comprises:

and the semantic analysis sub-module is used for receiving the data source knowledge provided by the data source analysis module, carrying out joint analysis and carrying out type and logic inspection on name references mentioned in the check rules.

And the optimization and logic generation sub-module is used for carrying out logic optimization after passing through the semantic analysis sub-module, and each round of optimization is carried out, wherein the optimizer receives an intermediate representation and obtains a logically equivalent but more optimized intermediate representation after calculation.

The invention also provides a data checking and calculating system, which comprises:

And the data input module is used for extracting the check data according to the related information of the data source and submitting the check data to the calculation core module.

Wherein the data checking logic describes a compiling matching tool chain in the system for ensuring the security of data source definition, the SQL parser performs a comprehensive check on the data source definition.

The data checking logic describes that the data source knowledge generated in the system is used for data extraction support of the module.

And the code generation module is used for converting the check rules related to the check data into executable codes by utilizing the final logic representation and the data source knowledge which are output in the data check logic description system and submitting the executable codes to the calculation core module.

And the calculation core module is used for receiving the check data and executable codes compiled by the check rules, and executing calculation to obtain a result.

The data input module adopts an independent data structure to describe the semi-structured data source information, and the extraction of semi-structured check data is realized.

The data input module is added with a stream data extraction unit, and the result output module is added with a stream data write-back unit to realize stream data source adaptation.

The code generation module includes generating executable code using the final logical representation and data source knowledge output in the data checking logic description system.

The module core component of the computing core module is an interpreter, the interpreter is provided with an independent thread pool, and the concurrency parts are disassembled according to algebraic structures of operator combinations and respectively delivered to different threads for execution.

Next, the present invention will be described in more detail.

The embodiment of the invention comprises a checking logic description system and a data checking calculation system, and is respectively used for solving two most important problems in the data checking field, namely logic description and calculation, as shown in fig. 1, and the two systems are described in detail below.

The first part is that the invention provides a data checking logic description system.

The description of data checking logic generally contains the following three types of information:

(1) The data source, i.e. the source of the checked data and the data structure information, a data source definition may be given, for example, by SQL.

(2) The checking logic can define a plurality of checking logics, which are also called checking rules, on the basis of the definition of a given data source, each rule can be regarded as a minimum unit of checking calculation, usually, one data source can define tens to hundreds of checking rules to complete comprehensive and fine data quality checking, and taking a simple checking logic as an example, the data source defined in (1) can obtain a selection. It can be seen that the core logic of the checking rules is mixed with the data source elements, which are in fact only part of the where word, namely "col is not null".

(3) And (3) checking output, namely describing the structure and the purpose of saving the checking calculation result, taking the SQL statement in (2) as an example, and adding a checking output definition to obtain a selection count (x) from A WHERE DATE = '20211220'and col is not null, namely, counting the data meeting the condition.

The data source, the check logic and the check output are three elements called data check, and as can be seen from the above examples, the three elements are mixed together by the traditional SQL to form tight coupling. The invention takes the three elements as break-through ports, and the three elements are completely decoupled through the design and the realization of novel logic description, so that advantages are established in various aspects, and obvious effects are obtained.

In the aspect of the design of logic description, the construction and the choice of grammar, semantics, characteristics and the like are mainly completed, so that a user can safely, accurately, flexibly and conveniently define a data source and check logic.

The implementation aspect of the logic description is to realize a matched compiler and tool chain from the engineering perspective on the basis of design work, and support common use scenes, such as online real-time compiling, error feedback, grammar highlighting, logic checking and representation generation and the like, and mainly comprises a data source analysis module and a compiling module.

Details of the design, implementation will be described below:

The design part, the logic description is special for data checking, and on the basis of meeting the basic logic construction capability, learning and using thresholds are reduced as much as possible, so that basic Chinese support and grammar sugar of some Chinese expression forms are added in a grammar level, as shown in a rule definition part in fig. 2, branch keywords can be written in a form of 'if' otherwise ', also can be written in a form of' if.. Then 'else', comparison symbols can be written in a form of Chinese (such as 'equal') or mathematical symbols (such as 'equal='), and in addition, some character string calculation can use Chinese expression forms, and grammar accords with Chinese use habit.

Logic includes basic grammar such as basic branch judgment, function call, logic operation, numerical operation and the like, data has strict type division, including Boolean type, integer (byte, short, int, long), floating point type (float, double), character string, date, time stamp, regular expression and the like, for example, the rule definition in fig. 2 constructs the literal quantity '2020-01-01' of one date type by using a 'date' keyword, and 'true' and 'false' are the literal quantity of Boolean type and the like.

To improve computational certainty (determinism), inspired by functional programming, all grammar structures are expressions (expressions), without statements (statements), the evaluation process does not contain any side effect, and Java is used as a contrast, "if...else..is a sentence in which side-effects are involved, in the logic description design of the present invention," if..then..otherwise., "is an expression, without side effects, the evaluation result is the calculation result of one branch, which requires that the calculation result of each branch is the same type, and when the types are different, an error is prompted. The checkrule definition in FIG. 2, the final evaluation result is equal to the evaluation result of the branch predicate expression.

To increase the security of the computation, the logic description may be subjected to a strict strong type check, and if the requirements of the computation type are not met, errors may be fed back quickly during compilation.

Finally, all functions used in the logic description come from a standard library, can be continuously expanded, is convenient for efficient writing of checking logic, avoids lengthy or repeated definition of logic, and is shown in FIG. 3.

Overall, the logic description design follows the principles of safety, flexible expression, simple writing and easy understanding.

The implementation part, as shown in fig. 4, is detailed as follows:

And the data source analysis module is responsible for completing the definition of the data source based on the division of the three elements of the check and providing online data source definition, real-time validity check and related analysis functions. In order to maintain generality, flexibility and safety, the grammar of the data source definition is based on a subset of the ANSI SQL language, only comprises a select query clause, the module comprises an SQL parser to support a core function, the workflow is shown in figure 5, the SQL is edited and submitted online and parsed by the SQL parser, and after passing, a legal data source definition is generated. The SQL parser realizes a complete SQL compiler front end and is responsible for checking the submitted SQL, and mainly fulfills the following four functions:

(1) 6, 7, prohibiting any non-select statement from compiling to pass through at the grammar level, ensuring that SQL statements cannot change and delete the existing data of the system, and ensuring the data security;

(2) Usability checking, as shown in FIG. 8, checking whether the submitted sentence accords with the grammar rule of the select clause and can be used as legal description of data source definition;

(3) SQL normalization, namely adjusting submitted sentences into normalized SQL which has the same meaning and is easy to be defined by subsequent check rules, for example, as shown in figure 9, stars in the selected sentences are fully unfolded, so that the name reference and related check when the compiling module checks the rule definition are convenient;

(4) And (3) extracting data source knowledge, namely extracting the data source knowledge from sentences after the step (3) is completed, wherein the data source knowledge comprises reference names, reference types, namespaces and the like, for example, in data source definition 'select a, b, c from tables', the data source knowledge is extracted by an SQL parser, the data source knowledge comprises a, b, c, table and other references and types thereof, and the information not only provides metadata information support for a subsequent compiling module, but also provides runtime information support in a checking calculation process. Notably, knowledge of the data source retains all metadata following reference-based principles, and reference-based representations have many benefits, enabling the module to support semantic access to both structured and semi-structured data-dependent metadata, as well as to support semantic access to streaming data. As a generic data structure, data source knowledge provides a unified data interface.

The internal structure and workflow of the SQL parser are shown in fig. 10, after the SQL sentence is subjected to basic lexical and grammatical analysis, in the semantic analysis stage, the metadata extraction unit can interact with a back-end specific database product online to obtain all metadata related to the SQL sentence, so that more strict semantic analysis is supported, safety and usability inspection are further perfected, and meanwhile, the SQL parser processes and extracts the metadata as data source knowledge for a subsequent compiling module.

It is worth noting that the SQL parser has no SQL compiled back-end function and cannot truly execute SQL, and in addition, the metadata extraction unit shields technical differences of the back-end database products on the basis of fulfilling basic functions as a connection point of various database products, unifies the expression form of relational metadata and provides a consistent interface to the outside.

The module completes the definition and analysis work related to the data source, provides metadata support for the subsequent compiling module, enables the checking rule to obtain clear calculation semantics in a specific context environment, and enables the data source knowledge generated by the SQL parser to be a bridge for connecting the data source and the checking logic and also plays a role in dividing the data source and the checking logic.

The compiling module is used for focusing on compiling analysis, optimizing and generating final representation of the checking logic description, the compiling flow is shown in fig. 11, the checking rules sequentially carry out analysis and calculation through sub-modules such as morphology, grammar, semantics, optimizing and logic generation, and finally logic intermediate representation irrelevant to a specific computing platform and database products is generated. Any link is wrong, so that compiling is interrupted, detailed compiling error information is fed back in advance, and a user can quickly correct logic. For example, as shown in fig. 12, the right expression is wrongly written as a character string 5, the types on both sides of the inequality sign are inconsistent, and the compilation does not pass and prompts an error.

The following will be described in terms of sub-modules:

And the lexical analysis submodule starts from the grammar level of the check logic, performs word segmentation processing on the check rule, and processes and converts the check rule into a token stream (token stream) from a character stream (CHARACTER STREAM) for analysis by the grammar analysis submodule.

And the grammar analysis sub-module is used for constructing a concrete grammar tree (CST) based on an LL (x) algorithm according to the grammar rule of the check logic after receiving the mark flow provided by the lexical analysis sub-module, regenerating an abstract grammar tree (AST) and finally submitting the abstract grammar tree to the semantic analysis sub-module.

And the semantic analysis sub-module is used for receiving the data source knowledge provided by the data source analysis module and checking and analyzing the semantics of the check logic. In order to expose more logic errors in advance, the strong type checking can ensure the type-based calculation safety (such as fig. 12, 13 and 14), the checking also needs to ensure that the evaluation result type of the checking logic is always a Boolean type so as to embody the clear judgment of the data quality, and in addition, the analysis process can also allow some rule-based type conversion, thereby improving the language usability on the premise of safety.

As shown in fig. 15, the inspection analysis also includes name references, logical consistency constraints, etc. mentioned in the inspection rules to fully ensure that the inspection rule description is complete and computable. After all analysis is completed, an intermediate representation (Typed IR) with type information is generated, which is simply referred to as intermediate representation and submitted to the subsequent optimization and logic generation sub-module.

The optimization and logic generation sub-module, once passing the semantic analysis sub-module, represents that the current checking rule has passed various checks. The main purpose of the module is to perform logic optimization, as shown in fig. 11, each round of optimization, the optimizer receives an intermediate representation, and a logically equivalent but more optimized intermediate representation is obtained after calculation. For example, an important optimization means is to advance all calculations that only contain literal amounts, such as 1+1, so that it can be directly optimized to 2 without having to perform a large number of inefficient repeated calculations during the actual execution stage of the checking rule.

After the optimizer completes the multi-round optimization, an optimal intermediate representation can be generated, some process results are removed, the data structure is trimmed, and a final logic representation is generated, wherein the logic representation is a data structure with data types and calculation auxiliary information, contains all information required by calculation, is irrelevant to a specific database product, can be used as a general interface to be matched with a specific calculation engine, framework or platform, and is detailed in a second part, namely a data checking calculation system.

The data source definition and the check rule definition are completely decoupled, and the check output is tightly connected with the data persistence layer, so that the check output can be completed by a specific calculation engine, framework or platform, and the result output module in the second part-data check calculation system is described in detail below.

By proper modification, the system can be expanded as follows:

(1) And similarly, the module adds json for describing the structure and the data type of the semi-structured data, generates the data source knowledge through the json analyzer, and keeps the metadata based on the principle of preserving reference (reference), so that the interface layer is stable, the whole compiling module can be connected in a seamless way, the whole system can realize the analysis function of a detection rule based on the semi-structured data source, and information support is provided for corresponding expansion of a follow-up data detection computing system.

(2) And (3) adding analysis support for the streaming data, wherein the analysis support is similar to the thought in the step (1), on the basis of maintaining a reference principle, DSL (for example, YAML) for describing the streaming data structure and type is added, a parser is written, standard data source knowledge is generated, and a compiling module can be docked, so that the system can analyze the inspection rules based on the streaming data source.

(3) And continuously optimizing the final logic representation, namely, in the optimizing and logic generating sub-module, the optimizer performs progressive optimization on the intermediate representation, and along with the expansion of the system function, the optimizer can continuously add more layers of analysis and optimization, continuously improve the final logic representation, improve the computing expression capacity and generate higher-performance execution codes in a power-assisted checking computing link.

In the whole, a plurality of modules of the system are tightly matched, the complete decoupling of three elements of data sources, check logics and check output is realized, a user can rapidly complete the definition of the data sources and the check logics by virtue of online real-time compiling and feedback, the development, testing and deployment processes are obviously simplified, chinese is properly introduced, the complementary advantages of Chinese and common writing methods are realized, the readability and maintainability of the check logics are improved, the use threshold is reduced, the cost and efficiency are improved, the safety of data and calculation is obviously improved by virtue of the characteristics of a large number of strict grammar, semantic inspection, strong type inspection and the like in the compiling period, and the data sources are defined by adopting ANSI SQL subsets, and the check logic description is independent of a standard data source knowledge interface provided by a data source analysis module, so that the problem of the logical consistency of the rear end of various databases is thoroughly solved, and the effects of one-time writing and multiple operation are achieved.

The second part, the invention also provides a data checking and calculating system.

The data checking calculation system aims to complete large-scale checking calculation based on the work of the first part data checking logic description system. The system core is a checking engine, the engine provides a complete checking calculation runtime environment, basic modules and structure schematic are shown by referring to fig. 16, a calculation core module is responsible for actual calculation, required target checking data and checking calculation logic are respectively provided by a data input module and a code generation module, and after calculation is completed, a result output module completes persistence of checking results.

The following is a detailed description of the module:

And the data input module extracts target checking data according to the related information of the data source and submits the target checking data to the calculation core module. The data source related information comes from data source knowledge and normalized SQL (fig. 9 and 10) provided by the data checking logic description system, and by utilizing the information, the module can completely sense the structure and source of the data to be checked, and stably and safely and continuously provide data for the calculation core module.

And the code generation module is responsible for converting the checking rules related to the target checking data into executable codes in batches and submitting the executable codes to the calculation core module for execution. The internal structure and workflow of the module are shown in figure 17, wherein the final logic representation and the knowledge of the data source are compiled and generated by the data checking logic description system, and after the translator receives the information, the final logic representation is traversed and the data source knowledge is combined to generate the efficient executable code. Inspired by Lambda algorithm, the executable code is composed of a plurality of orthogonal basic operators, after the verification logic compiles, the executable code is expressed by algebraic combination of the basic operators, and the set of operators can be considered as 'assembly language' of the whole verification engine.

And the calculation core module is used for receiving the check data and executable codes compiled by the check rules, and executing calculation to obtain a result. The module core component is an interpreter facing the basic operator, and in order to improve concurrency and strengthen control during running, the interpreter is provided with an independent thread pool, and the concurrency parts are disassembled according to algebraic structures of operator combinations and respectively delivered to different threads for execution. In addition, a set of execution information collection Unit (Metrics Unit) is built in the interpreter and is responsible for collecting information in the calculation process and feeding back to other units for runtime optimization, for example, after the execution information of the branch logic is collected, the hit rate of each branch logic is counted regularly, and when the hit rate of a certain branch reaches a threshold value, the optimization logic can be triggered to enable the branch and the judgment logic to execute concurrently, so that the calculation concurrency is further improved.

In a large-scale data checking calculation scene, generally, checking of one data source involves calculation of tens or even hundreds of checking rules, obviously, the characteristics of multiple cores of a modern CPU are fully utilized, and all checking rules depending on the same data source are combined for calculation, so that the purposes of saving IO resources and improving calculation efficiency can be achieved. Based on the theory, the module design realizes an aggregation algorithm, and can complete the calculation of homologous arbitrary multiple check rules only by one IO of a data source, and the independence of the calculation result of each rule is maintained, so that the combination is different.

Notably, while the checking computation is interpreted at the checking engine level, all constructs including the interpreter are eventually compiled into JVM bytecode execution at the entire JVM level, without distinction from normal programs.

And the result output module is a specific implementation of the checking output in the three elements of the data checking and is responsible for receiving the checking result given by the calculation core module, adapting with various data storage and completing the persistence of the checking result.

The checking engine can run alone or be connected with the distributed computing framework, so that computing power and lateral expansion capability matched with the data scale can be easily obtained. As shown in fig. 18, according to the characteristic of Spark distributed computing, the system embeds a verification engine sub-module into a Spark runtime system, puts the work of a code generation module in a Spark driver (SPARK DRIVER), generates executable codes, and then distributes the executable codes to computing core modules of each actuator (Spark Executor), and the data input and result output modules can quickly form a distributed verification computing environment with the same function as a single-version verification engine but greatly increased verification computing power only by proper adaptation.

In addition, as the modules of the engine are provided with standard interfaces, the connection is loosely coupled, metadata support provided by the system is described by means of data checking logic, and the following expansion is easily realized:

(1) The system can add an independent semi-structured data extractor into a data input module to realize the checking calculation function of the calculation engine on the semi-structured data.

(2) And introducing a streaming data checking calculation support, namely describing the support of the system to the streaming data checking based on the data checking logic, adding a streaming data extraction unit into a data input module of the system, and adding a streaming data writing back unit into a result output module to realize the data checking calculation function of a streaming data source (such as Kafka).

The two types of expansion benefit from the general computing characteristics of a computing core module, an interpreter inside the module is interpreted and executed based on algebraic combination structures of basic operators, and in the execution process, the interpreter acquires specific field data to be checked through reference (reference), for example, in a checking rule that 'name is not null', name is a reference, and finally, the interpreter acquires external field data to be checked from a data input module through reference. The data acquisition method based on the reference is suitable for structured and unstructured data, can also be suitable for streaming data, and can keep the stability and the universality of an interface, and the method is completely dependent on the principle of acquiring external data to be checked through the reference. Following this principle, even if the data involved in the actual computation does not exist (semi-structured data) or a run-time error occurs, the interpreter can properly handle the exception, which is also regarded as a kind of check result, and therefore, the computation core module does not need to be changed.

In the whole, the data checking calculation system provides an independent and complete checking calculation environment, realizes thorough calculation separation, enables checking calculation force to be independently adjusted according to work load, shields the difference of various database products at the back end, enables various back ends to multiplex and share the same checking system, avoids repeated construction, saves scientific and technological resources, and finally creates an aggregation algorithm only once IO to calculate all homologous checking rules, greatly promotes the balanced utilization of system resources and remarkably improves the calculation efficiency.

In recent years, financial supervision puts forward more and more strict requirements on the quality of data standardized by each large bank supervision (EAST for short), and EAST data has the characteristics of comprehensive service coverage, large detail data volume, high reporting timeliness and the like, and in order to continuously improve the quality of EAST data, a safe and efficient mass data quality detection system is needed to be constructed.

The EAST big data checking and analyzing system of the Xingjingbang is constructed based on the data checking logic description system and the data checking computing system, the system stock checking rules 4600 are more than one, in order to respond supervision change sensitively, the system provides functions of real-time online editing, checking and submitting of the checking rules, and the like, takes effect immediately, does not need to issue, and can be recalculated immediately with new rule logic. The daily average checking data size is 1.3TB, the checking time is 1-2 hours, the single day checking of 5-8TB data is carried out in the peak period, the calculation time is 5-9 hours, and the checking time and the work load basically have a linear relation.

According to the preliminary test, under the same cluster environment, SQL is adopted to write a checking rule, the same data set is given, the same work task is completed, the memory resources (in MB, namely megabytes, seconds) required by the checking process are only 8 to 14 percent of the memory resources required by SQL, and the computing resources (vcore, namely virtual cores, seconds) are 52 to 74 percent of the memory resources required by SQL. In addition, under the same environment, the number of check rules is increased from 25 to 250 for checking the same data source, the calculation time is increased to 1.9 times of the original time, and under the same condition, the time is not in line with the linear increase because the SQL-based check triggers the IO bottleneck.

The flexible result output module is beneficial, and the check computing system can select and output results such as statistics, sampling and the like, so that service personnel can conveniently and quickly master the overall data quality and analyze specific problems in detail; in addition, the checking system is irrelevant to a specific back-end database, each branch can be quickly transplanted and deployed, and is in butt joint with the back end of each data, so that a branch local personalized data checking system is quickly built.

Those skilled in the art will appreciate that the invention provides a system and its individual devices, modules, units, etc. that can be implemented entirely by logic programming of method steps, in addition to being implemented as pure computer readable program code, in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Therefore, the system and the devices, modules and units thereof provided by the invention can be regarded as a hardware component, and the devices, modules and units for realizing various functions included in the system can be regarded as structures in the hardware component, and the devices, modules and units for realizing various functions can be regarded as structures in the hardware component as well as software modules for realizing the method.

The foregoing describes specific embodiments of the present application. It is to be understood that the application is not limited to the particular embodiments described above, and that various changes or modifications may be made by those skilled in the art within the scope of the appended claims without affecting the spirit of the application. The embodiments of the application and the features of the embodiments may be combined with each other arbitrarily without conflict.

Claims

1. A data checking logic description system, comprising:

the data source analysis module is used for providing online real-time data source definition, validity check and related analysis functions based on the concept division of three elements of data check, and processing and extracting metadata into data source knowledge for the subsequent module;

the compiling module is used for editing, checking and analyzing the checking logic description, receiving the data source knowledge provided by the data source analyzing module to carry out joint analysis and carrying out logic optimization work in the compiling process, generating a final logic representation, and then delivering the final logic representation to a specific data checking computing system to complete checking computation so as to realize loose coupling of three elements;

the data source analysis module comprises an SQL analyzer, and on-line editing and submitting ANSI SQL is submitted to the SQL analyzer for analysis, and after passing, a legal data source definition is generated;

Extracting data source knowledge, namely extracting the data source knowledge from sentences after the SQL normalization is completed, wherein the data source knowledge comprises a reference name, a reference type and name space information, and providing metadata information support for a subsequent module;

2. The data checking logic description system according to claim 1, wherein the SQL parser further comprises a metadata extraction unit, wherein the metadata extraction unit is in online interaction with a back-end specific database product to obtain all metadata related to SQL sentences, support semantic analysis and enable security checking and usability checking to be more perfect, and the SQL parser processes and extracts the metadata as data source knowledge for use by a subsequent module.

3. The data checking logic description system according to claim 2, wherein the metadata extraction unit is capable of masking technical differences of the backend database products, unifying expression forms of relational metadata, and providing a consistent interface to the outside.

4. A data checking computing system based on the data checking logic description system of any one of claims 1-3, comprising:

Wherein the data checking logic describes a compiling matching tool chain in the system for ensuring the security of data source definition, the SQL parser performs a comprehensive check on the data source definition;

the data checking logic describes that the data source knowledge generated in the system is used for data extraction support of the module;

5. The data checking and calculating system according to claim 4, wherein the data input module adopts independent data structure to describe semi-structured data source information, so as to realize extraction of semi-structured checking data.

6. The data checking computing system according to claim 4, wherein the data input module adds a streaming data extraction unit, and the result output module adds a streaming data write-back unit, to implement streaming data source adaptation.

7. The data verification computing system of claim 4, wherein the code generation module includes generating executable code using the final logical representation and data source knowledge output in the data verification logic description system;

8. The data checking computing system according to claim 4, wherein the module core component of the computing core module is an interpreter, the interpreter has independent thread pools, and the concurrent parts are disassembled according to algebraic structures of the operator combinations and respectively submitted to different threads for execution;