CN118113689A

CN118113689A - A data quality analysis method and system

Info

Publication number: CN118113689A
Application number: CN202311805076.7A
Authority: CN
Inventors: 陈磊; 马灵; 王保蛟
Original assignee: Beijing Yusys Technologies Group Co ltd
Current assignee: Beijing Yusys Technologies Group Co ltd
Priority date: 2023-12-26
Filing date: 2023-12-26
Publication date: 2024-05-31
Anticipated expiration: 2043-12-26
Also published as: CN118113689B

Abstract

The embodiment of the invention provides a data quality analysis method and a system, comprising the following steps: constructing a data full-link map; determining data to be checked, and screening out problem data which does not meet the preset data quality check rule from the data to be checked; searching associated problem data associated with the problem data according to the full link map; carrying out data correction on the problem data; and synchronously updating the associated problem data according to the corrected problem data. By the technical scheme, after the problem data is found, the system can automatically update or supplement all associated data according to the full-link map without repeated searching, so that the working efficiency is greatly improved, and the consistency of data modification is effectively ensured.

Description

A data quality analysis method and system

技术领域Technical Field

本发明涉及数据质量管理技术领域，具体涉及一种数据质量分析方法及系统。The present invention relates to the technical field of data quality management, and in particular to a data quality analysis method and system.

背景技术Background technique

数据质量管理涉及多种技术，包括数据分析和挖掘、确定数据模型及数据、数据质量度量和指标、数据规范和验证，以及管理和监控。这些技术的组合可以支持数据质量分析、修复和管理的全面解决方案。现有技术中，存在一些成熟的数据质量管理引擎产品，可以将上述功能集成起来。Data quality management involves a variety of technologies, including data analysis and mining, determining data models and data, data quality metrics and indicators, data specification and verification, and management and monitoring. The combination of these technologies can support a comprehensive solution for data quality analysis, repair, and management. In the existing technology, there are some mature data quality management engine products that can integrate the above functions.

在实现本发明过程中，发明人发现现有技术中至少存在如下问题：In the process of implementing the present invention, the inventors found that there are at least the following problems in the prior art:

现有的数据质量管理引擎中，虽然可以根据预设规则查找出存在问题的数据，但通常该数据在当前位置之外的其他数据表或数据库中也有出现，因此需要人工方式进行逐一查找并更正，工作效率非常低。因此，如何加快查找和更正问题数据速度，是需要解决的问题。In existing data quality management engines, although problematic data can be found according to preset rules, the data usually also appears in other data tables or databases outside the current location, so it is necessary to manually find and correct them one by one, which is very inefficient. Therefore, how to speed up the speed of finding and correcting problematic data is a problem that needs to be solved.

发明内容Summary of the invention

本发明实施例提供一种数据质量分析方法及系统，用以解决现有数据质量分析工作中查找问题的工作效率较低的问题。The embodiments of the present invention provide a data quality analysis method and system to solve the problem of low efficiency in finding problems in existing data quality analysis work.

为达上述目的，一方面，本发明实施例提供一种数据质量分析方法，包括：构建数据全链路地图；确定待检核数据，并从所述待检核数据中筛选出不满足预设数据质量检核规则的问题数据；根据所述数据全链路地图，搜寻与所述问题数据相关联的关联问题数据；对所述问题数据进行更正；根据更正后的所述问题数据，对所述关联问题数据进行同步更新。To achieve the above-mentioned purpose, on the one hand, an embodiment of the present invention provides a data quality analysis method, including: constructing a full-link data map; determining the data to be checked, and filtering out problem data that does not meet preset data quality check rules from the data to be checked; searching for related problem data associated with the problem data according to the full-link data map; correcting the problem data; and synchronously updating the related problem data according to the corrected problem data.

另一方面，本发明实施例提供一种数据质量分析系统，包括：数据全链路地图构建模块，用于构建数据全链路地图；数据质量检核模块，用于确定待检核数据，并从所述待检核数据中筛选出不满足预设数据质量检核规则的问题数据；问题更正模块，用于对所述问题数据进行更正；同步更新模块，用于根据更正后的所述问题数据，对所述关联问题数据进行同步更新。On the other hand, an embodiment of the present invention provides a data quality analysis system, including: a data full-link map construction module, used to construct a data full-link map; a data quality verification module, used to determine the data to be verified, and filter out problem data that does not meet preset data quality verification rules from the data to be verified; a problem correction module, used to correct the problem data; and a synchronous update module, used to synchronously update the associated problem data according to the corrected problem data.

同时，本发明实施例还提供一种计算机设备，其包括：一个或多个处理器；存储装置，用于存储一个或多个程序；当所述一个或多个程序被所述一个或多个处理器执行时，使得所述一个或多个处理器实现如前所述的数据质量分析方法。At the same time, an embodiment of the present invention also provides a computer device, which includes: one or more processors; a storage device for storing one or more programs; when the one or more programs are executed by the one or more processors, the one or more processors implement the data quality analysis method as described above.

上述技术方案具有如下有益效果：The above technical solution has the following beneficial effects:

本技术方案中，当通过预设的数据质量检核规则查找到问题数据时，系统可以按已经建立的数据全链路地图去查找与该问题数据相关联的上下游的关联数据，从而将所有问题数据全部自动找出；在更正完问题数据后，再根据数据全链路地图完成数据流转加工，实现所有关联问题数据的同步更新，而无需像现有技术一样逐一进行查找和更正，从而大幅提高了工作效率，并确保了数据修改的一致性。In this technical solution, when problem data is found through preset data quality check rules, the system can search for upstream and downstream related data associated with the problem data according to the established data full-link map, thereby automatically finding all problem data; after correcting the problem data, the data flow processing is completed according to the data full-link map to achieve synchronous updating of all related problem data, without the need to search and correct one by one as in the prior art, thereby greatly improving work efficiency and ensuring the consistency of data modification.

此外，本发明还具有以下特点：In addition, the present invention also has the following characteristics:

本技术方案还支持数据质量检核规则的自定义和场景定义，如用户只需要在预设的输入界面内，输入针对检核目标的中文描述内容，系统会根据用户输入内容分析出对应的检核规则，并转化为数据库语言来更新已有的数据质量检核规则，从而满足了不同的用户需求，有效提升了用户体验。This technical solution also supports the customization and scenario definition of data quality check rules. For example, users only need to enter the Chinese description of the check target in the preset input interface. The system will analyze the corresponding check rules based on the user input content and convert it into database language to update the existing data quality check rules, thereby meeting different user needs and effectively improving the user experience.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required for use in the embodiments or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.

图1是本发明实施例一种数据质量分析方法的流程图；FIG1 is a flow chart of a data quality analysis method according to an embodiment of the present invention;

图2是本发明实施例一种数据质量分析系统的框架图；FIG2 is a framework diagram of a data quality analysis system according to an embodiment of the present invention;

图3是本发明实施例数据质量管理引擎的一个具体实施例的架构图；FIG3 is an architecture diagram of a specific embodiment of a data quality management engine according to an embodiment of the present invention;

图4是本发明具体实施例中查找关联问题数据的示意图；FIG4 is a schematic diagram of searching for associated problem data in a specific embodiment of the present invention;

图5是本发明具体实施例中录入界面示意图；5 is a schematic diagram of an input interface in a specific embodiment of the present invention;

图6是本发明具体实施例中表结构示意图。FIG. 6 is a schematic diagram of a table structure in a specific embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will be combined with the drawings in the embodiments of the present invention to clearly and completely describe the technical solutions in the embodiments of the present invention. Obviously, the described embodiments are only part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

如图1所示，本发明实施例提供一种数据质量分析方法，包括：As shown in FIG1 , an embodiment of the present invention provides a data quality analysis method, including:

S101、构建数据全链路地图；S101, build a full-link data map;

S102、确定待检核数据，并从待检核数据中筛选出不满足预设数据质量检核规则的问题数据；S102, determining the data to be checked, and screening out problematic data that does not meet preset data quality check rules from the data to be checked;

S103、根据数据全链路地图，搜寻与问题数据相关联的关联问题数据；S103, searching for related problem data associated with the problem data according to the data full-link map;

S104、对问题数据进行更正；S104, correct the problematic data;

S105、根据更正后的问题数据，对关联问题数据进行同步更新。S105. Synchronously update the associated question data according to the corrected question data.

为解决前述问题，本技术方案中，不再像现有技术一样通过设定不同的数据质量检核规则去逐一发现全部存在问题的数据，而是在数据库构造时同时创建数据全链路地图，在后续发现一个问题数据后，通过数据全链路地图进行追溯，查找与问题数据直接关联的上下游数据，这样可以一次性将所有存在关联的问题数据全部找出，极大地减轻了数据质量管理人员的工作量。由于数据全链路地图中，各个数据与关联数据之间存在血缘关系，即相关联的数据都是从某一数据开始进行数据流转加工而得到的，这样一来，只要修改了问题数据，就可以根据数据全链路地图同时修改所有其他的关联问题数据，实现同步更新。To solve the above problems, this technical solution no longer sets different data quality inspection rules to find all problematic data one by one as in the prior art. Instead, a full-link data map is created at the same time when the database is constructed. When a problematic data is subsequently discovered, it is traced back through the full-link data map to find the upstream and downstream data directly related to the problematic data. In this way, all the related problematic data can be found at one time, greatly reducing the workload of data quality management personnel. Since there is a blood relationship between each data and the related data in the full-link data map, that is, the related data are all obtained by data flow processing starting from a certain data, in this way, as long as the problematic data is modified, all other related problematic data can be modified at the same time according to the full-link data map to achieve synchronous update.

进一步的，步骤S103具体包括：Furthermore, step S103 specifically includes:

S1031、根据数据全链路地图，分别在数据库级血缘关系、表级血缘关系和字段级血缘关系的范围内追溯与问题数据相关联的全部关联数据；S1031. According to the data full-link map, all associated data associated with the problematic data are traced within the scope of database-level lineage, table-level lineage, and field-level lineage;

S1032、从全部关联数据中筛选出不满足预设数据质量检核规则的关联数据，作为关联问题数据。S1032. Filter out the associated data that do not meet the preset data quality check rules from all the associated data as associated problem data.

通常发现的问题数据都是某个工作表内的某个字段，在搜寻关联问题数据时，除了要在表内搜寻外，还要在具有血缘关系的不同数据表以及具有血缘关系的不同数据库中进行搜寻。The problem data usually found are all fields in a certain worksheet. When searching for related problem data, in addition to searching in the table, you also need to search in different data tables with blood relationships and different databases with blood relationships.

进一步的，步骤S101具体包括：Furthermore, step S101 specifically includes:

S1011、获取用于构建数据全链路地图的血缘关系数据；S1011. Obtaining blood relationship data for building a full-link data map;

S1012、解析血缘关系数据的层级关系；S1012, analyzing the hierarchical relationship of blood relationship data;

S1013、根据层级关系构建数据全链路地图。S1013. Construct a full-link data map based on the hierarchical relationship.

本技术方案中，数据全链路地图的血缘关系数据是一种关系，构建数据全链路地图的基本思路是：首先通过采集的方式获取ETL(Extract-Transform-Load，用来描述将数据从来源端经过抽取(extract)、转换(transform)、加载(load)至目的端的过程)对应的SQL脚本，并通过血缘解析工具解析里面的SQL，通过解析ETL脚本，形成数据库、表、字段对应的数据加工链路，也就是全链路，并通过钻取的方式可以看到表和字段的血缘关系，其中字段是最小颗粒度。In this technical solution, the blood relationship data of the data full-link map is a kind of relationship. The basic idea of constructing the data full-link map is: first, obtain the SQL script corresponding to ETL (Extract-Transform-Load, which is used to describe the process of extracting, transforming, and loading data from the source to the destination) through collection, and parse the SQL in it through the blood relationship parsing tool. By parsing the ETL script, a data processing link corresponding to the database, table, and field is formed, that is, the full link, and the blood relationship between the table and the field can be seen by drilling, where the field is the smallest granularity.

进一步的，在步骤S101之前，还包括：Furthermore, before step S101, the method further includes:

S011、根据数据表的表结构制定自定义输入逻辑规则；S011. Develop custom input logic rules based on the table structure of the data table;

S012、根据自定义输入逻辑规则确定录入界面；S012. Determine the input interface according to the custom input logic rule;

S013、通过录入界面获取用户自定义输入内容；S013, obtaining user-defined input content through the input interface;

S014、解析用户自定义输入内容，得到关于用户自定义检核规则的逻辑表达式；S014, parsing the user-defined input content to obtain a logical expression for the user-defined verification rule;

S015、将逻辑表达式转换为数据库语言形式的输入指令；S015, converting the logical expression into an input instruction in a database language form;

S016、通过输入指令完成数据质量检核规则的用户自定义。S016. Complete user customization of data quality check rules by inputting instructions.

常见的数据质量管理引擎中，只有通过质量规则发现问题数据这个单一逻辑，若需进行质量规则调整时，需要技术人员用数据库语言重新调整程序，以更新质量规则；而通常数据质量管理人员并不熟悉编程，难以独自完成调整工作，因此，为了便于数据管理人员的操作，本技术方案中，还设计了一种支持规则自定义和场景定义的方法，即用户只需要输入满足预设输入逻辑规则的针对检核目标的内容描述(优选采用在预设的录入界面进行输入的方式，或通过在录入界面的表格中下拉菜单选取的方式，在该种方式下，由于录入界面是根据自定义输入逻辑规则来设计的，对用户输入具有一定的约束作用，可以避免用户输入的内容过于随意而不能被准确识别)，系统会根据输入内容自动生成关于用户自定义检核规则的逻辑表达式(即，用于表述用户的意图)，并自动根据逻辑表达式生成SQL语言的代码，从而实现数据质量检核规则的用户自定义，提升用户使用场景。In common data quality management engines, there is only a single logic of discovering problematic data through quality rules. If the quality rules need to be adjusted, technical personnel need to readjust the program in the database language to update the quality rules. However, data quality management personnel are usually not familiar with programming and it is difficult for them to complete the adjustment work alone. Therefore, in order to facilitate the operation of data management personnel, this technical solution also designs a method that supports rule customization and scenario definition, that is, the user only needs to enter the content description for the verification target that meets the preset input logic rules (preferably by entering in a preset entry interface, or by selecting from a drop-down menu in the table of the entry interface. In this way, since the entry interface is designed based on the custom input logic rules, it has a certain constraint on user input, which can prevent the user's input from being too arbitrary and cannot be accurately identified). The system will automatically generate a logical expression for the user-defined verification rule based on the input content (that is, used to express the user's intention), and automatically generate SQL language code based on the logical expression, thereby realizing user customization of data quality verification rules and improving user usage scenarios.

进一步的，数据质量检核规则具体包括：Furthermore, the data quality check rules specifically include:

用于实现数据完整性检查的数据质量检核规则、用于实现数据准确性验证的数据质量检核规则、用于实现数据一致性检查的数据质量检核规则、以及用于实现异常数据识别的数据质量检核规则，以此分别查找不满足这几种检核规则的问题数据。Data quality check rules for data integrity check, data quality check rules for data accuracy verification, data quality check rules for data consistency check, and data quality check rules for abnormal data identification are used to find problematic data that does not meet these check rules.

如图2所示，本发明实施例还提供一种数据质量分析系统，包括：As shown in FIG2 , an embodiment of the present invention further provides a data quality analysis system, including:

数据全链路地图构建模块21，用于构建数据全链路地图；A data full-link map construction module 21 is used to construct a data full-link map;

数据质量检核模块22，用于确定待检核数据，并从待检核数据中筛选出不满足预设数据质量检核规则的问题数据；The data quality checking module 22 is used to determine the data to be checked and filter out problematic data that does not meet the preset data quality checking rules from the data to be checked;

关联问题数据查找模块23，用于根据数据全链路地图，搜寻与问题数据相关联的关联问题数据；The associated problem data search module 23 is used to search for associated problem data associated with the problem data according to the data full-link map;

问题更正模块24，用于对问题数据进行更正；A problem correction module 24, used to correct the problem data;

同步更新模块25，用于根据更正后的所述问题数据，对关联问题数据进行同步更新。The synchronous updating module 25 is used to synchronously update the associated problem data according to the corrected problem data.

进一步的，数据质量检核模块22具体用于：根据数据全链路地图，分别在数据库级血缘关系、表级血缘关系和字段级血缘关系的范围内追溯与问题数据相关联的全部关联数据；从全部关联数据中筛选出不满足预设数据质量检核规则的关联数据，作为关联问题数据。Furthermore, the data quality check module 22 is specifically used to: trace back all related data associated with the problem data within the scope of database-level lineage, table-level lineage and field-level lineage according to the data full-link map; and filter out the related data that does not meet the preset data quality check rules from all the related data as the related problem data.

进一步的，数据全链路地图构建模块21具体用于：获取用于构建数据全链路地图的血缘关系数据；解析血缘关系数据的层级关系；根据层级关系构建数据全链路地图。Furthermore, the data full-link map construction module 21 is specifically used to: obtain blood relationship data used to construct the data full-link map; parse the hierarchical relationship of the blood relationship data; and construct the data full-link map according to the hierarchical relationship.

进一步的，数据质量分析系统还包括数据质量检核规则自定义模块01，用于：根据数据表的表结构制定自定义输入逻辑规则；根据自定义输入逻辑规则确定录入界面；通过录入界面获取用户自定义输入内容；解析用户自定义输入内容，得到关于用户自定义检核规则的逻辑表达式；将逻辑表达式转换为数据库语言形式的输入指令；通过输入指令完成数据质量检核规则的用户自定义。Furthermore, the data quality analysis system also includes a data quality check rule customization module 01, which is used to: formulate customized input logic rules according to the table structure of the data table; determine the input interface according to the customized input logic rules; obtain user-defined input content through the input interface; parse the user-defined input content to obtain a logical expression for the user-defined check rule; convert the logical expression into an input instruction in the form of a database language; and complete user customization of data quality check rules through input instructions.

以下结合一个在银行领域应用的具体应用实例对本发明实施例上述技术方案进行详细说明，该具体实施例如图3所示，其为一个采用了上述数据质量分析方法的数据质量管理引擎。该数据质量管理引擎接收业务部门、科技部门和分支行提供的业务口径的数据质量检核规则，自动将业务口径的数据质量检核规则转换为技术口径的数据质量检核脚本规则；使用转换后的技术口径数据质量检核脚本规则对模拟生产环境数据进行自动化的数据质量检查；自动出具问题数据的质量问题明细和检核报告；获取管理人员对质量问题明细和检核报告的审核确认结果；根据质量问题明细和检核报告自动分析问题并生成提升方案；根据问题数据的明细自动在源头调整数据问题。在确认检核结果的步骤中，若发现问题数据的明细和检核报告有问题，反馈到提供业务口径的数据质量检核规则的步骤或转换业务口径的数据质量检核规则的步骤进行重新确认。The following is a detailed description of the technical solution of the embodiment of the present invention in conjunction with a specific application example in the banking field. The specific embodiment is shown in FIG3 , which is a data quality management engine that uses the above data quality analysis method. The data quality management engine receives the business-based data quality verification rules provided by the business department, the technology department and the branch bank, and automatically converts the business-based data quality verification rules into the technical-based data quality verification script rules; uses the converted technical-based data quality verification script rules to perform automated data quality checks on the simulated production environment data; automatically issues quality problem details and verification reports for the problem data; obtains the review and confirmation results of the management personnel on the quality problem details and verification reports; automatically analyzes the problems and generates improvement plans based on the quality problem details and verification reports; and automatically adjusts the data problems at the source based on the details of the problem data. In the step of confirming the verification results, if it is found that there are problems with the details and verification reports of the problem data, feedback is given to the step of providing the business-based data quality verification rules or the step of converting the business-based data quality verification rules for reconfirmation.

该图3示例的数据质量管理引擎的实际工作流程如下：The actual workflow of the data quality management engine in Figure 3 is as follows:

1)、业务部门、科技部门、分支行提供业务口径的数据质量检核规则；1) Business departments, technology departments, and branches provide data quality verification rules for business purposes;

2)、数据质量管理岗人员根据业务口径的数据质量检核规则制定技术口径的数据质量检核规则，并录入数据质量系统；具体地，将这些业务口径的数据质量检核规则转换为技术口径的检核脚本规则，涉及将业务逻辑和数据标准转化为技术语言，如SQL查询、编程脚本等，以便能够在数据质量系统中实施自动化的数据质量检查。2) Data quality management personnel formulate technical-caliber data quality verification rules based on business-caliber data quality verification rules and enter them into the data quality system; specifically, these business-caliber data quality verification rules are converted into technical-caliber verification script rules, which involves converting business logic and data standards into technical languages, such as SQL queries, programming scripts, etc., so that automated data quality checks can be implemented in the data quality system.

3)、数据质量管理岗人员对从生产环境迁移的模拟生产环境数据进行数据质量检核；生产环境指的是实际运行应用程序或系统的环境，包括硬件、软件、网络和其他相关组件。在这个生产环境中，用户可以访问和使用应用程序或系统，以执行实际的任务和操作。生产环境是经过严格测试和验证的，以确保其稳定性、安全性和性能。模拟生产环境数据指的是为了测试、验证或训练目的而从生产环境中复制或生成的数据。这些数据与生产环境中的真实数据相似，但可能进行了一些处理或修改，以适应测试或验证的需要。3) Data quality management personnel conduct data quality checks on simulated production environment data migrated from the production environment; the production environment refers to the environment where the application or system is actually running, including hardware, software, network and other related components. In this production environment, users can access and use the application or system to perform actual tasks and operations. The production environment is rigorously tested and verified to ensure its stability, security and performance. Simulated production environment data refers to data copied or generated from the production environment for testing, verification or training purposes. This data is similar to the real data in the production environment, but may have been processed or modified to meet the needs of testing or verification.

直接在生产环境中进行频繁的数据验证操作是不建议的，因为这可能会干扰到实际用户的使用，甚至可能导致数据损坏或系统崩溃。因此，将生产环境的数据迁移到一个独立的测试环境中，进行详尽的验证和测试。以下是详细解释：It is not recommended to perform frequent data verification operations directly in the production environment, as this may interfere with actual user usage and may even cause data corruption or system crashes. Therefore, migrate the data in the production environment to an independent test environment for detailed verification and testing. The following is a detailed explanation:

数据迁移：首先，从生产环境中导出需要验证的数据，并将其导入到测试环境中。这样可以确保测试环境中的数据与生产环境中的数据保持一致，从而使测试结果更具代表性。Data migration: First, export the data that needs to be verified from the production environment and import it into the test environment. This ensures that the data in the test environment is consistent with the data in the production environment, making the test results more representative.

在测试环境中进行验证：在测试环境中，可以对迁移过来的数据进行各种验证操作，例如数据完整性检查、性能测试、安全测试等。由于测试环境与生产环境是隔离的，这些操作不会对实际用户产生影响。Verification in a test environment: In a test environment, you can perform various verification operations on the migrated data, such as data integrity check, performance test, security test, etc. Since the test environment is isolated from the production environment, these operations will not affect actual users.

启动质量检核：在测试环境中完成数据验证后，可以启动质量检核流程。这包括一系列自动化的测试脚本和工具，用于检查数据的准确性和系统的稳定性。如果在这个过程中发现任何问题，可以在不影响用户的情况下进行修复。Start QA: After completing data validation in the test environment, the QA process can be started. This includes a series of automated test scripts and tools to check the accuracy of the data and the stability of the system. If any problems are found during this process, they can be fixed without affecting users.

再次验证：修复在质量检核中发现的问题后，需要再次在测试环境中进行验证，以确保所有问题都得到了妥善解决。Re-verification: After fixing the issues found in the quality check, you need to verify it again in the test environment to ensure that all issues have been properly resolved.

生产环境的质量检核：当在测试环境中完成所有验证并确认没有问题后，可以将更新的代码和数据部署到生产环境中。在生产环境中，还需要进行一次最终的质量检核，以确保在实际使用中的稳定性和性能。Quality check in the production environment: After all verifications are completed in the test environment and confirmed to be problem-free, the updated code and data can be deployed to the production environment. In the production environment, a final quality check is also required to ensure stability and performance in actual use.

通过这种先迁移、再测试、最后部署的流程，可以最大程度地减少在生产环境中出现问题的风险，从而确保系统的稳定性和数据的安全性。Through this process of migration first, testing second, and deployment last, the risk of problems in the production environment can be minimized, thereby ensuring the stability of the system and the security of the data.

4)、基于数据质量检核工具出具问题数据的质量问题明细和检核报告；4) Issue quality problem details and verification report of problematic data based on data quality verification tools;

这个数据质量检核工具在工作时首先基于业务口径进行数据的初步筛选和识别。业务口径指的是与具体业务逻辑、业务规则和业务需求相关的数据标准或规范。This data quality check tool first performs preliminary screening and identification of data based on business caliber when working. Business caliber refers to data standards or specifications related to specific business logic, business rules and business requirements.

具体来说，数据质量检核工具首先使用业务口径来识别哪些数据是与特定业务相关的，或者哪些数据可能存在问题。例如，根据业务的定义和规则，某些数据可能被认为是异常或不合规的。Specifically, data quality checking tools first use business caliber to identify which data is relevant to a specific business or which data may have problems. For example, according to the definition and rules of the business, some data may be considered abnormal or non-compliant.

然后，这个数据质量检核工具会将这些基于业务口径识别出的问题数据“翻译为”技术口径。技术口径是从技术角度定义的数据标准和规范，它与底层的数据结构、数据格式、数据类型等技术细节紧密相关。Then, this data quality check tool will "translate" the problematic data identified based on the business caliber into the technical caliber. The technical caliber is the data standard and specification defined from a technical perspective, which is closely related to the underlying data structure, data format, data type and other technical details.

通过翻译为技术口径，数据质量检核工具可以更深入地从技术层面去检核和识别问题数据。这包括检查数据的完整性、准确性、一致性等方面，以及检查是否存在数据冗余、数据缺失、数据格式错误等技术层面的问题。By translating into technical terms, data quality verification tools can more deeply verify and identify problematic data from a technical perspective. This includes checking the integrity, accuracy, consistency, and other aspects of the data, as well as checking whether there are technical issues such as data redundancy, data missing, and data format errors.

总的来说，这个数据质量检核工具结合了业务口径和技术口径的优势，首先从业务角度对数据进行初步筛选和识别，然后再从技术角度进行深入的问题数据检核，从而确保数据的全面和准确。In general, this data quality verification tool combines the advantages of business and technical perspectives. It first screens and identifies data from a business perspective, and then conducts in-depth problem data verification from a technical perspective to ensure that the data is comprehensive and accurate.

5)、业务部门人员、数据质量管理岗人员基于问题数据的明细和检核报告确认检核结果是否正确；若正确，则业务部门人员看之前的业务规则是否与工具给的问题数据一致，如果一致，那么，这些检核规则是合规的，可以进入生产环境验证；若不正确，那么需要调整业务口径描述或者优化技术口径，再次检核发现质量问题数据。5) Business department personnel and data quality management personnel confirm whether the verification results are correct based on the details of the problem data and the verification report; if correct, the business department personnel check whether the previous business rules are consistent with the problem data given by the tool. If they are consistent, then these verification rules are compliant and can be entered into the production environment for verification; if not, it is necessary to adjust the business scope description or optimize the technical scope and re-verify the data with quality problems.

6)、数据质量管理岗人员分析问题整理提升方案或建议；6) Data quality management personnel analyze problems and organize improvement plans or suggestions;

首先，明确在数据质量检核过程中发现的具体问题。这可能包括数据准确性、完整性、一致性、时效性等方面的问题。通过自动识别问题数据的明细和检核报告，对问题进行分类和归纳。First, identify the specific issues found during the data quality verification process. This may include issues with data accuracy, completeness, consistency, timeliness, etc. The issues are classified and summarized by automatically identifying the details of the problematic data and the verification report.

然后，对于每个识别出的问题，自动进行分析以找出其根本原因。这可能包括技术原因(如系统故障、数据处理错误等)和业务原因(如业务流程不合理、数据录入不规范等)。Then, for each identified problem, an analysis is automatically performed to find out its root cause. This may include technical reasons (such as system failure, data processing errors, etc.) and business reasons (such as unreasonable business processes, irregular data entry, etc.).

然后，根据问题的性质和根本原因，自动制定相应的解决方案。这包括技术解决方案(如修复系统错误、优化数据处理流程等)和业务解决方案(如改进业务流程、提高数据录入质量等)。Then, according to the nature and root cause of the problem, corresponding solutions are automatically formulated. This includes technical solutions (such as fixing system errors, optimizing data processing processes, etc.) and business solutions (such as improving business processes, improving data entry quality, etc.).

最后，对于每个解决方案，制定详细的实施计划，包括具体的步骤、时间表、所需资源以及预期的成果。Finally, for each solution, develop a detailed implementation plan, including specific steps, timelines, required resources, and expected outcomes.

7)、业务部门人员和数据质量管理岗人员若发现问题数据的明细和检核报告有问题，则反馈到环节1、环节2，重新确认业务口径和技术口径；上述问题包括：数据格式错误、数据不一致、不符合取值范围、数据重复、数据不正确、外键引用错误、不符合业务规则中的任意一个或多个。7) If business department personnel and data quality management personnel find problems with the details and verification reports of the problem data, they will feedback to link 1 and link 2 to reconfirm the business and technical calibers; the above problems include: incorrect data format, inconsistent data, non-compliance with the value range, duplicate data, incorrect data, foreign key reference errors, and any one or more of non-compliance with business rules.

8)、业务部门人员根据问题数据的明细在源头(源头是指指根据数据全链路地图找到的最早出现问题的数据，一般会指向业务部门)调整数据问题。该具体实施例的数据质量管理引擎可实现的功能包括：8) Business department personnel adjust data problems at the source (the source refers to the earliest data with problems found according to the data full-link map, which generally points to the business department) according to the details of the problem data. The data quality management engine of this specific embodiment can realize the following functions:

数据分析和挖掘：使用数据分析和挖掘技术来识别数据质量问题。包括数据统计分析、异常检测、模式识别和数据挖掘等技术，以发现数据中的错误、缺失、重复或不一致性。Data analysis and mining: Use data analysis and mining techniques to identify data quality issues. This includes techniques such as statistical analysis of data, anomaly detection, pattern recognition, and data mining to find errors, omissions, duplications, or inconsistencies in the data.

确定数据模型及数据：建设数据模型并将带检核的数据通过数据去重、合并、格式转换等方式初始化源头数据记录或者直接使用源头数据记录。Determine the data model and data: Build a data model and initialize the source data records through data deduplication, merging, format conversion, etc., or directly use the source data records.

数据质量度量和指标：使用数据质量度量和指标来评估数据的质量水平。包括定义和计算数据质量指标，如准确性、完整性、一致性、唯一性和时效性等，以量化数据质量并进行比较和监控。Data quality metrics and indicators: Use data quality metrics and indicators to assess the quality level of data. This includes defining and calculating data quality indicators such as accuracy, completeness, consistency, uniqueness, and timeliness to quantify data quality and compare and monitor it.

数据规范和验证：使用数据规范和验证技术来确保数据符合预定义的规范和标准。包括定义数据规范、制定数据验证规则和约束，并使用验证算法和逻辑检查来验证数据的格式、范围、唯一性和一致性等。以下是一些验证算法和逻辑的例子：Data specification and validation: Use data specification and validation techniques to ensure that data meets predefined specifications and standards. This includes defining data specifications, developing data validation rules and constraints, and using validation algorithms and logic checks to verify the format, scope, uniqueness, and consistency of data. Here are some examples of validation algorithms and logic:

格式验证：这种验证方法用于确保数据符合特定的格式要求。例如，对于电话号码，可以编写一个正则表达式来验证输入是否遵循特定的电话号码格式。Format validation: This validation method is used to ensure that the data conforms to a specific format requirement. For example, for a phone number, you can write a regular expression to validate that the input follows a specific phone number format.

范围验证：这种验证用于确保数据落在一个预定的范围内。例如，一个字段可能要求输入的值在1到100之间。Range validation: This type of validation is used to ensure that data falls within a predetermined range. For example, a field may require that the value entered be between 1 and 100.

唯一性验证：这种验证用于确保数据集中的值是唯一的，没有重复。例如，在数据库中，一个用户ID必须是唯一的，不能有两个相同的用户ID。Uniqueness validation: This validation is used to ensure that the values in the data set are unique and there are no duplicates. For example, in a database, a user ID must be unique and there cannot be two identical user IDs.

一致性验证：这种验证用于确保数据在多个字段或表中保持一致。例如，在一个订单处理系统中，订单总额必须与订单项的总金额一致。Consistency validation: This validation is used to ensure that data is consistent across multiple fields or tables. For example, in an order processing system, the order total must be consistent with the total amount of the order items.

存在性验证：这种验证用于确保引用的数据存在。例如，在一个订单处理系统中，当创建一个新的订单项时，需要验证所引用的产品ID是否存在于产品表中。Existence validation: This validation is used to ensure that the referenced data exists. For example, in an order processing system, when creating a new order item, it is necessary to verify whether the referenced product ID exists in the product table.

依赖验证：这种验证用于确保数据满足特定的业务规则或依赖关系。例如，在一个员工管理系统中，一个员工的开始日期不能晚于结束日期。Dependency validation: This validation is used to ensure that data meets specific business rules or dependencies. For example, in an employee management system, an employee's start date cannot be later than the end date.

自定义验证：当预定义的验证规则不能满足特定需求时，可以编写自定义的验证逻辑。这可以包括复杂的业务规则或特定领域的规则。Custom Validation: When the predefined validation rules do not meet your specific needs, you can write custom validation logic. This can include complex business rules or domain-specific rules.

管理和监控：支持数据质量的管理、跟踪和监控。可以提供数据质量报告、警报、仪表盘和可视化，帮助用户实时了解数据质量状况和变化。Management and monitoring: Supports the management, tracking and monitoring of data quality. It can provide data quality reports, alerts, dashboards and visualizations to help users understand data quality status and changes in real time.

该数据质量管理引擎的突出特点是：当发现某一个表中字段对应的数据有问题之后，结合预先构建的数据全链路地图，能快速找到与当前问题数据存在直接关联的上下游数据字段。这种方式能让用户不用再重复的梳理检核规则和关联上下游的数据，也就是说，通过一个点的问题，寻找数据来源和数据去向的问题。The outstanding feature of this data quality management engine is that when a problem is found in the data corresponding to a field in a table, combined with the pre-built data full-link map, it can quickly find the upstream and downstream data fields that are directly related to the current problem data. This method allows users to no longer repeatedly sort out the verification rules and associate upstream and downstream data. In other words, through a point problem, find the problem of data source and data destination.

该数据质量管理引擎发现问题的点主要包括数据完整性检查、数据准确性验证、数据一致性检查、数据变更监测、异常数据识别，具体为：The data quality management engine finds problems mainly in data integrity check, data accuracy verification, data consistency check, data change monitoring, and abnormal data identification, specifically:

1)、数据完整性检查：通过追踪数据在数据全链路地图中的血缘关系，可以检查数据是否经过了预期的加工过程，是否存在数据丢失或遗漏的情况。如果数据在数据全链路地图上出现中断或缺失，就可以认为存在数据完整性问题。1) Data integrity check: By tracing the relationship of data in the data full-link map, it is possible to check whether the data has gone through the expected processing process and whether there is any data loss or omission. If the data is interrupted or missing on the data full-link map, it can be considered that there is a data integrity problem.

2)、数据准确性验证：沿着数据全链路地图倒推，可以分析数据在加工过程中是否经历了正确的转换和计算步骤。如果发现数据在数据全链路地图中存在错误、异常或不一致，就可以识别出数据准确性问题。2) Data accuracy verification: By working backwards along the data full-link map, we can analyze whether the data has undergone the correct conversion and calculation steps during the processing. If errors, anomalies, or inconsistencies are found in the data full-link map, data accuracy issues can be identified.

3)、数据一致性检查：通过数据全链路地图，可以比较不同环节中的数据，并验证它们是否保持一致。如果发现数据在数据全链路地图上出现不一致或矛盾，就可以确定存在数据一致性问题。3) Data consistency check: Through the data full link map, you can compare the data in different links and verify whether they are consistent. If you find that the data is inconsistent or contradictory on the data full link map, you can determine that there is a data consistency problem.

4)、数据变更监测：通过数据全链路地图，可以跟踪数据的变更过程，识别数据的修改、更新或删除操作。可以检查数据是否按照预期进行了变更，并验证变更操作的准确性和完整性。如果数据的全链路地图显示不正常的变更或未经授权的修改，就可以发现数据变更问题。4) Data change monitoring: Through the data full-link map, you can track the data change process and identify data modification, update or deletion operations. You can check whether the data has been changed as expected and verify the accuracy and completeness of the change operation. If the data full-link map shows abnormal changes or unauthorized modifications, data change problems can be discovered.

5)、异常数据识别：通过沿数据全链路地图的追溯，可以监测数据的异常值、缺失值、重复值等，并定位到引起异常的数据加工环节。这有助于及早发现和处理异常数据，并提高数据质量。5) Abnormal data identification: By tracing along the data full-link map, abnormal values, missing values, duplicate values, etc. of the data can be monitored, and the data processing link that causes the abnormality can be located. This helps to detect and handle abnormal data early and improve data quality.

具体的，对于在该具体实施例中如何实现前述的“根据数据全链路地图，搜寻与问题数据相关联的关联问题数据”、以及“对问题数据进行更正，并根据已更正的问题数据，对关联问题数据进行同步更新”，其具体流程可参见图4：图中矩形框指的是数据存储在数据库中的表名和字段名，实现指通过解析ETL形成的数据全链路地图，虚线箭头指根据某一点(例如一个问题数据的字段)基于数据全链路地图找到跟这个字段有关系的所有字段，例如现在在账户表发现一个身份证号对应的数据不满足数据质量检核规则而存在问题(例如身份证号不满足18位，或者包含非法字符，或者最后一位校验位不正确，或者生日信息错误)，那么，可以通过数据全链路地图找到与账户表的身份证号相关的表和字段对应的问题数据，例如身份证号在账户表、用户表、法人表、汇总明细表都有，那么把他们都找到(实际过程中，可能字段名称已经改变，但链路关系在，也能找到，如姓名在用户表叫姓名，在法人表可以叫法定代表人名称等)，这样，用户就不用再去检核用户表、法人表、汇总明细表的数据。同时，如果还能根据数据最开始的点，例如用户表的身份证号，当用户在用户表对相应的问题数据(身份证号)修改之后，系统还可以根据ETL脚本，启动调度，完成数据流转加工，实现从用户表到账户表、法人表、汇总明细表的数据同步更新，也就是说，在源头调整数据之后，数据可以同步更新，从而大幅提高工作效率。Specifically, as to how to implement the aforementioned "searching for associated problem data associated with the problem data according to the data full-link map" and "correcting the problem data, and synchronously updating the associated problem data according to the corrected problem data" in this specific embodiment, the specific process can be seen in Figure 4: The rectangular box in the figure refers to the table name and field name where the data is stored in the database, the implementation refers to the data full-link map formed by parsing ETL, and the dotted arrow refers to finding all fields related to this field based on a certain point (for example, a field of a problem data) based on the data full-link map. For example, now it is found in the account table that the data corresponding to an ID card number does not meet If there are problems with the data quality check rules (for example, the ID number does not meet 18 digits, or contains illegal characters, or the last check digit is incorrect, or the birthday information is wrong), then the problem data corresponding to the table and field related to the ID number in the account table can be found through the data full link map. For example, the ID number is in the account table, user table, legal person table, and summary detail table, so find them all (in the actual process, the field name may have changed, but the link relationship is still there, and it can be found, such as the name in the user table is called the name, and in the legal person table it can be called the legal representative name, etc.), so the user does not need to check the data in the user table, legal person table, and summary detail table. At the same time, if you can also use the data at the very beginning, such as the ID number in the user table, when the user modifies the corresponding problem data (ID number) in the user table, the system can also start scheduling according to the ETL script, complete the data flow processing, and realize the synchronous update of data from the user table to the account table, legal person table, and summary detail table. In other words, after adjusting the data at the source, the data can be updated synchronously, thereby greatly improving work efficiency.

具体的，对于在该具体实施例中如何实现前述的“数据质量检核规则的用户自定义”，本技术方案中的思路是，根据用户填写的文本(数据库中表和字段的注释)及文本间的计算进行内容解析，并将解析到的内容映射到数据库可以识别的SQL脚本语音，简单说，就是基于预定的质量检核语法(或称为自定义输入逻辑规则)，将用户可以理解的文本内容解析为计算机可以识别的SQL脚本语言。预定的质量检核语法可以支持如下能力：Specifically, as to how to implement the aforementioned "user customization of data quality check rules" in this specific embodiment, the idea in this technical solution is to parse the content according to the text filled in by the user (comments on tables and fields in the database) and the calculations between the texts, and map the parsed content to the SQL script language that the database can recognize. In short, based on the predetermined quality check grammar (or custom input logic rules), the text content that the user can understand is parsed into the SQL script language that the computer can recognize. The predetermined quality check grammar can support the following capabilities:

1)、表达式解析：能够解析输入的数学表达式，并将其转换为可执行的计算语言；1) Expression parsing: It can parse the input mathematical expression and convert it into an executable computing language;

2)、运算符支持：支持各种数学运算符，如加法、减法、乘法、除法、取余等。2) Operator support: Supports various mathematical operators, such as addition, subtraction, multiplication, division, remainder, etc.

3)、变量和函数支持：允许定义和使用变量和函数，以便在表达式中进行计算；3) Variable and function support: allows the definition and use of variables and functions for calculation in expressions;

4)、数据类型支持：支持整数、浮点数、布尔值等常见的数据类型。4) Data type support: supports common data types such as integers, floating-point numbers, Boolean values, etc.

数据质量检核规则的用户自定义的具体流程可参见图5、图6所示的一个例子：用户可以通过图5所示的表格(属于录入界面中的一种)进行自定义输入(或执行通过预设的下拉菜单选取等操作)，且银行信息表(该表在数据库中存储的表名为bank_info)在数据库中存储的注释与字段名称如图6所示，现以图5表格中的第二行数据为例，可以理解为“银行信息表”的“银行地址”不能为空，且不为空的前提条件是“邮编！＝'450000'”；那么，根据预设的质量检核语法，结合图5和图6的内容，可以将【“银行信息表”的“银行地址”不能为空，且不为空的前提条件是“邮编！＝'450000'”】这段文本转义为计算机可以执行的数据库脚本SQL，其SQL为【select id,bank_name,bank_addr,post_no from bank_info wherepost_no！＝’450000’andbank_name is null】，最后，在应用时，通过执行解析的SQL可以找到邮编不是450000且银行地址为空的问题数据。The specific process of user-defined data quality check rules can be seen in an example shown in Figures 5 and 6: the user can make customized input (or perform operations such as selecting through a preset drop-down menu) through the table shown in Figure 5 (one of the input interfaces), and the comments and field names stored in the database of the bank information table (the table name stored in the database is bank_info) are shown in Figure 6. Now taking the second row of data in the table of Figure 5 as an example, it can be understood that the "bank address" of the "bank information table" cannot be empty, and the prerequisite for not being empty is "zip code! = '450000'"; then, according to the preset quality check syntax, combined with the contents of Figures 5 and 6, the text ["bank address" of the "bank information table" cannot be empty, and the prerequisite for not being empty is "zip code! = '450000'"] can be translated into a database script SQL that can be executed by a computer, and the SQL is [select id, bank_name, bank_addr, post_no from bank_info where post_no! ='450000'andbank_name is null], finally, when applied, the problem data whose zip code is not 450000 and bank address is empty can be found by executing the parsed SQL.

为使本领域内的任何技术人员能够实现或者使用本发明，上面对所公开实施例进行了描述。对于本领域技术人员来说；这些实施例的各种修改方式都是显而易见的，并且本文定义的一般原理也可以在不脱离本公开的精神和保护范围的基础上适用于其它实施例。因此，本公开并不限于本文给出的实施例，而是与本申请公开的原理和新颖性特征的最广范围相一致。The disclosed embodiments are described above to enable any person skilled in the art to implement or use the present invention. Various modifications of these embodiments are obvious to those skilled in the art, and the general principles defined herein may also be applied to other embodiments without departing from the spirit and scope of the present disclosure. Therefore, the present disclosure is not limited to the embodiments given herein, but is consistent with the broadest scope of the principles and novel features disclosed in this application.

以上所述的具体实施方式，对本发明的目的、技术方案和有益效果进行了进一步详细说明，所应理解的是，以上所述仅为本发明的具体实施方式而已，并不用于限定本发明的保护范围，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The specific implementation methods described above further illustrate the objectives, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above description is only a specific implementation method of the present invention and is not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present invention should be included in the scope of protection of the present invention.

Claims

1. A data quality analysis method, comprising:

Constructing a data full-link map;

determining data to be checked, and screening out problem data which does not meet the preset data quality check rule from the data to be checked;

Searching associated problem data associated with the problem data according to the data full-link map;

correcting the problem data;

And synchronously updating the associated problem data according to the corrected problem data.

2. The data quality analysis method of claim 1, wherein searching for associated problem data associated with the problem data based on the data full link map comprises:

according to the data all-link map, all associated data associated with the problem data are traced back within the ranges of database-level blood-edge relations, table-level blood-edge relations and field-level blood-edge relations respectively;

and screening out the associated data which does not meet the preset data quality check rule from all the associated data, and taking the associated data as associated problem data.

3. The data quality analysis method of claim 1, wherein the constructing a data full link map comprises:

Acquiring blood relationship data for constructing a data full-link map;

Analyzing the hierarchical relationship of the blood relationship data;

and constructing the data full-link map according to the hierarchical relationship.

4. The data quality analysis method of claim 1, further comprising, prior to said constructing a data full link map:

formulating a custom input logic rule according to a table structure of the data table;

Determining an input interface according to the custom input logic rule;

acquiring user-defined input content through the input interface;

Analyzing the user-defined input content to obtain a logic expression related to a user-defined check rule;

Converting the logic expression into an input instruction in a database language form;

and finishing user definition of the data quality check rule through the input instruction.

5. The data quality analysis method of claim 1, wherein the data quality check rule specifically comprises:

the data quality checking rule is used for realizing data integrity checking, the data quality checking rule is used for realizing data accuracy verification, the data quality checking rule is used for realizing data consistency checking, and the data quality checking rule is used for realizing abnormal data identification.

6. A data quality analysis system, comprising:

The data all-link map construction module is used for constructing a data all-link map;

the data quality checking module is used for determining data to be checked and screening out problem data which does not meet the preset data quality checking rule from the data to be checked;

The associated problem data searching module is used for searching associated problem data associated with the problem data according to the data all-link map;

the problem correction module is used for correcting the problem data;

and the synchronous updating module is used for synchronously updating the associated problem data according to the corrected problem data.

7. The data quality analysis system of claim 6, wherein the data quality check module is specifically configured to: according to the data all-link map, all associated data associated with the problem data are traced back within the ranges of database-level blood-edge relations, table-level blood-edge relations and field-level blood-edge relations respectively; and screening out the associated data which does not meet the preset data quality check rule from all the associated data, and taking the associated data as associated problem data.

8. The data quality analysis system of claim 6, wherein the data full link map construction module is specifically configured to: acquiring blood relationship data for constructing a data full-link map; analyzing the hierarchical relationship of the blood relationship data; and constructing the data full-link map according to the hierarchical relationship.

9. The data quality analysis system of claim 6, further comprising a data quality check rule customization module for: formulating a custom input logic rule according to a table structure of the data table; determining an input interface according to the custom input logic rule; acquiring user-defined input content through the input interface; analyzing the user-defined input content to obtain a logic expression related to a user-defined check rule; converting the logic expression into an input instruction in a database language form; and finishing user definition of the data quality check rule through the input instruction.

10. A computer device, comprising:

One or more processors;

a storage means for storing one or more programs;

The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the data quality analysis method of any of claims 1-5.