CN119127859A

CN119127859A - A data quality detection method, device and medium based on business scenarios

Info

Publication number: CN119127859A
Application number: CN202411272225.2A
Authority: CN
Inventors: 郑伟波; 冯子恺; 李腾
Original assignee: Inspur General Software Co Ltd
Current assignee: Inspur General Software Co Ltd
Priority date: 2024-09-11
Filing date: 2024-09-11
Publication date: 2024-12-13

Abstract

The embodiments of the present specification disclose a data quality detection method, device and medium based on business scenarios, and relate to the technical field of data quality detection. The method comprises: determining current business scenario information corresponding to current data quality detection requirements, and determining a data quality detection strategy based on the business scenario information; configuring quality detection tasks through the data quality detection strategy to determine a corresponding data quality detection task set, including multiple data quality detection subtasks; submitting the data quality detection task set to the Griffin task scheduling system, and distributing tasks for multiple data quality detection subtasks through the Griffin task scheduling system to determine a working node corresponding to each data quality detection subtask; executing the data quality detection subtask through each working node, generating a quality detection result corresponding to each working node, and integrating the quality detection results according to the current business scenario information to determine a data quality report.

Description

Data quality detection method, equipment and medium based on business scene

Technical Field

The present disclosure relates to the field of data quality detection technologies, and in particular, to a method, an apparatus, and a medium for detecting data quality based on a service scenario.

Background

The data quality refers to the degree of applicability, accuracy, integrity, consistency, timeliness, reliability, etc. of the data in meeting business requirements and decision support. It covers the whole process from data acquisition, processing, storage to applications, ensures that the data can reflect real world conditions, and provides effective information support in various scenarios. With the great rise in the number, quality and frequency of applications of digital services in various fields, such as social networks, electronic commerce, etc., enterprises need to provide personalized services, optimize product experience, and enhance market competitiveness by collecting and processing a large amount of data. When the processed data volume reaches a certain level (10 hundred million or 100 GB) and the complexity of the system is increased (the report number is not increased), the traditional large-data-volume data quality detection usually adopts a mode of executing MapReduce tasks by Hive SQL, the MapReduce operation is long in starting and running time, the discovery and processing efficiency of data quality problems are low, and secondly, complex data quality rules are difficult to effectively express and implement due to the limited SQL query flexibility.

The traditional Griffin framework cannot cover the complex data quality detection requirements in all industries or specific business scenes, and when facing some special or highly customized data quality detection tasks, a user needs to spend extra time and effort to realize quality rule customization by self-designing and writing corresponding Scala codes, so that the technical threshold of quality detection is improved. Therefore, the current quality detection method is relatively split with the service scene, and the user is required to customize quality rules, so that the detection efficiency is reduced.

Disclosure of Invention

One or more embodiments of the present disclosure provide a method, an apparatus, and a medium for detecting data quality based on a service scenario, which are used for solving the technical problem that a current quality detection method is relatively split with the service scenario, and a user is required to customize quality rules, thereby reducing detection efficiency.

One or more embodiments of the present disclosure adopt the following technical solutions:

One or more embodiments of the present disclosure provide a data quality detection method based on a service scenario, where the method includes determining current service scenario information corresponding to a current data quality detection requirement, determining a data quality detection policy based on the service scenario information, where the data quality detection policy includes data source connection request information of at least one data source corresponding to data to be detected and at least one data quality rule, performing quality detection task configuration according to the data quality detection policy to determine a corresponding data quality detection task set, where the data quality detection task set includes a plurality of data quality detection subtasks, submitting the data quality detection task set to a Griffin task scheduling system, performing task distribution on the plurality of data quality detection subtasks by the Griffin task scheduling system, determining a working node corresponding to each data quality detection subtask, performing the data quality detection subtask by each working node, generating a quality detection result corresponding to each working node, reporting the data quality detection result according to the current service scenario information, and integrating the data quality detection result.

Further, based on the service scene information, a data quality detection strategy is determined, which specifically comprises the steps of obtaining service data usage information, service data source information and at least one preset key data item of various service data in the service scene information, determining a data source type of a data source corresponding to each service data through the service data source information, determining data source connection request information of each data source based on the data source type, determining at least one data quality detection rule corresponding to each service data according to the service data usage information corresponding to each service data, wherein the data quality detection rule comprises any one or more of an integrity measure, a consistency measure, an accuracy measure and a validity measure, setting quality detection priority through the at least one key data item, and determining detection priority information based on the data source connection request information, the data quality detection rule and the detection priority information.

Further, the quality detection task configuration is carried out through the data quality detection strategy to determine a corresponding data quality detection task set, and the method specifically comprises the steps of configuring task parameters according to the data quality detection strategy to create the corresponding data quality detection task set, and analyzing the data quality detection task set through a Griffin task scheduling system to decompose the data quality detection task set into a plurality of data quality detection subtasks, wherein the plurality of data quality detection subtasks are independently executable detection tasks, and each data quality detection subtask comprises data source connection request information, data quality rule ID and subtask priority information.

The method comprises the steps of carrying out task distribution on a plurality of data quality detection subtasks through a Griffin task scheduling system, determining working nodes corresponding to the data quality detection subtasks, and generating a resource allocation sequence corresponding to the data quality detection subtasks according to the task dependency relationship among the data quality detection subtasks and the subtask priority information of the data quality detection subtasks, wherein the current resource usage information comprises current computing resource data, current storage resource data and current network bandwidth data, carrying out task feature analysis on the data quality detection subtasks, determining the resource requirement information corresponding to the data quality detection subtasks and the task dependency relationship among the data quality detection subtasks, generating the resource allocation sequence corresponding to the data quality detection subtasks according to the resource allocation sequence, and sequentially carrying out task quality detection on the data quality detection subtasks according to the resource requirement information corresponding to the data quality detection subtasks and the current network bandwidth data, and sequentially carrying out task allocation on the data quality detection subtasks according to the resource allocation sequence.

The method comprises the steps of receiving data source connection request information, data quality rule ID and sub-task priority information corresponding to each work node, carrying out data source butt joint according to the data source connection request information through a preset data reading middle layer to obtain data to be detected, wherein the data source comprises a relational data source and a big data storage data source, and matching a corresponding quality detection algorithm according to the data quality rule ID according to each work node to carry out quality detection on the data to be detected to generate a quality detection result corresponding to each work node.

Further, after the data quality detection subtasks are executed through each working node and the quality detection result corresponding to each working node is generated, the method further comprises the steps of monitoring subtask execution conditions of each working node through a preset real-time monitoring interface, collecting real-time task execution data of each working node, wherein the real-time task specification data comprise real-time node resource use information and node task execution information, determining real-time node task execution states corresponding to each working node according to the real-time node resource use information and the node task execution information of each working node, and carrying out task distribution adjustment based on the real-time node task execution states corresponding to each working node.

The method comprises the steps of collecting quality detection results of all working nodes, summarizing data quality based on a plurality of quality detection results, determining summarized result data, determining abnormal types of abnormal data items in the summarized result data, determining corresponding quality result display types through the current business scene information and the abnormal types, and rendering the summarized result data according to the quality result display types to construct a data quality problem view so as to determine the data quality report.

Further, determining current service scene information corresponding to the current data quality detection requirement, specifically comprising sending a service scene form corresponding to the quality detection requirement to a user side under the triggering of the quality detection operation of the user, wherein the service scene form comprises a service data type item, a service data usage item, a service data source item and a user-defined key data item, and determining the current service scene information corresponding to the current data quality detection requirement according to the service scene form, wherein the current service scene information comprises service data usage information, service data source information and at least one preset key data item of various service data.

One or more embodiments of the present specification provide a data quality detection apparatus based on a traffic scenario, including:

At least one processor, and

A memory communicatively coupled to the at least one processor, wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method described above.

One or more embodiments of the present specification provide a non-volatile computer storage medium storing computer-executable instructions configured to perform the above-described method.

The technical scheme adopted by the embodiment of the specification has the advantages that through the technical scheme, the data range and the data quality rule to be detected can be determined more accurately through defining the current service scene information, so that resource waste caused by indiscriminate comprehensive detection is avoided, the common data quality problem in the scene can be effectively identified and solved according to the data quality rule customized by a specific service scene, the accuracy and the reliability of data are improved, the detection task is distributed and executed efficiently through the Griffin task scheduling system, the quality detection result can be generated rapidly, the detection resource can be distributed more reasonably based on the service scene customized data quality detection strategy, unnecessary waste is avoided, griffin can respond and distribute tasks to corresponding working nodes rapidly for the data quality task to be detected in real time, the data quality monitoring similar to real time is realized, the accuracy, the completeness, the timeliness, the uniqueness, the effectiveness, the consistency and the like of the data can be measured accurately through the Griffin defined data quality rule, the comprehensive monitoring efficiency of the data quality is ensured, and the data quality monitoring efficiency is improved.

Drawings

In order to more clearly illustrate the embodiments of the present description or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some of the embodiments described in the present description, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. In the drawings:

Fig. 1 is a schematic flow chart of a data quality detection method based on a service scenario according to an embodiment of the present disclosure;

Fig. 2 is a schematic structural diagram of a data quality detection device based on a service scenario according to an embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions in the present specification better understood by those skilled in the art, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present disclosure.

The embodiment of the present disclosure provides a method for detecting data quality based on a service scenario, and it should be noted that an execution body in the embodiment of the present disclosure may be a server, or any device having a data processing capability. Fig. 1 is a flow chart of a data quality detection method based on a service scenario according to an embodiment of the present disclosure, as shown in fig. 1, mainly including the following steps:

step S101, determining current service scenario information corresponding to the current data quality detection requirement, so as to determine a data quality detection policy based on the service scenario information.

The data quality detection strategy comprises data source connection request information of at least one data source corresponding to the data to be detected and at least one data quality rule.

The method comprises the steps of determining current service scene information corresponding to current data quality detection requirements, and specifically comprises the steps of sending a service scene form corresponding to the quality detection requirements to a user side under the triggering of quality detection operation of the user, wherein the service scene form comprises a service data type item, a service data use item, a service data source item and a user-defined key data item, and determining the current service scene information corresponding to the current data quality detection requirements according to the service scene form, wherein the current service scene information comprises service data use information, service data source information and at least one preset key data item of various service data.

In one embodiment of the present description, when a user requests quality detection, the system first sends a service scenario form to the user, the form content typically including, but not limited to, a service data type item, a service data usage item, a service data source item, and a custom key data item. The service data type item refers to a data type, such as structured data (e.g., database table), unstructured data (e.g., text file, image), etc., that a user selects or inputs to be quality-checked in the current service scenario. The business data usage items are used to identify usage of the business data, such as financial transaction monitoring, user behavior analysis usage, and the like. The service data source item facilitates user input of a data source of the service data to be detected, such as an internal system, external provider, or otherwise acquired. The custom key data items are used to specify which data items are important items for the business scenario to the user. The user fills in the form according to the actual business condition and submits the form to the system. Prior to submission, the system may provide some verification mechanism to ensure that the information entered by the user is accurate and complete. After receiving a form submitted by a user, the system analyzes the content of the form to determine current service scene information corresponding to the current data quality detection requirement, wherein the current service scene information comprises service data use information, service data source information and at least one preset key data item of various service data.

The method comprises the steps of obtaining service data usage information, service data source information and at least one preset key data item of various service data in service scene information, determining a data source type of a data source corresponding to each service data through the service data source information to determine data source connection request information of each data source based on the data source type, determining at least one data quality detection rule corresponding to each service data according to the service data usage information corresponding to each service data, wherein the data quality detection rule comprises any one or more of integrity measurement, consistency measurement, accuracy measurement and validity measurement, setting quality detection priority through the at least one key data item, and determining detection priority information based on the data source connection request information, the data quality detection rule and the detection priority information.

In one embodiment of the present disclosure, a data source type (e.g., hive table, relational database, noSQL database, file system, etc.) corresponding to each type of service data is identified based on the service data source information. For each data source type, corresponding data source connection request information is acquired, including authentication information, connection addresses, port numbers, etc., so as to be able to successfully connect to the data source in the data quality detection process. Various types of data sources may be selected, including but not limited to Hive tables, relational databases, file systems, kafka, etc. streaming message queues, and other large data storage formats that support Spark reading. Specifically, for Hive tables, it is necessary to provide database names, table names and possible partition information, and the relational database needs database connection URLs, user names, passwords, table names, etc., and is configured with Kafka topic names, consumer groups, cluster addresses, etc.

And analyzing the characteristics and the quality requirements of each service data according to the service data use information, and determining the applicable data quality detection rule aiming at each service data. The data quality rules that may be selected include mainly integrity metrics, such as checking whether a record has missing or null values, and integrity verification may be performed by configuring specific fields to have non-null values, typically for primary keys and non-null traffic fields of the table. The method comprises the steps of comparing data between different data sources to be consistent, comparing key fields of two tables to be matched through JOIN operation, wherein the key fields are generally used for external keys of the tables, associating the two tables through the key fields, setting business logic rules such as age range limitation, unique identifier check and the like to verify whether the data accurately reflect real conditions, wherein the field types are generally numerical values, and timeliness measures such as requirements of data updating frequency for time sequence data can be configured to ensure that the data are effective in time and date and are generally used for date or time types. For example, in a health care system, the integrity measure ensures that fields such as name, age, disease diagnosis, etc. in the patient record are complete and not empty, the consistency measure verifies that patient records should be consistent in different systems, for example, patient basic information should be consistent in an electronic medical record system and a drug management system, patient ID and diagnostic information fields are compared through JOIN operation, the accuracy measure ensures the accuracy of data by setting business rules, for example, the age field must be within a reasonable range, the disease diagnosis is consistent with experimental results, etc., and for patient health monitoring data, the data can be updated within a specified time range, namely, the timeliness measure, for example, heart rate data is monitored in real time so as to take medical measures in time.

And identifying which data items have the greatest influence on business decision by the preset key data items, and setting higher quality detection priority for the key data items. And setting corresponding priority according to the service requirement by other non-critical data items. And integrating the data source connection request information, the data quality detection rule and the detection priority information, formulating a complete data quality detection strategy, ensuring that the strategy can cover all important service data, and carrying out quality detection according to the priority order.

By the technical scheme, the data quality can be systematically checked by formulating detailed data quality detection rules comprising integrity, consistency, accuracy and validity measurement, the data can meet business requirements and quality standards, the relevance of the data quality detection rules and business scenes is enhanced, the data items with the greatest influence on business decisions can be preferentially processed by presetting key data items and setting quality detection priorities, resource allocation is facilitated, timely and effective quality detection of key data is ensured, and unnecessary investment on non-key data is reduced.

Step S102, quality detection task configuration is carried out through a data quality detection strategy to determine a corresponding data quality detection task set.

Wherein the set of data quality detection tasks includes a plurality of data quality detection subtasks.

The method comprises the steps of carrying out quality detection task configuration through the data quality detection strategy to determine a corresponding data quality detection task set, and specifically comprises the steps of configuring task parameters according to the data quality detection strategy to create the corresponding data quality detection task set, and analyzing the data quality detection task set through a Griffin task scheduling system to decompose the data quality detection task set into a plurality of data quality detection subtasks, wherein the plurality of data quality detection subtasks are independently executable detection tasks, and each data quality detection subtask comprises data source connection request information, data quality rule ID and subtask priority information.

In one embodiment of the present disclosure, task configuration is performed in a data center, creating a data quality detection task set requires configuring a task set name, a data source list, a data quality rule list and a priority setting, where the task set name is used to identify the whole task set, the data source list contains connection information of all data sources that need to be detected, such as a database URL, a user name, a password, etc., the data quality rule list specifies a data quality rule ID list that needs to be applied by each data source, and may also set a priority for the whole task set or a single data source/rule.

Using the API or interface provided by Griffin, a set of data quality detection tasks is created from the parameters described above. At the time of creation, the system should verify the validity of the data source connection information and check if a rule ID exists in the system. A parser should be implemented in the Griffin task scheduling system to parse the task set into a plurality of independently executable data quality detection subtasks. Each subtask comprises data source connection request information, data quality rule ID and subtask priority information, the data source connection request information is extracted from a task set and used for being connected to a corresponding data source, the data quality rule ID is used for designating the data quality rule to be applied by the subtask, and the priority can be set according to the priority of the task set or independently set for the subtask.

Step S103, submitting the data quality detection task set to a Griffin task scheduling system, and performing task distribution on a plurality of data quality detection subtasks through the Griffin task scheduling system to determine a working node corresponding to each data quality detection subtask.

In one embodiment of the present description, tasks configured in the data center are submitted to the Griffin task scheduling system, which encapsulates the tasks into a format suitable for recognition and processing by the Griffin task scheduler. When the task is submitted, the resource requirement can be set, namely, the computing resources (such as CPU and memory) required by the task are designated, so that the task is ensured to obtain enough resources under high load, the alarm setting can be carried out, the alarm mechanism after the task fails to execute or overtime is configured, the timely response and the problem processing are ensured, and in addition, the execution frequency (such as cron expression of the timing task), the dependency relationship and the like can be set. Through the steps, the task submission of the data quality detection task is realized.

The Griffin task scheduler first parses the submitted task to identify the basic attributes of the task (e.g., task ID, priority, execution mode, etc.) and specific quality detection rules. The scheduler then decides how to distribute the tasks to different nodes in the Spark cluster according to the analysis of the current Spark cluster resource usage, including but not limited to computing resources (CPU, memory), storage resources and network bandwidth, for example, in the e-commerce system shopping node activity, the processing capacity of the system can be improved by an automatic expansion mechanism of the cloud platform (automatically increasing the Spark cluster nodes), and when submitting the data quality tasks, metadata of task priority can be set, so that the system allocates the resources for the tasks with high priority. The scheduler packages the tasks into task descriptions suitable for Spark execution and submits them to the Spark resource management module, and then the Spark resource manager schedules the tasks to each working Node (workbench Node) for execution.

The method comprises the steps of carrying out task distribution on a plurality of data quality detection subtasks through a Griffin task scheduling system, determining working nodes corresponding to the data quality detection subtasks, and generating a resource allocation sequence corresponding to the data quality detection subtasks according to the task dependency relationship among the data quality detection subtasks and the subtask priority information of each data quality detection subtask, wherein the current resource usage information comprises current computing resource data, current storage resource data and current network bandwidth data, carrying out task feature analysis on each data quality detection subtask, determining the resource requirement information corresponding to each data quality detection subtask and the task dependency relationship among the data quality detection subtask, generating the resource allocation sequence corresponding to each data quality detection subtask according to the resource allocation sequence, and carrying out task allocation on each data quality detection subtask according to the resource requirement information corresponding to each data quality detection subtask and the current resource usage condition of each working node in turn, and determining the working nodes corresponding to each data quality detection subtask.

In one embodiment of the present disclosure, first, it is ensured that the Griffin system can be integrated with a monitoring system of a Spark cluster, and resource usage of each working node in the Spark cluster is collected and displayed in real time. The Griffin system may obtain current computing resources (e.g., CPU usage, memory usage), storage resources (e.g., disk I/O, HDFS usage), and network bandwidth data by periodically calling the API interface of the monitoring tool. Each data quality detection subtask is analyzed to determine its computational complexity, memory requirements, disk I/O requirements, and network bandwidth requirements, which can be estimated by historical data, task configuration, or user input. The analysis results are converted into specific resource requirement information, including minimum CPU core number, minimum/maximum memory quantity, disk I/O rate and network bandwidth requirement. By analyzing the data flow and processing order between the data quality detection subtasks, determining the dependencies between the data quality detection subtasks, a directed graph may be used to represent the dependencies. And combining the priority information of each subtask with the task dependency relationship to generate a resource allocation sequence. The resource allocation sequence considers both the priorities of the tasks and the dependencies between the tasks. According to the resource allocation sequence, the Griffin system traverses each subtask to be allocated, and for each subtask, the most suitable node is selected for allocation according to the resource demand information and the resource use condition of the current working node through a matching algorithm. The matching algorithm may be the best-fit algorithm or the first-fit algorithm. Each subtask is scheduled to be executed on the assigned work node by YARN submitting the task.

The Griffin task scheduling system can process a plurality of data quality detection subtasks in parallel, fully utilizes computing resources, remarkably shortens overall detection time, can intelligently allocate tasks according to load conditions of working nodes, ensures balanced utilization of resources of all nodes, avoids the condition that some nodes are overloaded and other nodes are idle, can rapidly respond and distribute tasks to corresponding working nodes for the data quality tasks needing real-time detection, achieves near-real-time data quality monitoring, can accurately measure a plurality of dimensions such as accuracy, completeness, timeliness, uniqueness, validity and consistency of data through Griffin defined data quality rules, ensures comprehensive monitoring of data quality, and can perform centralized management on all the data quality detection tasks in the Griffin task scheduling system so as to facilitate unified monitoring and scheduling.

Step S104, executing a data quality detection subtask through each working node, generating a quality detection result corresponding to each working node, integrating the quality detection results according to the current service scene information, and determining a data quality report.

The method comprises the steps of executing a data quality detection subtask through each working node to generate a quality detection result corresponding to each working node, and specifically comprises the steps of determining data source connection request information, data quality rule ID and subtask priority information corresponding to each data quality detection subtask, carrying out data source butt joint according to the data source connection request information through a preset data reading middle layer to obtain data to be detected, wherein the data sources comprise relational data sources and big data storage data sources, and matching a corresponding quality detection algorithm according to the data quality rule ID according to each working node to execute quality detection on the data to be detected to generate a quality detection result corresponding to each working node.

In one embodiment of the present description, in a task scheduling system, each data quality detection subtask first needs to specify its corresponding data source connection request information. Such information typically includes the type of data source (e.g., mySQL, oracle, hadoop HDFS, hive, etc.), connection address (IP address or domain name), port number, database name, authentication information, etc. And adapting the relational data source connection and the big data storage system through the data reading middle layer. The middle layer service is responsible for identifying the data source type according to the request parameters, selecting a proper adapter for data reading or writing operation, providing an API or service endpoint, and allowing a caller to access different types of data sources in a uniform manner. Each subtask also needs to specify its corresponding data quality rule ID for referencing predefined quality detection rules (e.g., data integrity check, consistency verification, scope check, etc.) in the quality detection process. At the same time, the priority information of the subtasks also needs to be explicit in order to prioritize more critical or urgent tasks in case of limited resources. The data reading middle layer selects a proper adapter to butt joint different data sources according to the data source connection request information, so that the complexity of the data sources at the bottom layer is hidden, a unified and standardized data access interface (API or service endpoint) is provided for the upper layer application, and the Griffin task scheduling system can realize the reading and writing operation of various types of data without concern about specific data source types or connection modes. The data reading middle layer acquires the data to be detected, and matches the corresponding quality detection algorithm according to the data quality rule ID appointed on each working node, wherein the quality detection algorithm comprises various built-in or self-defined check logics and is used for evaluating the accuracy, the integrity, the consistency and the like of the data. After the quality detection algorithm is executed, a quality detection result corresponding to each working node is generated and used for data analysis, report generation or triggering corresponding processing flows (such as data cleaning, correction and the like).

Through the technical scheme, the high-efficiency butt joint and flexible management of different types of data sources can be realized, and meanwhile, the accuracy and the reliability of the data quality detection process are ensured. The application of the data reading middle layer greatly simplifies the complexity of data access, improves the expandability and maintainability of the whole system, and ensures the efficient execution of quality detection tasks and the reasonable utilization of resources based on task distribution and execution mechanisms of the working nodes and the data quality rule ID.

The method further comprises the steps of monitoring subtask execution conditions of each working node through a preset real-time monitoring interface, collecting real-time task execution data of each working node, wherein the real-time task specification data comprise real-time node resource use information and node task execution information, determining real-time node task execution states corresponding to each working node according to the real-time node resource use information and the node task execution information of each working node, and carrying out task distribution adjustment based on the real-time node task execution states corresponding to each working node.

In one embodiment of the present disclosure, the system may dynamically capture and display each key resource usage during the running of the data quality task, including, but not limited to, the CPU utilization at the server level, the memory consumption, and the resource allocation status on the yan cluster, and further including the disk I/O performance index and the network traffic statistics, through the real-time monitoring interface provided by the docking Spark, so that the manager may be able to comprehensively understand and track the resource bottleneck and the efficiency problem during the task execution. On this basis, the API interface of Griffin is further extended, allowing fine-grained resource monitoring at the user level and the project level by specifying task IDs and project IDs.

And acquiring real-time node resource use information and node task execution information of each working node, wherein the node resource use information comprises the use conditions of key resources such as CPU (Central processing Unit) use rate, memory occupation, disk I/O (input/output), network bandwidth and the like. The node task execution information refers to a task list currently running on the node, execution progress of each task, an amount of resources required for the task, priorities of the tasks, and the like.

And according to the collected resource use information, evaluating the resource load condition of each node, and judging whether the node is in a high-load, low-load or moderate-load state. The high load, low load or moderate load may be determined by the load ratio of the current load to the total load, e.g., 0-30% low load, 30% -60% medium load, and greater than 60% high load. And analyzing the execution efficiency of each task and whether resource bottleneck or dependency problems are encountered or not according to the task execution information. Combining the resource usage and task execution, a real-time task execution state is determined for each work node, and the states may include "busy" (high load and in task execution), "idle" (low load and short task queue), "overloaded" (resources approaching or exceeding a threshold value), etc. For overloaded nodes, it is contemplated to migrate part of the task to idle or moderately loaded nodes to balance the load and avoid performance bottlenecks. According to the emergency degree of the task and the resource requirement, the execution priority of the task on the node is adjusted, the high-priority task can be executed preferentially, and the low-priority task can be executed when the resource is sufficient. According to the real-time node resource use information and the node task execution information of each working node, the task execution state of the node is dynamically determined, and effective task distribution adjustment is carried out, so that the overall performance and the resource utilization rate of the system are improved.

The method comprises the steps of integrating quality detection results according to current business scene information to determine a data quality report, and specifically comprises the steps of collecting the quality detection results of each working node, carrying out data quality summarization based on a plurality of quality detection results, determining summarization result data, determining an abnormal type of an abnormal data item in the summarization result data, determining a corresponding quality result display type through the current business scene information and the abnormal type, and rendering the summarization result data according to the quality result display type to construct a data quality problem view to determine the data quality report.

In one embodiment of the present disclosure, after the tasks on each node complete their respective quality detection, the API interface provided by Griffin is called to periodically obtain the results of the completed data quality detection tasks, where Griffin integrates the results of all sub-tasks to form a complete data quality report. And collecting quality detection results of all the working nodes, and summarizing the quality detection results of all the working nodes. If multiple nodes detect the same data item, duplicate results need to be removed, the data is classified according to different detection results (such as pass, fail, warning, etc.), and the number of various detection results is counted, such as total detection item number, pass item number, fail item number, etc. In the summary result, data items that do not pass the quality detection, i.e., abnormal data items, are determined. For each exception data item, its exception type needs to be determined, which is typically determined based on the rule ID or exception description of the violation at the time of detection. Exception types may include, but are not limited to, data loss, data type errors, data format errors, data inconsistencies, out of range data, and the like.

After determining the abnormal data item and the abnormal type thereof, the system needs to determine a proper quality result display type according to the current business scene information and the abnormal type. Different business scenarios and anomaly types may require different display modes, and corresponding quality result display types are determined through current business scenario information and anomaly types, where the quality result display types include a problem network graph, a time trend graph, a quality scoring radar graph, and the like. For example, a time trend graph may be used when it is desired to monitor changes in data quality, such as tracking long-term changes in data accuracy, integrity or consistency, identifying trending problems or evaluating improvement effects, observing effects after data quality improvement measures are implemented, analyzing whether improvement in data quality is sustained, using a radar graph when it is desired to comprehensively evaluate data quality of different business modules for finding areas where improvement is desired, and displaying how data flows and transitions in the system when finding sources of data inconsistency or integrity problems, where a network graph is used. The structured data obtained from GRIFFIN API is parsed and converted to adapt it as a data structure that can be recognized by the G6 graph visualization engine. Rendering the processed result data by using a G6 graph visualization engine, constructing an intuitive and easily understood data quality problem view, and displaying the view in a data center table to serve as a data quality report. The data center judges whether the detection result exceeds a preset range, if so, the data center automatically sends a notification to related responsible persons, and the data center is convenient to respond to the data quality problem in time. For the found problems, the correction process can be tracked until the problems are solved, so that a complete closed-loop management mechanism is formed.

According to the technical scheme, the data range and the data quality rule to be detected can be determined more accurately through defining the current service scene information, so that resource waste caused by indiscriminate comprehensive detection is avoided, common data quality problems in a specific service scene can be effectively identified and solved according to the customized data quality rule of the service scene, accuracy and reliability of data are improved, a quality detection result can be generated quickly through efficient distribution and execution of a Griffin task scheduling system, detection resources can be distributed more reasonably based on a service scene customized data quality detection strategy, unnecessary waste is avoided, griffin can respond and distribute tasks to corresponding working nodes quickly for data quality tasks to be detected in real time, and accuracy, completeness, timeliness, uniqueness, effectiveness, consistency and other dimensions of data can be measured accurately through the defined data quality rule of Griffin, comprehensive monitoring of data quality is guaranteed, and data quality detection efficiency is improved.

The embodiment of the specification also provides a service scene-based data quality detection device, as shown in fig. 2, which comprises at least one processor and a memory in communication connection with the at least one processor, wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor so that the at least one processor can execute the method.

The present description also provides a non-transitory computer storage medium storing computer-executable instructions configured to perform the above-described method.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for apparatus, devices, non-volatile computer storage medium embodiments, the description is relatively simple, as it is substantially similar to method embodiments, with reference to the section of the method embodiments being relevant.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The devices and media provided in the embodiments of the present disclosure are in one-to-one correspondence with the methods, so that the devices and media also have similar beneficial technical effects as the corresponding methods, and since the beneficial technical effects of the methods have been described in detail above, the beneficial technical effects of the devices and media are not repeated here.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present description is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

The foregoing is merely one or more embodiments of the present description and is not intended to limit the present description. Various modifications and alterations to one or more embodiments of this description will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, or the like, which is within the spirit and principles of one or more embodiments of the present description, is intended to be included within the scope of the claims of the present description.

Claims

1. A data quality detection method based on a business scenario, characterized in that the method comprises:

Determine current business scenario information corresponding to the current data quality detection requirement, and determine a data quality detection strategy based on the business scenario information, wherein the data quality detection strategy includes data source connection request information of at least one data source corresponding to the data to be detected and at least one data quality rule;

Performing quality detection task configuration through the data quality detection strategy to determine a corresponding data quality detection task set, wherein the data quality detection task set includes a plurality of data quality detection subtasks;

Submitting the data quality detection task set to the Griffin task scheduling system, distributing the multiple data quality detection subtasks through the Griffin task scheduling system, and determining the working node corresponding to each of the data quality detection subtasks;

The data quality detection subtask is executed through each of the working nodes to generate a quality detection result corresponding to each of the working nodes. The quality detection results are integrated according to the current business scenario information to determine a data quality report.

2. A data quality detection method based on business scenarios according to claim 1, characterized in that, based on the business scenario information, determining a data quality detection strategy specifically includes:

Acquire business data usage information, business data source information and at least one preset key data item of multiple business data in the business scenario information;

Determine the data source type of the data source corresponding to each type of business data through the business data source information, and determine the data source connection request information of each data source based on the data source type;

Determine at least one data quality detection rule corresponding to each type of business data according to the business data usage information corresponding to each type of business data, wherein the data quality detection rule includes any one or more of integrity measurement, consistency measurement, accuracy measurement and effectiveness measurement;

The quality detection priority is set through the at least one key data item, and the detection priority information is determined to determine the data quality detection strategy based on the data source connection request information, data quality detection rules and detection priority information.

3. A data quality detection method based on business scenarios according to claim 1, characterized in that, through the data quality detection strategy, quality detection task configuration is performed to determine the corresponding data quality detection task set, specifically including:

According to the data quality detection strategy, configure task parameters to create a corresponding data quality detection task set;

The data quality detection task set is parsed through the Griffin task scheduling system to be decomposed into multiple data quality detection subtasks, wherein the multiple data quality detection subtasks are independently executable detection tasks, and each of the data quality detection subtasks includes data source connection request information, data quality rule ID and subtask priority information.

4. According to claim 1, a data quality detection method based on business scenarios is characterized in that, through the Griffin task scheduling system, the multiple data quality detection subtasks are distributed and the working node corresponding to each data quality detection subtask is determined, which specifically includes:

Dynamically capture the current resource usage information corresponding to each working node in the Spark big data computing engine through the Griffin task scheduling system, wherein the current resource usage information includes current computing resource data, current storage resource data, and current network bandwidth data;

Performing task feature analysis on each of the data quality detection subtasks to determine resource requirement information corresponding to each of the data quality detection subtasks and task dependencies between the multiple data quality detection subtasks;

Generate a resource allocation order corresponding to the multiple data quality detection subtasks according to the task dependency relationship between the multiple data quality detection subtasks and the subtask priority information of each data quality detection subtask;

According to the resource allocation order, based on the resource requirement information corresponding to each data quality detection subtask and the current resource usage of each working node, each data quality detection subtask is allocated nodes in turn to determine the working node corresponding to each data quality detection subtask.

5. A data quality detection method based on business scenarios according to claim 1, characterized in that, by executing the data quality detection subtask through each of the working nodes, generating a quality detection result corresponding to each of the working nodes, specifically comprising:

Determine the data source connection request information, data quality rule ID and subtask priority information corresponding to each of the data quality detection subtasks;

Through a preset data reading middle layer, data source docking is performed according to the data source connection request information to obtain the data to be detected, wherein the data source includes a relational data source and a big data storage data source;

According to each of the working nodes, according to the data quality rule ID, a corresponding quality detection algorithm is matched to perform quality detection on the data to be detected, and a quality detection result corresponding to each of the working nodes is generated.

6. A data quality detection method based on business scenarios according to claim 1, characterized in that after executing the data quality detection subtask through each of the working nodes and generating the quality detection result corresponding to each of the working nodes, the method further comprises:

Through a preset real-time monitoring interface, the subtask execution status of each of the working nodes is monitored, and the real-time task execution data of each of the working nodes is collected, wherein the real-time task specified data includes real-time node resource usage information and node task execution information;

Determine the real-time node task execution status corresponding to each of the working nodes according to the real-time node resource usage information and the node task execution information of each of the working nodes;

Task distribution adjustment is performed based on the real-time node task execution status corresponding to each of the working nodes.

7. A data quality detection method based on business scenarios according to claim 1, characterized in that the quality detection results are integrated according to the current business scenario information to determine the data quality report, specifically including:

Collect the quality detection result of each of the working nodes, perform data quality aggregation based on the multiple quality detection results, determine the aggregated result data, and determine the abnormal type of the abnormal data item in the aggregated result data;

Determine the corresponding quality result display type according to the current business scenario information and the exception type;

The summary result data is rendered according to the quality result display type, and a data quality problem view is constructed to determine the data quality report.

8. A data quality detection method based on business scenarios according to claim 1, characterized in that determining the current business scenario information corresponding to the current data quality detection requirement specifically includes:

When triggered by the user's quality inspection operation, a business scenario form corresponding to the quality inspection requirement is sent to the user terminal, wherein the business scenario form includes a business data type item, a business data usage item, a business data source item, and a custom key data item;

According to the business scenario form, current business scenario information corresponding to the current data quality detection requirements is determined, wherein the current business scenario information includes business data usage information of multiple business data, business data source information and at least one preset key data item.

9. A data quality detection device based on a business scenario, characterized in that the device comprises:

at least one processor; and,

a memory communicatively connected to the at least one processor; wherein,

The memory stores instructions that can be executed by the at least one processor, and the instructions are executed by the at least one processor so that the at least one processor can perform the method according to any one of claims 1 to 8.

10. A non-volatile computer storage medium storing computer executable instructions, wherein the computer executable instructions are configured to: execute the method according to any one of claims 1 to 8.