CN112818026B

CN112818026B - Data integration method and device

Info

Publication number: CN112818026B
Application number: CN201911121056.1A
Authority: CN
Inventors: 李中林
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2019-11-15
Filing date: 2019-11-15
Publication date: 2025-07-15
Anticipated expiration: 2039-11-15
Also published as: CN112818026A

Abstract

The present invention discloses a data integration method and device, and relates to the field of computer technology. The method includes: setting labels for metadata in multiple data sets in parallel, and storing the obtained metadata label information in a distributed search engine; wherein the metadata label information includes the dataset identifier of the metadata; parsing the pre-set dataset integration rules to generate corresponding aggregate query statements; performing aggregate query on the metadata label information based on the aggregate query statement to obtain aggregate query results. Through the above steps, the processing efficiency of data integration can be improved, the consumption of single-machine server memory and CPU and other resources during the integration process can be reduced, and the stability and availability of the system can be improved.

Description

Data integration method and device

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a data integration method and apparatus.

Background

In the prior art, manual processing and system-specific processing are mainly used when integrating multisource data sets. In the manual processing mode, operators can integrate and process the multi-source data set by adopting document tools such as Excel. For a small amount of data, the processing mode is low in cost and fast in processing. However, in the case of a large data volume, the manual processing method cannot be basically completed.

For the case of large data volume, the integration processing is mainly performed by a specific system, and the processing flow includes loading a plurality of data sets into a memory of a certain server, and then sequentially performing the integration processing on the plurality of data sets according to a preset integration rule. For example, assuming that there are four data sets data _set1、data_set2、data_set3、data_set4 and the predetermined integration rule is data _set1∩data_ste2∪data_set3-data_set4, after loading the four data sets into the memory, the intersection is first taken for data sets data _set1 and data _set2, then the union is taken with data set data _set3, and then the difference is taken with data set data _set4.

In the process of realizing the invention, the inventor finds that the existing data integration scheme based on the specific system has at least the following problems that the first data integration process can only be executed according to the sequence of the integration rule, the processing process can not be executed in parallel, so that the data integration processing efficiency is low, the processing time can not be evaluated, the second data integration process needs to load all data sets into the memory of a certain server, so that the memory consumption of a single server is larger, and the third data integration process needs longer, so that the CPU consumption of the single server is higher in a long time, thereby influencing the stability and usability of the whole system.

Disclosure of Invention

In view of this, the present invention provides a data integration method and apparatus, which can improve the processing efficiency of data integration, reduce the consumption of resources such as memory and CPU of a stand-alone server during the integration processing, and improve the stability and usability of the system.

To achieve the above object, according to one aspect of the present invention, there is provided a data integration method.

The data integration method comprises the steps of setting tags on metadata in a plurality of data sets in parallel, storing tag information of the obtained metadata into a distributed search engine, analyzing preset data set integration rules to generate an aggregation query statement corresponding to the metadata, and conducting aggregation query on the tag information of the metadata based on the aggregation query statement to obtain an aggregation query result.

The method comprises the steps of dividing each of a plurality of data sets into a plurality of data blocks and setting tags for metadata in the plurality of data blocks in parallel, wherein the step of setting tags for the metadata in the plurality of data blocks in parallel comprises the steps of judging whether a data set identifier of the metadata in each data block exists in a tag set corresponding to the metadata or not, and adding the data set identifier of the metadata in the tag set corresponding to the metadata in the case that the judgment result is negative, and then executing an insertion updating operation for the distributed search engine according to the tag set corresponding to the metadata.

Optionally, the step of analyzing the preset data set integration rule to generate the corresponding aggregation query statement comprises the steps of converting the preset data set integration rule into a digital expression, wherein the digital expression is composed of a digital operand and an operator, generating the corresponding aggregation query statement according to the digital expression, and the operator comprises at least one of an intersection operator, a union operator or a difference operator.

Optionally, the distributed search engine comprises a ELASTIC SEARCH engine.

To achieve the above object, according to another aspect of the present invention, there is provided a data integrating apparatus.

The data integration device comprises a marking module and a query module, wherein the marking module is used for parallelly setting labels for metadata in a plurality of data sets and storing label information of the obtained metadata into a distributed search engine, the label information of the metadata comprises a data set identifier of the metadata, the generation module is used for analyzing a preset data set integration rule to generate an aggregation query statement corresponding to the preset data set integration rule, and the query module is used for conducting aggregation query on the label information of the metadata based on the aggregation query statement to obtain an aggregation query result.

Optionally, the marking module sets tags on metadata in a plurality of data sets in parallel, and stores the obtained tag information of the metadata into a distributed search engine, wherein the marking module splits each of the plurality of data sets into a plurality of data blocks, and sets tags on the metadata in the plurality of data blocks in parallel; the marking module is used for parallelly setting labels for metadata in the plurality of data blocks, wherein the marking module is used for judging whether a data set identifier of the metadata exists in a label set corresponding to the metadata or not for the metadata in each data block, and if not, the marking module is used for adding the data set identifier of the metadata into the label set corresponding to the metadata and then executing insertion updating operation for a distributed search engine according to the label set corresponding to the metadata.

Optionally, the generating module analyzes a preset data set integration rule to generate an aggregate query statement corresponding to the preset data set integration rule, wherein the generating module converts the preset data set integration rule into a digital expression, the digital expression is composed of a digital operand and an operator, the generating module generates the aggregate query statement corresponding to the digital expression according to the digital expression, and the operator comprises at least one of an intersection operator, a union operator or a difference operator.

Optionally, the distributed search engine comprises a ELASTIC SEARCH engine.

To achieve the above object, according to still another aspect of the present invention, there is provided an electronic apparatus.

The electronic device comprises one or more processors and a storage device, wherein the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors are enabled to realize the data integration method.

To achieve the above object, according to still another aspect of the present invention, a computer-readable medium is provided.

The computer readable medium of the present invention has stored thereon a computer program which, when executed by a processor, implements the data integration method of the present invention.

The embodiment of the invention has the advantages that the method has the steps of parallelly setting the tags on the metadata in a plurality of data sets, storing the tag information of the obtained metadata into a distributed search engine, analyzing the preset data set integration rule to generate an aggregation query statement corresponding to the preset data set integration rule, and carrying out aggregation query on the tag information of the metadata based on the aggregation query statement to obtain an aggregation query result, so that the processing efficiency of data integration can be improved, the consumption of resources such as a stand-alone server memory, a CPU (Central processing unit) and the like in the integration processing process can be reduced, and the stability and the usability of the system can be improved.

Further effects of the above-described non-conventional alternatives are described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a schematic flow chart of a data integration method according to a first embodiment of the invention;

FIG. 2 is a schematic flow chart of a data integration method according to a second embodiment of the invention;

FIG. 3 is an exemplary flow diagram of metadata parallel marking in accordance with a second embodiment of the present invention;

FIG. 4 is a schematic diagram of a storage structure of metadata tag information in a distributed search engine according to a second embodiment of the present invention;

FIG. 5 is a schematic diagram of the main modules of a data integration device according to a third embodiment of the invention;

FIG. 6 is a schematic diagram of the main modules of a data integration system according to a fourth embodiment of the invention;

FIG. 7 is an exemplary system architecture diagram in which embodiments of the present invention may be applied;

fig. 8 is a schematic diagram of a computer system suitable for use in implementing an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present invention are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It is noted that embodiments of the invention and features of the embodiments may be combined with each other without conflict.

Example 1

Fig. 1 is a schematic flow chart of a data integration method according to a first embodiment of the invention. As shown in fig. 1, the data integration method according to the embodiment of the present invention includes:

step S101, the tag information is parallelly set for the metadata in a plurality of data sets, and the obtained tag information of the metadata is stored in a distributed search engine.

The metadata is understood to be an attribute value that can distinguish between or uniquely represent each element in the data set, and may be one attribute value of the data in the data set or a combination of several attribute values of the data in the data set. Taking the example of a user crowd data set obtained from multiple data sources (such as databases, text files, MQ message queues, etc.) in an e-commerce platform, the metadata may be a user account value (which may be simply referred to as a "user PIN").

In this step, metadata of a plurality of data sets to be integrated may be marked in parallel based on a timed worker (a task running framework) tool or a loop thread, etc. (the marking may be understood as metadata setting labels) so as to improve marking efficiency. For example, assuming a total of three data sets data _set1、data_set2 and data _set3, a timing woker tool may be started for each data set to mark the metadata of the three data sets in parallel. In order to further improve the marking efficiency, each of the three data sets can be further split into a plurality of data blocks, and metadata in the plurality of data blocks can be marked in parallel.

The tag information of the metadata comprises a data set identifier where the metadata is located. For example, when marking metadata in a dataset data _set1, an identification of the dataset (such as "No _set1" or "1" or the like) may be added to a data structure for storing tag information of the metadata. Wherein, the data structure for storing the tag information of the metadata may be a tuple, a linked list, or the like.

Wherein the distributed search engine may comprise a ELASTIC SEARCH engine. ELASTIC SEARCH engine, ES for short, is a Lucene-based search engine.

Step S102, analyzing a preset data set integration rule to generate an aggregation query statement corresponding to the data set integration rule.

In one example, the dataset integration rule may be represented by a suffix expression. A suffix expression is a general expression of an arithmetic or logical formula in which an operator is in the middle of an operand in the form of a suffix. For example, a certain dataset integration rule expressed in a suffix expression is data _set1∩data_ste2∪data_set3-data_set4.

In the step, the aggregation query statement corresponding to the data set integration rule can be obtained by analyzing the preset data set integration rule. Illustratively, assuming that the distributed search engine is ELASTIC SEARCH and the data set integration rule is data _set1∩data_ste2∪data_set3-data_set4, the parsed aggregate query statement corresponding thereto may be expressed as:

and step S103, performing aggregation query on the tag information of the metadata based on the aggregation query statement to obtain an aggregation query result.

In the embodiment of the invention, the metadata in a plurality of data sets are parallelly provided with the tags, the obtained tag information of the metadata is stored in the distributed search engine, the preset data set integration rule is analyzed to generate the corresponding aggregation query statement, and the aggregation query is carried out on the tag information of the metadata based on the aggregation query statement to obtain the aggregation query result, so that the processing efficiency of data integration can be improved, the consumption of resources such as a stand-alone server memory, a CPU (Central processing unit) and the like in the integration processing process can be reduced, and the stability and the usability of the system are improved.

Example two

Fig. 2 is a schematic flow chart of a data integration method according to a second embodiment of the invention. As shown in fig. 2, the data integration method according to the embodiment of the present invention includes:

Step S201, extracting a plurality of data sets from different data sources based on a pre-configured key data extraction rule.

In particular, different forms of key data extraction rules (or metadata extraction rules) may be set for different data sources to obtain multiple data sets from different data sources. For example, for a data source such as a text file, the key data extraction rules may be represented by regular expressions, and for a data source such as a database, the key data extraction rules may be represented by one or more screening fields in the database.

Step S202, the metadata in the data sets are parallelly provided with tags, and the tag information of the obtained metadata is stored in the distributed search engine.

In this step, metadata of a plurality of data sets to be integrated may be marked in parallel based on a timed worker (a task running framework) tool or a loop thread, etc. (the marking may be understood as metadata setting labels) so as to improve marking efficiency. For example, assuming a total of three data sets data _set1、data_set2 and data _set3, a timing woker tool may be started for each data set to mark the metadata of the three data sets in parallel.

In order to further improve the marking efficiency, each of the plurality of data sets may be further split into a plurality of data blocks, and metadata in the plurality of data blocks may be marked in parallel.

The tag information of the metadata comprises a data set identifier where the metadata is located. For example, when metadata in the dataset data _set1 is marked, an identification of the dataset (e.g., "No _set1" or "1" etc.) may be added to the set of tags corresponding to the metadata. When the metadata in the dataset data _set2 is marked, the identification (such as "No _set2" or "2") of the dataset may be added to the tag set corresponding to the metadata. The tag set corresponding to the metadata refers to a data structure for storing tag information of the metadata, and the data structure may be an array, a linked list, or the like.

In this step, metadata in a plurality of data sets are marked in parallel, and tag information of the marked metadata is stored in a distributed search engine, so that the summarized metadata tag information (as shown in fig. 4) can be finally obtained. For example, assuming that both data set data _set1 and data set data _set2 have the same metadata (e.g., user account value "1213"), then after the completion of the tagging, the distributed search engine has metadata tag information, [1,2], which indicates that the metadata is in both data set data _set1 and data set data _set2.

Step 203, converting the preset data set integration rule into a digital expression.

Wherein the digitized expression is comprised of digitized operands and operators in digital form, the operators including at least one of an intersection operator, a union operator, or a difference operator.

For example, assuming that the dataset integration rule is "data _set1∩data_ste2∪data_set3-data_set4", it can be converted to "1 n 2 3-4" by this step.

And step S204, generating an aggregation query statement corresponding to the digital expression according to the digital expression.

In the step, the digitized expression can be scanned, when the operator is encountered in scanning, a query clause can be constructed based on the operator and the operand, and then all constructed query clauses are nested and packaged to obtain an aggregated query statement corresponding to the query clause.

For example, assuming that the distributed search engine is ELASTIC SEARCH engine and the digitized expression obtained in step S203 is "1 n 2U 3-4", when the scan encounters the intersection operator "n", a query clause of "filter" { "item" { "tag": [1,2] }, when the scan encounters the intersection operator "u", a query clause of "should" { "item" { "tag":3}, and when the scan encounters the intersection operator "-, a query clause of" multiple_not "{" item "{ item" } 4 }, and then, the query clause is packaged and packaged to correspond to the query clause.

Step 205, performing an aggregate query on the tag information of the metadata based on the aggregate query statement to obtain an aggregate query result.

In the embodiment of the invention, the marking efficiency is improved by marking the metadata in a plurality of data sets in parallel, the tag information of the metadata obtained by marking is stored in the distributed search engine, the aggregation query statement corresponding to the preset data set integration rule is generated by analyzing the preset data set integration rule, and the tag information of the metadata is aggregated and queried based on the aggregation query statement, so that the processing efficiency of data integration can be improved, the consumption of resources such as a stand-alone server memory, a CPU (Central processing unit) and the like in the integration processing process can be reduced, and the stability and the usability of the system can be improved.

Fig. 3 is an exemplary flow diagram of metadata parallel marking in accordance with a second embodiment of the present invention. After splitting the data set into a plurality of data blocks, metadata in the plurality of data blocks may be marked in parallel based on the flow shown in fig. 3. As shown in fig. 3, an exemplary flow of parallel marking of metadata in a data block includes:

Step S301, obtaining a tag set corresponding to metadata in a data block in a distributed search engine.

Step S302, judging whether a tag set corresponding to the metadata exists or not. If there is a tag set corresponding to the metadata, step S303 may be performed, otherwise, step S304 may be performed.

In this step, if the tag set corresponding to the metadata is acquired from the distributed search engine, it is determined that the tag set corresponding to the metadata exists, and if the tag set corresponding to the metadata is not acquired from the distributed search engine, it is determined that the tag set corresponding to the metadata does not exist.

Step S303, judging whether the data set identifier of the metadata exists in the tag set. If the data set identifier of the metadata does not exist in the tag set, step S305 may be performed, otherwise, step S307 may be performed.

Step S304, creating a label set corresponding to the metadata.

Step S305, adding the data set identifier of the metadata to the tag set. After step S305, step S306 may be performed.

For example, when marking metadata in the data Block1, the identification of the data set (such as "No _set1" or "1" etc.) may be added to the tag set corresponding to the metadata, assuming that the data Block belongs to the data set data _set1, and when marking metadata in the data Block1, the identification of the data set (such as "No _set2" or "2" etc.) may be added to the tag set corresponding to the metadata, assuming that the data Block belongs to the data set data _set2.

And step S306, performing insertion updating operation on the distributed search engine according to the tag set corresponding to the metadata.

For example, when the distributed search engine is ELASTIC SEARCH, the tag information stored in the tag set corresponding to the metadata may be stored in the ELASTIC SEARCH engine through upsert operation.

Step S307, acquire the next metadata in the data block. After step S307, the next metadata acquired may be executed to execute step S301.

In the embodiment of the invention, metadata of a plurality of data sets can be marked in parallel through the steps, so that the marking efficiency is improved, and furthermore, the running pressure of a single server can be effectively reduced through the parallel marking in a distributed environment.

Fig. 5 is a schematic diagram of main modules of a data integration apparatus according to a third embodiment of the invention. As shown in fig. 5, the data integration apparatus 500 according to the embodiment of the present invention includes a marking module 501, a generating module 502, and a querying module 503.

The marking module 501 is configured to set labels for metadata in multiple data sets in parallel, and store label information of the obtained metadata in the distributed search engine.

Illustratively, the marking module 501 may perform parallel marking (where the marking may be understood as that metadata sets a tag) on metadata of multiple data sets to be integrated based on a timed worker (a task running framework) tool or a loop thread, etc. to improve marking efficiency. For example, assuming a total of three data sets data _set1、data_set2 and data _set3, a timing woker tool may be started for each data set to mark the metadata of the three data sets in parallel. In order to further improve the marking efficiency, each of the three data sets can be further split into a plurality of data blocks, and metadata in the plurality of data blocks can be marked in parallel.

The tag information of the metadata comprises a data set identifier where the metadata is located. For example, when metadata in the dataset data _set1 is marked, an identification of the dataset (e.g., "No _set1" or "1" etc.) may be added to the set of tags corresponding to the metadata. Wherein, the tag set refers to a data structure for storing tag information of the metadata, which may be a tuple, a linked list, or the like.

The generating module 502 is configured to parse a preset data set integration rule to generate an aggregate query statement corresponding to the data set integration rule.

Illustratively, the generating module 502 analyzes the preset data set integration rule to generate an aggregate query statement corresponding to the preset data set integration rule, wherein the generating module 502 converts the preset data set integration rule into a digital expression, the digital expression consists of a digital operand and an operator, and the generating module 502 generates the aggregate query statement corresponding to the digital expression according to the digital expression. Wherein the operators include at least one of a fetch intersection operator, a fetch union operator, or a fetch difference operator.

In one example, the dataset integration rule may be represented by a suffix expression. A suffix expression is a general expression of an arithmetic or logical formula in which an operator is in the middle of an operand in the form of a suffix. For example, a certain dataset integration rule expressed in a suffix expression is data _set1∩data_ste2∪data_set3-data_set4. Further, assuming that the distributed search engine is ELASTIC SEARCH engine, the aggregate query statement corresponding to the dataset integration rule "data _set1∩data_ste2∪data_set3-data_set4" generated by the generation module 502 may be expressed as:

And a query module 503, configured to perform an aggregate query on the tag information of the metadata based on the aggregate query statement, so as to obtain an aggregate query result.

In the embodiment of the invention, the metadata in a plurality of data sets are marked in parallel through the marking module, the tag information of the marked metadata is stored in the distributed search engine, the preset data set integration rule is analyzed through the generating module to generate the corresponding aggregation query statement, and the tag information of the metadata is subjected to aggregation query based on the aggregation query statement through the query module to obtain the aggregation query result, so that the processing efficiency of data integration can be improved, the consumption of resources such as a stand-alone server memory and a CPU (Central processing Unit) in the integration processing process can be reduced, and the stability and the usability of the system are improved.

Fig. 6 is a schematic diagram of main modules of a data integration system according to a fourth embodiment of the invention. As shown in fig. 6, the data integration system 600 of the embodiment of the present invention includes a multi-source data set collecting device 601, a data integrating device 602, an integration rule configuration terminal 603, a data distribution engine 604, a data output terminal 605, and a distributed search engine 606.

A multi-source data set collection means 601 for extracting a plurality of data sets from different data sources based on pre-configured key data extraction rules. The data source may be, for example, a text file, a database, or a MQ (Message Queue) message queue.

In particular, for different data sources, different forms of key data extraction rules (or metadata extraction rules) may be set in the multi-source data set collection device 601 to obtain multiple data sets from different data sources. For example, for a data source such as a text file, the key data extraction rules may be represented by regular expressions, and for a data source such as a database, the key data extraction rules may be represented by one or more screening fields in the database.

An integration rule configuration terminal 603 is configured to configure the data set integration rule. In the integrated rule configuration terminal 603, a set of functional operation controls may be designed on the interactive interface. For example, functional operation controls for adding, editing, deleting, and querying the integration rules may be designed to facilitate flexible configuration and management of the integration rules by system users. Wherein, the integration rule can comprise at least one operator of taking intersection operators (expressed by mathematical symbol #), taking union operators (expressed by mathematical symbol #), taking difference operators (expressed by mathematical symbol-). Illustratively, if the dataset integration rule is configured as data _set1∩data_set2, it means that the same data in dataset data _set1 and dataset data _set2 is fetched, if the dataset integration rule is configured as data _set1∪data_set2, it means that the union is fetched for dataset data _set1 and dataset data _set2, and if the dataset integration rule is configured as data _set1-data_set2, it means that the difference is fetched for dataset data _set1 and dataset data _set2, resulting in a collection of data that is present in dataset data _set1 but not in dataset data _set2.

In particular, after the data set integration rule is configured by the integration rule configuration terminal 603, the configured data set integration rule may be persistently stored. For example, the configured data set integration rules may be stored in a database or a centralized cache.

The data integration device 602 is configured to set labels for metadata in multiple data sets in parallel, store label information of the obtained metadata in the distributed search engine 606, parse preset data set integration rules to generate an aggregate query statement corresponding to the preset data set integration rules, and perform an aggregate query on the label information of the metadata based on the aggregate query statement to obtain an aggregate query result.

The distributed search engine 606 may employ ELASTIC SEARCH engine. ELASTIC SEARCH engine, ES for short, is a Lucene-based search engine. The ES engine can realize real-time query of large data volume, which provides basis for integration of multiple data sets in the embodiment of the invention.

The data distribution engine 604 is configured to distribute the aggregated query result, i.e. the data after the integration processing. For example, the integrated data may be distributed to a text file, a database, an MQ message queue, a network cache (e.g., redis cluster).

And the data output terminal 605 is used for outputting and displaying the integrated data.

The embodiment of the invention builds a complete data integration system, can realize autonomous configuration of multi-source data set acquisition rules and data set integration rules, has wide applicability and usability, can effectively avoid the defects existing in the data integration in the prior art, improves the integration efficiency of the multi-source data set, reduces the consumption of the data integration to the memory and the CPU of a single server, and improves the stability and the usability of the whole system.

Fig. 7 illustrates an exemplary system architecture 700 to which the data integration method or apparatus of embodiments of the present invention may be applied.

As shown in fig. 7, a system architecture 700 may include terminal devices 701, 702, 703, a network 704, and a server 705. The network 704 is the medium used to provide communication links between the terminal devices 701, 702, 703 and the server 705. The network 704 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

A user may interact with the server 705 via the network 704 using the terminal devices 701, 702, 703 to receive or send messages or the like. Various communication client applications, such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, social platform software, etc., may be installed on the terminal devices 701, 702, 703.

The terminal devices 701, 702, 703 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 705 may be a server providing various services, such as a background management server providing support for data integration requests initiated by users with the terminal devices 701, 702, 703. The background management server may analyze and process the received data, such as the data integration request, and feed back the processing result (for example, the data integration result) to the terminal device.

It should be noted that, the data integration method provided in the embodiment of the present invention is generally executed by the server 705, and accordingly, the data integration device is generally disposed in the server 705.

It should be understood that the number of terminal devices, networks and servers in fig. 7 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 8, there is illustrated a schematic diagram of a computer system 800 suitable for use in implementing an electronic device of an embodiment of the present invention. The computer system shown in fig. 8 is merely an example, and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.

As shown in fig. 8, the computer system 800 includes a Central Processing Unit (CPU) 801 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the system 800 are also stored. The CPU 801, ROM 802, and RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Connected to the I/O interface 805 are an input section 806 including a keyboard, a mouse, and the like, an output section 807 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker, and the like, a storage section 808 including a hard disk, and the like, and a communication section 809 including a network interface card such as a LAN card, a modem, and the like. The communication section 809 performs communication processing via a network such as the internet. The drive 810 is also connected to the I/O interface 805 as needed. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as needed so that a computer program read out therefrom is mounted into the storage section 808 as needed.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section 809, and/or installed from the removable media 811. The above-described functions defined in the system of the present invention are performed when the computer program is executed by a Central Processing Unit (CPU) 801.

The computer readable medium shown in the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of a computer-readable storage medium may include, but are not limited to, an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules involved in the embodiments of the present invention may be implemented in software or in hardware. The described modules may also be provided in a processor, which may be described as, for example, a processor including a marking module, a generating module, and a querying module. The names of these modules do not in some cases limit the module itself, and for example, the marking module may also be described as a "module that marks metadata in multiple data sets in parallel".

As a further aspect, the invention also provides a computer readable medium which may be comprised in the device described in the above embodiments or may be present alone without being fitted into the device. The computer readable medium carries one or more programs, when the one or more programs are executed by the device, the device executes the following processes of setting tags on metadata in a plurality of data sets in parallel and storing tag information of the obtained metadata into a distributed search engine, wherein the tag information of the metadata comprises a data set identifier of the metadata, analyzing a preset data set integration rule to generate an aggregation query statement corresponding to the data set integration rule, and performing aggregation query on the tag information of the metadata based on the aggregation query statement to obtain an aggregation query result.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives can occur depending upon design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A data integration method, characterized in that the method comprises:

Setting tags for metadata in multiple data sets in parallel, and storing the obtained tag information of the metadata in a distributed search engine; wherein the tag information of the metadata includes the identifier of the data set where the metadata is located;

Parse the pre-set data set integration rules to generate corresponding aggregation query statements;

Performing an aggregate query on the tag information of the metadata based on the aggregate query statement to obtain an aggregate query result;

The step of setting tags for metadata in multiple data sets in parallel and storing the obtained metadata tag information in a distributed search engine includes:

Each of the multiple data sets is split into multiple data blocks, and tags are set in parallel for the metadata in the multiple data blocks; wherein, the tag setting in parallel for the metadata in the multiple data blocks includes: for the metadata in each data block, determining whether the data set identifier where the metadata is located exists in the tag set corresponding to the metadata; if the determination result is no, adding the data set identifier where the metadata is located to the tag set corresponding to the metadata, and then performing an insert update operation on the distributed retrieval engine according to the tag set corresponding to the metadata.

2. The method according to claim 1, characterized in that the step of parsing the preset data set integration rule to generate the corresponding aggregation query statement comprises:

Converting a preset data set integration rule into a digital expression; wherein the digital expression is composed of digital operands and operators; generating a corresponding aggregation query statement according to the digital expression; wherein the operator includes at least one of the following: an intersection operator, a union operator, or a difference operator.

3. The method according to claim 1 is characterized in that the distributed search engine includes: an ElasticSearch engine.

4. A data integration device, characterized in that the device comprises:

A labeling module, used to set labels for metadata in multiple data sets in parallel, and store the obtained label information of the metadata into a distributed search engine; wherein the label information of the metadata includes the identifier of the data set where the metadata is located;

A generation module is used to parse the preset data set integration rules to generate corresponding aggregation query statements;

A query module, used to perform an aggregate query on the tag information of the metadata based on the aggregate query statement to obtain an aggregate query result;

The tagging module sets tags for metadata in multiple data sets in parallel, and stores the obtained metadata tag information into the distributed search engine, including:

The labeling module splits each of the multiple data sets into multiple data blocks, and sets labels for the metadata in the multiple data blocks in parallel; wherein, the labeling module sets labels for the metadata in the multiple data blocks in parallel including: for the metadata in each data block, the labeling module determines whether the data set identifier where the metadata is located exists in the label set corresponding to the metadata; if the judgment result is no, the labeling module adds the data set identifier where the metadata is located to the label set corresponding to the metadata, and then performs an insert update operation on the distributed retrieval engine according to the label set corresponding to the metadata.

5. The device according to claim 4, characterized in that the generation module parses the preset data set integration rule to generate the corresponding aggregation query statement comprising:

The generation module converts the preset data set integration rules into a digital expression; wherein the digital expression is composed of digital operands and operators; the generation module generates an aggregation query statement corresponding to the digital expression according to the digital expression; wherein the operator includes at least one of the following: an intersection operator, a union operator or a difference operator.

6. The device according to claim 4 is characterized in that the distributed search engine includes: an ElasticSearch engine.

7. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

When the one or more programs are executed by the one or more processors, the one or more processors implement the method according to any one of claims 1 to 3.

8. A computer-readable medium having a computer program stored thereon, wherein when the program is executed by a processor, the method according to any one of claims 1 to 3 is implemented.