[go: up one dir, main page]

CN112818026B - Data integration method and device - Google Patents

Data integration method and device Download PDF

Info

Publication number
CN112818026B
CN112818026B CN201911121056.1A CN201911121056A CN112818026B CN 112818026 B CN112818026 B CN 112818026B CN 201911121056 A CN201911121056 A CN 201911121056A CN 112818026 B CN112818026 B CN 112818026B
Authority
CN
China
Prior art keywords
metadata
data
integration
data set
search engine
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911121056.1A
Other languages
Chinese (zh)
Other versions
CN112818026A (en
Inventor
李中林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201911121056.1A priority Critical patent/CN112818026B/en
Publication of CN112818026A publication Critical patent/CN112818026A/en
Application granted granted Critical
Publication of CN112818026B publication Critical patent/CN112818026B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种数据整合方法和装置,涉及计算机技术领域。其中,该方法包括:对多个数据集中的元数据并行设置标签,并将得到的元数据的标签信息存入分布式检索引擎;其中,所述元数据的标签信息包括所述元数据所在的数据集标识;对预先设置的数据集整合规则进行解析,以生成与之对应的聚合查询语句;基于所述聚合查询语句对所述元数据的标签信息进行聚合查询,以得到聚合查询结果。通过以上步骤,能够提高数据整合的处理效率,降低整合处理过程对单机服务器内存和CPU等资源的消耗,提高系统的稳定性和可用性。

The present invention discloses a data integration method and device, and relates to the field of computer technology. The method includes: setting labels for metadata in multiple data sets in parallel, and storing the obtained metadata label information in a distributed search engine; wherein the metadata label information includes the dataset identifier of the metadata; parsing the pre-set dataset integration rules to generate corresponding aggregate query statements; performing aggregate query on the metadata label information based on the aggregate query statement to obtain aggregate query results. Through the above steps, the processing efficiency of data integration can be improved, the consumption of single-machine server memory and CPU and other resources during the integration process can be reduced, and the stability and availability of the system can be improved.

Description

Data integration method and device
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a data integration method and apparatus.
Background
In the prior art, manual processing and system-specific processing are mainly used when integrating multisource data sets. In the manual processing mode, operators can integrate and process the multi-source data set by adopting document tools such as Excel. For a small amount of data, the processing mode is low in cost and fast in processing. However, in the case of a large data volume, the manual processing method cannot be basically completed.
For the case of large data volume, the integration processing is mainly performed by a specific system, and the processing flow includes loading a plurality of data sets into a memory of a certain server, and then sequentially performing the integration processing on the plurality of data sets according to a preset integration rule. For example, assuming that there are four data sets data set1、dataset2、dataset3、dataset4 and the predetermined integration rule is data set1∩dataste2∪dataset3-dataset4, after loading the four data sets into the memory, the intersection is first taken for data sets data set1 and data set2, then the union is taken with data set data set3, and then the difference is taken with data set data set4.
In the process of realizing the invention, the inventor finds that the existing data integration scheme based on the specific system has at least the following problems that the first data integration process can only be executed according to the sequence of the integration rule, the processing process can not be executed in parallel, so that the data integration processing efficiency is low, the processing time can not be evaluated, the second data integration process needs to load all data sets into the memory of a certain server, so that the memory consumption of a single server is larger, and the third data integration process needs longer, so that the CPU consumption of the single server is higher in a long time, thereby influencing the stability and usability of the whole system.
Disclosure of Invention
In view of this, the present invention provides a data integration method and apparatus, which can improve the processing efficiency of data integration, reduce the consumption of resources such as memory and CPU of a stand-alone server during the integration processing, and improve the stability and usability of the system.
To achieve the above object, according to one aspect of the present invention, there is provided a data integration method.
The data integration method comprises the steps of setting tags on metadata in a plurality of data sets in parallel, storing tag information of the obtained metadata into a distributed search engine, analyzing preset data set integration rules to generate an aggregation query statement corresponding to the metadata, and conducting aggregation query on the tag information of the metadata based on the aggregation query statement to obtain an aggregation query result.
The method comprises the steps of dividing each of a plurality of data sets into a plurality of data blocks and setting tags for metadata in the plurality of data blocks in parallel, wherein the step of setting tags for the metadata in the plurality of data blocks in parallel comprises the steps of judging whether a data set identifier of the metadata in each data block exists in a tag set corresponding to the metadata or not, and adding the data set identifier of the metadata in the tag set corresponding to the metadata in the case that the judgment result is negative, and then executing an insertion updating operation for the distributed search engine according to the tag set corresponding to the metadata.
Optionally, the step of analyzing the preset data set integration rule to generate the corresponding aggregation query statement comprises the steps of converting the preset data set integration rule into a digital expression, wherein the digital expression is composed of a digital operand and an operator, generating the corresponding aggregation query statement according to the digital expression, and the operator comprises at least one of an intersection operator, a union operator or a difference operator.
Optionally, the distributed search engine comprises a ELASTIC SEARCH engine.
To achieve the above object, according to another aspect of the present invention, there is provided a data integrating apparatus.
The data integration device comprises a marking module and a query module, wherein the marking module is used for parallelly setting labels for metadata in a plurality of data sets and storing label information of the obtained metadata into a distributed search engine, the label information of the metadata comprises a data set identifier of the metadata, the generation module is used for analyzing a preset data set integration rule to generate an aggregation query statement corresponding to the preset data set integration rule, and the query module is used for conducting aggregation query on the label information of the metadata based on the aggregation query statement to obtain an aggregation query result.
Optionally, the marking module sets tags on metadata in a plurality of data sets in parallel, and stores the obtained tag information of the metadata into a distributed search engine, wherein the marking module splits each of the plurality of data sets into a plurality of data blocks, and sets tags on the metadata in the plurality of data blocks in parallel; the marking module is used for parallelly setting labels for metadata in the plurality of data blocks, wherein the marking module is used for judging whether a data set identifier of the metadata exists in a label set corresponding to the metadata or not for the metadata in each data block, and if not, the marking module is used for adding the data set identifier of the metadata into the label set corresponding to the metadata and then executing insertion updating operation for a distributed search engine according to the label set corresponding to the metadata.
Optionally, the generating module analyzes a preset data set integration rule to generate an aggregate query statement corresponding to the preset data set integration rule, wherein the generating module converts the preset data set integration rule into a digital expression, the digital expression is composed of a digital operand and an operator, the generating module generates the aggregate query statement corresponding to the digital expression according to the digital expression, and the operator comprises at least one of an intersection operator, a union operator or a difference operator.
Optionally, the distributed search engine comprises a ELASTIC SEARCH engine.
To achieve the above object, according to still another aspect of the present invention, there is provided an electronic apparatus.
The electronic device comprises one or more processors and a storage device, wherein the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors are enabled to realize the data integration method.
To achieve the above object, according to still another aspect of the present invention, a computer-readable medium is provided.
The computer readable medium of the present invention has stored thereon a computer program which, when executed by a processor, implements the data integration method of the present invention.
The embodiment of the invention has the advantages that the method has the steps of parallelly setting the tags on the metadata in a plurality of data sets, storing the tag information of the obtained metadata into a distributed search engine, analyzing the preset data set integration rule to generate an aggregation query statement corresponding to the preset data set integration rule, and carrying out aggregation query on the tag information of the metadata based on the aggregation query statement to obtain an aggregation query result, so that the processing efficiency of data integration can be improved, the consumption of resources such as a stand-alone server memory, a CPU (Central processing unit) and the like in the integration processing process can be reduced, and the stability and the usability of the system can be improved.
Further effects of the above-described non-conventional alternatives are described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a schematic flow chart of a data integration method according to a first embodiment of the invention;
FIG. 2 is a schematic flow chart of a data integration method according to a second embodiment of the invention;
FIG. 3 is an exemplary flow diagram of metadata parallel marking in accordance with a second embodiment of the present invention;
FIG. 4 is a schematic diagram of a storage structure of metadata tag information in a distributed search engine according to a second embodiment of the present invention;
FIG. 5 is a schematic diagram of the main modules of a data integration device according to a third embodiment of the invention;
FIG. 6 is a schematic diagram of the main modules of a data integration system according to a fourth embodiment of the invention;
FIG. 7 is an exemplary system architecture diagram in which embodiments of the present invention may be applied;
fig. 8 is a schematic diagram of a computer system suitable for use in implementing an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present invention are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
It is noted that embodiments of the invention and features of the embodiments may be combined with each other without conflict.
Example 1
Fig. 1 is a schematic flow chart of a data integration method according to a first embodiment of the invention. As shown in fig. 1, the data integration method according to the embodiment of the present invention includes:
step S101, the tag information is parallelly set for the metadata in a plurality of data sets, and the obtained tag information of the metadata is stored in a distributed search engine.
The metadata is understood to be an attribute value that can distinguish between or uniquely represent each element in the data set, and may be one attribute value of the data in the data set or a combination of several attribute values of the data in the data set. Taking the example of a user crowd data set obtained from multiple data sources (such as databases, text files, MQ message queues, etc.) in an e-commerce platform, the metadata may be a user account value (which may be simply referred to as a "user PIN").
In this step, metadata of a plurality of data sets to be integrated may be marked in parallel based on a timed worker (a task running framework) tool or a loop thread, etc. (the marking may be understood as metadata setting labels) so as to improve marking efficiency. For example, assuming a total of three data sets data set1、dataset2 and data set3, a timing woker tool may be started for each data set to mark the metadata of the three data sets in parallel. In order to further improve the marking efficiency, each of the three data sets can be further split into a plurality of data blocks, and metadata in the plurality of data blocks can be marked in parallel.
The tag information of the metadata comprises a data set identifier where the metadata is located. For example, when marking metadata in a dataset data set1, an identification of the dataset (such as "No set1" or "1" or the like) may be added to a data structure for storing tag information of the metadata. Wherein, the data structure for storing the tag information of the metadata may be a tuple, a linked list, or the like.
Wherein the distributed search engine may comprise a ELASTIC SEARCH engine. ELASTIC SEARCH engine, ES for short, is a Lucene-based search engine.
Step S102, analyzing a preset data set integration rule to generate an aggregation query statement corresponding to the data set integration rule.
In one example, the dataset integration rule may be represented by a suffix expression. A suffix expression is a general expression of an arithmetic or logical formula in which an operator is in the middle of an operand in the form of a suffix. For example, a certain dataset integration rule expressed in a suffix expression is data set1∩dataste2∪dataset3-dataset4.
In the step, the aggregation query statement corresponding to the data set integration rule can be obtained by analyzing the preset data set integration rule. Illustratively, assuming that the distributed search engine is ELASTIC SEARCH and the data set integration rule is data set1∩dataste2∪dataset3-dataset4, the parsed aggregate query statement corresponding thereto may be expressed as:
and step S103, performing aggregation query on the tag information of the metadata based on the aggregation query statement to obtain an aggregation query result.
In the embodiment of the invention, the metadata in a plurality of data sets are parallelly provided with the tags, the obtained tag information of the metadata is stored in the distributed search engine, the preset data set integration rule is analyzed to generate the corresponding aggregation query statement, and the aggregation query is carried out on the tag information of the metadata based on the aggregation query statement to obtain the aggregation query result, so that the processing efficiency of data integration can be improved, the consumption of resources such as a stand-alone server memory, a CPU (Central processing unit) and the like in the integration processing process can be reduced, and the stability and the usability of the system are improved.
Example two
Fig. 2 is a schematic flow chart of a data integration method according to a second embodiment of the invention. As shown in fig. 2, the data integration method according to the embodiment of the present invention includes:
Step S201, extracting a plurality of data sets from different data sources based on a pre-configured key data extraction rule.
In particular, different forms of key data extraction rules (or metadata extraction rules) may be set for different data sources to obtain multiple data sets from different data sources. For example, for a data source such as a text file, the key data extraction rules may be represented by regular expressions, and for a data source such as a database, the key data extraction rules may be represented by one or more screening fields in the database.
Step S202, the metadata in the data sets are parallelly provided with tags, and the tag information of the obtained metadata is stored in the distributed search engine.
The metadata is understood to be an attribute value that can distinguish between or uniquely represent each element in the data set, and may be one attribute value of the data in the data set or a combination of several attribute values of the data in the data set. Taking the example of a user crowd data set obtained from multiple data sources (such as databases, text files, MQ message queues, etc.) in an e-commerce platform, the metadata may be a user account value (which may be simply referred to as a "user PIN").
In this step, metadata of a plurality of data sets to be integrated may be marked in parallel based on a timed worker (a task running framework) tool or a loop thread, etc. (the marking may be understood as metadata setting labels) so as to improve marking efficiency. For example, assuming a total of three data sets data set1、dataset2 and data set3, a timing woker tool may be started for each data set to mark the metadata of the three data sets in parallel.
In order to further improve the marking efficiency, each of the plurality of data sets may be further split into a plurality of data blocks, and metadata in the plurality of data blocks may be marked in parallel.
The tag information of the metadata comprises a data set identifier where the metadata is located. For example, when metadata in the dataset data set1 is marked, an identification of the dataset (e.g., "No set1" or "1" etc.) may be added to the set of tags corresponding to the metadata. When the metadata in the dataset data set2 is marked, the identification (such as "No set2" or "2") of the dataset may be added to the tag set corresponding to the metadata. The tag set corresponding to the metadata refers to a data structure for storing tag information of the metadata, and the data structure may be an array, a linked list, or the like.
In this step, metadata in a plurality of data sets are marked in parallel, and tag information of the marked metadata is stored in a distributed search engine, so that the summarized metadata tag information (as shown in fig. 4) can be finally obtained. For example, assuming that both data set data set1 and data set data set2 have the same metadata (e.g., user account value "1213"), then after the completion of the tagging, the distributed search engine has metadata tag information, [1,2], which indicates that the metadata is in both data set data set1 and data set data set2.
Step 203, converting the preset data set integration rule into a digital expression.
In one example, the dataset integration rule may be represented by a suffix expression. A suffix expression is a general expression of an arithmetic or logical formula in which an operator is in the middle of an operand in the form of a suffix. For example, a certain dataset integration rule expressed in a suffix expression is data set1∩dataste2∪dataset3-dataset4.
Wherein the digitized expression is comprised of digitized operands and operators in digital form, the operators including at least one of an intersection operator, a union operator, or a difference operator.
For example, assuming that the dataset integration rule is "data set1∩dataste2∪dataset3-dataset4", it can be converted to "1 n 2 3-4" by this step.
And step S204, generating an aggregation query statement corresponding to the digital expression according to the digital expression.
In the step, the digitized expression can be scanned, when the operator is encountered in scanning, a query clause can be constructed based on the operator and the operand, and then all constructed query clauses are nested and packaged to obtain an aggregated query statement corresponding to the query clause.
For example, assuming that the distributed search engine is ELASTIC SEARCH engine and the digitized expression obtained in step S203 is "1 n 2U 3-4", when the scan encounters the intersection operator "n", a query clause of "filter" { "item" { "tag": [1,2] }, when the scan encounters the intersection operator "u", a query clause of "should" { "item" { "tag":3}, and when the scan encounters the intersection operator "-, a query clause of" multiple_not "{" item "{ item" } 4 }, and then, the query clause is packaged and packaged to correspond to the query clause.
Step 205, performing an aggregate query on the tag information of the metadata based on the aggregate query statement to obtain an aggregate query result.
In the embodiment of the invention, the marking efficiency is improved by marking the metadata in a plurality of data sets in parallel, the tag information of the metadata obtained by marking is stored in the distributed search engine, the aggregation query statement corresponding to the preset data set integration rule is generated by analyzing the preset data set integration rule, and the tag information of the metadata is aggregated and queried based on the aggregation query statement, so that the processing efficiency of data integration can be improved, the consumption of resources such as a stand-alone server memory, a CPU (Central processing unit) and the like in the integration processing process can be reduced, and the stability and the usability of the system can be improved.
Fig. 3 is an exemplary flow diagram of metadata parallel marking in accordance with a second embodiment of the present invention. After splitting the data set into a plurality of data blocks, metadata in the plurality of data blocks may be marked in parallel based on the flow shown in fig. 3. As shown in fig. 3, an exemplary flow of parallel marking of metadata in a data block includes:
Step S301, obtaining a tag set corresponding to metadata in a data block in a distributed search engine.
Step S302, judging whether a tag set corresponding to the metadata exists or not. If there is a tag set corresponding to the metadata, step S303 may be performed, otherwise, step S304 may be performed.
In this step, if the tag set corresponding to the metadata is acquired from the distributed search engine, it is determined that the tag set corresponding to the metadata exists, and if the tag set corresponding to the metadata is not acquired from the distributed search engine, it is determined that the tag set corresponding to the metadata does not exist.
Step S303, judging whether the data set identifier of the metadata exists in the tag set. If the data set identifier of the metadata does not exist in the tag set, step S305 may be performed, otherwise, step S307 may be performed.
Step S304, creating a label set corresponding to the metadata.
Step S305, adding the data set identifier of the metadata to the tag set. After step S305, step S306 may be performed.
For example, when marking metadata in the data Block1, the identification of the data set (such as "No set1" or "1" etc.) may be added to the tag set corresponding to the metadata, assuming that the data Block belongs to the data set data set1, and when marking metadata in the data Block1, the identification of the data set (such as "No set2" or "2" etc.) may be added to the tag set corresponding to the metadata, assuming that the data Block belongs to the data set data set2.
And step S306, performing insertion updating operation on the distributed search engine according to the tag set corresponding to the metadata.
For example, when the distributed search engine is ELASTIC SEARCH, the tag information stored in the tag set corresponding to the metadata may be stored in the ELASTIC SEARCH engine through upsert operation.
Step S307, acquire the next metadata in the data block. After step S307, the next metadata acquired may be executed to execute step S301.
In the embodiment of the invention, metadata of a plurality of data sets can be marked in parallel through the steps, so that the marking efficiency is improved, and furthermore, the running pressure of a single server can be effectively reduced through the parallel marking in a distributed environment.
Fig. 5 is a schematic diagram of main modules of a data integration apparatus according to a third embodiment of the invention. As shown in fig. 5, the data integration apparatus 500 according to the embodiment of the present invention includes a marking module 501, a generating module 502, and a querying module 503.
The marking module 501 is configured to set labels for metadata in multiple data sets in parallel, and store label information of the obtained metadata in the distributed search engine.
The metadata is understood to be an attribute value that can distinguish between or uniquely represent each element in the data set, and may be one attribute value of the data in the data set or a combination of several attribute values of the data in the data set. Taking the example of a user crowd data set obtained from multiple data sources (such as databases, text files, MQ message queues, etc.) in an e-commerce platform, the metadata may be a user account value (which may be simply referred to as a "user PIN").
Illustratively, the marking module 501 may perform parallel marking (where the marking may be understood as that metadata sets a tag) on metadata of multiple data sets to be integrated based on a timed worker (a task running framework) tool or a loop thread, etc. to improve marking efficiency. For example, assuming a total of three data sets data set1、dataset2 and data set3, a timing woker tool may be started for each data set to mark the metadata of the three data sets in parallel. In order to further improve the marking efficiency, each of the three data sets can be further split into a plurality of data blocks, and metadata in the plurality of data blocks can be marked in parallel.
The tag information of the metadata comprises a data set identifier where the metadata is located. For example, when metadata in the dataset data set1 is marked, an identification of the dataset (e.g., "No set1" or "1" etc.) may be added to the set of tags corresponding to the metadata. Wherein, the tag set refers to a data structure for storing tag information of the metadata, which may be a tuple, a linked list, or the like.
Wherein the distributed search engine may comprise a ELASTIC SEARCH engine. ELASTIC SEARCH engine, ES for short, is a Lucene-based search engine.
The generating module 502 is configured to parse a preset data set integration rule to generate an aggregate query statement corresponding to the data set integration rule.
Illustratively, the generating module 502 analyzes the preset data set integration rule to generate an aggregate query statement corresponding to the preset data set integration rule, wherein the generating module 502 converts the preset data set integration rule into a digital expression, the digital expression consists of a digital operand and an operator, and the generating module 502 generates the aggregate query statement corresponding to the digital expression according to the digital expression. Wherein the operators include at least one of a fetch intersection operator, a fetch union operator, or a fetch difference operator.
In one example, the dataset integration rule may be represented by a suffix expression. A suffix expression is a general expression of an arithmetic or logical formula in which an operator is in the middle of an operand in the form of a suffix. For example, a certain dataset integration rule expressed in a suffix expression is data set1∩dataste2∪dataset3-dataset4. Further, assuming that the distributed search engine is ELASTIC SEARCH engine, the aggregate query statement corresponding to the dataset integration rule "data set1∩dataste2∪dataset3-dataset4" generated by the generation module 502 may be expressed as:
And a query module 503, configured to perform an aggregate query on the tag information of the metadata based on the aggregate query statement, so as to obtain an aggregate query result.
In the embodiment of the invention, the metadata in a plurality of data sets are marked in parallel through the marking module, the tag information of the marked metadata is stored in the distributed search engine, the preset data set integration rule is analyzed through the generating module to generate the corresponding aggregation query statement, and the tag information of the metadata is subjected to aggregation query based on the aggregation query statement through the query module to obtain the aggregation query result, so that the processing efficiency of data integration can be improved, the consumption of resources such as a stand-alone server memory and a CPU (Central processing Unit) in the integration processing process can be reduced, and the stability and the usability of the system are improved.
Fig. 6 is a schematic diagram of main modules of a data integration system according to a fourth embodiment of the invention. As shown in fig. 6, the data integration system 600 of the embodiment of the present invention includes a multi-source data set collecting device 601, a data integrating device 602, an integration rule configuration terminal 603, a data distribution engine 604, a data output terminal 605, and a distributed search engine 606.
A multi-source data set collection means 601 for extracting a plurality of data sets from different data sources based on pre-configured key data extraction rules. The data source may be, for example, a text file, a database, or a MQ (Message Queue) message queue.
In particular, for different data sources, different forms of key data extraction rules (or metadata extraction rules) may be set in the multi-source data set collection device 601 to obtain multiple data sets from different data sources. For example, for a data source such as a text file, the key data extraction rules may be represented by regular expressions, and for a data source such as a database, the key data extraction rules may be represented by one or more screening fields in the database.
An integration rule configuration terminal 603 is configured to configure the data set integration rule. In the integrated rule configuration terminal 603, a set of functional operation controls may be designed on the interactive interface. For example, functional operation controls for adding, editing, deleting, and querying the integration rules may be designed to facilitate flexible configuration and management of the integration rules by system users. Wherein, the integration rule can comprise at least one operator of taking intersection operators (expressed by mathematical symbol #), taking union operators (expressed by mathematical symbol #), taking difference operators (expressed by mathematical symbol-). Illustratively, if the dataset integration rule is configured as data set1∩dataset2, it means that the same data in dataset data set1 and dataset data set2 is fetched, if the dataset integration rule is configured as data set1∪dataset2, it means that the union is fetched for dataset data set1 and dataset data set2, and if the dataset integration rule is configured as data set1-dataset2, it means that the difference is fetched for dataset data set1 and dataset data set2, resulting in a collection of data that is present in dataset data set1 but not in dataset data set2.
In particular, after the data set integration rule is configured by the integration rule configuration terminal 603, the configured data set integration rule may be persistently stored. For example, the configured data set integration rules may be stored in a database or a centralized cache.
The data integration device 602 is configured to set labels for metadata in multiple data sets in parallel, store label information of the obtained metadata in the distributed search engine 606, parse preset data set integration rules to generate an aggregate query statement corresponding to the preset data set integration rules, and perform an aggregate query on the label information of the metadata based on the aggregate query statement to obtain an aggregate query result.
The distributed search engine 606 may employ ELASTIC SEARCH engine. ELASTIC SEARCH engine, ES for short, is a Lucene-based search engine. The ES engine can realize real-time query of large data volume, which provides basis for integration of multiple data sets in the embodiment of the invention.
The data distribution engine 604 is configured to distribute the aggregated query result, i.e. the data after the integration processing. For example, the integrated data may be distributed to a text file, a database, an MQ message queue, a network cache (e.g., redis cluster).
And the data output terminal 605 is used for outputting and displaying the integrated data.
The embodiment of the invention builds a complete data integration system, can realize autonomous configuration of multi-source data set acquisition rules and data set integration rules, has wide applicability and usability, can effectively avoid the defects existing in the data integration in the prior art, improves the integration efficiency of the multi-source data set, reduces the consumption of the data integration to the memory and the CPU of a single server, and improves the stability and the usability of the whole system.
Fig. 7 illustrates an exemplary system architecture 700 to which the data integration method or apparatus of embodiments of the present invention may be applied.
As shown in fig. 7, a system architecture 700 may include terminal devices 701, 702, 703, a network 704, and a server 705. The network 704 is the medium used to provide communication links between the terminal devices 701, 702, 703 and the server 705. The network 704 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
A user may interact with the server 705 via the network 704 using the terminal devices 701, 702, 703 to receive or send messages or the like. Various communication client applications, such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, social platform software, etc., may be installed on the terminal devices 701, 702, 703.
The terminal devices 701, 702, 703 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.
The server 705 may be a server providing various services, such as a background management server providing support for data integration requests initiated by users with the terminal devices 701, 702, 703. The background management server may analyze and process the received data, such as the data integration request, and feed back the processing result (for example, the data integration result) to the terminal device.
It should be noted that, the data integration method provided in the embodiment of the present invention is generally executed by the server 705, and accordingly, the data integration device is generally disposed in the server 705.
It should be understood that the number of terminal devices, networks and servers in fig. 7 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 8, there is illustrated a schematic diagram of a computer system 800 suitable for use in implementing an electronic device of an embodiment of the present invention. The computer system shown in fig. 8 is merely an example, and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.
As shown in fig. 8, the computer system 800 includes a Central Processing Unit (CPU) 801 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the system 800 are also stored. The CPU 801, ROM 802, and RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.
Connected to the I/O interface 805 are an input section 806 including a keyboard, a mouse, and the like, an output section 807 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker, and the like, a storage section 808 including a hard disk, and the like, and a communication section 809 including a network interface card such as a LAN card, a modem, and the like. The communication section 809 performs communication processing via a network such as the internet. The drive 810 is also connected to the I/O interface 805 as needed. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as needed so that a computer program read out therefrom is mounted into the storage section 808 as needed.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section 809, and/or installed from the removable media 811. The above-described functions defined in the system of the present invention are performed when the computer program is executed by a Central Processing Unit (CPU) 801.
The computer readable medium shown in the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of a computer-readable storage medium may include, but are not limited to, an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules involved in the embodiments of the present invention may be implemented in software or in hardware. The described modules may also be provided in a processor, which may be described as, for example, a processor including a marking module, a generating module, and a querying module. The names of these modules do not in some cases limit the module itself, and for example, the marking module may also be described as a "module that marks metadata in multiple data sets in parallel".
As a further aspect, the invention also provides a computer readable medium which may be comprised in the device described in the above embodiments or may be present alone without being fitted into the device. The computer readable medium carries one or more programs, when the one or more programs are executed by the device, the device executes the following processes of setting tags on metadata in a plurality of data sets in parallel and storing tag information of the obtained metadata into a distributed search engine, wherein the tag information of the metadata comprises a data set identifier of the metadata, analyzing a preset data set integration rule to generate an aggregation query statement corresponding to the data set integration rule, and performing aggregation query on the tag information of the metadata based on the aggregation query statement to obtain an aggregation query result.
The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives can occur depending upon design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims (8)

1.一种数据整合方法,其特征在于,所述方法包括:1. A data integration method, characterized in that the method comprises: 对多个数据集中的元数据并行设置标签,并将得到的元数据的标签信息存入分布式检索引擎;其中,所述元数据的标签信息包括所述元数据所在的数据集标识;Setting tags for metadata in multiple data sets in parallel, and storing the obtained tag information of the metadata in a distributed search engine; wherein the tag information of the metadata includes the identifier of the data set where the metadata is located; 对预先设置的数据集整合规则进行解析,以生成与之对应的聚合查询语句;Parse the pre-set data set integration rules to generate corresponding aggregation query statements; 基于所述聚合查询语句对所述元数据的标签信息进行聚合查询,以得到聚合查询结果;Performing an aggregate query on the tag information of the metadata based on the aggregate query statement to obtain an aggregate query result; 所述对多个数据集中的元数据并行设置标签,并将得到的元数据的标签信息存入分布式检索引擎的步骤包括:The step of setting tags for metadata in multiple data sets in parallel and storing the obtained metadata tag information in a distributed search engine includes: 将所述多个数据集中的每一个拆分成多个数据块,并对所述多个数据块中的元数据并行设置标签;其中,所述对所述多个数据块中的元数据并行设置标签包括:对于每个数据块中的元数据,判断所述元数据所在的数据集标识是否存在于与所述元数据对应的标签集合中;在判断结果为否的情况下,将所述元数据所在的数据集标识添加至与所述元数据对应的标签集合中,然后根据所述元数据对应的标签集合对分布式检索引擎执行插入更新操作。Each of the multiple data sets is split into multiple data blocks, and tags are set in parallel for the metadata in the multiple data blocks; wherein, the tag setting in parallel for the metadata in the multiple data blocks includes: for the metadata in each data block, determining whether the data set identifier where the metadata is located exists in the tag set corresponding to the metadata; if the determination result is no, adding the data set identifier where the metadata is located to the tag set corresponding to the metadata, and then performing an insert update operation on the distributed retrieval engine according to the tag set corresponding to the metadata. 2.根据权利要求1所述的方法,其特征在于,所述对预先设置的数据集整合规则进行解析,以生成与之对应的聚合查询语句的步骤包括:2. The method according to claim 1, characterized in that the step of parsing the preset data set integration rule to generate the corresponding aggregation query statement comprises: 将预先设置的数据集整合规则转换为数字化表达式;其中,所述数字化表达式由数字化的操作数和操作符构成;根据所述数字化表达式生成与之对应的聚合查询语句;其中,所述操作符包括以下至少一项:取交集操作符、取并集操作符或者取差集操作符。Converting a preset data set integration rule into a digital expression; wherein the digital expression is composed of digital operands and operators; generating a corresponding aggregation query statement according to the digital expression; wherein the operator includes at least one of the following: an intersection operator, a union operator, or a difference operator. 3.根据权利要求1所述的方法,其特征在于,所述分布式检索引擎包括:ElasticSearch引擎。3. The method according to claim 1 is characterized in that the distributed search engine includes: an ElasticSearch engine. 4.一种数据整合装置,其特征在于,所述装置包括:4. A data integration device, characterized in that the device comprises: 打标模块,用于对多个数据集中的元数据并行设置标签,并将得到的元数据的标签信息存入分布式检索引擎;其中,所述元数据的标签信息包括所述元数据所在的数据集标识;A labeling module, used to set labels for metadata in multiple data sets in parallel, and store the obtained label information of the metadata into a distributed search engine; wherein the label information of the metadata includes the identifier of the data set where the metadata is located; 生成模块,用于对预先设置的数据集整合规则进行解析,以生成与之对应的聚合查询语句;A generation module is used to parse the preset data set integration rules to generate corresponding aggregation query statements; 查询模块,用于基于所述聚合查询语句对所述元数据的标签信息进行聚合查询,以得到聚合查询结果;A query module, used to perform an aggregate query on the tag information of the metadata based on the aggregate query statement to obtain an aggregate query result; 所述打标模块对多个数据集中的元数据并行设置标签,并将得到的元数据的标签信息存入分布式检索引擎包括:The tagging module sets tags for metadata in multiple data sets in parallel, and stores the obtained metadata tag information into the distributed search engine, including: 所述打标模块将所述多个数据集中的每一个拆分成多个数据块,并对所述多个数据块中的元数据并行设置标签;其中,所述打标模块对所述多个数据块中的元数据并行设置标签包括:对于每个数据块中的元数据,所述打标模块判断所述元数据所在的数据集标识是否存在于与所述元数据对应的标签集合中;在判断结果为否的情况下,所述打标模块将所述元数据所在的数据集标识添加至与所述元数据对应的标签集合中,然后根据所述元数据对应的标签集合对分布式检索引擎执行插入更新操作。The labeling module splits each of the multiple data sets into multiple data blocks, and sets labels for the metadata in the multiple data blocks in parallel; wherein, the labeling module sets labels for the metadata in the multiple data blocks in parallel including: for the metadata in each data block, the labeling module determines whether the data set identifier where the metadata is located exists in the label set corresponding to the metadata; if the judgment result is no, the labeling module adds the data set identifier where the metadata is located to the label set corresponding to the metadata, and then performs an insert update operation on the distributed retrieval engine according to the label set corresponding to the metadata. 5.根据权利要求4所述的装置,其特征在于,所述生成模块对预先设置的数据集整合规则进行解析,以生成与之对应的聚合查询语句包括:5. The device according to claim 4, characterized in that the generation module parses the preset data set integration rule to generate the corresponding aggregation query statement comprising: 所述生成模块将预先设置的数据集整合规则转换为数字化表达式;其中,所述数字化表达式由数字化的操作数和操作符构成;所述生成模块根据所述数字化表达式生成与之对应的聚合查询语句;其中,所述操作符包括以下至少一项:取交集操作符、取并集操作符或者取差集操作符。The generation module converts the preset data set integration rules into a digital expression; wherein the digital expression is composed of digital operands and operators; the generation module generates an aggregation query statement corresponding to the digital expression according to the digital expression; wherein the operator includes at least one of the following: an intersection operator, a union operator or a difference operator. 6.根据权利要求4所述的装置,其特征在于,所述分布式检索引擎包括:ElasticSearch引擎。6. The device according to claim 4 is characterized in that the distributed search engine includes: an ElasticSearch engine. 7.一种电子设备,其特征在于,包括:7. An electronic device, comprising: 一个或多个处理器;one or more processors; 存储装置,用于存储一个或多个程序,a storage device for storing one or more programs, 当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1至3中任一所述的方法。When the one or more programs are executed by the one or more processors, the one or more processors implement the method according to any one of claims 1 to 3. 8.一种计算机可读介质,其上存储有计算机程序,其特征在于,所述程序被处理器执行时实现如权利要求1至3中任一所述的方法。8. A computer-readable medium having a computer program stored thereon, wherein when the program is executed by a processor, the method according to any one of claims 1 to 3 is implemented.
CN201911121056.1A 2019-11-15 2019-11-15 Data integration method and device Active CN112818026B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911121056.1A CN112818026B (en) 2019-11-15 2019-11-15 Data integration method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911121056.1A CN112818026B (en) 2019-11-15 2019-11-15 Data integration method and device

Publications (2)

Publication Number Publication Date
CN112818026A CN112818026A (en) 2021-05-18
CN112818026B true CN112818026B (en) 2025-07-15

Family

ID=75851765

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911121056.1A Active CN112818026B (en) 2019-11-15 2019-11-15 Data integration method and device

Country Status (1)

Country Link
CN (1) CN112818026B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115271820A (en) * 2022-08-05 2022-11-01 中信百信银行股份有限公司 Elasticissearch-based method and device for selecting big data crowd and electronic equipment
CN115186023B (en) * 2022-09-07 2022-12-06 杭州安恒信息技术股份有限公司 Data set generation method, device, equipment and medium
CN116881310B (en) * 2023-09-07 2023-11-14 卓望数码技术(深圳)有限公司 Method and device for calculating set of big data
CN117687970B (en) * 2024-02-02 2024-07-02 济南浪潮数据技术有限公司 Metadata retrieval method and device, electronic equipment and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107145586A (en) * 2017-05-10 2017-09-08 中国电力科学研究院 A kind of label output method and apparatus based on power marketing data

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005339027A (en) * 2004-05-25 2005-12-08 Canon Inc Data processing apparatus, data processing method, and computer program
CN109947788B (en) * 2017-10-30 2021-10-15 北京京东尚科信息技术有限公司 Data query method and device

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107145586A (en) * 2017-05-10 2017-09-08 中国电力科学研究院 A kind of label output method and apparatus based on power marketing data

Also Published As

Publication number Publication date
CN112818026A (en) 2021-05-18

Similar Documents

Publication Publication Date Title
US10521404B2 (en) Data transformations with metadata
CN112818026B (en) Data integration method and device
CN109614402B (en) Multidimensional data query method and device
CN110472207A (en) List generation method and device
CN113419789B (en) Method and device for generating data model script
CN113485763B (en) Data processing method, device, electronic device and computer readable medium
CN111104479A (en) Data labeling method and device
CN113515285B (en) Method and device for generating real-time calculation logic data
CN113760240A (en) Method and device for generating data model
CN112947919B (en) Method and device for building business model and processing business request
WO2023000785A1 (en) Data processing method, device and system, and server and medium
CN112948334A (en) Log processing method and device
CN116303428A (en) Data processing method and device
CN113760969A (en) Data query method and device based on elastic search
CN110188113B (en) Method, device and storage medium for comparing data by using complex expression
CN113742321A (en) Data updating method and device
US20220276985A1 (en) System and methods for massive data management and tagging
CN112784195A (en) Page data publishing method and system
CN113312053B (en) A method and device for data processing
CN112148705A (en) Data migration method and device
CN112988778B (en) Method and device for processing database query script
CN113760949B (en) Data query method and device
CN112825107B (en) Method and device for generating chart
CN112131287B (en) A method and device for reading data
CN110472055B (en) Method and device for marking data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant