Detailed Description
Exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present invention are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
It is noted that embodiments of the invention and features of the embodiments may be combined with each other without conflict.
Example 1
Fig. 1 is a schematic flow chart of a data integration method according to a first embodiment of the invention. As shown in fig. 1, the data integration method according to the embodiment of the present invention includes:
step S101, the tag information is parallelly set for the metadata in a plurality of data sets, and the obtained tag information of the metadata is stored in a distributed search engine.
The metadata is understood to be an attribute value that can distinguish between or uniquely represent each element in the data set, and may be one attribute value of the data in the data set or a combination of several attribute values of the data in the data set. Taking the example of a user crowd data set obtained from multiple data sources (such as databases, text files, MQ message queues, etc.) in an e-commerce platform, the metadata may be a user account value (which may be simply referred to as a "user PIN").
In this step, metadata of a plurality of data sets to be integrated may be marked in parallel based on a timed worker (a task running framework) tool or a loop thread, etc. (the marking may be understood as metadata setting labels) so as to improve marking efficiency. For example, assuming a total of three data sets data set1、dataset2 and data set3, a timing woker tool may be started for each data set to mark the metadata of the three data sets in parallel. In order to further improve the marking efficiency, each of the three data sets can be further split into a plurality of data blocks, and metadata in the plurality of data blocks can be marked in parallel.
The tag information of the metadata comprises a data set identifier where the metadata is located. For example, when marking metadata in a dataset data set1, an identification of the dataset (such as "No set1" or "1" or the like) may be added to a data structure for storing tag information of the metadata. Wherein, the data structure for storing the tag information of the metadata may be a tuple, a linked list, or the like.
Wherein the distributed search engine may comprise a ELASTIC SEARCH engine. ELASTIC SEARCH engine, ES for short, is a Lucene-based search engine.
Step S102, analyzing a preset data set integration rule to generate an aggregation query statement corresponding to the data set integration rule.
In one example, the dataset integration rule may be represented by a suffix expression. A suffix expression is a general expression of an arithmetic or logical formula in which an operator is in the middle of an operand in the form of a suffix. For example, a certain dataset integration rule expressed in a suffix expression is data set1∩dataste2∪dataset3-dataset4.
In the step, the aggregation query statement corresponding to the data set integration rule can be obtained by analyzing the preset data set integration rule. Illustratively, assuming that the distributed search engine is ELASTIC SEARCH and the data set integration rule is data set1∩dataste2∪dataset3-dataset4, the parsed aggregate query statement corresponding thereto may be expressed as:
and step S103, performing aggregation query on the tag information of the metadata based on the aggregation query statement to obtain an aggregation query result.
In the embodiment of the invention, the metadata in a plurality of data sets are parallelly provided with the tags, the obtained tag information of the metadata is stored in the distributed search engine, the preset data set integration rule is analyzed to generate the corresponding aggregation query statement, and the aggregation query is carried out on the tag information of the metadata based on the aggregation query statement to obtain the aggregation query result, so that the processing efficiency of data integration can be improved, the consumption of resources such as a stand-alone server memory, a CPU (Central processing unit) and the like in the integration processing process can be reduced, and the stability and the usability of the system are improved.
Example two
Fig. 2 is a schematic flow chart of a data integration method according to a second embodiment of the invention. As shown in fig. 2, the data integration method according to the embodiment of the present invention includes:
Step S201, extracting a plurality of data sets from different data sources based on a pre-configured key data extraction rule.
In particular, different forms of key data extraction rules (or metadata extraction rules) may be set for different data sources to obtain multiple data sets from different data sources. For example, for a data source such as a text file, the key data extraction rules may be represented by regular expressions, and for a data source such as a database, the key data extraction rules may be represented by one or more screening fields in the database.
Step S202, the metadata in the data sets are parallelly provided with tags, and the tag information of the obtained metadata is stored in the distributed search engine.
The metadata is understood to be an attribute value that can distinguish between or uniquely represent each element in the data set, and may be one attribute value of the data in the data set or a combination of several attribute values of the data in the data set. Taking the example of a user crowd data set obtained from multiple data sources (such as databases, text files, MQ message queues, etc.) in an e-commerce platform, the metadata may be a user account value (which may be simply referred to as a "user PIN").
In this step, metadata of a plurality of data sets to be integrated may be marked in parallel based on a timed worker (a task running framework) tool or a loop thread, etc. (the marking may be understood as metadata setting labels) so as to improve marking efficiency. For example, assuming a total of three data sets data set1、dataset2 and data set3, a timing woker tool may be started for each data set to mark the metadata of the three data sets in parallel.
In order to further improve the marking efficiency, each of the plurality of data sets may be further split into a plurality of data blocks, and metadata in the plurality of data blocks may be marked in parallel.
The tag information of the metadata comprises a data set identifier where the metadata is located. For example, when metadata in the dataset data set1 is marked, an identification of the dataset (e.g., "No set1" or "1" etc.) may be added to the set of tags corresponding to the metadata. When the metadata in the dataset data set2 is marked, the identification (such as "No set2" or "2") of the dataset may be added to the tag set corresponding to the metadata. The tag set corresponding to the metadata refers to a data structure for storing tag information of the metadata, and the data structure may be an array, a linked list, or the like.
In this step, metadata in a plurality of data sets are marked in parallel, and tag information of the marked metadata is stored in a distributed search engine, so that the summarized metadata tag information (as shown in fig. 4) can be finally obtained. For example, assuming that both data set data set1 and data set data set2 have the same metadata (e.g., user account value "1213"), then after the completion of the tagging, the distributed search engine has metadata tag information, [1,2], which indicates that the metadata is in both data set data set1 and data set data set2.
Step 203, converting the preset data set integration rule into a digital expression.
In one example, the dataset integration rule may be represented by a suffix expression. A suffix expression is a general expression of an arithmetic or logical formula in which an operator is in the middle of an operand in the form of a suffix. For example, a certain dataset integration rule expressed in a suffix expression is data set1∩dataste2∪dataset3-dataset4.
Wherein the digitized expression is comprised of digitized operands and operators in digital form, the operators including at least one of an intersection operator, a union operator, or a difference operator.
For example, assuming that the dataset integration rule is "data set1∩dataste2∪dataset3-dataset4", it can be converted to "1 n 2 3-4" by this step.
And step S204, generating an aggregation query statement corresponding to the digital expression according to the digital expression.
In the step, the digitized expression can be scanned, when the operator is encountered in scanning, a query clause can be constructed based on the operator and the operand, and then all constructed query clauses are nested and packaged to obtain an aggregated query statement corresponding to the query clause.
For example, assuming that the distributed search engine is ELASTIC SEARCH engine and the digitized expression obtained in step S203 is "1 n 2U 3-4", when the scan encounters the intersection operator "n", a query clause of "filter" { "item" { "tag": [1,2] }, when the scan encounters the intersection operator "u", a query clause of "should" { "item" { "tag":3}, and when the scan encounters the intersection operator "-, a query clause of" multiple_not "{" item "{ item" } 4 }, and then, the query clause is packaged and packaged to correspond to the query clause.
Step 205, performing an aggregate query on the tag information of the metadata based on the aggregate query statement to obtain an aggregate query result.
In the embodiment of the invention, the marking efficiency is improved by marking the metadata in a plurality of data sets in parallel, the tag information of the metadata obtained by marking is stored in the distributed search engine, the aggregation query statement corresponding to the preset data set integration rule is generated by analyzing the preset data set integration rule, and the tag information of the metadata is aggregated and queried based on the aggregation query statement, so that the processing efficiency of data integration can be improved, the consumption of resources such as a stand-alone server memory, a CPU (Central processing unit) and the like in the integration processing process can be reduced, and the stability and the usability of the system can be improved.
Fig. 3 is an exemplary flow diagram of metadata parallel marking in accordance with a second embodiment of the present invention. After splitting the data set into a plurality of data blocks, metadata in the plurality of data blocks may be marked in parallel based on the flow shown in fig. 3. As shown in fig. 3, an exemplary flow of parallel marking of metadata in a data block includes:
Step S301, obtaining a tag set corresponding to metadata in a data block in a distributed search engine.
Step S302, judging whether a tag set corresponding to the metadata exists or not. If there is a tag set corresponding to the metadata, step S303 may be performed, otherwise, step S304 may be performed.
In this step, if the tag set corresponding to the metadata is acquired from the distributed search engine, it is determined that the tag set corresponding to the metadata exists, and if the tag set corresponding to the metadata is not acquired from the distributed search engine, it is determined that the tag set corresponding to the metadata does not exist.
Step S303, judging whether the data set identifier of the metadata exists in the tag set. If the data set identifier of the metadata does not exist in the tag set, step S305 may be performed, otherwise, step S307 may be performed.
Step S304, creating a label set corresponding to the metadata.
Step S305, adding the data set identifier of the metadata to the tag set. After step S305, step S306 may be performed.
For example, when marking metadata in the data Block1, the identification of the data set (such as "No set1" or "1" etc.) may be added to the tag set corresponding to the metadata, assuming that the data Block belongs to the data set data set1, and when marking metadata in the data Block1, the identification of the data set (such as "No set2" or "2" etc.) may be added to the tag set corresponding to the metadata, assuming that the data Block belongs to the data set data set2.
And step S306, performing insertion updating operation on the distributed search engine according to the tag set corresponding to the metadata.
For example, when the distributed search engine is ELASTIC SEARCH, the tag information stored in the tag set corresponding to the metadata may be stored in the ELASTIC SEARCH engine through upsert operation.
Step S307, acquire the next metadata in the data block. After step S307, the next metadata acquired may be executed to execute step S301.
In the embodiment of the invention, metadata of a plurality of data sets can be marked in parallel through the steps, so that the marking efficiency is improved, and furthermore, the running pressure of a single server can be effectively reduced through the parallel marking in a distributed environment.
Fig. 5 is a schematic diagram of main modules of a data integration apparatus according to a third embodiment of the invention. As shown in fig. 5, the data integration apparatus 500 according to the embodiment of the present invention includes a marking module 501, a generating module 502, and a querying module 503.
The marking module 501 is configured to set labels for metadata in multiple data sets in parallel, and store label information of the obtained metadata in the distributed search engine.
The metadata is understood to be an attribute value that can distinguish between or uniquely represent each element in the data set, and may be one attribute value of the data in the data set or a combination of several attribute values of the data in the data set. Taking the example of a user crowd data set obtained from multiple data sources (such as databases, text files, MQ message queues, etc.) in an e-commerce platform, the metadata may be a user account value (which may be simply referred to as a "user PIN").
Illustratively, the marking module 501 may perform parallel marking (where the marking may be understood as that metadata sets a tag) on metadata of multiple data sets to be integrated based on a timed worker (a task running framework) tool or a loop thread, etc. to improve marking efficiency. For example, assuming a total of three data sets data set1、dataset2 and data set3, a timing woker tool may be started for each data set to mark the metadata of the three data sets in parallel. In order to further improve the marking efficiency, each of the three data sets can be further split into a plurality of data blocks, and metadata in the plurality of data blocks can be marked in parallel.
The tag information of the metadata comprises a data set identifier where the metadata is located. For example, when metadata in the dataset data set1 is marked, an identification of the dataset (e.g., "No set1" or "1" etc.) may be added to the set of tags corresponding to the metadata. Wherein, the tag set refers to a data structure for storing tag information of the metadata, which may be a tuple, a linked list, or the like.
Wherein the distributed search engine may comprise a ELASTIC SEARCH engine. ELASTIC SEARCH engine, ES for short, is a Lucene-based search engine.
The generating module 502 is configured to parse a preset data set integration rule to generate an aggregate query statement corresponding to the data set integration rule.
Illustratively, the generating module 502 analyzes the preset data set integration rule to generate an aggregate query statement corresponding to the preset data set integration rule, wherein the generating module 502 converts the preset data set integration rule into a digital expression, the digital expression consists of a digital operand and an operator, and the generating module 502 generates the aggregate query statement corresponding to the digital expression according to the digital expression. Wherein the operators include at least one of a fetch intersection operator, a fetch union operator, or a fetch difference operator.
In one example, the dataset integration rule may be represented by a suffix expression. A suffix expression is a general expression of an arithmetic or logical formula in which an operator is in the middle of an operand in the form of a suffix. For example, a certain dataset integration rule expressed in a suffix expression is data set1∩dataste2∪dataset3-dataset4. Further, assuming that the distributed search engine is ELASTIC SEARCH engine, the aggregate query statement corresponding to the dataset integration rule "data set1∩dataste2∪dataset3-dataset4" generated by the generation module 502 may be expressed as:
And a query module 503, configured to perform an aggregate query on the tag information of the metadata based on the aggregate query statement, so as to obtain an aggregate query result.
In the embodiment of the invention, the metadata in a plurality of data sets are marked in parallel through the marking module, the tag information of the marked metadata is stored in the distributed search engine, the preset data set integration rule is analyzed through the generating module to generate the corresponding aggregation query statement, and the tag information of the metadata is subjected to aggregation query based on the aggregation query statement through the query module to obtain the aggregation query result, so that the processing efficiency of data integration can be improved, the consumption of resources such as a stand-alone server memory and a CPU (Central processing Unit) in the integration processing process can be reduced, and the stability and the usability of the system are improved.
Fig. 6 is a schematic diagram of main modules of a data integration system according to a fourth embodiment of the invention. As shown in fig. 6, the data integration system 600 of the embodiment of the present invention includes a multi-source data set collecting device 601, a data integrating device 602, an integration rule configuration terminal 603, a data distribution engine 604, a data output terminal 605, and a distributed search engine 606.
A multi-source data set collection means 601 for extracting a plurality of data sets from different data sources based on pre-configured key data extraction rules. The data source may be, for example, a text file, a database, or a MQ (Message Queue) message queue.
In particular, for different data sources, different forms of key data extraction rules (or metadata extraction rules) may be set in the multi-source data set collection device 601 to obtain multiple data sets from different data sources. For example, for a data source such as a text file, the key data extraction rules may be represented by regular expressions, and for a data source such as a database, the key data extraction rules may be represented by one or more screening fields in the database.
An integration rule configuration terminal 603 is configured to configure the data set integration rule. In the integrated rule configuration terminal 603, a set of functional operation controls may be designed on the interactive interface. For example, functional operation controls for adding, editing, deleting, and querying the integration rules may be designed to facilitate flexible configuration and management of the integration rules by system users. Wherein, the integration rule can comprise at least one operator of taking intersection operators (expressed by mathematical symbol #), taking union operators (expressed by mathematical symbol #), taking difference operators (expressed by mathematical symbol-). Illustratively, if the dataset integration rule is configured as data set1∩dataset2, it means that the same data in dataset data set1 and dataset data set2 is fetched, if the dataset integration rule is configured as data set1∪dataset2, it means that the union is fetched for dataset data set1 and dataset data set2, and if the dataset integration rule is configured as data set1-dataset2, it means that the difference is fetched for dataset data set1 and dataset data set2, resulting in a collection of data that is present in dataset data set1 but not in dataset data set2.
In particular, after the data set integration rule is configured by the integration rule configuration terminal 603, the configured data set integration rule may be persistently stored. For example, the configured data set integration rules may be stored in a database or a centralized cache.
The data integration device 602 is configured to set labels for metadata in multiple data sets in parallel, store label information of the obtained metadata in the distributed search engine 606, parse preset data set integration rules to generate an aggregate query statement corresponding to the preset data set integration rules, and perform an aggregate query on the label information of the metadata based on the aggregate query statement to obtain an aggregate query result.
The distributed search engine 606 may employ ELASTIC SEARCH engine. ELASTIC SEARCH engine, ES for short, is a Lucene-based search engine. The ES engine can realize real-time query of large data volume, which provides basis for integration of multiple data sets in the embodiment of the invention.
The data distribution engine 604 is configured to distribute the aggregated query result, i.e. the data after the integration processing. For example, the integrated data may be distributed to a text file, a database, an MQ message queue, a network cache (e.g., redis cluster).
And the data output terminal 605 is used for outputting and displaying the integrated data.
The embodiment of the invention builds a complete data integration system, can realize autonomous configuration of multi-source data set acquisition rules and data set integration rules, has wide applicability and usability, can effectively avoid the defects existing in the data integration in the prior art, improves the integration efficiency of the multi-source data set, reduces the consumption of the data integration to the memory and the CPU of a single server, and improves the stability and the usability of the whole system.
Fig. 7 illustrates an exemplary system architecture 700 to which the data integration method or apparatus of embodiments of the present invention may be applied.
As shown in fig. 7, a system architecture 700 may include terminal devices 701, 702, 703, a network 704, and a server 705. The network 704 is the medium used to provide communication links between the terminal devices 701, 702, 703 and the server 705. The network 704 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
A user may interact with the server 705 via the network 704 using the terminal devices 701, 702, 703 to receive or send messages or the like. Various communication client applications, such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, social platform software, etc., may be installed on the terminal devices 701, 702, 703.
The terminal devices 701, 702, 703 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.
The server 705 may be a server providing various services, such as a background management server providing support for data integration requests initiated by users with the terminal devices 701, 702, 703. The background management server may analyze and process the received data, such as the data integration request, and feed back the processing result (for example, the data integration result) to the terminal device.
It should be noted that, the data integration method provided in the embodiment of the present invention is generally executed by the server 705, and accordingly, the data integration device is generally disposed in the server 705.
It should be understood that the number of terminal devices, networks and servers in fig. 7 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 8, there is illustrated a schematic diagram of a computer system 800 suitable for use in implementing an electronic device of an embodiment of the present invention. The computer system shown in fig. 8 is merely an example, and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.
As shown in fig. 8, the computer system 800 includes a Central Processing Unit (CPU) 801 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the system 800 are also stored. The CPU 801, ROM 802, and RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.
Connected to the I/O interface 805 are an input section 806 including a keyboard, a mouse, and the like, an output section 807 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker, and the like, a storage section 808 including a hard disk, and the like, and a communication section 809 including a network interface card such as a LAN card, a modem, and the like. The communication section 809 performs communication processing via a network such as the internet. The drive 810 is also connected to the I/O interface 805 as needed. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as needed so that a computer program read out therefrom is mounted into the storage section 808 as needed.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section 809, and/or installed from the removable media 811. The above-described functions defined in the system of the present invention are performed when the computer program is executed by a Central Processing Unit (CPU) 801.
The computer readable medium shown in the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of a computer-readable storage medium may include, but are not limited to, an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules involved in the embodiments of the present invention may be implemented in software or in hardware. The described modules may also be provided in a processor, which may be described as, for example, a processor including a marking module, a generating module, and a querying module. The names of these modules do not in some cases limit the module itself, and for example, the marking module may also be described as a "module that marks metadata in multiple data sets in parallel".
As a further aspect, the invention also provides a computer readable medium which may be comprised in the device described in the above embodiments or may be present alone without being fitted into the device. The computer readable medium carries one or more programs, when the one or more programs are executed by the device, the device executes the following processes of setting tags on metadata in a plurality of data sets in parallel and storing tag information of the obtained metadata into a distributed search engine, wherein the tag information of the metadata comprises a data set identifier of the metadata, analyzing a preset data set integration rule to generate an aggregation query statement corresponding to the data set integration rule, and performing aggregation query on the tag information of the metadata based on the aggregation query statement to obtain an aggregation query result.
The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives can occur depending upon design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.