[go: up one dir, main page]

WO2018188607A1 - Stream processing method and device - Google Patents

Stream processing method and device Download PDF

Info

Publication number
WO2018188607A1
WO2018188607A1 PCT/CN2018/082641 CN2018082641W WO2018188607A1 WO 2018188607 A1 WO2018188607 A1 WO 2018188607A1 CN 2018082641 W CN2018082641 W CN 2018082641W WO 2018188607 A1 WO2018188607 A1 WO 2018188607A1
Authority
WO
WIPO (PCT)
Prior art keywords
stream processing
block
data storage
management unit
block number
Prior art date
Application number
PCT/CN2018/082641
Other languages
French (fr)
Chinese (zh)
Inventor
曹俊
胡斐然
林铭
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2018188607A1 publication Critical patent/WO2018188607A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services
    • H04L67/561Adding application-functional data or data for application control, e.g. adding metadata
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/60Scheduling or organising the servicing of application requests, e.g. requests for application data transmissions using the analysis and optimisation of the required network resources

Definitions

  • the present invention relates to the field of information technology, and in particular, to a stream processing method and apparatus.
  • Work flow is an abstraction, generalization, and description of the logical rules of how processes are organized before and after each other in the workflow and workflow.
  • the workflow concept originated in the field of production organization and office automation. It is a concept proposed for fixed-program activities in daily work. The purpose is to perform these tasks according to certain rules and processes by breaking down the work into well-defined processes or roles. Process and monitor it to improve work efficiency, better control processes, enhance customer service, and effectively manage business processes.
  • Workflow modeling which means that the workflow is represented in the computer in the appropriate model and implemented. Through workflow modeling, workflows can be managed through a workflow system.
  • the main function of the stream processing system is to define, execute and manage the workflow through the support of computer technology, and coordinate the information interaction between the processes in the workflow execution process and between the members of the group.
  • the stream processing system usually consists of a workflow design tool and a workflow management tool.
  • the workflow design tool allows the user to design their own workflow definition, and the workflow management tool is responsible for managing the execution of the workflow.
  • the workflow instance includes one or more tasks, and each agent needs to perform some work.
  • Apache Storm is a typical stream processing system in the prior art. It consists of a Master-Slave architecture. Nimbus is the main process and Supervisor is the slave process running the service.
  • the stream processing system Storm establishes a network connection with the distributed file system, and the distributed file system stores data that needs to be processed by the stream processing system Storm.
  • the distributed file system includes a Master Server (primary server) and a Data Server (data server), and the Master Server is The metadata management node manages the distribution of data blocks.
  • the Data Server is a data storage node point, and stores data block data.
  • the Storm and the data storage node point are set on different servers.
  • Storm In Storm's stream processing operations, Storm first needs to obtain data from the data server that needs to be streamed. Specifically, the data server provides a data query interface, and the Storm inputs parameters to the data query interface through the network, acquires data from the data server through the network, and then loads the acquired data into the Supervisor.
  • the stream processing system needs to acquire data from the data storage node through the network, the speed of acquiring the data is limited by the network performance, which may result in the performance of the entire stream processing being limited by the network, in the stream processing system and the data storage node.
  • the network transmission speed is low, the speed of stream processing is greatly affected.
  • an embodiment of the present invention provides a stream processing method and apparatus, which can overcome the technical problem that the speed of the stream processing is affected by the low network transmission speed between the stream processing system and the data storage node.
  • an embodiment of the present invention provides a stream processing method, where the method is applied to a stream processing system, where the stream processing system includes a stream processing management unit and a stream processing computing unit, and the method includes:
  • the stream processing management unit receives a stream processing task sent by the client, where the stream processing task includes a stream processing logic and a path of the file to be processed in the distributed file system, and the distributed file system includes a metadata management node and a plurality of data storage nodes, each a data storage node is provided with a stream processing computing unit;
  • the stream processing management unit acquires, from the metadata management node, a block number of each block corresponding to the path of the file to be processed, and a network address of the data storage node where each block is located;
  • the stream processing management unit respectively sends the stream processing logic and the block number of each block to the stream processing unit of the data storage node where each block is located;
  • the stream processing calculation unit acquires the block data corresponding to the received block number from the data storage node where it is located, and executes stream processing logic for the block data corresponding to the received block number.
  • the embodiment of the present invention distributes the stream processing calculation unit to each data storage node, and the stream processing management unit sends the stream processing task to the corresponding data storage node according to the path of the file to be processed, by the corresponding data storage node.
  • the stream processing calculation unit directly reads the block data corresponding to the file to be processed locally, and runs the stream processing logic on the read block data. Since the stream processing calculation unit locally reads the file to be processed, the stream processing system can be overcome. A technical problem with the low speed of network transmission between data storage nodes and the speed of stream processing.
  • the stream processing logic is executed in parallel in different stream processing calculation units, so that the stream processing speed can be further accelerated and the processing efficiency can be improved.
  • the data storage node is provided with a data management unit, and the stream processing calculation unit is configured as a program library, and the data management unit performs a function of the stream processing calculation unit by loading the program library.
  • the stream processing calculation unit is set in the data management unit through the program library, and the data management unit can directly read the block data, after the data management unit can read the block data, the stream processing logic can be executed, which can speed up the stream processing speed. .
  • the method further includes:
  • the stream processing calculation unit transmits the processing result obtained by the execution stream processing logic to the stream processing management unit.
  • the metadata management node records the first correspondence between the path of the file to be processed in the distributed file system and the block number of each block, and the stream processing management unit obtains the data from the metadata management node.
  • the block number of each block corresponding to the path, and the network address of the data storage node where the block number of each block is located specifically includes:
  • the stream processing management unit acquires the block numbers of the respective blocks from the first correspondence relationship according to the path of the file to be processed in the distributed file system.
  • the metadata management node records the second correspondence between the block number of each block and the network address of the data storage node where the block number of each block is located, and the stream processing management unit slave element
  • the data management node obtains the block number of each block corresponding to the path, and the network address of the data storage node where the block number of each block is located specifically includes:
  • the stream processing management unit acquires the network address of the data storage node where each block number is located from the second correspondence relationship according to each block number.
  • an embodiment of the present invention provides a stream processing system, including a stream processing management unit and a stream processing computing unit.
  • a stream processing management unit configured to receive a stream processing task sent by the client, where the stream processing task includes a path of the stream processing logic and the file to be processed in the distributed file system, and the distributed file system includes a metadata management node and multiple data storage a node, each data storage node is provided with a stream processing computing unit;
  • the stream processing management unit is further configured to obtain, from the metadata management node, each block number corresponding to the path, and a network address of the data storage node where each block number is located;
  • the stream processing management unit is further configured to separately send the stream processing logic and the block number corresponding to each network address to the stream processing unit of the corresponding data storage node;
  • the stream processing calculation unit is configured to acquire block data corresponding to the received block number from the data storage node where the data is stored, and execute stream processing logic for the block data corresponding to the received block number.
  • the data storage node is provided with a data management unit, and the stream processing calculation unit is configured as a program library, and the data management unit performs a function of the stream processing calculation unit by loading the program library.
  • the stream processing calculation unit is further configured to send the processing result obtained by the execution stream processing logic to the stream processing management unit.
  • the metadata management node records the first correspondence between the path of the file to be processed in the distributed file system and the block number of each block
  • the stream processing management unit is specifically configured to:
  • the block number of each block is obtained from the first correspondence according to the path of the file to be processed in the distributed file system.
  • the metadata management node records the second correspondence between the block number of each block and the network address of the data storage node where the block number of each block is located, and the stream processing management unit specifically uses to:
  • the network address of the data storage node where each block number is located is obtained from the second correspondence according to the block number of each block.
  • an embodiment of the present invention provides a stream processing management unit that performs the functions of a stream processing management unit in the stream processing system.
  • an embodiment of the present invention provides a host, including a memory, a processor, and a bus.
  • the memory and the processor are connected to the bus.
  • the memory stores program instructions, and the processor executes the program instructions to implement stream processing in the stream processing system. The function of the snap-in.
  • FIG. 1 is a schematic structural diagram of a stream processing system according to an embodiment of the present invention.
  • FIG. 2 is another schematic structural diagram of a stream processing system according to an embodiment of the present invention.
  • FIG. 3 is a data interaction diagram of a stream processing method according to an embodiment of the present invention.
  • FIG. 4 is a schematic structural diagram of an apparatus of a stream processing system according to an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of a device of a host according to an embodiment of the present invention.
  • FIG. 1 is a schematic diagram of a connection between a stream processing system and a distributed file system and a client according to an embodiment of the present invention.
  • the stream processing system includes a stream processing management unit 302 and a plurality of stream processing calculations.
  • Units 1011, 1021, ..., and 1031 the distributed file system includes a metadata management node 201 and a plurality of data storage nodes 101, 102, ..., and 103.
  • the client 301 is connected to the stream processing management unit 302, and the stream processing management unit 302 is connected to the metadata management node 201 and the plurality of data storage nodes 101, 102, ..., 103, respectively.
  • the client 301 is configured to receive a stream processing job submitted by the user.
  • the user when the user submits the stream processing job, the user specifies the path of the data to be processed in the distributed file system, and specifies what kind of processing is to be performed on the data to be processed.
  • the path of the stream processing task to be processed in the distributed file system may be, for example, a URL (abbreviation of Universal Resource Locator, a uniform resource locator), and the URL is a storage identifier of the distributed file system, and the URL may be in the metadata management node.
  • 201 finds the block number of each block corresponding to the file to be processed.
  • the client 301 generates a stream processing task according to the stream processing job submitted by the user, where the stream processing task includes a path of the stream processing logic and the data to be processed in the distributed file system, wherein the stream processing logic defines what kind of processing is to be performed on the data, for example.
  • the stream processing logic can specify to search for anomalous events in the data to be processed.
  • the client 301 sends a stream processing task to the stream processing management unit 302.
  • the stream processing management unit 302 performs scheduling according to the stream processing task, and the selected stream processing computing unit acquires the file to be processed from the distributed file system, and processes the file to be processed by the stream processing logic. deal with.
  • the stream processing system can be implemented based on the apache flink architecture
  • the client 301 is the client (client) process of the apache flink
  • the stream processing management unit 302 is the job manager (work manager) process of the apache flink
  • the metadata management node 201 is provided with a metadata management unit 2011 and a database 2012, and the metadata management unit 2011 provides an interface through which the external device can query the database 2012.
  • the database 2012 records the first correspondence between the path of the file to be processed in the distributed file system and the block number of each block, and the block number of each block and the network address of the data storage node where each block is located. The second correspondence.
  • files to be processed are stored in a database in a data storage node in the form of fragments, where the fragments refer to different block data, each piece of data corresponds to a block number, and the metadata management node records all files in the distributed storage system.
  • the correspondence between the path and the block number of each block, and each block number corresponds to the database of which data storage node is stored.
  • the data storage node 101 is provided with a stream processing calculation unit 1011 and a database 1012.
  • the database 1012 records the block data and the correspondence between the block number and the block data.
  • the stream processing calculation unit 1011 can access the database 1012 and obtain the block number from the database 1012. Corresponding block data.
  • the data storage nodes 102 and 103 have a similar structure to the data storage node 101, except that the block data recorded by the own database is different, and details are not described herein.
  • the distributed file system can be implemented by Hadoop
  • the database 2012, the database 1012, the database 1022, ... and the database 1032 can be implemented by Hbase (Hadoop Database, Hadoop database)
  • the metadata management unit 2011 can be the hmaster process of the Hbase database. .
  • the client 301 and the stream processing management unit 302 can be set on the same host, and establish a data connection with the metadata management node 201 and the data storage nodes 101, 102, .
  • the client 301 and the stream processing management unit 302 may also be disposed on different hosts, which is not limited by the embodiment of the present invention.
  • FIG. 2 is another schematic structural diagram of a stream processing system according to an embodiment of the present invention.
  • a client 301 and a stream processing management unit 302 are disposed on a host 10, and the host 10 is provided.
  • the operating system 303 and the hardware 304 are also included.
  • the hardware 304 is used to carry the operation of the operating system 303.
  • the hardware 304 includes a physical network card 3041.
  • the client 301 and the stream processing management unit 302 respectively run on the operating system 303 in the form of a process.
  • the physical network card 3041 accesses the network 50.
  • the metadata management node 201 includes a database 2012, a metadata management unit 2011, an operating system 2013, and a hardware 2014.
  • the database 2012 and the metadata management unit 2011 respectively run on the operating system 2013 in the form of a process, and the hardware 2014 is used for the bearer operation.
  • the operation of the system 2013 includes the physical network card 20141, the physical network card 20141 accessing the network 50, and the metadata management unit 2011 providing an interface through which the external device can access the database 2012.
  • the data storage node 101 includes data 1012, a stream processing computing unit 1011, an operating system 1013, and a hardware 1014.
  • the database 1012 and the stream processing computing unit 1011 respectively run on the operating system 1013 in the form of a process, and the hardware 1014 is used to carry the operating system.
  • the hardware 1014 includes a physical network card 10141, and the physical network card 10141 accesses the network 50.
  • the stream processing computing unit 1011 can access the database 1012.
  • the structure of the data storage nodes 102 and 103 is similar to that of the data storage node 101 and will not be described herein.
  • the stream processing management unit 302 and the client 301, the metadata management unit 2011, and the stream processing computing units 101, 1021, ..., and 1031 can pass RPC (Remote Procedure Call Protocol). Implement communication.
  • RPC Remote Procedure Call Protocol
  • the embodiment of the present invention provides a stream processing method, where the stream processing management unit 302 receives a stream processing task sent by the client 301, where the stream processing task includes a stream processing logic and a path of the file to be processed in the distributed file system.
  • the stream processing management unit 302 acquires the block number of each block corresponding to the path from the metadata management node 201, and the network address of the data storage node where the block number of each block is located; the stream processing management unit 302 respectively processes the stream processing logic and each The block number corresponding to the network address is sent to the stream processing unit of the corresponding data storage node; the stream processing calculation unit acquires the block data corresponding to the received block number from the data storage node where it is located, and performs block data corresponding to the received block number.
  • Stream processing logic acquires the block number of each block corresponding to the path from the metadata management node 201, and the network address of the data storage node where the block number of each block is located; the stream processing management unit 302 respectively processes the stream processing logic and each The block number corresponding to the network address is sent to the stream processing unit of the corresponding data storage node; the stream processing calculation unit acquires the block data corresponding to the received block number from the data storage node where it is located, and performs block data corresponding to
  • the embodiment of the present invention distributes the stream processing calculation unit to each data storage node, and the stream processing management unit sends the stream processing task to the corresponding data storage node according to the path of the file to be processed, by the corresponding data storage node.
  • the stream processing calculation unit directly reads the block data corresponding to the file to be processed locally, and runs the stream processing logic on the read block data. Since the stream processing calculation unit locally reads the file to be processed, the stream processing system can be overcome. A technical problem with the low speed of network transmission between data storage nodes and the speed of stream processing.
  • FIG. 3 is a data interaction diagram of a stream processing method according to an embodiment of the present invention. As shown in FIG. 3, the stream processing method includes the following steps:
  • Step 401 The stream processing management unit 302 receives the stream processing task sent by the client 301, where the stream processing task includes the stream processing logic and the path of the file to be processed in the distributed file system.
  • the client 301 can be a client process in the apache flink system
  • the stream processing management unit 302 can be a job manager process in the apache flink system.
  • Step 402 The stream processing management unit 302 sends a query request to the metadata management node 201, wherein the query request carries a path of the file to be processed in the distributed file system.
  • the query request includes an input parameter and a query instruction
  • the stream processing management unit 302 takes the path of the file to be processed in the distributed file system as an input parameter, and sends the input parameter and the control instruction to the metadata of the metadata management node 201.
  • the interface provided by the management unit 2011 for accessing the database 2012.
  • Step 403 The metadata management node 201 returns the block number of each block corresponding to the path and the network address of the data storage node corresponding to each block to the stream processing management unit 302 according to the query request.
  • the database 2012 of the metadata management node 201 records the first correspondence between the path of the file to be processed in the distributed file system and the block number of each block, and the block number of each block and the network of the data storage node where each block is located.
  • the second correspondence relationship of the addresses therefore, the stream processing management unit 302 of the metadata management node 201 acquires the block numbers of the respective blocks from the first correspondence relationship according to the path of the file to be processed in the distributed file system, and according to the block numbers of the respective blocks Obtaining the network address of the data storage node where each block is located from the second correspondence.
  • the block numbers acquired by the stream processing management unit 302 are block number 1 and block number 2, respectively. It is worth noting that in practical applications, the block number includes a plurality of, and for the sake of brief description, only two block numbers are taken as an example. To be explained, the stream processing management unit 302 queries the network address A of the data storage node 101 based on the block number 1, and queries the network address B of the data storage node 102 based on the block number 2.
  • Step 404 The stream processing management unit 302 transmits the stream processing logic and block number 1 to the stream processing computing unit 1011.
  • the stream processing management unit 302 queries the network address A of the data storage node 101 according to the block number 1, the stream processing task and the block number 1 corresponding to the network address A are sent to the stream processing calculation unit of the data storage node 101. 1011.
  • Step 405 The stream processing management unit 302 transmits the stream processing logic and block number 2 to the stream processing computing unit 1021.
  • the stream processing management unit 302 queries the network address B of the data storage node 102 according to the block number 2, the stream processing task and the block number 2 corresponding to the network address B are sent to the stream processing calculation unit of the data storage node 102. 1021.
  • the stream processing computing unit 1011 can be, for example, a task manager process in the apache flink system
  • the stream processing computing unit 1021 can be, for example, another task manager process in the apache flink system.
  • Step 406 The stream processing calculation unit 1011 acquires the received block data corresponding to the block number 1 from the data storage node 101 where it is located, and executes stream processing logic for the block data corresponding to the received block number 1.
  • the stream processing calculation unit 1011 acquires the block data corresponding to the block number 1 received from the stream processing management unit 302 from the database 1012 of the data storage node 101 in which it is located, and performs stream processing for the block data corresponding to the block number 1. logic.
  • data storage node 101 is further provided with a data management unit for accessing database 1012 to manage block data in database 1012.
  • the distributed file system can be Hadoop
  • the Hadoop database is implemented by the Hbase database
  • the metadata management unit 2011 is the Hmaster process of the Hbase database
  • the stream processing computing unit is set as a program library
  • the data management unit executes the flow by loading the program library. Handle the functions of the computing unit.
  • the data management unit is, for example, a HReigonServer process of the Hbase database, and the HReigonServer process embeds the task manager process into the HReigonServer process, and the task manager process can be set to a library of a jar package or a so file, and provides a startup interface, and the HReigonServer process is After loading the library, you can implement the task manager process by running the startup interface.
  • the HReigonServer process that implements the function of the task manager process can locally read the block data of the database 1012, the process of acquiring the block data can be prevented from being affected by the performance of the external network, and since the HReigonServer process is in the process Direct access to the database 1012, that is, directly read the block data from the memory, so the speed of the block data is faster, which can effectively improve the efficiency of stream processing.
  • the data management unit and the stream processing computing unit 1011 can concurrently run at the operating system 1013, and the stream processing computing unit 1011 accesses the database 1012 through an interface provided by the data management unit, in these examples, although not through the HReigonServer process.
  • the database 1012 is directly accessed within the process, but the stream processing computing unit 1011 can access the database 1012 locally, and can also avoid the impact on external network performance.
  • Step 407 The stream processing calculation unit 1021 acquires the block data corresponding to the received block number 2 from the data storage node 102 where it is located, and executes stream processing logic for the block data corresponding to the received block number 2.
  • data storage node 102 is provided with a data management unit for accessing database 1022 to manage block data.
  • the distributed file system can be Hadoop, the Hadoop database is implemented by the Hbase database, the metadata management unit 2011 is the Hmaster process of the Hbase database, the stream processing computing unit 1011 is set as a program library, and the data management unit executes the stream processing calculation unit by loading the program library. The function.
  • the data management unit is, for example, a HReigonServer process of the Hbase database, and the HReigonServer process embeds the task manager process into the HReigonServer process, and the task manager process can be set to a library of a jar package or a so file, and provides a startup interface, and the HReigonServer process is After loading the library, you can implement the task manager process by running the startup interface.
  • the HReigonServer process that implements the function of the task manager process can read the block data of the database 1022 locally, the process of acquiring the block data can be prevented from being affected by the performance of the external network, and since the HReigonServer process is in the process Direct access to the database 1022, so the speed of obtaining block data is faster, and the efficiency of stream processing can be effectively improved.
  • the data management unit and the stream processing computing unit 1021 can concurrently run at the operating system 1023, and the stream processing computing unit 1021 accesses the database 1022 through an interface provided by the data management unit, in these examples, although not through the HReigonServer process.
  • the database 1012 is directly accessed within the process, but the stream processing computing unit 1021 can access the database 1022 locally, and can also avoid the impact on external network performance.
  • Step 408 The stream processing calculation unit sends the first processing result obtained by the stream processing logic to the block data corresponding to the block number 1 to the stream processing management unit 302.
  • Step 409 The stream processing calculation unit transmits the second processing result obtained by the stream processing logic of the block data corresponding to the block number 2 to the stream processing management unit 302.
  • the embodiment of the present invention distributes the stream processing calculation unit to each data storage node, and the stream processing management unit sends the stream processing task to the corresponding data storage node according to the path of the file to be processed, and the corresponding data.
  • the stream processing calculation unit on the storage node directly reads the block data corresponding to the file to be processed locally, and runs the stream processing logic on the read block data. Since the stream processing calculation unit reads the file to be processed locally, the cause can be overcome.
  • the technical problem of low network transmission speed between the stream processing system and the data storage node affecting the speed of stream processing.
  • the stream processing logic is executed in parallel in different stream processing calculation units, so that the stream processing speed can be further accelerated and the processing efficiency can be improved.
  • stream processing system 90 may also be implemented based on the Storm, Spark, or Samza architecture.
  • FIG. 4 is a schematic structural diagram of a device of a stream processing management unit according to an embodiment of the present invention.
  • the stream processing management unit 302 includes:
  • the receiving module 601 is configured to receive a stream processing task sent by the client, where the stream processing task includes a path of the stream processing logic and the file to be processed in the distributed file system, where the distributed file system includes a metadata management node and multiple data storage nodes. Each data storage node is provided with a stream processing computing unit;
  • the query module 602 is configured to obtain, from the metadata management node, a block number of each block corresponding to the path, and a network address of the data storage node where each block is located;
  • the sending module 603 is configured to separately send the stream processing logic and the block number of each block to the stream processing unit of the data storage node where each block is located.
  • the receiving unit 601 is further configured to receive a processing result obtained by the execution stream processing logic sent by the stream processing calculation unit.
  • the metadata management node records the first correspondence between the path of the file to be processed in the distributed file system and the block number of each block, and the block number of each block and the network address of the data storage node where each block is located.
  • the query module 602 is specifically used to:
  • the network address of the data storage node where each block is located is obtained from the second correspondence according to the block number of each block.
  • FIG. 5 is a schematic structural diagram of a device according to an embodiment of the present invention.
  • the host 50 includes a memory 502, a processor 501, and a bus 503.
  • the memory 502 and the processor 501 are connected to the bus 503.
  • the memory 502 stores program instructions, and the processor 501 executes program instructions to implement the functions of the stream processing management unit 302 in the stream processing system described above.
  • the embodiment of the present invention distributes the stream processing calculation unit to each data storage node, and the stream processing management unit sends the stream processing task to the corresponding data storage node according to the path of the file to be processed, by the corresponding data storage node.
  • the stream processing calculation unit directly reads the block data corresponding to the file to be processed locally, and runs the stream processing logic on the read block data. Since the stream processing calculation unit locally reads the file to be processed, the stream processing system can be overcome. A technical problem with the low speed of network transmission between data storage nodes and the speed of stream processing.
  • the stream processing logic is executed in parallel in different stream processing calculation units, so that the stream processing speed can be further accelerated and the processing efficiency can be improved.
  • any of the device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as the cells may or may not be Physical units can be located in one place or distributed to multiple network elements.
  • Some or all of the processes may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • the connection relationship between processes indicates that there is a communication connection between them, and specifically may be implemented as one or more communication buses or signal lines.
  • the present invention can be implemented by means of software plus necessary general hardware, and of course, dedicated hardware, dedicated CPU, dedicated memory, dedicated memory, Special components and so on.
  • functions performed by computer programs can be easily implemented with the corresponding hardware, and the specific hardware structure used to implement the same function can be various, such as analog circuits, digital circuits, or dedicated circuits. Circuits, etc.
  • software program implementation is a better implementation in more cases.
  • the technical solution of the present invention which is essential or contributes to the prior art, can be embodied in the form of a software product stored in a readable storage medium, such as a floppy disk of a computer.
  • U disk mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), disk or optical disk, etc., including a number of instructions to make a computer device (may be A personal computer, server, or network device, etc.) performs the methods described in various embodiments of the present invention.
  • a computer device may be A personal computer, server, or network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Disclosed in the embodiments of the present invention are a stream processing method and device, the method comprising: a stream processing management unit receives a stream processing task sent by a client; the stream processing management unit obtains, from a metadata management node, the block number of each block corresponding to the path of a file to be processed and the network address of the data storage node where each block is located; the stream processing management unit sends stream processing logic and the block number of each block to a stream processing unit of the data storage node where each block is located respectively; and a stream processing calculation unit obtains block data corresponding to the received block number from the data storage node where the same is located, and executes a stream processing logic for the block data corresponding to the received block number. By means of the solution above, the technical problem of the low network transmission speed between a stream processing system and a data storage node affecting the speed of stream processing may be overcome.

Description

流处理方法及装置Flow processing method and device 技术领域Technical field
本发明涉及信息技术领域,特别涉及一种流处理方法及装置。The present invention relates to the field of information technology, and in particular, to a stream processing method and apparatus.
背景技术Background technique
工作流(Work flow)是对工作流程及工作流程中各业务之间如何前后组织在一起的逻辑规则的抽象、概括、描述。工作流概念起源于生产组织和办公自动化领域,是针对日常工作中具有固定程序活动而提出的一个概念,目的是通过将工作分解成定义良好的流程或角色,按照一定的规则和过程来执行这些流程并对其进行监控,达到提高工作效率、更好的控制过程、增强对客户的服务、有效管理业务流程等目的。工作流建模,即将工作流在计算机中以恰当的模型进行表示并对其实施计算。通过工作流建模,工作流可以通过工作流系统来管理。Work flow is an abstraction, generalization, and description of the logical rules of how processes are organized before and after each other in the workflow and workflow. The workflow concept originated in the field of production organization and office automation. It is a concept proposed for fixed-program activities in daily work. The purpose is to perform these tasks according to certain rules and processes by breaking down the work into well-defined processes or roles. Process and monitor it to improve work efficiency, better control processes, enhance customer service, and effectively manage business processes. Workflow modeling, which means that the workflow is represented in the computer in the appropriate model and implemented. Through workflow modeling, workflows can be managed through a workflow system.
流处理系统的主要功能是通过计算机技术的支持去定义、执行和管理工作流,协调工作流执行过程中流程之间以及群体成员之间的信息交互。流处理系统通常由工作流程设计工具、工作流程管理工具组成,工作流程设计工具供用户设计自己的工作流程定义,工作流程管理工具负责管理工作流程的执行。在工作流系统工作过程中,工作流程实例包括一个或多个业务(Task),每个业务代表需要进行的某项工作。The main function of the stream processing system is to define, execute and manage the workflow through the support of computer technology, and coordinate the information interaction between the processes in the workflow execution process and between the members of the group. The stream processing system usually consists of a workflow design tool and a workflow management tool. The workflow design tool allows the user to design their own workflow definition, and the workflow management tool is responsible for managing the execution of the workflow. During the workflow system work process, the workflow instance includes one or more tasks, and each agent needs to perform some work.
Apache Storm是现有技术中典型的流处理系统,由Master-Slave(主-从)架构组成,Nimbus是主进程,Supervisor是运行业务的从进程。流处理系统Storm与分布式文件系统建立网络连接,分布式文件系统存放需要流处理系统Storm进行处理的数据,分布式文件系统包括Master Server(主服务器)和Data Server(数据服务器),Master Server是元数据管理节点,管理数据块的分布情况,Data Server是数据存储节点点,存储数据块数据,Storm与数据存储节点点设置在不同服务器上。Apache Storm is a typical stream processing system in the prior art. It consists of a Master-Slave architecture. Nimbus is the main process and Supervisor is the slave process running the service. The stream processing system Storm establishes a network connection with the distributed file system, and the distributed file system stores data that needs to be processed by the stream processing system Storm. The distributed file system includes a Master Server (primary server) and a Data Server (data server), and the Master Server is The metadata management node manages the distribution of data blocks. The Data Server is a data storage node point, and stores data block data. The Storm and the data storage node point are set on different servers.
在Storm的流处理作业中,Storm首先需要从数据服务器获取需要进行流处理的数据。具体地,数据服务器提供数据查询接口,Storm通过网络输入参数至数据查询接口,通过网络从数据服务器获取数据,然后将获取到的数据加载到Supervisor中。In Storm's stream processing operations, Storm first needs to obtain data from the data server that needs to be streamed. Specifically, the data server provides a data query interface, and the Storm inputs parameters to the data query interface through the network, acquires data from the data server through the network, and then loads the acquired data into the Supervisor.
由于在现有技术中,流处理系统需要通过网络从数据存储节点获取数据,因此获取数据的速度受网络性能限制,会导致整个流处理的性能受限于网络,在流处理系统与数据存储节点之间的网络传输速度较低时,会极大地影响流处理的速度。Since in the prior art, the stream processing system needs to acquire data from the data storage node through the network, the speed of acquiring the data is limited by the network performance, which may result in the performance of the entire stream processing being limited by the network, in the stream processing system and the data storage node. When the network transmission speed is low, the speed of stream processing is greatly affected.
发明内容Summary of the invention
为解决现有技术的问题,本发明实施例提供一种流处理方法及装置,可克服因流处理系统与数据存储节点之间的网络传输速度较低而影响流处理的速度的技术问题。To solve the problem of the prior art, an embodiment of the present invention provides a stream processing method and apparatus, which can overcome the technical problem that the speed of the stream processing is affected by the low network transmission speed between the stream processing system and the data storage node.
第一方面,本发明实施例提供一种流处理方法,该方法应用于流处理系统,流处理系统包括流处理管理单元以及流处理计算单元,该方法包括:In a first aspect, an embodiment of the present invention provides a stream processing method, where the method is applied to a stream processing system, where the stream processing system includes a stream processing management unit and a stream processing computing unit, and the method includes:
流处理管理单元接收客户端发送的流处理任务,其中流处理任务包括流处理逻辑和待处理文件在分布式文件系统的路径,分布式文件系统包括元数据管理节点和多个数据存储节点,每一数据存储节点设置有流处理计算单元;The stream processing management unit receives a stream processing task sent by the client, where the stream processing task includes a stream processing logic and a path of the file to be processed in the distributed file system, and the distributed file system includes a metadata management node and a plurality of data storage nodes, each a data storage node is provided with a stream processing computing unit;
流处理管理单元从元数据管理节点获取待处理文件的路径对应的各个块的块号,以及各个块所在的数据存储节点的网络地址;The stream processing management unit acquires, from the metadata management node, a block number of each block corresponding to the path of the file to be processed, and a network address of the data storage node where each block is located;
流处理管理单元分别将流处理逻辑和各个块的块号发送至各个块所在的数据存储节点的流处理单元;The stream processing management unit respectively sends the stream processing logic and the block number of each block to the stream processing unit of the data storage node where each block is located;
流处理计算单元从所在的数据存储节点获取接收到的块号对应的块数据,针对接收到的块号对应的块数据执行流处理逻辑。The stream processing calculation unit acquires the block data corresponding to the received block number from the data storage node where it is located, and executes stream processing logic for the block data corresponding to the received block number.
由于本发明实施例将流处理计算单元分布设置在各数据存储节点上,并由流处理管理单元根据待处理文件的路径将流处理任务发送到对应的数据存储节点,由对应的数据存储节点上的流处理计算单元直接在本地读取待处理文件对应的块数据,并将读取到的块数据运行流处理逻辑,由于流处理计算单元本地读取待处理文件,因此可克服因流处理系统与数据存储节点之间的网络传输速度较低而影响流处理的速度的技术问题。The embodiment of the present invention distributes the stream processing calculation unit to each data storage node, and the stream processing management unit sends the stream processing task to the corresponding data storage node according to the path of the file to be processed, by the corresponding data storage node. The stream processing calculation unit directly reads the block data corresponding to the file to be processed locally, and runs the stream processing logic on the read block data. Since the stream processing calculation unit locally reads the file to be processed, the stream processing system can be overcome. A technical problem with the low speed of network transmission between data storage nodes and the speed of stream processing.
并且,由于待处理文件被分散为块数据,分别在不同的流处理计算单元并行执行流处理逻辑,因此可以进一步加快流处理速度,提高处理效率。Moreover, since the files to be processed are dispersed into block data, the stream processing logic is executed in parallel in different stream processing calculation units, so that the stream processing speed can be further accelerated and the processing efficiency can be improved.
在本发明实施例的一种实现方式中,数据存储节点设置有数据管理单元,流处理计算单元设置为程序库,数据管理单元通过加载程序库执行流处理计算单元的功能。In an implementation manner of the embodiment of the present invention, the data storage node is provided with a data management unit, and the stream processing calculation unit is configured as a program library, and the data management unit performs a function of the stream processing calculation unit by loading the program library.
由于将流处理计算单元通过程序库设置在数据管理单元中,而数据管理单元可直接读取块数据,在数据管理单元可读取块数据之后,即可执行流处理逻辑,可加快流处理速度。Since the stream processing calculation unit is set in the data management unit through the program library, and the data management unit can directly read the block data, after the data management unit can read the block data, the stream processing logic can be executed, which can speed up the stream processing speed. .
在本发明实施例的另一种实现方式中,该方法还包括:In another implementation manner of the embodiment of the present invention, the method further includes:
流处理计算单元将执行流处理逻辑获取的处理结果发送至流处理管理单元。The stream processing calculation unit transmits the processing result obtained by the execution stream processing logic to the stream processing management unit.
在本发明实施例的另一种实现方式中,元数据管理节点记录有待处理文件在分布式文件系统的路径与各个块的块号的第一对应关系,流处理管理单元从元数据管理节点获取路径对应的各个块的块号,以及各个块的块号所在的数据存储节点的网络地址具体包括:In another implementation manner of the embodiment of the present invention, the metadata management node records the first correspondence between the path of the file to be processed in the distributed file system and the block number of each block, and the stream processing management unit obtains the data from the metadata management node. The block number of each block corresponding to the path, and the network address of the data storage node where the block number of each block is located specifically includes:
流处理管理单元根据待处理文件在分布式文件系统的路径从第一对应关系获取各个块的块号。The stream processing management unit acquires the block numbers of the respective blocks from the first correspondence relationship according to the path of the file to be processed in the distributed file system.
在本发明实施例的另一种实现方式中,元数据管理节点记录有各个块的块号与各个块的块号所在的数据存储节点的网络地址的第二对应关系,流处理管理单元从元数据管理节点获取路径对应的各个块的块号,以及各个块的块号所在的数据存储节点的网络地址具体包括:In another implementation manner of the embodiment of the present invention, the metadata management node records the second correspondence between the block number of each block and the network address of the data storage node where the block number of each block is located, and the stream processing management unit slave element The data management node obtains the block number of each block corresponding to the path, and the network address of the data storage node where the block number of each block is located specifically includes:
流处理管理单元根据各块号从第二对应关系获取各块号所在的数据存储节点的网络地址。The stream processing management unit acquires the network address of the data storage node where each block number is located from the second correspondence relationship according to each block number.
第二方面,本发明实施例提供一种流处理系统,包括流处理管理单元以及流处理计算单元,In a second aspect, an embodiment of the present invention provides a stream processing system, including a stream processing management unit and a stream processing computing unit.
流处理管理单元,用于接收客户端发送的流处理任务,其中流处理任务包括流处理逻辑和待处理文件在分布式文件系统的路径,分布式文件系统包括元数据管理节点和多个数据存储节点,每一数据存储节点设置有流处理计算单元;a stream processing management unit, configured to receive a stream processing task sent by the client, where the stream processing task includes a path of the stream processing logic and the file to be processed in the distributed file system, and the distributed file system includes a metadata management node and multiple data storage a node, each data storage node is provided with a stream processing computing unit;
流处理管理单元,还用于从元数据管理节点获取路径对应的各块号,以及各块号所在的数据存储节点的网络地址;The stream processing management unit is further configured to obtain, from the metadata management node, each block number corresponding to the path, and a network address of the data storage node where each block number is located;
流处理管理单元,还用于分别将流处理逻辑和与各网络地址对应的块号发送至对应 的数据存储节点的流处理单元;The stream processing management unit is further configured to separately send the stream processing logic and the block number corresponding to each network address to the stream processing unit of the corresponding data storage node;
流处理计算单元,用于从所在的数据存储节点获取接收到的块号对应的块数据,针对接收到的块号对应的块数据执行流处理逻辑。The stream processing calculation unit is configured to acquire block data corresponding to the received block number from the data storage node where the data is stored, and execute stream processing logic for the block data corresponding to the received block number.
在本发明实施例的一种实现方式中,数据存储节点设置有数据管理单元,流处理计算单元设置为程序库,数据管理单元通过加载程序库执行流处理计算单元的功能。In an implementation manner of the embodiment of the present invention, the data storage node is provided with a data management unit, and the stream processing calculation unit is configured as a program library, and the data management unit performs a function of the stream processing calculation unit by loading the program library.
在本发明实施例的另一种实现方式中,流处理计算单元,还用于将执行流处理逻辑获取的处理结果发送至流处理管理单元。In another implementation manner of the embodiment of the present invention, the stream processing calculation unit is further configured to send the processing result obtained by the execution stream processing logic to the stream processing management unit.
在本发明实施例的另一种实现方式中,元数据管理节点记录有待处理文件在分布式文件系统的路径与各个块的块号的第一对应关系,流处理管理单元具体用于:In another implementation manner of the embodiment of the present invention, the metadata management node records the first correspondence between the path of the file to be processed in the distributed file system and the block number of each block, and the stream processing management unit is specifically configured to:
根据待处理文件在分布式文件系统的路径从第一对应关系获取各个块的块号。The block number of each block is obtained from the first correspondence according to the path of the file to be processed in the distributed file system.
在本发明实施例的另一种实现方式中,元数据管理节点记录有各个块的块号与各个块的块号所在的数据存储节点的网络地址的第二对应关系,流处理管理单元具体用于:In another implementation manner of the embodiment of the present invention, the metadata management node records the second correspondence between the block number of each block and the network address of the data storage node where the block number of each block is located, and the stream processing management unit specifically uses to:
根据各个块的块号从第二对应关系获取各块号所在的数据存储节点的网络地址。The network address of the data storage node where each block number is located is obtained from the second correspondence according to the block number of each block.
第三方面,本发明实施例提供一种流处理管理单元,执行上述流处理系统中的流处理管理单元的功能。In a third aspect, an embodiment of the present invention provides a stream processing management unit that performs the functions of a stream processing management unit in the stream processing system.
第四方面,本发明实施例提供一种主机,包括存储器、处理器和总线,存储器、处理器与总线连接,存储器存储有程序指令,处理器执行程序指令以实现上述流处理系统中的流处理管理单元的功能。In a fourth aspect, an embodiment of the present invention provides a host, including a memory, a processor, and a bus. The memory and the processor are connected to the bus. The memory stores program instructions, and the processor executes the program instructions to implement stream processing in the stream processing system. The function of the snap-in.
附图说明DRAWINGS
为了更清楚地说明本发明实施例的技术方案,下面将对本发明实施例中所需要使用的附图作简单地介绍,显而易见地,下面所描述的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings to be used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings described below are only some embodiments of the present invention, Those skilled in the art can also obtain other drawings based on these drawings without paying any creative work.
图1是根据本发明实施例的流处理系统的结构示意图;1 is a schematic structural diagram of a stream processing system according to an embodiment of the present invention;
图2是根据本发明实施例的流处理系统的另一结构示意图;2 is another schematic structural diagram of a stream processing system according to an embodiment of the present invention;
图3是根据本发明实施例的流处理方法的数据交互图;3 is a data interaction diagram of a stream processing method according to an embodiment of the present invention;
图4是根据本发明实施例的流处理系统的装置结构示意图;4 is a schematic structural diagram of an apparatus of a stream processing system according to an embodiment of the present invention;
图5是根据本发明实施例的主机的装置结构示意图。FIG. 5 is a schematic structural diagram of a device of a host according to an embodiment of the present invention.
具体实施方式detailed description
首先请参见图1,图1是根据本发明实施例的流处理系统与分布式文件系统和客户端的连接示意图,如图1所示,流处理系统包括流处理管理单元302以及多个流处理计算单元1011、1021、……、及1031,分布式文件系统包括元数据管理节点201和多个数据存储节点101、102、……、及103。First, please refer to FIG. 1. FIG. 1 is a schematic diagram of a connection between a stream processing system and a distributed file system and a client according to an embodiment of the present invention. As shown in FIG. 1, the stream processing system includes a stream processing management unit 302 and a plurality of stream processing calculations. Units 1011, 1021, ..., and 1031, the distributed file system includes a metadata management node 201 and a plurality of data storage nodes 101, 102, ..., and 103.
在本发明实施例中,客户端301与流处理管理单元302连接,流处理管理单元302分别与元数据管理节点201和多个数据存储节点101、102、……、103连接。In the embodiment of the present invention, the client 301 is connected to the stream processing management unit 302, and the stream processing management unit 302 is connected to the metadata management node 201 and the plurality of data storage nodes 101, 102, ..., 103, respectively.
客户端301用于接收用户提交的流处理作业,在本发明实施例中,用户提交流处理作业时指明待处理数据在分布式文件系统的路径,并规定要对待处理数据进行何种处理。The client 301 is configured to receive a stream processing job submitted by the user. In the embodiment of the present invention, when the user submits the stream processing job, the user specifies the path of the data to be processed in the distributed file system, and specifies what kind of processing is to be performed on the data to be processed.
流处理任务待处理文件在分布式文件系统的路径举例而言可以是URL(Universal  Resource Locator的缩写,统一资源定位符),URL为分布式文件系统的存储标识,通过URL可在元数据管理节点201查找到待处理文件对应的各个块的块号。The path of the stream processing task to be processed in the distributed file system may be, for example, a URL (abbreviation of Universal Resource Locator, a uniform resource locator), and the URL is a storage identifier of the distributed file system, and the URL may be in the metadata management node. 201 finds the block number of each block corresponding to the file to be processed.
客户端301根据用户提交的流处理作业产生流处理任务,流处理任务包括流处理逻辑和待处理数据在分布式文件系统的路径,其中流处理逻辑定义了对待处理数据进行何种处理,举例而言,流处理逻辑可规定在待处理数据中搜索异常事件。The client 301 generates a stream processing task according to the stream processing job submitted by the user, where the stream processing task includes a path of the stream processing logic and the data to be processed in the distributed file system, wherein the stream processing logic defines what kind of processing is to be performed on the data, for example. In other words, the stream processing logic can specify to search for anomalous events in the data to be processed.
客户端301向流处理管理单元302发送流处理任务,流处理管理单元302根据流处理任务进行调度,选择流处理计算单元从分布式文件系统获取待处理文件,并以流处理逻辑对待处理文件进行处理。The client 301 sends a stream processing task to the stream processing management unit 302. The stream processing management unit 302 performs scheduling according to the stream processing task, and the selected stream processing computing unit acquires the file to be processed from the distributed file system, and processes the file to be processed by the stream processing logic. deal with.
举例而言,流处理系统可基于apache flink架构实现,客户端301为apache flink的client(客户端)进程,流处理管理单元302为apache flink的job manager(工作管理器)进程,流处理计算单元为apache flink的task manager(作业管理器)进程。For example, the stream processing system can be implemented based on the apache flink architecture, the client 301 is the client (client) process of the apache flink, the stream processing management unit 302 is the job manager (work manager) process of the apache flink, and the stream processing unit The task manager process for apache flink.
元数据管理节点201设置有元数据管理单元2011和数据库2012,元数据管理单元2011提供接口,外部设备可通过接口查询数据库2012。数据库2012记录了分布式文件系统中的待处理文件在分布式文件系统的路径与各个块的块号的第一对应关系,以及各个块的块号与各个块所在的数据存储节点的网络地址的第二对应关系。The metadata management node 201 is provided with a metadata management unit 2011 and a database 2012, and the metadata management unit 2011 provides an interface through which the external device can query the database 2012. The database 2012 records the first correspondence between the path of the file to be processed in the distributed file system and the block number of each block, and the block number of each block and the network address of the data storage node where each block is located. The second correspondence.
在分布式存储系统中,待处理文件以碎片形式存储在数据存储节点的数据库中,其中碎片指不同的块数据,每一块数据对应一块号,元数据管理节点记录分布式存储系统中所有文件的路径与各个块的块号之间的对应关系,以及每一块号分别对应存储于哪一数据存储节点的数据库。In a distributed storage system, files to be processed are stored in a database in a data storage node in the form of fragments, where the fragments refer to different block data, each piece of data corresponds to a block number, and the metadata management node records all files in the distributed storage system. The correspondence between the path and the block number of each block, and each block number corresponds to the database of which data storage node is stored.
数据存储节点101设置有流处理计算单元1011和数据库1012,数据库1012记录有块数据以及块号与块数据之间的对应关系,流处理计算单元1011可访问数据库1012,通过块号从数据库1012获取对应的块数据。The data storage node 101 is provided with a stream processing calculation unit 1011 and a database 1012. The database 1012 records the block data and the correspondence between the block number and the block data. The stream processing calculation unit 1011 can access the database 1012 and obtain the block number from the database 1012. Corresponding block data.
在图1中,数据存储节点102和103与数据存储节点101具有类似结构,区别在于自身数据库记录的块数据不相同,于此不作赘述。In FIG. 1, the data storage nodes 102 and 103 have a similar structure to the data storage node 101, except that the block data recorded by the own database is different, and details are not described herein.
举例而言,分布式文件系统可通过Hadoop实现,数据库2012、数据库1012、数据库1022……以及数据库1032可通过Hbase(Hadoop Database,Hadoop数据库)实现,元数据管理单元2011可为Hbase数据库的hmaster进程。For example, the distributed file system can be implemented by Hadoop, the database 2012, the database 1012, the database 1022, ... and the database 1032 can be implemented by Hbase (Hadoop Database, Hadoop database), and the metadata management unit 2011 can be the hmaster process of the Hbase database. .
在本发明实施例中,客户端301和流处理管理单元302可设置在同一主机上,并通过网络与元数据管理节点201和数据存储节点101、102……103分别建立数据连接。In the embodiment of the present invention, the client 301 and the stream processing management unit 302 can be set on the same host, and establish a data connection with the metadata management node 201 and the data storage nodes 101, 102, .
在一些示例中,客户端301和流处理管理单元302也可设置在不同主机上,本发明实施例对此不作限定。In some examples, the client 301 and the stream processing management unit 302 may also be disposed on different hosts, which is not limited by the embodiment of the present invention.
为了便于理解,可参见图2,图2是根据本发明实施例的流处理系统的另一结构示意图,如图2所示,客户端301和流处理管理单元302设置在主机10上,主机10还包括操作系统303和硬件304,硬件304用于承载操作系统303的运行,硬件304包括物理网卡3041,客户端301和流处理管理单元302分别以进程的形式运行在操作系统303上,并通过物理网卡3041访问网络50。For ease of understanding, reference may be made to FIG. 2, which is another schematic structural diagram of a stream processing system according to an embodiment of the present invention. As shown in FIG. 2, a client 301 and a stream processing management unit 302 are disposed on a host 10, and the host 10 is provided. The operating system 303 and the hardware 304 are also included. The hardware 304 is used to carry the operation of the operating system 303. The hardware 304 includes a physical network card 3041. The client 301 and the stream processing management unit 302 respectively run on the operating system 303 in the form of a process. The physical network card 3041 accesses the network 50.
并且,元数据管理节点201包括数据库2012、元数据管理单元2011、操作系统2013和硬件2014,数据库2012和元数据管理单元2011分别以进程的形式运行在操作系统2013上,硬件2014用于承载操作系统2013的运行,硬件2014包括物理网卡20141,物理网卡20141接入网络50,元数据管理单元2011提供接口,外部设备可通过接口来访问数据 库2012。And, the metadata management node 201 includes a database 2012, a metadata management unit 2011, an operating system 2013, and a hardware 2014. The database 2012 and the metadata management unit 2011 respectively run on the operating system 2013 in the form of a process, and the hardware 2014 is used for the bearer operation. The operation of the system 2013 includes the physical network card 20141, the physical network card 20141 accessing the network 50, and the metadata management unit 2011 providing an interface through which the external device can access the database 2012.
并且,数据存储节点101包括数据1012、流处理计算单元1011、操作系统1013以及硬件1014,数据库1012和流处理计算单元1011分别以进程的形式运行在操作系统1013上,硬件1014用于承载操作系统2013的运行,硬件1014包括物理网卡10141,物理网卡10141接入网络50,在本发明实施例中,流处理计算单元1011可访问数据库1012。Moreover, the data storage node 101 includes data 1012, a stream processing computing unit 1011, an operating system 1013, and a hardware 1014. The database 1012 and the stream processing computing unit 1011 respectively run on the operating system 1013 in the form of a process, and the hardware 1014 is used to carry the operating system. In the operation of 2013, the hardware 1014 includes a physical network card 10141, and the physical network card 10141 accesses the network 50. In the embodiment of the present invention, the stream processing computing unit 1011 can access the database 1012.
数据存储节点102和103的结构与数据存储节点101类似,于此不作赘述。The structure of the data storage nodes 102 and 103 is similar to that of the data storage node 101 and will not be described herein.
举例而言,流处理管理单元302与客户端301、元数据管理单元2011和各流处理计算单元101、1021、……、以及1031之间可通过RPC(Remote Procedure Call Protocol,远程过程调用协议)实现通信。For example, the stream processing management unit 302 and the client 301, the metadata management unit 2011, and the stream processing computing units 101, 1021, ..., and 1031 can pass RPC (Remote Procedure Call Protocol). Implement communication.
基于以上架构,在本发明实施例提供一种流处理方法,流处理管理单元302接收客户端301发送的流处理任务,其中流处理任务包括流处理逻辑和待处理文件在分布式文件系统的路径;流处理管理单元302从元数据管理节点201获取路径对应的各个块的块号,以及各个块的块号所在的数据存储节点的网络地址;流处理管理单元302分别将流处理逻辑和与各网络地址对应的块号发送至对应的数据存储节点的流处理单元;流处理计算单元从所在的数据存储节点获取接收到的块号对应的块数据,针对接收到的块号对应的块数据执行流处理逻辑。Based on the above architecture, the embodiment of the present invention provides a stream processing method, where the stream processing management unit 302 receives a stream processing task sent by the client 301, where the stream processing task includes a stream processing logic and a path of the file to be processed in the distributed file system. The stream processing management unit 302 acquires the block number of each block corresponding to the path from the metadata management node 201, and the network address of the data storage node where the block number of each block is located; the stream processing management unit 302 respectively processes the stream processing logic and each The block number corresponding to the network address is sent to the stream processing unit of the corresponding data storage node; the stream processing calculation unit acquires the block data corresponding to the received block number from the data storage node where it is located, and performs block data corresponding to the received block number. Stream processing logic.
由于本发明实施例将流处理计算单元分布设置在各数据存储节点上,并由流处理管理单元根据待处理文件的路径将流处理任务发送到对应的数据存储节点,由对应的数据存储节点上的流处理计算单元直接在本地读取待处理文件对应的块数据,并将读取到的块数据运行流处理逻辑,由于流处理计算单元本地读取待处理文件,因此可克服因流处理系统与数据存储节点之间的网络传输速度较低而影响流处理的速度的技术问题。The embodiment of the present invention distributes the stream processing calculation unit to each data storage node, and the stream processing management unit sends the stream processing task to the corresponding data storage node according to the path of the file to be processed, by the corresponding data storage node. The stream processing calculation unit directly reads the block data corresponding to the file to be processed locally, and runs the stream processing logic on the read block data. Since the stream processing calculation unit locally reads the file to be processed, the stream processing system can be overcome. A technical problem with the low speed of network transmission between data storage nodes and the speed of stream processing.
为了进一步清楚说明,以下请参见图3,图3是根据本发明实施例的流处理方法的数据交互图,如图3所示,流处理方法包括以下步骤:For further clarity, please refer to FIG. 3, which is a data interaction diagram of a stream processing method according to an embodiment of the present invention. As shown in FIG. 3, the stream processing method includes the following steps:
步骤401:流处理管理单元302接收客户端301发送的流处理任务,其中流处理任务包括流处理逻辑和待处理文件在分布式文件系统的路径。Step 401: The stream processing management unit 302 receives the stream processing task sent by the client 301, where the stream processing task includes the stream processing logic and the path of the file to be processed in the distributed file system.
举例而言,客户端301可为apache flink系统中的client进程,流处理管理单元302可为apache flink系统中的job manager进程。For example, the client 301 can be a client process in the apache flink system, and the stream processing management unit 302 can be a job manager process in the apache flink system.
步骤402:流处理管理单元302向元数据管理节点201发送查询请求,其中,查询请求携带有待处理文件在分布式文件系统的路径。Step 402: The stream processing management unit 302 sends a query request to the metadata management node 201, wherein the query request carries a path of the file to be processed in the distributed file system.
举例而言,查询请求包括输入参数和查询指令,流处理管理单元302以待处理文件在分布式文件系统的路径为输入参数,并将输入参数及控制指令发送至元数据管理节点201的元数据管理单元2011提供的用于访问数据库2012的接口中。For example, the query request includes an input parameter and a query instruction, and the stream processing management unit 302 takes the path of the file to be processed in the distributed file system as an input parameter, and sends the input parameter and the control instruction to the metadata of the metadata management node 201. The interface provided by the management unit 2011 for accessing the database 2012.
步骤403:元数据管理节点201根据查询请求返回路径对应的各个块的块号以及各个块对应的数据存储节点的网络地址至流处理管理单元302。Step 403: The metadata management node 201 returns the block number of each block corresponding to the path and the network address of the data storage node corresponding to each block to the stream processing management unit 302 according to the query request.
承上可知,元数据管理节点201的数据库2012记录有待处理文件在分布式文件系统的路径与各个块的块号的第一对应关系和各个块的块号与各个块所在的数据存储节点的网络地址的第二对应关系,因此,元数据管理节点201的流处理管理单元302根据待处理文件在分布式文件系统的路径从第一对应关系获取各个块的块号,并根据各个块的块号从第二对应关系获取各个块所在的数据存储节点的网络地址。As can be seen, the database 2012 of the metadata management node 201 records the first correspondence between the path of the file to be processed in the distributed file system and the block number of each block, and the block number of each block and the network of the data storage node where each block is located. The second correspondence relationship of the addresses, therefore, the stream processing management unit 302 of the metadata management node 201 acquires the block numbers of the respective blocks from the first correspondence relationship according to the path of the file to be processed in the distributed file system, and according to the block numbers of the respective blocks Obtaining the network address of the data storage node where each block is located from the second correspondence.
假设流处理管理单元302获取的块号分别为块号1和块号2,值得注意的是,在实际 应用中,块号包括多个,于此为简要说明,仅以两个块号为例进行说明,流处理管理单元302根据块号1查询到数据存储节点101的网络地址A,根据块号2查询到数据存储节点102的网络地址B。It is assumed that the block numbers acquired by the stream processing management unit 302 are block number 1 and block number 2, respectively. It is worth noting that in practical applications, the block number includes a plurality of, and for the sake of brief description, only two block numbers are taken as an example. To be explained, the stream processing management unit 302 queries the network address A of the data storage node 101 based on the block number 1, and queries the network address B of the data storage node 102 based on the block number 2.
步骤404:流处理管理单元302将流处理逻辑和块号1发送至流处理计算单元1011。Step 404: The stream processing management unit 302 transmits the stream processing logic and block number 1 to the stream processing computing unit 1011.
在本步骤中,流处理管理单元302根据块号1查询到数据存储节点101的网络地址A之后,将流处理任务和网络地址A对应的块号1发送至数据存储节点101的流处理计算单元1011。In this step, after the stream processing management unit 302 queries the network address A of the data storage node 101 according to the block number 1, the stream processing task and the block number 1 corresponding to the network address A are sent to the stream processing calculation unit of the data storage node 101. 1011.
步骤405:流处理管理单元302将流处理逻辑和块号2发送至流处理计算单元1021。Step 405: The stream processing management unit 302 transmits the stream processing logic and block number 2 to the stream processing computing unit 1021.
在本步骤中,流处理管理单元302根据块号2查询到数据存储节点102的网络地址B之后,将流处理任务和网络地址B对应的块号2发送至数据存储节点102的流处理计算单元1021。In this step, after the stream processing management unit 302 queries the network address B of the data storage node 102 according to the block number 2, the stream processing task and the block number 2 corresponding to the network address B are sent to the stream processing calculation unit of the data storage node 102. 1021.
在步骤404和405中,流处理计算单元1011可例如为apache flink系统中的一个task manager进程,流处理计算单元1021可例如为apache flink系统中的另一个task manager进程。In steps 404 and 405, the stream processing computing unit 1011 can be, for example, a task manager process in the apache flink system, and the stream processing computing unit 1021 can be, for example, another task manager process in the apache flink system.
步骤406:流处理计算单元1011从所在的数据存储节点101获取接收到的块号1对应的块数据,针对接收到的块号1对应的块数据执行流处理逻辑。Step 406: The stream processing calculation unit 1011 acquires the received block data corresponding to the block number 1 from the data storage node 101 where it is located, and executes stream processing logic for the block data corresponding to the received block number 1.
在本步骤中,流处理计算单元1011从所在的数据存储节点101的数据库1012获取从流处理管理单元302接收到的块号1对应的块数据,并针对块号1对应的块数据执行流处理逻辑。In this step, the stream processing calculation unit 1011 acquires the block data corresponding to the block number 1 received from the stream processing management unit 302 from the database 1012 of the data storage node 101 in which it is located, and performs stream processing for the block data corresponding to the block number 1. logic.
在一些示例中,数据存储节点101进一步设置有数据管理单元,数据管理单元用于访问数据库1012以管理数据库1012中的块数据。In some examples, data storage node 101 is further provided with a data management unit for accessing database 1012 to manage block data in database 1012.
举例而言,分布式文件系统可为Hadoop,Hadoop的数据库通过Hbase数据库实现,元数据管理单元2011为Hbase数据库的Hmaster进程,流处理计算单元设置为程序库,数据管理单元通过加载程序库执行流处理计算单元的功能。For example, the distributed file system can be Hadoop, the Hadoop database is implemented by the Hbase database, the metadata management unit 2011 is the Hmaster process of the Hbase database, the stream processing computing unit is set as a program library, and the data management unit executes the flow by loading the program library. Handle the functions of the computing unit.
进一步,数据管理单元例如为Hbase数据库的HReigonServer进程,HReigonServer进程将task manager进程嵌入到HReigonServer进程中,task manager进程可设置为格式是jar包或so文件的程序库,并提供启动接口,HReigonServer进程在加载程序库之后通过运行启动接口即可实现task manager进程的功能。Further, the data management unit is, for example, a HReigonServer process of the Hbase database, and the HReigonServer process embeds the task manager process into the HReigonServer process, and the task manager process can be set to a library of a jar package or a so file, and provides a startup interface, and the HReigonServer process is After loading the library, you can implement the task manager process by running the startup interface.
由于在本发明实施例中,实现task manager进程的功能的HReigonServer进程可在本地读取数据库1012的块数据,因此获取块数据的过程可避免受到外部网络性能的影响,且由于HReigonServer进程在进程内直接访问数据库1012,即直接从内存读取块数据,因此获取块数据的速度更快,可有效提高流处理的效率。Since in the embodiment of the present invention, the HReigonServer process that implements the function of the task manager process can locally read the block data of the database 1012, the process of acquiring the block data can be prevented from being affected by the performance of the external network, and since the HReigonServer process is in the process Direct access to the database 1012, that is, directly read the block data from the memory, so the speed of the block data is faster, which can effectively improve the efficiency of stream processing.
在另一些示例中,数据管理单元与流处理计算单元1011可同时运行在操作系统1013,流处理计算单元1011通过数据管理单元提供的接口访问数据库1012,在该些示例中,虽然不是通过HReigonServer进程在进程内直接访问数据库1012,但流处理计算单元1011可在本地访问数据库1012,亦可避免到外部网络性能的影响。In other examples, the data management unit and the stream processing computing unit 1011 can concurrently run at the operating system 1013, and the stream processing computing unit 1011 accesses the database 1012 through an interface provided by the data management unit, in these examples, although not through the HReigonServer process. The database 1012 is directly accessed within the process, but the stream processing computing unit 1011 can access the database 1012 locally, and can also avoid the impact on external network performance.
步骤407:流处理计算单元1021从所在的数据存储节点102获取接收到的块号2对应的块数据,针对接收到的块号2对应的块数据执行流处理逻辑。Step 407: The stream processing calculation unit 1021 acquires the block data corresponding to the received block number 2 from the data storage node 102 where it is located, and executes stream processing logic for the block data corresponding to the received block number 2.
与上一步骤类似,在一些示例中,数据存储节点102设置有数据管理单元,数据管理单元用于访问数据库1022以管理块数据。分布式文件系统可为Hadoop,Hadoop的 数据库通过Hbase数据库实现,元数据管理单元2011为Hbase数据库的Hmaster进程,流处理计算单元1011设置为程序库,数据管理单元通过加载程序库执行流处理计算单元的功能。Similar to the previous step, in some examples, data storage node 102 is provided with a data management unit for accessing database 1022 to manage block data. The distributed file system can be Hadoop, the Hadoop database is implemented by the Hbase database, the metadata management unit 2011 is the Hmaster process of the Hbase database, the stream processing computing unit 1011 is set as a program library, and the data management unit executes the stream processing calculation unit by loading the program library. The function.
进一步,数据管理单元例如为Hbase数据库的HReigonServer进程,HReigonServer进程将task manager进程嵌入到HReigonServer进程中,task manager进程可设置为格式是jar包或so文件的程序库,并提供启动接口,HReigonServer进程在加载程序库之后通过运行启动接口即可实现task manager进程的功能。Further, the data management unit is, for example, a HReigonServer process of the Hbase database, and the HReigonServer process embeds the task manager process into the HReigonServer process, and the task manager process can be set to a library of a jar package or a so file, and provides a startup interface, and the HReigonServer process is After loading the library, you can implement the task manager process by running the startup interface.
由于在本发明实施例中,实现task manager进程的功能的HReigonServer进程可在本地读取数据库1022的块数据,因此获取块数据的过程可避免受到外部网络性能的影响,且由于HReigonServer进程在进程内直接访问数据库1022,因此获取块数据的速度更快,可有效提高流处理的效率。Since in the embodiment of the present invention, the HReigonServer process that implements the function of the task manager process can read the block data of the database 1022 locally, the process of acquiring the block data can be prevented from being affected by the performance of the external network, and since the HReigonServer process is in the process Direct access to the database 1022, so the speed of obtaining block data is faster, and the efficiency of stream processing can be effectively improved.
在另一些示例中,数据管理单元与流处理计算单元1021可同时运行在操作系统1023,流处理计算单元1021通过数据管理单元提供的接口访问数据库1022,在该些示例中,虽然不是通过HReigonServer进程在进程内直接访问数据库1012,但流处理计算单元1021可在本地访问数据库1022,亦可避免到外部网络性能的影响。In other examples, the data management unit and the stream processing computing unit 1021 can concurrently run at the operating system 1023, and the stream processing computing unit 1021 accesses the database 1022 through an interface provided by the data management unit, in these examples, although not through the HReigonServer process. The database 1012 is directly accessed within the process, but the stream processing computing unit 1021 can access the database 1022 locally, and can also avoid the impact on external network performance.
步骤408:流处理计算单元将对块号1对应的块数据执行流处理逻辑获取的第一处理结果发送至流处理管理单元302。Step 408: The stream processing calculation unit sends the first processing result obtained by the stream processing logic to the block data corresponding to the block number 1 to the stream processing management unit 302.
步骤409:流处理计算单元将对块号2对应的块数据执行流处理逻辑获取的第二处理结果发送至流处理管理单元302。Step 409: The stream processing calculation unit transmits the second processing result obtained by the stream processing logic of the block data corresponding to the block number 2 to the stream processing management unit 302.
综上,由于本发明实施例将流处理计算单元分布设置在各数据存储节点上,并由流处理管理单元根据待处理文件的路径将流处理任务发送到对应的数据存储节点,由对应的数据存储节点上的流处理计算单元直接在本地读取待处理文件对应的块数据,并将读取到的块数据运行流处理逻辑,由于流处理计算单元本地读取待处理文件,因此可克服因流处理系统与数据存储节点之间的网络传输速度较低而影响流处理的速度的技术问题。In summary, the embodiment of the present invention distributes the stream processing calculation unit to each data storage node, and the stream processing management unit sends the stream processing task to the corresponding data storage node according to the path of the file to be processed, and the corresponding data. The stream processing calculation unit on the storage node directly reads the block data corresponding to the file to be processed locally, and runs the stream processing logic on the read block data. Since the stream processing calculation unit reads the file to be processed locally, the cause can be overcome. The technical problem of low network transmission speed between the stream processing system and the data storage node affecting the speed of stream processing.
并且,由于待处理文件被分散为块数据,分别在不同的流处理计算单元并行执行流处理逻辑,因此可以进一步加快流处理速度,提高处理效率。Moreover, since the files to be processed are dispersed into block data, the stream processing logic is executed in parallel in different stream processing calculation units, so that the stream processing speed can be further accelerated and the processing efficiency can be improved.
值得注意的是,在本发明的可选实施例中,流处理系统90还可以基于Storm、Spark或Samza架构实现。It should be noted that in an alternative embodiment of the present invention, stream processing system 90 may also be implemented based on the Storm, Spark, or Samza architecture.
以下请参见图4,图4是根据本发明实施例的流处理管理单元的装置结构示意图,如图4所示,流处理管理单元302包括:Referring to FIG. 4, FIG. 4 is a schematic structural diagram of a device of a stream processing management unit according to an embodiment of the present invention. As shown in FIG. 4, the stream processing management unit 302 includes:
接收模块601,用于接收客户端发送的流处理任务,其中流处理任务包括流处理逻辑和待处理文件在分布式文件系统的路径,分布式文件系统包括元数据管理节点和多个数据存储节点,每一数据存储节点设置有流处理计算单元;The receiving module 601 is configured to receive a stream processing task sent by the client, where the stream processing task includes a path of the stream processing logic and the file to be processed in the distributed file system, where the distributed file system includes a metadata management node and multiple data storage nodes. Each data storage node is provided with a stream processing computing unit;
查询模块602,用于从元数据管理节点获取路径对应的各个块的块号,以及各个块所在的数据存储节点的网络地址;The query module 602 is configured to obtain, from the metadata management node, a block number of each block corresponding to the path, and a network address of the data storage node where each block is located;
发送模块603,用于分别将流处理逻辑和各个块的块号发送至各个块所在的数据存储节点的流处理单元。The sending module 603 is configured to separately send the stream processing logic and the block number of each block to the stream processing unit of the data storage node where each block is located.
可选地,接收单元601还用于接收流处理计算单元发送的执行流处理逻辑获取的处理结果。Optionally, the receiving unit 601 is further configured to receive a processing result obtained by the execution stream processing logic sent by the stream processing calculation unit.
可选地,元数据管理节点记录有待处理文件在分布式文件系统的路径与各个块的块号的第一对应关系,各个块的块号与各个块所在的数据存储节点的网络地址的第二对应关系,查询模块602具体用于:Optionally, the metadata management node records the first correspondence between the path of the file to be processed in the distributed file system and the block number of each block, and the block number of each block and the network address of the data storage node where each block is located. Corresponding relationship, the query module 602 is specifically used to:
根据待处理文件在分布式文件系统的路径从第一对应关系获取各个块的块号;Obtaining a block number of each block from the first correspondence according to the path of the file to be processed in the distributed file system;
根据各个块的块号从第二对应关系获取各个块所在的数据存储节点的网络地址。The network address of the data storage node where each block is located is obtained from the second correspondence according to the block number of each block.
以下请参见图5,图5是根据本发明实施例的主机的装置结构示意图,如图5所示,主机50包括存储器502、处理器501和总线503,存储器502、处理器501与总线503连接,存储器502存储有程序指令,处理器501执行程序指令以实现上述流处理系统中的流处理管理单元302的功能。Referring to FIG. 5, FIG. 5 is a schematic structural diagram of a device according to an embodiment of the present invention. As shown in FIG. 5, the host 50 includes a memory 502, a processor 501, and a bus 503. The memory 502 and the processor 501 are connected to the bus 503. The memory 502 stores program instructions, and the processor 501 executes program instructions to implement the functions of the stream processing management unit 302 in the stream processing system described above.
由于本发明实施例将流处理计算单元分布设置在各数据存储节点上,并由流处理管理单元根据待处理文件的路径将流处理任务发送到对应的数据存储节点,由对应的数据存储节点上的流处理计算单元直接在本地读取待处理文件对应的块数据,并将读取到的块数据运行流处理逻辑,由于流处理计算单元本地读取待处理文件,因此可克服因流处理系统与数据存储节点之间的网络传输速度较低而影响流处理的速度的技术问题。The embodiment of the present invention distributes the stream processing calculation unit to each data storage node, and the stream processing management unit sends the stream processing task to the corresponding data storage node according to the path of the file to be processed, by the corresponding data storage node. The stream processing calculation unit directly reads the block data corresponding to the file to be processed locally, and runs the stream processing logic on the read block data. Since the stream processing calculation unit locally reads the file to be processed, the stream processing system can be overcome. A technical problem with the low speed of network transmission between data storage nodes and the speed of stream processing.
并且,由于待处理文件被分散为块数据,分别在不同的流处理计算单元并行执行流处理逻辑,因此可以进一步加快流处理速度,提高处理效率。Moreover, since the files to be processed are dispersed into block data, the stream processing logic is executed in parallel in different stream processing calculation units, so that the stream processing speed can be further accelerated and the processing efficiency can be improved.
需说明的是,以上描述的任意装置实施例都仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部进程来实现本实施例方案的目的。另外,本发明提供的装置实施例附图中,进程之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。It should be noted that any of the device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as the cells may or may not be Physical units can be located in one place or distributed to multiple network elements. Some or all of the processes may be selected according to actual needs to achieve the purpose of the solution of the embodiment. In addition, in the drawings of the apparatus embodiments provided by the present invention, the connection relationship between processes indicates that there is a communication connection between them, and specifically may be implemented as one or more communication buses or signal lines. Those of ordinary skill in the art can understand and implement without any creative effort.
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本发明可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本发明而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘,U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the present invention can be implemented by means of software plus necessary general hardware, and of course, dedicated hardware, dedicated CPU, dedicated memory, dedicated memory, Special components and so on. In general, functions performed by computer programs can be easily implemented with the corresponding hardware, and the specific hardware structure used to implement the same function can be various, such as analog circuits, digital circuits, or dedicated circuits. Circuits, etc. However, for the purposes of the present invention, software program implementation is a better implementation in more cases. Based on the understanding, the technical solution of the present invention, which is essential or contributes to the prior art, can be embodied in the form of a software product stored in a readable storage medium, such as a floppy disk of a computer. , U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), disk or optical disk, etc., including a number of instructions to make a computer device (may be A personal computer, server, or network device, etc.) performs the methods described in various embodiments of the present invention.
所属领域的技术人员可以清楚地了解到,上述描述的系统、装置或单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。A person skilled in the art can clearly understand that the specific working process of the system, the device or the unit described above can refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以所述权利要求的保护范围为准。The above is only a specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily think of changes or substitutions within the technical scope of the present invention. It should be covered by the scope of the present invention. Therefore, the scope of the invention should be determined by the scope of the appended claims.

Claims (12)

  1. 一种流处理方法,其特征在于,所述方法应用于流处理系统,所述流处理系统包括流处理管理单元以及流处理计算单元,所述方法包括:A stream processing method, wherein the method is applied to a stream processing system, the stream processing system comprising a stream processing management unit and a stream processing computing unit, the method comprising:
    所述流处理管理单元接收客户端发送的流处理任务,其中所述流处理任务包括流处理逻辑和待处理文件在分布式文件系统的路径,所述分布式文件系统包括元数据管理节点和多个数据存储节点,每一所述数据存储节点设置有流处理计算单元;The stream processing management unit receives a stream processing task sent by a client, where the stream processing task includes a path of a stream processing logic and a file to be processed in a distributed file system, where the distributed file system includes a metadata management node and a plurality of Data storage nodes, each of the data storage nodes is provided with a stream processing calculation unit;
    所述流处理管理单元从所述元数据管理节点获取所述待处理文件的路径对应的各个块的块号,以及各个块所在的数据存储节点的网络地址;The stream processing management unit acquires, from the metadata management node, a block number of each block corresponding to the path of the file to be processed, and a network address of a data storage node where each block is located;
    所述流处理管理单元分别将所述流处理逻辑和所述各个块的块号发送至各个块所在的数据存储节点的流处理单元;The stream processing management unit respectively sends the stream processing logic and the block number of each block to a stream processing unit of a data storage node where each block is located;
    所述流处理计算单元从所在的数据存储节点获取接收到的所述块号对应的块数据,针对所述接收到的所述块号对应的块数据执行所述流处理逻辑。The stream processing calculation unit acquires the received block data corresponding to the block number from the data storage node where it is located, and executes the stream processing logic for the block data corresponding to the received block number.
  2. 根据权利要求1所述的方法,其特征在于,所述数据存储节点设置有数据管理单元,所述流处理计算单元设置为程序库,所述数据管理单元通过加载所述程序库执行所述流处理计算单元的功能。The method according to claim 1, wherein the data storage node is provided with a data management unit, the stream processing calculation unit is configured as a library, and the data management unit executes the stream by loading the library Handle the functions of the computing unit.
  3. 根据权利要求1或2所述的方法,其特征在于,所述方法还包括:The method according to claim 1 or 2, wherein the method further comprises:
    所述流处理计算单元将执行所述流处理逻辑获取的处理结果发送至所述流处理管理单元。The stream processing calculation unit transmits a processing result obtained by executing the stream processing logic to the stream processing management unit.
  4. 根据权利要求1至3任一项所述的方法,其特征在于,所述元数据管理节点记录有所述待处理文件在所述分布式文件系统的路径与各个块的块号的第一对应关系,所述流处理管理单元从所述元数据管理节点获取所述路径对应的各个块的块号,以及各个块所在的数据存储节点的网络地址具体包括:The method according to any one of claims 1 to 3, wherein the metadata management node records a first correspondence between a path of the file to be processed in the distributed file system and a block number of each block The relationship, the flow processing management unit acquires, from the metadata management node, the block number of each block corresponding to the path, and the network address of the data storage node where each block is located, specifically:
    所述流处理管理单元根据所述待处理文件在所述分布式文件系统的路径从所述第一对应关系获取各个块的块号。And the stream processing management unit acquires a block number of each block from the first correspondence relationship according to the path of the to-be-processed file in the distributed file system.
  5. 根据权利要求4所述的方法,其特征在于,所述元数据管理节点记录有各个块的块号与各个块所在的数据存储节点的网络地址的第二对应关系,所述流处理管理单元从所述元数据管理节点获取所述路径对应的各个块的块号,以及各所述个块的块号所在的数据存储节点的网络地址具体包括:The method according to claim 4, wherein the metadata management node records a second correspondence between a block number of each block and a network address of a data storage node where each block is located, the stream processing management unit The data management node obtains the block number of each block corresponding to the path, and the network address of the data storage node where the block number of each of the blocks is located specifically includes:
    所述流处理管理单元根据各个块的块号从所述第二对应关系获取各个块所在的数据存储节点的网络地址。The stream processing management unit acquires a network address of a data storage node where each block is located from the second correspondence according to a block number of each block.
  6. 一种流处理系统,其特征在于,包括流处理管理单元以及流处理计算单元,A stream processing system, comprising: a stream processing management unit and a stream processing computing unit,
    所述流处理管理单元,用于接收客户端发送的流处理任务,其中所述流处理任务包括流处理逻辑和待处理文件在分布式文件系统的路径,所述分布式文件系统包括元数据管理节点和多个数据存储节点,每一所述数据存储节点设置有流处理计算单元;The stream processing management unit is configured to receive a stream processing task sent by a client, where the stream processing task includes a path of a stream processing logic and a file to be processed in a distributed file system, where the distributed file system includes metadata management. a node and a plurality of data storage nodes, each of the data storage nodes being provided with a stream processing computing unit;
    所述流处理管理单元,还用于从所述元数据管理节点获取所述路径对应的各个块的块号,以及各个块所在的数据存储节点的网络地址;The stream processing management unit is further configured to acquire, from the metadata management node, a block number of each block corresponding to the path, and a network address of a data storage node where each block is located;
    所述流处理管理单元,还用于分别将所述流处理逻辑和所述各个块的块号发送至各个块所在的数据存储节点的流处理单元;The stream processing management unit is further configured to separately send the stream processing logic and the block number of each block to a stream processing unit of a data storage node where each block is located;
    所述流处理计算单元,用于从所在的数据存储节点获取接收到的所述块号对应的块数据,针对所述接收到的所述块号对应的块数据执行所述流处理逻辑。The stream processing calculation unit is configured to acquire the received block data corresponding to the block number from the data storage node where the data is stored, and execute the stream processing logic for the block data corresponding to the received block number.
  7. 根据权利要求6所述的系统,其特征在于,所述数据存储节点设置有数据管理单元,所述流处理计算单元设置为程序库,所述数据管理单元通过加载所述程序库执行所述流处理计算单元的功能。The system according to claim 6, wherein said data storage node is provided with a data management unit, said stream processing calculation unit being configured as a library, said data management unit executing said stream by loading said library Handle the functions of the computing unit.
  8. 根据权利要求6所述的系统,其特征在于,The system of claim 6 wherein:
    所述流处理计算单元,还用于将执行所述流处理逻辑获取的处理结果发送至所述流处理管理单元。The stream processing calculation unit is further configured to send a processing result that is executed by the stream processing logic to the stream processing management unit.
  9. 根据权利要求6所述的系统,其特征在于,所述元数据管理节点记录有所述待处理文件在所述分布式文件系统的路径与各个块的块号的第一对应关系,所述流处理管理单元具体用于:The system according to claim 6, wherein the metadata management node records a first correspondence between a path of the file to be processed in the distributed file system and a block number of each block, the stream The processing management unit is specifically used to:
    根据所述待处理文件在所述分布式文件系统的路径从所述第一对应关系获取各个块的块号。Obtaining a block number of each block from the first correspondence according to the path of the to-be-processed file in the distributed file system.
  10. 根据权利要求9所述的系统,其特征在于,所述元数据管理节点记录有各所述块号与各所述块号所在的数据存储节点的网络地址的第二对应关系,所述流处理管理单元具体用于:The system according to claim 9, wherein the metadata management node records a second correspondence between each block number and a network address of a data storage node where each block number is located, the stream processing The management unit is specifically used to:
    根据各个块的块号从所述第二对应关系获取各个块所在的数据存储节点的网络地址。The network address of the data storage node where each block is located is obtained from the second correspondence according to the block number of each block.
  11. 一种流处理管理单元,其特征在于,包括:A stream processing management unit, comprising:
    接收模块,用于接收客户端发送的流处理任务,其中所述流处理任务包括流处理逻辑和待处理文件在分布式文件系统的路径,所述分布式文件系统包括元数据管理节点和多个数据存储节点,每一所述数据存储节点设置有流处理计算单元;a receiving module, configured to receive a stream processing task sent by the client, where the stream processing task includes a path of the stream processing logic and the file to be processed in a distributed file system, where the distributed file system includes a metadata management node and multiple a data storage node, each of the data storage nodes is provided with a stream processing calculation unit;
    查询模块,用于从所述元数据管理节点获取所述路径对应的各个块的块号,以及各个块所在的数据存储节点的网络地址;a query module, configured to acquire, from the metadata management node, a block number of each block corresponding to the path, and a network address of a data storage node where each block is located;
    发送模块,用于分别将所述流处理逻辑和所述各个块的块号发送至各个块所在的数据存储节点的流处理单元。And a sending module, configured to separately send the stream processing logic and the block number of the respective block to a stream processing unit of the data storage node where each block is located.
  12. 一种主机,其特征在于,包括存储器、处理器和总线,所述存储器、所述处理器与所述总线连接,所述存储器存储有程序指令,所述处理器执行所述程序指令以使得所述主机执行下述步骤:A host, comprising: a memory, a processor and a bus, the memory, the processor being connected to the bus, the memory storing program instructions, the processor executing the program instructions to cause The host performs the following steps:
    接收客户端发送的流处理任务,其中所述流处理任务包括流处理逻辑和待处理文件在分布式文件系统的路径,所述分布式文件系统包括元数据管理节点和多个数据存储节点,每一所述数据存储节点设置有流处理计算单元;Receiving a stream processing task sent by the client, where the stream processing task includes a stream processing logic and a path of the file to be processed in a distributed file system, where the distributed file system includes a metadata management node and a plurality of data storage nodes, each One of the data storage nodes is provided with a stream processing computing unit;
    从所述元数据管理节点获取所述路径对应的各个块的块号,以及各个块所在的数据存储节点的网络地址;Obtaining, from the metadata management node, a block number of each block corresponding to the path, and a network address of a data storage node where each block is located;
    分别将所述流处理逻辑和所述各个块的块号发送至各个块所在的数据存储节点的流处理单元。The stream processing logic and the block number of the respective block are respectively sent to a stream processing unit of a data storage node where each block is located.
PCT/CN2018/082641 2017-04-11 2018-04-11 Stream processing method and device WO2018188607A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710233425.0 2017-04-11
CN201710233425.0A CN108696559B (en) 2017-04-11 2017-04-11 Stream processing method and device

Publications (1)

Publication Number Publication Date
WO2018188607A1 true WO2018188607A1 (en) 2018-10-18

Family

ID=63792265

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/082641 WO2018188607A1 (en) 2017-04-11 2018-04-11 Stream processing method and device

Country Status (2)

Country Link
CN (1) CN108696559B (en)
WO (1) WO2018188607A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111435938B (en) * 2019-01-14 2022-11-29 阿里巴巴集团控股有限公司 Data request processing method, device and equipment
CN110046131A (en) * 2019-01-23 2019-07-23 阿里巴巴集团控股有限公司 The Stream Processing method, apparatus and distributed file system HDFS of data
CN111290744B (en) * 2020-01-22 2023-07-21 北京百度网讯科技有限公司 Flow computing job processing method, flow computing system and electronic device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6415297B1 (en) * 1998-11-17 2002-07-02 International Business Machines Corporation Parallel database support for workflow management systems
US20090125553A1 (en) * 2007-11-14 2009-05-14 Microsoft Corporation Asynchronous processing and function shipping in ssis
CN102456185A (en) * 2010-10-29 2012-05-16 金蝶软件(中国)有限公司 Distributed workflow processing method and distributed workflow engine system
CN106339415A (en) * 2016-08-12 2017-01-18 北京奇虎科技有限公司 Data checking method, device and system

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4511469B2 (en) * 2003-10-27 2010-07-28 株式会社ターボデータラボラトリー Information processing method and information processing system
US7613848B2 (en) * 2006-06-13 2009-11-03 International Business Machines Corporation Dynamic stabilization for a stream processing system
US8150889B1 (en) * 2008-08-28 2012-04-03 Amazon Technologies, Inc. Parallel processing framework
CN101741885A (en) * 2008-11-19 2010-06-16 珠海市西山居软件有限公司 Distributed system and method for processing task flow thereof
US20110313934A1 (en) * 2010-06-21 2011-12-22 Craig Ronald Van Roy System and Method for Configuring Workflow Templates
CN102467411B (en) * 2010-11-19 2013-11-27 金蝶软件(中国)有限公司 Workflow processing and workflow agent method, device and system
CN102542367B (en) * 2010-12-10 2015-03-11 金蝶软件(中国)有限公司 Cloud computing network workflow processing method, device and system based on domain model
US9361323B2 (en) * 2011-10-04 2016-06-07 International Business Machines Corporation Declarative specification of data integration workflows for execution on parallel processing platforms
CN103309867A (en) * 2012-03-09 2013-09-18 句容智恒安全设备有限公司 Web data mining system on basis of Hadoop platform
US20130253977A1 (en) * 2012-03-23 2013-09-26 Commvault Systems, Inc. Automation of data storage activities
AU2015259417B2 (en) * 2014-05-13 2016-09-22 Datomia Research Labs OṺ Distributed secure data storage and transmission of streaming media content
CN104063486B (en) * 2014-07-03 2017-07-11 四川中亚联邦科技有限公司 A kind of big data distributed storage method and system
CN105608077A (en) * 2014-10-27 2016-05-25 青岛金讯网络工程有限公司 Big data distributed storage method and system
CN104536814B (en) * 2015-01-16 2019-01-22 北京京东尚科信息技术有限公司 A kind of method and system handling workflow
CN104657497A (en) * 2015-03-09 2015-05-27 国家电网公司 Mass electricity information concurrent computation system and method based on distributed computation
CN105468756A (en) * 2015-11-30 2016-04-06 浪潮集团有限公司 Design and implementation method of mass data processing system
CN106155791B (en) * 2016-06-30 2019-05-07 电子科技大学 A Workflow Task Scheduling Method in Distributed Environment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6415297B1 (en) * 1998-11-17 2002-07-02 International Business Machines Corporation Parallel database support for workflow management systems
US20090125553A1 (en) * 2007-11-14 2009-05-14 Microsoft Corporation Asynchronous processing and function shipping in ssis
CN102456185A (en) * 2010-10-29 2012-05-16 金蝶软件(中国)有限公司 Distributed workflow processing method and distributed workflow engine system
CN106339415A (en) * 2016-08-12 2017-01-18 北京奇虎科技有限公司 Data checking method, device and system

Also Published As

Publication number Publication date
CN108696559B (en) 2021-08-20
CN108696559A (en) 2018-10-23

Similar Documents

Publication Publication Date Title
US11711420B2 (en) Automated management of resource attributes across network-based services
US11461330B2 (en) Managed query service
CN109997126B (en) Event driven extraction, transformation, and loading (ETL) processing
US11411921B2 (en) Enabling access across private networks for a managed blockchain service
US20050071209A1 (en) Binding a workflow engine to a data model
Essa et al. Mobile agent based new framework for improving big data analysis
WO2019057055A1 (en) Task processing method and apparatus, electronic device, and storage medium
US10944814B1 (en) Independent resource scheduling for distributed data processing programs
US11601495B2 (en) Mechanism for a work node scan process to facilitate cluster scaling
US11695840B2 (en) Dynamically routing code for executing
WO2016061935A1 (en) Resource scheduling method, device and computer storage medium
CN106547790B (en) A relational database service system
CN109298937A (en) File parsing method and network device
CN112685499A (en) Method, device and equipment for synchronizing process data of work service flow
WO2022062661A1 (en) Operation notification method and apparatus, and storage medium and electronic apparatus
WO2018188607A1 (en) Stream processing method and device
WO2017020716A1 (en) Method and device for data access control
US8627341B2 (en) Managing events generated from business objects
CN112667393B (en) Method and device for building distributed task computing scheduling framework and computer equipment
WO2024045646A9 (en) Method, apparatus and system for managing cluster access permission
US11757959B2 (en) Dynamic data stream processing for Apache Kafka using GraphQL
Marian et al. Analysis of Different SaaS Architectures from a Trust Service Provider Perspective
US20230246916A1 (en) Service map conversion with preserved historical information
CN109587224B (en) Data processing method and device, electronic equipment and computer readable medium
JP2017117355A (en) Information processing system, information processing apparatus, processing control method, and processing control program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18784079

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18784079

Country of ref document: EP

Kind code of ref document: A1