WO2018188607A1

WO2018188607A1 - Stream processing method and device

Info

Publication number: WO2018188607A1
Application number: PCT/CN2018/082641
Authority: WO
Inventors: 曹俊; 胡斐然; 林铭
Original assignee: 华为技术有限公司
Priority date: 2017-04-11
Filing date: 2018-04-11
Publication date: 2018-10-18
Also published as: CN108696559B; CN108696559A

Abstract

Disclosed in the embodiments of the present invention are a stream processing method and device, the method comprising: a stream processing management unit receives a stream processing task sent by a client; the stream processing management unit obtains, from a metadata management node, the block number of each block corresponding to the path of a file to be processed and the network address of the data storage node where each block is located; the stream processing management unit sends stream processing logic and the block number of each block to a stream processing unit of the data storage node where each block is located respectively; and a stream processing calculation unit obtains block data corresponding to the received block number from the data storage node where the same is located, and executes a stream processing logic for the block data corresponding to the received block number. By means of the solution above, the technical problem of the low network transmission speed between a stream processing system and a data storage node affecting the speed of stream processing may be overcome.

Description

Flow processing method and device

Technical field

The present invention relates to the field of information technology, and in particular, to a stream processing method and apparatus.

Background technique

Work flow is an abstraction, generalization, and description of the logical rules of how processes are organized before and after each other in the workflow and workflow. The workflow concept originated in the field of production organization and office automation. It is a concept proposed for fixed-program activities in daily work. The purpose is to perform these tasks according to certain rules and processes by breaking down the work into well-defined processes or roles. Process and monitor it to improve work efficiency, better control processes, enhance customer service, and effectively manage business processes. Workflow modeling, which means that the workflow is represented in the computer in the appropriate model and implemented. Through workflow modeling, workflows can be managed through a workflow system.

The main function of the stream processing system is to define, execute and manage the workflow through the support of computer technology, and coordinate the information interaction between the processes in the workflow execution process and between the members of the group. The stream processing system usually consists of a workflow design tool and a workflow management tool. The workflow design tool allows the user to design their own workflow definition, and the workflow management tool is responsible for managing the execution of the workflow. During the workflow system work process, the workflow instance includes one or more tasks, and each agent needs to perform some work.

Apache Storm is a typical stream processing system in the prior art. It consists of a Master-Slave architecture. Nimbus is the main process and Supervisor is the slave process running the service. The stream processing system Storm establishes a network connection with the distributed file system, and the distributed file system stores data that needs to be processed by the stream processing system Storm. The distributed file system includes a Master Server (primary server) and a Data Server (data server), and the Master Server is The metadata management node manages the distribution of data blocks. The Data Server is a data storage node point, and stores data block data. The Storm and the data storage node point are set on different servers.

In Storm's stream processing operations, Storm first needs to obtain data from the data server that needs to be streamed. Specifically, the data server provides a data query interface, and the Storm inputs parameters to the data query interface through the network, acquires data from the data server through the network, and then loads the acquired data into the Supervisor.

Since in the prior art, the stream processing system needs to acquire data from the data storage node through the network, the speed of acquiring the data is limited by the network performance, which may result in the performance of the entire stream processing being limited by the network, in the stream processing system and the data storage node. When the network transmission speed is low, the speed of stream processing is greatly affected.

Summary of the invention

To solve the problem of the prior art, an embodiment of the present invention provides a stream processing method and apparatus, which can overcome the technical problem that the speed of the stream processing is affected by the low network transmission speed between the stream processing system and the data storage node.

In a first aspect, an embodiment of the present invention provides a stream processing method, where the method is applied to a stream processing system, where the stream processing system includes a stream processing management unit and a stream processing computing unit, and the method includes:

The stream processing management unit receives a stream processing task sent by the client, where the stream processing task includes a stream processing logic and a path of the file to be processed in the distributed file system, and the distributed file system includes a metadata management node and a plurality of data storage nodes, each a data storage node is provided with a stream processing computing unit;

The stream processing management unit acquires, from the metadata management node, a block number of each block corresponding to the path of the file to be processed, and a network address of the data storage node where each block is located;

The stream processing management unit respectively sends the stream processing logic and the block number of each block to the stream processing unit of the data storage node where each block is located;

The stream processing calculation unit acquires the block data corresponding to the received block number from the data storage node where it is located, and executes stream processing logic for the block data corresponding to the received block number.

The embodiment of the present invention distributes the stream processing calculation unit to each data storage node, and the stream processing management unit sends the stream processing task to the corresponding data storage node according to the path of the file to be processed, by the corresponding data storage node. The stream processing calculation unit directly reads the block data corresponding to the file to be processed locally, and runs the stream processing logic on the read block data. Since the stream processing calculation unit locally reads the file to be processed, the stream processing system can be overcome. A technical problem with the low speed of network transmission between data storage nodes and the speed of stream processing.

Moreover, since the files to be processed are dispersed into block data, the stream processing logic is executed in parallel in different stream processing calculation units, so that the stream processing speed can be further accelerated and the processing efficiency can be improved.

In an implementation manner of the embodiment of the present invention, the data storage node is provided with a data management unit, and the stream processing calculation unit is configured as a program library, and the data management unit performs a function of the stream processing calculation unit by loading the program library.

Since the stream processing calculation unit is set in the data management unit through the program library, and the data management unit can directly read the block data, after the data management unit can read the block data, the stream processing logic can be executed, which can speed up the stream processing speed. .

In another implementation manner of the embodiment of the present invention, the method further includes:

The stream processing calculation unit transmits the processing result obtained by the execution stream processing logic to the stream processing management unit.

In another implementation manner of the embodiment of the present invention, the metadata management node records the first correspondence between the path of the file to be processed in the distributed file system and the block number of each block, and the stream processing management unit obtains the data from the metadata management node. The block number of each block corresponding to the path, and the network address of the data storage node where the block number of each block is located specifically includes:

The stream processing management unit acquires the block numbers of the respective blocks from the first correspondence relationship according to the path of the file to be processed in the distributed file system.

In another implementation manner of the embodiment of the present invention, the metadata management node records the second correspondence between the block number of each block and the network address of the data storage node where the block number of each block is located, and the stream processing management unit slave element The data management node obtains the block number of each block corresponding to the path, and the network address of the data storage node where the block number of each block is located specifically includes:

The stream processing management unit acquires the network address of the data storage node where each block number is located from the second correspondence relationship according to each block number.

In a second aspect, an embodiment of the present invention provides a stream processing system, including a stream processing management unit and a stream processing computing unit.

a stream processing management unit, configured to receive a stream processing task sent by the client, where the stream processing task includes a path of the stream processing logic and the file to be processed in the distributed file system, and the distributed file system includes a metadata management node and multiple data storage a node, each data storage node is provided with a stream processing computing unit;

The stream processing management unit is further configured to obtain, from the metadata management node, each block number corresponding to the path, and a network address of the data storage node where each block number is located;

The stream processing management unit is further configured to separately send the stream processing logic and the block number corresponding to each network address to the stream processing unit of the corresponding data storage node;

The stream processing calculation unit is configured to acquire block data corresponding to the received block number from the data storage node where the data is stored, and execute stream processing logic for the block data corresponding to the received block number.

In another implementation manner of the embodiment of the present invention, the stream processing calculation unit is further configured to send the processing result obtained by the execution stream processing logic to the stream processing management unit.

In another implementation manner of the embodiment of the present invention, the metadata management node records the first correspondence between the path of the file to be processed in the distributed file system and the block number of each block, and the stream processing management unit is specifically configured to:

The block number of each block is obtained from the first correspondence according to the path of the file to be processed in the distributed file system.

In another implementation manner of the embodiment of the present invention, the metadata management node records the second correspondence between the block number of each block and the network address of the data storage node where the block number of each block is located, and the stream processing management unit specifically uses to:

The network address of the data storage node where each block number is located is obtained from the second correspondence according to the block number of each block.

In a third aspect, an embodiment of the present invention provides a stream processing management unit that performs the functions of a stream processing management unit in the stream processing system.

In a fourth aspect, an embodiment of the present invention provides a host, including a memory, a processor, and a bus. The memory and the processor are connected to the bus. The memory stores program instructions, and the processor executes the program instructions to implement stream processing in the stream processing system. The function of the snap-in.

DRAWINGS

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings to be used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings described below are only some embodiments of the present invention, Those skilled in the art can also obtain other drawings based on these drawings without paying any creative work.

1 is a schematic structural diagram of a stream processing system according to an embodiment of the present invention;

2 is another schematic structural diagram of a stream processing system according to an embodiment of the present invention;

3 is a data interaction diagram of a stream processing method according to an embodiment of the present invention;

4 is a schematic structural diagram of an apparatus of a stream processing system according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a device of a host according to an embodiment of the present invention.

detailed description

First, please refer to FIG. 1. FIG. 1 is a schematic diagram of a connection between a stream processing system and a distributed file system and a client according to an embodiment of the present invention. As shown in FIG. 1, the stream processing system includes a stream processing management unit 302 and a plurality of stream processing calculations. Units 1011, 1021, ..., and 1031, the distributed file system includes a metadata management node 201 and a plurality of

data storage nodes

101, 102, ..., and 103.

In the embodiment of the present invention, the client 301 is connected to the stream processing management unit 302, and the stream processing management unit 302 is connected to the metadata management node 201 and the plurality of

data storage nodes

101, 102, ..., 103, respectively.

The client 301 is configured to receive a stream processing job submitted by the user. In the embodiment of the present invention, when the user submits the stream processing job, the user specifies the path of the data to be processed in the distributed file system, and specifies what kind of processing is to be performed on the data to be processed.

The path of the stream processing task to be processed in the distributed file system may be, for example, a URL (abbreviation of Universal Resource Locator, a uniform resource locator), and the URL is a storage identifier of the distributed file system, and the URL may be in the metadata management node. 201 finds the block number of each block corresponding to the file to be processed.

The client 301 generates a stream processing task according to the stream processing job submitted by the user, where the stream processing task includes a path of the stream processing logic and the data to be processed in the distributed file system, wherein the stream processing logic defines what kind of processing is to be performed on the data, for example. In other words, the stream processing logic can specify to search for anomalous events in the data to be processed.

The client 301 sends a stream processing task to the stream processing management unit 302. The stream processing management unit 302 performs scheduling according to the stream processing task, and the selected stream processing computing unit acquires the file to be processed from the distributed file system, and processes the file to be processed by the stream processing logic. deal with.

For example, the stream processing system can be implemented based on the apache flink architecture, the client 301 is the client (client) process of the apache flink, the stream processing management unit 302 is the job manager (work manager) process of the apache flink, and the stream processing unit The task manager process for apache flink.

The metadata management node 201 is provided with a metadata management unit 2011 and a database 2012, and the metadata management unit 2011 provides an interface through which the external device can query the database 2012. The database 2012 records the first correspondence between the path of the file to be processed in the distributed file system and the block number of each block, and the block number of each block and the network address of the data storage node where each block is located. The second correspondence.

In a distributed storage system, files to be processed are stored in a database in a data storage node in the form of fragments, where the fragments refer to different block data, each piece of data corresponds to a block number, and the metadata management node records all files in the distributed storage system. The correspondence between the path and the block number of each block, and each block number corresponds to the database of which data storage node is stored.

The data storage node 101 is provided with a stream processing calculation unit 1011 and a database 1012. The database 1012 records the block data and the correspondence between the block number and the block data. The stream processing calculation unit 1011 can access the database 1012 and obtain the block number from the database 1012. Corresponding block data.

In FIG. 1, the

data storage nodes

102 and 103 have a similar structure to the data storage node 101, except that the block data recorded by the own database is different, and details are not described herein.

For example, the distributed file system can be implemented by Hadoop, the database 2012, the database 1012, the database 1022, ... and the database 1032 can be implemented by Hbase (Hadoop Database, Hadoop database), and the metadata management unit 2011 can be the hmaster process of the Hbase database. .

In the embodiment of the present invention, the client 301 and the stream processing management unit 302 can be set on the same host, and establish a data connection with the metadata management node 201 and the

data storage nodes

101, 102, .

In some examples, the client 301 and the stream processing management unit 302 may also be disposed on different hosts, which is not limited by the embodiment of the present invention.

For ease of understanding, reference may be made to FIG. 2, which is another schematic structural diagram of a stream processing system according to an embodiment of the present invention. As shown in FIG. 2, a client 301 and a stream processing management unit 302 are disposed on a host 10, and the host 10 is provided. The operating system 303 and the hardware 304 are also included. The hardware 304 is used to carry the operation of the operating system 303. The hardware 304 includes a physical network card 3041. The client 301 and the stream processing management unit 302 respectively run on the operating system 303 in the form of a process. The physical network card 3041 accesses the network 50.

And, the metadata management node 201 includes a database 2012, a metadata management unit 2011, an operating system 2013, and a hardware 2014. The database 2012 and the metadata management unit 2011 respectively run on the operating system 2013 in the form of a process, and the hardware 2014 is used for the bearer operation. The operation of the system 2013 includes the physical network card 20141, the physical network card 20141 accessing the network 50, and the metadata management unit 2011 providing an interface through which the external device can access the database 2012.

Moreover, the data storage node 101 includes data 1012, a stream processing computing unit 1011, an operating system 1013, and a hardware 1014. The database 1012 and the stream processing computing unit 1011 respectively run on the operating system 1013 in the form of a process, and the hardware 1014 is used to carry the operating system. In the operation of 2013, the hardware 1014 includes a physical network card 10141, and the physical network card 10141 accesses the network 50. In the embodiment of the present invention, the stream processing computing unit 1011 can access the database 1012.

The structure of the

data storage nodes

102 and 103 is similar to that of the data storage node 101 and will not be described herein.

For example, the stream processing management unit 302 and the client 301, the metadata management unit 2011, and the stream processing computing units 101, 1021, ..., and 1031 can pass RPC (Remote Procedure Call Protocol). Implement communication.

Based on the above architecture, the embodiment of the present invention provides a stream processing method, where the stream processing management unit 302 receives a stream processing task sent by the client 301, where the stream processing task includes a stream processing logic and a path of the file to be processed in the distributed file system. The stream processing management unit 302 acquires the block number of each block corresponding to the path from the metadata management node 201, and the network address of the data storage node where the block number of each block is located; the stream processing management unit 302 respectively processes the stream processing logic and each The block number corresponding to the network address is sent to the stream processing unit of the corresponding data storage node; the stream processing calculation unit acquires the block data corresponding to the received block number from the data storage node where it is located, and performs block data corresponding to the received block number. Stream processing logic.

For further clarity, please refer to FIG. 3, which is a data interaction diagram of a stream processing method according to an embodiment of the present invention. As shown in FIG. 3, the stream processing method includes the following steps:

Step 401: The stream processing management unit 302 receives the stream processing task sent by the client 301, where the stream processing task includes the stream processing logic and the path of the file to be processed in the distributed file system.

For example, the client 301 can be a client process in the apache flink system, and the stream processing management unit 302 can be a job manager process in the apache flink system.

Step 402: The stream processing management unit 302 sends a query request to the metadata management node 201, wherein the query request carries a path of the file to be processed in the distributed file system.

For example, the query request includes an input parameter and a query instruction, and the stream processing management unit 302 takes the path of the file to be processed in the distributed file system as an input parameter, and sends the input parameter and the control instruction to the metadata of the metadata management node 201. The interface provided by the management unit 2011 for accessing the database 2012.

Step 403: The metadata management node 201 returns the block number of each block corresponding to the path and the network address of the data storage node corresponding to each block to the stream processing management unit 302 according to the query request.

As can be seen, the database 2012 of the metadata management node 201 records the first correspondence between the path of the file to be processed in the distributed file system and the block number of each block, and the block number of each block and the network of the data storage node where each block is located. The second correspondence relationship of the addresses, therefore, the stream processing management unit 302 of the metadata management node 201 acquires the block numbers of the respective blocks from the first correspondence relationship according to the path of the file to be processed in the distributed file system, and according to the block numbers of the respective blocks Obtaining the network address of the data storage node where each block is located from the second correspondence.

It is assumed that the block numbers acquired by the stream processing management unit 302 are block number 1 and block number 2, respectively. It is worth noting that in practical applications, the block number includes a plurality of, and for the sake of brief description, only two block numbers are taken as an example. To be explained, the stream processing management unit 302 queries the network address A of the data storage node 101 based on the block number 1, and queries the network address B of the data storage node 102 based on the block number 2.

Step 404: The stream processing management unit 302 transmits the stream processing logic and block number 1 to the stream processing computing unit 1011.

In this step, after the stream processing management unit 302 queries the network address A of the data storage node 101 according to the block number 1, the stream processing task and the block number 1 corresponding to the network address A are sent to the stream processing calculation unit of the data storage node 101. 1011.

Step 405: The stream processing management unit 302 transmits the stream processing logic and block number 2 to the stream processing computing unit 1021.

In this step, after the stream processing management unit 302 queries the network address B of the data storage node 102 according to the block number 2, the stream processing task and the block number 2 corresponding to the network address B are sent to the stream processing calculation unit of the data storage node 102. 1021.

In steps 404 and 405, the stream processing computing unit 1011 can be, for example, a task manager process in the apache flink system, and the stream processing computing unit 1021 can be, for example, another task manager process in the apache flink system.

Step 406: The stream processing calculation unit 1011 acquires the received block data corresponding to the block number 1 from the data storage node 101 where it is located, and executes stream processing logic for the block data corresponding to the received block number 1.

In this step, the stream processing calculation unit 1011 acquires the block data corresponding to the block number 1 received from the stream processing management unit 302 from the database 1012 of the data storage node 101 in which it is located, and performs stream processing for the block data corresponding to the block number 1. logic.

In some examples, data storage node 101 is further provided with a data management unit for accessing database 1012 to manage block data in database 1012.

For example, the distributed file system can be Hadoop, the Hadoop database is implemented by the Hbase database, the metadata management unit 2011 is the Hmaster process of the Hbase database, the stream processing computing unit is set as a program library, and the data management unit executes the flow by loading the program library. Handle the functions of the computing unit.

Further, the data management unit is, for example, a HReigonServer process of the Hbase database, and the HReigonServer process embeds the task manager process into the HReigonServer process, and the task manager process can be set to a library of a jar package or a so file, and provides a startup interface, and the HReigonServer process is After loading the library, you can implement the task manager process by running the startup interface.

Since in the embodiment of the present invention, the HReigonServer process that implements the function of the task manager process can locally read the block data of the database 1012, the process of acquiring the block data can be prevented from being affected by the performance of the external network, and since the HReigonServer process is in the process Direct access to the database 1012, that is, directly read the block data from the memory, so the speed of the block data is faster, which can effectively improve the efficiency of stream processing.

In other examples, the data management unit and the stream processing computing unit 1011 can concurrently run at the operating system 1013, and the stream processing computing unit 1011 accesses the database 1012 through an interface provided by the data management unit, in these examples, although not through the HReigonServer process. The database 1012 is directly accessed within the process, but the stream processing computing unit 1011 can access the database 1012 locally, and can also avoid the impact on external network performance.

Step 407: The stream processing calculation unit 1021 acquires the block data corresponding to the received block number 2 from the data storage node 102 where it is located, and executes stream processing logic for the block data corresponding to the received block number 2.

Similar to the previous step, in some examples, data storage node 102 is provided with a data management unit for accessing database 1022 to manage block data. The distributed file system can be Hadoop, the Hadoop database is implemented by the Hbase database, the metadata management unit 2011 is the Hmaster process of the Hbase database, the stream processing computing unit 1011 is set as a program library, and the data management unit executes the stream processing calculation unit by loading the program library. The function.

Since in the embodiment of the present invention, the HReigonServer process that implements the function of the task manager process can read the block data of the database 1022 locally, the process of acquiring the block data can be prevented from being affected by the performance of the external network, and since the HReigonServer process is in the process Direct access to the database 1022, so the speed of obtaining block data is faster, and the efficiency of stream processing can be effectively improved.

In other examples, the data management unit and the stream processing computing unit 1021 can concurrently run at the operating system 1023, and the stream processing computing unit 1021 accesses the database 1022 through an interface provided by the data management unit, in these examples, although not through the HReigonServer process. The database 1012 is directly accessed within the process, but the stream processing computing unit 1021 can access the database 1022 locally, and can also avoid the impact on external network performance.

Step 408: The stream processing calculation unit sends the first processing result obtained by the stream processing logic to the block data corresponding to the block number 1 to the stream processing management unit 302.

Step 409: The stream processing calculation unit transmits the second processing result obtained by the stream processing logic of the block data corresponding to the block number 2 to the stream processing management unit 302.

In summary, the embodiment of the present invention distributes the stream processing calculation unit to each data storage node, and the stream processing management unit sends the stream processing task to the corresponding data storage node according to the path of the file to be processed, and the corresponding data. The stream processing calculation unit on the storage node directly reads the block data corresponding to the file to be processed locally, and runs the stream processing logic on the read block data. Since the stream processing calculation unit reads the file to be processed locally, the cause can be overcome. The technical problem of low network transmission speed between the stream processing system and the data storage node affecting the speed of stream processing.

It should be noted that in an alternative embodiment of the present invention, stream processing system 90 may also be implemented based on the Storm, Spark, or Samza architecture.

Referring to FIG. 4, FIG. 4 is a schematic structural diagram of a device of a stream processing management unit according to an embodiment of the present invention. As shown in FIG. 4, the stream processing management unit 302 includes:

The receiving module 601 is configured to receive a stream processing task sent by the client, where the stream processing task includes a path of the stream processing logic and the file to be processed in the distributed file system, where the distributed file system includes a metadata management node and multiple data storage nodes. Each data storage node is provided with a stream processing computing unit;

The query module 602 is configured to obtain, from the metadata management node, a block number of each block corresponding to the path, and a network address of the data storage node where each block is located;

The sending module 603 is configured to separately send the stream processing logic and the block number of each block to the stream processing unit of the data storage node where each block is located.

Optionally, the receiving unit 601 is further configured to receive a processing result obtained by the execution stream processing logic sent by the stream processing calculation unit.

Optionally, the metadata management node records the first correspondence between the path of the file to be processed in the distributed file system and the block number of each block, and the block number of each block and the network address of the data storage node where each block is located. Corresponding relationship, the query module 602 is specifically used to:

Obtaining a block number of each block from the first correspondence according to the path of the file to be processed in the distributed file system;

The network address of the data storage node where each block is located is obtained from the second correspondence according to the block number of each block.

Referring to FIG. 5, FIG. 5 is a schematic structural diagram of a device according to an embodiment of the present invention. As shown in FIG. 5, the host 50 includes a memory 502, a processor 501, and a bus 503. The memory 502 and the processor 501 are connected to the bus 503. The memory 502 stores program instructions, and the processor 501 executes program instructions to implement the functions of the stream processing management unit 302 in the stream processing system described above.

It should be noted that any of the device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as the cells may or may not be Physical units can be located in one place or distributed to multiple network elements. Some or all of the processes may be selected according to actual needs to achieve the purpose of the solution of the embodiment. In addition, in the drawings of the apparatus embodiments provided by the present invention, the connection relationship between processes indicates that there is a communication connection between them, and specifically may be implemented as one or more communication buses or signal lines. Those of ordinary skill in the art can understand and implement without any creative effort.

Through the description of the above embodiments, those skilled in the art can clearly understand that the present invention can be implemented by means of software plus necessary general hardware, and of course, dedicated hardware, dedicated CPU, dedicated memory, dedicated memory, Special components and so on. In general, functions performed by computer programs can be easily implemented with the corresponding hardware, and the specific hardware structure used to implement the same function can be various, such as analog circuits, digital circuits, or dedicated circuits. Circuits, etc. However, for the purposes of the present invention, software program implementation is a better implementation in more cases. Based on the understanding, the technical solution of the present invention, which is essential or contributes to the prior art, can be embodied in the form of a software product stored in a readable storage medium, such as a floppy disk of a computer. , U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), disk or optical disk, etc., including a number of instructions to make a computer device (may be A personal computer, server, or network device, etc.) performs the methods described in various embodiments of the present invention.

A person skilled in the art can clearly understand that the specific working process of the system, the device or the unit described above can refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.

The above is only a specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily think of changes or substitutions within the technical scope of the present invention. It should be covered by the scope of the present invention. Therefore, the scope of the invention should be determined by the scope of the appended claims.

Claims

A stream processing method, wherein the method is applied to a stream processing system, the stream processing system comprising a stream processing management unit and a stream processing computing unit, the method comprising:

The stream processing management unit receives a stream processing task sent by a client, where the stream processing task includes a path of a stream processing logic and a file to be processed in a distributed file system, where the distributed file system includes a metadata management node and a plurality of Data storage nodes, each of the data storage nodes is provided with a stream processing calculation unit;

The stream processing management unit acquires, from the metadata management node, a block number of each block corresponding to the path of the file to be processed, and a network address of a data storage node where each block is located;

The stream processing management unit respectively sends the stream processing logic and the block number of each block to a stream processing unit of a data storage node where each block is located;

The stream processing calculation unit acquires the received block data corresponding to the block number from the data storage node where it is located, and executes the stream processing logic for the block data corresponding to the received block number.
The method according to claim 1, wherein the data storage node is provided with a data management unit, the stream processing calculation unit is configured as a library, and the data management unit executes the stream by loading the library Handle the functions of the computing unit.
The method according to claim 1 or 2, wherein the method further comprises:

The stream processing calculation unit transmits a processing result obtained by executing the stream processing logic to the stream processing management unit.
The method according to any one of claims 1 to 3, wherein the metadata management node records a first correspondence between a path of the file to be processed in the distributed file system and a block number of each block The relationship, the flow processing management unit acquires, from the metadata management node, the block number of each block corresponding to the path, and the network address of the data storage node where each block is located, specifically:

And the stream processing management unit acquires a block number of each block from the first correspondence relationship according to the path of the to-be-processed file in the distributed file system.
The method according to claim 4, wherein the metadata management node records a second correspondence between a block number of each block and a network address of a data storage node where each block is located, the stream processing management unit The data management node obtains the block number of each block corresponding to the path, and the network address of the data storage node where the block number of each of the blocks is located specifically includes:

The stream processing management unit acquires a network address of a data storage node where each block is located from the second correspondence according to a block number of each block.
A stream processing system, comprising: a stream processing management unit and a stream processing computing unit,

The stream processing management unit is configured to receive a stream processing task sent by a client, where the stream processing task includes a path of a stream processing logic and a file to be processed in a distributed file system, where the distributed file system includes metadata management. a node and a plurality of data storage nodes, each of the data storage nodes being provided with a stream processing computing unit;

The stream processing management unit is further configured to acquire, from the metadata management node, a block number of each block corresponding to the path, and a network address of a data storage node where each block is located;

The stream processing management unit is further configured to separately send the stream processing logic and the block number of each block to a stream processing unit of a data storage node where each block is located;

The stream processing calculation unit is configured to acquire the received block data corresponding to the block number from the data storage node where the data is stored, and execute the stream processing logic for the block data corresponding to the received block number.
The system according to claim 6, wherein said data storage node is provided with a data management unit, said stream processing calculation unit being configured as a library, said data management unit executing said stream by loading said library Handle the functions of the computing unit.
The system of claim 6 wherein:

The stream processing calculation unit is further configured to send a processing result that is executed by the stream processing logic to the stream processing management unit.
The system according to claim 6, wherein the metadata management node records a first correspondence between a path of the file to be processed in the distributed file system and a block number of each block, the stream The processing management unit is specifically used to:

Obtaining a block number of each block from the first correspondence according to the path of the to-be-processed file in the distributed file system.
The system according to claim 9, wherein the metadata management node records a second correspondence between each block number and a network address of a data storage node where each block number is located, the stream processing The management unit is specifically used to:

The network address of the data storage node where each block is located is obtained from the second correspondence according to the block number of each block.
A stream processing management unit, comprising:

a receiving module, configured to receive a stream processing task sent by the client, where the stream processing task includes a path of the stream processing logic and the file to be processed in a distributed file system, where the distributed file system includes a metadata management node and multiple a data storage node, each of the data storage nodes is provided with a stream processing calculation unit;

a query module, configured to acquire, from the metadata management node, a block number of each block corresponding to the path, and a network address of a data storage node where each block is located;

And a sending module, configured to separately send the stream processing logic and the block number of the respective block to a stream processing unit of the data storage node where each block is located.
A host, comprising: a memory, a processor and a bus, the memory, the processor being connected to the bus, the memory storing program instructions, the processor executing the program instructions to cause The host performs the following steps:

Receiving a stream processing task sent by the client, where the stream processing task includes a stream processing logic and a path of the file to be processed in a distributed file system, where the distributed file system includes a metadata management node and a plurality of data storage nodes, each One of the data storage nodes is provided with a stream processing computing unit;

Obtaining, from the metadata management node, a block number of each block corresponding to the path, and a network address of a data storage node where each block is located;

The stream processing logic and the block number of the respective block are respectively sent to a stream processing unit of a data storage node where each block is located.