CN115016931B

CN115016931B - Data processing method and device

Info

Publication number: CN115016931B
Application number: CN202210481248.9A
Authority: CN
Inventors: 王竹凡; 庄灿伟; 孔伟康; 邱晗; 董元元
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-05-05
Filing date: 2022-05-05
Publication date: 2025-03-18
Anticipated expiration: 2042-05-05
Also published as: CN115016931A

Abstract

The embodiments of the present specification provide a data processing method and device, wherein the data processing method includes: a scheduling storage node receives a task processing request, and determines the execution status of each execution storage node based on the task processing request; the scheduling storage node determines a target execution storage node based on the execution status of each execution storage node, and forwards the task processing request to the target execution storage node; the target execution storage node determines data location information and a target processing function based on the task processing request, and obtains target business data according to the data location information and the target processing function.

Description

Data processing method and device

Technical Field

Embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a data processing method. One or more embodiments of the present specification relate to a data processing apparatus, a computing device, a computer readable storage medium, a data processing system, and a computer program.

Background

With the continuous development of computer technology, a computing and storage separation architecture is often adopted in many storage systems for decoupling computing and storage, and the computing and storage separation architecture can realize flexible expansion of computing and storage and flexible allocation of data resources.

However, the computing end in the current computing-storage separation architecture receives a data processing request for simply processing data, and such operations do not need complex operation calculation, but need to acquire the data from the storage end and simultaneously generate consumption of bandwidth resources for the computing end.

Disclosure of Invention

In view of this, the present embodiment provides a data processing method. One or more embodiments of the present specification are also directed to a data processing apparatus, a computing device, a data processing system, a computer readable storage medium, and a computer program, which solve the technical drawbacks of the prior art.

According to a first aspect of embodiments of the present disclosure, there is provided a data processing method applied to a data storage cluster, where the data storage cluster includes a scheduling storage node and at least one execution storage node, including:

the scheduling storage nodes receive task processing requests and determine the execution state of each execution storage node based on the task processing requests;

the scheduling storage node determines a target execution storage node based on the execution state of each execution storage node, and forwards the task processing request to the target execution storage node;

And the target execution storage node determines data position information and a target processing function based on the task processing request, and obtains target business data according to the data position information and the target processing function.

According to a second aspect of embodiments of the present specification, there is provided a data processing apparatus comprising:

the receiving module is configured to receive a task processing request and determine the execution state of each execution storage node based on the task processing request;

a forwarding module configured to determine a target execution storage node based on an execution state of each execution storage node, and forward the task processing request to the target execution storage node;

And the determining module is configured to determine data position information and a target processing function based on the task processing request and obtain target business data according to the data position information and the target processing function.

According to a third aspect of embodiments of the present specification, there is provided a data processing system comprising a data storage cluster and a data computing cluster, wherein,

The data storage cluster is configured to receive a task processing request and determine a target processing function based on the task processing request;

the data storage cluster is further configured to process the task processing request when the target processing function is a preset processing function;

The data storage cluster is further configured to forward the task processing request to a data computing cluster if the target processing function is not a preset processing function;

The data computing cluster is configured to receive a task processing request and process the task processing request.

According to a fourth aspect of embodiments of the present specification, there is provided a computing device comprising a memory, a processor and computer instructions stored on the memory and executable on the processor, the processor implementing the steps of the data processing method when executing the computer instructions.

According to a fifth aspect of embodiments of the present description, there is provided a computer readable storage medium storing computer instructions which, when executed by a processor, implement the steps of the data processing method.

According to a sixth aspect of the embodiments of the present specification, there is provided a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the steps of the data processing method described above.

The data processing method includes the steps that a scheduling storage node receives task processing requests and determines the execution state of each execution storage node based on the task processing requests, the scheduling storage node determines target execution storage nodes based on the execution state of each execution storage node and forwards the task processing requests to the target execution storage nodes, and the target execution storage nodes determine data position information and target processing functions based on the task processing requests and obtain target business data according to the data position information and the target processing functions.

According to the embodiment of the specification, the dispatch storage node dispatches the execution storage node, and the target execution storage node processes the task processing request, namely, the dispatch storage node and the execution storage node assist in processing the task processing request, so that the data processing efficiency is improved.

Drawings

FIG. 1 is a flow chart of a data processing method provided in one embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a data processing method according to one embodiment of the present disclosure;

FIG. 3 is a schematic diagram of selecting a currently executing storage node according to one embodiment of the present disclosure;

FIG. 4 is a schematic diagram of selecting a currently scheduled storage node according to one embodiment of the present disclosure;

FIG. 5 is a process flow diagram of a data processing method according to one embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a data processing system according to one embodiment of the present disclosure;

FIG. 7 is a schematic diagram of a data processing apparatus according to one embodiment of the present disclosure;

FIG. 8 is a block diagram of a computing device provided in one embodiment of the present description.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many other forms than described herein and similarly generalized by those skilled in the art to whom this disclosure pertains without departing from the spirit of the disclosure and, therefore, this disclosure is not limited by the specific implementations disclosed below.

The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present description refers to any or all possible combinations including one or more of the associated listed items.

It should be understood that, although the terms first, second, etc. may be used in one or more embodiments of this specification to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first may also be referred to as a second, and similarly, a second may also be referred to as a first, without departing from the scope of one or more embodiments of the present description. The term "if" as used herein may be interpreted as "at..once" or "when..once" or "in response to a determination", depending on the context.

First, terms related to one or more embodiments of the present specification will be explained.

Metadata, which is data describing data, mainly information describing data attributes, and is used for supporting functions such as indicating storage locations, historical data, resource searching, file recording and the like.

Distributed locks-in distributed systems, coordinated actions are often required. If one or a group of resources is shared between different systems or between different hosts of the same system, then access to these resources often requires mutual exclusion to prevent interference with each other to ensure consistency, and use of a distributed lock is required.

The role of executing data processing tasks in a distributed storage system, typically deployed on data nodes.

And the dispatching storage node is a node for integrating resources, collecting data processing requests and dispatching the resources in the distributed storage system.

And the distributed storage system writes data to a plurality of data nodes at the same time, and needs to ensure that the external presentation state is consistent and no state rollback occurs.

Metadata storage nodes, namely centralized metadata storage nodes in a distributed system, are generally used for storing attribute information such as state information, position information and the like, length information and the like of files.

The client side is a component which is provided for users in the distributed system and used for carrying out metadata operation and data reading and writing, and also refers to a component which is used for sending data processing requests by the computing side.

At present, the distributed storage system is widely applied to a computing storage separation architecture, and the architecture can independently use resources and telescopic resources without interference through decoupling computation and storage, so that the problem of mutual influence and amplification under an abnormal scene can be avoided.

On the other hand, the computing end in the computing-storage separation architecture needs to simply process the data, such as garbage collection by copying useful data to a new file, and rewriting after simple data operation to store a part of data, and the like.

According to the scheme, the scheduling storage node is introduced into the storage end and the storage node is executed to assist in data processing, so that simple data processing is sunk to the storage end, and bandwidth resource consumption of a computing end caused by the simple data processing is avoided.

In the present specification, a data processing method is provided, and the present specification relates to a data processing apparatus, a computing device, a computer-readable storage medium, a data processing system, and a computer program, one by one, in the following embodiments.

Fig. 1 shows a flowchart of a data processing method applied to a data storage cluster according to an embodiment of the present disclosure, where the data storage cluster includes a scheduling storage node and at least one execution storage node, and includes steps 102 to 106.

Step 102, a scheduling storage node receives a task processing request and determines the execution state of each execution storage node based on the task processing request.

In order to avoid the problem that simple data processing occupies bandwidth resources of a computing node, the scheme of the specification adds an execution process on an existing storage node to obtain the execution storage node, adds a scheduling process on the storage node to obtain the scheduling storage node, or creates a new storage node, adds the execution process or the scheduling process to enable the execution storage node to execute data processing tasks which can only be executed by the computing node originally. In practical application, because the executing storage nodes may fail to affect the normal execution of tasks, a plurality of executing storage nodes can be arranged, namely, the executing storage nodes can be distributed and deployed in the data storage cluster, so that different executing storage nodes can execute different task processing requests at the same time, and the executing storage nodes and at least one scheduling storage node form the data storage cluster.

The method comprises the steps of executing a storage node, scheduling the storage node, and the storage node, wherein the execution storage node can execute data processing tasks, the scheduling storage node can schedule the storage node with the data processing tasks based on the execution state of each execution storage node, the task processing request refers to a request for processing data, namely, a task processing request sent by equipment with data processing requirements, such as a task processing request sent by a client, and note that the task processing request which can be executed by the execution storage node in the specification is a simpler task processing request, and the execution state of the execution storage node refers to state information of the task processing condition of the execution storage node, such as the load capacity of the execution storage node, the current execution condition and the like.

For example, the data storage cluster comprises a scheduling storage node A, an execution storage node b and an execution storage node c, wherein the scheduling storage node A receives a task processing request Q and determines the execution state of the execution storage node b and the execution state of the execution storage node c based on the task processing request Q.

Because the solution of the present disclosure is that the execution storage node processes a simple data processing task, and for a more complex data processing task, the execution storage node still needs to execute the task processing task, so after the scheduling storage node receives the task processing request, it needs to first determine whether the execution storage node can execute the received task processing request.

Specifically, after the scheduling storage node receives the task processing request, the task processing request needs to be parsed, and by determining whether the target processing function included in the task processing request is a preset processing function, that is, the preset execution storage node may execute the processing function, that is, the specific method for the scheduling storage node to receive the task processing request and determine the execution state of each execution storage node based on the task processing request may include:

determining a target processing function based on the task processing request;

And under the condition that the target processing function is determined to be a preset processing function, determining the execution state of each execution storage node in the data storage cluster based on the task processing request.

The target processing function refers to a processing function obtained by analyzing a task processing request, for example, a processing function for calculating the sum of service data to be processed, and it should be noted that the target processing function may be null, that is, if the target processing function does not exist, the task processing request may be determined to be a task processing request of copy data, and the preset processing function refers to a processing function that may be executed by an execution storage node.

In practical application, in order to determine whether the task processing request can be executed by the execution storage node, a processing function table may be preset, and if a preset processing function consistent with the target processing function exists in the processing function table or the target processing function is empty, it may be determined that the execution storage node can execute the task processing request, and then the execution state of each execution storage node in the data storage cluster may be further determined according to the task processing request.

For example, the computing node G sends a task processing request to the data storage cluster, the scheduling storage node H receives the task processing request and analyzes the task processing request to obtain a processing function N, a preset processing function table is determined, the processing function table contains a processing function M, a processing function N and a processing function J, and if the processing function table contains the processing function N, the task processing request is determined to be processed by the executing storage node, then the execution state of each executing storage node in the data storage cluster is determined based on the task processing request.

After determining that the executing storage node can execute the task processing request, it may be determined by which executing storage node in the data storage cluster to execute the task processing request by determining the executing state of the executing storage node.

Specifically, the method for determining the execution state of each execution storage node by the scheduling storage node based on the task processing request may include:

the scheduling storage node queries the current task load capacity and node state information of each execution storage node in the data storage cluster based on the task processing request;

Or (b)

And the scheduling storage node receives the current task load and the node state information reported by each execution storage node in the data storage cluster.

The current task load capacity refers to the number of current processing tasks of the scheduling storage node, for example, the current execution storage node is executing 3 data processing tasks, namely, the load capacity is 3, the execution storage node with smaller current load capacity can be determined based on the current task load capacity, the execution storage node for executing task processing requests can be conveniently determined later, the node state information refers to the state of the execution storage node, for example, a fault state or a non-fault state, and the like, the faulty execution storage node can be determined based on the node state information, the execution of task processing requests by the node with faults is avoided, and the subsequent scheduling of the normally executable execution storage node is facilitated.

Specifically, the scheduling storage node can detect the current load capacity and the node state information of each execution storage node in the data storage cluster based on the task processing request, the scheduling storage node can also receive the current load capacity and the node state information reported by each execution storage node at intervals of preset time, and the current task load capacity and the node state information of each execution storage node can be obtained by adopting any one of the two methods or combining the two methods.

By determining the execution state of each execution storage node in the data storage cluster, namely determining the current task load and node state information of each execution storage node, the subsequent dispatch storage node can screen the execution storage nodes capable of executing task processing requests in the data storage system based on the current task load and node state information.

And 104, determining a target execution storage node by the dispatch storage node based on the execution state of each execution storage node, and forwarding the task processing request to the target execution storage node.

The target execution storage node is a storage node capable of executing task processing requests in the data storage system, and after the target execution storage node is determined, the scheduling storage node can forward the received task processing requests to the target execution storage node, so that the subsequent target execution storage node executes the task processing requests.

The method comprises the steps of determining an execution state of each execution storage node, namely determining current load and node state information of each execution storage node, determining a target execution storage node capable of processing task processing requests in a data storage cluster based on the execution state, determining that the execution storage node N and the execution storage node M in the data storage cluster are in a non-fault state according to the node state information of the execution storage node, further determining the current load of the execution storage node N and the execution storage node M, and sending the task processing requests to the target execution storage node after determining the target execution storage node.

For example, determining the current load capacity and node state information of each execution storage node in the data storage cluster, determining that each execution storage node in the data storage cluster is in a non-fault state according to the node state information, determining the execution storage node with the lowest current load capacity as a target execution storage node, and forwarding a task processing request received by a scheduling storage node to the target execution storage node.

The target execution storage node is determined in the data storage cluster based on the execution state of each execution storage node, so that task processing requests can be executed by the target execution storage node with better execution state later, and further task processing efficiency and task processing quality are improved.

And step 106, the target execution storage node determines data position information and a target processing function based on the task processing request, and obtains target business data according to the data position information and the target processing function.

After the target execution storage node receives the task processing request, the data position information and the target processing function can be determined according to the task processing request, and the service data to be processed is processed according to the data position information and the target processing function, so that the target service data is obtained.

The data location information refers to location information for storing service data to be processed, the target processing function refers to a code or a program which is contained in the task processing request and can be used for calculating target service data, and the target service data refers to data obtained by processing the service data to be processed based on the target processing function.

The data storage cluster can comprise metadata storage nodes, data position information of the service data to be processed can be determined based on the metadata, the service data to be processed can be obtained in the storage nodes storing the service data to be processed based on the data position information after the data position information is determined, for example, the data position information is determined to be data with initial offset of 1 and offset length of 5 in a file D, the service data to be processed is obtained in the file D based on the data position information, and after the service data to be processed is obtained, target service data can be calculated based on a target processing function.

In practical applications, the specific method for determining the data location information and the target processing function by the target execution storage node based on the task processing request may include:

The target execution storage node analyzes the task processing request;

And acquiring data position information and a target processing function in the task processing request.

The target execution storage node analyzes the task processing request after receiving the task processing request, and obtains the data position information and the target processing function carried in the task processing request by analyzing the task processing request.

For example, after receiving the task processing request Q, the executing storage node H parses the task processing request and determines the data location information J and the target processing function K carried in the task processing request.

By analyzing the task processing request, the data position information of the service data to be processed and the target processing function, namely the mode used when the service data to be processed is processed, can be determined.

After the data location information and the target processing function are acquired, the target service data can be acquired based on the data location information and the target processing function, and the method for acquiring the target service data according to the data location information and the target processing function can include:

Acquiring service data to be processed according to the data position information;

And processing the service data to be processed based on the target processing function to obtain target service data.

For example, the data with the initial offset of 5 and the offset length of 4 in the file G is obtained as '5, 6, 7, 8 and 9' in the file G based on the data position information, and the data to be processed is processed according to the target processing function '1 added per element', so as to obtain the target service data as '6, 7, 8, 9 and 10'. In practical application, the task processing request received by the scheduling storage node may be an array < input file, initial offset+offset length, processing function, and output file name >, where the input file refers to a file where service data to be processed is located, and the output file name refers to a file name when outputting a file storing target service data.

Specifically, when the processed target service data is written into the file, the target service data is written into the temporary file, the temporary file is set as a read-only attribute, the temporary file is renamed after the data processing is finished, and the file attribute is set as a default value, so that an output file containing complete target service data is obtained, and the file seen by a user is a file containing complete target service data. The scheduling storage node marks the task processing request as completed after detecting that the task processing is completed.

For example, as shown in fig. 2, fig. 2 is a schematic diagram of a data processing method provided in an embodiment of the present disclosure, where a scheduling storage node receives a task processing request sent by a client, determines execution states of an execution storage node 1, an execution storage node 2, and an execution storage node 3 based on the task processing request, selects the execution storage node 2 as a target execution storage node based on the execution states, and forwards the task processing request to the execution storage node 2. The execution storage node 2 determines data position information and a target processing function based on a task processing request, determines a source file and service data to be processed in the source file according to the data position information, processes the service data to be processed based on the target processing function to obtain target service data, writes the service data to be processed into a temporary file, renames the temporary file after all the service data to be processed are processed and the target service data is written into the temporary file to obtain a target file containing complete target service data, feeds back an execution completion result to a scheduling storage node after the task execution of the execution storage node is completed, and feeds back the execution completion result to a client by the scheduling storage node.

Since in practical application, a fault condition of the target execution storage node may occur, the scheduling storage node needs to schedule other execution storage nodes to execute task processing requests at the moment.

Specifically, the data processing method further includes:

Determining a current execution storage node in the data storage cluster under the condition that the scheduling storage node detects that the target execution storage node fails;

and sending a task processing request containing the retry tuple to the current execution storage node. The task processing request comprises a retry binary group, wherein the retry binary group consists of an execution identifier and a retry value.

The method comprises the steps of enabling a current execution storage node to be capable of processing a task processing request, enabling the task processing request to be a task processing request comprising a retry binary group, enabling the retry binary group to be a binary group comprising an execution identifier and a retry value, enabling the execution identifier to be an identifier set for the task processing request, for example, a character string which is randomly generated by a client sending the task processing request and can uniquely identify the task processing request, enabling the retry value to be determined based on the retry times of the client and the retry times of the dispatch storage node, enabling the dispatch storage node to obtain the execution state of a target execution storage node again, enabling the dispatch storage node to obtain the execution result of the execution storage node H again after a preset time is detected, and enabling the retry times of the client to be the number of times of the execution state of the dispatch storage node again.

The client can generate a task processing request containing a retry binary group and send the task processing request to the dispatch storage node, if the client performs retry when determining that the dispatch storage node is the final dispatch storage node, the retry binary group in the task processing request sent to the determined final dispatch storage node is a retry binary group with the retry number increased by the retry value of the client, the dispatch storage node sends the task processing request containing the retry binary group to the determined target execution storage node, and if the dispatch storage node performs retry when determining the target execution storage node, the retry binary group in the task processing request is sent to the target execution storage node as the retry binary group with the retry number increased by the retry value of the dispatch storage node.

Specifically, when the scheduling storage node detects that the target execution storage node fails, or the target execution storage node reports heartbeat data of the failure to the scheduling storage node, the scheduling storage node can reacquire the execution state of the target execution storage node once or a plurality of times, if the target execution storage node is determined to have the storage failure again, the scheduling storage node reacquires the execution state of each execution storage node in the data storage cluster, determines the current execution storage node based on each execution state, and forwards a task processing request containing a retry binary group to the current execution storage node.

In practical applications, the short-term failure of the target execution storage node may occur due to various reasons such as network delay, and at this time, the execution state of the target execution storage node may be redetermined, that is, before the current execution storage node is determined in the data storage cluster, the method may further include:

The scheduling storage node acquires the execution state of the target execution storage node again;

and increasing a retry value in the retry binary based on the number of acquisitions to reacquire the execution state.

The method comprises the steps that a target execution storage node can acquire an execution state of the target execution storage node again after a preset time period when the target execution storage node is determined to be in a fault state for the first time, if the execution state is recovered to be in a non-fault state, the target execution storage node can continue to execute task processing requests, if the execution state is still in the fault state, the execution state of the target execution storage node can be acquired again after the preset time period until the acquisition time upper limit value is reached, in addition, in order to improve the success rate of acquiring the execution state again, the preset time period can be increased along with the increase of the acquisition time, namely the time interval of acquiring the execution state again each time, and the retry value needs to be increased in the retry binary set each time of acquiring the execution state of the target execution storage node, namely the retry value of the retry binary set can be increased based on the retry time of the scheduling storage node.

If the retry number reaches the preset number threshold, the re-acquisition of the execution state of the target execution storage node can be stopped, and the task processing request of the current execution storage node is determined in the data storage cluster.

After determining the current executing storage node, the current executing storage node receives the task processing request including the binary group and executes the task processing request, and specifically after sending the task processing request including the retry binary group to the current executing storage node, the method may further include:

the current execution storage node determines a file to be written and a retry tuple based on the task processing request;

and opening the file to be written based on the retry tuple.

The method comprises the steps of writing a file to be written into the file to be written into target service data, wherein the retried tuple is a tuple contained in a task processing request sent to a current executing storage node by a scheduling storage node.

Specifically, after the current execution storage node determines the retry tuple, the file to be written is opened based on the retry tuple, that is, a data processing task of writing target service data into the file to be written is executed.

Further, since the target execution storage node may be unavailable for a short period, that is, after the current execution storage node is determined, the target execution storage node is restored to a non-failure state and continues to execute the task processing request, so that a situation that a writing conflict occurs when writing the file to be written is caused, in order to avoid the situation, the task processing request in the present specification includes a retry binary group, and in the case that it is determined based on the retry binary group that the execution identifier of the retry binary group currently writing the data to be written is consistent with the execution identifier of the retry binary group corresponding to the current execution storage node, the execution storage node with a larger retry value as the current writable target service data can be determined by a mode of comparing the retry value.

Specifically, the specific method for opening the file to be written based on the retry tuple includes:

determining a current retry tuple of the file to be written;

Comparing the current retry value of the current retry tuple with the retry value of the retry tuple when the current execution identifier in the current retry tuple is consistent with the execution identifier of the retry tuple;

And opening the file to be written based on the retry binary group under the condition that the retry value is larger than the current retry value.

The method comprises the steps of determining a current execution identifier and a current retry value in a current retry tuple, comparing the execution identifier of the current retry tuple with the execution identifier of the retry tuple corresponding to the current execution storage node, if the current retry tuple is inconsistent with the execution identifier of the retry tuple corresponding to the current execution storage node, executing a task processing request by the current execution storage node, if the current retry value is consistent with the retry value, further comparing the current retry value with the retry value, and if the current retry value is greater than the retry value, terminating task execution of the execution storage node corresponding to the current retry tuple, and executing a task of writing target service data into the file to be written by the current execution storage node.

For example, as shown in fig. 3, fig. 3 is a schematic diagram of selecting a current executing storage node according to an embodiment of the present disclosure, where the scheduling storage node receives a task processing request including a retry tuple < a1,0> sent by a client, and determines that the executing storage node 1 is a target executing storage node based on the executing states of the executing storage node 1, the executing storage node 2, and the executing storage node 3, and sends the task processing request including the retry tuple < a1,0 x 1024+0> to the executing storage node 1, where the executing storage node 1 may open a target file based on the retry tuple < a1,0 x 1024+0 >.

The method comprises the steps of scheduling a storage node to acquire the execution state of the execution storage node 1 again when the execution state of the execution storage node 1 is detected to be in a fault state, adding 1 to the retry value of a retry binary group to obtain a retry binary group < a1,0 x 1024+1>, determining that the current execution storage node is the execution storage node 3 if the execution state of the execution storage node 1 is still in the fault state, sending a task processing request containing the retry binary group < a1,0 x 1024+1> to the execution storage node 3, wherein the retry value of the retry binary group is 1, opening a file to be written by the execution storage node 3 based on the retry binary group < a1,0 x 1024+1>, and enabling the corresponding retry value of the retry binary group to be 1 even if the execution of the execution storage node 1 executes the task processing request again, so that data cannot be written into the file to be written continuously.

Further, in order to avoid that the scheduling of the execution storage node cannot be continued after the scheduling storage node fails, a plurality of scheduling storage nodes may be set in the data storage cluster, where the scheduling storage nodes may be centralized or semi-centralized, that is, a master-slave relationship exists between the scheduling storage nodes, and in the case of failure of a master scheduling storage node, the slave scheduling storage node may determine a new master scheduling storage node by acquiring a distributed lock, that is, the method may include:

Determining a scheduling storage node set in the data storage cluster under the condition that the scheduling storage node is detected to be faulty;

and sending a task processing request containing a retry binary group to each scheduling storage node in the scheduling storage node set.

Wherein, the dispatch storage node set refers to a set of dispatch storage nodes in a data storage system.

Specifically, in the case where the device (e.g., the client) that sends the task processing request detects that the scheduling storage node has failed, the task processing request including the retry tuple may be sent to each scheduling storage node set in the data storage cluster. And determining, by the new master schedule storage node, an execution state of each execution storage node in the data storage cluster based on the task processing request including the retry tuple, and determining a target execution storage node based on the execution state.

In addition, before determining the scheduling storage node set in the data storage cluster, the execution state of the scheduling storage node may be obtained again, so as to avoid the problem that the scheduling storage node is not available for a short period, which may specifically include:

Re-acquiring the scheduling execution state of the scheduling storage node;

and increasing the retry value of the retry binary based on the number of acquisitions to reacquire the scheduled execution state.

Specifically, when the device (such as a client) sending the task processing request detects that the scheduling storage node is in a fault state for the first time, the execution state of the scheduling storage node can be obtained again after a preset time period, if the execution state is restored to a non-fault state, the scheduling storage node can continuously accept the task processing request, if the execution state is still in a fault state, the execution state of the scheduling storage node can be obtained again after a preset time period until the upper limit value of the acquisition number is reached, in addition, in order to improve the success rate of obtaining the execution state again, the preset time period can be increased along with the increase of the acquisition number, namely, the time interval of each time of obtaining the execution state again is increased, the retry value of each time of obtaining the execution state again needs to be increased in the retry binary group, namely, the retry value of the retry binary group can be increased based on the retry number of the client in the retry binary group and the retry number of the scheduling storage node, if the influence of the retry number value of the client is larger, the retry coefficient of the client weight can be set as the corresponding retry coefficient of the corresponding number of the set as the retry coefficient of the corresponding node 1.

If the retry number reaches the preset number threshold, the re-acquisition of the execution state of the dispatch storage node may be stopped, a new dispatch storage node is determined to receive the task processing request in the data storage cluster, and a target execution storage node is determined in the execution storage nodes based on the task processing request.

For example, as shown in fig. 4, fig. 4 is a schematic diagram of selecting a current scheduling storage node according to an embodiment of the present disclosure, where the scheduling storage node receives a task processing request including a retry tuple < a1,0> sent by a client, determines that a target execution storage node is the execution storage node 1 based on the execution states of the execution storage node 1, the execution storage node 2, and the execution storage node 3, and the execution storage node 1 opens a target file based on the retry tuple < a1,0×1024+0> in the task processing request.

The method comprises the steps of detecting that an execution state of a dispatching storage node is a fault state by a client, re-acquiring the execution state of the dispatching storage node, adding 1 to a retry value of a retry binary set to obtain a retry binary set < a1,1>, if the execution state of the acquired dispatching storage node is still the fault state, sending a task processing request containing the retry binary set to each slave dispatching storage node, acquiring a distributed lock from the dispatching storage node 2, taking the dispatching storage node 2 as a current dispatching storage node, namely a main dispatching storage node, determining the execution state of each execution storage node by the main dispatching storage node, determining the current target execution storage node as an execution storage node 3 based on the execution state, sending the task processing request to the execution storage node 3, and opening a target file by the execution storage node 3 based on the retry binary set < a1,1 x 1024+0> in the task processing request.

The data processing method includes the steps that a scheduling storage node receives task processing requests and determines the execution state of each execution storage node based on the task processing requests, the scheduling storage node determines target execution storage nodes based on the execution state of each execution storage node and forwards the task processing requests to the target execution storage nodes, and the target execution storage nodes determine data position information and target processing functions based on the task processing requests and obtain target business data according to the data position information and the target processing functions. According to the embodiment of the specification, the dispatch storage node dispatches the execution storage node, and the target execution storage node processes the task processing request, namely, the dispatch storage node and the execution storage node assist in processing the task processing request, so that the data processing efficiency is improved.

The application of the data processing method provided in the present specification to the data storage cluster K is taken as an example, and the data processing method will be further described with reference to fig. 5. Fig. 5 is a flowchart of a processing procedure of a data processing method according to an embodiment of the present disclosure, and specific steps include steps 502 to 518.

Step 502, the scheduling storage node e receives a task processing request sent by the client, wherein the task processing request comprises a retry tuple < execution identifier 1, and the number of times of retries of the client is 1024+the number of times of retries of the scheduling storage node >.

Specifically, the retry tuple is currently < k1,0×1024+0>, where 1024 is the weight coefficient of the number of retries of the client.

Step 504, scheduling storage node e queries the execution state of each executing storage node in the data storage cluster based on the task processing request.

Step 506, the scheduling storage node e selects the execution storage node M as a target execution storage node, and sends the task processing request to the execution storage node M.

Step 508, the execution storage node M analyzes the received task processing request to obtain the data position information and the target processing function.

Step 510, the executing storage node M obtains the service data to be processed based on the data location information, processes the service data to be processed based on the target processing function, obtains the target service data, and opens the target file based on the retry tuple.

Specifically, the target file is opened based on the retry tuple < k1, 0x 1024+0>, and the target service data is written into the target file.

Step 512, when the scheduling storage node e detects that the executing storage node M is in a fault state, the executing state of the executing storage node M is reacquired, and the retry number of the scheduling storage node in the retry binary group is increased by 1.

Specifically, the retry binary set at this time is < k1,0×1024+1>.

Step 514, acquiring the execution state of each execution storage node in the data storage cluster K when the execution state of the execution storage node M acquired again by the scheduling storage node e is a fault state.

Step 516, selecting the execution storage node N as a target execution node based on the execution state of each execution storage node, and sending a task processing request containing a retry binary group to the execution storage node N.

Specifically, the task processing request sent to the executing storage node N includes a retry tuple < k1,0×1024+1>.

Step 518, the executing storage node N obtains the target service data based on the task processing request, and opens the target file based on the retry tuple.

Specifically, the target file is opened based on the retry tuple being < k1,0×1024+1>, and the target service data is written into the target file. If the execution state of the executing storage node M is restored to the non-failure state, but the corresponding retry tuple is < k1,0 x 1024+0>, wherein the retry value is 0, and the corresponding retry tuple of the executing storage node N is < k1,0 x 1024+1>, wherein the retry value is 1, which is greater than the retry value of the executing storage node M, so that the executing storage node N still writes the target service data into the target file.

According to the data processing method provided by the specification, the retry binary set is added in the task processing request and is used for recording the retry times of the client and the retry times of the scheduling storage nodes, so that when the writing conflict occurs later, the writing operation can be executed by the execution storage node corresponding to the retry binary set with higher retry value, and the data conflict is avoided.

Fig. 6 shows a schematic diagram of a data processing system provided in an embodiment of the present description, the system comprising a data storage cluster 602 and a data computing cluster 604, wherein,

The data storage cluster 602 is configured to receive a task processing request and determine a target processing function based on the task processing request;

the data storage cluster 602 is further configured to process the task processing request if the target processing function is a preset processing function;

the data storage cluster 602 is further configured to forward the task processing request to a data computing cluster if the target processing function is not a preset processing function;

the data computing cluster 604 is configured to receive a task processing request and process the task processing request.

Specifically, the data storage cluster refers to a cluster formed by scheduling storage nodes and executing storage nodes, and the data computing cluster refers to a cluster formed by computing nodes.

The data processing system of the specification processes the task processing request by the data storage cluster under the condition that the target processing function in the task processing request is determined to be the preset processing function, and forwards the task processing request to the data computing cluster under the condition that the target processing function in the task processing request is not the preset processing function, and processes the task processing request by the data computing cluster, so that the data processing task is directly executed by the data storage cluster under the condition that the target processing function corresponds to simple processing operation, and the calculation resource consumption caused by the completion of all data processing by the data computing cluster is avoided.

Corresponding to the method embodiment described above, the present disclosure further provides an embodiment of a data processing apparatus, and fig. 7 shows a schematic structural diagram of a data processing apparatus according to an embodiment of the present disclosure. As shown in fig. 7, the apparatus includes:

A receiving module 702 configured to receive a task processing request and determine an execution state of each execution storage node based on the task processing request;

A forwarding module 704 configured to determine a target execution storage node based on an execution state of each execution storage node, and forward the task processing request to the target execution storage node;

a determining module 706 configured to determine data location information and a target processing function based on the task processing request, and obtain target traffic data according to the data location information and the target processing function.

Optionally, the receiving module 702 is further configured to:

Or (b)

Optionally, the receiving module 702 is further configured to:

determining a target processing function based on the task processing request;

Optionally, the determining module 706 is further configured to:

The target execution storage node analyzes the task processing request;

Optionally, the determining module 706 is further configured to:

Optionally, the apparatus further comprises a first detection module configured to:

And sending a task processing request containing the retry tuple to the current execution storage node.

Optionally, the apparatus further comprises a first acquisition module configured to:

Optionally, the apparatus further comprises an opening module configured to:

and opening the file to be written based on the retry tuple.

Optionally, the apparatus further comprises an alignment module configured to:

determining a current retry tuple of the file to be written;

Optionally, the apparatus further comprises a second detection module configured to:

Optionally, the apparatus further comprises a second acquisition module configured to:

Re-acquiring the scheduling execution state of the scheduling storage node;

The data processing device provided by the specification comprises a scheduling storage node, a target execution storage node and a target execution storage node, wherein the scheduling storage node receives a task processing request and determines the execution state of each execution storage node based on the task processing request, the scheduling storage node determines the target execution storage node based on the execution state of each execution storage node and forwards the task processing request to the target execution storage node, and the target execution storage node determines data position information and a target processing function based on the task processing request and obtains target service data according to the data position information and the target processing function. The task processing request is processed by the scheduling storage node and the target execution storage node, namely, the task processing request is processed in an auxiliary mode through the scheduling storage node and the execution storage node, so that the data processing efficiency is improved.

The above is a schematic solution of a data processing apparatus of the present embodiment. It should be noted that, the technical solution of the data processing apparatus and the technical solution of the data processing method belong to the same conception, and details of the technical solution of the data processing apparatus, which are not described in detail, can be referred to the description of the technical solution of the data processing method.

Fig. 8 illustrates a block diagram of a computing device 800 provided in accordance with an embodiment of the present specification. The components of computing device 800 include, but are not limited to, memory 810 and processor 820. Processor 820 is coupled to memory 810 through bus 830 and database 850 is used to hold data.

Computing device 800 also includes access device 840, access device 840 enabling computing device 800 to communicate via one or more networks 860. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 840 may include one or more of any type of network interface, wired or wireless (e.g., a Network Interface Card (NIC)), such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.

In one embodiment of the present description, the above-described components of computing device 800, as well as other components not shown in FIG. 8, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device illustrated in FIG. 8 is for exemplary purposes only and is not intended to limit the scope of the present description. Those skilled in the art may add or replace other components as desired.

Computing device 800 may be any type of stationary or mobile computing device including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smart phone), wearable computing device (e.g., smart watch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 800 may also be a mobile or stationary server.

Wherein processor 820 performs the steps of the data processing method when executing the computer instructions.

The foregoing is a schematic illustration of a computing device of this embodiment. It should be noted that, the technical solution of the computing device and the technical solution of the data processing method belong to the same concept, and details of the technical solution of the computing device, which are not described in detail, can be referred to the description of the technical solution of the data processing method.

An embodiment of the present specification also provides a computer-readable storage medium storing computer instructions which, when executed by a processor, implement the steps of a data processing method as described above.

The above is an exemplary version of a computer-readable storage medium of the present embodiment. It should be noted that, the technical solution of the storage medium and the technical solution of the data processing method belong to the same concept, and details of the technical solution of the storage medium which are not described in detail can be referred to the description of the technical solution of the data processing method.

An embodiment of the present specification also provides a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the steps of the data processing method described above.

The above is an exemplary version of a computer program of the present embodiment. It should be noted that, the technical solution of the computer program and the technical solution of the data processing method belong to the same conception, and details of the technical solution of the computer program, which are not described in detail, can be referred to the description of the technical solution of the data processing method.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The computer instructions include computer program code that may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable medium may include any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, randomAccess Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the embodiments are not limited by the order of actions described, as some steps may be performed in other order or simultaneously according to the embodiments of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all required for the embodiments described in the specification.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.

The preferred embodiments of the present specification disclosed above are merely used to help clarify the present specification. Alternative embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the teaching of the embodiments. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. This specification is to be limited only by the claims and the full scope and equivalents thereof.

Claims

1. A data processing method, applied to a data storage cluster, wherein the data storage cluster includes a scheduling storage node and at least one execution storage node, wherein:

The scheduling storage node receives the task processing request and determines the execution state of each execution storage node based on the task processing request;

The scheduling storage node determines a target execution storage node based on the execution status of each execution storage node, and forwards the task processing request to the target execution storage node;

The target execution storage node determines data location information and a target processing function based on the task processing request, and obtains target business data according to the data location information and the target processing function;

Among them, the scheduling storage node receives a task processing request and determines the execution status of each execution storage node based on the task processing request, including: determining a target processing function based on the task processing request; when it is determined that the target processing function is a preset processing function, determining the execution status of each execution storage node in the data storage cluster based on the task processing request.

2. The method of claim 1, wherein the scheduling storage node determines the execution state of each execution storage node based on the task processing request, comprising:

The scheduling storage node queries the current task load and node status information of each execution storage node in the data storage cluster based on the task processing request;

or

The scheduling storage node receives the current task load and node status information reported by each execution storage node in the data storage cluster.

3. The method according to claim 1, wherein the target execution storage node determines data location information and a target processing function based on the task processing request, comprising:

The target execution storage node parses the task processing request;

The data location information and the target processing function in the task processing request are obtained.

4. The method according to claim 1, wherein obtaining target service data according to the data location information and the target processing function comprises:

Acquire the business data to be processed according to the data location information;

The to-be-processed business data is processed based on the target processing function to obtain target business data.

5. The method according to claim 1, wherein the task processing request includes a retry tuple, and the retry tuple consists of an execution identifier and a retry value;

The method further comprises:

In the case where the scheduling storage node detects that the target execution storage node fails, determining a current execution storage node in the data storage cluster;

Sending a task processing request including the retry 2-tuple to the current execution storage node.

6. The method according to claim 5, before determining the current executing storage node in the data storage cluster, further comprising:

The scheduling storage node reacquires the execution status of the target execution storage node;

A retry value in the retry tuple is increased based on the number of times the execution state is re-obtained.

7. The method according to claim 5, after sending the task processing request including the retry tuple to the current execution storage node, further comprising:

The currently executing storage node determines a file to be written and a retry tuple based on the task processing request;

The file to be written is opened based on the retry tuple.

8. The method according to claim 7, wherein opening the file to be written based on the retry tuple comprises:

Determine the current retry 2-tuple of the file to be written;

When the current execution identifier in the current retry binary is consistent with the execution identifier of the retry binary, comparing the current retry value of the current retry binary and the retry value of the retry binary;

When the retry value is greater than the current retry value, the file to be written is opened based on the retry 2-tuple.

9. The method according to claim 1, wherein the task processing request includes a retry tuple, wherein the retry tuple consists of an execution identifier and a retry value;

The method further comprises:

When it is detected that the scheduling storage node fails, determining a scheduling storage node set in the data storage cluster;

A task processing request including a retry tuple is sent to each scheduling storage node in the scheduling storage node set.

10. The method according to claim 9, before determining the set of scheduling storage nodes in the data storage cluster, further comprising:

Re-acquiring the scheduling execution status of the scheduling storage node;

The retry value of the retry tuple is increased based on the number of times the scheduling execution state is re-obtained.

11. A data processing system, comprising a data storage cluster and a data computing cluster, wherein:

The data storage cluster is further configured to forward the task processing request to the data computing cluster when the target processing function is not a preset processing function;

12. A computing device, comprising a memory, a processor, and computer instructions stored in the memory and executable on the processor, wherein the processor implements the steps of the method according to any one of claims 1 to 10 or 11 when executing the computer instructions.

13. A computer-readable storage medium storing computer-executable instructions, wherein the computer instructions, when executed by a processor, implement the steps of the method according to any one of claims 1 to 10 or 11.

14. A computer program product, characterized in that it comprises computer instructions, which, when executed by a processor, implement the steps of the method according to any one of claims 1 to 10 or 11.