Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The AI accelerator is a hardware device specially designed for artificial intelligence application, and aims to accelerate the execution of an artificial intelligence algorithm and improve the calculation efficiency and performance. AI accelerators typically employ specialized hardware architecture and instruction sets, optimized for the characteristics of the artificial intelligence algorithm, to achieve efficient parallel computation.
In the use scenario of the AI accelerator, tasks are sequentially executed in a single order, and one task is completed before the next task is started. When complex tasks are processed, all tasks in a single-stream execution mode must be executed in sequence, and when a certain task waits for data transmission or other resources, the accelerator resources may be in an idle state, so that the resources are wasted, and the resource utilization rate is low. Even if the accelerator has a strong computing power, it cannot be effectively utilized while waiting for data transmission.
In the related art, if a data dependency relationship exists between tasks, in a single-stream execution mode, a previous task needs to be waited for to be completed to execute a next task, and when a large number of computationally intensive tasks and data transmission intensive tasks are processed in the serial execution mode, the mutual wait between the tasks can obviously reduce the overall efficiency of the system. In addition, in the related technology, a developer is also required to manually manage the scheduling and execution sequence of tasks in the single-stream execution mode, so that programming complexity and error risk are increased, codes are difficult to maintain and expand, and development and debugging cost is increased. Therefore, there is a need to design a technical solution to overcome at least one technical problem in the related art.
In order to solve at least one technical problem in the related art, the embodiment of the application provides an accelerator-oriented multitasking method and a related device.
Firstly, in the technical scheme provided by the application, after the data processing task to be executed is obtained, each task to be accelerated is distributed into the corresponding task acceleration flow, and each task acceleration flow is used for independently managing the corresponding task to be accelerated. The method changes the mode that tasks in the previous single-stream execution mode are sequentially executed according to a single sequence, so that a plurality of tasks to be accelerated can be respectively processed in parallel in different task acceleration streams, the whole system is prevented from being stagnated due to the fact that a single task waits for data transmission or other resources, the parallel computing capacity of an AI accelerator is fully utilized, the condition that the accelerator resources are idle due to waiting is reduced, and the resource utilization rate is improved.
In the technical scheme provided by the application, the target event corresponding to the data processing task is created, and the dependency relationship among the tasks to be accelerated is added in the target event so as to synchronously manage the tasks to be accelerated with the dependency relationship among the task acceleration flows. Even though the dependency relationship exists among the tasks, the tasks can be reasonably scheduled and managed through the target events, so that the parallel execution of the tasks is realized as much as possible on the premise of meeting the dependency relationship, instead of sequentially executing all the tasks like a single-stream execution mode, the utilization efficiency of resources is further improved, and the resource waste is reduced.
In the technical scheme provided by the application, the task acceleration flows are adjusted in real time through the target event so as to optimize the utilization rate of the task acceleration flows. The method means that the task acceleration flow can be dynamically adjusted according to factors such as actual execution conditions of tasks and occupation conditions of resources, so that resources of the AI accelerator can be more reasonably distributed and utilized, the accelerator resources can be fully utilized when complex tasks are processed, and the accelerator resources cannot be idle due to waiting states of certain tasks, thereby improving the calculation efficiency and performance of the whole system.
Finally, in the technical scheme provided by the application, the task calculation results obtained by executing the corresponding tasks to be accelerated by the task acceleration flows are output, so that the parallel acceleration among the tasks to be accelerated is realized. The method is in clear contrast to the sequential execution of all tasks in the traditional single-stream execution mode, the parallel acceleration mode fully exerts the advantages of the AI accelerator, improves the calculation efficiency, and overcomes the problem of low resource utilization rate in the single-stream execution mode.
According to the technical scheme, the parallel acceleration of the data processing tasks is accelerated by the plurality of tasks, so that the multi-task processing for the accelerator is realized, the task execution efficiency is effectively improved, the utilization rate of accelerator resources is improved, the accelerator resource waste is avoided, and the user experience is improved.
The accelerator-oriented multitasking scheme provided by the embodiment of the application can also be executed by electronic equipment, and the electronic equipment can be a server, a server cluster and a cloud server. The electronic device may also be a terminal device such as a cell phone, computer, tablet, wearable device, or a dedicated device (e.g. a dedicated terminal device with accelerator-oriented multitasking method system, etc.). The chips described in the above embodiments may be mounted on these electronic devices. Or the electronic devices may also install a service program for executing the accelerator-oriented multitasking scheme.
The embodiment of the application is mainly used for the AI accelerator to realize the dynamic resource allocation of the AI accelerator. Here, an AI accelerator may be understood as a computing device designed or deployed in concert for an artificial intelligence application. The computing device herein may be a hardware device, a virtual resource, or a combination of both. The method adopts a special hardware architecture and instruction set, optimizes the characteristics of the artificial intelligence algorithm, and aims to accelerate the execution of the artificial intelligence algorithm and improve the calculation efficiency and performance. The AI accelerator can realize efficient parallel computation, and is suitable for processing large-scale data and complex computing tasks, such as deep learning model training, real-time data processing, scientific computation, simulation and the like. In the traditional task execution mode, because the tasks are sequentially executed according to a single sequence, the accelerator resources can be in an idle state when waiting for data transmission or other resources, and the parallel computing capacity of the AI accelerator can be fully exerted by the multi-stream parallel execution and event synchronization technology, so that the resource utilization rate of the AI accelerator is improved.
Fig. 1 is a flow chart of an accelerator-oriented multitasking method according to an embodiment of the present application, as shown in fig. 1, the method includes the following steps:
101, acquiring a data processing task to be executed;
102, distributing each task to be accelerated in the data processing task to a corresponding task acceleration flow;
103, creating a target event corresponding to the data processing task, wherein the target event is added with a dependency relationship among all tasks to be accelerated, and the dependency relationship is used for synchronously managing the tasks to be accelerated among all task acceleration flows;
104, adjusting each task acceleration flow in real time through the target event to optimize the utilization rate of each task acceleration flow;
and 105, outputting task calculation results obtained by executing corresponding tasks to be accelerated by the task acceleration flows, and realizing parallel acceleration among the tasks to be accelerated.
In the embodiment of the application, the data processing task refers to a set of various data operation tasks, and the data operation tasks include a calculation task, a transmission task and the like. For example, a data processing task may comprise a plurality of different types of subtasks, covering a range of operations from data acquisition, transmission, processing to result output, etc. It should be noted that the present application is mainly used for processing data processing tasks in artificial intelligence algorithms. And simultaneously, the method can be applied to parallel acceleration of other types of data processing tasks, and the method is not limited.
Illustratively, in a real-time data processing scenario, video stream processing and real-time image processing tasks, data loading and model computing tasks in deep learning training, and large-scale computing and data transmission tasks in scientific computing and simulation applications, etc., all belong to the category of data processing tasks.
In the embodiment of the application, the task acceleration flow is a logic unit for managing and executing tasks, and provides support for parallel execution of the tasks. Each task acceleration stream can independently manage and execute the tasks allocated to it, and a plurality of task acceleration streams can be run simultaneously, thereby realizing parallel processing of a plurality of tasks. By distributing different tasks to different task acceleration flows, the idle and performance bottleneck of resources in a single-flow mode can be avoided, the overlapping of calculation and communication is realized, and the calculation capacity and bandwidth resources of an accelerator are fully utilized. For example, during a data processing process, one task acceleration stream may be responsible for data transfer while another task acceleration stream may perform computing tasks.
In the embodiment of the application, the task to be accelerated is a part of the data processing tasks, and the tasks generally have higher computational complexity or data processing capacity, so that the execution efficiency can be improved by means of special hardware equipment (such as an AI accelerator). In the data processing task, the tasks to be accelerated are distributed to the corresponding task acceleration flows for parallel processing, so that the aim of accelerating execution is fulfilled. For example, in a deep learning training task, forward propagation and backward propagation computation of a model, large-scale matrix operation and the like belong to the task to be accelerated.
In the embodiment of the application, each task acceleration flow is used for independently managing the corresponding task to be accelerated. Under the architecture of multi-stream management, the following aspects are mainly embodied:
In multi-stream management, multiple streams are created by specific operations (e.g., "SYNSETDEVICE" cut to the corresponding device), each task acceleration stream can be used to execute different computing tasks in parallel. The task acceleration stream may be created with a priority specified, including three levels, high, medium, and low, with a default priority of medium. The task acceleration streams with different priorities can be allocated according to the importance and the emergency degree of the tasks, for example, the tasks to be accelerated with high real-time requirements can be allocated to the task acceleration streams with high priorities. After each task acceleration stream is created, the task acceleration stream has independent resource management space for managing the corresponding task to be accelerated and data transmission. This means that the resource usage of each task acceleration flow does not directly affect other flows, reducing the interference between task acceleration flows, providing a basis for independent execution of tasks. The inside of each task acceleration flow ensures that the tasks are executed in sequence. This is to ensure the correctness of the result, because some tasks to be accelerated may have a dependency relationship, and the correct calculation result can only be obtained by sequential execution. For example, in a data processing flow, it may be necessary to complete the preprocessing task of the data first to perform the subsequent model calculation task, and in the same task acceleration flow, these tasks may be sequentially executed in a predetermined order.
While the tasks are sequentially executed within each task acceleration stream, the tasks between each task acceleration stream may be executed concurrently. For example, since the memory copy task and the computing task occupy different resources (the memory copy mainly occupies memory bandwidth and the computing task mainly occupies computing core), the memory copy task and the computing task can be respectively allocated to different streams for concurrent execution, thereby reducing the overall execution time and improving the computing efficiency. The user can distribute different tasks to be accelerated into different task acceleration flows according to actual conditions. For example, if there are three tasks to be accelerated, the user may put them into three different task acceleration streams, respectively. The multi-stream management mechanism can automatically process parallel execution of the tasks, so that the efficiency is improved, and the correctness of the result is ensured.
The task acceleration streams have the property of executing asynchronously, which enables data transfer and computation tasks to be performed simultaneously in multiple task acceleration streams. For example, when one task acceleration flow transmits data from the memory to the computing device, the other task acceleration flow can simultaneously execute the computing tasks, so that the resources of the computing device are fully utilized, and the overall computing efficiency is improved. When all tasks to be accelerated in one task acceleration flow are completed, the task acceleration flow is destroyed and related resources are released. This is to ensure efficient use of system resources and avoid waste of resources. By timely destroying the task acceleration flow which is not needed any more, the system can distribute the resources to other tasks to be executed, and the overall performance and the resource utilization rate of the system are improved.
Therefore, each task acceleration flow realizes the independent management of the task to be accelerated through the mechanisms of independent resource management, in-flow sequential execution, inter-flow concurrent execution, user task allocation, destruction of flows, resource release and the like, thereby improving the computing efficiency and the utilization rate of system resources while ensuring the accuracy of computing results.
In 101, a data processing task to be executed is acquired. In practical applications, the implementation manner of acquiring the data processing task to be executed may vary according to the application scenario and the system architecture, and the following are some exemplary implementations:
such as a task queue. The system maintains one or more task queues, encapsulates new data processing tasks as they are generated and adds them to the task queues. The task acquisition module takes out tasks from the task queue periodically or in real time as data processing tasks to be executed. For example, in a data processing server, data processing requests sent by clients are parsed and packaged into tasks that are placed into a task queue, and a task retrieval component of the server retrieves tasks from the queue for processing according to first-in-first-out or other rules.
Such as message middleware. Message middleware (e.g., rabbitMQ, kafka, etc.) is used to communicate data processing tasks. When a task is generated, the task information is sent in the form of a message to a specific topic or queue of the message middleware. The data processing system receives and acquires tasks by subscribing to corresponding topics or queues. For example, in a distributed data processing system, data processing tasks generated by different data sources are sent through message middleware from which each data processing node obtains the tasks and executes them.
Such as database queries. The task information is stored in a database, and the data processing system acquires the task to be executed by querying the database. Specific query conditions, such as a task state of "not executed", may be set to screen out data processing tasks to be executed. For example, in an enterprise-level data processing platform, the creation, distribution, and execution status of tasks are recorded in a database, and the data processing module periodically queries the database to obtain the tasks to be executed.
Such as user interface interactions. In some applications, a user manually submits data processing tasks through a user interface. And after receiving a task submitting request of a user, the system processes and packages the task information as a data processing task to be executed. For example, in a data analysis tool, a user may upload data files and set analysis parameters, and the system creates data processing tasks based on the user's inputs and prepares to execute.
Such as event triggering. The system monitors specific events, and when the events occur, the corresponding data processing tasks are triggered. For example, in a monitoring system, when a certain index is detected to exceed a threshold value, a data processing task is triggered to perform deep analysis and processing on relevant data. The event may be a time event (e.g., a timed task), a state change event (e.g., a device state change), etc.
102, Each task to be accelerated in the data processing task is allocated to a corresponding task acceleration flow.
As an optional embodiment, before each task to be accelerated in the data processing task is allocated to a corresponding task acceleration flow in 102, a plurality of task acceleration flows may be created in advance, and corresponding accelerator resources may be configured for each task acceleration flow. Further, the priority corresponding to each task acceleration flow is set, wherein the higher the priority of the task acceleration flow is, the earlier the execution sequence of the assigned task to be accelerated in the dependency relationship is.
For example, a plurality of task acceleration streams are created using STREAMCREATE functions. The priorities of the task acceleration streams are multiple levels customized by a user, and can also be default levels configured in advance.
Illustratively, in a deep learning model training scenario, a data processing task is assumed that includes multiple tasks to be accelerated, such as data loading, data preprocessing, model forward propagation, model backward propagation, and parameter updating. First, three task acceleration flows, flow a, flow B, and flow C, are created in advance, and corresponding accelerator resources are configured for them. For example, stream A configures a high-performance GPU for processing computationally intensive tasks, stream B configures a portion of CPU resources and an amount of memory suitable for processing data transfers and some simple computing tasks, and stream C configures another GPU, but with slightly less performance than the GPU configured by stream A. Next, the priority of each task acceleration stream is set. The priority of stream a is set high, the priority of stream B is set medium, and the priority of stream C is set low. And starting to distribute tasks according to the rule that the higher the priority of the task acceleration flow is, the earlier the execution sequence of the distributed tasks to be accelerated in the dependency relation is. The data loading task has low computational requirements, but requires a certain memory bandwidth to perform data transmission, so that the data is allocated to the stream B. The data preprocessing task is also relatively simple, with some data processing operations, also allocated to stream B. Model forward propagation is a computationally intensive task that is highly performance demanding for GPU, since flow a is assigned a high priority, into flow a. Model back propagation is also a computationally intensive task, also assigned to flow a. The calculation amount of the parameter updating task is moderate and is distributed to the flow C.
In this example, since stream a has the highest priority, the order of execution of model forward and backward propagation in dependencies may be relatively forward, giving priority to computing resources for processing. The flow B has moderate priority, and the data loading and preprocessing tasks are sequentially executed. Stream C has the lowest priority and the parameter update task is executed after the previous dependent task is completed.
From the perspective of resource utilization, accelerator resources with different performances are reasonably distributed into each task acceleration flow in the mode, so that the requirements of different tasks are met, and resource waste is avoided. High performance GPUs are used to handle computationally intensive tasks, while CPUs and an amount of memory are used to handle data transfer and simple computing tasks. From the task execution sequence angle, the definite priority setting ensures that the tasks can be orderly executed under the dependency relationship, and ensures the correctness of the model training process. Meanwhile, the high-priority tasks can be executed preferentially, so that the processing speed of the key tasks is improved, the training efficiency of the whole deep learning model is improved, and the training process is more efficient and stable.
As an optional embodiment, in 102, the allocation of each task to be accelerated in the data processing task to a corresponding task acceleration flow may be implemented as the following steps:
1021, acquiring each task to be accelerated from the data processing task;
1022, according to the task characteristics of each task to be accelerated and the dependency relationship between each task to be accelerated, each task to be accelerated is distributed to the corresponding task acceleration flow.
In the embodiment of the application, the task characteristics at least comprise task attributes and task requirements. For example, task attributes include, but are not limited to, task type, calculation type, data transfer type. Task requirements include, but are not limited to, task computational complexity, data transmission bandwidth.
For example, among the data processing tasks trained by the deep learning model, the data processing tasks including the data loading, data preprocessing, model forward propagation, model backward propagation and parameter update waiting acceleration tasks are identified by 1021.
The data loading is to read training data from a disk to a memory, belongs to data transmission tasks, does not involve no calculation, mainly is data reading operation, is of the data transmission type from the disk to the memory, has extremely low task calculation amount and requires higher disk reading bandwidth. The data preprocessing is to normalize and cut the loaded data, belongs to data processing tasks, relates to simple mathematical operation, processes and moves the data in a memory, has moderate task calculation amount and requires a certain memory read-write bandwidth. Model forward propagation is to input preprocessed data into a deep learning model to calculate an output result, belongs to a calculation intensive task, has a large number of matrix multiplication and nonlinear activation function calculation, transmits data between a memory and a computing device (such as a GPU), has very high task calculation amount, and requires a high transmission bandwidth between the memory and the computing device. Model back propagation is to calculate gradient according to model output result and real label, belongs to computation intensive task, has a large number of matrix multiplication and gradient calculation, transmits data between memory and computing equipment, and task calculation amount is very high, and requires high transmission bandwidth between memory and computing equipment. The parameter updating is based on the calculated parameters of the gradient updating model, belongs to a computationally intensive task, relates to some simple mathematical operations such as gradient updating, transmits data between a memory and a computing device, has moderate task calculation amount and requires a certain transmission bandwidth between the memory and the computing device.
That is, there is a dependency relationship between these tasks to be accelerated, the data preprocessing must be performed after the data loading is completed, the model forward propagation must be performed after the data preprocessing is completed, the model backward propagation must be performed after the model forward propagation is completed, and the parameter updating must be performed after the model backward propagation is completed.
In step 1022, it is assumed that the above-mentioned dependency relationship between tasks to be accelerated is that task 2 depends on task 1, i.e. the data preprocessing must be performed after the data loading is completed. Task 3 depends on task 2, i.e. model forward propagation must be done after data preprocessing is complete. Task 4 depends on task 3, i.e., model back propagation must be done after model forward propagation is complete. Task 5 depends on task 4, i.e. the parameter update must be done after model back propagation is completed.
Based on the assumption, tasks can be distributed into different task acceleration streams according to the task characteristics and the dependency relationships. Specifically, task 1 (data loading) is assigned to task acceleration flow 1. Because the task is mainly data transmission from the disk to the memory, the task occupies different resources with other computing tasks and can be independently performed. Task 2 (data preprocessing) is assigned to task acceleration stream 2. After the task acceleration flow 1 finishes data loading, the task acceleration flow 2 can immediately start data preprocessing and use the memory to calculate. Task 3 (model forward propagation), task 4 (model backward propagation) and task 5 (parameter update) are assigned to task acceleration stream 3. All three tasks are computationally intensive tasks and there are sequential dependencies that can be placed in the same stream to ensure continuity of computation.
By the task allocation mode, the task acceleration flow 1 can carry out data preprocessing while carrying out data loading, the task acceleration flow 3 carries out model calculation, parallel execution of different tasks is realized, and the calculation efficiency is improved. Meanwhile, due to the fact that the dependency relationship among tasks is considered, accuracy of a calculation result is guaranteed.
In the alternative embodiment 1022, each task to be accelerated is allocated to a corresponding task acceleration flow according to the task characteristics of each task to be accelerated and the dependency relationship between each task to be accelerated, and the method can be implemented by determining a matched task acceleration flow according to the computation density, the data transmission quantity and the task type of each task to be accelerated, optimizing the matching relationship between the task to be accelerated and the task acceleration flow based on the dependency relationship between each task to be accelerated so as to preferentially allocate the task to be accelerated with the dependency relationship to the same task acceleration flow, and allocating each task to be accelerated to the corresponding task acceleration flow according to the optimized matching relationship.
Taking a complex image recognition project as an example, the data processing task of the project includes a plurality of tasks to be accelerated, such as loading original image data from a storage device into a memory (task a), performing preprocessing operations such as noise reduction and clipping on an image (task B), inputting the preprocessed image data into a deep neural network model for feature extraction (task C), performing object classification according to a feature extraction result (task D), and performing post-processing on a classification result (task E).
From the aspect of task characteristics, task A mainly is data transmission, and the data transmission quantity is large but the computation concentration is low, and belongs to the data transmission type task. Task B relates to some simple image algorithm calculations, the calculation concentration is moderate, data is processed in a memory, the data transmission quantity is relatively small, the task B belongs to a data processing type task, task C is the operation of a deep neural network, the calculation concentration is very high, the data is transmitted between the memory and a GPU (graphic processing unit) and other computing equipment, the data transmission quantity is large, and the task C belongs to a calculation intensive task. Task D is also computationally intensive, with high computational intensity and moderate data transfer. The task E has smaller calculated amount, mainly simple processing of results, low calculation density and small data transmission amount.
First, a matched task acceleration flow is determined according to the calculation intensity, the data transmission quantity and the task type. The task A can be distributed to a task acceleration flow F specially used for processing data transmission, the task A is suitable for fast transmission in an independent flow by utilizing the bandwidth of a storage device because of large data transmission quantity and small calculation, the task B can be distributed to a task acceleration flow G of a data processing type, the task A can be processed by utilizing memory resources better in the flow, and the tasks C and D can be distributed to a task acceleration flow H of high-performance calculation because of being computationally intensive and large calculation quantity, so that the tasks can fully utilize the calculation resources such as GPU and the like.
Then, the matching relationship is optimized based on the dependency relationship between the tasks. Task B depends on task A, namely, image data loading must be completed before data can be preprocessed, task C depends on task B, preprocessed images can be input into a model, task D depends on task C, feature extraction results are the basis of target classification, task E depends on task D, and post-processing is based on classification results. Therefore, tasks a and B are preferentially allocated to similar streams (e.g., streams F and G may perform a certain cooperative scheduling), and tasks C, D and E are allocated to the same task acceleration stream H, because of their close dependency relationship, and correctness of the execution sequence can be ensured in the same stream.
By the task allocation mode, various beneficial effects are brought. In terms of resource utilization, different types of task acceleration flows can respectively and fully utilize storage bandwidth, memory resources and computing equipment resources, so that resource idling and waste are avoided, for example, when a data transmission flow F transmits data, a computing flow H can simultaneously perform model computation, and the computing equipment is idle when waiting for data transmission as when a single flow is sequentially executed. In the aspect of execution efficiency, tasks in a plurality of task acceleration streams are executed in parallel, so that the processing time of the whole image recognition project is obviously shortened, and the execution efficiency is improved. Meanwhile, due to the consideration of the dependency relationship, the task is ensured to be executed according to the correct sequence, the accuracy and the reliability of the image recognition result are ensured, and the error result caused by the error of the task sequence is avoided.
Further optionally, in the above steps, each task to be accelerated is allocated to a task acceleration flow matched with each other according to the optimized matching relationship, and the task to be accelerated may be implemented in such a way that if the task to be accelerated with the dependency relationship is allocated to a different task acceleration flow, task marks corresponding to the tasks to be accelerated with the dependency relationship in the different task acceleration flows are added to the target event, and the task marks are synchronized to the corresponding task acceleration flow, so that the marked task to be accelerated starts to be executed after the execution of the dependent front task is completed.
Continuing with the previous image recognition project as an example, assume that when task acceleration streams are initially allocated according to task characteristics, tasks having a dependency relationship are allocated to different task acceleration streams for some reason (e.g., real-time state of system resources at that time, etc.). For example, task B (image preprocessing) is assigned to task acceleration flow G, while task a (original image data loading) on which it depends is assigned to task acceleration flow F, task D (object classification) is assigned to task acceleration flow H, and task C (feature extraction) on which it depends is also in task acceleration flow H, but due to the temporary situation of system resource allocation, the execution order of task D and task C in flow H may be confused.
In this case, according to the above steps, task markers corresponding to the tasks to be accelerated having a dependency relationship in different task acceleration flows are added to the target event. For example, specific task marks are respectively added for the task A and the task B in the target event to mark the dependency relationship between the task A and the task B, and corresponding marks are also added for the task C and the task D. And then synchronizing the task marks into corresponding task acceleration flows, wherein in the task acceleration flow F, after the execution of the task A is completed, the task marks transmit a signal to the task acceleration flow G to inform the task B that the front task A relied on by the task B is completed, and the task B starts the execution at the moment, so that the error condition that the task B starts to process when the data is not loaded to be completed is avoided. For the task C and the task D in the task acceleration flow H, the task mark can ensure that the task D starts to classify targets after the feature extraction of the task C is completed, and the correct sequence of task execution is ensured.
From the viewpoint of accuracy of task execution, the steps can avoid task execution errors caused by incomplete consideration of dependency relationships during task acceleration flow distribution, and ensure correct operation of the whole image recognition flow, so that accuracy of a final recognition result is ensured. From the resource utilization point of view, the invalid occupation or waste of resources caused by the wrong execution sequence of the tasks is avoided, for example, the situation that the task B is started in advance but waits due to no data is avoided, so that the system resources can be utilized more reasonably. From the aspects of stability and reliability of the system, the clear task marking and the synchronous mechanism enable the coordination among task acceleration flows to be more orderly, reduce the possibility of system faults or anomalies caused by improper processing of task dependency, improve the stability and reliability of the whole image recognition system, and ensure the correct execution of tasks and the normal operation of the system even under complex and changeable system resource environments.
In some embodiments, if the task to be accelerated with the dependency relationship is allocated to different task acceleration flows, in this case, the real-time adjustment of each task acceleration flow by the target event in the above step may be implemented as:
after all tasks to be accelerated with the dependency relationship are detected to be executed, deleting target events for managing all the tasks to be accelerated so as to release accelerator resources occupied by the target events.
First, the possible cases that tasks to be accelerated with dependency relationships are allocated to different task acceleration flows are described:
Firstly, considering resource utilization and parallelism, when different tasks have dependency relationships, but the demands of the different tasks on the resources are extremely different, and in order to fully utilize different types of resources in a specific scene, the tasks with the dependency relationships are distributed to different task acceleration flows, so that the overall parallelism is improved. For example, in some complex scientific computing tasks, one task may have a high demand for integer computing power of the CPU, while the next task it depends on has a high demand for floating point computing power of the GPU, and in order to fully utilize the CPU and GPU resources at the same time, the two dependent tasks may be allocated to different acceleration streams, which are processed by the CPU and GPU, respectively.
And secondly, the task execution time characteristics are different, namely if the previous task execution time in the dependency relationship is extremely short and the next task execution time is extremely long, the task execution time is possibly distributed to different task acceleration streams in order to avoid that long tasks block the resource distribution and execution of the short tasks. For example, in data processing, the data verification task may be completed quickly, but the subsequent data encryption task needs a long time due to the huge data volume, and at this time, the data verification task is distributed to different streams, so that after the data verification task is completed, the resources can be released quickly to other tasks, and the data encryption task is executed slowly in another stream, so that the resource waste and the task blockage are not caused.
Practical applications are not limited to the above, but the embodiments of the present application are merely examples.
Based on the above, in the task scheduling system, first, a target event is created for the task to be accelerated having a dependency relationship, and information such as the dependency relationship, the task mark, etc. of the tasks is recorded in the target event, so as to manage and track the execution states of the tasks. For example, the execution progress of the related tasks in each task acceleration stream is monitored in real time through a task scheduler and a monitoring module. When all tasks to be accelerated with the dependency relationship are monitored to be in a completed state, corresponding operations are triggered.
After receiving the signals of completing all the tasks, the task scheduler deletes the target event according to the record of the target event, and returns the accelerator resources occupied by the target event, such as the memory space, the position in the task scheduling queue and the like, to the system resource pool for other tasks or events.
Therefore, accelerator resources occupied by the target event are released in time, invalid occupation of the resources is avoided, the resources can be used by other tasks as soon as possible, the resource utilization rate of the whole system is improved, parallel execution of more tasks can be supported, and the overall performance of the system is improved. In addition, the target event which is not needed is deleted, the number of objects which need to be managed and maintained in the system is reduced, and the complexity of task scheduling and system management is reduced, so that the overhead of the system is reduced, and the running stability and efficiency of the system are improved. After releasing the resources, more possibility and flexibility are provided for new task allocation and scheduling, the system can allocate the resources more freely according to new task demands, the task execution flow is optimized, and the adaptability of the system to different task combinations and workloads is improved.
Further optionally, in the step, each task to be accelerated is allocated to a task acceleration flow matched with each other according to the optimized matching relationship, which may be implemented as:
after dividing a matrix multiplication task into a plurality of mutually independent tasks to be accelerated, distributing the mutually independent tasks to be accelerated into a plurality of task acceleration flows so as to accelerate the matrix multiplication task in parallel.
For example, in some computing scenarios, matrix multiplication tasks are common computing tasks and their computation effort tends to be large. According to a pre-configured task allocation principle, if a matrix multiplication task is split into a plurality of mutually independent tasks to be accelerated, the tasks can be processed according to a strategy of allocating the tasks to streams. For example, in a deep learning model training process, a large number of matrix multiplication operations are involved. It is assumed that there is a large matrix multiplication task that requires calculation of the result of multiplying two large matrices. The characteristic of matrix multiplication can be utilized to split the matrix multiplication into a plurality of mutually independent sub-matrix multiplication tasks. For example, the large matrix is divided by rows or columns, and the multiplication operation of each submatrix can be regarded as a separate task to be accelerated.
Specifically, in the training process of deep learning, matrix multiplication is one of the core operations of neural network computation. Taking a multi-layer perceptron (MLP) as an example, the neuron computation for each layer involves multiplication of the input vector with a weight matrix. When processing large-scale data sets and deep neural network models, the computational effort of these matrix multiplication operations grows exponentially, resulting in a significant increase in training time. To increase computational efficiency, matrix multiplication tasks may be split.
Assuming a large matrix multiplication, the product C (mxp) of matrix a (mxn) and matrix B (nxp) is calculated. According to the principle of matrix multiplication, each element C [ i ] [ j ] of matrix C is the sum of the products of the ith row of matrix A and the jth column of matrix B. This calculation process can be split into a number of independent sub-tasks. For example, the matrix A is divided into rows and the matrix B is divided into columns, and the product calculation of each row and each column can be regarded as an independent task to be accelerated. Thus, the calculation tasks for each element of the matrix C are independent of each other, and no data dependency exists.
Further, these independent sub-tasks may be assigned to different task acceleration streams according to the policy of task assignment to streams. Each task acceleration stream may independently process its assigned subtasks, leveraging the parallel computing capabilities of a computing device (e.g., GPU). For example, in a computing environment having multiple GPU cores, each GPU core may be responsible for one or more task acceleration streams, performing subtasks of a matrix multiplication in parallel.
In the task execution process, the problems of data transmission and storage also need to be considered. Since matrix multiplication involves a large number of data read-write operations, a reasonable data transmission strategy is critical to improving computational efficiency. The method can adopt an asynchronous data transmission mode, and can transmit data required by the next batch of calculation to the memory of the computing equipment in advance while the task acceleration flow executes the computing task, so that the data waiting time is reduced.
In addition, in order to ensure efficient operation of task acceleration streams, dynamic adjustment is also required according to the characteristics of the tasks and the real-time state of system resources. For example, if the calculation load of a certain task acceleration flow is too high, so that the execution time of the task acceleration flow is too long, part of the tasks of the task acceleration flow can be migrated into other task acceleration flows with lighter loads, and load balancing is achieved. Meanwhile, the execution sequence of the task acceleration flow can be dynamically adjusted according to the priority and the deadline of the task, so that important tasks can be completed in time.
The matrix multiplication task is split into a plurality of mutually independent tasks to be accelerated, the tasks are processed according to a strategy that the tasks are distributed to streams, and the reasonable data transmission and dynamic adjustment strategy are combined, so that the calculation efficiency of the matrix multiplication can be remarkably improved, the training process of the deep learning model is accelerated, and the method is also suitable for other calculation scenes needing a large amount of matrix operation, such as the fields of scientific calculation, data analysis and the like.
In practice, many tasks are suitable for the above steps in addition to the matrix multiplication task. Such as data processing and analysis tasks. In a large-scale image recognition project, feature extraction is required for a large number of images. The image set can be divided into a plurality of subsets according to a certain rule (such as according to folders, according to image numbering intervals and the like), the image feature extraction tasks of each subset are mutually independent, and can be distributed to different task acceleration streams for parallel processing so as to improve the speed of overall feature extraction. When processing massive transaction data to perform association rule mining, the data can be divided according to different dimensions (such as different areas, different time periods and the like), association rule mining tasks on each divided data set are mutually independent, and can be respectively distributed to different task acceleration streams for parallel execution, so that the speed of mining association rules is increased.
Or a scientific computing task. Such as particle motion calculation tasks in molecular dynamics simulation. When simulating the movement of a large number of particles, the particle set can be divided into a plurality of subsets according to factors such as space position and the like, the movement calculation of the particles in each subset is mutually independent in a certain time step, and can be distributed to different task acceleration flows for parallel calculation, and finally, the results are combined, so that the whole molecular dynamics simulation process is accelerated.
Such as grid computing tasks in finite element analysis. When finite element analysis is performed on a complex structure, calculation tasks of different areas after grid division are often independent of each other. For example, stress analysis is carried out on a large building structure, the grid of the building structure can be divided into a plurality of areas, tasks such as stress calculation in each area can be distributed to different task acceleration flows for parallel processing, and analysis efficiency is improved.
In addition, the training task can be deep learning. In training a deep neural network, training data is typically divided into batches. The data of each batch are mutually independent when forward propagation, backward propagation and other calculations are carried out, and the data processing tasks of different batches can be distributed into different task acceleration flows to be carried out in parallel, so that the training speed of the neural network is increased. Or when using multi-modal data (such as images, text, audio) for model training, the processing tasks of the different modal data can be regarded as mutually independent tasks to be accelerated. For example, tasks such as feature extraction of image data, word vector calculation of text data, spectrum analysis of audio data and the like can be respectively distributed to different task acceleration streams for parallel processing, and finally features of all modes are fused to continue training a model, so that training efficiency is improved.
Therefore, in the step of distributing a plurality of mutually independent tasks to be accelerated into a plurality of task acceleration streams to accelerate the matrix multiplication tasks in parallel, the split sub-matrix multiplication tasks can be respectively distributed into different task acceleration streams. Since these subtasks are independent of each other, there is no dependency, so they can be executed in parallel in different task acceleration streams.
During execution, data transfer operations may be involved. Referring to the application scenario shown in fig. 2, assuming H2d is host to device, data is transferred from cpu from memory to accelerator. Let d2h be the accelerator passed to memory. For each sub-matrix multiplication task, the relevant data needs to be transferred from the memory to the accelerator (H2 d operation) before computation is started, and after computation is completed, the result is transferred from the accelerator back to the memory (d 2H operation). Further assume that kernel is the task performed by the accelerator computation unit, then each submatrix multiplication task is performed by kernel to perform a specific computation operation in the accelerator.
In this way, the matrix multiplication task is split into a plurality of independent sub-tasks and distributed to a plurality of task acceleration flows for parallel execution, so that the parallel processing capacity of the task acceleration flows can be fully utilized, the calculation speed of the whole matrix multiplication task is increased, and the parallel acceleration effect is realized. Meanwhile, the principle of task allocation according to the task dependency relationship is followed, and mutually independent tasks are allocated to different streams, so that the calculation efficiency and the resource utilization rate of the system are improved.
In the above step, the task allocation manner of the task acceleration flow is the basis for improving the parallel acceleration. The task allocation policy of the task acceleration flow in the embodiment of the present application may be as follows:
First, allocation policies based on task characteristics. For example, tasks with larger calculation amount and tasks with smaller calculation amount are reasonably matched into different task acceleration flows according to the calculation amount of the tasks, so that the total calculation amount of each task acceleration flow is relatively balanced, and the situation that the existing task acceleration flow is overloaded and the existing task acceleration flow is idle is avoided. For example, when processing a plurality of matrix multiplication tasks of different scales, large matrix multiplication tasks and small matrix multiplication tasks are dispersed into different streams. For example, for tasks with data dependency, the tasks are distributed to the same task acceleration stream, so that the tasks are ensured to be executed according to the correct sequence, and the problem of inconsistent data is avoided. And mutually independent tasks are distributed to different task acceleration streams for parallel execution so as to fully utilize parallel computing resources. For example, in a data processing flow, a data reading task and a data preprocessing task are performed first, then a data analysis task is performed, the data reading task and the data preprocessing task may have a dependency relationship, and the data reading task and the data preprocessing task are allocated to the same stream, and if the data analysis tasks are independent of each other, different data analysis tasks may be allocated to different streams.
And secondly, an allocation strategy based on resource occupation. For example, the CPU resource perception strategy distributes tasks to different task acceleration flows according to the core number, frequency and other resource conditions of the CPU in the system, so that the CPU resource occupied by each task acceleration flow is in a reasonable range, and the parallel processing capacity of the CPU is fully utilized. For example, if the system has multiple CPU cores, tasks may be allocated to task acceleration streams corresponding to different cores, and for computationally intensive tasks, task acceleration streams where the CPU cores with higher performance are located may be preferentially allocated. Such as memory resource aware policies. The task with larger memory occupation and the task with smaller memory occupation are distributed into different task acceleration streams by considering the size of the memory space required by the task, so that excessive competition of memory resources is avoided. When the system memory is limited, tasks are reasonably distributed, so that each task acceleration stream can normally run, and task failure caused by insufficient memory can be avoided. For example, when processing large-scale data storage and calculation tasks, the tasks are distributed to different streams according to the data volume of the tasks, so that the problem caused by overhigh memory occupation of a certain stream is prevented.
Third, a priority-based allocation policy. Such as task priority policies. And setting a priority for each task, and distributing the tasks to different task acceleration streams according to the priority. The tasks with high priority are preferentially distributed into the task acceleration flow with better resources and higher processing speed so as to ensure that the tasks can be completed in time. For example, in a real-time system, a high priority is set for tasks with high real-time requirements (such as video stream processing, real-time monitoring data processing, etc.), and the tasks are distributed to special high-speed task acceleration streams. For example, the user customizes the priority policy. The user is allowed to set priority for the task or task acceleration stream according to own requirements and business logic. The user can define the priority according to factors such as importance of the task, emergency degree and the like, and the system distributes the task according to the priority set by the user. For example, in a production management system of an enterprise, a user can set tasks related to a key production link to be high-priority and distribute the tasks to task acceleration streams which are processed preferentially.
Fourth, based on dynamically adjusted allocation policies. Such as a load balancing policy. And monitoring the load condition of each task acceleration flow in real time, and when the load of a certain task acceleration flow is found to be too high, migrating part of tasks to the task acceleration flow with lower load, so as to realize dynamic balance of the load. For example, in a cloud computing environment, task allocation is dynamically adjusted according to the load of task acceleration streams on different servers, so that the resource utilization rate of the whole system is optimal. Such as an adaptive adjustment strategy. And automatically adjusting the allocation strategy of the task acceleration flow according to the running state of the system and the execution condition of the task. For example, when the system detects that certain tasks are performed longer than expected, or that certain resources are utilized less, the task allocation is automatically adjusted, trying different allocation strategies to improve the overall performance of the system.
It should be noted that, in practical application, dynamic adjustment of allocation of task acceleration flows according to real-time states of system resources is a key to improving system performance and resource utilization. The following describes four aspects of resource monitoring, task assessment, strategy adjustment and feedback implementation in detail:
First, resource monitoring. Specifically, the use condition of hardware resources such as a CPU, a GPU, a memory, a disk I/O, a network bandwidth and the like is monitored in real time by means of a monitoring tool or third-party software of the system. For example, in a deep learning training scenario, the GPU utilization and memory occupancy are important concerns, while in data processing applications, the disk I/O and memory usage are more critical.
In addition to hardware resources, software-level resources, such as task queue length, thread pool status, etc., need to be monitored. This information can reflect the current task load and processing power of the system.
Second, task assessment. And analyzing the characteristics of each task to be accelerated, including the calculation density, the data transmission quantity, the execution time estimation and the like of the task. For example, compute-intensive tasks typically require more CPU or GPU resources, while data-transmission-intensive tasks place higher demands on network bandwidth and disk I/O. Further optionally, each task is assigned a corresponding priority according to the importance, urgency and business requirements of the task. High priority tasks should be given priority to acquire system resources to ensure that they are completed in time.
Third, the policy is adjusted. When the system resources are insufficient, such as the CPU or GPU utilization reaches an upper limit, measures can be taken to migrate part of the tasks from the resource-intensive acceleration stream to the relatively resource-rich acceleration stream. For example, if the task load on a GPU is too high, some tasks with smaller computational load may be migrated to other GPUs for execution. For tasks with low priority, the execution of the tasks can be temporarily suspended, and the tasks can be resumed when the system resources are sufficient, and for tasks which are not needed any more, the tasks can be directly canceled.
When idle resources exist in the system, tasks waiting to be executed can be distributed into idle acceleration flows so as to fully utilize the system resources. For some tasks with larger calculation amount, the tasks can be split into a plurality of subtasks and distributed to different acceleration flows for parallel execution, so that the completion speed of the tasks is increased.
Fourth, implement and feedback. And according to the adjustment strategy, the allocation of the task acceleration flow is adjusted in real time. In the adjustment process, the dependency relationship of the task is ensured to be processed correctly, so that the task execution error is avoided. And feeding back and evaluating the effect of dynamic adjustment in real time. And judging the effectiveness of the adjustment strategy by comparing system performance indexes before and after adjustment, such as task completion time, resource utilization rate and the like. And according to the feedback result, optimizing and improving the regulation strategy to improve the self-adaption capability and performance of the system.
In practical applications, more intelligent dynamic adjustment can also be realized by combining machine learning algorithms, such as reinforcement learning. The reinforcement learning algorithm can automatically learn the optimal adjustment strategy according to the historical state and feedback information of the system, so that the performance and the resource utilization rate of the system are further improved.
It will be appreciated that in embodiments of the present application, the overlap of computation and communication is achieved by performing both data transmission and computation in different streams. For example, the computing tasks may be performed in one stream while data is being transferred in another stream to fully utilize the computing power and bandwidth resources of the accelerator. The streams and events are distributed reasonably according to the characteristics of specific tasks. For example, for a large and computationally intensive data transfer task, the number of streams can be increased, the division of tasks can be refined, and single stream overload can be avoided. Meanwhile, the positions of the events are reasonably set, unnecessary waiting time and synchronization expenditure are reduced, and the overall performance is improved.
For example, assuming a large scale image classification deep learning model training is being performed, the data processing tasks include data loading (reading image data from disk to memory), data preprocessing (e.g., normalization, clipping), model forward propagation, model backward propagation, and parameter update waiting acceleration tasks. Multiple task acceleration streams are created in advance, and target events are created to manage the dependency relationships between tasks. Data loading tasks are assigned to stream 1, and model forward and backward propagation computing tasks are assigned to stream 2. In stream 1, the next batch of image data is read from disk to memory, which is a data transfer process. At the same time, stream 2 uses the data that has been loaded and preprocessed from the previous batch to make model forward and backward propagation calculations. In this way, the data transmission and computation operations are performed simultaneously in different streams, enabling overlapping computation and communication.
In the traditional single-stream execution mode, the calculation can be started after the data loading is completed, and a disk is in an idle state in the calculation process, so that the calculation capacity and bandwidth resources of an accelerator can not be fully utilized. By the overlapping mode, data transmission is carried out at the same time of calculation, and the overall processing time is greatly reduced. For example, an original training period would take 10 minutes, wherein data is loaded for 3 minutes and calculated for 7 minutes, and after overlapping calculation and communication, the overall time may be shortened to about 7 minutes due to the parallel data transmission and calculation.
Considering that the data transmission quantity of the data loading and preprocessing tasks is large, the calculation of the model calculation tasks is intensive, the number of streams is increased, and the task division is refined. Data loading tasks are assigned to stream 1, data preprocessing tasks are assigned to stream 2, model forward propagation, backward propagation and parameter updates are assigned to streams 3, 4 and 5, respectively.
At the same time, the location of the event is set reasonably. For example, after stream 1 completes the data loading, an event notification stream 2 is triggered to start data preprocessing, and after stream 2 completes the preprocessing, an event notification stream 3 is triggered to start model forward propagation. In this way, each task can be started to execute at a proper time, reducing unnecessary waiting time.
In the event that flows and events are not reasonably allocated, a certain flow overload situation may occur. For example, if all tasks are concentrated in one stream, data loading and computation compete for resources, resulting in a slower processing speed. By increasing the number of streams and refining the division, the load of each stream is relatively balanced, and the problem of overload of a single stream is avoided. For example, when all tasks are handled originally in a single stream, each training period takes 15 minutes, and the resource utilization rate can only reach 60%. After reasonable distribution flow and event are adopted, the whole time is shortened to 8 minutes, and the resource utilization rate is improved to more than 90 percent.
By simultaneously carrying out data transmission and calculation in different streams and reasonably distributing streams and events, the method not only fully utilizes the calculation capability and bandwidth resources of the accelerator, but also reduces unnecessary waiting time and synchronization expenditure, remarkably improves the overall performance of deep learning model training, accelerates the training speed of the model and improves the resource utilization rate.
And 103, creating a target event corresponding to the data processing task. In the embodiment of the application, the dependency relationship among the tasks to be accelerated is added in the target event, so as to synchronously manage the tasks to be accelerated with the dependency relationship among the task acceleration flows.
In particular, it is assumed that the task is in the context of a complex genetic data analysis data processing task that includes a plurality of tasks to be accelerated. Firstly, a gene data file is read from a storage device (task A), then the read data is subjected to format conversion and preliminary cleaning (task B), then the cleaned data is subjected to gene feature extraction by using a specific algorithm (task C), and finally gene function prediction is performed according to the extracted features (task D).
Upon receipt of this data processing task, the system creates a corresponding target event upon implementation of the create target event. In this target event, the dependency relationship between the respective tasks to be accelerated is added in detail. Task B depends on task A because format conversion and cleaning can be performed only after the gene data file is successfully read, task C depends on task B, feature extraction can be performed only after data cleaning is completed, and task D depends on task C, and functional prediction can be performed only if gene features are acquired.
To implement a record of such dependencies, the system may employ a data structure to store this information, such as using a Directed Acyclic Graph (DAG), in which nodes represent respective tasks to be accelerated, and edges represent dependencies between the tasks. When the task A is completed, the system checks the dependency edge related to the task A in the target event, discovers that the task B depends on the task A, and informs the task acceleration flow where the task B is located that the task B can be started.
In this way, synchronous management of tasks to be accelerated with dependency relationships is achieved between the task acceleration flows. The method has the advantages that the method has remarkable effect, from the aspect of accuracy of task execution, the task is ensured to be executed according to the correct sequence, and the error result caused by the error of the task sequence is avoided, for example, the feature extraction can not be carried out under the condition that the data is not cleaned, so that the reliability of the analysis result of the gene data is ensured. From the aspect of resource utilization efficiency, because the tasks are orderly executed according to the dependency relationship, idle waste of other tasks for accelerating stream resources when a certain task waits for the completion of the dependent front-end task is avoided. For example, when the task A is executed, other task acceleration flows can process other subtasks which have no dependency relationship with the task A, and when the task A is completed, the execution of the task B is immediately triggered, so that the system resource is fully utilized, the execution efficiency of the whole genetic data analysis task is improved, and the data processing can be more efficiently and stably carried out.
104, Adjusting each task acceleration flow in real time through the target event to optimize the utilization rate of each task acceleration flow.
In an optional embodiment, 104, the task acceleration flows are dynamically detected through the target event, and the task allocation situation in each task acceleration flow is adjusted in real time based on the dynamic detection result so as to balance the utilization rate of each task acceleration flow.
It is assumed that in a promotional data processing scenario for a large e-commerce platform, there is one data processing task, comprising a plurality of tasks to be accelerated. For example, there are a task (task a, task B, task C, respectively) of acquiring related data from different data sources (e.g., a user order database, a commodity inventory database, a user evaluation database, etc.), a task (task D) of integrating the acquired data, and a task (task E) of analyzing and predicting a sales promotion effect based on the integrated data. These data processing tasks require the AI accelerator to be invoked for acceleration processing.
Then, in 104, first, a target event corresponding to the data processing task is created, and a dependency relationship between each task to be accelerated is recorded therein, for example, task D depends on the completion of task a, task B, and task C, and task E depends on the completion of task D. Meanwhile, these tasks are assigned to different task acceleration streams, assuming that task a, task B, and task C are executed in stream 1, stream 2, and stream 3, respectively, task D is executed in stream 4, and task E is executed in stream 5.
And in the execution process, dynamically detecting the acceleration flow of each task through the target event. And (3) monitoring the resource use condition (such as CPU utilization rate, memory occupation, data transmission rate and the like) and the task execution progress of each task acceleration stream in real time. For example, the detection finds that the flow 1 (executing task a) has a slow task execution speed and a low CPU utilization rate due to a large data volume of the data source and unstable network transmission, while the flows 2 (executing task B) and 3 (executing task C) have completed tasks quickly, are in an idle state, and the flow 4 (executing task D) cannot be started due to waiting for the completion of task a, and resources are idle.
Based on the dynamic detection result, the task allocation situation in each task acceleration stream is adjusted in real time. In order to balance the utilization rate of each task acceleration flow, the system can migrate the acquisition task of partial data in the task A to the flow 2 or the flow 3, and the execution speed of the task A is accelerated by utilizing the idle resources of the acquisition task. At the same time, notification flow 4 is ready to initiate execution of task D immediately upon completion of task a.
By such real-time adjustment, a good effect is obtained. From the perspective of resource utilization, the situation that resources of the stream 2 and the stream 3 are idle after the task B and the task C are completed and the situation that the stream 4 is idle due to waiting for the task A are avoided, the utilization rate of the whole system resources is improved, and the resources of each task acceleration stream can be fully utilized. From the aspect of task execution efficiency, the execution speed of the task A is accelerated, further the follow-up dependent tasks (task D and task E) can be started and executed in time, the completion time of the whole data processing task is shortened, the timeliness and accuracy of data processing of the sales promotion activity are improved, and a faster and effective support is provided for the decision of an E-commerce platform. Moreover, the real-time adjustment mechanism based on the target event can flexibly adjust task allocation according to the actual condition of task execution, enhances the adaptability and stability of the system, and ensures that the system can also operate efficiently under complex and changeable task loads.
Illustratively, the steps described above may be implemented in connection with a specific function. For example, EVENTCREATE functions are used to create events that are primarily used to record the completion of a task. Before the data processing task begins, a EVENTCREATE function is first used to create the relevant events. For example, for each of the tasks a, B, etc. in the aforementioned e-commerce platform promotional activity data processing scenario, the EVENTCREATE function may be used to create corresponding events "eventA", "eventB", etc. for recording the completion of task a, task B, respectively.
EventRecord functions by which events are recorded for subsequent tracking and management before the task begins. Just before each task begins execution, a EventRecord function is called to record the corresponding event. For example, before task a begins execution, call EventRecord (eventA), which indicates that recording of event information associated with task a is started, so that the system knows that task a has entered execution flow.
The EventSynchronize function is used to wait for event completion, ensure that the program will not continue to execute subsequent portions that should not be executed until the relevant task is completed, and play a role in synchronization and blocking. Where it is desired to wait for a task to complete, the EventSynchronize function is called. For example, before task D begins, because it depends on the completion of task A, task B, and task C, in task D's code logic, "EventSynchronize (corresponding to task A's eventA)", "EventSynchronize (corresponding to task B's eventB)", and "EventSynchronize (corresponding to task C's eventC)", are called, so task D waits for all of task A, task B, and task C to complete the corresponding events before continuing execution.
The EVENTELAPSEDTIME function is used for checking the specific execution time of the task, and is convenient for evaluating and analyzing the task execution efficiency. After the task is completed, EVENTELAPSEDTIME functions are called to get the time it takes for the task to execute. For example, after task a is completed, call EVENTELAPSEDTIME (EVENTA) can obtain the time consumed by task a from beginning to end, which is very helpful in analyzing task execution efficiency, optimizing system performance, and the like.
The STREAMWAITEVENT function is used to add dependencies of events between multiple streams, ensuring that tasks are performed in the correct order between streams. Assuming that stream1 performs task a, stream2 performs task B, and stream 3 performs task D, task D relies on the completion of task a and task B. In the code corresponding to task D of stream 3, "STREAMWAITEVENT (stream 1, eventA)" and "STREAMWAITEVENT (stream 2, eventB)" are used, indicating that task D in stream 3 depends on event "eventA" of task a in stream1 and event "eventB" of task B in stream 2. Thus, if "eventA" and "eventB" are not completed, task D in stream 3 is blocked from execution until both "eventA" and "eventB" are completed, thereby ensuring the proper order and dependency of task execution between multiple streams.
As an alternative embodiment, in the step, the real-time task execution condition in each task acceleration stream is dynamically monitored through a state checking function. And dynamically scheduling each task acceleration flow according to the predicted future task execution conditions so as to switch the unexecuted task to be accelerated from the task acceleration flow with higher utilization rate to the task acceleration flow with lower utilization rate.
In actual computing task processing, dynamically adjusting task allocation and execution strategies is a key means to improve system performance and resource utilization. First, the status information of each task acceleration stream and related events can be obtained dynamically in real-time by the stream and event status checking functions (e.g., eventQuery and StreamQuery). The EventQuery function may be used to query the status of an event, such as whether the event has completed, is in a wait state, and so on. The StreamQuery function is used for knowing the running condition of the task acceleration flow, including the execution progress of the task in the flow, the condition of the system resources occupied by the flow (such as the use proportion of resources such as CPU, memory, GPU, etc.), etc.
For example, in a deep learning training system that includes multiple task acceleration streams, the running state of each stream can be monitored in real time using these functions. If the GPU utilization of a certain stream is found to be low, this may mean that the task calculation amount in the stream is small or that the resource allocation is unreasonable, and further analysis and adjustment of the task allocation are required.
Considering the calculated amount of tasks, the calculated amounts of different tasks are quite different, and in order to fully utilize system resources, the calculated amounts of the tasks need to be reasonably distributed into different task acceleration streams. For tasks with larger calculation amount, such as forward propagation and backward propagation calculation tasks of the deep neural network, the tasks should be distributed into streams with more sufficient calculation resources, such as task acceleration streams equipped with high-performance GPU, while for tasks with smaller calculation amount, such as simple preprocessing tasks of data, the tasks can be distributed into streams with relatively less calculation resources, so that resource waste is avoided.
According to the task dependency relationship, the dependency relationship between tasks is an important basis for task allocation. After the user gives the task dependency relationship, the software uses the multi-stream technology to make the independent tasks execute in parallel, and at the same time, the tasks with the dependency relationship are ensured to execute according to the correct sequence through the event. For example, in a data processing flow, a data loading task must be completed before a data cleansing task, which must in turn be completed before a data analysis task. Through an event mechanism, a corresponding event is triggered after the data loading task is completed, the data cleaning task can be started to be executed only after the event is detected and confirmed by the stream where the data cleaning task is located, and the sequence and the accuracy of task execution are ensured by the same.
Further alternatively, a heuristic algorithm (such as a greedy algorithm) is employed to optimize the task allocation. The core idea of greedy algorithms is to take the optimal decision in the current state in each selection step in hopes of obtaining a globally optimal solution. In task allocation, when a task is completed, the system analyzes the tasks that it depends on and allocates the tasks to idle task acceleration streams.
In an image processing task system, after the image data reading task is completed, the system can check the follow-up dependent tasks of the task, such as tasks of image noise reduction, image enhancement and the like. If idle task acceleration flows exist at the moment, the system can preferentially distribute the dependent tasks into the idle flows so as to fully utilize system resources and reduce task waiting time. The method can improve the overall execution efficiency of the system to a certain extent, and avoid resource idling and task backlog.
The dynamic adjustment of task allocation and execution strategies can be realized by a series of measures such as dynamically monitoring the execution condition of the task acceleration flow, carrying out task allocation by combining the task characteristics, optimizing the task allocation by adopting a heuristic algorithm, and the like, so that the utilization rate of the flow is improved, the task waiting time and the resource waste are reduced, and the performance and the efficiency of the whole system are improved.
In another embodiment, in 104, the real-time adjustment of each task acceleration stream through the target event includes, after each task acceleration stream is started, detecting the completion condition of the last task to be accelerated in each task acceleration stream through a stream synchronization (StreamSynchronize) function. Further, after the last task to be accelerated is detected to be executed, the task to be accelerated to be executed is allocated to the task acceleration flow detected currently, or the task acceleration flow detected currently is destroyed, so that the accelerator resources occupied by the task acceleration flow detected currently are released.
Illustratively, the data processing task is assumed to be in a scenario of large-scale video processing, which includes a plurality of tasks to be accelerated, such as video decoding, video noise reduction, video color correction, video encoding, and the like. Multiple task acceleration streams, such as stream 1, stream 2, stream 3, are created in advance for processing different types or phases of tasks, respectively, and corresponding target events are created to manage the dependencies between the tasks.
After each task acceleration flow is started, the system detects the completion condition of the last task to be accelerated in each task acceleration flow through StreamSynchronize functions. For example, stream 1 is responsible for video decoding, the last task to be accelerated is to decode a segment of video data completely, stream 2 is responsible for video noise reduction, the last task to be accelerated is to complete noise reduction processing for decoded video frames, stream 3 is responsible for video encoding, and the last task to be accelerated is to encode processed video data into a specified format.
After detecting that the last task to be accelerated in video decoding in stream 1 is completed, the system will operate according to the specific situation. If other unprocessed video data needs to be decoded, the system allocates a video decoding task to be accelerated to the stream 1, so that the stream 1 can continue to work, and thus the accelerator resources (such as the decoding capability of the GPU) occupied by the stream 1 can be fully utilized, the video decoding work can be continuously performed, and the efficiency of the overall video processing is improved. And when the last task to be accelerated in the video coding in the stream 3 is detected to be executed, and no new video coding task is currently waiting to be processed, the system destroys the stream 3. This is because stream 3 occupies certain accelerator resources (e.g., processing power of the encoding chip, associated memory, etc.), and destruction stream 3 can free these resources so that other tasks accelerate stream or other data processing tasks can use these resources, avoiding idle waste of resources.
In this way, the task acceleration flow is adjusted in real time by using the target event and StreamSynchronize functions, so that the task acceleration flow and the allocation resources can be dynamically managed according to the actual execution situation of the task. On one hand, the task acceleration flow is guaranteed to fully play a role when the task can be executed, and the data processing task is continuously advanced, on the other hand, when the task acceleration flow completes the task and has no follow-up task, the occupied resources are released in time, the utilization rate of the whole resources of the system is improved, and the complex data processing task such as large-scale video processing can be operated more efficiently and reasonably.
In the embodiment of the application, the parallel acceleration of the data processing task is realized by accelerating a plurality of tasks, so that the multi-task processing for the accelerator is realized, the task execution efficiency is effectively improved, the utilization rate of accelerator resources is improved, the accelerator resource waste is avoided, and the user experience is improved.
In yet another embodiment of the present application, there is also provided an accelerator-oriented multitasking apparatus applied to a system-on-chip comprising a plurality of modules. As described with reference to fig. 3, the device comprises the following units:
an acquisition unit configured to acquire a data processing task to be executed;
the system comprises a data processing unit, an allocation unit, a task acceleration flow management unit and a data processing unit, wherein the data processing unit is used for processing data to be accelerated;
the system comprises a creation unit, a data processing task and a data processing task, wherein the creation unit is configured to create a target event corresponding to the data processing task, and the target event is added with a dependency relationship among all tasks to be accelerated so as to synchronously manage the tasks to be accelerated with the dependency relationship among all task acceleration flows;
the adjusting unit is configured to adjust each task acceleration flow in real time through the target event so as to optimize the utilization rate of each task acceleration flow;
The output unit is configured to output task calculation results obtained by executing corresponding tasks to be accelerated by the task acceleration flows, and parallel acceleration among the tasks to be accelerated is realized.
Further optionally, the adjusting unit is configured to dynamically detect each task acceleration flow through the target event, and is configured to:
after each task acceleration flow is started, the completion condition of the last task to be accelerated in each task acceleration flow is detected through StreamSynchronize functions;
after the completion of the execution of the last task to be accelerated is detected, the task to be accelerated to be executed is allocated to the task acceleration flow detected currently, or the task acceleration flow detected currently is destroyed, so that the accelerator resources occupied by the task acceleration flow detected currently are released.
Further optionally, the acquiring unit, before assigning each task to be accelerated in the data processing task to a corresponding task acceleration flow, is further configured to:
A plurality of task acceleration flows are created in advance, and corresponding accelerator resources are configured for each task acceleration flow;
And setting the corresponding priority of each task acceleration flow, wherein the higher the priority of the task acceleration flow is, the earlier the execution sequence of the assigned task to be accelerated in the dependency relation is.
Further optionally, the adjusting unit is configured to adjust each task acceleration flow in real time through the target event, and is configured to:
dynamically detecting each task acceleration flow through the target event;
and adjusting the task allocation situation in each task acceleration stream in real time based on the dynamic detection result so as to balance the utilization rate of each task acceleration stream.
Further optionally, the adjusting unit is configured to dynamically detect each task acceleration flow through the target event, and is configured to:
dynamically monitoring real-time task execution conditions in each task acceleration stream through a state checking function;
the real-time adjustment of the task allocation situation in each task acceleration stream based on the dynamic detection result comprises the following steps:
Based on the real-time task execution condition and the dependency relationship among the tasks to be accelerated, predicting future task execution conditions in each task acceleration flow by adopting a heuristic algorithm;
and dynamically scheduling each task acceleration flow according to the predicted future task execution condition so as to switch the unexecuted task to be accelerated from the task acceleration flow with higher utilization rate to the task acceleration flow with lower utilization rate.
Further optionally, the allocation unit allocates each task to be accelerated in the data processing task to a corresponding task acceleration flow, and is configured to:
acquiring each task to be accelerated from the data processing task;
According to the task characteristics of each task to be accelerated and the dependency relationship among the tasks to be accelerated, distributing each task to be accelerated into the corresponding task acceleration flow;
The task characteristics at least comprise task attributes and task demands, wherein the task attributes comprise task types, calculation types and data transmission types, and the task demands comprise task calculation amount and data transmission bandwidth.
Further optionally, the allocation unit allocates each task to be accelerated to a corresponding task acceleration flow according to the task characteristics of each task to be accelerated and the dependency relationship between each task to be accelerated, and is configured to:
determining matched task acceleration flows according to the calculation density, the data transmission quantity and the task types of each task to be accelerated;
optimizing the matching relation between the tasks to be accelerated and the task acceleration flow based on the dependency relation among the tasks to be accelerated so as to preferentially distribute the tasks to be accelerated with the dependency relation into the same task acceleration flow;
And distributing each task to be accelerated to the task acceleration flow matched with each task according to the optimized matching relation.
Further optionally, the allocation unit allocates each task to be accelerated to the respective matched task acceleration flow according to the optimized matching relationship, and is configured to:
if the tasks to be accelerated with the dependency relationship are distributed to different task acceleration flows, adding task marks corresponding to the tasks to be accelerated with the dependency relationship in different task acceleration flows in the target event;
And synchronizing the task mark into a corresponding task acceleration flow so that the marked task to be accelerated starts to be executed after the dependent front-end task is executed.
Further optionally, the allocation unit is configured to adjust each task acceleration flow in real time through the target event, where the allocation unit is configured to:
and if the tasks to be accelerated with the dependency relationship are distributed to different task acceleration flows, deleting target events for managing all the tasks to be accelerated after detecting that all the tasks to be accelerated with the dependency relationship are executed, so as to release accelerator resources occupied by the target events.
Further alternatively, the data processing task is a matrix multiplication task, and the allocation unit allocates each task to be accelerated to each matched task acceleration flow according to the optimized matching relationship, and is configured to:
after the matrix multiplication task is split into a plurality of mutually independent tasks to be accelerated, the mutually independent tasks to be accelerated are distributed into a plurality of task acceleration flows so as to accelerate the matrix multiplication task in parallel.
The system may implement the various steps of the method embodiments described above, which are not expanded herein.
In the embodiment of the application, the multi-task processing device facing the accelerator is adopted, so that the task execution efficiency can be improved, the utilization rate of accelerator resources can be improved, the accelerator resource waste can be avoided, and the user experience can be improved.
Referring to fig. 4, fig. 4 is a schematic diagram of an embodiment of an electronic device according to an embodiment of the application. As shown in fig. 4, an embodiment of the present application provides an electronic device 500, including a memory 510, a processor 520, and a computer program 511 stored in the memory 510 and capable of running on the processor 520, where when the processor 520 executes the computer program 511, the processor 520 obtains data processing tasks to be executed, distributes each task to be accelerated in each corresponding task acceleration flow in the data processing tasks, each task acceleration flow is used for independently managing the corresponding task to be accelerated, creates a target event corresponding to the data processing task, and adds a dependency relationship between each task to be accelerated in the target event, so as to synchronously manage the task to be accelerated with the dependency relationship between each task acceleration flow, adjusts each task acceleration flow in real time through the target event, so as to optimize the utilization rate of each task acceleration flow, outputs a task calculation result obtained by executing the corresponding task to be accelerated by each task acceleration flow, and realizes parallel acceleration between each task to be accelerated.
Referring to fig. 5, fig. 5 is a schematic diagram of an embodiment of a computer readable storage medium according to an embodiment of the application. As shown in fig. 5, this embodiment provides a computer readable storage medium 600, on which a computer program 611 is stored, and when the computer program 611 is executed by a processor, the steps of obtaining data processing tasks to be executed are implemented, distributing each task to be accelerated in the data processing tasks to a corresponding task acceleration flow, each task acceleration flow is used for independently managing the corresponding task to be accelerated, creating a target event corresponding to the data processing task, adding a dependency relationship between each task to be accelerated in the target event, so as to synchronously manage the task to be accelerated with the dependency relationship between each task acceleration flow, adjusting each task acceleration flow in real time through the target event, so as to optimize the utilization rate of each task acceleration flow, and outputting a task calculation result obtained by executing the corresponding task to be accelerated by each task acceleration flow, so as to implement parallel acceleration between each task to be accelerated.
In the foregoing embodiments, the descriptions of the embodiments are focused on, and for those portions of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.