CN113076182A - Computing task recovery method and device, user equipment and storage medium - Google Patents
Computing task recovery method and device, user equipment and storage medium Download PDFInfo
- Publication number
- CN113076182A CN113076182A CN202110316625.9A CN202110316625A CN113076182A CN 113076182 A CN113076182 A CN 113076182A CN 202110316625 A CN202110316625 A CN 202110316625A CN 113076182 A CN113076182 A CN 113076182A
- Authority
- CN
- China
- Prior art keywords
- information
- queue
- task
- computing
- memory
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogram communication
- G06F9/544—Buffers; Shared memory; Pipes
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogram communication
- G06F9/546—Message passing systems or structures, e.g. queues
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Debugging And Monitoring (AREA)
Abstract
The invention provides a method and a device for recovering a computing task, user equipment and a storage medium. The method comprises the following steps: when the task of the system is to be interrupted, suspending the ongoing task of the system, and storing the running context information of the task to a dump file; when the task is to be restored, analyzing the running context information of the task from the dump file, and restoring the task according to the running context information; and continuing to calculate the recovered task. The invention can realize the migration of the calculation task and save the progress of the existing calculation task.
Description
Technical Field
The present invention relates to the technical field of super computing, and in particular, to a method and an apparatus for recovering a computing task, a user equipment, and a storage medium.
Background
Migration of tasks is divided into Offline Migration (Offline Migration) and online Migration (Live Migration). The offline migration refers to that in the migration process, the service node needs to be stopped or suspended and no longer provides service to the outside. The online migration means that the service node is continuously available to the outside in the migration process.
In the process of implementing the invention, the inventor finds that at least the following technical problems exist in the prior art:
current migration schemes are free of techniques for migrating computational tasks. When the task is computationally intensive, the compute engine often needs to execute for a long time. In the process, if operations such as planned power down, computer hardware maintenance and replacement, calculation task migration and the like occur, the existing calculation task progress cannot be saved, the previous calculation progress is discarded, and the calculation must be started again after the task is interrupted.
Disclosure of Invention
The method, the device, the user equipment and the storage medium for recovering the computing task can recover the progress of the computing task when the computing task is interrupted.
In a first aspect, the present invention provides a method for recovering a computing task, including:
when the task of the system is to be interrupted, suspending the ongoing task of the system, and storing the running context information of the task to a dump file;
when the task is to be restored, analyzing the running context information of the task from the dump file, and restoring the task according to the running context information;
and continuing to calculate the recovered task.
Optionally, the step of saving the running context information of the task to a dump file includes:
suspending operation of a user-mode computing process running on user equipment in the system;
suspending a task on a computing engine in the system, and storing running context information of the suspended task in a queue memory area of a memory of the system;
saving user state process state information in the system to the dump file;
saving the state information of the driver state process in the system into the dump file;
and taking the process as a unit, and saving the information of the queue under the suspended driving state process into the dump file.
Optionally, the dynamic process state information includes: process ID, process control information, process page table information, process-allocated memory block information and process event information with corresponding relationship; the memory block information allocated by the process includes: user data of the process; the control information of the process includes: a process address space ID;
the step of saving the state information of the driver state process in the system into the dump file is specifically as follows:
and storing the process ID, the control information of the process, the page table information of the process, the memory block information distributed by the process and the process event information in a corresponding relationship.
Optionally, the information of the queue includes: queue ID, ring buffer distributed by queue, queue read-write pointer and queue memory area information; the queue memory area information includes: computing context switch buffer, control stack buffer and memory queue descriptor; the queue read-write pointer comprises: the value of the doorbell register;
the step of saving the information of the queue under the suspended driver state process to the dump file by taking the process as a unit is specifically as follows:
and storing the queue ID, the ring buffer distributed by the queue, the queue read-write pointer and the queue memory area information according to the corresponding relation.
Optionally, the step of recovering the task according to the running context information specifically includes:
analyzing the state information of the user mode process from the dump file, and recovering the user mode process according to the state information of the user mode process;
analyzing the state information of the driving dynamic process from the dump file, and recovering the driving dynamic process according to the state of the driving dynamic process;
analyzing queue information in the recovered process information from the dump file, and recovering the queue according to the queue information;
extracting the running context information of the suspended task from the queue memory area of the memory; according to the running context information of the tasks, recovering the tasks of the queue, and loading the tasks to a computing engine;
and sending a running signal to enable the computing engine and the user mode process to enter a running state simultaneously.
Optionally, the step of parsing out the state information of the dynamic driving process from the dump file, and recovering the dynamic driving process according to the state of the dynamic driving process includes:
allocating memory blocks for the drive state process according to the corresponding relation among the stored process ID, the stored process control information, the stored process page table information, the memory blocks allocated by the process and the stored process event information, copying the user data of the process to the memory blocks, and recovering the process page table; and configuring the process address space ID and the process page table into a register of the calculation engine.
Optionally, the analyzing queue information in the recovered process information from the dump file, and the recovering the queue according to the queue information includes:
and restoring the queue data according to the corresponding relation among the stored queue ID, the ring buffer area allocated by the queue, the queue read-write pointer and the queue memory area information, configuring the memory queue descriptor to the corresponding hardware queue register, and configuring the ring buffer area, the context switching buffer area, the control stack buffer area, the queue read-write pointer and the doorbell register value to the corresponding hardware queue register.
In a second aspect, the present invention provides a device for recovering a computing task, including:
the system comprises a storage unit, a task scheduling unit and a task scheduling unit, wherein the storage unit is used for suspending a task which is carried out by the system and storing the running context information of the task to a dump file when the task of the system is to be interrupted;
the analysis unit is used for analyzing the running context information of the task from the dump file when the task is to be recovered, and recovering the task according to the running context information;
and the computing unit is used for continuing computing the recovered task.
Optionally, the saving unit includes:
a first suspending subunit, configured to suspend an operation of a user-mode computing process running on a user device in the system;
the second suspension subunit is used for suspending the task on the computing engine in the system and storing the running context information of the suspended task into a queue memory area of a memory of the system;
the first storage subunit is used for storing the user state process state information in the system to the dump file;
the second storage subunit is used for storing the state information of the drive state process in the system into the dump file;
and the third saving subunit is used for saving the information of the queue in the suspended drive state process into the dump file by taking the process as a unit.
Optionally, the dynamic process state information includes: process ID, process control information, process page table information, process-allocated memory block information and process event information with corresponding relationship; the memory block information allocated by the process includes: user data of the process; the control information of the process includes: a process address space ID;
correspondingly, the second saving subunit specifically includes:
and storing the process ID, the control information of the process, the page table information of the process, the memory block information distributed by the process and the process event information in a corresponding relationship.
Optionally, the information of the queue includes: queue ID, ring buffer distributed by queue, queue read-write pointer and queue memory area information; the queue memory area information includes: computing context switch buffer, control stack buffer and memory queue descriptor; the queue read-write pointer comprises: the value of the doorbell register;
correspondingly, the third saving subunit specifically is:
and storing the queue ID, the ring buffer distributed by the queue, the queue read-write pointer and the queue memory area information according to the corresponding relation.
Optionally, the parsing unit specifically includes:
the first recovery subunit is configured to parse the user mode process state information from the dump file, and recover the user mode process according to the user mode process state information;
the second recovery subunit is used for analyzing the state information of the driving dynamic process from the dump file and recovering the driving dynamic process according to the state of the driving dynamic process;
the third recovery subunit is configured to parse queue information in the recovered process information from the dump file, and recover the queue according to the queue information;
a fourth recovery subunit, configured to extract, from the queue memory area of the memory, running context information of the suspended task; according to the running context information of the tasks, recovering the tasks of the queue, and loading the tasks to a computing engine;
and the sending subunit is configured to send an operation signal, so that the computing engine and the user mode process enter an operation state at the same time.
Optionally, the second recovery subunit specifically includes: allocating memory blocks for the drive state process according to the corresponding relation among the stored process ID, process control information, process page table information, process-allocated memory blocks and process event information, copying the user data of the process to the memory blocks, and recovering the process page table; and configuring the process address space ID and the process page table into a register of the calculation engine.
Optionally, the third recovery subunit specifically includes: and restoring the queue data according to the corresponding relation among the stored queue ID, the ring buffer area allocated by the queue, the queue read-write pointer and the queue memory area information, configuring the memory queue descriptor to the corresponding hardware queue register, and configuring the ring buffer area, the context switching buffer area, the control stack buffer area, the queue read-write pointer and the doorbell register value to the corresponding hardware queue register.
In a third aspect, the present invention provides a user equipment, which includes the above-mentioned recovery device for computing tasks.
In a fourth aspect, the present invention provides a device for recovering a computing task, including:
a memory;
and a processor coupled to the memory, the processor configured to perform the above-described recovery method of the computing task based on instructions stored in the memory.
In a fifth aspect, the present invention provides a computer-readable storage medium, wherein the computer-readable storage medium stores computer instructions, and the computer instructions, when executed by a processor, implement the recovery method for the computing task.
According to the method, the device, the user equipment and the storage medium for recovering the computing task, when the task of the system is to be interrupted, the ongoing task of the system is suspended, and the running context information of the task is stored in a dump file; when the task is to be restored, analyzing the running context information of the task from the dump file, and restoring the task according to the running context information, thereby realizing the restoration of the progress of the executed computing task when the computing task is interrupted.
Drawings
FIG. 1 is a flowchart illustrating a method for recovering a computing task according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating steps of saving context information of the task to a dump file in an application scenario of the present invention;
FIG. 3 is a flowchart illustrating the task recovery procedure according to the context information in the application scenario of the present invention;
FIG. 4 is a schematic diagram of a hierarchical structure of a computing process, driver software, and computing hardware in an application scenario of the present invention;
FIG. 5 is a block diagram of a task migration module of a computing device in an application scenario of the present invention;
FIG. 6 is a timing diagram illustrating task context saving in an application scenario of the present invention;
FIG. 7 is a timing diagram illustrating task context restoration in an application scenario of the present invention;
FIG. 8 is a diagram illustrating the connection of a device for recovering a computing task according to an embodiment of the present invention;
fig. 9 is a connection diagram of a device for recovering a computing task according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
For the sake of understanding, several terms referred to in the present invention will be explained.
The running context information of the task comprises register data, stack data, calculated intermediate data, executed kernel pair kernel program and other information of the task running on a computing engine;
user data of the process: the user is provided with data for the calculation.
An embodiment of the present invention provides a method for recovering a computing task, as shown in fig. 1, where the method includes:
step 1, when a task of a system is to be interrupted, suspending the ongoing task of the system, and storing the running context information of the task to a dump file;
step 2, when the task is to be restored, analyzing the running context information of the task from the dump file, and restoring the task according to the running context information;
and step 3, continuing to calculate the recovered task.
Through the operation, the suspended computing task can be resumed and then continuously run.
The method for recovering the computing task in the embodiment of the invention can be used in the scenes of computing task migration, hardware maintenance and the like, and specifically comprises the following steps:
1. in a cloud computing scene, the virtual machine saves the running state to be migrated and deployed to other nodes, and by applying the technology of the invention, the computing tasks on a computing engine (a graphic processor/a depth computing processor and the like) can be synchronously migrated, and the computing tasks are re-run according to the state before migration at a new node;
2. in a distributed load balancing scene, a certain node is overloaded during the distributed computation process, and a computation task needs to be migrated, so that the computation task can be migrated to other idle nodes by applying the technology of the invention, and then the state before migration is operated;
3. in a computing node maintenance (or planned power-off) scene, the technology of the invention is applied to save the context of the computing task, suspend the computing task, then power-off executes maintenance operations such as card replacement, memory expansion and the like, and power-on resumes the suspended computing task to continue running.
When used for the 1 st and 2 nd scenes, after the step 1, the following steps are executed: copying the running context information of the task to a node to be migrated, and then executing the step 2 on the migrated node; wherein, when the task is to be recovered in step 1, the following steps are specifically performed: detecting that a condition requiring migration occurs; in step 3, when the task is to be restored, the specific steps are as follows: and after the fact that the running context information of the task is copied to the node to be migrated is detected to be completed.
When the method is applied to the 3 rd scenario, the task to be interrupted of the system in step 1 is specifically: detecting a command of a user for suspending a computing task, and the like; in step 3, when the task is to be restored, the specific steps are as follows: an instruction of a user to resume a computing task is detected.
Optionally, as shown in fig. 2, the step of saving the running context information of the task to a dump file includes:
step 11, suspending the operation of a user mode computing process running on user equipment in the system;
step 12, suspending the task on the computing engine in the system, and storing the running context information of the suspended task in a queue memory area of a memory of the system;
step 13, saving the user state process state information in the system to the dump file;
step 14, saving the state information of the driver state process in the system into the dump file; the dynamic process state information comprises: process ID, process control information, process page table information, process-allocated memory block information and process event information with corresponding relationship; the memory block information allocated by the process includes: user data of the process; the control information of the process includes: a process address space ID; the method comprises the following steps: and storing the process ID, the control information of the process, the page table information of the process, the memory block information distributed by the process and the event information of the process according to the corresponding relation.
Wherein, the driving state progress state information is shown in table 1:
TABLE 1
And step 15, taking the process as a unit, and storing the information of the queue under the suspended driving state process into the dump file. The information of the queue includes: queue ID, a ring buffer area allocated by the queue, a queue read-write pointer and queue memory area information; the queue memory area information includes: calculating context switch buffer, control stack buffer and memory queue descriptor; the queue read-write pointer comprises: the value of the doorbell register; the method comprises the following steps: and storing the queue ID, the ring buffer distributed by the queue, the queue read-write pointer and the queue memory area information according to the corresponding relation.
Wherein, the information of the queue is shown in table 2:
TABLE 2
Optionally, as shown in fig. 3, the step of recovering the task according to the running context information specifically includes:
step 21, parsing the user mode process state information from the dump file, and recovering the user mode process according to the user mode process state information;
step 22, analyzing the state information of the driving dynamic process from the dump file, and recovering the driving dynamic process according to the state of the driving dynamic process; the method specifically comprises the following steps: allocating memory blocks for the drive state process according to the corresponding relation among the stored process ID, the stored process control information, the stored process page table, the memory blocks allocated by the process and the stored process event information, copying the user data of the process to the memory blocks, and recovering the process page table; and configuring the process address space ID and the process page table into a register of a computing engine.
The dynamic process state information comprises: process ID, process control information, process page table information, process-allocated memory block information and process event information with corresponding relationship; the memory block information allocated by the process includes: user data of the process; the control information of the process includes: process address space ID.
Step 23, analyzing the queue information in the recovered process information from the dump file, and recovering the queue according to the queue information; the method specifically comprises the following steps: and restoring the queue data according to the corresponding relation among the stored queue ID, the ring buffer area allocated by the queue, the queue read-write pointer and the queue memory area information, configuring the memory queue descriptor to the corresponding hardware queue register, and configuring the ring buffer area, the context switching buffer area, the control stack buffer area, the queue read-write pointer and the doorbell register value to the corresponding hardware queue register. The driver software refers to the middle layer of the three-layer structure in the system architecture diagram shown in fig. 4.
The information of the queue includes: queue ID, a ring buffer area allocated by the queue, a queue read-write pointer and queue memory area information; the queue memory area information includes: calculating context switch buffer, control stack buffer and memory queue descriptor; the queue read-write pointer comprises: doorbell register value.
Step 24, extracting the running context information of the suspended task from the queue memory area; according to the running context information of the tasks, recovering the tasks of the queue, and loading the tasks to a computing engine;
and 25, sending a running signal to enable the computing engine and the user mode process to simultaneously enter a running state.
The following describes an application scenario of the present invention. The invention relates to offline migration of a super computing task, supports multi-process and multi-task computing, and is divided into three layers of structures, namely a user process, driving software and computing hardware from top to bottom; FIG. 4 is a block diagram of the computing process, driver software, and computing hardware of the system; the user process pushes the calculation task to a calculation engine through the calculation queue to execute the calculation task; after the computing engine finishes the computing task, the computing result is written into a memory area appointed by the application process, and the upper computing process is informed, so that the task is finished.
In the method for recovering a computing task provided in the embodiment of the present invention, task migration is divided into 2 steps: saving and restoring the context of the computing task; when the current computing task is saved, firstly, when the ongoing computing task is suspended, the running context of the computing task is saved in a dump file; when the computing task needs to be restored, the dump file is analyzed, the computing task is restored to the state before saving from the file, and the computing device continues to calculate according to the progress saved before.
The structure of 4 modules for implementing migration of computing tasks is shown in FIG. 5. Adding 4 modules in the application layer, the driver layer, and the microcode layer, which are respectively a UMS (user migration slave), a MM (migration master), a DMS (driver migration slave), and a CMS (microcode migration slave):
1) a UMS module: based on the existing Checkpoint/recovery technology, such as CRIU (Checkpoint/Restore In Userspace, user space Checkpoint/Restore), BLCR (Berkeley Lab Checkpoint/Restart), technology, receiving an instruction from an MM module, and saving a user-state process into a dump file;
2) an MM module: the task migration control module sends information to each slave module (including the UMS, DMS and CMS modules of the figure 5) and controls the time sequence of saving and restoring the computing task context by each slave module;
3) a DMS module: and receiving a command from the MM module, and executing the saving and recovering tasks of the driving layer resources.
4) CMS module: receiving commands from the MM module, the CMS module is responsible for performing the save and restore work of the computing device hardware context resources.
First, as shown in fig. 6, the computing task context saving step is as follows:
step 1: the MM module sends a pause (freeze) command to the UMS, the UMS executes the operation of pausing all user-mode computing processes running on the card, and the MM module replies a completion message after the UMS is completed; after the user-mode process is frozen, no new computing task is pushed to the computing engine, and then the step 2 is carried out to freeze the computing engine and store the context content of the computing task.
Step 2: the MM module sends a pause command to the CMS module, the CMS module configures a control register of a computing engine to send a HALT command, and all computing tasks on the engine are paused; writing a SAVE storage command into a control register of a calculation engine queue, triggering the calculation engine to perform task storage operation, storing the information content of the running context of the calculation task into a calculation context switching buffer area provided by the driving software by the calculation engine, storing the control information of the calculation engine into a control stack buffer area provided by the driving software, and storing the information of a queue hardware register into a memory queue descriptor area provided by the driving software; the CMS module replies a completion message to the MM module after waiting for the completion of the calculation engine; after the computing engine is frozen and stored, the whole software and hardware system is all in a frozen state, and the steps of storing a user state process and driving software resources are sequentially carried out.
And step 3: the MM module sends a saving command to the UMS, and the UMS module saves all user-state computing process states to a dump file, wherein the step can be realized based on the existing Checkpoint detection point/Restore recovery technology; and after the UMS module finishes the operation, the UMS module replies a completion message to the MM.
And 4, step 4: the MM module sends a storage command to the DMS module, and the DMS module stores the process information, the process page table and key calculation data in the memory blocks (including the allocated host memory and the graphics processor memory) allocated by the process into a dump file according to the format of the table 1 by taking the process as a unit. The queue and the process have an attribution relationship, after the process information is stored, all the queue information under the process is stored, and the step 5 is entered. And 5: the DMS module takes a process (process) as a unit, stores information of all queues under the process, and the queue information comprises: queue descriptors, a ring buffer, a queue read-write pointer, queue memory area information and the like are stored in a dump file according to the format of a table 2; and finishing the message to the MM module after finishing. And 6, finishing the storage of the context of the calculation task in the steps, and stopping the calculation engine at any time.
Secondly, as shown in fig. 7, the computing task context restoring step is as follows:
step 1, the MM module sends a recovery message to the UMS module, the UMS module recovers a user mode process from a dump file, and the user mode process is in a suspended state after recovery (realized based on the existing CRIU or BLCR technology); and after the recovery is finished, the message is replied to the MM module.
Step 2, the MM module sends a recovery message to the DMS module, the DMS module recovers process information in the driver from the dump file, allocates the memory block and fills key data of the recovered memory block, and recovers a process page table; the process information recovery process is the reverse process of the storage process (see computation task storage step 4), and the data format is as follows:
and recovering the key data structure of the calculation process from the table 1, and configuring the process address space ID and the process page table address into a register of the calculation engine after the completion. And (3) after the process information is recovered, recovering all queue information under the process, and entering the step 3. Step 3, the DMS module recovers all queue information belonging to the process from the dump file by taking the process as a unit, wherein the queue information comprises a queue descriptor, a ring buffer, a doorbell register, a queue read-write pointer, queue memory area information and the like; the queue recovery process is the reverse of the save process (see computation task save step 5), and the data format is shown in table 2:
and after the queue key data structure is recovered, driving data in the software memory queue descriptor area to be configured to a corresponding hardware queue register, and configuring the ring buffer area, the context switch buffer area, the control stack buffer area, the queue read-write pointer and the doorbell register value to the corresponding hardware queue register. After the process and queue information is recovered, the software resources necessary for recovering the computing engine are ready, at this time, a completion message is replied to the MM, and then step 4 is performed to recover the computing engine. Step 4, the MM module sends a recovery message to the CMS module, the CMS module configures a calculation engine register according to the queue descriptor region, the context switching buffer area of the calculation and the control stack buffer area resource, the calculation task is loaded to the calculation engine, the progress of the calculation task after the loading is finished is in a state when the calculation task is stored, the capacity of interrupting the task before the calculation is carried out is realized, and the calculation task is in a frozen state; the MM replies with a message after the operation is completed.
And 6, completing the recovery of the whole software and hardware system through the steps.
The saving and recovering scheme of the computing task can be used in the scenes of computing task migration, hardware maintenance and the like, and specifically comprises the following steps:
1. in a cloud computing scene, the virtual machine saves the running state to be migrated and deployed to other nodes, and by applying the technology of the invention, the computing tasks on a computing engine (a graphic processor/a depth computing processor and the like) can be synchronously migrated, and the computing tasks are re-run according to the state before migration at a new node;
2. in a distributed load balancing scene, a certain node is overloaded during the distributed computation process, and a computation task needs to be migrated, so that the computation task can be migrated to other idle nodes by applying the technology of the invention, and then the state before migration is operated;
3. in a computing node maintenance (or planned power-off) scene, the technology of the invention is applied to save the context of the computing task, suspend the computing task, then power-off executes maintenance operations such as card replacement, memory expansion and the like, and power-on resumes the suspended computing task to continue running.
An embodiment of the present invention further provides a device for recovering a computing task, as shown in fig. 8, where the device includes:
the storage unit is used for detecting a task interrupt instruction, suspending a task which is in progress by a system and storing the running context information of the task to a dump file;
the analysis unit is used for acquiring a task recovery instruction, analyzing the running context information of the task from the dump file, and recovering the task according to the running context information;
and the computing unit is used for continuing computing the recovered task.
Optionally, the saving unit includes:
a first suspending subunit, configured to suspend an operation of a user-mode computing process running on a user device in the system;
the second suspension subunit is used for suspending the task on the computing engine in the system and storing the running context information of the suspended task into a queue memory area of a memory of the system;
the first storage subunit is used for storing the user state process state information in the system to the dump file;
the second storage subunit is used for storing the state information of the drive state process in the system into the dump file;
and the third saving subunit is used for saving the information of the queue in the suspended drive state process into the dump file by taking the process as a unit.
Optionally, the dynamic process state information includes: process ID, process control information, process page table information, process-allocated memory block information and process event information with corresponding relationship; the memory block information allocated by the process includes: user data of the process; the control information of the process includes: a process address space ID;
correspondingly, the second saving subunit specifically includes:
and storing the process ID, the control information of the process, the page table information of the process, the memory block information distributed by the process and the process event information in a corresponding relationship.
Optionally, the information of the queue includes: queue ID, ring buffer distributed by queue, queue read-write pointer and queue memory area information; the queue memory area information includes: computing context switch buffer, control stack buffer and memory queue descriptor; the queue read-write pointer comprises: the value of the doorbell register;
correspondingly, the third saving subunit specifically is:
and storing the queue ID, the ring buffer distributed by the queue, the queue read-write pointer and the queue memory area information according to the corresponding relation.
Optionally, the parsing unit specifically includes:
the first recovery subunit is configured to parse the user mode process state information from the dump file, and recover the user mode process according to the user mode process state information;
the second recovery subunit is used for analyzing the state information of the driving dynamic process from the dump file and recovering the driving dynamic process according to the state of the driving dynamic process;
the third recovery subunit is configured to parse queue information in the recovered process information from the dump file, and recover the queue according to the queue information;
a fourth recovery subunit, configured to extract, from the queue memory area of the memory, running context information of the suspended task; according to the running context information of the tasks, recovering the tasks of the queue, and loading the tasks to a computing engine;
and the sending subunit sends the running signal to enable the computing engine and the user mode process to simultaneously enter a running state.
Optionally, the second recovery subunit specifically includes: allocating memory blocks for the drive state process according to the corresponding relation among the stored process ID, process control information, process page table information, process-allocated memory blocks and process event information, copying the user data of the process to the memory blocks, and recovering the process page table; and configuring the process address space ID and the process page table into a register of the calculation engine.
Optionally, the third recovery subunit specifically includes: and restoring the queue data according to the corresponding relation among the stored queue ID, the ring buffer area allocated by the queue, the queue read-write pointer and the queue memory area information, configuring the memory queue descriptor to the corresponding hardware queue register, and configuring the ring buffer area, the context switching buffer area, the control stack buffer area, the queue read-write pointer and the doorbell register value to the corresponding hardware queue register.
The apparatus of this embodiment may be configured to implement the technical solutions of the above method embodiments, and the implementation principles and technical effects are similar, which are not described herein again.
The embodiment of the invention also provides the user equipment which comprises the device for recovering the computing task.
An embodiment of the present invention further provides a device for recovering a computing task, as shown in fig. 9, where the device includes:
a memory;
and a processor coupled to the memory, the processor configured to perform the method for recovery of the computing task based on instructions stored in the memory.
The embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores computer instructions, and the computer instructions, when executed by a processor, implement the method for recovering the computing task.
It will be understood by those skilled in the art that all or part of the processes of the embodiments of the methods described above may be implemented by a computer program, which may be stored in a computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (11)
1. A method for recovery of a computing task, comprising:
when the task of the system is to be interrupted, suspending the ongoing task of the system, and storing the running context information of the task to a dump file;
when the task is to be restored, analyzing the running context information of the task from the dump file, and restoring the task according to the running context information;
and continuing to calculate the recovered task.
2. The method of claim 1, wherein the step of saving the running context information of the task to a dump file comprises:
suspending operation of a user-mode computing process running on user equipment in the system;
suspending a task on a computing engine in the system, and storing running context information of the suspended task in a queue memory area of a memory of the system;
saving user state process state information in the system to the dump file;
saving the state information of the driver state process in the system into the dump file;
and taking the process as a unit, and saving the information of the queue under the suspended driving state process into the dump file.
3. The method of claim 2, wherein the driver state information comprises: process ID, process control information, process page table information, process-allocated memory block information and process event information with corresponding relationship; the memory block information allocated by the process includes: user data of the process; the control information of the process includes: a process address space ID;
the step of saving the state information of the driver state process in the system into the dump file is specifically as follows:
and storing the process ID, the control information of the process, the page table information of the process, the memory block information distributed by the process and the process event information in a corresponding relationship.
4. The method of claim 2, wherein the information of the queue comprises: queue ID, ring buffer distributed by queue, queue read-write pointer and queue memory area information; the queue memory area information includes: computing context switch buffer, control stack buffer and memory queue descriptor; the queue read-write pointer comprises: the value of the doorbell register;
the step of saving the information of the queue under the suspended driver state process to the dump file by taking the process as a unit is specifically as follows:
and storing the queue ID, the ring buffer distributed by the queue, the queue read-write pointer and the queue memory area information according to the corresponding relation.
5. The method according to claim 1, wherein the step of resuming the task according to the running context information specifically comprises:
analyzing the state information of the user mode process from the dump file, and recovering the user mode process according to the state information of the user mode process;
analyzing the state information of the driving dynamic process from the dump file, and recovering the driving dynamic process according to the state of the driving dynamic process;
analyzing queue information in the recovered process information from the dump file, and recovering the queue according to the queue information;
extracting the running context information of the suspended task from the queue memory area of the memory; according to the running context information of the tasks, recovering the tasks of the queue, and loading the tasks to a computing engine;
and sending a running signal to enable the computing engine and the user mode process to enter a running state simultaneously.
6. The method of claim 5, wherein the step of parsing information about the state of the dynamic driver process from the dump file, and recovering the dynamic driver process according to the state of the dynamic driver process comprises:
allocating memory blocks for the drive state process according to the corresponding relation among the stored process ID, the stored process control information, the stored process page table information, the memory blocks allocated by the process and the stored process event information, copying the user data of the process to the memory blocks, and recovering the process page table; and configuring the process address space ID and the process page table into a register of the calculation engine.
7. The method of claim 5, wherein the parsing the dump file for queue information in the recovered process information, and the recovering the queue according to the queue information comprises:
and restoring the queue data according to the corresponding relation among the stored queue ID, the ring buffer area allocated by the queue, the queue read-write pointer and the queue memory area information, configuring the memory queue descriptor to the corresponding hardware queue register, and configuring the ring buffer area, the context switching buffer area, the control stack buffer area, the queue read-write pointer and the doorbell register value to the corresponding hardware queue register.
8. An apparatus for recovery of a computing task, comprising:
the system comprises a storage unit, a task scheduling unit and a task scheduling unit, wherein the storage unit is used for suspending a task which is carried out by the system and storing the running context information of the task to a dump file when the task of the system is to be interrupted;
the analysis unit is used for analyzing the running context information of the task from the dump file when the task is to be recovered, and recovering the task according to the running context information;
and the computing unit is used for continuing computing the recovered task.
9. A user device characterized in that it comprises means for recovery of a computing task as claimed in claim.
10. A device for recovery of a computing task, comprising:
a memory;
and a processor coupled to the memory, the processor configured to perform the method of recovering from a computing task of any of claims 1 to 7 based on instructions stored in the memory.
11. A computer-readable storage medium, wherein the computer-readable storage medium stores computer instructions which, when executed by a processor, implement a recovery method for a computing task as claimed in any one of claims 1 to 7.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202110316625.9A CN113076182B (en) | 2021-03-24 | 2021-03-24 | Recovery method and device of computing task, user equipment and storage medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202110316625.9A CN113076182B (en) | 2021-03-24 | 2021-03-24 | Recovery method and device of computing task, user equipment and storage medium |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN113076182A true CN113076182A (en) | 2021-07-06 |
| CN113076182B CN113076182B (en) | 2024-03-29 |
Family
ID=76610718
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202110316625.9A Active CN113076182B (en) | 2021-03-24 | 2021-03-24 | Recovery method and device of computing task, user equipment and storage medium |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN113076182B (en) |
Cited By (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115437889A (en) * | 2022-11-08 | 2022-12-06 | 统信软件技术有限公司 | Emergency processing method and system and computing equipment |
| CN115470134A (en) * | 2022-09-20 | 2022-12-13 | 重庆长安汽车股份有限公司 | Automatic driving simulation dynamic operation resource scheduling method and device |
| CN115686776A (en) * | 2022-09-30 | 2023-02-03 | 辉羲智能科技(上海)有限公司 | Method and device for reducing multi-model task queuing time delay |
| WO2023185137A1 (en) * | 2022-03-31 | 2023-10-05 | 苏州浪潮智能科技有限公司 | Task management method and apparatus, and device and storage medium |
| CN117290075A (en) * | 2023-11-23 | 2023-12-26 | 苏州元脑智能科技有限公司 | Process migration method, system, device, communication equipment and storage medium |
Citations (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5201039A (en) * | 1987-09-30 | 1993-04-06 | Mitsubishi Denki Kabushiki Kaisha | Multiple address-space data processor with addressable register and context switching |
| US5390329A (en) * | 1990-06-11 | 1995-02-14 | Cray Research, Inc. | Responding to service requests using minimal system-side context in a multiprocessor environment |
| US5428779A (en) * | 1992-11-09 | 1995-06-27 | Seiko Epson Corporation | System and method for supporting context switching within a multiprocessor system having functional blocks that generate state programs with coded register load instructions |
| US20100161948A1 (en) * | 2006-11-14 | 2010-06-24 | Abdallah Mohammad A | Apparatus and Method for Processing Complex Instruction Formats in a Multi-Threaded Architecture Supporting Various Context Switch Modes and Virtualization Schemes |
| CN103150226A (en) * | 2013-04-01 | 2013-06-12 | 山东鲁能软件技术有限公司 | Abnormal dump and recovery system for computer model and dump and recovery method thereof |
| US20180300158A1 (en) * | 2017-04-18 | 2018-10-18 | International Business Machines Corporation | Management of store queue based on restoration operation |
| US20190347129A1 (en) * | 2018-05-11 | 2019-11-14 | Futurewei Technologies, Inc. | User space pre-emptive real-time scheduler |
| CN110597601A (en) * | 2019-09-16 | 2019-12-20 | 杭州和利时自动化有限公司 | Controller task switching method, device, equipment and readable storage medium |
| CN111432438A (en) * | 2020-03-26 | 2020-07-17 | 中国科学院计算技术研究所 | Base station processing task real-time migration method |
| CN112416536A (en) * | 2020-12-10 | 2021-02-26 | 成都海光集成电路设计有限公司 | Method for extracting processor execution context and processor |
-
2021
- 2021-03-24 CN CN202110316625.9A patent/CN113076182B/en active Active
Patent Citations (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5201039A (en) * | 1987-09-30 | 1993-04-06 | Mitsubishi Denki Kabushiki Kaisha | Multiple address-space data processor with addressable register and context switching |
| US5390329A (en) * | 1990-06-11 | 1995-02-14 | Cray Research, Inc. | Responding to service requests using minimal system-side context in a multiprocessor environment |
| US5428779A (en) * | 1992-11-09 | 1995-06-27 | Seiko Epson Corporation | System and method for supporting context switching within a multiprocessor system having functional blocks that generate state programs with coded register load instructions |
| US20100161948A1 (en) * | 2006-11-14 | 2010-06-24 | Abdallah Mohammad A | Apparatus and Method for Processing Complex Instruction Formats in a Multi-Threaded Architecture Supporting Various Context Switch Modes and Virtualization Schemes |
| CN103150226A (en) * | 2013-04-01 | 2013-06-12 | 山东鲁能软件技术有限公司 | Abnormal dump and recovery system for computer model and dump and recovery method thereof |
| US20180300158A1 (en) * | 2017-04-18 | 2018-10-18 | International Business Machines Corporation | Management of store queue based on restoration operation |
| US20190347129A1 (en) * | 2018-05-11 | 2019-11-14 | Futurewei Technologies, Inc. | User space pre-emptive real-time scheduler |
| CN110597601A (en) * | 2019-09-16 | 2019-12-20 | 杭州和利时自动化有限公司 | Controller task switching method, device, equipment and readable storage medium |
| CN111432438A (en) * | 2020-03-26 | 2020-07-17 | 中国科学院计算技术研究所 | Base station processing task real-time migration method |
| CN112416536A (en) * | 2020-12-10 | 2021-02-26 | 成都海光集成电路设计有限公司 | Method for extracting processor execution context and processor |
Non-Patent Citations (2)
| Title |
|---|
| ARTEM STAROSTIN等: "Verified Process-Context Switch for C-Programmed Kernels", vol. 5295, pages 240, XP019107315 * |
| 王迪: "基于X86体系结构VxWorks SMP调度和中断机制研究与优化", no. 5, pages 138 - 518 * |
Cited By (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2023185137A1 (en) * | 2022-03-31 | 2023-10-05 | 苏州浪潮智能科技有限公司 | Task management method and apparatus, and device and storage medium |
| CN115470134A (en) * | 2022-09-20 | 2022-12-13 | 重庆长安汽车股份有限公司 | Automatic driving simulation dynamic operation resource scheduling method and device |
| CN115686776A (en) * | 2022-09-30 | 2023-02-03 | 辉羲智能科技(上海)有限公司 | Method and device for reducing multi-model task queuing time delay |
| CN115686776B (en) * | 2022-09-30 | 2024-05-10 | 辉羲智能科技(上海)有限公司 | Method and device for reducing queuing time delay of multi-model tasks |
| CN115437889A (en) * | 2022-11-08 | 2022-12-06 | 统信软件技术有限公司 | Emergency processing method and system and computing equipment |
| CN115437889B (en) * | 2022-11-08 | 2023-03-10 | 统信软件技术有限公司 | Emergency processing method, system and computing equipment |
| CN117290075A (en) * | 2023-11-23 | 2023-12-26 | 苏州元脑智能科技有限公司 | Process migration method, system, device, communication equipment and storage medium |
| CN117290075B (en) * | 2023-11-23 | 2024-02-27 | 苏州元脑智能科技有限公司 | Process migration method, system, device, communication equipment and storage medium |
Also Published As
| Publication number | Publication date |
|---|---|
| CN113076182B (en) | 2024-03-29 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN113076182B (en) | Recovery method and device of computing task, user equipment and storage medium | |
| CN102314377B (en) | Accelerator and method thereof for supporting virtual machine migration | |
| CN106528327B (en) | A data processing method and backup server | |
| US8413145B2 (en) | Method and apparatus for efficient memory replication for high availability (HA) protection of a virtual machine (VM) | |
| CN103631633B (en) | Virtual machine total-system online migration method, device and system | |
| CN110704161B (en) | Virtual machine creation method and device and computer equipment | |
| US9563461B2 (en) | Suspending and resuming virtual machines | |
| CN110737512A (en) | Cross-platform virtual machine online migration method and related components | |
| US20120084520A1 (en) | Method and Apparatus for Efficient Memory Replication for High Availability (HA) Protection of a Virtual Machine (VM) | |
| US20110107344A1 (en) | Multi-core apparatus and load balancing method thereof | |
| JPWO2010122709A1 (en) | Rejuvenation processing apparatus, rejuvenation processing system, computer program, and data processing method | |
| US20200249987A1 (en) | Engine pre-emption and restoration | |
| CN111857966A (en) | Virtual machine snapshot creating method and device and computer readable storage medium | |
| CN116149818A (en) | Migration method, equipment, system and storage medium of GPU (graphics processing Unit) application | |
| CN111666266A (en) | Data migration method and related equipment | |
| CN112328365A (en) | Virtual machine migration method, device, equipment and storage medium | |
| JP5352299B2 (en) | High reliability computer system and configuration method thereof | |
| CN112181601A (en) | Memory pre-copying and virtual machine migration method and system based on dirtying rate prediction | |
| CN106776018A (en) | Host node for distributed system and method for parallel processing and equipment from node | |
| CN106775846B (en) | Method and apparatus for online migration of physical servers | |
| CN109308232B (en) | Method, device and system for rollback after virtual machine live migration fault | |
| CN109189615A (en) | A kind of delay machine treating method and apparatus | |
| JP6176318B2 (en) | program | |
| JPH1124936A (en) | High-speed restart method for information processing equipment | |
| EP4582944A1 (en) | Offloading method and device for shared directory file system between host and virtual machine |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |