CN120743630A

CN120743630A - A processor fault recovery method

Info

Publication number: CN120743630A
Application number: CN202511163956.8A
Authority: CN
Inventors: 张萌; 张盛兵; 崔兆玺; 王瀚钦
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2025-08-20
Filing date: 2025-08-20
Publication date: 2025-10-03

Abstract

The present invention discloses a processor fault recovery method, which relates to the technical field of computer system structure. The method sets multiple levels of hardware checkpoints on a processor superscalar pipeline; there is a preset interval between adjacent hardware checkpoints; when a pending instruction reaches any hardware checkpoint in the processor superscalar pipeline, the instruction state content outputted by the previous stage of the stage where the hardware checkpoint is located is backed up, and fault detection is performed on the processor superscalar pipeline; if a fault occurs in the processor superscalar pipeline, the backup instruction state content is moved to the processor superscalar pipeline to recover from the fault. This method solves the problem of high fault recovery overhead caused by the large depth of the processor superscalar pipeline.

Description

Processor fault recovery method

Technical Field

The present invention relates to the field of computer system structures, and in particular, to a processor fault recovery method.

Background

Checkpoints are an important technique for protecting computer systems from hardware faults. This technique typically involves creating at system start-up, recording a backup of the hardware state as the system operates, so that it can be restored to the last stable and correct state when a failure occurs. In virtualized and real-time systems, hardware checkpointing mechanisms are a key component in ensuring system availability and fault tolerance. By fixedly creating a hardware checkpoint, a system administrator may restore system state in the event of a hardware failure, error, or other anomaly, minimizing service disruption time.

In conventional checkpointing mechanisms, the hardware checkpointing is usually located at the end of the pipeline, and for superscalar processors, considering that the fault occurs in the worst case, i.e., a soft error occurs in the instruction fetch stage at the beginning of the pipeline, the fault recovery is checked and started during the commit stage, where the recovery delay is greatest, and the instructions entering the pipeline are flushed and re-executed later, which is a significant waste of performance. Based on the feature of a superscalar processor pipeline depth being large, this can lead to significant recovery overhead.

Therefore, there is a need for a processor fault recovery method that solves the problem of high fault recovery overhead of superscalar processor pipelines due to large depth.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a processor fault recovery method. The method can solve the problem of high fault recovery overhead of the superscalar processor pipeline caused by large depth.

The invention adopts the following technical scheme:

the invention provides a processor fault recovery method, which comprises the following steps:

setting multi-stage hardware checkpoints on a processor superscalar pipeline, wherein preset intervals exist between adjacent hardware checkpoints;

When an instruction to be processed reaches any hardware check point of the processor superscalar pipeline, the instruction state content output by the last stage of the stage where the hardware check point is positioned is backed up, and fault detection is carried out on the processor superscalar pipeline;

And under the condition that the processor superscalar pipeline has faults, the fault existing in the processor superscalar pipeline is recovered by moving the backed-up instruction state content to the processor superscalar pipeline.

Preferably, the multi-level hardware checkpoints comprise a first-level hardware checkpoint and a second-level hardware checkpoint, and the multi-level hardware checkpoints are arranged on the processor superscalar pipeline, and specifically comprise:

Setting a first stage hardware check point at a distribution stage of the processor superscalar pipeline and setting a second stage hardware check point at a write-back stage of the processor superscalar pipeline, wherein the interval between the first stage hardware check point and the second stage hardware check point is half of the whole period of the processor superscalar pipeline.

Preferably, the hardware check point is a first-stage hardware check point, and the backup of the instruction state content output by the last stage of the stage where the hardware check point is located specifically comprises the following steps:

the register renaming mapping table output by the last stage of the distribution stage is obtained, wherein the register renaming mapping table comprises information required by an instruction to access a general register;

and storing the register renaming mapping table into the first backup circuit, wherein the structure of the first backup circuit is consistent with that of the register renaming mapping table.

Preferably, the hardware check point is a second-stage hardware check point, and the backup of the instruction state content output by the last stage of the stage where the hardware check point is located specifically comprises the following steps:

obtaining a calculation result output by the last stage of the write-back stage, wherein the calculation result is data to be backed up to a general register and a control state register;

And storing the calculation result into a second backup circuit, wherein the structure of the second backup circuit is a register file structure.

Preferably, the backup of the instruction state content output by the stage preceding the stage at which each stage hardware checkpoint is located is performed in parallel with the processor superscalar pipeline.

The invention provides a processor fault recovery device, comprising:

the setting module is used for setting multi-stage hardware checkpoints on the processor superscalar pipeline, wherein preset intervals exist between adjacent hardware checkpoints;

the backup and detection module is used for backing up the instruction state content output by the last stage of the stage where the hardware check point is positioned when the instruction to be processed reaches any hardware check point of the processor superscalar pipeline, and carrying out fault detection on the processor superscalar pipeline;

and the recovery module is used for recovering the faults of the processor superscalar pipeline by moving the backed-up instruction state content to the processor superscalar pipeline under the condition that the processor superscalar pipeline has faults.

The invention provides a processor fault recovery medium, wherein a storage medium stores a computer program which realizes the processor fault recovery method when being executed by a processor.

The invention provides a computer device which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the fault recovery method of the processor when executing the program.

The at least one technical scheme adopted by the invention can achieve the following beneficial effects:

The method comprises the steps of setting multi-stage hardware check points on a processor superscalar pipeline, setting preset intervals between adjacent hardware check points, backing up instruction state contents output by the last stage of the stage where the hardware check points are located when an instruction to be processed reaches any hardware check point of the processor superscalar pipeline, detecting faults of the processor superscalar pipeline, and restoring the faults of the processor superscalar pipeline by moving the backed-up instruction state contents to the processor superscalar pipeline under the condition that the processor superscalar pipeline has faults. The invention sets up multi-stage hardware check points, backups the output content of the previous stage at each stage of hardware check points, carries out fault detection on the processor superscalar pipeline, if faults exist, moves the backups to the processor superscalar pipeline for recovery, does not influence the time sequence of the processor superscalar pipeline, is beneficial to minimizing the time of system interrupt recovery, improves the availability and fault tolerance of the system, and compared with a single hardware check point, in the superscalar pipeline with excessive pipeline stages, the invention reduces the redundant period between the recovery periods detected by faults and the processor performance waste caused by instruction clearing, has lower fault recovery cost than the single hardware check point, and solves the problem of higher fault recovery cost caused by large depth of the processor superscalar pipeline.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:

FIG. 1 is a schematic flow chart of a processor fault recovery method provided by the invention;

FIG. 2 is a schematic diagram of the first stage hardware checkpoint location of a superscalar pipeline of a processor provided by the present invention;

FIG. 3 is a schematic diagram of a second stage hardware checkpoint location of a superscalar processor pipeline provided by the present invention;

FIG. 4 is a schematic diagram of a pipeline lockstep provided by the present invention;

FIG. 5 is a timing diagram of a first level hardware checkpoint backup mechanism provided by the present invention;

FIG. 6 is a timing diagram of a second level hardware checkpoint backup mechanism provided by the present invention;

FIG. 7 is a two-level hardware checkpoint backup flowchart provided by the present invention;

FIG. 8 is a graph of performance impact assessment of a two-level hardware checkpointing mechanism provided by the present invention;

FIG. 9 is a schematic diagram of a processor fault recovery apparatus according to the present invention;

Fig. 10 is a schematic diagram of a computer device for implementing a processor fault recovery method according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to specific embodiments of the present invention and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The following describes in detail the technical solutions provided by the embodiments of the present invention with reference to the accompanying drawings.

Fig. 1 is a schematic flow chart of a fault recovery method for a processor in the present invention, which specifically includes the following steps:

S101, setting multi-stage hardware checkpoints on a processor superscalar pipeline, wherein preset intervals exist between adjacent hardware checkpoints.

In one exemplary embodiment, the multi-level hardware checkpoints include a first-level hardware checkpoint and a second-level hardware checkpoint, and the setting of the multi-level hardware checkpoints on the processor superscalar pipeline specifically includes setting the first-level hardware checkpoint at a dispatch stage of the processor superscalar pipeline and setting the second-level hardware checkpoint at a write-back stage of the processor superscalar pipeline, with the interval between the first-level hardware checkpoint and the second-level hardware checkpoint being half of the entire cycle of the processor superscalar pipeline.

Specifically, as shown in fig. 2, the first stage of hardware check point is set at the distribution stage of the processor superscalar pipeline, the distribution stage is a critical point for executing instructions from sequence to out-of-order execution, relevant information after renaming the instructions is written into corresponding reservation stations, reorder buffers and other components, and is mainly used for dispatching the instructions to an operation unit after decoding, the second stage of hardware check point is set at the write-back stage of the processor superscalar pipeline, after the instructions are emitted out-of-order at the reservation stations in the pipeline, the instructions are written back out-of-order after being read from a physical register file and executed, finally, the instructions are submitted in sequence by a Reorder Buffer (ROB), and the write-back stage is responsible for writing back the calculation results obtained in the instruction execution stage into a general purpose register and a control state register (Control Status Register, CSR). For some counting registers used for monitoring the system performance in the CSR and hardware acceleration or partial function registers customized by a designer, whether the data are correct or not does not affect the working state of the system. Thus, the processor system will not list these registers within the hardware checkpoint backup content and the system will not detect and recover even if a fault occurs in these registers.

And S102, when the to-be-processed instruction reaches any hardware check point of the processor superscalar pipeline, backing up the instruction state content output by the last stage of the stage where the hardware check point is positioned, and detecting faults of the processor superscalar pipeline.

In an exemplary embodiment, the hardware checkpoint is a first level hardware checkpoint, and backing up instruction state contents output by a previous stage of a stage in which the hardware checkpoint is located specifically includes obtaining a register renaming mapping Table (STATIC REGISTER Allocation Table, SRAT) output by the previous stage of a distribution stage, where the register renaming mapping Table includes information required by an instruction to access a general register, storing the register renaming mapping Table in a first backup circuit, and a structure of the first backup circuit is consistent with a structure of the register renaming mapping Table.

In an exemplary embodiment, the hardware check point is a second-stage hardware check point, and the backup of the instruction state content output by the previous stage of the stage where the hardware check point is located specifically includes obtaining a calculation result output by the previous stage of the write-back stage, where the calculation result is data to be backed up into a general register and a control state register, storing the calculation result into a second backup circuit, and the structure of the second backup circuit is a register file structure.

The instruction state content of the hardware check point, which needs to be backed up, is the output state content of the hardware check point after the instruction is executed at the last stage.

The instruction state content of the first-stage hardware check point is data in the general register and part of the control state registers, so that the control state registers which are not influenced by the working state of the system can not be backed up aiming at the correctness of the data per se, and in addition, a fault detection result can be returned to a submitting stage of a pipeline for instruction retirement judgment.

Specifically, assume that the system is currently running under the pipeline lockstep (TRIPLE PIPELINE Lock Step, TPLS) mechanism.

TPLS the execution of instructions by three pipelines is synchronized, which is equivalent in performance to a single-core superscalar processor, as shown in fig. 4, and the TPLS management unit is responsible for clock synchronization, fault detection, majority voting, and fault synchronization recovery control tasks of the pipelines. The three pipelines under the pipeline lock step mechanism synchronously execute the same instruction, under normal conditions, the data of the internal modules should be kept consistent, the appointed content is backed up at each designed hardware check point, signals to be detected are sent to a TPLS management unit for fault detection, when no error occurs, the pipelines can continue to execute and complete external interaction through an AXI ‌ (Advanced eXtensible Interface, AXI) bus, if the pause and resume signals of the synchronous control unit are received, the pipeline generating the error can complete fault resume operation under the dispatching of the TPLS management unit, and the other two pipelines need to pause the execution of the current instruction, wait for the start signal after the fault resume is completed to continue the lock step execution.

The fault detection unit is positioned in the TPLS management unit, is responsible for receiving three groups of interfaces which need to be detected by three pipelines, including register related information, memory access addresses and data, and is detected in real time through combinational logic. Depending on the system operating state, different error detection modes are adopted, including a dual pipeline lockstep mechanism and a comparator of the dual pipeline lockstep mechanism. In addition, the error detection of the pipeline can be performed in different stages of the pipeline, for example, a two-stage fault detection mode can be used, fault detection is performed in a distribution stage and a write-back stage of the pipeline respectively, and an error signal error_f at the detection position is generated by the combination logic to provide excitation for pipeline recovery.

Specifically, as shown in fig. 5, in the timing chart of the first stage hardware checkpoint backup mechanism, a soft error is generated in the first pipeline1 in renm clock cycles, and an error signal error is generated for the combinational logic circuit, but because the synchronous control unit beats, the system can acquire the fault detection result signal error_f only in renm clock cycles, and meanwhile, because of one clock cycle delay of the system backup, when the first pipeline1 needs fault recovery in renm stage, the system state state_ds_p0 of the zeroth pipeline in the distribution stage and the system state state_ds_p2 of the second pipeline in the distribution stage start backup s1, so that the backup is started from the output interface of the register renaming stage, and the backup is performed in parallel with the logic of the distribution stage, for example, the system state state_ renm _p0 of the zeroth pipeline in the renaming stage and the system state state_ renm _p2 of the second pipeline in the renaming stage in fig. 4 have already completed s1 backup, and the backup can be started to restore pipeline1 in the next clock cycle.

When the clock cycle of the first pipeline backup is the same as the clock cycle of the distribution stage, the backup content is a register renaming mapping table output by the upper stage, the register renaming mapping table comprises information required by an instruction to access a general register, otherwise, an extra backup circuit and a memory bank are spent, three more backup overheads are needed under a pipeline mechanism, and compared with the delay of one clock cycle, the hardware cost is doubled. In addition, when the first-stage hardware check point is backed up, the backup content is also required to be delayed by one beat, mainly to wait for the state backup of the previous stage to be completed, so as to ensure that the instruction obtains correct backup content under the condition that the fault is generated and the recovery is required.

Specifically, as shown in fig. 6, in the timing chart of the second stage hardware checkpoint backup mechanism, similarly, a soft error is generated in the first pipeline1 in the exe2 clock cycle, and the error signal error is generated for the combinational logic circuit, but because the synchronous control unit beats, the system can acquire the fault detection result signal error_f in the exe3 clock cycle, and because of one clock cycle delay of the system backup, when the first pipeline1 needs fault recovery in the exe3 stage, the system state state_rf_p0 of the zeroth pipeline in the write-back stage and the system state state_rf_p2 of the second pipeline in the write-back stage start backup s1, so that backup is started from the execution stage and logic of the write-back stage is performed in parallel, for example, the system state state_exe_p0 of the zeroth pipeline in the execution stage and the system state state_exe_p2 of the second pipeline in the execution stage in fig. 6 have already completed s1 backup, and the next clock cycle can start to be used for recovering pipeline 1.

When the clock period of the first pipeline backup is the same as the clock period of the write-back stage, the backup content is the data in the general register and part of the control state registers before the current instruction is written back, otherwise, an extra backup circuit and a memory bank are spent, three more backup overheads are needed under the pipeline mechanism, and compared with the delay of one clock period, the hardware cost is doubled. In addition, when the second-stage hardware check point is backed up, the backed up content needs to be delayed by one beat, mainly to wait for the completion of the backup of the content output in the previous stage, so as to ensure that the instruction obtains correct backup content under the condition that the failure is generated and needs to be recovered.

Specifically, the content to be backed up by the first-stage hardware check point is based on the register renaming mapping table of the speculative execution, so that the designed backup circuit and the register renaming mapping table have the same structure, and after obtaining the fault recovery enable, the quick recovery of the register mapping table can be completed in one period. In addition, after the fault occurs, the fault pipeline needs to be refreshed and restored according to the correct pipeline content, the backup circuit structure of the second-stage check point is consistent with the register file structure, and the quick restoration of the processor state in one period is ensured to be completed during the fault restoration.

In one exemplary embodiment, the backup of the instruction state content output by the stage immediately preceding the stage at which each stage hardware checkpoint is located is performed in parallel with the processor superscalar pipeline.

Specifically, the data sources of the two-stage hardware check point backup are all the output of the last stage, the output of the stage where the two-stage hardware check point is located is irrelevant to the backup mechanism, the logic of the two stages is processed in parallel, the original time sequence of the pipeline is not affected, and the extra performance cost of the processor is not occupied under normal conditions.

Specifically, the backup of the instruction state content output by the previous stage of the stage where the hardware check point is located is performed in real time, and the backup of the instruction state content output by the previous stage of the stage where the hardware check point is located is performed in each clock cycle.

And S103, under the condition that the processor superscalar pipeline has faults, the fault existing in the processor superscalar pipeline is recovered by moving the backed-up instruction state content to the processor superscalar pipeline.

The first stage hardware check point detects the fault occurrence, the same backup circuit with the register renaming mapping table structure is required to refresh and restore the faulty pipeline in combination with the backup content of the first stage hardware check point in S102, and the rest of pipelines wait in the period, and the second stage hardware check point detects the fault occurrence, the same backup circuit with the register file structure is required to refresh and restore the faulty pipeline in combination with the backup content of the second stage hardware check point in S102, and the rest of pipelines wait in the period.

In a specific embodiment, as shown in fig. 7, a two-level hardware checkpoint recovery process is provided, where the first level checkpoint in fig. 7 is a first level hardware checkpoint in the present invention, and the second level checkpoint is a second level hardware checkpoint in the present invention. When the program is executed to the first-stage check point, the SRAT output at the last stage of the first-stage check point is backed up, the SRAT comprises information required by the instruction to access the general register, the first-stage fault detection is carried out on the processor superscalar pipeline, and if the fault exists, the first-stage fault detection is carried out on the processor superscalar pipeline by utilizing the first backup circuit to restore the backed up content of the first-stage hardware check point. If no fault exists, the program continues to execute, when the program executes to a second-stage check point, the contents of the general register and part of the control state registers are backed up, the second-stage fault detection is carried out on the processor superscalar pipeline, if the fault exists, the second backup circuit is used for restoring the processor superscalar pipeline by combining the contents backed up by the second-stage check point, and if the fault does not exist, the program continues to execute.

As shown in FIG. 8, the same processor superscalar pipeline uses the same test program to input instructions, ensures that the two processor superscalar pipelines adopt consistent fault simulation excitation, strictly control other irrelevant variables, and the two-stage hardware check point mechanism is compared with the one-stage hardware check point mechanism, and the fault detection rate is consistent with the one-stage hardware check point mechanism under the condition of executing the same target test program and equivalent consistent faults by adopting the two-stage hardware check point mechanism, but the running time of the program is reduced by 9.54 percent, which is equivalent to the performance comparably improving.

In superscalar pipelines where the pipeline stages are excessive, the excess cycles during recovery due to error detection and processor performance waste due to instruction cleanup are reduced compared to a single hardware checkpoint. The backup logic and the pipeline logic of the two-stage pipeline backup mechanism are independently parallel, the pipeline time sequence is not affected, the time for recovering the system interrupt is minimized, and the availability and fault tolerance of the system are improved. The two-stage hardware check point is adopted to reduce the running time of the program by 9.54% under the condition of executing the same target test program and equivalent consistent faults compared with the condition of adopting the one-stage hardware check point and consistent fault detection rate, which is equivalent to the performance comparably improving.

The server mentioned in the present invention may be a server provided on a service platform, or a device such as a desktop computer, a notebook computer, etc. capable of executing the solution of the present invention. For convenience of explanation, only the server is used as the execution subject.

When the method for recovering the processor fault provided by the invention is applied, the method can be executed without the sequence of the steps shown in fig. 1, and the specific execution sequence of the steps can be determined according to the needs, so that the invention is not limited to the steps.

The foregoing provides a processor fault recovery method according to one or more embodiments of the present invention, and based on the same concept, the present invention further provides a corresponding processor fault recovery device, as shown in fig. 9.

Fig. 9 is a schematic diagram of a processor fault recovery apparatus according to the present invention, including:

the setting module 901 is configured to set a multi-level hardware check point on the processor superscalar pipeline, where a preset interval exists between adjacent hardware check points.

The backup and detection module 902 is configured to, when an instruction to be processed reaches any hardware checkpoint of the processor superscalar pipeline, backup the instruction state content output at the previous stage of the stage where the hardware checkpoint is located, and perform fault detection on the processor superscalar pipeline.

And the recovery module 903 is configured to, in case of a failure in the processor superscalar pipeline, recover the failure in the processor superscalar pipeline by moving the backed-up instruction status content to the processor superscalar pipeline.

For a specific limitation of a processor fault recovery apparatus, reference may be made to the limitation of a processor fault recovery method hereinabove, and no further description is given here. Each of the modules in the above-described processor fault recovery apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

The present invention also provides a computer readable storage medium storing a computer program operable to perform a processor fault recovery method as provided in fig. 1 above.

The invention also provides a schematic structural diagram of the computer device shown in fig. 10, and as shown in fig. 10, the computer device comprises a processor, an internal bus, a network interface, a memory and a nonvolatile memory, and may also comprise hardware required by other services in a hardware level. The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to implement the above-mentioned fault recovery method of the processor provided in fig. 1.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the present invention.

Claims

1. A method of processor fault recovery, the method comprising:

When an instruction to be processed reaches any hardware check point of the processor superscalar pipeline, backing up the instruction state content output by the last stage of the stage where the hardware check point is positioned, and detecting faults of the processor superscalar pipeline;

2. The method of claim 1, wherein the multi-level hardware checkpoints comprise a first level hardware checkpoint and a second level hardware checkpoint, and wherein the setting the multi-level hardware checkpoints on the processor superscalar pipeline comprises:

Setting the first-stage hardware check point at a distribution stage of the processor superscalar pipeline and setting the second-stage hardware check point at a write-back stage of the processor superscalar pipeline, wherein the interval between the first-stage hardware check point and the second-stage hardware check point is half of the whole period of the processor superscalar pipeline.

3. The method of claim 2, wherein the hardware checkpoint is the first level hardware checkpoint, and wherein the backing up the instruction state content output from a previous stage of the stage in which the hardware checkpoint is located comprises:

And storing the register renaming mapping table into a first backup circuit, wherein the structure of the first backup circuit is consistent with that of the register renaming mapping table.

4. The method of claim 2, wherein the hardware checkpoint is the second level hardware checkpoint, and wherein the backing up the instruction state content output from a previous stage of the stage in which the hardware checkpoint is located comprises:

acquiring a calculation result output by the last stage of the write-back stage, wherein the calculation result is data to be backed up in a general register and a control state register;

5. The method of claim 1, wherein the backing up of instruction state content output by a stage previous to a stage at which each stage hardware checkpoint is located is performed in parallel with the processor superscalar pipeline.