CN120255970A

CN120255970A - Baseboard management controller startup method, computer equipment, medium and product

Info

Publication number: CN120255970A
Application number: CN202510705561.XA
Authority: CN
Inventors: 周宁宁; 张中云
Original assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Current assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Priority date: 2025-05-29
Filing date: 2025-05-29
Publication date: 2025-07-04
Anticipated expiration: 2045-05-29
Also published as: CN120255970B

Abstract

The application discloses a starting method of a baseboard management controller, computer equipment, media and products, and relates to the technical field of server management, comprising the steps of reading a value of a first register after the baseboard management controller is started, and determining a target operation partition and environment variables based on a specific field of the first register; detecting whether the baseboard management controller has abnormality in the operation stage, if so, storing a reset reason in a specific field of the first register and restarting the baseboard management controller. The application detects whether the running stage has an abnormality after the baseboard management controller is started, stores the reset reason and the running process with the abnormality in the first register after the abnormality exists, restarts the first register, and reads the value in the first register after the starting and combines the reset reason to determine whether the partition is to be performed, thereby realizing the intelligent switching of the redundant backup system.

Description

Baseboard management controller starting method, computer equipment, medium and product

Technical Field

The present application relates to the field of server management technologies, and in particular, to a method for starting a baseboard management controller, a computer device, a medium, and a product.

Background

The baseboard management controller (Baseboard Management Controller, BMC) is a baseboard management controller integrated on a server motherboard, and is used as a special microcontroller independent of a main system for realizing hardware monitoring, remote power management and out-of-band operation and maintenance functions. In general, to improve reliability and maintainability of a BMC system, two BMC partitions, including a main partition and a backup partition, are typically designed to ensure that the system can quickly recover when a firmware update fails or the main partition is damaged. However, in the BMC operation process, if a kernel exception, an application layer exception, etc. occur, it is difficult to distinguish a specific restart cause after the BMC is restarted and reset, so it is difficult to determine whether or not partitioning is required.

Disclosure of Invention

The application provides a method for starting a baseboard management controller, computer equipment, media and products, which at least solve the problem that the baseboard management controller in the related art is difficult to switch subareas according to actual conditions after being started.

The application provides a starting method of a baseboard management controller, which comprises the following steps:

after the baseboard management controller is started, reading a value of a first register, and determining a target operation partition and an environment variable based on a specific field of the first register, wherein the environment variable is used for representing the operation partition when the baseboard management controller is started, the specific field of the first register is at least used for representing the operation state of the baseboard management controller, and the operation state at least comprises a reset reason and an abnormal operation progress;

Detecting whether the substrate management controller is abnormal in the operation stage;

If the abnormal condition exists, the reset reason and the abnormal running process are stored in a specific field of the first register, and the baseboard management controller is restarted.

The application also provides computer equipment, which comprises a memory and a processor, wherein the memory is used for storing a computer program, and the processor is used for realizing the steps of any baseboard management controller starting method when executing the computer program.

The present application also provides a computer readable storage medium having a computer program stored therein, wherein the computer program when executed by a processor implements the steps of any of the baseboard management controller starting methods described above.

The application also provides a computer program product comprising a computer program which when executed by a processor implements the steps of any of the above methods for starting a baseboard management controller.

The starting method of the baseboard management controller comprises the steps of reading a value of a first register after the baseboard management controller is started, determining a target operation partition and environment variables based on specific fields of the first register, detecting whether the baseboard management controller is abnormal in an operation stage, storing a reset reason in the specific fields of the first register if the baseboard management controller is abnormal, and restarting the baseboard management controller. The invention detects whether the running stage has an abnormality after the baseboard management controller is started, stores the reset reason and the running process with the abnormality in the first register after the abnormality exists, restarts the first register, and reads the value in the first register after the starting and combines the reset reason to determine whether the partition is to be performed, thereby realizing the intelligent switching of the redundant backup system.

Drawings

For a clearer description of embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described, it being apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.

Fig. 1 is a schematic flow chart of a method for starting a baseboard management controller according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a partition switching process according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an operation stage of a baseboard management controller according to an embodiment of the present application;

Fig. 4 is a schematic structural diagram of a baseboard management controller starting device according to an embodiment of the present application;

Fig. 5 is a schematic diagram of a hardware structure of a computer device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. Based on the embodiments of the present application, all other embodiments obtained by a person of ordinary skill in the art without making any inventive effort are within the scope of the present application.

It should be noted that in the description of the present application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. The terms "first," "second," and the like in this specification are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.

The present application will be further described in detail below with reference to the drawings and detailed description for the purpose of enabling those skilled in the art to better understand the aspects of the present application.

The specific application environment architecture or specific hardware architecture upon which the execution of the baseboard management controller startup method depends is described herein.

The baseboard management controller is an independent microcontroller embedded in a server or high-end computer hardware and is responsible for monitoring and managing the physical state of equipment, is integrated on a main board, and provides remote management functions including power supply control, temperature monitoring, fan speed regulation, log recording, fault diagnosis and the like through an independent out-of-band management interface. The BMC can still operate even when the host operating system is down, ensures that an administrator can access and maintain equipment through a network, and is a key component for realizing efficient operation and maintenance in a data center and a cloud computing environment. In a BMC system, designing a Flash chip to include two independent partitions (a main partition and a backup partition) is a typical redundancy architecture, and aims to enhance the reliability and fault tolerance of the system. The primary partition runs the current BMC firmware, while the backup partition stores a verified stable version or a last successfully updated firmware copy. When the main partition fails due to firmware upgrade abnormality, data damage or hardware failure, the boot loader of the BMC can automatically detect errors and switch to the backup partition for starting, so that the continuous availability of the system is ensured. However, during the running process of the BMC, the kernel crashes or hangs up, or the key process of the BMC application layer can restart after abnormality in the running process, and after restarting, whether the partition needs to be switched is difficult to distinguish due to unclear specific reasons of restarting reset. Based on the above, the invention provides a baseboard management controller starting method.

The embodiment of the application provides a baseboard management controller starting method, which is described in detail by combining an execution flow of the baseboard management controller starting method.

The embodiment of the invention provides a method for starting a baseboard management controller, which is shown in a flow chart of the method for starting the baseboard management controller in fig. 1, and comprises the following steps:

in step S101, after the baseboard management controller is started, the value of the first register is read, and the target running partition and the environment variable are determined based on the specific field of the first register.

The environment variable is used for representing an operation partition when the baseboard management controller is started, and the specific field of the first register is at least used for representing the operation state of the baseboard management controller, wherein the operation state at least comprises a reset reason and an abnormal operation process.

The register is a storage unit in the computer hardware for temporarily storing data, in this embodiment, the first register is a reset register, and is a specific register in which a numerical value is not lost after the baseboard management controller is reset, and different BITs (BIT) in the first register represent different meanings, that is, a specific field represents an operation state of the baseboard management controller.

Taking a 32-BIT register as an example, wherein [0] (BIT 1) is Reserved for Reserved BITs, the default value is 0, two BITs are used for marking reset reasons by [2:1] (BITs 2-3), the default value is 00,00 to indicate that a kernel is abnormal, 10 to indicate that a BMC is abnormal (abnormal of a baseboard management controller), and five BITs (BITs 4-8) to indicate the running process of a running stage are used for indicating that no abnormal running process occurs by [7:3] (BITs 00000,00000). The running process sequentially comprises kernel decompression and initialization, device driving loading, root file system mounting, user space initialization and BMC process monitoring service starting, wherein different running processes correspond to different BITs in five BIT BITs, and in the case of 00001, the running process is exemplified by that the kernel decompression and the initialization are abnormal, [8] (9 th BIT) marks a mirror image refreshing state by using one BIT BIT, 1 represents that in mirror image refreshing, 0 represents that mirror image refreshing is finished or not, and [31:9] (10-32 th BITs) is Reserved for Reserved BITs.

After the baseboard management controller is started, the value stored in the first register is read, and different fields in the first register represent different meanings. And reading the value of a specific field in the first register to judge whether the baseboard management controller is restarted or not, and judging whether the partition needs to be switched according to the situation of the previous abnormality, including switching to another partition and maintaining the current partition, if so, what the reset reason before the restarting is and what the abnormal running process is. And if the current partition is kept, taking the current operation partition as the target operation partition.

The value of the environment variable is 0 or 1, which corresponds to different partitions respectively, and after the baseboard management controller is started, the value of the environment variable is read first to be started from the corresponding partition. Illustratively, if the value of the environment variable is 0, then it is started from partition 0.

After the value of the first register is read, if the partition is judged to be required to be switched, the value of the environment variable is modified, and partition switching is realized. If the partition does not need to be switched, the value of the original environment variable is kept unchanged.

Specifically, after the baseboard management controller is started, the value of the first register is read in a stage of a boot loader (for example, u-boot, a kind of boot loader), so that whether to switch the partition is judged according to the value of the first register, and corresponding environment variables are set. The bootloader stage is mainly used for initializing hardware and booting an operating system, and comprises hardware initialization, environment variable setting and the like. In the bootloader stage, a hardware watchdog function is enabled for process detection in a subsequent stage.

Step S102, detecting whether the baseboard management controller is abnormal in the operation stage.

The running stage relates to a Kernel (Kernel) stage and a BMC user layer stage, specifically, the Kernel stage comprises processes such as Kernel decompression and initialization, device driving loading, root file system mounting, user space initialization and the like, and the BMC user layer stage comprises a BMC process monitoring service self-starting process. And detecting each process in the kernel stage and the BMC user layer stage, and starting a watchdog mechanism during detection. The baseboard management controller is provided with a hardware Watchdog Timer, which is realized by a special WDT (Watchdog Timer) hardware module.

As an example, a timeout threshold may be set, after the baseboard management controller is started, the process in the user layer stage periodically writes a specific value into the WDT register to reset the watchdog counter, and feeds the watchdog, if the process fails to feed the watchdog in time due to a fault, the watchdog counter will continuously accumulate until reaching the preset timeout threshold, and then trigger the hardware reset signal socreset to restart.

In the kernel stage, a notification function can be registered through a registration callback function mechanism, and if an abnormal process occurs, an abnormal event is captured.

If there is an abnormality, step S103 stores the reset cause and the abnormal running process in a specific field of the first register, and restarts the baseboard management controller.

When an abnormal process is detected in an operation stage, writing the abnormal operation process and a corresponding reset reason in a specific field of a first register, wherein the reset reason comprises the stage where the abnormal operation process is located, including the stage of abnormal kernel stage and the stage of abnormal user layer, and the operation process comprises the processes of kernel decompression and initialization of the kernel stage, device driving loading, root file system mounting, user space initialization and the like, and the BMC process monitoring service starting process of the user layer stage.

As an example, the flag reset is 01, indicating a kernel phase exception, and the flag running process is 00001, indicating that there is an exception to the kernel decompression and initialization process.

In some alternative embodiments, before reading the value of the first register in step S101, the method further includes:

Step S201, reading environment variables;

step S202, determining the current operation partition based on the environment variable.

The environment variables are used for representing the operation partitions when the baseboard management controller is started, and different environment variables correspond to different operation partitions. As one example, the environment variable values are 0 and 1. After the baseboard management controller is started, firstly, the environment variable is read, the current operation partition is determined according to the environment variable, if the value of the environment variable is 1, the partition 1 is started, and the partition 1 is the current operation partition.

In some alternative embodiments, after restarting the baseboard management controller, the method further comprises:

the first register is reset, and a specific field of the first register after the resetting represents that the baseboard management controller has no abnormality.

Resetting the first register, and clearing the previously stored exception information, so that the numerical value of each specific field in the first register is a default value, and the substrate management controller is characterized that no exception exists.

In some alternative embodiments, the first specific field of the first register characterizes the running process in which the exception occurred, and determining the target running partition and the environment variable based on the specific field of the first register in step S101 includes:

In step S301, if the first specific field indicates that there is an abnormal running process, the environment variable is modified from the first value to the second value, and the current running partition is switched to the target running partition. The first value corresponds to the current operation partition, and the second value corresponds to the target operation partition.

Taking a 32-BIT register as an example, the [7:3] (BITs 4-8) of the first register represents the running process of the running stage by five BIT BITs, the first specific field corresponds to the BIT BITs 4-8, and represents the running process with exception, and the default value is 00000. When the value of the first specific field is 00000, it indicates that no abnormal running process exists, the running process in the running stage sequentially includes kernel decompression and initialization, device driver loading, root file system mounting, user space initialization, and BMC process monitoring service starting according to the occurrence sequence, and different running processes correspond to different BITs in five BIT BITs, for example, if 00001, it indicates that the kernel decompression and initialization are abnormal.

First, a first specific field in a first register is read, if the value of the first specific field is not a default value, namely 00000, which indicates that an abnormal running process exists, the partition needs to be switched. And the environment variable corresponding to the current operation partition is a first value, and the current operation partition is switched to the target operation partition corresponding to the second value by modifying the environment variable from the first value to the second value.

Further, determining the target running partition and the environment variable based on the specific field of the first register in step S101 further includes:

Step S302, if the first specific field of the first register indicates that no abnormal running process exists, reading a second specific field and a third specific field of the first register, wherein the second specific field indicates a reset reason, and the third specific field indicates a mirror image refreshing state;

Step S303, determining a target operation partition and environment variables based on the read result.

If the first specific field of the first register is a default value, which indicates that no abnormal running process exists in the last starting before starting, that is, the restarting of the baseboard management controller is not caused by the abnormal running process, the second specific field and the third specific field of the first register are further read.

Taking a 32-BIT register as an example, the [2:1] (BITs 2-3) of the first register marks the reset reason with two BIT BITs, the second specific field corresponds to BIT BITs of BITs 2-3, the default value of the second specific field is 00,00, which indicates that the core is normal, 01 indicates that the core is abnormal, and 10 indicates that the BMC is abnormal (baseboard management controller is abnormal).

And marking the mirror image refreshing state by using a BIT in [8] (BIT 9) of the first register, wherein a third specific field corresponds to the BIT in BIT 1, and the default value of the third specific field is 0,0 indicates that the mirror image refreshing is completed or not refreshed, and 1 indicates that the mirror image refreshing is in progress or abnormal.

And judging whether the current running partition needs to be switched and modifying environment variables based on the reading results of the second specific field and the third specific field.

Further, the step S303 includes:

In step S3031, if the second specific field indicates that there is a reset reason, the third specific field of the first register is read, and the third specific field indicates a mirror refresh state.

If the value of the second specific field is not 0 and comprises 01 or 10, the existence of a reset reason is indicated, specifically comprising kernel exception and BMC exception, and the third specific field is read.

If the value of the second specific field is 0, it indicates that there is no reset reason, that is, there is no abnormal reset condition in the last starting operation process, so that the partition does not need to be switched in this starting.

In step S3032, if the third specific field characterizes the mirror refresh exception, the current running partition is determined as the target running partition.

If the value of the third specific field is not 0, the situation that the mirror image refreshing abnormality exists in the last starting operation process is indicated, so that the partition does not need to be switched in the current starting operation process, the environment variable does not need to be modified, and the current operation partition is taken as the target operation partition.

Further, after the third specific field of the first register is read in step S3031, the method includes modifying the environment variable from the first value to the second value and switching the current operation partition to the target operation partition if the third specific field indicates that the controller has no exception during the mirror refresh.

If the value of the third specific field is 0, it indicates that there is no abnormality in the mirror refreshing process in the last starting operation process, and the partition needs to be switched. And modifying the environment variable, and modifying the current first value into a second value, so that the current operation partition corresponding to the first value is switched to the target operation partition corresponding to the second value. And starting from the target operation partition by reading the modified environment variable at the next starting.

In this embodiment, the partition switching logic is fully described, and fig. 2 is a schematic diagram of the partition switching flow. After the baseboard management controller is started, the value of the first register is read, the first specific field is read first, whether the register value in the operation stage is not 0 is judged, if yes, the environment variable is modified from the first value to the second value, if no, whether a reset reason exists is further judged, and if the value corresponding to the reset reason is 0, the system is free from abnormality, and partition switching is not needed. If the value corresponding to the reset reason is not 0, if abnormal reset is indicated, whether the mirror image refreshing is abnormal is further judged, if the corresponding value is 1, the mirror image refreshing is indicated to be abnormal, namely, the mirror image of another partition is possibly damaged during the mirror image refreshing operation in the last starting operation period, and therefore the partition is not switched, and the current partition is continuously adopted. And if the mirror image refreshing is normal, switching the partition. Optionally, a rollback mechanism is added, if the dual-image backup is designed, the damaged target partition is marked as invalid, the last version of the complete image is pulled again from the remote server or the local backup storage, the damaged partition is overwritten, the image integrity (such as hash check) is checked, and the image refreshing flag bit is cleared after passing. By the method, the mirror image can be quickly recovered when the mirror image refreshing fails, and the risk of service interruption is reduced to the greatest extent.

In some alternative embodiments, the run phase includes a kernel phase, and step S102 includes:

In step S401, it is detected whether the baseboard management controller has an abnormal process in the process of the kernel phase.

When the kernel phase runs, the running state of the process in the kernel phase is monitored, and particularly, the panic abnormal events can be captured in the callback through the notification function. If the abnormal event is captured, the corresponding process is indicated to have an abnormality.

In step S402, if there is an abnormal process, it is determined that the baseboard management controller is abnormal in the kernel stage.

If an abnormal process exists in the stage, determining that an abnormality exists in the kernel stage, and resetting the abnormal process to be the kernel abnormality.

Further, if it is determined that the baseboard management controller is abnormal in the kernel stage, in step S103, the specific field of the first register stores the reset reason and the running process with the abnormality, including modifying the first specific field and the second specific field of the first register, where the modified first specific field indicates that the process in the kernel stage is abnormal, and the modified second specific field indicates that the reset reason of the baseboard management controller is abnormal in the kernel.

If it is detected that an abnormal process exists in the kernel phase, the value of the first register is correspondingly modified, specifically, taking a 32-BIT register as an example, the [7:3] (4-8 BITs) of the first register represents the running process of the running phase by using five BIT BITs, the first specific field corresponds to the BIT BITs of the 4-8 BITs, and represents the running process with the abnormal occurrence, and the default value is 00000. When the value of the first specific field is 00000, the running process without abnormality is indicated, wherein the running process of the kernel stage sequentially comprises kernel decompression and initialization, device driving loading, root file system mounting and user space initialization according to the occurrence sequence. Illustratively, 00001 indicates that there is an exception in kernel decompression and initialization. The [2:1] (BITs 2-3) of the first register marks the reset reason with two BIT BITs, the second specific field corresponds to BIT BITs 2-3, the default value of the second specific field is 00,00, which indicates normal, 01 indicates core exception, and 10 indicates BMC exception (baseboard management controller exception).

In some alternative embodiments, the run phase includes a user layer phase, and step S102 includes:

step S403, detecting whether an abnormal process exists in the user layer stage.

The user layer phase is an environment in which an application program runs, and in the baseboard management controller, the user layer phase refers to a phase in which various application programs running on the baseboard management controller are executing. And detecting whether the progress of the user layer stage is abnormal. The CPU utilization rate, the memory occupation condition, the disk I/O and the like of the process can be monitored. If a process occupies too high resources for a long time, it may indicate that the process has an exception, such as memory leakage, dead cycles, etc. An exception is detected by analyzing the behavior patterns of processes, e.g., a normal management process should perform tasks according to predetermined logic and frequency, which may indicate that a process is abnormal if the behavior pattern of the process does not match an expected pattern. The log file and error report of the process are checked. An abnormal process typically leaves an error message or warning in the log.

Specifically, step S403 includes detecting whether the process of the user layer stage is overtime based on a preset counter, and if so, determining that an abnormal process exists.

In step S404, if there is an abnormal progress in the user layer stage, it is determined that there is an abnormality in the baseboard management controller in the user layer stage.

The preset counter is a WDT counter, a timeout time threshold is set, and in the stage that the baseboard management controller runs to the user layer, the user layer daemon writes a specific value reset timer into the WDT register regularly to feed dogs. If the progress abnormality does not lead the dog timely, the overtime condition occurs, and the numerical value of the preset counter is continuously accumulated until the preset time threshold is reached. After the timeout, a hardware reset signal is triggered to restart.

If an abnormal process exists in the user layer stage, determining that the user layer stage is abnormal according to a rule, namely resetting the reason for the abnormal substrate management controller.

Further, if it is determined that the baseboard management controller has an exception at the user layer stage, in step S103, the specific field of the first register stores the reset reason and the running process with the exception, including modifying the first specific field and the second specific field of the first register, the modified first specific field indicates that the process at the user layer stage has an exception, and the modified second specific field indicates that the reset reason of the baseboard management controller is that the baseboard management controller is abnormal.

If it is detected that an abnormal process exists in the user layer stage, the value of the first register is correspondingly modified, specifically, taking a 32-BIT register as an example, the [7:3] (4-8 BITs) of the first register represents the running process of the running stage by using five BIT BITs, the first specific field corresponds to the BIT of the 4-8 BITs, and represents the running process with the abnormal occurrence, and the default value is 00000. When the value of the first specific field is 00000, it indicates that there is no abnormal running process, where the running process of the user layer stage includes starting the BMC process monitoring service, and if there is an abnormality, the corresponding value is 10000. The [2:1] (BITs 2-3) of the first register marks the reset reason with two BIT BITs, the second specific field corresponds to BIT BITs 2-3, the default value of the second specific field is 00,00, which indicates normal, 01 indicates core exception, and 10 indicates BMC exception (baseboard management controller exception).

An embodiment of the invention provides a method for starting a baseboard management controller, and fig. 3 is a schematic diagram of an operation stage of the baseboard management controller. Describing in time sequence, after the baseboard management controller is started, the operation phase includes:

During the period from start-up to completion of its initialization task and preparation for loading the kernel (Boot phase), the system first reads the value of the environment variable, determines the current running partition from the value of the environment variable, e.g., if the value of the current environment variable is 0, starts from partition 0. And then reading the value of the first register, judging whether the partition needs to be switched according to the reading result, and setting corresponding environment variables if the partition needs to be switched. The partition switching logic comprises the steps of firstly reading a first specific field, judging whether a register value in an operation stage is not 0, if so, modifying an environment variable from a first value to a second value, if not, further judging whether a reset reason exists, and if so, judging that the system is abnormal and the partition is not required to be switched, wherein the value corresponding to the reset reason is 0. If the value corresponding to the reset reason is not 0, if abnormal reset is indicated, whether the mirror image refreshing is abnormal is further judged, if the corresponding value is 1, the mirror image refreshing is indicated to be abnormal, namely, the mirror image of another partition is possibly damaged during the mirror image refreshing operation in the last starting operation period, and therefore the partition is not switched, and the current partition is continuously adopted. And if the mirror image refreshing is normal, switching the partition.

In the kernel phase, including processes such as kernel decompression and initialization, device driver loading, root file system mount, user space initialization phase, etc., the registration panic informs the function (in the program or operating system kernel, a function or mechanism is triggered when an unrecoverable error or exception condition is encountered, in the program or operating system kernel, this function performs a series of operations, such as recording error information, attempting to clear resources, notifying a user or administrator, and eventually causing the program or system to stop running), capturing Panic events in the callback to detect if an abnormal process exists, and if so, modifying the reset cause in the first register and the abnormal process, and restarting the baseboard management controller. After the restart, the running partition is modified accordingly by reading the value of the first register. In the callback function, the value of the register is set.

And in the BMC user layer stage, configuring the BMC process monitoring service to be self-started, writing a specific value into the WDT counting restarting register at regular intervals, restarting the WDT counter, and thus detecting the key process. If an abnormal process is detected, the reset source is saved because the BMC is abnormal to the first register and the baseboard management controller is restarted. After the restart, the running partition is modified accordingly by reading the value of the first register.

According to the method provided by the embodiment of the invention, the fault detection and automatic recovery are realized through the cooperative monitoring of the hardware watchdog timer (WDT) and the application layer daemon, the long-time downtime of the system caused by the abnormal kernel/BMC is avoided, and when the watchdog is overtime to trigger hardware reset (socreset), the system can be forcedly restarted and can be recovered to operate based on a preset strategy, so that the service continuity is ensured. The BIT BIT mark of the register (the first register) is not lost when the BMC is restarted, the reset reasons (such as kernel abnormality and BMC abnormality) and the fault occurrence stage are accurately recorded, key data are provided for subsequent fault diagnosis, and the partition is automatically selected to be started by combining the reset reasons and environment variables, so that intelligent switching of a redundant backup system is realized, and the usability of the system is improved. The hardware watchdog is deeply integrated with the BootLoader, starts monitoring at the initial stage of system start, covers the full life cycle, dynamically controls the start partition through environment variables, combines the double-partition redundancy design, and reduces the influence of single-point faults on the system.

According to the method, BIT BITs are divided in the first register to store corresponding information, including reset reasons, running processes and the like, and hardware-level fault information persistence storage can be achieved because the first register is a BMC restarting non-lost register. When the BMC is just started, a watchdog timer is started, a corresponding overtime threshold is set, the monitoring range is expanded to the initial stage of system starting, the limit that the traditional monitoring only acts on an application layer is broken through, and the cooperative mechanism of an application layer daemon and hardware WDT ensures real-time state feedback when the system normally operates.

And setting the environment variable to realize the dynamic control of double-partition starting through a dynamic partition switching mechanism of the environment variable, and combining with a BMC reset reason judgment logic to realize the automatic isolation of a fault partition and the seamless switching of a backup partition.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment.

The embodiment of the application also provides a baseboard management controller starting device, as shown in fig. 4, which comprises:

The partition determining module is used for reading the value of a first register after the baseboard management controller is started, and determining a target operation partition and an environment variable based on a specific field of the first register, wherein the environment variable is used for representing the operation partition when the baseboard management controller is started, the specific field of the first register is at least used for representing the operation state of the baseboard management controller, and the operation state at least comprises a reset reason and an abnormal operation progress;

the operation detection module is used for detecting whether the substrate management controller is abnormal in an operation stage;

And the restarting module is used for storing a reset reason and an abnormal running process in a specific field of the first register if the abnormality exists, and restarting the baseboard management controller.

In some alternative embodiments, the apparatus further comprises:

The variable reading module is used for reading the environment variable;

And the current partition determining module is used for determining the current running partition based on the environment variable.

In some alternative embodiments, the first specific field of the first register characterizes an executing process in which an exception occurred, and the partition determination module includes:

And the first switching unit is used for modifying the environment variable from a first value to a second value and switching the current operation partition to a target operation partition if the first specific field represents that the abnormal operation process exists, wherein the first value corresponds to the current operation partition, and the second value corresponds to the target operation partition.

In some alternative embodiments, the partition determination module further comprises:

The first reading unit is used for reading a second specific field and a third specific field of the first register if the first specific field of the first register represents that the abnormal running process does not exist, wherein the second specific field represents a reset reason, and the third specific field represents a mirror image refreshing state;

and the partition determining unit is used for determining a target operation partition and environment variables based on the reading result.

In some alternative embodiments, the partition determination unit includes:

a second reading subunit, configured to read a third specific field of the first register if the second specific field indicates that a reset reason exists, where the third specific field indicates a mirror refresh state;

and the partition switching subunit is used for determining the current running partition as the target running partition if the third specific field represents the mirror image refreshing exception.

In some alternative embodiments, the partition determination unit includes:

and the numerical value modification subunit is used for modifying the environment variable from the first numerical value to the second numerical value and switching the current running partition into the target running partition if the third specific field characterizes that the controller has no abnormality in mirror refreshing.

In some alternative embodiments, the run phase includes a kernel phase, and the run detection module includes:

The kernel program detection unit is used for detecting whether abnormal processes exist in the process of the kernel stage of the detection baseboard management controller;

and the kernel exception unit is used for determining that the baseboard management controller has exception in the kernel stage if the exception process exists.

In some alternative embodiments, if it is determined that the baseboard management controller has an exception in the kernel phase, the restarting module includes:

the first modification unit is used for modifying a first specific field and a second specific field of the first register, the modified first specific field represents that the process in the kernel stage is abnormal, and the modified second specific field represents that the reset source of the baseboard management controller is kernel abnormal.

In some alternative embodiments, the run phase includes a user layer phase, and the run detection module includes:

The user layer detection unit is used for detecting whether an abnormal process exists in the user layer stage;

And the user layer abnormality unit is used for determining that the baseboard management controller has abnormality in the user layer stage if the user layer stage has an abnormal progress.

In some alternative embodiments, if it is determined that the baseboard management controller has an abnormality in the user layer stage, the restarting module includes:

The second modifying unit is used for modifying a first specific field and a second specific field of the first register, the modified first specific field represents that the process at the user layer stage is abnormal, and the modified second specific field represents that the reset reason of the baseboard management controller is abnormal.

In some alternative embodiments, the user layer detection unit includes:

the timeout detection subunit is used for detecting whether the process of the user layer stage is overtime or not based on a preset counter;

And the timeout determining subunit is used for determining that an abnormal process exists if the timeout occurs.

In some alternative embodiments, the apparatus further comprises:

And the reset module is used for resetting the first register, and the specific field of the first register after the resetting represents that the baseboard management controller has no abnormality.

The description of the features in the embodiment corresponding to the baseboard management controller starting device can be referred to the related description of the embodiment corresponding to the baseboard management controller starting method, and will not be repeated here.

The embodiment of the application also provides a computer device comprising a memory 10 and a processor 20, as shown in fig. 5, the memory 10 having stored therein a computer program, the processor 20 being arranged to run the computer program to perform the steps of any of the baseboard management controller start-up method embodiments described above.

Embodiments of the present application also provide a computer readable storage medium having a computer program stored therein, wherein the computer program is configured to perform, when run, the steps of any of the baseboard management controller startup method embodiments described above.

In an exemplary embodiment, the computer readable storage medium may include, but is not limited to, a U disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, etc. various media in which a computer program may be stored.

Embodiments of the present application also provide a computer program product comprising a computer program which, when executed by a processor, implements the steps of any of the baseboard management controller start-up method embodiments described above.

Embodiments of the present application also provide another computer program product comprising a non-volatile computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of any of the baseboard management controller startup method embodiments described above.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The starting method of the baseboard management controller provided by the application is described in detail. The principles and embodiments of the present application have been described herein with reference to specific examples, the description of which is intended only to facilitate an understanding of the method of the present application and its core ideas. It should be noted that it will be apparent to those skilled in the art that various modifications and adaptations of the application can be made without departing from the principles of the application and these modifications and adaptations are intended to be within the scope of the application as defined in the following claims.

Claims

1. A baseboard management controller start-up method, comprising:

2. The controller start-up method of claim 1, wherein prior to the reading the value of the first register, the method further comprises:

Reading an environment variable;

And determining a current running partition based on the environment variable.

3. The controller boot method of claim 2, wherein the first specific field of the first register characterizes an abnormal running process, wherein the determining the target running partition and the environment variable based on the specific field of the first register comprises:

And if the first specific field characterizes that the abnormal running process exists, modifying the environment variable from a first value to a second value, and switching the current running partition to a target running partition, wherein the first value corresponds to the current running partition, and the second value corresponds to the target running partition.

4. The controller boot method of claim 3, wherein the determining a target run partition and an environment variable based on the specific field of the first register further comprises:

If the first specific field of the first register indicates that no abnormal running process exists, reading a second specific field and a third specific field of the first register, wherein the second specific field indicates a reset reason, and the third specific field indicates a mirror image refreshing state;

and determining a target operation partition and an environment variable based on the read result.

5. The controller startup method of claim 4, wherein the determining the target operating partition and the environment variable based on the read result comprises:

if the second specific field represents that the reset reason exists, reading a third specific field of the first register, wherein the third specific field represents a mirror image refreshing state;

and if the third specific field represents the mirror image refreshing exception, determining the current running partition as the target running partition.

6. The controller start-up method of claim 5, wherein after the reading the third specific field of the first register, the method further comprises:

and if the third specific field indicates that the controller has no abnormality in mirror image refreshing, modifying the environment variable from a first value to a second value, and switching the current running partition to a target running partition.

7. The controller startup method according to claim 1, wherein the operation phase includes a kernel phase, and the detecting whether there is an abnormality in the baseboard management controller in the operation phase includes:

Detecting whether abnormal processes exist in the process of the detection baseboard management controller in the kernel stage;

and if the abnormal process exists, determining that the baseboard management controller has an abnormality in the kernel stage.

8. The method according to claim 7, wherein if it is determined that the baseboard management controller has an exception in the kernel phase, storing a reset reason and an abnormal running process in a specific field of the first register, includes:

Modifying a first specific field and a second specific field of the first register, wherein the modified first specific field represents that the process in the kernel stage is abnormal, and the modified second specific field represents that the reset source of the baseboard management controller is kernel abnormal.

9. The controller startup method of claim 1, wherein the run phase comprises a user plane phase, and wherein detecting whether there is an anomaly in the baseboard management controller run phase comprises:

detecting whether an abnormal process exists in a user layer stage;

and if the user layer stage has an abnormal process, determining that the baseboard management controller has an abnormality in the user layer stage.

10. The controller startup method according to claim 9, wherein if it is determined that the baseboard management controller has an exception at the user layer stage, storing a reset reason and an abnormal running process in a specific field of the first register comprises:

modifying a first specific field and a second specific field of the first register, wherein the modified first specific field represents that an abnormality exists in a process of a user layer stage, and the modified second specific field represents that the reset of the baseboard management controller is caused by the abnormality of the baseboard management controller.

11. The controller startup method according to claim 9, wherein the detecting whether an abnormal process exists in the user layer stage comprises:

Detecting whether the process of the user layer stage is overtime or not based on a preset counter;

If the time is out, determining that an abnormal process exists.

12. The controller start-up method of claim 1, wherein after the restarting the baseboard management controller, the method further comprises:

Resetting the first register, wherein a specific field of the first register after resetting indicates that the baseboard management controller has no abnormality.

13. A computer device, comprising:

A memory for storing a computer program;

A processor for implementing the steps of the baseboard management controller start-up method according to any one of claims 1 to 12 when executing the computer program.

14. A computer readable storage medium, characterized in that a computer program is stored in the computer readable storage medium, wherein the computer program, when being executed by a processor, implements the steps of the baseboard management controller starting method according to any one of claims 1 to 12.

15. A computer program product comprising a computer program which, when executed by a processor, implements the steps of the baseboard management controller start-up method according to any one of claims 1 to 12.