CN118708418A - Server software and hardware information diagnosis system and method - Google Patents
Server software and hardware information diagnosis system and method Download PDFInfo
- Publication number
- CN118708418A CN118708418A CN202411206296.2A CN202411206296A CN118708418A CN 118708418 A CN118708418 A CN 118708418A CN 202411206296 A CN202411206296 A CN 202411206296A CN 118708418 A CN118708418 A CN 118708418A
- Authority
- CN
- China
- Prior art keywords
- hardware
- module
- information
- fault diagnosis
- software
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/22—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
- G06F11/2205—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/22—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
- G06F11/2273—Test methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/302—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a software system
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3058—Monitoring arrangements for monitoring environmental properties or parameters of the computing system or of the computing system component, e.g. monitoring of power, currents, temperature, humidity, position, vibrations
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Computer Hardware Design (AREA)
- Mathematical Physics (AREA)
- Debugging And Monitoring (AREA)
- Hardware Redundancy (AREA)
Abstract
本申请提供一种服务器软硬件信息诊断系统及方法,涉及计算机领域,该系统包括:硬件信息监控模块,实时监控硬件运行状态,得到硬件信息;软件信息监控模块,实时监控服务进行状态和系统资源使用状态,得到软件信息;硬件故障诊断模块,对硬件信息进行故障诊断,得到硬件故障诊断结果;软件故障诊断模块,对软件信息进行故障诊断,得到软件故障诊断结果;故障处理模块,根据硬件故障诊断结果和软件故障诊断结果,得到故障综合诊断结果。本申请实现了同时对软件故障和硬件故障的监控与诊断,提高了单一故障诊断方式诊断结果的正确性,提升服务器系统运行的稳定性,提高了系统的可用性。
The present application provides a server hardware and software information diagnosis system and method, which relates to the computer field. The system includes: a hardware information monitoring module, which monitors the hardware operation status in real time to obtain hardware information; a software information monitoring module, which monitors the service status and system resource usage status in real time to obtain software information; a hardware fault diagnosis module, which performs fault diagnosis on the hardware information to obtain hardware fault diagnosis results; a software fault diagnosis module, which performs fault diagnosis on the software information to obtain software fault diagnosis results; and a fault processing module, which obtains a comprehensive fault diagnosis result based on the hardware fault diagnosis results and the software fault diagnosis results. The present application realizes the monitoring and diagnosis of software faults and hardware faults at the same time, improves the correctness of the diagnosis results of a single fault diagnosis method, improves the stability of the server system operation, and improves the availability of the system.
Description
技术领域Technical Field
本申请涉及计算机领域,尤其涉及一种服务器软硬件信息诊断系统及方法。The present application relates to the computer field, and in particular to a server software and hardware information diagnosis system and method.
背景技术Background Art
服务器目前已经广泛应用于社会生活的各种领域,主要用于处理关键业务,系统数据丢失或者服务器异常停机都会对相关领域造成严重的后果。因此,对服务器的可用性提出了很高的要求,高可用性需要高效的故障监控、故障诊断、故障恢复等技术来实现。Servers are now widely used in various fields of social life, mainly for processing key business. System data loss or abnormal server downtime will have serious consequences for related fields. Therefore, high requirements are placed on the availability of servers. High availability requires efficient fault monitoring, fault diagnosis, fault recovery and other technologies to achieve.
目前,为了实现高效故障监控诊断,主要针对故障诊断方法或算法进行设计。然而,单独针对方法或算法进行的设计,并未结合硬件系统进行设计,当单个服务器出现系统故障时,容易忽略硬件故障,以致对最终诊断结果造成影响,降低诊断结果的准确性。At present, in order to achieve efficient fault monitoring and diagnosis, the design is mainly focused on fault diagnosis methods or algorithms. However, the design of methods or algorithms alone is not combined with the hardware system design. When a single server has a system failure, the hardware failure is easily ignored, which affects the final diagnosis results and reduces the accuracy of the diagnosis results.
发明内容Summary of the invention
本申请的目的是提供一种服务器软硬件信息诊断系统及方法,用于现有技术中解决仅针对故障诊断方法或算法进行设计开发而忽略硬件故障对最终诊断结果影响以致诊断结果准确性较差的技术问题,通过同时对软件故障和硬件故障进行监控与诊断,以提高单一故障诊断方式诊断结果的正确性,提升服务器系统运行的稳定性。The purpose of this application is to provide a server software and hardware information diagnosis system and method, which is used to solve the technical problem in the prior art that only the fault diagnosis method or algorithm is designed and developed while ignoring the impact of hardware failures on the final diagnosis results, resulting in poor accuracy of the diagnosis results. By simultaneously monitoring and diagnosing software failures and hardware failures, the accuracy of the diagnosis results of a single fault diagnosis method can be improved, thereby improving the stability of the server system operation.
本申请提供一种服务器软硬件信息诊断系统,包括:硬件信息监控模块,实时监控硬件运行状态,得到硬件信息;软件信息监控模块,实时监控服务进行状态和系统资源使用状态,得到软件信息;硬件故障诊断模块,对硬件信息进行故障诊断,得到硬件故障诊断结果;软件故障诊断模块,对软件信息进行故障诊断,得到软件故障诊断结果;故障处理模块,根据硬件故障诊断结果和软件故障诊断结果,得到故障综合诊断结果。The present application provides a server hardware and software information diagnosis system, including: a hardware information monitoring module, which monitors the hardware operation status in real time to obtain hardware information; a software information monitoring module, which monitors the service status and system resource usage status in real time to obtain software information; a hardware fault diagnosis module, which performs fault diagnosis on hardware information to obtain hardware fault diagnosis results; a software fault diagnosis module, which performs fault diagnosis on software information to obtain software fault diagnosis results; and a fault processing module, which obtains a comprehensive fault diagnosis result based on the hardware fault diagnosis results and the software fault diagnosis results.
根据本申请提供的一种服务器软硬件信息诊断系统,硬件信息包括中央处理器温度和内存状态;硬件信息监控模块包括基板管理控制单元、I2C交换机和传感器,其中:基板管理控制单元通过监测温度总线监测中央处理器的温度,得到中央处理器温度,并将中央处理器温度发送至硬件故障诊断模块;基板管理控制单元与双列直插式存储模块相连,以监测双列直插式存储模块状态,得到内存状态,并将内存状态发送至硬件故障诊断模块。According to a server hardware and software information diagnosis system provided by the present application, the hardware information includes the temperature of the central processing unit and the memory status; the hardware information monitoring module includes a baseboard management control unit, an I2C switch and a sensor, wherein: the baseboard management control unit monitors the temperature of the central processing unit by monitoring the temperature bus, obtains the temperature of the central processing unit, and sends the temperature of the central processing unit to the hardware fault diagnosis module; the baseboard management control unit is connected to the dual in-line memory module to monitor the state of the dual in-line memory module, obtains the memory status, and sends the memory status to the hardware fault diagnosis module.
根据本申请提供的一种服务器软硬件信息诊断系统,硬件信息包括主板温度;硬件信息监控模块还包括复杂可编程逻辑器件,复杂可编程逻辑器件包括I2C从设备模块;I2C交换机与基板管理控制单元相连,并通过I2C从设备模块与复杂可编程逻辑器件通信,I2C交换机接收传感器基于I2C总线发送的主板温度,并基于预设规则,将主板温度发送至复杂可编程逻辑器件的I2C从设备模块;复杂可编程逻辑器件将接收的主板温度与预设主板温度阈值进行比较,并基于主板温度超过预设主板温度阈值,通过I2C从设备模块将对应主板温度发送至I2C交换机,并利用I2C交换机将主板温度传输给基板管理控制单元;基板管理控制单元将接收的主板温度发送至硬件故障诊断模块。According to a server hardware and software information diagnosis system provided by the present application, the hardware information includes the temperature of the mainboard; the hardware information monitoring module also includes a complex programmable logic device, and the complex programmable logic device includes an I 2 C slave device module; the I 2 C switch is connected to the baseboard management and control unit, and communicates with the complex programmable logic device through the I 2 C slave device module, the I 2 C switch receives the mainboard temperature sent by the sensor based on the I 2 C bus, and sends the mainboard temperature to the I 2 C slave device module of the complex programmable logic device based on a preset rule; the complex programmable logic device compares the received mainboard temperature with a preset mainboard temperature threshold, and based on the mainboard temperature exceeding the preset mainboard temperature threshold, sends the corresponding mainboard temperature to the I 2 C switch through the I 2 C slave device module, and uses the I 2 C switch to transmit the mainboard temperature to the baseboard management and control unit; the baseboard management and control unit sends the received mainboard temperature to the hardware fault diagnosis module.
根据本申请提供的一种服务器软硬件信息诊断系统,硬件信息还包括电压信息,硬件信息监控模块还包括复杂可编程逻辑器件,复杂可编程逻辑器件包括模数转换模块、序列模块和串行外设接口模块,其中:序列模块分别与电压调节器和模数转换模块相连,序列模块按预设逻辑和时序控制电压调节器工作,并向模数转换模块发送激活指令;模数转换模块与电压调节器相连,模数转换模块基于激活指令,对电压调节器输出的电压信号进行采样和转换,得到电压信息,并将电压信息发送至复杂可编程逻辑器件;复杂可编程逻辑器件基于预设编程逻辑确定电压信息是否大于预设电压,若电压信息大于预设电压,则通过串行外设接口模块,将电压信息发送至硬件故障诊断模块。According to a server hardware and software information diagnostic system provided by the present application, the hardware information also includes voltage information, and the hardware information monitoring module also includes a complex programmable logic device, which includes an analog-to-digital conversion module, a sequence module and a serial peripheral interface module, wherein: the sequence module is respectively connected to the voltage regulator and the analog-to-digital conversion module, the sequence module controls the operation of the voltage regulator according to preset logic and timing, and sends an activation instruction to the analog-to-digital conversion module; the analog-to-digital conversion module is connected to the voltage regulator, and the analog-to-digital conversion module samples and converts the voltage signal output by the voltage regulator based on the activation instruction to obtain voltage information, and sends the voltage information to the complex programmable logic device; the complex programmable logic device determines whether the voltage information is greater than the preset voltage based on the preset programming logic, and if the voltage information is greater than the preset voltage, the voltage information is sent to the hardware fault diagnosis module through the serial peripheral interface module.
根据本申请提供的一种服务器软硬件信息诊断系统,硬件信息还包括使能信号和第一电源正常信号;序列模块在启动时按预设逻辑和时序控制向电压调节器发送使能信号,并确定能否接收电压调节器基于使能信号返回的第一电源正常信号,若无法接收到,则基于串行外设接口模块向硬件故障诊断模块上报对应故障信息。According to a server hardware and software information diagnostic system provided by the present application, the hardware information also includes an enable signal and a first power normal signal; when starting, the sequence module sends an enable signal to the voltage regulator according to preset logic and timing control, and determines whether it can receive the first power normal signal returned by the voltage regulator based on the enable signal. If it cannot be received, the corresponding fault information is reported to the hardware fault diagnosis module based on the serial peripheral interface module.
根据本申请提供的一种服务器软硬件信息诊断系统,硬件信息还包括内存过热信号、过热保护信号、节流信号、第二电源正常信号和复位信号中的至少一种,复杂可编程逻辑器件通过预设接口分别与中央处理器、双列直插式存储模块和电源模块相连,其中:复杂可编程逻辑器件监测双列直插式存储模块的内存温度,并基于双列直插式存储模块的内存温度超过预设内存温度阈值,生成内存过热信号,并通过串行外设接口模块发送至硬件故障诊断模块;复杂可编程逻辑器件监测主板温度,并基于主板温度超过预设过热阈值,生成过热保护信号,并通过串行外设接口模块发送至硬件故障诊断模块;复杂可编程逻辑器件监测双列直插式存储模块的负载和温度、中央处理器的负载和温度以及电源模块的电流,并结合预设策略,生成节流信号,并通过串行外设接口模块发送至硬件故障诊断模块;复杂可编程逻辑器件监测电源模块的输出电压和输出电流,并确定输出电压和输出电流是否符合预设范围,若符合预设范围,则生成第二电源正常信号,并通过串行外设接口模块发送至硬件故障诊断模块;复杂可编程逻辑器件监测电源模块,并确定电源模块在预设时间内是否达到稳定输出,若未达到,则生成复位信号,并通过串行外设接口模块发送至硬件故障诊断模块。According to a server software and hardware information diagnostic system provided by the present application, the hardware information also includes at least one of a memory overheat signal, an overheat protection signal, a throttling signal, a second power normal signal and a reset signal, and the complex programmable logic device is connected to the central processing unit, the dual in-line memory module and the power module respectively through preset interfaces, wherein: the complex programmable logic device monitors the memory temperature of the dual in-line memory module, and based on the memory temperature of the dual in-line memory module exceeding the preset memory temperature threshold, generates a memory overheat signal, and sends it to the hardware fault diagnosis module through the serial peripheral interface module; the complex programmable logic device monitors the motherboard temperature, and based on the motherboard temperature exceeding the preset overheat threshold, generates an overheat protection signal, and sends it to the hardware through the serial peripheral interface module Fault diagnosis module; the complex programmable logic device monitors the load and temperature of the dual in-line memory module, the load and temperature of the central processing unit, and the current of the power module, and generates a throttling signal in combination with a preset strategy, and sends it to the hardware fault diagnosis module through the serial peripheral interface module; the complex programmable logic device monitors the output voltage and output current of the power module, and determines whether the output voltage and output current meet the preset range. If they meet the preset range, a second power normal signal is generated and sent to the hardware fault diagnosis module through the serial peripheral interface module; the complex programmable logic device monitors the power module, and determines whether the power module achieves a stable output within a preset time. If not, a reset signal is generated and sent to the hardware fault diagnosis module through the serial peripheral interface module.
根据本申请提供的一种服务器软硬件信息诊断系统,硬件信息还包括风扇转速,复杂可编程逻辑器件还包括风扇模块,其中:风扇模块监控风扇的转速,并判断监测的风扇转速是否异常,若异常,则通过串行外设接口模块将风扇转速发送至硬件故障诊断模块。According to a server hardware and software information diagnostic system provided by the present application, the hardware information also includes fan speed, and the complex programmable logic device also includes a fan module, wherein: the fan module monitors the fan speed and determines whether the monitored fan speed is abnormal. If abnormal, the fan speed is sent to the hardware fault diagnosis module through the serial peripheral interface module.
根据本申请提供的一种服务器软硬件信息诊断系统,软件信息包括中央处理器使用率、内存使用率、网络流量、核心系统服务状态、预先选择的应用服务状态和预设待监控的进程状态中的至少一项;软件信息监控模块基于待监控的软件信息,调用对应内核模块访问接口访问内核空间,采集对应软件运行状态,并对其进行监控,得到对应软件信息;其中,软件信息监控模块是基于可加载内核模块技术对采集对应软件信息的软件信息采集模块进行编译、加载得到的。According to a server hardware and software information diagnostic system provided by the present application, the software information includes at least one of the central processing unit utilization rate, memory utilization rate, network traffic, core system service status, pre-selected application service status and preset process status to be monitored; the software information monitoring module calls the corresponding kernel module access interface to access the kernel space based on the software information to be monitored, collects the corresponding software running status, and monitors it to obtain the corresponding software information; wherein the software information monitoring module is obtained by compiling and loading the software information collection module that collects the corresponding software information based on the loadable kernel module technology.
根据本申请提供的一种服务器软硬件信息诊断系统,硬件故障诊断模块包括控制单元,控制单元根据硬件信息监控模块监测得到的硬件信息,按预设硬件故障评级规则进行故障评级,得到硬件故障诊断结果;软件故障诊断模块,将软件信息监控模块监测的软件信息与预设状态阈值进行比较,并将比较结果按预设软件故障评级规则进行故障评级,得到软件故障诊断结果。According to a server hardware and software information diagnosis system provided by the present application, the hardware fault diagnosis module includes a control unit, which performs fault rating according to preset hardware fault rating rules based on the hardware information monitored by the hardware information monitoring module to obtain a hardware fault diagnosis result; the software fault diagnosis module compares the software information monitored by the software information monitoring module with a preset status threshold, and performs fault rating on the comparison result according to the preset software fault rating rules to obtain a software fault diagnosis result.
根据本申请提供的一种服务器软硬件信息诊断系统,硬件信息包括中央处理器温度、内存状态、主板温度和电压信息中的至少一种;控制单元,根据接收的硬件信息,确定对应硬件信息超出额定阈值的范围,并结合预设硬件故障评级规则,确定对应硬件故障诊断结果;其中,预设硬件故障评级规则用于基于硬件信息超出额定数值的范围大小划分对应故障等级。According to a server hardware and software information diagnostic system provided by the present application, the hardware information includes at least one of the central processing unit temperature, memory status, mainboard temperature and voltage information; the control unit determines the range in which the corresponding hardware information exceeds the rated threshold based on the received hardware information, and determines the corresponding hardware fault diagnosis result in combination with the preset hardware fault rating rules; wherein the preset hardware fault rating rules are used to divide the corresponding fault level based on the range size of the hardware information exceeding the rated value.
根据本申请提供的一种服务器软硬件信息诊断系统,硬件信息包括使能信号、第一电源正常信号、内存过热信号、过热保护信号、节流信号、第二电源正常信号和复位信号中的至少一种;控制单元,根据接收的硬件信息,确定硬件信息的信号变化情况,并结合预设硬件故障评级规则,确定对应硬件故障诊断结果;其中,预设硬件故障评级规则用于硬件信息的信号变化情况划分对应故障等级。According to a server hardware and software information diagnostic system provided by the present application, the hardware information includes at least one of an enable signal, a first power normal signal, a memory overheat signal, an overheat protection signal, a throttling signal, a second power normal signal and a reset signal; the control unit determines the signal change of the hardware information based on the received hardware information, and determines the corresponding hardware fault diagnosis result in combination with the preset hardware fault rating rule; wherein the preset hardware fault rating rule is used to divide the corresponding fault level according to the signal change of the hardware information.
根据本申请提供的一种服务器软硬件信息诊断系统,根据硬件故障诊断结果和软件故障诊断结果,得到故障综合诊断结果,包括:根据硬件故障诊断结果和软件故障诊断结果,选择故障等级最高的故障诊断结果作为故障综合诊断结果。According to a server hardware and software information diagnosis system provided by the present application, a comprehensive fault diagnosis result is obtained based on the hardware fault diagnosis results and the software fault diagnosis results, including: based on the hardware fault diagnosis results and the software fault diagnosis results, selecting the fault diagnosis result with the highest fault level as the comprehensive fault diagnosis result.
本申请还提供一种服务器软硬件信息诊断方法,包括:实时监控硬件运行状态,得到硬件信息;实时监控服务进行状态和系统资源使用状态,得到软件信息;对硬件信息进行故障诊断,得到硬件故障诊断结果;对软件信息进行故障诊断,得到软件故障诊断结果;根据硬件故障诊断结果和软件故障诊断结果,得到故障综合诊断结果。The present application also provides a server hardware and software information diagnosis method, including: real-time monitoring of hardware operating status to obtain hardware information; real-time monitoring of service status and system resource usage status to obtain software information; performing fault diagnosis on hardware information to obtain hardware fault diagnosis results; performing fault diagnosis on software information to obtain software fault diagnosis results; and obtaining a comprehensive fault diagnosis result based on the hardware fault diagnosis results and the software fault diagnosis results.
本申请还提供一种计算机程序产品,包括计算机程序/指令,该计算机程序/指令被处理器执行时实现如上述任一种所述服务器软硬件信息诊断方法的步骤。The present application also provides a computer program product, including a computer program/instruction, which, when executed by a processor, implements the steps of any of the above-mentioned server software and hardware information diagnosis methods.
本申请还提供一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现如上述任一种所述服务器软硬件信息诊断方法的步骤。The present application also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the program, the steps of any of the above-mentioned server software and hardware information diagnosis methods are implemented.
本申请还提供一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如上述任一种所述服务器软硬件信息诊断方法的步骤。The present application also provides a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed by a processor, the steps of any of the above-mentioned server software and hardware information diagnosis methods are implemented.
本申请提供的服务器软硬件信息诊断系统及方法,通过硬件信息监控模块和软件信息监控模块分别监控硬件信息和软件信息,以便于利用硬件故障诊断模块基于硬件监控信息进行故障诊断,以及软件故障诊断模块基于软件信息进行故障诊断,从而实现同时对软件故障和硬件故障的监控与诊断,以提高单一故障诊断方式诊断结果的正确性,提升服务器系统运行的稳定性,提高了系统的可用性。The server hardware and software information diagnosis system and method provided in the present application monitor hardware information and software information respectively through a hardware information monitoring module and a software information monitoring module, so as to facilitate the use of the hardware fault diagnosis module to perform fault diagnosis based on the hardware monitoring information, and the software fault diagnosis module to perform fault diagnosis based on the software information, thereby realizing the monitoring and diagnosis of software faults and hardware faults at the same time, so as to improve the correctness of the diagnosis results of a single fault diagnosis method, enhance the stability of the server system operation, and improve the availability of the system.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本发明或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the present invention or the prior art, the following briefly introduces the drawings required for use in the embodiments or the description of the prior art. Obviously, the drawings described below are some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.
图1是本申请提供的服务器软硬件信息诊断系统的结构示意图之一;FIG1 is a schematic diagram of a server hardware and software information diagnostic system provided by the present application;
图2是本申请提供的服务器软硬件信息诊断系统的结构示意图之二;FIG2 is a second structural diagram of the server hardware and software information diagnosis system provided by the present application;
图3是本申请提供的服务器软硬件信息诊断系统的结构示意图之三;FIG3 is a third structural diagram of the server hardware and software information diagnosis system provided by the present application;
图4是本申请提供的服务器软硬件信息诊断方法的流程示意图;FIG4 is a flow chart of a method for diagnosing server software and hardware information provided by the present application;
图5是本申请提供的电子设备的结构示意图。FIG5 is a schematic diagram of the structure of an electronic device provided in the present application.
具体实施方式DETAILED DESCRIPTION
为使本发明的目的、技术方案和优点更加清楚,下面将结合本发明中的附图,对本发明中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purpose, technical solution and advantages of the present invention clearer, the technical solution of the present invention will be clearly and completely described below in conjunction with the drawings of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.
本申请的说明书和权利要求书中的术语“第一”、“第二”等是用于区别类似的对象,而不用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施,且“第一”、“第二”等所区分的对象通常为一类,并不限定对象的个数,例如第一对象可以是一个,也可以是多个。此外,说明书以及权利要求中“和/或”表示所连接对象的至少其中之一,字符“/”,一般表示前后关联对象是一种“或”的关系。The terms "first", "second", etc. in the specification and claims of this application are used to distinguish similar objects, and are not used to describe a specific order or sequence. It should be understood that the data used in this way can be interchangeable under appropriate circumstances, so that the embodiments of the present application can be implemented in an order other than those illustrated or described here, and the objects distinguished by "first", "second", etc. are generally of one type, and the number of objects is not limited. For example, the first object can be one or more. In addition, "and/or" in the specification and claims represents at least one of the connected objects, and the character "/" generally indicates that the objects associated with each other are in an "or" relationship.
下面结合附图,通过具体的实施例及其应用场景对本申请实施例提供的。。方法进行详细地说明。The method provided by the embodiment of the present application is described in detail below with reference to the accompanying drawings through specific embodiments and their application scenarios.
如图1所示,本申请实施例提供的一种服务器软硬件信息诊断系统,该系统包括:As shown in FIG1 , an embodiment of the present application provides a server software and hardware information diagnosis system, the system comprising:
硬件信息监控模块11,实时监控硬件运行状态,得到硬件信息;The hardware information monitoring module 11 monitors the hardware operation status in real time and obtains hardware information;
软件信息监控模块12,实时监控服务进行状态和系统资源使用状态,得到软件信息;The software information monitoring module 12 monitors the service status and system resource usage status in real time to obtain software information;
硬件故障诊断模块13,对硬件信息进行故障诊断,得到硬件故障诊断结果;The hardware fault diagnosis module 13 performs fault diagnosis on the hardware information to obtain a hardware fault diagnosis result;
软件故障诊断模块14,对软件信息进行故障诊断,得到软件故障诊断结果;The software fault diagnosis module 14 performs fault diagnosis on the software information to obtain a software fault diagnosis result;
故障处理模块15,根据硬件故障诊断结果和软件故障诊断结果,得到故障综合诊断结果。The fault processing module 15 obtains a comprehensive fault diagnosis result according to the hardware fault diagnosis result and the software fault diagnosis result.
在一个可选实施例中,硬件信息包括中央处理器(CPU)温度和内存状态。相应地,参考图2,硬件信息监控模块包括基板管理控制单元(Baseboard Management Controller,简称BMC)、intel设计的串行通信总线(Inter-Integrated Circuit,简称I2C)交换机和传感器,其中:基板管理控制单元通过监测温度总线(Platform Environment ControlInterface,简称PECI)监测中央处理器的温度,得到CPU温度,并将CPU温度发送至硬件故障诊断模块;基板管理控制单元与双列直插式存储模块(DIMM)相连,以监测DIMM状态,得到内存状态,并将内存状态发送至硬件故障诊断模块。In an optional embodiment, the hardware information includes the temperature of the central processing unit (CPU) and the memory status. Accordingly, referring to FIG2 , the hardware information monitoring module includes a baseboard management control unit (Baseboard Management Controller, referred to as BMC), an Intel-designed serial communication bus (Inter-Integrated Circuit, referred to as I 2 C) switch and a sensor, wherein: the baseboard management control unit monitors the temperature of the central processing unit by monitoring the temperature bus (Platform Environment Control Interface, referred to as PECI), obtains the CPU temperature, and sends the CPU temperature to the hardware fault diagnosis module; the baseboard management control unit is connected to the dual in-line memory module (DIMM) to monitor the DIMM status, obtain the memory status, and send the memory status to the hardware fault diagnosis module.
在一个可选实施例中,硬件信息包括主板温度,硬件信息监控模块还包括复杂可编程逻辑器件(Complex Programmable Logic Devic,简称CPLD),CPLD包括I2C从设备模块;I2C交换机与基板管理控制单元相连,并通过I2C从设备模块与CPLD通信,I2C交换机接收传感器基于I2C总线发送的主板温度,并基于预设规则,将主板温度发送至CPLD的I2C从设备模块;CPLD将接收的主板温度与预设主板温度阈值进行比较,并基于主板温度超过预设主板温度阈值,通过I2C从设备模块将对应主板温度发送至I2C交换机,并利用I2C交换机将主板温度传输给基板管理控制单元;基板管理控制单元将接收的主板温度发送至硬件故障诊断模块。In an optional embodiment, the hardware information includes the mainboard temperature, the hardware information monitoring module also includes a complex programmable logic device (Complex Programmable Logic Devic, referred to as CPLD), the CPLD includes an I 2 C slave device module; the I 2 C switch is connected to the baseboard management control unit and communicates with the CPLD through the I 2 C slave device module, the I 2 C switch receives the mainboard temperature sent by the sensor based on the I 2 C bus, and sends the mainboard temperature to the I 2 C slave device module of the CPLD based on a preset rule; the CPLD compares the received mainboard temperature with a preset mainboard temperature threshold, and based on the mainboard temperature exceeding the preset mainboard temperature threshold, sends the corresponding mainboard temperature to the I 2 C switch through the I 2 C slave device module, and transmits the mainboard temperature to the baseboard management control unit using the I 2 C switch; the baseboard management control unit sends the received mainboard temperature to the hardware fault diagnosis module.
在一个可选实施例中,硬件信息还包括电压信息,复杂可编程逻辑器件还包括模数转换模块(ADC)、序列模块和串行外设接口(SPI)模块,其中:序列模块分别与电压调节器和模数转换模块相连,序列模块按预设逻辑和时序控制电压调节器工作,并向模数转换模块发送激活指令;模数转换模块与电压调节器相连,模数转换模块基于激活指令,对电压调节器输出的电压信号进行采样和转换,得到电压信息,并将电压信息发送至复杂可编程逻辑器件;复杂可编程逻辑器件基于预设编程逻辑确定电压信息是否大于预设电压,若电压信息大于预设电压,则通过串行外设接口模块将电压信息发送至硬件故障诊断模块。In an optional embodiment, the hardware information also includes voltage information, and the complex programmable logic device also includes an analog-to-digital conversion module (ADC), a sequence module and a serial peripheral interface (SPI) module, wherein: the sequence module is respectively connected to the voltage regulator and the analog-to-digital conversion module, the sequence module controls the operation of the voltage regulator according to preset logic and timing, and sends an activation instruction to the analog-to-digital conversion module; the analog-to-digital conversion module is connected to the voltage regulator, and the analog-to-digital conversion module samples and converts the voltage signal output by the voltage regulator based on the activation instruction to obtain voltage information, and sends the voltage information to the complex programmable logic device; the complex programmable logic device determines whether the voltage information is greater than the preset voltage based on the preset programming logic, and if the voltage information is greater than the preset voltage, the voltage information is sent to the hardware fault diagnosis module through the serial peripheral interface module.
需要说明的是,电压调节器(Voltage Regulator,简称VR)用于将输入到主板上的电压转换成主板元器件所需电压,CPLD通过通用型输入输出接口(General-purposeinput/output,简称GPIO)获取VR的输出电压,并经内部模数转换模块(ADC)处理后进行逻辑判断,以在电压波动超过预设电压时,将电压信息通过串行外设接口模块发送硬件故障诊断模块进行日志记录与诊断。It should be noted that the voltage regulator (VR) is used to convert the voltage input to the motherboard into the voltage required by the motherboard components. The CPLD obtains the output voltage of the VR through the general-purpose input/output (GPIO) interface, and performs logical judgment after processing by the internal analog-to-digital conversion module (ADC). When the voltage fluctuation exceeds the preset voltage, the voltage information is sent to the hardware fault diagnosis module through the serial peripheral interface module for logging and diagnosis.
在一个可选实施例中,硬件信息还包括风扇转速,复杂可编程逻辑器件还包括风扇模块,其中:风扇模块监控风扇的转速,并判断监测的风扇转速是否异常,若异常,则通过串行外设接口模块将风扇转速发送至硬件故障诊断模块。In an optional embodiment, the hardware information also includes fan speed, and the complex programmable logic device also includes a fan module, wherein: the fan module monitors the fan speed and determines whether the monitored fan speed is abnormal. If abnormal, the fan speed is sent to the hardware fault diagnosis module through the serial peripheral interface module.
需要补充的是,判断监测的风扇转速是否异常,包括:判断风扇转速在第一目标时间内的波动范围是否超过预设波动范围,若是,则确定异常;和/或,判断风扇转速与预设转速的差异是否超过预设差值,若是,则确定异常;和/或,根据第二目标时间内的风扇转速,确定风扇是否异常停转,若是,则确定异常。另外,风扇模块基于串行外设接口模块、利用串行外设接口将异常的风扇转速发送至硬件故障诊断模块。It should be added that determining whether the monitored fan speed is abnormal includes: determining whether the fluctuation range of the fan speed within the first target time exceeds the preset fluctuation range, and if so, determining that it is abnormal; and/or determining whether the difference between the fan speed and the preset speed exceeds the preset difference, and if so, determining that it is abnormal; and/or, determining whether the fan stops abnormally based on the fan speed within the second target time, and if so, determining that it is abnormal. In addition, the fan module sends the abnormal fan speed to the hardware fault diagnosis module based on the serial peripheral interface module and using the serial peripheral interface.
在一个可选实施例中,硬件信息还包括使能信号(ENABLE)和第一电源正常信号(POWER GOOD);序列模块在启动时按预设逻辑和时序控制向电压调节器发送使能信号,并确定能否接收电压调节器基于使能信号返回的第一电源正常信号,若无法接收到,则基于串行外设接口向硬件故障诊断模块上报对应故障信息。需要说明的是,序列模块基于GPIO接口实现与电压调节器之间信号的收发。另外,使能信号是一个控制信号,用于决定电压调节器是否开始状态或是否处于激活状态;第一电源正常信号是一个指示信号,表示电压调节器的输出电压达到预期且稳定。In an optional embodiment, the hardware information also includes an enable signal (ENABLE) and a first power good signal (POWER GOOD); when the sequence module is started, the enable signal is sent to the voltage regulator according to the preset logic and timing control, and it is determined whether the first power good signal returned by the voltage regulator based on the enable signal can be received. If it cannot be received, the corresponding fault information is reported to the hardware fault diagnosis module based on the serial peripheral interface. It should be noted that the sequence module implements the transmission and reception of signals with the voltage regulator based on the GPIO interface. In addition, the enable signal is a control signal used to determine whether the voltage regulator is in the start state or in the active state; the first power good signal is an indication signal, indicating that the output voltage of the voltage regulator is as expected and stable.
在一个可选实施例中,硬件信息还包括内存过热信号(MEMHOT)、过热保护信号(THERMTRIP)、节流信号(THROTTLE)、第二电源正常信号(PWRGD)和复位信号(RESET)中的至少一种,复杂可编程逻辑器件通过预设接口分别与CPU、DIMM和电源模块相连,其中:复杂可编程逻辑器件监测DIMM的内存温度,基于DIMM的内存温度超过预设内存温度阈值,生成内存过热信号,并通过串行外设接口模块发送至硬件故障诊断模块;复杂可编程逻辑器件监测主板温度,基于主板温度超过预设过热阈值,生成过热保护信号,并通过串行外设接口模块发送至硬件故障诊断模块;复杂可编程逻辑器件监测DIMM的负载和温度、CPU的负载和温度以及电源模块的电流,并结合预设策略,生成节流信号,以及通过串行外设接口模块发送至硬件故障诊断模块;复杂可编程逻辑器件监测电源模块的输出电压和输出电流,并确定输出电压和输出电流是否符合预设范围,若符合预设范围,则生成第二电源正常信号,并通过串行外设接口模块发送至硬件故障诊断模块;复杂可编程逻辑器件监测电源模块,并确定电源模块在预设时间内是否达到稳定输出,若未达到,则生成复位信号,并通过串行外设接口模块发送至硬件故障诊断模块。In an optional embodiment, the hardware information also includes at least one of a memory overheat signal (MEMHOT), an overheat protection signal (THERMTRIP), a throttling signal (THROTTLE), a second power normal signal (PWRGD) and a reset signal (RESET), and the complex programmable logic device is connected to the CPU, DIMM and power module respectively through preset interfaces, wherein: the complex programmable logic device monitors the memory temperature of the DIMM, generates a memory overheat signal based on the memory temperature of the DIMM exceeding a preset memory temperature threshold, and sends it to the hardware fault diagnosis module through the serial peripheral interface module; the complex programmable logic device monitors the motherboard temperature, generates an overheat protection signal based on the motherboard temperature exceeding a preset overheat threshold, and sends it to the hardware fault diagnosis module through the serial peripheral interface module. The complex programmable logic device monitors the load and temperature of the DIMM, the load and temperature of the CPU, and the current of the power module, and generates a throttling signal in combination with a preset strategy, and sends it to the hardware fault diagnosis module through the serial peripheral interface module; the complex programmable logic device monitors the output voltage and output current of the power module, and determines whether the output voltage and output current meet the preset range. If they meet the preset range, a second power normal signal is generated, and sent to the hardware fault diagnosis module through the serial peripheral interface module; the complex programmable logic device monitors the power module, and determines whether the power module achieves a stable output within a preset time. If not, a reset signal is generated, and sent to the hardware fault diagnosis module through the serial peripheral interface module.
需要说明的是,过热保护信号用于指示主板温度超过预设主板温度阈值,以提示才去保护措施,比如降低功耗或关闭设备,以防止过热损坏;内存过热信号,用于指示内存温度超出安全范围;节流信号用于控制设备的功耗或新能,以防止过热或其他异常情况;第二电源正常信号用于指示电源供应正常;复位信号用于指示设备或系统重置为初始状态。It should be noted that the overheat protection signal is used to indicate that the motherboard temperature exceeds the preset motherboard temperature threshold, so as to prompt protective measures, such as reducing power consumption or shutting down the device to prevent overheating damage; the memory overheat signal is used to indicate that the memory temperature exceeds the safe range; the throttling signal is used to control the power consumption or energy of the device to prevent overheating or other abnormal conditions; the second power normal signal is used to indicate that the power supply is normal; the reset signal is used to indicate that the device or system is reset to the initial state.
需要补充的是,复杂可编程逻辑器件监测DIMM的内存温度,还包括:在基板管理控制单元监测DIMM得到内存温度后,将内存温度通过I2C交换机和I2C从设备模块发送至CPLD。另外,复杂可编程逻辑器件监测DIMM的负载可参考复杂可编程逻辑器件监测DIMM的内存温度,此处不做重复阐述。It should be added that the complex programmable logic device monitors the memory temperature of the DIMM, and further includes: after the baseboard management control unit monitors the DIMM to obtain the memory temperature, the memory temperature is sent to the CPLD through the I 2 C switch and the I 2 C slave device module. In addition, the complex programmable logic device monitoring the load of the DIMM can refer to the complex programmable logic device monitoring the memory temperature of the DIMM, which will not be repeated here.
同样的,复杂可编程逻辑器件监测CPU温度包括:在基板管理控制单元监测CPU得到CPU温度后,将CPU温度通过I2C交换机和I2C从设备模块发送至CPLD。另外,复杂可编程逻辑器件监测CPU负载可以参考复杂可编程逻辑器件监测CPU温度,此处不做重复阐述。Similarly, the complex programmable logic device monitors the CPU temperature including: after the baseboard management control unit monitors the CPU to obtain the CPU temperature, the CPU temperature is sent to the CPLD through the I 2 C switch and the I 2 C slave device module. In addition, the complex programmable logic device monitoring the CPU load can refer to the complex programmable logic device monitoring the CPU temperature, which will not be repeated here.
此外,复杂可编程逻辑器件监测主板温度可以通过上述I2C交换机接收传感器基于I2C总线发送的主板温度,并基于预设规则,将主板温度发送至复杂可编程逻辑器件的I2C从设备模块实现,此处不做重复阐述。In addition, the complex programmable logic device can monitor the motherboard temperature by receiving the motherboard temperature sent by the sensor based on the I2C bus through the above-mentioned I2C switch, and sending the motherboard temperature to the I2C slave device module of the complex programmable logic device based on preset rules, which will not be repeated here.
进一步地,预设接口可以采用GPIO接口。Furthermore, the preset interface may adopt a GPIO interface.
在一个可选实施例中,软件信息包括中央处理器使用率、内存使用率、网络流量、核心系统服务状态、预先选择的应用服务状态和预设待监控的进程状态中的至少一项;参考图3,软件信息监控模块基于待监控的软件信息,调用对应内核模块访问接口访问内核空间,采集对应软件运行状态,并对其进行监控,得到对应软件信息;其中,软件信息监控模块是基于可加载内核模块(Loadable Kernel Module,简称LKM)技术对采集对应软件信息的软件信息采集模块进行编译、加载得到的。In an optional embodiment, the software information includes at least one of the CPU usage rate, memory usage rate, network traffic, core system service status, pre-selected application service status and preset process status to be monitored; referring to FIG3, the software information monitoring module, based on the software information to be monitored, calls the corresponding kernel module access interface to access the kernel space, collects the corresponding software running status, and monitors it to obtain the corresponding software information; wherein, the software information monitoring module is obtained by compiling and loading the software information collection module that collects the corresponding software information based on the Loadable Kernel Module (LKM) technology.
举例而言,当软件信息为中央处理器(CPU)使用率和/或内存使用率时,对应的软件信息采集模块可以为性能监测工具,如任务管理器或性能监视器(perfmon)等;当件信息为网络流量时,对应的软件信息采集模块可以为资源监视器等;当软件信息为核心系统服务状态时,对应的软件信息采集模块可以为服务管理器等;当软件信息为预先选择的应用服务状态时,对应的软件信息采集模块可以为对应应用服务提供的管理模块或状态页面;当软件信息为预设待监控的进程状态时,对应软件信息采集模块可以为对应系统的命令行工具,如ps(Linux)或 tasklist(Windows)。应当注意,具体的软件信息采集模块需要根据实际涉及的软件系统以及具体软件信息确定,此处不做进一步地限定。For example, when the software information is the CPU usage and/or memory usage, the corresponding software information collection module may be a performance monitoring tool, such as a task manager or performance monitor (perfmon); when the software information is network traffic, the corresponding software information collection module may be a resource monitor; when the software information is the core system service status, the corresponding software information collection module may be a service manager; when the software information is the pre-selected application service status, the corresponding software information collection module may be a management module or status page provided by the corresponding application service; when the software information is the preset process status to be monitored, the corresponding software information collection module may be a command line tool of the corresponding system, such as ps (Linux) or tasklist (Windows). It should be noted that the specific software information collection module needs to be determined based on the actual software system involved and the specific software information, and is not further limited here.
应当注意,LKM是Linux系统中可以动态扩充内核功能的一种技术,使用LKM编程,只需要对相关的模块进行编译,而不需要对整个内核模块进行编译。此方法可以更全面的监控系统的信息。It should be noted that LKM is a technology that can dynamically expand kernel functions in Linux systems. When programming with LKM, only the relevant modules need to be compiled, rather than the entire kernel module. This method can monitor system information more comprehensively.
在一个可选实施例中,硬件故障诊断模块包括控制单元,控制单元根据硬件信息监控模块监测得到的硬件信息,按预设硬件故障评级规则进行故障评级,得到硬件故障诊断结果。需要补充的是,控制单元可以采用微程序控制器(Microprogrammed ControlUnit,简称MCU)。In an optional embodiment, the hardware fault diagnosis module includes a control unit, which performs fault rating according to the hardware information monitored by the hardware information monitoring module and a preset hardware fault rating rule to obtain a hardware fault diagnosis result. It should be added that the control unit can be a microprogrammed control unit (MCU).
进一步地,硬件信息包括中央处理器CPU温度、内存状态、主板温度和电压信息中的至少一种;控制单元,根据接收的硬件信息,确定对应硬件信息超出额定阈值的范围,并结合预设硬件故障评级规则,确定对应硬件故障诊断结果;其中,预设硬件故障评级规则用于基于硬件信息超出额定数值的范围大小划分对应故障等级。Furthermore, the hardware information includes at least one of the central processing unit CPU temperature, memory status, mainboard temperature and voltage information; the control unit determines the range in which the corresponding hardware information exceeds the rated threshold based on the received hardware information, and determines the corresponding hardware fault diagnosis result in combination with the preset hardware fault rating rules; wherein the preset hardware fault rating rules are used to divide the corresponding fault level based on the range in which the hardware information exceeds the rated value.
举例而言,比如,硬件信息包括电压信息,若电压信息超出额定的20%,则确定电压调节器及所在线路电压不稳,故障等级可以为轻度故障;再比如,硬件信息为风扇转速,若风扇转速超出额定的30%,定位风扇散热异常,故障等级可以为轻度故障,若风扇转速为异常停转,则定位为风扇散热异常,故障等级为重度故障,具体故障等级可以根据实际设计的预设硬件故障评级规则确定,此处不做进一步地限定。For example, the hardware information includes voltage information. If the voltage information exceeds the rated value by 20%, it is determined that the voltage regulator and the line voltage are unstable, and the fault level can be a minor fault. For another example, the hardware information is fan speed. If the fan speed exceeds the rated value by 30%, the fan heat dissipation is abnormal, and the fault level can be a minor fault. If the fan speed stops abnormally, it is located as abnormal fan heat dissipation, and the fault level is a severe fault. The specific fault level can be determined according to the preset hardware fault rating rules of the actual design, and no further limitation is made here.
另外,硬件信息包括使能信号、第一电源正常信号、内存过热信号、过热保护信号、节流信号、第二电源正常信号和复位信号中的至少一种;控制单元,根据接收的硬件信息,确定硬件信息的信号变化情况,并结合预设硬件故障评级规则,确定对应硬件故障诊断结果;其中,预设硬件故障评级规则用于硬件信息的信号变化情况划分对应故障等级。In addition, the hardware information includes at least one of an enable signal, a first power normal signal, a memory overheat signal, an overheat protection signal, a throttling signal, a second power normal signal and a reset signal; the control unit determines the signal change of the hardware information based on the received hardware information, and determines the corresponding hardware fault diagnosis result in combination with the preset hardware fault rating rules; wherein the preset hardware fault rating rules are used to divide the corresponding fault levels according to the signal change of the hardware information.
举例而言,硬件信息包括使能信号和第一电源正常信号,若序列模块发送给电压调节器的使能信号存在,但序列模块接收的第一电源正常信号由高贬低,定位为该电压调节器及线路存在异常掉电,故障等级可以设为重度故障,具体根据实际设计的预设硬件故障评级规则确定,此处不做进一步地限定。For example, the hardware information includes an enable signal and a first power normal signal. If the enable signal sent by the sequence module to the voltage regulator exists, but the first power normal signal received by the sequence module is degraded from high to low, it is determined that the voltage regulator and the circuit have abnormal power failure, and the fault level can be set to a severe fault, which is determined according to the preset hardware fault rating rules of the actual design and is not further limited here.
在一个可选实施例中,硬件故障诊断模块还包括存储单元,存储单元用于存储硬件信息监控模块得到的硬件信息。In an optional embodiment, the hardware fault diagnosis module further includes a storage unit, and the storage unit is used to store the hardware information obtained by the hardware information monitoring module.
进一步地,存储单元可以为闪存卡(TF card),闪存卡与控制单元之间通过数据传输接口(SDMMC)进行信息传输。Furthermore, the storage unit may be a flash memory card (TF card), and information is transmitted between the flash memory card and the control unit via a data transmission interface (SDMMC).
在一个可选实施例中,软件故障诊断模块,将软件信息监控模块监测的软件信息与预设状态阈值进行比较,并将比较结果按预设软件故障评级规则进行故障评级,得到软件故障诊断结果。需要说明的是,软件故障诊断的方式可参考上述硬件故障诊断方式,主要通过监控的软件信息与对应设定阈值进行比对,以在软件运行状态超过对应设定阈值时实现故障位置的定位及诊断。In an optional embodiment, the software fault diagnosis module compares the software information monitored by the software information monitoring module with a preset state threshold, and performs fault rating on the comparison result according to a preset software fault rating rule to obtain a software fault diagnosis result. It should be noted that the software fault diagnosis method can refer to the above-mentioned hardware fault diagnosis method, mainly by comparing the monitored software information with the corresponding set threshold, so as to locate and diagnose the fault position when the software running state exceeds the corresponding set threshold.
在一个可选实施例中,根据硬件故障诊断结果和软件故障诊断结果,得到故障综合诊断结果,包括:根据硬件故障诊断结果和软件故障诊断结果,选择故障等级最高的故障诊断结果作为故障综合诊断结果。In an optional embodiment, a comprehensive fault diagnosis result is obtained based on the hardware fault diagnosis results and the software fault diagnosis results, including: based on the hardware fault diagnosis results and the software fault diagnosis results, selecting the fault diagnosis result with the highest fault level as the comprehensive fault diagnosis result.
综上所述,本发明实施例通过硬件信息监控模块和软件信息监控模块分别监控硬件信息和软件信息,以便于利用硬件故障诊断模块基于硬件监控信息进行故障诊断,以及软件故障诊断模块基于软件信息进行故障诊断,从而实现同时对软件故障和硬件故障的监控与诊断,以提高单一故障诊断方式诊断结果的正确性,提升服务器系统运行的稳定性,提高了系统的可用性。To summarize, the embodiments of the present invention monitor hardware information and software information respectively through a hardware information monitoring module and a software information monitoring module, so as to utilize a hardware fault diagnosis module to perform fault diagnosis based on hardware monitoring information, and a software fault diagnosis module to perform fault diagnosis based on software information, thereby realizing simultaneous monitoring and diagnosis of software faults and hardware faults, so as to improve the accuracy of the diagnosis results of a single fault diagnosis method, enhance the stability of the server system operation, and improve the availability of the system.
下面对本申请提供的服务器软硬件信息诊断方法进行描述,下文描述的与上文描述的服务器软硬件信息诊断系统可相互对应参照。The server hardware and software information diagnosis method provided by the present application is described below. The server hardware and software information diagnosis system described below and above can be referenced to each other.
图4为本申请实施例提供的服务器软硬件信息诊断方法的流程示意图,如图4所示,具体包括:FIG4 is a flow chart of a server hardware and software information diagnosis method provided in an embodiment of the present application, as shown in FIG4 , specifically including:
S41,实时监控硬件运行状态,得到硬件信息;S41, real-time monitoring of hardware operation status and obtaining hardware information;
S42,实时监控服务进行状态和系统资源使用状态,得到软件信息;S42, real-time monitoring of service status and system resource usage status, and obtaining software information;
S43,对硬件信息进行故障诊断,得到硬件故障诊断结果;S43, performing fault diagnosis on the hardware information to obtain a hardware fault diagnosis result;
S44,对软件信息进行故障诊断,得到软件故障诊断结果;S44, performing fault diagnosis on the software information to obtain a software fault diagnosis result;
S45,根据硬件故障诊断结果和软件故障诊断结果,得到故障综合诊断结果。S45, obtaining a comprehensive fault diagnosis result according to the hardware fault diagnosis result and the software fault diagnosis result.
需要说明的是,本说明书中的步骤编号“S4N”不代表服务器软硬件信息诊断方法的先后顺序,下面具体描述本发明的服务器软硬件信息诊断方法。It should be noted that the step numbers "S4N" in this specification do not represent the sequence of the server hardware and software information diagnosis method. The server hardware and software information diagnosis method of the present invention is described in detail below.
步骤S41,实时监控硬件运行状态,得到硬件信息。Step S41, monitor the hardware operation status in real time to obtain hardware information.
在本实施例中,硬件信息包括中央处理器(CPU)温度和内存状态;实时监控硬件运行状态可以通过服务器软硬件信息诊断的硬件信息监控模块实现,硬件信息监控模块包括基板管理控制单元(Baseboard Management Controller,简称BMC)、I2C交换机和传感器;实时监控硬件运行状态,得到硬件信息,包括:通过基板管理控制单元,监测中央处理器的温度,得到CPU温度;通过基板管理控制单元,监测双列直插式存储模块(DIMM)的状态,得到内存状态。In this embodiment, the hardware information includes the temperature of the central processing unit (CPU) and the memory status; real-time monitoring of the hardware running status can be achieved through a hardware information monitoring module of the server software and hardware information diagnosis, and the hardware information monitoring module includes a baseboard management control unit (Baseboard Management Controller, referred to as BMC), an I2C switch and a sensor; real-time monitoring of the hardware running status to obtain hardware information includes: monitoring the temperature of the central processing unit through the baseboard management control unit to obtain the CPU temperature; monitoring the status of the dual in-line memory module (DIMM) through the baseboard management control unit to obtain the memory status.
在一个可选实施例中,硬件信息包括主板温度,硬件信息监控模块还包括复杂可编程逻辑器件CPLD,CPLD包括I2C从设备模块;实时监控硬件运行状态,得到硬件信息,还包括:通过传感器采集主板温度;通过I2C交换机基于预设规则,转发传感器采集的主板温度;通过复杂可编程逻辑器件将接收的主板温度与预设主板温度阈值进行比较,并基于主板温度超过预设主板温度阈值,通过I2C从设备模块将对应主板温度发送至I2C交换机,并利用I2C交换机将主板温度传输给基板管理控制单元;基板管理控制单元将接收的主板温度发送至硬件故障诊断模块。In an optional embodiment, the hardware information includes the mainboard temperature, the hardware information monitoring module also includes a complex programmable logic device (CPLD), and the CPLD includes an I 2 C slave device module; real-time monitoring of the hardware operation status to obtain the hardware information further includes: collecting the mainboard temperature through a sensor; forwarding the mainboard temperature collected by the sensor through an I 2 C switch based on a preset rule; comparing the received mainboard temperature with a preset mainboard temperature threshold through the complex programmable logic device, and sending the corresponding mainboard temperature to the I 2 C switch through the I 2 C slave device module based on the mainboard temperature exceeding the preset mainboard temperature threshold, and transmitting the mainboard temperature to the baseboard management control unit using the I 2 C switch; the baseboard management control unit sends the received mainboard temperature to the hardware fault diagnosis module.
在一个可选实施例中,硬件信息还包括电压信息,复杂可编程逻辑器件还包括模数转换模块(ADC)、序列模块和串行外设接口(SPI)模块,实时监控硬件运行状态,得到硬件信息,还包括:通过序列模块按预设逻辑和时序控制电压调节器工作,并向模数转换模块发送激活指令;通过模数转换模块基于激活指令,对电压调节器输出的电压信号进行采样和转换,得到电压信息,并将电压信息发送至复杂可编程逻辑器件;通过复杂可编程逻辑器件基于预设编程逻辑确定电压信息是否大于预设电压,若电压信息大于预设电压,则通过串行外设接口模块将电压信息发送至硬件故障诊断模块。In an optional embodiment, the hardware information also includes voltage information, and the complex programmable logic device also includes an analog-to-digital conversion module (ADC), a sequence module and a serial peripheral interface (SPI) module to monitor the hardware operating status in real time and obtain hardware information, and also includes: controlling the voltage regulator to work according to preset logic and timing through the sequence module, and sending an activation instruction to the analog-to-digital conversion module; sampling and converting the voltage signal output by the voltage regulator based on the activation instruction through the analog-to-digital conversion module to obtain voltage information, and sending the voltage information to the complex programmable logic device; determining whether the voltage information is greater than a preset voltage based on a preset programming logic through the complex programmable logic device, and if the voltage information is greater than the preset voltage, sending the voltage information to the hardware fault diagnosis module through the serial peripheral interface module.
在一个可选实施例中,硬件信息还包括风扇转速,复杂可编程逻辑器件还包括风扇模块,实时监控硬件运行状态,得到硬件信息,还包括:利用风扇模块监控风扇的转速,并判断监测的风扇转速是否异常,若异常,则通过串行外设接口模块将风扇转速发送至硬件故障诊断模块。In an optional embodiment, the hardware information also includes fan speed, and the complex programmable logic device also includes a fan module, which monitors the hardware operating status in real time to obtain hardware information, and also includes: using the fan module to monitor the fan speed, and determining whether the monitored fan speed is abnormal. If abnormal, the fan speed is sent to the hardware fault diagnosis module through the serial peripheral interface module.
进一步地,判断监测的风扇转速是否异常,包括:判断风扇转速在第一目标时间内的波动范围是否超过预设波动范围,若是,则确定异常;和/或,判断风扇转速与预设转速的差异是否超过预设差值,若是,则确定异常;和/或,根据第二目标时间内的风扇转速,确定风扇是否异常停转,若是,则确定异常。另外,风扇模块基于串行外设接口模块、利用串行外设接口将异常的风扇转速发送至硬件故障诊断模块。Further, judging whether the monitored fan speed is abnormal includes: judging whether the fluctuation range of the fan speed within the first target time exceeds the preset fluctuation range, and if so, determining that it is abnormal; and/or judging whether the difference between the fan speed and the preset speed exceeds the preset difference, and if so, determining that it is abnormal; and/or, judging whether the fan stops abnormally according to the fan speed within the second target time, and if so, determining that it is abnormal. In addition, the fan module sends the abnormal fan speed to the hardware fault diagnosis module based on the serial peripheral interface module and using the serial peripheral interface.
在一个可选实施例中,硬件信息还包括使能信号和第一电源正常信号;实时监控硬件运行状态,得到硬件信息,还包括:利用序列模块在启动时按预设逻辑和时序控制向电压调节器发送使能信号,并确定能否接收电压调节器基于使能信号返回的第一电源正常信号,若无法接收到,则基于串行外设接口模块向硬件故障诊断模块上报对应故障信息。In an optional embodiment, the hardware information also includes an enable signal and a first power normal signal; real-time monitoring of the hardware operating status to obtain hardware information also includes: using the sequence module to send an enable signal to the voltage regulator according to preset logic and timing control at startup, and determining whether the first power normal signal returned by the voltage regulator based on the enable signal can be received. If it cannot be received, reporting the corresponding fault information to the hardware fault diagnosis module based on the serial peripheral interface module.
在一个可选实施例中,硬件信息还包括内存过热信号、过热保护信号、节流信号、第二电源正常信号和复位信号中的至少一种,复杂可编程逻辑器件通过预设接口分别与CPU、DIMM和电源模块相连;实时监控硬件运行状态,得到硬件信息,还包括:利用复杂可编程逻辑器件监测DIMM的内存温度,基于DIMM的内存温度超过预设内存温度阈值,生成内存过热信号,并通过串行外设接口模块发送至硬件故障诊断模块;利用复杂可编程逻辑器件监测主板温度,基于主板温度超过预设过热阈值,生成过热保护信号,并通过串行外设接口模块发送至硬件故障诊断模块;利用复杂可编程逻辑器件监测DIMM的负载和温度、CPU的负载和温度以及电源模块的电流,并结合预设策略,生成节流信号,以及通过串行外设接口模块发送至硬件故障诊断模块;利用复杂可编程逻辑器件监测电源模块的输出电压和输出电流,并确定输出电压和输出电流是否符合预设范围,若符合预设范围,则生成第二电源正常信号,并通过串行外设接口模块发送至硬件故障诊断模块;利用复杂可编程逻辑器件监测电源模块,并确定电源模块在预设时间内是否达到稳定输出,若未达到,则生成复位信号,并通过串行外设接口模块发送至硬件故障诊断模块。In an optional embodiment, the hardware information also includes at least one of a memory overheat signal, an overheat protection signal, a throttling signal, a second power normal signal and a reset signal, and the complex programmable logic device is connected to the CPU, DIMM and power module respectively through preset interfaces; real-time monitoring of the hardware operation status to obtain hardware information also includes: using the complex programmable logic device to monitor the memory temperature of the DIMM, generating a memory overheat signal based on the memory temperature of the DIMM exceeding a preset memory temperature threshold, and sending it to the hardware fault diagnosis module through the serial peripheral interface module; using the complex programmable logic device to monitor the motherboard temperature, generating an overheat protection signal based on the motherboard temperature exceeding a preset overheat threshold, and sending it to the hardware fault diagnosis module through the serial peripheral interface module. block; using a complex programmable logic device to monitor the load and temperature of the DIMM, the load and temperature of the CPU, and the current of the power module, and combining with a preset strategy, generating a throttling signal, and sending it to the hardware fault diagnosis module through the serial peripheral interface module; using a complex programmable logic device to monitor the output voltage and output current of the power module, and determine whether the output voltage and output current meet the preset range, if they meet the preset range, generate a second power normal signal, and send it to the hardware fault diagnosis module through the serial peripheral interface module; using a complex programmable logic device to monitor the power module, and determine whether the power module achieves a stable output within a preset time, if not, generate a reset signal, and send it to the hardware fault diagnosis module through the serial peripheral interface module.
需要补充的是,利用复杂可编程逻辑器件监测DIMM的内存温度,还包括:在基板管理控制单元监测DIMM得到内存温度后,将内存温度通过I2C交换机和I2C从设备模块发送至CPLD。另外,利用复杂可编程逻辑器件监测DIMM的负载可参考利用复杂可编程逻辑器件监测DIMM的内存温度,此处不做重复阐述。It should be added that the use of complex programmable logic devices to monitor the memory temperature of DIMMs also includes: after the baseboard management control unit monitors the DIMMs to obtain the memory temperature, the memory temperature is sent to the CPLD via the I 2 C switch and the I 2 C slave device module. In addition, the use of complex programmable logic devices to monitor the load of DIMMs can refer to the use of complex programmable logic devices to monitor the memory temperature of DIMMs, which will not be repeated here.
同样的,利用复杂可编程逻辑器件监测CPU温度包括:在基板管理控制单元监测CPU得到CPU温度后,将CPU温度通过I2C交换机和I2C从设备模块发送至CPLD。另外,利用复杂可编程逻辑器件监测CPU负载可以参考利用复杂可编程逻辑器件监测CPU温度,此处不做重复阐述。Similarly, using a complex programmable logic device to monitor the CPU temperature includes: after the baseboard management control unit monitors the CPU to obtain the CPU temperature, the CPU temperature is sent to the CPLD through the I 2 C switch and the I 2 C slave device module. In addition, using a complex programmable logic device to monitor the CPU load can refer to using a complex programmable logic device to monitor the CPU temperature, which will not be repeated here.
此外,利用复杂可编程逻辑器件监测主板温度可以通过上述I2C交换机接收传感器基于I2C总线发送的主板温度,并基于预设规则,将主板温度发送至复杂可编程逻辑器件的I2C从设备模块实现,此处不做重复阐述。In addition, monitoring the motherboard temperature using a complex programmable logic device can be achieved by receiving the motherboard temperature sent by the sensor based on the I 2 C bus through the above-mentioned I 2 C switch, and sending the motherboard temperature to the I 2 C slave device module of the complex programmable logic device based on preset rules, which will not be repeated here.
步骤S42,实时监控服务进行状态和系统资源使用状态,得到软件信息。Step S42, real-time monitoring of service status and system resource usage status to obtain software information.
在一个可选实施例中,软件信息包括中央处理器使用率、内存使用率、网络流量、核心系统服务状态、预先选择的应用服务状态和预设待监控的进程状态中的至少一项。需要说明的是,实时监控硬件运行状态可以通过服务器软硬件信息诊断的硬件信息监控模块实现。In an optional embodiment, the software information includes at least one of the CPU usage, memory usage, network traffic, core system service status, pre-selected application service status, and preset process status to be monitored. It should be noted that real-time monitoring of hardware operating status can be achieved through the hardware information monitoring module of the server software and hardware information diagnosis.
在本实施例中,实时监控服务进行状态和系统资源使用状态,得到软件信息,包括:利用软件信息监控模块基于待监控的软件信息,调用对应内核模块访问接口访问内核空间,采集对应软件运行状态,并对其进行监控,得到对应软件信息;其中,软件信息监控模块是基于可加载内核模块技术对采集对应软件信息的软件信息采集模块进行编译、加载得到的。In this embodiment, the service status and system resource usage status are monitored in real time to obtain software information, including: using the software information monitoring module to call the corresponding kernel module access interface to access the kernel space based on the software information to be monitored, collect the corresponding software running status, and monitor it to obtain the corresponding software information; wherein the software information monitoring module is obtained by compiling and loading the software information collection module that collects the corresponding software information based on the loadable kernel module technology.
步骤S43,对硬件信息进行故障诊断,得到硬件故障诊断结果。Step S43, performing fault diagnosis on the hardware information to obtain a hardware fault diagnosis result.
需要说明的是,实时监控硬件运行状态可以通过服务器软硬件信息诊断的硬件故障诊断模块实现。It should be noted that real-time monitoring of the hardware operating status can be achieved through the hardware fault diagnosis module of the server software and hardware information diagnosis.
在一个可选实施例中, 硬件故障诊断模块包括控制单元,对硬件信息进行故障诊断,得到硬件故障诊断结果,包括:利用控制单元根据硬件信息监控模块监测得到的硬件信息,按预设硬件故障评级规则进行故障评级,得到硬件故障诊断结果。In an optional embodiment, the hardware fault diagnosis module includes a control unit, which performs fault diagnosis on hardware information to obtain hardware fault diagnosis results, including: using the control unit to perform fault rating according to preset hardware fault rating rules based on the hardware information monitored by the hardware information monitoring module to obtain hardware fault diagnosis results.
进一步地,硬件信息包括中央处理器CPU温度、内存状态、主板温度和电压信息中的至少一种;利用控制单元根据硬件信息监控模块监测得到的硬件信息,按预设硬件故障评级规则进行故障评级,得到硬件故障诊断结果,包括:利用控制单元,根据接收的硬件信息,确定对应硬件信息超出额定阈值的范围,并结合预设硬件故障评级规则,确定对应硬件故障诊断结果;其中,预设硬件故障评级规则用于基于硬件信息超出额定数值的范围大小划分对应故障等级。Furthermore, the hardware information includes at least one of the central processing unit CPU temperature, memory status, mainboard temperature and voltage information; the control unit is used to perform fault rating according to the hardware information monitored by the hardware information monitoring module according to the preset hardware fault rating rules to obtain a hardware fault diagnosis result, including: using the control unit to determine the range in which the corresponding hardware information exceeds the rated threshold based on the received hardware information, and determine the corresponding hardware fault diagnosis result in combination with the preset hardware fault rating rules; wherein the preset hardware fault rating rules are used to divide the corresponding fault level based on the range of the hardware information exceeding the rated value.
另外,硬件信息包括使能信号、第一电源正常信号、内存过热信号、过热保护信号、节流信号、第二电源正常信号和复位信号中的至少一种;利用控制单元根据硬件信息监控模块监测得到的硬件信息,按预设硬件故障评级规则进行故障评级,得到硬件故障诊断结果,还包括:利用控制单元,根据接收的硬件信息,确定硬件信息的信号变化情况,并结合预设硬件故障评级规则,确定对应硬件故障诊断结果;其中,预设硬件故障评级规则用于硬件信息的信号变化情况划分对应故障等级。In addition, the hardware information includes at least one of an enable signal, a first power normal signal, a memory overheat signal, an overheat protection signal, a throttling signal, a second power normal signal and a reset signal; using the control unit to monitor the hardware information obtained by the hardware information monitoring module, and perform fault rating according to the preset hardware fault rating rules to obtain a hardware fault diagnosis result, which also includes: using the control unit to determine the signal change of the hardware information based on the received hardware information, and determine the corresponding hardware fault diagnosis result in combination with the preset hardware fault rating rules; wherein the preset hardware fault rating rules are used to divide the corresponding fault levels according to the signal change of the hardware information.
在一个可选实施例中,硬件故障诊断模块还包括存储单元,在实时监控硬件运行状态,得到硬件信息之后,包括:将硬件信息发送至存储单元进行存储。In an optional embodiment, the hardware fault diagnosis module further includes a storage unit, and after real-time monitoring of the hardware operating status and obtaining the hardware information, the module includes: sending the hardware information to the storage unit for storage.
步骤S44,对软件信息进行故障诊断,得到软件故障诊断结果。Step S44, performing fault diagnosis on the software information to obtain a software fault diagnosis result.
在一个可选实施例中,对软件信息进行故障诊断,得到软件故障诊断结果,包括:利用软件故障诊断模块,将软件信息监控模块监测的软件信息与预设状态阈值进行比较,并将比较结果按预设软件故障评级规则进行故障评级,得到软件故障诊断结果。需要说明的是,软件故障诊断的方式可参考上述硬件故障诊断方式,主要通过监控的软件信息与对应设定阈值进行比对,以在软件运行状态超过对应设定阈值时实现故障位置的定位及诊断。In an optional embodiment, software information is fault diagnosed to obtain software fault diagnosis results, including: using a software fault diagnosis module to compare the software information monitored by the software information monitoring module with a preset state threshold, and performing fault rating on the comparison result according to a preset software fault rating rule to obtain a software fault diagnosis result. It should be noted that the software fault diagnosis method can refer to the above-mentioned hardware fault diagnosis method, mainly by comparing the monitored software information with the corresponding set threshold, so as to locate and diagnose the fault position when the software running state exceeds the corresponding set threshold.
步骤S45,根据硬件故障诊断结果和软件故障诊断结果,得到故障综合诊断结果。Step S45, obtaining a comprehensive fault diagnosis result based on the hardware fault diagnosis result and the software fault diagnosis result.
在本实施例中,根据硬件故障诊断结果和软件故障诊断结果,得到故障综合诊断结果,包括:根据硬件故障诊断结果和软件故障诊断结果,选择故障等级最高的故障诊断结果作为故障综合诊断结果。In this embodiment, a comprehensive fault diagnosis result is obtained based on the hardware fault diagnosis results and the software fault diagnosis results, including: based on the hardware fault diagnosis results and the software fault diagnosis results, selecting the fault diagnosis result with the highest fault level as the comprehensive fault diagnosis result.
综上所述,本发明实施例通过分别监控硬件信息和软件信息,以便于基于硬件监控信息进行故障诊断,以及基于软件信息进行故障诊断,从而实现同时对软件故障和硬件故障的监控与诊断,以提高单一故障诊断方式诊断结果的正确性,提升服务器系统运行的稳定性,提高了系统的可用性。To summarize, the embodiments of the present invention monitor hardware information and software information separately to facilitate fault diagnosis based on hardware monitoring information, and fault diagnosis based on software information, thereby realizing simultaneous monitoring and diagnosis of software faults and hardware faults, thereby improving the accuracy of the diagnosis results of a single fault diagnosis method, enhancing the stability of the server system operation, and improving the availability of the system.
需要说明的是,本申请实施例提供的服务器软硬件信息诊断方法,执行主体可以为服务器软硬件信息诊断系统,或者该服务器软硬件信息诊断系统中的用于执行服务器软硬件信息诊断方法的控制模块。本申请实施例中以服务器软硬件信息诊断系统执行服务器软硬件信息诊断方法为例,说明本申请实施例提供的服务器软硬件信息诊断方法。It should be noted that the server hardware and software information diagnosis method provided in the embodiment of the present application can be executed by a server hardware and software information diagnosis system, or a control module in the server hardware and software information diagnosis system for executing the server hardware and software information diagnosis method. In the embodiment of the present application, the server hardware and software information diagnosis method provided in the embodiment of the present application is explained by taking the server hardware and software information diagnosis system executing the server hardware and software information diagnosis method as an example.
另外,本申请实施例中,上述各个方法附图所示的服务器软硬件信息诊断方法均是以结合本申请实施例中的一个附图为例示例性的说明的。具体实现时,上述各个方法附图所示的服务器软硬件信息诊断方法还可以结合上述实施例中示意的其它可以结合的任意附图实现,此处不再赘述。In addition, in the embodiments of the present application, the server hardware and software information diagnosis methods shown in the above-mentioned method drawings are all illustrative descriptions of an accompanying drawing in the embodiments of the present application. In specific implementation, the server hardware and software information diagnosis methods shown in the above-mentioned method drawings can also be implemented in combination with any other accompanying drawings that can be combined as shown in the above-mentioned embodiments, which will not be repeated here.
图5示例了一种电子设备的实体结构示意图,如图5所示,该电子设备可以包括:处理器(processor)510、通信接口(Communications Interface)520、存储器(memory)530和通信总线540,其中,处理器510,通信接口520,存储器530通过通信总线540完成相互间的通信。处理器510可以调用存储器530中的逻辑指令,以执行服务器软硬件信息诊断方法,该方法包括:实时监控硬件运行状态,得到硬件信息;实时监控服务进行状态和系统资源使用状态,得到软件信息;对硬件信息进行故障诊断,得到硬件故障诊断结果;对软件信息进行故障诊断,得到软件故障诊断结果;根据硬件故障诊断结果和软件故障诊断结果,得到故障综合诊断结果。FIG5 illustrates a schematic diagram of the physical structure of an electronic device. As shown in FIG5 , the electronic device may include: a processor 510, a communications interface 520, a memory 530 and a communications bus 540, wherein the processor 510, the communications interface 520 and the memory 530 communicate with each other through the communications bus 540. The processor 510 may call the logic instructions in the memory 530 to execute the server software and hardware information diagnosis method, which includes: real-time monitoring of the hardware operation status to obtain hardware information; real-time monitoring of the service status and the system resource usage status to obtain software information; performing fault diagnosis on the hardware information to obtain hardware fault diagnosis results; performing fault diagnosis on the software information to obtain software fault diagnosis results; and obtaining a comprehensive fault diagnosis result based on the hardware fault diagnosis results and the software fault diagnosis results.
此外,上述的存储器530中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the logic instructions in the above-mentioned memory 530 can be implemented in the form of a software functional unit and can be stored in a computer-readable storage medium when it is sold or used as an independent product. Based on this understanding, the technical solution of the present application can be essentially or partly embodied in the form of a software product that contributes to the prior art. The computer software product is stored in a storage medium, including several instructions to enable a computer device (which can be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), disk or optical disk, etc. Various media that can store program codes.
另一方面,本申请还提供一种计算机程序产品,所述计算机程序产品包括存储在计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,计算机能够执行上述各方法所提供的服务器软硬件信息诊断方法,该方法包括:实时监控硬件运行状态,得到硬件信息;实时监控服务进行状态和系统资源使用状态,得到软件信息;对硬件信息进行故障诊断,得到硬件故障诊断结果;对软件信息进行故障诊断,得到软件故障诊断结果;根据硬件故障诊断结果和软件故障诊断结果,得到故障综合诊断结果。On the other hand, the present application also provides a computer program product, which includes a computer program stored on a computer-readable storage medium, and the computer program includes program instructions. When the program instructions are executed by a computer, the computer can execute the server hardware and software information diagnosis method provided by the above methods, and the method includes: real-time monitoring of the hardware operation status to obtain hardware information; real-time monitoring of the service status and system resource usage status to obtain software information; performing fault diagnosis on the hardware information to obtain hardware fault diagnosis results; performing fault diagnosis on the software information to obtain software fault diagnosis results; and obtaining a comprehensive fault diagnosis result based on the hardware fault diagnosis results and the software fault diagnosis results.
又一方面,本申请还提供一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现以执行上述各提供的服务器软硬件信息诊断方法,该方法包括:实时监控硬件运行状态,得到硬件信息;实时监控服务进行状态和系统资源使用状态,得到软件信息;对硬件信息进行故障诊断,得到硬件故障诊断结果;对软件信息进行故障诊断,得到软件故障诊断结果;根据硬件故障诊断结果和软件故障诊断结果,得到故障综合诊断结果。On the other hand, the present application also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, is implemented to execute the above-mentioned server hardware and software information diagnosis method, the method comprising: real-time monitoring of the hardware operating status to obtain hardware information; real-time monitoring of the service status and system resource usage status to obtain software information; performing fault diagnosis on the hardware information to obtain hardware fault diagnosis results; performing fault diagnosis on the software information to obtain software fault diagnosis results; and obtaining a comprehensive fault diagnosis result based on the hardware fault diagnosis results and the software fault diagnosis results.
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。The device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of this embodiment. Those of ordinary skill in the art may understand and implement it without creative work.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。Through the description of the above implementation methods, those skilled in the art can clearly understand that each implementation method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be implemented by hardware. Based on this understanding, the above technical solution is essentially or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product can be stored in a computer-readable storage medium, such as ROM/RAM, a disk, an optical disk, etc., including a number of instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in each embodiment or some parts of the embodiments.
最后应说明的是:以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present application, rather than to limit it. Although the present application has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the aforementioned embodiments, or replace some of the technical features therein with equivalents. However, these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the present application.
Claims (15)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202411206296.2A CN118708418B (en) | 2024-08-30 | 2024-08-30 | Server software and hardware information diagnosis system and method |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202411206296.2A CN118708418B (en) | 2024-08-30 | 2024-08-30 | Server software and hardware information diagnosis system and method |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN118708418A true CN118708418A (en) | 2024-09-27 |
| CN118708418B CN118708418B (en) | 2024-11-15 |
Family
ID=92818614
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202411206296.2A Active CN118708418B (en) | 2024-08-30 | 2024-08-30 | Server software and hardware information diagnosis system and method |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN118708418B (en) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119669043A (en) * | 2024-12-02 | 2025-03-21 | 东软睿驰汽车技术(沈阳)有限公司 | Multithreaded software and hardware fault analysis method and device applied to lockstep core |
| CN120353329A (en) * | 2025-06-24 | 2025-07-22 | 苏州元脑智能科技有限公司 | Server power supply monitoring method and device, electronic equipment and storage medium |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106681869A (en) * | 2016-11-23 | 2017-05-17 | 北海高创电子信息孵化器有限公司 | Computer fault detection system |
| CN108199922A (en) * | 2018-01-11 | 2018-06-22 | 承德石油高等专科学校 | A kind of system and method for diagnosing and repairing for the network equipment and server failure |
| CN108880901A (en) * | 2018-06-29 | 2018-11-23 | 合肥微商圈信息科技有限公司 | System and method for diagnosing and repairing network equipment and server fault |
| CN109581994A (en) * | 2017-09-28 | 2019-04-05 | 深圳市优必选科技有限公司 | A kind of robot fault diagnosis method, system and terminal equipment |
| CN109815103A (en) * | 2019-01-29 | 2019-05-28 | 黄河水利职业技术学院 | A computer fault diagnosis system |
| CN111048138A (en) * | 2019-12-22 | 2020-04-21 | 北京浪潮数据技术有限公司 | Hard disk fault detection method and related device |
| CN114600088A (en) * | 2019-11-05 | 2022-06-07 | 微软技术许可有限责任公司 | Server condition monitoring system and method using baseboard management controller |
-
2024
- 2024-08-30 CN CN202411206296.2A patent/CN118708418B/en active Active
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106681869A (en) * | 2016-11-23 | 2017-05-17 | 北海高创电子信息孵化器有限公司 | Computer fault detection system |
| CN109581994A (en) * | 2017-09-28 | 2019-04-05 | 深圳市优必选科技有限公司 | A kind of robot fault diagnosis method, system and terminal equipment |
| CN108199922A (en) * | 2018-01-11 | 2018-06-22 | 承德石油高等专科学校 | A kind of system and method for diagnosing and repairing for the network equipment and server failure |
| CN108880901A (en) * | 2018-06-29 | 2018-11-23 | 合肥微商圈信息科技有限公司 | System and method for diagnosing and repairing network equipment and server fault |
| CN109815103A (en) * | 2019-01-29 | 2019-05-28 | 黄河水利职业技术学院 | A computer fault diagnosis system |
| CN114600088A (en) * | 2019-11-05 | 2022-06-07 | 微软技术许可有限责任公司 | Server condition monitoring system and method using baseboard management controller |
| CN111048138A (en) * | 2019-12-22 | 2020-04-21 | 北京浪潮数据技术有限公司 | Hard disk fault detection method and related device |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119669043A (en) * | 2024-12-02 | 2025-03-21 | 东软睿驰汽车技术(沈阳)有限公司 | Multithreaded software and hardware fault analysis method and device applied to lockstep core |
| CN119669043B (en) * | 2024-12-02 | 2025-09-23 | 东软睿驰汽车技术(沈阳)有限公司 | Multi-threaded software and hardware fault analysis method and device applied to lockstep core |
| CN120353329A (en) * | 2025-06-24 | 2025-07-22 | 苏州元脑智能科技有限公司 | Server power supply monitoring method and device, electronic equipment and storage medium |
Also Published As
| Publication number | Publication date |
|---|---|
| CN118708418B (en) | 2024-11-15 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN118708418A (en) | Server software and hardware information diagnosis system and method | |
| CN106339058B (en) | Method and system for dynamically managing power supply | |
| US10042583B2 (en) | Device management method, device, and device management controller | |
| CN100383748C (en) | Policy-based responses to system errors that occur during OS runtime | |
| US20040003317A1 (en) | Method and apparatus for implementing fault detection and correction in a computer system that requires high reliability and system manageability | |
| CN111880906A (en) | Virtual machine high-availability management method, system and storage medium | |
| US20090138757A1 (en) | Failure recovery method in cluster system | |
| CN112463538A (en) | Liquid leakage detection and alarm system, method, device and equipment | |
| US6584432B1 (en) | Remote diagnosis of data processing units | |
| WO2024222514A1 (en) | Functional safety system and method for functional safety system | |
| CN117453036A (en) | Method, system and device for adjusting power consumption of equipment in server | |
| US20200387428A1 (en) | Information processing system | |
| CN112015597A (en) | Fault isolation method, device, equipment and computer readable storage medium | |
| JP2003173272A (en) | Information processing system, information processing device and maintenance center | |
| CN107506281A (en) | A kind of multiple power supplies monitoring system and method | |
| CN111654401B (en) | Network segment switching method, device, terminal and storage medium of monitoring system | |
| CN116204502B (en) | NAS storage service method and system with high availability | |
| US20050071461A1 (en) | Proxy alerting | |
| CN117931581A (en) | Graphic processor monitoring method, device, medium and server monitoring system | |
| CN115333983B (en) | Heartbeat management method and node | |
| CN116795195A (en) | Main board system with multiple CPU modules, control method of main board and computing equipment | |
| CN117251319A (en) | Power failure analysis method and device, electronic equipment and readable storage medium | |
| CN113535472A (en) | cluster server | |
| CN112558740A (en) | Assembly throttling power standby equipment charging system | |
| US20250199984A1 (en) | Individual power cycle control of accelerator modules configured on a node |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |