[go: up one dir, main page]

CN112256539B - PCIE link error statistical method, device, terminal and storage medium - Google Patents

PCIE link error statistical method, device, terminal and storage medium Download PDF

Info

Publication number
CN112256539B
CN112256539B CN202010990038.3A CN202010990038A CN112256539B CN 112256539 B CN112256539 B CN 112256539B CN 202010990038 A CN202010990038 A CN 202010990038A CN 112256539 B CN112256539 B CN 112256539B
Authority
CN
China
Prior art keywords
fatal error
error count
changes
count
pcie link
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010990038.3A
Other languages
Chinese (zh)
Other versions
CN112256539A (en
Inventor
李长飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Metabrain Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202010990038.3A priority Critical patent/CN112256539B/en
Publication of CN112256539A publication Critical patent/CN112256539A/en
Application granted granted Critical
Publication of CN112256539B publication Critical patent/CN112256539B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • G06F11/349Performance evaluation by tracing or monitoring for interfaces, buses
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3027Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a bus
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • G06F11/3476Data logging
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/42Bus transfer protocol, e.g. handshake; Synchronisation
    • G06F13/4282Bus transfer protocol, e.g. handshake; Synchronisation on a serial bus, e.g. I2C bus, SPI bus
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0823Errors, e.g. transmission errors
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/16Threshold monitoring

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Environmental & Geological Engineering (AREA)
  • Debugging And Monitoring (AREA)

Abstract

本发明公开一种PCIE链路错误统计方法、装置、终端及存储介质,实时监测PCIE链路的非致命错误计数;当监测到非致命错误计数在第一预设时长内连续改变,且改变次数超过第一次数阈值,则发出告警和/或中断该PCIE链路;当监测到第二预设时长内非致命错误计数发生N段改变,N超过第二段数阈值,且每次连续改变,非致命错误计数改变次数均未超过第一次数阈值,则发出告警和/或中断该PCIE链路;第一预设时长内的改变为一段改变。本发明在错误数量和错误产生时间两个维度进行统计,当产生的错误满足统计条件时,将产生告警或将该链路中断,避免过多的错误导致系统严重故障,大大提高系统运行的稳定性和可靠性。

Figure 202010990038

The invention discloses a PCIE link error statistics method, device, terminal and storage medium, which can monitor the non-fatal error count of the PCIE link in real time; If it exceeds the threshold of the first number of times, an alarm is issued and/or the PCIE link is interrupted; when it is detected that the non-fatal error count changes in N segments within the second preset time period, and N exceeds the threshold of the second number of segments, and each time it changes continuously, If the number of changes of the non-fatal error count does not exceed the first number of times threshold, an alarm is issued and/or the PCIE link is interrupted; the change within the first preset time period is a one-stage change. The present invention performs statistics in two dimensions: the number of errors and the time when errors are generated. When the generated errors meet the statistical conditions, an alarm will be generated or the link will be interrupted, so as to avoid serious system failures caused by excessive errors and greatly improve the stability of system operation. sturdiness and reliability.

Figure 202010990038

Description

一种PCIE链路错误统计方法、装置、终端及存储介质Method, device, terminal and storage medium for PCIE link error statistics

技术领域technical field

本发明涉及PCIE链路监测领域,具体涉及一种PCIE链路错误统计方法、装置、终端及存储介质。The invention relates to the field of PCIE link monitoring, in particular to a PCIE link error statistics method, device, terminal and storage medium.

背景技术Background technique

近年来,随着用户对融合、统一、效率、空间、能耗要求的不断提升,PCIE(peripheral component interconnect express,高速串行计算机扩展总线)设备在服务器、存储领域得到广泛应用,因此能够有效监测PCIE链路的健康状态,并根据监测情况采取安全保护策略,以提高系统运行的稳定性和可靠性越来越重要。目前各类PCIE设备大部分提供了错误数据,如何有效的利用这些数据来判断链路的健康状态一直是该领域的难点,并且目前没有一种有效的错误统计方法。In recent years, with the continuous improvement of users' requirements for integration, unification, efficiency, space, and energy consumption, PCIE (peripheral component interconnect express, high-speed serial computer expansion bus) equipment has been widely used in the server and storage fields, so it can effectively monitor The health status of the PCIE link and the adoption of security protection strategies based on the monitoring situation are more and more important to improve the stability and reliability of system operation. At present, most of the various PCIE devices provide error data. How to effectively use the data to judge the health status of the link has always been a difficulty in this field, and there is currently no effective error statistics method.

发明内容SUMMARY OF THE INVENTION

为解决上述问题,本发明提供一种PCIE链路错误统计方法、装置、终端及存储介质,对PCIE链路上的非致命错误进行合理统计,避免过多错误导致系统严重故障。In order to solve the above problems, the present invention provides a PCIE link error statistics method, device, terminal and storage medium, which can reasonably count the non-fatal errors on the PCIE link, so as to avoid serious system failure caused by excessive errors.

本发明的技术方案是:一种PCIE链路错误统计方法,包括以下步骤:The technical scheme of the present invention is: a PCIE link error statistics method, comprising the following steps:

实时监测PCIE链路的非致命错误计数;Real-time monitoring of non-fatal error counts of PCIE links;

当监测到非致命错误计数在第一预设时长内连续改变,且改变次数超过第一次数阈值,则发出告警和/或中断该PCIE链路;When it is detected that the non-fatal error count continuously changes within the first preset time period, and the number of changes exceeds the first number of times threshold, an alarm is issued and/or the PCIE link is interrupted;

当监测到第二预设时长内非致命错误计数发生N段改变, N超过第二段数阈值,且每次连续改变,非致命错误计数改变次数均未超过第一次数阈值,则发出告警和/或中断该PCIE链路;第一预设时长内的改变为一段改变。When it is detected that the non-fatal error count changes in N segments within the second preset time period, and N exceeds the second threshold of the number of segments, and for each continuous change, the number of non-fatal error count changes does not exceed the first threshold, an alarm and /or interrupt the PCIE link; the change within the first preset time period is a period of change.

进一步地,还包括以下步骤:Further, the following steps are also included:

申请若干对象池;对象池数量与第二段数阈值相同;Apply for several object pools; the number of object pools is the same as the threshold of the second segment;

当监测到非致命错误计数改变,则在对应对象池记录监测信息;When a change in the non-fatal error count is detected, the monitoring information is recorded in the corresponding object pool;

若在第一预设时长内,非致命错误计数连续改变,则持续在当前对象池内更新监测信息;If the non-fatal error count changes continuously within the first preset time period, the monitoring information is continuously updated in the current object pool;

若下一次非致命错误计数改变与上一次非致命错误计数改变之间的时间间隔大于第一预设时长,则移到下一个对象池记录监测信息,以此按对象池的排序循环覆盖使用各个对象池;If the time interval between the next non-fatal error count change and the last non-fatal error count change is greater than the first preset duration, move to the next object pool to record monitoring information, so as to cover and use each object pool cyclically according to the order of the object pool. object pool;

在第二预设时长内覆盖使用了全部对象池,则表示当监测到第二预设时长内非致命错误计数发生N段改变, N超过第二段数阈值,且每次连续改变,非致命错误计数改变次数均未超过第一次数阈值,则发出告警和/或中断该PCIE链路。If all the object pools are covered and used within the second preset time period, it means that when N segments of non-fatal error counts are detected within the second preset duration, N exceeds the second segment count threshold, and each time it changes continuously, a non-fatal error occurs. If the number of count changes does not exceed the first threshold, an alarm is issued and/or the PCIE link is interrupted.

进一步地,所记录监测信息包括:最近一次监测到非致命错误计数改变的时间、非致命错误计数最新数值、该段非致命错误计数改变次数。Further, the recorded monitoring information includes: the time when the non-fatal error count change was last monitored, the latest value of the non-fatal error count, and the number of times the non-fatal error count changed in this segment.

进一步地,非致命错误计数包括数据链路层包错误计数和传输层包错误计数。Further, the non-fatal error count includes the data link layer packet error count and the transport layer packet error count.

本发明的技术方案还包括一种PCIE链路错误统计装置,包括,The technical solution of the present invention also includes a PCIE link error statistics device, including,

计数监测模块:实时监测PCIE链路的非致命错误计数;Count monitoring module: monitor the non-fatal error count of PCIE link in real time;

第一异常处理模块:当监测到非致命错误计数在第一预设时长内连续改变,且改变次数超过第一次数阈值,则发出告警和/或中断该PCIE链路;The first exception handling module: when it is detected that the non-fatal error count continuously changes within the first preset time period, and the number of changes exceeds the first number of times threshold, an alarm is issued and/or the PCIE link is interrupted;

第二异常处理模块:当监测到第二预设时长内非致命错误计数发生N段改变,N超过第二段数阈值,且每次连续改变,非致命错误计数改变次数均未超过第一次数阈值,则发出告警和/或中断该PCIE链路;第一预设时长内的改变为一段改变。The second exception processing module: when it is detected that the non-fatal error count changes in N segments within the second preset time period, and N exceeds the second segment number threshold, and each continuous change, the number of non-fatal error count changes does not exceed the first number of times If the threshold is exceeded, an alarm is issued and/or the PCIE link is interrupted; the change within the first preset time period is a change in one segment.

进一步地,还包括,Further, it also includes,

对象池申请模块:申请若干对象池;对象池数量与第二段数阈值相同;Object pool application module: apply for several object pools; the number of object pools is the same as the threshold for the number of second segments;

监测信息记录模块:当监测到非致命错误计数改变,则在对应对象池记录监测信息;若在第一预设时长内,非致命错误计数连续改变,则持续在当前对象池内更新监测信息;若下一次非致命错误计数改变与上一次非致命错误计数改变之间的时间间隔大于第一预设时长,则移到下一个对象池记录监测信息,以此按对象池的排序循环覆盖使用各个对象池;Monitoring information recording module: when the non-fatal error count changes are detected, the monitoring information is recorded in the corresponding object pool; if the non-fatal error count changes continuously within the first preset time period, the monitoring information is continuously updated in the current object pool; If the time interval between the next non-fatal error count change and the last non-fatal error count change is greater than the first preset duration, move to the next object pool to record monitoring information, so as to cover and use each object cyclically according to the order of the object pool pool;

第二异常处理模块监测是否在第二预设时长内覆盖使用了全部对象池,若是则表示监测到第二预设时长内非致命错误计数发生N段改变, N超过第二段数阈值,且每次连续改变,非致命错误计数改变次数均未超过第一次数阈值,则发出告警和/或中断该PCIE链路。The second exception processing module monitors whether all the object pools are covered and used within the second preset time period, and if so, it means that N segments of non-fatal error counts have been changed within the second preset time period, and N exceeds the second segment count threshold, and each If the number of non-fatal error count changes does not exceed the threshold for the first number of consecutive changes, an alarm is issued and/or the PCIE link is interrupted.

进一步地,所记录监测信息包括:最近一次监测到非致命错误计数改变的时间、非致命错误计数最新数值、该段非致命错误计数改变次数。Further, the recorded monitoring information includes: the time when the non-fatal error count change was last monitored, the latest value of the non-fatal error count, and the number of times the non-fatal error count changed in this segment.

进一步地,非致命错误计数包括数据链路层包错误计数和传输层包错误计数。Further, the non-fatal error count includes the data link layer packet error count and the transport layer packet error count.

本发明的技术方案还包括一种终端,包括:The technical solution of the present invention also includes a terminal, including:

处理器;processor;

用于存储处理器的执行指令的存储器;memory for storing instructions for execution of the processor;

其中,所述处理器被配置为执行上述任一项所述的方法。Wherein, the processor is configured to perform any of the methods described above.

本发明的技术方案还包括一种存储有计算机程序的计算机可读存储介质,该程序被处理器执行时实现如上述任一项所述的方法。The technical solution of the present invention also includes a computer-readable storage medium storing a computer program, which implements the method described in any one of the above when the program is executed by a processor.

本发明提供的一种PCIE链路错误统计方法、装置、终端及存储介质,对链路上出现的非致命错误进行有效合理的统计,在错误数量和错误产生时间两个维度进行统计,当产生的错误满足统计条件时,将产生告警或将该链路中断,避免过多的错误导致系统严重故障,大大提高系统运行的稳定性和可靠性。该方法填补了PCIE链路错误统计方面的空白,且不区分PCIE设备,适用性更广泛,通用性强。A method, device, terminal and storage medium for PCIE link error statistics provided by the present invention perform effective and reasonable statistics on non-fatal errors occurring on the link, and conduct statistics in the two dimensions of error quantity and error generation time. When the number of errors meets the statistical conditions, an alarm will be generated or the link will be interrupted to avoid serious system failures caused by excessive errors and greatly improve the stability and reliability of system operation. The method fills the gap in PCIE link error statistics, does not distinguish PCIE devices, and has wider applicability and strong generality.

附图说明Description of drawings

图1是本发明具体实施例一方法流程示意图;1 is a schematic flowchart of a method according to a specific embodiment of the present invention;

图2是本发明具体实施例一一具体实现方式对象池架构示意框图;2 is a schematic block diagram of a specific embodiment of the present invention—a specific implementation mode object pool architecture;

图3是本发明具体实施例二结构示意框图。FIG. 3 is a schematic block diagram of the structure of the second embodiment of the present invention.

具体实施方式Detailed ways

下面结合附图并通过具体实施例对本发明进行详细阐述,以下实施例是对本发明的解释,而本发明并不局限于以下实施方式。The present invention will be described in detail below with reference to the accompanying drawings and specific examples. The following examples are to explain the present invention, but the present invention is not limited to the following embodiments.

如图1所示,本实施例提供一种PCIE链路错误统计方法,包括以下步骤:As shown in FIG. 1, this embodiment provides a PCIE link error statistics method, including the following steps:

S1,实时监测PCIE链路的非致命错误计数;S1, monitor the non-fatal error count of the PCIE link in real time;

S2,当监测到非致命错误计数在第一预设时长内连续改变,且改变次数超过第一次数阈值,则发出告警和/或中断该PCIE链路;S2, when it is detected that the non-fatal error count continuously changes within the first preset time period, and the number of changes exceeds the first number of times threshold, an alarm is issued and/or the PCIE link is interrupted;

S3,当监测到第二预设时长内非致命错误计数发生N段改变, N超过第二段数阈值,且每次连续改变,非致命错误计数改变次数均未超过第一次数阈值,则发出告警和/或中断该PCIE链路;第一预设时长内的改变为一段改变。S3, when it is detected that the non-fatal error count changes in N segments within the second preset time period, and N exceeds the second segment number threshold, and each continuous change, the number of non-fatal error count changes does not exceed the first number of thresholds, then send out Alarm and/or interrupt the PCIE link; the change within the first preset time period is a period of change.

该方法设置第一预设时长、第一次数阈值、第二预设时长和第二段数阈值,计数(指非致命错误计数)改变时,若在第一预设时长内连续改变,且连续改变次数超过第一次数阈值,则说明发生异常,需对异常处理;另外,在第二预设时长内若发生超过第二段数阈值段改变,则同样说明发生异常,需处理异常。需要说明的是,每段改变可能只有一次改变,也可能包括多次连续改变,若是连续改变,非致命错误计数改变次数均未超过第一次数阈值,否则会在不达到第二预设时长时即发出告警和/或中断该PCIE链路。This method sets the first preset duration, the first number of times threshold, the second preset duration and the second segment number threshold. When the count (referring to the non-fatal error count) changes, if it changes continuously within the first preset duration, and the continuous If the number of changes exceeds the threshold of the first number of times, it means that an exception occurs and needs to be dealt with; in addition, if there is a segment change that exceeds the second threshold of the number of segments within the second preset time period, it also means that an exception occurs and needs to be handled. It should be noted that each segment of change may have only one change, or may include multiple consecutive changes. If the changes are continuous, the number of non-fatal error count changes does not exceed the first threshold, otherwise it will not reach the second preset duration. alarm and/or interrupt the PCIE link.

通过上述方法,对链路上出现的非致命错误(Uncorrectable Error)进行了合理有效统计,在错误数量和错误产生时间两个维度进行统计,当产生的错误满足统计条件时,将产生告警或将该链路中断,避免过多的错误导致系统严重故障。Through the above method, reasonable and effective statistics are carried out on the non-fatal errors (Uncorrectable Errors) that occur on the link, and statistics are made in the two dimensions of the number of errors and the time of error generation. When the generated errors meet the statistical conditions, an alarm will be generated or the This link is interrupted to avoid excessive errors leading to serious system failure.

本实施例具体通过对象池对非致命错误计数监测信息进行保存和统计,首先申请若干对象池,对象池数量与第二段数阈值相同,以统计第二预设时长内的计数改变段数。对象池按序号排序,并循环覆盖使用,例如第一段监测的监测信息保存在第一个对象池,第二段监测的监测信息保存在第二个对象池,若有M个对象池,监测到M段计数改变后若仍未出现异常,则将第M+1段的监测信息保存在第一个对象池,并覆盖该池内之前的信息。In this embodiment, the non-fatal error count monitoring information is saved and counted through the object pool. First, several object pools are applied for, and the number of object pools is the same as the second segment number threshold, so as to count the number of count change segments within the second preset time period. The object pools are sorted by serial number and used cyclically. For example, the monitoring information of the first stage of monitoring is stored in the first object pool, and the monitoring information of the second stage of monitoring is stored in the second object pool. If there are M object pools, monitoring If no abnormality occurs after the count of the M segment is changed, the monitoring information of the M+1 segment is saved in the first object pool, and the previous information in the pool is overwritten.

当监测到非致命错误计数改变时,在对应对象池记录监测信息。其中监测信息包括:最近一次监测到非致命错误计数改变的时间、非致命错误计数最新数值、该段非致命错误计数改变次数。When a change in the count of non-fatal errors is detected, the monitoring information is recorded in the corresponding object pool. The monitoring information includes: the time when the non-fatal error count was changed most recently, the latest value of the non-fatal error count, and the number of changes of the non-fatal error count in this segment.

若在第一预设时长内,非致命错误计数连续改变,则持续在当前对象池内更新监测信息;若下一次非致命错误计数改变与上一次非致命错误计数改变之间的时间间隔大于第一预设时长,则移到下一个对象池记录监测信息,以此按对象池的排序循环覆盖使用各个对象池。If the non-fatal error count changes continuously within the first preset time period, the monitoring information will be continuously updated in the current object pool; if the time interval between the next non-fatal error count change and the last non-fatal error count change is greater than the first If the preset duration is set, it will move to the next object pool to record monitoring information, so as to cover and use each object pool cyclically according to the order of the object pools.

基于此,在第二预设时长内覆盖使用了全部对象池,则表示当监测到第二预设时长内非致命错误计数发生N段改变, N超过第二段数阈值,且每次连续改变,非致命错误计数改变次数均未超过第一次数阈值,则发出告警和/或中断该PCIE链路。Based on this, all the object pools are covered and used within the second preset duration, which means that when N segments of non-fatal error counts are detected within the second preset duration, N exceeds the threshold of the second segment count, and changes continuously each time, If none of the non-fatal error count changes exceeds the first threshold, an alarm is issued and/or the PCIE link is interrupted.

以下提供一具体实现方式,以进一步理解本方案。A specific implementation manner is provided below to further understand the solution.

如图2所示,该具体实现方式,对象池共51个,设置第一预设时长为20秒,第二预设时长为1小时,第一次数阈值为4次,第二段数阈值为51次。As shown in Figure 2, in this specific implementation, there are 51 object pools in total, the first preset duration is set to 20 seconds, the second preset duration is set to 1 hour, the first threshold is 4, and the second threshold is 4 51 times.

另外,非致命错误计数包括数据链路层包错误计数(Bad Data Link LayerPacket Count,简称Bad DLLP Count)和传输层包错误计数(Bad Transaction LayerPacket Count,简称Bad TLP Count)。In addition, the non-fatal error count includes a data link layer packet error count (Bad Data Link LayerPacket Count, referred to as Bad DLLP Count) and a transport layer packet error count (Bad Transaction LayerPacket Count, referred to as Bad TLP Count).

系统申请0~50总共51个统计对象池,51个对象池循环覆盖使用,从零开始使用;每个对象池有对错误的描述,描述内容有:当前错误统计时间(即最近一次监测到非致命错误计数改变的时间)、Bad TLP Count、Bad DLLP Count、统计变化次数(即该段非致命错误计数改变次数)。The system applies for a total of 51 statistical object pools from 0 to 50, and 51 object pools are used cyclically and are used from scratch; each object pool has a description of the error, and the description includes: Fatal error count change time), Bad TLP Count, Bad DLLP Count, number of statistical changes (that is, the number of non-fatal error count changes in this segment).

统计过程如下:The statistical process is as follows:

(1)系统循环读取PCIE链路上的非致命错误计数值(BadTLPCount和Bad DLLPCount),当两者任意一个发生变化(即计数改变),则进行统计;(1) The system cyclically reads the non-fatal error count values (BadTLPCount and Bad DLLPCount) on the PCIE link. When either of the two changes (that is, the count changes), statistics are performed;

(2)在20s内连续计数改变,则在当前对象池内进行更新;(2) If the continuous count changes within 20s, it will be updated in the current object pool;

(3)当本次计数改变与上次计数改变之间时间大于20s,则移动到下一个对象池统计。(3) When the time between the current count change and the last count change is greater than 20s, move to the next object pool for statistics.

当20s内连续4次计数改变,或者1小时内使用51个对象池,则说明发生异常,可发出告警和/或中断该PCIE链路。When the count changes for 4 consecutive times within 20s, or 51 object pools are used within 1 hour, it means that an abnormality occurs, and an alarm can be issued and/or the PCIE link can be interrupted.

需要说明的是,在链路正常阶段进行统计,设备插拔过程,链路变化过程不进行统计。另外,如果在1小时内51个对象池都使用完,为方便统计,可停止循环覆盖,即停止保存监测信息,发出告警和/或中断该PCIE链路即可。It should be noted that statistics are not performed during the normal link phase, during device insertion and removal, and during link changes. In addition, if all 51 object pools are used up within 1 hour, for the convenience of statistics, the cyclic coverage can be stopped, that is, the monitoring information can be stopped, and an alarm and/or the PCIE link can be interrupted.

实施例二Embodiment 2

如图3所示,在实施例一基础上,本实施例提供一种PCIE链路错误统计装置,包括以下功能模块。As shown in FIG. 3 , based on the first embodiment, this embodiment provides a PCIE link error statistics device, which includes the following functional modules.

计数监测模块101:实时监测PCIE链路的非致命错误计数;Count monitoring module 101: monitor the non-fatal error count of the PCIE link in real time;

第一异常处理模块102:当监测到非致命错误计数在第一预设时长内连续改变,且改变次数超过第一次数阈值,则发出告警和/或中断该PCIE链路;The first exception processing module 102: when monitoring that the non-fatal error count continuously changes within the first preset time period, and the number of changes exceeds the first number of times threshold, issue an alarm and/or interrupt the PCIE link;

第二异常处理模块103:当监测到第二预设时长内非致命错误计数发生N段改变,N超过第二段数阈值,且每次连续改变,非致命错误计数改变次数均未超过第一次数阈值,则发出告警和/或中断该PCIE链路;第一预设时长内的改变为一段改变;The second exception processing module 103 : when it is detected that the non-fatal error count has changed in N segments within the second preset time period, and N exceeds the second threshold of the number of segments, and each continuous change, the number of non-fatal error count changes does not exceed the first time If the number of thresholds is exceeded, an alarm is issued and/or the PCIE link is interrupted; the change within the first preset time period is a change in one segment;

对象池申请模块104:申请若干对象池;对象池数量与第二段数阈值相同;Object pool application module 104: apply for several object pools; the number of object pools is the same as the threshold for the number of second segments;

监测信息记录模块105:当监测到非致命错误计数改变,则在对应对象池记录监测信息;若在第一预设时长内,非致命错误计数连续改变,则持续在当前对象池内更新监测信息;若下一次非致命错误计数改变与上一次非致命错误计数改变之间的时间间隔大于第一预设时长,则移到下一个对象池记录监测信息,以此按对象池的排序循环覆盖使用各个对象池。Monitoring information recording module 105: when the non-fatal error count changes are detected, the monitoring information is recorded in the corresponding object pool; if the non-fatal error count changes continuously within the first preset time period, the monitoring information is continuously updated in the current object pool; If the time interval between the next non-fatal error count change and the last non-fatal error count change is greater than the first preset duration, move to the next object pool to record monitoring information, so as to cover and use each object pool cyclically according to the order of the object pool. object pool.

第二异常处理模块监测是否在第二预设时长内覆盖使用了全部对象池,若是则表示监测到第二预设时长内非致命错误计数发生N段改变, N超过第二段数阈值,且每次连续改变,非致命错误计数改变次数均未超过第一次数阈值,则发出告警和/或中断该PCIE链路。The second exception processing module monitors whether all the object pools are covered and used within the second preset time period, and if so, it means that N segments of non-fatal error counts have been changed within the second preset time period, and N exceeds the second segment count threshold, and each If the number of non-fatal error count changes does not exceed the threshold for the first number of consecutive changes, an alarm is issued and/or the PCIE link is interrupted.

其中,所记录监测信息包括:最近一次监测到非致命错误计数改变的时间、非致命错误计数最新数值、该段非致命错误计数改变次数。非致命错误计数包括数据链路层包错误计数和传输层包错误计数。The recorded monitoring information includes: the time when the non-fatal error count was changed in the last monitoring, the latest value of the non-fatal error count, and the number of times the non-fatal error count was changed in this segment. Non-fatal error counts include data link layer packet error counts and transport layer packet error counts.

实施例三Embodiment 3

本实施例提供一种终端,该终端包括处理器和存储器。This embodiment provides a terminal, where the terminal includes a processor and a memory.

存储器用于存储处理器的执行指令。存储器可以由任何类型的易失性或非易失性存储终端或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。当存储器中的执行指令由处理器执行时,使得终端能够执行上述方法实施例中的部分或全部步骤。Memory is used to store instructions for execution by the processor. Memory can be implemented by any type of volatile or non-volatile storage terminals or their combination such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read Only Memory (EEPROM), Erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic Disk or Optical Disk. When the execution instructions in the memory are executed by the processor, the terminal is enabled to execute some or all of the steps in the foregoing method embodiments.

处理器为存储终端的控制中心,利用各种接口和线路连接整个电子终端的各个部分,通过运行或执行存储在存储器内的软件程序和/或模块,以及调用存储在存储器内的数据,以执行电子终端的各种功能和/或处理数据。所述处理器可以由集成电路(IntegratedCircuit,简称IC) 组成,例如可以由单颗封装的IC 所组成,也可以由连接多颗相同功能或不同功能的封装IC而组成。The processor is the control center of the storage terminal, using various interfaces and lines to connect various parts of the entire electronic terminal, by running or executing the software programs and/or modules stored in the memory, and calling the data stored in the memory to execute. Various functions of the electronic terminal and/or processing data. The processor may be composed of an integrated circuit (Integrated Circuit, IC for short), for example, may be composed of a single packaged IC, or may be composed of a plurality of packaged ICs connected with the same function or different functions.

实施例四Embodiment 4

本实施例提供一种计算机存储介质,其中,该计算机存储介质可存储有程序,该程序执行时可包括本发明提供的各实施例中的部分或全部步骤。所述的存储介质可为磁碟、光盘、只读存储记忆体(英文:read-only memory,简称:ROM)或随机存储记忆体(英文:random access memory,简称:RAM)等。This embodiment provides a computer storage medium, wherein the computer storage medium can store a program, and when the program is executed, it can include some or all of the steps in the various embodiments provided by the present invention. The storage medium may be a magnetic disk, an optical disc, a read-only memory (English: read-only memory, ROM for short) or a random access memory (English: random access memory, RAM for short).

以上公开的仅为本发明的优选实施方式,但本发明并非局限于此,任何本领域的技术人员能思之的没有创造性的变化,以及在不脱离本发明原理前提下所作的若干改进和润饰,都应落在本发明的保护范围内。The above disclosure is only the preferred embodiment of the present invention, but the present invention is not limited thereto, any non-creative changes that can be conceived by those skilled in the art, and some improvements and modifications made without departing from the principles of the present invention , should fall within the protection scope of the present invention.

Claims (10)

1.一种PCIE链路错误统计方法,其特征在于,包括以下步骤:1. a PCIE link error statistics method, is characterized in that, comprises the following steps: 实时监测PCIE链路的非致命错误计数;Real-time monitoring of non-fatal error counts of PCIE links; 当监测到非致命错误计数在第一预设时长内连续改变,且改变次数超过第一次数阈值,则发出告警和/或中断该PCIE链路;When it is detected that the non-fatal error count continuously changes within the first preset time period, and the number of changes exceeds the first number of times threshold, an alarm is issued and/or the PCIE link is interrupted; 当监测到第二预设时长内非致命错误计数发生N段改变, N超过第二段数阈值,且每次连续改变,非致命错误计数改变次数均未超过第一次数阈值,则发出告警和/或中断该PCIE链路;第一预设时长内的改变为一段改变。When it is detected that the non-fatal error count changes in N segments within the second preset time period, and N exceeds the second threshold of the number of segments, and for each continuous change, the number of non-fatal error count changes does not exceed the first threshold, an alarm and /or interrupt the PCIE link; the change within the first preset time period is a period of change. 2.根据权利要求1所述的PCIE链路错误统计方法,其特征在于,还包括以下步骤:2. PCIE link error statistics method according to claim 1, is characterized in that, also comprises the following steps: 申请若干对象池;对象池数量与第二段数阈值相同;Apply for several object pools; the number of object pools is the same as the threshold of the second segment; 当监测到非致命错误计数改变,则在对应对象池记录监测信息;When a change in the non-fatal error count is detected, the monitoring information is recorded in the corresponding object pool; 若在第一预设时长内,非致命错误计数连续改变,则持续在当前对象池内更新监测信息;If the non-fatal error count changes continuously within the first preset time period, the monitoring information is continuously updated in the current object pool; 若下一次非致命错误计数改变与上一次非致命错误计数改变之间的时间间隔大于第一预设时长,则移到下一个对象池记录监测信息,以此按对象池的排序循环覆盖使用各个对象池;If the time interval between the next non-fatal error count change and the last non-fatal error count change is greater than the first preset duration, move to the next object pool to record monitoring information, so as to cover and use each object pool cyclically according to the order of the object pool. object pool; 在第二预设时长内覆盖使用了全部对象池,则表示当监测到第二预设时长内非致命错误计数发生N段改变, N超过第二段数阈值,且每次连续改变,非致命错误计数改变次数均未超过第一次数阈值,则发出告警和/或中断该PCIE链路。If all the object pools are covered and used within the second preset time period, it means that when N segments of non-fatal error counts are detected within the second preset duration, N exceeds the second segment count threshold, and each time it changes continuously, a non-fatal error occurs. If the number of count changes does not exceed the first threshold, an alarm is issued and/or the PCIE link is interrupted. 3.根据权利要求2所述的PCIE链路错误统计方法,其特征在于,所记录监测信息包括:最近一次监测到非致命错误计数改变的时间、非致命错误计数最新数值、该段非致命错误计数改变次数。3. PCIE link error statistics method according to claim 2, it is characterised in that the recorded monitoring information comprises: the most recent monitoring to the time that the non-fatal error count changed, the latest value of the non-fatal error count, the non-fatal error of this segment Count the number of changes. 4.根据权利要求1、2或3所述的PCIE链路错误统计方法,其特征在于,非致命错误计数包括数据链路层包错误计数和传输层包错误计数。4. The PCIE link error statistics method according to claim 1, 2 or 3, wherein the non-fatal error count includes a data link layer packet error count and a transport layer packet error count. 5.一种PCIE链路错误统计装置,其特征在于,包括,5. a PCIE link error statistics device, is characterized in that, comprises, 计数监测模块:实时监测PCIE链路的非致命错误计数;Count monitoring module: monitor the non-fatal error count of PCIE link in real time; 第一异常处理模块:当监测到非致命错误计数在第一预设时长内连续改变,且改变次数超过第一次数阈值,则发出告警和/或中断该PCIE链路;The first exception handling module: when it is detected that the non-fatal error count continuously changes within the first preset time period, and the number of changes exceeds the first number of times threshold, an alarm is issued and/or the PCIE link is interrupted; 第二异常处理模块:当监测到第二预设时长内非致命错误计数发生N段改变,N超过第二段数阈值,且每次连续改变,非致命错误计数改变次数均未超过第一次数阈值,则发出告警和/或中断该PCIE链路;第一预设时长内的改变为一段改变。The second exception processing module: when it is detected that the non-fatal error count changes in N segments within the second preset time period, and N exceeds the second segment number threshold, and each continuous change, the number of non-fatal error count changes does not exceed the first number of times If the threshold is exceeded, an alarm is issued and/or the PCIE link is interrupted; the change within the first preset time period is a change in one segment. 6.根据权利要求5所述的PCIE链路错误统计装置,其特征在于,还包括,6. PCIE link error statistics device according to claim 5, is characterized in that, also comprises, 对象池申请模块:申请若干对象池;对象池数量与第二段数阈值相同;Object pool application module: apply for several object pools; the number of object pools is the same as the threshold for the number of second segments; 监测信息记录模块:当监测到非致命错误计数改变,则在对应对象池记录监测信息;若在第一预设时长内,非致命错误计数连续改变,则持续在当前对象池内更新监测信息;若下一次非致命错误计数改变与上一次非致命错误计数改变之间的时间间隔大于第一预设时长,则移到下一个对象池记录监测信息,以此按对象池的排序循环覆盖使用各个对象池;Monitoring information recording module: when the non-fatal error count changes are detected, the monitoring information is recorded in the corresponding object pool; if the non-fatal error count changes continuously within the first preset time period, the monitoring information is continuously updated in the current object pool; If the time interval between the next non-fatal error count change and the last non-fatal error count change is greater than the first preset duration, move to the next object pool to record monitoring information, so as to cover and use each object cyclically according to the order of the object pool pool; 第二异常处理模块监测是否在第二预设时长内覆盖使用了全部对象池,若是则表示监测到第二预设时长内非致命错误计数发生N段改变, N超过第二段数阈值,且每次连续改变,非致命错误计数改变次数均未超过第一次数阈值,则发出告警和/或中断该PCIE链路。The second exception processing module monitors whether all the object pools are covered and used within the second preset time period, and if so, it means that N segments of non-fatal error counts have been changed within the second preset time period, and N exceeds the second segment count threshold, and each If the number of non-fatal error count changes does not exceed the threshold for the first number of consecutive changes, an alarm is issued and/or the PCIE link is interrupted. 7.根据权利要求6所述的PCIE链路错误统计装置,其特征在于,所记录监测信息包括:最近一次监测到非致命错误计数改变的时间、非致命错误计数最新数值、该段非致命错误计数改变次数。7. The PCIE link error statistics device according to claim 6, wherein the recorded monitoring information comprises: the time when the non-fatal error count is changed in the last monitoring, the latest value of the non-fatal error count, the non-fatal error of this segment Count the number of changes. 8.根据权利要求5、6或7所述的PCIE链路错误统计装置,其特征在于,非致命错误计数包括数据链路层包错误计数和传输层包错误计数。8. The PCIE link error statistics device according to claim 5, 6 or 7, wherein the non-fatal error count includes a data link layer packet error count and a transport layer packet error count. 9.一种终端,其特征在于,包括:9. A terminal, characterized in that, comprising: 处理器;processor; 用于存储处理器的执行指令的存储器;memory for storing instructions for execution of the processor; 其中,所述处理器被配置为执行权利要求1-4任一项所述的方法。wherein the processor is configured to perform the method of any one of claims 1-4. 10.一种存储有计算机程序的计算机可读存储介质,其特征在于,该程序被处理器执行时实现如权利要求1-4任一项所述的方法。10. A computer-readable storage medium storing a computer program, characterized in that, when the program is executed by a processor, the method according to any one of claims 1-4 is implemented.
CN202010990038.3A 2020-09-18 2020-09-18 PCIE link error statistical method, device, terminal and storage medium Active CN112256539B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010990038.3A CN112256539B (en) 2020-09-18 2020-09-18 PCIE link error statistical method, device, terminal and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010990038.3A CN112256539B (en) 2020-09-18 2020-09-18 PCIE link error statistical method, device, terminal and storage medium

Publications (2)

Publication Number Publication Date
CN112256539A CN112256539A (en) 2021-01-22
CN112256539B true CN112256539B (en) 2022-07-19

Family

ID=74232332

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010990038.3A Active CN112256539B (en) 2020-09-18 2020-09-18 PCIE link error statistical method, device, terminal and storage medium

Country Status (1)

Country Link
CN (1) CN112256539B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113448785B (en) * 2021-05-28 2023-03-28 山东英信计算机技术有限公司 Method, device and equipment for processing bandwidth state exception and readable medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130339825A1 (en) * 2012-06-13 2013-12-19 International Business Machines Corporation External settings that reconfigure the error handling behavior of a distributed pcie switch
CN106201753A (en) * 2016-06-28 2016-12-07 浪潮(北京)电子信息产业有限公司 A kind of based on the processing method of PCIE mistake in linux and system
US20180095817A1 (en) * 2015-09-11 2018-04-05 Huawei Technologies Co., Ltd. Method and apparatus for disconnecting link between pcie device and host
CN110532120A (en) * 2019-07-28 2019-12-03 苏州浪潮智能科技有限公司 The method and apparatus of PCIe not correctable error in monitoring server system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130339825A1 (en) * 2012-06-13 2013-12-19 International Business Machines Corporation External settings that reconfigure the error handling behavior of a distributed pcie switch
US20180095817A1 (en) * 2015-09-11 2018-04-05 Huawei Technologies Co., Ltd. Method and apparatus for disconnecting link between pcie device and host
CN106201753A (en) * 2016-06-28 2016-12-07 浪潮(北京)电子信息产业有限公司 A kind of based on the processing method of PCIE mistake in linux and system
CN110532120A (en) * 2019-07-28 2019-12-03 苏州浪潮智能科技有限公司 The method and apparatus of PCIe not correctable error in monitoring server system

Also Published As

Publication number Publication date
CN112256539A (en) 2021-01-22

Similar Documents

Publication Publication Date Title
TWI229796B (en) Method and system to implement a system event log for system manageability
EP2510439B1 (en) Managing errors in a data processing system
CN101126995B (en) Method and apparatus for processing serious hardware error
CN111949468B (en) A dual-port disk management method, device, terminal and storage medium
CN107077408A (en) Troubleshooting method, computer system, baseboard management controller and system
CN100565470C (en) A kind of blog management method and device
US20110138219A1 (en) Handling errors in a data processing system
CN100395722C (en) A method for storing abnormal state information of a control system
CN118689690A (en) A method and system for processing operating system memory failure
CN112256539B (en) PCIE link error statistical method, device, terminal and storage medium
CN117909109A (en) A memory error information processing method and computing device
CN115617550A (en) Processing device, control unit, electronic device, method, and computer program
CN118646640A (en) Network card fault repair method, device, baseboard management controller, system and medium
CN100470498C (en) Circuit arrangement and method for supporting and monitoring a microcontroller
CN105955864B (en) Power failure processing method, power module, monitoring management module and server
CN116015425B (en) Optical module control method and device, storage medium and electronic device
CN117950895A (en) A memory error information resetting method, computing device, and baseboard management controller
CN213122961U (en) Industrial control system and electronic equipment
CN108415788B (en) Data processing apparatus and method for responding to non-responsive processing circuitry
CN117290149B (en) Main control module reset fault location method, device, equipment, system and medium
CN110519558A (en) The processing method and its Baseboard Management Controller of video data
CN113075976B (en) Backup heat dissipation system, method and medium for server cluster
CN110795263A (en) A kind of hard disk link protection method and related device
CN115766415A (en) Intelligent network card VR state monitoring device, method, terminal and storage medium
CN118626299A (en) PCIe device fault processing method, BMC, PCIe device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: Building 9, No.1, guanpu Road, Guoxiang street, Wuzhong Economic Development Zone, Wuzhong District, Suzhou City, Jiangsu Province

Patentee after: Suzhou Yuannao Intelligent Technology Co.,Ltd.

Country or region after: China

Address before: Building 9, No.1, guanpu Road, Guoxiang street, Wuzhong Economic Development Zone, Wuzhong District, Suzhou City, Jiangsu Province

Patentee before: SUZHOU LANGCHAO INTELLIGENT TECHNOLOGY Co.,Ltd.

Country or region before: China