CN119376995B

CN119376995B - Server, computer system and multi-node management method

Info

Publication number: CN119376995B
Application number: CN202411973567.7A
Authority: CN
Inventors: 王海波; 葛志华; 赵乐森
Original assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Current assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Priority date: 2024-12-30
Filing date: 2024-12-30
Publication date: 2025-04-29
Anticipated expiration: 2044-12-30
Also published as: CN119376995A

Abstract

The embodiment of the application provides a server, a computer system and a multi-node management method, wherein the server comprises a processor node and a connection node, the processor node comprises a plurality of processors and a first controller, the connection node comprises a plurality of cable groups, the connection node is used for connecting reference equipment, a plurality of groups of data links used for carrying out data transmission with the reference equipment are provided for the processors through the plurality of cable groups, the first controller is used for establishing a corresponding relation between the plurality of groups of data links and the plurality of cable groups, and in the process of carrying out data transmission between the plurality of processors and the reference equipment, the target cable group with faults is detected from the plurality of cable groups according to the corresponding relation and fault states of all data links in the plurality of groups of data links. The application solves the problem of lower efficiency of locating the cable group with faults, and further achieves the effect of improving the efficiency of locating the cable group with faults.

Description

Server, computer system and multi-node management method

Technical Field

The embodiment of the application relates to the field of servers, in particular to a server, a computer system and a multi-node management method.

Background

In the related art, a cable group in a server can generate faults such as improper installation, breakage of a high-speed Seders cable and the like in the using or installing process. Taking Cable Tray as an example, because Cable Tray is complicated, there are thousands of differential pairs in a single cabinet, and as the number of GPUs in the cabinet increases, the number of GPUs in the single cabinet increases dramatically, and the number of differential pairs in the single cabinet increases exponentially, so that the number of cables in the Cable Tray increases greatly, the complexity of the Cable Tray increases greatly, and the cost of the Cable Tray increases greatly, which brings great challenges to maintenance, fault location and detection of the Cable Tray.

For example, when a specific cable fault (such as a break) or a problem of poor signal quality is required to be found, the conventional manual checking method may be difficult to quickly locate the fault problem and is easy to make mistakes, so that a maintenance period of a cable group in a server is longer, operation and maintenance efficiency is lower, and operation and maintenance efficiency and stability of the server are affected.

In the related art, no effective solution has been proposed for the problem of low efficiency of locating the failed cable set.

Disclosure of Invention

The embodiment of the application provides a server, a computer system and a multi-node management method, which are used for at least solving the problem of low efficiency of locating a cable set with faults in the related technology.

According to one embodiment of the application, a server is provided, which comprises a processor node and a connection node, wherein the processor node comprises a plurality of processors and a first controller, the connection node comprises a plurality of cable groups, the connection node is used for connecting reference equipment and providing a plurality of groups of data links for data transmission with the reference equipment for the plurality of processors through the plurality of cable groups, the first controller is used for establishing a corresponding relation between the plurality of groups of data links and the plurality of cable groups, and in the process of data transmission between the plurality of processors and the reference equipment, a failed target cable group is detected from the plurality of cable groups according to the corresponding relation and the failure state of each data link in the plurality of groups of data links.

In one exemplary embodiment, the processor node comprises a first storage device for storing first firmware of the first controller, wherein the first firmware comprises a corresponding relation between the plurality of groups of data links and the plurality of cable groups, the first storage device is connected with the first controller, and the first controller is used for reading the corresponding relation from the first storage device and detecting the target cable group with faults from the plurality of cable groups according to the corresponding relation and the first data link when detecting that a first data link with faults exists in first candidate data links for indicating faults, wherein each processor in the plurality of processors is used for transmitting data to the reference device through the corresponding data link in the first candidate data links.

In an exemplary embodiment, the processor node comprises a plurality of acquisition devices and a target processor, wherein the acquisition devices are connected with the target processor, the target processor is connected with the first controller, the acquisition devices are connected with the plurality of cable sets in a one-to-one correspondence manner and acquire actual link parameters of each data link in the plurality of data links, the actual link parameters are used for indicating the quality of data transmission between each processor in the plurality of processors and the reference equipment through the corresponding data link in the plurality of data links, and the processors are used for connecting the first candidate acquisition devices corresponding to the first candidate data link in the plurality of acquisition devices in a one-to-one correspondence manner under the condition that the fault state does not exist in the first candidate data link in the plurality of data links and indicating the fault data link, extracting the link parameters of each data link in the first candidate data link acquired by the first candidate acquisition device, obtaining a first set of link parameters, and transmitting the extracted link parameters to the target processor.

In an exemplary embodiment, the target processor is configured to receive the extracted first set of link parameters transmitted by the multiple processors, detect a failure state of each data link in the first candidate data link according to the first set of link parameters, obtain a first set of failure states, and transmit detected first set of failure information to the first controller, where each link parameter in the first set of link parameters is used to indicate a quality of data transmission by each processor in the multiple processors to the reference device through a corresponding data link in the first candidate data link, and the first set of failure information includes the first set of failure states with a correspondence relationship and a link identifier of each data link in the first candidate data link.

In one exemplary embodiment, the processor node includes a first conversion device connected to a corresponding first set of processors of the plurality of processors, a second conversion device connected to a corresponding second set of processors of the plurality of processors, the second set of processors being processors of the plurality of processors other than the first set of processors, the first set of processors being connected to a corresponding first set of acquisition devices of the plurality of acquisition devices, the second set of processors being connected to a corresponding second set of acquisition devices of the plurality of acquisition devices, the second set of processors being processors of the plurality of processors other than the first set of acquisition devices; the first processor is configured to receive, by the first conversion device, a second set of link parameters of a first set of data links corresponding to the first candidate data links extracted by the first processor, detect, according to the second set of link parameters, a failure state of each data link in the first set of data links, obtain a second set of failure states, and transmit second set of failure information to the first controller, where each processor in the first set of processors is configured to transmit data to the reference device through a corresponding data link in the first set of data links, the second set of fault information comprises link identifiers of the second set of fault states and the first set of data links with corresponding relations, the second processor is used for extracting third set of link parameters of the corresponding second set of data links in the first candidate data links extracted by the second processor through the second conversion device, fault states of all the data links in the second set of data links are detected according to the third set of link parameters to obtain the third set of fault states, and the third set of fault information is transmitted to the first controller, wherein the second set of processor is used for transmitting data to the reference equipment through the second set of data links, the third set of fault information comprises the third set of fault states and the link identifiers of the second set of data links with corresponding relations, the first set of fault states comprises the second set of fault states and the third set of fault states, and the first set of link parameters comprises the second set of link parameters and the third set of link parameters.

In an exemplary embodiment, the first processor is configured to detect a relationship between each link parameter in the second set of link parameters and a corresponding link parameter threshold, determine a failure state of a first reference data link corresponding to the first reference link parameter in the first set of data links as being indicative of the first reference data link failure, the first link parameter threshold being a link parameter of the first reference data link in the case where the first reference link does not fail, the second processor is configured to detect a relationship between each link parameter in the third set of link parameters and a corresponding second link parameter threshold, and determine a failure state of a second reference data link corresponding to the first reference link parameter in the first set of data links as being indicative of the first reference data link failure, in the case where the first reference link parameter in the third set of link parameters is detected to be less than a second link parameter threshold and the difference between the first link parameter threshold is greater than or equal to the second link parameter threshold, the first link parameter threshold being a link parameter in the case where the first reference link does not fail, the second link parameter in the second reference link parameter is detected to be the second link parameter and the second reference link data link failure.

In an exemplary embodiment, the first controller is configured to connect to a reference acquisition device corresponding to a reference processor in the plurality of acquisition devices when detecting that a fault state exists in the first candidate data link for indicating that a reference data link has failed, where the first controller is configured to extract a reference link parameter of the reference data link corresponding to the reference processor in the first candidate data link acquired by the connected reference acquisition device, and the reference processor is a processor corresponding to the reference data link in the plurality of processors.

In one exemplary embodiment, the processor node comprises a plurality of gating devices, wherein the gating devices are connected with the acquisition devices in a one-to-one correspondence manner, the gating devices are connected with the processors in a one-to-one correspondence manner, the gating devices are connected with the first controller, the first controller is used for detecting that a fault state is existed in the first candidate data link and used for indicating a fault-free alternative data link, and sending a high-level first control signal to an alternative gating device corresponding to the alternative data link in the gating devices, wherein the alternative processor in the processors is connected with the alternative gating device and the corresponding alternative acquisition device in the acquisition devices in the case that the first controller sends the high-level first control signal to the alternative gating device.

In an exemplary embodiment, the first controller is configured to send a low-level first control signal to a reference gating device corresponding to the reference processor in the plurality of gating devices when detecting that a fault state exists in the first candidate data link for indicating the reference data link that has failed, where the first controller extracts, through the reference gating device, the reference link parameter acquired by the reference acquisition device when the first controller sends the low-level first control signal to the reference gating device.

In an exemplary embodiment, each of the plurality of gating devices includes a first pin connected to a corresponding one of the plurality of processors, a second pin connected to the first controller, a signal input connected to the first controller, and a data extraction connected to a corresponding one of the plurality of acquisition devices, the second pin being connected to the data extraction in case the first controller transmits the first control signal of a low level to the signal input, and the first pin being connected to the data extraction in case the first controller transmits the first control signal of a high level to the signal input.

In one exemplary embodiment, the processor node comprises a plurality of first connectors and a plurality of acquisition devices, wherein the plurality of first connectors are in one-to-one correspondence with the plurality of acquisition devices, the plurality of first connectors are in one-to-one correspondence with the plurality of cable sets, and the first controller is used for extracting a first data link identifier of a first data link from first fault information corresponding to the first data link under the condition that the first data link is detected to have a fault state for indicating the occurrence of the fault, acquiring a first target connector identifier corresponding to the first data link identifier from the data link identifier and the first connector identifier with a corresponding relation, acquiring a target cable set identifier corresponding to the first target connector identifier from the first connector identifier and the cable set identifier with the corresponding relation, and determining the cable set corresponding to the target cable set identifier as the target cable set, wherein the first fault information comprises the first data link identifier and the first data link identifier with the corresponding relation, and the first data link identifier with the corresponding relation.

In an exemplary embodiment, the reference device comprises a data switch, wherein the data switch comprises a second controller, the second controller is connected with the plurality of cable groups and used for establishing corresponding relations between the plurality of groups of data links and the plurality of cable groups, and the reference cable groups with faults are detected from the plurality of cable groups according to the corresponding relations and fault states of all data links in the plurality of groups of data links in the process of data transmission between the plurality of processors and the reference device.

In one exemplary embodiment, the data switch comprises a second storage device, wherein the second storage device is used for storing second firmware of the second controller, the second firmware comprises corresponding relations between the plurality of groups of data links and the plurality of cable groups, the second storage device is connected with the second controller, the second controller is used for reading the corresponding relations from the second storage device when a fault state is detected to be used for indicating a fault second data link in the second candidate data links, and the reference cable group which is faulty is detected from the plurality of cable groups according to the corresponding relations and the second data link, wherein the data switch is used for transmitting data to a corresponding processor in the plurality of processors through each data link in the second candidate data links, and the second candidate data link is a data link except the first candidate data link in the plurality of groups of data links.

In one exemplary embodiment, the data switch comprises a switching device and a plurality of second connectors, wherein the switching device is connected with the second controller, the plurality of second connectors are connected with the switching device, the plurality of second connectors are connected with the plurality of cable sets in a one-to-one correspondence manner, and the plurality of cable sets are connected with the plurality of acquisition devices in a one-to-one correspondence manner;

The second controller is configured to send an acquisition instruction to the switching device in a process of performing data transmission between the multiple processors and the reference device, extract a second link parameter of a second candidate data link acquired by a second candidate acquisition device corresponding to a second candidate data link in the multiple acquisition devices, and detect a fault state of the second candidate data link according to the second link parameter, where the acquisition instruction is used for requesting acquisition of the second link parameter, and the switching device is used for executing the acquisition instruction.

In an exemplary embodiment, the second controller is configured to detect a relationship between each of the extracted second link parameters and a corresponding link parameter threshold, and determine, when it is detected that there is a third reference link parameter of the second link parameters that is smaller than a third link parameter threshold and a difference value from the third link parameter threshold is greater than or equal to a third difference value threshold, a failure state of a third reference data link as a link parameter indicating that the third reference data link is failed, where the third reference data link is a data link corresponding to the third reference link parameter in the second candidate data link, and the third link parameter threshold is a link parameter of the third reference data link in a case that the third reference data link is not failed.

In one exemplary embodiment, the second controller is configured to, when detecting that a fault state exists in the second candidate data link for indicating that a second data link is faulty, obtain a second target connector identifier corresponding to the second data link identifier from a data link identifier and a second connector identifier having a correspondence relationship, obtain a reference cable group identifier corresponding to the second target connector identifier from a second connector identifier and a cable group identifier having a correspondence relationship, and determine a cable group corresponding to the reference cable group identifier as the reference cable group, where the correspondence relationship includes a correspondence relationship between a data link identifier and a connector identifier, and a correspondence relationship between the connector identifier and a cable group identifier.

In one exemplary embodiment, the reference device comprises a data switch, wherein the data switch comprises a second controller, the first controller is connected with a management switch, the second controller is connected with the management switch, the first controller is used for transmitting first fault information of the target cable set to the management switch, the second controller is used for transmitting second fault information of the reference cable set to the management switch, and the management switch is used for transmitting the second fault information to the first controller and transmitting the first fault information to the second controller.

In one exemplary embodiment, the processor node comprises a first interface, wherein the first controller is connected with the first interface, the management switch comprises a plurality of management interfaces, the first interface is connected with corresponding first management interfaces in the plurality of management interfaces, and the first controller is used for transmitting the first fault information to the first management interface through the first interface and receiving the second fault information transmitted by the management switch through the first management interface through the first interface.

In one exemplary embodiment, the data switch comprises a second interface, wherein the second controller is connected with the second interface, the second interface is connected with a corresponding second management interface in the plurality of management interfaces, and the second controller is used for transmitting the second fault information to the second management interface through the second interface and receiving the first fault information transmitted by the management switch through the second management interface through the second interface.

In one exemplary embodiment, the first controller is configured to determine the target cable set and the reference cable set as failed cable sets, and the second controller is configured to determine the target cable set and the reference cable set as failed cable sets.

According to another embodiment of the present application, there is provided a computer system including a management switch, a plurality of power supply devices, and a plurality of servers, a first group of servers among the plurality of servers and the management switch being connected to a first group of power supply devices among the plurality of power supply devices, a second group of servers being connected to a second group of power supply devices, the first group of power supply devices being for supplying power to the first group of servers and the management switch, the second group of power supply devices being for supplying power to the second group of servers, the second group of servers being servers among the plurality of servers other than the first group of power supply devices, the second group of power supply devices being power supply devices among the plurality of power supply devices other than the first group of power supply devices.

According to yet another embodiment of the present application, there is provided a multi-node management method, a server including a plurality of processor nodes and a plurality of connection nodes, each of the processor nodes including a plurality of processors and a first controller, each of the connection nodes including a plurality of cable sets, the connection nodes being configured to connect a reference device, the connection nodes providing the plurality of processors with a plurality of sets of data links for data transmission with the reference device through the plurality of cable sets, the method being applied to the first controller, the method including establishing a correspondence between the plurality of sets of data links and the plurality of cable sets, and detecting a failed target cable set from the plurality of cable sets according to the correspondence and a failure state of each of the plurality of data links during data transmission between the plurality of processors and the reference device.

According to a further embodiment of the application, there is also provided a computer readable storage medium having stored therein a computer program, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.

According to a further embodiment of the application there is also provided an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

According to a further embodiment of the application, there is also provided a computer program product comprising a computer program which, when executed by a processor, implements the steps of any of the method embodiments described above.

According to the application, the controllers in the servers can establish the corresponding relation between a plurality of groups of data links and a plurality of cable groups, and in the data transmission process, the cable groups with faults are timely identified and positioned according to the fault state and the corresponding relation of the monitoring data links, so that the problems that the fault cable groups in the servers are slow to position, long in time consumption and easy to make mistakes caused by manually checking each cable one by one under the condition of interconnection of high-density cables are avoided. Therefore, the problem that the efficiency of locating the cable group with faults is low is solved, and the effect of improving the efficiency of locating the cable group with faults is achieved.

Drawings

FIG. 1 is a schematic diagram of the connection of a multiprocessor to a reference device in an alternative server, in accordance with an embodiment of the present application;

FIG. 2 is a network management topology of an alternative processor node and reference device node in accordance with an embodiment of the present application;

FIG. 3 is a block diagram of fault localization of an alternative server cable set in accordance with an embodiment of the present application;

FIG. 4 is a schematic flow diagram of an alternative server normal start-up according to an embodiment of the present application;

FIG. 5 is a flow chart of locating a faulty cable assembly, optionally after the cable assembly has failed, according to an embodiment of the present application;

FIG. 6 is a block diagram of the hardware architecture of a server device of a multi-node management method according to an embodiment of the present application;

FIG. 7 is a flow chart of a multi-node management method according to an embodiment of the present application;

fig. 8 is a block diagram of a structure of a multi-node management apparatus according to an embodiment of the present application.

Detailed Description

Embodiments of the present application will be described in detail below with reference to the accompanying drawings in conjunction with the embodiments.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.

First, terms involved in the embodiments of the present application are explained as follows:

A CPU Central Processing Unit, a central processing unit;

graphics Processing Unit, an image processing unit;

PCIE PERIPHERAL COMPONENT INTERCONNECT EXPRESS, high-speed serial computer expansion bus;

UPI Ultra Path Interconnect, super path interconnection;

FW is firmware, which is written into bottom software in the erasable memory, and after the chip is electrified, FW is operated to initialize the chip and peripheral equipment thereof;

BMC Baseboard Management Controller, baseboard management controller;

BIOS, basic Input/Output System;

serders: serialization/Deserialization, serial-to-parallel converter;

Cable Tray.

The embodiment provides a server which comprises a processor node and a connection node, wherein the processor node comprises a plurality of processors and a first controller, the connection node comprises a plurality of cable groups, the connection node is used for connecting reference equipment and providing a plurality of groups of data links for the processors to conduct data transmission with the reference equipment through the plurality of cable groups, the first controller is used for establishing a corresponding relation between the plurality of groups of data links and the plurality of cable groups, and in the process of conducting data transmission between the plurality of processors and the reference equipment, a target cable group with faults is detected from the plurality of cable groups according to the corresponding relation and fault states of all data links in the plurality of groups of data links.

Optionally, in this embodiment, fig. 1 is a schematic diagram illustrating connection between a multiprocessor and a reference device in an optional server according to an embodiment of the present application, and as shown in fig. 1, the server may include, but is not limited to, a processor node and a connection node, where the processor node may include, but is not limited to, n processors and a first controller, the connection node may include, but is not limited to, n cable sets, and the reference device may include, but is not limited to, n reference devices. The plurality of processors may be, but are not limited to, connected to corresponding ones of the n reference devices via a cable set in the connection node for data transmission, e.g., each of the n processors may be, but is not limited to, connected to a corresponding one of the n connection nodes, each of the n connection nodes may be, but is not limited to, connected to the reference device. It should be noted that the number of reference devices may be, but not limited to, one or more, and the number of reference devices and the number of connection nodes may be, but not limited to, the same or different.

Alternatively, in the present embodiment, each of the plurality of processors may be, but is not limited to, configured for data transmission with the reference device via a data link, e.g., each of the plurality of processors may be, but is not limited to including, a GPU processor or other processor, as the application is not limited in this regard.

Optionally, in this embodiment, the first controller may be, but not limited to, configured to establish a correspondence between a plurality of groups of data links and a plurality of cable groups, and detect, from the plurality of cable groups, a failed target cable group according to the correspondence and a failure state of each of the plurality of groups of data links during data transmission between the plurality of processors and the reference device, for example, the first controller may be, but not limited to, a BMC controller or other controller, which is not limited in this application.

Alternatively, in this embodiment, each of the plurality of processors may transmit data to the reference device through a data link, for example, each of the plurality of processors may transmit data to the reference device through a data link, or the reference device may transmit data to a corresponding one of the plurality of processors through a corresponding one of the data links, as an alternative example, the reference device may include, but is not limited to, a service switch, a router, and the like.

Alternatively, in the present embodiment, each of the plurality of sets of data links may have a corresponding one of the plurality of cable sets, but it is understood that a set of data links has a corresponding one of the cable sets. As an alternative example, one cable set may, but is not limited to, comprise two cables for providing data links in two transmission directions, respectively, e.g. cable set 1 comprises cable 1 and cable 2, wherein cable 1 provides a data link for the processor to transmit data to the reference device and cable 2 provides a data link for the reference device to transmit data to the processor.

Alternatively, in this embodiment, the connection node may be, but not limited to, used to connect the reference device, and provide the plurality of processors with a plurality of sets of data links for data transmission with the reference device through a plurality of Cable sets, for example, the plurality of Cable sets may be, but not limited to, a Cable set including a Cable set in a Cable tray, and the Cable set in the Cable tray may be, but not limited to, a Cable set including a connector connected to a processor node side and a connector connected to a reference device side. The Cable tray may include, but is not limited to, a plurality of Cable sets, each Cable set may include, but is not limited to, a connector for connecting one processor node side and a connector for connecting one reference device side, each Cable set may include, but is not limited to, two cables, wherein one Cable may be, but is not limited to, transmitting data in each of the plurality of processors to the reference device, and the other Cable may be, but is not limited to, transmitting data in the reference device to each of the plurality of processors.

Alternatively, in this embodiment, the failure state of the data link may be used to indicate that the data link is failed or that the data link is not failed, for example, in the case where the failure state of the data link is used to indicate that the data link is failed, the data link may transmit data at a lower rate, or the data link may not transmit data, or the like.

Optionally, in this embodiment, after detecting, according to the correspondence and the fault state of each data link in the plurality of data links, the faulty target cable group from the plurality of cable groups in the process that the plurality of processors and the reference device perform data transmission, the faulty target cable group may be replaced or maintained quickly, for example, the plurality of cable groups included in the connection node may be designed to be pluggable, but not limited to be pluggable, and in the case that the faulty target cable group is detected, the faulty cable group may be replaced directly, so that the efficiency of replacing the faulty cable group is improved.

Through the server, the corresponding relation between the plurality of groups of data links and the plurality of cable groups can be established by the controller in the server, and the cable groups with faults are timely identified and positioned according to the fault state and the corresponding relation of the monitoring data links in the data transmission process, so that the problems that the fault cable groups in the server are slow to position, long in time consumption and easy to make mistakes due to the fact that each cable is manually checked one by one under the condition that high-density cables are connected are avoided. Therefore, the problem that the efficiency of locating the cable group with faults is low is solved, and the effect of improving the efficiency of locating the cable group with faults is achieved.

Alternatively, in the present embodiment, the first candidate data link may be, but is not limited to being, used to transmit data in the reference device to each of a plurality of processors.

Alternatively, in this embodiment, the first storage device may, but is not limited to, a storage device including the first firmware for storing the first controller, for example, the first storage device may, but is not limited to, include a Flash memory, or a solid state disk, or the like, for example, the first storage device may, but is not limited to, include an SPI Nor Flash (SERIAL PERIPHERAL INTERFACE Nor Flash, serial peripheral interface type nonvolatile memory).

Alternatively, in this embodiment, the first firmware may, but is not limited to, include a program that is cured in the first storage device, and the first firmware may, but is not limited to, store the correspondence between the plurality of sets of data links and the plurality of cable sets, for example, the first firmware may, but is not limited to, include FW in the first storage device.

Optionally, in this embodiment, the first controller may further, but is not limited to, read, at an initialization stage of the first controller, correspondence between the plurality of sets of data links and the plurality of cable sets stored in the first firmware.

Optionally, in this embodiment, when the correspondence between the plurality of groups of data links and the plurality of cable groups changes, a first adjustment instruction may be sent, but not limited to, to the first firmware to adjust the correspondence between the plurality of groups of data links and the plurality of cable groups stored in the first firmware to an updated correspondence, where the first adjustment instruction is used to adjust the correspondence between the plurality of groups of data links and the plurality of cable groups stored in the first firmware to an updated correspondence, where the updated correspondence includes a correspondence between an updated group of data links and an updated cable group, where the updated group of data links is the same as or different from the plurality of groups of data links, and where the updated cable group is the same as or different from the plurality of cable groups.

Alternatively, in this embodiment, the acquisition device may be, but is not limited to, connected in a one-to-one correspondence with the cable set and acquire the actual link parameters in the data link, e.g., the acquisition device may be, but is not limited to, including PHY RETIMER (PHYSICAL LAYER RETIMER ) chips.

For example, taking the example of the acquisition device comprising PHY RETIMER chips, each PHY RETIMER chip of the plurality PHY RETIMER of chips may, but is not limited to, acquire actual link parameters for a data link used to transmit data in each of the plurality of processors to the reference device and acquire actual link parameters for a data link used to transmit data in the reference device to each of the plurality of processors.

Alternatively, in the present embodiment, the target processor may, but is not limited to, receive the extracted first set of link parameters for the plurality of processor transmissions, the first set of link parameters including link parameters for a data link for the processor to transmit data to the reference transmission device, for example, the target processor may, but is not limited to, include a CPU, FPGA (Field Programmable GATE ARRAY ), CPLD (Complex Programmable Logic Device, complex programmable logic device), or the like, or other processor, as the application is not limited in this regard.

Alternatively, in this embodiment, the multiple processors may be, but are not limited to, connected in a one-to-one correspondence to the multiple acquisition devices by default, and may be, but are not limited to, a plurality of GPUs may be included by the multiple processors, the acquisition devices include PHY RETIMER chips, the target processor includes a CPU, and the reference device includes a service switch. In the initialization stage of PHY RETIMER chips, each of the GPUs is connected to each of PHY RETIMER chips of PHY RETIMER chips, and each of the GPUs can communicate with each of PHY RETIMER chips of PHY RETIMER chips through a management bus (e.g., MDIO bus), perform parameter configuration on PHY RETIMER chips, such as setting a transmission rate of data transmission, adjusting a level of a signal, etc., and in the data transmission stage, each of the GPUs keeps being connected to each of PHY RETIMER chips of PHY RETIMER chips so long as a data link detected by a CPU is not failed, so as to ensure that data can be normally transmitted to a service switch.

Optionally, in this embodiment, the link parameter may, but is not limited to, a parameter for indicating a quality of data transmission between each of the plurality of processors and the reference device through a corresponding data link of the plurality of sets of data links, and the link parameter may, but is not limited to, a parameter reflecting the quality of the data transmission, for example, an SI (SIGNAL INTEGRITY ) parameter, and the SI parameter may, but is not limited to, include a frequency parameter, an impulse response parameter, and a step response parameter.

Alternatively, in this embodiment, the quality of data transmission between each of the plurality of processors and the reference device through the corresponding data link of the plurality of sets of data links may be determined as normal, but not limited to, when an actual link parameter greater than or equal to an actual link parameter threshold and a difference between the actual link parameter threshold and the actual link parameter threshold is detected to be less than or equal to an actual link parameter threshold, and the quality of data transmission between each of the plurality of processors and the reference device through the corresponding data link of the plurality of sets of data links may be determined as abnormal, when an actual link parameter less than or equal to an actual link parameter threshold and a difference between the actual link parameter threshold and the actual link parameter threshold is detected to be greater than or equal to an actual link parameter threshold.

Alternatively, in the present embodiment, each data link may, but is not limited to, have a one-to-one correspondence of link identification, which may, but is not limited to, be used to identify and locate the data link, e.g., the link identification may, but is not limited to, include a number of the data link, an address of the data link, and so on.

Optionally, in this embodiment, each link parameter in the first set of link parameters may be, but is not limited to being, used to indicate a quality of data transmission by each processor in the plurality of processors to the reference device through a corresponding data link in the first candidate data links, where each link parameter in the first set of link parameters and the data link of the data transmission to the reference device may be, but is not limited to being, in a one-to-one correspondence, e.g., one data link corresponds to one link parameter.

Alternatively, in this embodiment, each data link may, but not limited to, have a one-to-one corresponding fault state, and may, but not limited to, detect the fault state corresponding to the data link according to the link parameter acquired by the acquisition device in real time, and it is understood that the fault state corresponding to the data link may change or remain unchanged, for example, the data link 1 returns from the fault to normal transmission data, or the data link 1 is always normal.

Alternatively, in the present embodiment, the second set of fault conditions may include, but is not limited to, a fault condition for indicating that the data link has failed and a fault condition for indicating that the data link has not failed.

Alternatively, in the present embodiment, the number of processors in the first group of processors and the second group of processors may be, but not limited to, the same, or different, for example, the number of processors in the first group of processors and the second group of processors is 4, or the number of processors in the first group of processors is 3, and the number of processors in the second group of processors is 5.

Alternatively, in this embodiment, the first conversion means may be, but is not limited to, used to connect the first processor to a corresponding first set of processors of the plurality of processors, and the second conversion means may be, but is not limited to, used to connect the second processor to a corresponding second set of processors of the plurality of processors. For example, the conversion device may include, but is not limited to, PCIE SWITCH (switch) chips.

Alternatively, in the present embodiment, the first link parameter threshold and the second link parameter threshold may be, but are not limited to being, the same, or different.

Alternatively, in the present embodiment, in the case where a failure state of the data link is determined for the occurrence of a failure, in the process of transmitting data (or referred to as a signal) to the reference device through the corresponding data link, a signal quality degradation, or an abnormality such as a signal interruption may occur in each of the plurality of processors.

Optionally, in this embodiment, the first controller may include a BMC, the acquisition device includes PHY RETIMER chips, and the reference processor includes a GPU, for example, where the BMC is connected to PHY RETIMER chips corresponding to GPUs corresponding to reference data links having a fault in the multiple GPUs when the BMC detects that the reference data link has a fault.

Alternatively, in this embodiment, the reference data link may include, but is not limited to, one or more data links in the first candidate data link, and the reference acquisition device may include, but is not limited to, one or more acquisition devices in the plurality of acquisition devices corresponding to the reference data link.

Optionally, in this embodiment, the candidate data link may include one data link or multiple data links in the first candidate data link, or all data links in the first candidate data link, where the candidate data link includes multiple data links in the first candidate data link, the first controller may send, by using, but not limited to, a high-level first control signal to the candidate gating device corresponding to the candidate data link in a batch manner, so as to implement batch control on the candidate gating device, and improve control efficiency on the gating device.

Alternatively, in the present embodiment, the high level may be used for indicating a logic level, for example, the high level may be used for indicating that the voltage is greater than or equal to the first voltage threshold, may be used for indicating a high level by a corresponding flag, for example, may be used for indicating a high level by a flag 1.

Optionally, in this embodiment, the reference data link may, but is not limited to, include one data link or multiple data links in the first candidate data link, where the reference data link includes multiple data links in the first candidate data link, the first controller may, but is not limited to, send, in batches, a low-level first control signal to the reference gating device, so as to implement batch control over the reference gating device, and improve control efficiency over the gating device.

Alternatively, in the present embodiment, the low level may be used for indicating a logic level, for example, the low level may be used for indicating that the voltage is smaller than the second voltage threshold, may be used for indicating a low level by a corresponding flag, for example, may be used for indicating a low level by a flag 0.

Optionally, in this embodiment, when the first controller switches to send the low-level first control signal to the signal input terminal and to send the high-level first control signal to the signal input terminal, the connection between the second pin and the data extraction terminal is switched to the connection between the first pin and the data extraction terminal, and when the first controller switches to send the high-level first control signal to the signal input terminal and to send the low-level first control signal to the signal input terminal, the connection between the first pin and the data extraction terminal is switched to the connection between the second pin and the data extraction terminal.

Alternatively, in the present embodiment, the first fault information may include, but is not limited to, fault information detected by the first processor and transmitted to the first controller, and fault information detected by the second processor and transmitted to the first controller.

Alternatively, in the present embodiment, the connector identifier may be used for identifying and locating the connector, for example, the connector identifier may be used for identifying and locating the cable set, for example, the cable set may be used for identifying and locating the cable set, or for identifying the cable set.

Alternatively, in this embodiment, the data switch may include, but is not limited to, a service switch or other switch that may be used for data transmission, as the application is not limited in this regard.

Optionally, in this embodiment, the reference device may be configured with a second controller, for example, the second controller may be, but not limited to, a controller including a BMC, and the second controller may be, but not limited to, a controller for establishing a correspondence between multiple sets of data links and multiple cable sets, and during data transmission between the multiple processors and the reference device, detecting, according to the correspondence and a failure state of each data link in the multiple sets of data links, a reference cable set that fails from the multiple cable sets, for example, the second controller may be, but not limited to, a BMC controller or other controller, and the application is not limited thereto.

Optionally, in this embodiment, during the data transmission between the plurality of processors and the reference device, after detecting the failed reference cable set from the plurality of cable sets according to the correspondence and the failure state of each data link in the plurality of data links, the failed reference cable set may be replaced or maintained quickly, but not limited to.

Alternatively, in this embodiment, the reference device may, but is not limited to, transmit data to a corresponding processor of the plurality of processors over each of the second candidate data links.

Alternatively, in this embodiment, the second storage device may, but is not limited to, a storage device including the second firmware storing the second controller, for example, the second storage device may, but is not limited to, include Flash memory, or a solid state disk, etc., for example, the second storage device may, but is not limited to, include SPI Nor Flash.

Alternatively, in this embodiment, the second firmware may, but is not limited to, include a program that is cured in the second storage device, and may, but is not limited to, store the correspondence between the plurality of sets of data links and the plurality of cable sets in the second firmware, for example, the second firmware may, but is not limited to, include FW in the first storage device.

Optionally, in this embodiment, the second controller may further, but not limited to, read, at an initialization stage of the second controller, correspondence between the plurality of sets of data links and the plurality of cable sets stored in the first firmware.

Optionally, in this embodiment, when the correspondence between the plurality of groups of data links and the plurality of cable groups changes, a second adjustment instruction may be sent, but not limited to, to the second firmware to adjust the correspondence between the plurality of groups of data links and the plurality of cable groups stored in the second firmware to an updated correspondence, where the second adjustment instruction is configured to adjust the correspondence between the plurality of groups of data links and the plurality of cable groups stored in the second firmware to an updated correspondence, where the updated correspondence includes a correspondence between an updated group of data links and an updated cable group, where the updated group of data links is the same as or different from the plurality of groups of data links, and where the updated cable group is the same as or different from the plurality of cable groups.

Alternatively, in this embodiment, the switching device may be, but not limited to being, deployed in a data switch, the switching device may be, but not limited to being, used to execute the acquisition instructions, e.g., the switching device may be, but not limited to comprising an ethernet switching chip.

Alternatively, in this embodiment, the multiple processors in the processor node may, but are not limited to, implement data transmission through the connected switching device, for example, in a case where the processor 1 in the processor node wishes to transmit data to the processor 2 in the processor node, the processor 1 sends the data 1 to the connected switching device, and the switching device transmits the received data 1 to the processor 2, to implement data transmission between the processor 1 and the processor 2.

Alternatively, in the present embodiment, each of the second link parameters may be, but not limited to being, greater than, less than, or equal to the corresponding link parameter threshold.

Alternatively, in the present embodiment, the reference cable sets may be the same as, or different from, the target cable set, for example, the reference cable set and the cable set corresponding to the target cable set are both cable set 1, or the cable set corresponding to the reference cable set is cable set 1, and the cable set corresponding to the target cable set is cable set 2.

Alternatively, in this embodiment, the second data link may, but is not limited to, correspond to the same cable set as the first data link, or to different cable sets, for example, the second data link and the first data link each correspond to cable set 1, or the second data link corresponds to cable set 1, and the first data link corresponds to cable set 3, it being understood that the second data link and the first data link are both data links provided by the same cable set, or the second data link and the first data link are data links provided by different cable sets.

Alternatively, in the present embodiment, the first failure information in the first controller and the second failure information in the second controller may be, but are not limited to, interacted through the management switch.

Optionally, in this embodiment, the first fault information and the second fault information may be displayed on a Web interface of the first controller after the second fault information is transmitted to the first controller and the first fault information is transmitted to the second controller, so that efficiency of checking all fault information by operation and maintenance personnel is improved.

Alternatively, in the present embodiment, the second fault information and the first fault information may be displayed on the web interface of the second controller after the second fault information is transmitted to the first controller and the first fault information is transmitted to the second controller, but not limited thereto.

Alternatively, in this embodiment, one of the processors in the processor node may be, but is not limited to being, connected to a management interface, which may be, but is not limited to including, a GE interface in the management switch.

Alternatively, in this embodiment, a data switch may be, but not limited to being, connected to a management interface, which may be, but not limited to, a GE (Gigabit Ethernet) interface included in the management switch.

Alternatively, in the present embodiment, the first controller may determine the target cable set and the reference cable set as the failed cable set in combination with the first failure information and the second failure information received from the management switch, and the second controller may determine the target cable set and the reference cable set as the failed cable set in combination with the second failure information and the first failure information received from the management switch.

By the embodiment of the application, the cable set fault is determined by combining the first fault information and the second fault information, and the fault cable set can be accurately positioned and identified even in a complex server architecture, so that the fault cable set is ensured to be quickly repaired, and the operation and maintenance efficiency of the cable set in the server is improved.

The computer system further comprises a management switch, a plurality of power supply devices and a plurality of servers, wherein a first group of servers in the plurality of servers and the management switch are connected with a first group of power supply devices in the plurality of power supply devices, a second group of servers are connected with a second group of power supply devices, the first group of power supply devices are used for supplying power to the first group of servers and the management switch, the second group of power supply devices are used for supplying power to the second group of servers, the second group of servers are servers except the first group of servers in the plurality of servers, and the second group of power supply devices are power supply devices except the first group of power supply devices in the plurality of power supply devices.

Alternatively, in the present embodiment, the Power supply device may be, but is not limited to, used to supply Power to the management switch, the plurality of processor nodes, and the plurality of reference devices, for example, the Power supply device may be, but is not limited to, include a Power shell.

In order to better understand the application scenario and workflow of the server in the embodiment of the present application, the application scenario and workflow of the server in the embodiment of the present application are explained and described below in conjunction with the optional embodiment, which can be applied to the embodiment of the present application but is not limited to the embodiment of the present application.

The Power device may include, but is not limited to, a Power shell, the processor node includes a GPU BOX node, each of the plurality of processors includes a GPU, the first storage device includes an SPI Nor Flash1, the first firmware includes a FW storing a first controller, the second storage device includes an SPI Nor Flash 3, the second firmware includes a FW storing a second controller, the first controller includes BMC0, the second controller includes BMC1, the first processor includes CPU0, the second processor includes CPU1, the capture device includes PHY RETIMER chips, the conversion device includes PCIE SWITCH chips, the strobe device includes Mux chips, the connector includes a 112Gbps high-speed connector, the switching device includes an ethernet switching chip, and the Cable set includes a Cable set in Cable Tray as an example.

Fig. 2 is a network management topology diagram of an optional processor node and a reference device node according to an embodiment of the present application, as shown in fig. 2, an AI Rack server may include, but is not limited to, 2 Power shell (Power supply module) nodes, n GPU BOX nodes, n switch nodes, and 1 management switch, where the management switch includes multiple GE ports (corresponding to multiple management interfaces), and each GPU BOX node and switch node respectively has a GE port (corresponding to a first interface and a second interface) connected to a corresponding GE port of multiple GE ports on the management switch for communication.

Fig. 3 is a block diagram of fault localization of an alternative server cable set, as shown in fig. 3, and may be illustrated and described by way of example, but not limitation, with respect to any one of the processor nodes and one of the data switches connected in fig. 2, in accordance with an embodiment of the present application.

As shown in fig. 3, on the GPU BOX node, there are 8 GPUs (corresponding to a plurality of processors included in the processor node) of GPU0 to GPU7 on the GPU BOX node, GPU0 to GPU3 are corresponding to a first group of processors, GPU4 to GPU7 are corresponding to a second group of processors, PHY RETIMER to PHY RETIMER chips at 112Gbps (Gigabits per second gigabit per second) rate are 8, mux0 to Mux7 chips are 8, 1 BMC chip (for example, BMC0 is corresponding to a first controller), 2 PCIE SWITCH chips (corresponding to a first conversion device and a second conversion device), 2 CPU chips (corresponding to a first processor and a second processor), and one SPI norflash chip (corresponding to a first storage device, for example, SPI norflash 1 mounted under BMC 0) mounted under the BMC chip are designed.

CPU0 (corresponding to a first processor) and CPU1 (corresponding to a second processor) respectively output PCIE x16 high-speed links to connect to 2 PCIE SWITCH chips, meanwhile, CPU0 and CPU1 are interconnected through UPI high-speed links, PCIE SWITCH chips are connected to GPU0 to GPU3 through 4 downlink PCIE x16 high-speed links, PCIE SWITCH chips are connected to GPU4 to GPU7 through 4 downlink PCIE x16 high-speed links, seders of GPU0 to GPU7 respectively outputting x16 Gbps is connected to PHY RETIMER0 to PHY RETIMER chips, PHY RETIMER0 to PHY RETIMER7 chips (corresponding to a plurality of acquisition devices) respectively outputting x16 112Gbps Serders is connected to a 112Gbps high-speed connector; the high-speed connector of 112Gbps is connected to Cable Tray, and the Cable inside the Cable Tray goes through Serders of 112Gbps and is connected to an Ethernet switch chip (equivalent to a switching device) of the switch node. CPU0 downloads SPI Nor Flash 0,SPI Nor Flash 0, which may, but is not limited to, store CPU 0's operating system and CPU 0's firmware.

The low-speed management link of the GPU BOX node is designed as follows, that the BMC0 chip and the GPU0 module respectively Output MDIO links to connect to the Mux0 chip, the Select0 signal (corresponding to a first control signal) of the Mux0 chip is connected to the BMC0 chip, and the BMC controls whether the MDIO (MANAGEMENT DATA Input/Output ) of the Output end (corresponding to a data extraction end) of the Mux0 chip is from the BMC0 or the GPU0 by sending the Select0 signal. When the Select0 signal (corresponding to the first control signal) is at low level, the 1 pin (corresponding to the second pin) on the Mux0 chip is connected with the output end of the Mux0 chip, the MDIO of the output end of the Mux0 chip is from BMC0, and when the Select0 signal is at high level, the 0 pin (corresponding to the first pin) on the Mux0 chip is connected with the output end of the Mux0 chip, and the MDIO of the output end of the Mux0 chip is from GPU0. And similarly, the MDIO link from BMC0 is also connected to Mux1 to Mux7 chips, the MDIO link from GPU1 to GPU7 is also connected to Mux1 to Mux7 chips, BMC0 controls the output end MDIO signals from BMC0 to Mux7 chips or from GPU0 to GPU7 by sending Select [7:0] signals, CPU0 stores BIOS FW by SPI link under SPI Nor Flash0, BMC0 stores FW of BMC0 by SPI link under SPI Flash Nor Flash1, and BMC0 is connected to GE port of management switch node by RGMII (Reduced Gigabit media independent interface) link outlet GE port.

On the service switch node, an ethernet switch chip, an ARM CPU chip, a BMC chip (corresponding to a second controller, for example, BMC 1), and an SPI nors Flash chip (corresponding to a second storage device, for example, SPI nors Flash 3 downloaded by BMC 1) mounted under the BMC chip are designed. ARM (ADVANCED RISC (Reduced Instruction Set Computing, reduced instruction set computing) CPU is connected to the Ethernet exchange chip through a high-speed link of PCIE x2, ARM CPU hangs down SPI Nor Flash 2 chip through SPI link to store FW of ARM CPU and OS (Operating System), BMC1 hangs down SPI Nor Flash 3 chip through SPI link to store FW of BMC1, BMC1 is connected to the Ethernet exchange chip through MDIO link to obtain link state information of the Ethernet exchange chip and PHY RETIMER chip of GPU BOX node, BMC1 is connected to GE port of management switch node through MIRGI link outlet GE port. Meanwhile, PHY RETIMER to PHY RETIMER of the GPU BOX nodes are connected to Serders of 112Gbps in-board 112Gbps, the high-speed connector is connected to Cable Tray, the Cable Tray is connected to an ethernet switch chip on the service switch, and thus, the interconnection from the GPU BOX0 to GPU BOXn nodes to 112Gbps serders of the service switch node 0 and the service switch node n is completed through the Cable Tray.

Through the above embodiments, the embodiment of the present application refers to a general architecture design for AI Rack (ARTIFICIAL INTELLIGENCE RACK ) server (corresponding to a server) Cable Tray fault location. Based on the realization scheme of the server, the problem of difficult positioning and maintenance of Cable track faults on the AI Rack server which is mainstream in the industry at present can be solved, the solution architecture is simple and flexible in design and low in cost, challenges and problems encountered in the current Cable track fault positioning can be effectively solved only by combining a set of system board card design with the interconnection scheme of each node in the AI Rack and BMC code design, the system design is greatly simplified, the system reliability is greatly improved, the Cable track maintenance time is shortened, the maintenance cost is reduced, and the existing problem in the industry of the AI Rack server is solved. Meanwhile, the framework has strong universality, and can be transplanted to Cable Tray fault positioning of a switch and a router cabinet, so that the framework has high market value.

Fig. 4 is a schematic flow chart of an optional normal start-up of a server according to an embodiment of the present application, and as shown in fig. 4, the flow chart of the normal start-up of the server is specifically as follows:

Firstly, after the system is powered on, detecting whether n GPU BOX nodes and n service switch nodes of the system are powered on, if not, the system has hardware faults and needs maintenance, and powering on again after the maintenance is completed.

After the power-on is completed, the CPU0 loads BIOS FW from the SPI Nor Flash0, meanwhile, the BMC0 and the BMC1 load FW from the SPI Nor Flash1/3 respectively, wherein the FW of the BMC0 comprises a corresponding relation between an entire system Lane number (equivalent to a data link identifier) and wafe (equivalent to a cable set) in a cable tray, the FW of the BMC1 comprises a corresponding relation between the entire system Lane number (equivalent to a link identifier) and wafe in the cable tray, and the ARM CPU loads FW from the SPI Nor Flash2, the CPU0, the CPU1, the BMC0, the BMC1 and the ARM CPU finish starting. At this time, CPU0 and CPU1 of GPU BOX node and PCIE SWITCH0 and Switch1 set up PCIE to connect, PCIE SWITCH and PCIE SWITCH1 set up PCIE to connect with GPU0 to GPU7 separately, finish CPU0 and 1 to 8 recognition movements of GPU;

And thirdly, BMC0 of the GPU BOX node sends Select [7:0] to the Mux7 to Mux0 chips, and the output MDIO data of the Mux chips are controlled to come from GPU7 to GPU0. Next, GPU0 to GPU7 of the GPU BOX node complete the startup and configure PHY RETIMER to PHY RETIMER7, respectively, via the MDIO bus, including configuration rate, operating mode, etc.

And fourthly, the PHY RETIMER to PHY RETIMER7 of the GPU BOX node respectively send signals to the Ethernet switching chip of the service switch node through the 112Gbps Serders link of the x16 to establish connection, and the Ethernet switching chip of the service switch node respectively sends signals to the PHY RETIMER to PHY RETIMER7 chip of the GPU BOX node through the 112Gbps Serders link of the x16 to establish connection.

And fifthly, completing connection establishment between PHY RETIMER to PHY RETIMER7 of the GPU BOX node and the Ethernet switch chip of the service switch node.

And step six, if the connection establishment is successful, the normal operation is carried out, and if the connection establishment is failed, the problem of bandwidth failure or Link Down (Link Down) failure of which the bandwidth of the connection establishment is smaller than x16 can occur, and the Cable Tray hardware failure positioning is required.

Through the steps, after the GPU BOX node and the service switch node of the whole cabinet AI Rack (equivalent to a server) are powered on, the PHY RETIMER chip and the Ethernet switch chip are connected with each other through Serders of 112 Gbps.

Fig. 5 is a flowchart of locating a faulty cable assembly after a cable assembly is optionally faulty, as shown in fig. 5, according to an embodiment of the present application, where the locating of the cable assembly after the cable assembly is faulty is specifically as follows:

after the Cable track link fails, the GPU BOX node BMC0 sends a Select [7:0] signal to control the MDIO links output by the Mux7 to Mux0 chips to come from BMC0 (which is equivalent to sending a low-level first control signal).

And secondly, the BMC0 sends an instruction to the PHY RETIMER to PHY RETIMER chips through an MDIO link to acquire the quality of 112Gbps Serdes signals sent to the PHY RETIMER chips by the Ethernet switch chips on the service switch nodes, and meanwhile, the BMC1 on the service switch nodes sends an instruction to the Ethernet switch chips through the MDIO link to acquire the quality of 112Gbps Serdes signals sent to the Ethernet switch chips by the PHY RETIMER to PHY RETIMER chips on the GPU BOX nodes. The PHY RETIMER chip can detect the signal quality sent by the Ethernet switch chip, can detect the link up/down state, and the Ethernet switch chip can also detect the signal quality sent by the PHY RETIMER chip, and can also detect the link up/down state.

And thirdly, the PHY RETIMER to PHY RETIMER chip transmits the land number which fails to establish connection to the BMC0 through an MDIO link, and the Ethernet exchange chip transmits the land number which fails to establish connection with the PHY RETIMER to PHY RETIMER chip to the BMC1 through the MDIO link. Next, BMC0 sends the data of the lane number detected by the PHY RETIMER to PHY RETIMER chips to the management switch through the GE port, and finally to BMC1 of the service switch node, and at the same time, BMC0 may obtain the reasons for the lane with failed connection from the PHY RETIMER to 7 chips, specifically, the reasons include poor SI signal quality and broken cable line, and at the same time, BMC1 sends the data of the lane number detected by the service switch node with failed connection to the management switch through the GE port, and finally to BMC0 of the GPU BOX node. At this time, BMC0 of the GPU Box node and BMC1 of the service switch node both acquire the Lane number and the failure cause of the connection establishment failure.

And fourthly, the BMC0 compares the failed land number with a mapping table (corresponding to the corresponding relation between a plurality of groups of data links and a plurality of Cable groups) in the BMC0, prints wafe which land of which connector in the Cable Tray needs to be replaced on the web page of the BMC0, and meanwhile, the BMC1 compares the failed land number with the mapping table in the BMC1, and prints wafe which land of which connector in the Cable Tray needs to be replaced on the web page of the BMC 1.

And step six, according to the fault information printed by the web pages of the BMC0 and the BMC1, the Cable fault corresponding to the wafe (slice group) of which service switch node to which connector of which GPU BOX node is seen, and quick replacement and maintenance are carried out.

And 7, ending.

Through the embodiment, the Cable Tray is the most advanced scheme in the GPU interconnection industry on the AI Rack server at present, has the defects of high price, high complexity, high maintenance cost and high fault location difficulty, the method can effectively solve the challenges and problems in the current Cable Tray fault positioning, greatly simplify the system design, shorten the AI RACK maintenance time, reduce the maintenance cost and greatly improve the system reliability.

The method embodiments provided in the embodiments of the present application may be executed in a server apparatus or similar computing device. Taking the example of running on a server device, fig. 6 is a block diagram of the hardware structure of the server device of a multi-node management method according to an embodiment of the present application. As shown in fig. 6, the server device may include one or more (only one is shown in fig. 6) processors 602 (the processor 602 may include, but is not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA) and a memory 604 for storing data, where the server device may further include a transmission device 606 for communication functions and an input-output device 608. It will be appreciated by those of ordinary skill in the art that the structure shown in fig. 6 is merely illustrative and is not intended to limit the structure of the server apparatus described above. For example, the server device may also include more or fewer components than shown in fig. 6, or have a different configuration than shown in fig. 6.

The memory 604 may be used to store computer programs, such as software programs of application software and modules, such as computer programs corresponding to the multi-node management method in the embodiment of the present application, and the processor 602 executes the computer programs stored in the memory 604 to perform various functional applications and data processing, that is, implement the method described above. Memory 604 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid state memory. In some examples, memory 604 may further comprise memory located remotely from processor 602, which may be connected to the server device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmitting device 606 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of a server device. In one example, the transmitting device 606 includes a network adapter (Network Interface Controller, simply referred to as a NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 606 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.

In this embodiment, there is provided a multi-node management method, and fig. 7 is a flowchart of a multi-node management method according to an embodiment of the present application, and as shown in fig. 7, a server includes a plurality of processor nodes and a plurality of connection nodes, each of the processor nodes includes a plurality of processors and a first controller, each of the connection nodes includes a plurality of cable sets, the connection nodes are used for connecting a reference device, the connection nodes provide a plurality of sets of data links for the plurality of processors for data transmission with the reference device through the plurality of cable sets, the method is applied to the first controller, and the flowchart includes the steps of:

step S702, establishing a correspondence between the plurality of groups of data links and the plurality of cable groups;

Step S704, in the process of data transmission between the plurality of processors and the reference device, detecting a faulty target cable set from the plurality of cable sets according to the correspondence and the fault state of each data link in the plurality of data links.

In the technical solution provided in step S702, but not limited to, in the hardware design stage, each data link may be explicitly identified, for example, a data link ID (Identification number), and each cable set may be explicitly identified, for example, a cable set ID.

Alternatively, in this embodiment, the mapping relationship between each data link and the cable set may be, but not limited to, recorded, and a data link-cable set mapping relationship table may be formed, and the mapping relationship between the data link identifier and the cable set identifier may be, but not limited to, recorded in detail in the data link-cable set mapping relationship table.

Optionally, in this embodiment, but not limited to, when the system configuration is changed or the cable set is replaced, the corresponding relationship between the data link and the cable set may be adjusted in time, so as to ensure the accuracy of the corresponding relationship.

In the solution provided in step S704, the failed target cable set may be maintained or replaced after the failed target cable set is detected from the plurality of cable sets, but is not limited to the foregoing.

Through the steps, the fault state of the data link is continuously monitored in the data transmission process, the fault cable group is rapidly positioned through the corresponding relation, the abnormality in the data transmission can be responded in time, the service interruption time overlength caused by the cable group fault is reduced, and therefore the running stability of the server is improved.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present application.

In this embodiment, a multi-node management device is further provided, and the device is used to implement the foregoing embodiments and preferred embodiments, and will not be described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

FIG. 8 is a block diagram of a multi-node management apparatus according to an embodiment of the present application, as shown in FIG. 8, a server including a plurality of processor nodes each including a plurality of processors and a first controller, and a plurality of connection nodes each including a plurality of cable sets for connecting a reference device, the connection nodes providing the plurality of processors with a plurality of sets of data links for data transmission with the reference device through the plurality of cable sets, the apparatus being applied to the first controller, the apparatus including:

A building module 802, configured to build a correspondence between the plurality of groups of data links and the plurality of cable groups;

The detection module 804 detects a failed target cable group from the plurality of cable groups according to the corresponding relationship and the failure state of each data link in the plurality of data links in the process of data transmission between the plurality of processors and the reference device.

By the device, the fault state of the data link is continuously monitored in the data transmission process, the fault cable group is rapidly positioned through the corresponding relation, the abnormality in the data transmission can be responded in time, the service interruption time overlength caused by the cable group fault is reduced, and therefore the running stability of the server is improved.

It should be noted that each of the above modules may be implemented by software or hardware, and the latter may be implemented by, but not limited to, the above modules all being located in the same processor, or each of the above modules being located in different processors in any combination.

Embodiments of the present application also provide a computer readable storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.

In an exemplary embodiment, the computer readable storage medium may include, but is not limited to, a U disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, etc. various media in which a computer program may be stored.

An embodiment of the application also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

In an exemplary embodiment, the electronic device may further include a transmission device connected to the processor, and an input/output device connected to the processor.

Embodiments of the application also provide a computer program product comprising a computer program which, when executed by a processor, implements the steps of any of the method embodiments described above.

Embodiments of the present application also provide another computer program product comprising a non-volatile computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of any of the method embodiments described above.

Embodiments of the present application also provide a computer program comprising computer instructions stored in a computer readable storage medium, a processor of a computer device reading the computer instructions from the computer readable storage medium, the processor executing the computer instructions to cause the computer device to perform the steps of any of the method embodiments described above.

Specific examples in this embodiment may refer to the examples described in the foregoing embodiments and the exemplary implementation, and this embodiment is not described herein.

It will be appreciated by those skilled in the art that the modules or steps of the application described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may be implemented in program code executable by computing devices, so that they may be stored in a storage device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps of them may be fabricated into a single integrated circuit module. Thus, the present application is not limited to any specific combination of hardware and software.

The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the principle of the present application should be included in the protection scope of the present application.

Claims

1. A server, characterized in that it comprises: a processor node and a connection node, wherein the processor node comprises: a plurality of processors, a plurality of first connectors and a first controller, and the connection node comprises: a plurality of cable groups;

The connection node is used to connect to a reference device and provide the multiple processors with multiple groups of data links for data transmission with the reference device through the multiple cable groups;

The plurality of first connectors are connected to the plurality of cable groups in a one-to-one correspondence;

The first controller is used to establish a correspondence between the multiple groups of data links and the multiple cable groups, wherein the correspondence includes a correspondence between a data link identifier and a connector identifier, and a correspondence between the connector identifier and a cable group identifier; during data transmission between the multiple processors and the reference device, a target cable group with a fault is detected from the multiple cable groups according to the correspondence and a fault state of each data link in the multiple groups of data links;

The first controller is further used for, when a fault state is detected in the first candidate data link to indicate a faulty first data link, extracting the first data link identifier of the first data link from the first fault information corresponding to the first data link, obtaining the first target connector identifier corresponding to the first data link identifier from the data link identifier and the first connector identifier having a corresponding relationship, obtaining the target cable group identifier corresponding to the first target connector identifier from the first connector identifier and the cable group identifier having a corresponding relationship, and determining the cable group corresponding to the target cable group identifier as the target cable group, wherein the first fault information includes the fault state of the first data link and the first data link identifier of the first data link.

2. The server according to claim 1, wherein the processor node comprises: a first storage device, the first storage device is used to store a first firmware of the first controller, the first firmware includes a correspondence between the multiple groups of data links and the multiple cable groups, and the first storage device is connected to the first controller;

The first controller is used to read the corresponding relationship from the first storage device when a fault state is detected in the first candidate data link to indicate a faulty first data link, and detect the faulty target cable group from the multiple cable groups based on the corresponding relationship and the first data link, wherein each of the multiple processors is used to transmit data to the reference device through a corresponding data link in the first candidate data link.

3. The server according to claim 2, characterized in that the processor node comprises: a plurality of acquisition devices and a target processor, the plurality of acquisition devices are connected to the target processor, and the target processor is connected to the first controller;

The multiple acquisition devices are used to be connected to the multiple cable groups in a one-to-one correspondence, and to acquire actual link parameters of each data link in the multiple data link groups, wherein the actual link parameters are used to indicate the quality of data transmission between each processor in the multiple processors and the reference device through the corresponding data link in the multiple data link groups;

The multiple processors are used to connect one-to-one with the first candidate acquisition devices corresponding to the first candidate data link in the multiple acquisition devices when there is no fault state in the first candidate data link in the multiple groups of data links to indicate a faulty data link, and extract the link parameters of each data link in the first candidate data link collected by the first candidate acquisition device to obtain a first group of link parameters, and transmit the extracted first group of link parameters to the target processor.

4. The server according to claim 3, characterized in that:

The target processor is used to receive the first group of link parameters extracted and transmitted by the multiple processors, and detect the fault status of each data link in the first candidate data link based on the first group of link parameters to obtain a first group of fault status, and transmit the detected first group of fault information to the first controller, wherein each link parameter in the first group of link parameters is used to indicate the quality of data transmission performed by each processor in the multiple processors to the reference device through the corresponding data link in the first candidate data link, and the first group of fault information includes the first group of fault statuses having a corresponding relationship and the link identifier of each data link in the first candidate data link.

5. The server according to claim 4, characterized in that the processor node comprises: a first conversion device and a second conversion device, the target processor comprises a first processor and a second processor, the first processor and the second processor are both connected to the first controller, the first conversion device is connected to the first processor, the second conversion device is connected to the second processor, the first conversion device is connected to a first group of processors corresponding to the plurality of processors, the second conversion device is connected to a second group of processors corresponding to the plurality of processors, the second group of processors is the processors other than the first group of processors in the plurality of processors, the first group of processors is connected to a first group of acquisition devices corresponding to the plurality of acquisition devices, the second group of processors is connected to a second group of acquisition devices corresponding to the plurality of acquisition devices, and the second group of acquisition devices is the acquisition devices other than the first group of acquisition devices in the plurality of acquisition devices;

The first processor is used to receive, through the first conversion device, a second group of link parameters of the first group of data links corresponding to the first candidate data links extracted by the first group of processors; and detect the fault status of each data link in the first group of data links according to the second group of link parameters, obtain a second group of fault status, and transmit the second group of fault information to the first controller, wherein each processor in the first group of processors is used to transmit data to the reference device through a corresponding data link in the first group of data links, and the second group of fault information includes the second group of fault status and the link identifier of the first group of data links having a corresponding relationship;

The second processor is used to extract, through the second conversion device, a third group of link parameters of the second group of data links corresponding to the first candidate data links extracted by the second group of processors; and detect the fault status of each data link in the second group of data links according to the third group of link parameters, obtain a third group of fault status, and transmit the third group of fault information to the first controller, wherein the second group of processors is used to transmit data to the reference device through the second group of data links, and the third group of fault information includes the third group of fault status and the link identifier of the second group of data links having a corresponding relationship;

The first group of fault states includes the second group of fault states and the third group of fault states, and the first group of link parameters includes the second group of link parameters and the third group of link parameters.

6. The server according to claim 5, characterized in that:

the first processor being configured to detect a relationship between each link parameter in the second group of link parameters and a corresponding link parameter threshold, and, when detecting that there is a first reference link parameter in the second group of link parameters that is less than a first link parameter threshold and whose difference with the first link parameter threshold is greater than or equal to a first difference threshold, determine a failure state of a first reference data link corresponding to the first reference link parameter in the first group of data links as indicating that the first reference data link has failed, wherein the first link parameter threshold is a link parameter of the first reference data link when the first reference data link has not failed;

the second processor is configured to detect a relationship between each link parameter in the third group of link parameters and a corresponding second link parameter threshold, and when detecting that there is a second reference link parameter in the third group of link parameters that is less than the second link parameter threshold and whose difference with the second link parameter threshold is greater than or equal to a second difference threshold, determine a fault state of a second reference data link corresponding to the second reference link parameter in the second group of data links as indicating that the second reference data link has failed, the second link parameter threshold being a link parameter of the second reference data link when the second reference data link has not failed;

The first data link includes the first reference data link and the second reference data link.

7. The server according to claim 3, characterized in that:

The first controller is used to connect to a reference acquisition device corresponding to a reference processor among the multiple acquisition devices when a fault state is detected in the first candidate data link to indicate a faulty reference data link, wherein the first controller is used to extract reference link parameters of the reference data link corresponding to the reference processor in the first candidate data link acquired by the connected reference acquisition device, and the reference processor is a processor corresponding to the reference data link among the multiple processors.

8. The server according to claim 7, characterized in that the processor node comprises: a plurality of gating devices, the plurality of gating devices are connected to the plurality of acquisition devices in a one-to-one correspondence, the plurality of gating devices are connected to the plurality of processors in a one-to-one correspondence, and the plurality of gating devices are all connected to the first controller;

The first controller is used to send a high-level first control signal to an alternative gating device corresponding to the alternative data link among the multiple gating devices when a fault state is detected in the first candidate data link to indicate an alternative data link that has not failed, wherein when the first controller sends the high-level first control signal to the alternative gating device, an alternative processor among the multiple processors is connected to the alternative gating device and a corresponding alternative acquisition device among the multiple acquisition devices.

9. The server according to claim 8, characterized in that:

The first controller is used to send a low-level first control signal to a reference gating device corresponding to the reference processor among the multiple gating devices when a fault state is detected in the first candidate data link to indicate the reference data link that has failed, wherein when the first controller sends the low-level first control signal to the reference gating device, the first controller extracts the reference link parameters collected by the reference acquisition device through the reference gating device.

10. The server according to claim 8, characterized in that each of the plurality of gating devices comprises a first pin, a second pin, a signal input terminal and a data extraction terminal, the first pin is connected to a corresponding processor among the plurality of processors, the second pin is connected to the first controller, the signal input terminal is connected to the first controller, and the data extraction terminal is connected to a corresponding acquisition device among the plurality of acquisition devices;

When the first controller sends the first control signal of a low level to the signal input terminal, the second pin is connected to the data extraction terminal;

When the first controller sends the first control signal of a high level to the signal input terminal, the first pin is connected to the data extraction terminal.

11. The server according to claim 2, characterized in that the processor node further comprises: a plurality of acquisition devices, and the plurality of first connectors are connected to the plurality of acquisition devices in a one-to-one correspondence.

12. The server according to claim 1, wherein the reference device comprises: a data switch, the data switch comprises a second controller, and the second controller is connected to the plurality of cable groups;

The second controller is used to establish a correspondence between the multiple groups of data links and the multiple cable groups; during the process of data transmission between the multiple processors and the reference device, based on the correspondence and the fault status of each data link in the multiple groups of data links, a faulty reference cable group is detected from the multiple cable groups.

13. The server according to claim 12, wherein the data switch comprises: a second storage device, the second storage device is used to store a second firmware of the second controller, the second firmware includes a correspondence between the multiple groups of data links and the multiple cable groups, and the second storage device is connected to the second controller;

The second controller is used to read the corresponding relationship from the second storage device when a fault state is detected in the second candidate data link to indicate a faulty second data link, and detect the faulty reference cable group from the multiple cable groups based on the corresponding relationship and the second data link, wherein the data switch is used to transmit data to a corresponding processor among the multiple processors through each data link in the second candidate data link, and the second candidate data link is a data link in the multiple groups of data links except the first candidate data link.

14. The server according to claim 13, characterized in that the data switch comprises: a switching device and a plurality of second connectors, the switching device is connected to the second controller, the plurality of second connectors are connected to the switching device, the plurality of second connectors are connected to the plurality of cable groups in a one-to-one correspondence, and the plurality of cable groups are connected to the plurality of acquisition devices in a one-to-one correspondence;

The second controller is used to send an acquisition instruction to the switching device during data transmission between the multiple processors and the reference device, and extract the second link parameter of the second candidate data link collected by the second candidate acquisition device corresponding to the second candidate data link among the multiple acquisition devices collected by the switching device in response to the acquisition instruction, and detect the fault state of the second candidate data link according to the second link parameter, wherein the acquisition instruction is used to request the collection of the second link parameter, and the switching device is used to execute the acquisition instruction.

15. The server according to claim 14, characterized in that:

The second controller is used to detect the relationship between each link parameter in the extracted second link parameters and the corresponding link parameter threshold, and when it is detected that there is a third reference link parameter in the second link parameters that is less than a third link parameter threshold and the difference between the third reference link parameter and the third link parameter threshold is greater than or equal to a third difference threshold, determine the fault state of the third reference data link as indicating that the third reference data link has failed, wherein the third reference data link is a data link corresponding to the third reference link parameter in the second candidate data link, and the third link parameter threshold is a link parameter of the third reference data link when the third reference data link has not failed.

16. The server according to claim 15, characterized in that

The second controller is used to obtain the second target connector identifier corresponding to the second data link identifier from the data link identifier and the second connector identifier having a corresponding relationship, and obtain the reference cable group identifier corresponding to the second target connector identifier from the second connector identifier and the cable group identifier having a corresponding relationship, when detecting that a fault state exists in the second candidate data link for indicating a faulty second data link, and determine the cable group corresponding to the reference cable group identifier as the reference cable group, wherein the corresponding relationship includes the corresponding relationship between the data link identifier and the connector identifier, and the corresponding relationship between the connector identifier and the cable group identifier.

17. The server according to claim 1, characterized in that the reference device comprises: a data switch, the data switch comprises: a second controller, the first controller is connected to a management switch, and the second controller is connected to the management switch;

The first controller is used to transmit the first fault information of the target cable group to the management switch;

The second controller is used to transmit second fault information of the reference cable group to the management switch;

The management switch is used to transmit the second fault information to the first controller, and transmit the first fault information to the second controller.

18. The server according to claim 17, wherein the processor node comprises: a first interface, the first controller is connected to the first interface, the management switch comprises a plurality of management interfaces, the first interface is connected to a corresponding first management interface among the plurality of management interfaces;

The first controller is configured to transmit the first fault information to the first management interface through the first interface, and receive, through the first interface, the second fault information transmitted by the management switch through the first management interface.

19. The server according to claim 18, wherein the data switch comprises: a second interface, the second controller is connected to the second interface, and the second interface is connected to a corresponding second management interface among the plurality of management interfaces;

The second controller is configured to transmit the second fault information to the second management interface through the second interface, and receive the first fault information transmitted by the management switch through the second management interface through the second interface.

20. The server according to claim 19, characterized in that

The first controller is configured to determine the target cable group and the reference cable group as failed cable groups;

The second controller is configured to determine the target cable group and the reference cable group as faulty cable groups.

21. A computer system, characterized in that it comprises: a management switch, multiple power supply devices and multiple servers as described in any one of claims 1 to 20, a first group of servers among the multiple servers and the management switch are connected to a first group of power supply devices among the multiple power supply devices, a second group of servers is connected to a second group of power supply devices, the first group of power supply devices is used to power the first group of servers and the management switch, the second group of power supply devices is used to power the second group of servers, the second group of servers is servers other than the first group of servers among the multiple servers, and the second group of power supply devices is power supply devices other than the first group of power supply devices among the multiple power supply devices.

22. A multi-node management method, characterized in that a server comprises: a plurality of processor nodes and a plurality of connection nodes, each of the processor nodes comprises: a plurality of processors, a plurality of first connectors and a first controller, each of the connection nodes comprises: a plurality of cable groups, the plurality of first connectors are connected to the plurality of cable groups in a one-to-one correspondence, the connection nodes are used to connect to a reference device, the connection nodes provide the plurality of processors with a plurality of data links for data transmission with the reference device through the plurality of cable groups, the method is applied to the first controller, the method comprises:

Establishing a correspondence between the plurality of data link groups and the plurality of cable groups, the correspondence comprising a correspondence between data link identifiers and connector identifiers, and a correspondence between the connector identifiers and cable group identifiers;

In the process of data transmission between the multiple processors and the reference device, detecting a target cable group with a fault from the multiple cable groups according to the corresponding relationship and the fault status of each data link in the multiple groups of data links;

Wherein, the detecting of a target cable group with a fault from the multiple cable groups includes: in a case where a fault state is detected in a first candidate data link for indicating a first data link with a fault, extracting a first data link identifier of the first data link from first fault information corresponding to the first data link, obtaining a first target connector identifier corresponding to the first data link identifier from a data link identifier and a first connector identifier having a corresponding relationship, and obtaining a target cable group identifier corresponding to the first target connector identifier from a first connector identifier and a cable group identifier having a corresponding relationship, and determining the cable group corresponding to the target cable group identifier as the target cable group, wherein the first fault information includes the fault state of the first data link and the first data link identifier of the first data link.

23. A computer-readable storage medium, characterized in that:

The computer-readable storage medium stores a computer program, wherein the computer program implements the steps of the method described in claim 22 when executed by a processor.

24. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein:

When the processor executes the computer program, the steps of the method according to claim 22 are implemented.

25. A computer program product, comprising a computer program, characterized in that

When the computer program is executed by a processor, the steps of the method according to claim 22 are implemented.