CN120763250A

CN120763250A - Interconnection system, interconnection system data synchronization method and electronic equipment

Info

Publication number: CN120763250A
Application number: CN202510897385.4A
Authority: CN
Inventors: 张顺顺; 李希栓
Original assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Current assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Priority date: 2025-06-30
Filing date: 2025-06-30
Publication date: 2025-10-10

Abstract

The present disclosure provides an interconnected system and a method and electronic device for synchronizing data in an interconnected system. The interconnected system includes: multiple resource groups and function convergence units, which establish interconnections between the multiple resource groups; each resource group includes multiple data processing modules and multiple access units with different functions, and the multiple data processing modules are interconnected and perform data interaction based on the interconnection relationship; the multiple access units with different functions are respectively interconnected with the target data processing modules according to a preset topological relationship; in response to data sent by the multiple access units with different functions, the function convergence unit manages the target data processing modules based on the data. Compared with related technologies, the embodiments of the present disclosure can achieve efficient integration and management of resources through the architecture of multiple resource groups and function convergence units, thereby improving data processing efficiency and resource utilization.

Description

Interconnection system, interconnection system data synchronization method and electronic equipment

Technical Field

The disclosure relates to the technical field of servers, and in particular relates to an interconnection system, a method for synchronizing data of the interconnection system and electronic equipment.

Background

With the breakthrough of the parameters of the artificial intelligence large model in the billions level, the demand for calculation power is increasingly remarkable. In the field of artificial intelligence computing infrastructure, large-scale acceleration card clusters have become a necessary choice for large-scale model training. In the current large-scale kilocalorie cluster, because the limited computational power and the memory bandwidth of a CPU are difficult to support real-time synchronization of mass parameters among multiple cards, the standard network card copies data for multiple times through a PCIe bus to generate high delay, the inefficiency of the traditional protocol stack severely restricts network throughput, and the problem that part of schemes depend on CPU operation due to networking complexity and GPU is still not broken through the bottom transmission bottleneck. These problems severely limit the computational power of large-scale acceleration card clusters. Therefore, how to increase the computing power of the large-scale acceleration card cluster is a problem to be solved.

Disclosure of Invention

The disclosure provides an interconnection system, a method for synchronizing data of the interconnection system and electronic equipment. The method mainly aims at solving the problem of how to construct a large-scale acceleration card cluster and improving the computing power.

According to a first aspect of the present disclosure, an interconnection system is provided, where the interconnection system includes a plurality of resource groups and a function convergence unit, where the function convergence unit establishes interconnection between the plurality of resource groups through corresponding access units with different functions in each resource group;

Each resource group comprises a plurality of data processing modules and a plurality of access units with different functions, wherein the data processing modules are interconnected and perform data interaction based on interconnection relation;

the access units with different functions are respectively interconnected with the target data processing module according to a preset topological relation;

and responding to the received data sent by the access units with different functions, and respectively managing the target data processing modules by the function convergence unit based on the data.

Optionally, the plurality of data processing modules include a general purpose computing unit, an acceleration computing unit, and a link switching unit;

the general purpose computing unit performs data interaction with the acceleration computing unit via the link switching unit.

Optionally, the access units with different functions comprise a management access unit, a parameter access unit, a service access unit and a sample access unit;

the parameter access unit is connected with the acceleration calculation unit and transmits the parameter data acquired from the acceleration calculation unit to the parameter convergence unit, wherein the function convergence unit comprises a parameter convergence unit, a management convergence unit, a service convergence unit and a sample convergence unit;

The service access unit is connected with the general computing unit and transmits the service data acquired from the general computing unit to the service convergence unit;

the sample access unit is connected with the general computing unit and transmits sample data acquired from the general computing unit to the sample aggregation unit;

The management access unit is respectively connected with the general computing unit, the acceleration computing unit and the link switching unit and is used for respectively managing the general computing unit, the acceleration computing unit and the link switching unit according to the received management instruction.

Optionally, the management access unit is connected with the controller management network ports corresponding to the general calculation unit, the acceleration calculation unit and the link switching unit respectively;

the management access unit is respectively connected with the exchange management network ports corresponding to the link exchange unit, the management convergence unit and the parameter convergence unit;

The management access unit is connected with the exchange management network port of the sample aggregation unit;

The management access unit is connected with the exchange management network port of the service convergence unit.

Optionally, a target controller is determined from the controllers in the plurality of resource groups, and the target controller is interconnected with the management convergence unit.

Optionally, the target controller constructs the topological relation of each processing component in all the resource groups through the identification information of each processing component in the plurality of resource groups;

A plurality of resource groups are managed based on the topological relation.

Optionally, respectively acquiring uplink and downlink bandwidths of a plurality of access units with different functions and the number of resource groups;

And configuring the target transmission bandwidth of the function convergence unit according to the number of the resource groups and the uplink and downlink bandwidths of a plurality of access units with different functions.

Optionally, according to the number of resource groups and uplink and downlink bandwidths of a plurality of access units with different functions, configuring a parameter access unit, a service access unit and a sample access unit as a first transmission bandwidth;

and configuring a preset second transmission bandwidth for the management access unit, wherein the first transmission bandwidth is larger than the second transmission bandwidth.

Optionally, the acceleration computing unit comprises a plurality of heterogeneous acceleration boards;

dividing the heterogeneous acceleration boards into a plurality of acceleration board groups according to preset dividing conditions, wherein the number of the acceleration board groups is smaller than that of the heterogeneous acceleration boards;

all heterogeneous accelerator cards in the same accelerator card group are connected with the same exchange chip in the link exchange unit.

Optionally, heterogeneous accelerator cards among different accelerator card groups perform data interaction in a bridging manner.

Optionally, the heterogeneous acceleration board card comprises a heterogeneous processor, a data exchange chip and a network module;

The heterogeneous processor is connected with a first end of the data exchange chip, and a second end of the data exchange chip is connected with the network module to form a parameter transmission link;

the heterogeneous processor, the data exchange chip and the network module are integrated in the same circuit board.

Optionally, the parameter data generated by the heterogeneous processor is sent to the parameter access unit via a parameter transmission link.

Optionally, the link switching unit includes a plurality of switching chips, and divides the plurality of switching chips into a plurality of switching chip sets;

and in each exchange chip group, all the exchange chips in the group are interconnected according to a preset topological structure.

Optionally, each switching chip includes a plurality of ports;

The plurality of ports of each switching chip are configured to be the same or different functional ports, wherein the functional ports comprise any one or more of a host port, a device port and an interconnection port.

Optionally, the interconnection between all the switching chips is established in the same switching chipset through the interconnection port;

the exchange chip is connected with a heterogeneous acceleration board card in the acceleration computing unit through an equipment port;

The switching chip is connected with a processor in the general-purpose computing unit through a host port.

Optionally, the general purpose computing unit comprises at least two network adapters, wherein the network adapters perform data transmission by means of direct memory access.

Optionally, the network adapter comprises a first network adapter and a second network adapter;

The first network adapter is connected with the service access unit and is used for transmitting service data generated by the processor in a direct memory access mode;

the second network adapter is connected with the sample access unit and is used for transmitting sample data generated by the processor in a direct memory access mode.

Optionally, the target controller collects the identity of the acceleration computing unit in each resource group and gathers the global identity to generate a routing mapping table;

and respectively sending the route mapping table to a plurality of resource group nodes, and triggering the driving layer to load the configuration of the route mapping table.

According to a second aspect of the present disclosure, there is provided a method of data synchronization of an interconnection system, including:

responding to the received training samples, and respectively issuing the training samples to a plurality of resource groups;

responding to a training completion instruction of a training sample, and transmitting training data generated in a training process to respective corresponding function convergence units through access units with different functions;

Based on the function convergence unit and a preset routing relation, training data are synchronized to a plurality of data processing modules in other resource groups for data interaction.

According to a third aspect of the present disclosure, there is provided an electronic device comprising:

At least one processor, and

A memory communicatively coupled to the at least one processor, wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of interconnection system data synchronization of the second aspect described above.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of the preceding first aspect.

According to a fifth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of the first aspect described above.

Compared with the related art, the embodiment of the invention can realize efficient integration and management of resources through the architecture of a plurality of resource groups and function convergence units, and data processing modules in each resource group can be closely interconnected and perform data interaction, so that the data processing efficiency and the resource utilization rate are improved. And secondly, the access units with different functions are interconnected with the target data processing module according to a preset topological relation, so that the accuracy and the high efficiency of data transmission are ensured, and meanwhile, the function convergence unit manages the received data and further optimizes the data flow direction and the processing flow. The design not only improves the flexibility and the expandability of the system, but also enhances the stability and the reliability of the system, and provides powerful support for the multi-card interconnection system in high-performance computing and large-scale data processing tasks.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic structural diagram of an interconnection system according to an embodiment of the disclosure;

FIG. 2 is a schematic diagram of another interconnect system according to an embodiment of the present disclosure;

Fig. 3 is a schematic diagram of heterogeneous acceleration card interconnection topology provided in an embodiment of the present disclosure;

fig. 4 is a wiring diagram of a management access unit according to an embodiment of the present disclosure;

Fig. 5 is a schematic diagram of an interconnection system network connection according to an embodiment of the disclosure;

FIG. 6 is a three-view of a heterogeneous accelerator card provided by embodiments of the present disclosure;

fig. 7 is a schematic structural diagram of a heterogeneous accelerator card according to an embodiment of the disclosure;

fig. 8 is a flowchart of a method for synchronizing data in an interconnection system according to an embodiment of the disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The following describes an interconnection system, a method for synchronizing data of the interconnection system, and an electronic device according to an embodiment of the disclosure with reference to the accompanying drawings.

Fig. 1 is a schematic structural diagram of an interconnection system according to an embodiment of the disclosure. The interconnection system comprises a plurality of resource groups 1 and a function convergence unit 2, wherein the function convergence unit 2 establishes interconnection among the plurality of resource groups 1 through corresponding access units with different functions in each resource group 1;

each resource group 1 comprises a plurality of data processing modules 11 and a plurality of access units 12 with different functions, wherein the data processing modules 11 are interconnected, and data interaction is performed based on the interconnection relationship.

The access units 12 with different functions are respectively interconnected with the target data processing module 11 according to a preset topological relation.

In response to the received data transmitted by the plurality of access units 12 of different functions, the function convergence unit 2 manages the target data processing modules 11 based on the data, respectively.

In the embodiment of the disclosure, the interconnection system is used for realizing resource interconnection and collaborative management in a kcal cluster, and the core of the interconnection system is composed of a plurality of resource groups 1 and a function convergence unit 2. The resource groups 1 are basic units with independent data processing capability in a cluster, and comprise a plurality of data processing modules 11 and access units 12 with different functions, so that pooling and local interaction of computing resources are realized, the function convergence unit 2 is a management center crossing the resource groups, and a global interconnection link is established through the access units 12 in each resource group 1, so that data convergence and unified control are realized.

The internal structure of the resource group 1 and the data interaction mechanism. The interconnection design of the data processing modules 11, the multiple data processing modules 11 (such as heterogeneous accelerator cards, general purpose computing nodes, etc.) contained in each resource group 1 implement physical and logical interconnection through a high-speed bus (such as PCIe Fabric) or a switching network, so as to form a localized data interaction network. The network supports direct communication (such as point-to-point transmission) among the data processing modules 11, reduces communication delay through topology optimization (such as 2×4Fabric architecture), and ensures real-time performance of data interaction. The function division of the access unit 12 is interconnected with the topology, and the access unit 12 can be divided into, but not limited to, a parameter plane access unit, a service plane access unit, a sample plane access unit, a management plane access unit, and the like according to functions, which respectively correspond to four large network functions (parameter transmission, service scheduling, sample reading, and device management) in the cluster. Each access unit 12 is interconnected with the target data processing module 11 based on a preset topological relation (such as a Leaf-Spine architecture), and is specifically characterized in that the access units are directly connected with a network port of the data processing module 11 through a high-speed interface (such as a 400G/800G optical module), and are accessed into the function convergence unit 2 according to a hierarchical topology (such as two-layer CLOS networking) to form a transmission path of the data processing module 11-access unit 12-function convergence unit 2.

The management mechanism and data processing flow of the function convergence unit 2. The data aggregation and forwarding of the cross-resource groups, the functional aggregation unit 2 (such as a 51.2T parameter plane switch, a 25.6T service plane aggregation switch and the like) receives the data sent by the access unit 12 of each resource group 1, and based on the data type (parameter plane data, service plane data and the like) and the target address, the data is forwarded efficiently through the full interconnection topology (Fullmesh) or the equal-cost multi-path routing (ECMP), so that the data processing module 11 in any resource group 1 can establish non-blocking connection with other modules. Based on the resource management and control of the data, the function convergence unit 2 realizes the state monitoring and management of the target data processing module 11 by analyzing the data sent by the access unit 12, for example, power management, namely, acquiring the power consumption information of the data processing module 11 through an out-of-band management network (gigabit switch), realizing power redundancy and energy consumption optimization, topology discovery, namely, automatically identifying the interconnection relation of the data processing module 11 based on the topology mapping data fed back by the BMC management module, dynamically adjusting a forwarding path, and fault isolation, namely, automatically switching to a standby path when a single module or a link fails, thereby guaranteeing the reliability of data transmission. The above examples are just some of the realizations of the embodiments of the present disclosure, and the present disclosure is not limited to the target data processing module 11.

In summary, the embodiment of the disclosure combines the direct interconnection of the data processing modules 11 in the resource group 1 and the hierarchical topology design of the access units 12, and the full interconnection architecture of the function convergence units 2 to realize the low-delay and 1:1 non-blocking performance of data transmission in a cluster, so as to meet the communication requirements of high throughput and zero packet loss under kilocalorie scale, the decoupling design of the access units 12 and the data processing modules 11 supports physical pooling and cross-node sharing (such as through access of heterogeneous accelerator cards) of computing resources, the function convergence units 2 realize unified monitoring and scheduling of distributed resources through data analysis, so as to improve the utilization rate and management efficiency of the cluster resources, and based on the hierarchical architecture, allow the hierarchy (such as three-stage Spine) of the function convergence units 2 to be increased to the kilocalorie scale, and meanwhile, the link redundancy and the failure switching mechanism ensure that when part of nodes fail, the cluster can still maintain communication continuity and reduce the failure influence.

In one embodiment, the plurality of data processing modules 11 includes a general purpose computing unit 111, an acceleration computing unit 112, a link switching unit 113;

the general purpose computing unit 111 performs data interaction with the acceleration computing unit 112 via the link switching unit 113.

Specifically, referring to fig. 2, the plurality of data processing modules 11 include a general purpose computing unit 111, an acceleration computing unit 112, and a link switching unit 113, which correspond to core hardware components in the kcal cluster.

The general calculation unit 111 adopts a CPU general server node, integrates a double-way HOST processor and has the functions of system control and logic operation, the acceleration calculation unit 112 supports tensor calculation with high computation force density based on heterogeneous acceleration cards (such as 16 GPU cards in a chassis) integrating GPU chips and network chips, and the link switching unit 113 constructs a 2X 4Fabric topology by SW-BOX (Pearl) formed by 8 PCIe Gen5 Switch chips to realize high-speed data exchange.

The general purpose computing unit 111 is connected to the HOST mode port (1 group x16 PCIELANES) of the link switching unit 113 through a PCIe link, and the acceleration computing unit 112 is connected to the link switching unit 113 through the Device mode port (4 group x16 PCIELANES), so as to form a three-level interconnection structure of "general purpose computing unit 111→link switching unit 113→acceleration computing unit 112". For example, a dual-path HOST processor may be connected to 32 acceleration computing units 112 (each Switch chip supports 4 groups of Device mode ports) through a Fabric network of a link switching unit 113, and fig. 3 is a Fabric network composed of acceleration cards and switching chips according to an embodiment of the present disclosure, where the acceleration computing units 112 are connected to a high-speed cable of the link switching unit 113 through retimer modules (such as CDFP interfaces in a chassis) to implement PCIe signal enhancement and transmission.

In one embodiment, the plurality of access units 12 of different functions includes a management access unit 121, a parameter access unit 122, a service access unit 123, and a sample access unit 124;

the parameter access unit 122 is connected with the acceleration calculation unit 112, and transmits the parameter data acquired from the acceleration calculation unit 112 to the parameter aggregation unit 22, wherein the function aggregation unit 2 comprises the parameter aggregation unit 22, the management aggregation unit 21, the service aggregation unit 23 and the sample aggregation unit 24;

the service access unit 123 is connected to the general-purpose computing unit 111, and transmits the service data acquired from the general-purpose computing unit 111 to the service convergence unit 23;

the sample access unit 124 is connected to the general-purpose computing unit 111, and transmits sample data acquired from the general-purpose computing unit 111 to the sample aggregation unit 24;

The management access unit 121 is connected to the general purpose computing unit 111, the acceleration computing unit 112, and the link switching unit 113, respectively, and is configured to manage the general purpose computing unit 111, the acceleration computing unit 112, and the link switching unit 113, respectively, according to the received management instruction.

Specifically, the multiple access units 12 with different functions include a management access unit 121, a parameter access unit 122, a service access unit 123, and a sample access unit 124, which correspond to access nodes of four large networks in the interconnection system.

Management access unit 121 is a gigabit out-of-band management switch, and connects the BMC network port of general purpose computing unit 111, the management interface of acceleration computing unit 112, and the control port of link switching unit 113.

In a specific embodiment, the management access unit 121 is connected to the controller management interfaces corresponding to the general computing unit 111, the acceleration computing unit 112, and the link switching unit 113, respectively;

The management access unit 121 is connected with the exchange management network ports corresponding to the link exchange unit 113, the management convergence unit 21 and the parameter convergence unit 22 respectively;

the management access unit 121 is connected with the exchange management network port of the sample aggregation unit 24;

the management access unit 121 is connected to the switching management portal of the service convergence unit 23.

Specifically, in the embodiment of the present disclosure, each device unit is connected to the management access unit 121 through a controller management network port and a switching management network port, and is managed and controlled by the management access unit 121, so as to implement status monitoring, configuration issuing and fault management for the general computing unit 111, the acceleration computing unit 112, the link switching unit 113, the management convergence unit 21, the parameter convergence unit 22, the service convergence unit 23 and the sample convergence unit 24. The specific connection relationship between the management access unit 121 and the foregoing respective units is shown in fig. 4. The management access unit 121 is used as a core access node of an out-of-band management network of the interconnection system, and uses a gigabit out-of-band management switch to realize management interconnection of the global device, and the interface configuration is as follows:

a device management interface, which is provided with a multi-kilomega RJ45 network port and is used for connecting a controller management network port of the general computing unit 111, the acceleration computing unit 112 and the link switching unit 113;

And the exchange management interface is used for configuring a high-speed network port and is used for the exchange management network ports of the uplink management convergence unit 21, the downlink parameter convergence unit 22, the service convergence unit 23 and the sample convergence unit 24.

The BMC module packages the data of CPU state, memory utilization rate and the like into an IPMI protocol packet, and uploads the IPMI protocol packet to the management convergence unit 21 through the management access unit 121 to realize the functions of remote startup and shutdown, firmware upgrading and the like of the general server.

The management access unit 121 is connected to the management convergence unit 21 (48 paths of gigabit convergence switches) in an uplink mode through 10G optical fibers to form an access-convergence two-layer management network, wherein 16 management access units 121 in a single POD are connected to the management convergence unit 21 in a redundant mode through 2 10G links respectively, the management convergence unit 21 integrates management data of the whole cluster, and is connected with an upper operation and maintenance platform (such as OpenStack) in a butt joint mode to realize centralized management of kilocalorie scale equipment.

The exchange management network port (e.g., console port, MGMT port) of the parameter aggregation unit 22 is connected to the management access unit 121, and is used for receiving configuration instructions (e.g., VLAN division, ECMP routing table update) of the parameter plane exchange, uploading fault alarms (e.g., optical module failure, port congestion) of the parameter plane link, and forwarding to the operation and maintenance system through the management access unit 121.

The exchange management network ports of the service convergence unit 23 and the sample convergence unit 24 access the out-of-band management network through the management access unit 121, so as to realize the configuration synchronization (such as QoS strategy and ACL rule) of the service plane/sample plane exchange, upload the bandwidth utilization monitoring data, and support the management system to dynamically adjust the priority of the service and the sample flow.

In a specific embodiment, a target controller is determined from among the controllers in the plurality of resource groups 1, and the target controller is interconnected with the management convergence unit 21.

On the basis of the embodiment, the target controller constructs the topological relation of each processing component in all the resource groups 1 through the identification information of each processing component in the plurality of resource groups 1;

the plurality of resource groups 1 are managed based on the topological relation.

Specifically, the target controller is selected from the controllers of the plurality of resource groups 1, specifically, a BMC management module of the general computing unit 111 or a management module of the link switching unit 113, and is interconnected with the management convergence unit 21 by accessing the target controller to the management access unit 121 through a gigabit network port and then uploading the target controller to the management convergence unit 21 (48 paths of gigabit convergence switches) through 10G optical fibers.

The target controller obtains unique Identification (ID) and physical parameters of each processing component through the out-of-band management network, including general purpose computing unit 111, cpu model, motherboard SN number, PCIe root port number, acceleration computing unit 112, PCIe address of gpu card, HBM capacity, network chip model (e.g. integrated 400G NIC), link switching unit 113, WWN number of switch chip, fabric topology level (e.g. row/column coordinates in 2×4 matrix).

According to the three-level architecture of the general computing unit 111, the link switching unit 113 and the acceleration computing unit 112, the parent-child connection relation of all components (such as HOST ports of a Switch chip are connected to a CPU of the general computing unit 111) is recorded, link parameter records, namely physical attributes (such as Gen5 x16 lanes and transmission rate 16 GB/s) of a PCIe link, optical module types (such as 800G SR8) and lengths (50 m) of optical fiber connections are collected, and source pooling mapping is carried out, namely GPU cards of the acceleration computing unit 112 are divided into an acceleration card group according to logical groups (such as every 4 cards are assigned to a case), and the acceleration card group is associated to the ports of the corresponding link switching unit 113.

In a specific embodiment, the uplink and downlink bandwidths of the access units 12 with different functions and the number of the resource groups 1 are respectively acquired;

the target transmission bandwidth of the function convergence unit 2 is configured according to the number of the resource groups 1 and the uplink and downlink bandwidths of the access units 12 with different functions.

Specifically, as shown in fig. 5, the upstream and downstream bandwidths of each function access unit 12 are configured by a parameter access unit 122, including 64 400G ports (connection acceleration calculation unit 112), a single port bandwidth of 400Gbps, a total downstream bandwidth of 25.6Tbps, and 32 800G ports (connection parameter aggregation unit 22), a single port bandwidth of 800Gbps, and a total upstream bandwidth of 25.6Tbps.

The service/sample access unit 123/124 has 2 400G ports (intelligent network card connected to the general calculation unit 111) downstream and 800Gbps total downstream bandwidth, and has 2 400G ports (connected to the service/sample aggregation unit 23/24) upstream and 800Gbps total upstream bandwidth. The management access unit 121 is gigabit ports (1 Gbps) in uplink and downlink, and the total bandwidth meets the out-of-band management requirement.

The number of resource groups 1 is defined. The accelerator cards of the interconnection system form a kilocalorie cluster, and the interconnection system in this embodiment is composed of 16 PODs, and each POD is an independent resource group 1, so the total number of the resource groups 1 is 16. Each resource group 1 comprises an acceleration calculation unit 112 composed of 64 heterogeneous acceleration cards, a general calculation unit 111 composed of a processor, a link switching unit 113 composed of a switching chip, and a class 4 access unit 12 (parameters/services/samples/management).

The parameter access unit 122 (51.2T switch) is connected with a network module (such as heterogeneous accelerator card integrated with NIC) of the acceleration calculation unit 112 through 400G optical fiber, the uplink is accessed to the parameter aggregation unit 22 (8 stations of 51.2T aggregation switch) through 800G link, and the parameter data of the acceleration calculation unit 112 is transmitted through HBM, network module, parameter access unit 122, parameter aggregation unit 22, supporting 1:1 non-blocking transmission, and meeting the high throughput synchronization requirement of model parameters in distributed training (such as parameter exchange of 1024 card clusters). The traffic access unit 123/sample access unit 124 interacts with the corresponding aggregation unit. The interface configuration of the general computing unit 111 is to plug two 400G intelligent network cards, which respectively correspond to data transmission of a service plane and a sample plane, bypass a CPU and directly access a memory through a DMA technology, and networking topology is that a service access unit 123 and a sample access unit 124 adopt a Leaf-Spine architecture, a Leaf layer 1.6T switch is connected to a service convergence unit 23/a sample convergence unit 24 (25.6T convergence switch) through a 2X 400G uplink, so that high-speed forwarding of service scheduling data and sample data (such as parallel reading of model training samples) is realized.

Based on the above embodiment, according to the number of resource groups 1 and the uplink and downlink bandwidths of the access units 12 with different functions, the parameter access unit 122, the service access unit 123 and the sample access unit 124 are configured to be the first transmission bandwidth;

a preset second transmission bandwidth is configured for the management access unit 121, wherein the first transmission bandwidth is greater than the second transmission bandwidth.

Specifically, the interconnection system includes 16 resource groups 1 (PODs), and each resource group 1 is configured:

the parameter access unit 122 is 1 station 51.2T exchanger, the downstream 64×400G port (single port 400 Gbps), the upstream 32×800G port (single port 800 Gbps);

service/sample access units 123/124, 1 1.6T exchanger each, 2X 400G ports each of up and down;

The management access unit 121:1 gigabit switch has 1Gbps port for uplink and downlink.

In one embodiment, the acceleration computing unit 112 includes a plurality of heterogeneous acceleration boards 1121;

dividing the plurality of heterogeneous acceleration boards 1121 into a plurality of acceleration board groups according to preset dividing conditions, wherein the number of the plurality of acceleration board groups is smaller than that of the plurality of heterogeneous acceleration boards 1121;

all heterogeneous accelerator cards 1121 in the same accelerator card group are connected to the same switch chip in the link switch unit 113.

Specifically, the disclosed embodiment provides a heterogeneous acceleration board 1121 as shown in fig. 6. The acceleration computing unit 112 includes 64 heterogeneous acceleration boards 1121 (such as GPU cards in a chassis), integrates GPU chips and 4 network chips, and supports PCIe Gen5 x16 interfaces, and specific parameters include GPU chips, i.e., three-quarter long specification numbers of double width and full height, and network modules, i.e., integrated NICs, and supports P2P network transmission and PCIe data offloading.

According to the preset condition of 'same exchange chip connection', 64 heterogeneous acceleration boards 1121 are divided into N acceleration board groups (N < 16), and each group corresponds to 1 Broadcom PCIe Gen Switch chip in the link exchange unit 113, so as to realize the logical grouping of physical links.

The link switching unit 113 constructs a2×4Fabric topology using 8 Switch chips, each chip supporting 4 groups of x16 PCIELANES set to Device mode for connecting heterogeneous accelerator cards, and 4 heterogeneous accelerator cards 1121 (1 card is connected to each group of x16 lanes) are accessible to a single chip.

The grouping number and the grouping scale are that 16 boards are divided into 4 acceleration board groups (N=4), each group of 4 boards corresponds to the Device ports of 4 Switch chips, each heterogeneous acceleration board 1121 is connected with the x16 lanes port of the Switch chip through retimer modules (such as CDFP interfaces) to form a signal path of 'boards → retimer → Switch chips', and 1 group in 4 groups of x16 lanes of the Switch chips is shared by the same group of 4 boards.

A single POD contains 4 chassis (64 boards) and requires 32 Switch chips (8 chips/SW-BOX x 4 SW-BOX) to implement the grouping.

In one embodiment, heterogeneous accelerator cards 1121 among different accelerator card groups interact data by way of bridging.

Specifically, bridging interconnection across the accelerator board card group is implemented by a bridging device (e.g., PCIe bridge chip). A single POD contains 2 SW-BOX (16 Switch chips total), 8 bridges (1 between every 2 chips) are deployed, supporting the cross-group interconnection of 64 boards.

In one embodiment, the heterogeneous accelerator board 1121 comprises a heterogeneous processor 11211, a data exchange chip 11212, and a network module 11213;

The heterogeneous processor 11211 is connected with a first end of the data exchange chip 11212, and a second end of the data exchange chip 11212 is connected with the network module 11213 to form a parameter transmission link;

The heterogeneous processor 11211, the data exchange chip 11212, and the network module 11213 are integrated in the same circuit board.

Specifically, heterogeneous processor 11211 employs high-performance GPU chips. The data exchange chip 11212 is a PCIe Gen5Switch chip and has the feature of supporting 16 groups of x16 PCIe Gen5 lanes.

The network module 11213 integrates a network chip, supports QSFP-DD/OSFP optical module interfaces, is compatible with 400G/800G transmission rate, supports protocols such as RoCE v2, TCP/IP, UDP and the like, and meets the non-blocking transmission requirement of a parameter plane.

As shown in FIG. 7, the three are integrated on the same layer of PCB circuit board, the layout follows the principle of signal integrity, namely, the heterogeneous processor 11211 and the data exchange chip 11212 are interconnected through a ceramic package substrate, the distance is less than or equal to 5mm, the signal delay is reduced, the network module 11213 is arranged at the edge of the circuit board, and the data exchange chip 11212 is connected through a high-speed differential wiring, so that the wiring length is reduced.

The physical connection of the parameter transmission link is as follows, heterogeneous processor 11211→data exchange chip 11212 (first end x16 lanes) →data exchange chip 11212 (second end x16 lanes) →network module 11213. The heterogeneous processor 11211 interacts with the data exchange chip 11212 through the HBM memory, and the second end of the data exchange chip 11212 is connected to the network module 11213 to realize parameter plane data offloading.

The network module 11213 is connected to the parameter access unit 122 (51.2T switch) through optical fibers, supporting Scale out networking, the network interface of the single heterogeneous accelerator board 1121 can access the Leaf switch, and the parameter plane data can be transmitted simultaneously with the network link (low latency) through PCIe link (high bandwidth).

In one embodiment, the parameter data generated by heterogeneous processor 11211 is sent to parameter access unit 122 via a parameter transmission link.

Specifically, the parameter data generated by the heterogeneous processor 11211 is transmitted to the parameter access unit 122 through the heterogeneous processor 11211, the data exchange chip 11212 (PCIe Gen5 x16 lanes), the network module 11213 (400G NIC), the optical fiber, and the parameter access unit 122 (51.2T switch)

The heterogeneous processor 11211 is integrated with the heterogeneous accelerator board 1121, the data switching chip 11212 is compatible with the Fabric network of the link switching unit 113 by Broadcom PCIe Gen Switch, and the network module 11213 is a 400G optical module. The heterogeneous processor 11211 interacts with the data exchange chip 11212 through the HBM memory, and the PCIe link bandwidth of the data exchange chip 11212 to the network module 11213 is 16GB/s (x 16 lanes).

The network module 11213 of the heterogeneous accelerator board 1121 is connected to the downlink ports of the parameter access units 122 through optical fibers, and specifically has the topology that a single parameter access unit 122 (51.2T switch) supports 64 400G downlink ports, 64 heterogeneous accelerator board 1121 are connected, 16 parameter access units 122 form a Leaf layer, and the two-layer CLOS networking architecture is formed through interconnection of the parameter aggregation units 22 (Spine layer).

The heterogeneous processor 11211 end sends generated model parameters (such as weight parameters in GPT-4 training) to the data exchange chip 11212 through HBM memory buffer and PCIe bus, the data exchange chip 11212 end carries out route marking (such as target PODID and GPU card number) on the parameter data and forwards the parameter data to the network module 11213 through a Fabric mode port, the network module 11213 end encapsulates the parameter data into a RoCE frame and sends the RoCE frame to the parameter access unit 122 through optical fibers, and the parameter plane data transmission paths in corresponding documents are heterogeneous accelerator, HBM, network module and network switch.

When the parameter data is required to be transmitted in the same POD, the communication of kilocalorie scale is realized through the PCIe Fabric network of the link switching unit 113 (low delay path) and the path from the network module 11213 to the parameter access unit 122 to the parameter convergence unit 22 when the parameter data is transmitted across the POD.

In one embodiment, the link switch unit 113 includes a plurality of switch chips 1131, and divides the plurality of switch chips 1131 into a plurality of switch chip groups;

in each exchange chip group, all exchange chips 1131 in the group are interconnected according to a preset topology structure.

Specifically, the link Switch unit 113 employs PCIe Gen5 Switch chips (Switch chips), and a single chip supports 144 PCIe Gen5 lanes (9 groups x16 lanes), 4 groups x16 lanes may be configured in Fabric mode for inter-chip interconnection, and supports a switching bandwidth of up to 2.56 TB/s.

Based on the "topology consistency" principle, 8 switch chips 1131 are partitioned into 2 switch chip sets (4 chips per set) to form a2×4 Fabric matrix topology. Fabric mode port configuration, namely 4 groups of x16 lanes of each exchange chip 1131 are set to be in Fabric mode, and different chip group interconnection realizes bandwidth aggregation through a bridge to ensure cross-chip communication.

The topology design of the exchange chip sets is strongly related to the grouping of the heterogeneous accelerator boards 1121, each exchange chip 1131 is connected with 4 heterogeneous accelerator boards 1121,2 through 4 groups of x16 lanes (Device mode) to support 32 boards to be interconnected, and topology delay (less than or equal to 2 hops) in the exchange chip sets ensures that communication delay between boards in the same group meets the requirement of low-delay communication.

The link switching units 113 employ SW-BOX (shared) modular packages, each of which integrates 8 switching chips 1131 (2 switching chipsets) supporting a 40-way high-speed interface.

In one embodiment, each switch chip 1131 includes a plurality of ports;

the ports of each switch chip 1131 are configured as the same or different functional ports, wherein the functional ports include any one or more combinations of host ports, device ports, and interconnect ports.

Specifically, the Switch chip 1131 (PCIe Gen5 Switch) includes 144 PCIe Gen5 lanes, is packaged according to 9 groups x16 lanes, and can be configured as a functional Port, namely a HOST Port (HOST Port) connected to the CPU of the general purpose computing unit 111, a Device Port (Device Port) connected to a Device such as the heterogeneous accelerator card 1121, and an interconnection Port (Fabric Port) used for high-speed interconnection between the Switch chips 1131 to construct a PCIe Fabric network.

Based on the "topology demand and traffic type" dynamic configuration port functionality, a typical configuration of a single switch chip 1131 is as follows, 1 set x16 lanes is configured as HOST ports (connected CPU), 4 sets x16 lanes is configured as Device ports (connected GPU), and 4 sets x16 lanes is configured as Fabric ports (inter-chip interconnect).

In one embodiment, interconnections between all switch chips 1131 are established in the same switch chipset through interconnection ports;

the exchange chip 1131 is connected with the heterogeneous acceleration board 1121 in the acceleration calculation unit 112 through a device port;

The switch chip 1131 is connected to the processor in the general purpose computing unit 111 through a host port.

Specifically, the switch chip 1131 in the same switch chipset constructs a2×4 full interconnect matrix through an interconnect Port (Fabric Port), which is specifically implemented as follows:

The interconnection port configuration is that each exchange chip 1131 sets 4 groups of x16 PCIe Gen5 lanes as interconnection ports, and realizes direct connection among chips through high-speed differential wiring or a bridge to form a 32-card 2 x 4Fabric interconnection topology, and a communication path is that at most 2 interconnection ports jump between any two exchange chips 1131.

The switch chips 1131 are interconnected with the heterogeneous accelerator boards 1121 through Device ports (Device ports), and specifically configured to allocate ports, wherein each switch chip 1131 sets 4 groups x16 lanes as Device ports, each group is connected with PCIe gold fingers of 1 heterogeneous accelerator board 1121, a single switch chip supports 4 boards to be accessed, and a physical path is the heterogeneous accelerator boards 1121- & gt retimer module (enhanced signal) and the Device ports of the switch chips 1131.

The switch chips 1131 are connected to the CPU of the general purpose computing unit 111 through HOST ports (HOST ports), and specifically, each switch chip 1131 sets 1 group x16 lanes as a HOST Port, supports PCIe Root Complex functions, and a single SW-BOX (8 switch chips) provides 8 groups of HOST ports to connect 2 general purpose computing units 111 (two-way HOST), and the CPU of the general purpose computing unit 111→the HOST port→the switch chip 1131→the device port→the heterogeneous accelerator board 1121 forms a three-level interconnection path.

The host port supports the DMA technology, the CPU can bypass the memory to directly access the HBM data of the heterogeneous accelerator board 1121, reduce the data transmission delay, and realize the configuration management of the exchange chip 1131 through the host port, such as dynamically adjusting the bandwidth allocation of the device port and the interconnection port.

In one embodiment, the general purpose computing unit 111 includes at least two network adapters, wherein the network adapters perform data transfer by way of direct memory access.

On the basis of the embodiment, the network adapter comprises a first network adapter and a second network adapter;

the first network adapter is connected with the service access unit 123 and is used for transmitting service data generated by the processor in a direct memory access mode;

The second network adapter is connected to the sample access unit 124 for transmitting the processor-generated sample data in a direct memory access manner.

Specifically, the general computing unit 111 integrates at least two network adapters, adopts a 400G intelligent network adapter, has the specific specifications of supporting 400Gbps line speed forwarding by a single adapter, being compatible with 100G/200G rate self-adaption, and has the unloading function of hardware acceleration TCP/IP protocol stack, SR-IOV virtualization, TLS encryption and other tasks, thereby reducing CPU load. The first network adapter is connected to the service access unit 123 and is dedicated for service data transmission, and the second network adapter is connected to the sample access unit 124 and is dedicated for sample data transmission. The network adapter realizes DMA data transmission through the following paths of a processor, a system memory, a network adapter DMA engine, a physical link and an access unit. The DMA engine bypasses the CPU to directly access the memory, so that the bottleneck of data transmission is reduced.

In a specific embodiment, the target controller collects the identity of the acceleration computing unit 112 in each resource group 1, gathers the global identity, and generates a routing mapping table;

Specifically, when all the resource groups 1 (PODs) are powered up, the network initialization and the node ID confirmation are managed, and the system enters an "initialization state". The default uses the Pearl node BMC of the resource group 0 (POD 0) as a target controller to implement unified ID management on all nodes in the kilowatt cluster, and specifically includes:

Standardized coding of identity. The method comprises the steps of allocating continuous IDs (such as Host0 and Host 1) for a general purpose computing unit 111 (Host), allocating unique IDs (such as GPU0 to GPU 255) for an acceleration computing unit 112 (GPU card), ensuring that each data processing module 11 (comprising the general purpose computing unit 111, the acceleration computing unit 112 and a link switching unit 113) has a global unique identity, and integrating physical location information (such as a serial number of a belonging resource group 1 and a slot number of a case) and logical topology information (such as a hierarchical coordinate in a CLOS topology) by an ID coding rule to form a four-dimensional identification system of a resource group 1-equipment type-physical location-logical coordinate.

And obtaining a network topology and a GPU topology. The target controller obtains the network topology (parameter plane/service plane/sample plane aggregation switch connection relationship) and the GPU topology (interconnection relationship of the acceleration calculation unit 112 in the link switching unit 113) of each resource group 1 through a management ToR (Top of Rack switch, management ToR is a management access unit in the implementation of the present disclosure).

Global identity acquisition and routing mapping table generation

The identity acquisition by the computing unit 112 is accelerated. The target controller collects the identity of the acceleration computing unit 112 through an out-of-band management network (I2C link, MCTP over SMBus protocol), and specifically comprises a physical identifier, a logic identifier and a logic identifier, wherein the physical identifier is used for reading a PCIe address (Bus/Device/Function) of a GPU card through the I2C link, a WWN number (worldwide unique identifier) and a slot code (such as front window 1-10 and rear window 1-6) in a chassis, and the logic identifier is used for determining the acceleration board card group number to which the acceleration computing unit 112 belongs and the hierarchical position (such as Leaf switch port number) in a Fabric topology based on the port mapping relation of a link switching unit 113 (PEARL PCIE switch) and matching the hardware architecture of the 'Pearl Inbry BMC intelligent management system integrated super-expansion management software' in a document.

Hierarchical generation logic of the routing mapping table. The target controller generates a global routing mapping table based on the following steps of collecting local GPU address mapping reported by each general computing unit 111 through an H2B interface (Host to BMC), such as 'asset information reporting → GPU address acquisition' flow in the figure, and global topology modeling, namely utilizing mCPU to summarize GPU topology information of all resource groups 1 through an I2C link, constructing a global topology according to a three-level structure of 'resource group 1 → link switching unit 113 → acceleration computing unit 112', and generating a routing mapping table containing source equipment ID, target equipment ID, forwarding paths (such as through Pearl switch ports), bandwidth attributes (512 GB/s aggregation bandwidth) and delay attributes (less than or equal to 2 mu s).

Distribution mechanism and driving layer configuration of route mapping table

The paths are distributed across the mapping tables of the resource groups. The target controller distributes the routing mapping table to each resource group node through the management ToR, and the specific transmission path is that the target controller (POD 0 Pearl BMC), the management ToR, each resource group management network port, the corresponding Pearl BMC, the general calculation unit 111BMC and the acceleration calculation unit 112. The distribution process adopts IPMI over LAN protocol to ensure the transmission reliability under kilowatt scale.

The hardware configuration of the driving layer is realized. After the driver layer of the acceleration calculation unit 112 loads the routing mapping table, the following configuration is performed, namely, updating a hardware forwarding rule, namely, configuring a flow table of the network module 11213 through a TCAM (ternary content addressable memory), for example, forwarding a data packet with a target ID of GPU128 from a QSFP-DD1 port to a corresponding Pearl switch port, matching the function of updating SW configuration by SW management software in a document, and deploying a bandwidth priority policy, namely, setting data transmission priority through a PFC (priority flow control) mechanism of the RoCE v2 protocol according to the bandwidth attribute in the mapping table, so as to ensure non-blocking transmission of a high priority path.

Dynamic update mechanism upon topology change

When the topology or configuration of the cluster changes (such as hot plug of the acceleration computing unit 112 and port adjustment of the link switching unit 113), the supernode management software maintains the route effectiveness by monitoring the state of each node in real time through the BMC management module by the target controller and triggering the topology updating process when detecting the change of the device asset list (such as the change of the status of the slot position reported by the chassis BMC).

The target controller re-collects the identity of the change node, generates an incremental routing mapping table, only updates the affected forwarding rule, avoids resource consumption caused by total update, and sends an incremental update packet to the related resource group nodes by managing the ToR, and each node driving layer rapidly updates the hardware flow table based on a differential algorithm to ensure communication continuity during topology change.

In summary, through the PCIe Fabric decoupling architecture of the general purpose computing unit 111 and the acceleration computing unit 112, the tight coupling limitation of the conventional single-machine multi-card is broken, so that heterogeneous computing resources can be shared through the nodes, the space utilization of the computing resources is improved, the signal loss is reduced, and the power utilization is optimized synchronously.

By adopting a two-layer CLOS networking (Leaf-Spine) architecture, based on a 51.2T switch and an 800G optical module, a single cluster supports non-blocking interconnection of 1024 GPU cards, and when expansion is carried out to ten thousand cards, the computing power requirement of AIGC core users on kilocalories/super Mo Kaji groups can be met through three-level topology superposition, and the hardware cost is reduced compared with that of the traditional scheme.

The four networks are physically isolated, a parameter plane is focused on high-bandwidth lossless transmission, 30% -50% of CPU load is unloaded by a service/sample plane through an intelligent network card, task scheduling efficiency is improved, and an independent gigabit network of a management plane ensures out-of-band management instantaneity, so that equipment control can be maintained even if a data plane fails.

Fig. 8 is a method for synchronizing data of an interconnection system according to an embodiment of the present disclosure.

As shown in fig. 8, the method comprises the steps of:

In step 301, in response to the received training samples, the training samples are issued to a plurality of resource groups, respectively.

In the embodiment of the disclosure, the efficient distribution of training samples is realized through a sample surface network of a kilocalorie cluster, and the specific technical paths are as follows:

Sample data is accessed into a cluster through a sample access unit 124 (1.6T switch), and based on a Leaf-Spine architecture of a CLOS topology, the sample data is issued from a storage system through a storage system, a sample aggregation unit 24 (25.6T switch), a sample access unit 124, a general calculation unit 111, a link switching unit 113 and an acceleration calculation unit 112, wherein the sample access unit 124 supports a 2X 400G uplink, forms 1:1 non-blocking transmission with the sample aggregation unit 24, and ensures that the sample data is issued to 16 resource groups 1 (PODs) in a bandwidth of 800Gbps in parallel.

The sample plane network issuing process uses the second network adapter 11112 (400G intelligent network card) of the general purpose computing unit 111 to bypass the CPU to directly access the memory through the DMA technology, reducing the transmission delay.

Step 302, in response to a training completion instruction of the training sample, training data generated in the training process is transmitted to the respective corresponding function convergence units through the access units with different functions.

In the embodiment of the disclosure, after training is completed, data is transmitted through the corresponding access units according to the function types, and is transmitted to the parameter convergence unit 22 through the parameter access unit 122 (51.2T switch) by the network module 11213 (400G NIC) of the acceleration calculation unit 112, wherein the paths are that the acceleration calculation unit 112, the parameter access unit 122, the parameter convergence unit 22, and the total bandwidth reaches 409.6Tbps when the single-path bandwidth 400gbps and 1024 cards are transmitted concurrently, and non-blocking convergence is realized through 8 51.2T parameter convergence units 22. Is transmitted by the first network adapter 11111 of the general purpose computing unit 111 via the service access unit 123 to the service convergence unit 23.

Step 303, based on the function convergence unit and the preset routing relationship, synchronizing the training data to a plurality of data processing modules in other resource groups for data interaction.

In the embodiment of the disclosure, the data synchronization of the cross-resource group is realized based on the function convergence unit and the routing mapping table, and the specific flow is as follows:

Route mapping and data forwarding. The global routing mapping table generated by the target controller (such as BMC) records the interconnection paths of the data processing modules in each resource group 1, for example, a resource group A acceleration card, a parameter aggregation unit 22 and a resource group B acceleration card (the hop count is less than or equal to 2). The data synchronization is realized through Fullmesh full interconnection topology of the parameter surface network, and the data interaction between any two resource groups is at most two hops.

It should be noted that, in the embodiments of the present disclosure, a plurality of steps may be included, and these steps are numbered for convenience of description, but these numbers are not limiting on the time slots executed between the steps and the execution sequence, and these steps may be implemented in any order, and the embodiments of the present disclosure do not limit this.

The foregoing explanation of the method embodiment is also applicable to the apparatus of this embodiment, and the principle is the same, and this embodiment is not limited thereto.

An embodiment of the application also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments of interconnection system data synchronization described above.

Embodiments of the present application also provide a computer readable storage medium having a computer program stored therein, wherein the computer program is configured to perform, when run, the steps of any of the method embodiments of interconnection system data synchronization described above.

In an exemplary embodiment, the computer readable storage medium may include, but is not limited to, a U disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, etc. various media in which a computer program may be stored.

Embodiments of the present application also provide a computer program product comprising a computer program which, when executed by a processor, performs the steps of any of the method embodiments of interconnection system data synchronization described above.

Embodiments of the present application also provide another computer program product comprising a non-volatile computer readable storage medium storing a computer program which, when executed by a processor, performs the steps of any of the method embodiments of interconnection system data synchronization described above.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The interconnection system, the method for synchronizing the data of the interconnection system and the electronic equipment provided by the application are described in detail. The principles and embodiments of the present application have been described herein with reference to specific examples, the description of which is intended only to facilitate an understanding of the method of the present application and its core ideas. It should be noted that it will be apparent to those skilled in the art that various modifications and adaptations of the application can be made without departing from the principles of the application and these modifications and adaptations are intended to be within the scope of the application as defined in the following claims.

Claims

1. An interconnection system, characterized in that the interconnection system comprises: a plurality of resource groups and a function convergence unit, wherein the function convergence unit establishes interconnections between the plurality of resource groups through access units with different functions corresponding to the resource groups;

Each resource group includes multiple data processing modules and multiple access units with different functions. The multiple data processing modules are interconnected and perform data exchange based on the interconnection relationship;

The plurality of access units with different functions are interconnected with the target data processing module respectively according to a preset topological relationship;

In response to receiving data sent by multiple access units with different functions, the function convergence unit manages the target data processing modules respectively based on the data.

2. The interconnection system according to claim 1, wherein the plurality of data processing modules include a general computing unit, an accelerated computing unit, and a link switching unit;

The general computing unit exchanges data with the accelerated computing unit via the link switching unit.

3. The interconnection system according to claim 2, wherein the plurality of access units with different functions include: a management access unit, a parameter access unit, a service access unit, and a sample access unit;

The parameter access unit is connected to the accelerated computing unit, and transmits the parameter data obtained from the accelerated computing unit to the parameter aggregation unit, wherein the function aggregation unit includes: the parameter aggregation unit, the management aggregation unit, the service aggregation unit, and the sample aggregation unit;

The service access unit is connected to the general computing unit and transmits the service data obtained from the general computing unit to the service convergence unit;

The sample access unit is connected to the general computing unit and transmits the sample data obtained from the general computing unit to the sample aggregation unit;

The management access unit is connected to the general computing unit, the accelerated computing unit, and the link switching unit respectively, and is used to manage the general computing unit, the accelerated computing unit, and the link switching unit respectively according to the received management instructions.

4. The interconnection system according to claim 3, wherein the management access unit is connected to the corresponding controller management network ports of the general computing unit, the accelerated computing unit, and the link switching unit respectively;

The management access unit is connected to the corresponding switch management network ports of the link switching unit, the management aggregation unit, and the parameter aggregation unit respectively;

The management access unit is connected to the switch management network port of the sample aggregation unit;

The management access unit is connected to the switch management network port of the service convergence unit.

5 . The interconnection system according to claim 4 , wherein a target controller is determined from controllers in a plurality of resource groups, and the target controller is interconnected with the management convergence unit.

6. The interconnection system according to claim 5, wherein the target controller constructs a topological relationship between the processing components in all resource groups using identification information of the processing components in the plurality of resource groups;

The multiple resource groups are managed based on the topological relationship.

7. The interconnection system according to claim 3, characterized in that the uplink and downlink bandwidths of the plurality of access units with different functions and the number of the resource groups are obtained respectively;

The target transmission bandwidth of the function convergence unit is configured according to the number of the resource groups and the uplink and downlink bandwidths of a plurality of access units with different functions.

8. The interconnection system according to claim 7, characterized in that:

According to the number of the resource groups and the uplink and downlink bandwidths of the multiple access units with different functions, configuring the parameter access unit, the service access unit, and the sample access unit to a first transmission bandwidth;

A preset second transmission bandwidth is configured for the management access unit; wherein the first transmission bandwidth is greater than the second transmission bandwidth.

9. The interconnection system according to claim 2, wherein the accelerated computing unit comprises: a plurality of heterogeneous acceleration boards;

Dividing the plurality of heterogeneous accelerator boards into a plurality of accelerator board groups according to a preset division condition, wherein the number of the plurality of accelerator board groups is less than the number of the plurality of heterogeneous accelerator boards;

All heterogeneous acceleration boards in the same acceleration board group are connected to the same switching chip in the link switching unit.

10. The interconnection system according to claim 9, wherein heterogeneous acceleration boards in different acceleration card groups exchange data via a bridge.

11. The interconnection system according to claim 9, wherein the heterogeneous acceleration board comprises: a heterogeneous processor, a data exchange chip, and a network module;

The heterogeneous processor is connected to a first end of the data exchange chip, and a second end of the data exchange chip is connected to the network module to form a parameter transmission link;

The heterogeneous processor, the data exchange chip, and the network module are integrated into the same circuit board.

12 . The interconnection system according to claim 11 , wherein the parameter data generated by the heterogeneous processor is sent to the parameter access unit via the parameter transmission link.

13. The interconnection system according to claim 2, wherein the link switching unit comprises a plurality of switching chips, and the plurality of switching chips are divided into a plurality of switching chip groups;

In each switching chipset group, all switching chips in the group are interconnected according to a preset topology.

14. The interconnection system according to claim 13, wherein each switch chip comprises a plurality of ports;

The multiple ports of each switching chip are configured as the same or different functional ports; wherein the functional ports include: any one or more combinations of host ports, device ports and interconnection ports.

15. The interconnection system according to claim 14, wherein the interconnection between all switching chips in the same switching chipset is established through the interconnection port;

The switching chip is connected to the heterogeneous acceleration board in the accelerated computing unit through the device port;

The switching chip is connected to the processor in the general computing unit through the host port.

16 . The interconnection system according to claim 15 , wherein the general computing unit comprises at least two network adapters, wherein the network adapters perform data transmission via direct memory access.

17. The interconnection system according to claim 16, wherein the network adapter comprises: a first network adapter and a second network adapter;

The first network adapter is connected to the service access unit and is used to transmit the service data generated by the processor in a direct memory access manner;

The second network adapter is connected to the sample access unit and is used to transmit the sample data generated by the processor in a direct memory access manner.

18. The interconnection system according to claim 5, wherein the target controller collects the identity identifiers of the accelerated computing units in each resource group, summarizes the global identity identifiers, and generates a routing mapping table;

The routing mapping table is sent to the multiple resource group nodes respectively, and the driver layer is triggered to load the configuration of the routing mapping table.

19. A method for data synchronization of an interconnected system, characterized in that the method is applied to the interconnected system according to any one of claims 1 to 18, comprising:

In response to the received training samples, the training samples are respectively sent to a plurality of resource groups;

In response to a training completion instruction of the training sample, the training data generated during the training process is transmitted to the corresponding functional convergence units through the access units with different functions;

Based on the function convergence unit and the preset routing relationship, the training data is synchronized to multiple data processing modules in other resource groups for data interaction.

20. An electronic device, comprising:

at least one processor; and

a memory communicatively connected to the at least one processor; wherein,

The memory stores instructions that can be executed by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform the method for synchronizing interconnected system data according to claim 19.