CN119336451B

CN119336451B - A method, device and medium for collective communication control of distributed training

Info

Publication number: CN119336451B
Application number: CN202411863321.4A
Authority: CN
Inventors: 李勇; 何水兵; 秦亦; 薛辉; 赖彦名
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2024-12-17
Filing date: 2024-12-17
Publication date: 2025-06-13
Anticipated expiration: 2044-12-17
Also published as: CN119336451A

Abstract

The present application discloses a method, device and medium for collective communication control of distributed training, the method comprising: when the difference in the number of computing nodes between any two clusters in the clusters participating in data reduction is within a preset range, the data on all computing nodes in the first cluster are reduced to designated computing nodes, and the number of designated computing nodes is the same as the number of computing nodes in the second cluster with the smallest number of computing nodes. The designated computing nodes are controlled to perform data reduction with the computing nodes in the second cluster; the designated computing nodes are controlled to synchronize data with other nodes in the first cluster except the designated computing nodes. Thus, the other clusters except the cluster with the smallest number of computing nodes first perform a reduction within the cluster, and reduce the data to the designated computing nodes with the same number as the smallest number of nodes in each cluster, so as to ensure that the nodes of each cluster are the same when reducing across clusters, avoid some nodes from being reduced with multiple nodes at the same time, and reduce the collective communication overhead.

Description

Distributed training set communication control method, device and medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, and a medium for controlling aggregate communication in distributed training.

Background

With the rapid development of big data and Artificial Intelligence (AI) models, distributed training techniques are widely applied to AI models to jointly complete complex computing tasks of the AI models through cooperation of a plurality of computing nodes in a computing cluster, that is, to decompose tasks that are originally completed on a single computing node into complex computing tasks on an AI scale that are cooperatively completed on a plurality of computing nodes.

Aggregate communication refers to a particular form of communication between a set of processes or computing nodes. In the distributed training of the AI model, optimizing the aggregate communication performance is a main technical means for improving the distributed training efficiency, and improving the distributed training efficiency is important for ensuring that each computing node in the cluster fully plays the computing performance. At present, the aggregate communication is mostly constructed based on the isomorphic network, when the aggregate communication performance of the isomorphic network is optimized, data are uniformly divided among computing nodes, and the data are reduced to each computing node through aggregate communication semantics, so that the optimal communication bandwidth is realized, and the distributed training efficiency is improved.

However, in order to meet the explosive-type increased intelligent computing power demand of the AI model, the AI chips of various types are continuously and iteratively updated, so that the situation of mixed deployment of different AI chips occurs in the AI model, and the computing power of the different AI chips is different, so that the number of computing nodes participating in mixed training among different computing clusters is different. Therefore, when data reduction is performed across clusters, because the number of computing nodes is asymmetric among clusters participating in reduction, some computing nodes need to perform data reduction with a plurality of computing nodes at the same time, so that a network competition phenomenon occurs, the aggregate communication overhead is high, the efficiency of the cross-cluster reduction is low, and the efficiency of distributed training is reduced.

Therefore, how to solve the problems of high aggregate communication overhead, low cross-cluster reduction efficiency and improved distributed training efficiency is a problem to be solved by those skilled in the art.

Disclosure of Invention

In view of this, an aspect of the present application provides a method for controlling collective communication of distributed training, the method comprising:

When the difference of the number of the computing nodes between any two clusters is within a preset range, reducing the data distributed on all the computing nodes in a first cluster to the designated computing nodes in the first cluster, wherein the number of the designated computing nodes is the same as the number of the computing nodes in a second cluster, and the second cluster is the cluster with the minimum number of the computing nodes in the clusters participating in data reduction;

controlling the appointed computing node and the computing nodes in the second cluster to perform data reduction;

and controlling the designated computing node to perform data synchronization with other nodes except the designated computing node in the first cluster.

Optionally, the reducing the data distributed on all the computing nodes in the first cluster to the designated computing nodes in the first cluster includes:

dividing the computing nodes in the first cluster into m node subgroups, wherein each node subgroup comprises k computing nodes, k=floor (n/m), n is the number of computing nodes in the first cluster, and m is the number of computing nodes in the second cluster;

Selecting one computing node from each node subgroup as the designated computing node;

and reducing the data distributed on all the computing nodes in the first cluster to the appointed computing node.

Optionally, the selecting a computing node from each node group as the designated computing node includes:

Sequencing according to the device serial numbers of the computing nodes in the first cluster from small to large;

based on the sorting result, dividing the k computing nodes into the node subgroups in sequence according to the mode that each node subgroup divides one computing node at a time;

and taking the computing node with the smallest equipment serial number in each node group as the appointed computing node.

Optionally, the reducing the data distributed on all the computing nodes in the first cluster to the designated computing node includes:

determining whether a computing node in the first cluster has a target computing node which is not grouped;

if not, reducing the data in each node subgroup to the designated computing nodes in the corresponding group;

and if so, executing the step of reducing the data in each node group to the designated computing nodes in the corresponding group, and equally dividing the data on the target computing nodes into m data, and transmitting the m data to different designated computing nodes in a one-to-one correspondence manner.

Optionally, the distributed training set communication control method further includes:

when the difference is not in the preset range, sequencing according to the sequence from small to large of the number of the computing nodes;

merging the first cluster and the second cluster into a target cluster based on the sorting result;

determining whether the difference between any two clusters in the target cluster and the non-merged cluster is within the preset range;

and entering the step of reducing the data distributed on all the computing nodes in the first cluster to the appointed computing nodes in the first cluster, and executing the subsequent steps.

Optionally, if the difference between any two clusters in the target cluster and the unmixed cluster is not within the preset range, the method further includes:

Determining whether a difference value between the number of computing nodes in the target cluster and the number of computing nodes in a designated cluster exceeds a threshold value, wherein the designated cluster is the cluster with the minimum number of computing nodes in the uncombined cluster;

If the threshold value is exceeded, the step of taking the target cluster and the unmixed cluster as the clusters for participating in data reduction is entered, and the subsequent steps are executed;

If the number of the non-merged clusters is not more than 1, entering the step of sorting the computing nodes in the order from small to large, and executing the subsequent step, and if the number of the non-merged clusters is not more than 1, entering the step of taking the target clusters and the non-merged clusters as the clusters for participating in data reduction, and executing the subsequent step.

Optionally, the difference of the number of computing nodes between clusters participating in data reduction is within a preset range, including:

If the result of the specified calculation of the maximum value and the minimum value of the number of the calculation nodes in the cluster participating in the data reduction is not greater than a preset value, the difference is within the preset range, wherein the specified calculation comprises ratio calculation or difference calculation.

Another aspect of the present application provides a distributed training set communication control apparatus, the apparatus comprising:

The cluster internal reduction module is used for reducing the data distributed on all the computing nodes in a first cluster to the designated computing nodes in the first cluster when the difference of the number of the computing nodes between any two clusters is within a preset range, wherein the number of the designated computing nodes is the same as the number of the computing nodes in a second cluster, and the second cluster is the cluster with the minimum number of the computing nodes in the clusters participating in data reduction;

the cross-cluster reduction module is used for controlling the designated computing node to reduce data with the computing nodes in the second cluster;

And the data synchronization module is used for controlling the designated computing node to perform data synchronization with other nodes except the designated computing node in the first cluster.

Another aspect of the present application provides a distributed training aggregate communication control device, including a memory and a processor, where the memory stores a computer program executable on the processor, and the processor implements the steps of the distributed training aggregate communication control method when executing the program.

Another aspect of the present application provides a computer-readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of the distributed training aggregate communication control method.

The method, the device and the medium for controlling the set communication of the distributed training have the beneficial effects that when the data reduction of the clusters is carried out in the distributed training, if the number of the computing nodes among the clusters is different, the data reduction is carried out in the clusters for one time except for the clusters with the least number of the computing nodes, so that the data is reduced to the designated computing nodes with the same number as the least number of the computing nodes in each cluster, the alignment of the number of the computing nodes among different clusters, namely the same number of the computing nodes, is ensured, the phenomenon that network competition occurs due to the fact that some computing nodes need to carry out data reduction with a plurality of the computing nodes at the same time is avoided, the cost of set communication is reduced, and the efficiency of the reduction of the clusters is improved.

Drawings

Fig. 1 is a schematic flow chart of a distributed training set communication control method according to an embodiment of the present application;

Fig. 2 is a schematic diagram of a distributed training set communication control method according to an embodiment of the present application;

FIG. 3 is a flow chart of a method for controlling aggregate communication of distributed training according to another embodiment of the present application;

FIG. 4 is a schematic diagram of a distributed training set communication control method according to another embodiment of the present application;

FIG. 5 is a schematic diagram of a distributed training set communication control method according to another embodiment of the present application;

FIG. 6 is a flow chart of a method for controlling aggregate communication of distributed training according to still another embodiment of the present application;

fig. 7 is a schematic structural diagram of a distributed training set communication control device according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a distributed training set communication control device according to another embodiment of the present application.

Reference numeral 80 is a memory, 81 is a processor, 82 is a display screen, 83 is an input/output interface, 84 is a communication interface, 85 is a power supply, 86 is a communication bus, 801 is a computer program, 802 is an operating system, 803 is data.

Detailed Description

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the application. The term "if" as used herein may be interpreted as "at..once" or "when..once" or "in response to a determination", depending on the context.

Fig. 1 is a flow chart of a method for controlling aggregate communication of distributed training according to an embodiment of the present application, as shown in fig. 1, the method includes:

S10, when the difference of the number of computing nodes between any two clusters is within a preset range, reducing the data distributed on all the computing nodes in the first cluster to the designated computing nodes in the first cluster, wherein the number of the designated computing nodes is the same as the number of the computing nodes in the second cluster, and the second cluster is the cluster with the minimum number of the computing nodes in the clusters participating in data reduction;

It can be appreciated that in the distributed training of the hybrid deployment of different AI chips, the number of computing nodes in the cluster participating in the data reduction is different due to the different AI chip computing power. For example, in a deep learning model of a complex image recognition training task, two types of AI chips are required to be jointly completed, including a chip a and a chip B, wherein the intelligent computing power of the chip a is higher than that of the chip B, so that 4 chips a are deployed in the cluster a, and 10 chips B are deployed in the cluster B, so that the image recognition task of the deep learning model can be jointly completed.

In distributed training, communications within the same computing node may be communicated using a high-speed interconnect technology such as NVLink, between different computing nodes within the same cluster may be communicated using a high-speed interconnect technology such as GDR (GPU DIRECT RDMA), and only relatively low-speed RoCE (RDMA over Converged Ethernet) interconnect technology may be used between different clusters.

Therefore, in the mixed distributed training scene with heterogeneous communication links, when the aggregate communication optimization is performed to improve the distributed training efficiency, the number of the computing nodes among clusters participating in data reduction is different, so that some computing nodes need to simultaneously perform data reduction with a plurality of computing nodes when participating in data reduction, and the phenomenon of network competition occurs, thereby influencing the aggregate communication optimization effect.

Because, to address the above-described technical problem, in an alternative embodiment, the number of compute nodes between clusters participating in data reduction is aligned, i.e., the compute nodes between clusters participating in data reduction are the same.

Specifically, data distributed on all computing nodes in a first cluster is reduced to designated computing nodes in the first cluster, wherein the number of the designated computing nodes is the same as the number of the computing nodes in a second cluster, and the second cluster is the cluster with the smallest number of the computing nodes in the clusters participating in data reduction. That is, the number of compute nodes of all clusters that participate in the data reduction are aligned with the cluster that has the smallest number of compute nodes. For ease of understanding, the following will exemplify.

For example, the clusters participating in data reduction include a cluster a, a cluster B, and a cluster C, and the number of computing nodes of the cluster a is 15, the number of computing nodes of the cluster B is 20, and the number of computing nodes of the cluster C is 23. It is apparent that in this example, the first cluster is cluster B and cluster C, while the second cluster with the smallest number of computing nodes is cluster a, and the specified number of computing nodes should be aligned with the number of computing nodes of cluster a, i.e., the specified number of computing nodes is 15.

When the cluster A, the cluster B and the cluster C perform cross-cluster data reduction, in order to ensure that the number of computing nodes among three clusters is the same, the cluster B and the cluster C are controlled to be in the respective clusters, and data distributed on all the computing nodes are reduced to 15 appointed computing nodes.

S11, controlling the appointed computing node and the computing nodes in the second cluster to perform data reduction;

Further, cross-cluster reduction is performed after the number of compute nodes for each cluster is aligned. Specifically, the designated computing node is controlled to perform data reduction with the computing nodes in the second cluster. For example, in the above example, the control cluster B and the 15 designated computing nodes on the cluster C perform data reduction between clusters with 15 computing nodes in the cluster a. I.e. including reduction between cluster B and cluster a, cluster B and cluster C, cluster a and cluster C.

It should be noted that if the number of computing nodes between any two clusters is very different among the clusters participating in data reduction, the phenomenon of insufficient utilization of communication links and resource waste may be caused. For example, there are 5 computing nodes in cluster a and 50 computing nodes in cluster B, then when data is reduced across clusters, cluster B reduces the data to 5 designated computing nodes first, and then reduces the data with the 5 computing nodes in cluster a through the 5 computing nodes.

Obviously, this results in many communication links being idle in cross-cluster reduction and relatively inefficient due to the more data when 5 data is reduced. Thus, in an alternative embodiment, it is determined whether the difference in the number of compute nodes between any two clusters is within a predetermined range in the clusters involved in data reduction prior to reduction of data distributed over all compute nodes in the first cluster to a given compute node in the first cluster.

That is, before the clusters with more computing nodes reduce data internally, it is determined whether the number of computing nodes among all the clusters participating in data reduction is relatively uniform, i.e. the difference between the number of computing nodes is not too large.

It should be noted that when determining whether the difference of the number of the computing nodes between any two clusters is within the preset range, the method can determine according to the difference of the computing nodes between any two clusters, can determine according to the ratio of the computing nodes between any two clusters, and can determine whether the number of the computing nodes between any two clusters is relatively uniform in other manners, which is not limited by the method.

Fig. 2 is a schematic diagram of a distributed training set communication control method according to an embodiment of the present application, in an alternative embodiment, if the number of computing nodes among clusters participating in data reduction is the same, as shown in fig. 2, the number of computing nodes among cluster 1 to cluster N is 4, and when cross-cluster data reduction is performed, data are uniformly split among computing nodes inside each cluster, and each computing node is only responsible for a small portion of data reduction corresponding to itself. Further, data reduction, i.e., data exchange, is then completed between the compute nodes within the cluster by way of the collective communication semantics of the Reduce script between the compute nodes. Then, cross-cluster reduction is carried out to complete data reduction among clusters.

And S12, controlling the designated computing node to perform data synchronization with other nodes except the designated computing node in the first cluster.

Further, after the cross-cluster data reduction is completed, the designated computing node is controlled to perform data synchronization with other nodes in the first cluster except the designated computing node through step S12. For example, in the above example, after the 15 designated computing nodes in the 20 computing nodes in the cluster B complete data reduction with the clusters a and C, the 15 designated computing nodes need to synchronize the data to the other 5 computing nodes that do not perform cross-cluster data reduction. Similarly, the cluster C needs to perform the same data synchronization operation. And in the cluster A, as all the computing nodes participate in cross-cluster data reduction, data synchronization is not needed.

Therefore, in the distributed training, if the number of computing nodes among clusters is different and the number of computing nodes among clusters is the same, the data reduction is firstly carried out once in the clusters except for the clusters with the minimum number of computing nodes, so that the data is reduced to the designated computing nodes with the same number as the minimum number of computing nodes in each cluster, and the alignment of the number of computing nodes among different clusters, namely the same number of computing nodes, is ensured when the data reduction among the clusters is carried out, the phenomenon that some computing nodes need to simultaneously carry out data reduction with a plurality of computing nodes is avoided, the network competition phenomenon is caused, the cost of the aggregate communication is reduced, and the efficiency of the data reduction among the clusters is improved.

Fig. 3 is a flow chart of a method for controlling aggregate communication in distributed training according to another embodiment of the present application, in an alternative embodiment, as shown in fig. 3, the method reduces data distributed on all computing nodes in a first cluster to a designated computing node in the first cluster, including:

S30, dividing the computing nodes in the first cluster into m node subgroups, wherein each node subgroup comprises k computing nodes, k=floor (n/m), n is the number of computing nodes in the first cluster, and m is the number of computing nodes in the second cluster;

In an alternative embodiment, the selection may be performed according to a certain rule when selecting the designated computing node in the first cluster. Specifically, the computing nodes in the first cluster are firstly divided according to the number of the computing nodes in the second cluster. Therefore, the number of computing nodes in the first cluster and the second cluster is counted first, the number of computing nodes in the first cluster is denoted as n, the number of computing nodes in the second cluster is denoted as m, and obviously n > m.

Further, the n computing nodes in the first cluster are divided into m node subgroups, each node subgroup includes k computing nodes, so as to ensure that the number of computing nodes in each node subgroup is the same, and ensure that the computing nodes in each node subgroup are integer numbers. Thus, k=floor (n/m), i.e. the number of compute nodes in each node group is equal to the number of compute nodes in the first cluster divided by the number of compute nodes in the second cluster, rounded down. For example, when n=10, m=3, k=floor (n/m) =3.

S31, selecting one computing node from each node group as a designated computing node;

s32, reducing the data distributed on all the computing nodes in the first cluster to the designated computing nodes.

Further, one computing node is selected from each node group as a designated computing node. The selection may be random selection or may be performed according to a device serial number of the computing node, which is not limited in the present application.

It may be appreciated that the computing nodes in the first cluster are divided into m node subgroups, and a designated computing node is selected from each node subgroup, so that m designated computing nodes may be obtained. And further may reduce the data distributed on all the compute nodes in the first cluster to the designated compute node selected in this embodiment.

Therefore, the set communication control method for distributed training provided by the embodiment of the application aligns the number of computing nodes among clusters participating in data reduction, namely, aligns the clusters with the minimum number of computing nodes. When data reduction is carried out across clusters, the problem that part of computing nodes need to be reduced with a plurality of computing nodes at the same time is avoided, set communication is optimized, and distributed training efficiency is improved.

As an alternative embodiment, selecting a computing node from each node group as a designated computing node includes:

Based on the sequencing result, dividing k computing nodes into each node group in sequence according to the mode that each node group divides one computing node at a time;

and taking the computing node with the smallest equipment serial number in each node group as a designated computing node.

It will be appreciated that different computing nodes have different device numbers, and that computing nodes between the same device number may communicate using one switch, while computing nodes between different device numbers may require multiple switches to communicate.

Therefore, in an alternative embodiment, in order to save resources, when one computing node is selected from each node subgroup as a designated computing node, the computing node with the smallest device serial number in each node subgroup is used as the designated computing node.

Specifically, the method includes the steps of firstly sorting the computing nodes according to the equipment serial numbers of the computing nodes in the first cluster from small to large, and dividing k computing nodes into the node groups sequentially according to the mode that each node group divides one computing node at a time after sorting. That is, after sorting from small to large, one computing node is placed into m node subgroups at a time, and the node subgroup in which the currently placed computing node is located is different from the node subgroup of the computing node placed last time.

It can be appreciated that, since the number of computing nodes in the second cluster is the smallest, the maximum value of the device serial numbers of the computing nodes is smaller than the maximum value of the device serial numbers in the first cluster. Therefore, in order to align with the device serial numbers in the second cluster, the device resources of the switch are saved, and the computing node with the smallest device serial number in each node group is used as the designated computing node. Fig. 4 is a schematic diagram of a distributed training set communication control method according to another embodiment of the present application, and for convenience of understanding, the following will exemplify the method.

For example, as shown in fig. 4, the clusters participating in data reduction include a cluster a and a cluster B, and the number of computing nodes of the cluster a is 7, and the number of computing nodes of the cluster B is 3. It is apparent that in this example, the first cluster is cluster a, while the second cluster with the smallest number of computing nodes is cluster B, and the number of computing nodes is designated as 3. The number of devices of each computing node in cluster B is 0 to 2, the number of devices of each computing node in cluster a is 0 to 6, the number of designated computing nodes is 3, i.e., the number of node subgroups m=3, and the number of computing nodes in each node subgroup k=2.

In determining the designated compute nodes in cluster a, as shown in fig. 4, the 7 compute nodes in cluster a are divided into 3 node subgroups including node subgroup A1, node subgroup A2, and node subgroup A3. And sequencing according to the device serial numbers of the 7 computing nodes from small to large.

Further, the 7 computing nodes are sequentially divided into 3 node groups according to the sorting result, one computing node is divided at a time, and the node groups divided each time are different, namely, the node groups where the computing nodes divided in the front and back two times are located are different. Thus, as shown in fig. 4, the computing nodes of the device number 0 and the device number 3 are in the node group A1, the computing nodes of the device number 1 and the device number 4 are in the node group A2, and the computing nodes of the device number 2 and the device number 5 are in the node group A5.

In order to ensure that the device serial numbers of the designated computing nodes selected from the 3 node groups are aligned with the device serial numbers in the cluster B, the computing node with the smallest device serial number in each node group is used as the designated computing node, that is, the computing node with the device serial number 0 in the node group A1, the computing node with the device serial number 1 in the node group A2, and the computing node with the device serial number 2 in the node group A3 are used as the designated computing nodes.

Further, in an alternative embodiment, when the designated computing node is controlled to perform data reduction with the computing node in the second cluster, the corresponding data reduction is performed on the computing node with the same device serial number in the designated computing node as the computing node device serial number in the second cluster. For example, in the above example, the computing node with device number 0 in node group A1 is reduced to the computing node with device number 0 in cluster B, and the other computing nodes are similar.

Therefore, according to the set communication control method for distributed training provided by the embodiment of the application, the computing node with the smallest equipment serial number in the node group is selected as the designated computing node so as to be aligned with the equipment serial number of the computing node in the second cluster, set communication is optimized to improve distributed efficiency, and equipment resources are saved.

In an alternative embodiment, reducing data distributed across all computing nodes in a first cluster to a designated computing node includes:

determining whether a computing node within the first cluster has a target computing node that is not grouped;

If not, the data in each node group is reduced to the appointed computing nodes in the corresponding group;

If so, the step of reducing the data in each node group to the designated computing nodes in the corresponding group is executed, the data on the target computing node is divided into m data averagely, and the m data are sent to different designated computing nodes in a one-to-one correspondence mode.

It will be appreciated that when grouping computing nodes within a first cluster, there may be an inability to divide all nodes into node groups, where the number of target computing nodes that are not grouped, t=n-k×m. For example, as shown in the example of fig. 4, after 7 clusters in cluster a are divided into 3 node subgroups, 1 computing node cannot perform node subgroup division.

At this time, in an alternative embodiment, when the data distributed on all the computing nodes in the first cluster is reduced to the designated computing node, it is required to determine whether there is a target computing node that is not grouped in the first cluster.

If so, to ensure that all data in the first cluster participates in the reduction while the cross-cluster reduction occurs, the data on the target computing nodes that do not participate in the grouping is divided into m-point data, i.e., the number of divided pieces is the same as the number of designated computing nodes. Further, the m-point data are sent to different designated computing nodes in a one-to-one correspondence mode. For ease of understanding, the following description will be given with reference to fig. 4.

As shown in fig. 4, the target computing node participating in the packet is a node with a device number of 6, and the shaded portion in the different computing node is used to characterize the data in the current computing node, for example, the data on the target computing node with the device number of 6 is divided into 3 points, including data P61, data P62, and data P63 on average. Further, data on the target computing node is sent (Pend) to different designated computing nodes in a one-to-one correspondence. That is, data P61 is transmitted to the computing node of device number 0, data P62 is transmitted to the computing node of device number 1, and data P63 is transmitted to the computing node of device number 2.

Of course, it should be noted that, the corresponding relation between the m pieces of data divided by the target computing node and the designated computing node is not limited by the present application, and it is only required to ensure that different data are sent to different designated computing nodes.

In a specific embodiment, while the target computing node sends data to the designated computing node, the data in each node group is reduced to the designated computing node in the corresponding group by means of aggregate communication semantics All Reduce.

For example, as shown in fig. 4, the data P3 on the computing node of the device number 3 is reduced to the designated computing node of the device number 0 in the corresponding group by means of the aggregate communication semantic All Reduce, the data P3 on the computing node of the device number 4 is reduced to the designated computing node of the device number 1 in the corresponding group, and the data P3 on the computing node of the device number 5 is reduced to the designated computing node of the device number 2 in the corresponding group.

It should be noted that, the steps of reducing the data in each node group to the designated computing node in the corresponding group and the steps of dividing the data on the target computing node into m pieces of data and transmitting the m pieces of data to different designated computing nodes in a one-to-one correspondence manner may be performed simultaneously or in a sequential order, which is not limited to the present application.

In another alternative embodiment, if there is no target computing node that is not grouped in the computing nodes in the first cluster, the number of computing nodes representing the current first cluster may be uniformly divided into m node subgroups, where data in each node subgroup may be directly reduced to a designated computing node in the corresponding group.

Fig. 5 is a schematic diagram of a distributed training set communication control method according to still another embodiment of the present application, further in an alternative embodiment, after data reduction is completed inside a cluster, that is, after data reduction is performed on designated computing nodes in a node group, and data is uniformly sent to each designated computing node by a target computing node that does not perform data reduction, as shown in fig. 4, data reduction is performed between designated computing nodes of each cluster by using a set communication semantic manner of ReduceScatter.

Further, as shown in fig. 5, after cross-cluster data reduction, data reduction is performed between designated computing nodes corresponding to different node groups in the cluster a through a ALL GATHER aggregate communication semantic manner. Further, as shown in fig. 5, in the same node group, the designated computing node broadcasts data to other computing nodes in the node group by means of the aggregate communication semantics of Broadcast.

In another alternative embodiment, when there is a target computing node that is not grouped in the computing node in the first cluster, as shown in fig. 5, after the designated computing node completes cross-cluster reduction, the designated computing node needs to perform data reduction with other designated computing nodes in the node, and needs to Broadcast data to other computing nodes in the same node group, and also needs to Broadcast the data to the target computing node that is not grouped by means of a collective communication semantic manner, so as to complete cross-cluster data reduction.

Fig. 6 is a flow chart of a distributed training set communication control method according to another embodiment of the present application, as an alternative embodiment, as shown in fig. 6, where when the difference is not within a preset range, the distributed training set communication control method further includes:

S60, sorting according to the order of the number of the computing nodes from small to large;

s61, merging the first cluster and the second cluster into a target cluster based on the sequencing result;

In a specific embodiment, if the number of computing nodes between any two clusters in the clusters participating in data reduction is greatly different, the phenomenon of insufficient utilization of communication links and resource waste can be caused. Therefore, in order to solve the technical problem, in an alternative embodiment, when the gap is not within the preset range, the first name cluster and the second name cluster are sorted according to the order from small to large in number of the computing nodes, and the first name cluster and the second name cluster are combined into a target cluster, that is, the clusters with the smallest number and the second number are combined to obtain the target cluster.

S62, determining whether the difference between any two clusters in the target cluster and the non-merged cluster is within a preset range, and if so, entering step S63.

And S63, taking the target cluster and the unmixed cluster as clusters participating in data reduction, entering step S10 in the embodiment, and executing the subsequent step S11 and step S12.

Further, after the target clusters are obtained through combination, whether the difference between any two clusters in the target clusters and the non-combined clusters is within a preset range or not is determined, that is, whether the number of computing nodes among all the clusters is uniform or not after combination is determined. If the difference between the numbers of computing nodes of the clusters after merging is within an acceptable range, i.e. more uniform, cross-cluster data reduction can be performed, and step S63 is performed.

Specifically, the target cluster and each cluster which are not combined are taken as new clusters which participate in data reduction, and cross-cluster reduction is realized through steps S10 to S12 in the embodiment. For ease of understanding, the following will exemplify.

For example, the clusters currently participating in data reduction include cluster 1, cluster 2, and cluster 3, and the number of computing nodes in cluster 1 is 5, the number of computing nodes in cluster 2 is 50, and the number of computing nodes in cluster 3 is 60. Obviously, if the data reduction is performed on the cluster 1, the cluster 2 and the cluster 3, the data needs to be reduced to 5 designated computing nodes in the cluster, and when the data reduction is performed across the clusters, some communication links are seriously idle, and the computing nodes participating in the reduction have low reduction efficiency due to huge data quantity.

Therefore, in the technical scheme provided by the embodiment of the application, because the number of the computing nodes in the three clusters is very different, after the number of the computing nodes in the three clusters is sequenced, the cluster 1 and the cluster 2 are combined to obtain the target cluster, and the obtained target cluster comprises 55 computing nodes.

Furthermore, the target cluster and the cluster 3 are used as new clusters participating in data reduction, obviously, the number of 55 computing nodes and 60 computing nodes is less, and more idle links can be avoided when the cross-cluster reduction is performed.

Therefore, according to the set communication control method for distributed training, when the number of computing nodes among clusters participating in data reduction is greatly different, the clusters with the least number and the least number are combined, a large number of idle links are avoided when clusters are crossed, set communication performance is further optimized, and therefore distributed efficiency is improved.

Based on the foregoing embodiments, as an alternative embodiment, as shown in fig. 6, if the difference between any two clusters is not within the preset range in the target cluster and the non-merged cluster, the method further includes:

S64, determining whether the difference value between the number of computing nodes in the target cluster and the number of computing nodes in the designated cluster exceeds a threshold value, wherein the designated cluster is the cluster with the minimum number of computing nodes in the uncombined cluster, if the threshold value is exceeded, the step S63 is carried out, and if the threshold value is not exceeded, the step S65 is carried out.

On the basis of the embodiment, after the clusters with the minimum number and the small number are combined to obtain the target cluster, if the difference between the number of the calculation nodes of any two clusters is still large in the target cluster and the non-combined clusters. That is, after one merging, the difference between the number of computing nodes between clusters is still large.

At this time, it is necessary to judge first through step S64, and such a reason is caused to occur because the difference between the number of computing nodes in the target cluster and the number of computing nodes in the specified cluster exceeds the threshold. That is, it is determined whether the node distance between the current clusters is large because the number of calculated nodes in the target cluster after merging is too large.

If so, that is, the difference between the number of computing nodes in the target cluster and the number of computing nodes in the designated cluster exceeds the threshold, the cluster continues to be merged at this time, and the step S63 is returned to, that is, the data reduction is performed by taking the target cluster after being merged and the cluster not to be merged as new clusters participating in the reduction.

S65, determining whether the number of the non-merged clusters is greater than 1, if so, entering step S60, and executing subsequent steps, and if not, entering the step of taking the target cluster and the non-merged clusters as clusters participating in data reduction, and executing the subsequent steps.

Of course, in another alternative embodiment, if the difference between the number of computing nodes in the target cluster and the number of computing nodes in the designated cluster does not exceed the threshold, it is indicated that the non-uniformity in the number of computing nodes between the current clusters is not caused by the excessive number of computing nodes in the target cluster obtained by merging.

At this time, it is further determined whether the number of uncombined clusters is greater than 1 through step S65, in fact, after determining that the node non-uniformity between the current clusters is not caused by the target clusters, the clusters may be continuously merged so as to achieve the target of node uniformity between the clusters.

Specifically, it is determined whether the current target cluster and the unmixed clusters are still greater than 2, that is, it is determined whether the number of current clusters can continue to be merged, through step S65. If so, return to step S60 and continue to merge clusters. It should be noted, however, that each time the process returns to step S60, the target clusters after merging and the clusters not merging are sorted and merged.

Of course, if the number of current clusters cannot be merged, i.e. the number of clusters not subjected to data reduction is 1, i.e. only 2 clusters remain before, step S63 can only be returned, the target cluster and the non-merged clusters are regarded as clusters participating in data reduction, step S10 in the above embodiment is entered, and the subsequent steps S11 and S12 are executed.

As an alternative embodiment, the difference of the number of computing nodes between clusters participating in data reduction is within a preset range, including:

If the result of the specified calculation of the maximum value and the minimum value of the number of the calculation nodes in the cluster participating in the data reduction is not greater than a preset value, the difference is in a preset range, wherein the specified calculation comprises ratio calculation or difference calculation.

In a specific embodiment, when determining whether the difference of the number of computing nodes between clusters participating in data reduction is within a preset range, the number of computing nodes in each cluster may be counted, and a maximum value Umax and a minimum value Umin of the number of computing nodes may be determined.

Further, the maximum value Umax and the minimum value Umin are specified, and the specified calculation may include, but is not limited to, ratio calculation or difference calculation. When the ratio calculation is carried out, whether the calculation result of Umax/Umin is larger than 1 or not is determined, if not, the difference of the number of calculation nodes representing each cluster is within an acceptable range, and at the moment, cross-cluster data reduction can be carried out. If the difference between the numbers of the computing nodes representing each cluster is larger than 1, the clusters need to be combined in order to avoid the waste of more communication link resources, and a specific combining method is described in the above embodiment.

In another alternative embodiment, the difference between the maximum value Umax and the minimum value Umin may also be calculated, if the difference is smaller than a preset value, the difference of the number of calculation nodes representing each cluster is within an acceptable range, otherwise, the difference is not considered to be within the acceptable range.

It should be noted that, for different specified calculation modes, the corresponding preset value settings are different. In addition, it should be noted that, besides the calculation mode of the ratio and the difference value of the number of calculation nodes of each cluster within the acceptable range, other calculation modes are also possible, which is not limited by the present application. For example, the difference value of the number of the computing nodes between all the clusters is computed, whether the difference between any two difference values is smaller than a preset value is further determined, if so, the difference of the number of the computing nodes representing each cluster is within an acceptable range, and otherwise, the difference is not considered to be within the acceptable range.

In the above embodiment, the detailed description is given to the method for controlling the set communication of the distributed training, and the application also provides a corresponding embodiment of the device for controlling the set communication of the distributed training.

Fig. 7 is a schematic structural diagram of a distributed training set communication control device according to an embodiment of the present application, where, as shown in fig. 7, the device includes:

The intra-cluster reduction module 70 is configured to reduce, when a difference in the number of computing nodes between any two clusters is within a preset range, data distributed on all computing nodes in a first cluster to a designated computing node in the first cluster, where the number of the designated computing nodes is the same as the number of computing nodes in a second cluster, and the second cluster is a cluster with the smallest number of computing nodes in the clusters participating in data reduction;

A cross-cluster reduction module 71 for controlling the designated computing node to perform data reduction with the computing nodes in the second cluster;

The data synchronization module 72 is configured to control the designated computing node to perform data synchronization with other nodes in the first cluster except the designated computing node.

In addition, the set communication control device for distributed training provided by the embodiment of the application further comprises:

the node subgroup dividing module is used for dividing the computing nodes in the first cluster into m node subgroups, wherein each node subgroup is provided with k computing nodes, k=floor (n/m), n is the number of the computing nodes in the first cluster, and m is the number of the computing nodes in the second cluster;

The designated computing node selection module is used for selecting one computing node from each node group as a designated computing node;

And the intra-cluster reduction sub-module is used for reducing the data distributed on all the computing nodes in the first cluster to the designated computing nodes.

The first ordering module is used for ordering according to the device serial numbers of the computing nodes in the first cluster from small to large;

the computing node dividing module is used for dividing k computing nodes into each node group in sequence according to the mode of dividing one computing node at a time of each node group based on the sequencing result;

And the designated computing node determining module is used for taking the computing node with the smallest equipment serial number in each node group as the designated computing node.

The first processing module is used for determining whether the computing nodes in the first cluster have target computing nodes which are not grouped, if not, reducing the data in each node group to the designated computing nodes in the corresponding group, if so, executing the step of reducing the data in each node group to the designated computing nodes in the corresponding group, dividing the data on the target computing nodes into m data on average, and transmitting the m data to different designated computing nodes in a one-to-one correspondence mode.

The second sequencing module is used for sequencing according to the sequence of the number of the calculation nodes from small to large when the difference is not in the preset range;

the merging module is used for merging the first cluster and the second cluster into a target cluster based on the sequencing result;

The second processing module is used for determining whether the difference between any two clusters in the target cluster and the non-merging cluster is within a preset range, if so, taking the target cluster and the non-merging cluster as clusters participating in data reduction, entering a step of reducing the data distributed on all the computing nodes in the first cluster to the appointed computing nodes in the first cluster, and executing subsequent steps.

The third processing module is used for determining whether the difference value between the number of computing nodes in the target cluster and the number of computing nodes in the designated cluster exceeds a threshold value, wherein the designated cluster is the cluster with the smallest number of computing nodes in the uncombined cluster, if the difference value exceeds the threshold value, the step of taking the target cluster and the uncombined cluster as clusters participating in data reduction is carried out, if the difference value does not exceed the threshold value, the step of determining whether the number of the uncombined clusters is larger than 1, if the difference value is larger than the threshold value, the step of sequencing the number of computing nodes from small to large, and the subsequent step is carried out, and if the difference value is not larger than the threshold value, the step of taking the target cluster and the uncombined cluster as clusters participating in data reduction is carried out.

FIG. 8 is a schematic structural diagram of a distributed training set communication control device according to another embodiment of the present application, where the distributed training set communication control device includes a memory 80 for storing a computer program as shown in FIG. 8;

A processor 81 for implementing the steps of the set communication control method of distributed training as mentioned in the above embodiments when executing a computer program.

The distributed training set communication control device provided in this embodiment may include, but is not limited to, a notebook computer or a desktop computer.

Processor 81 may include one or more processing cores, such as a 4-core processor, an 8-core processor, etc. The Processor 81 may be implemented in at least one hardware form of a digital signal Processor (DIGITAL SIGNAL Processor, DSP), field-Programmable gate array (FPGA), and Programmable logic array (Programmable Logic Array, PLA). The processor 81 may also include a main processor for processing data in the awake state, which is also called a central processor (Central Processing Unit, abbreviated as CPU), and a coprocessor for processing data in the standby state, which is a low-power-consumption processor. In some embodiments, the processor 81 may be integrated with an image processor (Graphics Processing Unit, GPU for short), which is responsible for rendering and drawing of the content that the display screen is required to display. In some embodiments, the processor 81 may also include an artificial intelligence (ARTIFICIAL INTELLIGENCE, AI for short) processor for processing computing operations related to machine learning.

Memory 80 may include one or more computer-readable storage media, which may be non-transitory. Memory 80 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In this embodiment, the memory 80 is at least used to store a computer program 801 that, when loaded and executed by the processor 81, is capable of implementing the relevant steps of the distributed training aggregate communication control method disclosed in any of the foregoing embodiments. In addition, the resources stored in the memory 80 may further include an operating system 802, data 803, and the like, where the storage manner may be transient storage or permanent storage. Operating system 802 may include Windows, unix, linux, among other things. The data 803 may include, but is not limited to, related data involved in a distributed training set communication control method, and the like.

In some embodiments, the distributed training set communication control device may further include a display screen 82, an input-output interface 83, a communication interface 84, a power supply 85, and a communication bus 86.

Those skilled in the art will appreciate that the architecture shown in fig. 8 is not limiting of the distributed training aggregate communication control device and may include more or fewer components than illustrated.

The distributed training set communication control device provided by the embodiment of the application comprises a memory and a processor, wherein the processor can realize the distributed training set communication control method in the embodiment when executing a program stored in the memory.

It should be noted that although operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Claims

1. A method for controlling collective communication of distributed training, the method comprising:

Controlling the designated computing node and the computing nodes in the second cluster to perform data reduction in a reducing script set communication semantic mode;

2. The method of claim 1, wherein the reducing the data distributed among the first cluster on all computing nodes to a designated computing node in the first cluster comprises:

3. The distributed training aggregate communication control method of claim 2, wherein said selecting a computing node from each of said node subgroups as said designated computing node comprises:

4. The method of distributed training aggregate communication control of claim 2, wherein said reducing data distributed across all computing nodes in the first cluster to the designated computing node comprises:

5. The distributed training aggregate communication control method of claim 1, further comprising:

6. The method for controlling communication of a distributed training set according to claim 5, wherein, if the target cluster and the uncombined cluster, the gap between any two clusters is not within the preset range, and the method further comprises:

7. The method for controlling collective communication of distributed training according to claim 1, characterized in that a gap in the number of computing nodes between clusters participating in data reduction is within a preset range, comprising:

8. A distributed training aggregate communication control device, the device comprising:

The cross-cluster reduction module is used for controlling the designated computing node and the computing nodes in the second cluster to Reduce data in a reducing script set communication semantic mode;

9. A distributed training aggregate communication control device comprising a memory and a processor, the memory having stored thereon a computer program executable on the processor, characterized in that the processor, when executing the program, implements the steps of the distributed training aggregate communication control method of any of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the collective communication control method of distributed training according to any of claims 1 to 7.