Detailed Description
Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are illustrated in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The term "comprising" and variations thereof as used herein means open ended, i.e., "including but not limited to. The term "or" means "and/or" unless specifically stated otherwise. The term "based on" means "based at least in part on". The terms "one embodiment" and "some embodiments" mean "at least one example embodiment. The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like, may refer to different or the same object.
Fig. 1 shows a schematic diagram of an exemplary parallel processor chip 100. As shown in fig. 1, the parallel processor chip 100 may include a plurality of processors 110 and a task scheduler 120 for task scheduling for the plurality of processors 110. According to the current programming model, the task scheduler 120 performs computational task scheduling on the processor 110 in the form of a processor cluster. That is, several processors 110 form a processor cluster, and the task scheduler 120 assigns a computing task to the processor cluster. Here, in order to distinguish from the virtual processor clusters described below, such a fixed-form divided processor cluster is also referred to as a physical processor cluster, as shown by reference numeral 130 in fig. 1. For example, 16 processors 110 (i.e., processor 110-0, processor 110-1,) are illustratively shown in FIG. 1 as being a first processor 110-15, where every 4 processors 110 constitute one physical processor cluster 130 (e.g., processors 110-0 through 110-3 constitute physical processor cluster 130-0, processors 110-4 through 110-7 constitute physical processor cluster 130-1, processors 110-8 through 110-11 constitute physical processor cluster 130-2, and processors 110-12 through 110-15 constitute physical processor cluster 130-3). Those skilled in the art will appreciate that the number of processors 110, the number of physical processor clusters 130, the number of processors 110 (and the number of internal interconnects 140 described below) included in each physical processor cluster 130 included in the parallel processor chip 100 shown in fig. 1 are merely exemplary, and that they may be designed and implemented differently for different system requirements. Furthermore, in some implementations, where the number of processors 110 contained in the parallel processor chip 100 has been determined, the number of physical processor clusters 130 contained and the number of processors 110 in each physical processor cluster 130 (i.e., the number of processors per cluster described below) may be flexibly configured in a programming model according to different applications.
In some examples, the parallel processor chip 100 may be an AI chip, a GPU (Graphic Processing Unit, graphics processing unit), a General Purpose GPU (GPGPU), or the like, wherein the processor 110 may be a General Purpose computing unit CU (Computing Unit) therein, and the physical processor cluster 130 may also be referred to as a stream processor cluster (STREAMING PROCESSING CLUSTER, SPC).
The task scheduler 120 may be directly connected to each of the processors 110 in the plurality of physical processor clusters 130 through a plurality of internal interconnects 140 (internal interconnect 140-0, internal interconnect 140-1, internal interconnect 140-2, and internal interconnect 140-3 as shown in fig. 1), respectively, to uniformly schedule and manage the processors 110.
In some implementations, the task scheduler 120 may be separately connected to each internal interconnect 140, and upon task scheduling, send task scheduling signals only to the internal interconnect 140 corresponding to the physical processor cluster 130 targeted for task scheduling and the computing tasks assigned by the task scheduling signals are routed by the internal interconnect 140 to all processors 110 of the physical processor cluster 130. For example, when the task scheduler 120 is to schedule tasks to the physical processor cluster 130-1, the task scheduler 120 may send task scheduling signals only to the internal interconnect 140-1 and route computing tasks to all of its processors 110-4 through 110-7 by the internal interconnect 140-1.
In other implementations, a plurality of internal interconnects 140 may communicate with each other. In this case, the task scheduler 120 may be connected to only one internal interconnect 140, and at the time of task scheduling, transmit a task scheduling signal to only the internal interconnect 140 connected thereto, and forward the task scheduling signal to another internal interconnect 140 corresponding to the physical processor cluster 130 as a task scheduling target by the internal interconnect 140. For example, as shown in FIG. 1, assume that the task scheduler 120 is connected to only the first internal interconnect 140-0, and that the first internal interconnect 140-0 is connected to the internal interconnect 140-1, the internal interconnect 140-2, and the internal interconnect 140-3 in this order. In this case, the task scheduler 120 may directly transmit a task scheduling signal to the internal interconnect 140-0, and the internal interconnect 140-0 determines whether the physical processor cluster 130-0 connected thereto is a target of the task scheduling signal by parsing the task scheduling signal. If the physical processor cluster 130-0 is the target of the task scheduling signal, the internal interconnect 140-0 may route the computing task assigned by the task scheduling signal to all of the processors 110-0 through 110-3 in the physical processor cluster 130-0. If the physical processor cluster 130-0 is not the target of the task scheduling signal, the internal interconnect 140-0 may route the task scheduling signal to the next internal interconnect 140-1 and the next internal interconnect 140-1 performs similar operations as the internal interconnect 140-0. And so on until the task scheduling signal reaches the physical processor cluster 130 that is its target.
In this case, if one or more processors 110 fail during the fabrication of the parallel processor chip 100 (which may be determined by test means), since the task scheduler 120 performs the calculation task scheduling with the physical processor clusters as granularity, all the physical processor clusters 130 including the failed processors will not work properly, thereby greatly reducing the number of actually available processors.
For this situation, a hardware fault tolerance mechanism called a harvest scheme is currently generally utilized to enable the whole chip to still work normally even if a certain number of failure processors are included, so as to improve the yield of the parallel processor chip and reduce the manufacturing cost. In the related art, there are two types of harvest schemes for parallel processor chips, one harvest scheme is to directly mask the entire physical processor cluster including the failed processor when the task scheduler 120 performs the task scheduling, and only allocate the computing task to the physical processor cluster where all the processors are intact (valid). However, with this scheme, if most or all of the physical processor clusters include a failed processor (even if only one failed processor is included), there will be very few or no physical processor clusters available for the task scheduler 120 to schedule, so that the entire parallel processor chip still cannot work normally, and the improvement of the chip yield is very limited.
Another harvest approach is to configure the redundant processors 110 in each physical processor cluster 130 such that the number of active processors in each physical processor cluster 130 still meets the number of processors within the cluster defined by the programming model. For example, assuming that 4 processors 110 need to be included in each physical processor cluster 130, 1 or 2 additional processors 110 may be configured for each physical processor cluster 130 as redundant processors. When 1 or 2 of the 4 processors 110 are failing processors, the failing processors may be replaced with the additionally configured 1 or 2 processors 110 to form a physical processor cluster 130 containing 4 valid processors. However, in this scheme, the additionally configured redundant processors will occupy additional chip area, causing resource waste, and the number of redundant processors is also difficult to determine, and too many redundant processors will further increase the chip area, and too few redundant processors may still not guarantee that the entire physical processor cluster 130 is available.
In view of the above-mentioned problems, the present disclosure provides a task scheduling scheme for a parallel processor chip, in which all effective processors in the chip are divided into a plurality of logical processor clusters as one processor pool, and the task scheduler 120 performs allocation of computing tasks with the logical processor clusters as granularity, so that the chip yield can be improved without changing a programming model.
Fig. 2 shows an exemplary flowchart of a task scheduling method 200 for the parallel processor chip 100 according to an embodiment of the invention. The task scheduling method 200 may be performed by the parallel processor chip 100, more particularly, by the task scheduler 120.
As shown in fig. 2, at block 210, the task scheduler 120 may determine the validity status, i.e., valid or invalid, of each processor 110 of the parallel processor chip 100.
As previously described, processor failures caused during the fabrication of the parallel processor chip 100 may be determined by test means. Currently, commonly used Test means include, for example, DFT Test (Design for Test), i.e., by inserting various hardware logic for improving testability (including controllability and observability) of a chip into an original Design of the chip, and testing the chip and receiving Test results can be initiated by dedicated DFT Test software, thereby making the chip easy to Test. DFT testing may be performed at the end of the chip fabrication process, prior to packaging. However, the present invention is not so limited and even after the chip is packaged and delivered, a DFT test or other similar test may be performed on it to determine if each processor 110 is a valid processor.
In this case, block 210 may specifically include performing a DFT test on the plurality of processors 110 of the parallel processor chip 100 to determine whether each processor 110 is a valid processor or an invalid processor, and setting the result of the DFT test in a status register to indicate the validity status of each processor 110.
Here, a status register harvest_reg dedicated to the validity states of the plurality of processors 110 may be configured in the task scheduler 120, the size of the status register harvest_reg being equal to the number of processors 110 in the parallel processor chip 100, each bit indicating the validity state of the corresponding processor 110, respectively. For example, in the case of 16 processors 110 shown in FIG. 1, the status register harvest_reg may include 16 bits, denoted harvest_reg [15:0], with the lowest bits corresponding to processor 110-0 and the highest bits corresponding to processor 110-15, with a bit value of 0 for a certain bit indicating that the corresponding processor 110 is an inactive processor and a bit value of 1 indicating that the processor 110 is an active processor.
At block 220, the task scheduler 120 may divide the active processors of the plurality of processors 110 into a plurality of logical processor clusters based on the availability status of each processor.
FIG. 3 illustrates a schematic diagram of a logical processor cluster 130' that is partitioned from the plurality of processors 110 of FIG. 1, according to an embodiment of the present invention. Fig. 3 differs from fig. 1 only in that the physical processor cluster 130 is replaced with a newly partitioned logical processor cluster 130'. Further, to indicate the validity status of each processor 110, the invalid processor 110 is shown in dashed boxes. In the example shown in fig. 3, assuming processors 110-0, 110-1, 110-8, and 110-12 are inactive processors, the status register harvest_reg [15:0] = 1110 1110 1111 1100.
More specifically, dividing the active processors of the plurality of processors 110 into the plurality of logical processor clusters 130 'in block 220 may specifically include assigning sequentially increasing processor logic numbers to the active processors of the plurality of processors 110 based on the status of the availability of each processor 110 and dividing the active processors into the plurality of logical processor clusters 130' based on the processor logic numbers and a predetermined number of processors per cluster.
For example, the task scheduler 120 may traverse the state registers harvest_reg in a low to high order of processor physical numbers (i.e., physical numbers 110-0, 110-1, 110-15 for multiple processors 110), and assign a processor logical number of 0 for a first processor 110 with a validity state of 1 (i.e., a valid processor), assign a processor logical number of 1 for a second processor 110 with a validity state of 1, and so on until the entire state registers harvest_reg is traversed. For the processor validity state shown in fig. 3, the following table 1 will be obtained:
TABLE 1
For a processor 110 with a validity state of 0 (i.e., an invalid processor), it may not be assigned a processor logical number or other indicia may be used to indicate that it is an invalid processor.
The task scheduler 120 may then divide the logical processor clusters 130' by processor logical number (instead of processor physical number) and number of processors per cluster. For example, still assuming a4 number of processors per cluster, i.e., each processor cluster contains 4 processors, the active processors of Table 1 may be divided into logical processor clusters 130' -0, 130' -1, and 130' -2 as shown in Table 2 and FIG. 3 below, according to processor logical numbers.
TABLE 2
Here, the division of the logical processor clusters in the order in which the processor logical numbers are increased is exemplary, and the present invention is not limited thereto, but the division of the logical processor clusters may be performed in other manners. For example, the active processors may be divided into logical processor clusters in an alternating or random manner, so long as a mapping relationship table between processor logical numbers, processor physical numbers, and logical processor clusters as shown in table 2 above is maintained in the task scheduler 120.
Next, at block 230, the task scheduler 120 may assign a computing task to at least one logical processor cluster 130' of the plurality of logical processor clusters 130' such that the computing task is routed to the processor 110 corresponding to the at least one logical processor cluster 130 '.
In some embodiments, the task scheduler 120 may assign at least one logical processor cluster to a computing task based on its requirements for parallel processing capabilities.
In some implementations, when determining the number of processors per cluster for an application in the programming model, the requirement of each computing task for parallel processing capability is simultaneously converted into the form of a processor cluster, that is, when each computing task is scheduled, the task scheduler 120 can know how many processor clusters need to be configured for the computing task according to the requirement information of the computing task. Or the task scheduler 120 may convert the computing task into a processor cluster level based on a requirement for a smaller granularity of parallel processing capability (e.g., a requirement for a processor level), as the invention is not limited in this regard.
Of course, other factors, such as availability (e.g., whether idle) of the logical processor clusters, etc., need to be considered in allocating a certain computing task to one or more logical processor clusters, and specific allocation policies may be determined, for example, in a round robin fashion, randomly from among available processor clusters, or according to a certain rule (e.g., the processor cluster with the least workload) from among available processor clusters, etc., which is not limited herein.
The task scheduler 120 may then generate a computational task barrier signal corresponding to the determined at least one logical processor cluster 130' (e.g., logical processor cluster 130' -0) based on the mapping relationship between the plurality of logical processor clusters 130 (e.g., logical processor clusters 130' -0 through 130' -2) and the plurality of processors 110 (e.g., processors 110-0 through 110-15) of the parallel processor chip 100, wherein the computational task barrier signal indicates the processor 110 corresponding to the at least one logical processor cluster 130 '.
For example, assuming that the logical processor cluster assigned to the computing task by the task scheduler 120 is the logical processor cluster 130' -0, the processors corresponding to the logical processor cluster 130' -0 are determined to be physically numbered as processor 110-2, processor 110-3, processor 110-4, and processor 110-5 based on the mapping relationship between the logical processor cluster 130' and the processor 110 as shown in the above-described table 2. In this case, the computational task barrier signal generated by the task scheduler 120 may be a 16-bit string 0000 0000 0011 1100, where bits with bit values of 1 correspond to processor 110-2, processor 110-3, processor 110-4, and processor 110-5, respectively (in order from low to high).
By the method, the effective processors of the parallel processor chip can be dynamically divided into the logic processor clusters, so that the computing task scheduling can be performed on the chip by taking the logic processor clusters as granularity under the condition of not changing a programming model.
Further, in some embodiments, the method 200 may further include routing the computational tasks to the corresponding processors 110 through the plurality of internal interconnects 140 of the parallel processor chip 100 according to the computational task barrier signals described above.
FIG. 4 illustrates an exemplary flow diagram of a computing task routing process 240 according to an embodiment of the present invention. Here, it is assumed that the parallel processor chip 100 includes a plurality of internal interconnections 140 (e.g., a first internal interconnection 140-0, a second internal interconnection 140-1, a third internal interconnection 140-2, and a fourth internal interconnection 140-3, wherein each internal interconnection 140 is connected to a corresponding physical processor cluster 130) as shown in fig. 1 and 3.
As shown in FIG. 4, at block 241, the first internal interconnect 140-0 may obtain the lowest N bits of the computing task barrier signal and at block 242 determine whether the computing task is assigned to one or more processors 110 in the physical processor cluster 130-0 corresponding to the first internal interconnect 140-0. Here, N is the number of processors 110 (i.e., the number of processors per cluster) included in the physical processor cluster 130, such as n=4 in the examples of fig. 1 and 3.
For example, assuming that the computational task barrier signal generated by the task scheduler 120 in block 230 is a 16-bit string 0000 0000 0011 1100, the first internal interconnect 140-0 may obtain the lowest 4 bits 1100 of the computational task barrier signal and determine from the bit with a bit value of 1 in the lowest 4 bits that the computational task is assigned to two processors in the physical processor cluster 130-0 corresponding to the first internal interconnect 140-0, namely processor 110-2 and processor 110-3.
If it is determined that the computing task is assigned to one or more processors 110 (e.g., processor 110-2 and processor 110-3) in the physical processor cluster 130-0 corresponding to the first internal interconnect 140-0, then the first internal interconnect 140-0 may route the computing task to processor 110-2 and processor 110-3 at block 243.
Next, at block 244, the first internal interconnect 140-0 may obtain the lowest N bits after the N bits of the computation task barrier signal are shifted right in sequence and passed to other internal interconnects (e.g., the second internal interconnect 140-1, the third internal interconnect 140-2, and the fourth internal interconnect 140-3) after the first internal interconnect 140-0, respectively. Here, whether the determination of block 242 is "yes" or "no," the operations of block 244 are ultimately performed, i.e., a determination is made as to whether a computing task is assigned to one or more processors 110 in the other physical processor cluster 130 based on the computing task barrier signal.
For example, in the case where the computation task barrier signal is a 16-bit string 0000 0000 0011 1100, the first internal interconnect 140-0 right-shifts the computation task barrier signal by the lowest 4 bits of 0011 (i.e., the 5 th-8 th bit of the bit string 0000 0000 0011 1100 from low to high) and passes the 4 bits 0011 to the second internal interconnect 140-1, the first internal interconnect 140-0 right-shifts the computation task barrier signal by the lowest 4 bits of 0000 (i.e., the 9 th-12 th bit of the bit string 0000 0000 0011 1100 from low to high) and passes the 4 bits 0000 to the third internal interconnect 140-2, the first internal interconnect 140-0 continues right-shifting the computation task barrier signal by the lowest 4 bits of 0000 (i.e., the 13 th-16 th bit of the bit string 0000 0000 0011 1100 from low to high), and passes the 4 bits 0000 to the fourth internal interconnect 140-3.
Next, the other internal interconnect 140 may perform operations similar to the first internal interconnect 140-0, i.e., operations described above in blocks 242 through 244, determine whether the computing task is assigned to one or more processors 110 in the physical processor cluster 130 corresponding to the other internal interconnect 140 based on the N-bit signal received from the first internal interconnect 140-0, and route the computing task to the corresponding processor in the case that it is determined that the computing task is assigned to the processor 110 in the corresponding physical processor cluster 130.
For example, the second internal interconnect 140-1 may determine from the received 4-bit signal 0011 that the computational task is assigned to two processors in the physical processor cluster 130-1 corresponding to the second internal interconnect 140-1, namely, the processor 110-4 and the processor 110-5, and route the computational task to the processor 110-4 and the processor 110-5, and the third internal interconnect 140-2 and the fourth internal interconnect 140-3 may determine from the received 4-bit signal 0000 that the computational task is not assigned to the processor 110 in the physical processor cluster 130-2 and 130-3 corresponding to the third internal interconnect 140-2 and the fourth internal interconnect 140-3.
By using the scheme of the invention, all effective processors in the parallel processor chip are used as a processor pool and divided into a plurality of logic processor clusters, and the task scheduler distributes calculation tasks by taking the logic processor clusters as granularity, so that more available processor clusters can be obtained under the condition that the number of invalid processors is the same, the chip yield is improved, and a programming model is not required to be changed. In addition, the mapping relation between the logic processor clusters and the processors is stored in the task scheduler, so that the task scheduler can realize task allocation with the logic processor clusters as granularity without complex logic.
Those skilled in the art will appreciate that the parallel processor chip 100 shown in the figures is merely illustrative and may contain more or fewer components.
The parallel processor chip 100 and the task scheduling method 200 performed in the parallel processor chip 100 according to the present disclosure are described above with reference to the accompanying drawings. Those skilled in the art will appreciate that the parallel processor chip 100 need not include all of the components shown in the figures, may include only some or more of the components necessary to perform the functions described in this disclosure, and that the manner of connection of these components is not limited to the form shown in the figures, and that the method 200 may include further steps not shown in the figures.
The present invention may be embodied as a method, a computing device (e.g., parallel processor chip 100 or task scheduler 120 of parallel processor chip 100), a computer readable storage medium, and/or a computer program product. The computer readable storage medium has stored thereon computer program code which, when executed, is adapted to carry out the methods of the present disclosure. The computer program product comprises a computer program which, when executed, performs the methods of the present disclosure. The computing device and/or computing device may include at least one processor and at least one memory coupled to the at least one processor, which may store instructions for execution by the at least one processor. The instructions, when executed by the at least one processor, the computing device and/or control device may perform the methods described above.
In one or more exemplary designs, the functions described in this disclosure may be implemented in hardware, software, firmware, or any combination thereof. For example, if implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
Those of ordinary skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both.
The previous description of the disclosure is provided to enable any person of ordinary skill in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.