[go: up one dir, main page]

CN118277101B - Task scheduling method for parallel processor chip, computing device, computer readable storage medium and computer program product - Google Patents

Task scheduling method for parallel processor chip, computing device, computer readable storage medium and computer program product Download PDF

Info

Publication number
CN118277101B
CN118277101B CN202410658287.0A CN202410658287A CN118277101B CN 118277101 B CN118277101 B CN 118277101B CN 202410658287 A CN202410658287 A CN 202410658287A CN 118277101 B CN118277101 B CN 118277101B
Authority
CN
China
Prior art keywords
processor
processors
logical
cluster
task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410658287.0A
Other languages
Chinese (zh)
Other versions
CN118277101A (en
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Bi Ren Technology Co ltd
Beijing Bilin Technology Development Co ltd
Original Assignee
Shanghai Bi Ren Technology Co ltd
Beijing Bilin Technology Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Bi Ren Technology Co ltd, Beijing Bilin Technology Development Co ltd filed Critical Shanghai Bi Ren Technology Co ltd
Priority to CN202410658287.0A priority Critical patent/CN118277101B/en
Publication of CN118277101A publication Critical patent/CN118277101A/en
Application granted granted Critical
Publication of CN118277101B publication Critical patent/CN118277101B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multi Processors (AREA)

Abstract

本公开提供了一种用于并行处理器芯片的任务调度方法、计算设备、计算机可读存储介质和计算机程序产品。该方法包括:确定所述并行处理器芯片的多个处理器中的每个处理器的有效性状态;基于每个处理器的有效性状态将所述多个处理器中的有效处理器划分为多个逻辑处理器簇;以及将计算任务分配给所述多个逻辑处理器簇中的至少一个逻辑处理器簇以使得所述计算任务基于所述多个逻辑处理器簇和所述多个处理器之间的映射关系而被路由至与所述至少一个逻辑处理器簇相对应的处理器。

The present disclosure provides a task scheduling method for a parallel processor chip, a computing device, a computer-readable storage medium, and a computer program product. The method includes: determining the validity status of each processor in a plurality of processors of the parallel processor chip; dividing the valid processors in the plurality of processors into a plurality of logical processor clusters based on the validity status of each processor; and assigning a computing task to at least one logical processor cluster in the plurality of logical processor clusters so that the computing task is routed to a processor corresponding to the at least one logical processor cluster based on a mapping relationship between the plurality of logical processor clusters and the plurality of processors.

Description

Task scheduling method for parallel processor chip, computing device, computer readable storage medium and computer program product
Technical Field
The present disclosure relates generally to the field of processors, and more particularly, to a task scheduling method for parallel processor chips, a computing device, a computer readable storage medium, and a computer program product.
Background
With advances in integrated circuit manufacturing processes, parallel processor chips integrated with multiple processors (also referred to as processing cores) are currently in common use to speed up the processing of various computing tasks. The number of processors that can be integrated on a single parallel processor chip can range from a few to tens of processors or even more. Parallel processor chips that integrate tens or even more processors on a single chip are also referred to as many-core parallel processor chips. The processors in a parallel processor chip all adopt the same design and are huge in number, so that the processors can be used for accelerating the running of parallel computing tasks.
During the chip manufacturing process, manufacturing defects are inevitably introduced, resulting in circuit failure. The parallel processor chip has a higher probability of failure due to the influence of manufacturing defects, which are more prominent for many-core parallel processor chips, because of the larger number of processors. If left untreated, a defective processor can cause the entire chip to fail, resulting in a chip yield that is too low and a cost of a single, intact chip that is too high.
Disclosure of Invention
Aiming at the problems, the present disclosure provides a scheme for task scheduling by constructing effective processors in parallel processor chips as logical processor clusters, so that the task scheduling with the processor clusters as granularity can be realized without changing a programming model, and the yield of the chips is improved.
According to one aspect of the present disclosure, a task scheduling method for parallel processor chips is provided. The method includes determining a validity state of each of a plurality of processors of the parallel processor chip, dividing the valid processors of the plurality of processors into a plurality of logical processor clusters based on the validity state of each processor, and assigning a computing task to at least one logical processor cluster of the plurality of logical processor clusters such that the computing task is routed to a processor corresponding to the at least one logical processor cluster based on a mapping relationship between the plurality of logical processor clusters and the plurality of processors.
In some implementations, determining the validity state of each of a plurality of processors of the parallel processor chip includes performing a DFT test on the plurality of processors to determine whether each processor is a valid processor or an invalid processor, and setting a result of the DFT test in a status register to indicate the validity state of each processor.
In some implementations, dividing the valid processors of the plurality of processors into a plurality of logical processor clusters based on the validity state of each processor includes assigning sequentially increasing processor logical numbers to the valid processors of the plurality of processors based on the validity state of each processor and dividing the valid processors into the plurality of logical processor clusters based on the processor logical numbers and a predetermined number of processors per cluster.
In some implementations, assigning a computing task to at least one logical processor cluster of the plurality of logical processor clusters includes assigning the at least one logical processor cluster to the computing task according to a requirement of the computing task for parallel processing capability, and generating a computing task barrier signal corresponding to the at least one logical processor cluster based on a mapping relationship between the plurality of logical processor clusters and the plurality of processors, wherein the computing task barrier signal indicates a processor corresponding to the at least one logical processor cluster.
In some implementations, the method further includes routing the computational tasks to corresponding processors through a plurality of internal interconnects of the parallel processor chips according to the computational task barrier signals.
In some implementations, the plurality of internal interconnects include a plurality of physical processor clusters respectively connected to the plurality of processors and routing the computational task to the corresponding processor through the plurality of internal interconnects of the parallel processor chip according to the computational task barrier signal includes a first internal interconnect acquiring a lowest N bits of the computational task barrier signal and determining whether the computational task is assigned to one or more processors in the physical processor cluster corresponding to the first internal interconnect, where N is a number of processors included in the physical processor cluster, routing the computational task to the one or more processors if it is determined that the computational task is assigned to one or more processors in the physical processor cluster corresponding to the first internal interconnect, and acquiring a lowest N bits after the computational task barrier signal is sequentially right shifted by N bits and transmitting the lowest N bits to other internal interconnects after the first internal interconnect respectively to determine whether the computational task is assigned to one or more processors in the other internal processor cluster corresponding to the other internal interconnect.
According to another aspect of the present disclosure there is provided a computing device comprising at least one processor, and at least one memory coupled to the at least one processor and storing instructions for execution by the at least one processor, which when executed by the at least one processor, cause the computing device to perform the steps of the method as described above.
According to yet another aspect of the present disclosure, a computer readable storage medium is provided, having stored thereon computer program code which, when executed, performs the method as described above.
According to yet another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a machine, performs the method as described above.
Drawings
The disclosure will be better understood and other objects, details, features and advantages of the disclosure will become more apparent by reference to the description of specific embodiments thereof given in the following drawings.
Fig. 1 shows a schematic diagram of an exemplary parallel processor chip.
Fig. 2 shows an exemplary flow chart of a task scheduling method for parallel processor chips according to an embodiment of the invention.
FIG. 3 illustrates a schematic diagram of a logical processor cluster that is partitioned among the multiple processors of FIG. 1, in accordance with an embodiment of the present invention.
FIG. 4 illustrates an exemplary flow chart of a computing task routing process according to an embodiment of the invention.
Detailed Description
Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are illustrated in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The term "comprising" and variations thereof as used herein means open ended, i.e., "including but not limited to. The term "or" means "and/or" unless specifically stated otherwise. The term "based on" means "based at least in part on". The terms "one embodiment" and "some embodiments" mean "at least one example embodiment. The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like, may refer to different or the same object.
Fig. 1 shows a schematic diagram of an exemplary parallel processor chip 100. As shown in fig. 1, the parallel processor chip 100 may include a plurality of processors 110 and a task scheduler 120 for task scheduling for the plurality of processors 110. According to the current programming model, the task scheduler 120 performs computational task scheduling on the processor 110 in the form of a processor cluster. That is, several processors 110 form a processor cluster, and the task scheduler 120 assigns a computing task to the processor cluster. Here, in order to distinguish from the virtual processor clusters described below, such a fixed-form divided processor cluster is also referred to as a physical processor cluster, as shown by reference numeral 130 in fig. 1. For example, 16 processors 110 (i.e., processor 110-0, processor 110-1,) are illustratively shown in FIG. 1 as being a first processor 110-15, where every 4 processors 110 constitute one physical processor cluster 130 (e.g., processors 110-0 through 110-3 constitute physical processor cluster 130-0, processors 110-4 through 110-7 constitute physical processor cluster 130-1, processors 110-8 through 110-11 constitute physical processor cluster 130-2, and processors 110-12 through 110-15 constitute physical processor cluster 130-3). Those skilled in the art will appreciate that the number of processors 110, the number of physical processor clusters 130, the number of processors 110 (and the number of internal interconnects 140 described below) included in each physical processor cluster 130 included in the parallel processor chip 100 shown in fig. 1 are merely exemplary, and that they may be designed and implemented differently for different system requirements. Furthermore, in some implementations, where the number of processors 110 contained in the parallel processor chip 100 has been determined, the number of physical processor clusters 130 contained and the number of processors 110 in each physical processor cluster 130 (i.e., the number of processors per cluster described below) may be flexibly configured in a programming model according to different applications.
In some examples, the parallel processor chip 100 may be an AI chip, a GPU (Graphic Processing Unit, graphics processing unit), a General Purpose GPU (GPGPU), or the like, wherein the processor 110 may be a General Purpose computing unit CU (Computing Unit) therein, and the physical processor cluster 130 may also be referred to as a stream processor cluster (STREAMING PROCESSING CLUSTER, SPC).
The task scheduler 120 may be directly connected to each of the processors 110 in the plurality of physical processor clusters 130 through a plurality of internal interconnects 140 (internal interconnect 140-0, internal interconnect 140-1, internal interconnect 140-2, and internal interconnect 140-3 as shown in fig. 1), respectively, to uniformly schedule and manage the processors 110.
In some implementations, the task scheduler 120 may be separately connected to each internal interconnect 140, and upon task scheduling, send task scheduling signals only to the internal interconnect 140 corresponding to the physical processor cluster 130 targeted for task scheduling and the computing tasks assigned by the task scheduling signals are routed by the internal interconnect 140 to all processors 110 of the physical processor cluster 130. For example, when the task scheduler 120 is to schedule tasks to the physical processor cluster 130-1, the task scheduler 120 may send task scheduling signals only to the internal interconnect 140-1 and route computing tasks to all of its processors 110-4 through 110-7 by the internal interconnect 140-1.
In other implementations, a plurality of internal interconnects 140 may communicate with each other. In this case, the task scheduler 120 may be connected to only one internal interconnect 140, and at the time of task scheduling, transmit a task scheduling signal to only the internal interconnect 140 connected thereto, and forward the task scheduling signal to another internal interconnect 140 corresponding to the physical processor cluster 130 as a task scheduling target by the internal interconnect 140. For example, as shown in FIG. 1, assume that the task scheduler 120 is connected to only the first internal interconnect 140-0, and that the first internal interconnect 140-0 is connected to the internal interconnect 140-1, the internal interconnect 140-2, and the internal interconnect 140-3 in this order. In this case, the task scheduler 120 may directly transmit a task scheduling signal to the internal interconnect 140-0, and the internal interconnect 140-0 determines whether the physical processor cluster 130-0 connected thereto is a target of the task scheduling signal by parsing the task scheduling signal. If the physical processor cluster 130-0 is the target of the task scheduling signal, the internal interconnect 140-0 may route the computing task assigned by the task scheduling signal to all of the processors 110-0 through 110-3 in the physical processor cluster 130-0. If the physical processor cluster 130-0 is not the target of the task scheduling signal, the internal interconnect 140-0 may route the task scheduling signal to the next internal interconnect 140-1 and the next internal interconnect 140-1 performs similar operations as the internal interconnect 140-0. And so on until the task scheduling signal reaches the physical processor cluster 130 that is its target.
In this case, if one or more processors 110 fail during the fabrication of the parallel processor chip 100 (which may be determined by test means), since the task scheduler 120 performs the calculation task scheduling with the physical processor clusters as granularity, all the physical processor clusters 130 including the failed processors will not work properly, thereby greatly reducing the number of actually available processors.
For this situation, a hardware fault tolerance mechanism called a harvest scheme is currently generally utilized to enable the whole chip to still work normally even if a certain number of failure processors are included, so as to improve the yield of the parallel processor chip and reduce the manufacturing cost. In the related art, there are two types of harvest schemes for parallel processor chips, one harvest scheme is to directly mask the entire physical processor cluster including the failed processor when the task scheduler 120 performs the task scheduling, and only allocate the computing task to the physical processor cluster where all the processors are intact (valid). However, with this scheme, if most or all of the physical processor clusters include a failed processor (even if only one failed processor is included), there will be very few or no physical processor clusters available for the task scheduler 120 to schedule, so that the entire parallel processor chip still cannot work normally, and the improvement of the chip yield is very limited.
Another harvest approach is to configure the redundant processors 110 in each physical processor cluster 130 such that the number of active processors in each physical processor cluster 130 still meets the number of processors within the cluster defined by the programming model. For example, assuming that 4 processors 110 need to be included in each physical processor cluster 130, 1 or 2 additional processors 110 may be configured for each physical processor cluster 130 as redundant processors. When 1 or 2 of the 4 processors 110 are failing processors, the failing processors may be replaced with the additionally configured 1 or 2 processors 110 to form a physical processor cluster 130 containing 4 valid processors. However, in this scheme, the additionally configured redundant processors will occupy additional chip area, causing resource waste, and the number of redundant processors is also difficult to determine, and too many redundant processors will further increase the chip area, and too few redundant processors may still not guarantee that the entire physical processor cluster 130 is available.
In view of the above-mentioned problems, the present disclosure provides a task scheduling scheme for a parallel processor chip, in which all effective processors in the chip are divided into a plurality of logical processor clusters as one processor pool, and the task scheduler 120 performs allocation of computing tasks with the logical processor clusters as granularity, so that the chip yield can be improved without changing a programming model.
Fig. 2 shows an exemplary flowchart of a task scheduling method 200 for the parallel processor chip 100 according to an embodiment of the invention. The task scheduling method 200 may be performed by the parallel processor chip 100, more particularly, by the task scheduler 120.
As shown in fig. 2, at block 210, the task scheduler 120 may determine the validity status, i.e., valid or invalid, of each processor 110 of the parallel processor chip 100.
As previously described, processor failures caused during the fabrication of the parallel processor chip 100 may be determined by test means. Currently, commonly used Test means include, for example, DFT Test (Design for Test), i.e., by inserting various hardware logic for improving testability (including controllability and observability) of a chip into an original Design of the chip, and testing the chip and receiving Test results can be initiated by dedicated DFT Test software, thereby making the chip easy to Test. DFT testing may be performed at the end of the chip fabrication process, prior to packaging. However, the present invention is not so limited and even after the chip is packaged and delivered, a DFT test or other similar test may be performed on it to determine if each processor 110 is a valid processor.
In this case, block 210 may specifically include performing a DFT test on the plurality of processors 110 of the parallel processor chip 100 to determine whether each processor 110 is a valid processor or an invalid processor, and setting the result of the DFT test in a status register to indicate the validity status of each processor 110.
Here, a status register harvest_reg dedicated to the validity states of the plurality of processors 110 may be configured in the task scheduler 120, the size of the status register harvest_reg being equal to the number of processors 110 in the parallel processor chip 100, each bit indicating the validity state of the corresponding processor 110, respectively. For example, in the case of 16 processors 110 shown in FIG. 1, the status register harvest_reg may include 16 bits, denoted harvest_reg [15:0], with the lowest bits corresponding to processor 110-0 and the highest bits corresponding to processor 110-15, with a bit value of 0 for a certain bit indicating that the corresponding processor 110 is an inactive processor and a bit value of 1 indicating that the processor 110 is an active processor.
At block 220, the task scheduler 120 may divide the active processors of the plurality of processors 110 into a plurality of logical processor clusters based on the availability status of each processor.
FIG. 3 illustrates a schematic diagram of a logical processor cluster 130' that is partitioned from the plurality of processors 110 of FIG. 1, according to an embodiment of the present invention. Fig. 3 differs from fig. 1 only in that the physical processor cluster 130 is replaced with a newly partitioned logical processor cluster 130'. Further, to indicate the validity status of each processor 110, the invalid processor 110 is shown in dashed boxes. In the example shown in fig. 3, assuming processors 110-0, 110-1, 110-8, and 110-12 are inactive processors, the status register harvest_reg [15:0] = 1110 1110 1111 1100.
More specifically, dividing the active processors of the plurality of processors 110 into the plurality of logical processor clusters 130 'in block 220 may specifically include assigning sequentially increasing processor logic numbers to the active processors of the plurality of processors 110 based on the status of the availability of each processor 110 and dividing the active processors into the plurality of logical processor clusters 130' based on the processor logic numbers and a predetermined number of processors per cluster.
For example, the task scheduler 120 may traverse the state registers harvest_reg in a low to high order of processor physical numbers (i.e., physical numbers 110-0, 110-1, 110-15 for multiple processors 110), and assign a processor logical number of 0 for a first processor 110 with a validity state of 1 (i.e., a valid processor), assign a processor logical number of 1 for a second processor 110 with a validity state of 1, and so on until the entire state registers harvest_reg is traversed. For the processor validity state shown in fig. 3, the following table 1 will be obtained:
TABLE 1
For a processor 110 with a validity state of 0 (i.e., an invalid processor), it may not be assigned a processor logical number or other indicia may be used to indicate that it is an invalid processor.
The task scheduler 120 may then divide the logical processor clusters 130' by processor logical number (instead of processor physical number) and number of processors per cluster. For example, still assuming a4 number of processors per cluster, i.e., each processor cluster contains 4 processors, the active processors of Table 1 may be divided into logical processor clusters 130' -0, 130' -1, and 130' -2 as shown in Table 2 and FIG. 3 below, according to processor logical numbers.
TABLE 2
Here, the division of the logical processor clusters in the order in which the processor logical numbers are increased is exemplary, and the present invention is not limited thereto, but the division of the logical processor clusters may be performed in other manners. For example, the active processors may be divided into logical processor clusters in an alternating or random manner, so long as a mapping relationship table between processor logical numbers, processor physical numbers, and logical processor clusters as shown in table 2 above is maintained in the task scheduler 120.
Next, at block 230, the task scheduler 120 may assign a computing task to at least one logical processor cluster 130' of the plurality of logical processor clusters 130' such that the computing task is routed to the processor 110 corresponding to the at least one logical processor cluster 130 '.
In some embodiments, the task scheduler 120 may assign at least one logical processor cluster to a computing task based on its requirements for parallel processing capabilities.
In some implementations, when determining the number of processors per cluster for an application in the programming model, the requirement of each computing task for parallel processing capability is simultaneously converted into the form of a processor cluster, that is, when each computing task is scheduled, the task scheduler 120 can know how many processor clusters need to be configured for the computing task according to the requirement information of the computing task. Or the task scheduler 120 may convert the computing task into a processor cluster level based on a requirement for a smaller granularity of parallel processing capability (e.g., a requirement for a processor level), as the invention is not limited in this regard.
Of course, other factors, such as availability (e.g., whether idle) of the logical processor clusters, etc., need to be considered in allocating a certain computing task to one or more logical processor clusters, and specific allocation policies may be determined, for example, in a round robin fashion, randomly from among available processor clusters, or according to a certain rule (e.g., the processor cluster with the least workload) from among available processor clusters, etc., which is not limited herein.
The task scheduler 120 may then generate a computational task barrier signal corresponding to the determined at least one logical processor cluster 130' (e.g., logical processor cluster 130' -0) based on the mapping relationship between the plurality of logical processor clusters 130 (e.g., logical processor clusters 130' -0 through 130' -2) and the plurality of processors 110 (e.g., processors 110-0 through 110-15) of the parallel processor chip 100, wherein the computational task barrier signal indicates the processor 110 corresponding to the at least one logical processor cluster 130 '.
For example, assuming that the logical processor cluster assigned to the computing task by the task scheduler 120 is the logical processor cluster 130' -0, the processors corresponding to the logical processor cluster 130' -0 are determined to be physically numbered as processor 110-2, processor 110-3, processor 110-4, and processor 110-5 based on the mapping relationship between the logical processor cluster 130' and the processor 110 as shown in the above-described table 2. In this case, the computational task barrier signal generated by the task scheduler 120 may be a 16-bit string 0000 0000 0011 1100, where bits with bit values of 1 correspond to processor 110-2, processor 110-3, processor 110-4, and processor 110-5, respectively (in order from low to high).
By the method, the effective processors of the parallel processor chip can be dynamically divided into the logic processor clusters, so that the computing task scheduling can be performed on the chip by taking the logic processor clusters as granularity under the condition of not changing a programming model.
Further, in some embodiments, the method 200 may further include routing the computational tasks to the corresponding processors 110 through the plurality of internal interconnects 140 of the parallel processor chip 100 according to the computational task barrier signals described above.
FIG. 4 illustrates an exemplary flow diagram of a computing task routing process 240 according to an embodiment of the present invention. Here, it is assumed that the parallel processor chip 100 includes a plurality of internal interconnections 140 (e.g., a first internal interconnection 140-0, a second internal interconnection 140-1, a third internal interconnection 140-2, and a fourth internal interconnection 140-3, wherein each internal interconnection 140 is connected to a corresponding physical processor cluster 130) as shown in fig. 1 and 3.
As shown in FIG. 4, at block 241, the first internal interconnect 140-0 may obtain the lowest N bits of the computing task barrier signal and at block 242 determine whether the computing task is assigned to one or more processors 110 in the physical processor cluster 130-0 corresponding to the first internal interconnect 140-0. Here, N is the number of processors 110 (i.e., the number of processors per cluster) included in the physical processor cluster 130, such as n=4 in the examples of fig. 1 and 3.
For example, assuming that the computational task barrier signal generated by the task scheduler 120 in block 230 is a 16-bit string 0000 0000 0011 1100, the first internal interconnect 140-0 may obtain the lowest 4 bits 1100 of the computational task barrier signal and determine from the bit with a bit value of 1 in the lowest 4 bits that the computational task is assigned to two processors in the physical processor cluster 130-0 corresponding to the first internal interconnect 140-0, namely processor 110-2 and processor 110-3.
If it is determined that the computing task is assigned to one or more processors 110 (e.g., processor 110-2 and processor 110-3) in the physical processor cluster 130-0 corresponding to the first internal interconnect 140-0, then the first internal interconnect 140-0 may route the computing task to processor 110-2 and processor 110-3 at block 243.
Next, at block 244, the first internal interconnect 140-0 may obtain the lowest N bits after the N bits of the computation task barrier signal are shifted right in sequence and passed to other internal interconnects (e.g., the second internal interconnect 140-1, the third internal interconnect 140-2, and the fourth internal interconnect 140-3) after the first internal interconnect 140-0, respectively. Here, whether the determination of block 242 is "yes" or "no," the operations of block 244 are ultimately performed, i.e., a determination is made as to whether a computing task is assigned to one or more processors 110 in the other physical processor cluster 130 based on the computing task barrier signal.
For example, in the case where the computation task barrier signal is a 16-bit string 0000 0000 0011 1100, the first internal interconnect 140-0 right-shifts the computation task barrier signal by the lowest 4 bits of 0011 (i.e., the 5 th-8 th bit of the bit string 0000 0000 0011 1100 from low to high) and passes the 4 bits 0011 to the second internal interconnect 140-1, the first internal interconnect 140-0 right-shifts the computation task barrier signal by the lowest 4 bits of 0000 (i.e., the 9 th-12 th bit of the bit string 0000 0000 0011 1100 from low to high) and passes the 4 bits 0000 to the third internal interconnect 140-2, the first internal interconnect 140-0 continues right-shifting the computation task barrier signal by the lowest 4 bits of 0000 (i.e., the 13 th-16 th bit of the bit string 0000 0000 0011 1100 from low to high), and passes the 4 bits 0000 to the fourth internal interconnect 140-3.
Next, the other internal interconnect 140 may perform operations similar to the first internal interconnect 140-0, i.e., operations described above in blocks 242 through 244, determine whether the computing task is assigned to one or more processors 110 in the physical processor cluster 130 corresponding to the other internal interconnect 140 based on the N-bit signal received from the first internal interconnect 140-0, and route the computing task to the corresponding processor in the case that it is determined that the computing task is assigned to the processor 110 in the corresponding physical processor cluster 130.
For example, the second internal interconnect 140-1 may determine from the received 4-bit signal 0011 that the computational task is assigned to two processors in the physical processor cluster 130-1 corresponding to the second internal interconnect 140-1, namely, the processor 110-4 and the processor 110-5, and route the computational task to the processor 110-4 and the processor 110-5, and the third internal interconnect 140-2 and the fourth internal interconnect 140-3 may determine from the received 4-bit signal 0000 that the computational task is not assigned to the processor 110 in the physical processor cluster 130-2 and 130-3 corresponding to the third internal interconnect 140-2 and the fourth internal interconnect 140-3.
By using the scheme of the invention, all effective processors in the parallel processor chip are used as a processor pool and divided into a plurality of logic processor clusters, and the task scheduler distributes calculation tasks by taking the logic processor clusters as granularity, so that more available processor clusters can be obtained under the condition that the number of invalid processors is the same, the chip yield is improved, and a programming model is not required to be changed. In addition, the mapping relation between the logic processor clusters and the processors is stored in the task scheduler, so that the task scheduler can realize task allocation with the logic processor clusters as granularity without complex logic.
Those skilled in the art will appreciate that the parallel processor chip 100 shown in the figures is merely illustrative and may contain more or fewer components.
The parallel processor chip 100 and the task scheduling method 200 performed in the parallel processor chip 100 according to the present disclosure are described above with reference to the accompanying drawings. Those skilled in the art will appreciate that the parallel processor chip 100 need not include all of the components shown in the figures, may include only some or more of the components necessary to perform the functions described in this disclosure, and that the manner of connection of these components is not limited to the form shown in the figures, and that the method 200 may include further steps not shown in the figures.
The present invention may be embodied as a method, a computing device (e.g., parallel processor chip 100 or task scheduler 120 of parallel processor chip 100), a computer readable storage medium, and/or a computer program product. The computer readable storage medium has stored thereon computer program code which, when executed, is adapted to carry out the methods of the present disclosure. The computer program product comprises a computer program which, when executed, performs the methods of the present disclosure. The computing device and/or computing device may include at least one processor and at least one memory coupled to the at least one processor, which may store instructions for execution by the at least one processor. The instructions, when executed by the at least one processor, the computing device and/or control device may perform the methods described above.
In one or more exemplary designs, the functions described in this disclosure may be implemented in hardware, software, firmware, or any combination thereof. For example, if implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
Those of ordinary skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both.
The previous description of the disclosure is provided to enable any person of ordinary skill in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (7)

1.一种用于并行处理器芯片的任务调度方法,包括:1. A task scheduling method for a parallel processor chip, comprising: 确定所述并行处理器芯片的多个处理器中的每个处理器的有效性状态;determining a validity state of each of a plurality of processors of the parallel processor chip; 基于每个处理器的有效性状态将所述多个处理器中的有效处理器划分为多个逻辑处理器簇;以及dividing active processors of the plurality of processors into a plurality of logical processor clusters based on the effectiveness status of each processor; and 将计算任务分配给所述多个逻辑处理器簇中的至少一个逻辑处理器簇以使得所述计算任务基于所述多个逻辑处理器簇和所述多个处理器之间的映射关系而被路由至与所述至少一个逻辑处理器簇相对应的处理器,allocating computing tasks to at least one logical processor cluster among the plurality of logical processor clusters so that the computing tasks are routed to processors corresponding to the at least one logical processor cluster based on a mapping relationship between the plurality of logical processor clusters and the plurality of processors, 其中将计算任务分配给所述多个逻辑处理器簇中的至少一个逻辑处理器簇包括:The allocating the computing task to at least one logical processor cluster among the plurality of logical processor clusters comprises: 根据所述计算任务对并行处理能力的要求为所述计算任务分配所述至少一个逻辑处理器簇;以及Allocating the at least one logical processor cluster to the computing task according to the requirement of the computing task for parallel processing capability; and 基于所述多个逻辑处理器簇和所述多个处理器之间的映射关系产生与所述至少一个逻辑处理器簇相对应的计算任务屏障信号,其中所述计算任务屏障信号指示与所述至少一个逻辑处理器簇对应的处理器,generating a computing task barrier signal corresponding to the at least one logical processor cluster based on a mapping relationship between the plurality of logical processor clusters and the plurality of processors, wherein the computing task barrier signal indicates the processor corresponding to the at least one logical processor cluster; 并且所述方法还包括:And the method also includes: 根据所述计算任务屏障信号通过所述并行处理器芯片的多个内部互连将所述计算任务路由至对应的处理器,其中所述多个内部互连分别与所述多个处理器的多个物理处理器簇相连,并且根据所述计算任务屏障信号通过所述并行处理器芯片的多个内部互连将所述计算任务路由至对应的处理器包括:Routing the computing task to the corresponding processor through multiple internal interconnections of the parallel processor chip according to the computing task barrier signal, wherein the multiple internal interconnections are respectively connected to multiple physical processor clusters of the multiple processors, and routing the computing task to the corresponding processor through multiple internal interconnections of the parallel processor chip according to the computing task barrier signal comprises: 第一内部互连获取所述计算任务屏障信号的最低N位,并且确定所述计算任务是否分配给与所述第一内部互连相对应的物理处理器簇中的一个或多个处理器,其中N是所述物理处理器簇中所包含的处理器的个数;The first internal interconnect obtains the lowest N bits of the computing task barrier signal, and determines whether the computing task is allocated to one or more processors in a physical processor cluster corresponding to the first internal interconnect, where N is the number of processors included in the physical processor cluster; 如果确定所述计算任务被分配给与所述第一内部互连相对应的物理处理器簇中的一个或多个处理器,则将所述计算任务路由给所述一个或多个处理器;If it is determined that the computing task is assigned to one or more processors in a physical processor cluster corresponding to the first internal interconnect, routing the computing task to the one or more processors; 获取所述计算任务屏障信号依次右移N位之后的最低N位,并分别传递给所述第一内部互连之后的其他内部互连以确定所述计算任务是否分配给与所述其他内部互连相对应的物理处理器簇中的一个或多个处理器。The lowest N bits of the computing task barrier signal after being right-shifted by N bits are obtained and transmitted to other internal interconnections after the first internal interconnection to determine whether the computing task is allocated to one or more processors in the physical processor cluster corresponding to the other internal interconnections. 2.如权利要求1所述的方法,其中确定所述并行处理器芯片的多个处理器中的每个处理器的有效性状态包括:2. The method of claim 1 , wherein determining the validity status of each processor of the plurality of processors of the parallel processor chip comprises: 对所述多个处理器进行DFT测试以确定每个处理器是有效处理器还是无效处理器;以及performing DFT testing on the plurality of processors to determine whether each processor is a valid processor or an invalid processor; and 将所述DFT测试的结果设置在状态寄存器中,以指示每个处理器的有效性状态。The results of the DFT test are set in a status register to indicate the validity status of each processor. 3.如权利要求1所述的方法,其中基于每个处理器的有效性状态将所述多个处理器中的有效处理器划分为多个逻辑处理器簇包括:3. The method of claim 1 , wherein partitioning active processors of the plurality of processors into a plurality of logical processor clusters based on the effectiveness status of each processor comprises: 基于每个处理器的有效性状态为所述多个处理器中的有效处理器分配依次递增的处理器逻辑编号;以及assigning sequentially increasing processor logical numbers to valid processors of the plurality of processors based on the validity status of each processor; and 基于所述处理器逻辑编号和预定的每簇处理器个数,将所述有效处理器划分为所述多个逻辑处理器簇。The valid processors are divided into the plurality of logical processor clusters based on the processor logical numbers and the predetermined number of processors per cluster. 4.如权利要求3所述的方法,还包括:4. The method of claim 3, further comprising: 建立并保存所述多个处理器的处理器逻辑编号、所述多个处理器的物理编号和逻辑处理器簇之间的映射关系。A mapping relationship between the processor logical numbers of the plurality of processors, the physical numbers of the plurality of processors and the logical processor cluster is established and saved. 5.一种计算设备,包括:5. A computing device comprising: 至少一个处理器;以及at least one processor; and 至少一个存储器,所述至少一个存储器被耦合到所述至少一个处理器并且存储用于由所述至少一个处理器执行的指令,所述指令当由所述至少一个处理器执行时,使得所述计算设备执行根据权利要求1至4中任一项所述的方法的步骤。At least one memory, the at least one memory being coupled to the at least one processor and storing instructions for execution by the at least one processor, the instructions, when executed by the at least one processor, causing the computing device to perform the steps of the method according to any one of claims 1 to 4. 6.一种计算机可读存储介质,其上存储有计算机程序代码,所述计算机程序代码在被运行时执行如权利要求1至4中任一项所述的方法。6. A computer-readable storage medium having a computer program code stored thereon, wherein the computer program code executes the method according to any one of claims 1 to 4 when executed. 7.一种计算机程序产品,包括计算机程序,所述计算机程序在被机器执行时执行如权利要求1至4中任一项所述的方法。7. A computer program product, comprising a computer program, which, when executed by a machine, performs the method according to any one of claims 1 to 4.
CN202410658287.0A 2024-05-24 2024-05-24 Task scheduling method for parallel processor chip, computing device, computer readable storage medium and computer program product Active CN118277101B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410658287.0A CN118277101B (en) 2024-05-24 2024-05-24 Task scheduling method for parallel processor chip, computing device, computer readable storage medium and computer program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410658287.0A CN118277101B (en) 2024-05-24 2024-05-24 Task scheduling method for parallel processor chip, computing device, computer readable storage medium and computer program product

Publications (2)

Publication Number Publication Date
CN118277101A CN118277101A (en) 2024-07-02
CN118277101B true CN118277101B (en) 2025-05-06

Family

ID=91634759

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410658287.0A Active CN118277101B (en) 2024-05-24 2024-05-24 Task scheduling method for parallel processor chip, computing device, computer readable storage medium and computer program product

Country Status (1)

Country Link
CN (1) CN118277101B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101278264A (en) * 2005-09-28 2008-10-01 英特尔公司 Reliable computing with a many-core processor
CN118034924A (en) * 2024-02-08 2024-05-14 北京灵汐科技有限公司 Data processing method and device based on many-core system, electronic equipment and medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2853736B2 (en) * 1996-04-26 1999-02-03 日本電気株式会社 Cluster number conversion circuit
JP2002182997A (en) * 2000-12-15 2002-06-28 Hitachi Ltd Communication processing method, device for executing the same, and recording medium storing the processing program
CN114546630B (en) * 2020-11-24 2025-07-25 北京灵汐科技有限公司 Task processing method and device, electronic equipment and medium
CN112835719B (en) * 2021-02-10 2023-10-31 北京灵汐科技有限公司 Method and device for task processing, many-core system and computer readable medium
CN116048740A (en) * 2021-10-28 2023-05-02 北京灵汐科技有限公司 Task scheduling method and system based on many-core system, electronic equipment and medium
CN117215740A (en) * 2023-08-28 2023-12-12 蔚来汽车科技(安徽)有限公司 Method and system for constructing scheduling model, and method and system for scheduling target task
CN118011183B (en) * 2024-02-21 2024-07-19 北京力通通信有限公司 Parallel scheduling method, equipment and medium for real-time response multi-chip multi-test task

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101278264A (en) * 2005-09-28 2008-10-01 英特尔公司 Reliable computing with a many-core processor
CN118034924A (en) * 2024-02-08 2024-05-14 北京灵汐科技有限公司 Data processing method and device based on many-core system, electronic equipment and medium

Also Published As

Publication number Publication date
CN118277101A (en) 2024-07-02

Similar Documents

Publication Publication Date Title
CN100351798C (en) Thread signaling in multi-threaded network processor
US8788672B2 (en) Microprocessor with software control over allocation of shared resources among multiple virtual servers
US8539155B1 (en) Managing home cache assignment
JP2022516739A (en) Reconfigurable data processor virtualization
JP2024050808A (en) Repeated redundancy of workload
US11656908B2 (en) Allocation of memory resources to SIMD workgroups
US7051180B2 (en) Masterless building block binding to partitions using identifiers and indicators
US8364874B1 (en) Prioritized polling for virtual network interfaces
EP2192496B1 (en) Arbitration in multiprocessor device
JP2009508247A (en) Method and system for bus arbitration
JPH0844681A (en) Centralized management of resource shared by plurality of processors
US20130097350A1 (en) Qos based dynamic execution engine selection
CN111913816B (en) Method, device, terminal and medium for realizing clusters in GPGPU (graphics processing Unit)
JP6205168B2 (en) System and method for parallel model checking utilizing parallel structured duplicate detection
CN118277101B (en) Task scheduling method for parallel processor chip, computing device, computer readable storage medium and computer program product
JP2017162522A (en) Method and apparatus for allocating interrupts in multi-core system
US9330006B2 (en) System and method for managing correspondence between a cache memory and a main memory
CN116594765B (en) Instruction processing method, instruction processing circuit, processor and electronic equipment
CN116483536A (en) Data scheduling method, computing chip and electronic equipment
CN116957902A (en) NoC arbitration method for GPU
US20240362066A1 (en) Run-Time Configurable Architectures
JPH08292932A (en) Multiprocessor system and method for performing a task in a multiprocessor system
US20140013148A1 (en) Barrier synchronization method, barrier synchronization apparatus and arithmetic processing unit
US7000089B2 (en) Address assignment to transaction for serialization
EP4361826B1 (en) Bandwidth allocation

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant