[go: up one dir, main page]

CN114548389A - Management method and corresponding processor of computing unit in heterogeneous computing - Google Patents

Management method and corresponding processor of computing unit in heterogeneous computing Download PDF

Info

Publication number
CN114548389A
CN114548389A CN202210100383.4A CN202210100383A CN114548389A CN 114548389 A CN114548389 A CN 114548389A CN 202210100383 A CN202210100383 A CN 202210100383A CN 114548389 A CN114548389 A CN 114548389A
Authority
CN
China
Prior art keywords
command
computing unit
computing
register
bit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210100383.4A
Other languages
Chinese (zh)
Other versions
CN114548389B (en
Inventor
马亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Denglin Technology Co Ltd
Original Assignee
Shanghai Denglin Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Denglin Technology Co Ltd filed Critical Shanghai Denglin Technology Co Ltd
Priority to CN202210100383.4A priority Critical patent/CN114548389B/en
Priority claimed from CN202210100383.4A external-priority patent/CN114548389B/en
Publication of CN114548389A publication Critical patent/CN114548389A/en
Application granted granted Critical
Publication of CN114548389B publication Critical patent/CN114548389B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • G06F9/30047Prefetch instructions; cache control instructions
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/30101Special purpose registers
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

本公开提供了异构计算中计算单元的管理方法及相应处理器,其中为每个计算单元设置有与其对应的命令管理寄存器和命令状态寄存器,命令管理寄存器的各个比特位指示与其对应的计算单元的命令缓存中各个条目是否处于空闲状态,命令状态寄存器的各个比特位指示与其对应的计算单元中各个命令的执行状态。该方案基于命令管理寄存器和命令状态寄存器实现了对多个计算单元的管理,提高了异构计算处理器对计算命令的执行效率,具有良好的可扩展性。The present disclosure provides a method for managing computing units in heterogeneous computing and a corresponding processor, wherein each computing unit is provided with a corresponding command management register and a command status register, and each bit of the command management register indicates the corresponding computing unit Whether each entry in the command cache is in an idle state, each bit of the command status register indicates the execution status of each command in the corresponding computing unit. The scheme realizes the management of multiple computing units based on the command management register and the command status register, improves the execution efficiency of computing commands by the heterogeneous computing processor, and has good scalability.

Description

异构计算中计算单元的管理方法及相应处理器Management method of computing unit in heterogeneous computing and corresponding processor

技术领域technical field

本申请涉及高性能计算和异构计算,尤其涉及异构计算中计算单元的管理方法及相应处理器。The present application relates to high-performance computing and heterogeneous computing, and in particular, to a method for managing computing units in heterogeneous computing and a corresponding processor.

背景技术Background technique

本部分的陈述仅仅是为了提供与本申请的技术方案有关的背景信息,以帮助理解,其对于本申请的技术方案而言并不一定构成现有技术。The statements in this section are only for providing background information related to the technical solutions of the present application to help understanding, and they do not necessarily constitute prior art to the technical solutions of the present application.

近年来,深度神经网络已经非常流行,其计算任务具有数据量大、计算量大,计算种类多等特点。深度神经网络的计算任务通常是以有向无环数据流图为组织形式。这些计算任务的执行顺序可以是串行执行,也可以是并行执行。根据图的拓扑关系,这些计算任务往往会分配到多个命令列表中。需要串行执行的计算任务通常会被分配到同一命令列表中,以便让硬件顺序执行;而可以并行执行的计算任务通常会被分配到多个不同的命令列表中,以便让硬件并行执行来提高计算效率。现有的串行指令执行方式的中央处理器CPU执行并行算法的效率很低,因此出现了“CPU+硬件加速器”的异构计算架构,其中硬件加速器(也可称为协处理器)专门处理大量计算任务,而其他非计算任务可由CPU(也可称为主处理器)负责。In recent years, deep neural network has become very popular, and its computing tasks have the characteristics of large amount of data, large amount of calculation, and many kinds of calculation. The computational tasks of deep neural networks are usually organized in the form of directed acyclic data flow graphs. The execution order of these computing tasks can be serial execution or parallel execution. According to the topological relationship of the graph, these computing tasks are often assigned to multiple command lists. Computational tasks that need to be executed in series are usually assigned to the same command list to allow the hardware to execute sequentially; while computational tasks that can be executed in parallel are usually assigned to multiple different command lists to allow the hardware to execute in parallel to improve performance. Computational efficiency. The existing serial instruction execution mode of the central processing unit (CPU) is very inefficient in executing parallel algorithms, so a heterogeneous computing architecture of "CPU + hardware accelerator" has emerged, in which the hardware accelerator (also known as a coprocessor) specializes in processing a large number of Computational tasks, while other non-computational tasks may be handled by the CPU (which may also be referred to as the main processor).

针对深度神经网络的各种硬件加速器中,异构计算处理器由于拥有多种针对计算任务加速的专用硬件计算单元,相比传统的图像处理单元GPU(Graphics ProcessingUnit),拥有更高的计算效率和更低的功耗。但由于异构处理器中包含的计算单元种类多样且各种计算单元执行命令的特性不同,使得对于各个计算单元的管理面临很多挑战。Among the various hardware accelerators for deep neural networks, heterogeneous computing processors have higher computing efficiency and higher computing efficiency than traditional graphics processing units (GPUs) due to their multiple dedicated hardware computing units for computing task acceleration. lower power consumption. However, due to the diverse types of computing units included in the heterogeneous processor and the different characteristics of executing commands by various computing units, the management of each computing unit faces many challenges.

需要说明的是,上述内容仅用于帮助理解本申请的技术方案,并不作为评价本申请的现有技术的依据。It should be noted that the above content is only used to help understand the technical solutions of the present application, and is not used as a basis for evaluating the prior art of the present application.

发明内容SUMMARY OF THE INVENTION

本申请的目的是提供一种异构计算中管理计算单元方法及异构计算处理器,以简单高效的方式实现了多个计算单元管理,提高了计算命令的执行效率。The purpose of the present application is to provide a method for managing computing units in heterogeneous computing and a heterogeneous computing processor, which realizes the management of multiple computing units in a simple and efficient manner and improves the execution efficiency of computing commands.

上述目的是通过以下技术方案实现的:The above purpose is achieved through the following technical solutions:

根据本申请实施例的第一方面,提供了一种异构计算中计算单元的管理方法,其中为每个计算单元设置有与其对应的命令管理寄存器和命令状态寄存器,命令管理寄存器的各个比特位指示对应计算单元的命令缓存中各个条目是否处于空闲状态,命令状态寄存器的各个比特位指示对应计算单元中各个命令的执行状态;所述方法包括:According to a first aspect of the embodiments of the present application, a method for managing computing units in heterogeneous computing is provided, wherein each computing unit is provided with a corresponding command management register and a command status register, and each bit of the command management register Indicate whether each entry in the command cache of the corresponding computing unit is in an idle state, and each bit of the command status register indicates the execution state of each command in the corresponding computing unit; the method includes:

根据待执行的命令的类型,确定适于执行该类型命令的计算单元相对应的命令管理寄存器;从所确定的命令管理寄存器中选择具有指示空闲状态的比特位的命令管理寄存器,将其中一个指示空闲状态的比特位在命令管理寄存器中的序号作为所述待执行的命令的序号并将该比特位设置为指示非空闲状态;将所述待执行的命令及其序号发送到与所选择的命令管理寄存器相对应的计算单元,并在该计算单元对应的命令状态寄存器中将与所述序号对应的比特位设置为指示未完成状态;以及响应于收到来自计算单元的命令完成信号,在该计算单元对应的命令状态寄存器中将与所完成的命令的序号对应的比特位设置为指示已完成状态,并在该计算单元对应的命令管理寄存器中将与所完成的命令的序号对应的比特位设置为指示空闲状态。According to the type of the command to be executed, determine the command management register corresponding to the computing unit suitable for executing the command of this type; select the command management register with the bits indicating the idle state from the determined command management registers, and indicate one of them. The serial number of the bit of the idle state in the command management register is used as the serial number of the command to be executed and the bit is set to indicate a non-idle state; the command to be executed and its serial number are sent to the command with the selected command. The computing unit corresponding to the management register, and in the command status register corresponding to the computing unit, the bit corresponding to the sequence number is set to indicate an incomplete state; and in response to receiving the command completion signal from the computing unit, in the In the command status register corresponding to the computing unit, the bit corresponding to the serial number of the completed command is set to indicate the completed state, and in the command management register corresponding to the computing unit, the bit corresponding to the serial number of the completed command is set. Set to indicate idle state.

在一些实施例中,选择具有指示空闲状态的比特位的命令管理寄存器可以包括:通过读取命令管理寄存器的值来识别命令管理寄存器中具有指示空闲状态的比特位的数量;以及选择具有最大数量的指示空闲状态的比特位的命令管理寄存器。In some embodiments, selecting a command management register with bits indicating an idle state may include: identifying a number of bits in the command management register having bits indicating an idle state by reading the value of the command management register; and selecting a command management register with a maximum number of bits The command management register for the bits indicating the idle state.

在一些实施例中,选择具有指示空闲状态的比特位的命令管理寄存器可以包括:以轮询的方式选择具有指示空闲状态的比特位的命令管理寄存器。In some embodiments, selecting the command management register with the bits indicating the idle state may include selecting the command management register with the bits indicating the idle state in a polling manner.

在一些实施例中,该方法还可包括:通过设置计算单元的相对应的命令管理寄存器的值来将该计算单元的状态设置为不可用。In some embodiments, the method may further include setting the state of the computing unit to unavailable by setting a value of a corresponding command management register of the computing unit.

在一些实施例中,所述命令管理寄存器和命令状态寄存器的宽度依赖于其对应计算单元的命令缓存的长度。In some embodiments, the width of the command management register and the command status register depends on the length of the command buffer of the corresponding computing unit.

在一些实施例中,该方法还可包括:响应于收到来自计算单元的命令完成信号,在经过预设的时间段之后,在该计算单元对应的命令管理寄存器中将与所完成的命令的序号对应的比特位设置为指示空闲状态。In some embodiments, the method may further include: in response to receiving the command completion signal from the computing unit, after a preset period of time has elapsed, in a command management register corresponding to the computing unit, the command management register corresponding to the completed command is updated. The bit corresponding to the serial number is set to indicate the idle state.

在一些实施例中,该方法还可包括:记录所述待执行的命令的序号与该待执行的命令所属的硬件命令队列的对应关系。In some embodiments, the method may further include: recording the correspondence between the sequence number of the command to be executed and the hardware command queue to which the command to be executed belongs.

在一些实施例中,该方法还可包括:响应于收到来自计算单元的命令完成信号,根据所完成的命令的序号查找该命令所对应的硬件命令队列,并向其反馈指示该命令完成的信号。In some embodiments, the method may further include: in response to receiving a command completion signal from the computing unit, searching for a hardware command queue corresponding to the command according to the sequence number of the completed command, and feeding back a command indicating that the command is completed. Signal.

在一些实施例中,该方法还可包括:将来自同一硬件队列的多个待执行的命令连续分发至各计算单元进行处理。In some embodiments, the method may further include: continuously distributing multiple commands to be executed from the same hardware queue to each computing unit for processing.

在一些实施例中,该方法还可包括:响应于收到来自计算单元的命令完成信号,确定所完成的命令是否还需要等待其他命令;仅在确定该命令不需要等待其他命令时,在该计算单元对应的命令管理寄存器中将与所完成的命令的序号对应的比特位设置为指示空闲状态。In some embodiments, the method may further include: in response to receiving the command completion signal from the computing unit, determining whether the completed command still needs to wait for other commands; only when it is determined that the command does not need to wait for other commands, in the In the command management register corresponding to the computing unit, the bit corresponding to the sequence number of the completed command is set to indicate the idle state.

在一些实施例中,该方法还可包括:对于来自多个硬件命令队列的待执行的命令,按照预设的优先级来向计算单元分发每个待执行的命令。In some embodiments, the method may further include: for commands to be executed from a plurality of hardware command queues, distributing each command to be executed to the computing unit according to a preset priority.

根据本申请实施例的第二方面,提供了一种异构计算处理器,其包括控制器、多个不同类型的计算单元、与每个计算单元对应的命令管理寄存器和命令状态寄存器,其中所述命令管理寄存器的各个比特位指示对应的计算单元的命令缓存的各个条目是否处于空闲状态;所述命令状态寄存器的各个比特位指示对应的计算单元中各个命令的执行状态,所述控制器被配置为执行上述第一方面所述的方法。According to a second aspect of the embodiments of the present application, a heterogeneous computing processor is provided, which includes a controller, a plurality of computing units of different types, a command management register and a command status register corresponding to each computing unit, wherein the Each bit of the command management register indicates whether each entry of the command cache of the corresponding computing unit is in an idle state; each bit of the command status register indicates the execution state of each command in the corresponding computing unit, and the controller is is configured to perform the method described in the first aspect above.

本申请实施例的技术方案可以包括以下有益效果:The technical solutions of the embodiments of the present application may include the following beneficial effects:

提出了基于命令管理寄存器和命令状态寄存器的计算单元管理方法,既不需要复杂的软件编程,也避免了由于读取外部存储而引起的无效等待。而且以简单的方式实现了命令队列的非阻塞的执行方式,提高了异构计算处理器对计算命令的执行效率。另外这种基于寄存器的计算单元管理机制还改善了异构计算处理器的鲁棒性,并具有良好的可扩展性。A computing unit management method based on command management register and command status register is proposed, which does not require complicated software programming, and avoids invalid waiting caused by reading external storage. In addition, the non-blocking execution mode of the command queue is realized in a simple way, and the execution efficiency of the computing command by the heterogeneous computing processor is improved. In addition, this register-based computing unit management mechanism also improves the robustness of heterogeneous computing processors and has good scalability.

应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请。It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not limiting of the present application.

附图说明Description of drawings

此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本申请的实施例,并与说明书一起用于解释本申请的原理。显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。在附图中:The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description serve to explain the principles of the application. Obviously, the drawings in the following description are only some embodiments of the present application, and for those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative effort. In the attached image:

图1为本申请实施例提供的一种异构计算处理器的结构模块示意图。FIG. 1 is a schematic diagram of a structural module of a heterogeneous computing processor according to an embodiment of the present application.

图2为本申请实施例提供的一种异构计算处理器中前端控制引擎的结构模块示意图。FIG. 2 is a schematic structural module diagram of a front-end control engine in a heterogeneous computing processor according to an embodiment of the present application.

图3为本申请实施例提供的一种异构计算中计算单元管理方法的流程示意图。FIG. 3 is a schematic flowchart of a computing unit management method in heterogeneous computing provided by an embodiment of the present application.

具体实施方式Detailed ways

为了使本申请的目的,技术方案及优点更加清楚明白,以下结合附图通过具体实施例对本申请进一步详细说明。应当理解,所描述的实施例是本申请的一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动下获得的所有其他实施例,都属于本申请保护的范围。In order to make the purpose, technical solutions and advantages of the present application clearer, the present application will be further described in detail below with reference to the accompanying drawings through specific embodiments. It should be understood that the described embodiments are some, but not all, of the embodiments of the present application. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the protection scope of the present application.

此外,所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施例中。在下面的描述中,提供许多具体细节从而给出对本申请的实施例的充分理解。然而,本领域技术人员将意识到,可以实践本申请的技术方案而没有特定细节中的一个或更多,或者可以采用其它的方法、组元、装置、步骤等。在其它情况下,不详细示出或描述公知方法、装置、实现或者操作以避免模糊本申请的各方面。Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided in order to give a thorough understanding of the embodiments of the present application. However, those skilled in the art will appreciate that the technical solutions of the present application may be practiced without one or more of the specific details, or other methods, components, devices, steps, etc. may be employed. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the present application.

附图中所示的方框图仅仅是功能实体,不一定必须与物理上独立的实体相对应。即,可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。The block diagrams shown in the figures are merely functional entities and do not necessarily necessarily correspond to physically separate entities. That is, these functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices entity.

附图中所示的流程图仅是示例性说明,不是必须包括所有的内容和操作/步骤,也不是必须按所描述的顺序执行。例如,有的操作/步骤还可以分解,而有的操作/步骤可以合并或部分合并,因此实际执行的顺序有可能根据实际情况改变。The flowcharts shown in the figures are only exemplary illustrations and do not necessarily include all contents and operations/steps, nor do they have to be performed in the order described. For example, some operations/steps can be decomposed, and some operations/steps can be combined or partially combined, so the actual execution order may be changed according to the actual situation.

在异构计算架构中,主处理器可以将待执行的计算任务分配到多个命令列表中,并指示相应的协处理器执行这些计算任务。这些命令列表可以保存在存储器中供协处理器读取、解析和执行。在本申请的实施例中,通过异构计算处理器作为协处理器来执行主处理器准备好的命令列表。In a heterogeneous computing architecture, the main processor can allocate computing tasks to be executed into multiple command lists, and instruct the corresponding coprocessors to execute these computing tasks. These command lists can be kept in memory for reading, parsing and execution by the coprocessor. In the embodiment of the present application, the command list prepared by the main processor is executed by the heterogeneous computing processor as a co-processor.

图1为本申请实施例提供的一种异构计算处理器的结构模块示意图。FIG. 1 is a schematic diagram of a structural module of a heterogeneous computing processor according to an embodiment of the present application.

在该异构计算处理器中,包括至少两种不同类型的计算单元;例如通用图像处理单元、专用于处理矩阵的加法和乘法的计算单元、专用于向量运算和求导运算的计算单元等等。每个计算单元具有可以接收多个命令的命令缓存,从而避免因命令传输空隙而产生计算资源的浪费。命令缓存包括多个条目,每个接收到的命令被保存在相应的一个条目中。由于不同计算单元的接收命令能力不同,各计算单元的命令缓存中条目的数量(可以理解为命令缓存的深度、长度或大小)也不同。In the heterogeneous computing processor, at least two different types of computing units are included; for example, a general-purpose image processing unit, a computing unit dedicated to processing matrix addition and multiplication, a computing unit dedicated to vector operations and derivation operations, etc. . Each computing unit has a command buffer that can receive multiple commands, so as to avoid wasting computing resources due to command transmission gaps. The command cache includes multiple entries, and each received command is stored in a corresponding entry. Since different computing units have different command receiving capabilities, the number of entries in the command buffers of each computing unit (which can be understood as the depth, length or size of the command buffer) is also different.

该异构计算处理器还包括主机接口、前端控制引擎、命令网络和存储器接口。其中主机接口主要用于该异构处理器与主处理器之间的命令通信或数据通信。存储器接口用于该异构计算处理器与存储器之间的通信。前端控制引擎通过命令网络与各个计算单元相连接,将待执行的命令列表分发至各个计算单元进行执行。该命令网络可以以片上数据交换网络的形式来实现,本文对此不做限制。The heterogeneous computing processor also includes a host interface, a front end control engine, a command network, and a memory interface. The host interface is mainly used for command communication or data communication between the heterogeneous processor and the main processor. A memory interface is used for communication between the heterogeneous computing processor and memory. The front-end control engine is connected to each computing unit through a command network, and distributes a list of commands to be executed to each computing unit for execution. The command network can be implemented in the form of an on-chip data exchange network, which is not limited in this paper.

在异构计算处理器中,对于不同种类的计算单元的管理面临很多挑战。例如,不同种类的计算单元接收命令的能力可能不同,执行命令的顺序也可能不同,需要根据每种计算单元的执行命令的特性进行调度;并行执行的命令队列可能会同时调用同一个计算单元,需要处理命令队列之间的竞争关系;多种计算单元存在协同处理计算任务的需求时,需要处理各计算单元之间的等待关系。在下文将要详细阐述的本申请的实施例中,提供了针对异构计算处理器的计算单元管理方法,可以提高异构处理器的计算命令的执行效率,同时降低计算单元的管理复杂性,并具有良好的可扩展性。In heterogeneous computing processors, the management of different kinds of computing units faces many challenges. For example, the ability of different types of computing units to receive commands may be different, and the order of executing commands may also be different, which needs to be scheduled according to the characteristics of the executed commands of each computing unit; the command queue executed in parallel may call the same computing unit at the same time, Competitive relationship between command queues needs to be handled; when multiple computing units need to process computing tasks cooperatively, the waiting relationship between computing units needs to be handled. In the embodiments of the present application, which will be described in detail below, a computing unit management method for heterogeneous computing processors is provided, which can improve the execution efficiency of computing commands of heterogeneous processors, while reducing the management complexity of computing units, and Has good scalability.

图2为本申请实施例提供的一种异构计算处理器中的前端控制引擎的结构模块示意图。该前端控制引擎可用于读取、解析命令列表,并根据解析结果进行命令分发。FIG. 2 is a schematic structural module diagram of a front-end control engine in a heterogeneous computing processor according to an embodiment of the present application. The front-end control engine can be used to read and parse the command list, and distribute commands according to the parsing result.

如图2所示,该前端控制引擎包括硬件队列调度模块、计算管理模块和多个可以并行执行的硬件命令队列。硬件队列调度模块可将待执行的命令列表分配至各个硬件命令队列。由于主处理器根据相应的计算任务创建并保存在存储器中的命令列表的数目可能大于异构计算处理器中的硬件命令队列的数目,因此硬件队列调度模块需要对硬件命令队列进行调度,例如通过一定的调度策略将命令列表发送到空闲的硬件命令队列中。每个硬件命令队列对收到的命令进行解析后,经由计算管理模块向计算单元分发命令。每个硬件命令队列也可以独立从存储器中读取软件准备的命令列表,然后进行命令的解析以及向计算管理模块分发命令。计算管理模块可以根据命令的类型和各个计算单元命令执行情况,将命令发到不同的计算单元中进行处理。As shown in FIG. 2 , the front-end control engine includes a hardware queue scheduling module, a calculation management module and a plurality of hardware command queues that can be executed in parallel. The hardware queue scheduling module can assign the list of commands to be executed to each hardware command queue. Since the number of command lists created by the main processor and stored in the memory according to the corresponding computing tasks may be greater than the number of hardware command queues in the heterogeneous computing processors, the hardware queue scheduling module needs to schedule the hardware command queues, for example, by A certain scheduling strategy sends the command list to the idle hardware command queue. After each hardware command queue parses the received command, it distributes the command to the computing unit via the computing management module. Each hardware command queue can also independently read the command list prepared by the software from the memory, then parse the command and distribute the command to the computing management module. The computing management module can send commands to different computing units for processing according to the type of the command and the command execution status of each computing unit.

在本申请的实施例中,计算管理模块通过为每个计算单元设置两个寄存器来监控或检测各计算单元的命令执行情况。其中一个寄存器用于记录计算单元的命令缓存状态,由此可以称为命令管理寄存器。而另一个寄存器用于记录计算单元的命令执行状态,由此可以称为命令状态寄存器。In the embodiment of the present application, the computing management module monitors or detects the command execution status of each computing unit by setting two registers for each computing unit. One of the registers is used to record the command buffer status of the computing unit, so it can be called the command management register. The other register is used to record the command execution state of the computing unit, so it can be called a command state register.

在该实施例中,与每个计算单元对应的命令管理寄存器中每个位(Bit,也可称为比特位)对应该计算单元的命令缓存中的一个条目,用来表明该条目是否空闲。例如,如果值为1表示该条目未被分配,可以接收计算命令;值为0表示该条目已被占用,不能接收计算命令,反之亦然。In this embodiment, each bit (Bit, may also be referred to as a bit) in the command management register corresponding to each computing unit corresponds to an entry in the command cache of the computing unit, and is used to indicate whether the entry is free. For example, a value of 1 indicates that the entry is unallocated and can receive compute commands; a value of 0 means that the entry is occupied and cannot receive compute commands, and vice versa.

在下文中对于命令管理寄存器的每个比特位,以值为1指示空闲,值为0指示已占用为例来进行说明。In the following, for each bit of the command management register, a value of 1 indicates idle, and a value of 0 indicates occupied as an example for description.

命令管理寄存器中比特位的数目(即该寄存器的宽度)取决于计算单元命令缓存的深度。由于不同计算单元的接收命令能力不同,各计算单元的命令缓存深度不同,相应地各个计算单元对应的命令管理寄存器的宽度也可以是不同的。如果一个计算单元对应的命令管理寄存器的各个比特位的值全为1,代表该计算单元的命令缓存所有条目空闲,其当前可以接收的命令能力为最大;如果该命令管理寄存器的各个比特位的值全为0,表示该计算单元不能接收新的命令。这样,当计算管理模块向计算单元发送某种计算命令时,可以先基于计算命令的类型确定出能够用于处理该类型命令的计算单元;然后查询与相应类型的计算单元对应的命令管理寄存器,从中选择当前可以接收命令的计算单元来向其发送计算命令。如果该类型的所有计算单元都不能接收新的命令,那么该计算管理模块可以阻塞该计算命令,但后续的不属于该种类型的计算命令仍然可以继续发送,从而使那些仍然有计算资源的计算单元可以接收新的计算命令而继续运行,命令网络不会被阻塞,可改善异构计算处理器的执行效率。The number of bits in the command management register (ie the width of the register) depends on the depth of the command buffer of the computing unit. Since different computing units have different command receiving capabilities, the command buffer depths of each computing unit are different, and correspondingly, the widths of the command management registers corresponding to each computing unit may also be different. If the value of each bit of the command management register corresponding to a computing unit is all 1, it means that all entries in the command cache of the computing unit are free, and the command capability it can currently receive is the largest; if the value of each bit of the command management register is 1 The value is all 0, indicating that the computing unit cannot receive new commands. In this way, when the calculation management module sends a certain calculation command to the calculation unit, it can first determine the calculation unit that can be used to process the type of command based on the type of the calculation command; then query the command management register corresponding to the corresponding type of calculation unit, The computing unit that can currently receive commands is selected from which to send computing commands. If all computing units of this type cannot receive new commands, the computing management module can block the computing command, but subsequent computing commands that do not belong to this type can still continue to be sent, so that those computing that still have computing resources The unit can continue to run after receiving new computing commands, and the command network will not be blocked, which can improve the execution efficiency of heterogeneous computing processors.

继续参考图2,与每个计算单元对应的命令状态寄存器与上述的命令管理寄存器宽度可以相同。该命令状态寄存器中,每个比特位对应该计算单元的一个计算命令的执行状态,用来记录该命令的完成情况。例如,当值为1时表示计算命令已经完成,值为0时表示计算命令正在执行或者没有启动,反之也可以。Continuing to refer to FIG. 2 , the command status register corresponding to each computing unit may have the same width as the above-mentioned command management register. In the command status register, each bit corresponds to the execution status of a computing command of the computing unit, and is used to record the completion of the command. For example, when the value is 1, it means that the calculation command has been completed, and when the value is 0, it means that the calculation command is being executed or not started, and vice versa.

在下文中,对于命令状态寄存器的每个比特位,以值为1指示命令已完成,值为0指示命令还没有完成或未启动为例来进行说明。在一个实施例中,命令状态寄存器中各个比特位的初始值全为0,当一个计算命令完成后,相应的比特位会被设为1。In the following, for each bit of the command status register, a value of 1 indicates that the command has been completed, and a value of 0 indicates that the command has not been completed or not started as an example for description. In one embodiment, the initial value of each bit in the command status register is all 0, and when a calculation command is completed, the corresponding bit is set to 1.

通过上述的命令状态寄存器和命令管理寄存器互相配合,可以做到对于同一硬件命令队列的非阻塞执行或乱序执行。也就是说硬件命令队列不需要按照队列顺序发射一个命令后进行阻塞,等待该命令完成后,再发射下一个命令。在本申请的实施例中,每个硬件命令队列可以根据计算单元的命令管理寄存器的状态向计算单元连续分发若干命令而不需要立刻阻塞,由此可减少命令等待时间,提高命令的执行效率。每个命令的执行情况会被记录在计算单元对应的命令状态寄存器中。当一个硬件命令队列需要等待某个命令完成时,只需要查询相应命令状态寄存器中的相应比特位即可了解相关信息,而不需要做额外的存储或追踪。另外,计算单元内部或计算单元之间的命令完成顺序也可能不同。由于命令状态寄存器可以追踪每一条已经发射的命令,因此也可以允许已发射的命令的乱序完成,进一步改善了异构计算处理器对于计算命令的执行效率。Through the cooperation of the above-mentioned command status register and command management register, non-blocking execution or out-of-order execution of the same hardware command queue can be achieved. That is to say, the hardware command queue does not need to block after sending a command in the queue order, and wait for the command to complete before sending the next command. In the embodiment of the present application, each hardware command queue can continuously distribute several commands to the computing unit according to the state of the command management register of the computing unit without immediately blocking, thereby reducing command waiting time and improving command execution efficiency. The execution of each command will be recorded in the command status register corresponding to the computing unit. When a hardware command queue needs to wait for a command to complete, it only needs to query the corresponding bits in the corresponding command status register to learn the relevant information, without additional storage or tracking. In addition, the order in which commands are completed may vary within a compute unit or between compute units. Since the command status register can track each command that has been issued, it can also allow out-of-order completion of the issued commands, which further improves the execution efficiency of the computing commands by the heterogeneous computing processor.

在该实施例中,在计算管理模块为各个计算单元设置了与其对应的命令管理寄存器和命令状态寄存器。这些命令管理寄存器和命令状态寄存器对于所有硬件命令队列可见,即异构处理器的所有计算单元对所有硬件命令队列共享,计算管理模块可直接根据各计算单元的命令管理寄存器的情况对其进行分配。例如,当两个硬件命令队列要竞争同一计算单元时,计算管理模块通过为分发至计算单元的每个命令设置该计算单元的命令管理寄存器和命令状态寄存器,可以确保计算单元的每个计算命令都能正常执行,并且计算单元的命令缓存中每个命令都不会被其他队列的命令更改或替换。可以看出,这里的命令状态寄存器和命令管理寄存器有点类似于硬件形式的“同步锁”,简单方便地解决了不同硬件命令队列之间竞争同一计算单元的问题。而如果是通过软件编程的方式解决不同硬件命令队列之间竞争同一计算单元的问题,往往需要各个硬件命令队列反复查询存储器来获取计算单元的命令执行情况。而且通过软件编程很难准预测每个命令实时执行情况,因此经常造成不必要的同步等待,浪费了硬件计算资源。另外,软件编程的方式往往不利于发挥硬件效率,因为用于进行各命令队列之间的同步指令本身就有开销,大多的同步指令实现是通过在外部存储器中通过读写旗语信号进行实现,也需要反复读取存储器,从而也引入了很多访存延迟或等待,影响了计算命令的执行效率。In this embodiment, a command management register and a command status register corresponding to each computing unit are set in the computing management module. These command management registers and command status registers are visible to all hardware command queues, that is, all computing units of heterogeneous processors share all hardware command queues, and the computing management module can directly allocate them according to the command management registers of each computing unit. . For example, when two hardware command queues are to compete for the same computing unit, the computing management module can ensure that each computing command of the computing unit can be guaranteed by setting the command management register and the command status register of the computing unit for each command distributed to the computing unit. are executed normally, and each command in the command cache of the computing unit is not changed or replaced by commands from other queues. It can be seen that the command status register and command management register here are somewhat similar to the "synchronization lock" in the form of hardware, which simply and conveniently solves the problem of competition between different hardware command queues for the same computing unit. However, if the problem of competition between different hardware command queues for the same computing unit is solved by software programming, each hardware command queue often needs to repeatedly query the memory to obtain the command execution status of the computing unit. Moreover, it is difficult to accurately predict the real-time execution of each command through software programming, which often causes unnecessary synchronization waits and wastes hardware computing resources. In addition, the way of software programming is often not conducive to the use of hardware efficiency, because the synchronization instructions used to perform synchronization between command queues have their own overhead. Most synchronization instructions are implemented by reading and writing semaphore signals in external memory. The memory needs to be read repeatedly, which also introduces a lot of memory access delay or waiting, which affects the execution efficiency of the calculation command.

在一些实施例中,还可以通过响应于来自上层软件或主处理器等的命令或配置信息,通过设置计算单元的相对应的命令管理寄存器的值来将该计算单元的状态进行设置。通过这种设计不仅降低了计算单元的管理复杂性,而且还可以改善异构计算处理器的鲁棒性。例如,如果由于芯片良率的问题,导致某些计算单元不可用,这些计算单元的命令管理寄存器初始值会被设为各个比特位全为0,即表示无法向该计算单元分配任何命令。以这样简单的设计就能保证异构计算处理器芯片可以在有些计算单元不可用的情况下继续工作。另外,通过这种设置寄存器的方式还为异构计算处理器提供了良好的扩展性。当需要增加新种类的计算单元或改变计算单元的数量时,只需要设置或调整计算管理单元对应的命令管理寄存器和命令状态寄存器即可。本申请不对芯片良率的检测方式进行限定,采用外部硬件检测手段或外部软件检测手段均可,对芯片良率的检测不应理解为对本申请的限制。In some embodiments, the state of the computing unit may also be set by setting the value of the corresponding command management register of the computing unit in response to commands or configuration information from upper-layer software or a host processor. This design not only reduces the management complexity of computing units, but also improves the robustness of heterogeneous computing processors. For example, if some computing units are unavailable due to chip yield problems, the initial value of the command management registers of these computing units will be set to all 0 bits, which means that no command can be assigned to the computing unit. Such a simple design can ensure that the heterogeneous computing processor chip can continue to work even when some computing units are unavailable. In addition, this way of setting registers also provides good scalability for heterogeneous computing processors. When a new type of computing unit needs to be added or the number of computing units needs to be changed, it is only necessary to set or adjust the command management register and the command status register corresponding to the computing management unit. This application does not limit the detection method of the chip yield, and an external hardware detection method or an external software detection method may be used, and the detection of the chip yield should not be construed as a limitation on this application.

下面结合图3所示的流程图对计算管理模块对于多个计算单元的管理过程进行更详细的描述。The following describes the management process of the computing management module for multiple computing units in more detail with reference to the flowchart shown in FIG. 3 .

在步骤S1,计算管理模块收到来自硬件命令队列的待执行的命令。由于异构计算处理器通常都包含多个不同类型的计算单元。因此,计算管理模块首先根据接收到的待执行的命令的类型,对适于执行该类型命令的计算单元相对应的命令管理寄存器进行检测,以确定这些计算单元的命令缓存的空闲状态。如上文提到的,命令管理寄存器的各个比特位指示与其对应的计算单元的命令缓存中各个条目是否处于空闲状态。还是以值为1指示空闲,值为0指示已占用为例,通过读取该命令管理寄存器的值就能够方便地获知其中哪些比特位为1,哪些比特位为0,并可以据此统计出计算单元的命令缓存中空闲条目的数量,即该计算单元还可以接收多少个命令。In step S1, the computing management module receives commands to be executed from the hardware command queue. Because heterogeneous computing processors usually contain multiple computing units of different types. Therefore, the calculation management module firstly checks the command management registers corresponding to the calculation units suitable for executing commands of this type according to the type of the received commands to be executed, to determine the idle state of the command buffers of these calculation units. As mentioned above, each bit of the command management register indicates whether each entry in the command buffer of the corresponding computing unit is in an idle state. Or take a value of 1 to indicate idle, and a value of 0 to indicate occupied as an example. By reading the value of the command management register, you can easily know which bits are 1 and which bits are 0, and can be counted accordingly. The number of free entries in the command cache of the computing unit, that is, how many commands the computing unit can receive.

异构计算处理器通常都包含多个硬件命令队列。在一些实施例中,计算管理模块还可被配置为按照一定的算法对来自多个硬件命令队列的分配计算单元的请求进行仲裁,根据仲裁算法优先选择出其中一个硬件命令队列来为其服务,将该硬件命令队列中的命令分发至计算单元。在一个实施例中,这样的仲裁算法可以是轮询算法,用来保证对于各硬件命令队列执行命令的公平性。在又一个实施例中,这样的仲裁算法可以是贪婪算法,用来保证某个硬件命令队列的命令优先执行。在又一个实施例中,计算管理模块可以根据预先设置的各个硬件命令队列的权重进行调度,既保证调度有一定优先级,又使得所有的队列命令都能被发送到计算单元,从而避免某个队列不会因为永远阻塞而饿死。Heterogeneous computing processors typically contain multiple hardware command queues. In some embodiments, the computing management module may also be configured to arbitrate requests from multiple hardware command queues to allocate computing units according to a certain algorithm, and select one of the hardware command queues to serve it preferentially according to the arbitration algorithm, Commands in the hardware command queue are dispatched to computing units. In one embodiment, such an arbitration algorithm may be a round-robin algorithm to ensure fairness of command execution for each hardware command queue. In yet another embodiment, such an arbitration algorithm may be a greedy algorithm, which is used to ensure that commands in a certain hardware command queue are executed first. In yet another embodiment, the computing management module may perform scheduling according to the preset weights of each hardware command queue, which not only ensures that scheduling has a certain priority, but also enables all queue commands to be sent to the computing unit, thereby avoiding a certain Queues don't starve by blocking forever.

如上文提到的,计算管理模块通过读取计算单元的命令管理寄存器的值就能判断出其对应计算单元的命令缓存的空闲状态。在步骤S2,计算管理模块可以从具有指示空闲状态的比特位的多个命令管理寄存器中选择出其中一个命令管理寄存器,将该命令管理寄存器对应的计算单元作为该执行的命令的目标计算单元。该命令管理寄存器的序号就是该目标计算单元的序号。As mentioned above, the computing management module can determine the idle state of the command buffer of the corresponding computing unit by reading the value of the command management register of the computing unit. In step S2, the calculation management module may select one command management register from a plurality of command management registers with bits indicating idle state, and use the calculation unit corresponding to the command management register as the target calculation unit of the executed command. The serial number of the command management register is the serial number of the target computing unit.

在一个实施例中,计算管理模块可以根据计算单元的命令缓存中空闲条目的数量(即命令管理寄存器中指示空闲状态的比特位的数量)来选择命令管理寄存器。例如可以优先选择空闲条目比较多的计算单元,从而尽量确保各计算单元之间的负载均衡。In one embodiment, the computing management module may select the command management register according to the number of free entries in the command cache of the computing unit (ie, the number of bits in the command management register indicating the idle state). For example, a computing unit with more idle entries may be preferentially selected, so as to ensure load balance among the computing units as much as possible.

在又一个实施例中,计算管理模块也可以采用简单的轮询分配方式,避免了比较每个命令管理寄存器中指示空闲状态的比特位的数量而产生的逻辑和周期。In yet another embodiment, the calculation management module may also adopt a simple polling allocation method, which avoids the logic and cycle generated by comparing the number of bits indicating the idle state in each command management register.

对于所选择的命令管理寄存器,计算管理模块将其中的一个指示空闲状态的比特位分配给待执行的命令。例如,将该比特位在该命令管理寄存器中的序号作为该待执行的命令的序号(也可以称为命令的标识号,或命令id),并将该比特位设置为指示非空闲状态。如上文提到的,在计算单元对应的命令管理寄存器中,每个比特位指示的是该计算单元的命令缓存的一个条目的状态,即每个比特位与计算单元的命令缓存条目一一对应。因此,在该实施例中,将比特位在该命令管理寄存器的序号作为用于标识该待执行的命令的序号,有助于简化后续对命令的缓存和执行状态的设置。由于这个比特位的序号就是该待执行命令将会在目标计算单元的命令缓存中所占用的条目的序号,因此可以直接根据收到的待执行的命令的序号将其保存在命令缓存的相应条目中。另外,这里将该比特位设置为指示非空闲状态,可以确保在命令缓存释放前,该条目不会被其他命令占用。如果当前的命令已经执行完成,且,该命令对应的比特位已经被设置为指示空闲状态的情况下,该比特位在命令管理寄存器中的序号可以被分配给新的待执行的命令,即,通过查询或访问命令管理寄存器中的各比特位的值,即可得知相应比特位在该命令管理寄存器中的序号(或称标识)当前是否允许被分配给下一个待执行的命令,这样的方式简便且高效,可以节约额外的访问查询过程。For the selected command management register, the computing management module assigns one of the bits indicating the idle state to the command to be executed. For example, the serial number of the bit in the command management register is used as the serial number of the command to be executed (also referred to as an identification number of a command, or a command id), and the bit is set to indicate a non-idle state. As mentioned above, in the command management register corresponding to the computing unit, each bit indicates the status of an entry in the command cache of the computing unit, that is, each bit corresponds to the command cache entry of the computing unit one-to-one . Therefore, in this embodiment, the sequence number of the bit in the command management register is used as the sequence number for identifying the command to be executed, which helps to simplify the subsequent setting of the cache and execution state of the command. Since the serial number of this bit is the serial number of the entry that the command to be executed will occupy in the command cache of the target computing unit, it can be directly stored in the corresponding entry of the command cache according to the serial number of the received command to be executed. middle. In addition, setting this bit here to indicate a non-idle state can ensure that the entry will not be occupied by other commands before the command cache is released. If the current command has been executed, and the bit corresponding to the command has been set to indicate the idle state, the sequence number of the bit in the command management register can be assigned to the new command to be executed, that is, By querying or accessing the value of each bit in the command management register, you can know whether the serial number (or identifier) of the corresponding bit in the command management register is currently allowed to be assigned to the next command to be executed. The method is simple and efficient, and additional access query process can be saved.

在一些实施例中,计算管理模块还可以记录分配给该待执行的命令的序号与该待执行的命令所属的硬件命令队列的对应关系,以便后续向相应硬件队列反馈命令的执行情况。In some embodiments, the computing management module may also record the correspondence between the sequence number assigned to the command to be executed and the hardware command queue to which the command to be executed belongs, so as to feed back the execution status of the command to the corresponding hardware queue subsequently.

继续参考图3,在步骤S3,计算管理模块将所述待执行的命令及其序号,通过命令网络发送到与所选择的命令管理寄存器相对应的计算单元。计算单元会将收到的命令及序号保存在其命令缓存的对应条目中。在发送命令的同时,计算管理模块可直接基于该待执行的命令的序号将该计算单元对应的命令状态寄存器中与该序号对应的比特位设置为指示未完成状态,以防止由于先前操作没有及时清除状态而导致后续计算管理模块对于命令执行状况的误判。计算单元从其命令缓存中读取一个计算命令,并进行执行。当命令执行完成后,计算单元会将命令完成信号和所完成的命令的序号通过命令网络返回到计算管理模块。Continuing to refer to FIG. 3, in step S3, the calculation management module sends the command to be executed and its serial number to the calculation unit corresponding to the selected command management register through the command network. The computing unit will save the received command and sequence number in the corresponding entry in its command cache. While sending the command, the calculation management module can directly set the bit corresponding to the sequence number in the command status register corresponding to the calculation unit to indicate an incomplete state based on the sequence number of the command to be executed, so as to prevent the previous operation from being delayed in time. Clearing the state causes the subsequent calculation management module to misjudge the command execution state. The computing unit reads a computing command from its command buffer and executes it. When the execution of the command is completed, the computing unit will return the command completion signal and the sequence number of the completed command to the computing management module through the command network.

在步骤S4,计算管理模块响应于收到来自计算单元的命令完成信号,在该计算单元对应的命令状态寄存器中,将与所完成的命令的序号对应的比特位设置为指示已完成状态,以表示该计算命令已经完成。In step S4, in response to receiving the command completion signal from the computing unit, the computing management module sets the bit corresponding to the sequence number of the completed command in the command status register corresponding to the computing unit to indicate the completed state, so as to Indicates that the calculation command has been completed.

在一个实施例中,计算管理模块还可以将该命令完成的信息反馈给对应的硬件命令队列。在这样的实施例中,计算管理模块在为待执行的命令分配序号时,可以记录该命令序号属于哪个硬件命令队列。这样,当收到命令完成信号后,可以将相关信息返回给对应的硬件命令队列。In one embodiment, the computing management module may also feed back the command completion information to the corresponding hardware command queue. In such an embodiment, when assigning a sequence number to a command to be executed, the computing management module may record which hardware command queue the command sequence number belongs to. In this way, when the command completion signal is received, the relevant information can be returned to the corresponding hardware command queue.

在又一个实施例中,计算管理模块也可以不将该命令完成的信息反馈给对应的硬件命令队列。当需要查询某个命令的执行状态时,例如在硬件命令队列出现阻塞,或,需要命令之间进行同步操作,从而需要查询某个命令的执行状态时,计算管理模块可以主动查询相应的命令状态寄存器,以此可以节省硬件寄存器资源。In yet another embodiment, the computing management module may also not feed back the command completion information to the corresponding hardware command queue. When the execution status of a command needs to be queried, for example, the hardware command queue is blocked, or synchronization between commands is required, so that the execution status of a command needs to be queried, the computing management module can actively query the corresponding command status registers, which can save hardware register resources.

计算管理模块在收到来自计算单元的命令完成信号后,还可以在该计算单元对应的命令管理寄存器中,将与所完成的命令的序号对应的比特位设置为指示空闲状态,从而释放该计算单元的命令缓存中的相应条目。After the calculation management module receives the command completion signal from the calculation unit, it can also set the bit corresponding to the serial number of the completed command in the command management register corresponding to the calculation unit to indicate the idle state, thereby releasing the calculation The corresponding entry in the unit's command cache.

作为一种实施方式,计算管理模块可以在收到命令完成信号后,立刻将命令管理寄存器对应比特位修改为指示空闲状态,从而可以快速释放该计算单元的命令缓存条目空间,以便于让后续的新命令继续执行。As an implementation manner, after receiving the command completion signal, the computing management module can immediately modify the corresponding bit of the command management register to indicate an idle state, so that the command cache entry space of the computing unit can be quickly released, so that subsequent The new command continues to execute.

考虑到命令之间可能存在的同步操作,计算管理模块在收到命令完成信号后,也可以采用以下实施方式对命令管理寄存器的比特位进行修改更新:Considering the possible synchronous operation between commands, after receiving the command completion signal, the computing management module can also modify and update the bits of the command management register in the following manner:

作为另一种实施方式,计算管理模块可以在收到命令完成信号后,并不立刻将命令管理寄存器对应比特位进行修改,而是选择延后一段时间再将其修改为指示空闲状态(即,响应于收到来自计算单元的命令完成信号,在经过预设的时间段之后,在该计算单元对应的命令管理寄存器中将与所完成的命令的序号对应的比特位设置为指示空闲状态)。这是考虑到命令之间可能存在的同步操作,当某个命令完成后,还需要等待其他命令,此时该完成的命令的序号需要在命令管理寄存器中保留一段时间,以便于其他命令能够通过查询对应的命令状态寄存器来判断该命令的完成情况。而如果不考虑同步,直接将完成的命令的序号在命令管理寄存器对应比特位修改为指示空闲状态,那么该序号很可能就会被分配给后续的新命令,相应地命令状态寄存器中对应的比特位也会被重置,导致无法查询到该命令的执行状态。As another implementation manner, after receiving the command completion signal, the computing management module may not modify the corresponding bits of the command management register immediately, but choose to delay for a period of time and then modify it to indicate an idle state (that is, In response to receiving the command completion signal from the computing unit, after a preset period of time, the bit corresponding to the serial number of the completed command is set to indicate an idle state in the command management register corresponding to the computing unit). This is considering the possible synchronous operation between commands. When a command is completed, it needs to wait for other commands. At this time, the sequence number of the completed command needs to be kept in the command management register for a period of time, so that other commands can pass through Query the corresponding command status register to judge the completion of the command. However, if synchronization is not considered, the sequence number of the completed command is directly modified in the corresponding bit of the command management register to indicate the idle state, then the sequence number is likely to be assigned to the subsequent new command, and the corresponding bit in the command status register is corresponding. The bit is also reset, making it impossible to query the execution status of the command.

作为再一种实施方式,计算管理模块可以在收到命令完成信号后,确定所完成的命令是否还需要等待其他命令;仅在确定该命令不需要等待其他命令时,在该计算单元对应的命令管理寄存器中将与所完成的命令的序号对应的比特位设置为指示空闲状态。该实施方式相较于等待固定时长后才修改比特位状态的方式,可以适应更多样的同步操作场景,灵活性更好。As a further implementation, the computing management module may, after receiving the command completion signal, determine whether the completed command still needs to wait for other commands; only when it is determined that the command does not need to wait for other commands, the command corresponding to the computing unit The bit corresponding to the sequence number of the completed command is set in the management register to indicate the idle state. Compared with the method of modifying the bit state after waiting for a fixed period of time, this embodiment can adapt to more synchronous operation scenarios and has better flexibility.

在上述的实施例中,尽管以独立模块的形式介绍了硬件队列调度模块和计算管理模块,但应理解上述模块或者实现上述模块的电路可以作为异构计算处理器的控制电路、控制逻辑或控制器的一部分而实现,或者异构计算处理器的控制电路、控制逻辑或控制器可以被配置为执行上文结合硬件队列调度模块和计算管理模块所介绍的功能、步骤或方法。基于同一发明构思,本申请实施例还提供一种异构计算处理器。该异构计算处理器包括控制器、多个不同类型的计算单元、以及如上文介绍的与每个计算单元对应的命令管理寄存器和命令状态寄存器。其中该控制器被配置为实现上文结合硬件队列调度模块和计算管理模块介绍的功能,或者执行结合上文实施例所介绍的异构计算中多个计算单元的管理过程。关于该异构计算处理器的其他细节,请参与上文中的相关描述,此处不再赘述。In the above-mentioned embodiments, although the hardware queue scheduling module and the calculation management module are introduced in the form of independent modules, it should be understood that the above-mentioned modules or the circuits implementing the above-mentioned modules can be used as control circuits, control logic or control circuits of heterogeneous computing processors. implemented as part of a processor, or the control circuitry, control logic or controller of a heterogeneous computing processor may be configured to perform the functions, steps or methods described above in connection with the hardware queue scheduling module and the computing management module. Based on the same inventive concept, an embodiment of the present application further provides a heterogeneous computing processor. The heterogeneous computing processor includes a controller, a plurality of computing units of different types, and a command management register and a command status register corresponding to each computing unit as described above. The controller is configured to implement the functions described above in conjunction with the hardware queue scheduling module and the computing management module, or to perform the management process of multiple computing units in heterogeneous computing described in conjunction with the above embodiments. For other details of the heterogeneous computing processor, please refer to the relevant description above, which will not be repeated here.

本说明书中针对“各个实施例”、“一些实施例”、“一个实施例”、或“实施例”等的参考指代的是结合所述实施例所描述的特定特征、结构、或性质包括在至少一个实施例中。因此,短语“在各个实施例中”、“在一些实施例中”、“在一个实施例中”、或“在实施例中”等在整个说明书中各地方的出现并非必须指代相同的实施例。此外,特定特征、结构、或性质可以在一个或多个实施例中以任何合适方式组合。因此,结合一个实施例中所示出或描述的特定特征、结构或性质可以整体地或部分地与一个或多个其他实施例的特征、结构、或性质无限制地组合,只要该组合不是非逻辑性的或不能工作。References in this specification to "various embodiments," "some embodiments," "one embodiment," or "an embodiment," etc., refer to specific features, structures, or properties described in connection with the embodiments, including in at least one embodiment. Thus, appearances of the phrases "in various embodiments," "in some embodiments," "in one embodiment," or "in an embodiment" in various places throughout the specification are not necessarily referring to the same implementation example. Furthermore, the particular features, structures, or properties may be combined in any suitable manner in one or more embodiments. Thus, particular features, structures, or properties shown or described in connection with one embodiment may be combined, in whole or in part, with the features, structures, or properties of one or more other embodiments without limitation, provided that the combination is not non-limiting. Logical or not working.

本说明书中“包括”和“具有”以及类似含义的术语表达,意图在于覆盖不排他的包含,例如包含了一系列步骤或单元的过程、方法、系统、产品或设备并不限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其他步骤或单元。“一”或“一个”也不排除多个的情况。另外,本申请附图中的各个元素仅仅为了示意说明,并非按比例绘制。The expressions "comprising" and "having" and terms of similar meaning in this specification are intended to cover non-exclusive inclusion, such as a process, method, system, product or device comprising a series of steps or units not limited to those listed steps or units, but optionally also steps or units not listed, or optionally other steps or units inherent to these processes, methods, products or devices. "A" or "an" also does not exclude the case of more than one. In addition, various elements in the drawings of the present application are for illustrative purposes only and are not drawn to scale.

虽然本申请已经通过上述实施例进行了描述,然而本申请并非局限于这里所描述的实施例,在不脱离本申请范围的情况下还包括所做出的各种改变以及变化。Although the present application has been described by the above-mentioned embodiments, the present application is not limited to the embodiments described herein, and includes various changes and changes made without departing from the scope of the present application.

Claims (12)

1.一种异构计算中计算单元的管理方法,其中为每个计算单元设置有与其对应的命令管理寄存器和命令状态寄存器,命令管理寄存器的各个比特位指示对应计算单元的命令缓存中各个条目是否处于空闲状态,命令状态寄存器的各个比特位指示对应计算单元中各个命令的执行状态;所述方法包括:1. a management method of computing unit in heterogeneous computing, wherein each computing unit is provided with its corresponding command management register and command status register, and each bit of the command management register indicates each entry in the command cache of the corresponding computing unit Whether it is in an idle state, each bit of the command status register indicates the execution state of each command in the corresponding computing unit; the method includes: 根据待执行的命令的类型,确定适于执行该类型命令的计算单元相对应的命令管理寄存器;According to the type of the command to be executed, determine the command management register corresponding to the computing unit suitable for executing the command of this type; 从所确定的命令管理寄存器中,选择具有指示空闲状态的比特位的命令管理寄存器,将其中一个指示空闲状态的比特位在命令管理寄存器中的序号作为所述待执行的命令的序号,并将该比特位设置为指示非空闲状态;From the determined command management registers, select a command management register with a bit indicating an idle state, use the sequence number of one of the bits indicating an idle state in the command management register as the sequence number of the command to be executed, and set the sequence number of the command to be executed. This bit is set to indicate a non-idle state; 将所述待执行的命令及其序号发送到与所选择的命令管理寄存器相对应的计算单元,并在该计算单元对应的命令状态寄存器中将与所述序号对应的比特位设置为指示未完成状态;Send the command to be executed and its sequence number to a computing unit corresponding to the selected command management register, and set the bit corresponding to the sequence number in the command status register corresponding to the computing unit to indicate incomplete state; 响应于收到来自计算单元的命令完成信号,在该计算单元对应的命令状态寄存器中将与所完成的命令的序号对应的比特位设置为指示已完成状态,并在该计算单元对应的命令管理寄存器中将与所完成的命令的序号对应的比特位设置为指示空闲状态。In response to receiving the command completion signal from the computing unit, in the command status register corresponding to the computing unit, the bit corresponding to the serial number of the completed command is set to indicate the completed state, and the command management corresponding to the computing unit The bit in the register corresponding to the sequence number of the completed command is set to indicate the idle state. 2.根据权利要求1所述的方法,选择具有指示空闲状态的比特位的命令管理寄存器包括:2. The method of claim 1, selecting a command management register having a bit indicating an idle state comprising: 通过读取命令管理寄存器的值来识别命令管理寄存器中具有指示空闲状态的比特位的数量;Identify the number of bits in the command management register that have bits indicating the idle state by reading the value of the command management register; 选择具有最大数量的指示空闲状态的比特位的命令管理寄存器。The command management register with the largest number of bits indicating idle state is selected. 3.根据权利要求1所述的方法,选择具有指示空闲状态的比特位的命令管理寄存器包括:3. The method of claim 1, selecting a command management register having a bit indicating an idle state comprising: 以轮询的方式选择具有指示空闲状态的比特位的命令管理寄存器。The command management register with the bits indicating the idle state is selected in a round-robin fashion. 4.根据权利要求1所述的方法,还包括通过设置计算单元的相对应的命令管理寄存器的值来将该计算单元的状态设置为不可用。4. The method of claim 1, further comprising setting the state of a computing unit to unavailable by setting a value of a corresponding command management register of the computing unit. 5.根据权利要求1所述的方法,所述命令管理寄存器和命令状态寄存器的宽度依赖于其对应计算单元的命令缓存的长度。5. The method of claim 1, wherein the width of the command management register and the command status register depends on the length of the command buffer of their corresponding computing unit. 6.根据权利要求1所述的方法,还包括:6. The method of claim 1, further comprising: 记录所述待执行的命令的序号与该待执行的命令所属的硬件命令队列的对应关系。The correspondence between the sequence number of the command to be executed and the hardware command queue to which the command to be executed belongs is recorded. 7.根据权利要求6所述的方法,还包括:7. The method of claim 6, further comprising: 响应于收到来自计算单元的命令完成信号,根据所完成的命令的序号查找该命令所对应的硬件命令队列,并向其反馈指示该命令完成的信号。In response to receiving the command completion signal from the computing unit, the hardware command queue corresponding to the command is searched according to the sequence number of the completed command, and a signal indicating that the command is completed is fed back to it. 8.根据权利要求6所述的方法,还包括:8. The method of claim 6, further comprising: 将来自同一硬件命令队列的多个待执行的命令连续分发至各计算单元进行处理。Continuously distribute multiple commands to be executed from the same hardware command queue to each computing unit for processing. 9.根据权利要求1-8中任一项所述的方法,还包括:9. The method of any one of claims 1-8, further comprising: 响应于收到来自计算单元的命令完成信号,确定所完成的命令是否还需要等待其他命令;In response to receiving the command completion signal from the computing unit, determining whether the completed command still needs to wait for other commands; 仅在确定该命令不需要等待其他命令时,在该计算单元对应的命令管理寄存器中将与所完成的命令的序号对应的比特位设置为指示空闲状态。Only when it is determined that the command does not need to wait for other commands, the bit corresponding to the sequence number of the completed command is set to indicate an idle state in the command management register corresponding to the computing unit. 10.根据权利要求1-8中任一项所述的方法,还包括:10. The method of any one of claims 1-8, further comprising: 响应于收到来自计算单元的命令完成信号,在经过预设的时间段之后,在该计算单元对应的命令管理寄存器中将与所完成的命令的序号对应的比特位设置为指示空闲状态。In response to receiving the command completion signal from the computing unit, after a preset period of time, the bit corresponding to the serial number of the completed command is set to indicate the idle state in the command management register corresponding to the computing unit. 11.根据权利要求1-8中任一项所述的方法,还包括对于来自多个硬件命令队列的待执行的命令,按照预设的优先级来向计算单元分发每个待执行的命令。11. The method according to any one of claims 1-8, further comprising, for commands to be executed from a plurality of hardware command queues, distributing each command to be executed to the computing unit according to a preset priority. 12.一种异构计算处理器,其包括控制器、多个不同类型的计算单元、与每个计算单元对应的命令管理寄存器和命令状态寄存器,其中所述命令管理寄存器的各个比特位指示对应的计算单元的命令缓存的各个条目是否处于空闲状态;所述命令状态寄存器的各个比特位指示对应的计算单元中各个命令的执行状态,所述控制器被配置为执行权利要求1-11中任一项所述的方法。12. A heterogeneous computing processor, comprising a controller, a plurality of computing units of different types, a command management register and a command status register corresponding to each computing unit, wherein each bit of the command management register indicates corresponding Whether each entry of the command cache of the computing unit is in an idle state; each bit of the command status register indicates the execution state of each command in the corresponding computing unit, and the controller is configured to execute any one of claims 1-11. one of the methods described.
CN202210100383.4A 2022-01-27 Management method of computing units in heterogeneous computing and corresponding processor Active CN114548389B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210100383.4A CN114548389B (en) 2022-01-27 Management method of computing units in heterogeneous computing and corresponding processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210100383.4A CN114548389B (en) 2022-01-27 Management method of computing units in heterogeneous computing and corresponding processor

Publications (2)

Publication Number Publication Date
CN114548389A true CN114548389A (en) 2022-05-27
CN114548389B CN114548389B (en) 2025-10-10

Family

ID=

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115617533A (en) * 2022-12-14 2023-01-17 上海登临科技有限公司 Process switching management method and computing device in heterogeneous computing

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112204524A (en) * 2018-05-24 2021-01-08 赛灵思公司 Embedded scheduling of hardware resources for hardware acceleration
CN113056729A (en) * 2018-11-19 2021-06-29 赛灵思公司 Programming and control of computational cells in an integrated circuit

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112204524A (en) * 2018-05-24 2021-01-08 赛灵思公司 Embedded scheduling of hardware resources for hardware acceleration
CN113056729A (en) * 2018-11-19 2021-06-29 赛灵思公司 Programming and control of computational cells in an integrated circuit

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115617533A (en) * 2022-12-14 2023-01-17 上海登临科技有限公司 Process switching management method and computing device in heterogeneous computing
CN115617533B (en) * 2022-12-14 2023-03-10 上海登临科技有限公司 Process switching management method in heterogeneous computing and computing device

Similar Documents

Publication Publication Date Title
US11836524B2 (en) Memory interface for a multi-threaded, self-scheduling reconfigurable computing fabric
CN112106030B (en) Thread state monitoring in systems with multithreaded self-scheduling processors
CN112106026B (en) Multithreaded self-scheduling processor for managing network congestion
CN1188794C (en) Coprocessor with multiple logic interface
CN102609312B (en) Based on the SJF memory request dispatching method that fairness is considered
CN102156665B (en) Differential serving method for virtual system competition resources
US6868087B1 (en) Request queue manager in transfer controller with hub and ports
US20030105901A1 (en) Parallel multi-threaded processing
US20090100200A1 (en) Channel-less multithreaded DMA controller
CN102081551A (en) Micro-architecture sensitive thread scheduling (MSTS) method
US11526767B2 (en) Processor system and method for increasing data-transfer bandwidth during execution of a scheduled parallel process
US8086766B2 (en) Support for non-locking parallel reception of packets belonging to a single memory reception FIFO
US12204774B2 (en) Allocation of resources when processing at memory level through memory request scheduling
WO2025152635A1 (en) Cache request processing method and apparatus, device, storage medium and program
US10713089B2 (en) Method and apparatus for load balancing of jobs scheduled for processing
US11061724B2 (en) Programmable hardware scheduler for digital processing systems
CN111597044A (en) Task scheduling method and device, storage medium and electronic equipment
CN100547572C (en) Method and system for dynamically establishing direct memory access path
WO2023216629A1 (en) Multi-process management method in heterogeneous computing, and computing device
US11113101B2 (en) Method and apparatus for scheduling arbitration among a plurality of service requestors
CN114548389A (en) Management method and corresponding processor of computing unit in heterogeneous computing
CN114548389B (en) Management method of computing units in heterogeneous computing and corresponding processor
CN101647002A (en) Multiprocessing system and method
US8135878B1 (en) Method and apparatus for improving throughput on a common bus
CN118819748A (en) A task scheduling method, scheduling management system and multi-core processor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Country or region after: China

Address after: Room 1101, Building 5, South Bank New Land Phase I, No. 11 Yangfu Road, Suzhou Industrial Park, Suzhou Area, China (Jiangsu) Pilot Free Trade Zone, Suzhou City, Jiangsu Province 215101

Applicant after: Suzhou Denglin Technology Co.,Ltd.

Address before: 201306 room 710, building a, No. 3236, Jiangshan Road, Pudong New Area, Shanghai

Applicant before: Shanghai Denglin Technology Co.,Ltd.

Country or region before: China

GR01 Patent grant