CN118672941B - A task execution method, device, equipment and storage medium - Google Patents
A task execution method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN118672941B CN118672941B CN202411154887.XA CN202411154887A CN118672941B CN 118672941 B CN118672941 B CN 118672941B CN 202411154887 A CN202411154887 A CN 202411154887A CN 118672941 B CN118672941 B CN 118672941B
- Authority
- CN
- China
- Prior art keywords
- level cache
- read request
- data
- cache
- read
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/0811—Multiuser, multiprocessor or multiprocessing cache systems with multilevel cache hierarchies
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/0813—Multiuser, multiprocessor or multiprocessing cache systems with a network or matrix configuration
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/10—Providing a specific technical effect
- G06F2212/1016—Performance improvement
- G06F2212/1021—Hit rate improvement
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/10—Providing a specific technical effect
- G06F2212/1016—Performance improvement
- G06F2212/1024—Latency reduction
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Memory System Of A Hierarchy Structure (AREA)
Abstract
Description
技术领域Technical Field
本发明涉及缓存技术领域,特别涉及一种任务执行方法、装置、设备及存储介质。The present invention relates to the field of cache technology, and in particular to a task execution method, device, equipment and storage medium.
背景技术Background Art
GPU(Graphic Processing Unit,图形处理器)通常包含多个流式多处理器(Stream Multiprocessors,SMP),这些SMP通过在同一时间内执行不同的数据流,从而显著加速计算过程,然而随着核心数量的增长,如何高效地管理内存层次结构,尤其是二级缓存(L2-Cache),成为了提升系统整体效率的重要挑战。传统的GPU设计中,每个流多处理器可能拥有独立的二级缓存,这样的设计虽然保证了每个核心的独立性和数据访问速度,但存在资源浪费、数据冗余以及缓存一致性维护复杂等问题,特别是在高度并行的工作负载下,数据频繁在各核心间传递,独立缓存会导致大量的数据迁移和无效的缓存行刷新,严重影响系统性能和能效比。GPU (Graphic Processing Unit) usually contains multiple stream multiprocessors (SMP), which significantly accelerate the computing process by executing different data streams at the same time. However, as the number of cores increases, how to efficiently manage the memory hierarchy, especially the L2 cache (L2-Cache), has become an important challenge to improve the overall efficiency of the system. In traditional GPU design, each stream multiprocessor may have an independent L2 cache. Although such a design ensures the independence and data access speed of each core, it has problems such as resource waste, data redundancy, and complex cache consistency maintenance. Especially under highly parallel workloads, data is frequently transferred between cores. Independent cache will cause a large amount of data migration and invalid cache line refresh, which seriously affects system performance and energy efficiency.
由此可见,如何降低数据传输延迟,提高缓存命中率是本领域要解决的问题。It can be seen that how to reduce data transmission delay and improve cache hit rate is a problem to be solved in this field.
发明内容Summary of the invention
有鉴于此,本发明的目的在于提供一种任务执行方法、装置、设备及存储介质,通过独立一级缓存与共享二级缓存的层次化缓存结构可以高效读取计算任务对应的数据,集中管理缓存资源,降低了延迟。其具体方案如下:In view of this, the purpose of the present invention is to provide a task execution method, device, equipment and storage medium, which can efficiently read the data corresponding to the computing task through the hierarchical cache structure of independent first-level cache and shared second-level cache, centrally manage cache resources, and reduce latency. The specific scheme is as follows:
第一方面,本申请提供了一种任务执行方法,应用于流式多处理器,包括:In a first aspect, the present application provides a task execution method, applied to a streaming multiprocessor, comprising:
基于上位机产生的计算任务生成针对自身一级缓存的读请求;所述流式多处理器对应有独立的一级缓存;所述计算任务对应有指令数据和操作数数据;Generate a read request for its own first-level cache based on the computing task generated by the host computer; the streaming multiprocessor corresponds to an independent first-level cache; the computing task corresponds to instruction data and operand data;
若所述读请求在所述一级缓存中未命中数据,则通过所述一级缓存将所述读请求转发至预设交叉矩阵,以便通过所述预设交叉矩阵将所述读请求转发至由若干流式多处理器共享的二级缓存;所述预设交叉矩阵内集成仲裁器及路由,用于根据仲裁机制及读请求对应的地址空间将读请求转发至所述二级缓存;If the read request does not hit data in the first-level cache, the read request is forwarded to a preset cross matrix through the first-level cache, so that the read request is forwarded to a second-level cache shared by a plurality of streaming multiprocessors through the preset cross matrix; an arbitrator and a router are integrated in the preset cross matrix, and are used to forward the read request to the second-level cache according to an arbitration mechanism and an address space corresponding to the read request;
基于所述读请求经所述一级缓存从所述二级缓存读取所述计算任务对应的指令数据或操作数数据;Reading instruction data or operand data corresponding to the computing task from the secondary cache via the primary cache based on the read request;
根据读取到的所述指令数据或所述操作数数据执行所述计算任务,并将相应的运算结果缓存到所述二级缓存,以便所述上位机对所述计算任务对应的所述运算结果进行处理。The computing task is executed according to the read instruction data or the operand data, and the corresponding computing result is cached in the secondary cache so that the host computer can process the computing result corresponding to the computing task.
可选的,所述基于上位机产生的计算任务生成针对自身一级缓存的读请求,包括:Optionally, the generating of a read request for its own first-level cache based on the computing task generated by the host computer includes:
基于上位机产生的计算任务激活自身的调度单元,以产生若干线程束;Activate its own scheduling unit based on the computing tasks generated by the host computer to generate several thread bundles;
将所述计算任务对应的初始指令地址以及所述线程束的名称传递至自身的取指单元,以便所述取指单元对所述初始指令地址和所述线程束的名称进行解析,并根据解析结果生成针对自身一级缓存的读请求。The initial instruction address corresponding to the computing task and the name of the thread warp are passed to its own instruction fetch unit, so that the instruction fetch unit parses the initial instruction address and the name of the thread warp, and generates a read request for its own first-level cache according to the parsing result.
可选的,所述还包括:Optionally, the method further includes:
若所述读请求在所述一级缓存中命中数据,则通过所述读请求从所述一级缓存读取与所述计算任务对应的指令数据或操作数数据,以便根据读取到的所述指令数据或所述操作数数据执行所述计算任务。If the read request hits data in the first-level cache, instruction data or operand data corresponding to the computing task is read from the first-level cache through the read request, so as to execute the computing task according to the read instruction data or operand data.
可选的,所述通过所述读请求从所述一级缓存读取与所述计算任务对应的指令数据或操作数数据,包括:Optionally, reading instruction data or operand data corresponding to the computing task from the first-level cache through the read request includes:
根据所述读请求从所述一级缓存本地读取相应的指令数据或操作数数据后,并将从所述一级缓存中读取到的指令数据或操作数数据缓存至与所述流式多处理器对应预设缓冲区;After reading the corresponding instruction data or operand data locally from the first-level cache according to the read request, the instruction data or operand data read from the first-level cache is cached in a preset buffer corresponding to the streaming multiprocessor;
从所述预设缓冲区读取与所述计算任务对应的指令数据或操作数数据。The instruction data or operand data corresponding to the computing task is read from the preset buffer.
可选的,所述通过所述一级缓存将所述读请求转发至预设交叉矩阵,以便通过所述预设交叉矩阵将所述读请求转发至由若干流式多处理器共享的二级缓存,包括:Optionally, forwarding the read request to a preset cross matrix through the first-level cache, so as to forward the read request to a second-level cache shared by a plurality of streaming multiprocessors through the preset cross matrix, includes:
通过所述一级缓存经自身对应的主线接口将所述读请求转发至预设交叉矩阵的第一接口耦合器,以便所述预设交叉矩阵利用内部集成的读仲裁器和路由将所述读请求转发至第二接口耦合器,并通过所述第二接口耦合器利用与二级缓存对应的从线接口将所述读请求传递至所述二级缓存,以完成将所述读请求转发至若干流式多处理器共享的二级缓存的过程。The read request is forwarded by the first-level cache via its own corresponding main line interface to the first interface coupler of a preset cross matrix, so that the preset cross matrix uses the internally integrated read arbiter and routing to forward the read request to the second interface coupler, and the read request is passed to the second-level cache via the second interface coupler using the slave line interface corresponding to the second-level cache, so as to complete the process of forwarding the read request to the second-level cache shared by several streaming multiprocessors.
可选的,所述根据读取到的所述指令数据或所述操作数数据执行所述计算任务,包括:Optionally, executing the computing task according to the read instruction data or the operand data includes:
通过自身的执行单元选择与所述计算任务相匹配的目标计算单元;所述目标计算单元包括逻辑计算单元、浮点计算单元、访存单元及函数单元;Selecting a target computing unit that matches the computing task through its own execution unit; the target computing unit includes a logic computing unit, a floating-point computing unit, a memory access unit and a function unit;
利用所述目标计算单元执行读取到的所述指令数据或所述操作数数据对应的所述计算任务,得到相应的运算结果。The target computing unit is used to execute the computing task corresponding to the read instruction data or operand data to obtain a corresponding computing result.
可选的,所述将相应的运算结果缓存到所述二级缓存,包括:Optionally, caching the corresponding operation result into the secondary cache includes:
通过所述一级缓存的主线接口将所述运算结果转发至所述预设交叉矩阵的第三接口耦合器,以便所述预设交叉矩阵利用内部集成的写仲裁器和路由将所述运算结果转发至第四接口耦合器,并通过所述第四接口耦合器利用与所述二级缓存对应的从线接口将所述运算结果缓存至所述二级缓存。The calculation result is forwarded to the third interface coupler of the preset cross matrix through the master line interface of the first-level cache, so that the preset cross matrix forwards the calculation result to the fourth interface coupler using the internally integrated write arbiter and routing, and caches the calculation result to the second-level cache through the fourth interface coupler using the slave line interface corresponding to the second-level cache.
第二方面,本申请提供了一种任务执行装置,应用于流式多处理器,包括:In a second aspect, the present application provides a task execution device, applied to a streaming multiprocessor, comprising:
读请求生成模块,用于基于上位机产生的计算任务生成针对自身一级缓存的读请求;所述流式多处理器对应有独立的一级缓存;所述计算任务对应有指令数据和操作数数据;A read request generation module, used to generate a read request for its own first-level cache based on a computing task generated by a host computer; the streaming multiprocessor corresponds to an independent first-level cache; the computing task corresponds to instruction data and operand data;
转发模块,用于当所述读请求在所述一级缓存中未命中数据时,通过所述一级缓存将所述读请求转发至预设交叉矩阵,以便通过所述预设交叉矩阵将所述读请求转发至由若干流式多处理器共享的二级缓存;所述预设交叉矩阵内集成仲裁器及路由,用于根据仲裁机制及读请求对应的地址空间将读请求转发至所述二级缓存;A forwarding module, used for forwarding the read request to a preset cross matrix through the first-level cache when the read request does not hit data in the first-level cache, so as to forward the read request to the second-level cache shared by a plurality of streaming multiprocessors through the preset cross matrix; an arbitrator and a router are integrated in the preset cross matrix, and used for forwarding the read request to the second-level cache according to an arbitration mechanism and an address space corresponding to the read request;
数据读取模块,用于基于所述读请求经所述一级缓存从所述二级缓存读取所述计算任务对应的指令数据或操作数数据;A data reading module, configured to read instruction data or operand data corresponding to the computing task from the secondary cache via the primary cache based on the read request;
任务执行模块,用于根据读取到的所述指令数据或所述操作数数据执行所述计算任务,并将相应的运算结果缓存到所述二级缓存,以便所述上位机对所述计算任务对应的所述运算结果进行处理。The task execution module is used to execute the computing task according to the read instruction data or the operand data, and cache the corresponding computing result to the secondary cache so that the host computer can process the computing result corresponding to the computing task.
第三方面,本申请提供了一种电子设备,包括:In a third aspect, the present application provides an electronic device, including:
存储器,用于保存计算机程序;Memory, used to store computer programs;
处理器,用于执行所述计算机程序以实现如上述的任务执行方法。The processor is used to execute the computer program to implement the task execution method as described above.
第四方面,本申请提供了一种计算机可读存储介质,用于保存计算机程序,所述计算机程序被处理器执行时实现如上述的任务执行方法。In a fourth aspect, the present application provides a computer-readable storage medium for storing a computer program, wherein the computer program implements the task execution method as described above when executed by a processor.
由此可见,本申请中的流式多处理器可以基于上位机产生的计算任务生成针对自身一级缓存的读请求;所述流式多处理器对应有独立的一级缓存;所述计算任务对应有指令数据和操作数数据;若所述读请求在所述一级缓存中未命中数据,则通过所述一级缓存将所述读请求转发至预设交叉矩阵,以便通过所述预设交叉矩阵将所述读请求转发至由若干流式多处理器共享的二级缓存;所述预设交叉矩阵内集成仲裁器及路由,用于根据仲裁机制及读请求对应的地址空间将读请求转发至所述二级缓存;然后基于所述读请求经所述一级缓存从所述二级缓存读取所述计算任务对应的指令数据或操作数数据;再根据读取到的所述指令数据或所述操作数数据执行所述计算任务,并将相应的运算结果缓存到所述二级缓存,以便所述上位机对所述计算任务对应的所述运算结果进行处理。这样一来,本申请在每个流多处理器内构建一级缓存,同时根据流多处理器的个数及地址空间,在二级缓存内为每个流多处理器分配合适的缓存空间;通过独立一级缓存与共享二级缓存的层次化缓存结构,可以高效读取计算任务对应的数据,通过共享二级缓存可以集中管理缓存资源,允许所有流多处理器访问同一个缓存池,减少了数据复制,降低数据传输延迟,提高缓存命中率;并且这样的层次化缓存结构具备高度的灵活性和自适应性,可以满足多样化应用需求。It can be seen that the streaming multiprocessor in the present application can generate a read request for its own first-level cache based on the computing task generated by the host computer; the streaming multiprocessor corresponds to an independent first-level cache; the computing task corresponds to instruction data and operand data; if the read request does not hit the data in the first-level cache, the read request is forwarded to a preset cross matrix through the first-level cache, so that the read request is forwarded to the second-level cache shared by several streaming multiprocessors through the preset cross matrix; the preset cross matrix integrates an arbitrator and a router, which is used to forward the read request to the second-level cache according to the arbitration mechanism and the address space corresponding to the read request; then, based on the read request, the instruction data or operand data corresponding to the computing task is read from the second-level cache through the first-level cache; then, the computing task is executed according to the read instruction data or the operand data, and the corresponding calculation result is cached in the second-level cache, so that the host computer can process the calculation result corresponding to the computing task. In this way, the present application constructs a first-level cache in each stream multiprocessor, and at the same time allocates appropriate cache space to each stream multiprocessor in the second-level cache according to the number and address space of the stream multiprocessors; through the hierarchical cache structure of independent first-level cache and shared second-level cache, the data corresponding to the computing task can be read efficiently, and the cache resources can be centrally managed through the shared second-level cache, allowing all stream multiprocessors to access the same cache pool, reducing data replication, reducing data transmission delay, and improving cache hit rate; and such a hierarchical cache structure is highly flexible and adaptable, which can meet the needs of diverse applications.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required for use in the embodiments or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on the provided drawings without paying creative work.
图1为本申请公开的一种任务执行方法流程图;FIG1 is a flow chart of a task execution method disclosed in the present application;
图2为本申请公开的一种具体的任务执行方法流程图;FIG2 is a flowchart of a specific task execution method disclosed in the present application;
图3为本申请公开的一种交叉矩阵示意图;FIG3 is a schematic diagram of a cross matrix disclosed in the present application;
图4为本申请公开的一种基于交叉点的M×S交叉开关架构示意图;FIG4 is a schematic diagram of a crosspoint-based M×S crossbar switch architecture disclosed in the present application;
图5为本申请公开的一种流多处理器与二级缓存连接关系示意图;FIG5 is a schematic diagram of a connection relationship between a streaming multiprocessor and a secondary cache disclosed in the present application;
图6为本申请公开的一种任务执行装置结构示意图;FIG6 is a schematic diagram of the structure of a task execution device disclosed in the present application;
图7为本申请公开的一种电子设备结构图。FIG. 7 is a structural diagram of an electronic device disclosed in this application.
具体实施方式DETAILED DESCRIPTION
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The following will be combined with the drawings in the embodiments of the present invention to clearly and completely describe the technical solutions in the embodiments of the present invention. Obviously, the described embodiments are only part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.
参见图1所示,本发明实施例公开了一种任务执行方法,应用于流式多处理器,包括:As shown in FIG1 , an embodiment of the present invention discloses a task execution method, which is applied to a streaming multiprocessor, including:
步骤S11、基于上位机产生的计算任务生成针对自身一级缓存的读请求;所述流式多处理器对应有独立的一级缓存;所述计算任务对应有指令数据和操作数数据。Step S11, generating a read request for its own first-level cache based on the computing task generated by the host computer; the streaming multiprocessor corresponds to an independent first-level cache; the computing task corresponds to instruction data and operand data.
可以理解的是,上位机产生并行计算任务后,会将计算任务所需的指令数据和操作数数据缓存至对应流式多处理器的一级缓存或二级缓存,相关的流式多处理器首先根据要执行的计算任务生成针对自身一级缓存的读请求,以尝试从一级缓存中读取计算任务所需的指令数据和操作数数据。It can be understood that after the upper computer generates a parallel computing task, it will cache the instruction data and operand data required for the computing task into the first-level cache or second-level cache of the corresponding streaming multiprocessor. The relevant streaming multiprocessor first generates a read request for its own first-level cache based on the computing task to be executed, in order to try to read the instruction data and operand data required for the computing task from the first-level cache.
在具体的实施例中,所述基于上位机产生的计算任务生成针对自身一级缓存的读请求,可以包括:基于上位机产生的计算任务激活自身的调度单元,以产生若干线程束;将所述计算任务对应的初始指令地址以及所述线程束的名称传递至自身的取指单元,以便所述取指单元对所述初始指令地址和所述线程束的名称进行解析,并根据解析结果生成针对自身一级缓存的读请求。具体的,流式多处理器在需要执行上位机的计算任务时,首先激活内部的调度单元以产生活跃的线程束,假如同时产生两个活跃的线程束,并将初始指令地址、活跃线程束的ID(Identity,身份名称)等参数传递给取指单元;取指单元解析收到的指令地址等参数信息后会产生读请求信号,用于从一级缓存中读取计算任务所需的指令数据和操作数数据。In a specific embodiment, the generation of a read request for its own first-level cache based on the computing task generated by the host computer may include: activating its own scheduling unit based on the computing task generated by the host computer to generate a number of thread bundles; passing the initial instruction address corresponding to the computing task and the name of the thread bundle to its own instruction fetch unit, so that the instruction fetch unit parses the initial instruction address and the name of the thread bundle, and generates a read request for its own first-level cache according to the parsing result. Specifically, when the streaming multiprocessor needs to execute the computing task of the host computer, it first activates the internal scheduling unit to generate an active thread bundle. If two active thread bundles are generated at the same time, the initial instruction address, the ID (Identity) of the active thread bundle and other parameters are passed to the instruction fetch unit; after parsing the received instruction address and other parameter information, the instruction fetch unit generates a read request signal for reading the instruction data and operand data required for the computing task from the first-level cache.
步骤S12、若所述读请求在所述一级缓存中未命中数据,则通过所述一级缓存将所述读请求转发至预设交叉矩阵,以便通过所述预设交叉矩阵将所述读请求转发至由若干流式多处理器共享的二级缓存;所述预设交叉矩阵内集成仲裁器及路由,用于根据仲裁机制及读请求对应的地址空间将读请求转发至所述二级缓存。Step S12: If the read request does not hit data in the first-level cache, the read request is forwarded to a preset cross matrix through the first-level cache, so that the read request is forwarded to the second-level cache shared by several streaming multiprocessors through the preset cross matrix; the preset cross matrix integrates an arbitrator and a router, which are used to forward the read request to the second-level cache according to the arbitration mechanism and the address space corresponding to the read request.
本申请实施例中,若是一级缓存根据读请求对应的地址信息读取相应缓存但是没有命中数据,则可以将该读请求转发至预设交叉矩阵,由预设交叉矩阵将该读请求转发至共享的二级缓存;预设交叉矩阵中集成有仲裁器和地址转发的路由,仲裁器包括针对读请求的读仲裁器和针对写请求的写仲裁器;预设交叉矩阵的仲裁机制可以基于读请求对应的地址空间将该读请求转发至二级缓存相应的缓存区域。在具体的实施例中,交叉矩阵(AXI-Crossbar,基于AXI(Advanced eXtensible Interface,一种总线协议)总线的交叉矩阵)可以包括主从AXI接口耦合模块、Crossbar模块,其中主从接口耦合模块负责将多个Master(主线,与一级缓存连接)接口或Slave(从线,与二级缓存连接)接口与Crossbar(交叉矩阵)模块互联组成交叉矩阵,支持数据缓存及数据位宽转换,Crossbar模块内部集成仲裁器及路由功能,负责主从接口之间数据路由转发。假如通过AXI-Crossbar交叉矩阵组成3x3主从互联网络,则Master 0(序号为0的流式多处理器对应的一级缓存与交叉矩阵连接的主线)通过AXI接口耦合模块接入Crossbar模块可与Slave0~3任意接口实现数据转发。In an embodiment of the present application, if the first-level cache reads the corresponding cache according to the address information corresponding to the read request but does not hit the data, the read request can be forwarded to the preset cross matrix, and the preset cross matrix forwards the read request to the shared second-level cache; the preset cross matrix is integrated with an arbitrator and an address forwarding route, and the arbitrator includes a read arbitrator for the read request and a write arbitrator for the write request; the arbitration mechanism of the preset cross matrix can forward the read request to the corresponding cache area of the second-level cache based on the address space corresponding to the read request. In a specific embodiment, the cross matrix (AXI-Crossbar, a cross matrix based on the AXI (Advanced eXtensible Interface, a bus protocol) bus) can include a master-slave AXI interface coupling module and a Crossbar module, wherein the master-slave interface coupling module is responsible for interconnecting multiple Master (main line, connected to the first-level cache) interfaces or Slave (slave line, connected to the second-level cache) interfaces with the Crossbar (cross matrix) module to form a cross matrix, supporting data caching and data bit width conversion, and the Crossbar module integrates an arbitrator and routing functions internally, and is responsible for data routing and forwarding between master and slave interfaces. If a 3x3 master-slave interconnection network is formed through an AXI-Crossbar cross matrix, Master 0 (the main line connecting the first-level cache corresponding to the streaming multiprocessor with sequence number 0 and the cross matrix) is connected to the Crossbar module through the AXI interface coupling module to realize data forwarding with any interface of Slave0~3.
在一种具体的实施例中,所述通过所述一级缓存将所述读请求转发至预设交叉矩阵,以便通过所述预设交叉矩阵将所述读请求转发至由若干流式多处理器共享的二级缓存,可以包括:通过所述一级缓存经自身对应的主线接口将所述读请求转发至预设交叉矩阵的第一接口耦合器,以便所述预设交叉矩阵利用内部集成的读仲裁器和路由将所述读请求转发至第二接口耦合器,并通过所述第二接口耦合器利用与二级缓存对应的从线接口将所述读请求传递至所述二级缓存,以完成将所述读请求转发至若干流式多处理器共享的二级缓存的过程。具体的,不同流式多处理器的一级缓存与预设交叉矩阵之间均通过单独的主线接口连接,而预设交叉矩阵与二级缓存之间通过若干从线接口连接;一级缓存可以通过主线接口将读请求转发至预设交叉矩阵的第一接口耦合器,该第一接口耦合器用于接收一级缓存转发的读请求;然后,预设交叉矩阵可以利用内部集成的读仲裁器和相应的路由实现对读请求的转发,由第二接口耦合器将该读请求转发至二级缓存的相应缓存区域。In a specific embodiment, forwarding the read request to a preset cross matrix through the first-level cache so as to forward the read request to a second-level cache shared by several streaming multiprocessors through the preset cross matrix may include: forwarding the read request to a first interface coupler of a preset cross matrix through the first-level cache via its own corresponding main line interface, so that the preset cross matrix forwards the read request to a second interface coupler using an internally integrated read arbitrator and routing, and passing the read request to the second-level cache through the second interface coupler using a slave line interface corresponding to the second-level cache, so as to complete the process of forwarding the read request to the second-level cache shared by several streaming multiprocessors. Specifically, the first-level caches of different streaming multiprocessors are connected to the preset cross matrix through a separate main line interface, while the preset cross matrix is connected to the second-level cache through a number of slave line interfaces; the first-level cache can forward the read request to the first interface coupler of the preset cross matrix through the main line interface, and the first interface coupler is used to receive the read request forwarded by the first-level cache; then, the preset cross matrix can use the internally integrated read arbitrator and corresponding routing to forward the read request, and the second interface coupler forwards the read request to the corresponding cache area of the second-level cache.
步骤S13、基于所述读请求经所述一级缓存从所述二级缓存读取所述计算任务对应的指令数据或操作数数据。Step S13: Read instruction data or operand data corresponding to the computing task from the second-level cache via the first-level cache based on the read request.
进一步的,当读请求被转发至二级缓存后,二级缓存将通过一级缓存向流式多处理器返回指令数据或操作数数据。可以理解的是,流式多处理器可以将从缓存中读取到的数据缓存到自身对应的预设缓冲区(buffer),以便进行任务运算。Furthermore, when the read request is forwarded to the L2 cache, the L2 cache will return the instruction data or operand data to the streaming multiprocessor through the L1 cache. It is understandable that the streaming multiprocessor can cache the data read from the cache into its own corresponding preset buffer for task calculation.
在一种具体的实施例中,还可以包括:若所述读请求在所述一级缓存中命中数据,则通过所述读请求从所述一级缓存读取与所述计算任务对应的指令数据或操作数数据,以便根据读取到的所述指令数据或所述操作数数据执行所述计算任务。具体的,流式多处理器向一级缓存发出的读请求在一级缓存中命中数据后,可以直接从一级缓存读取与计算任务对应的指令数据或操作数数。进一步的,所述通过所述读请求从所述一级缓存读取与所述计算任务对应的指令数据或操作数数据,可以包括:根据所述读请求从所述一级缓存本地读取相应的指令数据或操作数数据后,并将从所述一级缓存中读取到的指令数据或操作数数据缓存至与所述流式多处理器对应预设缓冲区;从所述预设缓冲区读取与所述计算任务对应的指令数据或操作数数据。一级缓存可以将读取到的指令数据或操作数数据缓存到流式多处理器对应线程束的预设缓冲区,之后,流式多处理器可以从预设缓冲区读取相应的指令数据或操作数数据,以便进行计算任务的运算过程。In a specific embodiment, it may also include: if the read request hits data in the first-level cache, then the instruction data or operand data corresponding to the computing task is read from the first-level cache through the read request, so as to execute the computing task according to the read instruction data or operand data. Specifically, after the read request issued by the streaming multiprocessor to the first-level cache hits data in the first-level cache, the instruction data or operand data corresponding to the computing task can be directly read from the first-level cache. Further, the reading of the instruction data or operand data corresponding to the computing task from the first-level cache through the read request may include: after reading the corresponding instruction data or operand data locally from the first-level cache according to the read request, the instruction data or operand data read from the first-level cache is cached in a preset buffer corresponding to the streaming multiprocessor; and the instruction data or operand data corresponding to the computing task is read from the preset buffer. The first-level cache may cache the read instruction data or operand data in the preset buffer corresponding to the thread warp of the streaming multiprocessor, and then the streaming multiprocessor may read the corresponding instruction data or operand data from the preset buffer to perform the computing process of the computing task.
步骤S14、根据读取到的所述指令数据或所述操作数数据执行所述计算任务,并将相应的运算结果缓存到所述二级缓存,以便所述上位机对所述计算任务对应的所述运算结果进行处理。Step S14, executing the computing task according to the read instruction data or the operand data, and caching the corresponding computing result into the secondary cache, so that the host computer processes the computing result corresponding to the computing task.
本申请实施例中,流式多处理器在读取到与计算任务对应的指令数据以及操作数数据时,执行该计算任务,且可以将相应的运算结果缓存至二级缓存,由上位机进行处理。In the embodiment of the present application, when the streaming multiprocessor reads the instruction data and operand data corresponding to the computing task, it executes the computing task and can cache the corresponding operation result to the secondary cache for processing by the host computer.
在一种具体的实施例中,所述根据读取到的所述指令数据或所述操作数数据执行所述计算任务,可以包括:通过自身的执行单元选择与所述计算任务相匹配的目标计算单元;所述目标计算单元包括逻辑计算单元、浮点计算单元、访存单元及函数单元;利用所述目标计算单元执行读取到的所述指令数据或所述操作数数据对应的所述计算任务,得到相应的运算结果。流式多处理器的执行单元将根据接收到的操作数数据以及相关的指令数据后,根据计算任务的类型选择与计算任务相匹配的目标计算单元,利用选择的目标计算单元完成计算任务的运算过程,得到相应的运算结果。进一步的,所述将相应的运算结果缓存到所述二级缓存,可以包括:通过所述一级缓存的主线接口将所述运算结果转发至所述预设交叉矩阵的第三接口耦合器,以便所述预设交叉矩阵利用内部集成的写仲裁器和路由将所述运算结果转发至第四接口耦合器,并通过所述第四接口耦合器利用与所述二级缓存对应的从线接口将所述运算结果缓存至所述二级缓存。具体的,流式多处理器得到计算任务对应的运算结果后,通过一级缓存的主线接口将该运算结果转发至预设交叉矩阵的第三接口耦合器,该第三接口耦合器用于接收数据写入请求;然后预设交叉矩阵通过内部集成的写仲裁器和相关路由将该运算结果转发至第四接口耦合器,由该第四接口耦合器通过相应的从线接口将运算结果缓存至二级缓存,以便由上位机处理该运算结果。In a specific embodiment, the execution of the computing task according to the instruction data or operand data read may include: selecting a target computing unit matching the computing task through its own execution unit; the target computing unit includes a logic computing unit, a floating-point computing unit, a memory access unit and a function unit; using the target computing unit to execute the computing task corresponding to the instruction data or the operand data read, and obtaining a corresponding computing result. The execution unit of the streaming multiprocessor will select a target computing unit matching the computing task according to the type of computing task after receiving the operand data and related instruction data, and use the selected target computing unit to complete the computing process of the computing task to obtain a corresponding computing result. Further, caching the corresponding computing result to the secondary cache may include: forwarding the computing result to the third interface coupler of the preset cross matrix through the master line interface of the primary cache, so that the preset cross matrix forwards the computing result to the fourth interface coupler using the internal integrated write arbiter and routing, and caching the computing result to the secondary cache through the fourth interface coupler using the slave line interface corresponding to the secondary cache. Specifically, after the streaming multiprocessor obtains the calculation result corresponding to the calculation task, it forwards the calculation result to the third interface coupler of the preset cross matrix through the main line interface of the first-level cache, and the third interface coupler is used to receive the data write request; then the preset cross matrix forwards the calculation result to the fourth interface coupler through the internally integrated write arbitrator and related routing, and the fourth interface coupler caches the calculation result to the second-level cache through the corresponding slave line interface, so that the host computer can process the calculation result.
由此可见,本申请在每个流多处理器内构建一级缓存,同时根据流多处理器的个数及地址空间,在二级缓存内为每个流多处理器分配合适的缓存空间;通过独立一级缓存与共享二级缓存的层次化缓存结构,可以高效读取计算任务对应的数据,通过共享二级缓存可以集中管理缓存资源;当不同的流多处理器发生数据交互时,只需根据流多处理器对应的地址空间就可以实现数据交互,优化了数据访问效率,可以减少冲突,提高整体计算性能。It can be seen that the present application constructs a first-level cache in each stream multiprocessor, and at the same time allocates appropriate cache space to each stream multiprocessor in the second-level cache according to the number and address space of the stream multiprocessors; through the hierarchical cache structure of independent first-level cache and shared second-level cache, the data corresponding to the computing task can be read efficiently, and the cache resources can be centrally managed through the shared second-level cache; when different stream multiprocessors interact with each other, data interaction can be achieved only according to the address space corresponding to the stream multiprocessor, which optimizes data access efficiency, reduces conflicts, and improves overall computing performance.
如图2所示,本申请实施例公开了一种任务执行方法,应用于流式多处理器,包括:As shown in FIG. 2 , an embodiment of the present application discloses a task execution method, which is applied to a streaming multiprocessor, including:
流多处理器可以包括调度单元、L1-Icache(Instruction Cache,一级缓存的指令缓存)、取指单元、译码单元、操作数收集单元、寄存器文件、L1-Dcache(Data Cache,一级缓存的数据缓存)、执行单元、写回单元,调度单元(维护线程束PC主机表)负责激活线程束并产生相关的参数信息传递到取指单元。取指单元则将上一级传递的相关参数进行解析生成L1-Icache的请求信号,L1-Icache接收到来自取指单元的请求信号后,根据相应的请求信息读取指令数据,当读取未命中时会将请求信息继续经由AXI-Crossbar交叉矩阵向L2-Cache(二级缓存)传递,直至等待读取到指令数据后返回到取指单元。译码单元接收来自取指单元的指令数据进行解析,根据指令的类型解析出源操作数地址、目的地址等信息,之后将操作数等信息传递给操作数收集单元,根据相应的操作数类型及地址去读取寄存器文件或L1-Dcache,当操作数位于L1-Dcache时,读取过程与指令数据类似,数据未命中的情况下向L2-Cache发送读数据请求。进一步的,当指令数据和操作数收集完成后,将操作数传递到执行单元,执行单元内集成逻辑计算单元、浮点计算单元、访存单元及函数单元,根据计算任务的类型选择相应的计算单元进行运算,执行结束后将运算结果根据目的类型及地址写回到相应的Cache中,至此一次完整的流水线运算完成。其中,L1-Cache可以包括L1-Icache和L1-Dcache两类,内部集成请求仲裁模块、应答仲裁模块、应答缓存buffer、旁路转发模块及缓存RAM。当同时接收到多个线程束读取数据的请求时,请求仲裁模块根据线程束的ID划分优先级进行轮询处理,当从缓存RAM中读取到相关数据后则通过应答仲裁模块,根据线程束ID缓存到对应的buffer内,以此轮询处理。当数据未命中时,则将读请求通过旁路转发模块经由AXI-Crossbar交叉矩阵转发到L2-Cache继续读取数据。如图3所示为AXI-Crossbar交叉矩阵示意图,其中的Crossbar部分包括处理读请求AR/R的读仲裁器和处理写请求AW/W的写仲裁器,均基于路由实现数据的转发。进一步的,如图4所示为基于交叉点的M×S交叉开关架构,在每个交叉点上设有一个由选择信号控制的开关,来控制交叉点的通断情况以决定输入端口和输出端口的连接。选择信号由交叉开关路由器生成。当多个节点同时发起对同一节点的通信请求时,设备会基于一个采用轮询策略的仲裁器决定进入路由内部的数据来自哪个端口的接口耦合缓冲区,当接口缓冲区中的数据被取出时,其相应的存储资源也会被释放。若缓冲区已满,输出端会将缓冲可用信号置为无效,暂时拦截后续通信请求。当端口M到端口S存在传输任务亟待处理,其会准备待发送数据,同时向交叉开关控制器发送传输请求信号及目的端口标号等传输信息。若数据通路空闲可用,则交叉开关控制器将(M,S)开关的选择信号置为有效,同时向端口M反馈准备就绪的发送反馈信号。接收反馈信号后,数据从端口M进入交叉开关,并沿数据通路自端口S离开交叉开关网络。在置(M,S)选择信号有效的同时,交叉开关控制器也会向端口S发送传输请求信号。当数据被端口S正确地接受并缓存后,端口S会向交叉开关控制器发送相应的回复反馈信号。控制器将回复反馈信号继续反馈至端口M,表征此次通信任务完成,可以释放存储空间,同时关闭对应的交叉点开关。The stream multiprocessor may include a scheduling unit, L1-Icache (Instruction Cache, instruction cache of the first-level cache), an instruction fetch unit, a decoding unit, an operand collection unit, a register file, L1-Dcache (Data Cache, data cache of the first-level cache), an execution unit, and a write-back unit. The scheduling unit (maintaining the thread bundle PC host table) is responsible for activating the thread bundle and generating relevant parameter information to pass to the instruction fetch unit. The instruction fetch unit parses the relevant parameters passed by the previous level to generate a request signal for L1-Icache. After receiving the request signal from the instruction fetch unit, L1-Icache reads the instruction data according to the corresponding request information. When the read misses, the request information will continue to be passed to L2-Cache (secondary cache) via the AXI-Crossbar cross matrix until it returns to the instruction fetch unit after waiting for the instruction data to be read. The decoding unit receives the instruction data from the instruction fetch unit for parsing, and parses the source operand address, destination address and other information according to the type of instruction, and then passes the operand and other information to the operand collection unit, and reads the register file or L1-Dcache according to the corresponding operand type and address. When the operand is located in the L1-Dcache, the reading process is similar to the instruction data, and a read data request is sent to the L2-Cache when the data does not hit. Further, after the instruction data and operand collection are completed, the operand is passed to the execution unit, which integrates the logic calculation unit, floating point calculation unit, memory access unit and function unit. According to the type of calculation task, the corresponding calculation unit is selected for calculation. After the execution is completed, the calculation result is written back to the corresponding Cache according to the destination type and address, and a complete pipeline operation is completed. Among them, the L1-Cache can include two types, L1-Icache and L1-Dcache, and internally integrates a request arbitration module, a response arbitration module, a response cache buffer, a bypass forwarding module and a cache RAM. When multiple thread bundles receive requests to read data at the same time, the request arbitration module prioritizes the thread bundle IDs for polling. When the relevant data is read from the cache RAM, it is cached in the corresponding buffer according to the thread bundle ID through the response arbitration module for polling. When the data does not hit, the read request is forwarded to the L2-Cache through the bypass forwarding module via the AXI-Crossbar cross matrix to continue reading the data. As shown in Figure 3, the AXI-Crossbar cross matrix schematic diagram includes a read arbitrator for processing read requests AR/R and a write arbitrator for processing write requests AW/W, both of which implement data forwarding based on routing. Further, as shown in Figure 4, an M×S cross switch architecture based on a crosspoint is provided at each crosspoint, where a switch controlled by a selection signal is provided to control the on-off status of the crosspoint to determine the connection between the input port and the output port. The selection signal is generated by the cross switch router. When multiple nodes initiate communication requests to the same node at the same time, the device will decide which port's interface coupling buffer the data entering the router comes from based on an arbitrator using a polling strategy. When the data in the interface buffer is taken out, its corresponding storage resources will also be released. If the buffer is full, the output end will invalidate the buffer available signal and temporarily intercept subsequent communication requests. When there is a transmission task from port M to port S to be processed, it will prepare the data to be sent and send a transmission request signal and transmission information such as the destination port number to the cross switch controller. If the data path is idle and available, the cross switch controller will set the (M, S) switch selection signal to valid and send a ready-to-send feedback signal to port M. After receiving the feedback signal, the data enters the cross switch from port M and leaves the cross switch network from port S along the data path. While setting the (M, S) selection signal valid, the cross switch controller will also send a transmission request signal to port S. When the data is correctly received and cached by port S, port S will send a corresponding reply feedback signal to the cross switch controller. The controller will continue to feed back the reply feedback signal to port M, indicating that the communication task is completed, the storage space can be released, and the corresponding crosspoint switch is closed at the same time.
如图5所示为流多处理器与二级缓存连接关系示意图,可以看出,L1-Cache包括L1-Icache和L1-Dcache两类单个流多处理器对应一个L1-Cache,且L1-Cache通过主线与AXI-Crossbar交叉矩阵连接,AXI-Crossbar交叉矩阵通过从线连接二级缓存,二级缓存根据流多处理器的地址空间对应不同的缓存区域;在具体的实施例中,L1-Cache内部可以集成请求仲裁模块、应答仲裁模块、应答缓存buffer、旁路转发模块及缓存RAM。当同时接收到多个线程束读取数据的请求时,请求仲裁模块根据线程束的ID划分优先级进行轮询处理,当从缓存RAM中读取到相关数据后则通过应答仲裁模块,根据线程束ID缓存到对应的buffer内,以此轮询处理。当数据未命中时,则将读请求通过旁路转发模块经由AXI-Crossbar交叉矩阵转发到L2-Cache继续读取数据。As shown in FIG5, it is a schematic diagram of the connection relationship between the stream multiprocessor and the secondary cache. It can be seen that the L1-Cache includes two types of L1-Icache and L1-Dcache. A single stream multiprocessor corresponds to one L1-Cache, and the L1-Cache is connected to the AXI-Crossbar cross matrix through the main line. The AXI-Crossbar cross matrix is connected to the secondary cache through the slave line. The secondary cache corresponds to different cache areas according to the address space of the stream multiprocessor; in a specific embodiment, the L1-Cache can integrate a request arbitration module, a response arbitration module, a response cache buffer, a bypass forwarding module and a cache RAM. When multiple thread bundles receive requests to read data at the same time, the request arbitration module divides the priority according to the thread bundle ID for polling processing. When the relevant data is read from the cache RAM, it is cached in the corresponding buffer according to the thread bundle ID through the response arbitration module, and the polling processing is carried out in this way. When the data is not hit, the read request is forwarded to the L2-Cache through the bypass forwarding module via the AXI-Crossbar cross matrix to continue reading the data.
在具体的实施例中,当上位机产生并行计算任务后,首先将此次计算任务所需的指令及操作数数据缓存到L2-Cache,并为此次任务分配所需的流多处理器及其对应的L2-Cache缓存空间。相应的,流多处理器SMP0激活内部的调度单元产生活跃的线程束,假如同时产生两个活跃的线程束W0、W1,并将初始指令地址、活跃线程束的ID等参数传递给取指单元。取指单元解析收到的指令地址等参数信息后会产生读请求信号Req.W0、Req.W1来读取L1-Icache中的指令数据,当L1-ICache内的请求仲裁模块检测到读请求信号后,依次根据地址信息读取相应缓存,若此次数据命中则将指令数据通过应答仲裁模块分别缓存到应答buffer.W0和buffer.W1内由取指单元依次取走。相反若此次数据未命中,L1-ICache的旁路转发模块依次将请求信号IReq.W0、IReq.W1转发到AXI-Crossbar交叉矩阵,SMP0的读请求从Master0侧接入互联网络,经由仲裁器、路由转发后从Slave0侧传递到L2-Cache并从指定的SMP0缓存区域内读取数据,之后将读取的数据经由L1-Icache的应答仲裁模块缓存到对应的buffer内,最后由取指单元读走后并将读回的指令数据转发到译码单元。译码单元在接收到来自取指单元的指令数据后进行解析,根据指令的类型协议解析出源操作数所在位置、计算结果存放位置及所需的计算单元类型等信息并递给操作数收集单元,根据相应的操作数类型及地址去读取寄存器文件或L1-Dcache。当操作数位于寄存器文件时,根据寄存器地址直接读取即可,当操作数位于L1-Dcache时,则产生读请求信号DReq.W0、DReq.W1,该过程与指令数据的请求、响应流程相同,数据命中的情况下直接返回操作数,未命中的情况下需要将读请求信号转发到L2-Cache,最终将所有需要的操作数收集完成后将其传递到执行单元。进一步的,执行单元接收到操作数以及相关的指令信息后,根据计算任务的类型选择相应的计算单元进行运算,包括逻辑计算单元、浮点计算单元、访存单元及函数单元,执行结束后将运算结果根据目的类型及地址通过写回单元缓存到相应的寄存器文件或L1-Dcache中,至此SMP0的一次完整的流水线运算完成。可以理解的是,当计算任务分配到多个SMP时,每个SMP会产生多个活跃的线程束来完成相关运算,重复上述步骤,最终所有SMP的运算结果会缓存到L2-Cache由上位机进行处理。In a specific embodiment, when the host computer generates a parallel computing task, the instructions and operand data required for this computing task are first cached in the L2-Cache, and the required stream multiprocessor and its corresponding L2-Cache cache space are allocated for this task. Correspondingly, the stream multiprocessor SMP0 activates the internal scheduling unit to generate an active thread bundle. If two active thread bundles W0 and W1 are generated at the same time, the initial instruction address, the ID of the active thread bundle and other parameters are passed to the instruction fetch unit. After parsing the received instruction address and other parameter information, the instruction fetch unit generates read request signals Req.W0 and Req.W1 to read the instruction data in the L1-Icache. When the request arbitration module in the L1-ICache detects the read request signal, it reads the corresponding cache in turn according to the address information. If the data hits this time, the instruction data is cached in the response buffer.W0 and buffer.W1 respectively through the response arbitration module and is taken away in turn by the instruction fetch unit. On the contrary, if the data does not hit this time, the bypass forwarding module of L1-ICache will forward the request signals IReq.W0 and IReq.W1 to the AXI-Crossbar cross matrix in turn, and the read request of SMP0 will be connected to the Internet from the Master0 side, and will be forwarded from the Slave0 side to L2-Cache through the arbitrator and routing, and read from the specified SMP0 cache area. After that, the read data will be cached in the corresponding buffer through the response arbitration module of L1-Icache, and finally read by the instruction fetch unit and forward the read instruction data to the decoding unit. After receiving the instruction data from the instruction fetch unit, the decoding unit will parse it, and parse the source operand location, calculation result storage location and required calculation unit type according to the instruction type protocol, and pass it to the operand collection unit, and read the register file or L1-Dcache according to the corresponding operand type and address. When the operand is in the register file, it can be directly read according to the register address. When the operand is in L1-Dcache, a read request signal DReq.W0 and DReq.W1 are generated. The process is the same as the request and response process of the instruction data. In the case of a data hit, the operand is directly returned. In the case of a miss, the read request signal needs to be forwarded to L2-Cache. Finally, after all the required operands are collected, they are passed to the execution unit. Further, after the execution unit receives the operand and the related instruction information, it selects the corresponding computing unit for operation according to the type of computing task, including the logic computing unit, the floating point computing unit, the memory access unit and the function unit. After the execution is completed, the operation result is cached in the corresponding register file or L1-Dcache through the write-back unit according to the destination type and address. At this point, a complete pipeline operation of SMP0 is completed. It can be understood that when the computing task is assigned to multiple SMPs, each SMP will generate multiple active thread bundles to complete the related operations. Repeat the above steps, and finally the operation results of all SMPs will be cached in L2-Cache for processing by the host computer.
本申请在每个流多处理器内构建一级缓存,同时根据流多处理器的个数及地址空间,在二级缓存内为每个流多处理器分配合适的缓存空间;通过独立一级缓存与共享二级缓存的层次化缓存结构,可以高效读取计算任务对应的数据,通过共享二级缓存可以集中管理缓存资源,允许所有流多处理器访问同一个缓存池,减少了数据复制,降低数据传输延迟,提高缓存命中率;并且这样的层次化缓存结构具备高度的灵活性和自适应性,可以满足多样化应用需求。The present application constructs a first-level cache in each stream multiprocessor, and at the same time allocates appropriate cache space to each stream multiprocessor in the second-level cache according to the number and address space of the stream multiprocessors; through the hierarchical cache structure of independent first-level cache and shared second-level cache, the data corresponding to the computing task can be read efficiently, and the cache resources can be centrally managed through the shared second-level cache, allowing all stream multiprocessors to access the same cache pool, reducing data replication, reducing data transmission latency, and improving cache hit rate; and such a hierarchical cache structure is highly flexible and adaptable, and can meet diverse application requirements.
由此可见,本申请在每个流多处理器内构建一级缓存,同时根据流多处理器的个数及地址空间,在二级缓存内为每个流多处理器分配合适的缓存空间;通过独立一级缓存与共享二级缓存的层次化缓存结构,可以高效读取计算任务对应的数据,通过共享二级缓存可以集中管理缓存资源,相比较于流多处理器对应独立二级缓存的方案,虽然保证了每个核心的独立性和数据访问速度,但存在资源浪费、数据冗余以及缓存一致性维护复杂等问题,特别是在高度并行的工作负载下,数据频繁在各核心间传递,独立缓存会导致大量的数据迁移和无效的缓存行刷新,严重影响系统性能和能效比;而本申请的共享二级缓存的层次化缓存结构允许所有流多处理器访问同一个缓存池,减少了数据复制,降低数据传输延迟,提高缓存命中率;并且这样的层次化缓存结构具备高度的灵活性和自适应性,可以满足多样化应用需求。It can be seen that the present application constructs a first-level cache in each stream multiprocessor, and at the same time allocates appropriate cache space to each stream multiprocessor in the second-level cache according to the number and address space of the stream multiprocessors; through the hierarchical cache structure of independent first-level cache and shared second-level cache, the data corresponding to the computing task can be efficiently read, and the cache resources can be centrally managed through the shared second-level cache. Compared with the solution of independent second-level cache corresponding to the stream multiprocessor, although the independence and data access speed of each core are guaranteed, there are problems such as resource waste, data redundancy and complex cache consistency maintenance. Especially under highly parallel workloads, data is frequently transferred between cores, and independent cache will cause a large amount of data migration and invalid cache line refresh, which seriously affects system performance and energy efficiency; the hierarchical cache structure of the shared second-level cache of the present application allows all stream multiprocessors to access the same cache pool, reduces data replication, reduces data transmission delay, and improves cache hit rate; and such a hierarchical cache structure has high flexibility and adaptability, which can meet diverse application requirements.
如图6所示,本申请实施例公开了一种任务执行装置,应用于流式多处理器,包括:As shown in FIG6 , the embodiment of the present application discloses a task execution device, which is applied to a streaming multiprocessor, including:
读请求生成模块11,用于基于上位机产生的计算任务生成针对自身一级缓存的读请求;所述流式多处理器对应有独立的一级缓存;所述计算任务对应有指令数据和操作数数据;A read request generation module 11 is used to generate a read request for its own first-level cache based on a computing task generated by a host computer; the streaming multiprocessor corresponds to an independent first-level cache; the computing task corresponds to instruction data and operand data;
其中,在人工智能、大数据分析、高端图形渲染等场景下,GPU中的流式多处理器可以执行相应上位机产生的计算任务,该计算任务对应有指令数据和操作数数据;单个流式多处理器对应有独立的一级缓存。进一步的,流式多处理器执行单个计算任务的过程,具体是通过自身的读请求生成模型基于计算任务生成针对自身一级缓存的读请求,以便通过该读请求从自身对应的独立的一级缓存中读取与当前的计算任务相关的指令数据一级操作数数据。Among them, in scenarios such as artificial intelligence, big data analysis, and high-end graphics rendering, the streaming multiprocessor in the GPU can execute the computing tasks generated by the corresponding host computer, and the computing tasks correspond to instruction data and operand data; a single streaming multiprocessor corresponds to an independent first-level cache. Furthermore, the process of the streaming multiprocessor executing a single computing task is to generate a read request for its own first-level cache based on the computing task through its own read request generation model, so as to read the instruction data and first-level operand data related to the current computing task from its corresponding independent first-level cache through the read request.
转发模块12,用于当所述读请求在所述一级缓存中未命中数据时,通过所述一级缓存将所述读请求转发至预设交叉矩阵,以便通过所述预设交叉矩阵将所述读请求转发至由若干流式多处理器共享的二级缓存;所述预设交叉矩阵内集成仲裁器及路由,用于根据仲裁机制及读请求对应的地址空间将读请求转发至所述二级缓存;A forwarding module 12, configured to forward the read request to a preset cross matrix through the first-level cache when the read request does not hit data in the first-level cache, so as to forward the read request to a second-level cache shared by a plurality of streaming multiprocessors through the preset cross matrix; an arbitrator and a router are integrated in the preset cross matrix, configured to forward the read request to the second-level cache according to an arbitration mechanism and an address space corresponding to the read request;
其中,单个流式多处理器生成的读请求在自身对应的独立的一级缓存中并未命中数据时,可以通过转发模块将该读请求转发至预设交叉矩阵;该预设交叉矩阵是连接一级缓存和(由若干流式多处理器共享的)二级缓存的交叉开关,内部集成有仲裁器和路由,可以实现一级缓存和二级缓存之间的数据转发。通过预设交叉矩阵可以将读请求转发至二级缓存,以便从二级缓存中读取相应数据。When a read request generated by a single streaming multiprocessor does not hit data in its corresponding independent first-level cache, the read request can be forwarded to a preset cross matrix through a forwarding module; the preset cross matrix is a cross switch connecting the first-level cache and the second-level cache (shared by several streaming multiprocessors), with an arbitrator and a router integrated inside, which can realize data forwarding between the first-level cache and the second-level cache. The read request can be forwarded to the second-level cache through the preset cross matrix so that the corresponding data can be read from the second-level cache.
数据读取模块13,用于基于所述读请求经所述一级缓存从所述二级缓存读取所述计算任务对应的指令数据或操作数数据;A data reading module 13, configured to read instruction data or operand data corresponding to the computing task from the secondary cache via the primary cache based on the read request;
其中,流式多处理器中可通过数据读取模块,经一级缓存从二级缓存读取与当前计算任务对应的指令数据或操作数数据;通过读请求可以从一级缓存和/或二级缓存读取与当前的计算任务对应的指令数据和操作数数据。Among them, in the streaming multiprocessor, the data reading module can be used to read the instruction data or operand data corresponding to the current computing task from the second-level cache through the first-level cache; the instruction data and operand data corresponding to the current computing task can be read from the first-level cache and/or the second-level cache through a read request.
任务执行模块14,用于根据读取到的所述指令数据或所述操作数数据执行所述计算任务,并将相应的运算结果缓存到所述二级缓存,以便所述上位机对所述计算任务对应的所述运算结果进行处理。The task execution module 14 is used to execute the computing task according to the read instruction data or the operand data, and cache the corresponding computing result to the secondary cache so that the host computer can process the computing result corresponding to the computing task.
其中,流式多处理器读取到当前的计算任务对应的指令数据、操作数数据后,可以通过任务执行模块执行该计算任务,然后将相应的运算结果再缓存至二级缓存中,以便上位机对当前的计算任务对应的运算结果进行处理。可以理解的是,将运算结果缓存至二级缓存的过程,也是通过预设交叉矩阵数据转发。Among them, after the streaming multiprocessor reads the instruction data and operand data corresponding to the current computing task, it can execute the computing task through the task execution module, and then cache the corresponding operation result in the secondary cache so that the host computer can process the operation result corresponding to the current computing task. It can be understood that the process of caching the operation result in the secondary cache is also forwarded through the preset cross matrix data.
由此可见,本申请在每个流多处理器内构建一级缓存,同时根据流多处理器的个数及地址空间,在二级缓存内为每个流多处理器分配合适的缓存空间;通过独立一级缓存与共享二级缓存的层次化缓存结构,可以高效读取计算任务对应的数据,通过共享二级缓存可以集中管理缓存资源,允许所有流多处理器访问同一个缓存池,减少了数据复制,降低数据传输延迟,提高缓存命中率;并且这样的层次化缓存结构具备高度的灵活性和自适应性,可以满足多样化应用需求。It can be seen that the present application constructs a first-level cache in each stream multiprocessor, and at the same time allocates appropriate cache space to each stream multiprocessor in the second-level cache according to the number and address space of the stream multiprocessors; through the hierarchical cache structure of independent first-level cache and shared second-level cache, the data corresponding to the computing task can be read efficiently, and the cache resources can be centrally managed through the shared second-level cache, allowing all stream multiprocessors to access the same cache pool, reducing data duplication, reducing data transmission delays, and improving cache hit rates; and such a hierarchical cache structure is highly flexible and adaptable, and can meet the needs of diverse applications.
在一种具体的实施例中,所述读请求生成模块11,可以包括:In a specific embodiment, the read request generating module 11 may include:
线程束生成单元,用于基于上位机产生的计算任务激活自身的调度单元,以产生若干线程束;A thread warp generation unit, used to activate its own scheduling unit based on the computing tasks generated by the host computer to generate a number of thread warps;
读请求生成单元,用于将所述计算任务对应的初始指令地址以及所述线程束的名称传递至自身的取指单元,以便所述取指单元对所述初始指令地址和所述线程束的名称进行解析,并根据解析结果生成针对自身一级缓存的读请求。A read request generating unit is used to pass the initial instruction address corresponding to the computing task and the name of the thread bundle to its own instruction fetch unit, so that the instruction fetch unit parses the initial instruction address and the name of the thread bundle, and generates a read request for its own first-level cache according to the parsing result.
在一种具体的实施例中,所述装置还可以包括:In a specific embodiment, the device may further include:
一级缓存读取模块,用于当所述读请求在所述一级缓存中命中数据时,通过所述读请求从所述一级缓存读取与所述计算任务对应的指令数据或操作数数据,以便根据读取到的所述指令数据或所述操作数数据执行所述计算任务。A first-level cache reading module is used to read instruction data or operand data corresponding to the computing task from the first-level cache through the read request when the read request hits data in the first-level cache, so as to execute the computing task according to the read instruction data or the operand data.
在另一种具体的实施例中,所述一级缓存读取模块,可以包括:In another specific embodiment, the first-level cache reading module may include:
第一数据缓存单元,用于根据所述读请求从所述一级缓存本地读取相应的指令数据或操作数数据后,并将从所述一级缓存中读取到的指令数据或操作数数据缓存至与所述流式多处理器对应预设缓冲区;A first data cache unit, configured to read the corresponding instruction data or operand data locally from the first-level cache according to the read request, and cache the instruction data or operand data read from the first-level cache into a preset buffer corresponding to the streaming multiprocessor;
第一数据读取单元,用于从所述预设缓冲区读取与所述计算任务对应的指令数据或操作数数据。The first data reading unit is used to read instruction data or operand data corresponding to the computing task from the preset buffer.
在一种具体的实施例中,所述转发模块12,可以包括:In a specific embodiment, the forwarding module 12 may include:
转发单元,用于通过所述一级缓存经自身对应的主线接口将所述读请求转发至预设交叉矩阵的第一接口耦合器,以便所述预设交叉矩阵利用内部集成的读仲裁器和路由将所述读请求转发至第二接口耦合器,并通过所述第二接口耦合器利用与二级缓存对应的从线接口将所述读请求传递至所述二级缓存,以完成将所述读请求转发至若干流式多处理器共享的二级缓存的过程。A forwarding unit is used to forward the read request to the first interface coupler of a preset cross matrix through the first-level cache via its own corresponding master line interface, so that the preset cross matrix uses the internally integrated read arbiter and routing to forward the read request to the second interface coupler, and transmit the read request to the second-level cache through the second interface coupler using the slave line interface corresponding to the second-level cache, so as to complete the process of forwarding the read request to the second-level cache shared by several streaming multiprocessors.
在一种具体的实施例中,所述任务执行模块14,可以包括:In a specific embodiment, the task execution module 14 may include:
计算单元选择单元,用于通过自身的执行单元选择与所述计算任务相匹配的目标计算单元;所述目标计算单元包括逻辑计算单元、浮点计算单元、访存单元及函数单元;A computing unit selection unit, used to select a target computing unit matching the computing task through its own execution unit; the target computing unit includes a logic computing unit, a floating-point computing unit, a memory access unit and a function unit;
任务执行单元,用于利用所述目标计算单元执行读取到的所述指令数据或所述操作数数据对应的所述计算任务,得到相应的运算结果。The task execution unit is used to use the target computing unit to execute the computing task corresponding to the read instruction data or the operand data to obtain a corresponding computing result.
在一种具体的实施例中,所述任务执行模块14,可以包括:In a specific embodiment, the task execution module 14 may include:
第二数据缓存单元,用于通过所述一级缓存的主线接口将所述运算结果转发至所述预设交叉矩阵的第三接口耦合器,以便所述预设交叉矩阵利用内部集成的写仲裁器和路由将所述运算结果转发至第四接口耦合器,并通过所述第四接口耦合器利用与所述二级缓存对应的从线接口将所述运算结果缓存至所述二级缓存。The second data cache unit is used to forward the calculation result to the third interface coupler of the preset cross matrix through the main line interface of the first-level cache, so that the preset cross matrix uses the internally integrated write arbitrator and routing to forward the calculation result to the fourth interface coupler, and cache the calculation result to the second-level cache through the fourth interface coupler using the slave line interface corresponding to the second-level cache.
进一步的,本申请实施例还公开了一种电子设备,图7是根据一示例性实施例示出的电子设备20结构图,图中的内容不能认为是对本申请的使用范围的任何限制。Furthermore, an embodiment of the present application also discloses an electronic device. FIG. 7 is a structural diagram of an electronic device 20 according to an exemplary embodiment. The content in the diagram cannot be regarded as any limitation on the scope of use of the present application.
图7为本申请实施例提供的一种电子设备20的结构示意图。该电子设备 20,具体可以包括:至少一个处理器21、至少一个存储器22、电源23、通信接口24、输入输出接口25和通信总线26。其中,所述存储器22用于存储计算机程序,所述计算机程序由所述处理器21加载并执行,以实现前述任一实施例公开的任务执行方法中的相关步骤。另外,本实施例中的电子设备20具体可以为电子计算机。FIG7 is a schematic diagram of the structure of an electronic device 20 provided in an embodiment of the present application. The electronic device 20 may specifically include: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input/output interface 25, and a communication bus 26. The memory 22 is used to store a computer program, which is loaded and executed by the processor 21 to implement the relevant steps in the task execution method disclosed in any of the aforementioned embodiments. In addition, the electronic device 20 in this embodiment may specifically be an electronic computer.
本实施例中,电源23用于为电子设备20上的各硬件设备提供工作电压;通信接口24能够为电子设备20创建与外界设备之间的数据传输通道,其所遵循的通信协议是能够适用于本申请技术方案的任意通信协议,在此不对其进行具体限定;输入输出接口25,用于获取外界输入数据或向外界输出数据,其具体的接口类型可以根据具体应用需要进行选取,在此不进行具体限定。In this embodiment, the power supply 23 is used to provide working voltage for each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and the external device, and the communication protocol it follows is any communication protocol that can be applied to the technical solution of the present application, and is not specifically limited here; the input and output interface 25 is used to obtain external input data or output data to the outside world, and its specific interface type can be selected according to specific application needs and is not specifically limited here.
另外,存储器22作为资源存储的载体,可以是只读存储器、随机存储器、磁盘或者光盘等,其上所存储的资源可以包括操作系统221、计算机程序222等,存储方式可以是短暂存储或者永久存储。In addition, the memory 22, as a carrier for storing resources, can be a read-only memory, a random access memory, a disk or an optical disk, etc. The resources stored thereon can include an operating system 221, a computer program 222, etc., and the storage method can be temporary storage or permanent storage.
其中,操作系统221用于管理与控制电子设备20上的各硬件设备以及计算机程序222,其可以是Windows Server、Netware、Unix、Linux等。计算机程序222除了包括能够用于完成前述任一实施例公开的由电子设备20执行的任务执行方法的计算机程序之外,还可以进一步包括能够用于完成其他特定工作的计算机程序。The operating system 221 is used to manage and control the hardware devices and computer programs 222 on the electronic device 20, which can be Windows Server, Netware, Unix, Linux, etc. In addition to including computer programs that can be used to complete the task execution method performed by the electronic device 20 disclosed in any of the aforementioned embodiments, the computer program 222 can further include computer programs that can be used to complete other specific tasks.
进一步的,本申请还公开了一种计算机可读存储介质,用于存储计算机程序;其中,所述计算机程序被处理器执行时实现前述公开的任务执行方法。关于该方法的具体步骤可以参考前述实施例中公开的相应内容,在此不再进行赘述。Furthermore, the present application also discloses a computer-readable storage medium for storing a computer program; wherein the computer program, when executed by a processor, implements the task execution method disclosed above. For the specific steps of the method, reference may be made to the corresponding contents disclosed in the above embodiments, and no further description will be given here.
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同或相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。In this specification, each embodiment is described in a progressive manner, and each embodiment focuses on the differences from other embodiments. The same or similar parts between the embodiments can be referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant parts can be referred to the method part.
专业人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Professionals may further appreciate that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of the two. In order to clearly illustrate the interchangeability of hardware and software, the composition and steps of each example have been generally described in the above description according to function. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Professionals and technicians may use different methods to implement the described functions for each specific application, but such implementation should not be considered to be beyond the scope of this application.
结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。The steps of the method or algorithm described in conjunction with the embodiments disclosed herein may be implemented directly using hardware, a software module executed by a processor, or a combination of the two. The software module may be placed in a random access memory (RAM), a memory, a read-only memory (ROM), an electrically programmable ROM, an electrically erasable programmable ROM, a register, a hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。Finally, it should be noted that, in this article, relational terms such as first and second, etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Moreover, the terms "include", "comprise" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. In the absence of further restrictions, the elements defined by the statement "comprise a ..." do not exclude the presence of other identical elements in the process, method, article or device including the elements.
以上对本申请所提供的技术方案进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。The technical solution provided by the present application is introduced in detail above. Specific examples are used in this article to illustrate the principles and implementation methods of the present application. The description of the above embodiments is only used to help understand the method of the present application and its core idea. At the same time, for general technical personnel in this field, according to the idea of the present application, there will be changes in the specific implementation method and application scope. In summary, the content of this specification should not be understood as a limitation on the present application.
Claims (9)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202411154887.XA CN118672941B (en) | 2024-08-22 | 2024-08-22 | A task execution method, device, equipment and storage medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202411154887.XA CN118672941B (en) | 2024-08-22 | 2024-08-22 | A task execution method, device, equipment and storage medium |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN118672941A CN118672941A (en) | 2024-09-20 |
| CN118672941B true CN118672941B (en) | 2024-10-22 |
Family
ID=92731332
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202411154887.XA Active CN118672941B (en) | 2024-08-22 | 2024-08-22 | A task execution method, device, equipment and storage medium |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN118672941B (en) |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119311623B (en) * | 2024-09-25 | 2025-05-20 | 成都市芯璨科技有限公司 | Data transmission system and method based on SoC bus |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103914333A (en) * | 2014-04-14 | 2014-07-09 | 中国科学技术大学苏州研究院 | Multi-core memory system simulator on basis of network-on-chip interconnection |
| CN104954254A (en) * | 2013-12-27 | 2015-09-30 | 凯为公司 | Matrix of on-chip routers interconnecting a plurality of processing engines and a method of routing using thereof |
Family Cites Families (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6988183B1 (en) * | 1998-06-26 | 2006-01-17 | Derek Chi-Lan Wong | Methods for increasing instruction-level parallelism in microprocessors and digital system |
| US9916274B2 (en) * | 2015-07-23 | 2018-03-13 | Cavium, Inc. | Apparatus and method for on-chip crossbar design in a network switch using benes network |
-
2024
- 2024-08-22 CN CN202411154887.XA patent/CN118672941B/en active Active
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN104954254A (en) * | 2013-12-27 | 2015-09-30 | 凯为公司 | Matrix of on-chip routers interconnecting a plurality of processing engines and a method of routing using thereof |
| CN103914333A (en) * | 2014-04-14 | 2014-07-09 | 中国科学技术大学苏州研究院 | Multi-core memory system simulator on basis of network-on-chip interconnection |
Also Published As
| Publication number | Publication date |
|---|---|
| CN118672941A (en) | 2024-09-20 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12229422B2 (en) | On-chip atomic transaction engine | |
| US6286090B1 (en) | Mechanism for selectively imposing interference order between page-table fetches and corresponding data fetches | |
| Pattnaik et al. | Opportunistic computing in gpu architectures | |
| JP4085389B2 (en) | Multiprocessor system, consistency control device and consistency control method in multiprocessor system | |
| JP3661761B2 (en) | Non-uniform memory access (NUMA) data processing system with shared intervention support | |
| US6526481B1 (en) | Adaptive cache coherence protocols | |
| US7475198B2 (en) | Asynchronous symmetric multiprocessing | |
| US10210117B2 (en) | Computing architecture with peripherals | |
| JPH10149342A (en) | Multiprocess system executing prefetch operation | |
| CN118672941B (en) | A task execution method, device, equipment and storage medium | |
| JPH10133917A (en) | Multiprocess system with coherency-relative error logging capability | |
| JP4307508B2 (en) | System and method for maintaining cache coherency in caches having different cache location lengths | |
| CN118519924A (en) | Cache control method and device, electronic equipment and readable storage medium | |
| JP5265827B2 (en) | Hybrid coherence protocol | |
| CN117194283A (en) | Vector read-write instruction processing method based on RISC-V instruction set | |
| Zhao et al. | Hardware support for accelerating data movement in server platform | |
| CN117435251B (en) | A post-quantum cryptographic algorithm processor and its system on chip | |
| US20070073977A1 (en) | Early global observation point for a uniprocessor system | |
| Chaudhuri et al. | The impact of negative acknowledgments in shared memory scientific applications | |
| BiTalebi et al. | LARA: Locality-aware resource allocation to improve GPU memory-access time | |
| Lee et al. | Global bus design of a bus-based COMA multiprocessor DICE | |
| Agarwal et al. | Using CoDeL to rapidly prototype network processsor extensions | |
| Chatterjee et al. | Optimizing a multi-core processor for message-passing workloads | |
| Afsahi et al. | Architectural extensions to support efficient communication using message prediction | |
| Pai et al. | Modular Implementation of Directory-Based Cache Coherence for Multicore Processing |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant | ||
| TR01 | Transfer of patent right | ||
| TR01 | Transfer of patent right |
Effective date of registration: 20250714 Address after: 250000 Shandong Province, Jinan City, China (Shandong) Free Trade Pilot Zone, Shunhua Road Street, Inspur Road 1036, Building S01, 5th Floor Patentee after: Yuanqixin (Shandong) Semiconductor Technology Co.,Ltd. Country or region after: China Address before: 250000 building S02, No. 1036, Gaoxin Inspur Road, Jinan, Shandong Patentee before: Shandong Inspur Scientific Research Institute Co.,Ltd. Country or region before: China |