CN115878553A

CN115878553A - Method for system on chip and related product

Info

Publication number: CN115878553A
Application number: CN202110926703.7A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2021-08-12
Filing date: 2021-08-12
Publication date: 2023-03-31

Abstract

The present disclosure relates to a method for a system on chip, a computing device and a board, the computing device being comprised in a combined processing device, which may further comprise interface means and other processing means. The computing device interacts with other processing devices to jointly complete computing operations specified by a user. The combined processing device may further comprise a storage device connected to the computing device and the other processing device, respectively, for storing data of the computing device and the other processing device. The scheme of the disclosure can expand the use scene of the cache memory and improve the use efficiency of the cache memory.

Description

Method for system on chip and related products

技术领域technical field

本公开一般地涉及存储领域。更具体地，本公开涉及一种用于片上系统的方法、对应的片上系统、包括该片上系统的计算装置和包括该计算装置的板卡。The present disclosure relates generally to the field of storage. More specifically, the present disclosure relates to a method for a system-on-chip, a corresponding system-on-chip, a computing device including the system-on-chip, and a board including the computing device.

背景技术Background technique

计算系统的操作性能在相当大程度上取决于内存平均访问延迟。通过提高高速缓冲存储器(简称为“缓存”)的命中率来有效减少内存访问次数，可以显著改善系统性能。为此，处理器通常采用缓存机制，并且利用缓存来调节处理器与低速主存之间速度和性能上的不匹配。当前的缓存实行多级缓存机制，例如采用三级缓存(L1，L2和L3)，并且将最靠近主存的缓存称为最后一级缓存(“Last Level Cache”，LLC)。鉴于缓存在片上系统中的频繁使用及其重要作用，需要提出有效的管理策略，以便提高缓存的利用率，减少对主存的访问次数。另外，如何针对不同的场景来扩展LLC的应用也成为需要解决的问题。The operational performance of a computing system is largely determined by the average memory access latency. System performance can be significantly improved by effectively reducing the number of memory accesses by increasing the hit rate of the cache memory (referred to as "cache"). To this end, processors typically employ a cache mechanism, and use the cache to accommodate the mismatch in speed and performance between the processor and slow main memory. The current cache implements a multi-level cache mechanism, such as three-level cache (L1, L2 and L3), and the cache closest to the main memory is called the last level cache ("Last Level Cache", LLC). In view of the frequent use and important role of cache in SoCs, it is necessary to propose an effective management strategy in order to improve the utilization of cache and reduce the number of visits to main memory. In addition, how to expand the application of LLC for different scenarios has also become a problem that needs to be solved.

发明内容Contents of the invention

鉴于上述背景技术部分所提及的技术问题，扩展高速缓冲存储器的使用场景。本公开在如下的多个方面中提供用于片上系统的方案。In view of the technical problems mentioned in the background technology section above, the usage scenarios of the cache memory are expanded. The present disclosure provides solutions for a system on chip in the following aspects.

在第一方面中，本公开提供了一种用于片上系统的方法，所述片上系统包括至少用于执行运算操作的多个集群和与该多个集群互联的高速缓冲存储器，每个集群包括用于执行所述运算操作的多个处理器核，所述方法包括：将片外存储器的指定存储空间映射到高速缓冲存储器的给定存储区，以将该给定存储区用作集群间数据通信的集群存储区；以及使用所述集群存储区来执行所述集群的操作。In a first aspect, the present disclosure provides a method for a system-on-chip comprising at least a plurality of clusters for performing computational operations and a cache memory interconnected with the plurality of clusters, each cluster comprising A plurality of processor cores for performing the operation, the method includes: mapping a specified storage space of the off-chip memory to a given storage area of the cache memory, so as to use the given storage area as inter-cluster data a cluster storage area for communication; and performing operations of the cluster using the cluster storage area.

在第二方面中，本公开提供了一种片上系统，包括：多个集群，其中每个集群包括至少用于执行运算操作的多个处理器核；以及高速缓冲存储器，其与所述多个集群互联，并且配置成：将给定存储区用作集群间数据通信的集群存储区，其中所述给定存储区与片外存储器的指定存储空间形成映射关系；以及使用所述集群存储区来执行所述集群的操作。In a second aspect, the present disclosure provides a system-on-chip, comprising: a plurality of clusters, wherein each cluster includes a plurality of processor cores for at least performing arithmetic operations; and a cache memory associated with the plurality of The clusters are interconnected and configured to: use a given storage area as a cluster storage area for inter-cluster data communication, wherein the given storage area forms a mapping relationship with a designated storage space of the off-chip memory; and use the cluster storage area to Execute the operation for the cluster.

在第三方面中，本公开提供了一种计算装置，其包括如上所述以及在下文多个实施例中所述的片上系统。In a third aspect, the present disclosure provides a computing device comprising a system-on-chip as described above and in various embodiments below.

在第四方面中，本公开提供了一种板卡，包括如上所述以及在下文多个实施例中所述的计算装置。In a fourth aspect, the present disclosure provides a board, including the computing device as described above and described in various embodiments below.

在第五方面中，本公开提供了一种计算设备，包括如上所述以及在下文多个实施例中所述的板卡。In a fifth aspect, the present disclosure provides a computing device, including the board as described above and described in various embodiments below.

根据本公开上述多个方面中所提供的方案，可以利用高速缓冲存储器的给定存储区来实现片上系统的集群间的高效通信。由此，可以将本需要通过片外存储器来传递的数据直接通过该给定存储区来进行传递，由此加速数据的访存并显著提升缓存命中率。进一步，由于通过给定存储区提升了缓存命中的机率，本公开的方案也显著提升了片上系统的整体性能。另外，该给定存储区的划分简化了高速缓冲存储器的管理，并且扩展了高速缓冲存储器的使用场景。借助于该给定存储区，片上系统的多个集群可以实现多种灵活的通信机制，从而也提升了集群的操作性能。According to the solutions provided in the above-mentioned multiple aspects of the present disclosure, a given storage area of the cache memory can be utilized to realize efficient communication between clusters of a system on chip. Therefore, the data that needs to be transferred through the off-chip memory can be directly transferred through the given storage area, thereby speeding up data access and significantly improving the cache hit rate. Further, since the probability of a cache hit is increased through a given storage area, the solution of the present disclosure also significantly improves the overall performance of the SoC. In addition, the division of the given memory area simplifies the management of the cache memory and expands the usage scenarios of the cache memory. With the help of the given storage area, multiple clusters of the SoC can implement multiple flexible communication mechanisms, thereby also improving the operational performance of the clusters.

附图说明Description of drawings

通过参考附图阅读下文的详细描述，本公开示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中，以示例性而非限制性的方式示出了本公开的若干实施方式，并且相同或对应的标号表示相同或对应的部分其中：The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily understood by reading the following detailed description with reference to the accompanying drawings. In the drawings, several embodiments of the present disclosure are shown by way of illustration and not limitation, and the same or corresponding reference numerals indicate the same or corresponding parts wherein:

图1是示出根据本公开实施例的板卡的结构图；FIG. 1 is a structural diagram showing a board according to an embodiment of the present disclosure;

图2是示出根据本公开实施例的集成电路装置的结构图；FIG. 2 is a structural diagram illustrating an integrated circuit device according to an embodiment of the present disclosure;

图3是示出根据本公开实施例的单核计算装置的内部结构示意图；3 is a schematic diagram showing the internal structure of a single-core computing device according to an embodiment of the present disclosure;

图4是示出根据本公开实施例的多核计算装置的内部结构示意图；4 is a schematic diagram showing the internal structure of a multi-core computing device according to an embodiment of the present disclosure;

图5是示出根据本公开实施例的处理器核的内部结构示意图；FIG. 5 is a schematic diagram showing the internal structure of a processor core according to an embodiment of the present disclosure;

图6是示出根据本公开实施例的用于高速缓冲存储器的方法流程图；FIG. 6 is a flowchart illustrating a method for a cache memory according to an embodiment of the present disclosure;

图7是示出根据本公开实施例的高速缓冲存储器的简化框图；Figure 7 is a simplified block diagram illustrating a cache memory according to an embodiment of the disclosure;

图8是示出根据本公开实施例的片上系统的简化框图；8 is a simplified block diagram illustrating a system-on-chip according to an embodiment of the disclosure;

图9是示出根据本公开实施例的片上系统的详细框图；9 is a detailed block diagram illustrating a system-on-chip according to an embodiment of the present disclosure;

图10是示出根据本公开实施例的页模式的示意框图；FIG. 10 is a schematic block diagram illustrating a page mode according to an embodiment of the present disclosure;

图11是示出根据本公开实施例的窗口模式下的哈希操作示意图Fig. 11 is a schematic diagram showing a hash operation in window mode according to an embodiment of the present disclosure

图12是示出根据本公开实施例的片上系统的简化框图；Figure 12 is a simplified block diagram illustrating a system-on-chip according to an embodiment of the disclosure;

图13是示出根据本公开实施例的用于片上系统的方法流程图；以及13 is a flowchart illustrating a method for a system on a chip according to an embodiment of the present disclosure; and

图14是示出根据本公开实施例的片上系统的操作框图。FIG. 14 is a block diagram illustrating an operation of a system on chip according to an embodiment of the present disclosure.

具体实施方式Detailed ways

下面将结合本披露实施例中的附图，对本披露实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本披露一部分实施例，而不是全部的实施例。基于本披露中的实施例，本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本披露保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present disclosure with reference to the drawings in the embodiments of the present disclosure. Obviously, the described embodiments are part of the embodiments of the present disclosure, not all of them. Based on the embodiments in the present disclosure, all other embodiments obtained by those skilled in the art without creative efforts fall within the protection scope of the present disclosure.

应当理解，本披露的权利要求、说明书及附图中可能使用的术语“第一”、“第二”和“第三”等是用于区别不同对象，而不是用于描述特定顺序。本披露的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在，但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that the terms "first", "second" and "third" that may be used in the claims, specification and drawings of the present disclosure are used to distinguish different objects, rather than to describe a specific sequence. The terms "comprising" and "comprises" used in the specification and claims of this disclosure indicate the presence of described features, integers, steps, operations, elements and/or components, but do not exclude one or more other features, integers , steps, operations, elements, components, and/or the presence or addition of collections thereof.

还应当理解，在此本披露说明书中所使用的术语仅仅是出于描述特定实施例的目的，而并不意在限定本披露。如在本披露说明书和权利要求书中所使用的那样，除非上下文清楚地指明其它情况，否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解，在本披露说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合，并且包括这些组合。It should also be understood that the terminology used in this disclosure description is for the purpose of describing specific embodiments only, and is not intended to limit the present disclosure. As used in this disclosure and the claims, the singular forms "a", "an" and "the" are intended to include plural referents unless the context clearly dictates otherwise. It should also be further understood that the term "and/or" used in the present disclosure and the claims refers to any combination and all possible combinations of one or more of the associated listed items, and includes these combinations.

如在本说明书和权利要求书中所使用的那样，术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地，短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。As used in this specification and claims, the term "if" may be interpreted as "when" or "once" or "in response to determining" or "in response to detecting" depending on the context. Similarly, the phrase "if determined" or "if [the described condition or event] is detected" may be construed, depending on the context, to mean "once determined" or "in response to the determination" or "once detected [the described condition or event] ]” or “in response to detection of [described condition or event]”.

下面结合附图来详细描述本公开的具体实施方式。Specific embodiments of the present disclosure will be described in detail below in conjunction with the accompanying drawings.

图1示出根据本公开实施例的一种板卡10的结构示意图。可以理解的是图1所示结构和组成仅仅是一种示例，其并不用于在任何方面对本公开的方案进行限制。FIG. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the present disclosure. It can be understood that the structure and composition shown in FIG. 1 are only an example, and are not intended to limit the solution of the present disclosure in any respect.

如图1所示，板卡10包括芯片101，其可以是一种系统级芯片(System on Chip，SoC)，也即本公开上下文中所描述的片上系统。在一个实施场景中，其可以集成有一个或多个组合处理装置。前述组合处理装置可以是一种人工智能运算单元，用以支持各类深度学习和机器学习算法，满足计算机视觉、语音、自然语言处理、数据挖掘等领域复杂场景下的智能处理需求，特别是深度学习技术大量应用在云端智能领域。云端智能应用的一个显著特点是输入数据量大，对平台的存储能力和计算能力有很高的要求，而本实施例的板卡10适用在云端智能应用，具有庞大的片外存储、片上存储和强大的计算能力。As shown in FIG. 1 , the board 10 includes a chip 101 , which may be a system-on-chip (System on Chip, SoC), that is, a system-on-chip described in the context of the present disclosure. In one implementation scenario, it may be integrated with one or more combined processing devices. The aforementioned combination processing device can be an artificial intelligence computing unit, which is used to support various deep learning and machine learning algorithms, and meet the intelligent processing requirements in complex scenarios in the fields of computer vision, voice, natural language processing, data mining, etc., especially in-depth Learning technology is widely used in the field of cloud intelligence. A notable feature of cloud intelligent applications is the large amount of input data, which has high requirements on the storage and computing capabilities of the platform. The board 10 of this embodiment is suitable for cloud intelligent applications and has huge off-chip storage and on-chip storage. and powerful computing power.

进一步如图中所示，芯片101通过对外接口装置102与外部设备103相连接。根据不同的应用场景，外部设备103例如可以是服务器、计算机、摄像头、显示器、鼠标、键盘、网卡或wifi接口等。待处理的数据可以由外部设备103通过对外接口装置102传递至芯片101。芯片101的计算结果可以经由对外接口装置102传送回外部设备103。根据不同的应用场景，对外接口装置102可以具有不同的接口形式，例如PCIe接口等。As further shown in the figure, the chip 101 is connected to an external device 103 through an external interface device 102 . According to different application scenarios, the external device 103 may be, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card or a wifi interface, and the like. The data to be processed can be transmitted to the chip 101 by the external device 103 through the external interface device 102 . The calculation result of the chip 101 can be sent back to the external device 103 via the external interface device 102 . According to different application scenarios, the external interface device 102 may have different interface forms, such as a PCIe interface and the like.

板卡10还可以包括用于存储数据的存储器件104，其包括一个或多个存储单元105。存储器件104通过总线与控制器件106和芯片101进行连接和数据传输。板卡10中的控制器件106可以配置用于对芯片101的状态进行调控。为此，在一个应用场景中，控制器件106可以包括单片机(Micro Controller Unit，MCU)。The board 10 may also include a storage device 104 for storing data, which includes one or more storage units 105 . The storage device 104 is connected and data transmitted with the control device 106 and the chip 101 through the bus. The control device 106 in the board 10 may be configured to regulate the state of the chip 101 . To this end, in an application scenario, the control device 106 may include a microcontroller (Micro Controller Unit, MCU).

图2是示出根据上述实施例的芯片101中的组合处理装置的结构图。如图2中所示，组合处理装置20可以包括计算装置201、接口装置202、处理装置203和动态随机存取存储器(Dynamic Random Access Memory，DRAM)DRAM 204。FIG. 2 is a structural diagram showing a combination processing device in the chip 101 according to the above-described embodiment. As shown in FIG. 2 , the combined processing device 20 may include a computing device 201 , an interface device 202 , a processing device 203 and a Dynamic Random Access Memory (Dynamic Random Access Memory, DRAM) DRAM 204 .

计算装置201可以配置成执行用户指定的操作，主要实现为单核智能处理器或者多核智能处理器。在一些操作中，其可以用于执行深度学习或机器学习方面的计算，并且还可以通过接口装置202与处理装置203进行交互，以共同完成用户指定的操作。The computing device 201 can be configured to perform user-specified operations, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor. In some operations, it can be used to perform deep learning or machine learning calculations, and can also interact with the processing device 203 through the interface device 202 to jointly complete operations specified by the user.

接口装置202可以用于在计算装置201与处理装置203间传输数据和控制指令。例如，计算装置201可以经由接口装置202从处理装置203中获取输入数据，写入计算装置201片上的存储装置。进一步，计算装置201可以经由接口装置202从处理装置203中获取控制指令，写入计算装置201片上的控制缓存中。替代地或可选地，接口装置202也可以读取计算装置201的存储装置中的数据并传输给处理装置203。The interface device 202 can be used to transmit data and control instructions between the computing device 201 and the processing device 203 . For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202 and write it into a storage device on the computing device 201 . Further, the computing device 201 may obtain control instructions from the processing device 203 via the interface device 202 and write them into the control cache on the chip of the computing device 201 . Alternatively or optionally, the interface device 202 may also read data in the storage device of the computing device 201 and transmit it to the processing device 203 .

处理装置203作为通用的处理装置，执行包括但不限于数据搬运、对计算装置201的开启和/或停止等基本控制。根据实现方式的不同，处理装置203可以是中央处理器(Central Processing Unit，CPU)、图形处理器(Graphics Processing Unit，GPU)或其他通用和/或专用处理器中的一种或多种类型的处理器，这些处理器包括但不限于数字信号处理器(Digital Signal Processor，DSP)、专用集成电路(Application SpecificIntegrated circuit，ASIC)、现场可编程门阵列(Field-Programmable Gate Array，FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等，并且其数目可以根据实际需要来确定。如前所述，仅就本公开的计算装置201而言，其可以视为具有单核结构或者同构多核结构。然而，当将计算装置201和处理装置203整合共同考虑时，二者视为形成异构多核结构。As a general processing device, the processing device 203 performs basic control including but not limited to data transfer, starting and/or stopping the computing device 201 . According to different implementations, the processing device 203 may be one or more types of a central processing unit (Central Processing Unit, CPU), a graphics processing unit (Graphics Processing Unit, GPU) or other general-purpose and/or special-purpose processors. Processors, these processors include but not limited to Digital Signal Processor (Digital Signal Processor, DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other possible Programming logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs. As mentioned above, as far as the computing device 201 of the present disclosure is concerned, it can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when considering the integration of the computing device 201 and the processing device 203 together, they are considered to form a heterogeneous multi-core structure.

DRAM 204用以存储待处理的数据，为DDR内存，大小通常为16G或更大，用于保存计算装置201和/或处理装置203的数据。The DRAM 204 is used to store data to be processed, and is a DDR memory, usually 16G or larger in size, for storing data of the computing device 201 and/or the processing device 203 .

图3示出了计算装置201为单核的内部结构示意图。单核计算装置301用以处理计算机视觉、语音、自然语言、数据挖掘等输入数据，单核计算装置301包括三大模块：控制模块31、运算模块32及存储模块33。FIG. 3 shows a schematic diagram of the internal structure of the computing device 201 as a single core. The single-core computing device 301 is used to process input data such as computer vision, speech, natural language, data mining, etc. The single-core computing device 301 includes three modules: a control module 31 , an operation module 32 and a storage module 33 .

控制模块31用以协调并控制运算模块32和存储模块33的工作，以完成深度学习的任务，其包括取指单元(Instruction Fetch Unit，IFU)311及指令译码单元(InstructionDecode Unit，IDU)312。取指单元311用以获取来自处理装置203的指令，指令译码单元312则将获取的指令进行译码，并将译码结果作为控制信息发送给运算模块32和存储模块33。The control module 31 is used to coordinate and control the work of the operation module 32 and the storage module 33 to complete the task of deep learning, which includes an instruction fetch unit (Instruction Fetch Unit, IFU) 311 and an instruction decoding unit (Instruction Decode Unit, IDU) 312 . The instruction fetching unit 311 is used to obtain instructions from the processing device 203 , and the instruction decoding unit 312 decodes the obtained instructions and sends the decoding results to the computing module 32 and the storage module 33 as control information.

运算模块32包括向量运算单元321及矩阵运算单元322。向量运算单元321用以执行向量运算，可支持向量乘、加、非线性变换等复杂运算；矩阵运算单元322负责深度学习算法的核心计算，即矩阵乘及卷积。存储模块33用来存储或搬运相关数据，包括神经元存储单元(Neuron RAM，NRAM)331、参数存储单元(Weight RAM，WRAM)332、直接内存访问模块(Direct Memory Access，DMA)333。NRAM 331用以存储输入神经元、输出神经元和计算后的中间结果；WRAM 332则用以存储深度学习网络的卷积核，即权值；DMA 333通过总线34连接DRAM 204，负责单核计算装置301与DRAM 204间的数据搬运。The operation module 32 includes a vector operation unit 321 and a matrix operation unit 322 . The vector operation unit 321 is used to perform vector operations, and can support complex operations such as vector multiplication, addition, and nonlinear transformation; the matrix operation unit 322 is responsible for the core calculation of the deep learning algorithm, namely matrix multiplication and convolution. The storage module 33 is used to store or transfer relevant data, including a neuron storage unit (Neuron RAM, NRAM) 331 , a parameter storage unit (Weight RAM, WRAM) 332 , and a direct memory access module (Direct Memory Access, DMA) 333 . NRAM 331 is used to store input neurons, output neurons and calculated intermediate results; WRAM 332 is used to store convolution kernels of deep learning networks, namely weights; DMA 333 is connected to DRAM 204 through bus 34 and is responsible for single-core calculations Data transfer between the device 301 and the DRAM 204 .

图4示出了计算装置201为多核的内部结构示意图。多核计算装置41采用分层结构设计，多核计算装置41作为一个片上系统，其包括根据本公开的至少一个集群(cluster)，每个集群又包括多个处理器核。换言之，多核计算装置41是以片上系统-集群-处理器核的层次所构成的。以片上系统的层级来看，如图4所示，多核计算装置41包括外部存储控制器401、外设通信模块402、片上互联模块403、同步模块404以及多个集群405。FIG. 4 shows a schematic diagram of the internal structure of the computing device 201 as a multi-core. The multi-core computing device 41 adopts a hierarchical structure design, and the multi-core computing device 41 is a system on chip, which includes at least one cluster (cluster) according to the present disclosure, and each cluster includes multiple processor cores. In other words, the multi-core computing device 41 is constituted at the level of SoC-cluster-processor core. Viewed from the system-on-chip level, as shown in FIG. 4 , the multi-core computing device 41 includes an external storage controller 401 , a peripheral communication module 402 , an on-chip interconnection module 403 , a synchronization module 404 and multiple clusters 405 .

外部存储控制器401可以有多个(如图中示例性地示出2个)，其用以响应处理器核发出的访问请求，访问外部存储设备，也即本公开上下文中的片外存储器(例如图2中的DRAM 204)，从而自片外读取数据或是将数据写入。外设通信模块402用以通过接口装置202接收来自处理装置203的控制信号，启动计算装置201执行任务。片上互联模块403将外部存储控制器401、外设通信模块402及多个集群405连接起来，用以在各个模块间传输数据和控制信号。同步模块404是一种全局同步屏障控制器(Global Barrier Controller，GBC)，用以协调各集群的工作进度，确保信息的同步。本公开的多个集群405是多核计算装置41的计算核心。尽管在图4中示例性地示出4个集群，然而，随着硬件的发展，本公开的多核计算装置41还可以包括8个、16个、64个、甚至更多的集群405。在一个应用场景中，集群405可以用于高效地执行深度学习算法。There may be multiple external memory controllers 401 (two are exemplarily shown in the figure), which are used to respond to the access request issued by the processor core, and access the external memory device, that is, the off-chip memory in the context of the present disclosure ( For example, the DRAM 204 in FIG. 2 ), so as to read or write data from off-chip. The peripheral communication module 402 is used for receiving a control signal from the processing device 203 through the interface device 202 to start the computing device 201 to execute tasks. The on-chip interconnection module 403 connects the external storage controller 401, the peripheral communication module 402 and multiple clusters 405 to transmit data and control signals among the modules. The synchronization module 404 is a global synchronization barrier controller (Global Barrier Controller, GBC), which is used to coordinate the work progress of each cluster and ensure the synchronization of information. The plurality of clusters 405 of the present disclosure are the computing cores of the multi-core computing device 41 . Although 4 clusters are exemplarily shown in FIG. 4 , with the development of hardware, the multi-core computing device 41 of the present disclosure may also include 8, 16, 64, or even more clusters 405 . In an application scenario, the cluster 405 can be used to efficiently execute deep learning algorithms.

以集群的层级来看，如图4所示，每个集群405可以包括多个处理器核(IPU core)406及一个存储核(MEM core)407，其例如可以包括本公开上下文所描述的高速缓冲存储器(例如LLC)。Viewed at the cluster level, as shown in FIG. 4 , each cluster 405 may include a plurality of processor cores (IPU core) 406 and a storage core (MEM core) 407, which may include, for example, the high-speed Buffer memory (eg LLC).

处理器核406在图中示例性地示出为4个，本公开不限制处理器核406的数量，并且其内部架构如图5所示。每个处理器核406类似于图3的单核计算装置301，并且同样可以包括三个模块：控制模块51、运算模块52和存储模块53。控制模块51、运算模块52及存储模块53的功用及结构大致与控制模块31、运算模块32及存储模块33相同，此处不再赘述。需特别说明的是，存储模块53可以包括输入/输出直接内存访问模块(Input/Output DirectMemory Access，IODMA)533、搬运直接内存访问模块(Move Direct Memory Access，MVDMA)534。IODMA 533通过广播总线409控制NRAM 531/WRAM 532与DRAM 204的访存；MVDMA 534则用以控制NRAM 531/WRAM 532与存储单元(SRAM)408的访存。The number of processor cores 406 is exemplarily shown in the figure as four, and the present disclosure does not limit the number of processor cores 406, and its internal architecture is shown in FIG. 5 . Each processor core 406 is similar to the single-core computing device 301 in FIG. 3 , and may also include three modules: a control module 51 , an operation module 52 and a storage module 53 . The functions and structures of the control module 51 , computing module 52 and storage module 53 are roughly the same as those of the control module 31 , computing module 32 and storage module 33 , and will not be repeated here. It should be noted that the storage module 53 may include an input/output direct memory access module (Input/Output Direct Memory Access, IODMA) 533 and a moving direct memory access module (Move Direct Memory Access, MVDMA) 534 . The IODMA 533 controls the memory access of the NRAM 531 /WRAM 532 and the DRAM 204 through the broadcast bus 409 ; the MVDMA 534 is used to control the memory access of the NRAM 531 /WRAM 532 and the storage unit (SRAM) 408 .

回到图4，存储核407主要用以存储和通信，即存储处理器核406间的共享数据或中间结果、以及执行集群405与DRAM 204之间的通信、集群405间彼此的通信、处理器核406间彼此的通信等。在其他实施例中，存储核407可以具有标量运算的能力，用以执行标量运算。Returning to FIG. 4, the storage core 407 is mainly used for storage and communication, that is, storing shared data or intermediate results between the processor cores 406, executing communication between the cluster 405 and the DRAM 204, communication between the clusters 405, processors communication between the cores 406 and the like. In other embodiments, the storage core 407 may have a scalar operation capability to perform scalar operations.

存储核407可以包括静态随机存取存储器(Static Random-Access Memory，SRAM)408、广播总线409、集群直接内存访问模块(Cluster Direct Memory Access，CDMA)410及全局直接内存访问模块(Global Direct Memory Access，GDMA)411。在一个实施场景中，SRAM 408可以承担高性能数据中转站的角色。由此，在同一个集群405内不同处理器核406之间所复用的数据不需要通过处理器核406各自向DRAM 204获得，而是经SRAM 408在处理器核406间中转。进一步，存储核407仅需要将复用的数据从SRAM 408迅速分发给多个处理器核406即可，从而可以提高核间通信效率，并显著减少片上片外的输入/输出访问。The storage core 407 may include a static random access memory (Static Random-Access Memory, SRAM) 408, a broadcast bus 409, a cluster direct memory access module (Cluster Direct Memory Access, CDMA) 410 and a global direct memory access module (Global Direct Memory Access , GDMA) 411. In an implementation scenario, the SRAM 408 can take on the role of a high-performance data transfer station. Thus, the data multiplexed between different processor cores 406 in the same cluster 405 does not need to be obtained from the DRAM 204 by each of the processor cores 406 , but is transferred between the processor cores 406 via the SRAM 408 . Further, the storage core 407 only needs to quickly distribute the multiplexed data from the SRAM 408 to multiple processor cores 406, thereby improving the communication efficiency between cores and significantly reducing on-chip and off-chip input/output access.

广播总线409、CDMA 410及GDMA 411则分别用来执行处理器核406间的通信、集群405间的通信和集群405与DRAM 204的数据传输。以下将分别说明。The broadcast bus 409 , CDMA 410 and GDMA 411 are respectively used to perform communication between processor cores 406 , communication between clusters 405 and data transmission between clusters 405 and DRAM 204 . They will be described separately below.

广播总线409用以完成集群405内各处理器核406间的高速通信，此实施例的广播总线409支持核间通信方式包括单播、多播与广播。单播是指点对点(例如单一处理器核至单一处理器核)的数据传输，多播是将一份数据从SRAM 408传输到特定几个处理器核406的通信方式，而广播则是将一份数据从SRAM 408传输到所有处理器核406的通信方式，属于多播的一种特例。The broadcast bus 409 is used to complete high-speed communication among the processor cores 406 in the cluster 405 . The broadcast bus 409 in this embodiment supports inter-core communication methods including unicast, multicast and broadcast. Unicast refers to point-to-point (such as single processor core to single processor core) data transmission, multicast is a communication method that transmits a piece of data from SRAM 408 to specific processor cores 406, and broadcast is a The communication method in which a copy of data is transmitted from the SRAM 408 to all processor cores 406 is a special case of multicast.

CDMA 410用以控制在同一个计算装置201内不同集群405间的SRAM 408的访存。GDMA 411与外部存储控制器401协同，用以控制集群405的SRAM 408到DRAM 204的访存，或是将数据自DRAM 204读取至SRAM 408中。从前述可知，DRAM 204与NRAM 431或WRAM 432间的通信可以经由2种方式来实现。第一种方式是通过IODAM 433直接和DRAM 204与NRAM 431或WRAM 432通信；第二种方式是先经由GDMA 411使得数据在DRAM 204与SRAM 408间传输，再经过MVDMA 534使得数据在SRAM 408与NRAM 431或WRAM 432间传输。尽管第二种方式可能需要更多的元件参与且数据流较长，但实际上在部分实施例中，第二种方式的带宽远大于第一种方式，因此通过第二种方式来执行DRAM 204与NRAM 431或WRAM 432间的通信可能更为有效。可以理解的是，这里所描述的数据传输方式仅仅是示例性的，并且本领域技术人员根据本公开的教导，也可以根据硬件的具体布置来灵活地选择和适用各种数据传输方式。CDMA 410 is used to control the access of SRAM 408 between different clusters 405 in the same computing device 201 . The GDMA 411 cooperates with the external memory controller 401 to control memory access from the SRAM 408 of the cluster 405 to the DRAM 204 , or to read data from the DRAM 204 to the SRAM 408 . It can be seen from the foregoing that the communication between the DRAM 204 and the NRAM 431 or WRAM 432 can be realized in two ways. The first way is to directly communicate with DRAM 204 and NRAM 431 or WRAM 432 through IODAM 433; the second way is to first transmit data between DRAM 204 and SRAM 408 through GDMA 411, and then make data transfer between SRAM 408 and SRAM 408 through MVDMA 534 transfer between NRAM 431 or WRAM 432 . Although the second way may require more components to participate and the data flow is longer, in fact, in some embodiments, the bandwidth of the second way is much larger than the first way, so the second way is used to implement the DRAM 204 Communication with NRAM 431 or WRAM 432 may be more efficient. It can be understood that the data transmission methods described here are only exemplary, and those skilled in the art can flexibly select and apply various data transmission methods according to the specific arrangement of hardware according to the teaching of the present disclosure.

在其他的实施例中，GDMA 411的功能和IODMA 533的功能可以整合在同一部件中。尽管本公开为了方便描述，将GDMA 411和IODMA 533视为不同的部件，然而对于本领域技术人员来说，只要其实现的功能以及达到的技术效果与本公开类似，即属于本公开的保护范围。进一步地，GDMA 411的功能、IODMA 533的功能、CDMA 410的功能、MVDMA 534的功能也可以由同一部件来实现。In other embodiments, the function of GDMA 411 and the function of IODMA 533 can be integrated in the same component. Although this disclosure regards GDMA 411 and IODMA 533 as different components for the convenience of description, for those skilled in the art, as long as their functions and technical effects are similar to those of this disclosure, they belong to the protection scope of this disclosure . Furthermore, the functions of GDMA 411, IODMA 533, CDMA 410, and MVDMA 534 can also be implemented by the same component.

以上结合图1-图5对本公开的硬件架构及其内部结构进行了详细的描述。可以理解的是上述描述仅仅是示例性的而非限制性的。根据不同的应用场景和硬件规格，本领域技术人员也可以对本公开的板卡及其内部结构进行改变，而这些改变依然落入本公开的保护范围内。例如，在本公开下面将用描述的方案中，其对应的硬件架构可以是不包括用以控制在同一个计算装置201内不同集群405间对SRAM 408访存的CDMA410。取而代之，本公开下面的方案涉及对例如设置在SRAM408和DRAM204之间的高速缓冲存储器进行改进和优化，以通过高速缓冲存储器实现高效的数据按需锁存和不同集群间的通信。The hardware architecture and its internal structure of the present disclosure have been described in detail above with reference to FIGS. 1-5 . It is to be understood that the foregoing description is illustrative only and not restrictive. According to different application scenarios and hardware specifications, those skilled in the art may also make changes to the board card and its internal structure of the present disclosure, and these changes still fall within the protection scope of the present disclosure. For example, in the solution described below in the present disclosure, the corresponding hardware architecture may not include the CDMA 410 used to control the access to the SRAM 408 between different clusters 405 in the same computing device 201 . Instead, the underlying approach of the present disclosure involves improving and optimizing the cache, eg, disposed between SRAM 408 and DRAM 204, to enable efficient on-demand latching of data and communication between different clusters through the cache.

为了高效地使用高速缓冲存储器(例如LLC)并提高数据访问的命中率，本公开下面的方案提出将高速缓冲存储器中的特定存储空间配置成锁存区，以用于数据的锁存操作，特别是针对于将要被频繁使用的数据。例如，前述频繁使用的数据可以是将要在具有数据依赖关系的至少一个任务之间重复使用的数据。可以理解的是，当只需使用数据一次时，则可以不将该数据锁存在高速缓冲存储器中。In order to efficiently use the cache memory (such as LLC) and improve the hit rate of data access, the following scheme of the present disclosure proposes to configure a specific storage space in the cache memory as a latch area for data latch operations, especially It is for data that will be frequently used. For example, the aforementioned frequently used data may be data to be reused between at least one task having a data dependency. It will be appreciated that data need not be locked in the cache memory when the data need only be used once.

进一步，在前述配置锁存区以用于数据锁存的基础上，本公开下面的方案还提出将高速缓冲存储器配置成支持多种锁存模式，以便在接收到锁存相关请求时，令高速缓冲存储器操作在与前述锁存相关请求对应的锁存模式下。根据不同的应用场景和需求，本公开的多种锁存模式可以具有特定的优先级顺序，以满足不同的锁存相关操作。另外，为了使得高速缓冲存储器支持多种锁存模式，本公开的方案也提出多种不同的配置方法，使得可以更为灵活地使用高速缓冲存储器并且利用其实现集群间的通信。Further, on the basis of configuring the latch area for data latching, the following solution of the present disclosure also proposes to configure the cache memory to support multiple latch modes, so that when a latch-related request is received, the high-speed The buffer memory operates in a latch mode corresponding to the aforementioned latch-related request. According to different application scenarios and requirements, various latch modes of the present disclosure may have a specific priority order to satisfy different latch-related operations. In addition, in order to enable the cache memory to support multiple latch modes, the solution of the present disclosure also proposes a variety of different configuration methods, so that the cache memory can be used more flexibly and used to realize inter-cluster communication.

图6是示出根据本公开实施例的用于高速缓冲存储器的方法600的流程图。如图6中所示，方法600包括在步骤S602处，将所述高速缓冲存储器中的特定存储空间配置为支持多种锁存模式的锁存区。在一个实施例中，前述的多种锁存模式可以包括但不限于基于硬件指令来执行锁存相关操作的指令模式、基于窗口属性来执行锁存相关操作的窗口模式、基于数据流来执行锁存相关操作的流模式和/或基于缓存页来执行锁存相关操作的页模式。在一个实施例中，前述的数据流可以是具有不同类型的指令流或数据流。以数据流为例，在神经网络的应用场景中，该数据流可以是神经网络模型的神经元数据流、权重数据流、输出结果数据流等。另外，在本公开的上下文场景中，锁存相关操作所针对的数据是将要由片上系统的处理器多次使用到的数据，其相对于未进行锁存操作的数据具有相对高的优先级。通过将这样多次使用的数据锁存(或称驻留)于本公开的锁存区中，可以显著提升缓存命中率，由此改善系统的整体性能。另外，通过将重复使用到的数据驻留在LLC的锁存区中，可以减小数据在片上系统和片外存储器(例如DDR或DRAM)之间的读写操作，从而也提高了访存效率。FIG. 6 is a flowchart illustrating a method 600 for a cache memory according to an embodiment of the disclosure. As shown in FIG. 6 , the method 600 includes, at step S602 , configuring a specific storage space in the cache memory as a latch area supporting multiple latch modes. In one embodiment, the aforementioned multiple latch modes may include, but not limited to, an instruction mode for performing latch-related operations based on hardware instructions, a window mode for performing latch-related operations based on window attributes, and a lock mode for performing latch-related operations based on data streams. Streaming mode for store-related operations and/or page mode for latch-related operations based on cache pages. In one embodiment, the aforementioned data streams may be instruction streams or data streams of different types. Taking the data stream as an example, in the application scenario of the neural network, the data stream may be the neuron data stream, weight data stream, output result data stream, etc. of the neural network model. In addition, in the context of the present disclosure, the data targeted by the latch-related operation is data that will be used multiple times by the processor of the system-on-chip, and has relatively higher priority than the data that is not subjected to the latch operation. By latching (or residing in) such multiple-used data in the latch area of the present disclosure, the cache hit rate can be significantly improved, thereby improving the overall performance of the system. In addition, by residing the reused data in the latch area of the LLC, the read and write operations of data between the on-chip system and off-chip memory (such as DDR or DRAM) can be reduced, thereby improving memory access efficiency .

在一个应用场景中，上述的多种锁存模式可以根据用户的偏好或系统的优选项而设置为具有不同的优先级。例如，在一个实施方式中，优先级的高低顺序可以是指令模式—>窗口模式—>流模式—>页模式；在另一个实施方式中，优先级的高低顺序可以是指令模式—>页模式—>流模式—>窗口模式。通过这样的多模式和优先级设置，可以以更多方式来使用高速缓冲存储器中的锁存区，增加了锁存区使用的灵活性以应对不同的应用场景和系统要求。进一步地，可以根据上述锁存模式的优先级顺序依次遍历，当高优先级的锁存模式被禁用时，可以采用低优先级的锁存模式。In an application scenario, the above-mentioned multiple latch modes can be set to have different priorities according to user preferences or system preferences. For example, in one implementation, the order of priority may be instruction mode -> window mode -> stream mode -> page mode; in another implementation, the order of priority may be instruction mode -> page mode —> Stream Mode —> Window Mode. Through such multi-mode and priority setting, the latch area in the cache memory can be used in more ways, increasing the flexibility of using the latch area to cope with different application scenarios and system requirements. Further, it may traverse sequentially according to the priority order of the above-mentioned latch modes, and when the high-priority latch mode is disabled, the low-priority latch mode may be adopted.

在一个实施例中，可以根据接收到的多种配置指令中的一种配置指令将特定存储空间配置成支持对应的一种锁存模式的锁存区。在一个场景中，该配置指令可以包括一个或多个配置项，以实现对前述锁存区的配置。例如，该多个配置项可以包括启用锁存区、禁用锁存区和/或锁存区大小的配置项。进一步，可以在前述的指令模式、窗口模式、流模式或页模式来配置相应的锁存策略(例如锁存数据的大小或需锁存的具体数据)，以用于锁存不同类型或特定的指令、数据或数据流等。在不同模式下来配置相应的锁存策略具体可参见下文的描述。通过这样的启用、禁用和多种具体的配置，本公开的方案可以实现灵活地使用高速缓冲存储器，使得其可以根据需要操作于本公开的多种锁存模式之一，或者操作于常规模式。In an embodiment, a specific storage space may be configured as a latch area supporting a corresponding latch mode according to one configuration instruction among the received configuration instructions. In one scenario, the configuration instruction may include one or more configuration items, so as to realize the configuration of the aforementioned latch area. For example, the plurality of configuration items may include configuration items for enabling a latch area, disabling a latch area, and/or a size of a latch area. Further, the corresponding latch strategy (such as the size of the latch data or the specific data to be latched) can be configured in the aforementioned instruction mode, window mode, stream mode or page mode, so as to latch different types or specific Instructions, data or data flow, etc. For details on configuring the corresponding latch strategy in different modes, please refer to the description below. Through such enabling, disabling and various specific configurations, the scheme of the present disclosure can realize the flexible use of the cache memory, so that it can operate in one of the various latch modes of the present disclosure, or operate in the normal mode as required.

返回到图6中的流程图，在完成上述步骤S602处的配置操作后，在步骤S604处，接收在锁存区中对数据进行锁存相关操作的锁存相关请求。根据本公开的实施例，该锁存相关请求可以由旨在将特定的数据驻留于锁存区的操作来触发。替代地，该锁存相关请求也可以由旨在将特定的数据从锁存区移除或释放的操作来触发。如前文所详细描述的，当操作于不同的锁存模式时，本公开的锁存相关请求也可以具有不同的表达形式或内容。例如，对于指令模式、窗口模式或流模式，锁存相关请求可以包括用于指示高速缓冲存储器的行为属性的配置项等。Returning to the flow chart in FIG. 6, after the configuration operation at step S602 is completed, at step S604, a latch-related request for performing latch-related operations on data in the latch area is received. According to an embodiment of the present disclosure, the latch-related request may be triggered by an operation intended to reside specific data in a latch region. Alternatively, the latch-related request may also be triggered by an operation intended to remove or release specific data from the latch area. As described in detail above, when operating in different latch modes, the latch-related requests of the present disclosure may also have different expressions or contents. For example, for an instruction mode, a window mode, or a stream mode, the latch-related request may include a configuration item for indicating a behavior attribute of the cache memory, and the like.

在一个实施例中，上述用于指示高速缓冲存储器的行为属性的配置项至少包括以下多个配置属性中的一个：In one embodiment, the above-mentioned configuration item for indicating the behavior attribute of the cache memory includes at least one of the following multiple configuration attributes:

瞬态(Transient)属性：不在LLC内进行缓存，也即与片外存储器(如DDR)直接进行数据的读写操作；用以对于某些只访问一次的数据，不在LLC内进行缓存，从而避免占用LLC资源；Transient (Transient) attribute: do not cache in LLC, that is, directly perform data read and write operations with off-chip memory (such as DDR); for some data that is only accessed once, do not cache in LLC, thereby avoiding Occupy LLC resources;

锁定(Lock)属性：将特定的数据驻留于锁存区，从命中的缓存行(cacheline)中读写数据。若缓存行属于锁存区，则缓存行属性配置为持久(persisting)属性；若缓存行不属于锁存区，则缓存行的属性不变，即保持下面的常规(normal)属性；应当清楚的是，上述的锁存区的缓存行具有两种属性，即持久(persisting)属性和常规(normal)属性。该锁存区中持久(persisting)属性的缓存行仅可以被附带Lock属性的锁存相关请求访问和替换。Lock (Lock) attribute: Reside specific data in the latch area, and read and write data from the hit cache line (cacheline). If the cache line belongs to the latch area, the attribute of the cache line is configured as a persistent (persisting) attribute; if the cache line does not belong to the latch area, the attribute of the cache line remains unchanged, that is, the following normal (normal) attributes are maintained; it should be clear Yes, the above-mentioned cache line in the latch area has two attributes, namely a persistent (persisting) attribute and a normal (normal) attribute. A cache line with a persistent (persisting) attribute in the lock area can only be accessed and replaced by a lock-related request with a Lock attribute.

解锁(Unlock)属性：将从命中的缓存行中读写数据后，释放LLC中锁存区内数据的对应存储空间，并将锁存区中的相应的缓存行属性设置为下面的常规属性；Unlock (Unlock) attribute: After reading and writing data from the hit cache line, release the corresponding storage space of the data in the latch area in the LLC, and set the corresponding cache line attribute in the latch area to the following general attributes;

常规属性：在LLC内正常缓存的请求，可以直接与片外存储器进行读写数据；General attributes: requests that are normally cached in the LLC can directly read and write data with off-chip memory;

无效(Invalid)属性：读取之后直接无效数据，避免被替换写入到片外存储器；Invalid attribute: Invalid data directly after reading to avoid being replaced and written to the off-chip memory;

干净(Clean)属性：在执行写操作时，可以将数据写入命中的缓存行中，并将整个高速缓冲存储器(cache)的存储内容写回到片外存储器，缓存行的属性保持不变；读操作时，从命中的缓存行中读取数据。当该命中的缓存行为脏(dirty)时，则将其写回到片外存储器中；Clean (Clean) attribute: When performing a write operation, data can be written into the hit cache line, and the storage content of the entire cache memory (cache) can be written back to the off-chip memory, and the attributes of the cache line remain unchanged; During a read operation, data is read from the hit cache line. When the hit cache line is dirty (dirty), write it back to the off-chip memory;

默认(Default)属性：该默认项可以用以指示忽略关于锁存模式的配置。Default (Default) attribute: the default item can be used to indicate that the configuration about the latch mode is ignored.

通过在锁存相关请求中附带上述示例性的可配置属性后，本公开的方案可以根据这些附带的属性来执行指令模式下的对应锁存相关操作。By attaching the above exemplary configurable attributes to the latch-related request, the solution of the present disclosure can execute corresponding latch-related operations in the instruction mode according to these attached attributes.

再例如，对于页模式，锁存相关请求可以指示与特定的页相关的数据将被锁存于锁存区中以用于后续的多次使用，或可以指示与特定的页相关的数据在经多次使用后，从锁存区解锁以释放出更多的存储空间来用于后续的数据锁存。可以理解的是，通过释放操作，锁存区的存储空间可以被灵活地使用，从而提高了本公开锁存区的使用效率。For another example, for the page mode, the latch related request may indicate that the data related to the specific page will be latched in the latch area for subsequent multiple use, or may indicate that the data related to the specific page will be After multiple uses, unlock from the latch area to release more storage space for subsequent data latches. It can be understood that, through the release operation, the storage space of the latch area can be used flexibly, thereby improving the utilization efficiency of the latch area of the present disclosure.

返回到图6的流程，响应于上述步骤S604的锁存相关请求，在步骤S606处，可以根据锁存相关请求，以对应的锁存模式在锁存区中对数据执行锁存相关操作。根据本公开的实施例，前述的锁存相关操作可以包括针对于锁存区的读操作和写操作。在一个实施方式中，针对于锁存区的写操作，方法600还可以包括根据锁存相关请求将数据或选定的部分数据锁存于锁存区的指定区域内，以便用于后续的多次读取。在另一个实施方式中，针对于锁存区的读操作，方法600还可以包括在执行完读操作后，根据锁存相关请求将数据或选定的部分所述数据从所述锁存区的指定区域释放。Returning to the process of FIG. 6 , in response to the latch-related request in step S604 , at step S606 , according to the latch-related request, a latch-related operation may be performed on data in the latch area in a corresponding latch mode. According to an embodiment of the present disclosure, the aforementioned latch-related operations may include a read operation and a write operation for the latch area. In one embodiment, for the write operation of the latch area, the method 600 may also include latching data or a selected part of the data in a specified area of the latch area according to a latch-related request, so as to be used in subsequent multiple reads. In another embodiment, for the read operation of the latch area, the method 600 may further include, after the read operation is completed, transferring the data or a selected part of the data from the latch area according to the latch-related request The specified area is released.

关于前述选定的部分数据，在一个实施例中，可以以随机的方式从数据中选择预定比例的数据形成前述的部分数据来锁存于锁存区中。在另一个实施例中，可以利用哈希算法从数据中选择预定比例的数据作为前述的部分数据来锁存于锁存区中。在进一步的实施例中，当待执行锁存相关操作的数据的访存地址处于所述锁定窗口的地址范围之内，可以采用前述的哈希算法选定能够锁存在锁存区的部分数据。关于哈希算法的具体使用，将再稍后结合附图11来进行详细描述。Regarding the aforementioned selected partial data, in one embodiment, a predetermined proportion of data may be randomly selected from the data to form the aforementioned partial data to be latched in the latch area. In another embodiment, a predetermined proportion of data may be selected from the data by using a hash algorithm as the aforementioned partial data to be latched in the latch area. In a further embodiment, when the access address of the data to be latch-related operations is within the address range of the lock window, the aforementioned hash algorithm may be used to select part of the data that can be locked in the latch area. The specific use of the hash algorithm will be described in detail later in conjunction with FIG. 11 .

利用上述结合图6所描述的方法，本公开的方案使得高速缓冲存储器支持多个锁存模式，从而扩大了高速缓冲存储器的应用场景并且显著提升了缓存命中率。进一步，由于多种锁存模式的引入，也令锁存区的使用更具灵活性和适应性，从而满足不同的应用场景和用户需求。另外，由于在锁存区对数据进行有效的锁存，也促进了数据在生产者内核(“producer kernel”)和一个或多个消费者内核(“consumer kernel”)之间的共享，提升了数据的可访问性和使用率。此处的生产者内核和消费者内核可以理解为具有依赖性的两个任务，其中生产者内核的输出将作为传递至消费者内核的输入，以便消费者内核使用该输入来完成相应的任务。此时，由于生产者内核的输出将作为后续运算的输入，因此可以将该生产者内核的输出作为后续需要多次使用的数据，该后续需要多次使用的数据可以暂存在高速缓冲存储器的锁存区内，以便于消费者内核可以直接从该高速缓冲存储器中获取该输入，而无需访问片外存储器，从而减少了人工智能处理器与片外存储器之间的访存交互，降低了IO访存开销，进而可以提高人工智能处理器的处理效率及性能。Using the method described above in conjunction with FIG. 6 , the solution of the present disclosure enables the cache memory to support multiple latch modes, thereby expanding the application scenarios of the cache memory and significantly improving the cache hit rate. Furthermore, due to the introduction of multiple latch modes, the use of the latch area is more flexible and adaptable, so as to meet different application scenarios and user requirements. In addition, due to the effective latching of data in the latch area, the sharing of data between the producer kernel ("producer kernel") and one or more consumer kernels ("consumer kernel") is also promoted, improving the Data Accessibility and Usage. The producer kernel and the consumer kernel here can be understood as two dependent tasks, where the output of the producer kernel will be used as the input to the consumer kernel, so that the consumer kernel can use the input to complete the corresponding task. At this time, since the output of the producer core will be used as the input of subsequent operations, the output of the producer core can be used as data that needs to be used multiple times in the future, and the data that needs to be used multiple times in the future can be temporarily stored in the lock of the cache memory memory area, so that the consumer core can directly obtain the input from the cache memory without accessing the off-chip memory, thereby reducing the memory interaction between the artificial intelligence processor and the off-chip memory, and reducing the IO access memory overhead, which in turn can improve the processing efficiency and performance of artificial intelligence processors.

图7是示出根据本公开实施例的高速缓冲存储器700的简化框图。可以理解的是图7所示的高速缓冲存储器700可以是结合图6所描述的高速缓冲存储器，因此关于图6所描述的高速缓冲存储器也同样适用于下面关于图7的描述。FIG. 7 is a simplified block diagram illustrating a cache memory 700 according to an embodiment of the disclosure. It can be understood that the cache memory 700 shown in FIG. 7 may be the cache memory described in conjunction with FIG. 6 , so the cache memory described in FIG. 6 is also applicable to the following description in relation to FIG. 7 .

如图7中所示，本公开的高速缓冲存储器700可以包括配置模块701和锁存执行模块702。进一步，高速缓冲存储器700中还包括用于执行缓存操作的存储空间，例如，图中所示出的将存储空间平均划分成8份的8路(way0～way7)，其中每路中包括若干数目的缓存行(cacheline)。As shown in FIG. 7 , the cache memory 700 of the present disclosure may include a configuration module 701 and a latch execution module 702 . Further, the cache memory 700 also includes a storage space for performing cache operations, for example, as shown in the figure, the storage space is equally divided into 8 ways (way0-way7), wherein each way includes a number of The cache line (cacheline).

在一个实施例中，上述的配置模块可以用于将高速缓冲存储器中的特定存储空间配置成支持多种锁存模式的锁存区，其中该特定存储空间的大小小于高速缓冲存储器的总存储大小。例如，图7中的way0～way5可以被配置为支持锁存的特定存储空间。对应地，图7中的way6～7可以保持高速缓冲存储器的普通属性，也即作为一般缓存使用。如前所述，该锁存模式可以是指令模式、窗口模式、流模式和/或页模式。进一步，锁存执行模块可以用于接收在锁存区中对数据进行锁存相关操作的锁存相关请求。接着，该锁存执行模块可以根据锁存相关请求，以对应的锁存模式在锁存区内对数据执行锁存相关操作。与前文描述相同，此处的锁存相关操作可以包括针对于锁存区的写操作(即将数据写入到锁存区)或将锁存区中的数据从锁存区释放。例如，当消费者内核使用完锁存区的数据，并且该数据将不再被其他消费者内核使用时，则可以将该锁存区内存储数据的空间进行释放，以用于锁存其他数据。In one embodiment, the above-mentioned configuration module can be used to configure a specific storage space in the cache memory as a latch area supporting multiple latch modes, wherein the size of the specific storage space is smaller than the total storage size of the cache memory . For example, way0-way5 in FIG. 7 can be configured as a specific storage space that supports latching. Correspondingly, ways6-7 in FIG. 7 can maintain the common attributes of the cache memory, that is, be used as a general cache. As previously mentioned, the latch mode can be instruction mode, window mode, stream mode and/or page mode. Further, the latch execution module may be configured to receive a latch-related request for performing latch-related operations on data in the latch area. Next, the latch execution module can perform latch-related operations on data in the latch area in a corresponding latch mode according to the latch-related request. Same as the previous description, the latch-related operations here may include a write operation for the latch area (that is, writing data into the latch area) or releasing data in the latch area from the latch area. For example, when the consumer core has used up the data in the lock area and the data will no longer be used by other consumer cores, the space for storing data in the lock area can be released for locking other data .

图8是示出根据本公开实施例的片上系统800的简化框图。如图8中所示，本公开的片上系统800可以包括如图7中所示出的高速缓冲存储器700和处理器(或处理器核)802。在一个实施方式中，高速缓冲存储器的锁存执行模块可以用于根据锁存相关请求，以对应的所述锁存模式在所述锁存区内对所述数据执行锁存相关操作。关于高速缓冲存储器700，前文结合图6和图7对其进行了描述，此处将不再赘述。关于处理器802，根据本公开的方案，其可以是各种类型的处理器，并且可以包括一个或多个处理器核以生成锁存相关请求。在操作中，高速缓冲存储器的锁存执行模块用于根据生成的锁存相关请求，以对应的锁存模式在锁存区内对数据执行锁存相关操作。例如，当锁存模式是指令模式时，则处理器可以用于根据接收到的硬件指令来生成锁存相关请求。又例如，当锁存模式是页模式时，则处理器可以用于根据缓存页配置来生成锁存相关请求。再如，当锁存模式是窗口模式或流模式时，则处理可以用于配置锁定窗口，并根据锁定窗口来生成锁存相关请求。FIG. 8 is a simplified block diagram illustrating a system-on-chip 800 according to an embodiment of the disclosure. As shown in FIG. 8 , a system-on-chip 800 of the present disclosure may include a cache memory 700 and a processor (or processor core) 802 as shown in FIG. 7 . In one embodiment, the latch execution module of the cache memory may be configured to perform a latch-related operation on the data in the latch area in the corresponding latch mode according to the latch-related request. Regarding the cache memory 700 , it has been described above in conjunction with FIG. 6 and FIG. 7 , and will not be repeated here. Regarding the processor 802, according to the aspects of the present disclosure, it may be various types of processors, and may include one or more processor cores to generate latch-related requests. In operation, the latch execution module of the cache memory is configured to perform latch-related operations on data in the latch area in a corresponding latch mode according to the generated latch-related request. For example, when the latch mode is an instruction mode, the processor can be configured to generate latch-related requests according to received hardware instructions. For another example, when the latch mode is the page mode, the processor may be configured to generate a latch-related request according to the cache page configuration. For another example, when the latch mode is a window mode or a stream mode, the processing may be used to configure a lock window, and generate a latch-related request according to the lock window.

根据不同的实施方式，处理器802还可以是包括多个计算核的智能处理器或智能处理单元(“Intelligence Processing Unit”，简写为“IPU”)，其可以配置成执行各类人工智能领域(例如神经网络方面)的计算。According to different implementations, the processor 802 may also be an intelligent processor or an intelligent processing unit ("Intelligence Processing Unit", abbreviated as "IPU") including multiple computing cores, which may be configured to execute various artificial intelligence fields ( such as neural network calculations).

图9是示出根据本公开实施例的片上系统900的详细框图。可以理解的是这里所示出的片上系统900可以是图8所示片上系统的一种具体实现方式，因此关于图8所描述的内容也同样适用于图9。进一步，仅为了示例的目的，将以多个锁存模式中的窗口模式(或流模式)来描述片上系统900的操作。FIG. 9 is a detailed block diagram illustrating a system on chip 900 according to an embodiment of the present disclosure. It can be understood that the system-on-chip 900 shown here may be a specific implementation of the system-on-chip shown in FIG. 8 , and therefore the content described with respect to FIG. 8 is also applicable to FIG. 9 . Further, for the purpose of example only, the operation of the system-on-chip 900 will be described in a window mode (or stream mode) among a plurality of latch modes.

如图9中所示，片上系统900可以包括任务调度器(“Job Scheduler”)902，其包括调度单元903和配置器904。在一个实施例中，配置器904可以用于根据分配的配置任务(例如可以从任务队列中获得)来生成配置指令，以便向高速缓冲存储器(即“LLC”906)的配置模块(如CLR)发送。在一个实施例中，调度单元903可以用于对任务调度器中的多个任务(也即将在人工智能处理器上执行的“kernel”)进行调度，以便向本公开的片上系统中的智能处理器(IPU)905发送。在本公开的方案中，这里的智能处理器905可以是包括多个处理器核，多个处理器核可以构成如图4中所示出的一个群集(cluster)。在一个实施场景中，在如前的多处理器核架构中，调度单元可以根据多个处理器核的空闲度(例如利用率)来将任务分配给合适的处理器核。As shown in FIG. 9 , a system on chip 900 may include a task scheduler (“Job Scheduler”) 902 including a scheduling unit 903 and a configurator 904 . In one embodiment, the configurator 904 may be configured to generate configuration instructions according to assigned configuration tasks (e.g., obtainable from a task queue) to be sent to a configuration module (such as a CLR) in a cache memory (that is, "LLC" 906) send. In one embodiment, the scheduling unit 903 can be used to schedule multiple tasks in the task scheduler (that is, the "kernel" to be executed on the artificial intelligence processor), so as to provide intelligent processing in the system on chip of the present disclosure processor (IPU) 905 to send. In the solution of the present disclosure, the intelligent processor 905 here may include multiple processor cores, and the multiple processor cores may form a cluster as shown in FIG. 4 . In an implementation scenario, in the previous multi-processor core architecture, the scheduling unit may allocate tasks to appropriate processor cores according to the idleness (eg utilization) of the multiple processor cores.

进一步，该片上系统900还包括系统内存管理单元(“System Memory ManagementUnit”,简写为“SMMU”)，该系统内存管理单元用于将访存数据的虚拟地址转换为物理地址，以根据该物理地址实现对相关存储位置的访问。在一个实施方式中，该系统内存管理单元中包括设置有地址转换缓冲器TLB(Translation Lookaside Buffer，也称为快表)。该TLB中维护有页表，该页表包括至少一个页表项，每个页表项包括页(page)以及该页对应的页框(Frame)。在操作中，该系统内存管理单元可以根据接收到的虚拟地址来确定该虚拟地址对应的页，并且接着可以通过页与页框的映射关系确定该虚拟地址对应的物理地址PA(Physcial Address)，从而可以根据该物理地址实现对高速缓冲存储器的相关存储位置的访问。Further, the system-on-chip 900 also includes a system memory management unit ("System Memory ManagementUnit", abbreviated as "SMMU"), which is used to convert the virtual address of the memory access data into a physical address, so that according to the physical address Implements access to the associated storage location. In one embodiment, the system memory management unit includes an address translation buffer TLB (Translation Lookaside Buffer, also called fast table). A page table is maintained in the TLB, and the page table includes at least one page table entry, and each page table entry includes a page (page) and a page frame (Frame) corresponding to the page. In operation, the system memory management unit can determine the page corresponding to the virtual address according to the received virtual address, and then can determine the physical address PA (Physical Address) corresponding to the virtual address through the mapping relationship between the page and the page frame, Therefore, the access to the relevant storage location of the cache memory can be realized according to the physical address.

在一个实施例中，可以通过上述的窗口模式或流模式实现对高速缓冲存储器的访问。此时，智能处理器可以从存储器中获取参数表，并根据参数表来配置与待执行锁存相关操作的数据关联的锁定窗口(“Lock window”),以及根据配置的锁定窗口生成锁存相关请求(即，例如附带lock/unlock属性的IO访问请求)。接着，SMMU可以根据该IO访问请求来对LLC执行锁存相关操作。具体地，SMMU可以将前述的IO访问请求发送至LLC 906的缓存策略模块907(其执行类似于图7中锁存执行模块702的相同操作)来进行执行。在一个实施方式中，参数表可以包括用于配置锁定窗口或流模式中的流锁存属性的参数项。例如，参数项可以包括但不限于锁定/解锁窗口(“lock/unlock window”)、每数据流的锁定/解锁(“perstream lock/unlock”)、锁存比率(“Lock Ratio”)、锁定窗口标识(“lock window flag”)等信息。在一个实施场景中，该参数表中的参数可以是用户自定义设置的。由此，可以在程序的运行阶段获得该参数表中的相关参数，并且将该参数表存储在存储器(例如DDR)中，以便智能处理器(如图中的IPU 905)在执行阶段使用。In one embodiment, access to the cache memory can be implemented through the above-mentioned window mode or stream mode. At this time, the intelligent processor can obtain the parameter table from the memory, and according to the parameter table, configure a lock window ("Lock window") associated with the data of the latch-related operation to be performed, and generate a lock window according to the configured lock window. Requests (ie, eg IO access requests with lock/unlock attributes attached). Then, the SMMU can perform latch-related operations on the LLC according to the IO access request. Specifically, the SMMU may send the aforementioned IO access request to the cache policy module 907 of the LLC 906 (which performs the same operation as the latch execution module 702 in FIG. 7 ) for execution. In one embodiment, the parameter table may include parameter items for configuring a lock window or a stream latch attribute in a stream mode. For example, parameter items may include, but are not limited to, lock/unlock window (“lock/unlock window”), lock/unlock per stream (“perstream lock/unlock”), lock ratio (“Lock Ratio”), lock window Identification ("lock window flag") and other information. In an implementation scenario, the parameters in the parameter table may be user-defined. Therefore, the relevant parameters in the parameter table can be obtained during the running phase of the program, and the parameter table can be stored in a memory (such as DDR), so that the intelligent processor (such as IPU 905 in the figure) can use it during the execution phase.

在一个实施方式中，上述的锁定窗口用于表示软件用户希望锁存的存储空间，该锁定窗口的大小可以大于高速缓冲存储器上的锁存区的大小。上述锁定窗口包括以下中的一项或多项：窗口的基地址和大小，其中窗口的基地址可以是上层软件配置的虚拟地址(例如虚拟地址“Virtual Address”，简写“VA”)，该窗口的基地址与待执行锁存相关操作的数据起始地址相对应，而窗口的大小可以与待锁存的数据大小相对应。In one embodiment, the above-mentioned lock window is used to represent the storage space that the software user wishes to lock, and the size of the lock window may be larger than the size of the lock area on the cache memory. The above-mentioned locked window includes one or more of the following: the base address and size of the window, wherein the base address of the window can be a virtual address configured by the upper layer software (such as virtual address "Virtual Address", abbreviated "VA"), the window The base address of the window corresponds to the starting address of the data to be latched, and the size of the window may correspond to the size of the data to be latched.

具体地，在窗口模式下，智能处理器可以根据任务调度器下发的任务，确定该任务中数据的访存地址(该访存地址可以是虚拟地址)，并将该任务中数据的访存地址与窗口的锁定窗口定义的地址范围进行比较。如果该任务中数据的访存地址在锁定窗口的地址范围，则表示命中该锁定窗口，此时可以使能锁定窗口(如“Enabled”)。否则，若该任务中数据的访存地址在锁定窗口的地址范围之外，则表示未命中该锁定窗口。此时，该锁定窗口可以被忽略，也即表明该任务中的数据将不会被暂存在高速缓冲存储器中。进一步地，在数据的访存地址命中该锁定窗口时，则可以利用哈希算法从数据中选择预定比例的数据作为前述的部分数据来存储于锁存区中。关于哈希算法的具体使用，将再稍后结合附图11来进行详细描述。之后，智能处理器可以通过SMMU将附带Lock属性的锁存相关请求发送给高速缓冲存储器LLC。其中，该附带Lock属性的锁存相关请求可以用于指示将特定的数据驻留于锁存区，该特定的数据可以是根据哈希算法选定的部分数据。Specifically, in the window mode, the intelligent processor can determine the memory access address of the data in the task (the memory access address can be a virtual address) according to the task issued by the task scheduler, and make the memory access address of the data in the task The address is compared to the address range defined by the window's lock window. If the access address of the data in the task is within the address range of the lock window, it means that the lock window is hit, and the lock window can be enabled (such as "Enabled") at this time. Otherwise, if the access address of the data in the task is outside the address range of the lock window, it means that the lock window is not hit. At this time, the lock window can be ignored, which means that the data in the task will not be temporarily stored in the cache memory. Further, when the access address of the data hits the lock window, a predetermined proportion of data may be selected from the data by using a hash algorithm as the aforementioned partial data and stored in the latch area. The specific use of the hash algorithm will be described in detail later in conjunction with FIG. 11 . Afterwards, the intelligent processor can send the lock-related request attached with the Lock attribute to the cache memory LLC through the SMMU. Wherein, the lock-related request attached with the Lock attribute may be used to indicate that specific data resides in the lock area, and the specific data may be part of data selected according to a hash algorithm.

下面以窗口模式，结合图9来描述LLC的锁存过程和释放过程。The latching process and release process of the LLC will be described below in the window mode with reference to FIG. 9 .

LLC驻留(或称锁定)过程：LLC residency (or lock) process:

步骤1:任务调度器借助配置器来配置LLC(例如经由缓存策略模块)以便启用锁定区(“Lock enable”)、禁用锁定区(“Lock disable”)和锁定区的大小，即图中所示出的路(“Way”)数(例如Way0～Way7)。Step 1: The task scheduler configures the LLC with the help of the configurator (e.g. via the cache policy module) to enable the locked region ("Lock enable"), disable the locked region ("Lock disable") and the size of the locked region, as shown in the figure The number of ways ("Ways") to go out (eg Way0-Way7).

步骤2：任务调度器下发任务kernel给IPU；Step 2: The task scheduler sends the task kernel to the IPU;

步骤3：IPU从参数表中获取锁定窗口标识(“lock window flag”)，读取并配置锁定窗口。在一个实施场景中，这里的参数表可以通过软件来配置并存储在片外的动态随机存取存储器(“Dynamic Random Access Memory”，简写为“DRAM”)的一个存储地址处。接着，任务调度器可以向IPU传送该地址并且IPU可以根据该地址来读取参数表，以便完成对锁定窗口的配置。Step 3: The IPU obtains the lock window flag (“lock window flag”) from the parameter table, reads and configures the lock window. In an implementation scenario, the parameter table here may be configured by software and stored at a storage address of an off-chip dynamic random access memory (“Dynamic Random Access Memory”, abbreviated as “DRAM”). Then, the task scheduler can transmit the address to the IPU and the IPU can read the parameter table according to the address, so as to complete the configuration of the locking window.

步骤4：IPU通过内存管理单元SMMU生成锁存相关请求，并且在向LLC的缓存策略模块发送该请求时，可以令该请求依据锁定窗口信息来附带lock属性。Step 4: The IPU generates a lock-related request through the memory management unit SMMU, and when sending the request to the cache policy module of the LLC, the request can be attached with a lock attribute according to the lock window information.

步骤5：LLC的缓存策略模块在接收到带有lock属性的锁存相关请求之后，将对应的数据存储在相应的缓存行中，并标记该缓存行(也即锁存区)的lock属性，例如设置为如前所述的“持久(persisting)”。Step 5: After receiving the lock-related request with the lock attribute, the cache policy module of the LLC stores the corresponding data in the corresponding cache line, and marks the lock attribute of the cache line (that is, the lock area), For example set to "persisting" as described above.

LLC解驻留(或称释放)过程：LLC deresidence (or release) process:

步骤6：任务调度器下发kernel给IPU；Step 6: The task scheduler sends the kernel to the IPU;

步骤7：IPU从参数表中获取解锁窗口标识，读取并配置解锁窗口；Step 7: The IPU obtains the unlock window ID from the parameter table, reads and configures the unlock window;

步骤8：IPU发射请求的时候，依据解锁窗口信息来附带解锁(“unlock”)属性；Step 8: When the IPU transmits the request, it attaches the unlock (“unlock”) attribute according to the unlock window information;

步骤9：LLC的缓存策略模块在接收到带有unlock属性的请求之后，将命中的lock属性的缓存行切换为常规属性，如前面结合指令模式所描述的常规(Normal)属性；Step 9: After receiving the request with the unlock attribute, the cache policy module of the LLC switches the cache line of the hit lock attribute to a normal attribute, such as the normal (Normal) attribute described in conjunction with the instruction mode above;

步骤10：任务调度器借助配置器并且通过CLR模块来禁用锁存区即(即，LLC lockdisable)。在一个实现场景中，CLR模块可以根据配置器的指示来清空先前的锁定属性配置。Step 10: The task scheduler disables the lock area (ie, LLC lockdisable) by means of the configurator and through the CLR module. In an implementation scenario, the CLR module may clear the previous locking attribute configuration according to the instruction of the configurator.

以上结合图9对本公开的片上系统在窗口模式下的锁存方案进行了详细的描述。通过这样的锁存操作，可以显著提升缓存命中的机率并且改善了高速缓冲存储器的使用效率和扩大了应用场景。The latch scheme of the system on chip of the present disclosure in the window mode has been described in detail above with reference to FIG. 9 . Through such a latch operation, the probability of a cache hit can be significantly increased, the utilization efficiency of the cache memory is improved, and the application scenarios are expanded.

本公开实施例还支持流模式下的锁存相关操作，当本公开的任务中数据流对应的使能比特位为低，则视为默认情形，也即不执行流模式下的锁存相关操作。反之，当使能比特位为高时，则可以在流模式下对该数据流执行相应的锁存相关操作。具体地，本公开的窗口模式和流模式具有相类似的操作，可以利用哈希算法和该数据流的锁存比例从该数据流中选择预定比例的数据作为前述的部分数据来存储于锁存区中。关于哈希算法的具体使用，将再稍后结合附图11来进行详细描述。The embodiments of the present disclosure also support latch-related operations in stream mode. When the enable bit corresponding to the data stream in the task of the present disclosure is low, it is regarded as the default situation, that is, the latch-related operations in stream mode are not performed. . Conversely, when the enable bit is high, the corresponding latch-related operations can be performed on the data stream in stream mode. Specifically, the window mode and the stream mode of the present disclosure have similar operations, and a predetermined proportion of data can be selected from the data stream as the aforementioned partial data to be stored in the latch by using the hash algorithm and the lock ratio of the data stream. in the district. The specific use of the hash algorithm will be described in detail later in conjunction with FIG. 11 .

如前所述，在一个实施例中，本公开实施例还支持页模式下的锁存相关操作，下面结合图10来描述该页模式。As mentioned above, in one embodiment, the embodiment of the present disclosure also supports latch-related operations in the page mode, and the page mode will be described below with reference to FIG. 10 .

图10是示出根据本公开实施例的页模式的示意框图。如图10所示，根据本申请的方案，可以对缓存页直接进行配置以使得其具有本公开的锁定属性，从而使得与存储器(如“内存”)形成映射关系的缓存页可以用于多个内核kernel(如图中所示内核0～2)之间共享访问数据使用。在一个实施方式中，程序员可以利用指令(例如Malloc)对缓存页进行锁定属性标记。当一个内核访问标记为锁定的缓存页时，SMMU可以将缓存页所对应的数据锁定于本公开的锁存区中。接着，当后续的内核需要再次访问到前述的缓存页时，其可以从锁存区的对应缓存行读取先前锁定的数据，从而实现缓存命中。由此，通过页模式，本公开的方案提升了数据在多个内核之间的共享和可访问性。FIG. 10 is a schematic block diagram illustrating a page mode according to an embodiment of the present disclosure. As shown in FIG. 10, according to the solution of the present application, the cache page can be directly configured so that it has the lock attribute of the present disclosure, so that the cache page that forms a mapping relationship with the memory (such as "memory") can be used for multiple The kernel kernels (kernel 0-2 shown in the figure) share access data for use. In one embodiment, the programmer may use an instruction (such as Malloc) to mark the cache page with a lock attribute. When a kernel accesses a cache page marked as locked, the SMMU can lock the data corresponding to the cache page in the latch area of the present disclosure. Then, when the subsequent core needs to access the aforementioned cache page again, it can read the previously locked data from the corresponding cache line in the latch area, thereby achieving a cache hit. Thus, through the page mode, the disclosed scheme improves the sharing and accessibility of data among multiple cores.

具体地，在页模式下，软件驱动程序可以通过指令直接配置(“System MemoryManagement Unit”,简写为“SMMU”)页表中的信息，并且根据该信息来确定执行基于页的锁存操作或者常规(normal)操作两种配置。当页表中的信息指示SMMU被旁路(bypass)时，则表示无需对高速缓冲存储器进行锁存，此时高速缓冲存储器中的缓冲行的属性可以是常规(Normal)属性。当信息指示SMMU为线性映射时，此时则可以根据SMMU线性映射窗口配置来设置基于页的锁存操作。例如，该线性映射窗口内的缓存页所对应的数据锁定于本公开的锁存区中。该SMMU可以基于页表中的信息生成相应的锁存相关请求，并将该锁存相关请求发送至LLC，LLC的缓存策略模块可以根据该锁存相关请求对LLC的缓存行进行配置，以执行相应的缓存相关操作。Specifically, in the page mode, the software driver can directly configure ("System Memory Management Unit", abbreviated as "SMMU") information in the page table through instructions, and determine to perform page-based latch operations or conventional (normal) Operate both configurations. When the information in the page table indicates that the SMMU is bypassed (bypass), it means that the cache memory does not need to be latched. At this time, the attribute of the buffer line in the cache memory can be a normal (Normal) attribute. When the information indicates that the SMMU is linearly mapped, then the page-based latch operation may be set according to the SMMU linearly mapped window configuration. For example, the data corresponding to the cache page in the linear mapping window is locked in the latch area of the present disclosure. The SMMU can generate a corresponding lock-related request based on the information in the page table, and send the lock-related request to the LLC, and the cache policy module of the LLC can configure the cache line of the LLC according to the lock-related request to execute The corresponding cache-related operations.

在一个实施例中，本公开实施例还支持指令模式，此时，该片上系统可以通过指令集中的访存指令(IO指令)来配置LLC中的锁存区。In one embodiment, the embodiment of the present disclosure also supports an instruction mode, at this time, the system-on-chip can configure the latch area in the LLC through a memory access instruction (IO instruction) in the instruction set.

例如IO指令附带锁存相关属性的至少一种配置域，从而借助于该配置域来灵活地配置LLC。这里，各种配置域可以代表当对片外存储器(例如DDR空间)执行数据访存时，LLC可以对应执行的操作行为。在一个实施场景中，指令中包括上述的配置属性：瞬态(Transient)属性、锁定(Lock)属性、解锁(Unlock)属性、常规(Normal)属性、无效(Invalid)属性、干净(Clean)属性或默认(Default)属性等等。由于指令模式为最高优先级，因此当IO访存指令指示为Default属性时，则意味着可以由其他模式(如窗口模式、流模式或页模式)来执行锁存相关操作。For example, the IO instruction is accompanied by at least one configuration domain that latches related attributes, so that the LLC can be flexibly configured by means of the configuration domain. Here, various configuration domains may represent corresponding operation behaviors that the LLC may perform when performing data access to off-chip memory (such as DDR space). In an implementation scenario, the above configuration attributes are included in the instruction: Transient (Transient) attribute, Lock (Lock) attribute, Unlock (Unlock) attribute, General (Normal) attribute, Invalid (Invalid) attribute, Clean (Clean) attribute Or the default (Default) attribute and so on. Since the instruction mode is the highest priority, when the IO access instruction is indicated as the Default attribute, it means that other modes (such as window mode, stream mode or page mode) can perform latch-related operations.

当任务调度器将任务下发至智能处理器IPU时，IPU可以根据任务中IO指令确定锁存相关请求。具体地，当IO指令中的Lock属性的配置域使能时，此时可以在锁存相关请求中附带Lock属性，以使得LLC根据该附带Lock属性的锁存相关请求来将特定的数据存储于锁定区。当IO指令中的Unlock属性的配置域使能时，此时可以在锁存相关请求中附带Unlock属性，以使得LLC根据该附带Unlock属性的锁存相关请求来释放锁定区。根据应用场景不同，这里的锁存相关请求还可以类似地附带其他的属性。When the task scheduler sends the task to the intelligent processor IPU, the IPU can determine the latch related request according to the IO instruction in the task. Specifically, when the configuration domain of the Lock attribute in the IO instruction is enabled, the Lock attribute can be attached to the lock-related request at this time, so that the LLC can store specific data in the lock-related request according to the Lock attribute. locked area. When the configuration field of the Unlock attribute in the IO command is enabled, the Unlock attribute can be attached to the lock-related request at this time, so that the LLC can release the locked area according to the lock-related request attached with the Unlock attribute. According to different application scenarios, the latch-related request here can also have other attributes similarly attached.

进一步地，在一些操作场景中，当指令中还包括用于指示锁存比例的特定配置域。当该指令中的特定配置域(例如特定比特位inst_ratio_en)为低时，则可以认为锁存操作取决于指令配置，即根据任务中具体的IO指令来确定锁存相关请求。如果前述比特位为高，则可以根据哈希(hash)算法与该指令指示的锁存比率(lock ratio)进行比较，从该数据流中选择预定比例的数据作为前述的部分数据来存储于锁存区中。关于哈希算法的具体使用，下文结合附图11来进行详细描述。Further, in some operation scenarios, when the instruction also includes a specific configuration field for indicating the latch ratio. When a specific configuration field in the instruction (for example, a specific bit inst_ratio_en) is low, it can be considered that the latch operation depends on the instruction configuration, that is, the latch-related request is determined according to the specific IO instruction in the task. If the aforementioned bit is high, it can be compared with the lock ratio (lock ratio) indicated by the instruction according to the hash (hash) algorithm, and a predetermined proportion of data is selected from the data stream as the aforementioned partial data to be stored in the lock in storage area. The specific use of the hash algorithm will be described in detail below in conjunction with FIG. 11 .

图11是示出根据本公开实施例的窗口模式或流模式下的哈希操作。本公开的方案使用哈希操作来执行一定比例的驻留(即锁定)是因为LLC驻留关键问题之一是带宽与容量的权衡(“tradeoff”)。因此，本公开提出执行一定比例的驻留(即Lock Ratio)，从而可以针对不同的任务，获得不同的带宽和驻留容量。假定预设Lock Ratio的值为P(例如以百分比计)，则预期带宽为B＝6T*P+2T*(1-P)，这里6T是数据驻留于LLC上的读取速率，而2T是数据存储于内存(如DRAM)上的读取速度，其中T＝1000Gbit/s。如前所述，Lock Ratio可以在lock/unlock window中配置或者是针对具体的数据流进行配置。另外，尽管下文对窗口模式或流模式下的哈希操作进行了描述，但类似的操作也适用于指令模式下的哈希操作。FIG. 11 illustrates a hash operation in window mode or stream mode according to an embodiment of the present disclosure. The scheme of the present disclosure uses a hash operation to enforce a certain percentage of residency (ie locking) because one of the key issues with LLC residency is the bandwidth versus capacity tradeoff ("tradeoff"). Therefore, the present disclosure proposes to implement a certain ratio of residency (ie, Lock Ratio), so that different bandwidths and residency capacities can be obtained for different tasks. Assuming that the value of the preset Lock Ratio is P (for example, in percentage), the expected bandwidth is B=6T*P+2T*(1-P), where 6T is the read rate at which data resides on the LLC, and 2T It is the reading speed of data stored in memory (such as DRAM), where T=1000Gbit/s. As mentioned earlier, Lock Ratio can be configured in the lock/unlock window or for specific data streams. Also, although hash operations in window mode or stream mode are described below, similar operations are also applicable to hash operations in instruction mode.

具体来说，在窗口模式或流模式下，智能处理器核首先数据的访存地址与锁定窗口限定的地址范围进行比较，以确定请求的地址是否在锁定窗口的地址范围内。当请求的地址在锁定窗口的地址范围内时，则如图11中所示，可以对命中的窗口地址范围执行哈希运算。这里，每个数据的访存地址可以是虚拟地址。Specifically, in the window mode or stream mode, the intelligent processor core first compares the access address of the data with the address range defined by the lock window to determine whether the requested address is within the address range of the lock window. When the requested address is within the address range of the locked window, as shown in FIG. 11 , a hash operation may be performed on the hit window address range. Here, the access address of each data may be a virtual address.

具体来说，借助于全局固定的Hash规则，可以将该访存地址的VA映射到Hash空间上(即图中的“Hash Map”)，并且该Hash过程可以优先保留地址低位信息。接着，可以将在1102处获得的Hash值与锁存比例Lock Ratio在1104处进行比较，以随机选取出对应比例的数据。具体地，当访存地址的hash值小于该锁存比例时，则认为命中，并且因此该部分数据(即符合比例的数据)可以锁存在高速缓冲存储器中。与之相反，当访存地址的hash值大于或等于该锁存比例时，则认为未命中，并且因此该部分数据将不会锁存在高速缓冲存储器中。Specifically, with the help of globally fixed Hash rules, the VA of the memory access address can be mapped to the Hash space (that is, the "Hash Map" in the figure), and the Hash process can preferentially retain the low-order information of the address. Next, the Hash value obtained at 1102 can be compared with the lock ratio Lock Ratio at 1104 to randomly select data of a corresponding ratio. Specifically, when the hash value of the access address is smaller than the latch ratio, it is considered a hit, and therefore the part of data (ie, data conforming to the ratio) can be latched in the cache memory. On the contrary, when the hash value of the access address is greater than or equal to the latch ratio, it is considered a miss, and therefore this part of data will not be latched in the cache memory.

例如，当锁存比例Lock Ratio设定为10％，则可以依顺序从Hash值中选取前10％值所对应的部分数据，即数据的锁存地址的哈希值小于该锁存比例的部分数据进行锁存相关操作。当前在其他示例中，锁存比例还可以是其他数值，该锁存比例可以由软件用户自定义设置，前述的选取操作也可以根据Hash算法的设定来实现。例如，该锁存比例还可以是20％～30％，此时可以依顺序从Hash值中选取前20％～30％值所对应的部分数据进行锁存相关操作。此后，可以在1106处按照指定的请求类型来处理，即将部分数据进行锁定或解锁。For example, when the lock ratio Lock Ratio is set to 10%, you can select the part of data corresponding to the first 10% value from the Hash value in order, that is, the part whose hash value of the latch address of the data is smaller than the lock ratio Data is latched for related operations. Currently, in other examples, the latch ratio can also be other values, and the latch ratio can be customized by the software user, and the aforementioned selection operation can also be implemented according to the setting of the Hash algorithm. For example, the latch ratio may also be 20%-30%, and at this time, partial data corresponding to the first 20%-30% of the Hash values may be sequentially selected to perform latch-related operations. Thereafter, at 1106, it can be processed according to the specified request type, that is, to lock or unlock some data.

以上结合图6-图11对本公开的高速缓冲存储器的锁存方案进行了详细地描述。基于前述锁存方案的思想，并且作为前述锁存方案的补充，下面将结合图12-图14对本公开针对高速缓冲存储器的另一扩展应用进行描述，即如何通过高速缓冲器来实现片上系统内集群间的通信。The latch scheme of the cache memory of the present disclosure has been described in detail above with reference to FIGS. 6-11 . Based on the idea of the aforementioned latch scheme, and as a supplement to the aforementioned latch scheme, the following will describe another extended application of the present disclosure for the cache memory in conjunction with Fig. 12-Fig. Inter-cluster communication.

图12是示出根据本公开实施例的片上系统的简化框图。结合前文的描述，可以理解的是这里的片上系统可以是包括在图2所示计算装置201中的片上系统，例如是由多核计算装置41所构成的片上系统。如图6中所示，该片上系统1200包括示例性示出的四个集群0-集群4。鉴于前文已经对集群进行了详细的描述，此处不再赘述。进一步示出的是高速缓冲存储器1201，其例如可以设置在如前图5中所示出的SRAM 408中，用于执行集群间的数据传输操作。在一个实施场景中，该高速缓冲存储器1201也可以与DRAM(例如DDR)进行片上与片外之间的双向通信，包括各种类型数据或指令的传递。12 is a simplified block diagram illustrating a system-on-chip according to an embodiment of the disclosure. In combination with the foregoing description, it can be understood that the system on chip here may be the system on chip included in the computing device 201 shown in FIG. 2 , for example, the system on chip constituted by the multi-core computing device 41 . As shown in FIG. 6 , the system-on-chip 1200 includes four clusters 0 - cluster 4 exemplarily shown. Since the cluster has been described in detail above, it will not be repeated here. Further shown is a cache memory 1201 , which may be disposed, for example, in the SRAM 408 as previously shown in FIG. 5 , for performing inter-cluster data transfer operations. In an implementation scenario, the cache memory 1201 can also perform on-chip and off-chip bidirectional communication with DRAM (such as DDR), including the transfer of various types of data or instructions.

图13是示出根据本公开实施例的用于片上系统的方法1300的流程图。这里的片上系统可以是如图12中所示出的片上系统。具体来说，该片上系统至少包括用于执行运算操作的多个集群和与该多个集群互联的高速缓冲存储器。在一个实施场景中，每个集群可以包括用于执行所述运算操作的多个处理器核。在一个实施场景中，上述在高速缓冲存储器中确定的锁存区可以用于完成集群间的数据通信，从而使得该片上系统可以不再设置CDMA410和GDMA 411等通信模块。FIG. 13 is a flowchart illustrating a method 1300 for a system on chip according to an embodiment of the present disclosure. The system on chip here may be the system on chip as shown in FIG. 12 . Specifically, the system-on-chip includes at least a plurality of clusters for performing computing operations and a cache memory interconnected with the plurality of clusters. In an implementation scenario, each cluster may include multiple processor cores for performing the computing operations. In an implementation scenario, the above-mentioned latch area determined in the cache memory can be used to complete inter-cluster data communication, so that the system-on-chip does not need to configure communication modules such as CDMA410 and GDMA411.

在一个实施方式中，上述锁存区可以用于在具有依赖关系的任务之间传递数据，例如，锁存区可以用于在生产者内核和消费者内核之间传递数据。具体来说，处理器可以通过配置的锁定窗口，将生产者内核需要交换给消费者内核的数据锁存在LLC中。在一个场景中，当处理器执行完生产者内核后，可以将需要传递给消费者内核的数据(其可能是生产者内核的输入数据或输出数据)进行锁存。鉴于此，处理器可以如前所述那样通过配置的锁定窗口并且借助于例如SMMU来对LLC进行本公开的锁存相关操作，从而在窗口模式下将需要交换的上述数据锁存于LLC中，以供消费者内核稍后使用。对应地，处理器还可以根据消费者内核中配置的解锁窗口来释放锁存区，即处理器在通过对于LLC中锁存的数据执行读取操作完成消费者内核的执行时，可以释放LLC中锁存区内数据的对应存储空间。In one embodiment, the above-mentioned latch area can be used to transfer data between tasks with dependencies, for example, the latch area can be used to transfer data between a producer core and a consumer core. Specifically, the processor can lock the data that the producer core needs to exchange to the consumer core in the LLC through the configured lock window. In one scenario, after the processor finishes executing the producer kernel, it may latch data that needs to be delivered to the consumer kernel (which may be input data or output data of the producer kernel). In view of this, the processor can perform the latch-related operations of the present disclosure on the LLC through the configured lock window and by means of, for example, the SMMU, so as to latch the above-mentioned data that needs to be exchanged in the LLC in the window mode, for later use by the consumer kernel. Correspondingly, the processor can also release the latch area according to the unlock window configured in the consumer kernel, that is, when the processor completes the execution of the consumer kernel by performing a read operation on the data latched in the LLC, it can release the latch area in the LLC. The corresponding storage space of the data in the latch area.

基于上述锁存区能够配置用于在具有依赖关系的任务之间传递数据，该锁存区还可以用于片间通信的应用场景。例如，该处理器的一个集群或处理器核中经由锁存区发射数据(该数据可以是生产者内核需要交换给消费者内核的数据)给其他集群中的处理器进行归并处理。其他集群中的处理器从锁存区读取数据进行处理，从而实现数据间的片间传输。采用该锁存区进行集群间通信的方式具体可参见下文的描述。Based on the above-mentioned latch area can be configured to transfer data between tasks with dependencies, the latch area can also be used in the application scenario of inter-chip communication. For example, a cluster or processor core of the processor transmits data (the data may be data that the producer core needs to exchange to the consumer core) via the latch area to processors in other clusters for merge processing. Processors in other clusters read data from the latch area for processing, thereby realizing inter-chip data transfer. For the manner in which inter-cluster communication is performed using the latch area, please refer to the description below.

如图13所示，本公开还包括使用高速缓冲存储器的锁存区来进行集群间通信的方法，该方法包括：As shown in FIG. 13 , the present disclosure also includes a method for performing inter-cluster communication using a latch area of a cache memory, the method comprising:

在步骤S1302处，将片外存储器的指定存储空间映射到高速缓冲存储器(“cache”)的给定存储区(其物理属性与前述结合附图所描述的锁定区相同)，以将该给定存储区用作集群间数据通信的集群存储区。在如图8所示出的一个实施场景中，高速缓冲存储器可以包括LLC，并且片外存储器包括DDR。基于此，指定的存储空间可以是图14中1402处所指定的存储空间。对应地，集群存储区可以是图14中1404处高速缓冲存储器中的给定存储区。在一个实施场景中，可以通过软件配置来指定DDR的指定存储空间，并且将该DDR的指定存储空间映射到cache上的给定空间，用作cluster间(例如图14中所示出的集群0和集群1)通信。在完成集群存储区的划分和确定后，在步骤S1304处，可以使用确定的所述集群存储区来执行集群的操作。At step S1302, the specified storage space of the off-chip memory is mapped to a given storage area of the high-speed cache ("cache") (its physical properties are the same as the locking area described above in conjunction with the accompanying drawings), so that the given The storage area is used as the cluster storage area for inter-cluster data communication. In an implementation scenario as shown in FIG. 8 , the cache memory may include LLC, and the off-chip memory may include DDR. Based on this, the specified storage space may be the storage space specified at 1402 in FIG. 14 . Correspondingly, the cluster storage area may be a given storage area in the cache memory at 1404 in FIG. 14 . In an implementation scenario, the specified storage space of the DDR can be specified through software configuration, and the specified storage space of the DDR can be mapped to a given space on the cache for inter-cluster (for example, the cluster 0 shown in Figure 14 Communicate with cluster 1). After the cluster storage area is divided and determined, at step S1304, the determined cluster storage area may be used to perform cluster operations.

在一个实施例中，使用集群存储区来执行集群的操作可以包括将所述集群存储区用于集群间通信。在该情形中，将所述集群存储区用于集群间通信可以具体包括：利用所述集群存储区来实现集群之间的点对点通信。附加地，可以利用所述集群存储区来实现所述多个集群之一对其余集群的广播通信。当点对点通信的场景中，集群存储区可以用于接收来自于第一集群针对写入数据的写操作以及响应于第二集群的读操作，向所述第二集群发送第一集群先前的写入数据。In one embodiment, using the cluster store to perform operations of the cluster may include using the cluster store for inter-cluster communication. In this case, using the cluster storage area for inter-cluster communication may specifically include: using the cluster storage area to implement point-to-point communication between clusters. Additionally, the cluster storage area may be used to implement broadcast communication from one of the multiple clusters to other clusters. In the scenario of point-to-point communication, the cluster storage area can be used to receive the write operation of the first cluster for writing data and respond to the read operation of the second cluster, and send the previous write of the first cluster to the second cluster data.

在上述写操作的一个示例实现中，集群存储区还可以用于接收将上述写操作所关联的写入数据驻留于所述集群存储区的锁定指示，例如图14中所示出的写锁定(“writelock”)，即上述的附带Lock属性的锁定相关请求。接着，可以基于锁定指示将写入数据驻留于集群存储区中，其中该集群存储区可以是上述实施例中确定的锁存区。通过这样的驻留方式，可以显著提高将被多次读取的数据在高速缓冲存储器中的命中率。In an example implementation of the above-mentioned write operation, the cluster storage area may also be used to receive a lock indication that the write data associated with the above-mentioned write operation resides in the cluster storage area, such as the write lock shown in FIG. 14 ("writelock"), that is, the above-mentioned lock-related request with the Lock attribute. Then, based on the lock indication, the written data may reside in the cluster storage area, wherein the cluster storage area may be the latch area determined in the above embodiment. Through such a residency manner, the hit ratio of data to be read many times in the cache memory can be significantly improved.

在一中实施场景中，在其中一个集群中执行的生产者内核可以通过上述的写锁定，将需要交换给消费者内核的数据锁存在LLC中，以供消费者内核稍后使用，例如生产者内核经由LLC发射数据给其他集群中的处理器进行归并处理。其他集群中的处理器可以从该集群存储区读取数据进行处理，从而实现数据间的片间传输。In one implementation scenario, the producer kernel executing in one of the clusters can lock the data that needs to be exchanged to the consumer kernel in the LLC through the above-mentioned write lock for later use by the consumer kernel, such as the producer The core transmits data via LLC to processors in other clusters for merge processing. Processors in other clusters can read data from the cluster storage area for processing, thereby realizing inter-slice transmission of data.

在上述读操作的一个示例实现中，集群存储区还可以用于接收令所述写入数据不写回片外存储器的读无效指示，例如从图14中集群1所发出的读无效(“read invalid”)。其中，该读无效指示可以是附带invalid属性的锁存相关请求，该锁存相关请求的生成方式具体可参见上文的描述。在不同的锁存模式下，其锁存相关请求可以不同。接着，集群存储区可以在向集群1发送所述写入数据后，基于所述读无效指示来令与所述写入数据关联的高速缓存行无效。In an example implementation of the read operation described above, the cluster memory area can also be used to receive a read invalidation indication that the write data is not written back to the off-chip memory, such as a read invalidation (“read invalidation” issued by cluster 1 in FIG. 14 ). invalid"). Wherein, the read invalid indication may be a latch-related request with an invalid attribute, and the generation method of the latch-related request may refer to the above description for details. In different latch modes, the latch-related requests can be different. Then, after sending the write data to cluster 1, the cluster storage area may invalidate the cache line associated with the write data based on the read invalidation indication.

为了实现上述集群之间的数据传递(或者说通信)的同步，向集群存储区写入数据的集群(如集群0)可以在写操作后完成后向另一个集群(如集群1)发送同步指令，例如如图14中的hsem(“硬件信号量”)。接收到同步指令后，集群1可以发送针对于集群存储区的上述读无效请求，以便在读取集群0写入进集群存储区的数据后，令高速缓存行无效，从而防止前述数据的写回。In order to achieve the synchronization of data transfer (or communication) between the above clusters, the cluster (such as cluster 0) that writes data to the cluster storage area can send a synchronization command to another cluster (such as cluster 1) after the write operation is completed , such as hsem ("hardware semaphore") in Figure 14. After receiving the synchronization command, cluster 1 can send the above-mentioned read invalidation request for the cluster storage area to invalidate the cache line after reading the data written into the cluster storage area by cluster 0, thereby preventing the write-back of the aforementioned data .

在本披露的上下文中，上述将数据写入到集群存储区和从集群存储区读取数据的行为也可以统称为锁存相关请求所触发的锁存相关操作，该锁存相关请求的确认方式可参见上文的描述。具体地，该锁存相关的请求可以用于指示锁存操作。通过锁存操作，数据将被锁存于集群存储区中以用于后续的多次使用。进一步，该锁存相关的请求可以用于指示释放操作，并且通过该释放操作，数据可以从集群存储区解锁以释放出更多的存储空间来用于后续的数据锁存。可以理解的是，通过释放操作，集群存储区的存储空间可以被灵活地使用，从而提高了本公开集群存储区的使用效率。In the context of this disclosure, the above-mentioned behaviors of writing data to and reading data from the cluster storage area can also be collectively referred to as lock-related operations triggered by lock-related requests, and the confirmation method of the lock-related requests See description above. Specifically, the latch-related request may be used to indicate a latch operation. Through the latch operation, the data will be latched in the cluster storage area for subsequent multiple uses. Further, the latch-related request can be used to indicate a release operation, and through the release operation, data can be unlocked from the cluster storage area to release more storage space for subsequent data latches. It can be understood that, through the release operation, the storage space of the cluster storage area can be used flexibly, thereby improving the usage efficiency of the cluster storage area in the present disclosure.

在一个实施方式中，针对于集群存储区的读操作，还可以在执行完读操作后，根据锁存相关请求将数据或选定的部分所述数据从所述集群存储区的指定区域释放。关于前述选定的部分数据，在一个实施例中，可以以随机的方式从数据中选择预定比例的数据形成前述的部分数据来锁存于锁存区中。在另一个实施例中，可以利用哈希算法从数据中选择预定比例的数据作为前述的部分数据来锁存于集群存储区中，具体可参见上文图11部分的描述。In one embodiment, for the read operation of the cluster storage area, after the read operation is completed, the data or a selected part of the data may be released from the specified area of the cluster storage area according to a latch-related request. Regarding the aforementioned selected partial data, in one embodiment, a predetermined proportion of data may be randomly selected from the data to form the aforementioned partial data to be latched in the latch area. In another embodiment, hash algorithm can be used to select a predetermined proportion of data from the data as the aforementioned partial data to be latched in the cluster storage area. For details, please refer to the description in the part of FIG. 11 above.

以上结合附图对本公开的方案进行了详细的描述。根据不同的应用场景，本披露的电子设备或装置可以包括服务器、云端服务器、服务器集群、数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、PC设备、物联网终端、移动终端、手机、行车记录仪、导航仪、传感器、摄像头、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、视觉终端、自动驾驶终端、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆；所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机；所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。本披露的电子设备或装置还可以被应用于互联网、物联网、数据中心、能源、交通、公共管理、制造、教育、电网、电信、金融、零售、工地、医疗等领域。进一步，本披露的电子设备或装置还可以用于云端、边缘端、终端等与人工智能、大数据和/或云计算相关的应用场景中。在一个或多个实施例中，根据本披露方案的算力高的电子设备或装置可以应用于云端设备(例如云端服务器)，而功耗小的电子设备或装置可以应用于终端设备和/或边缘端设备(例如智能手机或摄像头)。在一个或多个实施例中，云端设备的硬件信息和终端设备和/或边缘端设备的硬件信息相互兼容，从而可以根据终端设备和/或边缘端设备的硬件信息，从云端设备的硬件资源中匹配出合适的硬件资源来模拟终端设备和/或边缘端设备的硬件资源，以便完成端云一体或云边端一体的统一管理、调度和协同工作。The solution of the present disclosure has been described in detail above with reference to the accompanying drawings. According to different application scenarios, the electronic equipment or devices disclosed in this disclosure may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, IoT terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, earphones, mobile storage, wearable devices, visual terminals, automatic driving terminals, vehicles, household appliances, and/or medical equipment. Said vehicles include airplanes, ships and/or vehicles; said household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, range hoods; said medical equipment includes nuclear magnetic resonance instruments, Ultrasound and/or electrocardiograph. The electronic equipment or device disclosed herein can also be applied to fields such as the Internet, the Internet of Things, data centers, energy, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, and medical care. Further, the electronic device or device disclosed herein can also be used in application scenarios related to artificial intelligence, big data and/or cloud computing, such as cloud, edge, and terminal. In one or more embodiments, electronic devices or devices with high computing power according to the disclosed solutions can be applied to cloud devices (such as cloud servers), while electronic devices or devices with low power consumption can be applied to terminal devices and/or Edge devices (such as smartphones or cameras). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that according to the hardware information of the terminal device and/or the edge device, the hardware resources of the cloud device can be Match appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-end integration.

需要说明的是，为了简明的目的，本披露将一些方法及其实施例表述为一系列的动作及其组合，但是本领域技术人员可以理解本披露的方案并不受所描述的动作的顺序限制。因此，依据本披露的公开或教导，本领域技术人员可以理解其中的某些步骤可以采用其他顺序来执行或者同时执行。进一步，本领域技术人员可以理解本披露所描述的实施例可以视为可选实施例，即其中所涉及的动作或模块对于本披露某个或某些方案的实现并不一定是必需的。另外，根据方案的不同，本披露对一些实施例的描述也各有侧重。鉴于此，本领域技术人员可以理解本披露某个实施例中没有详述的部分，也可以参见其他实施例的相关描述。It should be noted that, for the purpose of brevity, the present disclosure expresses some methods and their embodiments as a series of actions and combinations thereof, but those skilled in the art can understand that the solution of the present disclosure is not limited by the order of the described actions . Therefore, according to the disclosure or teaching of the present disclosure, those skilled in the art may understand that certain steps may be performed in other orders or simultaneously. Further, those skilled in the art can understand that the embodiments described in the present disclosure can be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily required for the realization of one or some solutions of the present disclosure. In addition, according to different schemes, the description of some embodiments in this disclosure also has different emphases. In view of this, those skilled in the art may understand the part that is not described in detail in a certain embodiment of the present disclosure, and may also refer to related descriptions of other embodiments.

在具体实现方面，基于本披露的公开和教导，本领域技术人员可以理解本披露所公开的若干实施例也可以通过本文未公开的其他方式来实现。例如，就前文所述的电子设备或装置实施例中的各个单元来说，本文在考虑了逻辑功能的基础上对其进行划分，而实际实现时也可以有另外的划分方式。又例如，可以将多个单元或组件结合或者集成到另一个系统，或者对单元或组件中的一些特征或功能进行选择性地禁用。就不同单元或组件之间的连接关系而言，前文结合附图所讨论的连接可以是单元或组件之间的直接或间接耦合。在一些场景中，前述的直接或间接耦合涉及利用接口的通信连接，其中通信接口可以支持电性、光学、声学、磁性或其它形式的信号传输。In terms of specific implementation, based on the disclosure and teachings of the present disclosure, those skilled in the art may understand that several embodiments disclosed in the present disclosure may also be implemented in other ways not disclosed herein. For example, with respect to each unit in the above-mentioned electronic device or device embodiment, this paper divides them on the basis of considering logical functions, but there may be other division methods in actual implementation. As another example, multiple units or components may be combined or integrated into another system, or some features or functions in units or components may be selectively disabled. As far as the connection relationship between different units or components is concerned, the connections discussed above in conjunction with the drawings may be direct or indirect couplings between units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection using an interface, where the communication interface may support electrical, optical, acoustic, magnetic or other forms of signal transmission.

在本披露中，作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元示出的部件可以是或者也可以不是物理单元。前述部件或单元可以位于同一位置或者分布到多个网络单元上。另外，根据实际的需要，可以选择其中的部分或者全部单元来实现本披露实施例所述方案的目的。另外，在一些场景中，本披露实施例中的多个单元可以集成于一个单元中或者各个单元物理上单独存在。In the present disclosure, a unit described as a separate component may or may not be physically separated, and a component shown as a unit may or may not be a physical unit. The aforementioned components or units may be located at the same location or distributed over multiple network units. In addition, according to actual needs, some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure. In addition, in some scenarios, multiple units in the embodiments of the present disclosure may be integrated into one unit, or each unit exists physically independently.

在一些实现场景中，上述集成的单元可以采用软件程序模块的形式来实现。如果以软件程序模块的形式实现并作为独立的产品销售或使用时，所述集成的单元可以存储在计算机可读取存储器中。基于此，当本披露的方案以软件产品(例如计算机可读存储介质)的形式体现时，该软件产品可以存储在存储器中，其可以包括若干指令用以使得计算机设备(例如个人计算机、服务器或者网络设备等)执行本披露实施例所述方法的部分或全部步骤。前述的存储器可以包括但不限于U盘、闪存盘、只读存储器(“Read Only Memory”，简写为ROM)、随机存取存储器(“Random Access Memory”，简写为RAM)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。In some implementation scenarios, the above integrated units may be implemented in the form of software program modules. If implemented in the form of a software program module and sold or used as a stand-alone product, the integrated unit may be stored in a computer readable memory. Based on this, when the solution of the present disclosure is embodied in the form of a software product (such as a computer-readable storage medium), the software product can be stored in a memory, and it can include several instructions to make a computer device (such as a personal computer, a server, or Network devices, etc.) execute some or all of the steps of the methods described in the embodiments of the present disclosure. The aforementioned memory may include but not limited to U disk, flash disk, read-only memory ("Read Only Memory", abbreviated as ROM), random access memory ("Random Access Memory", abbreviated as RAM), mobile hard disk, magnetic disk Or various media such as CDs that can store program codes.

在另外一些实现场景中，上述集成的单元也可以采用硬件的形式实现，即为具体的硬件电路，其可以包括数字电路和/或模拟电路等。电路的硬件结构的物理实现可以包括但不限于物理器件，而物理器件可以包括但不限于晶体管或忆阻器等器件。鉴于此，本文所述的各类装置(例如计算装置或其他处理装置)可以通过适当的硬件处理器来实现，例如CPU、GPU、FPGA、DSP和ASIC等。进一步，前述的所述存储单元或存储装置可以是任意适当的存储介质(包括磁存储介质或磁光存储介质等)，其例如可以是可变电阻式存储器(“Resistive Random Access Memory”，简写为RRAM)、动态随机存取存储器(“DynamicRandom Access Memory”，简写为DRAM)、静态随机存取存储器(“Static Random AccessMemory”，简写为SRAM)、增强动态随机存取存储器(“Enhanced Dynamic Random AccessMemory”，简写为“EDRAM”)、高带宽存储器(“High Bandwidth Memory”，简写为“HBM”)、混合存储器立方体(“Hybrid Memory Cube”，简写为“HMC”)、ROM和RAM等。In other implementation scenarios, the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits. The physical realization of the hardware structure of the circuit may include but not limited to physical devices, and the physical devices may include but not limited to devices such as transistors or memristors. In view of this, various devices (such as computing devices or other processing devices) described herein may be implemented by appropriate hardware processors, such as CPU, GPU, FPGA, DSP, and ASIC. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which may be, for example, a variable resistance memory ("Resistive Random Access Memory", abbreviated as RRAM), dynamic random access memory ("Dynamic Random Access Memory", abbreviated as DRAM), static random access memory ("Static Random AccessMemory", abbreviated as SRAM), enhanced dynamic random access memory ("Enhanced Dynamic Random AccessMemory" , abbreviated as "EDRAM"), high bandwidth memory ("High Bandwidth Memory", abbreviated as "HBM"), hybrid memory cube ("Hybrid Memory Cube", abbreviated as "HMC"), ROM and RAM, etc.

依据以下条款可更好地理解前述内容：The foregoing can be better understood in light of the following terms:

条款A1.一种用于片上系统的方法，所述片上系统包括至少用于执行运算操作的多个集群和与该多个集群互联的高速缓冲存储器，每个集群包括用于执行所述运算操作的多个处理器核，所述方法包括：Clause A1. A method for a system on chip comprising at least a plurality of clusters for performing computational operations and a cache memory interconnected with the plurality of clusters, each cluster comprising A plurality of processor cores, the method comprising:

将片外存储器的指定存储空间映射到高速缓冲存储器的给定存储区，以将该给定存储区用作集群间数据通信的集群存储区；以及mapping a specified storage space of the off-chip memory to a given storage area of the cache memory to use the given storage area as a cluster storage area for inter-cluster data communication; and

使用所述集群存储区来执行所述集群的操作。Operations of the cluster are performed using the cluster storage area.

条款A2.根据条款A1所述的方法，其中使用所述集群存储区来执行所述集群的操作包括将所述集群存储区用于集群间通信。Clause A2. The method of Clause A1, wherein using the cluster storage area to perform operations of the cluster comprises using the cluster storage area for inter-cluster communication.

条款A3.根据条款A2所述的方法，其中将所述集群存储区用于集群间通信包括：Clause A3. The method of Clause A2, wherein using the cluster storage area for inter-cluster communication comprises:

利用所述集群存储区来实现集群之间的点对点通信；或者Utilizing the cluster storage area for peer-to-peer communication between clusters; or

利用所述集群存储区来实现所述多个集群之一对其余集群的广播通信。The cluster storage area is used to implement broadcast communication from one of the multiple clusters to other clusters.

条款A4.根据条款A3所述的方法，其中利用所述集群存储区来实现集群之间的点对点通信包括：Clause A4. The method of Clause A3, wherein utilizing the cluster storage area to enable peer-to-peer communication between clusters comprises:

接收来自于第一集群针对写入数据的写操作；以及receiving a write operation for writing data from the first cluster; and

响应于第二集群的读操作，向所述第二集群发送所述写入数据。The write data is sent to the second cluster in response to a read operation by the second cluster.

条款A5.根据条款A4所述的方法，其中在所述写操作中，所述方法还包括：Clause A5. The method of Clause A4, wherein in the write operation, the method further comprises:

接收将所述写操作所关联的写入数据驻留于所述集群存储区的锁定指示；以及receiving a lock indication that write data associated with the write operation resides in the cluster storage area; and

基于所述锁定指示将所述写入数据驻留于所述集群存储区。The write data is resident in the cluster storage area based on the lock indication.

条款A6.根据条款A4或A5所述的方法，其中在所述读操作中，所述方法还包括：Clause A6. The method of clause A4 or A5, wherein in the read operation, the method further comprises:

接收令所述写入数据不写回片外存储器的读无效指示；以及receiving a read invalidation indication that the write data is not written back to the off-chip memory; and

在向所述第二集群发送所述写入数据后，基于所述读无效指示来令与所述写入数据关联的高速缓存行无效。A cache line associated with the write data is invalidated based on the read invalidation indication after the write data is sent to the second cluster.

条款A7.一种片上系统，包括：Clause A7. A system on a chip comprising:

多个集群，其中每个集群包括至少用于执行运算操作的多个处理器核；以及a plurality of clusters, wherein each cluster includes at least a plurality of processor cores for performing computational operations; and

高速缓冲存储器，其与所述多个集群互联，并且配置成：a cache memory interconnected with the plurality of clusters and configured to:

将给定存储区用作集群间数据通信的集群存储区，其中所述给定存储区与片外存储器的指定存储空间形成映射关系；以及using a given storage area as a cluster storage area for inter-cluster data communication, wherein the given storage area forms a mapping relationship with a specified storage space of the off-chip memory; and

条款A8.根据条款A7所述的片上系统，其中所述集群存储区配置成用于集群间通信。Clause A8. The system-on-chip of Clause A7, wherein the cluster memory area is configured for inter-cluster communication.

条款A9.根据条款A8所述的片上系统，其中所述集群存储区配置成用于集群间的点对点通信或所述多个集群之一对其余集群的广播通信。Clause A9. The system-on-a-chip of Clause A8, wherein the cluster memory area is configured for inter-cluster point-to-point communication or broadcast communication from one of the plurality of clusters to the remaining clusters.

条款A10.根据条款A9所述的片上系统，其中在所述点对点通信中，集群存储区配置成：Clause A10. The system-on-chip of Clause A9, wherein in the peer-to-peer communication, the cluster storage area is configured to:

条款A11.根据条款A10所述的片上系统，其中所述第二集群配置成：Clause A11. The system-on-chip of Clause A10, wherein the second cluster is configured to:

接收来自于所述第一集群的硬件信号量；以及receiving a hardware semaphore from the first cluster; and

响应于接收到所述硬件信号量，对所述集群存储区执行所述读操作。In response to receiving the hardware semaphore, the read operation is performed on the cluster memory area.

条款A12.根据条款A10所述的片上系统，其中在所述写操作中，所述第一集群配置成向所述集群存储区发送将所述写入数据驻留于所述集群存储区的锁定指示，以便所述集群存储区基于所述锁定指示来驻留所述写入数据。Clause A12. The system-on-a-chip of clause A10, wherein in the write operation, the first cluster is configured to send a lock to the cluster memory area to reside the write data in the cluster memory area indication, so that the cluster store resides the write data based on the lock indication.

条款A13、根据条款A12所述的片上系统，其中在所述读操作中，所述第二集群配置成向所述集群存储区发送令所述写入数据不写回片处存储器的读无效指示，以便所述集群存储区基于所述读无效指示来令与所述写入数据关联的高速缓存行无效。Clause A13. The system-on-chip according to clause A12, wherein in said read operation, said second cluster is configured to send a read invalidation indication to said cluster storage area causing said write data not to be written back to off-chip memory , such that the cluster store invalidates the cache line associated with the write data based on the read invalidation indication.

条款A14.一种计算装置，包括根据条款A7-A13的任意一项所述的片上系统。Clause A14. A computing device comprising the system-on-chip according to any one of clauses A7-A13.

条款A15.一种板卡，包括根据条款A14所述的计算装置。Clause A15. A board comprising the computing device according to Clause A14.

条款A16、一种计算设备，包括根据条款A15所述的板卡。Clause A16. A computing device comprising the board according to Clause A15.

虽然本文已经示出和描述了本披露的多个实施例，但对于本领域技术人员显而易见的是，这样的实施例只是以示例的方式来提供。本领域技术人员可以在不偏离本披露思想和精神的情况下想到许多更改、改变和替代的方式。应当理解的是在实践本披露的过程中，可以采用对本文所描述的本披露实施例的各种替代方案。所附权利要求书旨在限定本披露的保护范围，并因此覆盖这些权利要求范围内的等同或替代方案。While various embodiments of the present disclosure have been shown and described herein, it would be obvious to those skilled in the art that such embodiments are provided by way of example only. Many modifications, changes and substitutions may occur to those skilled in the art without departing from the idea and spirit of the present disclosure. It should be understood that various alternatives to the embodiments of the disclosure described herein may be employed in practicing the disclosure. It is intended that the appended claims define the scope of protection of the present disclosure and therefore cover equivalents or alternatives within the scope of these claims.

Claims

1. A method for a system on chip comprising at least a plurality of clusters for performing arithmetic operations and a cache memory interconnected with the plurality of clusters, each cluster comprising a cluster for performing said arithmetic operations A plurality of processor cores, the method comprising:

mapping a specified storage space of the off-chip memory to a given storage area of the cache memory to use the given storage area as a cluster storage area for inter-cluster data communication; and

Operations of the cluster are performed using the cluster storage area.

2. The method of claim 1, wherein using the cluster storage area to perform operations of the cluster comprises using the cluster storage area for inter-cluster communication.

3. The method of claim 2, wherein using the cluster storage area for inter-cluster communication comprises:

Utilizing the cluster storage area for peer-to-peer communication between clusters; or

The cluster storage area is used to implement broadcast communication from one of the multiple clusters to other clusters.

4. The method according to claim 3, wherein utilizing the cluster storage area to implement peer-to-peer communication between clusters comprises:

receiving a write operation for writing data from the first cluster; and

The write data is sent to the second cluster in response to a read operation by the second cluster.

5. The method according to claim 4, wherein in the write operation, the method further comprises:

receiving a lock indication that write data associated with the write operation resides in the cluster storage area; and

The write data is resident in the cluster storage area based on the lock indication.

6. The method according to claim 4 or 5, wherein in the read operation, the method further comprises:

receiving a read invalidation indication that the write data is not written back to the off-chip memory; and

A cache line associated with the write data is invalidated based on the read invalidation indication after the write data is sent to the second cluster.

7. A system on a chip comprising:

a plurality of clusters, wherein each cluster includes at least a plurality of processor cores for performing computational operations; and

a cache memory interconnected with the plurality of clusters and configured to:

using a given storage area as a cluster storage area for inter-cluster data communication, wherein the given storage area forms a mapping relationship with a specified storage space of the off-chip memory; and

Operations of the cluster are performed using the cluster storage area.

8. The system-on-chip of claim 7, wherein the cluster memory area is configured for inter-cluster communication.

9. The system-on-chip according to claim 8, wherein the cluster storage area is configured for point-to-point communication between clusters or broadcast communication from one of the plurality of clusters to the rest of the clusters.

10. The system on chip according to claim 9, wherein in the point-to-point communication, the cluster storage area is configured to:

receiving a write operation for writing data from the first cluster; and

11. The system-on-chip of claim 10, wherein the second cluster is configured to:

receiving a hardware semaphore from the first cluster; and

In response to receiving the hardware semaphore, the read operation is performed on the cluster memory area.

12. The system-on-chip of claim 10 , wherein in the write operation, the first cluster is configured to send a lock to the cluster memory area to reside the write data in the cluster memory area indication, so that the cluster store resides the write data based on the lock indication.

13. The system-on-chip according to claim 12, wherein in the read operation, the second cluster is configured to send a read invalid indication to the cluster storage area so that the write data is not written back to the off-chip memory , such that the cluster store invalidates the cache line associated with the write data based on the read invalidation indication.

14. A computing device comprising a system-on-chip according to any one of claims 7-13.

15. A board comprising the computing device according to claim 14.