CN104252425B - The management method and processor of a kind of instruction buffer - Google Patents
The management method and processor of a kind of instruction buffer Download PDFInfo
- Publication number
- CN104252425B CN104252425B CN201310269557.0A CN201310269557A CN104252425B CN 104252425 B CN104252425 B CN 104252425B CN 201310269557 A CN201310269557 A CN 201310269557A CN 104252425 B CN104252425 B CN 104252425B
- Authority
- CN
- China
- Prior art keywords
- cache
- instruction
- hardware thread
- instruction cache
- private
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0875—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with dedicated cache, e.g. instruction or stack
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/0842—Multiuser, multiprocessor or multiprocessing cache systems for multiprocessing or multitasking
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3802—Instruction prefetching
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3851—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Memory System Of A Hierarchy Structure (AREA)
Abstract
本发明实施例提供一种指令缓存的管理方法和处理器,实际计算机领域,能够扩大硬件线程的指令缓存容量,降低指令缓存的缺失率,提高系统性能。该处理器的共享指令缓存中的硬件线程标识用于识别共享指令缓存中的缓存行对应的硬件线程,私有指令缓存用于存储从共享指令缓存中替换出的指令缓存行,还包括缺失缓存,当处理器的硬件线程在从指令缓存中获取指令时,同时访问指令缓存中的共享指令缓存和硬件线程对应的私有指令缓存,确定共享指令缓存和硬件线程对应的私有指令缓存是否存在指令,并根据判断结果从共享指令缓存或硬件线程对应的私有指令缓存中获取指令。本发明实施例用于管理处理器的指令缓存。
Embodiments of the present invention provide an instruction cache management method and a processor, which can expand the instruction cache capacity of hardware threads, reduce the missing rate of instruction cache, and improve system performance in the actual computer field. The hardware thread identifier in the shared instruction cache of the processor is used to identify the hardware thread corresponding to the cache line in the shared instruction cache, and the private instruction cache is used to store the instruction cache line replaced from the shared instruction cache, and also includes the missing cache, When the hardware thread of the processor acquires instructions from the instruction cache, it simultaneously accesses the shared instruction cache in the instruction cache and the private instruction cache corresponding to the hardware thread, determines whether there are instructions in the shared instruction cache and the private instruction cache corresponding to the hardware thread, and According to the judgment result, the instruction is obtained from the shared instruction cache or the private instruction cache corresponding to the hardware thread. The embodiment of the present invention is used for managing the instruction cache of the processor.
Description
技术领域technical field
本发明涉及计算机领域,尤其涉及一种指令缓存的管理方法和处理器。The invention relates to the field of computers, in particular to an instruction cache management method and a processor.
背景技术Background technique
CPU(Central Processing Unit,中央处理机)缓存(Cache Memory)是位于CPU与内存之间的临时存储器,容量比内存小得多,解决了CPU运算速度与内存读写速度不匹配的矛盾,加快了CPU的读取速度。CPU (Central Processing Unit, central processing unit) cache (Cache Memory) is a temporary storage located between the CPU and the memory, and its capacity is much smaller than that of the memory. CPU read speed.
在多线程处理器中,多个硬件线程从同一块I-Cache(指令缓存)中获取指令,当I-Cache中不存在所要获取的指令时,在向下一级Cache发送缺失请求的同时,切换到其他硬件线程访问I-Cache继续取指,减少了流水线由于I-Cache缺失所导致的停顿,提高了流水线效率。但是由于每个硬件线程分到的共享I-Cache资源不足时,I-Cache缺失率增大,I-Cache发往下一级Cache的缺失请求会频繁发生,且从下一级Cache取回指令回填时,在线程数据增多时,会导致填入的指令所在的Cache行立即填入到缺失的I-Cache中不会立即用到,而替换出的Cache行反而有可能被再次使用。In a multi-threaded processor, multiple hardware threads obtain instructions from the same I-Cache (instruction cache). When there is no instruction to be obtained in the I-Cache, while sending a missing request to the next-level Cache, Switching to other hardware threads to access the I-Cache continues to fetch instructions, reducing the pause of the pipeline caused by the lack of I-Cache and improving the efficiency of the pipeline. However, when the shared I-Cache resources assigned to each hardware thread are insufficient, the I-Cache miss rate increases, and the missing requests sent from the I-Cache to the next-level Cache will occur frequently, and instructions will be retrieved from the next-level Cache. When backfilling, when the thread data increases, the Cache line where the filled instruction is located will be immediately filled into the missing I-Cache and will not be used immediately, and the replaced Cache line may be used again instead.
另外,在根据Cache命中情况来调整Thread(线程)的调度策略时,会尽量在一段时间内保证优先调度访存指令在Cache中命中率高的线程,但是对于每个硬件线程分到的共享I-Cache资源不足的问题并没有得到改善。In addition, when adjusting the scheduling strategy of Thread (thread) according to the Cache hit situation, it will try to ensure that the thread with a high hit rate of memory access instructions in the Cache is prioritized for a period of time, but for the shared I allocated to each hardware thread -The problem of insufficient Cache resources has not been improved.
发明内容Contents of the invention
本发明的实施例提供一种指令缓存的管理方法和处理器,能够扩大硬件线程的指令缓存容量,降低指令缓存的缺失率,提高系统性能。Embodiments of the present invention provide an instruction cache management method and a processor, which can expand the instruction cache capacity of hardware threads, reduce the missing rate of the instruction cache, and improve system performance.
为达到上述目的,本发明的实施例采用如下技术方案In order to achieve the above object, embodiments of the present invention adopt the following technical solutions
第一方面,提供一种处理器,其特征在于,包括程序计数器、寄存器堆、指令预取部件、指令译码部件、指令发射部件、地址生成单元、算术逻辑单元、共享浮点单元、数据缓存以及内部总线,还包括:In the first aspect, a processor is provided, which is characterized in that it includes a program counter, a register file, an instruction prefetch unit, an instruction decoding unit, an instruction issuing unit, an address generation unit, an arithmetic logic unit, a shared floating point unit, and a data cache and the internal bus, also includes:
共享指令缓存,用于存储所有硬件线程的共享指令,包括标签存储阵列和数据存储阵列,所述标签存储阵列用于存储标签,所述数据存储阵列包括存储的指令和硬件线程标识,所述硬件线程标识用于识别所述共享指令缓存中的缓存行对应的硬件线程;The shared instruction cache is used to store shared instructions of all hardware threads, including a tag storage array and a data storage array, the tag storage array is used to store tags, the data storage array includes stored instructions and hardware thread identifiers, and the hardware The thread identifier is used to identify the hardware thread corresponding to the cache line in the shared instruction cache;
私有指令缓存,用于存储从所述共享指令缓存中替换出的指令缓存行,所述私有指令缓存与所述硬件线程一一对应;A private instruction cache, configured to store instruction cache lines replaced from the shared instruction cache, where the private instruction cache corresponds to the hardware threads one by one;
缺失缓存,用于当所述共享指令缓存中不存在所取指令时,将从所述共享指令缓存的下一级缓存中取回的缓存行保存在所述硬件线程的缺失缓存中,在所述所取指令对应的硬件线程取指时,将所述缺失缓存器中的缓存行回填至所述共享指令缓存中,所述缺失缓存与所述硬件线程一一对应。The miss cache is used to save the cache line retrieved from the next-level cache of the shared instruction cache in the miss cache of the hardware thread when the fetched instruction does not exist in the shared instruction cache. When the hardware thread corresponding to the fetched instruction fetches an instruction, the cache line in the miss register is backfilled into the shared instruction cache, and the miss buffer corresponds to the hardware thread one by one.
结合第一方面,在第一方面的第一种可能实现的方式中,还包括:In combination with the first aspect, in the first possible implementation manner of the first aspect, it also includes:
标签比较逻辑,用于当所述硬件线程取指时,将所述硬件线程对应的私有指令缓存中的标签与翻译后援缓冲器转换的物理地址进行比较,将所述私有指令缓存与所述标签比较逻辑相连,以使得所述硬件线程在访问所述共享指令缓存的同时访问所述私有指令缓存。Tag comparison logic, used to compare the tag in the private instruction cache corresponding to the hardware thread with the physical address converted by the translation lookaside buffer when the hardware thread fetches instructions, and compare the private instruction cache with the tag The comparison logic is connected so that the hardware thread accesses the private instruction cache while accessing the shared instruction cache.
结合第一方面的第一种可能实现的方式,在第二种可能实现的方式中,所述处理器为多线程处理器,所述私有指令缓存的结构为全相联结构,所述全相联结构为主存储器中的任意一块指令缓存映射所述私有指令缓存中的任意一块指令缓存。With reference to the first possible implementation manner of the first aspect, in the second possible implementation manner, the processor is a multi-threaded processor, the structure of the private instruction cache is a fully associative structure, and the fully associative The associative structure maps any instruction cache in the main memory to any instruction cache in the private instruction cache.
结合第一方面的第二种可能实现的方式,在第三种可能实现的方式中,所述共享指令缓存、私有指令缓存和所述缺失缓存为静态存储芯片或动态存储芯片。With reference to the second possible implementation manner of the first aspect, in a third possible implementation manner, the shared instruction cache, the private instruction cache, and the miss cache are static memory chips or dynamic memory chips.
第二方面,提供一种指令缓存的管理方法,包括:In a second aspect, a method for managing an instruction cache is provided, including:
当处理器的硬件线程在从指令缓存中获取指令时,同时访问所述指令缓存中的共享指令缓存和所述硬件线程对应的私有指令缓存;When the hardware thread of the processor acquires the instruction from the instruction cache, simultaneously access the shared instruction cache in the instruction cache and the private instruction cache corresponding to the hardware thread;
确定所述共享指令缓存和所述硬件线程对应的私有指令缓存是否存在所述指令,并根据判断结果从所述共享指令缓存或所述硬件线程对应的私有指令缓存中获取所述指令。Determining whether the instruction exists in the shared instruction cache and the private instruction cache corresponding to the hardware thread, and acquiring the instruction from the shared instruction cache or the private instruction cache corresponding to the hardware thread according to a judgment result.
结合第二方面,在第二方面的第一种可能实现的方式中,所述共享指令缓存包括标签存储阵列和数据存储阵列,所述标签存储阵列用于存储标签,所述数据存储阵列包括存储的指令和硬件线程标识,所述硬件线程标识用于识别所述共享指令缓存中的缓存行对应的硬件线程;With reference to the second aspect, in the first possible implementation manner of the second aspect, the shared instruction cache includes a tag storage array and a data storage array, the tag storage array is used to store tags, and the data storage array includes a storage An instruction and a hardware thread identifier, where the hardware thread identifier is used to identify the hardware thread corresponding to the cache line in the shared instruction cache;
所述私有指令缓存的结构为全相联结构,所述全相联结构为主存储器中的任意一块指令缓存映射所述私有指令缓存中的任意一块指令缓存,所述私有指令缓存与所述硬件线程一一对应。The structure of the private instruction cache is a fully associative structure, and the fully associative structure maps any instruction cache in the private instruction cache to any instruction cache in the main memory, and the private instruction cache is connected to the hardware Threads correspond one-to-one.
结合第二方面的第一种可能实现的方式,在第二种可能实现的方式中,所述确定所述共享指令缓存和所述硬件线程对应的私有指令缓存是否存在所述指令,并根据判断结果从所述共享指令缓存或所述硬件线程对应的私有指令缓存中获取所述指令包括:With reference to the first possible implementation manner of the second aspect, in the second possible implementation manner, the determining whether the instruction exists in the shared instruction cache and the private instruction cache corresponding to the hardware thread, and according to the judgment As a result, obtaining the instruction from the shared instruction cache or the private instruction cache corresponding to the hardware thread includes:
若所述共享指令缓存和所述硬件线程对应的私有指令缓存同时存在所述指令,则从所述共享指令缓存中获取所述指令;If the instruction exists in both the shared instruction cache and the private instruction cache corresponding to the hardware thread, obtain the instruction from the shared instruction cache;
若所述共享指令缓存中存在所述指令且所述硬件线程对应的私有指令缓存不存在所述指令,则从所述共享指令缓存中获取所述指令;If the instruction exists in the shared instruction cache and the instruction does not exist in the private instruction cache corresponding to the hardware thread, then obtain the instruction from the shared instruction cache;
若所述硬件线程对应的私有指令缓存存在所述指令且所述共享指令缓存中不存在所述指令,则从所述硬件线程对应的私有指令缓存获取所述指令。If the instruction exists in the private instruction cache corresponding to the hardware thread and the instruction does not exist in the shared instruction cache, acquire the instruction from the private instruction cache corresponding to the hardware thread.
结合第二方面的第二种可能实现的方式,在第三种可能实现的方式中,所述方法还包括:With reference to the second possible implementation manner of the second aspect, in a third possible implementation manner, the method further includes:
若所述共享指令缓存和所述私有指令缓存都不存在所述指令,则通过所述硬件线程向所述共享指令缓存的下一级缓存发送缓存缺失请求,If the instruction does not exist in the shared instruction cache and the private instruction cache, a cache miss request is sent to a next-level cache of the shared instruction cache through the hardware thread,
若所述下一级缓存中存在所述指令,则通过所述硬件线程从所述下一级缓存中获取所述指令,并将所述指令所在的缓存行存储在所述硬件线程对应的缺失缓存中,在所述硬件线程取指时,将所述缓存行回填至所述共享指令缓存中;If the instruction exists in the next-level cache, the instruction is obtained from the next-level cache through the hardware thread, and the cache line where the instruction is located is stored in the miss corresponding to the hardware thread In the cache, when the hardware thread fetches an instruction, backfill the cache line into the shared instruction cache;
若所述下一级缓存中不存在所述指令,则通过所述硬件线程向主存储器发送所述缺失请求,从所述主存储器中获取所述指令,并将所述指令所在的缓存行存储在所述硬件线程对应的缺失缓存中,在所述硬件线程取指时,将所述缓存行回填至所述共享指令缓存中;If the instruction does not exist in the next-level cache, send the missing request to the main memory through the hardware thread, obtain the instruction from the main memory, and store the cache line where the instruction is located In the missing cache corresponding to the hardware thread, when the hardware thread fetches an instruction, backfill the cache line into the shared instruction cache;
其中,所述缺失缓存与所述硬件线程一一对应。Wherein, the missing cache is in one-to-one correspondence with the hardware threads.
结合第二方面的第三种可能实现的方式,在第四种可能实现的方式中,在将所述缓存行回填至所述共享指令缓存中时,若所述共享指令缓存不存在空闲资源,则将所述缓存行替换所述共享指令缓存中的第一缓存行,将所述缓存行回填至所述共享指令缓存中,同时根据获取所述第一缓存行的第一硬件线程的硬件线程标识,将所述第一缓存行存储在所述第一硬件线程对应的私有指令缓存中;With reference to the third possible implementation manner of the second aspect, in a fourth possible implementation manner, when backfilling the cache line into the shared instruction cache, if there is no idle resource in the shared instruction cache, Then replace the cache line with the first cache line in the shared instruction cache, backfill the cache line into the shared instruction cache, and at the same time, according to the hardware thread of the first hardware thread that acquires the first cache line Identifying, storing the first cache line in a private instruction cache corresponding to the first hardware thread;
其中,所述第一缓存行是通过最近最少使用到算法确定的。Wherein, the first cache line is determined by a least recently used algorithm.
结合第二方面的第四种可能实现的方式,在第五种可能实现的方式中,在将替换出来的第一缓存行存储在获取所述第一硬件线程对应的私有指令缓存中时,若所述第一硬件线程对应的私有指令缓存不存在空闲资源,则将所述第一缓存行替换所述第一硬件线程对应的私有指令缓存中的第二缓存行,将所述第一缓存行回填至所述第一硬件线程对应的私有指令缓存中;With reference to the fourth possible implementation of the second aspect, in the fifth possible implementation, when storing the replaced first cache line in the private instruction cache corresponding to the first hardware thread, if If there is no idle resource in the private instruction cache corresponding to the first hardware thread, then replace the first cache line with the second cache line in the private instruction cache corresponding to the first hardware thread, and replace the first cache line backfilling into the private instruction cache corresponding to the first hardware thread;
其中,所述第二缓存行是通过所述最近最少使用到算法确定的。Wherein, the second cache line is determined by the least recently used algorithm.
本发明实施例提供一种指令缓存的管理方法和处理器,处理器包括程序计数器、寄存器堆、指令预取部件、指令译码部件、指令发射部件、地址生成单元、算术逻辑单元、共享浮点单元、数据缓存以及内部总线,还包括共享指令缓存,私有指令缓存,缺失缓存和标签比较逻辑。其中,共享指令缓存,用于存储所有硬件线程的共享指令,包括标签存储阵列和数据存储阵列,数据存储阵列包括存储的指令和硬件线程标识,硬件线程标识用于识别共享指令缓存中的缓存行对应的硬件线程,私有指令缓存,用于存储从共享指令缓存中替换出的指令缓存行,私有指令缓存与硬件线程一一对应,标签比较逻辑,用于当硬件线程取指时,将该硬件线程对应的私有指令缓存中的标签与翻译后援缓冲器转换的物理地址进行比较,私有指令缓存与标签比较逻辑相连,以使得硬件线程在访问共享指令缓存的同时访问私有指令缓存,当处理器的硬件线程在从指令缓存中获取指令时,同时访问指令缓存中的共享指令缓存和硬件线程对应的私有指令缓存,确定共享指令缓存和硬件线程对应的私有指令缓存是否存在指令,并根据判断结果从共享指令缓存或硬件线程对应的私有指令缓存中获取指令,能够扩大硬件线程的指令缓存容量,降低指令缓存的缺失率,提高系统性能。Embodiments of the present invention provide a method for managing an instruction cache and a processor. The processor includes a program counter, a register file, an instruction prefetching unit, an instruction decoding unit, an instruction issuing unit, an address generation unit, an arithmetic logic unit, and a shared floating point Units, data caches, and internal buses, as well as shared instruction caches, private instruction caches, miss caches, and tag comparison logic. Among them, the shared instruction cache is used to store shared instructions of all hardware threads, including a tag storage array and a data storage array, the data storage array includes stored instructions and hardware thread identifiers, and the hardware thread identifier is used to identify cache lines in the shared instruction cache The corresponding hardware thread, the private instruction cache, is used to store the instruction cache line replaced from the shared instruction cache, the private instruction cache corresponds to the hardware thread one by one, and the tag comparison logic is used to check the hardware The tag in the private instruction cache corresponding to the thread is compared with the physical address converted by the translation lookaside buffer. The private instruction cache is connected to the tag comparison logic, so that the hardware thread accesses the private instruction cache while accessing the shared instruction cache. When the processor's When the hardware thread obtains instructions from the instruction cache, it simultaneously accesses the shared instruction cache in the instruction cache and the private instruction cache corresponding to the hardware thread, determines whether there is an instruction in the shared instruction cache and the private instruction cache corresponding to the hardware thread, and according to the judgment result from Obtaining instructions from the shared instruction cache or the private instruction cache corresponding to the hardware thread can expand the instruction cache capacity of the hardware thread, reduce the miss rate of the instruction cache, and improve system performance.
附图说明Description of drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. Those skilled in the art can also obtain other drawings based on these drawings without creative work.
图1为本发明实施例提供的一种处理器结构示意图;FIG. 1 is a schematic structural diagram of a processor provided by an embodiment of the present invention;
图2为本发明实施例提供的一种指令缓存的管理方法流程示意图;FIG. 2 is a schematic flowchart of a method for managing an instruction cache provided by an embodiment of the present invention;
图3为本发明实施例提供的一种同时访问共享指令缓存和私有指令缓存的逻辑示意图;FIG. 3 is a schematic diagram of simultaneously accessing a shared instruction cache and a private instruction cache provided by an embodiment of the present invention;
图4为本发明实施例提供的一种根据缓存缺失请求取回缓存行时的逻辑示意图。FIG. 4 is a schematic diagram of retrieving a cache line according to a cache miss request according to an embodiment of the present invention.
具体实施方式detailed description
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.
在现代多线程处理器设计中,随着硬件线程数目的增多,每个硬件线程所能分到的共享资源会存在不足,例如,对于Cache(缓存)中的L1(Level1)Cache这一重要共享资源更是如此。每个硬件线程分到的L1Cache的指令缓存容量过小,L1中会存在不命中的情况,L1缺失率增高,导致L1Cache与L2Cache通信增多而从L2Cache中取指,或者从主存储器中取指,处理器功耗增大。In the design of modern multi-threaded processors, as the number of hardware threads increases, there will be insufficient shared resources for each hardware thread. For example, for the important shared resource of L1 (Level 1) Cache in Cache Even more so with resources. The instruction cache capacity of the L1Cache assigned to each hardware thread is too small, there will be misses in the L1, and the L1 miss rate will increase, resulting in increased communication between the L1Cache and the L2Cache and fetching instructions from the L2Cache, or fetching instructions from the main memory. Processor power consumption increases.
本发明实施例提供一种处理器01,如图1所示,包括程序计数器011、寄存器堆012、指令预取部件013、指令译码部件014、指令发射部件015、地址生成单元016、算术逻辑单元017、共享浮点单元018、数据缓存019、以及内部总线,还包括:An embodiment of the present invention provides a processor 01, as shown in FIG. 1 , including a program counter 011, a register file 012, an instruction prefetching unit 013, an instruction decoding unit 014, an instruction transmitting unit 015, an address generation unit 016, and an arithmetic logic Unit 017, shared floating point unit 018, data cache 019, and internal bus, also include:
共享指令缓存(I-Cache)020,私有指令缓存021,缺失缓存(Miss Buffer)022和标签(Tag)比较逻辑023。Shared instruction cache (I-Cache) 020, private instruction cache 021, missing cache (Miss Buffer) 022 and tag (Tag) comparison logic 023.
其中,共享指令缓存020,用于存储所有硬件线程的共享指令,包括标签存储阵列(Tag Array)0201和数据存储阵列(Data Array)0202,标签存储阵列0201用于存储标签,数据存储阵列0202包括存储的指令02021和硬件线程标识(Thread ID)02022,硬件线程标识02022用于识别共享指令缓存020中的缓存行对应的硬件线程。Wherein, the shared instruction cache 020 is used to store shared instructions of all hardware threads, including a tag storage array (Tag Array) 0201 and a data storage array (Data Array) 0202, the tag storage array 0201 is used to store tags, and the data storage array 0202 includes The stored instruction 02021 and hardware thread identifier (Thread ID) 02022, the hardware thread identifier 02022 is used to identify the hardware thread corresponding to the cache line in the shared instruction cache 020.
私有指令缓存021,用于存储从共享指令缓存020中替换出的指令缓存行,私有指令缓存021与硬件线程一一对应。The private instruction cache 021 is used to store the instruction cache lines replaced from the shared instruction cache 020, and the private instruction cache 021 corresponds to hardware threads one by one.
缺失缓存022,用于当共享指令缓存020中不存在所取指令时,将从共享指令缓存020的下一级缓存中取回的缓存行进行缓存在硬件线程的缺失缓存中,在所取指令对应的硬件线程取指时,将缺失缓存022中的缓存行回填至共享指令缓存中,缺失缓存022与硬件线程一一对应。The missing cache 022 is used for caching the cache line retrieved from the next-level cache of the shared instruction cache 020 in the missing cache of the hardware thread when the fetched instruction does not exist in the shared instruction cache 020. When the corresponding hardware thread fetches an instruction, the cache line in the missing cache 022 is backfilled into the shared instruction cache, and the missing cache 022 corresponds to the hardware thread one by one.
标签比较逻辑,用于当硬件线程取指时,将该硬件线程对应的私有指令缓存中的标签与TLB(Translation Look-aside Buffers,翻译后援缓冲器)转换的PA(PhysisAdress)进行比较,私有指令缓存021与标签比较逻辑相连,以使得硬件线程在访问共享指令缓存020的同时访问私有指令缓存021。Tag comparison logic is used to compare the tag in the private instruction cache corresponding to the hardware thread with the PA (PhysisAdress) converted by the TLB (Translation Look-aside Buffers) when the hardware thread fetches instructions. The cache 021 is connected with tag comparison logic, so that hardware threads access the private instruction cache 021 while accessing the shared instruction cache 020 .
其中,TLB或称为页表缓冲,里面存放的是一些页表文件(虚拟地址到物理地址的转换表,可以将所取指令的虚拟地址通过TLB转换为物理地址,在将物理地址与私有指令缓存中的标签进行比较后,若物理地址与私有指令缓存中的标签相同,使得硬件线程在访问共享指令缓存的同时也访问私有指令缓存。Among them, TLB or page table buffer, which stores some page table files (virtual address to physical address conversion table, can convert the virtual address of the fetched instruction into a physical address through TLB, and convert the physical address with the private instruction After the tag in the cache is compared, if the physical address is the same as the tag in the private instruction cache, the hardware thread accesses the private instruction cache while accessing the shared instruction cache.
举例来说,PC(Program Counter,程序计数器)有16个,为PC0~PC15,一个处理器核内逻辑处理器核(硬件线程)的个数与PC的个数是一致的。For example, there are 16 PCs (Program Counter, program counter), which are PC0-PC15, and the number of logical processor cores (hardware threads) in one processor core is consistent with the number of PCs.
GRF(General Register File,寄存器堆),一个处理器核内的逻辑处理器核对应一个GRF,数量上与PC的数量一致。GRF (General Register File, register file), a logical processor core in a processor core corresponds to a GRF, and the number is consistent with the number of PCs.
Fetch(指令预取部件)用于获取指令,Decoder(指令译码部件)用于对指令进行解码,Issue为指令发射部件,用于发射指令,AGU(Address Generator Unit,地址生成单元)为进行所有地址计算的模块,生成一个用于对访问存储器进行控制的地址。ALU(Arithmetic Logic Unit,算术逻辑单元)是CPU(Central Processing Unit,中央处理器)的执行单元,可以由"And Gate"(与门)和"Or Gate"(或门)构成的算术逻辑单元。共享浮点单元(Shared Float Point Unit)为处理器中专门进行浮点算术运算的电路单元,数据缓存(D-Cache)用于存储数据,内部总线用于连接处理器内各部件。Fetch (instruction prefetching unit) is used to obtain instructions, Decoder (instruction decoding unit) is used to decode instructions, Issue is an instruction emission unit, which is used to transmit instructions, and AGU (Address Generator Unit, address generation unit) is used to perform all The address calculation module generates an address for controlling access to memory. ALU (Arithmetic Logic Unit, arithmetic logic unit) is the execution unit of CPU (Central Processing Unit, central processing unit), which can be composed of "And Gate" (AND gate) and "Or Gate" (or gate). Shared Float Point Unit (Shared Float Point Unit) is a circuit unit specialized in floating-point arithmetic operations in the processor, the data cache (D-Cache) is used to store data, and the internal bus is used to connect various components in the processor.
举例来说,处理器01为多线程处理器,私有指令缓存021的结构为全相联结构,全相联结构为主存储器中的任意一块指令缓存映射私有指令缓存中的任意一块指令缓存。For example, the processor 01 is a multi-threaded processor, and the structure of the private instruction cache 021 is a fully associative structure. The fully associative structure maps any instruction cache in the main memory to any instruction cache in the private instruction cache.
举例来说,共享指令缓存020、私有指令缓存021和缺失缓存022为静态存储芯片或动态存储芯片。For example, the shared instruction cache 020 , the private instruction cache 021 and the miss cache 022 are static memory chips or dynamic memory chips.
举例来说,可以在I-Cache Data Array(指令高速缓存数据存储阵列)中新增Thread ID(硬件线程标识),该Thread ID用于表示Cache Line是由哪一个硬件线程发出的Cache Miss(缓存不命中)请求所取回的。For example, a Thread ID (hardware thread identification) can be added in the I-Cache Data Array (instruction cache data storage array), and the Thread ID is used to indicate which hardware thread the Cache Line is issued by. miss) request retrieved.
例如,当硬件线程访问L1的I-Cache不命中,即I-Cache中不存在硬件线程所要获得的指令时,L1向L1的下一级缓存L2Cache发送Cache Miss请求,若L2Cache命中,即L2Cache中存在该硬件线程所要获得的指令时,该硬件线程将L2Cache中该指令所在的缓存行(Cache Line)回填至L1Cache中,也可以是该硬件线程在收到返回的Cache Line时,不直接将该Cache Line填入L1Cache中,而是将该Cache Line保存在该硬件线程对应的MissBuffer中,直到轮到该硬件线层取指时,将该Cache Line填入L1Cache中。For example, when a hardware thread misses accessing the I-Cache of L1, that is, when there is no instruction in the I-Cache that the hardware thread wants to obtain, L1 sends a Cache Miss request to L2Cache, the next-level cache of L1. When there is an instruction to be obtained by the hardware thread, the hardware thread backfills the cache line (Cache Line) where the instruction is located in the L2Cache into the L1Cache, or when the hardware thread receives the returned Cache Line, it does not directly insert the The Cache Line is filled in the L1Cache, but the Cache Line is stored in the MissBuffer corresponding to the hardware thread, and the Cache Line is filled in the L1Cache until it is the turn of the hardware thread layer to fetch instructions.
这样,当该硬件线程将L2Cache中该指令所在的缓存行回填至L1Cache中发生替换时,不直接丢弃掉替换出的Cache Line,可以根据替换出的Cache Line对应的硬件线程的Thread ID,将替换出的Cache Line填入该替换出的Cache Line对应的硬件线程对应的私有指令缓存中。In this way, when the hardware thread backfills the cache line where the instruction in the L2Cache is located into the L1Cache for replacement, the replaced Cache Line is not directly discarded, but can be replaced according to the Thread ID of the hardware thread corresponding to the replaced Cache Line. The removed Cache Line is filled into the private instruction cache corresponding to the hardware thread corresponding to the replaced Cache Line.
举例来说,发生替换可以是由于L1Cache中不存在空闲资源导致的,替换出的Cache Line可以根据LRU(Least Recently Used,最近最少使用到的)算法得到。For example, the replacement may be caused by the absence of idle resources in the L1 Cache, and the replaced Cache Line may be obtained according to an LRU (Least Recently Used, least recently used) algorithm.
其中,LRU算法就是一旦指令缓存出现缺失,则将未被使用到的时间最长的一条指令替换出缓存,换句话说,缓存首先保留的是最近常被使用到的指令。Among them, the LRU algorithm is to replace the instruction that has not been used for the longest time out of the cache once the instruction cache is missing. In other words, the cache first retains the most recently frequently used instructions.
举例来说,一个硬件线程在取指时,可以同时访问I-Cache和该硬件线程对应的私有Cache。For example, when a hardware thread fetches an instruction, it can simultaneously access the I-Cache and the private cache corresponding to the hardware thread.
若I-Cache中存在所取指令而该硬件线程对应的私有Cache不存在所取指令,则从I-Cache中获得所取指令;If there is a fetched instruction in the I-Cache and the private Cache corresponding to the hardware thread does not have the fetched instruction, then obtain the fetched instruction from the I-Cache;
若I-Cache中不存在所取指令而该硬件线程对应的私有Cache存在所取指令,则从该硬件线程对应的私有Cache存在所取指令获得所取指令;If there is no fetched instruction in the I-Cache and the private Cache corresponding to the hardware thread has the fetched instruction, then the fetched instruction exists from the private Cache corresponding to the hardware thread to obtain the fetched instruction;
若I-Cache和该硬件线程对应的私有Cache中同时存在所取指令,则从I-Cache中获得所取指令;If there are fetched instructions in the I-Cache and the private Cache corresponding to the hardware thread, the fetched instructions are obtained from the I-Cache;
若I-Cache和该硬件线程对应的私有Cache中都不存在所取指令,则该硬件线程向I-Cache的下一级缓存发送Cache Miss请求,以获得所取指令。If neither the I-Cache nor the private Cache corresponding to the hardware thread has the fetched instruction, the hardware thread sends a Cache Miss request to the next-level cache of the I-Cache to obtain the fetched instruction.
举例来说,根据硬件线程的调取策略,当下一个周期切换到其它线程取指时,在访问共享指令缓存的同时,将新的硬件线程对应的私有Cache与Tag(标签)比较逻辑相连,将该私有Cache读出的Tag比较逻辑与TLB(Translation Look-aside Buffers,转译后备缓冲区)输出的PA(Physics Address,物理地址)进行比较,生成私有Cache Miss信号和私有Cache数据输出。当该新的硬件线程对应的私有Cache中存在所取指令,私有Cache Miss信号表示存在指令,并有指令输出。For example, according to the calling strategy of the hardware thread, when the next cycle is switched to another thread to fetch instructions, while accessing the shared instruction cache, the private Cache corresponding to the new hardware thread is connected to the Tag (tag) comparison logic, and the The Tag comparison logic read by the private Cache is compared with the PA (Physics Address, physical address) output by the TLB (Translation Look-aside Buffers, translation look-aside buffer) to generate a private Cache Miss signal and a private Cache data output. When the fetched instruction exists in the private Cache corresponding to the new hardware thread, the private Cache Miss signal indicates that the instruction exists, and the instruction is output.
因此,本发明实施例提供一种处理器,该处理器包括程序计数器、寄存器堆、指令预取部件、指令译码部件、指令发射部件、地址生成单元、算术逻辑单元、共享浮点单元、数据缓存以及内部总线,还包括共享指令缓存,私有指令缓存,缺失缓存和标签比较逻辑,在共享指令缓存的数据存储阵列中增加硬件线程标识,用来在缓存缺失时,取回的缓存行是由哪一个硬件线程发出的缓存缺失请求取回的,当共享指令缓存发生替换时,将替换出的缓存行根据硬件线程标识存入到对应的硬件线程对应的私有指令缓存中,且缺失缓存用于硬件线程在收到缓存缺失请求返回的缓存行时,不直接将缓存行回填至共享指令缓存中,而是将缓存行保存在缺失缓存中,直到轮到该硬件线程取指时,将缓存行回填至共享指令缓存中,降低了即将被访问到底缓存行被替换出的指令缓存的几率,另外,增加的私有指令缓存,增大了每个硬件线程的缓存容量,提高了系统性能。Therefore, an embodiment of the present invention provides a processor, which includes a program counter, a register file, an instruction prefetch unit, an instruction decoding unit, an instruction issuing unit, an address generation unit, an arithmetic logic unit, a shared floating point unit, a data Cache and internal bus, including shared instruction cache, private instruction cache, missing cache and tag comparison logic, adding hardware thread identification in the data storage array of shared instruction cache, used to retrieve cache lines when the cache is missing. Which hardware thread sends a cache miss request to retrieve, when the shared instruction cache is replaced, the replaced cache line is stored in the private instruction cache corresponding to the corresponding hardware thread according to the hardware thread ID, and the missing cache is used for When the hardware thread receives the cache line returned by the cache miss request, it does not directly backfill the cache line into the shared instruction cache, but saves the cache line in the missing cache until it is the turn of the hardware thread to fetch the instruction, and the cache line is returned. Backfilling to the shared instruction cache reduces the probability of the instruction cache that is about to be accessed to the end of the cache line being replaced. In addition, the increased private instruction cache increases the cache capacity of each hardware thread and improves system performance.
本发明又一实施例提供一种指令缓存的管理方法,如图2所示,包括:Another embodiment of the present invention provides a method for managing an instruction cache, as shown in FIG. 2 , including:
101、当处理器的硬件线程在从指令缓存中获取指令时,处理器同时访问指令缓存中的共享指令缓存和硬件线程对应的私有指令缓存。101. When a hardware thread of a processor acquires an instruction from an instruction cache, the processor simultaneously accesses a shared instruction cache in the instruction cache and a private instruction cache corresponding to the hardware thread.
示例性的,该处理器(Central Processing Unit,CPU)可以为多线程处理器。一个物理内核可以有多个硬件线程,也称为逻辑内核或逻辑处理器,但是一个硬件线程并不代表一个物理内核,Windows将每一个硬件线程是被为一个可调度的逻辑处理器,每一个逻辑处理器可以运行软件线程的代码。指令缓存可以为处理器中的L1Cache中的共享指令缓存(I-Cache)和硬件线程的私有指令缓存。其中L1Cache包括数据缓存(D-Cache)和指令缓存(I-Cache)。Exemplarily, the processor (Central Processing Unit, CPU) may be a multi-thread processor. A physical core can have multiple hardware threads, also known as logical cores or logical processors, but a hardware thread does not represent a physical core. Windows regards each hardware thread as a schedulable logical processor, each Logical processors can run code for software threads. The instruction cache may be a shared instruction cache (I-Cache) in the L1Cache in the processor and a private instruction cache of the hardware thread. Among them, L1Cache includes data cache (D-Cache) and instruction cache (I-Cache).
具体的,可以在每个硬件线程中设置一块全相联的私有Cache,即私有Cache与硬件线程一一对应。其中,全相联结构为主存储器中的任意一块指令缓存映射私有指令缓存中的任意一块指令缓存。Specifically, a fully associative private Cache may be set in each hardware thread, that is, a private Cache corresponds to a hardware thread one by one. Wherein, the fully associative structure maps any instruction cache in the main memory to any instruction cache in the private instruction cache.
另外,还可以增加一个Tag(标签)比较逻辑,当硬件线程取指时,主动将硬件线程的私有Cache与Tag比较逻辑相连,这样,当一个硬件线程取指时,同时访问I-Cache和该硬件线程对应的私有Cache。In addition, a Tag (label) comparison logic can also be added. When a hardware thread fetches an instruction, it actively connects the private Cache of the hardware thread with the Tag comparison logic. In this way, when a hardware thread fetches an instruction, it simultaneously accesses the I-Cache and the The private cache corresponding to the hardware thread.
102、处理器确定共享指令缓存和硬件线程对应的私有指令缓存是否存在指令,而后进入步骤103或104或105或106。102. The processor determines whether there is an instruction in the shared instruction cache and the private instruction cache corresponding to the hardware thread, and then enters step 103 or 104 or 105 or 106.
示例性的,硬件线程同时访问I-Cache和该硬件线程对应的私有Cache时,同时判断I-Cache和该硬件线程对应的私有Cache是否存在所取指令。Exemplarily, when a hardware thread simultaneously accesses the I-Cache and the private Cache corresponding to the hardware thread, it is simultaneously judged whether there are fetched instructions in the I-Cache and the private Cache corresponding to the hardware thread.
设若该多线程处理器有32个硬件线程,该32个硬件线程共用一块64KB的I-Cache,即共享指令缓存容量为64KB。每个硬件线程包含有一块32路全相联的私有Cache,可以存储32条替换出来的Cache Line(缓存行),每一条Cache Line包含64Bytes,这样,每一块私有Cache容量为2KB。If the multi-threaded processor has 32 hardware threads, the 32 hardware threads share a 64KB I-Cache, that is, the shared instruction cache capacity is 64KB. Each hardware thread contains a 32-way fully associative private Cache, which can store 32 replaced Cache Lines (cache lines). Each Cache Line contains 64Bytes. In this way, the capacity of each private Cache is 2KB.
当新增一个32路Tag比较逻辑时,硬件线程在访问I-Cache共享指令缓存的同时,将该硬件线程读出的32路Tag与TLB输出的PA(Physics Address,物理地址)进行比较,并生成私有Cache Miss信号和私有Cache数据输出。若32路Tag与PA相同,则私有Cache Miss信号表示该硬件线程的私有Cache存在所取指令,私有Cache数据为有效指令。如图3所示。When a 32-way Tag comparison logic is added, the hardware thread compares the 32-way Tag read by the hardware thread with the PA (Physics Address, physical address) output by the TLB while accessing the I-Cache shared instruction cache, and Generate private Cache Miss signal and private Cache data output. If the 32-way Tag is the same as the PA, the private Cache Miss signal indicates that the private Cache of the hardware thread has the fetched instruction, and the private Cache data is a valid instruction. As shown in Figure 3.
其中,TLB或称为页表缓冲,里面存放的是一些页表文件(虚拟地址到物理地址的转换表,可以将所取指令的虚拟地址通过TLB转换为物理地址,在将物理地址与私有指令缓存中的标签进行比较后,若物理地址与私有指令缓存中的标签相同,使得硬件线程在访问共享指令缓存的同时也访问私有指令缓存。Among them, TLB or page table buffer, which stores some page table files (virtual address to physical address conversion table, can convert the virtual address of the fetched instruction into a physical address through TLB, and convert the physical address with the private instruction After the tag in the cache is compared, if the physical address is the same as the tag in the private instruction cache, the hardware thread accesses the private instruction cache while accessing the shared instruction cache.
103、若共享指令缓存和硬件线程对应的私有指令缓存同时存在指令,则处理器从共享指令缓存中获取指令。103. If instructions exist in both the shared instruction cache and the private instruction cache corresponding to the hardware thread, the processor acquires the instruction from the shared instruction cache.
示例性的,在同时对I-Cache和私有Cache进行访问时,若I-Cache和私有Cache同时存在所取指令,则从I-Cache中取得指令。Exemplarily, when accessing the I-Cache and the private Cache at the same time, if the I-Cache and the private Cache both have fetched instructions, the instructions are fetched from the I-Cache.
104、若共享指令缓存中存在指令且硬件线程对应的私有指令缓存不存在指令,则处理器从共享指令缓存中获取指令。104. If there is an instruction in the shared instruction cache and no instruction exists in the private instruction cache corresponding to the hardware thread, the processor acquires the instruction from the shared instruction cache.
示例性的,若I-Cache中存在所取指令,硬件线程的私有Cache不存在所取指令,则从I-Cache中存在所取指令。Exemplarily, if the fetched instruction exists in the I-Cache, but the fetched instruction does not exist in the private Cache of the hardware thread, then the fetched instruction exists in the I-Cache.
105、若硬件线程对应的私有指令缓存存在指令且共享指令缓存中不存在指令,则处理器从硬件线程对应的私有指令缓存获取指令。105. If there is an instruction in the private instruction cache corresponding to the hardware thread and no instruction exists in the shared instruction cache, the processor acquires the instruction from the private instruction cache corresponding to the hardware thread.
示例性的,若I-Cache中不命中,即不存在所取指令,硬件线程对应的私有Cache中存在所取指令,则从硬件线程对应的私有Cache中取得指令。这样,通过主动选取硬件线程对应的私有Cache参与Tag比较,可以扩大每个硬件线程分到的Cache容量,增大了硬件线程的指令缓存的命中率。Exemplarily, if there is a miss in the I-Cache, that is, there is no fetched instruction, and the fetched instruction exists in the private Cache corresponding to the hardware thread, the instruction is fetched from the private Cache corresponding to the hardware thread. In this way, by actively selecting the private Cache corresponding to the hardware thread to participate in Tag comparison, the Cache capacity assigned to each hardware thread can be expanded, and the hit rate of the instruction cache of the hardware thread can be increased.
106、若共享指令缓存和私有指令缓存都不存在指令,则处理器通过硬件线程向共享指令缓存的下一级缓存发送缓存缺失请求。106. If there is no instruction in both the shared instruction cache and the private instruction cache, the processor sends a cache miss request to a next-level cache of the shared instruction cache through a hardware thread.
示例性的,若I-Cache和硬件线程对应的私有Cache都不存在所取指令,则该硬件线程向I-Cache的下一级缓存发出Cache Miss(缓存缺失)。Exemplarily, if neither the I-Cache nor the private Cache corresponding to the hardware thread has the instruction fetched, the hardware thread issues a Cache Miss (cache miss) to the next-level cache of the I-Cache.
例如,L1Cache和硬件线程对应的私有Cache都不存在所取指令,则硬件线程向L1Cache的下一级缓存L2Cache发出Cache Miss,以求从L2Cache中获得所取指令。For example, neither the L1Cache nor the private Cache corresponding to the hardware thread has the fetched instruction, and the hardware thread sends a Cache Miss to the L2Cache, which is the next level cache of the L1Cache, in order to obtain the fetched instruction from the L2Cache.
107、若下一级缓存中存在指令,则处理器通过硬件线程从下一级缓存中获取指令,并将指令所在的缓存行存储在硬件线程对应的缺失缓存中,在硬件线程取指时,将缓存行回填至共享指令缓存中。107. If there is an instruction in the next-level cache, the processor obtains the instruction from the next-level cache through the hardware thread, and stores the cache line where the instruction is located in the missing cache corresponding to the hardware thread. When the hardware thread fetches the instruction, Backfill the cache line into the shared instruction cache.
示例性的,当L2Cache中存在所取指令,则从L2Cache中获得指令,并不直接将该指令所在的Cache Line回填至L1Cache中,而是将所取指令所在的Cache Line保存在该硬件线程对应的Miss Buffer(缺失缓存)中,直到轮到该硬件线程取指时,将该Cache Line填入L1Cache中。Exemplarily, when there is a fetched instruction in the L2Cache, the instruction is obtained from the L2Cache, and the Cache Line where the instruction is located is not directly backfilled into the L1Cache, but the Cache Line where the fetched instruction is located is stored in the hardware thread corresponding In the Miss Buffer (missing cache), until it is the turn of the hardware thread to fetch instructions, fill the Cache Line into the L1Cache.
其中,Miss Buffer与硬件线程一一对应,即每个硬件线程都有一个Miss Buffer,每个硬件线程都利用一个Miss Buffer来缓存Cache Miss请求返回的Cache Line,这是由于假设将该Cache Line回填至L1Cache时发生替换,替换出的Cache Line有可能是即将被访问到的Cache Line,Miss Buffer的存在优化了Cache Line的回填时机,降低了将被访问到的Cache Line被替换出的缓存的几率。Among them, the Miss Buffer corresponds to the hardware thread one by one, that is, each hardware thread has a Miss Buffer, and each hardware thread uses a Miss Buffer to cache the Cache Line returned by the Cache Miss request. This is because it is assumed that the Cache Line is backfilled When the L1Cache is replaced, the replaced Cache Line may be the Cache Line that is about to be accessed. The existence of the Miss Buffer optimizes the backfill timing of the Cache Line and reduces the chance that the Cache Line that will be accessed will be replaced by the cache. .
108、若下一级缓存中不存在指令,则处理器通过硬件线程向主存储器发送缺失请求,从主存储器中获取指令,并将指令所在的缓存行存储在硬件线程对应的缺失缓存中,在硬件线程取指时,将缓存行回填至共享指令缓存中。108. If there is no instruction in the next-level cache, the processor sends a miss request to the main memory through the hardware thread, obtains the instruction from the main memory, and stores the cache line where the instruction is located in the miss cache corresponding to the hardware thread. When a hardware thread fetches an instruction, it backfills the cache line into the shared instruction cache.
示例性的,若L2Cache中也不存在所取指令,则硬件线程向主存储器发出CacheMiss请求,以求从主存储器中获得所取指令。若主存储器中存在所取指令,则获得所取指令,并将所取指令所在的Cache Line保存在该硬件线程对应的Miss Buffer中,直到轮到该硬件线程取指时,将该Cache Line填入L1Cache中。Exemplarily, if the fetched instruction does not exist in the L2Cache, the hardware thread sends a CacheMiss request to the main memory, so as to obtain the fetched instruction from the main memory. If the fetched instruction exists in the main memory, the fetched instruction is obtained, and the Cache Line where the fetched instruction is located is saved in the Miss Buffer corresponding to the hardware thread, until it is the turn of the hardware thread to fetch the instruction, the Cache Line is filled into the L1Cache.
也可以是L2Cache中也不存在所取指令时,硬件线程向L3Cache发出Cache Miss请求,若L3Cache中存在所取指令,则获得所取指令,若L3Cache中不存在所取指令,则向主存储器发出Cache Miss请求以获得所取指令。It can also be that when the fetched instruction does not exist in the L2Cache, the hardware thread sends a Cache Miss request to the L3Cache. If there is a fetched instruction in the L3Cache, the fetched instruction is obtained, and if there is no fetched instruction in the L3Cache, it is sent to the main memory. Cache Miss request to get fetched instructions.
其中,CPU与Cache之间的交换单位为字,当CPU要读取主存中一个字时,发出该字的内存地址同时到达Cache和主存,L1Cache或L2Cache或L3Cache可以在Cache控制逻辑依据地址的Tag标记部分判断是否存在字,若命中,CPU获得该字,若未命中,则要用主存读取周期从主存中读出并输出至CPU,即使当前CPU仅读一个字,Cache控制器也要把主存储器中包含该字的一个完整的Cache行复制到Cache中,这种向Cache传送一行数据的操作就称为Cache行填充。Among them, the exchange unit between the CPU and the Cache is a word. When the CPU wants to read a word in the main memory, the memory address issued by the word arrives at the Cache and the main memory at the same time. L1Cache, L2Cache or L3Cache can control the logical basis address in the Cache The tag part of the tag judges whether there is a word. If it hits, the CPU gets the word. If it misses, it needs to use the main memory read cycle to read it from the main memory and output it to the CPU. Even if the current CPU only reads one word, the Cache control The processor also copies a complete Cache line containing the word in the main memory to the Cache. This operation of transferring a line of data to the Cache is called Cache line filling.
另外,在将缓存行回填至共享指令缓存中时,若共享指令缓存不存在空闲资源,则将缓存行替换共享指令缓存中的第一缓存行,将缓存行回填至共享指令缓存中,同时根据获取第一缓存行的第一硬件线程的硬件线程标识,将第一缓存行存储在第一硬件线程对应的私有指令缓存中。其中,第一缓存行是通过LRU(Least Recently Used,最近最少使用到的)算法确定的。In addition, when backfilling the cache line into the shared instruction cache, if there is no idle resource in the shared instruction cache, replace the cache line with the first cache line in the shared instruction cache, backfill the cache line into the shared instruction cache, and at the same time according to The hardware thread identifier of the first hardware thread of the first cache line is acquired, and the first cache line is stored in a private instruction cache corresponding to the first hardware thread. Wherein, the first cache line is determined by an LRU (Least Recently Used, least recently used) algorithm.
示例性的,可以在I-Cache Data Array(指令高速缓存数据存储阵列)中增加Thread ID(硬件线程标识),该硬件线程标识用于表示一条Cache Line是由哪一个硬件线程发出的Cache Miss请求所取回的。这样,当在每个硬件线程中设置一块全相联的私有Cache后,当I-Cache发生替换时,不直接丢弃掉替换出的Cache Line,而可以根据该ThreadID,将替换出的Cache Line填入Thread ID标识的硬件线程的私有Cache中,这是由于替换出的Cache Line有很快要被访问的可能性。如图4所示。Exemplary, Thread ID (hardware thread identification) can be added in I-Cache Data Array (instruction cache data storage array), and this hardware thread identification is used to represent the Cache Miss request that a Cache Line is sent by which hardware thread retrieved. In this way, after setting a fully associative private Cache in each hardware thread, when the I-Cache is replaced, the replaced Cache Line is not directly discarded, but the replaced Cache Line can be filled in according to the ThreadID. Into the private Cache of the hardware thread identified by the Thread ID, this is because the replaced Cache Line may be accessed soon. As shown in Figure 4.
在将替换出来的第一缓存行存储在获取第一硬件线程对应的私有指令缓存中时,若第一硬件线程对应的私有指令缓存不存在空闲资源,则将第一缓存行替换第一硬件线程对应的私有指令缓存中的第二缓存行,将第一缓存行回填至第一硬件线程对应的私有指令缓存中。其中,第二缓存行是通过LRU算法确定的。When storing the replaced first cache line in the private instruction cache corresponding to the acquired first hardware thread, if there is no idle resource in the private instruction cache corresponding to the first hardware thread, replace the first cache line with the first hardware thread The second cache line in the corresponding private instruction cache backfills the first cache line into the private instruction cache corresponding to the first hardware thread. Wherein, the second cache line is determined through the LRU algorithm.
其中,LRU算法就是一旦指令缓存出现缺失,则将未被使用到的时间最长的一条指令替换出缓存,换句话说,缓存首先保留的是最近常被使用到的指令。Among them, the LRU algorithm is to replace the instruction that has not been used for the longest time out of the cache once the instruction cache is missing. In other words, the cache first retains the most recently frequently used instructions.
这样一来,通过增加私有Cache有效的扩大了每个硬件线程分到的指令Cache容量,增大了硬件线程的指令Cache的命中率,减少了I-Cache与下一级Cache之间的通信,同时通过增加的Miss Buffer优化了Cache Line的回填时机,降低了即将被访问到的CacheLine被替换出的几率,增加的Tag比较逻辑,使得访问I-Cache时同时访问共享指令缓存和私有指令缓存,增加了指令缓存的命中率。In this way, by increasing the private Cache, the instruction cache capacity allocated to each hardware thread is effectively expanded, the hit rate of the instruction cache of the hardware thread is increased, and the communication between the I-Cache and the next-level cache is reduced. At the same time, the added Miss Buffer optimizes the backfill timing of the Cache Line, reducing the chance of the Cache Line being accessed soon being replaced. The added Tag comparison logic makes it possible to access the shared instruction cache and the private instruction cache at the same time when accessing the I-Cache. Increased instruction cache hit ratio.
本发明实施例提供一种指令缓存的管理方法,当处理器的硬件线程在从指令缓存中获取指令时,同时访问指令缓存中的共享指令缓存和硬件线程对应的私有指令缓存,确定共享指令缓存和硬件线程对应的私有指令缓存是否存在指令,并根据判断结果从共享指令缓存或硬件线程对应的私有指令缓存中获取指令,若共享指令缓存和私有指令缓存都不存在指令,则通过硬件线程向共享指令缓存的下一级缓存发送缓存缺失请求,并将指令所在的缓存行存储在硬件线程对应的缺失缓存中,在硬件线程取指时,将缓存行回填至共享指令缓存中,在将缓存行回填至共享指令缓存中时,若共享指令缓存不存在空闲资源,则将缓存行替换共享指令缓存中的第一缓存行,将缓存行回填至共享指令缓存中,同时根据获取第一缓存行的第一硬件线程的硬件线程标识,将第一缓存行存储在第一硬件线程对应的私有指令缓存中,这样能够扩大硬件线程的指令缓存容量,降低指令缓存的缺失率,提高系统性能。An embodiment of the present invention provides a method for managing an instruction cache. When a hardware thread of a processor acquires an instruction from the instruction cache, it simultaneously accesses the shared instruction cache in the instruction cache and the private instruction cache corresponding to the hardware thread to determine the shared instruction cache. Whether there is an instruction in the private instruction cache corresponding to the hardware thread, and according to the judgment result, the instruction is obtained from the shared instruction cache or the private instruction cache corresponding to the hardware thread. The next-level cache of the shared instruction cache sends a cache miss request, and stores the cache line where the instruction is located in the missing cache corresponding to the hardware thread. When the hardware thread fetches an instruction, the cache line is backfilled into the shared instruction cache. When backfilling a line into the shared instruction cache, if there is no idle resource in the shared instruction cache, replace the cache line with the first cache line in the shared instruction cache, backfill the cache line into the shared instruction cache, and at the same time obtain the first cache line The hardware thread identification of the first hardware thread, the first cache line is stored in the private instruction cache corresponding to the first hardware thread, which can expand the instruction cache capacity of the hardware thread, reduce the miss rate of the instruction cache, and improve system performance.
在本申请所提供的几个实施例中,应该理解到,所揭露的处理器和方法,可以通过其它的方式实现。例如,以上所描述的设备实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed processor and method may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
另外,在本发明各个实施例中的设备和系统中,各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理包括,也可以两个或两个以上单元集成在一个单元中。且上述的各单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。In addition, in the devices and systems in various embodiments of the present invention, each functional unit may be integrated into one processing unit, each unit may be physically included separately, or two or more units may be integrated into one unit. Moreover, each of the above-mentioned units can be implemented in the form of hardware, or can be implemented in the form of hardware plus software functional units.
实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于一计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read Only Memory,简称ROM)、随机存取存储器(Random Access Memory,简称RAM)、磁碟或者光盘等各种可以存储程序代码的介质。All or part of the steps for realizing the above-mentioned method embodiments can be completed by hardware related to program instructions, and the aforementioned program can be stored in a computer-readable storage medium, and when the program is executed, the steps including the above-mentioned method embodiments are executed; The aforementioned storage media include: U disk, mobile hard disk, read only memory (Read Only Memory, referred to as ROM), random access memory (Random Access Memory, referred to as RAM), magnetic disk or optical disk, etc., which can store program codes. medium.
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以所述权利要求的保护范围为准。The above is only a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Anyone skilled in the art can easily think of changes or substitutions within the technical scope disclosed in the present invention. Should be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be determined by the protection scope of the claims.
Claims (9)
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201310269557.0A CN104252425B (en) | 2013-06-28 | 2013-06-28 | The management method and processor of a kind of instruction buffer |
| PCT/CN2014/080059 WO2014206217A1 (en) | 2013-06-28 | 2014-06-17 | Management method for instruction cache, and processor |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201310269557.0A CN104252425B (en) | 2013-06-28 | 2013-06-28 | The management method and processor of a kind of instruction buffer |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN104252425A CN104252425A (en) | 2014-12-31 |
| CN104252425B true CN104252425B (en) | 2017-07-28 |
Family
ID=52141028
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201310269557.0A Active CN104252425B (en) | 2013-06-28 | 2013-06-28 | The management method and processor of a kind of instruction buffer |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN104252425B (en) |
| WO (1) | WO2014206217A1 (en) |
Families Citing this family (15)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN104809078B (en) * | 2015-04-14 | 2019-05-14 | 苏州中晟宏芯信息科技有限公司 | Based on the shared cache hardware resource access method for exiting yielding mechanism |
| WO2017006235A1 (en) * | 2015-07-09 | 2017-01-12 | Centipede Semi Ltd. | Processor with efficient memory access |
| CN106484310B (en) * | 2015-08-31 | 2020-01-10 | 华为数字技术(成都)有限公司 | Storage array operation method and device |
| US10606762B2 (en) | 2017-06-16 | 2020-03-31 | International Business Machines Corporation | Sharing virtual and real translations in a virtual cache |
| US10698836B2 (en) * | 2017-06-16 | 2020-06-30 | International Business Machines Corporation | Translation support for a virtual cache |
| US10831664B2 (en) | 2017-06-16 | 2020-11-10 | International Business Machines Corporation | Cache structure using a logical directory |
| CN109308190B (en) * | 2018-07-09 | 2023-03-14 | 北京中科睿芯科技集团有限公司 | Shared line buffer system based on 3D stack memory architecture and shared line buffer |
| US11099999B2 (en) * | 2019-04-19 | 2021-08-24 | Chengdu Haiguang Integrated Circuit Design Co., Ltd. | Cache management method, cache controller, processor and storage medium |
| CN110990062B (en) * | 2019-11-27 | 2023-03-28 | 上海高性能集成电路设计中心 | Instruction prefetching filtering method |
| CN111078592A (en) * | 2019-12-27 | 2020-04-28 | 无锡中感微电子股份有限公司 | Multi-level instruction cache of low-power-consumption system on chip |
| WO2022150996A1 (en) * | 2021-01-13 | 2022-07-21 | 王志平 | Method for implementing processor cache structure |
| CN114116533B (en) * | 2021-11-29 | 2023-03-10 | 海光信息技术股份有限公司 | The Method of Using Shared Memory to Store Data |
| CN114168300A (en) * | 2021-12-20 | 2022-03-11 | 海光信息技术股份有限公司 | Thread scheduling method, processor and electronic device |
| CN115098169B (en) * | 2022-06-24 | 2024-03-05 | 海光信息技术股份有限公司 | Method and device for retrieving instructions based on capacity sharing |
| CN117851278B (en) * | 2024-03-08 | 2024-06-18 | 上海芯联芯智能科技有限公司 | Method for sharing static random access memory and central processing unit |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101510191A (en) * | 2009-03-26 | 2009-08-19 | 浙江大学 | Multi-core system structure with buffer window and implementing method thereof |
| CN103020003A (en) * | 2012-12-31 | 2013-04-03 | 哈尔滨工业大学 | Multi-core program determinacy replay-facing memory competition recording device and control method thereof |
Family Cites Families (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6212604B1 (en) * | 1998-12-03 | 2001-04-03 | Sun Microsystems, Inc. | Shared instruction cache for multiple processors |
| US20110320720A1 (en) * | 2010-06-23 | 2011-12-29 | International Business Machines Corporation | Cache Line Replacement In A Symmetric Multiprocessing Computer |
-
2013
- 2013-06-28 CN CN201310269557.0A patent/CN104252425B/en active Active
-
2014
- 2014-06-17 WO PCT/CN2014/080059 patent/WO2014206217A1/en active Application Filing
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101510191A (en) * | 2009-03-26 | 2009-08-19 | 浙江大学 | Multi-core system structure with buffer window and implementing method thereof |
| CN103020003A (en) * | 2012-12-31 | 2013-04-03 | 哈尔滨工业大学 | Multi-core program determinacy replay-facing memory competition recording device and control method thereof |
Also Published As
| Publication number | Publication date |
|---|---|
| WO2014206217A1 (en) | 2014-12-31 |
| CN104252425A (en) | 2014-12-31 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN104252425B (en) | The management method and processor of a kind of instruction buffer | |
| US9513904B2 (en) | Computer processor employing cache memory with per-byte valid bits | |
| CN104272279B (en) | Data processing device with cache and conversion lookaside buffer | |
| CN104346294B (en) | Data read/write method, device and computer system based on multi-level buffer | |
| US20210279054A1 (en) | Implementing a micro-operation cache with compaction | |
| KR101531078B1 (en) | Data processing system and data processing method | |
| CN100416515C (en) | Cache line flush instruction and method, apparatus and system for executing the same | |
| US6427188B1 (en) | Method and system for early tag accesses for lower-level caches in parallel with first-level cache | |
| US9547593B2 (en) | Systems and methods for reconfiguring cache memory | |
| CN103729306B (en) | The method and data processing equipment of cache block invalidation | |
| WO2014206218A1 (en) | Method and processor for accessing data cache | |
| US8335908B2 (en) | Data processing apparatus for storing address translations | |
| CN1509436A (en) | Method and system for speculatively invalidating a cache line in a cache | |
| US20170185515A1 (en) | Cpu remote snoop filtering mechanism for field programmable gate array | |
| JP2018504694A (en) | Cache accessed using virtual address | |
| KR102268601B1 (en) | Processor for data forwarding, operation method thereof and system including the same | |
| CN102566970B (en) | For processing the data processor modifying instruction with cache bypass | |
| US9424190B2 (en) | Data processing system operable in single and multi-thread modes and having multiple caches and method of operation | |
| US20160055083A1 (en) | Processors and methods for cache sparing stores | |
| JPWO2004031943A1 (en) | Data processor | |
| JP2008502069A (en) | Memory cache controller and method for performing coherency operations therefor | |
| US6976130B2 (en) | Cache controller unit architecture and applied method | |
| US9037804B2 (en) | Efficient support of sparse data structure access | |
| CN100451994C (en) | Microprocessor, apparatus and method for selectively associating store buffer cache line status with response buffer cache line status | |
| US11176042B2 (en) | Method and apparatus for architectural cache transaction logging |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |