CN112749120A

CN112749120A - Techniques for efficiently transferring data to a processor

Info

Publication number: CN112749120A
Application number: CN202010490360.XA
Authority: CN
Inventors: A·克尔; J·肖凯特; 小刚·邱; O·帕兰杰佩; P·饶; S·加德雷; S·J·海因里希; M·帕特尔; O·吉普; A·卡茨
Original assignee: Nvidia Corp
Current assignee: Nvidia Corp
Priority date: 2019-10-29
Filing date: 2020-06-02
Publication date: 2021-05-04
Anticipated expiration: 2040-06-02
Also published as: CN112749120B

Abstract

The invention discloses a technology for efficiently transmitting data to a processor, and specifically discloses a technology for block data transmission, which can reduce data transmission and memory access overhead, and significantly reduce multiprocessor activity and energy consumption. A thread executing on a multiprocessor that needs data stored in global memory can request the required data and store it in on-chip shared memory, which the thread can access multiple times. Data can be loaded from global memory and stored in shared memory using instructions that direct data into shared memory without storing data in the multiprocessor's registers and/or cache memory during data transfers.

Description

Techniques for efficiently transferring data to a processor

Cross Reference to Related Applications

The present application claims the benefit of provisional application serial No. 62/927,417 filed on day 29, 10/2019 and provisional application serial No. 62/927,511 filed on day 29, 10/2019, each of which is incorporated herein by reference. This application is related to commonly assigned, co-pending U.S. patent application serial No. 16/712,236 filed 12/2019.

Technical Field

The technology relates to efficient use of processor memory, and processor Instruction Set Architectures (ISAs) that include instructions that facilitate such efficient use. More specifically, the techniques herein relate to efficient data storage in a memory shared by multiple cores, and to dedicated or additional memory access instructions that reduce register pressure during memory transfers and/or data transfer delays.

Background

Large-scale parallel high-performance multithreaded multi-core processing systems (systems comprising many processing cores running in parallel) process data much faster than in the past. These processing systems may break down complex computations into smaller tasks, which may then be performed simultaneously by multiple parallel processing cores. This "divide and conquer" approach allows complex calculations to be performed in a fraction of the time when only one or a few processors are processing the same calculations in sequence.

As the number of concurrent processing tasks increases, the amount of data required to support these many parallel computations also increases, creating a bottleneck for data access in "nearby" on-chip memory.

One or more modern processing chips (e.g., GPUs) typically contain a large amount of memory near the parallel processor, thereby reducing memory access latency. For example, as used herein, some NVIDIA GPUs include local on-chip high bandwidth storage (including, for example, 4GB cache/shared memory) of about 12GB or greater to service 5000 cores running in parallel.

To further reduce data access latency, modern processing systems typically organize storage as a hierarchy (e.g., level 1 (L1) cache, level 2 (L2) cache, level 3 (L3) cache, main memory, etc.). Such a memory hierarchy stores data that is currently being processed by the processing cores, closer to the processing cores for easier access. The cache memory closest to the processing cores (e.g., the L1 cache) may be partitioned, distributed, or otherwise organized such that each processing core or group of processing cores accesses its own cache exclusively, thereby avoiding latency due to contention of memory with other cores. Such cache memory is typically supported by hardware circuitry that maintains tags and is responsible for automatically writing a "dirty" cache line back to main memory before the line is flushed, thereby saving the need for a software programmer to explicitly manage the cache.

There is also a need to share local memory between processing cores. Sometimes, one or more processing cores need to access one or more values computed by one or more different processing cores or use memory as a communication channel to signal another processing core. While one core may write data back to main memory for another core to access, accessing main memory typically requires thousands of cycles. Hierarchical cache memory architectures are sometimes available to share such data between processors without waiting for a main memory access. However, to facilitate data sharing among multiple processors, modern GPUs typically have local on-chip memory shared among the processing cores. Furthermore, such shared memory is typically supported/accelerated by replication (direct memory access or DMA) hardware.

Using shared memory in addition to or instead of cache memory may provide certain advantages. For example, applications using shared memory support more coalesced accesses and may achieve a higher degree of memory level parallelism. For example, as previously expressly described, see the following technical articles incorporated by reference herein: li, y.yang, h.dai, s.yan, f.mueller, and h.zhou "understand the trade-off between software-managed caches and hardware-managed caches in the GPU" 2014 IEEE systems and software performance analysis international seminar (ISPASS), montreal, california, 2014, page 231, 242.

In some systems, physical memory blocks may be flexibly allocated between different local memory functions (e.g., between shared memory and cache memory). In one example, an embodiment of the NVIDIA rolling/Volta GPU architecture provides a software programming option for dynamically (runtime) adjusting local memory allocation. In particular, the NVIDIA Volta architecture combines the functionality of L1 and texture cache into a unified L1/texture cache, which acts as a merge buffer for memory accesses, collecting data requested by threads in a thread bundle before passing the data to the thread bundle. In the Volta architecture, the L1 cache, texture cache, and shared memory are supported by a combined data cache. The portion of the combined data cache dedicated to shared memory (referred to as "carveout" in the CUDA) may be selected at run-time using the CUDA instruction cudafuncsettribute () along with an attribute cudafuncattributpreterredshariedmemorcarveout that effectively selects how much shared memory should be allocated for each Streaming Multiprocessor (SM).

The benefit of the L1 cache in conjunction with shared memory is improved latency and bandwidth. For many applications, this approach narrows the performance gap between explicitly managed shared memory and direct access to device storage. Such shared memory allows parallel threads executing on the same or different processing cores to temporarily store and exchange data. Thus, the shared memory provides a communication mechanism that enables independent threads to communicate with each other. In addition, the cost of register overflow is low and the balance between occupancy and overflow can be re-evaluated to provide optimal performance.

Traditionally, storing data from main memory to shared memory requires a multi-step process. First, the processor executes a memory load instruction from main memory, which causes the addressed data to be read from main memory and stored into one or more cache lines of cache memory, and then written from cache memory to one or more registers of the processor. Such registers may be allocated in a register file (which may be another block of local memory), with different registers in the register file being allocated to different processors or processor cores. The processor may then execute the store instruction to store the data in its registers in the shared memory.

This conventional method of loading data into shared memory consumes a large number of registers for an extended and often indeterminate period of time in the event that some common transactions, such as matrix multiplication, require large transfers of data. During this time (which may last thousands of cycles due to long delays or other dependencies of the main memory in some cases), the registers may be bundled (tie-up) and not used for any other purpose. This register binding prevents the associated processor from doing useful work until the register is released.

Thus, while using shared memory for some computations may reduce data availability bottlenecks, existing methods of moving data into and out of shared memory typically consume resources (i.e., registers) that may otherwise be available for data processing — which, when performing some data intensive computations, may reduce the overall processing rate of the processing system. Therefore, there is a need to manage memory bandwidth requirements more efficiently while achieving higher mathematical throughput in areas such as Artificial Intelligence (AI) and Deep Learning (DL).

Drawings

The following detailed description of exemplary, non-limiting illustrative embodiments will be read in connection with the following drawings:

1A-1B illustrate an example, non-limiting parallel processing architecture in which data is routed into shared memory via caches and registers.

2A-2B illustrate an example, non-limiting parallel processing architecture in which data is written to a shared memory, bypass register.

3A-3B illustrate an example non-limiting parallel processing architecture in which data is written to shared memory, bypass registers, and cache.

FIG. 4 illustrates an example, non-limiting block diagram of a memory control system.

FIG. 5 illustrates an example of a shared memory including a plurality of memory banks coupled to an interconnect circuit.

FIG. 6 illustrates an example representation of how data retrieved from main memory or global memory may be laid out in shared memory to reduce conflicting requests.

Fig. 7 illustrates an exemplary sector pattern including four sectors that may be applied to a shared memory.

FIG. 8 illustrates an example parallel processing unit.

FIG. 9A illustrates an example general processing cluster in the parallel processing unit of FIG. 8.

FIG. 9B illustrates an example memory partition unit of the parallel processing unit of FIG. 8.

FIG. 10A illustrates the example streaming multiprocessor of FIG. 9A.

FIG. 10B is an exemplary conceptual diagram of a processing system implemented using the Parallel Processing Unit (PPU) of FIG. 8.

FIG. 10C is a block diagram of an exemplary system in which the various architectures and/or functions of the foregoing various embodiments can be implemented.

FIG. 11 is a conceptual diagram of an example graphics processing pipeline implemented by the PPU of FIG. 8, according to an embodiment.

Detailed Description

The example non-limiting techniques herein provide an efficient data access bandwidth to feed a solution for streaming multi-processor core math units at "optical speed" rates. In Streaming Multiprocessors (SM), there is a need to reduce memory input output interface (MIO) bandwidth and reduce register pressure to move data operands in order to increase mathematical throughput by a factor of 2 over previous architectures.

The illustrative, non-limiting embodiments provide block data transfer techniques as a way to reduce data transfer and memory access overhead, thereby significantly reducing the activity and power consumption of multiple processors (e.g., SM level). In a typical computing program, threads cooperatively (co-operationality) load data from global memory and then store it in shared memory so that the data can be accessed and processed multiple times later. Example embodiments provide a new instruction that indicates that data loaded from global memory is stored directly into shared memory. This eliminates moving data through the SM register and minimizes data movement through the MIO. The illustrative, non-limiting embodiments provide for executing an application to selectively direct requested data from the full local memory to the shared memory (1) bypassing the registers and L1 cache (2) bypassing the registers via the L1 cache or (3) via the L1 cache and registers. In the illustrative non-limiting embodiment, the shared memory is software managed and the L1 cache memory is hardware managed.

Accordingly, one aspect of the exemplary, non-limiting techniques herein provides an Instruction Set Architecture (ISA) that provides (at least) three different types of load and store from memory to memory instructions or sequences of instructions:

(a) a typical conventional load instruction, where data is retrieved from main or global memory and written to a cache line of an on-chip cache memory (which may be managed by hardware), and then data is written from the cache memory to one or more registers of the processing core (which, once resident in one or more registers of the processing core, may be stored from the one or more registers to the on-chip shared memory using conventional store-to-memory instructions); or

A first type or kind of load from memory bypass instruction that bypasses one or more processor core registers, i.e., data written to a cache line is copied into shared memory without first writing to one or more processing core registers; or

A second type or kind of bypass load instruction that bypasses one or more processor core registers and cache memory, i.e., data retrieved from main memory or global memory, is written directly to shared memory without first being stored in processor core registers or cache memory.

A developer may select different ones of these instructions to perform loads from memory operations according to requirements and performance requirements.

For example, in many cases where it is important to maintain memory coherency using hardware support of on-chip caches, developers will want to explicitly control the storage of individual values to specific shared memory locations using application-defined addresses. In these cases, the developer will cause the execution of conventional load and store instructions to provide a conventional memory load path of global memory > cache > register > shared memory.

On the other hand, in some cases, developers may wish to obtain the benefits of hardware-maintained cache memory, but applications do not require the loaded data as values loaded by a particular thread. It need only be data that is commonly loaded and explicitly synchronized in shared memory. In this case, the developer may cause a first type or kind of bypass load instruction to be executed to provide a memory load path for global memory > cache > shared memory.

In other cases, the developer may wish to load data into shared memory as efficiently as possible and wish to do so with minimal additional resources, in which case the developer may cause execution of a second type or kind of bypass load instruction to provide a global memory > shared memory load path.

A typical conventional load instruction, in which data is retrieved from main memory or global memory and written to a cache line of an on-chip cache memory, is implemented by two separate instructions. An instruction is loaded from global memory, followed by a separate store to shared memory instruction. The load from global memory instruction loads data from global memory into the data register of the requesting thread. The store to shared memory stores the data in the registers to shared memory.

The illustrative, non-limiting embodiments provide fused load and store instructions (LDGSTS) that can load data from global memory (LDG) and store the data in shared memory (STS) bypassing processor core registers. Fused load and store instructions (LDGSTS) may load data from global memory and store the data into shared memory with the option of bypassing one or more processor core registers or bypassing both one or more processor core registers and cache memory. A first type or class of load/store instruction is ldgsts. Bypass, which bypasses both one or more processor core registers and cache memory. The ldgsts. Instruction codes ldgsts.access and/or ldgsts.bypass for legacy loads and stores may be loaded into the instruction register and the instruction decoder may convert the instructions into hardware settings. Based on the type or kind of issued instruction, the load and store of data from global memory to shared memory instruction may control hardware to bypass one or more processor core registers and/or cache memory.

In some example non-limiting embodiments, the decision of which of the different types or kinds of slave load instructions to use may be made by an optimization compiler, e.g., based on a translation (switch) included in the source code by a developer, tracking of where the data occurs (e.g., whether the data has ever been altered and written back to main or global memory), and/or whether the data is accessed and/or other optimization factors after the slave load instructions are executed by the multi-core/multi-thread bundle (rather than the single core/single thread bundle).

In some examples, the decision of which of the different types or classes of load from memory instructions to use may be made dynamically at runtime by a developer evaluating conditions in the source code. For example, the decision may be made by using a memory descriptor (included as an instruction parameter). This is a hint to convert ldgsts. Access may be overwritten in a variant with memory descriptors, for example. If the memory descriptor has an l1InVDontAllocate set, and.sz is specified as 128, then the global memory access will be considered specified. There may be instruction attributes, such as operand size supported by ldgsts.

Those skilled in the art will appreciate that the same thread of execution may use any or all of these different memory load instructions, depending on the context. Thus, the same executing thread may load some values from global memory into shared memory using a conventional load instruction, it may load other values in global memory into shared memory using a first type of bypass load instruction, and it may still load other values in global memory into shared memory using a second type of bypass load instruction.

Example non-limiting embodiments provide one or more of the following:

multithread replication with asynchronous DMA-like behavior. The copy may be directed to cache access through L1 or bypass it.

Detection and encoding of fill patterns. The data is "swapped" (swizzle) because it is stored to shared memory in order to run subsequent conflict-free accesses. Rather than staging (stage) data by the SM register and swapping it when stored in shared memory, the example embodiments detect a swap pattern generated between cooperating threads and encode it for use in data stuffing.

Byte-level zero padding results that exceed the boundary.

Programming models, among others, allow efficient software pipelining.

The example non-limiting techniques herein provide a block data transfer technique that reduces data transfer and memory access overhead, thereby significantly reducing streaming multiprocessor SM-level activity and energy consumption. In a typical computing program, threads cooperatively load data from global memory and then store it in shared memory so that the data can be accessed and processed multiple times later.

1A-3B illustrate an example, non-limiting parallel processing architecture that shows software-controlled transfer of data from global memory 511 to functional units 512 that may be implemented by multiple functional units₀-512_N Shared memory 574 is accessed. The parallel processing architecture includes a plurality of SMs 440 (SM0-SMN) that can access global memory 511 external to SM 440. Global memory 511 may include a hierarchical cache memory (e.g., a level 2 cache and/or a level 3 cache) and/or a DRAM memory. In some examples, referring to FIGS. 8-10C, global memory 511 may include Memory Management Unit (MMU)490, X-Bar 370, memory partition unit 380, and/or memory 304.

Each SM 440 includes multiple cores, such as functional units 512₀-512_NConfigured to process multiple threads in parallel. A thread (e.g., an execution thread) is an instance of a set of instructions that are configured to be executed by functional units 512₀-512_NOn a particular data set. The threads in a thread block may be executed simultaneously and multiple thread blocks may be executed simultaneously. In some embodiments, single instruction, multiple data (SIMD) instruction issue techniques are used to support parallel execution of a large number of threads without providing multiple independent instruction units. In other embodiments, single instruction, multi-threaded (SIMT) techniques are used to support parallel execution of a large number of generally synchronous threads, using a general purpose instruction unit configured to issue instructions to a set of cores. Unlike SIMD execution architectures, where all cores typically execute the same instructions, SIMT executesAllowing different threads to more easily follow different execution paths through a given program. Those of ordinary skill in the art will appreciate that SIMD processing architectures represent a functional subset of SIMT processing architectures.

Each functional unit 512₀-512_NAre coupled to cache memory 572, shared memory 574, and register file 520 via an interconnection network, such as a hierarchical crossbar having one or more read and/or write crossbars. Cache memory 572, which may be a level 1 (L1) cache, and shared memory 574 are present at each functional unit 512₀-512_NA low-latency on-chip memory is provided nearby. Register file 520 may include a plurality of functional units 512 that may be allocated by software₀-512_NAnd/or data registers of different threads executed by SM 440. Register file 520 is functional unit 512₀-512_NTemporary storage is provided.

Parallel processing architectures may support multiple address spaces, including local, shared, and global, to support data visibility for threads. Additional read-only address spaces, including constants and textures, may be supported. Each thread has its own local or dedicated memory per thread that is controllable by register allocation (see, e.g., U.S. patent No. 8,555,035 and U.S. patent No. 7,634,621, which are incorporated herein by reference as if explicitly set forth).

Each thread in the same thread block or a different thread block may access global memory 511 using the hierarchical cache memory. Each thread in the same thread block may access an allocated portion of shared memory 574 that may be accounted for per block of shared memory. Each executing thread block may have an allocated portion of shared memory 574. Shared memory 574 is a software managed cache used to load data from global memory in order to reduce the number of off-chip memory accesses by executing threads. Software explicitly allocates and accesses shared memory 574. The threads in the thread block are synchronized (e.g., after cooperatively loading data from global memory to shared memory) to avoid critical resource usage conflicts.

When multiple threads in a block are expecting to use the same data from global memory 511, shared memory 574 may be used to store this data, thereby reducing the number of requests to global memory 511. The shared memory 14 may also be used to avoid non-coalesced memory accesses by: data in merge mode is loaded and stored from global memory 11 and then reordered in shared memory 14 to improve thread access to the data.

In some embodiments, unified cache system 570 may include a dataram that is configurable as both shared memory 574 and local cache memory 572. The unified cache system may be provided in the same on-chip memory (e.g., SRAM) used for both the L1 cache and the shared memory, and includes a mechanism for allocating how much unified memory is dedicated to the L1 cache and the shared memory for each kernel call. Combining the L1data cache with shared memory into a unified memory may reduce latency and provide higher bandwidth. In some examples, the unified cache system may also include a dynamically configurable register file (e.g., register file 520). For information about the unified cache system and how to configure the information, see the following references, which are incorporated by reference as if explicitly set forth: U.S. patent application publication numbers 2018/0322078; and CUDA C Programming guide, PG-02829-.

When operating in one or more functional units 512₀-512_NWhen software in (c) needs data stored in global memory 511, the software starts a thread with a "load" from the memory command. The load from the memory command may load data from global memory 511 and store the data in shared memory 574 so that it is visible to all threads (e.g., all threads in a block). Once the data is stored in the shared memory, the thread may access the data multiple times.

Fig. 1A and 1B illustrate conventional stages of transferring data from global memory (e.g., L2) to SM's shared memory 574 through a hierarchy crossbar. Each SM 440 may have one or more such read ports, each connected with 32 bytes of fill bandwidth.

Data is loaded using a global memory Load (LDG) operation followed by a store to shared memory (STS) operation. In some examples, the SM's L1 cache has a unified physical data random access memory (ram) for both tag data and shared memory. For example, such an organization of a data random access memory may have, for example, 32 banks (banks) of 32 bits each. However, even with unified memory, conventional approaches have temporarily stored data through multiple memory locations (i.e., L1 cache and registers) before storing the data in shared memory.

In previous GPUs, incoming read port data (32 byte sector fills) was first written back to cache memory 572 (e.g., L1 cache). An L1 cache line may be 128 bytes, with 4 32 byte sectors. The incoming fill sector will be directed to one of the 4 cache line sectors. When all sectors associated with the load operation are received, the load operation completes and data is read back from the fill sectors in L1, aligned with the requesting thread through the read crossbar, and then written back to the destination register. An STS operation is then issued in which data is read from the SM register file and then written back to the shared memory data bank after traversing the write crossbar to align it with the shared memory address. Thus, the entire sequence has 2 wide RAM reads and writes and 2 traversals through the local crossbar.

As shown in fig. 1A and 1B, both cache memory 572 and register file 520 are used during data transfer. This approach can compromise performance for a variety of reasons. When a load from a global memory 511 operation is issued, registers are allocated to the threads to represent the results. Since a load operation may request data from one hundred different locations, and the delays should overlap as much as possible, one hundred different registers will be bound. Furthermore, these registers are bound as long as the load is not complete, which can be thousands of cycles. While fewer registers may be allocated, allocating fewer registers for such load operations will serialize memory read and load operations, increasing transfer time.

Example code to load data into shared memory via global memory Load (LDG) and then store to shared memory (STS) instructions may include:

as the code illustrates, LDG instructions load data from global memory into registers, and STS instructions store data from registers into shared memory. Load instructions (LDS) within a shared memory window are executed to load data into registers for performing a fused multiply add FFMA in accordance with the LDG and STS instructions. Matrix multiplication is a key computation in many scientific applications, especially in deep learning. For more information on efficient matrix multiplication on GPUs, see for example, cultass: fast Linear Algebra in CUDA C + +, 12/5/2017 (updated in 5/2018 at 21/5), https:// devblogs. nvidia. com/cutlass-Linear-a-CUDA/, which is incorporated herein by reference. The present technique is not limited to FFMA instructions and may be implemented using other instructions that use data loaded into shared memory.

To optimize loading of data into shared memory, the illustrative non-limiting embodiments do not require register file staging of data by the SM. In an exemplary, non-limiting embodiment, this is implemented as a single ldgsts.

Fig. 2A and 2B illustrate loading data into shared memory 574 without first loading data into registers 520 during data transfer. The data is transferred using a fused instruction that is loaded from global memory 511 and stores the data into shared memory 574 via cache memory 572. In this embodiment, a normal load is performed using the global address, and the incoming fill data sector is written back to the data cache. The data is then fetched from the data line of the cache, aligned by the read data crossbar, and written back to the shared memory destination location. A write crossbar switch (which typically aligns the stored data) may be used to align the write addresses. From an energy efficiency perspective, this instruction eliminates the read and write energy of the register file compared to the LDG and STS instructions alone. Furthermore, from a performance perspective, the instruction reduces register pressure by not having to return the reserved register for long latency loads. Since registers are no longer needed to hold transient load values, they can be used for other tasks. The only registers required for fused memory load and store operations may include the address of the SM algorithm.

As will be discussed in more detail below, the exemplary non-limiting embodiment allows for the tracking of the completion of instructions in an asynchronous copy/DMA manner, independent of other memory operations.

Exemplary code to bypass registers using fused LDGSTs via L1 cache may include:

to further optimize loading data into shared memory, the exemplary non-limiting embodiment does not require register files through the SM and L1 cache staging data. Fig. 3A and 3B illustrate loading data into shared memory 574 without first loading the data into register file 520 and cache memory 572 during a data transfer. This instruction has a single ldgs.

In this exemplary non-limiting embodiment, energy may be further reduced by writing incoming fill data directly back to the final shared memory destination. In addition, this may also eliminate the L1data random access memory read and write energy in the embodiment shown in FIGS. 2A and 2B. From a performance perspective, this example embodiment also translates into a reduction in the shared memory crossbar and L1DATA bandwidth requirements. In dense mathematical kernels such as GEneral matrix-to-matrix multiplication (GEMM), L1DATA can become a bottleneck using a thread bundle wide Matrix Multiply Accumulate (MMA) instruction. The data path is implemented without the need to register the data in registers and L1, a full-brown data crossbar from the read port (as an artifact in the example embodiment, the read port cannot stop) may be utilized. In fig. 2A, the interface between global memory access 511 and cache memory 572 includes a read port. In fig. 3B, the interface between the Fill steering gear (Fill Steer) and L1 includes a read port.

Example code that bypasses the L1 cache and registers using the fused ldgs.

In some examples, a processing system (such as a GPU) may include decode logic to receive and decode instructions. The decoded instructions may be used to control instruction-based hardware loading and storing data. The GPU may include a decode fused load/store instruction format that specifies whether the GPU should (a) bypass processor registers or (a) bypass both processor registers and cache memory. Based on the decoding, the requested data may be loaded and stored by hardware. In one example, in response to decoding, the GPU may retrieve a data word from a first memory (e.g., global memory) and store the retrieved data word into a second memory (e.g., shared memory) without first storing the retrieved data in processor registers and cache memory (as selected by the decoded fused load/store instruction format specification). The second memory may be shared by multiple threads (e.g., multiple threads in a thread block) and/or provided in a common on-chip physical memory address space along with the cache memory.

The fused load/store instruction format may include a source address (e.g., a global memory address), a destination address (e.g., an address in the second memory), and a data transfer path. In one example, the fused load/store instruction may include (i) a source word address, (ii) a destination word address, and (iii) at least one bit specifying whether the GPU should (a) bypass a processor register or (b) bypass both the processor register and a cache memory to retrieve a data word from the source word address and store the retrieved data word into the destination word address, and the decoding includes decoding the at least one bit.

Tag and shared memory control system

Fig. 4 illustrates an example block diagram of a memory control system that may be indicated by instructions disclosed herein (e.g., load, store, and/or fused load and store instructions). Examples of the present technique are not limited to the particular memory control system shown in FIG. 4, but may be applied to other memory control systems. The memory control system shown in FIG. 4 includes address logic 22, a tag pipeline 24, a data tag queue 26, a data memory store 30, an interconnect network 580, and a register file 520. One or more of the operations described with reference to fig. 4 may be implemented by one or more load/store units (LSUs) 554 (shown in fig. 10A). The dataram storage 30 can include a shared dataram storage 32 and an L1 dataram storage 34, which are provided in a configurable unified cache architecture.

The address logic 22 is configured to receive memory transactions from execution threads, generate addresses for servicing the transactions, and route memory transactions based on the type of memory transaction. In some examples, the address logic 22 may include an Address Generation Unit (AGU). The AGU may be configured to compute a global address for accessing the global memory and a local address for accessing the shared memory.

The memory transactions received by address logic 22 may include load transactions, store transactions, and fused load/store transactions. Load transactions may include loads from shared memory and loads from full local space. The data returned in response to the load transaction may be provided to a register file associated with the requesting thread.

The store transaction may include a store to shared memory and a store to global memory. The data in the registers associated with a particular functional unit may be stored in shared memory and/or global memory.

Depending on the type of memory instructions and system architecture, the data for load and/or store instructions may take a particular path. For example, when address logic 22 receives an instruction to load data from shared memory into a register associated with a particular thread, address logic 22 calculates a shared memory address and routes the instruction directly to shared data memory 32 along data path 22a, thereby providing a low latency memory transaction. Unlike external storage, there may be hundreds or thousands of cycles between external storage and shared memory, and one cycle between shared memory 32 and the requesting thread. Data requested from the shared data memory 32 is provided to the register file 520 associated with the requesting thread via the interconnection network 580.

As previously discussed, when data is loaded into shared memory using conventional load from global memory instructions, the data travels from global memory to the L1 cache, from the L1 cache to registers, and from registers to shared memory. The instructions disclosed herein provide the option to load to shared memory instructions to bypass registers and/or the L1 cache because the requested data is stored from global memory into shared memory.

When address logic 22 receives an instruction (e.g., an ldgsts. access instruction) that loads memory into shared memory, bypassing the register, the instruction is broken into two separate sub-instructions by an Address Generation Unit (AGU) that computes the source global address and the destination shared memory address. The source global address is used for load operations and the destination shared memory address is used for store to shared memory operations. The LDG sub-instruction may be treated as a normal load in the tag phase and perform a 32 address merge on four separate tag requests. STS sub-instructions may be considered pseudo-instructions for delivery to the pipeline only via a shared memory address.

The address logic 22 routes the load instruction along the data path 22b to the tag pipeline 24. The tag pipeline 24 determines whether the requested data is available in the L1 dataram 34 based on the tag information. When a hit (hit) occurs because the requested data is available in the L1 dataram 34, the tag pipeline 24 routes the memory transaction to the L1 dataram 34 via the data path 24 a. Data is pulled from a cache data line in L1 dataram 34, aligned through interconnect network 580 (e.g., a write crossbar), and then written back to the destination shared memory address in shared dataram 32. This data path differs from conventional load instructions in which data requested in L1 dataram 34 is first stored to register file 520 via interconnect network 580 and then moved from register file 520 to shared dataram 32. In the event of a hit, the storage path (e.g., including a side collector) is used to synthesize an atomic operation similar to read/modify/write, which reads from the tagged line and stores to the shared memory line in the data cache.

When a miss (miss) occurs due to the requested data being unavailable in the L1 dataram 34, the tag pipeline 24 allocates a tag and requests the missed data from the global memory interface via the datapath 24 b. The global memory interface may include hardware (e.g., Direct Memory Access (DMA) I/O) for retrieving data from other caches and main memory. The memory transaction is also pushed to the queue 26, which may include marking to data (t2d) First In First Out (FIFO). The storage transactions in the queue 26 stall before the missed data is returned to the L1 dataram 34 and the associated storage transaction reaches the top of the queue 26.

The tag pipeline 24 may include a merger 40, a tag memory 42, and a tag processor 44 to service memory transactions received from the address logic 22. The merger 40 may reorder and/or merge individual write requests (e.g., sector requests) into a large cache line request.

The tag processor 44 determines whether a tag for the given memory transaction exists in the tagged memory 42. If the tag processor 44 determines that a tag is present and the data corresponding to the tag is present in the L1 dataram 34, a cache hit will occur and the memory transaction is routed to the L1 dataram 34. If the tag processor 44 determines that no tags are present, then when returning data from global memory, the tag processor 44 will assign a tag to the data in the tag memory 42 and the associated location in the L1 datastore 34.

When address logic 22 receives an instruction (e.g., an ldgs bypass instruction) that loads memory into shared memory bypassing registers and L1 cache, the instruction is divided into two separate sub-instructions by the AGU computing the source global address and the destination shared memory address. The source global address is used for load operations and the destination shared memory address is used for store shared memory operations. The representation of the global address and the shared memory address in the instruction may be specified in registers of the requesting thread.

The address logic 22 routes the load instruction along the data path 22b to the tag pipeline 24. The tag pipeline 24 determines whether the requested data is stored in the L1 dataram 34 based on the tag information. When a hit occurs due to the requested data being available in the L1 dataram 34, the tag pipeline 24 routes the memory transaction to the L1 dataram 34 via the data path 24 a. Data is pulled from a cache data line of L1 dataram 34, aligned through interconnect network 580 (e.g., a write crossbar), written back to a destination shared memory address of shared data memory 32, and the tag of the write data is invalid. The tag may be invalid because the data is now stored in shared memory and the application is not expected to request the same data from global memory in the short term.

In the event of a miss, the tag pipeline 24 allocates a tag and requests the missed data from the global memory interface via data path 24 b. The memory transaction is also pushed to the queue 26. The memory transaction will stall before the missed data is returned. When the data is returned via data path 511b, the fill data will be written directly back to the final shared memory destination in shared data memory 32 without storing the data in L1data memory 34 and the tag of the write data is invalid. Data is written directly from the read port to the shared data memory 32 via the interconnect network 580 (see read port discussed with respect to fig. 2A and 2B).

A direct global to shared memory path (not passed through a cache or register) may not be able to maintain row-level consistency, leaving stale values in the cache, which may not be consistent with values visible in shared memory in some implementations. To avoid inconsistencies, the example embodiments described above provide an implementation in which the system probes and invalidates (snoop) L1 lines.

In some embodiments, the interconnection network 580 may include more than one cross-switch. For example, in one embodiment, interconnect network 580 includes a crossbar for general loading and storage, and a separate crossbar for writing data received from the memory interface via data path 511b to shared data memory 32. In this embodiment, a separate crossbar may be used, since what is functionally needed for this path is the crossbar, which is simpler than what is typically required for loads and stores. The data path in this example may include a data path 511b from the memory interface through a separate fill crossbar to the capacity of the dataram memory 30. This allows simultaneous filling from memory when reading from the shared datastore 32 or the L1 datastore 34.

Loading data from shared memory provides granular flexibility in returning data. For example, if 32 words are read from shared memory, the 32 words may be received from different rows if the requested word is stored in different banks of shared memory. In L1, 32 words would come from the same cache line. Even though the physical structure of L1 is similar to shared memory, this level of granularity cannot be achieved using L1 because L1 is limited in the number of tags that can be used (e.g., four tags). L1 access may be limited to up to a number of different non-conflicting rows, such as a number of tag banks. In the L1 example using four tag banks, up to four cache lines may be accessed.

To increase bandwidth in the shared data memory 32, the on-chip memory is organized into multiple banks. Fig. 5 illustrates an example of shared memory, which includes a plurality of memory banks coupled to an interconnection network 580. Since shared memory is close to the processing core, shared memory provides higher bandwidth and lower latency than off-chip memory. However, how data is distributed among multiple shared memory banks affects bandwidth.

As shown in fig. 5, the shared memory is divided into a plurality of banks (e.g., 32 banks, where each clock cycle has a bandwidth of 32 or 64 bits), each of which includes a read port and a write port from which writing and reading are simultaneously processed in the plurality of banks. Memory requests composed of N addresses falling into N different banks can be processed simultaneously. For example, if a read request is made for data at a0, a1, A3, the interconnection network 580 may read data from bank 0, bank 1, and bank 2 simultaneously via the respective read ports. If multiple addresses of a memory request fall within the same memory bank, the request will serialize the individual address requests. For example, if read requests are made to data at B0, a1, and C0, there is a bank conflict and the read requests to B0, a1, and C0 would be serialized into three reads.

Data may be stored in shared memory banks such that consecutive words of a returned cache line map to consecutive banks. FIG. 6 illustrates an example representation of how data from global memory may be laid out in shared memory to reduce conflicting requests. The top diagram shows how data is stored in the global memory, and the bottom diagram shows how data is stored in the shared memory after the LDGSTS instruction is fused. Each line in the upper graph corresponds to data, as the data will be laid out in a cache line (L1 cache or off-chip memory). Each thread will access the global memory in this linear manner. A block in global memory a may represent 8 two-byte elements wide and 8 elements high. The blocks of a may represent blocks of the overall matrix.

When a thread accesses a cache line, the data will be rearranged into the shared memory address that is "swapped". For example, a row is retrieved, as laid out in global memory. Each element in a row will be returned and stored in shared memory along a different bank by the shared address specified by each thread and encoded as a token that is sent with the request and returned with the data.

As shown in fig. 6, the rows of a are distributed across the width of the shared memory. This allows access to any one sub-block in the same block in a collision-free manner. If all data from block A are vertically stacked in shared memory, they can only be read serially. By scattering them, all a blocks can be accessed without conflict. Fig. 6 also shows how the contiguous elements of block C are distributed to different specific banks of shared memory.

With data distribution across the banks, when 32 words are read from a shared memory (including 32 banks), the 32 words can be read from different rows. Similar requests to the L1 cache will read from the same line. As discussed above, in some implementations, L1 accesses are limited to up to the number of different non-conflicting rows, such as the number of tag banks.

Fig. 7 illustrates an exemplary sector pattern applicable to shared memory (including four sectors). The merge buffer may be configured to determine a pattern between the fill sector and the destination shared memory address. The detected patterns are optimized for common patterns observed in existing workloads.

The merge buffer may be configured to detect and apply permutation (persistence) mode and/or intra-sector swapping when processing memory transactions. In one example, a 128B full office cache line is further arranged into four 32B sectors. When a global cache line sector (e.g., sector 0) is returned in response to the LDGSTS instruction, the cache line sector may be stored directly into sector 0 of the shared memory via the interconnection network, or rotated and stored into

sectors

1, 2, or 3 of the shared memory.

In addition, the granularity at which returned global cache line sectors are stored in shared memory may be controlled. For example, these instructions may have the option of swapping half of the returned global cache line sectors (e.g., store 16B low in the high portion of the shared memory sector, store 16B low in the low portion of the shared memory sector). Further, instructions may write only low 16B or only high 16B into a shared memory sector.

When multiple threads request data, the merge buffer will look at the address of the requested data and determine if the requested data belongs to the same cache line. If it is determined that the addresses belong to the same cache line, the requests for data may be merged and a sector mode set so that when the data is returned, the data will be stored in a different bank of the shared memory. As an example, a first sector is written to the upper portion of the first memory bank, and a shift and rotation is applied so that the next sector is written to the lower portion of the second memory bank (see, e.g., fig. 6). If the pattern cannot be recognized, the hardware may send a data request alone.

The interconnection network 580 may include one or more crossbars, including read crossbars and/or write crossbars. The interconnection network 580 supports multiple parallel read and write operations and supports access to data stored in the shared data memory 32 using direct addressing and access to data stored in the L1data memory 34 using tag lookup. Interconnect network 580 may support multiple simultaneous reads and writes corresponding to the number of banks in shared memory, L1 cache, and/or registers. In one example, the interconnection network 580 may support multiple simultaneous reads and writes equal in number to the number of memory banks in the data memory store 30 (e.g., unified memory).

Partitioning memory requests

When the data requested by a thread exceeds a predetermined size, the memory request may be split into separate requests to provide a certain level of granularity for each request. For example, when the data requested by a thread is greater than a predetermined size (e.g., 16 or 32 bytes), the thread issuing the load/store request may fork into one or more helper threads, each independently issuing a 16-byte or 32-byte memory request. Each memory request is broken down into independently issued cache line requests. The threads may be joined after completion of each request.

Padding byte zeroing

LDGSTS may support the option of padding the SMEM target address by zero bytes (. ZFILL). This is useful if the footprint of a thread reading from global memory crosses a maximum boundary condition. Rather than the global address being invalid or the data itself being invalid, the zeroed-out data is not suitable for the current work product of a given thread. The advantage is that the kernel can use a consistent access bit size (e.g., sz: {.32,. 64,. 128}) without the need to use fixed-size loads for custom code or branching.

A number of alternative methods are available for how to specify byte zeroing, e.g., as an explicit mask or byte count in the instruction parameters provided by the thread. Instead, the. ZFILL option is embedded in the global address LSB, indicating the number of bytes from the end of the region being stored to zero. This keeps the same parameter moves and BW nand ZFILL operations, achieving minimal design interruption and the same problem performance.

When.zfill is specified, the global address alignment check is suspended and alignment is forced to the.sz boundary. This frees up the address where the log2(. sz access) bits are least significant, and allows the global address LSB to be interpreted as a count of the number of bytes from the end of the stored area to zero. when.ZFLL is specified in an instruction, the.ZFLL parameter value may be specified per thread and may uniquely affect the area per thread to be stored. The global address is defined as the final sum of (RF + UR + Imm).

Given a byte count embedded in the global address LSB to zero and the access size, a mask may be generated indicating which bytes remain from the padding data (mask bit 1) or zero (mask bit 0). ZFOLL, if not used, the mask is implied to be all 1's.

Enabling byte writes for storage

A complementary feature of ZFILL may be provided for global storage to allow actual storage bytes to be skipped (.wskip) from data sent per thread. The same technique can be used as embedding a byte count in the global address LSB, where the count indicates the number of bytes stored from a given thread that are skipped to write at the end of the sz access. This also allows the kernel to use a uniform value of. sz (e.g., stg.128.wskip) to handle the maximum boundary condition, rather than storing the different. sz in stages.

Programming model and software exposure

Simple LDG/STS/Barrier mode: a common and relatively simple mode is for threads to cooperatively load global data, immediately store it in shared memory, and consume it using barrier synchronization. Although memory latency is exposed here, applications rely on multi-cooperative thread group (CTA) occupancy in the SM to hide this exposed latency. This is a relatively inefficient use of resources because all resources, such as threads, register files, and shared memory, are duplicated. However, the compiler may fuse LDG and STS operations. The compiler has a mechanism to make barrier synchronization dependent on STS completion. A scoreboard (scoreboard) release capability is provided for the LDGSTS instruction so that the barrier instruction may rely on a scoreboard register.

Batch loading mode: the advanced usage mode is where a thread issues multiple LDG operations before barrier synchronization is performed, followed by multiple STS operations. The fused LDGSTS operations may be performed out of order with respect to one another. To provide correct dependency tracking for barrier operations, a new instruction named LDGDEPBAR may be added that adds a fence (fence) to the completion of all previous LDGSTS operations.

Software pipelining loading: a more efficient way to hide the memory latency is to use software pipelines and loop unrolling to advance the prefetched data. This is accomplished by allocating sets of registers for multiple load batches and explicitly delaying the corresponding STS operations. This technique may be difficult for a compiler to target, especially when it involves raising memory operations through a barrier. The combination of LDGSTS and LDGDEPBAR makes this process possible.

The unrolling process becomes easier because unnamed register batches are assigned/managed. The LDGDEPBAR instruction is defined to have ordered semantics with the previous LDGDEPBAR instruction. This borrows techniques for texttree batch management using depar dependency tracking.

This process also enables more powerful asynchronous DMA transfer primitives and shared memory based synchronization objects. In this case, in the example shown,

the barrier state object is implemented in shared memory, rather than in a dedicated counter.

A thread may update the state using a custom atomic operation atom.

Similar to LDGDEPBAR, another fence operation "ARRIVES. LDGSTSBAR.64" is defined that updates the shared memory barrier object after a batch of LDGSTS operations are completed.

These constructs will allow LDGSTS to be disclosed as a thread cooperative asynchronous DMA operation whose completion status can be queried by querying a shared memory barrier status object [ see us provisional patent application serial No. 62/927,511 filed on 29/10/2019 and us patent application serial No. 16/712,236 filed on 12/2019, which are incorporated herein by reference ]. This allows application programmers to take advantage of asynchronous DMA transfers and be easily supported by the driver/compiler.

When the LDGSTS instruction is issued, the thread of execution may derive a (fork off) logical helper thread, which in turn copies data from global memory to shared memory (without using SM resources such as thread bundle slots and register files). However, helper threads compete for memory bandwidth with other threads. The execution thread then marks the completion of the helper thread by a join operation, which takes one of three forms:

1. ARRIVES. LDGSTSBAR with ARrive-Wait-Barrier

2. LE implementation of batch count scoreboard using LDGDEPBAR and DEPBAR

3. Write on scoreboard & req (which is updated by LDGSTS instruction)

Order of instruction issue

All instructions prior to the LDGSTS may be issued before the helper thread is spawned. a) Helper threads and b) there is no guaranteed order of execution between instructions in the execution thread between the LDGSTS instruction and the join instruction. The helper thread is guaranteed to complete before the instruction is issued by the execution thread, which continues the join operation in program order.

The BAR instruction in the execution thread (before the LDGSTS instruction, between the LDGSTS instruction and the join operation, or after the join operation) does not guarantee any other execution order for the helper thread. In other words, helper threads (which do not belong to the CTA of the executing thread) do not participate in CTA-wide synchronization barriers like bar.sync and bar.syncall.

Example of an example non-limiting parallel processing GPU architecture for performing the above-described operations and processing as described above

An exemplary illustrative architecture will now be described that can be directed to the shared memory instructions disclosed in this application by loading and storing. The following information is set forth for purposes of illustration and should not be construed as limiting in any way. Any of the following features may be optionally combined with, but not exclusively of, other features described.

FIG. 8 illustrates a Parallel Processing Unit (PPU)300 according to one embodiment. In one embodiment, the PPU 300 is a multi-threaded processor implemented on one or more integrated circuit devices. The PPU 300 is a latency hiding architecture designed for parallel processing of many threads. A thread (i.e., an execution thread) is an instance of a set of instructions configured to be executed by the PPU 300. In one embodiment, PPU 300 is a Graphics Processing Unit (GPU) configured to implement a graphics rendering pipeline for processing three-dimensional (3D) graphics data in order to generate two-dimensional (2D) image data for display on a display device, such as a Liquid Crystal Display (LCD) device. In other embodiments, the PPU 300 may be used to perform general purpose computations.

One or more PPUs 300 may be configured to accelerate thousands of High Performance Computing (HPC), data centers, and machine learning applications. The PPU 300 may be configured to accelerate a wide variety of deep learning systems and applications, including auto-driven automotive platforms, deep learning, high-precision speech, image and text recognition systems, intelligent video analysis, molecular simulation, drug development, disease diagnosis, weather forecasting, big data analysis, astronomy, molecular dynamics simulation, financial modeling, robotics, factory automation, real-time language translation, online search optimization, and personalized user recommendations, among others.

As shown in FIG. 8, PPU 300 includes input/output (I/O) unit 305, front end unit 315, scheduler unit 320, work distribution unit 325, hub 330, crossbar (Xbar)370, one or more general purpose processing clusters (GPCs) 350, and one or more partition units 380. PPUs 300 may be connected to host processors or other PPUs 300 via one or more high-speed nvlinks 310 interconnects. PPU 300 may be connected to a host processor or other peripheral device via interconnect 302. The PPU 300 may also be connected to local memory including a plurality of memory devices 304. In one embodiment, the local memory may include a plurality of Dynamic Random Access Memory (DRAM) devices. DRAM devices may be configured as High Bandwidth Memory (HBM) subsystems, with multiple DRAM dies (die) stacked within each device.

The NVLink310 interconnect enables the system to scale and includes one or more PPUs 300 in combination with one or more CPUs, supporting cache coherency between PPUs 300 and CPUs, and CPU hosting. Data and/or commands may be sent by NVLink310 to and from other units of PPU 300, such as one or more replication engines, video encoders, video decoders, power management units, etc. (not explicitly shown) via hub 330. NVLink310 is described in more detail in conjunction with fig. 10B.

The I/O unit 305 is configured to send and receive communications (e.g., commands, data, etc.) from a host processor (not shown) over the interconnect 302. The I/O unit 305 may communicate with the host processor directly via the interconnect 302, or through one or more intermediate devices (such as a memory bridge). In one embodiment, I/O unit 305 may communicate with one or more other processors (e.g., one or more PPUs 300) via interconnect 302. In one embodiment, I/O unit 305 implements a peripheral component interconnect express (PCIe) interface for communicating over a PCIe bus, and interconnect 302 is a PCIe bus. In alternative embodiments, the I/O unit 305 may implement other types of known interfaces for communicating with external devices.

The I/O unit 305 decodes data packets received via the interconnect 302. In one embodiment, the data packets represent commands configured to cause the PPU 300 to perform various operations. The I/O unit 305 transmits the decoded command to various other units of the PPU 300 as specified by the command. For example, some commands may be sent to the front end unit 315. Other commands may be sent to hub 330 or other units of PPU 300, such as one or more replication engines, video encoders, video decoders, power management units, and the like (not expressly shown). In other words, the I/O unit 305 is configured to route communications between and among the various logic units of the PPU 300.

In one embodiment, a program executed by a host processor encodes a stream of commands in a buffer that provides the workload to the PPU 300 for processing. The workload may include a number of instructions and data to be processed by those instructions. A buffer is an area of memory that is accessible (e.g., read/write) by both the host processor and the PPU 300. For example, I/O unit 305 may be configured to access buffers in system memory connected to interconnect 302 via memory requests transmitted over interconnect 302. In one embodiment, the host processor writes the command stream to a buffer and then sends a pointer to the beginning of the command stream to the PPU 300. The front end unit 315 receives pointers to one or more command streams. The front end unit 315 manages one or more streams, reads commands from the streams and forwards the commands to the various units of the PPU 300.

The front end units 315 are coupled to a scheduler unit 320, which configures various GPCs 350 to process tasks defined by one or more streams. The scheduler unit 320 is configured to track status information related to various tasks managed by the scheduler unit 320. The status may indicate which GPC 350 the task is assigned to, whether the task is active or inactive, a priority associated with the task, and so on. The scheduler unit 320 manages the execution of multiple tasks on one or more GPCs 350.

The scheduler unit 320 is coupled to a work allocation unit 325 configured to dispatch tasks for execution on the GPCs 350. The work assignment unit 325 may track several scheduled tasks received from the scheduler unit 320. In one embodiment, the work allocation unit 325 manages a pending (pending) task pool and an active task pool for each GPC 350. The pool of pending tasks may include a number of time slots (e.g., 32 time slots) that contain tasks assigned to be processed by a particular GPC 350. The active task pool may include a number of slots (e.g., 4 slots) for tasks being actively processed by the GPCs 350. When a GPC 350 completes execution of a task, the task is evicted from the active task pool of the GPC 350, and one of the other tasks from the pending task pool is selected and scheduled for execution on the GPC 350. If the active task on the GPC 350 is already idle, e.g., while waiting for a data dependency to be resolved, the active task may be evicted from the GPC 350 and returned to the pending task pool, while another task in the pending task pool is selected and scheduled for execution on the GPC 350.

The work distribution unit 325 communicates with one or more GPCs 350 via XBar 370. XBar 370 is an interconnection network that couples many of the elements of PPU 300 to other elements of PPU 300. For example, XBar 370 may be configured to couple work allocation unit 325 to a particular GPC 350. Although not explicitly shown, one or more other units of PPU 300 may also be connected to XBar 370 via hub 330.

Tasks are managed by the scheduler unit 320 and dispatched to GPCs 350 by the work distribution unit 325. GPCs 350 are configured to process tasks and generate results. The results may be consumed by other tasks within the GPC 350, routed to different GPCs 350 via the XBar 370, or stored in the memory 304. The results may be written to memory 304 via partition unit 380, partition unit 380 implementing a memory interface for reading data from memory 304 and writing data to memory 304. The results may be sent to another PPU 300 or CPU via NVLink 310. In one embodiment, the PPU 300 includes a number U of partition units 380 that is equal to the number of independent and distinct memory devices 304 coupled to the PPU 300. Partition unit 380 is described in more detail below in conjunction with FIG. 9B.

In one embodiment, the host processor executes a driver kernel that implements an Application Programming Interface (API) that enables one or more applications to execute on the host processor to schedule operations for execution on the PPU 300. In one embodiment, multiple computing applications are executed simultaneously by the PPU 300, and the PPU 300 provides isolation, quality of service (QoS), and independent address spaces for the multiple computing applications. The application may generate instructions (e.g., API calls) that cause the driver kernel to generate one or more tasks to be executed by the PPU 300. The driver kernel exports tasks to one or more streams being processed by the PPU 300. Each task may include one or more related thread groups, referred to herein as thread bundles (warp). In one embodiment, the thread bundle includes 32 related threads that may be executed in parallel. Cooperative threads may refer to multiple threads that include instructions to perform tasks and may exchange data through a shared memory. Threads and cooperative threads are described in more detail in conjunction with FIG. 10A.

FIG. 9A illustrates a GPC 350 of the PPU 300 of FIG. 8 according to one embodiment. As shown in fig. 9A, each GPC 350 includes multiple hardware units for processing tasks. In one embodiment, each GPC 350 includes a pipeline manager 410, a pre-raster operations unit (PROP)415, a raster engine 425, a work distribution crossbar (WDX)480, a Memory Management Unit (MMU)490, and one or more Data Processing Clusters (DPCs) 420. It should be understood that the GPCs 350 of fig. 9A may include other hardware units in place of or in addition to the units shown in fig. 9A.

In one embodiment, the operation of GPCs 350 is controlled by a pipeline manager 410. The pipeline manager 410 manages the configuration of one or more DPCs 420 for processing tasks assigned to the GPC 350. In one embodiment, pipeline manager 410 may configure at least one of the one or more DPCs 420 to implement at least a portion of a graphics rendering pipeline. For example, DPC 420 may be configured to execute a vertex shading program on programmable Streaming Multiprocessor (SM) 440. The pipeline manager 410 may also be configured to route data packets received from the work distribution unit 325 to the appropriate logical unit in the GPC 350. For example, some packets may be routed to fixed function hardware units in the PROP 415 and/or raster engine 425, while other packets may be routed to the DPC 420 for processing by the primitive engine 435 or SM 440. In one embodiment, the pipeline manager 410 may configure at least one of the one or more DPCs 420 to implement a neural network model and/or a compute pipeline.

The PROP unit 415 is configured to route data generated by the raster engine 425 and the DPC 420 to a Raster Operations (ROP) unit, described in more detail in connection with fig. 9B. The PROP unit 415 may also be configured to perform optimization of color mixing, organize pixel data, perform address conversion, and the like.

Graphics processing pipeline

In one embodiment, PPU 300 includes a Graphics Processing Unit (GPU). PPU 300 is configured to receive commands specifying a shading program for processing graphics data. Graphics data may be defined as a set of primitives, such as points, lines, triangles, quadrilaterals, triangle strips, and the like. Typically, a primitive includes data that specifies a plurality of vertices (e.g., in a model spatial coordinate system) of the primitive and attributes associated with each vertex of the primitive. The PPU 300 may be configured to process the primitives to generate a frame buffer (e.g., pixel data for each of the pixels of the display).

The application writes model data (e.g., a set of vertices and attributes) for the scene to a memory, such as system memory or memory 304. The model data defines each of the objects that may be visible on the display. The application then makes an API call to the driver kernel, which requests the model data to be rendered and displayed. The driver kernel reads the model data and writes commands to one or more streams to perform operations to process the model data. These commands may reference different shading programs to be implemented on the SM 440 of the PPU 300, including one or more of vertex shading, shell shading, domain shading, geometric shading, and pixel shading. For example, one or more of the SMs 440 may be configured to execute a vertex shading program that processes a plurality of vertices defined by model data. In one embodiment, different SMs 440 may be configured to execute different shading programs simultaneously. For example, a first subset of SMs 440 may be configured to execute a vertex shading program, while a second subset of SMs 440 may be configured to execute a pixel shading program. The first subset of SMs 440 processes the vertex data to produce processed vertex data and writes the processed vertex data to L2 cache 460 and/or memory 304. After the processed vertex data is rasterized (e.g., converted from three-dimensional data to two-dimensional data in screen space) to generate fragment data, a second subset of the SMs 440 performs pixel shading to generate processed fragment data, which is then mixed with other processed fragment data and written to a frame buffer in memory 304. The vertex shading program and the pixel shading program may execute concurrently, processing different data from the same scene in a pipelined manner until all model data for the scene has been rendered to the frame buffer. The contents of the frame buffer are then transferred to the display controller for display on the display device.

FIG. 11 is a conceptual diagram of a graphics processing pipeline 600 implemented by the PPU 300 of FIG. 8, according to one embodiment. Graphics processing pipeline 600 is an abstract flow diagram of processing steps implemented to generate 2D computer-generated images from 3D geometric data. It is well known that pipelined architectures can more efficiently perform long delay operations by dividing the operation into multiple stages, with the output of each stage coupled to the input of the next successive stage. Thus, graphics processing pipeline 600 receives input data 601 passing from one stage to the next stage of graphics processing pipeline 600 to generate output data 602. In one embodiment, graphics processing pipeline 600 may be represented by

API defined graphics processing pipeline. Alternatively, graphics processing pipeline 600 may be implemented in the context of the functionality and architecture of the previous figure and/or any subsequent figure or figures.

As shown in FIG. 11, graphics processing pipeline 600 comprises a pipeline architecture comprising a plurality of stages. These stages include, but are not limited to, a data assembly stage 610, a vertex shading stage 620, a primitive assembly stage 630, a geometry shading stage 640, a viewport scale, culling, and clip (VSCC) stage 650, a rasterization stage 660, a fragment shading stage 670, and a raster operations stage 680. As described above, software shading algorithms that work in conjunction with such shading hardware may be optimized to reduce computation time.

In one embodiment, input data 601 includes commands that configure processing units to implement stages of graphics processing pipeline 600 and configure geometric primitives (e.g., points, lines, triangles, quadrilaterals, triangle strips, or sectors, etc.) to be processed by these stages. The output data 602 may include pixel data (i.e., color data) that is copied into a frame buffer or other type of surface data structure in memory.

The data assembly stage 610 receives input data 601 specifying vertex data for high-order surfaces, primitives, and the like. The data assembly stage 610 collects vertex data in temporary storage or queues, such as by receiving a command from a host processor that includes a pointer to a buffer in memory and reading the vertex data from the buffer. The vertex data is then passed to the vertex shading stage 620 for processing.

Vertex shading phase 620 processes vertex data by performing a set of operations (e.g., vertex shaders or programs) on each of the vertices at a time. A vertex may, for example, be specified as a 4-coordinate vector (e.g., < x, y, z, w >) associated with one or more vertex attributes (e.g., color, texture coordinates, surface normal, etc.). Vertex shading phase 620 may manipulate various vertex attributes, such as position, color, texture coordinates, and the like. In other words, vertex shading phase 620 performs operations on vertex coordinates or other vertex attributes associated with the vertices. These operations typically include lighting operations (e.g., modifying color attributes of the vertices) and transformation operations (e.g., modifying the coordinate space of the vertices). For example, the coordinates in the object coordinate space may be used to specify vertices, which are transformed by multiplying the coordinates by a matrix that converts the coordinates from the object coordinate space to world space or normalized-device-coordinate (NCD) space. The vertex shading stage 620 generates transformed vertex data that is passed to the primitive assembly stage 630.

Primitive assembly stage 630 collects the vertices output by vertex shading stage 620 and groups the vertices into geometric primitives for processing by geometry shading stage 640. For example, primitive assembly stage 630 may be configured to group every three consecutive vertices into geometric primitives (e.g., triangles) for delivery to geometric shading stage 640. In some embodiments, a particular vertex may be reused for consecutive geometric primitives (e.g., two consecutive triangles in a triangle strip may share two vertices). Primitive assembly stage 630 passes the geometric primitives (e.g., the set of associated vertices) to geometry shading stage 640.

Geometry shading phase 640 processes geometric primitives by performing a set of operations (e.g., a geometry shader or program) on the geometric primitives. A tessellation (tessellation) operation may generate one or more geometric primitives from each geometric primitive. In other words, geometric shading stage 640 may subdivide each geometric primitive into a finer grid of two or more geometric primitives for processing by the remainder of graphics processing pipeline 600. The geometry shading stage 640 passes the geometric primitives to the viewport SCC stage 650.

In one embodiment, graphics processing pipeline 600 may operate within a streaming multiprocessor and vertex shading stage 620, a primitive assembly stage 630, a geometry shading stage 640, a fragment shading stage 670, and/or hardware/software associated therewith, which may sequentially perform processing operations. Once the sequential processing operations are complete, in one embodiment, the viewport SCC stage 650 can utilize the data. In one embodiment, primitive data processed by one or more of the stages in graphics processing pipeline 600 may be written to a cache (e.g., an L1 cache, a vertex cache, etc.). In this case, in one embodiment, the viewport SCC stage 650 can access the data in the cache. In one embodiment, the viewport SCC stage 650 and the rasterization stage 660 are implemented as fixed function circuitry.

The viewport SCC stage 650 performs viewport scaling, culling, and clipping of geometric primitives. Each surface being rendered is associated with an abstract camera position. The camera position represents the position of a viewer viewing the scene and defines the view frustum of the object surrounding the scene. The view frustum may include a viewing plane, a back plane, and four clipping planes. Any geometric primitives that lie completely outside the view frustum may be culled (e.g., discarded) because they will not contribute to the final rendered scene. Any geometric primitives that are partially within the viewing frustum and partially outside the viewing frustum may be cropped (e.g., converted to new geometric primitives that are enclosed within the viewing frustum). Furthermore, each geometric primitive may be scaled based on the depth of the view frustum. All possible visible geometric primitives are then passed to the rasterization stage 660.

The rasterization stage 660 converts the 3D geometric primitive into 2D fragments (e.g., capable of being used for display, etc.). The rasterization stage 660 may be configured to set a set of plane equations with vertices of the geometric primitives from which various attributes may be interpolated. The rasterization stage 660 may also calculate a coverage mask for a plurality of pixels that indicates whether one or more sample positions of the pixels intercept the geometric primitive. In one embodiment, a z-test may also be performed to determine if a geometric primitive is occluded by other geometric primitives that have been rasterized. The rasterization stage 660 generates fragment data (e.g., interpolated vertex attributes associated with a particular sample position for each covered pixel) that is passed to the fragment shading stage 670.

The fragment shading phase 670 processes fragment data by performing a set of operations (e.g., a fragment shader or program) on each of the fragments. The fragment shading stage 670 may generate pixel data (e.g., color values) for the fragment, such as by performing a lighting operation or sampling a texture map using interpolated texture coordinates for the fragment. Fragment shading stage 670 generates pixel data, which is sent to raster operations stage 680.

Raster operations stage 680 may perform various operations on the pixel data, such as performing alpha testing, stencil testing (stencil test), and blending the pixel data with other pixel data corresponding to other fragments associated with the pixel. When raster operations stage 680 has completed processing pixel data (e.g., output data 602), the pixel data may be written to a render target, such as a frame buffer, color buffer, or the like. The raster engine 425 includes a number of fixed function hardware units configured to perform various raster operations. In one embodiment, the raster engine 425 includes a setup engine, a coarse raster engine, a culling engine, a clipping engine, a fine raster engine, and a tile merge engine. The setup engine receives the transformed vertices and generates plane equations associated with the geometric primitives defined by the vertices. The plane equations are sent to a coarse raster engine to generate coverage information for the primitive (e.g., the tile's x, y coverage mask). The output of the coarse raster engine is sent to a culling engine, where fragments associated with primitives that fail the z-test are culled, and the uncarpeted fragments are sent to a clipping engine, where fragments located outside of the view frustum are clipped. Those segments that remain after clipping and culling may be passed to a fine raster engine to generate attributes for pixel segments based on a plane equation generated by a setup engine. The output of the raster engine 425 includes fragments to be processed, e.g., by a fragment shader implemented in the DPC 420.

It should be appreciated that one or more additional stages may be included in graphics processing pipeline 600 in addition to or in place of one or more of the stages described above. Various implementations of the abstract graphics processing pipeline may implement different stages. Furthermore, in some embodiments, one or more of the stages described above may be excluded from the graphics processing pipeline (such as geometry shading stage 640). Other types of graphics processing pipelines are considered to be contemplated within the scope of the present disclosure. Further, any stage of graphics processing pipeline 600 may be implemented by one or more dedicated hardware units within a graphics processor (such as PPU 300). Other stages of graphics processing pipeline 600 may be implemented by programmable hardware units, such as SM 440 of PPU 300.

Graphics processing pipeline 600 may be implemented via an application program executed by a host processor (e.g., CPU). In one embodiment, the device driver may implement an Application Programming Interface (API) that defines various functions that may be used by an application to generate graphical data for display. The device driver is a software program that includes a plurality of instructions that control the operation of the PPU 300. The API provides an abstraction for programmers to generate graphics data using specialized graphics hardware (e.g., PPU 300) without requiring the programmer to use a specific instruction set of PPU 300. The application may include API calls routed to the device driver of PPU 300. The device driver interprets the API calls and performs various operations in response to the API calls. In some cases, a device driver may perform operations by executing instructions on a CPU. In other cases, the device driver may perform operations at least in part by initiating operations on the PPU 300 using an input/output interface between the CPU and the PPU 300. In one embodiment, the device driver is configured to implement graphics processing pipeline 600 using the hardware of PPU 300.

Various programs may be executed in PPU 300 to implement the various stages of graphics processing pipeline 600. For example, a device driver may start a kernel on the PPU 300 to perform the vertex shading phase 620 on one SM 440 (or multiple SMs 440). Device drivers (or the initial kernel executed by PPU 400) may also launch other kernels on PPU 400 to execute other stages of graphics processing pipeline 600, such as geometry shading stage 640 and fragment shading stage 670. Furthermore, certain stages of graphics processing pipeline 600 may be implemented on fixed unit hardware, such as a rasterizer or data assembler implemented in PPU 400. It will be appreciated that results from one core may be processed by one or more intermediate fixed function hardware units before being processed by subsequent cores on SM 440.

Each DPC 420 included in the GPC 350 includes an M-pipe controller (MPC)430, a primitive engine 435, and one or more SMs 440. The MPC 430 controls the operation of the DPC 420 and routes data packets received from the pipeline manager 410 to the appropriate elements in the DPC 420. For example, packets associated with the vertices may be routed to primitive engine 435, primitive engine 435 configured to fetch the vertex attributes associated with the vertices from memory 304. Instead, packets associated with the shading program may be sent to SM 440.

SM 440 includes a programmable streaming processor configured to process tasks represented by a plurality of threads. Each SM 440 is multithreaded and configured to concurrently execute multiple threads (e.g., 32 threads) from a particular thread group. In one embodiment, SM 440 implements a SIMD (single instruction, multiple data) architecture, where each thread in a thread group (e.g., a thread bundle) is configured to process different sets of data based on the same instruction set. All threads in a thread group execute the same instruction. In another embodiment, the SM 440 implements a SIMT (single instruction, multi-threaded) architecture, where each thread in a thread group is configured to process different sets of data based on the same instruction set, but where individual threads in the thread group are allowed to diverge during execution. In one embodiment, a program counter, call stack, and execution state are maintained for each thread bundle, enabling concurrency between thread bundles and serial execution within a thread bundle when threads within a thread bundle diverge. In another embodiment, program counters, call stacks, and execution states are maintained for each individual thread, thereby achieving equal concurrency among all threads within and between thread bundles. When the execution state is maintained for each individual thread, threads executing the same instructions may be converged and executed in parallel for maximum efficiency. SM 440 is described in more detail below in conjunction with fig. 10A.

The MMU 490 provides an interface between the GPCs 350 and the partition units 380. The MMU 490 may provide translation of virtual addresses to physical addresses, memory protection, and arbitration of memory requests. In one embodiment, the MMU 490 provides one or more Translation Lookaside Buffers (TLBs) for performing translations from virtual addresses to physical addresses in memory 304.

FIG. 9B illustrates a memory partitioning unit 380 of the PPU 300 of FIG. 8, according to one embodiment. As shown in FIG. 9B, memory partition unit 380 includes a Raster Operations (ROP) unit 450, a level two (L2) cache 460, and a memory interface 470. A memory interface 470 is coupled to the memory 304. Memory interface 470 may implement a 32, 64, 128, 1024 bit data bus, etc. for high speed data transfer. In one embodiment, PPU 300 incorporates U memory interfaces 470, one memory interface 470 for each pair of partition units 380, where each pair of partition units 380 is connected to a corresponding memory device 304. For example, the PPU 300 may be connected to up to Y memory devices 304, such as a high bandwidth memory stack or a graphics double data rate version 5 synchronous dynamic random access memory or other type of persistent memory.

In one embodiment, memory interface 470 implements the HBM2 memory interface, and Y equals half of U. In one embodiment, the HBM2 memory stack is located on the same physical package as the PPU 300, providing significant power and area savings compared to conventional GDDR5 SDRAM systems. In one embodiment, each HBM2 stack includes four memory dies and Y equals 4, where the HBM2 stack includes two 128-bit channels per die, for a total of 8 channels and a data bus width of 1024 bits.

In one embodiment, memory 304 supports Single Error Correction Double Error Detection (SECDED) Error Correction Codes (ECC) to protect data. For computing applications that are sensitive to data corruption, ECC provides higher reliability. Reliability is particularly important in large cluster computing environments where the PPU 300 handles very large data sets and/or long running applications.

In one embodiment, PPU 300 implements a multi-level memory hierarchy. In one embodiment, memory partition unit 380 supports unified memory to provide a single unified virtual address space for the CPU and PPU 300 memory, enabling data sharing between virtual memory systems. In one embodiment, the frequency of accesses by the PPU 300 to memory located on other processors is tracked to ensure that a page of memory is moved to the physical memory of the PPU 300 that accesses the page more frequently. In one embodiment, NVLink310 supports an address translation service that allows PPU 300 to directly access CPU's page tables and provides full access to CPU memory by PPU 300.

In one embodiment, the replication engine transfers data between multiple PPUs 300 or between a PPU 300 and a CPU. The copy engine may generate a page fault for an address that does not map to a page table. The memory partition unit 380 may then service the page fault, mapping the address into the page table, after which the copy engine may perform the transfer. In conventional systems, fixed memory (e.g., non-pageable) is operated for multiple copy engines between multiple processors, which significantly reduces available memory. Due to a hardware paging error, the address can be passed to the copy engine without worrying about whether the memory page resides and whether the copy process is transparent.

Data from memory 304 or other system memory may be retrieved by memory partition unit 380 and stored in L2 cache 460, L2 cache 460 being on-chip and shared among GPCs 350. As shown, each memory partition unit 380 includes a portion of the L2 cache 460 associated with a corresponding memory device 304. The lower-level cache may then be implemented in multiple units within the GPC 350. For example, each SM 440 may implement a level one (L1) cache. The L1 cache is a private memory dedicated to a particular SM 440. Data from L2 cache 460 may be fetched and stored in each L1 cache for processing in the functional units of SM 440. L2 cache 460 is coupled to memory interface 470 and XBR 370.

ROP unit 450 performs graphics raster operations related to pixel colors, such as color compression, pixel blending, and the like. ROP unit 450 also implements a depth test with raster engine 425, receiving the depth of sample locations associated with pixel fragments from a culling engine of raster engine 425. The sample locations associated with the fragments are tested for depth relative to the corresponding depth in the depth buffer. If the fragment passes the depth test for the sample location, ROP unit 450 updates the depth buffer and sends the results of the depth test to raster engine 425. It will be understood that the number of partition units 380 may be different than the number of GPCs 350, and thus each ROP unit 450 may be coupled to each GPC 350. ROP unit 450 tracks packets received from different GPCs 350 and determines to which GPC 350 the results generated by ROP unit 450 are routed through Xbar 370. Although ROP unit 450 is included within memory partition unit 380 in fig. 9B, ROP unit 450 may be external to memory partition unit 380 in other embodiments. For example, ROP unit 450 may reside in the GPC 350 or another unit.

FIG. 10A illustrates the streaming multiprocessor 440 of FIG. 9A according to one embodiment. As shown in fig. 10A, SM 440 includes an instruction cache 505, one or more scheduler units 510, a register file 520, one or more processing cores 550, one or more Special Function Units (SFUs) 552, one or more load/store units (LSUs) 554, an interconnection network 580, a shared memory/L1 cache 570.

As described above, the work distribution unit 325 schedules tasks to execute on GPCs 350 of PPUs 300. A task is assigned to a particular DPC 420 within the GPC 350, and if the task is associated with a shader program, the task may be assigned to the SM 440. Scheduler unit 510 receives tasks from work allocation unit 325 and manages the scheduling of instructions assigned to one or more thread blocks of SM 440. Scheduler unit 510 schedules thread blocks to execute as bundles of parallel threads, where each thread block is assigned at least one bundle. In one embodiment, 32 threads are executed per bundle. Scheduler unit 510 may manage multiple different thread blocks, assign thread bundles to different thread blocks, and then dispatch instructions from multiple different cooperative groups to various functional units (i.e., cores 550, SFUs 552, and LSUs 554) during each clock cycle.

Collaboration groups are programming models for organizing groups of communication threads that allow developers to express the granularity with which threads are communicating, enabling richer, more efficient parallel decomposition to be expressed. The cooperative launch API supports synchronicity between thread blocks to execute parallel algorithms. The conventional programming model provides a single simple structure for the synchronous cooperative threads: barriers (barriers) across all threads of a thread block (e.g., synchreads () function). However, programmers often want to define thread groups at a granularity less than the thread block granularity and synchronize within the defined groups, enabling higher performance, design flexibility, and software reuse in the form of collective group-wide function interfaces (collective-wide function interfaces).

The collaboration group enables programmers to explicitly define thread groups at sub-block (e.g., as small as a single thread) and multi-block granularity and perform collective operations, such as synchronicity across threads in the collaboration group. The programming model supports clean composition across software boundaries so that libraries and utility functions can be safely synchronized in their local environment without assumptions on convergence. The collaboration group primitives enable new modes of collaborative parallelism, including producer-consumer parallelism, opportunistic parallelism, and global synchronization across the entire thread block grid.

Dispatch unit 515 is configured to transmit instructions to one or more functional units. In this embodiment, the scheduler unit 510 includes two dispatch units 515 that enable two different instructions from the same thread bundle to be scheduled during each clock cycle. In alternative embodiments, each scheduler unit 510 may include a single dispatch unit 515 or additional dispatch units 515.

Each SM 440 includes a register file 520 that provides a set of registers for the functional units of the SM 440. In one embodiment, register file 520 is divided among each functional unit such that each functional unit is allocated a dedicated portion of register file 520. In another embodiment, the register file 520 is divided between different thread bundles executed by the SM 440. Register file 520 provides temporary storage for operands connected to the data paths of the functional units.

Each SM 440 includes L processing cores 550. In one embodiment, SM 440 includes a large number (e.g., 128, etc.) of different processing cores 550. Each core 550 may include fully pipelined, single-precision, double-precision, and/or mixed-precision processing units including floating-point arithmetic logic units and integer arithmetic logic units. In one embodiment, the floating-point arithmetic logic unit implements the IEEE 754-. In one embodiment, the cores 550 include 64 single-precision (32-bit) floating-point cores, 64 integer cores, 32 double-precision (64-bit) floating-point cores, and 8 tensor cores.

The tensor cores are configured to perform matrix operations, and in one embodiment, one or more tensor cores are included in core 550. In particular, the tensor core is configured to perform deep learning matrix operations, such as convolution operations for neural network training and reasoning. In one embodiment, each tensor core operates on a 4 × 4 matrix and performs a matrix multiply and accumulate operation D ═ a × B + C, where A, B, C and D are 4 × 4 matrices.

In one embodiment, the matrix multiplication inputs a and B are 16-bit floating-point matrices, while the accumulation matrices C and D may be 16-bit floating-point or 32-bit floating-point matrices. The tensor core operates on 16-bit floating-point input data with 32-bit floating-point accumulation. 16-bit floating-point multiplication requires 64 operations, produces a full precision product, and then accumulates using the addition of 32-bit floating points with other intermediate products of a 4 x 4 matrix multiplication. In practice, the tensor core is used to perform larger two-dimensional or higher-dimensional matrix operations built up from these smaller elements. APIs (such as the CUDA 9C + + API) disclose specialized matrix loading, matrix multiplication and accumulation and matrix storage operations to efficiently use the tensor core from the CUDA-C + + program. At the CUDA level, the thread bundle level interface assumes a 16 × 16 size matrix that spans all 32 threads of a thread bundle.

In some embodiments, transpose hardware is included in the processing core 550 or other functional unit (e.g., SFU 552 or LUS 554) and is configured to generate matrix data stored by diagonal lines and/or matrix data stored by diagonal lines to generate original matrices and/or transpose matrices. The transpose hardware may provide a register file 520 load path to the SM 440 within the shared memory 570.

In one example, matrix data stored by diagonal lines may be retrieved from DRAM and stored in shared memory 570. When processing instructions that perform processing using diagonally stored matrix data, the transpose hardware disposed in the path of the shared memory 570 and the register file 520 may provide an original matrix, a transpose matrix, a compressed original matrix, and/or a compressed transpose matrix. The single matrix data of the diagonal store may be maintained until the last store before the instruction, and the matrix type specified by the instruction may be generated in register file 520 as needed.

Each SM 440 also includes M SFUs 552 that perform special functions (e.g., attribute evaluation, inverse square root, etc.). In one embodiment, SFU 552 may include a tree traversal unit configured to traverse a hierarchical tree data structure. In one embodiment, SFU 552 may comprise texture units configured to perform texture map filtering operations. In one embodiment, the texture unit is configured to load a texture map (e.g., a 2D array of texels) from memory 304 and sample the texture map to produce sampled texture values for use in a shader program executed by SM 440. In one embodiment, the texture map is stored in shared memory/L1 cache 470. The texture unit implements a texture operation, such as a filtering operation using a mip-map (i.e., a texture map of a different level of detail). In one embodiment, each SM 440 includes two texture units.

Each SM 440 also includes N LSUs 554 that implement load and store operations between shared memory/L1 cache 570 and register file 520. Each SM 440 includes an interconnection network 580 that connects each functional unit to register file 520 and LSU 554 to register file 520, shared memory/L1 cache 570. In one embodiment, interconnect network 580 is a crossbar that may be configured to connect any functional unit to any register in register file 520, and to connect LSU 554 to register file 520 and memory locations in shared memory/L1 cache 570.

The shared memory/L1 cache 570 is an on-chip memory array that allows data storage and communication between the SM 440 and the primitive engine 435 and between threads in the SM 440. In one embodiment, shared memory/L1 cache 570 comprises 128KB of storage capacity and is in the path from SM 440 to partition unit 380. Shared memory/L1 cache 570 may be used for cache reads and writes. One or more of shared memory/L1 cache 570, L2 cache 460, and memory 304 are backing stores.

Combining data caching and shared memory functions into a single memory block provides the best overall performance for both types of memory accesses. This capacity can be used by programs as a cache that does not use shared memory. For example, if the shared memory is configured to use half the capacity, texture and load/store operations may use the remaining capacity. Integration within shared memory/L1 cache 570 allows shared memory/L1 cache 570 to function as a high throughput pipeline for streaming data while providing high bandwidth and low latency access to heavily loaded data.

When configured for general-purpose parallel computing, a simpler configuration can be used compared to graphics processing. In particular, the fixed function graphics processing unit shown in FIG. 3 is bypassed, creating a simpler programming model. In a general-purpose parallel computing configuration, the work allocation unit 325 assigns and allocates thread blocks directly to the DPCs 420. The threads in the block execute the same program, use unique thread IDs in the computations to ensure that each thread generates a unique result, execute the program and perform the computations using the SM 440, use the shared memory/L1 cache 570 to communicate between the threads, and use the LSU 554 to read and write to global memory through the shared memory/L1 cache 570 and the memory partition unit 380. When configured for general purpose parallel computing, SM 440 may also write commands that scheduler unit 320 may use to initiate new work on DPC 420.

The PPU 300 may be included in a desktop computer, a laptop computer, a tablet computer, a server, a supercomputer, a smartphone (e.g., wireless, handheld device), a Personal Digital Assistant (PDA), a digital camera, a vehicle, a head-mounted display, a handheld electronic device, and so forth. In one embodiment, the PPU 300 is included on a single semiconductor substrate. In another embodiment, the PPU 300 is included on a system on a chip (SoC) with one or more other devices, such as an additional PPU 300, memory 304, a Reduced Instruction Set Computer (RISC) CPU, a Memory Management Unit (MMU), a digital-to-analog converter (DAC), and so forth.

In one embodiment, PPU 300 may be included on a graphics card that includes one or more memory devices 304. The graphics card may be configured to interface with a PCIe slot on a motherboard of the desktop computer. In yet another embodiment, the PPU 300 may be an Integrated Graphics Processing Unit (iGPU) or a parallel processor contained in a chipset of a motherboard.

Exemplary computing System

Systems with multiple GPUs and CPUs are used in various industries as developers expose and exploit more parallelism in applications, such as artificial intelligence computing. High performance GPU acceleration systems with tens of thousands to more compute nodes are deployed in data centers, research institutions, and supercomputers to address larger problems. As the number of processing devices increases within high performance systems, communication and data transfer mechanisms need to be extended to support the increased bandwidth.

FIG. 10B is a conceptual diagram of a processing system 500 implemented using the PPU 300 of FIG. 8, according to one embodiment. Exemplary system 500 may be configured to implement the methods disclosed herein (e.g., the methods shown in fig. 5, 6, or 8). Processing system 500 includes a CPU 530, a switch 555, and each of a plurality of PPUs 300, as well as a corresponding memory 304. NVLink310 provides a high-speed communication link between each PPU 300. Although a particular number of NVLink310 and interconnect 302 connections are shown in FIG. 10B, the number of connections to each PPU 300 and CPU 530 may vary. Switch 555 interfaces between interconnect 302 and CPU 530. PPU 300, memory 304, and NVLink310 may be located on a single semiconductor platform to form parallel processing module 525. In one embodiment, switch 555 supports two or more protocols that interface between various different connections and/or links.

In another embodiment (not shown), NVLink310 provides one or more high speed communication links between each PPU 300 and CPU 530, and switch 555 interfaces between interconnect 302 and each PPU 300. PPU 300, memory 304, and interconnect 302 may be located on a single semiconductor platform to form parallel processing module 525. In yet another embodiment (not shown), interconnect 302 provides one or more communication links between each PPU 300 and CPU 530, and switch 555 interfaces between each PPU 300 using NVLink310 to provide one or more high-speed communication links between PPUs 300. In another embodiment (not shown), NVLink310 provides one or more high-speed communication links between PPU 300 and CPU 530 through switch 555. In yet another embodiment (not shown), interconnect 302 provides one or more communication links directly between each PPU 300. One or more NVLink310 high speed communication links may be implemented as physical NVLink interconnects or on-chip or bare-die interconnects using the same protocol as NVLink 310.

In the context of this specification, a single semiconductor platform may refer to only a single semiconductor-based integrated circuit fabricated on a die or chip. It should be noted that the term single semiconductor platform may also refer to multi-chip modules with increased connections that simulate on-chip operation and are substantially improved by utilizing conventional bus implementations. Of course, the various circuits or devices may also be placed separately or in various combinations of semiconductor platforms, depending on the needs of the user. Alternatively, the parallel processing module 525 may be implemented as a circuit board substrate, and each of the PPU 300 and/or the memory 304 may be packaged devices. In one embodiment, CPU 530, switch 555, and parallel processing module 525 are located on a single semiconductor platform.

In one embodiment, the signaling rate for each NVLink310 is 20 to 25 gbits/sec, and each PPU 300 includes six NVLink310 interfaces (as shown in fig. 10B, each PPU 300 includes five NVLink310 interfaces). Each NVLink310 provides a data transfer rate of 25 gbits/sec in each direction, with six links providing 300 gbits/sec. When CPU 530 also includes one or more NVLink310 interfaces, NVLink310 may be dedicated to PPU communications as shown in FIG. 10B, or some combination of PPU to PPU and PPU to CPU.

In one embodiment, NVLink310 allows direct load/store/atomic access from CPU 530 to memory 304 of each PPU 300. In one embodiment, NVLink310 supports coherency operations, allowing data read from memory 304 to be stored in the cache hierarchy of CPU 530, reducing cache access latency of CPU 530. In one embodiment, NVLink310 includes support for Address Translation Services (ATS), allowing PPU 300 direct access to page tables within CPU 530. One or more nvlinks 310 may also be configured to operate in a low power mode.

Fig. 10C illustrates an exemplary system 565 in which the various architectures and/or functionalities of the various previous embodiments may be implemented. Exemplary system 565 may be configured to implement methods disclosed herein (e.g., the methods shown in fig. 5, 6, or 8).

As shown, a system 565 is provided that includes at least one central processing unit 530 coupled to a communication bus 575. The communication bus 575 may be implemented using any suitable protocol, such as PCI (peripheral component interconnect), PCI-Express, AGP (accelerated graphics Port), HyperTransport, or any other bus or point-to-point communication protocol(s). The system 565 also includes a main memory 540. Control logic (software) and data are stored in main memory 540, and main memory 540 may take the form of Random Access Memory (RAM).

System 565 also includes an input device 560, a parallel processing system 525, and a display device 545, such as a conventional CRT (cathode ray tube), LCD (liquid crystal display), LED (light emitting diode), plasma display, and the like. User input may be received from an input device 560 (e.g., a keyboard, mouse, touchpad, microphone, etc.). Each of the aforementioned modules and/or devices may even be located on a single semiconductor platform to form system 565. Alternatively, the various modules may also be placed separately or in various combinations of semiconductor platforms, depending on the needs of the user.

Further, system 565 can be coupled for communication purposes to a network (e.g., a telecommunications network, a Local Area Network (LAN), a wireless network, a Wide Area Network (WAN) such as the internet, a peer-to-peer network, a cable network, etc.) through network interface 535.

System 565 can also include secondary storage (not shown). Secondary storage 610 includes, for example, a hard disk drive and/or a removable storage drive, on behalf of a floppy disk drive, a magnetic tape drive, an optical disk drive, a Digital Versatile Disk (DVD) drive, a recording device, a Universal Serial Bus (USB) flash memory. The removable storage drive reads from and/or writes to a removable storage unit in a well known manner.

Computer programs, or computer control logic algorithms, may be stored in main memory 540 and/or secondary storage. Such computer programs, when executed, enable system 565 to perform various functions. Memory 540, storage, and/or any other storage are possible examples of computer-readable media.

The architecture and/or functionality of the various previous figures may be implemented in the context of a general purpose computer system, a circuit board system, a game console system dedicated for entertainment purposes, a dedicated system, and/or any other desired system. For example, system 565 may take the form of a desktop computer, laptop computer, tablet computer, server, supercomputer, smartphone (e.g., wireless, handheld device), Personal Digital Assistant (PDA), digital camera, vehicle, head-mounted display, handheld electronic device, mobile phone device, television, workstation, game console, embedded system, and/or any other type of logic.

All patents and printed publications mentioned above are incorporated herein by reference as if set forth in full.

While the invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not to be limited to the disclosed embodiment, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A method performed by a processing system comprising a multi-threaded processor and on-chip memory including cache memory and shared memory managed by software, the method comprising:

Simultaneous execution of multiple threads, including threads executing fused load and store instructions for loading data stored in an external memory of the processing system and storing the data into the shared memory, the fused load and store instruction can bypass (1) the processor register associated with the thread executing the fused load and store instruction or (2) the processor register and the configuration between cache memory,

wherein the instructions instruct the system to:

retrieving the data from the external memory; and

storing the retrieved data in the shared memory without first storing the retrieved data in the processor registers, or without first storing the retrieved data in the processor registers and the in cache memory.

2. The method of claim 1, wherein a plurality of threads execute the fused load and store instructions for loading data stored in the external memory into the shared memory , and the instruction instructs the system to determine a shared memory destination address for the data requested by the instruction, determine the pattern between the padded sector of the retrieved data and the shared memory destination address, and assign all The padded sectors of the retrieved data are stored to the shared memory sectors identified by the determined pattern.

3. The method of claim 1, wherein the shared memory comprises a plurality of sectors arranged in a memory bank, and data stored into at least one of the sectors is applied with intra-sector swap.

4. The method of claim 1, wherein the fused load and store instruction is a single instruction that identifies a destination shared memory address and a source global address.

5. The method of claim 1, wherein when (1) the fused load and store instruction is configured to bypass the processor register and the cache memory, and (2) the instruction requested by the instruction When the data is stored in the cache memory, the fused load and store instructions instruct the system to store the requested data in the cache memory into the shared memory and invalidate the The tag of the requested data in the cache memory.

6. The method of claim 1, wherein the fused load and store instructions are dynamically configured at runtime.

7. A processing system comprising:

a multithreaded processor that is configured to execute multiple threads simultaneously;

a plurality of data registers, each of which is assigned to a thread of execution; and

On-chip memory including cache memory and shared memory, the cache memory configured to allow the plurality of execution threads to access tag data stored in the cache memory, and the shared memory configured to allow all the plurality of execution threads access unmarked data stored in the shared memory,

wherein the system is configured to be responsive to at least one of the threads executing instructions for loading data stored in an external memory of the processing system and selectively storing the data to the processing system shared memory without first storing the data in the plurality of data registers or without first storing the data in the plurality of data registers and the cache memory, retrieved from the external memory the data and selectively store the retrieved data in the shared memory without (1) first storing the retrieved data in the plurality of data registers or (2) without first storing the retrieved data in the plurality of data registers The retrieved data is stored in the plurality of data registers and the cache memory.

8. The processing system of claim 7, wherein the retrieved data is stored in the shared memory without (1) first storing the retrieved data in the shared memory based on dynamic selection at runtime. in the plurality of data registers, or (2) without first storing the retrieved data in the plurality of data registers and the cache memory.

9. The processing system of claim 7, wherein the cache memory and the shared memory are unified physical random access memory.

10. The processing system of claim 9, wherein the unified physical random access memory includes a register file that includes the plurality of data registers dynamically allocated to the threads of execution.

11. The processing system of claim 7, wherein the processing system includes hardware circuitry configured to receive a plurality of instructions for loading data stored in the external memory, and storing the data in the shared memory, and determining a shared memory sector fill pattern for storing the retrieved data in the shared memory based on the plurality of instructions.

12. The processing system of claim 7, wherein the shared memory includes a plurality of sectors arranged in a memory bank, and the retrieved data is stored in the sectors with intra-sector swapping at least one of.

13. The processing system of claim 7, wherein the shared memory comprises a plurality of sectors arranged in memory banks, the portions of the retrieved data are stored in the sectors in different memory banks .

14. The processing system of claim 7, wherein the instructions load data from one sector of the external memory and store the loaded data into a portion of a plurality of sectors in the shared memory.

15. The processing system of claim 7, wherein the shared memory is a software managed cache.

16. The processing system of claim 7, wherein the retrieved data is stored in one or more cache lines of the cache memory, the system further comprising an interconnect circuit, the interconnect circuit. linking circuitry configured to transfer data stored in the one or more cache lines into sectors of the shared memory without first storing the retrieved data in the plurality of data registers middle.

17. A method performed by a processing system, comprising:

a multi-threaded processor configured to execute multiple threads concurrently; a plurality of registers assigned to the execution threads; a hardware-managed cache memory configured to allow the multiple execution threads to access memory stored in the cache tagged data in memory; and a software-managed shared memory configured to allow the plurality of execution threads to access untagged data stored in the shared memory, the method comprising:

Multiple threads are executed concurrently, wherein at least one of the threads executes instructions for loading data stored in the external memory of the processing system and storing the data in the shared memory, the instructions being is configured to: (1) store the data into the shared memory without first storing the data into a register assigned to the execution thread, or (2) store the data into the shared memory memory without first storing the data in the register and the cache memory;

In response to the instruction configured to store the data in the shared memory without first storing the data in the register, the data is retrieved from the external memory, and the retrieved data is retrieved storing data into the shared memory without first storing the retrieved data in the registers; and

retrieving the data from the external memory in response to the instruction configured to store the data in the shared memory without first storing the data in the register and the cache memory , and store the retrieved data in the shared memory without first storing the retrieved data in the registers and the shared memory.

18. The method of claim 17, further comprising two or more of the threads of execution executing for loading data stored in the shared memory into the two or more execution threads assigned to the two or more each of the threads of execution registers and executes computational instructions using the data stored in the registers.

19. The method of claim 18, wherein the computation is a matrix multiply and accumulate operation.

20. The method of claim 17, wherein the instruction for loading data stored in the external memory of the processing system and storing the data in the shared memory is a single instruction, the The single instruction described includes the destination shared memory address and the source global address.

21. The method of claim 17, wherein the shared memory includes a plurality of sectors arranged in a memory bank, and the retrieved data is stored in at least one of the sectors and is Intra-sector swapping is applied.

22. The method of claim 17, wherein the shared memory comprises a plurality of sectors arranged in a memory bank, and the retrieved data is stored in the sectors in different memory banks.

23. A GPU execution method, comprising:

decoding a fused load/store instruction format that specifies whether the GPU should (a) bypass processor registers or (b) bypass both processor registers and cache memory; and

In response to the decoding, retrieving a data word from the first memory and storing the retrieved data word in a second memory without first storing the retrieved data in the processor register or without first storing the retrieved data in the processor register The retrieved data is stored in the processor registers and the cache memory, selected by a decoded fused load/store instruction format specification.

24. The GPU execution method of claim 23, further comprising sharing the second memory among a plurality of threads.

25. The GPU execution method of claim 23, wherein the cache memory and the second memory are arranged in a common on-chip physical memory address space.

26. The GPU execution method of claim 23, wherein the fused load/store instruction format comprises: (i) a source word address, (ii) a destination word address, and (iii) at least one bit specifying the Whether the GPU should (a) bypass the processor registers or (b) bypass both the processor registers and cache memory to retrieve data words from the source word address and store the retrieved data words into the destination word address, and the decoding includes decoding the at least one bit.

27. A GPU execution method, comprising:

Decode load/store memory access instruction format;

In response to the decoding, initiating a memory access to the GPU shared memory, including selectively bypassing intermediate storage in at least one of a processor register and cache memory; and

The completion of the memory access as an asynchronous copy/direct memory access operation is tracked.

28. The method of claim 27, further comprising writing to shared memory while dynamically selecting whether to bypass each intermediate store at runtime.

29. The method of claim 27, wherein the memory access comprises reading a sector of global memory and writing it to a portion of a plurality of sectors in the shared memory.