[go: up one dir, main page]

CN113377538B - A storage computing collaborative scheduling method and system for GPU data reuse - Google Patents

A storage computing collaborative scheduling method and system for GPU data reuse Download PDF

Info

Publication number
CN113377538B
CN113377538B CN202110649358.7A CN202110649358A CN113377538B CN 113377538 B CN113377538 B CN 113377538B CN 202110649358 A CN202110649358 A CN 202110649358A CN 113377538 B CN113377538 B CN 113377538B
Authority
CN
China
Prior art keywords
gpu
data
data page
kernel
reverse
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110649358.7A
Other languages
Chinese (zh)
Other versions
CN113377538A (en
Inventor
李晨
李宣佚
郭阳
鲁建壮
陈小文
刘胜
张洋
刘畅
曹壮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202110649358.7A priority Critical patent/CN113377538B/en
Publication of CN113377538A publication Critical patent/CN113377538A/en
Application granted granted Critical
Publication of CN113377538B publication Critical patent/CN113377538B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The invention discloses a storage computing collaborative scheduling method and a system for GPU data reuse, wherein the method comprises the steps that a kernel program is started to turn over a reverse mark of the kernel program; in a thread block scheduler of the GPU, for thread block scheduling of the kernel program, selecting one thread block dispatch strategy in a forward thread block dispatch strategy and a reverse thread block dispatch strategy in turn according to a reversing mark of the kernel program to select a thread block to be transmitted from a thread block to be transmitted queue; in the GPU driving, for the data page replacement of the kernel program, one of the forward data page replacement strategy and the reverse data page replacement strategy is selected in turn according to the inversion mark of the kernel program to select the GPU side data page from the GPU side data page queue for replacement. The invention realizes the cooperative scheduling of the thread blocks and the data pages, reduces the influence of excessive configuration of the memory on the system performance by reusing shared data, and can effectively improve the system performance.

Description

一种面向GPU数据重用的存储计算协同调度方法及系统A storage computing collaborative scheduling method and system for GPU data reuse

技术领域technical field

本发明涉及计算机的计算调度技术,具体涉及一种面向GPU数据重用的存储计算协同调度方法及系统。The invention relates to computer computing scheduling technology, in particular to a storage computing collaborative scheduling method and system for GPU data reuse.

背景技术Background technique

由于GPU具有很高的计算吞吐量和很好的可编程性,其已经被广泛用于包括机器学习、目标检测以及图像去噪等高性能领域。然而由于GPU上有限的内存空间已经无法容纳应用程序不断扩大的工作集(单位时间的GPU数据访问量)。统一虚拟内存和按需取页技术的引入为内存超额配置提供了很好的支持,但是由于CPU内存与GPU内存间存在额外的数据页传输,引起了系统性能的损失。因此如何减少这些多余的数据迁移对于性能的改善是至关重要的。在研究了大量的被测程序集后,我们发现有很多的应用程序中存在内核程序(Kernel)间数据共享的情况。而且对于大多数这样的程序,其中的每一个内核程序都按照相似的数据访问顺序去访问同一片数据区域。当GPU的内存无法容纳整个内核程序的工作集时,旧的数据页会被换出到CPU的内存中而需要的数据页会被取到GPU的内存中。当一个内核程序结束时只有最新访问的数据页会保留在GPU的内存中,而后续的内核程序启动后还会再去访问那些已经被换入到CPU内存中的数据页。我们发现虽然这些应用程序中的内核程序间存在大量的共享数据,但是当发生内存超额配置时这样的数据共享特性就会消失,进而引发系统性能的急剧下降。Due to its high computing throughput and good programmability, GPU has been widely used in high-performance fields including machine learning, object detection, and image denoising. However, due to the limited memory space on the GPU, it has been unable to accommodate the ever-expanding working set of the application (GPU data access per unit time). The introduction of unified virtual memory and on-demand page fetching technology provides good support for memory over-allocation, but due to the extra data page transfer between CPU memory and GPU memory, the loss of system performance is caused. Therefore, how to reduce these redundant data migrations is crucial for performance improvement. After studying a large number of tested assemblies, we found that there are many applications that share data between kernel programs (Kernel). And for most of these programs, each of the kernel programs accesses the same data area in a similar data access sequence. When the GPU's memory cannot hold the working set of the entire kernel program, the old data pages will be swapped out to the CPU's memory and the required data pages will be fetched into the GPU's memory. When a kernel program ends, only the latest accessed data pages will remain in the GPU memory, and subsequent kernel programs will also access those data pages that have been swapped into the CPU memory after startup. We found that although there is a large amount of shared data between kernel programs in these applications, such data sharing characteristics will disappear when memory over-allocation occurs, causing a sharp decline in system performance.

有效地去使用GPU内存中现有的数据,是避免由页失效引起长时延开销的关键,在内存超额配置的情形下更是如此。图1展示了内核程序间存在数据共享的应用程序由于内存超额配置而引起的性能下降。我们发现这类应用程序对于内存超额配置的程度是不敏感的,只要有一点内存超额配置就会引起性能的急剧下降。图2展示了当GPU内存只可以容纳FFT程序75%的数据访问大小时它的数据页访问特征和页失效率的变化情况(多个内核程序程序按照相同的顺序访问相同的顺序,虚线表示每个内核程序程序结束边界)。我们发现FFT中的每一个内核程序都有相似的数据访问特征和顺序,在每一个内核程序开始的边界(图2中圈住的区域)都有很高的页失效率,这是因为早些访问的数据页被新访问的数据页替换,从而导致后续的内核程序再访问这些被逐出的数据页会产生数据页失效。为了在内存超额配置时从根本上减少页的迁移次数,提出了许多的相关技术包括预取、使用计算时间来隐藏传输时间以及批量处理页失效等。然而这些技术对于这类应用程序的性能提升几乎没有作用。预取由于会将有用的数据页逐出引起系统的抖动,由大量页失效引起的长时延是无法通过预逐出数据页以及批处理页失效来隐藏的。基于以上分析,我们对于内核程序间存在数据共享这类应用程序得到了三个发现。首先,一旦发生内存超额配置,程序的性能会急剧下降。其次,之前针对发生内存超额配置时程序性能的优化方法对于此类程序是不适用的。最后,在内核程序边界处的页失效率非常高。因此,我们的研究目标是通过重用内核程序间共享数据的方式来降低在内核程序边界处的页失效率。Efficiently using existing data in GPU memory is key to avoiding the long latency overhead caused by page faults, especially in the case of memory oversubscription. Figure 1 shows the performance degradation of an application that shares data between kernels due to memory overcommitment. We found that such applications are insensitive to the degree of memory over-provisioning, and that even a little memory over-provisioning can cause a sharp drop in performance. Figure 2 shows the change of data page access characteristics and page fault rate when the GPU memory can only accommodate 75% of the data access size of the FFT program (multiple kernel programs access the same order in the same order, and the dotted line indicates each kernel program end boundary). We found that each kernel program in FFT has similar data access characteristics and order, and there is a high page fault rate at the boundary of each kernel program start (the area circled in Figure 2), which is because earlier The accessed data pages are replaced by newly accessed data pages, so that subsequent kernel programs accessing these evicted data pages will cause data page invalidation. In order to fundamentally reduce the number of page migrations when memory is over-provisioned, many related techniques have been proposed including prefetching, using computation time to hide transfer time, and batching page failures. However, these techniques do little to improve the performance of such applications. Prefetching will cause system jitter due to the eviction of useful data pages, and the long delay caused by a large number of page failures cannot be hidden by pre-eviction of data pages and batch page failures. Based on the above analysis, we have obtained three findings for applications such as data sharing between kernel programs. First, once memory overcommitment occurs, the performance of the program will drop dramatically. Second, previous optimization methods for program performance when memory overcommitment occurs are not applicable to such programs. Finally, the page fault rate at kernel program boundaries is very high. Therefore, our research goal is to reduce the page fault rate at kernel program boundaries by reusing shared data between kernel programs.

发明内容Contents of the invention

本发明要解决的技术问题:针对现有技术的上述问题,提供一种面向GPU数据重用的存储计算协同调度方法及系统,本发明实现了线程块和数据页协同调度,通过重用共享数据来降低内存超额配置对系统性能的影响,能够有效提高系统的性能。The technical problem to be solved by the present invention: Aiming at the above-mentioned problems of the prior art, a storage computing cooperative scheduling method and system oriented to GPU data reuse are provided. The present invention realizes the cooperative scheduling of thread blocks and data pages, and reduces The impact of memory oversubscription on system performance can effectively improve system performance.

为了解决上述技术问题,本发明采用的技术方案为:In order to solve the problems of the technologies described above, the technical solution adopted in the present invention is:

一种面向GPU数据重用的存储计算协同调度方法,包括:A storage computing collaborative scheduling method for GPU data reuse, comprising:

1)在当前程序出现GPU内存容量超额、且内核程序间存在数据共享的条件下,检测是否有内核程序启动,若有内核程序启动则将该内核程序的倒转标志翻转;1) Under the condition that the current program has excessive GPU memory capacity and there is data sharing between kernel programs, detect whether there is a kernel program started, and if there is a kernel program started, flip the inversion flag of the kernel program;

2)在GPU驱动中,针对该内核程序的数据页替换,根据该内核程序的倒转标志在正向数据页替换策略和反向数据页替换策略两种预设的数据页替换策略中来轮流选择其中一种数据页替换策略来从GPU端数据页队列中选择GPU端数据页进行替换,所述正向数据页替换策略和反向数据页替换策略选择GPU端数据页的方向不同;在GPU的线程块调度器中,针对该内核程序的线程块调度,根据该内核程序的倒转标志在正向线程块派发策略和反向线程块派发策略两种预设的线程块派发策略中来轮流选择其中一种线程块派发策略来从线程块待发射队列中选择线程块发射,所述正向线程块派发策略和反向线程块派发策略选择线程块的方向不同。2) In the GPU driver, for the data page replacement of the kernel program, according to the inversion flag of the kernel program, the two preset data page replacement strategies of the forward data page replacement strategy and the reverse data page replacement strategy are selected in turn One of the data page replacement strategies selects the GPU end data page from the GPU end data page queue to replace, and the forward data page replacement strategy and the reverse data page replacement strategy select the direction of the GPU end data page to be different; In the thread block scheduler, for the thread block scheduling of the kernel program, according to the inversion flag of the kernel program, the two preset thread block dispatch strategies of the forward thread block dispatch strategy and the reverse thread block dispatch strategy are selected in turn. A thread block dispatch strategy is used to select a thread block from the thread block waiting queue for emission, and the forward thread block dispatch strategy and the reverse thread block dispatch strategy select thread blocks in different directions.

可选地,步骤1)中将该内核程序的倒转标志翻转包括:首先检测该内核程序的倒转标志是否已经存在,若该内核程序的倒转标志不存在,则为该内核程序初始化倒转标志,若该内核程序的倒转标志存在,则将该内核程序的倒转标志翻转。Optionally, in the step 1), the inversion sign of the kernel program includes: first detecting whether the inversion sign of the kernel program exists, if the inversion sign of the kernel program does not exist, then for the inversion sign of the kernel program initialization, if If the inversion sign of the kernel program exists, then the inversion sign of the kernel program is turned over.

可选地,所述初始化倒转标志时,倒转标志的初始化值为0或1。Optionally, when the inversion flag is initialized, the initialization value of the inversion flag is 0 or 1.

可选地,所述将该内核程序的倒转标志翻转是指:若内核程序的倒转标志的原值为0,则将该内核程序的倒转标志从0变为1,若内核程序的倒转标志的原值为1,则将该内核程序的倒转标志从1变为0。Optionally, said inversion of the inversion flag of the kernel program refers to: if the original value of the inversion flag of the kernel program is 0, then the inversion flag of the kernel program changes from 0 to 1, if the inversion flag of the kernel program If the original value is 1, then the inversion flag of the kernel program is changed from 1 to 0.

可选地,步骤2)中根据该内核程序的倒转标志在正向数据页替换策略和反向数据页替换策略两种预设的数据页替换策略中来轮流选择其中一种数据页替换策略来从GPU端数据页队列中选择GPU端数据页进行替换时,若倒转标志为0,则选择正向数据页替换策略;若倒转标志为1,则选择反向数据页替换策略。Optionally, in step 2), select one of the data page replacement strategies in turn from the two preset data page replacement strategies of the forward data page replacement strategy and the reverse data page replacement strategy according to the inversion flag of the kernel program. When selecting a GPU-side data page for replacement from the GPU-side data page queue, if the reverse flag is 0, select the forward data page replacement strategy; if the reverse flag is 1, select the reverse data page replacement strategy.

可选地,步骤2)中根据倒转标志在正向线程块派发策略和反向线程块派发策略两种预设的线程块派发策略中来轮流选择其中一种线程块派发策略时,若倒转标志为0,则选择正向线程块派发策略;若倒转标志为1,则选择反向线程块派发策略。Optionally, in step 2), when selecting one of the thread block dispatching strategies in turn from the two preset thread block dispatching strategies of the forward thread block dispatching strategy and the reverse thread block dispatching strategy according to the reversal flag, if the reversal flag If it is 0, the forward thread block dispatch strategy is selected; if the reverse flag is 1, the reverse thread block dispatch strategy is selected.

可选地,步骤1)中当前程序出现GPU内存容量超额是指:位于主机端的GPU驱动监测到GPU端产生的数据页请求时,如果此时GPU驱动所维护的数据页队列长度达到GPU内存容量时,则需要依据数据页替换策略将GPU内存中的数据页逐出到CPU内存中去,同时将内存超额配置标志进行置位来表示当前程序出现内存超额配置。Optionally, the excess GPU memory capacity of the current program in step 1) refers to: when the GPU driver on the host side monitors the data page request generated by the GPU side, if the length of the data page queue maintained by the GPU driver reaches the GPU memory capacity , it is necessary to evict the data pages in the GPU memory to the CPU memory according to the data page replacement policy, and at the same time set the memory over-allocation flag to indicate that the current program has memory over-allocation.

可选地,步骤1)中内核程序间存在数据共享是指:依据编译时的信息去判断内核程序间是否共享着相同的指针,以此作为内核程序间存在数据共享的指标,并为每一个将启动的内核程序分配一个数据共享标志,以此来表明其是否与前一个内核程序存在数据共享;当一个内核程序启动时,将分别判断内存超额配置标志位以及此内核程序所对应的数据共享标志位是否都为1,如果是,则判定需要使用协同调度方法。Optionally, there is data sharing between kernel programs in step 1) and refers to: judging whether the kernel programs share the same pointer according to the information at compile time, as an indicator of data sharing between kernel programs, and for each Assign a data sharing flag to the started kernel program to indicate whether it has data sharing with the previous kernel program; when a kernel program starts, it will judge the memory over-allocation flag bit and the data sharing corresponding to this kernel program respectively Whether the flag bits are all 1, if yes, it is determined that the cooperative scheduling method needs to be used.

此外,本发明还提供一种面向GPU数据重用的存储计算协同调度系统,包括相互连接的处理单元和存储器,所述处理单元被编程或配置以执行前述面向GPU数据重用的存储计算协同调度方法的步骤。In addition, the present invention also provides a GPU data reuse-oriented storage computing cooperative scheduling system, including an interconnected processing unit and a memory, and the processing unit is programmed or configured to execute the aforementioned GPU data reuse-oriented storage computing collaborative scheduling method step.

此外,本发明还提供一种计算机可读存储介质,该计算机可读存储介质中存储有被编程或配置以执行前述面向GPU数据重用的存储计算协同调度方法的计算机程序。In addition, the present invention also provides a computer-readable storage medium, in which a computer program programmed or configured to execute the aforementioned GPU data reuse-oriented storage-computing cooperative scheduling method is stored.

和现有技术相比,本发明具有下述优点:由于GPU具有很高的计算吞吐量和很好的可编程性,其已经被广泛用于包括机器学习、目标检测以及图像去噪等高性能领域。然而由于GPU上有限的内存空间已经无法容纳应用程序不断扩大的工作集(单位时间的GPU数据访问量)。统一虚拟内存和按需取页技术的引入为内存超额配置提供了很好的支持,但是由于CPU内存与GPU内存间存在额外的数据页传输,引起了系统性能的损失。因此如何减少这些多余的数据迁移对于性能的改善是至关重要的。在研究了大量的被测程序集后,我们发现有很多的应用程序中存在内核程序间数据共享的情况。而且对于大多数这样的程序,其中的每一个内核程序都按照相似的数据访问顺序去访问同一片数据区域。当GPU的内存无法容纳整个内核程序的工作集时,旧的数据页会被换出到CPU的内存中而需要的数据页会被取到GPU的内存中。当一个内核程序结束时只有最新访问的数据页会被保留在GPU的内存中,而后续的内核程序启动后还会再去访问那些已经被换入到CPU内存中的数据页。我们发现虽然这些应用程序中的内核程序间存在大量的共享数据,但是当发生内存超额配置时这样的数据共享特性就会消失,进而引发系统性能的急剧下降。基于以上观察,本发明提出了一个线程块和数据页协同调度的方法来通过有效利用内核程序间的共享数据去提升系统的性能,通过协调线程块分配顺序的切换,即改变线程块分配的顺序、数据页替换策略的切换,来充分利用内核程序间的共享数据,通过大量的GPU测试集来评估我们的方法的性能,结果显示本发明的方法比最新研究的性能提升了65%。Compared with the prior art, the present invention has the following advantages: since the GPU has very high computing throughput and good programmability, it has been widely used in high performance applications including machine learning, target detection and image denoising. field. However, due to the limited memory space on the GPU, it has been unable to accommodate the ever-expanding working set of the application (GPU data access per unit time). The introduction of unified virtual memory and on-demand page fetching technology provides good support for memory over-allocation, but due to the extra data page transfer between CPU memory and GPU memory, the loss of system performance is caused. Therefore, how to reduce these redundant data migrations is crucial for performance improvement. After studying a large number of tested assemblies, we found that there are many applications that share data between kernel programs. And for most of these programs, each of the kernel programs accesses the same data area in a similar data access sequence. When the GPU's memory cannot hold the working set of the entire kernel program, the old data pages will be swapped out to the CPU's memory and the required data pages will be fetched into the GPU's memory. When a kernel program ends, only the latest accessed data pages will be kept in the GPU memory, and subsequent kernel programs will also access those data pages that have been swapped into the CPU memory after starting. We found that although there is a large amount of shared data between kernel programs in these applications, such data sharing characteristics will disappear when memory over-allocation occurs, causing a sharp decline in system performance. Based on the above observations, the present invention proposes a thread block and data page cooperative scheduling method to improve the performance of the system by effectively utilizing the shared data between kernel programs, and by coordinating the switching of the thread block allocation order, that is, changing the thread block allocation order , Data page replacement strategy switching to make full use of the shared data between kernel programs, and evaluate the performance of our method through a large number of GPU test sets. The results show that the method of the present invention improves the performance by 65% compared with the latest research.

附图说明Description of drawings

图1为GPU内存超额配置对于系统性能的影响。Figure 1 shows the impact of GPU memory oversubscription on system performance.

图2为当GPU内存只可以容纳FFT程序75%的数据访问大小时它的数据页访问特征和页失效率的变化情况。Figure 2 shows the change of data page access characteristics and page failure rate when the GPU memory can only accommodate 75% of the data access size of the FFT program.

图3为本发明实施例方法的基本原理示意图。Fig. 3 is a schematic diagram of the basic principle of the method of the embodiment of the present invention.

图4为本发明实施例方法和现有方法的性能对比示意图。Fig. 4 is a schematic diagram of performance comparison between the method of the embodiment of the present invention and the existing method.

图5为本发明实施例中当GPU内存容量设置为FFT所访问数据量的75%时,使用本实施例方法(倒转分配)时,FFT执行过程中的数据页访问特征与数据页失效率的变化。FIG. 5 shows the relationship between the data page access characteristics and data page failure rate during FFT execution when the GPU memory capacity is set to 75% of the amount of data accessed by FFT in the embodiment of the present invention, when using the method of this embodiment (inverted allocation). Variety.

图6为本发明实施例中当GPU内存容量设置为FFT所访问数据量的75%时,使用现有的协同方法时,FFT执行过程中的数据页访问特征与数据页失效率的变化。FIG. 6 shows the change of data page access characteristics and data page failure rate during FFT execution when the GPU memory capacity is set to 75% of the data volume accessed by FFT in an embodiment of the present invention and using the existing cooperative method.

图7为本发明实施例中基准配置下的执行过程。Fig. 7 is the execution process under the benchmark configuration in the embodiment of the present invention.

图8为本发明实施例中使用倒转分配时的执行过程。FIG. 8 is an execution process when inversion allocation is used in the embodiment of the present invention.

图9为本发明实施例中使用协同方法时的执行过程。Fig. 9 is the execution process when using the collaborative method in the embodiment of the present invention.

图10为本发明实施例中基于加载时间的最近最少使用的数据页替换策略示意图。FIG. 10 is a schematic diagram of a least recently used data page replacement strategy based on loading time in an embodiment of the present invention.

图11为本发明实施例中反向基于加载时间的最近最少使用的数据页替换策略示意图。FIG. 11 is a schematic diagram of a reverse least recently used data page replacement strategy based on loading time in an embodiment of the present invention.

图12为本发明实施例中使用基准配置作为性能参考时基于不同优化机制的IPC对比。FIG. 12 is an IPC comparison based on different optimization mechanisms when the benchmark configuration is used as a performance reference in the embodiment of the present invention.

具体实施方式Detailed ways

如图3所示,本实施例面向GPU数据重用的存储计算协同调度方法包括:As shown in Figure 3, the storage and computing collaborative scheduling method for GPU data reuse in this embodiment includes:

1)在当前程序出现GPU内存容量超额、且内核程序间存在数据共享的条件下,检测是否有内核程序启动,若有内核程序启动则将该内核程序的倒转标志翻转;1) Under the condition that the current program has excessive GPU memory capacity and there is data sharing between kernel programs, detect whether there is a kernel program started, and if there is a kernel program started, flip the inversion flag of the kernel program;

2)在GPU驱动中,针对该内核程序的数据页替换,根据该内核程序的倒转标志在正向数据页替换策略和反向数据页替换策略两种预设的数据页替换策略中来轮流选择其中一种数据页替换策略来从GPU端数据页队列中选择GPU端数据页进行替换,所述正向数据页替换策略和反向数据页替换策略选择GPU端数据页的方向不同;在GPU的线程块调度器中,针对该内核程序的线程块调度,根据该内核程序的倒转标志在正向线程块派发策略和反向线程块派发策略两种预设的线程块派发策略中来轮流选择其中一种线程块派发策略来从线程块待发射队列中选择线程块发射,所述正向线程块派发策略和反向线程块派发策略选择线程块的方向不同。2) In the GPU driver, for the data page replacement of the kernel program, according to the inversion flag of the kernel program, the two preset data page replacement strategies of the forward data page replacement strategy and the reverse data page replacement strategy are selected in turn One of the data page replacement strategies selects the GPU end data page from the GPU end data page queue to replace, and the forward data page replacement strategy and the reverse data page replacement strategy select the direction of the GPU end data page to be different; In the thread block scheduler, for the thread block scheduling of the kernel program, according to the inversion flag of the kernel program, the two preset thread block dispatch strategies of the forward thread block dispatch strategy and the reverse thread block dispatch strategy are selected in turn. A thread block dispatch strategy is used to select a thread block from the thread block waiting queue for emission, and the forward thread block dispatch strategy and the reverse thread block dispatch strategy select thread blocks in different directions.

正如前文所示,将线程块和内存中的数据配合起来是降低在内核程序边界处的页失效率的关键,然而如何确定与这些数据相关的线程块是设计遇到的最大挑战。基于对这类应用程序的观察分析,本实施例方法发现在大多数这类程序中,每一个内核程序都由相似的数据访问特征和顺序(如图2所示),在每一个内核程序中,对数据页的访问行为都与线程块的调度机制相关,因为每一个线程一般都是使用其对应的线程标号和线程块标号去表明要操作的数据位置。鉴于此,本实施例方法提出了叫做倒转分配的线程块分配机制,如图3中标记b所示,通过对线程块调度机制进行调整来动态改变线程块的派发策略。默认的线程块派发策略如图3中标记b中左边所示,线程块按照序号从低到高的顺序进行派发。每当有内核程序启动时,倒转标志都进行翻转,调度器都会依据命令处理器中的倒转标志在图3中标记b中所示的两种线程块派发策略中来选择,当倒转标志为0时,选择正向线程块派发策略,反之则选择反向线程块派发策略,这样可以使派发的线程块和存储器中保留的数据相吻合来最大化地去重用数据。As shown above, coordinating thread blocks with data in memory is the key to reducing the page fault rate at the kernel program boundary, but how to determine the thread blocks related to these data is the biggest challenge encountered in the design. Based on the observation and analysis of this type of application program, the method of this embodiment finds that in most of these types of programs, each kernel program has similar data access characteristics and sequences (as shown in Figure 2), in each kernel program , the access behavior to the data page is related to the thread block scheduling mechanism, because each thread generally uses its corresponding thread label and thread block label to indicate the data location to be operated. In view of this, the method of this embodiment proposes a thread block allocation mechanism called reverse allocation, as shown by mark b in FIG. 3 , by adjusting the thread block scheduling mechanism to dynamically change the thread block allocation strategy. The default thread block distribution strategy is shown on the left side of mark b in Figure 3, and thread blocks are distributed in the order of sequence numbers from low to high. Whenever a kernel program is started, the inversion flag is flipped, and the scheduler will choose between the two thread block distribution strategies shown in mark b in Figure 3 according to the inversion flag in the command processor. When the inversion flag is 0 When selecting the forward thread block dispatching strategy, otherwise the reverse thread block dispatching strategy is selected, so that the dispatched thread block matches the data retained in the memory to maximize the de-reuse of data.

本实施例中,步骤1)中当前程序出现GPU内存容量超额是指:位于主机端的GPU驱动监测到GPU端产生的数据页请求时,如果此时GPU驱动所维护的数据页队列长度达到GPU内存容量时,则需要依据数据页替换策略将GPU内存中的数据页逐出到CPU内存中去,同时将内存超额配置标志进行置位来表示当前程序出现内存超额配置。In this embodiment, the GPU memory capacity excess in the current program in step 1) refers to: when the GPU driver on the host side monitors the data page request generated by the GPU side, if the length of the data page queue maintained by the GPU driver reaches the GPU memory capacity, it is necessary to evict the data pages in the GPU memory to the CPU memory according to the data page replacement policy, and at the same time set the memory over-allocation flag to indicate that the current program has memory over-allocation.

本实施例中,步骤1)中内核程序间存在数据共享是指:依据编译时的信息去判断内核程序间是否共享着相同的指针,以此作为内核程序间存在数据共享的指标,并为每一个将启动的内核程序分配一个数据共享标志,以此来表明其是否与前一个内核程序存在数据共享;当一个内核程序启动时,将分别判断内存超额配置标志位以及此内核程序所对应的数据共享标志位是否都为1,如果是,则判定需要使用协同调度方法。In the present embodiment, in step 1) there is data sharing between the kernel programs and refers to: judging whether the kernel programs share the same pointer according to the information at the time of compiling, as an indicator of data sharing between the kernel programs, and for each A kernel program to be started is assigned a data sharing flag to indicate whether it has data sharing with the previous kernel program; when a kernel program is started, the memory over-allocation flag bit and the data corresponding to the kernel program will be judged respectively Whether the shared flag bits are all 1, if so, it is determined that the cooperative scheduling method needs to be used.

本实施例提出的协同调度方法的应用场景是当应用程序存在内存超额配置,且多个内核程序间存在数据共享,因此对于内存超额配置和内核程序间数据共享的检测是本机制触发的关键。当位于主机端的GPU驱动监测到GPU端产生的数据页请求时,如果此时GPU驱动所维护的数据页队列长度达到GPU内存容量时,则需要依据数据页替换策略将GPU内存中的数据页逐出到CPU内存中去,同时将内存超额配置标志进行置位来说明当前程序出现内存超额配置,并以此标志作为本实施例协同调度机制启动的指标之一。依据编译时的信息去判断内核程序间是否共享着相同的指针,以此作为内核程序间存在数据共享的指标。并为每一个将启动的内核程序分配一个数据共享标志,以此来表明其是否与前一个内核程序存在数据共享。当一个内核程序启动时,本实施例提出的协同调度机制将分别判断内存超额配置标志位以及此内核程序所对应的数据共享标志位是否都为1,如果是则触发协同调度机制,否则按照默认配置进行执行。The application scenario of the cooperative scheduling method proposed in this embodiment is when there is memory over-allocation in the application program, and there is data sharing between multiple kernel programs. Therefore, the detection of memory over-allocation and data sharing between kernel programs is the key to triggering this mechanism. When the GPU driver on the host side monitors the data page request generated by the GPU side, if the length of the data page queue maintained by the GPU driver reaches the GPU memory capacity, it needs to replace the data pages in the GPU memory one by one according to the data page replacement policy. out to the CPU memory, and set the memory over-allocation flag at the same time to indicate that the current program has memory over-allocation, and use this flag as one of the indicators for the cooperative scheduling mechanism of this embodiment to start. According to the information at the time of compilation, it is judged whether the kernel programs share the same pointer, which is used as an indicator of data sharing between the kernel programs. And assign a data sharing flag to each kernel program to be started, so as to indicate whether it has data sharing with the previous kernel program. When a kernel program is started, the cooperative scheduling mechanism proposed in this embodiment will judge whether the memory oversubscription flag bit and the data sharing flag bit corresponding to the kernel program are both 1, and if so, trigger the cooperative scheduling mechanism, otherwise follow the default configuration to execute.

本实施例中,步骤1)中将该内核程序的倒转标志翻转包括:首先检测该内核程序的倒转标志是否已经存在,若该内核程序的倒转标志不存在,则为该内核程序初始化倒转标志,若该内核程序的倒转标志存在,则将该内核程序的倒转标志翻转。In the present embodiment, step 1) inversion of the reversal sign of this kernel program includes: first detect whether the reversal sign of this kernel program exists, if the reversal sign of this kernel program does not exist, then for the initialization reversal sign of this kernel program, If the inversion sign of the kernel program exists, the inversion sign of the kernel program is reversed.

本实施例中,所述初始化倒转标志时,倒转标志的初始化值为0或1。In this embodiment, when the inversion flag is initialized, the initialization value of the inversion flag is 0 or 1.

本实施例中,所述将该内核程序的倒转标志翻转是指:若内核程序的倒转标志的原值为0,则将该内核程序的倒转标志从0变为1,若内核程序的倒转标志的原值为1,则将该内核程序的倒转标志从1变为0。In the present embodiment, the reversing of the inversion flag of the kernel program refers to: if the original value of the inversion flag of the kernel program is 0, then the inversion flag of the kernel program is changed from 0 to 1, if the inversion flag of the kernel program The original value of is 1, then the inversion flag of the kernel program is changed from 1 to 0.

本实施例中,步骤2)中根据该内核程序的倒转标志在正向数据页替换策略和反向数据页替换策略两种预设的数据页替换策略中来轮流选择其中一种数据页替换策略来从GPU端数据页队列中选择GPU端数据页进行替换时,若倒转标志为0,则选择正向数据页替换策略;若倒转标志为1,则选择反向数据页替换策略。In this embodiment, in step 2), according to the inversion flag of the kernel program, one of the data page replacement strategies is selected in turn from the two preset data page replacement strategies of the forward data page replacement strategy and the reverse data page replacement strategy When selecting a GPU-side data page for replacement from the GPU-side data page queue, if the reverse flag is 0, select the forward data page replacement strategy; if the reverse flag is 1, select the reverse data page replacement strategy.

本实施例中,步骤2)中根据倒转标志在正向线程块派发策略和反向线程块派发策略两种预设的线程块派发策略中来轮流选择其中一种线程块派发策略时,若倒转标志为0,则选择正向线程块派发策略;若倒转标志为1,则选择反向线程块派发策略。In this embodiment, in step 2), when selecting one of the thread block dispatching strategies in turn from the two preset thread block dispatching strategies of the forward thread block dispatching strategy and the reverse thread block dispatching strategy according to the reversal flag, if the reverse If the flag is 0, the forward thread block dispatch strategy is selected; if the reverse flag is 1, the reverse thread block dispatch strategy is selected.

为了展示本实施例方法的有效性,本实施例中分别引入了叫做Oracle、ETC的现有技术与本实施例方法(倒转分配)进行对比,其中Oracle可以完全利用存储器中所保留的内核程序间的共享数据。图4展示了使用默认配置作为基准,本实施例方法(倒转分配)和Oracle、ETC的平均性能对比,其中Baseline为作为对比的基准性能。图5展示了应用本实施例方法(倒转分配)和协同方法时FFT程序的数据页访问特征和页失效变化情况。基于图4的分析可知,首先,相比于最新的方法ETC,本实施例方法(倒转分配)平均提升了20.2%的性能,这与图5所展示的内核程序边界处页失效率下降是一致的;其次,相比于Oracle,倒转分配的性能差了58.5%,说明还有很大的提升空间。基于对图5的分析可知,偶数号的内核程序的执行时间比奇数号的内核程序执行时间短,参见图5,当GPU内存容量设置为FFT所访问数据量的75%时,使用本实施例方法(倒转分配)时,FFT执行过程中的数据页访问特征与数据页失效率的变化(多个内核程序程序按照相同的顺序访问相同的顺序,虚线表示每个内核程序程序结束边界)。In order to demonstrate the effectiveness of the method of this embodiment, the prior art called Oracle and ETC are respectively introduced in this embodiment to compare with the method of this embodiment (inverted allocation), wherein Oracle can fully utilize the inter-kernel program reserved in the memory shared data. Figure 4 shows the average performance comparison between the method of this embodiment (inverted allocation) and Oracle and ETC using the default configuration as a benchmark, where Baseline is the benchmark performance for comparison. FIG. 5 shows the data page access characteristics and page failure changes of the FFT program when the method of this embodiment (inverted allocation) and the cooperative method are applied. Based on the analysis in Figure 4, it can be known that, firstly, compared with the latest method ETC, the method in this embodiment (inverted allocation) improves the performance by an average of 20.2%, which is consistent with the decrease in the page failure rate at the kernel program boundary shown in Figure 5 Second, compared to Oracle, the performance of reverse allocation is 58.5% worse, indicating that there is still a lot of room for improvement. Based on the analysis of Fig. 5, it can be seen that the execution time of even-numbered kernel programs is shorter than that of odd-numbered kernel programs. Referring to Fig. 5, when the GPU memory capacity is set to 75% of the amount of data accessed by FFT, this embodiment is used method (inverted allocation), the data page access characteristics and data page failure rate changes during FFT execution (multiple kernel programs access the same sequence in the same order, and the dotted line indicates the end boundary of each kernel program).

为了深入探究本实施例方法(倒转分配)与Oracle之间存在巨大性能差距以及不同内核程序执行时间差异的原因,本实施例中分析了一个简单测试程序的执行过程,这个应用程序包含多个内核程序程序,并且每一个内核程序程序都访问相同的数据,具体的执行过程如图7~图9所示。为了简化分析,我们做了一些假设:1)GPU同时只能执行一个线程块;2)五个内核程序程序(A,B,C,D,E)访问相同的数据,并且每个内核程序由5个线程块(C1-C5)组成,它们仅访问一个数据页(分别访问P1-P5);3)GPU的内存容量是3个数据页。首先我们分析如图7所示的使用基本配置时的程序执行过程,这些内核程序顺序启动并且相关的线程块也被按顺序派发到相应的执行单元。起初,请求的数据页之间被装载到没有满的GPU的内存中。然而由于有限的存储空间和基于加载时间的最近最少使用的数据页替换策略,最先装载到内存中的数据页会被新访问的数据页替换出去,因此在整个执行过程中一直会产生数据页缺失异常,导致很大的性能损失,这与图1所示的结果一致。In order to deeply explore the reasons for the huge performance gap between the method of this embodiment (inverted allocation) and Oracle and the difference in the execution time of different kernel programs, this embodiment analyzes the execution process of a simple test program. This application program contains multiple kernels. program, and each kernel program accesses the same data, the specific execution process is shown in Figure 7-9. To simplify the analysis, we make some assumptions: 1) GPU can only execute one thread block at a time; 2) five kernel programs (A, B, C, D, E) access the same data, and each kernel program is controlled by 5 thread blocks (C1-C5), they only access one data page (respectively access P1-P5); 3) The memory capacity of the GPU is 3 data pages. First, we analyze the program execution process when using the basic configuration as shown in Figure 7. These kernel programs are started sequentially and the related thread blocks are also dispatched to the corresponding execution units in sequence. Initially, the requested data pages are loaded into GPU memory that is not full. However, due to the limited storage space and the least recently used data page replacement strategy based on loading time, the data page first loaded into the memory will be replaced by the newly accessed data page, so the data page will always be generated during the entire execution process Missing exceptions lead to a large performance penalty, which is consistent with the results shown in Figure 1.

应用了本实施例方法(倒转分配)的执行过程如图8所示,与图7不同的是,每个内核程序的线程块分配顺序在不断切换。我们得到了两个发现,首先,与基本配置的执行过程相比,在内核程序边界的页失效率很低,这与图5的结果吻合。其次,对于奇数次执行的内核程序,GPU内存中保留的数据没有被充分利用到,这也解释了本实施例方法(倒转分配)与Oracle之间存在性能差距的原因。总之,虽然本实施例方法(倒转分配)可以较好地提升性能,但是取得与Oracle相近的性能提升是不可能的。配合数据页的替换策略与本实施例方法(倒转分配)是弥补这一性能差距的关键。The execution process of applying the method of this embodiment (inverted allocation) is shown in FIG. 8 . The difference from FIG. 7 is that the thread block allocation order of each kernel program is constantly switching. We made two observations. First, the page fault rate at the kernel program boundary is very low compared to the execution of the base configuration, which agrees with the results in Fig. 5. Secondly, for the kernel program executed with an odd number of times, the data reserved in the GPU memory is not fully utilized, which also explains the reason for the performance gap between the method of this embodiment (inverted allocation) and Oracle. In a word, although the method of this embodiment (reverse allocation) can improve the performance well, it is impossible to achieve performance improvement similar to that of Oracle. The replacement policy that cooperates with the data page and the method of this embodiment (inverted allocation) are the key to make up this performance gap.

奇数次执行的内核程序程序没有充分利用到存储器中的共享数据。图10展示了默认的数据页替换策略的基本原理,位于主机上的GPU的驱动维护着一个用来记录从CPU内存向GPU内存进行数据页迁移顺序的线性表。尽管它不能反映数据页的访问顺序,但是与理想的最近最少使用策略相比有更低的开销。因为倒转分配切换了线程块的分配顺序,基本的数据页替换策略不再能很好地与倒转分配相互配合使用。如图5所示,依据基于加载时间的最近最少使用策略,数据页3(P3)而不是数据页5(P5)在内核程序B中的线程块C2执行过程中被逐出到CPU的内存中去。然而,对于内核程序B来说,更希望将数据页5(P5)逐出,这样后续的内核程序程序才可以充分利用共享数据。Kernel programs executed an odd number of times do not fully utilize shared data in memory. Figure 10 shows the basic principle of the default data page replacement policy. The GPU driver on the host maintains a linear table used to record the order of data page migration from CPU memory to GPU memory. Although it does not reflect the order in which data pages are accessed, it has lower overhead than the ideal least recently used policy. Because reverse allocation switches the order in which thread blocks are allocated, the basic data page replacement strategy no longer works well with reverse allocation. As shown in Figure 5, data page 3 (P3) instead of data page 5 (P5) is evicted to the CPU memory during the execution of thread block C2 in kernel program B according to the least recently used policy based on load time go. However, for kernel program B, it is more desirable to evict data page 5 (P5), so that subsequent kernel programs can make full use of the shared data.

为了解决这个问题,我们提出了叫做倒转替换数据页替换策略,如图10(同图3中的标记a)所示。协同倒转策略在线程块倒转替换的基础上引入了一个叫做反向基于加载时间的最近最少使用的数据页替换策略,如图11所示。这个新的替换策略与默认的相比,数据页的分配和逐出方向发生了切换。如图3中标记a所示,每当有一个新的内核程序启动时,位于主机端的GPU驱动会根据命令处理器中倒转标志位选择相应的数据页替换策略,当倒转标志为0时,使用图7所示的正向数据页替换策略,反之则使用图11所示的反向数据页替换策略,这样可以更好配合倒转分配机制来充分利用共享数据。In order to solve this problem, we propose a data page replacement strategy called reverse replacement, as shown in Figure 10 (the same as the mark a in Figure 3). The cooperative inversion strategy introduces a reverse least recently used data page replacement strategy based on loading time on the basis of thread block inversion replacement, as shown in Figure 11. This new replacement policy switches the allocation and eviction direction of data pages compared to the default one. As shown by mark a in Figure 3, whenever a new kernel program starts, the GPU driver on the host side will select the corresponding data page replacement strategy according to the inversion flag bit in the command processor. When the inversion flag is 0, use The forward data page replacement strategy shown in Figure 7, and the reverse data page replacement strategy shown in Figure 11, can better cooperate with the reverse allocation mechanism to make full use of shared data.

图8展示了同时运用倒转分配机制与倒转替换机制的协同倒转机制时,在前一节中引入的简单测试程序的执行过程。正如期望的结果,数据页5(P5)在当内核程序B中线程块C2执行时被替换到了CPU内存中。因此在内核程序C执行时,之前所有在GPU内存中保留的数据会被完全使用到,从而在内核程序边界处不会产生页失效错误。图6展示了当使用协同倒转策略时FFT程序的数据页访问特征和页失效率,我们得到了两个发现,首先,奇数次内核程序程序的性能得到了很大的提升,从而得到了比倒转分配更好的性能;其次,更多的共享数据被重新利用,从而在内核程序边界处会产生更少的数据页失效错误。我们得出结论,通过重新使用共享数据可以有效减少内核程序边界处的数据页失效错误。Figure 8 shows the execution process of the simple test program introduced in the previous section when the cooperative inversion mechanism of the inversion allocation mechanism and the inversion replacement mechanism is used at the same time. As expected, data page 5 (P5) is replaced in CPU memory when thread block C2 in kernel program B executes. Therefore, when the kernel program C is executed, all data previously reserved in the GPU memory will be fully used, so that page fault errors will not be generated at the kernel program boundary. Figure 6 shows the data page access characteristics and page failure rate of the FFT program when the cooperative inversion strategy is used. We obtained two findings. First, the performance of the odd-numbered kernel program has been greatly improved, resulting in a ratio inversion Allocate better performance; second, more shared data is reused, resulting in fewer data page failure errors at kernel program boundaries. We conclude that data page failure errors at kernel program boundaries can be effectively reduced by reusing shared data.

本实施例中扩展了GPGPU-simv4.0.0来进行评估。表1中显示了包括内核和存储在内的GPU系统的相关配置信息。我们认真地模拟了CPU内存和GPU内存间数据的按需迁移过程,将数据页失效处理延迟设为最乐观的20μs。如果GPU的内存已满,则GPU的驱动程序依据相应的数据页替换策略将对应的数据页替换出去。为了在不同程度的内存超额配置性进行实验,我们将GPU的内存容量设置为每个应用程序需要占用空间的一定比例(75%-95%)。我们从CUDASDK、Rodinia、Ispass以及Polibench等基准被测程序套件中选取了12个应用程序进行实验,这些程序的占用空间从1MB到96MB不等,平均占用为18.5MB。有限的模拟速度阻碍了我们对更大占用空间的应用程序进行模拟。In this example, GPGPU-simv4.0.0 is extended for evaluation. The relevant configuration information of the GPU system including cores and storage is shown in Table 1. We carefully simulated the on-demand migration process of data between CPU memory and GPU memory, and set the data page failure processing delay to the most optimistic 20μs. If the memory of the GPU is full, the driver of the GPU will replace the corresponding data page according to the corresponding data page replacement policy. To experiment with varying degrees of memory oversubscription, we set the GPU's memory capacity to a certain percentage (75%-95%) of the space required by each application. We selected 12 applications from CUDASDK, Rodinia, Ispass, and Polibench and other benchmark test suites for experiments. The occupied space of these programs ranged from 1MB to 96MB, with an average occupancy of 18.5MB. Limited simulation speed prevents us from simulating larger footprint applications.

表1模拟器基本配置信息:Table 1 Simulator basic configuration information:

Figure BDA0003110466690000081
Figure BDA0003110466690000081

为了评估我们的方法的有效性,我们实现了最新的技术ETC。图12显示了以基本配置作为基准的多个应用程序在倒转分配、协同倒转、ETC以及Oracle配置下的性能对比。我们得到了3个发现,首先,ETC表现出较弱的性能优化能力,平均提高了15%的性能。相比于ETC,倒转分配和协同倒转分别提升了20.2%和65%。其次,对于几乎所有的应用程序,协同倒转都可以达到与Oracle相近的性能,主要由于大多数内核程序间的共享数据被有效地利用了。最后,对于BFS和FWT程序,协同倒转都没有较好地改善它们的性能,这是因为这两个程序的不同内核程序具有不规则的数据访问特征,不可以很好地被协同倒转捕获和利用。因此,对于内核程序间存在大量数据共享地应用程序而言,它们在内存超额配置地情况下会遭受极大的性能损失。在本实施例方法中,提出了一个线程块和数据页协同调度的方法,旨在通过重用共享数据来降低内存超额配置对系统性能的影响。To evaluate the effectiveness of our method, we implement the state-of-the-art ETC. Figure 12 shows the performance comparison of multiple applications under reverse allocation, cooperative reverse, ETC and Oracle configurations based on the basic configuration. We got 3 findings, first, ETC exhibits weaker performance optimization ability, improving performance by 15% on average. Compared with ETC, inversion allocation and synergy inversion are improved by 20.2% and 65%, respectively. Second, for almost all applications, cooperative inversion can achieve performance close to that of Oracle, mainly because most of the shared data among kernel programs are effectively utilized. Finally, for both BFS and FWT programs, co-inversion does not improve their performance well, because the different kernel programs of these two programs have irregular data access characteristics, which cannot be well captured and utilized by co-inversion . As a result, applications that share a lot of data between kernels suffer a significant performance penalty when memory is oversubscribed. In the method of this embodiment, a thread block and data page cooperative scheduling method is proposed, aiming at reducing the impact of memory over-allocation on system performance by reusing shared data.

综上所述,虽然最新的GPU配备了越来越大的内存,但是仍然无法容纳大型应用程序的整个工作集。在统一虚拟内存以及按需数据页置入技术的支持下,这些大型程序无需程序员的主动干预仍然可以正常运行,然而这个优势引起了一定的性能损失。我们发现尽管当前有很多关于减少CPU与GPU间大量页迁移的研究,但是对于内核程序之间存在数据共享的应用程序来说仍然无法获得较好的性能改善。在本实施例中,本实施例方法提出了一个线程块和数据页协同调度方法,以对程序员透明的方式去减小内存超额配置带来的影响,本实施例面向GPU数据重用的存储计算协同调度方法的基本原理是通过协调线程块分配顺序以及数据页替换顺序的策略来有效地使用内核程序间的共享数据,实验表明采用本实施例面向GPU数据重用的存储计算协同调度方法的系统性能比采用最新研究方法提升了65%。To sum up, although the latest GPUs are equipped with larger and larger memory, they still cannot accommodate the entire working set of large applications. With the support of unified virtual memory and on-demand data page insertion technology, these large programs can still run normally without the active intervention of programmers. However, this advantage has caused a certain performance loss. We found that although there are many current researches on reducing large page migration between CPU and GPU, it is still impossible to obtain good performance improvement for applications with data sharing between kernel programs. In this embodiment, the method of this embodiment proposes a thread block and data page cooperative scheduling method to reduce the impact of memory over-allocation in a transparent manner to programmers. This embodiment is oriented to storage computing for GPU data reuse The basic principle of the cooperative scheduling method is to effectively use the shared data between kernel programs by coordinating the order of thread block allocation and the order of data page replacement. Experiments show that the system performance of the storage computing cooperative scheduling method for GPU data reuse in this embodiment is A 65% improvement over the latest research method.

此外,本实施例还提供一种面向GPU数据重用的存储计算协同调度系统,包括:In addition, this embodiment also provides a GPU data reuse-oriented storage computing collaborative scheduling system, including:

倒转标志管理程序单元,用于检测是否有内核程序启动,若有内核程序启动则将该内核程序的倒转标志翻转;Reversal flag management program unit, used to detect whether there is a kernel program to start, if there is a kernel program to start, the reversal flag of the kernel program is reversed;

数据页替换程序单元,用于在GPU驱动中,针对该内核程序的数据页替换,根据该内核程序的倒转标志在正向数据页替换策略和反向数据页替换策略两种预设的数据页替换策略中来轮流选择其中一种数据页替换策略来从GPU端数据页队列中选择GPU端数据页进行替换,所述正向数据页替换策略和反向数据页替换策略选择GPU端数据页的方向不同;The data page replacement program unit is used for the data page replacement of the kernel program in the GPU driver, and according to the inversion flag of the kernel program, there are two preset data pages in the forward data page replacement strategy and the reverse data page replacement strategy In the replacement strategy, one of the data page replacement strategies is selected in turn to select the GPU-side data page from the GPU-side data page queue for replacement. The forward data page replacement strategy and the reverse data page replacement strategy select the GPU-side data page different directions;

线程块选择程序单元,用于在GPU的线程块调度器中,针对该内核程序的线程块调度,根据该内核程序的倒转标志在正向线程块派发策略和反向线程块派发策略两种预设的线程块派发策略中来轮流选择其中一种线程块派发策略来从线程块待发射队列中选择线程块发射,所述正向线程块派发策略和反向线程块派发策略选择线程块的方向不同。The thread block selection program unit is used in the thread block scheduler of the GPU, for the thread block scheduling of the kernel program. One of the thread block dispatch strategies is selected in turn from the thread block dispatch strategy to select the thread block emission from the thread block queue to be launched, and the forward thread block dispatch strategy and the reverse thread block dispatch strategy select the direction of the thread block different.

此外,本实施例还提供一种面向GPU数据重用的存储计算协同调度系统,包括:In addition, this embodiment also provides a GPU data reuse-oriented storage computing collaborative scheduling system, including:

CPU,用于检测是否有内核程序启动,若有内核程序启动则将该内核程序的倒转标志翻转,以及在GPU驱动中,针对该内核程序的数据页替换,根据该内核程序的倒转标志在正向数据页替换策略和反向数据页替换策略两种预设的数据页替换策略中来轮流选择其中一种数据页替换策略来从GPU端数据页队列中选择GPU端数据页进行替换,所述正向数据页替换策略和反向数据页替换策略选择GPU端数据页的方向不同;The CPU is used to detect whether a kernel program is started, and if a kernel program is started, the inversion flag of the kernel program is reversed, and in the GPU driver, the data page replacement for the kernel program is performed according to the inversion flag of the kernel program. Select one of the data page replacement strategies in turn from the two preset data page replacement strategies of the data page replacement strategy and the reverse data page replacement strategy to select the GPU-side data page from the GPU-side data page queue for replacement. The forward data page replacement strategy and the reverse data page replacement strategy select different directions for GPU-side data pages;

GPU,用于在GPU的线程块调度器中,针对该内核程序的线程块调度,根据该内核程序的倒转标志在正向线程块派发策略和反向线程块派发策略两种预设的线程块派发策略中来轮流选择其中一种线程块派发策略来从线程块待发射队列中选择线程块发射,所述正向线程块派发策略和反向线程块派发策略选择线程块的方向不同;GPU, used in the thread block scheduler of the GPU, for the thread block scheduling of the kernel program, according to the reverse flag of the kernel program, there are two preset thread blocks in the forward thread block dispatch strategy and the reverse thread block dispatch strategy In the distribution strategy, one of the thread block distribution strategies is selected in turn to select a thread block from the thread block queue to be emitted, and the forward thread block distribution strategy and the reverse thread block distribution strategy select thread blocks in different directions;

所述CPU和GPU之间相互连接。The CPU and the GPU are connected to each other.

此外,本实施例还提供一种面向GPU数据重用的存储计算协同调度系统,包括相互连接的处理单元和存储器,所述处理单元被编程或配置以执行所述面向GPU数据重用的存储计算协同调度方法的步骤。In addition, this embodiment also provides a GPU data reuse-oriented storage computing cooperative scheduling system, including a processing unit and a memory connected to each other, and the processing unit is programmed or configured to perform the GPU data reuse-oriented storage computing collaborative scheduling method steps.

此外,本实施例还提供一种计算机可读存储介质,该计算机可读存储介质中存储有被编程或配置以执行所述面向GPU数据重用的存储计算协同调度方法的计算机程序。In addition, this embodiment also provides a computer-readable storage medium, where a computer program programmed or configured to execute the storage-computing cooperative scheduling method oriented to GPU data reuse is stored.

本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可读存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。Those skilled in the art should understand that the embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein. The present application is described with reference to flowcharts and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the present application. It should be understood that each procedure and/or block in the flowchart and/or block diagram, and a combination of procedures and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions may be provided to a general purpose computer, special purpose computer, embedded processor, or processor of other programmable data processing equipment to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing equipment produce a An apparatus for realizing the functions specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram. These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions The device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram. These computer program instructions can also be loaded onto a computer or other programmable data processing device, causing a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process, thereby The instructions provide steps for implementing the functions specified in the flow chart or blocks of the flowchart and/or the block or blocks of the block diagrams.

以上所述仅是本发明的优选实施方式,本发明的保护范围并不仅局限于上述实施例,凡属于本发明思路下的技术方案均属于本发明的保护范围。应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理前提下的若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。The above descriptions are only preferred implementations of the present invention, and the scope of protection of the present invention is not limited to the above examples, and all technical solutions that fall under the idea of the present invention belong to the scope of protection of the present invention. It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principles of the present invention should also be regarded as the protection scope of the present invention.

Claims (10)

1. The storage computing cooperative scheduling method for GPU data reuse is characterized by comprising the following steps of:
1) Detecting whether a kernel program is started under the conditions that the current program has excessive GPU memory capacity and data sharing exists among the kernel programs, and turning over a reversing mark of the kernel program if the kernel program is started;
2) In GPU driving, selecting one of the data page replacement strategies in turn in two preset data page replacement strategies of a forward data page replacement strategy and a reverse data page replacement strategy according to the inversion mark of the kernel program to select a GPU side data page from a GPU side data page queue for replacement, wherein the directions of the data page selection of the forward data page replacement strategy and the reverse data page replacement strategy are different; in a thread block scheduler of a GPU, for thread block scheduling of the kernel program, one of a forward thread block dispatch strategy and a reverse thread block dispatch strategy is selected in turn according to a reversing mark of the kernel program in two preset thread block dispatch strategies to select thread block emission from a thread block to-be-emitted queue, wherein the directions of the thread blocks selected by the forward thread block dispatch strategy and the reverse thread block dispatch strategy are different; the data page replacement strategy introduces a reverse data page replacement strategy based on the thread block reverse replacement of the GPU, and the reverse data page replacement strategy is a data page replacement strategy which is least recently used and is reversely based on loading time, so that the allocation and the eviction direction of the data page are switched with the forward data page replacement strategy.
2. The GPU data reuse oriented storage computing co-scheduling method of claim 1, wherein flipping the kernel's inversion flag in step 1) comprises: firstly, detecting whether a reverse flag of the kernel program exists or not, initializing the reverse flag for the kernel program if the reverse flag of the kernel program does not exist, and reversing the reverse flag of the kernel program if the reverse flag of the kernel program exists.
3. The GPU-data reuse oriented storage computing co-scheduling method of claim 2, wherein when the inversion flag is initialized, an initialization value of the inversion flag is 0 or 1.
4. The GPU data reuse oriented storage computing co-scheduling method of claim 2, wherein said flipping the inversion flag of the kernel refers to: the reverse flag of the kernel is changed from 0 to 1 if the original value of the reverse flag of the kernel is 0, and from 1 to 0 if the original value of the reverse flag of the kernel is 1.
5. The method for collaborative scheduling of storage and computation for GPU-oriented data reuse according to claim 1, wherein in step 2), one of the two preset data page replacement policies, namely the forward data page replacement policy and the reverse data page replacement policy, is selected in turn according to the inversion flag of the kernel program, when the GPU-oriented data page is selected from the GPU-oriented data page queue for replacement, if the inversion flag is 0, the forward data page replacement policy is selected; if the reverse flag is 1, then the reverse page replacement policy is selected.
6. The method for collaborative scheduling of storage and computation for GPU data reuse according to claim 1, wherein in step 2), when one of the forward thread block dispatch policies and the reverse thread block dispatch policies is selected in turn according to the reverse flag in two preset thread block dispatch policies, if the reverse flag is 0, the forward thread block dispatch policy is selected; if the reverse flag is 1, then the reverse thread block dispatch policy is selected.
7. The collaborative scheduling method for GPU-oriented data reuse according to claim 1, wherein the occurrence of the GPU memory capacity excess in the current program in step 1) means: when the GPU driver at the host computer monitors a data page request generated by the GPU, if the length of a data page queue maintained by the GPU driver reaches the memory capacity of the GPU at the moment, the data page in the GPU memory needs to be evicted into the CPU memory according to a data page replacement strategy, and meanwhile, a memory excess configuration mark is set to indicate that the memory excess configuration of the current program occurs.
8. The collaborative scheduling method for storage computing for GPU-oriented data reuse according to claim 1, wherein the data sharing between kernel programs in step 1) means: judging whether the same pointer is shared between kernel programs according to the information of compiling time, taking the pointer as an index of data sharing between kernel programs, and distributing a data sharing mark for each kernel program to be started so as to indicate whether the kernel programs share data with the previous kernel program; when a kernel program is started, whether the memory excess configuration zone bit and the data sharing zone bit corresponding to the kernel program are 1 or not is respectively judged, and if yes, a cooperative scheduling method is judged to be needed.
9. A GPU data reuse oriented storage computing co-scheduling system comprising a processing unit and a memory connected to each other, wherein the processing unit is programmed or configured to perform the steps of the GPU data reuse oriented storage computing co-scheduling method of any of claims 1 to 8.
10. A computer readable storage medium having stored therein a computer program programmed or configured to perform the GPU-oriented data reuse storage computing co-scheduling method of any of claims 1-8.
CN202110649358.7A 2021-06-10 2021-06-10 A storage computing collaborative scheduling method and system for GPU data reuse Active CN113377538B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110649358.7A CN113377538B (en) 2021-06-10 2021-06-10 A storage computing collaborative scheduling method and system for GPU data reuse

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110649358.7A CN113377538B (en) 2021-06-10 2021-06-10 A storage computing collaborative scheduling method and system for GPU data reuse

Publications (2)

Publication Number Publication Date
CN113377538A CN113377538A (en) 2021-09-10
CN113377538B true CN113377538B (en) 2023-06-20

Family

ID=77573750

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110649358.7A Active CN113377538B (en) 2021-06-10 2021-06-10 A storage computing collaborative scheduling method and system for GPU data reuse

Country Status (1)

Country Link
CN (1) CN113377538B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114691339A (en) * 2022-04-12 2022-07-01 统信软件技术有限公司 Process scheduling method and computing device
CN115361451B (en) * 2022-10-24 2023-03-24 中国人民解放军国防科技大学 A network communication parallel processing method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103365796A (en) * 2012-04-05 2013-10-23 西门子公司 Volume rendering on shared memory systems with multiple processors by optimizing cache reuse
CN112181689A (en) * 2020-09-30 2021-01-05 华东师范大学 Runtime system for efficiently scheduling GPU kernel under cloud
CN112801849A (en) * 2019-11-14 2021-05-14 英特尔公司 Method and apparatus for scheduling thread order to improve cache efficiency

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11016861B2 (en) * 2019-04-11 2021-05-25 International Business Machines Corporation Crash recoverability for graphics processing units (GPU) in a computing environment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103365796A (en) * 2012-04-05 2013-10-23 西门子公司 Volume rendering on shared memory systems with multiple processors by optimizing cache reuse
CN112801849A (en) * 2019-11-14 2021-05-14 英特尔公司 Method and apparatus for scheduling thread order to improve cache efficiency
CN112181689A (en) * 2020-09-30 2021-01-05 华东师范大学 Runtime system for efficiently scheduling GPU kernel under cloud

Also Published As

Publication number Publication date
CN113377538A (en) 2021-09-10

Similar Documents

Publication Publication Date Title
US8285930B2 (en) Methods for adapting performance sensitive operations to various levels of machine loads
US10877757B2 (en) Binding constants at runtime for improved resource utilization
JP7465887B2 (en) Data Structure Processing
US8301836B2 (en) Methods for determining alias offset of a cache memory
US8131931B1 (en) Configurable cache occupancy policy
CN101067781A (en) Techniques for performing memory disambiguation
CN113377538B (en) A storage computing collaborative scheduling method and system for GPU data reuse
CN103218309A (en) Multi-level instruction cache prefetching
US20150293856A1 (en) Disk Array Flushing Method and Disk Array Flushing Apparatus
US8285931B2 (en) Methods for reducing cache memory pollution during parity calculations of RAID data
US11726918B2 (en) Dynamically coalescing atomic memory operations for memory-local computing
JP4888839B2 (en) Vector computer system having cache memory and method of operating the same
Van Essen et al. DI-MMAP: A high performance memory-map runtime for data-intensive applications
US8219751B2 (en) Methods for optimizing performance of transient data calculations
CN108733585B (en) Cache system and related method
US11934311B2 (en) Hybrid allocation of data lines in a streaming cache memory
US12314175B2 (en) Cache memory with per-sector cache residency controls
CN114746848B (en) Cache Architecture for Storage Devices
CN104182281A (en) Method for implementing register caches of GPGPU (general purpose graphics processing units)
US10579519B2 (en) Interleaved access of memory
CN106469020B (en) Cache element and control method and application system thereof
KR20090007084A (en) Disk Array Mass Prefetching Method
Wang Towards a High-performance and Secure Memory System and Architecture for Emerging Applications
JP4792065B2 (en) Data storage method
Xu et al. I/O Transit Caching for PMem-based Block Device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant