CN117742679A

CN117742679A - Kernel fusion method and system based on deep neural network

Info

Publication number: CN117742679A
Application number: CN202311724858.8A
Authority: CN
Inventors: 李奕瑾; 杜绍敏; 夏春伟; 赵家程; 崔慧敏
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2023-12-14
Filing date: 2023-12-14
Publication date: 2024-03-22

Abstract

The present invention provides a kernel fusion method based on a deep neural network, which includes: compiling the source code into a host-side intermediate code file and a device-side intermediate code file through a compilation framework, inputting the above two files into the fusion framework, and generating the fused The device-side intermediate code file; optimize and compile the fused device-side intermediate code file to obtain the host-side intermediate code file with device-side information; combine the host-side intermediate code file with device-side information with the device-side intermediate code file The code file is input into the fusion framework to generate the fused host-side intermediate code file; the fused host-side intermediate code file is optimized and compiled to obtain the corresponding executable file. The invention also provides a kernel fusion system, a storage medium and an electronic device based on a deep neural network. In this way, the present invention can reduce performance overhead and improve parallel resource utilization, thereby improving the reasoning performance of the deep neural network system.

Description

Kernel fusion method and system based on deep neural network

技术领域Technical field

本发明涉及人工智能的神经网络技术领域，尤其涉及一种基于深度神经网络的内核融合方法、系统、存储介质及电子设备。The present invention relates to the field of artificial intelligence neural network technology, and in particular to a kernel fusion method, system, storage medium and electronic device based on a deep neural network.

背景技术Background technique

最近几年，DNN(Deep Neural Networks，深度神经网络)已经在许多领域被广泛的应用，包括语音识别、图像分类、神经机器翻译、自动驾驶等。鉴于DNN模型的广泛部署以及对其高效推理的巨大需求，无论是工业界还是学术界都从硬件、软件、算法等多个方面对DNN模型进行优化，提出了一系列面向DNN优化的硬件加速器、加速库、编译器、编程框架和算法。In recent years, DNN (Deep Neural Networks) has been widely used in many fields, including speech recognition, image classification, neural machine translation, automatic driving, etc. In view of the widespread deployment of DNN models and the huge demand for efficient reasoning, both industry and academia have optimized DNN models from multiple aspects such as hardware, software, and algorithms, and have proposed a series of hardware accelerators for DNN optimization. Acceleration libraries, compilers, programming frameworks and algorithms.

在编程框架与编译器方面的优化技术包括基于图的TACO(腾讯的AI计算加速套件)和APOLLO(百度的自动驾驶开放平台)。TACO能够将计算图中的算子进行高效的等价替换，从而提升性能。APOLLO通过将计算图划分为子图的形式，通过分析进行内核(kernel)融合，提升推理和训练性能。Optimization technologies in programming frameworks and compilers include graph-based TACO (Tencent’s AI computing acceleration suite) and APOLLO (Baidu’s open platform for autonomous driving). TACO can efficiently replace operators in the calculation graph with equivalents, thus improving performance. APOLLO improves inference and training performance by dividing the calculation graph into subgraphs and performing kernel fusion through analysis.

现有算子融合技术包括基于模式(pattern)的融合，基于规则的融合，基于程序依赖分析的融合等。在基于模式或规则的融合方案中，需要用户预先定义可以融合的模式或规则，这样对于新出现的算子不能很好的处理。在基于程序依赖分析的融合方案中，需要编译器强大的分析能力，这对编译器的挑战很大，不一定能够精准分析出所有可能可以融合的算子。在选定融合方案时，还需要一个调优的过程，这个过程也是非常耗时的。并且当内核间存在全局依赖时，融合框架都不会将它们融合在一起。但是研究发现存在全局依赖的内核进行融合仍然可能得到性能增益。例如Rammer是工业界知名的DNN编译器，也叫NNFusion。它在编译的过程中将算子水平合并到不同的线程块中，以并行执行多个算子。同时它也会探索算子间和算子内的协同调度。Apollo是目前最为先进的用于神经网络推理优化的算子融合框架，它不仅考虑访存密集型算子，也考虑计算密集型算子的融合，它使用手工规则进行算子融合，同时也优化了独立的算子之间的并行执行。Existing operator fusion technologies include pattern-based fusion, rule-based fusion, and program dependency analysis-based fusion. In the fusion scheme based on patterns or rules, users are required to pre-define the patterns or rules that can be fused, so that new operators cannot be processed well. In the fusion solution based on program dependency analysis, the compiler needs to have powerful analysis capabilities, which poses a great challenge to the compiler, and it may not be able to accurately analyze all operators that may be fused. When selecting a fusion solution, a tuning process is also required, which is also very time-consuming. And when there are global dependencies between cores, the fusion framework will not fuse them together. However, research has found that it is still possible to obtain performance gains by fusing kernels with global dependencies. For example, Rammer is a well-known DNN compiler in the industry, also called NNFusion. It horizontally merges operators into different thread blocks during compilation to execute multiple operators in parallel. At the same time, it will also explore co-scheduling between operators and within operators. Apollo is currently the most advanced operator fusion framework for neural network inference optimization. It not only considers the fusion of memory-intensive operators, but also considers the fusion of calculation-intensive operators. It uses manual rules for operator fusion and also optimizes Parallel execution between independent operators.

现有DNN的算子融合技术，要么需要用户预先定义融合模式或规则，要么需要编译器具有强大的分析能力。但是预先定义的融合模式或规则并不能涵盖所有算子，且编译器的分析能力并不完美，不能够分析出所有可融合的算子。Existing DNN operator fusion technology either requires users to pre-define the fusion mode or rules, or requires the compiler to have powerful analysis capabilities. However, predefined fusion modes or rules do not cover all operators, and the compiler's analysis capabilities are not perfect and cannot analyze all fusion operators.

综上可知，现有技术在实际使用上显然存在不便与缺陷，所以有必要加以改进。In summary, it can be seen that the existing technology obviously has inconveniences and defects in actual use, so it is necessary to improve it.

发明内容Contents of the invention

针对上述的缺陷，本发明的目的在于提供一种基于深度神经网络的内核融合方法、系统、存储介质及电子设备，能够降低性能开销，提高并行资源利用率，从而提升深度神经网络系统的推理性能。In view of the above defects, the purpose of the present invention is to provide a kernel fusion method, system, storage medium and electronic device based on a deep neural network, which can reduce performance overhead and improve parallel resource utilization, thereby improving the reasoning performance of the deep neural network system. .

为了解决上述技术问题，本发明是这样实现的：In order to solve the above technical problems, the present invention is implemented as follows:

第一方面，本发明实施例提供了一种基于深度神经网络的内核融合方法，包括：In the first aspect, embodiments of the present invention provide a kernel fusion method based on a deep neural network, including:

通过编译框架将源码分别编译为主机端、设备端对应的主机端中间代码文件和设备端中间代码文件；Compile the source code into host-side intermediate code files and device-side intermediate code files corresponding to the host side and device side through the compilation framework;

将所述主机端中间代码文件和所述设备端中间代码文件作为参数输入到融合框架，生成融合后的设备端中间代码文件；Input the host-side intermediate code file and the device-side intermediate code file as parameters into the fusion framework, and generate a fused device-side intermediate code file;

通过所述编译框架将所述融合后的设备端中间代码文件进行优化和编译，得到带有设备端信息的主机端中间代码文件；The fused device-side intermediate code file is optimized and compiled through the compilation framework to obtain a host-side intermediate code file with device-side information;

将所述带有设备端信息的主机端中间代码文件和所述设备端中间代码文件作为参数输入到所述融合框架，生成融合后的主机端中间代码文件；Input the host-side intermediate code file with device-side information and the device-side intermediate code file as parameters into the fusion framework, and generate a fused host-side intermediate code file;

通过所述编译框架将所述融合后的主机端中间代码文件进行优化和编译，得到对应的可执行文件。The integrated host-side intermediate code file is optimized and compiled through the compilation framework to obtain a corresponding executable file.

根据本发明所述的方法，所述通过所述编译框架将所述融合后的设备端中间代码文件进行优化和编译，得到带有设备端信息的主机端中间代码文件的步骤包括：According to the method of the present invention, the step of optimizing and compiling the merged device-side intermediate code file through the compilation framework to obtain the host-side intermediate code file with device-side information includes:

通过所述编译框架将所述融合后的设备端中间代码文件进行优化，生成优化后的设备端中间代码文件；Optimize the fused device-side intermediate code file through the compilation framework to generate an optimized device-side intermediate code file;

通过所述编译框架将所述优化后的设备端中间代码文件进行编译，生成融合后的设备端二进制文件；Compile the optimized device-side intermediate code file through the compilation framework to generate a merged device-side binary file;

通过所述编译框架将所述源码和所述融合后的设备端二进制文件进行编译，生成所述带有设备端信息的主机端中间代码文件。The source code and the fused device-side binary file are compiled through the compilation framework to generate the host-side intermediate code file with device-side information.

根据本发明所述的方法，所述带有设备端信息的主机端中间代码文件为带有设备端二进制信息的主机端中间代码文件。According to the method of the present invention, the host-side intermediate code file with device-side information is a host-side intermediate code file with device-side binary information.

根据本发明所述的方法，所述生成融合后的主机端中间代码文件的步骤包括：According to the method of the present invention, the step of generating the merged host-side intermediate code file includes:

生成所述融合后的主机端中间代码文件时，提取每次启动内核时的网络、线程块、线程的维度信息；When generating the fused host-side intermediate code file, extract the dimensional information of the network, thread blocks, and threads each time the kernel is started;

根据不同的融合方式确定融合后的内核的网络的新的维度信息；Determine the new dimensional information of the fused kernel network according to different fusion methods;

设置所述融合后的主机端中间代码文件的所述新的维度信息；Set the new dimension information of the merged host-side intermediate code file;

生成所述融合后的主机端中间代码文件的新名字，然后根据所述新名字生成调用指令的中间代码。Generate a new name of the merged host-side intermediate code file, and then generate an intermediate code for calling instructions based on the new name.

根据本发明所述的方法，所述通过编译框架将所述融合后的主机端中间代码文件进行优化和编译，得到对应的可执行文件的步骤包括：According to the method of the present invention, the step of optimizing and compiling the fused host-side intermediate code file through a compilation framework to obtain the corresponding executable file includes:

通过所述编译框架将所述融合后的主机端中间代码文件进行优化，生成优化后的主机端中间代码文件；Optimize the merged host-side intermediate code file through the compilation framework to generate an optimized host-side intermediate code file;

通过所述编译框架将所述优化后的主机端中间代码文件进行编译，生成所述可执行文件。The optimized host-side intermediate code file is compiled through the compilation framework to generate the executable file.

根据本发明所述的方法，所述融合后的设备端中间代码文件中插入有全局同步原语以对本内核的所有线程之间进行同步。According to the method of the present invention, a global synchronization primitive is inserted into the merged device-side intermediate code file to synchronize all threads of the kernel.

根据本发明所述的方法，所述编译框架为LLVM编译框架；和/或According to the method of the present invention, the compilation framework is an LLVM compilation framework; and/or

所述可执行文件为应用于统一计算设备架构的可执行文件。The executable file is an executable file applied to a unified computing device architecture.

第二方面，本发明实施例提供了一种基于上述方法构建的基于深度神经网络的内核融合系统，包括：In the second aspect, embodiments of the present invention provide a deep neural network-based kernel fusion system constructed based on the above method, including:

中间代码编译模块，用于通过编译框架将源码分别编译为主机端、设备端对应的主机端中间代码文件和设备端中间代码文件；The intermediate code compilation module is used to compile the source code into host-side intermediate code files and device-side intermediate code files corresponding to the host side and device side through the compilation framework;

第一融合代码模块，用于将所述主机端中间代码文件和所述设备端中间代码文件作为参数输入到融合框架，生成融合后的设备端中间代码文件；The first fusion code module is used to input the host-side intermediate code file and the device-side intermediate code file as parameters into the fusion framework, and generate a fused device-side intermediate code file;

第一优化编译模块，用于通过所述编译框架将所述融合后的设备端中间代码文件进行优化和编译，得到带有设备端信息的主机端中间代码文件；The first optimization and compilation module is used to optimize and compile the fused device-side intermediate code file through the compilation framework to obtain a host-side intermediate code file with device-side information;

第二融合代码模块，将所述带有设备端信息的主机端中间代码文件和所述设备端中间代码文件作为参数输入到所述融合框架，生成融合后的主机端中间代码文件；The second fusion code module inputs the host-side intermediate code file with device-side information and the device-side intermediate code file as parameters into the fusion framework, and generates a fused host-side intermediate code file;

第二优化编译模块，用于通过所述编译框架将所述融合后的主机端中间代码文件进行优化和编译，得到对应的可执行文件。The second optimization and compilation module is used to optimize and compile the fused host-side intermediate code file through the compilation framework to obtain a corresponding executable file.

第三方面，本发明实施例提供了一种存储介质，用于存储一种用于执行任一项所述的基于深度神经网络的内核融合方法的计算机程序。In a third aspect, embodiments of the present invention provide a storage medium for storing a computer program for executing any one of the deep neural network-based kernel fusion methods described above.

第四方面，本发明实施例一种电子设备，包括存储介质、处理器以及存储在所述存储器上并可在所述处理器上运行的计算机程序，所述处理器执行所述计算机程序时实现任一项所述的基于深度神经网络的内核融合方法。In a fourth aspect, an embodiment of the present invention is an electronic device, including a storage medium, a processor, and a computer program stored on the memory and executable on the processor. When the processor executes the computer program, the The kernel fusion method based on deep neural network according to any one of the above.

在本发明实施例中，首先通过编译框架将源码分别编译为主机端中间代码文件和设备端中间代码文件，并作为参数输入到融合框架，生成融合后的设备端中间代码文件；将融合后的设备端中间代码文件进行优化和编译，得到带有设备端信息的主机端中间代码文件；将带有设备端信息的主机端中间代码文件和设备端中间代码文件输入到融合框架，生成融合后的主机端中间代码文件；将融合后的主机端中间代码文件进行优化和编译，得到对应的可执行文件。借此，本发明提出了一种将深度神经网络中的所有内核都融合为一个内核的算子融合方法，包括在中间代码级别上的融合代码生成，能够降低内核启动开销、数据传输开销等性能开销，提高并行资源利用率，从而提升深度神经网络系统的推理性能。In the embodiment of the present invention, the source code is first compiled into a host-side intermediate code file and a device-side intermediate code file through the compilation framework, and is input into the fusion framework as a parameter to generate a fused device-side intermediate code file; the fused device-side intermediate code file is The device-side intermediate code file is optimized and compiled to obtain a host-side intermediate code file with device-side information; the host-side intermediate code file with device-side information and the device-side intermediate code file are input into the fusion framework to generate the fused Host-side intermediate code file; optimize and compile the merged host-side intermediate code file to obtain the corresponding executable file. Thus, the present invention proposes an operator fusion method that fuses all kernels in a deep neural network into one kernel, including fusion code generation at the intermediate code level, which can reduce kernel startup overhead, data transmission overhead and other performance overhead, improve parallel resource utilization, thereby improving the inference performance of deep neural network systems.

附图说明Description of drawings

图1是本发明实施例一提供的基于深度神经网络的内核融合方法的流程示意图；Figure 1 is a schematic flow chart of a kernel fusion method based on a deep neural network provided by Embodiment 1 of the present invention;

图2是本发明实施例二提供的基于深度神经网络的内核融合方法的流程示意图；Figure 2 is a schematic flowchart of the kernel fusion method based on deep neural networks provided in Embodiment 2 of the present invention;

图3是本发明实施例一提供的基于深度神经网络的内核融合系统的结构示意图；Figure 3 is a schematic structural diagram of a deep neural network-based kernel fusion system provided in Embodiment 1 of the present invention;

图4是本发明实施例二提供的基于深度神经网络的内核融合系统的结构示意图；Figure 4 is a schematic structural diagram of a deep neural network-based kernel fusion system provided in Embodiment 2 of the present invention;

图5是本发明实施例提供的电子设备的硬件结构示意图。FIG. 5 is a schematic diagram of the hardware structure of an electronic device provided by an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。In order to make the purpose, technical solutions and advantages of the present invention more clear, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention and are not intended to limit the present invention.

需要说明的，本说明书中针对“一个实施例”、“实施例”、“示例实施例”等的引用，指的是描述的该实施例可包括特定的特征、结构或特性，但是不是每个实施例必须包含这些特定特征、结构或特性。此外，这样的表述并非指的是同一个实施例。进一步，在结合实施例描述特定的特征、结构或特性时，不管有没有明确的描述，已经表明将这样的特征、结构或特性结合到其它实施例中是在本领域技术人员的知识范围内的。It should be noted that references to "one embodiment", "embodiment", "example embodiment", etc. in this specification mean that the described embodiment may include specific features, structures or characteristics, but not every Embodiments must contain these specific features, structures or characteristics. Furthermore, such statements are not intended to refer to the same embodiment. Further, when particular features, structures or characteristics are described in connection with embodiments, whether or not explicitly described, it has been shown that it is within the knowledge of those skilled in the art to incorporate such features, structures or characteristics into other embodiments. .

此外，在说明书及后续的权利要求当中使用了某些词汇来指称特定组件或部件，所属领域中具有通常知识者应可理解，制造商可以用不同的名词或术语来称呼同一个组件或部件。本说明书及后续的权利要求并不以名称的差异来作为区分组件或部件的方式，而是以组件或部件在功能上的差异来作为区分的准则。在通篇说明书及后续的权利要求书中所提及的“包括”和“包含”为一开放式的用语，故应解释成“包含但不限定于”。以外，“连接”一词在此系包含任何直接及间接的电性连接手段。间接的电性连接手段包括通过其它装置进行连接。In addition, certain words are used in the description and subsequent claims to refer to specific components or parts. It should be understood by those with ordinary knowledge in the art that a manufacturer may use different nouns or terms to refer to the same component or part. This description and the subsequent claims do not use differences in names as a means to distinguish components or parts; rather, differences in functions of components or parts are used as a criterion for distinction. The words "include" and "include" mentioned in this description and the following claims are open-ended terms, and therefore should be interpreted to mean "include but not limited to." Except, the word "connection" here includes any direct and indirect means of electrical connection. Indirect electrical connection means include connection through other devices.

下面结合附图，通过具体的实施例及其应用场景，对本发明实施例提供的基于深度神经网络的内核融合方法进行详细地说明。The kernel fusion method based on deep neural networks provided by the embodiments of the present invention will be described in detail below with reference to the accompanying drawings through specific embodiments and application scenarios.

图1是本发明实施例一提供的基于深度神经网络的内核融合方法的流程示意图，所述方法包括以下步骤：Figure 1 is a schematic flowchart of a kernel fusion method based on a deep neural network provided by Embodiment 1 of the present invention. The method includes the following steps:

步骤S101，通过编译框架将源码分别编译为主机端、设备端对应的主机端中间代码(Instruction Reference，IR)文件和设备端中间代码文件。Step S101: Compile the source code into host-side intermediate code (Instruction Reference, IR) files and device-side intermediate code files corresponding to the host side and the device side through the compilation framework.

可选的，所述编译框架为LLVM(Low Level Virtual Machine)编译框架。Optionally, the compilation framework is an LLVM (Low Level Virtual Machine) compilation framework.

步骤S102，将主机端中间代码文件和设备端中间代码文件作为参数输入到融合框架，生成融合后的设备端中间代码文件。Step S102: Input the host-side intermediate code file and the device-side intermediate code file as parameters into the fusion framework, and generate a fused device-side intermediate code file.

可选的，所述融合后的设备端中间代码文件中插入有全局同步原语以对本内核的所有线程之间进行同步。Optionally, a global synchronization primitive is inserted into the merged device-side intermediate code file to synchronize all threads of this kernel.

步骤S103，通过编译框架将融合后的设备端中间代码文件进行优化和编译，得到带有设备端信息的主机端中间代码文件。Step S103: Optimize and compile the merged device-side intermediate code file through the compilation framework to obtain a host-side intermediate code file with device-side information.

步骤S104，将带有设备端信息的主机端中间代码文件和设备端中间代码文件作为参数输入到融合框架，生成融合后的主机端中间代码文件。Step S104: Input the host-side intermediate code file with device-side information and the device-side intermediate code file as parameters into the fusion framework, and generate a merged host-side intermediate code file.

步骤S105，通过编译框架将融合后的主机端中间代码文件进行优化和编译，得到对应的可执行文件。Step S105: Optimize and compile the merged host-side intermediate code file through the compilation framework to obtain a corresponding executable file.

可选的，所述可执行文件为应用于CUDA(Compute Unified DeviceArchitecture，统一计算设备架构)的可执行文件。Optionally, the executable file is an executable file applied to CUDA (Compute Unified Device Architecture, unified computing device architecture).

在本发明实施例提供的基于深度神经网络的内核融合方法，基于编译框架实现了融合框架，负责代码生成和优化。首先通过编译框架将源码分别编译为主机端中间代码文件和设备端中间代码文件，并作为参数输入到融合框架，生成融合后的设备端中间代码文件；将融合后的设备端中间代码文件进行优化和编译，得到带有设备端信息的主机端中间代码文件；将带有设备端信息的主机端中间代码文件和设备端中间代码文件输入到融合框架，生成融合后的主机端中间代码文件；将融合后的主机端中间代码文件进行优化和编译，得到对应的可执行文件。借此，本发明提出了一种将深度神经网络中的所有内核都融合为一个内核的算子融合方法，包括在中间代码级别上的融合代码生成，能够降低内核启动开销、数据传输开销等性能开销，提高并行资源利用率，从而提升深度神经网络系统的推理性能。In the kernel fusion method based on deep neural network provided by the embodiment of the present invention, the fusion framework is implemented based on the compilation framework and is responsible for code generation and optimization. First, the source code is compiled into host-side intermediate code files and device-side intermediate code files through the compilation framework, and input into the fusion framework as parameters to generate the fused device-side intermediate code files; the fused device-side intermediate code files are optimized and compile to obtain the host-side intermediate code file with device-side information; input the host-side intermediate code file with device-side information and the device-side intermediate code file into the fusion framework to generate the fused host-side intermediate code file; The fused host-side intermediate code file is optimized and compiled to obtain the corresponding executable file. Thus, the present invention proposes an operator fusion method that fuses all kernels in a deep neural network into one kernel, including fusion code generation at the intermediate code level, which can reduce kernel startup overhead, data transmission overhead and other performance overhead, improve parallel resource utilization, thereby improving the inference performance of deep neural network systems.

图2是本发明实施例二提供的基于深度神经网络的内核融合方法的流程示意图，本发明基于LLVM编译框架实现了融合框架，负责代码生成和优化。融合框架运行在x86的机器上，融合后的DNN程序可以运行在NVIDIA GPU上，本实施例中以LSTM(Long Short-TermMemory，长短期记忆)网络实现为例阐述本发明的基本思想，所述方法包括以下步骤：Figure 2 is a schematic flowchart of the kernel fusion method based on deep neural networks provided in Embodiment 2 of the present invention. The present invention implements the fusion framework based on the LLVM compilation framework and is responsible for code generation and optimization. The fusion framework runs on an x86 machine, and the fused DNN program can run on an NVIDIA GPU. In this embodiment, the LSTM (Long Short-Term Memory) network implementation is used as an example to illustrate the basic idea of the present invention. The method includes the following steps:

步骤S201，通过编译框架将源码分别编译为主机端、设备端对应的主机端中间代码文件和设备端中间代码文件。Step S201: Compile the source code into host-side intermediate code files and device-side intermediate code files corresponding to the host side and the device side through the compilation framework.

可选的，本实施例中的编译框架为LLVM编译框架。具体而言，用LLVM编译框架将C++和CUDA(统一计算设备架构)源码分别编译为主机端、设备端对应的LLVM IR文件：lstm-host.ll和lstm-kernel.ll。Optionally, the compilation framework in this embodiment is an LLVM compilation framework. Specifically, the LLVM compilation framework is used to compile the C++ and CUDA (Unified Computing Device Architecture) source codes into LLVM IR files corresponding to the host side and device side: lstm-host.ll and lstm-kernel.ll.

步骤S202，将主机端中间代码文件和设备端中间代码文件作为参数输入到融合框架，生成融合后的设备端中间代码文件。Step S202: Input the host-side intermediate code file and the device-side intermediate code file as parameters into the fusion framework, and generate a fused device-side intermediate code file.

本步骤涉及设备端的融合代码生成。可选的，将主机端LLVM IR文件lstm-host.ll、设备端的LLVM IR文件lstm-kernel.ll作为参数输入到融合框架，得融合后的设备端LLVM IR文件：fused-lstm-kernel.ll。This step involves device-side fusion code generation. Optionally, input the host-side LLVM IR file lstm-host.ll and the device-side LLVM IR file lstm-kernel.ll as parameters into the fusion framework to obtain the merged device-side LLVM IR file: fused-lstm-kernel.ll .

可选的，在对设备端进行代码生成时，可能需要在融合后的设备端中间代码文件fused-lstm-kernel.ll中插入全局同步原语来在一个内核(kernel)的所有线程之间进行同步。保证在进行下一个算子的计算之前，所有线程都已生成了其负责计算的结果。优选的，全局同步原语可通过CUDA(统一计算设备架构)的Cooperative Group API(协作组接口)来实现。Optionally, when generating code for the device side, you may need to insert global synchronization primitives into the fused device-side intermediate code file fused-lstm-kernel.ll to perform synchronization between all threads of a kernel (kernel). Synchronize. It is guaranteed that all threads have generated the results they are responsible for before calculating the next operator. Preferably, the global synchronization primitive can be implemented through the Cooperative Group API of CUDA (Computing Unified Device Architecture).

步骤S203，通过编译框架将融合后的设备端中间代码文件进行优化，生成优化后的设备端中间代码文件。Step S203: Optimize the integrated device-side intermediate code file through the compilation framework to generate an optimized device-side intermediate code file.

可选的，对融合后的设备端LLVM IR文件进行传统的编译优化，包括常量传播等，得到优化后的设备端LLVM IR文件：fused-lstm-kernel-opt.ll。Optionally, perform traditional compilation optimization on the fused device-side LLVM IR file, including constant propagation, etc., to obtain the optimized device-side LLVM IR file: fused-lstm-kernel-opt.ll.

步骤S204，通过编译框架将优化后的设备端中间代码文件进行编译，生成融合后的设备端二进制(fatbin)文件。Step S204: Compile the optimized device-side intermediate code file through the compilation framework to generate a merged device-side binary (fatbin) file.

可选的，用LLVM编译框架，ptxas和fatbinary将优化后的设备端LLVM IR文件，编译为融合后的设备端fatbin文件：fused-lstm-kernel.fatbin。Optionally, use the LLVM compilation framework, ptxas and fatbinary to compile the optimized device-side LLVM IR file into the fused device-side fatbin file: fused-lstm-kernel.fatbin.

步骤S205，通过编译框架将源码和融合后的设备端二进制文件进行编译，生成带有设备端信息的主机端中间代码文件。Step S205: Compile the source code and the merged device-side binary file through the compilation framework to generate a host-side intermediate code file with device-side information.

可选的，带有设备端信息的主机端中间代码文件为带有设备端二进制信息的主机端中间代码文件。优选的，用LLVM将C++/CUDA源码和融合后的设备端fatbin文件，编译为带有设备端fatbin信息的主机端LLVM IR文件：lstm-fatbin-host.ll。Optionally, the host-side intermediate code file with device-side information is a host-side intermediate code file with device-side binary information. Preferably, use LLVM to compile the C++/CUDA source code and the merged device-side fatbin file into a host-side LLVM IR file with device-side fatbin information: lstm-fatbin-host.ll.

步骤S206，将带有设备端信息的主机端中间代码文件和设备端中间代码文件作为参数输入到融合框架，生成融合后的主机端中间代码文件。Step S206: Input the host-side intermediate code file with device-side information and the device-side intermediate code file as parameters into the fusion framework to generate a merged host-side intermediate code file.

本步骤涉及主机端的融合代码生成。可选的，将带有设备端fatbin信息的主机端LLVM IR文件(lstm-fatbin-host.ll)和设备端的LLVM IR文件(lstm-kernel.ll)输入给融合框架，生成融合后的主机端LLVM IR文件：fused-lstm-fatbin-host.ll。所述融合后的主机端LLVM IR文件带有设备端fatbin信息。This step involves fusion code generation on the host side. Optionally, input the host-side LLVM IR file (lstm-fatbin-host.ll) with device-side fatbin information and the device-side LLVM IR file (lstm-kernel.ll) to the fusion framework to generate the merged host-side LLVM IR file: fused-lstm-fatbin-host.ll. The merged host-side LLVM IR file contains device-side fatbin information.

可选的，所述生成融合后的主机端中间代码文件的步骤包括：Optionally, the step of generating the merged host-side intermediate code file includes:

(1)提取每次启动内核(LaunchKernel)时的网络(grid)、线程块(block)、线程(thread)的维度信息；(1) Extract the dimension information of the network (grid), thread block (block), and thread (thread) each time the kernel (LaunchKernel) is launched;

(2)根据不同的融合方式确定融合后的内核的网络(grid)的新的维度信息；(2) Determine the new dimensional information of the fused kernel network (grid) according to different fusion methods;

(3)在生成融合后的主机端中间代码文件(LaunchKernel IR)；(3) Generate the merged host-side intermediate code file (LaunchKernel IR);

(4)设置融合后的主机端中间代码文件的新的维度信息，给定新的参数，新的参数就是原来参数按照顺序拼接的结果；(4) Set the new dimension information of the fused host-side intermediate code file. Given new parameters, the new parameters are the result of splicing the original parameters in order;

(5)生成融合后的主机端中间代码文件的新名字，然后根据新名字生成调用call指令的中间代码。(5) Generate a new name for the fused host-side intermediate code file, and then generate the intermediate code for calling the call instruction based on the new name.

步骤S207，通过编译框架将融合后的主机端中间代码文件进行优化，生成优化后的主机端中间代码文件。Step S207: Optimize the merged host-side intermediate code file through the compilation framework to generate an optimized host-side intermediate code file.

优选的，对带有设备端fatbin信息的融合后的主机端LLVM IR文件进行传统编译优化，包括常量传播等，得到优化后的主机端LLVM IR文件：fused-lstm-fatbin-host-opt.ll。Preferably, perform traditional compilation and optimization on the fused host-side LLVM IR file with device-side fatbin information, including constant propagation, etc., to obtain the optimized host-side LLVM IR file: fused-lstm-fatbin-host-opt.ll .

步骤S208，通过编译框架将优化后的主机端中间代码文件进行编译，生成可执行文件。Step S208: Compile the optimized host-side intermediate code file through the compilation framework to generate an executable file.

可选的，用LLVM编译框架将优化后的主机端LLVM IR文件编译链接为应用于CUDA(统一计算设备架构)的可执行文件：fused-lstm。Optionally, use the LLVM compilation framework to compile and link the optimized host-side LLVM IR file into an executable file for CUDA (Unified Computing Device Architecture): fused-lstm.

本发明的目的是解决现有技术的内核(kernel)融合方案中需要人工定义模式或规则；或需要精准的编译分析，在调优(tuning)阶段耗时过长；以及有全局依赖的内核不会被融合的问题。本发明提出了一种将DNN系统中的所有开销都融合为一个内核的算子融合方法，包括在中间代码级别上的融合代码生成，降低启动内核开销和数据传输开销，提高并行资源利用率，从而提升DNN系统的推理性能。The purpose of the present invention is to solve the problem that the existing technology's kernel fusion solution requires manual definition of modes or rules; or requires precise compilation and analysis, which takes too long in the tuning phase; and the kernel with global dependencies does not require issues that will be integrated. The present invention proposes an operator fusion method that integrates all overhead in the DNN system into one kernel, including fusion code generation at the intermediate code level, reducing startup kernel overhead and data transmission overhead, and improving parallel resource utilization. Thereby improving the reasoning performance of the DNN system.

对LSTM网络的不同算子融合方案进行评估后，发现LSTM在TensorRT系统下推理性能为6.30ms，在Rammer系统下推理性能为1.72ms。本发明将LSTM网络融合为一个内核，并进行了编译优化，推理性能为0.80ms，明显优于其他的融合方案。After evaluating different operator fusion solutions of the LSTM network, it was found that the inference performance of LSTM under the TensorRT system was 6.30ms, and the inference performance under the Rammer system was 1.72ms. This invention integrates the LSTM network into a kernel and performs compilation optimization. The inference performance is 0.80ms, which is significantly better than other fusion solutions.

需要说明的是，本发明实施例提供的基于深度神经网络的内核融合方法，执行主体可以为电子设备、基于深度神经网络的内核融合系统，或者该基于深度神经网络的内核融合系统中的用于执行基于深度神经网络的内核融合方法的控制模块。本发明实施例中以基于深度神经网络的内核融合系统执行基于深度神经网络的内核融合方法为例，说明本发明实施例提供的基于深度神经网络的内核融合系统。It should be noted that, for the deep neural network-based kernel fusion method provided by the embodiment of the present invention, the execution subject may be an electronic device, a deep neural network-based kernel fusion system, or a device used in the deep neural network-based kernel fusion system. A control module that executes the kernel fusion method based on deep neural networks. In the embodiment of the present invention, a kernel fusion system based on a deep neural network executing a kernel fusion method based on a deep neural network is used as an example to illustrate the kernel fusion system based on a deep neural network provided by the embodiment of the present invention.

图3是本发明实施例一提供的基于深度神经网络的内核融合系统的结构示意图，所述系统100至少包括中间代码编译模块10、第一融合代码模块20、第一优化编译模块30、第二融合代码模块40、第二优化编译模块50，其中：Figure 3 is a schematic structural diagram of a deep neural network-based kernel fusion system provided in Embodiment 1 of the present invention. The system 100 at least includes an intermediate code compilation module 10, a first fusion code module 20, a first optimization compilation module 30, a second Fusion code module 40 and second optimization compilation module 50, wherein:

所述中间代码编译模块10，用于通过编译框架将源码分别编译为主机端、设备端对应的主机端中间代码文件和设备端中间代码文件。优选的，所述编译框架为LLVM编译框架。The intermediate code compilation module 10 is used to compile the source code into host-side intermediate code files and device-side intermediate code files corresponding to the host side and the device side through the compilation framework. Preferably, the compilation framework is an LLVM compilation framework.

所述第一融合代码模块20，用于将主机端中间代码文件和设备端中间代码文件作为参数输入到融合框架，生成融合后的设备端中间代码文件。The first fusion code module 20 is used to input the host-side intermediate code file and the device-side intermediate code file as parameters into the fusion framework, and generate a fused device-side intermediate code file.

所述第一优化编译模块30，用于通过编译框架将融合后的设备端中间代码文件进行优化和编译，得到带有设备端信息的主机端中间代码文件。The first optimization and compilation module 30 is used to optimize and compile the merged device-side intermediate code file through a compilation framework to obtain a host-side intermediate code file with device-side information.

所述第二融合代码模块40，将带有设备端信息的主机端中间代码文件和设备端中间代码文件作为参数输入到融合框架，生成融合后的主机端中间代码文件。The second fusion code module 40 inputs the host-side intermediate code file with device-side information and the device-side intermediate code file as parameters into the fusion framework, and generates the fused host-side intermediate code file.

所述第二优化编译模块50，用于通过编译框架将融合后的主机端中间代码文件进行优化和编译，得到对应的可执行文件。优选的，所述可执行文件为应用于统一计算设备架构的可执行文件。The second optimization and compilation module 50 is used to optimize and compile the merged host-side intermediate code file through the compilation framework to obtain a corresponding executable file. Preferably, the executable file is an executable file applied to a unified computing device architecture.

图4是本发明实施例二提供的基于深度神经网络的内核融合系统的结构示意图，所述系统100至少包括中间代码编译模块10、第一融合代码模块20、第一优化编译模块30、第二融合代码模块40、第二优化编译模块50，其中：Figure 4 is a schematic structural diagram of a deep neural network-based kernel fusion system provided in Embodiment 2 of the present invention. The system 100 at least includes an intermediate code compilation module 10, a first fusion code module 20, a first optimization compilation module 30, a second Fusion code module 40 and second optimization compilation module 50, wherein:

所述第一融合代码模块20，用于将主机端中间代码文件和设备端中间代码文件作为参数输入到融合框架，生成融合后的设备端中间代码文件。优选的，融合后的设备端中间代码文件中插入有全局同步原语以对本内核的所有线程之间进行同步。The first fusion code module 20 is used to input the host-side intermediate code file and the device-side intermediate code file as parameters into the fusion framework, and generate a fused device-side intermediate code file. Preferably, a global synchronization primitive is inserted into the merged device-side intermediate code file to synchronize all threads of this kernel.

所述第一优化编译模块30，用于通过编译框架将融合后的设备端中间代码文件进行优化和编译，得到带有设备端信息的主机端中间代码文件。优选的，第一优化编译模块30进一步包括：The first optimization and compilation module 30 is used to optimize and compile the merged device-side intermediate code file through a compilation framework to obtain a host-side intermediate code file with device-side information. Preferably, the first optimization compilation module 30 further includes:

第一优化子模块31，用于通过编译框架将融合后的设备端中间代码文件进行优化，生成优化后的设备端中间代码文件。The first optimization sub-module 31 is used to optimize the fused device-side intermediate code file through the compilation framework and generate an optimized device-side intermediate code file.

第一编译子模块32，用于通过编译框架将优化后的设备端中间代码文件进行编译，生成融合后的设备端二进制文件。The first compilation sub-module 32 is used to compile the optimized device-side intermediate code file through the compilation framework to generate a merged device-side binary file.

第二编译子模块33，用于通过编译框架将源码和融合后的设备端二进制文件进行编译，生成带有设备端信息的主机端中间代码文件。优选的，所述带有设备端信息的主机端中间代码文件为带有设备端二进制信息的主机端中间代码文件。The second compilation sub-module 33 is used to compile the source code and the fused device-side binary file through the compilation framework to generate a host-side intermediate code file with device-side information. Preferably, the host-side intermediate code file with device-side information is a host-side intermediate code file with device-side binary information.

所述第二融合代码模块40，将带有设备端信息的主机端中间代码文件和设备端中间代码文件作为参数输入到融合框架，生成融合后的主机端中间代码文件。优选的，所述第二融合代码模块40用于在生成融合后的主机端中间代码文件时，提取每次启动内核时的网络、线程块、线程的维度信息。根据不同的融合方式确定融合后的内核的网络的新的维度信息；设置融合后的主机端中间代码文件的新的维度信息；生成融合后的主机端中间代码文件的新名字，然后根据新名字生成调用指令的中间代码。The second fusion code module 40 inputs the host-side intermediate code file with device-side information and the device-side intermediate code file as parameters into the fusion framework, and generates the fused host-side intermediate code file. Preferably, the second fusion code module 40 is used to extract the dimensional information of the network, thread blocks, and threads each time the kernel is started when generating the fused host-side intermediate code file. Determine the new dimensional information of the fused kernel network according to different fusion methods; set the new dimensional information of the fused host-side intermediate code file; generate a new name for the fused host-side intermediate code file, and then use the new name to Generate intermediate code for calling instructions.

所述第二优化编译模块50，用于通过编译框架将融合后的主机端中间代码文件进行优化和编译，得到对应的可执行文件。优选的，第二优化编译模块50进一步包括：The second optimization and compilation module 50 is used to optimize and compile the merged host-side intermediate code file through the compilation framework to obtain a corresponding executable file. Preferably, the second optimization compilation module 50 further includes:

第二优化子模块51，用于通过编译框架将融合后的主机端中间代码文件进行优化，生成优化后的主机端中间代码文件。The second optimization submodule 51 is used to optimize the merged host-side intermediate code file through a compilation framework to generate an optimized host-side intermediate code file.

第三编译子模块52，用于通过编译框架将优化后的主机端中间代码文件进行编译，生成可执行文件。优选的，所述可执行文件为应用于统一计算设备架构的可执行文件。The third compilation sub-module 52 is used to compile the optimized host-side intermediate code file through the compilation framework to generate an executable file. Preferably, the executable file is an executable file applied to a unified computing device architecture.

本发明实施例提供的基于深度神经网络的内核融合系统能够实现图基于深度神经网络的内核融合方法实施例实现的各个过程，为避免重复，这里不再赘述。The deep neural network-based kernel fusion system provided by the embodiment of the present invention can implement various processes implemented by the deep neural network-based kernel fusion method in the figure. To avoid duplication, they will not be described again here.

本发明实施例提供的基于深度神经网络的内核融合系统，首先通过编译框架将源码分别编译为主机端中间代码文件和设备端中间代码文件，并作为参数输入到融合框架，生成融合后的设备端中间代码文件；将融合后的设备端中间代码文件进行优化和编译，得到带有设备端信息的主机端中间代码文件；将带有设备端信息的主机端中间代码文件和设备端中间代码文件输入到融合框架，生成融合后的主机端中间代码文件；将融合后的主机端中间代码文件进行优化和编译，得到对应的可执行文件。借此，本发明提出了一种将深度神经网络中的所有内核都融合为一个内核的算子融合系统，包括在中间代码级别上的融合代码生成，能够降低内核启动开销、数据传输开销等性能开销，提高并行资源利用率，从而提升深度神经网络系统的推理性能。The kernel fusion system based on the deep neural network provided by the embodiment of the present invention first compiles the source code into a host-side intermediate code file and a device-side intermediate code file through a compilation framework, and inputs them as parameters into the fusion framework to generate the fused device-side Intermediate code file; optimize and compile the fused device-side intermediate code file to obtain a host-side intermediate code file with device-side information; input the host-side intermediate code file with device-side information and the device-side intermediate code file Go to the fusion framework and generate the fused host-side intermediate code file; optimize and compile the fused host-side intermediate code file to obtain the corresponding executable file. Thus, the present invention proposes an operator fusion system that fuses all kernels in a deep neural network into one kernel, including fusion code generation at the intermediate code level, which can reduce kernel startup overhead, data transmission overhead and other performance overhead, improve parallel resource utilization, thereby improving the inference performance of deep neural network systems.

本发明还提供一种存储介质，用于存储如图1～图2所述任意一种基于深度神经网络的内核融合方法的计算机程序。例如计算机程序指令，当其被计算机执行时，通过该计算机的操作，可以调用或提供根据本发明的方法和/或技术方案，且能达到相同的技术效果，为避免重复，这里不再赘述。而调用本发明的方法的程序指令，可能被存储在固定的或可移动的存储介质中，和/或通过广播或其他信号承载媒体中的数据流而被传输和/或被存储在根据程序指令运行的计算机设备的存储介质中。The present invention also provides a storage medium for storing the computer program of any deep neural network-based kernel fusion method as shown in Figures 1 to 2. For example, computer program instructions, when executed by a computer, can call or provide methods and/or technical solutions according to the present invention through the operation of the computer, and can achieve the same technical effect. To avoid duplication, they will not be described again here. The program instructions for calling the method of the present invention may be stored in a fixed or removable storage medium, and/or be transmitted through a data stream in a broadcast or other signal-bearing media and/or be stored in a program instruction according to the program instructions. The storage medium of the computer device on which it is running.

根据本发明的一个实施例，本发明还提供一种如图5所示的电子设备400，所述电子设备400可选包括用于存储计算机程序的存储介质200和用于执行计算机程序的处理器300，其中，当该计算机程序被该处理器300执行时实现上述任一项基于深度神经网络的内核融合方法，触发该电子设备400执行基于前述多个实施例中的方法和/或技术方案，且能达到相同的技术效果，为避免重复，这里不再赘述。需要注意的是，本发明实施例中的电子设备包括移动电子设备和非移动电子设备。示例性的，移动电子设备可以为手机、平板电脑、笔记本电脑、掌上电脑、车载电子设备、可穿戴设备、超级移动个人计算机、上网本或者个人数字助理等，非移动电子设备可以为服务器、网络附属存储器(NetworkAttachedStorage，NAS)、个人计算机(personal computer，PC)、电视机(television，TV)、柜员机或者自助机等，本发明实施例不作具体限定。According to an embodiment of the present invention, the present invention also provides an electronic device 400 as shown in Figure 5. The electronic device 400 optionally includes a storage medium 200 for storing a computer program and a processor for executing the computer program. 300, wherein when the computer program is executed by the processor 300, any one of the above-mentioned deep neural network-based kernel fusion methods is implemented, triggering the electronic device 400 to execute the methods and/or technical solutions based on the foregoing embodiments, And can achieve the same technical effect. To avoid repetition, they will not be described again here. It should be noted that electronic devices in embodiments of the present invention include mobile electronic devices and non-mobile electronic devices. For example, mobile electronic devices may be mobile phones, tablet computers, notebook computers, PDAs, vehicle-mounted electronic devices, wearable devices, super mobile personal computers, netbooks, or personal digital assistants, etc., and non-mobile electronic devices may be servers, network attachments, etc. Storage (Network Attached Storage, NAS), personal computer (PC), television (television, TV), teller machine or self-service machine, etc. are not specifically limited in the embodiment of the present invention.

需要注意的是，本发明可在软件和/或软件与硬件的组合体中被实施，例如，可采用专用集成电路(ASIC)、通用目的计算机或任何其他类似硬件设备来实现。在一个实施例中，本发明的软件程序可以通过处理器执行以实现上文步骤或功能。同样地，本发明的软件程序(包括相关的数据结构)可以被存储到计算机可读记录介质中，例如，RAM存储器，磁或光驱动器或软磁盘及类似设备。另外，本发明的一些步骤或功能可采用硬件来实现，例如，作为与处理器配合从而执行各个步骤或功能的电路。It should be noted that the present invention may be implemented in software and/or a combination of software and hardware, for example, using an application specific integrated circuit (ASIC), a general purpose computer or any other similar hardware device. In one embodiment, the software program of the present invention can be executed by a processor to implement the above steps or functions. Likewise, the software program of the present invention (including associated data structures) may be stored in a computer-readable recording medium, such as a RAM memory, a magnetic or optical drive or a floppy disk and similar devices. In addition, some steps or functions of the present invention may be implemented in hardware, for example, as a circuit that cooperates with a processor to perform each step or function.

本发明可以作为计算机实现方法在计算机上实现、或者在专用硬件中实现、或以两者的组合的方式实现。用于根据本发明的方法的可执行代码或其部分可以存储在计算机程序产品上。计算机程序产品的示例包括存储器设备、光学存储设备、集成电路、服务器、在线软件等。可选的，计算机程序产品包括存储在计算机可读介质上以便当所述程序产品在计算机上执行时执行根据本发明的方法的非临时程序代码部件。The present invention can be implemented on a computer as a computer implementation method, or in dedicated hardware, or in a combination of both. Executable code for the method according to the invention or parts thereof may be stored on a computer program product. Examples of computer program products include memory devices, optical storage devices, integrated circuits, servers, online software, and the like. Optionally, the computer program product includes non-transitory program code means stored on a computer-readable medium for performing the method according to the invention when the program product is executed on a computer.

在可选实施例中，计算机程序包括适合于当计算机程序在计算机上运行时执行根据本发明的方法的所有步骤的计算机程序代码部件。可选地，在计算机可读介质上体现计算机程序。In an alternative embodiment, the computer program comprises computer program code means adapted to perform all steps of the method according to the invention when the computer program is run on a computer. Optionally, the computer program is embodied on a computer-readable medium.

需要说明的是，在本文中，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。此外，需要指出的是，本发明实施方式中的方法和装置的范围不限按示出或讨论的顺序来执行功能，还可包括根据所涉及的功能按基本同时的方式或按相反的顺序来执行功能，例如，可以按不同于所描述的次序来执行所描述的方法，并且还可以添加、省去、或组合各种步骤。另外，参照某些示例所描述的特征可在其他示例中被组合。It should be noted that, in this document, the terms "comprising", "comprises" or any other variation thereof are intended to cover a non-exclusive inclusion, such that a process, method, article or device that includes a series of elements not only includes those elements, It also includes other elements not expressly listed or inherent in the process, method, article or apparatus. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article, or device that includes that element. In addition, it should be pointed out that the scope of the methods and apparatuses in the embodiments of the present invention is not limited to performing functions in the order shown or discussed, but may also include performing functions in a substantially simultaneous manner or in reverse order depending on the functions involved. Functions may be performed, for example, the methods described may be performed in an order different from that described, and various steps may be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.

当然，本发明还可有其它多种实施例，在不背离本发明精神及其实质的情况下，熟悉本领域的技术人员当可根据本发明作出各种相应的改变和变形，但这些相应的改变和变形都应属于本发明所附的权利要求的保护范围。Of course, the present invention can also have various other embodiments. Without departing from the spirit and essence of the present invention, those skilled in the art can make various corresponding changes and deformations according to the present invention. However, these corresponding Changes and deformations should fall within the protection scope of the appended claims of the present invention.

Claims

1. The kernel fusion method based on the deep neural network is characterized by comprising the following steps of:

compiling the source codes into host end intermediate code files and equipment end intermediate code files corresponding to the host end and the equipment end respectively through a compiling frame;

inputting the host side intermediate code file and the equipment side intermediate code file as parameters into a fusion framework to generate a fused equipment side intermediate code file;

optimizing and compiling the fused equipment-end intermediate code file through the compiling framework to obtain a host-end intermediate code file with equipment-end information;

inputting the host end intermediate code file with the equipment end information and the equipment end intermediate code file as parameters into the fusion frame to generate a fused host end intermediate code file;

and optimizing and compiling the fused host side intermediate code file through the compiling framework to obtain a corresponding executable file.

2. The method according to claim 1, wherein the step of optimizing and compiling the fused device-side intermediate code file by the compiling framework to obtain a host-side intermediate code file with device-side information includes:

optimizing the fused equipment end intermediate code file through the compiling framework to generate an optimized equipment end intermediate code file;

compiling the optimized equipment-side intermediate code file through the compiling framework to generate a fused equipment-side binary file;

and compiling the source code and the fused equipment-side binary file through the compiling framework to generate the host-side intermediate code file with the equipment-side information.

3. The method of claim 2, wherein the host side intermediate code file with device side information is a host side intermediate code file with device side binary information.

4. The method of claim 1, wherein the step of generating the fused host-side intermediate code file comprises:

when the fused host side intermediate code file is generated, extracting dimension information of a network, a thread block and a thread when the kernel is started each time;

determining new dimension information of the network of the fused kernel according to different fusion modes;

setting the new dimension information of the fused host side intermediate code file;

and generating a new name of the fused host side intermediate code file, and then generating an intermediate code of the calling instruction according to the new name.

5. The method of claim 1, wherein the step of optimizing and compiling the fused host-side intermediate code file by a compiling framework to obtain a corresponding executable file comprises:

optimizing the fused host side intermediate code file through the compiling framework to generate an optimized host side intermediate code file;

and compiling the optimized host side intermediate code file through the compiling framework to generate the executable file.

6. The method of claim 1, wherein global synchronization primitives are inserted into the fused device side intermediate code file to synchronize all threads of the kernel.

7. The method of claim 1, wherein the compilation framework is an LLVM compilation framework; and/or

The executable file is an executable file applied to a unified computing device architecture.

8. A deep neural network-based kernel fusion system constructed based on the method of any one of claims 1-7, comprising:

the intermediate code compiling module is used for compiling the source codes into a host end intermediate code file and a device end intermediate code file corresponding to the host end and the device end respectively through a compiling framework;

the first fusion code module is used for inputting the host side intermediate code file and the equipment side intermediate code file as parameters into a fusion frame to generate a fused equipment side intermediate code file;

the first optimizing and compiling module is used for optimizing and compiling the fused equipment-end intermediate code file through the compiling framework to obtain a host-end intermediate code file with equipment-end information;

the second fusion code module is used for inputting the host end intermediate code file with the equipment end information and the equipment end intermediate code file into the fusion frame as parameters to generate a fused host end intermediate code file;

and the second optimizing and compiling module is used for optimizing and compiling the fused host side intermediate code file through the compiling framework to obtain a corresponding executable file.

9. A storage medium storing a computer program for performing the deep neural network-based kernel fusion method of any one of claims 1 to 7.

10. An electronic device comprising a storage medium, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the deep neural network based kernel fusion method of any one of claims 1-7 when the computer program is executed.