CN107341540B - An apparatus and method for executing a Hessian-Free training algorithm - Google Patents
An apparatus and method for executing a Hessian-Free training algorithm Download PDFInfo
- Publication number
- CN107341540B CN107341540B CN201610283885.XA CN201610283885A CN107341540B CN 107341540 B CN107341540 B CN 107341540B CN 201610283885 A CN201610283885 A CN 201610283885A CN 107341540 B CN107341540 B CN 107341540B
- Authority
- CN
- China
- Prior art keywords
- theta
- instruction
- updated
- data processing
- processing unit
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Neurology (AREA)
- Memory System Of A Hierarchy Structure (AREA)
- Advance Control (AREA)
- Complex Calculations (AREA)
Abstract
一种计算装置和方法,该装置包括控制器单元和数据处理单元。所述数据处理单元包括运算控制子模块和基本运算子模块,以及梯度运算子模块、阻尼项运算子模块、高斯‑牛顿矩阵运算子模块和共轭梯度法运算子模块中的至少之一。使用该装置可以完成对各种神经网络的训练,如自动编码器、卷积神经网络RNN等等。本发明可以解决数据的通用处理器运算性能不足,前段译码开销大的问题,加速相关应用的执行速度;同时,对数据缓存单元的应用,避免了反复向内存读取数据,降低了内存访问的带宽。
A computing device and method including a controller unit and a data processing unit. The data processing unit includes an operation control submodule, a basic operation submodule, and at least one of a gradient operation submodule, a damping term operation submodule, a Gauss-Newton matrix operation submodule, and a conjugate gradient method operation submodule. Using this device, the training of various neural networks, such as auto-encoder, convolutional neural network RNN, etc., can be completed. The invention can solve the problems of insufficient computing performance of the general-purpose processor for data and large overhead for decoding in the front section, and accelerate the execution speed of related applications; at the same time, the application of the data cache unit avoids repeatedly reading data from the memory and reduces memory access. bandwidth.
Description
技术领域technical field
本发明涉及神经网络运算技术领域,更具体地涉及一种用于执行Hessian-Free训练算法的装置和方法。The present invention relates to the technical field of neural network operations, and more particularly to a device and method for executing a Hessian-Free training algorithm.
背景技术Background technique
梯度下降法在函数逼近、优化计算、模式识别和图像处理等领域被广泛应用。目前主流的神经网络的训练方法是梯度下降法(结合BP算法),但是这种方法忽略了误差函数的曲率信息,不仅容易出现参数变化过度平缓,从而无法收敛到局部最优点的情况,而且无法很好的处理“病态曲率”的误差函数(比如Rosenbrock函数)。Hessian-Free训练算法很好的解决了这个问题,并且通过一些细节改进,使之没有出现运算量关于参数数量平方增长(和梯度下降法一样是线性增长)的情况。Gradient descent is widely used in the fields of function approximation, optimization calculation, pattern recognition and image processing. The current mainstream neural network training method is the gradient descent method (combined with the BP algorithm), but this method ignores the curvature information of the error function. Very good handling of "ill-conditioned curvature" error functions (such as the Rosenbrock function). The Hessian-Free training algorithm solves this problem very well, and through some detailed improvements, it does not appear that the amount of computation increases squarely with respect to the number of parameters (linear growth like the gradient descent method).
目前,一种执行Hessian-Free训练算法的已知方法是使用通用处理器。该方法通过使用通用寄存器堆和通用功能部件执行通用指令来支持上述算法。该方法的缺点之一是单个通用处理器的运算性能较低。而多个通用处理器并行执行时,通用处理器之间相互通信又成为了性能瓶颈。另外,通用处理器需要把Hessian-Free训练算法对应的相关运算译码成一长列运算及访存指令序列,处理器前端译码带来了较大的功耗开销。Currently, one known way to perform the Hessian-Free training algorithm is to use a general-purpose processor. The method supports the above-described algorithms by executing general-purpose instructions using a general-purpose register file and general-purpose functional units. One of the disadvantages of this method is the low computational performance of a single general-purpose processor. When multiple general-purpose processors execute in parallel, the mutual communication between general-purpose processors has become a performance bottleneck. In addition, the general-purpose processor needs to decode the related operations corresponding to the Hessian-Free training algorithm into a long sequence of operations and memory access instructions, and the front-end decoding of the processor brings a large power consumption overhead.
另一种执行Hessian-Free训练算法的已知方法是使用图形处理器(GPU)。该方法通过使用通用寄存器堆和通用流处理单元执行通用SIMD指令来支持上述算法。由于GPU是专门用来执行图形图像运算以及科学计算的设备,没有对Hessian-Free训练算法相关运算的专门支持,仍然需要大量的前端译码工作才能执行Hessian-Free训练算法中相关的运算,带来了大量的额外开销。另外,GPU只有较小的片上缓存,运算中所需数据(如高斯-牛顿矩阵等)需要反复从片外搬运,片外带宽成为了主要性能瓶颈,同时带来了巨大的功耗开销。Another known method of performing the Hessian-Free training algorithm is to use a graphics processing unit (GPU). The method supports the above algorithms by executing general-purpose SIMD instructions using a general-purpose register file and a general-purpose stream processing unit. Since the GPU is a device specially used to perform graphics and image operations and scientific computing, there is no special support for the related operations of the Hessian-Free training algorithm, and a large amount of front-end decoding work is still required to perform the related operations in the Hessian-Free training algorithm. Comes with a lot of extra overhead. In addition, the GPU has only a small on-chip cache, and the data required in the operation (such as Gauss-Newton matrix, etc.) needs to be repeatedly transferred from the off-chip. The off-chip bandwidth has become the main performance bottleneck, and it also brings huge power consumption.
发明内容SUMMARY OF THE INVENTION
有鉴于此,本发明的目的在于提供一种用于执行Hessian-Free训练算法的装置和方法,以期坚决上述技术问题中的至少之一。In view of this, the purpose of the present invention is to provide an apparatus and method for executing a Hessian-Free training algorithm, so as to resolve at least one of the above technical problems.
为了实现上述目的,作为本发明的一个方面,本发明提供了一种用于执行Hessian-Free训练算法的装置,包括:In order to achieve the above object, as an aspect of the present invention, the present invention provides a device for executing a Hessian-Free training algorithm, comprising:
控制器单元,用于将读取的指令译码为控制相应模块的微指令,并将其发送给相应模块;The controller unit is used to decode the read instruction into a micro-instruction for controlling the corresponding module, and send it to the corresponding module;
数据缓存单元,用于存储运算过程中的中间变量,并对所述中间变量执行初始化及更新操作;a data cache unit, used for storing intermediate variables in the operation process, and performing initialization and update operations on the intermediate variables;
数据处理单元,用于在所述控制器单元的控制下执行运算操作,并将中间变量存储于所述数据缓存单元中。A data processing unit, configured to perform arithmetic operations under the control of the controller unit, and store intermediate variables in the data buffer unit.
其中,所述数据处理单元包括运算控制子模块、梯度运算子模块、阻尼项运算子模块、高斯-牛顿矩阵运算子模块、共轭梯度法运算子模块及基本运算子模块;其中,所述基本运算子模块进行的是矩阵、向量之间的加、乘基础运算;Wherein, the data processing unit includes an operation control sub-module, a gradient operation sub-module, a damping term operation sub-module, a Gauss-Newton matrix operation sub-module, a conjugate gradient method operation sub-module and a basic operation sub-module; wherein the basic operation sub-module The operation sub-module performs basic operations of addition and multiplication between matrices and vectors;
作为优选,所述梯度运算子模块、阻尼项运算子模块、高斯-牛顿矩阵运算子模块、共轭梯度法运算子模块均能够调用所述基本运算子模块,且根据情况所述梯度运算子模块、阻尼项运算子模块、高斯-牛顿矩阵运算子模块、共轭梯度法运算子模块之间允许互相调用。Preferably, the gradient operation sub-module, the damping term operation sub-module, the Gauss-Newton matrix operation sub-module, and the conjugate gradient method operation sub-module can all call the basic operation sub-module, and the gradient operation sub-module according to the situation , Damping term operation sub-module, Gauss-Newton matrix operation sub-module, and conjugate gradient method operation sub-module are allowed to call each other.
其中,所述数据缓存单元在装置初始化时初始化f(θ)的二阶估计在第n次待更新参数向量θn的更新开始前,将读出到数据处理单元中,并在所述数据处理单元中得到更新向量后将再次写入;其中,θ为待更新参数向量,θn为第n次待更新参数向量,f(θ)为误差函数,即衡量结果的实际值与预测值偏离的函数;δn是更新向量,且θn+1=θn+δn。Wherein, the data buffer unit initializes the second-order estimation of f(θ) when the device is initialized Before the nth update of the parameter vector θ n to be updated starts, the read out to the data processing unit, and after the update vector is obtained in the data processing unit, Write again; among them, θ is the parameter vector to be updated, θ n is the parameter vector to be updated for the nth time, and f(θ) is the error function, that is, the function that measures the deviation between the actual value of the result and the predicted value; δ n is the update vector , and θ n+1 = θ n +δ n .
其中,所述数据缓存单元在初始化的步骤中,初始化其中的梯度高斯-牛顿矩阵Gf、阻尼系数λ和阻尼函数其中,所述梯度指f在θn处的梯度值,Gf是f在θn处的高斯-牛顿矩阵;阻尼函数是根据训练模型预先确定好的函数在θn处的值;阻尼系数λ通过LM式启发式方法求得;Wherein, the data cache unit is initialized In the steps of , initialize the gradient in which Gauss-Newton matrix G f , damping coefficient λ and damping function in, the gradient refers to the gradient value of f at θ n , G f is the Gauss-Newton matrix of f at θ n ; damping function is the value of the pre-determined function at θ n according to the training model; the damping coefficient λ is obtained by the LM heuristic method;
所述数据处理单元从所述数据缓存单元中读取从外部指定空间中读取待更新参数向量θn;在模块内得到更新向量δn,将θn更新为θn+1,对应的更新为然后将写入至所述数据缓存单元,将θn+1写入到外部指定空间中;其中,θn+1为第n+1次待更新参数向量,为f(θ+1)的二阶估计。the data processing unit reads from the data cache unit Read the parameter vector θ n to be updated from the external specified space; get the update vector δ n in the module, update θ n to θ n+1 , the corresponding update to followed by Write to the data cache unit, and write θ n+1 into the external specified space; wherein, θ n+1 is the n+1 th parameter vector to be updated, is the second-order estimate of f(θ+1).
作为本发明的另一个方面,本发明还提供了一种执行Hessian-Free训练算法的方法,包括以下步骤:As another aspect of the present invention, the present invention also provides a method for executing the Hessian-Free training algorithm, comprising the following steps:
步骤(1),通过指令,完成数据缓存单元的初始化操作,即初始化f(θ)的二阶估计其中,θ为待更新参数向量,θn为第n次待更新参数向量,f(θ)为误差函数,即衡量结果的实际值与预测值偏离的函数;δn是更新向量,且θn+1=θn+δn;Step (1), through the instruction, complete the initialization operation of the data cache unit, that is, initialize the second-order estimation of f(θ) Among them, θ is the parameter vector to be updated, θ n is the parameter vector to be updated for the nth time, f(θ) is the error function, that is, the function that measures the deviation between the actual value of the result and the predicted value; δ n is the update vector, and θ n +1 = θ n +δ n ;
步骤(2),通过IO指令,完成直接内存访问单元从外部空间读取待更新参数向量的操作;Step (2), through the IO instruction, completes the operation that the direct memory access unit reads the parameter vector to be updated from the external space;
步骤(3),数据处理单元根据相应指令,在θn处对误差函数f(θ)进行二阶泰勒展开,并添加阻尼项得到f(θ)在θn附近的估计即Step (3), the data processing unit performs a second-order Taylor expansion on the error function f(θ) at θ n according to the corresponding instruction, and adds a damping term get an estimate of f(θ) around θ n which is
其中,Gf是f在θn处的高斯-牛顿矩阵;阻尼系数λ通过LM式启发式方法求得;阻尼函数是根据训练模型预先确定好的函数在θn处的值;Among them, G f is the Gauss-Newton matrix of f at θ n ; the damping coefficient λ is obtained by the LM heuristic method; the damping function is the value of the pre-determined function at θ n according to the training model;
步骤(4),数据处理单元根据相应的指令,进行有预条件的共轭梯度法求δn使得达到最小值,并把θn更新为θn+1,具体更新操作为:In step (4), the data processing unit performs a preconditioned conjugate gradient method to obtain δ n according to the corresponding instructions such that Reach the minimum value, and update θ n to θ n +1 , the specific update operation is:
θn+1=θn+δn;θ n+1 = θ n +δ n ;
步骤(5),数据处理单元判断更新后的参数向量是否收敛,若收敛,运算结束,否则,转到步骤(2)处继续执行。In step (5), the data processing unit judges whether the updated parameter vector has converged, and if it converges, the operation ends, otherwise, go to step (2) and continue to execute.
其中,步骤(1)中所述完成数据缓存单元的初始化操作的步骤包括:对梯度高斯-牛顿矩阵Gf、阻尼系数λ和阻尼函数进行置零操作。Wherein, the step of completing the initialization operation of the data cache unit in step (1) includes: Gauss-Newton matrix G f , damping coefficient λ and damping function Perform zero-setting operation.
其中,步骤(3)中当进行RNN训练时,Wherein, when performing RNN training in step (3),
阻尼函数 damping function
其中S和f均是距离函数,GS是S在θn处的高斯-牛顿矩阵,μ是一个预先确定好的正数。where S and f are distance functions, G S is the Gauss-Newton matrix of S at θ n , and μ is a predetermined positive number.
其中,步骤(4)所述进行有预条件的共轭梯度法求δn使得达到最小值的步骤中,在进行有预条件的共轭梯度法过程中,只用“mini-batch”而不是所有样本,且其中涉及的高斯-牛顿矩阵乘向量的运算都是通过做隐式近似计算。Wherein, in step (4), the preconditioned conjugate gradient method is performed to obtain δ n such that In the step of reaching the minimum value, in the process of the preconditioned conjugate gradient method, only "mini-batch" is used instead of all samples, and the Gauss-Newton matrix multiplication vector operation involved is through Do implicit approximations.
作为本发明的再一个方面,本发明还提供了一种执行Hessian-Free训练算法的方法,其特征在于,包括以下步骤:As a further aspect of the present invention, the present invention also provides a method for executing the Hessian-Free training algorithm, which is characterized in that it includes the following steps:
步骤S1,在指令缓存单元的首地址处预先存入一条IO指令。In step S1, an IO instruction is pre-stored at the first address of the instruction cache unit.
步骤S2,运算开始,控制器单元从指令缓存单元的首地址读取该条IO指令,根据译出的微指令,直接内存访问单元从外部地址空间读取所有与Hessian-Free计算有关的所有指令,并将其缓存入指令缓存单元中;Step S2, the operation starts, the controller unit reads the IO instruction from the first address of the instruction cache unit, and according to the translated microinstruction, the direct memory access unit reads all the instructions related to the Hessian-Free calculation from the external address space. , and cache it in the instruction cache unit;
步骤S3,控制器单元从指令缓存单元读入一条IO指令,根据译出的微指令,直接内存访问单元从外部空间读取初始待更新参数向量θ0到数据处理单元中;Step S3, the controller unit reads in an IO instruction from the instruction cache unit, and according to the microinstruction translated, the direct memory access unit reads the initial parameter vector θ 0 to be updated from the external space into the data processing unit;
步骤S4,控制器单元从指令缓存单元读入赋值指令,根据译出的微指令,数据缓存单元中的初始化,数据处理单元中的迭代次数n被设置为0;其中,θ为待更新参数向量,θn为第n次待更新参数向量,f(θ)为误差函数,即衡量结果的实际值与预测值偏离的函数;δn是更新向量,且θn+1=θn+δn;为f(θ)的二阶估计;Step S4, the controller unit reads the assignment instruction from the instruction cache unit, and according to the translated microinstruction, the Initialization, the number of iterations n in the data processing unit is set to 0; among them, θ is the parameter vector to be updated, θ n is the nth parameter vector to be updated, and f(θ) is the error function, that is, the actual value of the measurement result is the same as the A function of the deviation of the predicted value; δ n is the update vector, and θ n+1 = θ n +δ n ; is the second-order estimate of f(θ);
步骤S5,控制器单元从指令缓存单元读入一条IO指令,根据译出的微指令,直接内存访问单元从外部空间读取待更新参数向量θn传入到数据处理单元中;Step S5, the controller unit reads in an IO instruction from the instruction cache unit, and according to the translated microinstruction, the direct memory access unit reads the parameter vector θ n to be updated from the external space and passes it into the data processing unit;
步骤S6,控制器单元从指令缓存单元读入一条对误差函数在当前参数向量值附近进行二阶估计的指令,根据译出的微指令,进行求f(θ)在θn附近的二阶估计的操作;该操作中,指令被送至运算控制子模块,运算控制子模块发送相应指令进行以下操作:利用梯度运算子模块计算利用高斯-牛顿运算子模块和基本运算子模块中的矩阵乘法运算,得到f在θn处的高斯-牛顿矩阵Gf;利用阻尼项运算子模块和基本运算子模块执行LM启发式方法得到阻尼系数λ,进而得到阻尼项最后,通过Step S6, the controller unit reads an instruction for performing second-order estimation of the error function near the current parameter vector value from the instruction cache unit, and performs a second-order estimation of f(θ) near θ n according to the translated microinstruction. operation; in this operation, the instruction is sent to the operation control sub-module, and the operation control sub-module sends the corresponding instruction to perform the following operations: use the gradient operation sub-module to calculate Using the matrix multiplication operation in the Gauss-Newton operation sub-module and the basic operation sub-module, the Gauss-Newton matrix G f of f at θ n is obtained; the damping term operation sub-module and the basic operation sub-module are used to perform the LM heuristic method to obtain the damping coefficient λ, and then the damping term is obtained Finally, by
得到的表达式存入数据缓存单元;其中,阻尼函数是根据训练模型预先确定好的函数在θn处的值; get The expression of is stored in the data cache unit; among them, the damping function is the value of the pre-determined function at θ n according to the training model;
步骤S7,控制器单元从指令缓存单元读取一条数据传输指令,根据译出的微指令,将从数据缓存单元传送到数据处理单元中;Step S7, the controller unit reads a data transmission instruction from the instruction cache unit, and according to the translated microinstruction, Transferred from the data buffer unit to the data processing unit;
步骤S8,控制器单元从指令缓存单元读取一条参数更新运算指令,根据译出的微指令,进行用有预条件的共轭梯度法求δn使得达到最小值,并把θn更新为θn+1的操作;直接内存访问单元从外部空间读取待更新参数向量θn传入到数据处理单元中;运算控制子模块控制相关运算模块进行如下操作:利用共轭梯度运算子模块和基本运算子模块得到更新向量δn;最后,利用基本运算子模块中的向量加法将θn更新为θn+1;Step S8, the controller unit reads a parameter update operation instruction from the instruction cache unit, and according to the decoded microinstruction, uses the preconditioned conjugate gradient method to obtain δ n such that The operation of reaching the minimum value and updating θ n to θ n+1 ; the direct memory access unit reads the parameter vector θ n to be updated from the external space and transfers it to the data processing unit; the operation control sub-module controls the relevant operation module to perform the following steps Operation: use the conjugate gradient operation sub-module and the basic operation sub-module to obtain the update vector δ n ; finally, use the vector addition in the basic operation sub-module to update θ n to θ n+1 ;
步骤S9,控制器单元从指令缓存单元读取一条IO指令,根据译出的微指令,更新后的参数向量θn+1从数据处理单元通过直接内存访问单元传送至外部指定空间;Step S9, the controller unit reads an IO instruction from the instruction cache unit, and according to the microinstruction translated, the updated parameter vector θ n+1 is transferred from the data processing unit to the external designated space by the direct memory access unit;
步骤S10,控制器单元从指令缓存单元读取一条收敛判断指令,根据译出的微指令,数据处理单元判断更新后的参数向量θn+1是否收敛:若收敛,运算结束;否则,迭代次数n的值增长1,转回执行步骤S5。Step S10, the controller unit reads a convergence judgment instruction from the instruction cache unit, and according to the translated microinstruction, the data processing unit judges whether the updated parameter vector θ n+1 converges: if converged, the operation ends; otherwise, the number of iterations The value of n increases by 1, and the process returns to step S5.
作为本发明的还一个方面,本发明还提供了一种执行Hessian-Free训练算法的装置,所述装置的控制器中固化有执行如上所述的执行Hessian-Free训练算法的方法的程序。As a further aspect of the present invention, the present invention also provides a device for executing the Hessian-Free training algorithm, and a program for executing the above-mentioned method for executing the Hessian-Free training algorithm is solidified in the controller of the device.
基于上述技术方案可知,本发明的装置和方法具有如下有益效果:使用该装置可以实现Hessian-Free训练算法,完成对各种神经网络的训练,如自动编码器(auto-encoders)、卷积神经网络(RNN)等;通过采用专门用于执行Hessian-Free训练算法的设备,可以解决数据的通用处理器运算性能不足,前段译码开销大的问题,加速相关应用的执行速度;同时,对数据缓存单元的应用,避免了反复向内存读取数据,降低了内存访问的带宽。Based on the above technical solutions, the device and method of the present invention have the following beneficial effects: using the device can realize the Hessian-Free training algorithm and complete the training of various neural networks, such as auto-encoders, convolutional neural networks Network (RNN), etc.; by using the equipment specially used to execute the Hessian-Free training algorithm, the problem of insufficient computing performance of the general-purpose processor for the data and the large overhead of the front-end decoding can be solved, and the execution speed of the related applications can be accelerated; at the same time, the data The application of the cache unit avoids repeatedly reading data from the memory and reduces the bandwidth of memory access.
附图说明Description of drawings
图1为根据本发明一实施例的用于实现Hessian-Free训练算法相关应用的装置的整体结构示例框图;1 is a block diagram showing an example of the overall structure of an apparatus for implementing applications related to the Hessian-Free training algorithm according to an embodiment of the present invention;
图2为根据本发明一实施例的用于实现Hessian-Free训练算法相关应用的装置中数据处理单元的示例框图;2 is an exemplary block diagram of a data processing unit in an apparatus for implementing applications related to the Hessian-Free training algorithm according to an embodiment of the present invention;
图3为根据本发明一实施例的用于实现Hessian-Free训练算法相关应用的运算流程图。FIG. 3 is an operation flow chart for implementing a related application of the Hessian-Free training algorithm according to an embodiment of the present invention.
具体实施方式Detailed ways
为使本发明的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本发明作进一步的详细说明。通过以下详细描述,本发明的其它方面、优势和突出特征对于本领域技术人员将变得显而易见。In order to make the objectives, technical solutions and advantages of the present invention more clearly understood, the present invention will be further described in detail below in conjunction with specific embodiments and with reference to the accompanying drawings. Other aspects, advantages and salient features of the present invention will become apparent to those skilled in the art from the following detailed description.
在本说明书中,下述用于描述本发明原理的各种实施例只是说明,不应该以任何方式解释为限制发明的范围。参照附图的下述描述用于帮助全面理解由权利要求及其等同物限定的本发明的示例性实施例。下述描述包括多种具体细节来帮助理解,但这些细节应认为仅仅是示例性的。因此,本领域普通技术人员应认识到,在不背离本发明的范围和精神的情况下,可以对本文中描述的实施例进行多种改变和修改。此外,为了清楚和简洁起见,省略了公知功能和结构的描述。此外,贯穿附图,相同参考数字用于相似功能和操作。In this specification, the various embodiments described below to describe the principles of the invention are illustrative only and should not be construed in any way to limit the scope of the invention. The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of exemplary embodiments of the present invention as defined by the claims and their equivalents. The following description includes numerous specific details to assist in that understanding, but these details should be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted for clarity and conciseness. Furthermore, the same reference numerals are used for similar functions and operations throughout the drawings.
本发明公开了一种用于执行Hessian-Free训练算法的装置,包括指令缓存单元、指令译码单元、直接内存访问单元、数据处理单元和数据缓存模块。使用该装置可以实现Hessian-Free训练算法,完成对各种神经网络的训练,如自动编码器(auto-encoders)、卷积神经网络(RNN)等。每次迭代时,对误差函数(目标函数)做二阶泰勒展开,并且添加阻尼项,作为目标函数的估计,之后根据当前的梯度、高斯-牛顿矩阵、阻尼函数(damplingfunction)和阻尼系数(dampling constant),用有预条件的共轭梯度法(PreconditioningCG-Minimize)求得更新向量,更新待更新参数。持续迭代直至待更新参数向量收敛。The invention discloses a device for executing a Hessian-Free training algorithm, comprising an instruction cache unit, an instruction decoding unit, a direct memory access unit, a data processing unit and a data cache module. Using the device, the Hessian-Free training algorithm can be implemented to complete the training of various neural networks, such as auto-encoders, convolutional neural networks (RNN), and the like. At each iteration, a second-order Taylor expansion is performed on the error function (objective function), and a damping term is added as an estimate of the objective function, and then based on the current gradient, Gauss-Newton matrix, damping function (dampling function) and damping coefficient (dampling constant), use the preconditioned conjugate gradient method (PreconditioningCG-Minimize) to obtain the update vector, and update the parameters to be updated. Continue to iterate until the parameter vector to be updated converges.
更具体地,本发明的装置包括直接内存控制单元、指令缓存单元、控制器单元、数据缓存单元和数据处理单元。其中,直接内存访问单元能够访问外部地址空间,可以向装置内部各个缓存单元读写数据,完成数据的加载和存储,具体包括向指令缓存单元读取指令,从指定存储单元之间读取待更新参数和对应的梯度值到数据处理单元,将更新后的参数向量从数据处理单元直接写入外部指定空间;指令缓存单元通过直接内存访问单元读取指令,并缓存读入的指令;控制器单元从指令缓存单元中读取指令,将指令译码为控制其他模块行为的微指令并将其发送给其他模块如直接内存访问单元、数据缓存单元、数据处理单元等;数据缓存单元存储装置运行中需要的一些中间变量,并对这些变量做初始化及更新操作;数据处理单元根据指令做相应的运算操作。More specifically, the apparatus of the present invention includes a direct memory control unit, an instruction cache unit, a controller unit, a data cache unit and a data processing unit. Among them, the direct memory access unit can access the external address space, can read and write data to each cache unit inside the device, and complete the loading and storage of data, which specifically includes reading instructions from the instruction cache unit, and reading from the designated storage units to be updated. The parameters and corresponding gradient values are sent to the data processing unit, and the updated parameter vector is directly written into the external designated space from the data processing unit; the instruction cache unit reads the instructions through the direct memory access unit, and caches the read instructions; the controller unit Read instructions from the instruction cache unit, decode the instructions into micro-instructions that control the behavior of other modules and send them to other modules such as direct memory access units, data cache units, data processing units, etc.; the data cache unit storage device is running Some intermediate variables are required, and initialize and update these variables; the data processing unit performs corresponding operations according to the instructions.
此外,本发明还公开了一种执行Hessian-Free训练算法的方法,包括以下步骤:In addition, the present invention also discloses a method for executing the Hessian-Free training algorithm, comprising the following steps:
步骤(1),通过指令,完成数据缓存单元的初始化操作,即初始化f(θ)的二阶估计具体来说,是对其中的梯度高斯-牛顿矩阵Gf、阻尼系数λ和阻尼函数进行置零操作。Step (1), through the instruction, complete the initialization operation of the data cache unit, that is, initialize the second-order estimation of f(θ) Specifically, is the gradient of Gauss-Newton matrix G f , damping coefficient λ and damping function Perform zero-setting operation.
步骤(2),通过IO指令,完成直接内存访问单元从外部空间读取待更新参数向量的操作。In step (2), through the IO instruction, the operation of the direct memory access unit reading the parameter vector to be updated from the external space is completed.
步骤(3),数据处理单元根据相应指令,在θn处对误差函数f(θ)进行二阶泰勒展开,并添加阻尼项得到f(θ)在θn附近的估计即 Step (3), the data processing unit performs a second-order Taylor expansion on the error function f(θ) at θ n according to the corresponding instruction, and adds a damping term get an estimate of f(θ) around θ n which is
其中,Gf是f在θn处的高斯-牛顿矩阵;δn是更新向量;阻尼系数λ是用LM式启发式(Levenburg-Marquardt style heuristics)的方法求得;阻尼函数是根据训练模型预先确定好的函数在θn处的值,比如进行RNN训练时,S和f类似,是距离函数,GS是S在θn处的高斯-牛顿矩阵,μ(weighting constant)是一个预先确定好的正数。Among them, G f is the Gauss-Newton matrix of f at θ n ; is the value of the pre-determined function at θ n according to the training model, such as when training RNN, Similar to f, S is a distance function, G S is the Gauss-Newton matrix of S at θ n , and μ (weighting constant) is a predetermined positive number.
步骤(4),数据处理单元根据相应的指令,进行有预条件的共轭梯度法求δn使得达到最小值,并把θn更新为θn+1的操作。更新操作如下:In step (4), the data processing unit performs a preconditioned conjugate gradient method to obtain δ n according to the corresponding instructions such that The operation to reach the minimum value and update θ n to θ n+1 . The update operation is as follows:
θn+1=θn+δn;θ n+1 = θ n +δ n ;
值得一提的是,在进行用有预条件的共轭梯度法过程中,只用“mini-batch”而不是所有样本,并且其中涉及的高斯-牛顿矩阵乘向量的运算都是通过做隐式近似计算(Pearlmutter的R{}-method)。这样既提高了大数据学习的效率或者说提高数据运算模块的运算效率,也避免了运算量随参数数量平方增长的情况。It is worth mentioning that in the process of using the preconditioned conjugate gradient method, only "mini-batch" is used instead of all samples, and the Gauss-Newton matrix multiplication vector operation involved is all done by Do an implicit approximation (Pearlmutter's R{}-method). This not only improves the efficiency of big data learning or the operation efficiency of the data operation module, but also avoids the situation that the amount of operation increases with the square of the number of parameters.
步骤(5),数据处理单元判断更新后的参数向量是否收敛,若收敛,运算结束,否则,转到步骤(2)处继续执行。In step (5), the data processing unit judges whether the updated parameter vector has converged, and if it converges, the operation ends, otherwise, go to step (2) and continue to execute.
根据本发明实施例实现的Hessian-Free训练算法的装置,可以用以支持使用Hessian-Free训练算法方面的应用。在数据缓存单元开辟一个空间存储误差函数在每一代待更新参数附近的二阶估计,在每次进行有预条件的共轭梯度法时,利用该二阶估计计算一个更新向量,然后进行对待更新向量的更新操作。重复进行上述步骤,直至待更新向量收敛。The apparatus for the Hessian-Free training algorithm implemented according to the embodiment of the present invention can be used to support applications using the Hessian-Free training algorithm. Open up a space in the data cache unit to store the second-order estimation of the error function near the parameters to be updated in each generation, and use the second-order estimation to calculate an update vector each time the preconditioned conjugate gradient method is performed, and then perform the update to be updated. The update operation of the vector. Repeat the above steps until the vector to be updated converges.
下面结合附图的本发明的技术方案进行进一步阐述说明。The technical solutions of the present invention will be further elaborated below in conjunction with the accompanying drawings.
图1示出了根据本发明一实施例的用于实现Hessian-Free训练算法的装置的整体结构示例框图。如图1所示,该装置包括直接内存访问单元1、指令缓存单元2、控制器单元3、数据缓存单元4和数据处理单元5,均可以通过硬件电路实现。FIG. 1 shows a block diagram of an exemplary overall structure of an apparatus for implementing a Hessian-Free training algorithm according to an embodiment of the present invention. As shown in FIG. 1 , the device includes a direct memory access unit 1 , an
直接内存访问单元1能够访问外部地址空间,可以向装置内部各个缓存单元读写数据,完成数据的加载和存储。具体包括向指令缓存单元2读取指令,从指定存储单元之间读取待更新参数到数据处理单元5,将更新后的参数向量从数据处理单元5直接写入外部指定空间。The direct memory access unit 1 can access the external address space, can read and write data to each cache unit inside the device, and complete the loading and storage of data. Specifically, it includes reading instructions from the
指令缓存单元2通过直接内存访问单元1读取指令,并缓存读入的指令。The
控制器单元3从指令缓存单元2中读取指令,将指令译码为控制其他模块行为的微指令并将其发送给其他模块如直接内存访问单元1、数据缓存单元4、数据处理单元5等。The
数据缓存单元4在装置初始化时初始化具体来说,是初始化其中的梯度高斯-牛顿矩阵Gf、阻尼系数λ和阻尼函数在第n次待更新参数向量θn的更新开始前,会将读出到数据处理单元5中。在数据处理单元5中得到更新向量δn,将θn更新为θn+1,对应的更新为然后将写入至数据缓存单元4(新的数据将以前的相对应数据覆盖),用于下次使用。
数据处理单元5从数据缓存单元4中读取通过直接内存访问单元1从外部指定空间中读取待更新参数向量θn。在模块内得到更新向量δn,将θn更新为θn+1,对应的更新为然后将写入至数据缓存单元4,将θn+1通过直接内存访问单元1写入到外部指定空间中。The
图2示出了根据本发明实施例的用于实现Hessian-Free训练算法相关应用的装置中数据处理单元的示例框图。如图2所示,数据处理单元包括运算控制子模块51、梯度运算子模块52、阻尼项运算子模块53、高斯-牛顿矩阵运算子模块54、共轭梯度法运算子模块55以及基本运算子模块56。其中,基本运算子模块56进行的是矩阵、向量之间的加乘等基础运算,52、53、54、55子模块都会调用56子模块,并且根据情况,这些模块互相之间也允许互相调用。FIG. 2 shows an example block diagram of a data processing unit in an apparatus for implementing applications related to a Hessian-Free training algorithm according to an embodiment of the present invention. As shown in FIG. 2, the data processing unit includes an operation control sub-module 51, a
图3示出了根据Hessian-Free训练算法进行相关运算的装置的总体流程图。FIG. 3 shows the overall flow chart of the apparatus for performing correlation operations according to the Hessian-Free training algorithm.
在步骤S1,在指令缓存单元2的首地址处预先存入一条IO指令。In step S1, an IO instruction is pre-stored at the first address of the
在步骤S2,运算开始,控制单元3从指令缓存单元2的首地址读取该条IO指令,根据译出的微指令,直接内存访问单元1从外部地址空间读取所有与Hessian-Free计算有关的所有指令,并将其缓存入指令缓存单元2中。In step S2, the operation starts, the
在步骤S3,控制器单元3从指令缓存单元2读入一条IO指令,根据译出的微指令,直接内存访问单元1从外部空间读取初始待更新参数向量θ0到数据处理单元5中。In step S3 , the
在步骤S4,控制器单元3从指令缓存单元2读入赋值指令,根据译出的微指令,数据缓存单元4中的初始化,数据处理单元5中的迭代次数n被设置为0。In step S4, the
在步骤S5,控制器单元3从指令缓存单元2读入一条IO指令,根据译出的微指令,直接内存访问单元1从外部空间读取待更新参数向量θn传入到数据处理单元5中。In step S5, the
在步骤S6,控制器单元3从指令缓存单元2读入一条对误差函数在当前参数向量值附近进行二阶估计的指令,根据译出的微指令,进行求f(θ)在θn附近的二阶估计的操作。该操作中,指令被送至运算控制子模块51,运算控制子模块51发送相应指令进行以下操作:利用梯度运算子模块52计算利用高斯-牛顿运算子模块54和基本运算子模块56中的矩阵乘法运算,得到f在θn处的高斯-牛顿矩阵Gf;利用阻尼项运算子模块53和基本运算子模块56执行LM启发式方法得到阻尼系数λ,进而得到阻尼项最后,得到的表达式存入数据缓存单元4。In step S6, the
在步骤S7,控制器单元3从指令缓存单元2读取一条数据传输指令,根据译出的微指令,将从数据缓存单元4传送到数据处理单元5中。In step S7, the
在步骤S8,控制器单元3从指令缓存单元2读取一条参数更新运算指令,根据译出的微指令,进行用有预条件的共轭梯度法求δn使得达到最小值,并把θn更新为θn+1的操作。直接内存访问单元1从外部空间读取待更新参数向量θn传入到数据处理单元5中。运算控制子模块51控制相关运算模块进行如下操作:利用共轭梯度运算子模块55和基本运算子模块56得到更新向量δn。这其中,根据阻尼函数的表达式,也可能需要调用高斯牛顿-运算模块(比如之前提到的RNN的例子)。最后,利用基本运算子模块56中的向量加法将θn更新为θn+1。In step S8, the
在步骤S9,控制器单元3从指令缓存单元2读取一条IO指令,根据译出的微指令,更新后的参数向量θn+1从数据处理单元5通过直接内存访问单元1传送至外部指定空间。In step S9, the
在步骤S10,控制器单元从指令缓存单元2读取一条收敛判断指令,根据译出的微指令,数据处理单元判断更新后的参数向量θn+1是否收敛:若收敛,运算结束;否则,迭代次数n的值增长1,转回执行步骤S5。In step S10, the controller unit reads a convergence judgment instruction from the
前面的附图中所描绘的进程或方法可通过包括硬件(例如,电路、专用逻辑等)、固件、软件(例如,被承载在非瞬态计算机可读介质上的软件),或两者的组合的处理逻辑来执行。虽然上文按照某些顺序操作描述了进程或方法,但是,应该理解,所描述的某些操作能以不同顺序来执行。此外,可并行地而非顺序地执行一些操作。The processes or methods depicted in the preceding figures may be implemented by means of processes including hardware (eg, circuits, special purpose logic, etc.), firmware, software (eg, software carried on a non-transitory computer-readable medium), or both. Combined processing logic to execute. Although a process or method is described above as operating in certain orders, it should be understood that certain operations described can be performed in a different order. Furthermore, some operations may be performed in parallel rather than sequentially.
在前述的说明书中,参考其特定示例性实施例描述了本发明的各实施例。显然,可对各实施例做出各种修改,而不悖离所附权利要求所述的本发明的更广泛的精神和范围。相应地,说明书和附图应当被认为是说明性的,而不是限制性的。In the foregoing specification, embodiments of the invention have been described with reference to specific exemplary embodiments thereof. Obviously, various modifications may be made to the various embodiments without departing from the broader spirit and scope of the invention as set forth in the appended claims. Accordingly, the specification and drawings are to be regarded in an illustrative rather than a restrictive sense.
Claims (15)
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201610283885.XA CN107341540B (en) | 2016-04-29 | 2016-04-29 | An apparatus and method for executing a Hessian-Free training algorithm |
| PCT/CN2016/081842 WO2017185413A1 (en) | 2016-04-29 | 2016-05-12 | Device and method for executing hessian-free training algorithm |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201610283885.XA CN107341540B (en) | 2016-04-29 | 2016-04-29 | An apparatus and method for executing a Hessian-Free training algorithm |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN107341540A CN107341540A (en) | 2017-11-10 |
| CN107341540B true CN107341540B (en) | 2021-07-20 |
Family
ID=60160584
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201610283885.XA Active CN107341540B (en) | 2016-04-29 | 2016-04-29 | An apparatus and method for executing a Hessian-Free training algorithm |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN107341540B (en) |
| WO (1) | WO2017185413A1 (en) |
Families Citing this family (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN109934208B (en) * | 2019-04-22 | 2024-07-23 | 江苏邦融微电子有限公司 | Hardware acceleration system and method for fingerprint identification |
| US11444483B2 (en) * | 2020-01-14 | 2022-09-13 | Hitachi Energy Switzerland Ag | Adaptive state estimation for power systems |
| CN111626434B (en) * | 2020-05-15 | 2022-06-07 | 浪潮电子信息产业股份有限公司 | Distributed training parameter updating method, device, equipment and storage medium |
| CN112990958B (en) * | 2021-01-19 | 2024-10-18 | 腾讯科技(深圳)有限公司 | Data processing method, device, storage medium and computer equipment |
| CN114358180B (en) * | 2021-12-31 | 2025-05-06 | 海光信息技术股份有限公司 | Processor pre-fetch training method, processing device, processor and computing equipment |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN1658550A (en) * | 2004-04-16 | 2005-08-24 | 威盛电子股份有限公司 | Apparatus and method for performing cryptographic operations |
| CN1834898A (en) * | 2005-05-16 | 2006-09-20 | 威盛电子股份有限公司 | Microprocessor apparatus and method for modular exponentiation |
| CN102156637A (en) * | 2011-05-04 | 2011-08-17 | 中国人民解放军国防科学技术大学 | Vector crossing multithread processing method and vector crossing multithread microprocessor |
| WO2016037351A1 (en) * | 2014-09-12 | 2016-03-17 | Microsoft Corporation | Computing system for training neural networks |
Family Cites Families (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US9390370B2 (en) * | 2012-08-28 | 2016-07-12 | International Business Machines Corporation | Training deep neural network acoustic models using distributed hessian-free optimization |
| US9601109B2 (en) * | 2013-12-06 | 2017-03-21 | International Business Machines Corporation | Systems and methods for accelerating hessian-free optimization for deep neural networks by implicit preconditioning and sampling |
-
2016
- 2016-04-29 CN CN201610283885.XA patent/CN107341540B/en active Active
- 2016-05-12 WO PCT/CN2016/081842 patent/WO2017185413A1/en active Application Filing
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN1658550A (en) * | 2004-04-16 | 2005-08-24 | 威盛电子股份有限公司 | Apparatus and method for performing cryptographic operations |
| CN1834898A (en) * | 2005-05-16 | 2006-09-20 | 威盛电子股份有限公司 | Microprocessor apparatus and method for modular exponentiation |
| CN102156637A (en) * | 2011-05-04 | 2011-08-17 | 中国人民解放军国防科学技术大学 | Vector crossing multithread processing method and vector crossing multithread microprocessor |
| WO2016037351A1 (en) * | 2014-09-12 | 2016-03-17 | Microsoft Corporation | Computing system for training neural networks |
Non-Patent Citations (4)
| Title |
|---|
| DaDianNao: A Machine-Learning Supercomputer;Yunji Chen等;《2014 47th Annual IEEE/ACM International Symposium on Microarchitecture》;20141231;第609-622页 * |
| Deep learning via Hessian-free optimization;James Martens;《Proceedings of the 27th International Conference on Machine Learning》;20101231;第1-8页 * |
| Training Deep and Recurrent Networks with Hessian-Free Optimization;James Martens等;《Neural Networks: Tricks ofthe Trade》;20121231;第1-58页 * |
| Training Neural Networks with Stochastic Hessian-Free Optimization;Ryan Kiros;《arXiv:1301.3641v3 [cs.LG]》;20130501;第1-11页 * |
Also Published As
| Publication number | Publication date |
|---|---|
| WO2017185413A1 (en) | 2017-11-02 |
| CN107341540A (en) | 2017-11-10 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN107341540B (en) | An apparatus and method for executing a Hessian-Free training algorithm | |
| CN111860812B (en) | Apparatus and method for performing convolutional neural network training | |
| KR102331978B1 (en) | Device and method for executing forward calculation of artificial neural network | |
| CN110929863B (en) | Apparatus and method for performing LSTM operations | |
| KR102175044B1 (en) | Apparatus and method for running artificial neural network reverse training | |
| CN109375951B (en) | Device and method for executing forward operation of full-connection layer neural network | |
| CN113537481B (en) | Device and operation method for performing LSTM neural network operation | |
| CN107341132B (en) | Device and method for executing AdaGrad gradient descent training algorithm | |
| WO2017185347A1 (en) | Apparatus and method for executing recurrent neural network and lstm computations | |
| WO2017185257A1 (en) | Device and method for performing adam gradient descent training algorithm | |
| CN107704267A (en) | A kind of convolutional neural networks operational order and its method | |
| CN108334944B (en) | Artificial neural network operation device and method | |
| WO2017185336A1 (en) | Apparatus and method for executing pooling operation | |
| CN107315570B (en) | Apparatus and method for executing Adam gradient descent training algorithm | |
| WO2017185248A1 (en) | Apparatus and method for performing auto-learning operation of artificial neural network | |
| WO2018112892A1 (en) | Device and method for supporting fast artificial neural network operation | |
| CN111860814B (en) | Apparatus and method for performing batch normalization operations | |
| WO2017185256A1 (en) | Rmsprop gradient descent algorithm execution apparatus and method | |
| CN107329733B (en) | Apparatus and method for performing posing operations | |
| CN107315569B (en) | An apparatus and method for performing RMSprop gradient descent algorithm |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| CB02 | Change of applicant information |
Address after: 100190 room 644, comprehensive research building, No. 6 South Road, Haidian District Academy of Sciences, Beijing Applicant after: Zhongke Cambrian Technology Co.,Ltd. Address before: 100190 room 644, comprehensive research building, No. 6 South Road, Haidian District Academy of Sciences, Beijing Applicant before: Beijing Zhongke Cambrian Technology Co.,Ltd. |
|
| CB02 | Change of applicant information | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant | ||
| TG01 | Patent term adjustment | ||
| TG01 | Patent term adjustment |