[go: up one dir, main page]

CN107341540B - An apparatus and method for executing a Hessian-Free training algorithm - Google Patents

An apparatus and method for executing a Hessian-Free training algorithm Download PDF

Info

Publication number
CN107341540B
CN107341540B CN201610283885.XA CN201610283885A CN107341540B CN 107341540 B CN107341540 B CN 107341540B CN 201610283885 A CN201610283885 A CN 201610283885A CN 107341540 B CN107341540 B CN 107341540B
Authority
CN
China
Prior art keywords
theta
instruction
updated
data processing
processing unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610283885.XA
Other languages
Chinese (zh)
Other versions
CN107341540A (en
Inventor
张士锦
郭崎
陈天石
陈云霁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cambricon Technologies Corp Ltd
Original Assignee
Cambricon Technologies Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cambricon Technologies Corp Ltd filed Critical Cambricon Technologies Corp Ltd
Priority to CN201610283885.XA priority Critical patent/CN107341540B/en
Priority to PCT/CN2016/081842 priority patent/WO2017185413A1/en
Publication of CN107341540A publication Critical patent/CN107341540A/en
Application granted granted Critical
Publication of CN107341540B publication Critical patent/CN107341540B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Advance Control (AREA)
  • Complex Calculations (AREA)

Abstract

一种计算装置和方法,该装置包括控制器单元和数据处理单元。所述数据处理单元包括运算控制子模块和基本运算子模块,以及梯度运算子模块、阻尼项运算子模块、高斯‑牛顿矩阵运算子模块和共轭梯度法运算子模块中的至少之一。使用该装置可以完成对各种神经网络的训练,如自动编码器、卷积神经网络RNN等等。本发明可以解决数据的通用处理器运算性能不足,前段译码开销大的问题,加速相关应用的执行速度;同时,对数据缓存单元的应用,避免了反复向内存读取数据,降低了内存访问的带宽。

Figure 201610283885

A computing device and method including a controller unit and a data processing unit. The data processing unit includes an operation control submodule, a basic operation submodule, and at least one of a gradient operation submodule, a damping term operation submodule, a Gauss-Newton matrix operation submodule, and a conjugate gradient method operation submodule. Using this device, the training of various neural networks, such as auto-encoder, convolutional neural network RNN, etc., can be completed. The invention can solve the problems of insufficient computing performance of the general-purpose processor for data and large overhead for decoding in the front section, and accelerate the execution speed of related applications; at the same time, the application of the data cache unit avoids repeatedly reading data from the memory and reduces memory access. bandwidth.

Figure 201610283885

Description

一种用于执行Hessian-Free训练算法的装置和方法An apparatus and method for executing a Hessian-Free training algorithm

技术领域technical field

本发明涉及神经网络运算技术领域,更具体地涉及一种用于执行Hessian-Free训练算法的装置和方法。The present invention relates to the technical field of neural network operations, and more particularly to a device and method for executing a Hessian-Free training algorithm.

背景技术Background technique

梯度下降法在函数逼近、优化计算、模式识别和图像处理等领域被广泛应用。目前主流的神经网络的训练方法是梯度下降法(结合BP算法),但是这种方法忽略了误差函数的曲率信息,不仅容易出现参数变化过度平缓,从而无法收敛到局部最优点的情况,而且无法很好的处理“病态曲率”的误差函数(比如Rosenbrock函数)。Hessian-Free训练算法很好的解决了这个问题,并且通过一些细节改进,使之没有出现运算量关于参数数量平方增长(和梯度下降法一样是线性增长)的情况。Gradient descent is widely used in the fields of function approximation, optimization calculation, pattern recognition and image processing. The current mainstream neural network training method is the gradient descent method (combined with the BP algorithm), but this method ignores the curvature information of the error function. Very good handling of "ill-conditioned curvature" error functions (such as the Rosenbrock function). The Hessian-Free training algorithm solves this problem very well, and through some detailed improvements, it does not appear that the amount of computation increases squarely with respect to the number of parameters (linear growth like the gradient descent method).

目前,一种执行Hessian-Free训练算法的已知方法是使用通用处理器。该方法通过使用通用寄存器堆和通用功能部件执行通用指令来支持上述算法。该方法的缺点之一是单个通用处理器的运算性能较低。而多个通用处理器并行执行时,通用处理器之间相互通信又成为了性能瓶颈。另外,通用处理器需要把Hessian-Free训练算法对应的相关运算译码成一长列运算及访存指令序列,处理器前端译码带来了较大的功耗开销。Currently, one known way to perform the Hessian-Free training algorithm is to use a general-purpose processor. The method supports the above-described algorithms by executing general-purpose instructions using a general-purpose register file and general-purpose functional units. One of the disadvantages of this method is the low computational performance of a single general-purpose processor. When multiple general-purpose processors execute in parallel, the mutual communication between general-purpose processors has become a performance bottleneck. In addition, the general-purpose processor needs to decode the related operations corresponding to the Hessian-Free training algorithm into a long sequence of operations and memory access instructions, and the front-end decoding of the processor brings a large power consumption overhead.

另一种执行Hessian-Free训练算法的已知方法是使用图形处理器(GPU)。该方法通过使用通用寄存器堆和通用流处理单元执行通用SIMD指令来支持上述算法。由于GPU是专门用来执行图形图像运算以及科学计算的设备,没有对Hessian-Free训练算法相关运算的专门支持,仍然需要大量的前端译码工作才能执行Hessian-Free训练算法中相关的运算,带来了大量的额外开销。另外,GPU只有较小的片上缓存,运算中所需数据(如高斯-牛顿矩阵等)需要反复从片外搬运,片外带宽成为了主要性能瓶颈,同时带来了巨大的功耗开销。Another known method of performing the Hessian-Free training algorithm is to use a graphics processing unit (GPU). The method supports the above algorithms by executing general-purpose SIMD instructions using a general-purpose register file and a general-purpose stream processing unit. Since the GPU is a device specially used to perform graphics and image operations and scientific computing, there is no special support for the related operations of the Hessian-Free training algorithm, and a large amount of front-end decoding work is still required to perform the related operations in the Hessian-Free training algorithm. Comes with a lot of extra overhead. In addition, the GPU has only a small on-chip cache, and the data required in the operation (such as Gauss-Newton matrix, etc.) needs to be repeatedly transferred from the off-chip. The off-chip bandwidth has become the main performance bottleneck, and it also brings huge power consumption.

发明内容SUMMARY OF THE INVENTION

有鉴于此,本发明的目的在于提供一种用于执行Hessian-Free训练算法的装置和方法,以期坚决上述技术问题中的至少之一。In view of this, the purpose of the present invention is to provide an apparatus and method for executing a Hessian-Free training algorithm, so as to resolve at least one of the above technical problems.

为了实现上述目的,作为本发明的一个方面,本发明提供了一种用于执行Hessian-Free训练算法的装置,包括:In order to achieve the above object, as an aspect of the present invention, the present invention provides a device for executing a Hessian-Free training algorithm, comprising:

控制器单元,用于将读取的指令译码为控制相应模块的微指令,并将其发送给相应模块;The controller unit is used to decode the read instruction into a micro-instruction for controlling the corresponding module, and send it to the corresponding module;

数据缓存单元,用于存储运算过程中的中间变量,并对所述中间变量执行初始化及更新操作;a data cache unit, used for storing intermediate variables in the operation process, and performing initialization and update operations on the intermediate variables;

数据处理单元,用于在所述控制器单元的控制下执行运算操作,并将中间变量存储于所述数据缓存单元中。A data processing unit, configured to perform arithmetic operations under the control of the controller unit, and store intermediate variables in the data buffer unit.

其中,所述数据处理单元包括运算控制子模块、梯度运算子模块、阻尼项运算子模块、高斯-牛顿矩阵运算子模块、共轭梯度法运算子模块及基本运算子模块;其中,所述基本运算子模块进行的是矩阵、向量之间的加、乘基础运算;Wherein, the data processing unit includes an operation control sub-module, a gradient operation sub-module, a damping term operation sub-module, a Gauss-Newton matrix operation sub-module, a conjugate gradient method operation sub-module and a basic operation sub-module; wherein the basic operation sub-module The operation sub-module performs basic operations of addition and multiplication between matrices and vectors;

作为优选,所述梯度运算子模块、阻尼项运算子模块、高斯-牛顿矩阵运算子模块、共轭梯度法运算子模块均能够调用所述基本运算子模块,且根据情况所述梯度运算子模块、阻尼项运算子模块、高斯-牛顿矩阵运算子模块、共轭梯度法运算子模块之间允许互相调用。Preferably, the gradient operation sub-module, the damping term operation sub-module, the Gauss-Newton matrix operation sub-module, and the conjugate gradient method operation sub-module can all call the basic operation sub-module, and the gradient operation sub-module according to the situation , Damping term operation sub-module, Gauss-Newton matrix operation sub-module, and conjugate gradient method operation sub-module are allowed to call each other.

其中,所述数据缓存单元在装置初始化时初始化f(θ)的二阶估计

Figure GDA0002763091890000021
在第n次待更新参数向量θn的更新开始前,将
Figure GDA0002763091890000022
读出到数据处理单元中,并在所述数据处理单元中得到更新向量后将
Figure GDA0002763091890000023
再次写入;其中,θ为待更新参数向量,θn为第n次待更新参数向量,f(θ)为误差函数,即衡量结果的实际值与预测值偏离的函数;δn是更新向量,且θn+1=θnn。Wherein, the data buffer unit initializes the second-order estimation of f(θ) when the device is initialized
Figure GDA0002763091890000021
Before the nth update of the parameter vector θ n to be updated starts, the
Figure GDA0002763091890000022
read out to the data processing unit, and after the update vector is obtained in the data processing unit,
Figure GDA0002763091890000023
Write again; among them, θ is the parameter vector to be updated, θ n is the parameter vector to be updated for the nth time, and f(θ) is the error function, that is, the function that measures the deviation between the actual value of the result and the predicted value; δ n is the update vector , and θ n+1 = θ nn .

其中,所述数据缓存单元在初始化

Figure GDA0002763091890000024
的步骤中,初始化其中的梯度
Figure GDA0002763091890000025
高斯-牛顿矩阵Gf、阻尼系数λ和阻尼函数
Figure GDA0002763091890000026
其中,
Figure GDA0002763091890000027
所述梯度
Figure GDA0002763091890000031
指f在θn处的梯度值,Gf是f在θn处的高斯-牛顿矩阵;阻尼函数
Figure GDA0002763091890000032
是根据训练模型预先确定好的函数在θn处的值;阻尼系数λ通过LM式启发式方法求得;Wherein, the data cache unit is initialized
Figure GDA0002763091890000024
In the steps of , initialize the gradient in which
Figure GDA0002763091890000025
Gauss-Newton matrix G f , damping coefficient λ and damping function
Figure GDA0002763091890000026
in,
Figure GDA0002763091890000027
the gradient
Figure GDA0002763091890000031
refers to the gradient value of f at θ n , G f is the Gauss-Newton matrix of f at θ n ; damping function
Figure GDA0002763091890000032
is the value of the pre-determined function at θ n according to the training model; the damping coefficient λ is obtained by the LM heuristic method;

所述数据处理单元从所述数据缓存单元中读取

Figure GDA0002763091890000033
从外部指定空间中读取待更新参数向量θn;在模块内得到更新向量δn,将θn更新为θn+1,对应的
Figure GDA0002763091890000034
更新为
Figure GDA0002763091890000035
然后将
Figure GDA0002763091890000036
写入至所述数据缓存单元,将θn+1写入到外部指定空间中;其中,θn+1为第n+1次待更新参数向量,
Figure GDA0002763091890000037
为f(θ+1)的二阶估计。the data processing unit reads from the data cache unit
Figure GDA0002763091890000033
Read the parameter vector θ n to be updated from the external specified space; get the update vector δ n in the module, update θ n to θ n+1 , the corresponding
Figure GDA0002763091890000034
update to
Figure GDA0002763091890000035
followed by
Figure GDA0002763091890000036
Write to the data cache unit, and write θ n+1 into the external specified space; wherein, θ n+1 is the n+1 th parameter vector to be updated,
Figure GDA0002763091890000037
is the second-order estimate of f(θ+1).

作为本发明的另一个方面,本发明还提供了一种执行Hessian-Free训练算法的方法,包括以下步骤:As another aspect of the present invention, the present invention also provides a method for executing the Hessian-Free training algorithm, comprising the following steps:

步骤(1),通过指令,完成数据缓存单元的初始化操作,即初始化f(θ)的二阶估计

Figure GDA0002763091890000038
其中,θ为待更新参数向量,θn为第n次待更新参数向量,f(θ)为误差函数,即衡量结果的实际值与预测值偏离的函数;δn是更新向量,且θn+1=θnn;Step (1), through the instruction, complete the initialization operation of the data cache unit, that is, initialize the second-order estimation of f(θ)
Figure GDA0002763091890000038
Among them, θ is the parameter vector to be updated, θ n is the parameter vector to be updated for the nth time, f(θ) is the error function, that is, the function that measures the deviation between the actual value of the result and the predicted value; δ n is the update vector, and θ n +1 = θ nn ;

步骤(2),通过IO指令,完成直接内存访问单元从外部空间读取待更新参数向量的操作;Step (2), through the IO instruction, completes the operation that the direct memory access unit reads the parameter vector to be updated from the external space;

步骤(3),数据处理单元根据相应指令,在θn处对误差函数f(θ)进行二阶泰勒展开,并添加阻尼项

Figure GDA0002763091890000039
得到f(θ)在θn附近的估计
Figure GDA00027630918900000310
即Step (3), the data processing unit performs a second-order Taylor expansion on the error function f(θ) at θ n according to the corresponding instruction, and adds a damping term
Figure GDA0002763091890000039
get an estimate of f(θ) around θ n
Figure GDA00027630918900000310
which is

Figure GDA00027630918900000311
Figure GDA00027630918900000311

其中,Gf是f在θn处的高斯-牛顿矩阵;阻尼系数λ通过LM式启发式方法求得;阻尼函数

Figure GDA00027630918900000312
是根据训练模型预先确定好的函数在θn处的值;Among them, G f is the Gauss-Newton matrix of f at θ n ; the damping coefficient λ is obtained by the LM heuristic method; the damping function
Figure GDA00027630918900000312
is the value of the pre-determined function at θ n according to the training model;

步骤(4),数据处理单元根据相应的指令,进行有预条件的共轭梯度法求δn使得

Figure GDA00027630918900000313
达到最小值,并把θn更新为θn+1,具体更新操作为:In step (4), the data processing unit performs a preconditioned conjugate gradient method to obtain δ n according to the corresponding instructions such that
Figure GDA00027630918900000313
Reach the minimum value, and update θ n to θ n +1 , the specific update operation is:

θn+1=θnnθ n+1 = θ nn ;

步骤(5),数据处理单元判断更新后的参数向量是否收敛,若收敛,运算结束,否则,转到步骤(2)处继续执行。In step (5), the data processing unit judges whether the updated parameter vector has converged, and if it converges, the operation ends, otherwise, go to step (2) and continue to execute.

其中,步骤(1)中所述完成数据缓存单元的初始化操作的步骤包括:对梯度

Figure GDA0002763091890000041
高斯-牛顿矩阵Gf、阻尼系数λ和阻尼函数
Figure GDA0002763091890000042
进行置零操作。Wherein, the step of completing the initialization operation of the data cache unit in step (1) includes:
Figure GDA0002763091890000041
Gauss-Newton matrix G f , damping coefficient λ and damping function
Figure GDA0002763091890000042
Perform zero-setting operation.

其中,步骤(3)中当进行RNN训练时,Wherein, when performing RNN training in step (3),

阻尼函数

Figure GDA0002763091890000043
damping function
Figure GDA0002763091890000043

其中S和f均是距离函数,GS是S在θn处的高斯-牛顿矩阵,μ是一个预先确定好的正数。where S and f are distance functions, G S is the Gauss-Newton matrix of S at θ n , and μ is a predetermined positive number.

其中,步骤(4)所述进行有预条件的共轭梯度法求δn使得

Figure GDA0002763091890000044
达到最小值的步骤中,在进行有预条件的共轭梯度法过程中,只用“mini-batch”而不是所有样本,且其中涉及的高斯-牛顿矩阵乘向量的运算都是通过
Figure GDA0002763091890000045
做隐式近似计算。Wherein, in step (4), the preconditioned conjugate gradient method is performed to obtain δ n such that
Figure GDA0002763091890000044
In the step of reaching the minimum value, in the process of the preconditioned conjugate gradient method, only "mini-batch" is used instead of all samples, and the Gauss-Newton matrix multiplication vector operation involved is through
Figure GDA0002763091890000045
Do implicit approximations.

作为本发明的再一个方面,本发明还提供了一种执行Hessian-Free训练算法的方法,其特征在于,包括以下步骤:As a further aspect of the present invention, the present invention also provides a method for executing the Hessian-Free training algorithm, which is characterized in that it includes the following steps:

步骤S1,在指令缓存单元的首地址处预先存入一条IO指令。In step S1, an IO instruction is pre-stored at the first address of the instruction cache unit.

步骤S2,运算开始,控制器单元从指令缓存单元的首地址读取该条IO指令,根据译出的微指令,直接内存访问单元从外部地址空间读取所有与Hessian-Free计算有关的所有指令,并将其缓存入指令缓存单元中;Step S2, the operation starts, the controller unit reads the IO instruction from the first address of the instruction cache unit, and according to the translated microinstruction, the direct memory access unit reads all the instructions related to the Hessian-Free calculation from the external address space. , and cache it in the instruction cache unit;

步骤S3,控制器单元从指令缓存单元读入一条IO指令,根据译出的微指令,直接内存访问单元从外部空间读取初始待更新参数向量θ0到数据处理单元中;Step S3, the controller unit reads in an IO instruction from the instruction cache unit, and according to the microinstruction translated, the direct memory access unit reads the initial parameter vector θ 0 to be updated from the external space into the data processing unit;

步骤S4,控制器单元从指令缓存单元读入赋值指令,根据译出的微指令,数据缓存单元中的

Figure GDA0002763091890000046
初始化,数据处理单元中的迭代次数n被设置为0;其中,θ为待更新参数向量,θn为第n次待更新参数向量,f(θ)为误差函数,即衡量结果的实际值与预测值偏离的函数;δn是更新向量,且θn+1=θnn
Figure GDA0002763091890000047
为f(θ)的二阶估计;Step S4, the controller unit reads the assignment instruction from the instruction cache unit, and according to the translated microinstruction, the
Figure GDA0002763091890000046
Initialization, the number of iterations n in the data processing unit is set to 0; among them, θ is the parameter vector to be updated, θ n is the nth parameter vector to be updated, and f(θ) is the error function, that is, the actual value of the measurement result is the same as the A function of the deviation of the predicted value; δ n is the update vector, and θ n+1 = θ nn ;
Figure GDA0002763091890000047
is the second-order estimate of f(θ);

步骤S5,控制器单元从指令缓存单元读入一条IO指令,根据译出的微指令,直接内存访问单元从外部空间读取待更新参数向量θn传入到数据处理单元中;Step S5, the controller unit reads in an IO instruction from the instruction cache unit, and according to the translated microinstruction, the direct memory access unit reads the parameter vector θ n to be updated from the external space and passes it into the data processing unit;

步骤S6,控制器单元从指令缓存单元读入一条对误差函数在当前参数向量值附近进行二阶估计的指令,根据译出的微指令,进行求f(θ)在θn附近的二阶估计

Figure GDA0002763091890000051
的操作;该操作中,指令被送至运算控制子模块,运算控制子模块发送相应指令进行以下操作:利用梯度运算子模块计算
Figure GDA0002763091890000052
利用高斯-牛顿运算子模块和基本运算子模块中的矩阵乘法运算,得到f在θn处的高斯-牛顿矩阵Gf;利用阻尼项运算子模块和基本运算子模块执行LM启发式方法得到阻尼系数λ,进而得到阻尼项
Figure GDA0002763091890000053
最后,通过Step S6, the controller unit reads an instruction for performing second-order estimation of the error function near the current parameter vector value from the instruction cache unit, and performs a second-order estimation of f(θ) near θ n according to the translated microinstruction.
Figure GDA0002763091890000051
operation; in this operation, the instruction is sent to the operation control sub-module, and the operation control sub-module sends the corresponding instruction to perform the following operations: use the gradient operation sub-module to calculate
Figure GDA0002763091890000052
Using the matrix multiplication operation in the Gauss-Newton operation sub-module and the basic operation sub-module, the Gauss-Newton matrix G f of f at θ n is obtained; the damping term operation sub-module and the basic operation sub-module are used to perform the LM heuristic method to obtain the damping coefficient λ, and then the damping term is obtained
Figure GDA0002763091890000053
Finally, by

Figure GDA0002763091890000054
得到
Figure GDA0002763091890000055
的表达式存入数据缓存单元;其中,阻尼函数
Figure GDA0002763091890000056
是根据训练模型预先确定好的函数在θn处的值;
Figure GDA0002763091890000054
get
Figure GDA0002763091890000055
The expression of is stored in the data cache unit; among them, the damping function
Figure GDA0002763091890000056
is the value of the pre-determined function at θ n according to the training model;

步骤S7,控制器单元从指令缓存单元读取一条数据传输指令,根据译出的微指令,将

Figure GDA0002763091890000057
从数据缓存单元传送到数据处理单元中;Step S7, the controller unit reads a data transmission instruction from the instruction cache unit, and according to the translated microinstruction,
Figure GDA0002763091890000057
Transferred from the data buffer unit to the data processing unit;

步骤S8,控制器单元从指令缓存单元读取一条参数更新运算指令,根据译出的微指令,进行用有预条件的共轭梯度法求δn使得

Figure GDA0002763091890000058
达到最小值,并把θn更新为θn+1的操作;直接内存访问单元从外部空间读取待更新参数向量θn传入到数据处理单元中;运算控制子模块控制相关运算模块进行如下操作:利用共轭梯度运算子模块和基本运算子模块得到更新向量δn;最后,利用基本运算子模块中的向量加法将θn更新为θn+1;Step S8, the controller unit reads a parameter update operation instruction from the instruction cache unit, and according to the decoded microinstruction, uses the preconditioned conjugate gradient method to obtain δ n such that
Figure GDA0002763091890000058
The operation of reaching the minimum value and updating θ n to θ n+1 ; the direct memory access unit reads the parameter vector θ n to be updated from the external space and transfers it to the data processing unit; the operation control sub-module controls the relevant operation module to perform the following steps Operation: use the conjugate gradient operation sub-module and the basic operation sub-module to obtain the update vector δ n ; finally, use the vector addition in the basic operation sub-module to update θ n to θ n+1 ;

步骤S9,控制器单元从指令缓存单元读取一条IO指令,根据译出的微指令,更新后的参数向量θn+1从数据处理单元通过直接内存访问单元传送至外部指定空间;Step S9, the controller unit reads an IO instruction from the instruction cache unit, and according to the microinstruction translated, the updated parameter vector θ n+1 is transferred from the data processing unit to the external designated space by the direct memory access unit;

步骤S10,控制器单元从指令缓存单元读取一条收敛判断指令,根据译出的微指令,数据处理单元判断更新后的参数向量θn+1是否收敛:若收敛,运算结束;否则,迭代次数n的值增长1,转回执行步骤S5。Step S10, the controller unit reads a convergence judgment instruction from the instruction cache unit, and according to the translated microinstruction, the data processing unit judges whether the updated parameter vector θ n+1 converges: if converged, the operation ends; otherwise, the number of iterations The value of n increases by 1, and the process returns to step S5.

作为本发明的还一个方面,本发明还提供了一种执行Hessian-Free训练算法的装置,所述装置的控制器中固化有执行如上所述的执行Hessian-Free训练算法的方法的程序。As a further aspect of the present invention, the present invention also provides a device for executing the Hessian-Free training algorithm, and a program for executing the above-mentioned method for executing the Hessian-Free training algorithm is solidified in the controller of the device.

基于上述技术方案可知,本发明的装置和方法具有如下有益效果:使用该装置可以实现Hessian-Free训练算法,完成对各种神经网络的训练,如自动编码器(auto-encoders)、卷积神经网络(RNN)等;通过采用专门用于执行Hessian-Free训练算法的设备,可以解决数据的通用处理器运算性能不足,前段译码开销大的问题,加速相关应用的执行速度;同时,对数据缓存单元的应用,避免了反复向内存读取数据,降低了内存访问的带宽。Based on the above technical solutions, the device and method of the present invention have the following beneficial effects: using the device can realize the Hessian-Free training algorithm and complete the training of various neural networks, such as auto-encoders, convolutional neural networks Network (RNN), etc.; by using the equipment specially used to execute the Hessian-Free training algorithm, the problem of insufficient computing performance of the general-purpose processor for the data and the large overhead of the front-end decoding can be solved, and the execution speed of the related applications can be accelerated; at the same time, the data The application of the cache unit avoids repeatedly reading data from the memory and reduces the bandwidth of memory access.

附图说明Description of drawings

图1为根据本发明一实施例的用于实现Hessian-Free训练算法相关应用的装置的整体结构示例框图;1 is a block diagram showing an example of the overall structure of an apparatus for implementing applications related to the Hessian-Free training algorithm according to an embodiment of the present invention;

图2为根据本发明一实施例的用于实现Hessian-Free训练算法相关应用的装置中数据处理单元的示例框图;2 is an exemplary block diagram of a data processing unit in an apparatus for implementing applications related to the Hessian-Free training algorithm according to an embodiment of the present invention;

图3为根据本发明一实施例的用于实现Hessian-Free训练算法相关应用的运算流程图。FIG. 3 is an operation flow chart for implementing a related application of the Hessian-Free training algorithm according to an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本发明作进一步的详细说明。通过以下详细描述,本发明的其它方面、优势和突出特征对于本领域技术人员将变得显而易见。In order to make the objectives, technical solutions and advantages of the present invention more clearly understood, the present invention will be further described in detail below in conjunction with specific embodiments and with reference to the accompanying drawings. Other aspects, advantages and salient features of the present invention will become apparent to those skilled in the art from the following detailed description.

在本说明书中,下述用于描述本发明原理的各种实施例只是说明,不应该以任何方式解释为限制发明的范围。参照附图的下述描述用于帮助全面理解由权利要求及其等同物限定的本发明的示例性实施例。下述描述包括多种具体细节来帮助理解,但这些细节应认为仅仅是示例性的。因此,本领域普通技术人员应认识到,在不背离本发明的范围和精神的情况下,可以对本文中描述的实施例进行多种改变和修改。此外,为了清楚和简洁起见,省略了公知功能和结构的描述。此外,贯穿附图,相同参考数字用于相似功能和操作。In this specification, the various embodiments described below to describe the principles of the invention are illustrative only and should not be construed in any way to limit the scope of the invention. The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of exemplary embodiments of the present invention as defined by the claims and their equivalents. The following description includes numerous specific details to assist in that understanding, but these details should be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted for clarity and conciseness. Furthermore, the same reference numerals are used for similar functions and operations throughout the drawings.

本发明公开了一种用于执行Hessian-Free训练算法的装置,包括指令缓存单元、指令译码单元、直接内存访问单元、数据处理单元和数据缓存模块。使用该装置可以实现Hessian-Free训练算法,完成对各种神经网络的训练,如自动编码器(auto-encoders)、卷积神经网络(RNN)等。每次迭代时,对误差函数(目标函数)做二阶泰勒展开,并且添加阻尼项,作为目标函数的估计,之后根据当前的梯度、高斯-牛顿矩阵、阻尼函数(damplingfunction)和阻尼系数(dampling constant),用有预条件的共轭梯度法(PreconditioningCG-Minimize)求得更新向量,更新待更新参数。持续迭代直至待更新参数向量收敛。The invention discloses a device for executing a Hessian-Free training algorithm, comprising an instruction cache unit, an instruction decoding unit, a direct memory access unit, a data processing unit and a data cache module. Using the device, the Hessian-Free training algorithm can be implemented to complete the training of various neural networks, such as auto-encoders, convolutional neural networks (RNN), and the like. At each iteration, a second-order Taylor expansion is performed on the error function (objective function), and a damping term is added as an estimate of the objective function, and then based on the current gradient, Gauss-Newton matrix, damping function (dampling function) and damping coefficient (dampling constant), use the preconditioned conjugate gradient method (PreconditioningCG-Minimize) to obtain the update vector, and update the parameters to be updated. Continue to iterate until the parameter vector to be updated converges.

更具体地,本发明的装置包括直接内存控制单元、指令缓存单元、控制器单元、数据缓存单元和数据处理单元。其中,直接内存访问单元能够访问外部地址空间,可以向装置内部各个缓存单元读写数据,完成数据的加载和存储,具体包括向指令缓存单元读取指令,从指定存储单元之间读取待更新参数和对应的梯度值到数据处理单元,将更新后的参数向量从数据处理单元直接写入外部指定空间;指令缓存单元通过直接内存访问单元读取指令,并缓存读入的指令;控制器单元从指令缓存单元中读取指令,将指令译码为控制其他模块行为的微指令并将其发送给其他模块如直接内存访问单元、数据缓存单元、数据处理单元等;数据缓存单元存储装置运行中需要的一些中间变量,并对这些变量做初始化及更新操作;数据处理单元根据指令做相应的运算操作。More specifically, the apparatus of the present invention includes a direct memory control unit, an instruction cache unit, a controller unit, a data cache unit and a data processing unit. Among them, the direct memory access unit can access the external address space, can read and write data to each cache unit inside the device, and complete the loading and storage of data, which specifically includes reading instructions from the instruction cache unit, and reading from the designated storage units to be updated. The parameters and corresponding gradient values are sent to the data processing unit, and the updated parameter vector is directly written into the external designated space from the data processing unit; the instruction cache unit reads the instructions through the direct memory access unit, and caches the read instructions; the controller unit Read instructions from the instruction cache unit, decode the instructions into micro-instructions that control the behavior of other modules and send them to other modules such as direct memory access units, data cache units, data processing units, etc.; the data cache unit storage device is running Some intermediate variables are required, and initialize and update these variables; the data processing unit performs corresponding operations according to the instructions.

此外,本发明还公开了一种执行Hessian-Free训练算法的方法,包括以下步骤:In addition, the present invention also discloses a method for executing the Hessian-Free training algorithm, comprising the following steps:

步骤(1),通过指令,完成数据缓存单元的初始化操作,即初始化f(θ)的二阶估计

Figure GDA0002763091890000071
具体来说,是对其中的梯度
Figure GDA0002763091890000072
高斯-牛顿矩阵Gf、阻尼系数λ和阻尼函数
Figure GDA0002763091890000073
进行置零操作。Step (1), through the instruction, complete the initialization operation of the data cache unit, that is, initialize the second-order estimation of f(θ)
Figure GDA0002763091890000071
Specifically, is the gradient of
Figure GDA0002763091890000072
Gauss-Newton matrix G f , damping coefficient λ and damping function
Figure GDA0002763091890000073
Perform zero-setting operation.

步骤(2),通过IO指令,完成直接内存访问单元从外部空间读取待更新参数向量的操作。In step (2), through the IO instruction, the operation of the direct memory access unit reading the parameter vector to be updated from the external space is completed.

步骤(3),数据处理单元根据相应指令,在θn处对误差函数f(θ)进行二阶泰勒展开,并添加阻尼项

Figure GDA0002763091890000081
得到f(θ)在θn附近的估计
Figure GDA0002763091890000082
Figure GDA0002763091890000083
Step (3), the data processing unit performs a second-order Taylor expansion on the error function f(θ) at θ n according to the corresponding instruction, and adds a damping term
Figure GDA0002763091890000081
get an estimate of f(θ) around θ n
Figure GDA0002763091890000082
which is
Figure GDA0002763091890000083

其中,Gf是f在θn处的高斯-牛顿矩阵;δn是更新向量;阻尼系数λ是用LM式启发式(Levenburg-Marquardt style heuristics)的方法求得;阻尼函数

Figure GDA0002763091890000084
是根据训练模型预先确定好的函数在θn处的值,比如进行RNN训练时,
Figure GDA0002763091890000085
S和f类似,是距离函数,GS是S在θn处的高斯-牛顿矩阵,μ(weighting constant)是一个预先确定好的正数。Among them, G f is the Gauss-Newton matrix of f at θ n ;
Figure GDA0002763091890000084
is the value of the pre-determined function at θ n according to the training model, such as when training RNN,
Figure GDA0002763091890000085
Similar to f, S is a distance function, G S is the Gauss-Newton matrix of S at θ n , and μ (weighting constant) is a predetermined positive number.

步骤(4),数据处理单元根据相应的指令,进行有预条件的共轭梯度法求δn使得

Figure GDA0002763091890000086
达到最小值,并把θn更新为θn+1的操作。更新操作如下:In step (4), the data processing unit performs a preconditioned conjugate gradient method to obtain δ n according to the corresponding instructions such that
Figure GDA0002763091890000086
The operation to reach the minimum value and update θ n to θ n+1 . The update operation is as follows:

θn+1=θnnθ n+1 = θ nn ;

值得一提的是,在进行用有预条件的共轭梯度法过程中,只用“mini-batch”而不是所有样本,并且其中涉及的高斯-牛顿矩阵乘向量的运算都是通过

Figure GDA0002763091890000087
做隐式近似计算(Pearlmutter的R{}-method)。这样既提高了大数据学习的效率或者说提高数据运算模块的运算效率,也避免了运算量随参数数量平方增长的情况。It is worth mentioning that in the process of using the preconditioned conjugate gradient method, only "mini-batch" is used instead of all samples, and the Gauss-Newton matrix multiplication vector operation involved is all done by
Figure GDA0002763091890000087
Do an implicit approximation (Pearlmutter's R{}-method). This not only improves the efficiency of big data learning or the operation efficiency of the data operation module, but also avoids the situation that the amount of operation increases with the square of the number of parameters.

步骤(5),数据处理单元判断更新后的参数向量是否收敛,若收敛,运算结束,否则,转到步骤(2)处继续执行。In step (5), the data processing unit judges whether the updated parameter vector has converged, and if it converges, the operation ends, otherwise, go to step (2) and continue to execute.

根据本发明实施例实现的Hessian-Free训练算法的装置,可以用以支持使用Hessian-Free训练算法方面的应用。在数据缓存单元开辟一个空间存储误差函数在每一代待更新参数附近的二阶估计,在每次进行有预条件的共轭梯度法时,利用该二阶估计计算一个更新向量,然后进行对待更新向量的更新操作。重复进行上述步骤,直至待更新向量收敛。The apparatus for the Hessian-Free training algorithm implemented according to the embodiment of the present invention can be used to support applications using the Hessian-Free training algorithm. Open up a space in the data cache unit to store the second-order estimation of the error function near the parameters to be updated in each generation, and use the second-order estimation to calculate an update vector each time the preconditioned conjugate gradient method is performed, and then perform the update to be updated. The update operation of the vector. Repeat the above steps until the vector to be updated converges.

下面结合附图的本发明的技术方案进行进一步阐述说明。The technical solutions of the present invention will be further elaborated below in conjunction with the accompanying drawings.

图1示出了根据本发明一实施例的用于实现Hessian-Free训练算法的装置的整体结构示例框图。如图1所示,该装置包括直接内存访问单元1、指令缓存单元2、控制器单元3、数据缓存单元4和数据处理单元5,均可以通过硬件电路实现。FIG. 1 shows a block diagram of an exemplary overall structure of an apparatus for implementing a Hessian-Free training algorithm according to an embodiment of the present invention. As shown in FIG. 1 , the device includes a direct memory access unit 1 , an instruction cache unit 2 , a controller unit 3 , a data cache unit 4 and a data processing unit 5 , all of which can be implemented by hardware circuits.

直接内存访问单元1能够访问外部地址空间,可以向装置内部各个缓存单元读写数据,完成数据的加载和存储。具体包括向指令缓存单元2读取指令,从指定存储单元之间读取待更新参数到数据处理单元5,将更新后的参数向量从数据处理单元5直接写入外部指定空间。The direct memory access unit 1 can access the external address space, can read and write data to each cache unit inside the device, and complete the loading and storage of data. Specifically, it includes reading instructions from the instruction cache unit 2 , reading the parameters to be updated from the specified storage units to the data processing unit 5 , and directly writing the updated parameter vector from the data processing unit 5 into the external specified space.

指令缓存单元2通过直接内存访问单元1读取指令,并缓存读入的指令。The instruction cache unit 2 reads the instruction through the direct memory access unit 1, and caches the read instruction.

控制器单元3从指令缓存单元2中读取指令,将指令译码为控制其他模块行为的微指令并将其发送给其他模块如直接内存访问单元1、数据缓存单元4、数据处理单元5等。The controller unit 3 reads the instructions from the instruction cache unit 2, decodes the instructions into micro-instructions that control the behavior of other modules and sends them to other modules such as the direct memory access unit 1, the data cache unit 4, the data processing unit 5, etc. .

数据缓存单元4在装置初始化时初始化

Figure GDA0002763091890000091
具体来说,是初始化其中的梯度
Figure GDA0002763091890000092
高斯-牛顿矩阵Gf、阻尼系数λ和阻尼函数
Figure GDA0002763091890000093
在第n次待更新参数向量θn的更新开始前,会将
Figure GDA0002763091890000094
读出到数据处理单元5中。在数据处理单元5中得到更新向量δn,将θn更新为θn+1,对应的
Figure GDA0002763091890000095
更新为
Figure GDA0002763091890000096
然后将
Figure GDA0002763091890000097
写入至数据缓存单元4(新的数据将以前的相对应数据覆盖),用于下次使用。Data cache unit 4 is initialized at device initialization
Figure GDA0002763091890000091
Specifically, it is to initialize the gradient in
Figure GDA0002763091890000092
Gauss-Newton matrix G f , damping coefficient λ and damping function
Figure GDA0002763091890000093
Before the nth update of the parameter vector θ n to be updated starts, the
Figure GDA0002763091890000094
Read out into the data processing unit 5 . The update vector δ n is obtained in the data processing unit 5 , and θ n is updated to θ n+1 , corresponding to
Figure GDA0002763091890000095
update to
Figure GDA0002763091890000096
followed by
Figure GDA0002763091890000097
Write to the data cache unit 4 (the new data will overwrite the previous corresponding data) for the next use.

数据处理单元5从数据缓存单元4中读取

Figure GDA0002763091890000098
通过直接内存访问单元1从外部指定空间中读取待更新参数向量θn。在模块内得到更新向量δn,将θn更新为θn+1,对应的
Figure GDA0002763091890000099
更新为
Figure GDA00027630918900000910
然后将
Figure GDA00027630918900000911
写入至数据缓存单元4,将θn+1通过直接内存访问单元1写入到外部指定空间中。The data processing unit 5 reads from the data buffer unit 4
Figure GDA0002763091890000098
The parameter vector θ n to be updated is read from the external specified space through the direct memory access unit 1 . Get the update vector δ n in the module, update θ n to θ n+1 , the corresponding
Figure GDA0002763091890000099
update to
Figure GDA00027630918900000910
followed by
Figure GDA00027630918900000911
Write to the data cache unit 4, and write θ n+1 into the external designated space through the direct memory access unit 1.

图2示出了根据本发明实施例的用于实现Hessian-Free训练算法相关应用的装置中数据处理单元的示例框图。如图2所示,数据处理单元包括运算控制子模块51、梯度运算子模块52、阻尼项运算子模块53、高斯-牛顿矩阵运算子模块54、共轭梯度法运算子模块55以及基本运算子模块56。其中,基本运算子模块56进行的是矩阵、向量之间的加乘等基础运算,52、53、54、55子模块都会调用56子模块,并且根据情况,这些模块互相之间也允许互相调用。FIG. 2 shows an example block diagram of a data processing unit in an apparatus for implementing applications related to a Hessian-Free training algorithm according to an embodiment of the present invention. As shown in FIG. 2, the data processing unit includes an operation control sub-module 51, a gradient operation sub-module 52, a damping term operation sub-module 53, a Gauss-Newton matrix operation sub-module 54, a conjugate gradient method operation sub-module 55 and a basic operator module 56. Among them, the basic operation sub-module 56 performs basic operations such as addition and multiplication between matrices and vectors. The 52, 53, 54, and 55 sub-modules will call the 56 sub-module, and according to the situation, these modules are also allowed to call each other. .

图3示出了根据Hessian-Free训练算法进行相关运算的装置的总体流程图。FIG. 3 shows the overall flow chart of the apparatus for performing correlation operations according to the Hessian-Free training algorithm.

在步骤S1,在指令缓存单元2的首地址处预先存入一条IO指令。In step S1, an IO instruction is pre-stored at the first address of the instruction cache unit 2.

在步骤S2,运算开始,控制单元3从指令缓存单元2的首地址读取该条IO指令,根据译出的微指令,直接内存访问单元1从外部地址空间读取所有与Hessian-Free计算有关的所有指令,并将其缓存入指令缓存单元2中。In step S2, the operation starts, the control unit 3 reads the IO instruction from the first address of the instruction cache unit 2, and according to the decoded microinstruction, the direct memory access unit 1 reads all the information related to the Hessian-Free calculation from the external address space. and cache all instructions in instruction cache unit 2.

在步骤S3,控制器单元3从指令缓存单元2读入一条IO指令,根据译出的微指令,直接内存访问单元1从外部空间读取初始待更新参数向量θ0到数据处理单元5中。In step S3 , the controller unit 3 reads an IO instruction from the instruction cache unit 2 , and according to the decoded microinstruction, the direct memory access unit 1 reads the initial parameter vector θ 0 to be updated into the data processing unit 5 from the external space.

在步骤S4,控制器单元3从指令缓存单元2读入赋值指令,根据译出的微指令,数据缓存单元4中的

Figure GDA0002763091890000101
初始化,数据处理单元5中的迭代次数n被设置为0。In step S4, the controller unit 3 reads the assignment instruction from the instruction cache unit 2, and according to the decoded microinstruction, the data in the data cache unit 4
Figure GDA0002763091890000101
Initially, the iteration number n in the data processing unit 5 is set to zero.

在步骤S5,控制器单元3从指令缓存单元2读入一条IO指令,根据译出的微指令,直接内存访问单元1从外部空间读取待更新参数向量θn传入到数据处理单元5中。In step S5, the controller unit 3 reads an IO instruction from the instruction cache unit 2, and according to the decoded microinstruction, the direct memory access unit 1 reads the parameter vector θ n to be updated from the external space and transfers it to the data processing unit 5 .

在步骤S6,控制器单元3从指令缓存单元2读入一条对误差函数在当前参数向量值附近进行二阶估计的指令,根据译出的微指令,进行求f(θ)在θn附近的二阶估计

Figure GDA0002763091890000102
的操作。该操作中,指令被送至运算控制子模块51,运算控制子模块51发送相应指令进行以下操作:利用梯度运算子模块52计算
Figure GDA0002763091890000111
利用高斯-牛顿运算子模块54和基本运算子模块56中的矩阵乘法运算,得到f在θn处的高斯-牛顿矩阵Gf;利用阻尼项运算子模块53和基本运算子模块56执行LM启发式方法得到阻尼系数λ,进而得到阻尼项
Figure GDA0002763091890000112
最后,得到
Figure GDA0002763091890000113
的表达式存入数据缓存单元4。In step S6, the controller unit 3 reads from the instruction buffer unit 2 an instruction for second-order estimation of the error function near the current parameter vector value, and according to the decoded microinstruction, performs the calculation of f(θ) near θ n . second-order estimate
Figure GDA0002763091890000102
operation. In this operation, the instruction is sent to the operation control sub-module 51, and the operation control sub-module 51 sends corresponding instructions to perform the following operations: use the gradient operation sub-module 52 to calculate
Figure GDA0002763091890000111
Utilize the matrix multiplication operation in the Gauss-Newton operation sub-module 54 and the basic operation sub-module 56 to obtain the Gauss-Newton matrix G f of f at θ n ; utilize the damping term operation sub-module 53 and the basic operation sub-module 56 to perform LM heuristic The damping coefficient λ is obtained by the formula method, and then the damping term is obtained
Figure GDA0002763091890000112
Finally, get
Figure GDA0002763091890000113
The expression is stored in the data cache unit 4.

在步骤S7,控制器单元3从指令缓存单元2读取一条数据传输指令,根据译出的微指令,将

Figure GDA0002763091890000114
从数据缓存单元4传送到数据处理单元5中。In step S7, the controller unit 3 reads a data transfer instruction from the instruction cache unit 2,
Figure GDA0002763091890000114
It is transferred from the data buffer unit 4 to the data processing unit 5 .

在步骤S8,控制器单元3从指令缓存单元2读取一条参数更新运算指令,根据译出的微指令,进行用有预条件的共轭梯度法求δn使得

Figure GDA0002763091890000115
达到最小值,并把θn更新为θn+1的操作。直接内存访问单元1从外部空间读取待更新参数向量θn传入到数据处理单元5中。运算控制子模块51控制相关运算模块进行如下操作:利用共轭梯度运算子模块55和基本运算子模块56得到更新向量δn。这其中,根据阻尼函数
Figure GDA0002763091890000116
的表达式,也可能需要调用高斯牛顿-运算模块(比如之前提到的RNN的例子)。最后,利用基本运算子模块56中的向量加法将θn更新为θn+1。In step S8, the controller unit 3 reads a parameter update operation instruction from the instruction cache unit 2, and according to the decoded microinstruction, uses the preconditioned conjugate gradient method to obtain δ n such that
Figure GDA0002763091890000115
The operation to reach the minimum value and update θ n to θ n+1 . The direct memory access unit 1 reads the parameter vector θ n to be updated from the external space and transfers it to the data processing unit 5 . The operation control sub-module 51 controls the correlation operation module to perform the following operations: using the conjugate gradient operation sub-module 55 and the basic operation sub-module 56 to obtain the update vector δ n . Among them, according to the damping function
Figure GDA0002763091890000116
The expression of , may also need to call the Gauss-Newton-computation module (such as the RNN example mentioned earlier). Finally, θ n is updated to θ n+1 using vector addition in the basic operation sub-module 56 .

在步骤S9,控制器单元3从指令缓存单元2读取一条IO指令,根据译出的微指令,更新后的参数向量θn+1从数据处理单元5通过直接内存访问单元1传送至外部指定空间。In step S9, the controller unit 3 reads an IO instruction from the instruction cache unit 2, and according to the decoded microinstruction, the updated parameter vector θ n+1 is transmitted from the data processing unit 5 through the direct memory access unit 1 to the external specified space.

在步骤S10,控制器单元从指令缓存单元2读取一条收敛判断指令,根据译出的微指令,数据处理单元判断更新后的参数向量θn+1是否收敛:若收敛,运算结束;否则,迭代次数n的值增长1,转回执行步骤S5。In step S10, the controller unit reads a convergence judgment instruction from the instruction cache unit 2, and according to the translated microinstruction, the data processing unit judges whether the updated parameter vector θ n+1 is converged: if it converges, the operation ends; otherwise, The value of the number of iterations n is increased by 1, and the process returns to step S5.

前面的附图中所描绘的进程或方法可通过包括硬件(例如,电路、专用逻辑等)、固件、软件(例如,被承载在非瞬态计算机可读介质上的软件),或两者的组合的处理逻辑来执行。虽然上文按照某些顺序操作描述了进程或方法,但是,应该理解,所描述的某些操作能以不同顺序来执行。此外,可并行地而非顺序地执行一些操作。The processes or methods depicted in the preceding figures may be implemented by means of processes including hardware (eg, circuits, special purpose logic, etc.), firmware, software (eg, software carried on a non-transitory computer-readable medium), or both. Combined processing logic to execute. Although a process or method is described above as operating in certain orders, it should be understood that certain operations described can be performed in a different order. Furthermore, some operations may be performed in parallel rather than sequentially.

在前述的说明书中,参考其特定示例性实施例描述了本发明的各实施例。显然,可对各实施例做出各种修改,而不悖离所附权利要求所述的本发明的更广泛的精神和范围。相应地,说明书和附图应当被认为是说明性的,而不是限制性的。In the foregoing specification, embodiments of the invention have been described with reference to specific exemplary embodiments thereof. Obviously, various modifications may be made to the various embodiments without departing from the broader spirit and scope of the invention as set forth in the appended claims. Accordingly, the specification and drawings are to be regarded in an illustrative rather than a restrictive sense.

Claims (15)

1. A computing device for performing a Hessian-Free training algorithm, the computing device for performing neural network operations; the computing device comprises a controller unit and a data processing unit, and is characterized in that:
the controller unit is used for decoding the read instruction into a micro instruction for controlling the data processing unit and sending the micro instruction to the data processing unit;
the data processing unit is used for performing calculation on input data according to the micro instruction to obtain a result of the calculation instruction; the data processing unit comprises an operation control submodule and a basic operation submodule, as well as a gradient operation submodule, a damping term operation submodule, a Gaussian-Newton matrix operation submodule and a conjugate gradient method operation submodule;
the data cache unit is used for storing intermediate variables in the operation process and executing initialization and updating operations on the intermediate variables; wherein the data buffer unit initializes a second order estimate of f (θ) at device initialization
Figure FDA0003076912510000011
Parameter vector theta to be updated at nth timenBefore the update of (2) is started, will
Figure FDA0003076912510000012
Read out to a data processing unit and will obtain an update vector in said data processing unit
Figure FDA0003076912510000013
Writing again; wherein, theta is a parameter vector to be updated, and thetanF (theta) is an error function, namely a function of deviation of an actual value of a measurement result and a predicted value, for the parameter vector to be updated at the nth time; deltanIs an update vector, and θn+1=θnn
2. The computing device of claim 1, wherein the basic operation submodule performs addition, multiplication basis operations between matrices and/or vectors.
3. The computing device of claim 1, wherein the gradient operation sub-module, the damping term operation sub-module, the gauss-newton matrix operation sub-module, and the conjugate gradient method operation sub-module are all capable of invoking the basic operation sub-module, and the gradient operation sub-module, the damping term operation sub-module, the gauss-newton matrix operation sub-module, and the conjugate gradient method operation sub-module are allowed to invoke each other as appropriate.
4. The computing device of claim 1,
the data processing unit stores an intermediate variable in the data cache unit when performing an arithmetic operation under the control of the controller unit.
5. The computing device of claim 1, wherein the data cache unit is initialized
Figure FDA0003076912510000021
In the step (2), the gradient therein is initialized
Figure FDA0003076912510000022
Gauss-Newton matrix GfDamping coefficient lambda and damping function
Figure FDA0003076912510000023
Wherein,
Figure FDA0003076912510000024
the gradient
Figure FDA0003076912510000025
Finger f at θnGradient value of (G)fIs f at θnA gauss-newton matrix of (d); damping function
Figure FDA0003076912510000026
Is a function predetermined according to a training model at thetanThe value of (d); the damping coefficient lambda is obtained by an LM type heuristic method;
the data processing unit reads from the data cache unit
Figure FDA0003076912510000027
Reading parameter vector theta to be updated from external designated spacen(ii) a Obtaining an update vector delta within a modulenWill thetanIs updated to thetan+1Corresponding to
Figure FDA0003076912510000028
Is updated to
Figure FDA0003076912510000029
Then will be
Figure FDA00030769125100000210
Writing into the data buffer unit, and storing thetan+1Writing into an external designated space; wherein, thetan+1For the (n +1) th parameter vector to be updated,
Figure FDA00030769125100000211
is a second order estimate of f (θ + 1).
6. A computational method for performing a Hessian-Free training algorithm, the computational method for performing neural network operations, the computational method comprising the steps of:
the controller unit decodes the read instruction into a microinstruction for controlling the data processing unit and sends the microinstruction to the data processing unit;
the data processing unit executes calculation on input data according to the micro instruction to obtain a result of the calculation instruction; the data processing unit comprises an operation control submodule and a basic operation submodule, and a gradient operation submodule, a damping term operation submodule, a Gaussian-Newton matrix operation submodule and a conjugate gradient method operation submodule;
a data cache unit for storing intermediate variables generated during operation, and performing initialization and update operations on the intermediate variables, wherein the data cache unit initializes second-order estimation of f (theta) during device initialization
Figure FDA00030769125100000212
Parameter vector theta to be updated at nth timenBefore the update of (2) is started, will
Figure FDA00030769125100000213
Read out to a data processing unit and will obtain an update vector in said data processing unit
Figure FDA00030769125100000214
Writing again; wherein, theta is a parameter vector to be updated, and thetanF (theta) is an error function, namely a function of deviation of an actual value of a measurement result and a predicted value, for the parameter vector to be updated at the nth time; deltanIs an update vector, and θn+1=θnn
7. The computing method of claim 6, wherein the basic operation submodule performs addition and multiplication basic operation steps between matrices and/or vectors.
8. The computing method of claim 6, wherein the gradient operation sub-module, the damping term operation sub-module, the Gauss-Newton matrix operation sub-module, and the conjugate gradient method operation sub-module are all capable of calling the basic operation sub-module, and the gradient operation sub-module, the damping term operation sub-module, the Gauss-Newton matrix operation sub-module, and the conjugate gradient method operation sub-module are allowed to be called each other as the case may be.
9. The computing method of claim 6, wherein the data cache unit is initialized
Figure FDA0003076912510000031
In the step (2), the gradient therein is initialized
Figure FDA0003076912510000032
Gauss-Newton matrix GfDamping coefficient lambda and damping function
Figure FDA0003076912510000033
Wherein,
Figure FDA0003076912510000034
the gradient
Figure FDA0003076912510000035
Finger f at θnGradient value of (G)fIs f at θnA gauss-newton matrix of (d); damping function
Figure FDA0003076912510000036
Is predetermined according to a training modelFunction at thetanThe value of (d); the damping coefficient lambda is obtained by an LM type heuristic method;
the data processing unit reads from the data cache unit
Figure FDA0003076912510000037
Reading parameter vector theta to be updated from external designated spacen(ii) a Obtaining an update vector delta within a modulenWill thetanIs updated to thetan+1Corresponding to
Figure FDA0003076912510000038
Is updated to
Figure FDA0003076912510000039
Then will be
Figure FDA00030769125100000310
Writing into the data buffer unit, and storing thetan+1Writing into an external designated space; wherein, thetan+1For the (n +1) th parameter vector to be updated,
Figure FDA00030769125100000311
is a second order estimate of f (θ + 1).
10. A computational method for performing a Hessian-Free training algorithm, comprising the steps of:
step (1), finishing the initialization operation of the data cache unit through an instruction, namely initializing the second-order estimation of f (theta)
Figure FDA00030769125100000312
Wherein, theta is a parameter vector to be updated, and thetanF (theta) is an error function, namely a function of deviation of an actual value of a measurement result and a predicted value, for the parameter vector to be updated at the nth time; deltanIs an update vector, and θn+1=θnn
Step (2), completing the operation of reading the parameter vector to be updated from the external space through an IO instruction;
step (3), the data processing unit executes the corresponding instruction according to the corresponding instruction at thetanPerforming second-order Taylor expansion on the error function f (theta), and adding a damping term
Figure FDA0003076912510000041
F (theta) at thetanEstimation of proximity
Figure FDA0003076912510000042
Namely, it is
Figure FDA0003076912510000043
Wherein G isfIs f at θnA gauss-newton matrix of (d); the damping coefficient lambda is obtained by an LM type heuristic method; damping function
Figure FDA0003076912510000044
Is a function predetermined according to a training model at thetanThe value of (d);
step (4), the data processing unit calculates delta by a preconditioned conjugate gradient method according to corresponding instructionsnSo that
Figure FDA0003076912510000045
Reaches a minimum value, andnis updated to thetan+1The specific update operation is:
θn+1=θnn
and (5) judging whether the updated parameter vector is converged by the data processing unit, if so, finishing the operation, otherwise, turning to the step (2) to continue executing.
11. The computing method of claim 10, wherein the step of completing initialization operations of the data cache unit in step (1) comprises: for gradient
Figure FDA0003076912510000046
Gauss-Newton matrix GfDamping coefficient lambda and damping function
Figure FDA0003076912510000047
A zero operation is performed.
12. The computing method of claim 10, wherein in step (3) when RNN training is performed,
damping function
Figure FDA0003076912510000048
Where S and f are both distance functions, GSIs S at θnAnd a gauss-newton matrix where μ is a predetermined positive number.
13. The method of claim 10, wherein step (4) comprises performing a preconditioned conjugate gradient method for δnSo that
Figure FDA0003076912510000049
In the step of reaching the minimum, only the "mini-batch" is used instead of all samples during the implementation of the preconditioned conjugate gradient method, and the gaussian-newton matrix multiplication vector involved therein is calculated by
Figure FDA0003076912510000051
And performing implicit approximate calculation.
14. A computational method for performing a Hessian-Free training algorithm, comprising the steps of:
step S1, an IO instruction is pre-stored in the first address of the instruction cache unit;
step S2, when the operation starts, the controller unit reads the IO instruction from the first address of the instruction cache unit, and according to the translated micro instruction, the direct memory access unit reads all instructions related to Hessian-Free calculation from the external address space and caches the instructions into the instruction cache unit;
step S3, the controller unit reads an IO instruction from the instruction cache unit, and the dma unit reads the initial parameter vector θ to be updated from the external space according to the translated microinstruction0To a data processing unit;
step S4, the controller unit reads the assignment instruction from the instruction cache unit, and the data cache unit stores the assignment instruction in accordance with the translated microinstruction
Figure FDA0003076912510000052
Initializing, wherein the iteration number n in the data processing unit is set to be 0; wherein, theta is a parameter vector to be updated, and thetanF (theta) is an error function, namely a function of deviation of an actual value of a measurement result and a predicted value, for the parameter vector to be updated at the nth time; deltanIs an update vector, and θn+1=θnn
Figure FDA0003076912510000053
A second order estimate of f (θ);
step S5, the controller unit reads an IO instruction from the instruction cache unit, and the dma unit reads the parameter vector θ to be updated from the external space according to the translated microinstructionnTransmitting the data into a data processing unit;
step S6, the controller unit reads in an instruction for second-order estimation of the error function near the current parameter vector value from the instruction cache unit, and performs the calculation of f (θ) at θ according to the translated microinstructionnNearby second order estimation
Figure FDA0003076912510000054
The operation of (1); in the operation, the instruction is sent to an operation control submodule in the data processing unit, and the operation control submodule sends a corresponding instruction to perform the following operations: using a gradient arithmetic submodule to calculate
Figure FDA0003076912510000055
Matrix multiplication in the Gauss-Newton operation submodule and the basic operation submodule is utilized to obtain f at thetanGauss-Newton matrix G of (G)f(ii) a Utilizing the damping item operation submodule and the basic operation submodule to execute an LM heuristic method to obtain a damping coefficient lambda so as to obtain a damping item
Figure FDA0003076912510000061
Finally, by
Figure FDA0003076912510000062
To obtain
Figure FDA0003076912510000063
The expression of (2) is stored in a data cache unit; wherein the damping function
Figure FDA0003076912510000064
Is a function predetermined according to a training model at thetanThe value of (d);
in step S7, the controller unit reads a data transfer instruction from the instruction cache unit and translates the read data transfer instruction into a microinstruction
Figure FDA0003076912510000065
The data is transmitted from the data buffer unit to the data processing unit;
step S8, the controller unit reads a parameter updating operation instruction from the instruction cache unit, and calculates delta by using a preconditioned conjugate gradient method according to the translated microinstructionnSo that
Figure FDA0003076912510000066
Reaches a minimum value, andnis updated to thetan+1The operation of (1); direct memory access unit reads parameter vector theta to be updated from external spacenTransmitting the data into a data processing unit; operation controlThe submodule controls the correlation operation module to perform the following operations: obtaining an update vector delta by using a conjugate gradient operation submodule and a basic operation submodulen(ii) a Finally, theta is added by vector addition in the basic operation submodulenIs updated to thetan+1
Step S9, the controller unit reads an IO instruction from the instruction cache unit, and updates the parameter vector theta according to the translated microinstructionn+1Transmitting the data from the data processing unit to an external designated space through the direct memory access unit;
in step S10, the controller unit reads a convergence judgment instruction from the instruction cache unit, and the data processing unit judges the updated parameter vector θ according to the translated microinstructionn+1Whether to converge or not: if the convergence is reached, the operation is ended; otherwise, the value of the iteration number n is increased by 1, and the process returns to step S5.
15. A computing device for executing a Hessian-Free training algorithm, the computing device having a program embodied in a controller that executes the computing method of any of claims 6 to 14.
CN201610283885.XA 2016-04-29 2016-04-29 An apparatus and method for executing a Hessian-Free training algorithm Active CN107341540B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201610283885.XA CN107341540B (en) 2016-04-29 2016-04-29 An apparatus and method for executing a Hessian-Free training algorithm
PCT/CN2016/081842 WO2017185413A1 (en) 2016-04-29 2016-05-12 Device and method for executing hessian-free training algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610283885.XA CN107341540B (en) 2016-04-29 2016-04-29 An apparatus and method for executing a Hessian-Free training algorithm

Publications (2)

Publication Number Publication Date
CN107341540A CN107341540A (en) 2017-11-10
CN107341540B true CN107341540B (en) 2021-07-20

Family

ID=60160584

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610283885.XA Active CN107341540B (en) 2016-04-29 2016-04-29 An apparatus and method for executing a Hessian-Free training algorithm

Country Status (2)

Country Link
CN (1) CN107341540B (en)
WO (1) WO2017185413A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109934208B (en) * 2019-04-22 2024-07-23 江苏邦融微电子有限公司 Hardware acceleration system and method for fingerprint identification
US11444483B2 (en) * 2020-01-14 2022-09-13 Hitachi Energy Switzerland Ag Adaptive state estimation for power systems
CN111626434B (en) * 2020-05-15 2022-06-07 浪潮电子信息产业股份有限公司 Distributed training parameter updating method, device, equipment and storage medium
CN112990958B (en) * 2021-01-19 2024-10-18 腾讯科技(深圳)有限公司 Data processing method, device, storage medium and computer equipment
CN114358180B (en) * 2021-12-31 2025-05-06 海光信息技术股份有限公司 Processor pre-fetch training method, processing device, processor and computing equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1658550A (en) * 2004-04-16 2005-08-24 威盛电子股份有限公司 Apparatus and method for performing cryptographic operations
CN1834898A (en) * 2005-05-16 2006-09-20 威盛电子股份有限公司 Microprocessor apparatus and method for modular exponentiation
CN102156637A (en) * 2011-05-04 2011-08-17 中国人民解放军国防科学技术大学 Vector crossing multithread processing method and vector crossing multithread microprocessor
WO2016037351A1 (en) * 2014-09-12 2016-03-17 Microsoft Corporation Computing system for training neural networks

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9390370B2 (en) * 2012-08-28 2016-07-12 International Business Machines Corporation Training deep neural network acoustic models using distributed hessian-free optimization
US9601109B2 (en) * 2013-12-06 2017-03-21 International Business Machines Corporation Systems and methods for accelerating hessian-free optimization for deep neural networks by implicit preconditioning and sampling

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1658550A (en) * 2004-04-16 2005-08-24 威盛电子股份有限公司 Apparatus and method for performing cryptographic operations
CN1834898A (en) * 2005-05-16 2006-09-20 威盛电子股份有限公司 Microprocessor apparatus and method for modular exponentiation
CN102156637A (en) * 2011-05-04 2011-08-17 中国人民解放军国防科学技术大学 Vector crossing multithread processing method and vector crossing multithread microprocessor
WO2016037351A1 (en) * 2014-09-12 2016-03-17 Microsoft Corporation Computing system for training neural networks

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
DaDianNao: A Machine-Learning Supercomputer;Yunji Chen等;《2014 47th Annual IEEE/ACM International Symposium on Microarchitecture》;20141231;第609-622页 *
Deep learning via Hessian-free optimization;James Martens;《Proceedings of the 27th International Conference on Machine Learning》;20101231;第1-8页 *
Training Deep and Recurrent Networks with Hessian-Free Optimization;James Martens等;《Neural Networks: Tricks ofthe Trade》;20121231;第1-58页 *
Training Neural Networks with Stochastic Hessian-Free Optimization;Ryan Kiros;《arXiv:1301.3641v3 [cs.LG]》;20130501;第1-11页 *

Also Published As

Publication number Publication date
WO2017185413A1 (en) 2017-11-02
CN107341540A (en) 2017-11-10

Similar Documents

Publication Publication Date Title
CN107341540B (en) An apparatus and method for executing a Hessian-Free training algorithm
CN111860812B (en) Apparatus and method for performing convolutional neural network training
KR102331978B1 (en) Device and method for executing forward calculation of artificial neural network
CN110929863B (en) Apparatus and method for performing LSTM operations
KR102175044B1 (en) Apparatus and method for running artificial neural network reverse training
CN109375951B (en) Device and method for executing forward operation of full-connection layer neural network
CN113537481B (en) Device and operation method for performing LSTM neural network operation
CN107341132B (en) Device and method for executing AdaGrad gradient descent training algorithm
WO2017185347A1 (en) Apparatus and method for executing recurrent neural network and lstm computations
WO2017185257A1 (en) Device and method for performing adam gradient descent training algorithm
CN107704267A (en) A kind of convolutional neural networks operational order and its method
CN108334944B (en) Artificial neural network operation device and method
WO2017185336A1 (en) Apparatus and method for executing pooling operation
CN107315570B (en) Apparatus and method for executing Adam gradient descent training algorithm
WO2017185248A1 (en) Apparatus and method for performing auto-learning operation of artificial neural network
WO2018112892A1 (en) Device and method for supporting fast artificial neural network operation
CN111860814B (en) Apparatus and method for performing batch normalization operations
WO2017185256A1 (en) Rmsprop gradient descent algorithm execution apparatus and method
CN107329733B (en) Apparatus and method for performing posing operations
CN107315569B (en) An apparatus and method for performing RMSprop gradient descent algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100190 room 644, comprehensive research building, No. 6 South Road, Haidian District Academy of Sciences, Beijing

Applicant after: Zhongke Cambrian Technology Co.,Ltd.

Address before: 100190 room 644, comprehensive research building, No. 6 South Road, Haidian District Academy of Sciences, Beijing

Applicant before: Beijing Zhongke Cambrian Technology Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant
TG01 Patent term adjustment
TG01 Patent term adjustment