CN107341540B

CN107341540B - An apparatus and method for executing a Hessian-Free training algorithm

Info

Publication number: CN107341540B
Application number: CN201610283885.XA
Authority: CN
Inventors: 张士锦; 郭崎; 陈天石; 陈云霁
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2016-04-29
Filing date: 2016-04-29
Publication date: 2021-07-20
Anticipated expiration: 2036-04-29
Also published as: WO2017185413A1; CN107341540A

Abstract

A computing device and method including a controller unit and a data processing unit. The data processing unit includes an operation control submodule, a basic operation submodule, and at least one of a gradient operation submodule, a damping term operation submodule, a Gauss-Newton matrix operation submodule, and a conjugate gradient method operation submodule. Using this device, the training of various neural networks, such as auto-encoder, convolutional neural network RNN, etc., can be completed. The invention can solve the problems of insufficient computing performance of the general-purpose processor for data and large overhead for decoding in the front section, and accelerate the execution speed of related applications; at the same time, the application of the data cache unit avoids repeatedly reading data from the memory and reduces memory access. bandwidth.

Description

An apparatus and method for executing a Hessian-Free training algorithm

技术领域technical field

本发明涉及神经网络运算技术领域，更具体地涉及一种用于执行Hessian-Free训练算法的装置和方法。The present invention relates to the technical field of neural network operations, and more particularly to a device and method for executing a Hessian-Free training algorithm.

背景技术Background technique

梯度下降法在函数逼近、优化计算、模式识别和图像处理等领域被广泛应用。目前主流的神经网络的训练方法是梯度下降法(结合BP算法)，但是这种方法忽略了误差函数的曲率信息，不仅容易出现参数变化过度平缓，从而无法收敛到局部最优点的情况，而且无法很好的处理“病态曲率”的误差函数(比如Rosenbrock函数)。Hessian-Free训练算法很好的解决了这个问题，并且通过一些细节改进，使之没有出现运算量关于参数数量平方增长(和梯度下降法一样是线性增长)的情况。Gradient descent is widely used in the fields of function approximation, optimization calculation, pattern recognition and image processing. The current mainstream neural network training method is the gradient descent method (combined with the BP algorithm), but this method ignores the curvature information of the error function. Very good handling of "ill-conditioned curvature" error functions (such as the Rosenbrock function). The Hessian-Free training algorithm solves this problem very well, and through some detailed improvements, it does not appear that the amount of computation increases squarely with respect to the number of parameters (linear growth like the gradient descent method).

目前，一种执行Hessian-Free训练算法的已知方法是使用通用处理器。该方法通过使用通用寄存器堆和通用功能部件执行通用指令来支持上述算法。该方法的缺点之一是单个通用处理器的运算性能较低。而多个通用处理器并行执行时，通用处理器之间相互通信又成为了性能瓶颈。另外，通用处理器需要把Hessian-Free训练算法对应的相关运算译码成一长列运算及访存指令序列，处理器前端译码带来了较大的功耗开销。Currently, one known way to perform the Hessian-Free training algorithm is to use a general-purpose processor. The method supports the above-described algorithms by executing general-purpose instructions using a general-purpose register file and general-purpose functional units. One of the disadvantages of this method is the low computational performance of a single general-purpose processor. When multiple general-purpose processors execute in parallel, the mutual communication between general-purpose processors has become a performance bottleneck. In addition, the general-purpose processor needs to decode the related operations corresponding to the Hessian-Free training algorithm into a long sequence of operations and memory access instructions, and the front-end decoding of the processor brings a large power consumption overhead.

另一种执行Hessian-Free训练算法的已知方法是使用图形处理器(GPU)。该方法通过使用通用寄存器堆和通用流处理单元执行通用SIMD指令来支持上述算法。由于GPU是专门用来执行图形图像运算以及科学计算的设备，没有对Hessian-Free训练算法相关运算的专门支持，仍然需要大量的前端译码工作才能执行Hessian-Free训练算法中相关的运算，带来了大量的额外开销。另外，GPU只有较小的片上缓存，运算中所需数据(如高斯-牛顿矩阵等)需要反复从片外搬运，片外带宽成为了主要性能瓶颈，同时带来了巨大的功耗开销。Another known method of performing the Hessian-Free training algorithm is to use a graphics processing unit (GPU). The method supports the above algorithms by executing general-purpose SIMD instructions using a general-purpose register file and a general-purpose stream processing unit. Since the GPU is a device specially used to perform graphics and image operations and scientific computing, there is no special support for the related operations of the Hessian-Free training algorithm, and a large amount of front-end decoding work is still required to perform the related operations in the Hessian-Free training algorithm. Comes with a lot of extra overhead. In addition, the GPU has only a small on-chip cache, and the data required in the operation (such as Gauss-Newton matrix, etc.) needs to be repeatedly transferred from the off-chip. The off-chip bandwidth has become the main performance bottleneck, and it also brings huge power consumption.

发明内容SUMMARY OF THE INVENTION

有鉴于此，本发明的目的在于提供一种用于执行Hessian-Free训练算法的装置和方法，以期坚决上述技术问题中的至少之一。In view of this, the purpose of the present invention is to provide an apparatus and method for executing a Hessian-Free training algorithm, so as to resolve at least one of the above technical problems.

为了实现上述目的，作为本发明的一个方面，本发明提供了一种用于执行Hessian-Free训练算法的装置，包括：In order to achieve the above object, as an aspect of the present invention, the present invention provides a device for executing a Hessian-Free training algorithm, comprising:

控制器单元，用于将读取的指令译码为控制相应模块的微指令，并将其发送给相应模块；The controller unit is used to decode the read instruction into a micro-instruction for controlling the corresponding module, and send it to the corresponding module;

数据缓存单元，用于存储运算过程中的中间变量，并对所述中间变量执行初始化及更新操作；a data cache unit, used for storing intermediate variables in the operation process, and performing initialization and update operations on the intermediate variables;

数据处理单元，用于在所述控制器单元的控制下执行运算操作，并将中间变量存储于所述数据缓存单元中。A data processing unit, configured to perform arithmetic operations under the control of the controller unit, and store intermediate variables in the data buffer unit.

其中，所述数据处理单元包括运算控制子模块、梯度运算子模块、阻尼项运算子模块、高斯-牛顿矩阵运算子模块、共轭梯度法运算子模块及基本运算子模块；其中，所述基本运算子模块进行的是矩阵、向量之间的加、乘基础运算；Wherein, the data processing unit includes an operation control sub-module, a gradient operation sub-module, a damping term operation sub-module, a Gauss-Newton matrix operation sub-module, a conjugate gradient method operation sub-module and a basic operation sub-module; wherein the basic operation sub-module The operation sub-module performs basic operations of addition and multiplication between matrices and vectors;

作为优选，所述梯度运算子模块、阻尼项运算子模块、高斯-牛顿矩阵运算子模块、共轭梯度法运算子模块均能够调用所述基本运算子模块，且根据情况所述梯度运算子模块、阻尼项运算子模块、高斯-牛顿矩阵运算子模块、共轭梯度法运算子模块之间允许互相调用。Preferably, the gradient operation sub-module, the damping term operation sub-module, the Gauss-Newton matrix operation sub-module, and the conjugate gradient method operation sub-module can all call the basic operation sub-module, and the gradient operation sub-module according to the situation , Damping term operation sub-module, Gauss-Newton matrix operation sub-module, and conjugate gradient method operation sub-module are allowed to call each other.

其中，所述数据缓存单元在装置初始化时初始化f(θ)的二阶估计

在第n次待更新参数向量θ_n的更新开始前，将

读出到数据处理单元中，并在所述数据处理单元中得到更新向量后将

再次写入；其中，θ为待更新参数向量，θ_n为第n次待更新参数向量，f(θ)为误差函数，即衡量结果的实际值与预测值偏离的函数；δ_n是更新向量，且θ_n+1＝θ_n+δ_n。Wherein, the data buffer unit initializes the second-order estimation of f(θ) when the device is initialized

Before the nth update of the parameter vector θ _n to be updated starts, the

read out to the data processing unit, and after the update vector is obtained in the data processing unit,

Write again; among them, θ is the parameter vector to be updated, θ _n is the parameter vector to be updated for the nth time, and f(θ) is the error function, that is, the function that measures the deviation between the actual value of the result and the predicted value; δ _n is the update vector , and θ _n+1 = θ _n +δ _n .

其中，所述数据缓存单元在初始化

的步骤中，初始化其中的梯度

高斯-牛顿矩阵G_f、阻尼系数λ和阻尼函数

其中，

所述梯度

指f在θ_n处的梯度值，G_f是f在θ_n处的高斯-牛顿矩阵；阻尼函数

是根据训练模型预先确定好的函数在θ_n处的值；阻尼系数λ通过LM式启发式方法求得；Wherein, the data cache unit is initialized

In the steps of , initialize the gradient in which

Gauss-Newton matrix G _f , damping coefficient λ and damping function

in,

the gradient

refers to the gradient value of f at θ _n , G _f is the Gauss-Newton matrix of f at θ _n ; damping function

is the value of the pre-determined function at θ _n according to the training model; the damping coefficient λ is obtained by the LM heuristic method;

所述数据处理单元从所述数据缓存单元中读取

从外部指定空间中读取待更新参数向量θ_n；在模块内得到更新向量δ_n，将θ_n更新为θ_n+1，对应的

更新为

然后将

写入至所述数据缓存单元，将θ_n+1写入到外部指定空间中；其中，θ_n+1为第n+1次待更新参数向量，

为f(θ+1)的二阶估计。the data processing unit reads from the data cache unit

Read the parameter vector θ _n to be updated from the external specified space; get the update vector δ _n in the module, update θ _n to θ _n+1 , the corresponding

update to

followed by

Write to the data cache unit, and write θ _n+1 into the external specified space; wherein, θ _n+1 is the n+1 th parameter vector to be updated,

is the second-order estimate of f(θ+1).

作为本发明的另一个方面，本发明还提供了一种执行Hessian-Free训练算法的方法，包括以下步骤：As another aspect of the present invention, the present invention also provides a method for executing the Hessian-Free training algorithm, comprising the following steps:

步骤(1)，通过指令，完成数据缓存单元的初始化操作，即初始化f(θ)的二阶估计

其中，θ为待更新参数向量，θ_n为第n次待更新参数向量，f(θ)为误差函数，即衡量结果的实际值与预测值偏离的函数；δ_n是更新向量，且θ_n+1＝θ_n+δ_n；Step (1), through the instruction, complete the initialization operation of the data cache unit, that is, initialize the second-order estimation of f(θ)

Among them, θ is the parameter vector to be updated, θ _n is the parameter vector to be updated for the nth time, f(θ) is the error function, that is, the function that measures the deviation between the actual value of the result and the predicted value; δ _n is the update vector, and θ _{n +1} = θ _n +δ _n ;

步骤(2)，通过IO指令，完成直接内存访问单元从外部空间读取待更新参数向量的操作；Step (2), through the IO instruction, completes the operation that the direct memory access unit reads the parameter vector to be updated from the external space;

步骤(3)，数据处理单元根据相应指令，在θ_n处对误差函数f(θ)进行二阶泰勒展开，并添加阻尼项

得到f(θ)在θ_n附近的估计

即Step (3), the data processing unit performs a second-order Taylor expansion on the error function f(θ) at θ _n according to the corresponding instruction, and adds a damping term

get an estimate of f(θ) around θ _n

which is

其中，G_f是f在θ_n处的高斯-牛顿矩阵；阻尼系数λ通过LM式启发式方法求得；阻尼函数

是根据训练模型预先确定好的函数在θ_n处的值；Among them, G _f is the Gauss-Newton matrix of f at θ _n ; the damping coefficient λ is obtained by the LM heuristic method; the damping function

is the value of the pre-determined function at θ _n according to the training model;

步骤(4)，数据处理单元根据相应的指令，进行有预条件的共轭梯度法求δ_n使得

达到最小值，并把θ_n更新为θ_n+1，具体更新操作为：In step (4), the data processing unit performs a preconditioned conjugate gradient method to obtain δ _n according to the corresponding instructions such that

Reach the minimum value, and update θ n to θ _n ₊₁ , the specific update operation is:

θ_n+1＝θ_n+δ_n；θ _n+1 = θ _n +δ _n ;

步骤(5)，数据处理单元判断更新后的参数向量是否收敛，若收敛，运算结束，否则，转到步骤(2)处继续执行。In step (5), the data processing unit judges whether the updated parameter vector has converged, and if it converges, the operation ends, otherwise, go to step (2) and continue to execute.

其中，步骤(1)中所述完成数据缓存单元的初始化操作的步骤包括：对梯度

高斯-牛顿矩阵G_f、阻尼系数λ和阻尼函数

进行置零操作。Wherein, the step of completing the initialization operation of the data cache unit in step (1) includes:

Gauss-Newton matrix G _f , damping coefficient λ and damping function

Perform zero-setting operation.

其中，步骤(3)中当进行RNN训练时，Wherein, when performing RNN training in step (3),

阻尼函数

damping function

其中S和f均是距离函数，G_S是S在θ_n处的高斯-牛顿矩阵，μ是一个预先确定好的正数。where S and f are distance functions, G _S is the Gauss-Newton matrix of S at θ _n , and μ is a predetermined positive number.

其中，步骤(4)所述进行有预条件的共轭梯度法求δ_n使得

达到最小值的步骤中，在进行有预条件的共轭梯度法过程中，只用“mini-batch”而不是所有样本，且其中涉及的高斯-牛顿矩阵乘向量的运算都是通过

做隐式近似计算。Wherein, in step (4), the preconditioned conjugate gradient method is performed to obtain δ _n such that

In the step of reaching the minimum value, in the process of the preconditioned conjugate gradient method, only "mini-batch" is used instead of all samples, and the Gauss-Newton matrix multiplication vector operation involved is through

Do implicit approximations.

作为本发明的再一个方面，本发明还提供了一种执行Hessian-Free训练算法的方法，其特征在于，包括以下步骤：As a further aspect of the present invention, the present invention also provides a method for executing the Hessian-Free training algorithm, which is characterized in that it includes the following steps:

步骤S1，在指令缓存单元的首地址处预先存入一条IO指令。In step S1, an IO instruction is pre-stored at the first address of the instruction cache unit.

步骤S2，运算开始，控制器单元从指令缓存单元的首地址读取该条IO指令，根据译出的微指令，直接内存访问单元从外部地址空间读取所有与Hessian-Free计算有关的所有指令，并将其缓存入指令缓存单元中；Step S2, the operation starts, the controller unit reads the IO instruction from the first address of the instruction cache unit, and according to the translated microinstruction, the direct memory access unit reads all the instructions related to the Hessian-Free calculation from the external address space. , and cache it in the instruction cache unit;

步骤S3，控制器单元从指令缓存单元读入一条IO指令，根据译出的微指令，直接内存访问单元从外部空间读取初始待更新参数向量θ₀到数据处理单元中；Step S3, the controller unit reads in an IO instruction from the instruction cache unit, and according to the microinstruction translated, the direct memory access unit reads the initial parameter vector θ ₀ to be updated from the external space into the data processing unit;

步骤S4，控制器单元从指令缓存单元读入赋值指令，根据译出的微指令，数据缓存单元中的

初始化，数据处理单元中的迭代次数n被设置为0；其中，θ为待更新参数向量，θ_n为第n次待更新参数向量，f(θ)为误差函数，即衡量结果的实际值与预测值偏离的函数；δ_n是更新向量，且θ_n+1＝θ_n+δ_n；

为f(θ)的二阶估计；Step S4, the controller unit reads the assignment instruction from the instruction cache unit, and according to the translated microinstruction, the

Initialization, the number of iterations n in the data processing unit is set to 0; among them, θ is the parameter vector to be updated, θ _n is the nth parameter vector to be updated, and f(θ) is the error function, that is, the actual value of the measurement result is the same as the A function of the deviation of the predicted value; δ _n is the update vector, and θ _n+1 = θ _n +δ _n ;

is the second-order estimate of f(θ);

步骤S5，控制器单元从指令缓存单元读入一条IO指令，根据译出的微指令，直接内存访问单元从外部空间读取待更新参数向量θ_n传入到数据处理单元中；Step S5, the controller unit reads in an IO instruction from the instruction cache unit, and according to the translated microinstruction, the direct memory access unit reads the parameter vector θ _n to be updated from the external space and passes it into the data processing unit;

步骤S6，控制器单元从指令缓存单元读入一条对误差函数在当前参数向量值附近进行二阶估计的指令，根据译出的微指令，进行求f(θ)在θ_n附近的二阶估计

的操作；该操作中，指令被送至运算控制子模块，运算控制子模块发送相应指令进行以下操作：利用梯度运算子模块计算

利用高斯-牛顿运算子模块和基本运算子模块中的矩阵乘法运算，得到f在θ_n处的高斯-牛顿矩阵G_f；利用阻尼项运算子模块和基本运算子模块执行LM启发式方法得到阻尼系数λ，进而得到阻尼项

最后，通过Step S6, the controller unit reads an instruction for performing second-order estimation of the error function near the current parameter vector value from the instruction cache unit, and performs a second-order estimation of f(θ) near θ _n according to the translated microinstruction.

operation; in this operation, the instruction is sent to the operation control sub-module, and the operation control sub-module sends the corresponding instruction to perform the following operations: use the gradient operation sub-module to calculate

Using the matrix multiplication operation in the Gauss-Newton operation sub-module and the basic operation sub-module, the Gauss-Newton matrix G _f of f at θ _n is obtained; the damping term operation sub-module and the basic operation sub-module are used to perform the LM heuristic method to obtain the damping coefficient λ, and then the damping term is obtained

Finally, by

得到

的表达式存入数据缓存单元；其中，阻尼函数

是根据训练模型预先确定好的函数在θ_n处的值；

get

The expression of is stored in the data cache unit; among them, the damping function

步骤S7，控制器单元从指令缓存单元读取一条数据传输指令，根据译出的微指令，将

从数据缓存单元传送到数据处理单元中；Step S7, the controller unit reads a data transmission instruction from the instruction cache unit, and according to the translated microinstruction,

Transferred from the data buffer unit to the data processing unit;

步骤S8，控制器单元从指令缓存单元读取一条参数更新运算指令，根据译出的微指令，进行用有预条件的共轭梯度法求δ_n使得

达到最小值，并把θ_n更新为θ_n+1的操作；直接内存访问单元从外部空间读取待更新参数向量θ_n传入到数据处理单元中；运算控制子模块控制相关运算模块进行如下操作：利用共轭梯度运算子模块和基本运算子模块得到更新向量δ_n；最后，利用基本运算子模块中的向量加法将θ_n更新为θ_n+1；Step S8, the controller unit reads a parameter update operation instruction from the instruction cache unit, and according to the decoded microinstruction, uses the preconditioned conjugate gradient method to obtain δ _n such that

The operation of reaching the minimum value and updating θ _n to θ _n+1 ; the direct memory access unit reads the parameter vector θ _n to be updated from the external space and transfers it to the data processing unit; the operation control sub-module controls the relevant operation module to perform the following steps Operation: use the conjugate gradient operation sub-module and the basic operation sub-module to obtain the update vector δ _n ; finally, use the vector addition in the basic operation sub-module to update θ _n to θ _n+1 ;

步骤S9，控制器单元从指令缓存单元读取一条IO指令，根据译出的微指令，更新后的参数向量θ_n+1从数据处理单元通过直接内存访问单元传送至外部指定空间；Step S9, the controller unit reads an IO instruction from the instruction cache unit, and according to the microinstruction translated, the updated parameter vector θ _n+1 is transferred from the data processing unit to the external designated space by the direct memory access unit;

步骤S10，控制器单元从指令缓存单元读取一条收敛判断指令，根据译出的微指令，数据处理单元判断更新后的参数向量θ_n+1是否收敛：若收敛，运算结束；否则，迭代次数n的值增长1，转回执行步骤S5。Step S10, the controller unit reads a convergence judgment instruction from the instruction cache unit, and according to the translated microinstruction, the data processing unit judges whether the updated parameter vector θ _n+1 converges: if converged, the operation ends; otherwise, the number of iterations The value of n increases by 1, and the process returns to step S5.

作为本发明的还一个方面，本发明还提供了一种执行Hessian-Free训练算法的装置，所述装置的控制器中固化有执行如上所述的执行Hessian-Free训练算法的方法的程序。As a further aspect of the present invention, the present invention also provides a device for executing the Hessian-Free training algorithm, and a program for executing the above-mentioned method for executing the Hessian-Free training algorithm is solidified in the controller of the device.

基于上述技术方案可知，本发明的装置和方法具有如下有益效果：使用该装置可以实现Hessian-Free训练算法，完成对各种神经网络的训练，如自动编码器(auto-encoders)、卷积神经网络(RNN)等；通过采用专门用于执行Hessian-Free训练算法的设备，可以解决数据的通用处理器运算性能不足，前段译码开销大的问题，加速相关应用的执行速度；同时，对数据缓存单元的应用，避免了反复向内存读取数据，降低了内存访问的带宽。Based on the above technical solutions, the device and method of the present invention have the following beneficial effects: using the device can realize the Hessian-Free training algorithm and complete the training of various neural networks, such as auto-encoders, convolutional neural networks Network (RNN), etc.; by using the equipment specially used to execute the Hessian-Free training algorithm, the problem of insufficient computing performance of the general-purpose processor for the data and the large overhead of the front-end decoding can be solved, and the execution speed of the related applications can be accelerated; at the same time, the data The application of the cache unit avoids repeatedly reading data from the memory and reduces the bandwidth of memory access.

附图说明Description of drawings

图1为根据本发明一实施例的用于实现Hessian-Free训练算法相关应用的装置的整体结构示例框图；1 is a block diagram showing an example of the overall structure of an apparatus for implementing applications related to the Hessian-Free training algorithm according to an embodiment of the present invention;

图2为根据本发明一实施例的用于实现Hessian-Free训练算法相关应用的装置中数据处理单元的示例框图；2 is an exemplary block diagram of a data processing unit in an apparatus for implementing applications related to the Hessian-Free training algorithm according to an embodiment of the present invention;

图3为根据本发明一实施例的用于实现Hessian-Free训练算法相关应用的运算流程图。FIG. 3 is an operation flow chart for implementing a related application of the Hessian-Free training algorithm according to an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚明白，以下结合具体实施例，并参照附图，对本发明作进一步的详细说明。通过以下详细描述，本发明的其它方面、优势和突出特征对于本领域技术人员将变得显而易见。In order to make the objectives, technical solutions and advantages of the present invention more clearly understood, the present invention will be further described in detail below in conjunction with specific embodiments and with reference to the accompanying drawings. Other aspects, advantages and salient features of the present invention will become apparent to those skilled in the art from the following detailed description.

在本说明书中，下述用于描述本发明原理的各种实施例只是说明，不应该以任何方式解释为限制发明的范围。参照附图的下述描述用于帮助全面理解由权利要求及其等同物限定的本发明的示例性实施例。下述描述包括多种具体细节来帮助理解，但这些细节应认为仅仅是示例性的。因此，本领域普通技术人员应认识到，在不背离本发明的范围和精神的情况下，可以对本文中描述的实施例进行多种改变和修改。此外，为了清楚和简洁起见，省略了公知功能和结构的描述。此外，贯穿附图，相同参考数字用于相似功能和操作。In this specification, the various embodiments described below to describe the principles of the invention are illustrative only and should not be construed in any way to limit the scope of the invention. The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of exemplary embodiments of the present invention as defined by the claims and their equivalents. The following description includes numerous specific details to assist in that understanding, but these details should be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted for clarity and conciseness. Furthermore, the same reference numerals are used for similar functions and operations throughout the drawings.

本发明公开了一种用于执行Hessian-Free训练算法的装置，包括指令缓存单元、指令译码单元、直接内存访问单元、数据处理单元和数据缓存模块。使用该装置可以实现Hessian-Free训练算法，完成对各种神经网络的训练，如自动编码器(auto-encoders)、卷积神经网络(RNN)等。每次迭代时，对误差函数(目标函数)做二阶泰勒展开，并且添加阻尼项，作为目标函数的估计，之后根据当前的梯度、高斯-牛顿矩阵、阻尼函数(damplingfunction)和阻尼系数(dampling constant)，用有预条件的共轭梯度法(PreconditioningCG-Minimize)求得更新向量，更新待更新参数。持续迭代直至待更新参数向量收敛。The invention discloses a device for executing a Hessian-Free training algorithm, comprising an instruction cache unit, an instruction decoding unit, a direct memory access unit, a data processing unit and a data cache module. Using the device, the Hessian-Free training algorithm can be implemented to complete the training of various neural networks, such as auto-encoders, convolutional neural networks (RNN), and the like. At each iteration, a second-order Taylor expansion is performed on the error function (objective function), and a damping term is added as an estimate of the objective function, and then based on the current gradient, Gauss-Newton matrix, damping function (dampling function) and damping coefficient (dampling constant), use the preconditioned conjugate gradient method (PreconditioningCG-Minimize) to obtain the update vector, and update the parameters to be updated. Continue to iterate until the parameter vector to be updated converges.

更具体地，本发明的装置包括直接内存控制单元、指令缓存单元、控制器单元、数据缓存单元和数据处理单元。其中，直接内存访问单元能够访问外部地址空间，可以向装置内部各个缓存单元读写数据，完成数据的加载和存储，具体包括向指令缓存单元读取指令，从指定存储单元之间读取待更新参数和对应的梯度值到数据处理单元，将更新后的参数向量从数据处理单元直接写入外部指定空间；指令缓存单元通过直接内存访问单元读取指令，并缓存读入的指令；控制器单元从指令缓存单元中读取指令，将指令译码为控制其他模块行为的微指令并将其发送给其他模块如直接内存访问单元、数据缓存单元、数据处理单元等；数据缓存单元存储装置运行中需要的一些中间变量，并对这些变量做初始化及更新操作；数据处理单元根据指令做相应的运算操作。More specifically, the apparatus of the present invention includes a direct memory control unit, an instruction cache unit, a controller unit, a data cache unit and a data processing unit. Among them, the direct memory access unit can access the external address space, can read and write data to each cache unit inside the device, and complete the loading and storage of data, which specifically includes reading instructions from the instruction cache unit, and reading from the designated storage units to be updated. The parameters and corresponding gradient values are sent to the data processing unit, and the updated parameter vector is directly written into the external designated space from the data processing unit; the instruction cache unit reads the instructions through the direct memory access unit, and caches the read instructions; the controller unit Read instructions from the instruction cache unit, decode the instructions into micro-instructions that control the behavior of other modules and send them to other modules such as direct memory access units, data cache units, data processing units, etc.; the data cache unit storage device is running Some intermediate variables are required, and initialize and update these variables; the data processing unit performs corresponding operations according to the instructions.

此外，本发明还公开了一种执行Hessian-Free训练算法的方法，包括以下步骤：In addition, the present invention also discloses a method for executing the Hessian-Free training algorithm, comprising the following steps:

具体来说，是对其中的梯度

高斯-牛顿矩阵G_f、阻尼系数λ和阻尼函数

进行置零操作。Step (1), through the instruction, complete the initialization operation of the data cache unit, that is, initialize the second-order estimation of f(θ)

Specifically, is the gradient of

Gauss-Newton matrix G _f , damping coefficient λ and damping function

Perform zero-setting operation.

步骤(2)，通过IO指令，完成直接内存访问单元从外部空间读取待更新参数向量的操作。In step (2), through the IO instruction, the operation of the direct memory access unit reading the parameter vector to be updated from the external space is completed.

得到f(θ)在θ_n附近的估计

即

Step (3), the data processing unit performs a second-order Taylor expansion on the error function f(θ) at θ _n according to the corresponding instruction, and adds a damping term

get an estimate of f(θ) around θ _n

which is

其中，G_f是f在θ_n处的高斯-牛顿矩阵；δ_n是更新向量；阻尼系数λ是用LM式启发式(Levenburg-Marquardt style heuristics)的方法求得；阻尼函数

是根据训练模型预先确定好的函数在θ_n处的值，比如进行RNN训练时，

S和f类似，是距离函数，G_S是S在θ_n处的高斯-牛顿矩阵，μ(weighting constant)是一个预先确定好的正数。Among them, G _f is the Gauss-Newton matrix of _f at θ _n ;

is the value of the pre-determined function at θ _n according to the training model, such as when training RNN,

Similar to f, S is a distance function, G _S is the Gauss-Newton matrix of S at θ _n , and μ (weighting constant) is a predetermined positive number.

达到最小值，并把θ_n更新为θ_n+1的操作。更新操作如下：In step (4), the data processing unit performs a preconditioned conjugate gradient method to obtain δ _n according to the corresponding instructions such that

The operation to reach the minimum value and update θ _n to θ _n+1 . The update operation is as follows:

θ_n+1＝θ_n+δ_n；θ _n+1 = θ _n +δ _n ;

值得一提的是，在进行用有预条件的共轭梯度法过程中，只用“mini-batch”而不是所有样本，并且其中涉及的高斯-牛顿矩阵乘向量的运算都是通过

做隐式近似计算(Pearlmutter的R{}-method)。这样既提高了大数据学习的效率或者说提高数据运算模块的运算效率，也避免了运算量随参数数量平方增长的情况。It is worth mentioning that in the process of using the preconditioned conjugate gradient method, only "mini-batch" is used instead of all samples, and the Gauss-Newton matrix multiplication vector operation involved is all done by

Do an implicit approximation (Pearlmutter's R{}-method). This not only improves the efficiency of big data learning or the operation efficiency of the data operation module, but also avoids the situation that the amount of operation increases with the square of the number of parameters.

根据本发明实施例实现的Hessian-Free训练算法的装置，可以用以支持使用Hessian-Free训练算法方面的应用。在数据缓存单元开辟一个空间存储误差函数在每一代待更新参数附近的二阶估计，在每次进行有预条件的共轭梯度法时，利用该二阶估计计算一个更新向量，然后进行对待更新向量的更新操作。重复进行上述步骤，直至待更新向量收敛。The apparatus for the Hessian-Free training algorithm implemented according to the embodiment of the present invention can be used to support applications using the Hessian-Free training algorithm. Open up a space in the data cache unit to store the second-order estimation of the error function near the parameters to be updated in each generation, and use the second-order estimation to calculate an update vector each time the preconditioned conjugate gradient method is performed, and then perform the update to be updated. The update operation of the vector. Repeat the above steps until the vector to be updated converges.

下面结合附图的本发明的技术方案进行进一步阐述说明。The technical solutions of the present invention will be further elaborated below in conjunction with the accompanying drawings.

图1示出了根据本发明一实施例的用于实现Hessian-Free训练算法的装置的整体结构示例框图。如图1所示，该装置包括直接内存访问单元1、指令缓存单元2、控制器单元3、数据缓存单元4和数据处理单元5，均可以通过硬件电路实现。FIG. 1 shows a block diagram of an exemplary overall structure of an apparatus for implementing a Hessian-Free training algorithm according to an embodiment of the present invention. As shown in FIG. 1 , the device includes a direct memory access unit 1 , an instruction cache unit 2 , a controller unit 3 , a data cache unit 4 and a data processing unit 5 , all of which can be implemented by hardware circuits.

直接内存访问单元1能够访问外部地址空间，可以向装置内部各个缓存单元读写数据，完成数据的加载和存储。具体包括向指令缓存单元2读取指令，从指定存储单元之间读取待更新参数到数据处理单元5，将更新后的参数向量从数据处理单元5直接写入外部指定空间。The direct memory access unit 1 can access the external address space, can read and write data to each cache unit inside the device, and complete the loading and storage of data. Specifically, it includes reading instructions from the instruction cache unit 2 , reading the parameters to be updated from the specified storage units to the data processing unit 5 , and directly writing the updated parameter vector from the data processing unit 5 into the external specified space.

指令缓存单元2通过直接内存访问单元1读取指令，并缓存读入的指令。The instruction cache unit 2 reads the instruction through the direct memory access unit 1, and caches the read instruction.

控制器单元3从指令缓存单元2中读取指令，将指令译码为控制其他模块行为的微指令并将其发送给其他模块如直接内存访问单元1、数据缓存单元4、数据处理单元5等。The controller unit 3 reads the instructions from the instruction cache unit 2, decodes the instructions into micro-instructions that control the behavior of other modules and sends them to other modules such as the direct memory access unit 1, the data cache unit 4, the data processing unit 5, etc. .

数据缓存单元4在装置初始化时初始化

具体来说，是初始化其中的梯度

高斯-牛顿矩阵G_f、阻尼系数λ和阻尼函数

在第n次待更新参数向量θ_n的更新开始前，会将

读出到数据处理单元5中。在数据处理单元5中得到更新向量δ_n，将θ_n更新为θ_n+1，对应的

更新为

然后将

写入至数据缓存单元4(新的数据将以前的相对应数据覆盖)，用于下次使用。Data cache unit 4 is initialized at device initialization

Specifically, it is to initialize the gradient in

Gauss-Newton matrix G _f , damping coefficient λ and damping function

Before the nth update of the parameter vector θ _n to be updated starts, the

Read out into the data processing unit 5 . The update vector δ _n is obtained in the data processing unit 5 , and θ _n is updated to θ _n+1 , corresponding to

update to

followed by

Write to the data cache unit 4 (the new data will overwrite the previous corresponding data) for the next use.

数据处理单元5从数据缓存单元4中读取

通过直接内存访问单元1从外部指定空间中读取待更新参数向量θ_n。在模块内得到更新向量δ_n，将θ_n更新为θ_n+1，对应的

更新为

然后将

写入至数据缓存单元4，将θ_n+1通过直接内存访问单元1写入到外部指定空间中。The data processing unit 5 reads from the data buffer unit 4

The parameter vector θ _n to be updated is read from the external specified space through the direct memory access unit 1 . Get the update vector δ _n in the module, update θ _n to θ _n+1 , the corresponding

update to

followed by

Write to the data cache unit 4, and write θ _n+1 into the external designated space through the direct memory access unit 1.

图2示出了根据本发明实施例的用于实现Hessian-Free训练算法相关应用的装置中数据处理单元的示例框图。如图2所示，数据处理单元包括运算控制子模块51、梯度运算子模块52、阻尼项运算子模块53、高斯-牛顿矩阵运算子模块54、共轭梯度法运算子模块55以及基本运算子模块56。其中，基本运算子模块56进行的是矩阵、向量之间的加乘等基础运算，52、53、54、55子模块都会调用56子模块，并且根据情况，这些模块互相之间也允许互相调用。FIG. 2 shows an example block diagram of a data processing unit in an apparatus for implementing applications related to a Hessian-Free training algorithm according to an embodiment of the present invention. As shown in FIG. 2, the data processing unit includes an operation control sub-module 51, a gradient operation sub-module 52, a damping term operation sub-module 53, a Gauss-Newton matrix operation sub-module 54, a conjugate gradient method operation sub-module 55 and a basic operator module 56. Among them, the basic operation sub-module 56 performs basic operations such as addition and multiplication between matrices and vectors. The 52, 53, 54, and 55 sub-modules will call the 56 sub-module, and according to the situation, these modules are also allowed to call each other. .

图3示出了根据Hessian-Free训练算法进行相关运算的装置的总体流程图。FIG. 3 shows the overall flow chart of the apparatus for performing correlation operations according to the Hessian-Free training algorithm.

在步骤S1，在指令缓存单元2的首地址处预先存入一条IO指令。In step S1, an IO instruction is pre-stored at the first address of the instruction cache unit 2.

在步骤S2，运算开始，控制单元3从指令缓存单元2的首地址读取该条IO指令，根据译出的微指令，直接内存访问单元1从外部地址空间读取所有与Hessian-Free计算有关的所有指令，并将其缓存入指令缓存单元2中。In step S2, the operation starts, the control unit 3 reads the IO instruction from the first address of the instruction cache unit 2, and according to the decoded microinstruction, the direct memory access unit 1 reads all the information related to the Hessian-Free calculation from the external address space. and cache all instructions in instruction cache unit 2.

在步骤S3，控制器单元3从指令缓存单元2读入一条IO指令，根据译出的微指令，直接内存访问单元1从外部空间读取初始待更新参数向量θ₀到数据处理单元5中。In step S3 , the controller unit 3 reads an IO instruction from the instruction cache unit 2 , and according to the decoded microinstruction, the direct memory access unit 1 reads the initial parameter vector θ ₀ to be updated into the data processing unit 5 from the external space.

在步骤S4，控制器单元3从指令缓存单元2读入赋值指令，根据译出的微指令，数据缓存单元4中的

初始化，数据处理单元5中的迭代次数n被设置为0。In step S4, the controller unit 3 reads the assignment instruction from the instruction cache unit 2, and according to the decoded microinstruction, the data in the data cache unit 4

Initially, the iteration number n in the data processing unit 5 is set to zero.

在步骤S5，控制器单元3从指令缓存单元2读入一条IO指令，根据译出的微指令，直接内存访问单元1从外部空间读取待更新参数向量θ_n传入到数据处理单元5中。In step S5, the controller unit 3 reads an IO instruction from the instruction cache unit 2, and according to the decoded microinstruction, the direct memory access unit 1 reads the parameter vector θ _n to be updated from the external space and transfers it to the data processing unit 5 .

在步骤S6，控制器单元3从指令缓存单元2读入一条对误差函数在当前参数向量值附近进行二阶估计的指令，根据译出的微指令，进行求f(θ)在θ_n附近的二阶估计

的操作。该操作中，指令被送至运算控制子模块51，运算控制子模块51发送相应指令进行以下操作：利用梯度运算子模块52计算

利用高斯-牛顿运算子模块54和基本运算子模块56中的矩阵乘法运算，得到f在θ_n处的高斯-牛顿矩阵G_f；利用阻尼项运算子模块53和基本运算子模块56执行LM启发式方法得到阻尼系数λ，进而得到阻尼项

最后，得到

的表达式存入数据缓存单元4。In step S6, the controller unit 3 reads from the instruction buffer unit 2 an instruction for second-order estimation of the error function near the current parameter vector value, and according to the decoded microinstruction, performs the calculation of f(θ) near θ _n . second-order estimate

operation. In this operation, the instruction is sent to the operation control sub-module 51, and the operation control sub-module 51 sends corresponding instructions to perform the following operations: use the gradient operation sub-module 52 to calculate

Utilize the matrix multiplication operation in the Gauss-Newton operation sub-module 54 and the basic operation sub-module 56 to obtain the Gauss-Newton matrix G _f of f at θ _n ; utilize the damping term operation sub-module 53 and the basic operation sub-module 56 to perform LM heuristic The damping coefficient λ is obtained by the formula method, and then the damping term is obtained

Finally, get

The expression is stored in the data cache unit 4.

在步骤S7，控制器单元3从指令缓存单元2读取一条数据传输指令，根据译出的微指令，将

从数据缓存单元4传送到数据处理单元5中。In step S7, the controller unit 3 reads a data transfer instruction from the instruction cache unit 2,

It is transferred from the data buffer unit 4 to the data processing unit 5 .

在步骤S8，控制器单元3从指令缓存单元2读取一条参数更新运算指令，根据译出的微指令，进行用有预条件的共轭梯度法求δ_n使得

达到最小值，并把θ_n更新为θ_n+1的操作。直接内存访问单元1从外部空间读取待更新参数向量θ_n传入到数据处理单元5中。运算控制子模块51控制相关运算模块进行如下操作：利用共轭梯度运算子模块55和基本运算子模块56得到更新向量δ_n。这其中，根据阻尼函数

的表达式，也可能需要调用高斯牛顿-运算模块(比如之前提到的RNN的例子)。最后，利用基本运算子模块56中的向量加法将θ_n更新为θ_n+1。In step S8, the controller unit 3 reads a parameter update operation instruction from the instruction cache unit 2, and according to the decoded microinstruction, uses the preconditioned conjugate gradient method to obtain δ _n such that

The operation to reach the minimum value and update θ _n to θ _n+1 . The direct memory access unit 1 reads the parameter vector θ _n to be updated from the external space and transfers it to the data processing unit 5 . The operation control sub-module 51 controls the correlation operation module to perform the following operations: using the conjugate gradient operation sub-module 55 and the basic operation sub-module 56 to obtain the update vector δ _n . Among them, according to the damping function

The expression of , may also need to call the Gauss-Newton-computation module (such as the RNN example mentioned earlier). Finally, θ _n is updated to θ _n+1 using vector addition in the basic operation sub-module 56 .

在步骤S9，控制器单元3从指令缓存单元2读取一条IO指令，根据译出的微指令，更新后的参数向量θ_n+1从数据处理单元5通过直接内存访问单元1传送至外部指定空间。In step S9, the controller unit 3 reads an IO instruction from the instruction cache unit 2, and according to the decoded microinstruction, the updated parameter vector θ _n+1 is transmitted from the data processing unit 5 through the direct memory access unit 1 to the external specified space.

在步骤S10，控制器单元从指令缓存单元2读取一条收敛判断指令，根据译出的微指令，数据处理单元判断更新后的参数向量θ_n+1是否收敛：若收敛，运算结束；否则，迭代次数n的值增长1，转回执行步骤S5。In step S10, the controller unit reads a convergence judgment instruction from the instruction cache unit 2, and according to the translated microinstruction, the data processing unit judges whether the updated parameter vector θ _n+1 is converged: if it converges, the operation ends; otherwise, The value of the number of iterations n is increased by 1, and the process returns to step S5.

前面的附图中所描绘的进程或方法可通过包括硬件(例如，电路、专用逻辑等)、固件、软件(例如，被承载在非瞬态计算机可读介质上的软件)，或两者的组合的处理逻辑来执行。虽然上文按照某些顺序操作描述了进程或方法，但是，应该理解，所描述的某些操作能以不同顺序来执行。此外，可并行地而非顺序地执行一些操作。The processes or methods depicted in the preceding figures may be implemented by means of processes including hardware (eg, circuits, special purpose logic, etc.), firmware, software (eg, software carried on a non-transitory computer-readable medium), or both. Combined processing logic to execute. Although a process or method is described above as operating in certain orders, it should be understood that certain operations described can be performed in a different order. Furthermore, some operations may be performed in parallel rather than sequentially.

在前述的说明书中，参考其特定示例性实施例描述了本发明的各实施例。显然，可对各实施例做出各种修改，而不悖离所附权利要求所述的本发明的更广泛的精神和范围。相应地，说明书和附图应当被认为是说明性的，而不是限制性的。In the foregoing specification, embodiments of the invention have been described with reference to specific exemplary embodiments thereof. Obviously, various modifications may be made to the various embodiments without departing from the broader spirit and scope of the invention as set forth in the appended claims. Accordingly, the specification and drawings are to be regarded in an illustrative rather than a restrictive sense.

Claims

1. A computing device for performing a Hessian-Free training algorithm, the computing device for performing neural network operations; the computing device comprises a controller unit and a data processing unit, and is characterized in that:

the controller unit is used for decoding the read instruction into a micro instruction for controlling the data processing unit and sending the micro instruction to the data processing unit;

the data processing unit is used for performing calculation on input data according to the micro instruction to obtain a result of the calculation instruction; the data processing unit comprises an operation control submodule and a basic operation submodule, as well as a gradient operation submodule, a damping term operation submodule, a Gaussian-Newton matrix operation submodule and a conjugate gradient method operation submodule;

the data cache unit is used for storing intermediate variables in the operation process and executing initialization and updating operations on the intermediate variables; wherein the data buffer unit initializes a second order estimate of f (θ) at device initialization

Parameter vector theta to be updated at nth time_nBefore the update of (2) is started, will

Read out to a data processing unit and will obtain an update vector in said data processing unit

Writing again; wherein, theta is a parameter vector to be updated, and theta_nF (theta) is an error function, namely a function of deviation of an actual value of a measurement result and a predicted value, for the parameter vector to be updated at the nth time; delta_nIs an update vector, and θ_n+1＝θ_n+δ_n。

2. The computing device of claim 1, wherein the basic operation submodule performs addition, multiplication basis operations between matrices and/or vectors.

3. The computing device of claim 1, wherein the gradient operation sub-module, the damping term operation sub-module, the gauss-newton matrix operation sub-module, and the conjugate gradient method operation sub-module are all capable of invoking the basic operation sub-module, and the gradient operation sub-module, the damping term operation sub-module, the gauss-newton matrix operation sub-module, and the conjugate gradient method operation sub-module are allowed to invoke each other as appropriate.

4. The computing device of claim 1,

the data processing unit stores an intermediate variable in the data cache unit when performing an arithmetic operation under the control of the controller unit.

5. The computing device of claim 1, wherein the data cache unit is initialized

In the step (2), the gradient therein is initialized

Gauss-Newton matrix G_fDamping coefficient lambda and damping function

Wherein,

the gradient

Finger f at θ_nGradient value of (G)_fIs f at θ_nA gauss-newton matrix of (d); damping function

Is a function predetermined according to a training model at theta_nThe value of (d); the damping coefficient lambda is obtained by an LM type heuristic method;

the data processing unit reads from the data cache unit

Reading parameter vector theta to be updated from external designated space_n(ii) a Obtaining an update vector delta within a module_nWill theta_nIs updated to theta_n+1Corresponding to

Is updated to

Then will be

Writing into the data buffer unit, and storing theta_n+1Writing into an external designated space; wherein, theta_n+1For the (n +1) th parameter vector to be updated,

is a second order estimate of f (θ + 1).

6. A computational method for performing a Hessian-Free training algorithm, the computational method for performing neural network operations, the computational method comprising the steps of:

the controller unit decodes the read instruction into a microinstruction for controlling the data processing unit and sends the microinstruction to the data processing unit;

the data processing unit executes calculation on input data according to the micro instruction to obtain a result of the calculation instruction; the data processing unit comprises an operation control submodule and a basic operation submodule, and a gradient operation submodule, a damping term operation submodule, a Gaussian-Newton matrix operation submodule and a conjugate gradient method operation submodule;

a data cache unit for storing intermediate variables generated during operation, and performing initialization and update operations on the intermediate variables, wherein the data cache unit initializes second-order estimation of f (theta) during device initialization

7. The computing method of claim 6, wherein the basic operation submodule performs addition and multiplication basic operation steps between matrices and/or vectors.

8. The computing method of claim 6, wherein the gradient operation sub-module, the damping term operation sub-module, the Gauss-Newton matrix operation sub-module, and the conjugate gradient method operation sub-module are all capable of calling the basic operation sub-module, and the gradient operation sub-module, the damping term operation sub-module, the Gauss-Newton matrix operation sub-module, and the conjugate gradient method operation sub-module are allowed to be called each other as the case may be.

9. The computing method of claim 6, wherein the data cache unit is initialized

In the step (2), the gradient therein is initialized

Gauss-Newton matrix G_fDamping coefficient lambda and damping function

Wherein,

the gradient

Is predetermined according to a training modelFunction at theta_nThe value of (d); the damping coefficient lambda is obtained by an LM type heuristic method;

the data processing unit reads from the data cache unit

Is updated to

Then will be

is a second order estimate of f (θ + 1).

10. A computational method for performing a Hessian-Free training algorithm, comprising the steps of:

step (1), finishing the initialization operation of the data cache unit through an instruction, namely initializing the second-order estimation of f (theta)

Wherein, theta is a parameter vector to be updated, and theta_nF (theta) is an error function, namely a function of deviation of an actual value of a measurement result and a predicted value, for the parameter vector to be updated at the nth time; delta_nIs an update vector, and θ_n+1＝θ_n+δ_n；

Step (2), completing the operation of reading the parameter vector to be updated from the external space through an IO instruction;

step (3), the data processing unit executes the corresponding instruction according to the corresponding instruction at theta_nPerforming second-order Taylor expansion on the error function f (theta), and adding a damping term

F (theta) at theta_nEstimation of proximity

Namely, it is

Wherein G is_fIs f at θ_nA gauss-newton matrix of (d); the damping coefficient lambda is obtained by an LM type heuristic method; damping function

Is a function predetermined according to a training model at theta_nThe value of (d);

step (4), the data processing unit calculates delta by a preconditioned conjugate gradient method according to corresponding instructions_nSo that

Reaches a minimum value, and_nis updated to theta_n+1The specific update operation is:

θ_n+1＝θ_n+δ_n；

and (5) judging whether the updated parameter vector is converged by the data processing unit, if so, finishing the operation, otherwise, turning to the step (2) to continue executing.

11. The computing method of claim 10, wherein the step of completing initialization operations of the data cache unit in step (1) comprises: for gradient

Gauss-Newton matrix G_fDamping coefficient lambda and damping function

A zero operation is performed.

12. The computing method of claim 10, wherein in step (3) when RNN training is performed,

damping function

Where S and f are both distance functions, G_SIs S at θ_nAnd a gauss-newton matrix where μ is a predetermined positive number.

13. The method of claim 10, wherein step (4) comprises performing a preconditioned conjugate gradient method for δ_nSo that

In the step of reaching the minimum, only the "mini-batch" is used instead of all samples during the implementation of the preconditioned conjugate gradient method, and the gaussian-newton matrix multiplication vector involved therein is calculated by

And performing implicit approximate calculation.

14. A computational method for performing a Hessian-Free training algorithm, comprising the steps of:

step S1, an IO instruction is pre-stored in the first address of the instruction cache unit;

step S2, when the operation starts, the controller unit reads the IO instruction from the first address of the instruction cache unit, and according to the translated micro instruction, the direct memory access unit reads all instructions related to Hessian-Free calculation from the external address space and caches the instructions into the instruction cache unit;

step S3, the controller unit reads an IO instruction from the instruction cache unit, and the dma unit reads the initial parameter vector θ to be updated from the external space according to the translated microinstruction₀To a data processing unit;

step S4, the controller unit reads the assignment instruction from the instruction cache unit, and the data cache unit stores the assignment instruction in accordance with the translated microinstruction

Initializing, wherein the iteration number n in the data processing unit is set to be 0; wherein, theta is a parameter vector to be updated, and theta_nF (theta) is an error function, namely a function of deviation of an actual value of a measurement result and a predicted value, for the parameter vector to be updated at the nth time; delta_nIs an update vector, and θ_n+1＝θ_n+δ_n；

A second order estimate of f (θ);

step S5, the controller unit reads an IO instruction from the instruction cache unit, and the dma unit reads the parameter vector θ to be updated from the external space according to the translated microinstruction_nTransmitting the data into a data processing unit;

step S6, the controller unit reads in an instruction for second-order estimation of the error function near the current parameter vector value from the instruction cache unit, and performs the calculation of f (θ) at θ according to the translated microinstruction_nNearby second order estimation

The operation of (1); in the operation, the instruction is sent to an operation control submodule in the data processing unit, and the operation control submodule sends a corresponding instruction to perform the following operations: using a gradient arithmetic submodule to calculate

Matrix multiplication in the Gauss-Newton operation submodule and the basic operation submodule is utilized to obtain f at theta_nGauss-Newton matrix G of (G)_f(ii) a Utilizing the damping item operation submodule and the basic operation submodule to execute an LM heuristic method to obtain a damping coefficient lambda so as to obtain a damping item

Finally, by

To obtain

The expression of (2) is stored in a data cache unit; wherein the damping function

in step S7, the controller unit reads a data transfer instruction from the instruction cache unit and translates the read data transfer instruction into a microinstruction

The data is transmitted from the data buffer unit to the data processing unit;

step S8, the controller unit reads a parameter updating operation instruction from the instruction cache unit, and calculates delta by using a preconditioned conjugate gradient method according to the translated microinstruction_nSo that

Reaches a minimum value, and_nis updated to theta_n+1The operation of (1); direct memory access unit reads parameter vector theta to be updated from external space_nTransmitting the data into a data processing unit; operation controlThe submodule controls the correlation operation module to perform the following operations: obtaining an update vector delta by using a conjugate gradient operation submodule and a basic operation submodule_n(ii) a Finally, theta is added by vector addition in the basic operation submodule_nIs updated to theta_n+1；

Step S9, the controller unit reads an IO instruction from the instruction cache unit, and updates the parameter vector theta according to the translated microinstruction_n+1Transmitting the data from the data processing unit to an external designated space through the direct memory access unit;

in step S10, the controller unit reads a convergence judgment instruction from the instruction cache unit, and the data processing unit judges the updated parameter vector θ according to the translated microinstruction_n+1Whether to converge or not: if the convergence is reached, the operation is ended; otherwise, the value of the iteration number n is increased by 1, and the process returns to step S5.

15. A computing device for executing a Hessian-Free training algorithm, the computing device having a program embodied in a controller that executes the computing method of any of claims 6 to 14.