Softmax hardware implementation method and system of logic resource limited platform
Technical Field
The invention relates to a software max hardware implementation method and system of a logic resource limited platform, belonging to the technical field of hardware implementation of a neural network activation function.
Background
In the field of big data, deep Neural Networks (DNNs) have achieved great success, and efficient hardware architecture has been the goal of academia and industry. Wherein the Softmax layer is widely used for different DNNs. The Softmax function is typically used as an activation function for the output layer in classification tasks, which maps the output of multiple neurons into (0, 1) intervals, which has a very wide range of applications in machine learning and deep learning. Especially in dealing with multi-classification (C > 2) problems, the final output unit of the classifier requires a Softmax function for numerical processing. The expression is as follows: the exponentiation and division calculations in the Softmax function are quite expensive, especially in embedded systems, and many look-up table based implementations of the prior art introduce excessive resource consumption and are complex. There is therefore a need for a rational way to deploy functions on hardware that ensures low resource consumption and that can be implemented efficiently.
Disclosure of Invention
Aiming at the problems, the invention provides a software max hardware implementation of a logic resource limited platform, which solves the problems of high cost and low efficiency of software max function hardware implementation by utilizing function fitting and serial accumulation. The technical proposal is as follows:
The complete technical scheme of the invention is that a software max hardware implementation method of a logic resource limited platform comprises the following steps:
1) The original Softmax function expression And (3) performing transformation:
Wherein n is the total number of inputs to Softmax, i is the index of the corresponding inputs and outputs, i=1.
2) Calculating an exponential function of each gating input x 1,x2,....,xn Where radix e is replaced with radix 2:
3) Calculating the accumulated sum of the exponentiations of each input x 1,x2,....,xn in step 2):
4) Calculating the natural logarithm of the accumulated result in the step 3):
f_ln=ln(f)
5) Calculating the exponent power of each strobe input x 1,x2,....,xn added to the calculation result f_ln of step 4), respectively, where radix e is replaced by radix 2:
6) Storing each result R (i) of the calculation of step 5) in a register to obtain the final total output R.
Further, in steps 2) and 5), the same exponent operation module is used for calculation, the radix e is replaced by the radix 2, and the exponent operation module is composed of two adders, a constant multiplier and a shifter:
u is the integer portion of |y.log 2 e|, i.e., the integer portion of the truncated fixed point number, and v is the fractional portion of |y.log 2 e|, i.e., the fractional portion of the truncated fixed point number. The absolute value is represented in fixed-point hardware by judging the sign bit, the absolute value of the positive number is the original value, and the absolute value of the negative number is inverted and added by one. Such as in step 2) and step 5) Or (b)The first adder calculates x i -0 or x i -f_ln, the constant multiplier calculates (x i-f_ln).log2 e or x i.log2 e; intercept integer part of absolute value to get u, the second adder realizes fitting function 2 v≈v+b1, where b 1 is constant, and finally the calculated result is obtained by shifting u to left or right of 2 v.
Further, in step 3), a serial accumulation module is used for calculation, and an accumulation enable signal counter is used for control.
Further, in step 4), a logarithmic operation module is used to calculate, and the base e is replaced by the base 2, where the logarithmic operation module is composed of a leading 1 detector (LOD), a decoder and a right shifter, and a constant adder and adder:
ln(f)=ln2*log2f=ln2*(w+log2k)
t is the intermediate value where the most significant bit of f is 1 and the other bits are 0, w is the index where the most significant bit of f is, k is the remainder of f after scaling, and k e [1, 2), for example for sixteen fixed point numbers:
If f=16' b0000_1011_1111.0011,
Then t=16 'b0000_1000_0000.0000, w=4, k=16' b0000_1.011_1111_0011;
if f=16' b0000_0011_1111.0011,
Then t=16 'b0000_0010_0000.0000, w= 6,k =16' b0000_001.1_1111_0011.
Further, in step 4), ln (f) is calculated, LOD is used to obtain an intermediate value t with only the most significant bit of f being 1 and the other bits being 0, the decoder input t obtains the index w with the most significant bit of f being located, the right shifter shifts f right by w to obtain k, the constant adder implements a fitting function log 2k≈k+b2, where b 2 is a constant, and finally the constant multiplier and adder calculate ln2 x (w+k+b 2).
The method and the system for realizing the software max hardware of the logic resource limited platform have the beneficial effects that the complexity of realizing the software max hardware is effectively reduced, the complex functions are realized only by using a limited basic operation logic unit through function equivalent transformation, exponentiation base number and logarithmic base number replacement, function fitting, serial accumulation and exponential operation unit multiplexing, and the hardware realization area and the power consumption cost are effectively reduced.
Compared with the prior art, the method has the advantages that the circuit complexity is greatly reduced, the power consumption and the area cost are reduced, meanwhile, the requirement for parameter storage based on the LUT lookup table method is not met, the requirement for processing time based on iteration of the CORDIC method is not met, the throughput rate is increased, and the problem that Softmax hardware is difficult to realize is solved.
Drawings
FIG. 1 is a block diagram of a Softmax hardware implementation of a logical resource constrained platform of the present invention;
FIG. 2 is a schematic diagram of an exponential function circuit implementation;
fig. 3 is a schematic diagram of a logarithmic function circuit implementation.
Detailed Description
The technical scheme of the invention is further described below with reference to the accompanying drawings and examples.
As shown in fig. 1, the Softmax hardware implementation system of the logic resource limited platform according to the present invention has an input of x 1,x2,....,xn and an output of r 1,r2,....,rn. The method comprises the following steps:
(1) After operand x 1,x2,....,xn is input, lnF_en is set to 0, a counter Cnt [ $clog2 (n) -1:0] is enabled, x 1,x2,....,xn is sequentially gated by using Cnt [ $clog2 (n) -1:0] as gating signals, and each gating input x 1,x2,....,xn exponential function is calculated Where radix e is replaced with radix 2:
(2) Calculating the accumulated sum of the exponentiations of each input x 1,x2,....,xn in the step (1), setting Acc_en to 1, and setting Acc_en to 0 when Cnt [ $clog2 (n) -1:0] is equal to the input number n, so as to finish the accumulation of all the exponentiations of x 1,x2,....,xn.
(3) Calculating the natural logarithm f_ln=ln (f) of the accumulated result in the step (2), wherein the base e is replaced by the base 2.
(4) Setting and re-enabling a counter Cnt [ $ clone 2 (n) -1:0], setting LnF_en to 1, calculating an exponential power of each gating input x 1,x2,....,xn added to the calculation result f_ln of step (3), respectively, wherein the base e is replaced by the base 2:
(5) Storing each result R (i) of the calculation in the step (4) in a register to obtain a final total output R.
Through the design of the invention, the performance of the Softmax hardware circuit in the practical application scene with high efficiency, low complexity, low area and low energy consumption is met, and the Softmax calculation of n inputs is theoretically completed once in 2n clock cycles, wherein n is the number of the inputs, and meanwhile, the advantages in area and power consumption are ensured.
As shown in fig. 2, the calculation process of the exponential function circuit:
(1) The input x i or x i -f_ln is multiplied by the fixed-point constant log 2 e to obtain a fixed-point output { u i,vi }, where u i is the integer portion and v i is the fractional portion, where-1<v i <1.
(2) The f (v i)=vi +b) is used for performing function fitting calculation 2 v, and although x i is larger than or equal to 0, x i -f_ln has a positive value and a negative value, so that a piecewise function fitting is needed to be used for determining a coefficient of 0.9919 according to different fitting parameters b 1 or b 2 of positive and negative gating of u i, piecewise intervals are (-1, 0) and [0, 1), for example, if the input is 16-bit fixed point number decimal bit width is 4, b 1=6'b01_0100,b2 =6' b00_1111 and [0,1 ] interval fitting is taken. The coefficients are slightly worse in the (-1, 0) interval, f (v i)=a*vi +b) can be used to improve accuracy in the (-1, 0) interval by piecewise fitting, where a=0.4966, b= 0.9711, for example, the input is 16-bit fixed point number decimal place width is 4, a is realized by right shifting v i by one bit, no additional cost is left on hardware, b 2 =6' b00_1111 is taken, and the fitting is determined to be 0.9909.
(3) Judging positive and negative through the highest sign bit of u i, if the sign bit is negative, inverting and adding one to obtain an absolute value, gating, and performing left shift or right shift on the 2 v calculated in the step (2) according to the positive and negative of u i to obtain a final exponential function calculation result:
as shown in fig. 3, the calculation process of the logarithmic function circuit:
(1) The LOD is used to obtain an intermediate value t where the most significant bit of the input f is located at 1 and the other bits are 0.
(2) And decoding the input t by using a decoder to obtain an index w where the f most significant bit is located.
(3) The input f is shifted to the right by w bits using a shifter to get k, where k e (1, 2).
(4) The adder is used to calculate log 2k:log2k≈k+b2, where b 2 is a constant, such as a fixed point number system, and the total bit width of the output f_ln of the logarithmic function circuit is 6 decimal places and 4, then b= -0.9485 is taken to be b=6' b11.0001 after being fixed, and the fitting determination coefficient is 0.9906.
(5) Adding w to log 2 k obtained by the function fitting calculation in the step (4), and multiplying ln2 by the accumulated value by using a constant multiplier to obtain a final log function calculation result that ln (f) =ln2log 2f=ln2*(w+log2 k
The invention adopts the equivalent transformation of the functions, the replacement of the exponentiation base number and the logarithmic base number, the fitting of the functions, the serial accumulation and the multiplexing of the exponential operation units, only uses a limited basic operation logic unit to realize complex functions, converts the combination of the exponentiation function and the division of the original functions into the combination of the exponentiation function and the logarithmic function, simultaneously carries out the function fitting with controllable precision according to the operation characteristics and the data range, saves a great amount of calculation time and iteration process, and effectively reduces the hardware realization area and the power consumption cost by utilizing the multiplexing of the serial accumulation and the function unit.