Disclosure of Invention
In view of this, the present invention provides an apparatus and a system for accelerating floating point separable convolution calculation, and an image classification method, which can improve the floating point separable convolution calculation process based on the FPGA, wherein the depth separation convolution calculation achieves the linear speed calculation throughput.
In order to achieve the purpose, the technical scheme of the invention is as follows: a convolution multiplication and accumulation hardware accelerator for a convolution neural network comprises a convolution multiplication unit, a convolution addition tree unit and a convolution forward addition chain unit.
The convolution multiplication unit comprises PE multiplied by SIMD floating-point multipliers; the input of each floating-point multiplier comprises input characteristic diagram data and input parameter data, and the time delay of each floating-point multiplier is more than one clock cycle; wherein PE is the number of input characteristic diagram data, SIMD is the number of input parameter data, and PE and SIMD are both powers of 2; the convolution multiplication unit outputs PE × SIMD multiplication results as input data to the convolution addition tree unit.
The convolution addition tree unit comprises PE floating point addition trees, input data of the convolution addition tree unit are PE multiplied SIMD multiplication results, the PE multiplied SIMD multiplication results and SIMD multiplication results corresponding to each input feature map data are used as a group, and the PE groups are respectively sent into one floating point addition tree.
The floating-point addition tree is composed of SIMD-1 floating-point adders, and performs addition operation on SIMD multiplication results in each group, wherein the delay of the floating-point adders is more than one clock cycle; the convolution addition tree unit outputs the PE addition tree results as input to the convolution forward addition chain unit.
The convolution forward addition chain unit comprises PE addition chains, input data of one addition chain is a corresponding addition tree result, the input data of more than one clock period of the addition chain is accumulated, and the length of the addition chain is determined according to the number of clock periods needing to be accumulated; the output of the convolution forward addition chain unit is output characteristic diagram data which is composed of PE floating point data output by all addition chains, and the output characteristic diagram data is a convolution multiplication accumulation result.
Furthermore, each adder chain accumulates the input data of more than one clock period, and the length of the adder chain is determined according to the number of the clock periods needing to be accumulatedThe body is as follows: each addition chain comprising log2n adders, where n is the number of clock cycles to be accumulated. For the current addition chain, the corresponding input data is one of the addition tree results. The value of PE is 1-PE in the result of the PE-th addition tree; the first adder accumulates input data of adjacent clock cycles and outputs the accumulated data as the input of the second adder. The second adder adds the inputs of adjacent clock cycles and outputs as the input to the next adder. And so on; until the last adder outputs the output result of the current addition chain.
Another embodiment of the present invention further provides a convolution multiply-accumulate hardware acceleration system of a convolution neural network, which includes a host and an FPGA convolution calculation system, wherein the host and the FPGA convolution calculation system are connected by a PCIe bus.
The FPGA convolution computing system consists of a memory and on-chip convolution reasoning computing logic.
The on-chip convolution reasoning and calculating logic comprises a system control unit, an input characteristic diagram data cache scheduling unit, an input parameter data cache scheduling unit, a convolution multiplication accumulation calculating unit and an output data cache scheduling unit.
The input characteristic diagram data buffer scheduling unit is used for storing the input characteristic diagram data and sending the input characteristic diagram data into the convolution multiply-accumulate calculating unit.
The input parameter data buffer scheduling unit is used for storing input parameter data and sending the input parameter data to the convolution multiplication accumulation calculating unit.
The convolution multiply accumulate calculating unit adopts the convolution multiply accumulate hardware accelerator structure of the convolution neural network, and performs convolution multiply accumulate operation on input parameter data by using input characteristic diagram data, and sends output characteristic diagram data serving as a convolution multiply accumulate result to the output data cache scheduling unit.
And the output data cache scheduling unit sends the convolution multiplication accumulation result to the memory for storage.
The host reads the convolution multiply-accumulate result from the memory through the PCIe bus.
Another embodiment of the present invention further provides a convolution multiply-accumulate hardware acceleration method for a convolutional neural network, including the following steps:
respectively acquiring PE input characteristic diagram data and SIMD input parameter data, wherein both PE and SIMD are powers of 2; and performing floating-point multiplication operation on each input feature map data and each input parameter data by using the PE multiplied by SIMD floating-point multipliers, and outputting the PE multiplied by SIMD multiplication results as input data of a convolution addition tree unit.
The delay of the floating-point multiplier is more than one clock cycle.
And step two, taking the SIMD multiplication results of the PE multiplied by SIMD as a group corresponding to each input feature map data, dividing the SIMD multiplication results into PE groups, respectively sending the PE groups into a floating point addition tree, totally adding the PE floating point addition trees, and outputting the PE addition tree results as the input of a convolution forward addition chain unit.
The floating-point addition tree is composed of SIMD-1 floating-point adders, and performs addition operation on SIMD multiplication results in each group, wherein the delay of the floating-point adders is more than one clock cycle.
Thirdly, PE addition chains are adopted, wherein input data of one addition chain is a corresponding addition tree result, the addition chains accumulate the input data of more than one clock period, and the length of the addition chains is determined according to the number of clock periods needing to be accumulated; and the PE floating point data output by all the addition chains form output characteristic diagram data, and the output characteristic diagram data is a convolution multiplication accumulation result.
Further, each addition chain accumulates input data of more than one clock cycle, and the length of the addition chain is determined according to the number of clock cycles required to be accumulated, specifically: each addition chain comprising log2n adders, wherein n is the number of clock cycles needing to be accumulated; for the current addition chain, the corresponding input data is one of the addition tree results; the value of PE is 1-PE in the result of the PE-th addition tree; the first adder accumulates input data of adjacent clock periods and outputs the accumulated input data as the input of the second adder; the second adder accumulates the input of the adjacent clock period and outputs the accumulated input as the input of the next adder; to be provided withAnd so on; until the last adder outputs the output result of the current addition chain.
Has the advantages that:
the floating point separable convolution calculation accelerating device and system provided by the invention can improve the deep separation convolution in separable convolution, and the improvement principle is as follows: according to the general requirements of the floating point depth separation convolution calculation input, scheduling input feature map floating point data and input parameter floating point data to prepare for the input of a depth separation convolution multiplication calculation matrix of the next step; the depth separation convolution multiplication matrix is used for finishing the multiplication calculation of corresponding items of the depth separation convolution input feature map floating point data and the input parameter floating point data, and outputting a multiplication calculation result, wherein the result numerical value is the floating point data; and inputting the multiplication calculation result to a depth separation forward accumulation tree matrix unit to finish the accumulation calculation of the corresponding items of the floating point data of the output characteristic diagram, and finally finishing the calculation of the floating point data of the output characteristic diagram. The invention can improve the bottleneck of calculation throughput brought by the existing closed-loop operation mode of floating point depth separation, volume multiplication and accumulation, and has the effect of achieving linear speed calculation throughput capacity in floating point depth separation, volume multiplication and accumulation calculation.
Detailed Description
The invention is described in detail below by way of example with reference to the accompanying drawings.
The invention provides a floating point separable convolution calculation acceleration device, which comprises a point convolution module and a point convolution module, wherein the point convolution module is used for carrying out convolution acceleration calculation on first input characteristic diagram data by utilizing first input parameter data and outputting first output characteristic diagram data.
And the point convolution module is used for performing convolution acceleration calculation on the first input characteristic diagram data and outputting first output characteristic diagram data. The dot convolution module in the embodiment of the present invention may use a conventional dot convolution acceleration calculation module in the field of technology, and may also use a multiplication and accumulation hardware acceleration apparatus structure related to dot convolution as described in the patent with application number 202011587375.4 (named as "convolution multiplication and accumulation hardware acceleration apparatus, system and method of convolutional neural network") to implement acceleration operation of dot convolution multiplication and accumulation. The input parameter data in the point convolution module is first input parameter data and is used for performing convolution acceleration calculation on the first input feature map data. According to the disclosure of patent application No. 202011587375.4, the apparatus includes a convolution multiplication unit, a convolution addition tree unit, and a convolution forward addition chain unit, and in the present invention, the units in the point convolution module can be expressed as: the system comprises a point convolution multiplication unit, a point convolution addition tree unit and a point convolution forward addition chain unit.
According to the disclosure in patent application No. 202011587375.4, a point convolution multiplication unit includes PE × SIMD floating-point multipliers; the input of each floating-point multiplier comprises first input characteristic diagram data and first input parameter data, and the time delay of each floating-point multiplier is more than one clock cycle; the method comprises the following steps of (1) obtaining a first input characteristic diagram data, wherein PE is the number of the first input characteristic diagram data, SIMD is the number of the first input parameter data, and both PE and SIMD are powers of 2; the point convolution multiplication unit outputs the PE × SIMD point convolution multiplication result as input data of the point convolution addition tree unit.
The point convolution addition tree unit comprises PE floating point addition trees, input data of the point convolution addition tree unit is a PE multiplied SIMD point convolution multiplication result, the PE multiplied SIMD multiplication result corresponding to each first input feature map data is used as a group, and the PE group is divided into the PE group and respectively sent into one floating point addition tree.
The floating-point addition tree is composed of SIMD-1 floating-point adders, and performs addition operation on SIMD multiplication results in each group, wherein the delay of the floating-point adders is more than one clock cycle; the convolution addition tree unit outputs the PE addition tree results as input to the convolution forward addition chain unit.
The point convolution forward addition chain unit comprises PE addition chains, input data of one addition chain is a corresponding addition tree result, the input data of more than one clock period of the addition chain is accumulated, and the length of the addition chain is determined according to the number of clock periods needing to be accumulated; the output of the convolution forward addition chain unit is first output characteristic diagram data which is composed of PE floating point data output by all addition chains, and the first output characteristic diagram data is a convolution multiplication accumulation result.
Each addition chain accumulates input data of more than one clock period, and the length of the addition chain is determined according to the number of clock periods required to be accumulated, and specifically comprises the following steps: each addition chain comprising log2n adders, wherein n is the number of clock cycles needing to be accumulated; for the current addition chain, the corresponding input data is one of the addition tree results; the value of PE is 1-PE in the result of the PE-th addition tree; the first adder accumulates input data of adjacent clock periods and outputs the accumulated input data as the input of the second adder; the second adder accumulates the input of the adjacent clock period and outputs the accumulated input as the input of the next adder; and so on; until the last adder outputs the output result of the current addition chain.
In the embodiment of the invention, the floating point separable convolution calculation module further comprises a depth separation convolution module; the input of the depth separation convolution module is second input feature map data; the depth separation convolution module carries out convolution acceleration operation on the second input characteristic map data by setting second input parameter data to obtain second output characteristic map data; wherein the second output profile data is used as the first input profile data.
The work flow of the depth separation convolution module is shown in figure 1, the structure composition is shown in figure 2, and the depth separation convolution multiplication matrix unit and the depth separation convolution forward accumulation tree matrix are included.
The structure of the depth separation convolution multiplication matrix unit is shown in fig. 3, the depth separation convolution multiplication matrix unit comprises PEx1 floating-point convolution multiplication calculators, the convolution multiplication calculators adopt general floating-point multiplication calculators customized based on fixed-point DSP hard cores, and the delay of the floating-point multiplication calculators is not required and can be the delay of a plurality of clock cycles; the input of the depth separation convolution multiplication calculation matrix structure is second input feature map data and second input parameter data, the input feature map data of each clock is 1 floating point data (Ipt0), and the input feature map data are PE floating point data Wgt 0-WgtPE (as shown in FIG. 3, Wgt0, Wgt1, Wgt2 and Wgt 3); the output is the result of the deep separation convolution multiplication, and the result contains PEx1 floating point data; because the depth separation convolution multiplication calculation matrix constructed by the matrix unit by adopting the general floating-point multiplier is an open-loop design, closed-loop operation is not introduced, the unit has no restriction on the delay of the specific floating-point multiplier and can be used for a plurality of clock cycles, and finally the unit can achieve the linear speed calculation throughput capacity.
The matrix unit structure of the depth separation convolution forward accumulation tree is shown in fig. 4, and includes pe (process element) forward accumulation trees, and completes accumulation calculation of the multiplication result of depth separation convolution with the size of kxk convolution kernels. The input of the depth separation convolution forward accumulation tree matrix is the multiplication result of the previous step of depth separation convolution, and the multiplication result comprises PEx1 floating point data, and the output of the matrix is second output characteristic diagram data comprising PEx1 floating point data. The forward accumulation tree is formed by connecting two groups of input time sequence scheduling modules and an addition tree in series, and sequentially comprises a first input time sequence scheduling module, a first addition tree, a second input time sequence scheduling module and a second addition tree; the depth separation convolution multiplication result output by the depth separation convolution multiplication matrix unit is input to a first input time sequence scheduling module, the first input time sequence scheduling module completes the time sequence data register of k clock cycles, k input time sequence data are prepared for a first adder, a first addition tree completes the accumulation summation of the k input time sequence data, the summation result of the first adder is input to a second time sequence scheduling module, the second input time sequence scheduling module registers k summation results of the first addition tree, completes the time sequence data registration of k multiplied by k clock periods, prepares k inputs for the second adder, the second addition tree performs the accumulation summation on the k summation results of the first addition tree, and the forward accumulation tree totally completes the accumulation summation of the time sequence data of k multiplied by k clock periods, outputs the second output characteristic diagram data as the first input characteristic diagram data and inputs the first input characteristic diagram data to the point convolution module. The unit adopts the addition tree to construct the forward accumulation tree, is an open-loop design, does not introduce operations such as closed-loop accumulation and the like, has no constraint on the delay of the addition tree, can be in a plurality of clock cycles, and finally can achieve the linear speed calculation throughput capacity.
The first input time sequence scheduling module and the second input time sequence scheduling module have the same structure and consist of an input register, a counter and k internal registers; input timing schedule structure as shown in fig. 5, an input timing schedule structure of k data is illustrated, k being typically the size of a convolution kernel, and composed of one input register (Ipt0), one Counter (Counter), three internal registers (R0, R1, R2), and three output registers (Opt0, Opt1, Opt 2). The input register is connected with the counter, and the k internal registers are directly connected with the counter.
The input register receives input data of k clock cycles and sends the input data into the counter; and the input data of the first input time sequence scheduling module is multiplication calculation result data, and the input data of the second input time sequence scheduling module is a summation result of the first addition tree.
The counter registers input data to k internal registers in a mode of taking a modulus k, after the k input data are registered, the k internal registers output the k data at one time to prepare for the input of the adder, and the input speed of the stage is matched with the output speed.
An example of the addition tree design is shown in fig. 6, where the input is k floating point data, k is the size of the convolution kernel, and k is 3, the example includes k-1 floating point adders, performs the function of adding and summing k data, and the output is the output of the addition tree, and includes 1 floating point data. The floating-point adder adopts a general floating-point addition calculator customized based on a fixed-point DSP hard core. As the floating point addition tree constructed by the general floating point adder is designed in an open loop mode, closed loop operation is not introduced, the unit has no restriction on the delay of the specific floating point adder and can be used for multiple clock cycles, and finally the unit can achieve the linear speed calculation throughput capacity.
The output of the point convolution module is the first output characteristic diagram data as the final separable convolution output result. Therefore, the floating point depth separation convolution calculation acceleration function is completed.
An FPGA acceleration-based separable convolution inference computing system is shown in fig. 7, and includes a host and an FPGA separable convolution computing system connected by a PCIe bus.
The FPGA separable convolution computing system consists of a memory and on-chip separable convolution computing logic;
the on-chip separable convolution calculation logic comprises a system control unit, an input characteristic diagram data cache scheduling unit, an input parameter data cache scheduling unit, a deep separation convolution module, a point convolution module and an output data cache scheduling unit.
The input feature map data cache scheduling unit is used for storing second input feature map data and sending the second input feature map data to the deep separation convolution module.
The input parameter data buffer scheduling unit is used for storing input parameter data, and comprises first input parameter data used for the point convolution module and second input parameter data used for the depth separation convolution module.
The depth separation convolution module adopts the depth separation convolution module in the floating point separable convolution calculation accelerating device, performs depth separation convolution calculation operation on second input characteristic diagram data by using second input parameter data, outputs second output characteristic diagram data, sends the second output characteristic diagram data into the point convolution module as first input characteristic diagram data, performs convolution calculation operation on the first input characteristic diagram data by using the first input parameter data, outputs first output characteristic diagram data, namely a final separable convolution output result, and sends the final separable convolution output result into the output data cache scheduling unit;
and the output data buffer scheduling unit sends the final separable convolution output result to the memory for storage.
The host reads the final separable convolution output from the memory over the PCIe bus.
In practical application, more than one on-chip separable convolution computing logic can be included in one FPGA separable convolution computing system, so that multilayer convolution operation can be realized.
The FPGA-accelerated separable convolution reasoning calculation flow is shown in FIG. 8, the first step of system initialization is to enable the system to have a normal initial state, and the second step of writing calculation data into a memory by a host computer, wherein the calculation data comprise input characteristic diagram data and input parameter data; thirdly, scheduling input feature graph data and weight data from a memory to enter a cache; fourthly, executing floating point depth separation convolution multiplication accumulation calculation to complete the depth separation convolution calculation function; fifthly, executing floating point convolution multiplication accumulation calculation to finally form a separable convolution multiplication accumulation calculation result; step six, outputting characteristic diagram data in a scheduling mode to prepare for continuously writing in the memory; seventhly, writing the data into a memory; and eighthly, reading the convolution calculation result from the memory by the host.
The invention applies the structure of the floating point separable convolution acceleration accelerating device to carry out the image processing method of floating point separable convolution calculation acceleration, namely, firstly, an image processing model based on a MobileNet backbone network is constructed, wherein the convolution layer adopts the structure of the floating point separable convolution calculation accelerating device provided by any embodiment, an image to be processed is taken as first input characteristic map data, input parameter data of a depth separation convolution multiplication matrix unit is obtained through learning and training, and finally, the output of the image processing model is the processing result of the image to be processed. The hardware acceleration scheme of separable convolution multiply accumulate (vq) using the convolutional neural network described above has the following example:
the separable convolutional neural network image classification application scene based on deep learning is particularly applied to an image classification model based on a MobileNet backbone network, a separable convolutional multiply-accumulate hardware accelerating device completes one layer of operation of a multilayer convolutional neural network, the input of the separable convolutional neural network is picture data or characteristic map data, the input parameter is a specific parameter of each layer of the convolutional neural network, the parameter can be obtained by training a separable convolutional neural network image classification algorithm through deep learning, and the output of the separable convolutional neural network image classification application scene is characteristic map data or final image classification result data.
The separable convolutional neural network image target detection application scene based on deep learning is particularly applied to an image target detection model based on a MobileNet backbone network, a separable convolutional multiply-accumulate hardware accelerating device completes one layer of operation of a multilayer convolutional neural network, the input of the separable convolutional neural network is picture data or characteristic map data, the input parameter is a specific parameter of each layer of the convolutional neural network, the parameter can be obtained by training a separable convolutional neural network image target detection algorithm through deep learning, and the output of the separable convolutional neural network image target detection application scene is characteristic map data or final image target detection result data.
In summary, the above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.