[go: up one dir, main page]

CN112836793A - Floating point separable convolution calculation acceleration device, system and image processing method - Google Patents

Floating point separable convolution calculation acceleration device, system and image processing method Download PDF

Info

Publication number
CN112836793A
CN112836793A CN202110061071.2A CN202110061071A CN112836793A CN 112836793 A CN112836793 A CN 112836793A CN 202110061071 A CN202110061071 A CN 202110061071A CN 112836793 A CN112836793 A CN 112836793A
Authority
CN
China
Prior art keywords
convolution
input
data
output
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110061071.2A
Other languages
Chinese (zh)
Other versions
CN112836793B (en
Inventor
张志超
刘忠麟
王志乾
喻金桃
蒋丽婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Clp Taiji Group Co Ltd
CETC 15 Research Institute
Original Assignee
CETC 15 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 15 Research Institute filed Critical CETC 15 Research Institute
Priority to CN202110061071.2A priority Critical patent/CN112836793B/en
Publication of CN112836793A publication Critical patent/CN112836793A/en
Application granted granted Critical
Publication of CN112836793B publication Critical patent/CN112836793B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/7817Specially adapted for signal processing, e.g. Harvard architectures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/483Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
    • G06F7/485Adding; Subtracting
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Hardware Design (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Nonlinear Science (AREA)
  • Signal Processing (AREA)
  • Neurology (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Complex Calculations (AREA)

Abstract

本发明公开了浮点可分离卷积计算加速装置、系统以及图像处理方法,其原理是:按照浮点深度分离卷积计算输入的通用要求,调度输入特征图浮点数据以及输入参数浮点数据,为下一步深度分离卷积乘法计算矩阵的输入做准备;利用深度分离卷积乘法矩阵完成深度分离卷积输入特征图浮点数据以及输入参数浮点数据对应项的乘法计算,并输出乘法计算结果,结果数值为浮点数据;乘法计算结果输入至深度分离前向累加树矩阵单元,完成输出特征图浮点数据对应项的累加计算,最后计算完成输出特征图浮点数据。本发明具有在浮点深度分离卷积乘累加计算达到线速计算吞吐能力的效果。

Figure 202110061071

The invention discloses a floating-point separable convolution calculation acceleration device, a system and an image processing method. , prepare for the input of the next depth separation convolution multiplication calculation matrix; use the depth separation convolution multiplication matrix to complete the multiplication calculation of the depth separation convolution input feature map floating point data and the corresponding items of the input parameter floating point data, and output the multiplication calculation As a result, the result value is floating point data; the multiplication result is input to the depth separation forward accumulation tree matrix unit to complete the accumulation calculation of the corresponding items of the output feature map floating point data, and finally the output feature map floating point data is completed. The present invention has the effect of separating the convolution multiplying and accumulating in the floating point depth to achieve line-speed computing throughput.

Figure 202110061071

Description

Floating point separable convolution calculation accelerating device, system and image processing method
Technical Field
The invention relates to the technical field of artificial neural networks, in particular to a floating point separable convolution calculation accelerating device, a floating point separable convolution calculation accelerating system and an image classification method.
Background
The convolutional neural network technology is widely applied to image classification and target identification, and a general image classification model gradually develops towards the lightweight convolutional neural network technology. Compared with the conventional convolutional neural network, the lightweight convolutional neural network, such as the MobileNet, adopts separable convolution to design the neural network structure, and the number of parameters and the operation cost are lower. Separable Convolution generally consists of a depth separation Convolution (Depthwise Separable Convolition) and a point Convolution (Pointwise Convolition).
In hardware-accelerated computation of convolutional neural networks based on fpga (field Programmable Gate array), convolution is a conventional computational operation. The method is limited by the acceleration performance and the constraint of resources on an FPGA chip, most of convolution calculation based on the FPGA is designed based on the convolution calculation of fixed-point operands, and an FPGA fixed-point DSP (digital Signal processing) hard core is adopted to carry out fixed-point convolution multiplication accumulation operation, so that the throughput capacity of linear speed calculation can be achieved in the fixed-point convolution multiplication accumulation calculation. The conventional convolution multiply-accumulate calculation based on the FPGA fixed-point DSP hard core has a closed-loop multiply-accumulate operation, and the operation has low delay in a fixed-point precision operation mode, is a clock cycle and can realize linear speed throughput; under the floating point precision operation mode, the delay is a plurality of clock cycles, and due to the closed loop multiply-accumulate calculation mode, the linear speed calculation throughput requirement cannot be met, and the calculation throughput capacity of the FPGA for accelerating the floating point convolution neural network is greatly lost.
Therefore, in the floating point depth separation convolution neural network accelerated calculation mode requiring higher precision, a floating point depth separation convolution product multiplication accumulation calculation method capable of linear speed calculation throughput needs to be redesigned.
Disclosure of Invention
In view of this, the present invention provides an apparatus and a system for accelerating floating point separable convolution calculation, and an image classification method, which can improve the floating point separable convolution calculation process based on the FPGA, wherein the depth separation convolution calculation achieves the linear speed calculation throughput.
In order to achieve the purpose, the technical scheme of the invention is as follows: a convolution multiplication and accumulation hardware accelerator for a convolution neural network comprises a convolution multiplication unit, a convolution addition tree unit and a convolution forward addition chain unit.
The convolution multiplication unit comprises PE multiplied by SIMD floating-point multipliers; the input of each floating-point multiplier comprises input characteristic diagram data and input parameter data, and the time delay of each floating-point multiplier is more than one clock cycle; wherein PE is the number of input characteristic diagram data, SIMD is the number of input parameter data, and PE and SIMD are both powers of 2; the convolution multiplication unit outputs PE × SIMD multiplication results as input data to the convolution addition tree unit.
The convolution addition tree unit comprises PE floating point addition trees, input data of the convolution addition tree unit are PE multiplied SIMD multiplication results, the PE multiplied SIMD multiplication results and SIMD multiplication results corresponding to each input feature map data are used as a group, and the PE groups are respectively sent into one floating point addition tree.
The floating-point addition tree is composed of SIMD-1 floating-point adders, and performs addition operation on SIMD multiplication results in each group, wherein the delay of the floating-point adders is more than one clock cycle; the convolution addition tree unit outputs the PE addition tree results as input to the convolution forward addition chain unit.
The convolution forward addition chain unit comprises PE addition chains, input data of one addition chain is a corresponding addition tree result, the input data of more than one clock period of the addition chain is accumulated, and the length of the addition chain is determined according to the number of clock periods needing to be accumulated; the output of the convolution forward addition chain unit is output characteristic diagram data which is composed of PE floating point data output by all addition chains, and the output characteristic diagram data is a convolution multiplication accumulation result.
Furthermore, each adder chain accumulates the input data of more than one clock period, and the length of the adder chain is determined according to the number of the clock periods needing to be accumulatedThe body is as follows: each addition chain comprising log2n adders, where n is the number of clock cycles to be accumulated. For the current addition chain, the corresponding input data is one of the addition tree results. The value of PE is 1-PE in the result of the PE-th addition tree; the first adder accumulates input data of adjacent clock cycles and outputs the accumulated data as the input of the second adder. The second adder adds the inputs of adjacent clock cycles and outputs as the input to the next adder. And so on; until the last adder outputs the output result of the current addition chain.
Another embodiment of the present invention further provides a convolution multiply-accumulate hardware acceleration system of a convolution neural network, which includes a host and an FPGA convolution calculation system, wherein the host and the FPGA convolution calculation system are connected by a PCIe bus.
The FPGA convolution computing system consists of a memory and on-chip convolution reasoning computing logic.
The on-chip convolution reasoning and calculating logic comprises a system control unit, an input characteristic diagram data cache scheduling unit, an input parameter data cache scheduling unit, a convolution multiplication accumulation calculating unit and an output data cache scheduling unit.
The input characteristic diagram data buffer scheduling unit is used for storing the input characteristic diagram data and sending the input characteristic diagram data into the convolution multiply-accumulate calculating unit.
The input parameter data buffer scheduling unit is used for storing input parameter data and sending the input parameter data to the convolution multiplication accumulation calculating unit.
The convolution multiply accumulate calculating unit adopts the convolution multiply accumulate hardware accelerator structure of the convolution neural network, and performs convolution multiply accumulate operation on input parameter data by using input characteristic diagram data, and sends output characteristic diagram data serving as a convolution multiply accumulate result to the output data cache scheduling unit.
And the output data cache scheduling unit sends the convolution multiplication accumulation result to the memory for storage.
The host reads the convolution multiply-accumulate result from the memory through the PCIe bus.
Another embodiment of the present invention further provides a convolution multiply-accumulate hardware acceleration method for a convolutional neural network, including the following steps:
respectively acquiring PE input characteristic diagram data and SIMD input parameter data, wherein both PE and SIMD are powers of 2; and performing floating-point multiplication operation on each input feature map data and each input parameter data by using the PE multiplied by SIMD floating-point multipliers, and outputting the PE multiplied by SIMD multiplication results as input data of a convolution addition tree unit.
The delay of the floating-point multiplier is more than one clock cycle.
And step two, taking the SIMD multiplication results of the PE multiplied by SIMD as a group corresponding to each input feature map data, dividing the SIMD multiplication results into PE groups, respectively sending the PE groups into a floating point addition tree, totally adding the PE floating point addition trees, and outputting the PE addition tree results as the input of a convolution forward addition chain unit.
The floating-point addition tree is composed of SIMD-1 floating-point adders, and performs addition operation on SIMD multiplication results in each group, wherein the delay of the floating-point adders is more than one clock cycle.
Thirdly, PE addition chains are adopted, wherein input data of one addition chain is a corresponding addition tree result, the addition chains accumulate the input data of more than one clock period, and the length of the addition chains is determined according to the number of clock periods needing to be accumulated; and the PE floating point data output by all the addition chains form output characteristic diagram data, and the output characteristic diagram data is a convolution multiplication accumulation result.
Further, each addition chain accumulates input data of more than one clock cycle, and the length of the addition chain is determined according to the number of clock cycles required to be accumulated, specifically: each addition chain comprising log2n adders, wherein n is the number of clock cycles needing to be accumulated; for the current addition chain, the corresponding input data is one of the addition tree results; the value of PE is 1-PE in the result of the PE-th addition tree; the first adder accumulates input data of adjacent clock periods and outputs the accumulated input data as the input of the second adder; the second adder accumulates the input of the adjacent clock period and outputs the accumulated input as the input of the next adder; to be provided withAnd so on; until the last adder outputs the output result of the current addition chain.
Has the advantages that:
the floating point separable convolution calculation accelerating device and system provided by the invention can improve the deep separation convolution in separable convolution, and the improvement principle is as follows: according to the general requirements of the floating point depth separation convolution calculation input, scheduling input feature map floating point data and input parameter floating point data to prepare for the input of a depth separation convolution multiplication calculation matrix of the next step; the depth separation convolution multiplication matrix is used for finishing the multiplication calculation of corresponding items of the depth separation convolution input feature map floating point data and the input parameter floating point data, and outputting a multiplication calculation result, wherein the result numerical value is the floating point data; and inputting the multiplication calculation result to a depth separation forward accumulation tree matrix unit to finish the accumulation calculation of the corresponding items of the floating point data of the output characteristic diagram, and finally finishing the calculation of the floating point data of the output characteristic diagram. The invention can improve the bottleneck of calculation throughput brought by the existing closed-loop operation mode of floating point depth separation, volume multiplication and accumulation, and has the effect of achieving linear speed calculation throughput capacity in floating point depth separation, volume multiplication and accumulation calculation.
Drawings
FIG. 1 is a flow chart of floating point depth separable convolution calculation based on a forward accumulation tree according to an embodiment of the present invention;
FIG. 2 is a diagram of a depth separation convolution calculation unit based on a forward accumulation tree;
FIG. 3 is a diagram of a depth separation convolution multiplication matrix structure;
FIG. 4 is a diagram of a depth separation convolution forward accumulation tree matrix structure;
FIG. 5 is a diagram of an input timing schedule
FIG. 6 is an example adder tree design;
FIG. 7 is a separable convolution inference computation system based on FPGA acceleration;
FIG. 8 is a flow chart of separable convolution inference computation based on FPGA acceleration.
Detailed Description
The invention is described in detail below by way of example with reference to the accompanying drawings.
The invention provides a floating point separable convolution calculation acceleration device, which comprises a point convolution module and a point convolution module, wherein the point convolution module is used for carrying out convolution acceleration calculation on first input characteristic diagram data by utilizing first input parameter data and outputting first output characteristic diagram data.
And the point convolution module is used for performing convolution acceleration calculation on the first input characteristic diagram data and outputting first output characteristic diagram data. The dot convolution module in the embodiment of the present invention may use a conventional dot convolution acceleration calculation module in the field of technology, and may also use a multiplication and accumulation hardware acceleration apparatus structure related to dot convolution as described in the patent with application number 202011587375.4 (named as "convolution multiplication and accumulation hardware acceleration apparatus, system and method of convolutional neural network") to implement acceleration operation of dot convolution multiplication and accumulation. The input parameter data in the point convolution module is first input parameter data and is used for performing convolution acceleration calculation on the first input feature map data. According to the disclosure of patent application No. 202011587375.4, the apparatus includes a convolution multiplication unit, a convolution addition tree unit, and a convolution forward addition chain unit, and in the present invention, the units in the point convolution module can be expressed as: the system comprises a point convolution multiplication unit, a point convolution addition tree unit and a point convolution forward addition chain unit.
According to the disclosure in patent application No. 202011587375.4, a point convolution multiplication unit includes PE × SIMD floating-point multipliers; the input of each floating-point multiplier comprises first input characteristic diagram data and first input parameter data, and the time delay of each floating-point multiplier is more than one clock cycle; the method comprises the following steps of (1) obtaining a first input characteristic diagram data, wherein PE is the number of the first input characteristic diagram data, SIMD is the number of the first input parameter data, and both PE and SIMD are powers of 2; the point convolution multiplication unit outputs the PE × SIMD point convolution multiplication result as input data of the point convolution addition tree unit.
The point convolution addition tree unit comprises PE floating point addition trees, input data of the point convolution addition tree unit is a PE multiplied SIMD point convolution multiplication result, the PE multiplied SIMD multiplication result corresponding to each first input feature map data is used as a group, and the PE group is divided into the PE group and respectively sent into one floating point addition tree.
The floating-point addition tree is composed of SIMD-1 floating-point adders, and performs addition operation on SIMD multiplication results in each group, wherein the delay of the floating-point adders is more than one clock cycle; the convolution addition tree unit outputs the PE addition tree results as input to the convolution forward addition chain unit.
The point convolution forward addition chain unit comprises PE addition chains, input data of one addition chain is a corresponding addition tree result, the input data of more than one clock period of the addition chain is accumulated, and the length of the addition chain is determined according to the number of clock periods needing to be accumulated; the output of the convolution forward addition chain unit is first output characteristic diagram data which is composed of PE floating point data output by all addition chains, and the first output characteristic diagram data is a convolution multiplication accumulation result.
Each addition chain accumulates input data of more than one clock period, and the length of the addition chain is determined according to the number of clock periods required to be accumulated, and specifically comprises the following steps: each addition chain comprising log2n adders, wherein n is the number of clock cycles needing to be accumulated; for the current addition chain, the corresponding input data is one of the addition tree results; the value of PE is 1-PE in the result of the PE-th addition tree; the first adder accumulates input data of adjacent clock periods and outputs the accumulated input data as the input of the second adder; the second adder accumulates the input of the adjacent clock period and outputs the accumulated input as the input of the next adder; and so on; until the last adder outputs the output result of the current addition chain.
In the embodiment of the invention, the floating point separable convolution calculation module further comprises a depth separation convolution module; the input of the depth separation convolution module is second input feature map data; the depth separation convolution module carries out convolution acceleration operation on the second input characteristic map data by setting second input parameter data to obtain second output characteristic map data; wherein the second output profile data is used as the first input profile data.
The work flow of the depth separation convolution module is shown in figure 1, the structure composition is shown in figure 2, and the depth separation convolution multiplication matrix unit and the depth separation convolution forward accumulation tree matrix are included.
The structure of the depth separation convolution multiplication matrix unit is shown in fig. 3, the depth separation convolution multiplication matrix unit comprises PEx1 floating-point convolution multiplication calculators, the convolution multiplication calculators adopt general floating-point multiplication calculators customized based on fixed-point DSP hard cores, and the delay of the floating-point multiplication calculators is not required and can be the delay of a plurality of clock cycles; the input of the depth separation convolution multiplication calculation matrix structure is second input feature map data and second input parameter data, the input feature map data of each clock is 1 floating point data (Ipt0), and the input feature map data are PE floating point data Wgt 0-WgtPE (as shown in FIG. 3, Wgt0, Wgt1, Wgt2 and Wgt 3); the output is the result of the deep separation convolution multiplication, and the result contains PEx1 floating point data; because the depth separation convolution multiplication calculation matrix constructed by the matrix unit by adopting the general floating-point multiplier is an open-loop design, closed-loop operation is not introduced, the unit has no restriction on the delay of the specific floating-point multiplier and can be used for a plurality of clock cycles, and finally the unit can achieve the linear speed calculation throughput capacity.
The matrix unit structure of the depth separation convolution forward accumulation tree is shown in fig. 4, and includes pe (process element) forward accumulation trees, and completes accumulation calculation of the multiplication result of depth separation convolution with the size of kxk convolution kernels. The input of the depth separation convolution forward accumulation tree matrix is the multiplication result of the previous step of depth separation convolution, and the multiplication result comprises PEx1 floating point data, and the output of the matrix is second output characteristic diagram data comprising PEx1 floating point data. The forward accumulation tree is formed by connecting two groups of input time sequence scheduling modules and an addition tree in series, and sequentially comprises a first input time sequence scheduling module, a first addition tree, a second input time sequence scheduling module and a second addition tree; the depth separation convolution multiplication result output by the depth separation convolution multiplication matrix unit is input to a first input time sequence scheduling module, the first input time sequence scheduling module completes the time sequence data register of k clock cycles, k input time sequence data are prepared for a first adder, a first addition tree completes the accumulation summation of the k input time sequence data, the summation result of the first adder is input to a second time sequence scheduling module, the second input time sequence scheduling module registers k summation results of the first addition tree, completes the time sequence data registration of k multiplied by k clock periods, prepares k inputs for the second adder, the second addition tree performs the accumulation summation on the k summation results of the first addition tree, and the forward accumulation tree totally completes the accumulation summation of the time sequence data of k multiplied by k clock periods, outputs the second output characteristic diagram data as the first input characteristic diagram data and inputs the first input characteristic diagram data to the point convolution module. The unit adopts the addition tree to construct the forward accumulation tree, is an open-loop design, does not introduce operations such as closed-loop accumulation and the like, has no constraint on the delay of the addition tree, can be in a plurality of clock cycles, and finally can achieve the linear speed calculation throughput capacity.
The first input time sequence scheduling module and the second input time sequence scheduling module have the same structure and consist of an input register, a counter and k internal registers; input timing schedule structure as shown in fig. 5, an input timing schedule structure of k data is illustrated, k being typically the size of a convolution kernel, and composed of one input register (Ipt0), one Counter (Counter), three internal registers (R0, R1, R2), and three output registers (Opt0, Opt1, Opt 2). The input register is connected with the counter, and the k internal registers are directly connected with the counter.
The input register receives input data of k clock cycles and sends the input data into the counter; and the input data of the first input time sequence scheduling module is multiplication calculation result data, and the input data of the second input time sequence scheduling module is a summation result of the first addition tree.
The counter registers input data to k internal registers in a mode of taking a modulus k, after the k input data are registered, the k internal registers output the k data at one time to prepare for the input of the adder, and the input speed of the stage is matched with the output speed.
An example of the addition tree design is shown in fig. 6, where the input is k floating point data, k is the size of the convolution kernel, and k is 3, the example includes k-1 floating point adders, performs the function of adding and summing k data, and the output is the output of the addition tree, and includes 1 floating point data. The floating-point adder adopts a general floating-point addition calculator customized based on a fixed-point DSP hard core. As the floating point addition tree constructed by the general floating point adder is designed in an open loop mode, closed loop operation is not introduced, the unit has no restriction on the delay of the specific floating point adder and can be used for multiple clock cycles, and finally the unit can achieve the linear speed calculation throughput capacity.
The output of the point convolution module is the first output characteristic diagram data as the final separable convolution output result. Therefore, the floating point depth separation convolution calculation acceleration function is completed.
An FPGA acceleration-based separable convolution inference computing system is shown in fig. 7, and includes a host and an FPGA separable convolution computing system connected by a PCIe bus.
The FPGA separable convolution computing system consists of a memory and on-chip separable convolution computing logic;
the on-chip separable convolution calculation logic comprises a system control unit, an input characteristic diagram data cache scheduling unit, an input parameter data cache scheduling unit, a deep separation convolution module, a point convolution module and an output data cache scheduling unit.
The input feature map data cache scheduling unit is used for storing second input feature map data and sending the second input feature map data to the deep separation convolution module.
The input parameter data buffer scheduling unit is used for storing input parameter data, and comprises first input parameter data used for the point convolution module and second input parameter data used for the depth separation convolution module.
The depth separation convolution module adopts the depth separation convolution module in the floating point separable convolution calculation accelerating device, performs depth separation convolution calculation operation on second input characteristic diagram data by using second input parameter data, outputs second output characteristic diagram data, sends the second output characteristic diagram data into the point convolution module as first input characteristic diagram data, performs convolution calculation operation on the first input characteristic diagram data by using the first input parameter data, outputs first output characteristic diagram data, namely a final separable convolution output result, and sends the final separable convolution output result into the output data cache scheduling unit;
and the output data buffer scheduling unit sends the final separable convolution output result to the memory for storage.
The host reads the final separable convolution output from the memory over the PCIe bus.
In practical application, more than one on-chip separable convolution computing logic can be included in one FPGA separable convolution computing system, so that multilayer convolution operation can be realized.
The FPGA-accelerated separable convolution reasoning calculation flow is shown in FIG. 8, the first step of system initialization is to enable the system to have a normal initial state, and the second step of writing calculation data into a memory by a host computer, wherein the calculation data comprise input characteristic diagram data and input parameter data; thirdly, scheduling input feature graph data and weight data from a memory to enter a cache; fourthly, executing floating point depth separation convolution multiplication accumulation calculation to complete the depth separation convolution calculation function; fifthly, executing floating point convolution multiplication accumulation calculation to finally form a separable convolution multiplication accumulation calculation result; step six, outputting characteristic diagram data in a scheduling mode to prepare for continuously writing in the memory; seventhly, writing the data into a memory; and eighthly, reading the convolution calculation result from the memory by the host.
The invention applies the structure of the floating point separable convolution acceleration accelerating device to carry out the image processing method of floating point separable convolution calculation acceleration, namely, firstly, an image processing model based on a MobileNet backbone network is constructed, wherein the convolution layer adopts the structure of the floating point separable convolution calculation accelerating device provided by any embodiment, an image to be processed is taken as first input characteristic map data, input parameter data of a depth separation convolution multiplication matrix unit is obtained through learning and training, and finally, the output of the image processing model is the processing result of the image to be processed. The hardware acceleration scheme of separable convolution multiply accumulate (vq) using the convolutional neural network described above has the following example:
the separable convolutional neural network image classification application scene based on deep learning is particularly applied to an image classification model based on a MobileNet backbone network, a separable convolutional multiply-accumulate hardware accelerating device completes one layer of operation of a multilayer convolutional neural network, the input of the separable convolutional neural network is picture data or characteristic map data, the input parameter is a specific parameter of each layer of the convolutional neural network, the parameter can be obtained by training a separable convolutional neural network image classification algorithm through deep learning, and the output of the separable convolutional neural network image classification application scene is characteristic map data or final image classification result data.
The separable convolutional neural network image target detection application scene based on deep learning is particularly applied to an image target detection model based on a MobileNet backbone network, a separable convolutional multiply-accumulate hardware accelerating device completes one layer of operation of a multilayer convolutional neural network, the input of the separable convolutional neural network is picture data or characteristic map data, the input parameter is a specific parameter of each layer of the convolutional neural network, the parameter can be obtained by training a separable convolutional neural network image target detection algorithm through deep learning, and the output of the separable convolutional neural network image target detection application scene is characteristic map data or final image target detection result data.
In summary, the above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (5)

1. A floating point separable convolution calculation accelerating device comprises a point convolution module, a first output characteristic diagram data output module and a second output characteristic diagram data output module, wherein the point convolution module is used for carrying out convolution acceleration calculation on first input characteristic diagram data by utilizing first input parameter data; the floating point separable convolution calculation module is characterized by further comprising a depth separation convolution module;
the input of the depth separation convolution module is second input feature map data; the depth separation convolution module carries out convolution acceleration operation on the second input feature map data by setting second input parameter data to obtain second output feature map data; wherein the second output profile data is taken as the first input profile data;
the depth separation convolution module includes: a depth separation convolution multiplication matrix unit and a depth separation convolution forward accumulation tree matrix;
the depth separation convolution multiplication matrix unit comprises PEx1 floating-point convolution multiplication calculators, and the delay of the floating-point convolution multiplication calculators is not limited to the delay of 1 clock cycle; the input of the depth separation convolution multiplication matrix unit is second input feature map data and second input parameter data, the second input feature map data of each clock cycle is 1 floating point data Ipt0, and the second input parameter data are PE floating point data, namely Wgt 0-WgtPE; the output is the result of the deep separation convolution multiplication, and the result contains PEx1 floating point data;
the depth separation convolution forward accumulation tree matrix unit comprises PE forward accumulation trees and completes the accumulation calculation of the multiplication result of the depth separation convolution with the size of kxk convolution kernels; the input of the depth separation convolution forward accumulation tree matrix unit is the multiplication result of the depth separation convolution, and the output is second output characteristic diagram data which comprises PEx1 floating point data;
the forward accumulation tree is formed by connecting two groups of input time sequence scheduling modules and an addition tree in series, and sequentially comprises a first input time sequence scheduling module, a first addition tree, a second input time sequence scheduling module and a second addition tree;
the depth separation convolution multiplication result output by the depth separation convolution multiplication matrix unit is input to the first input time sequence scheduling module, the first input time sequence scheduling module completes the time sequence data register of k clock periods, k input time sequence data are prepared for the first adder, the first addition tree completes the accumulation summation of the k input time sequence data, the summation result of the first adder is input to the second time sequence scheduling module, the second input time sequence scheduling module registers k summation results of the first addition tree, completes the time sequence data registration of k multiplied by k clock periods, prepares k inputs for the second adder, the second addition tree performs cumulative summation on the k summation results of the first addition tree, the forward accumulation tree totally finishes the accumulation summation of the time sequence data of k multiplied by k clock periods, outputs second output characteristic diagram data which is used as first input characteristic diagram data and is input to the point convolution module;
the output of the point convolution module is first output characteristic diagram data which is used as a final separable convolution output result.
2. The apparatus of claim 1, wherein the first input timing scheduling module is structurally identical to the second input timing scheduling module, consisting of an input register, a counter, and k internal registers;
the input register is connected with the counter, and the k internal registers are directly connected with the counter;
the input register receives input data of k clock cycles and sends the input data to the counter;
the counter registers input data to k internal registers in a mode of taking a modulus k, and after the k input data are registered, the k internal registers output k data at one time.
3. The apparatus of claim 2, wherein the addition tree includes k-1 floating-point adders to perform an accumulated sum operation of k data.
4. A floating point separable convolution calculation acceleration system is characterized by comprising a host and an FPGA separable convolution calculation system, wherein the host and the FPGA separable convolution calculation system are connected through a PCIe bus;
the FPGA separable convolution computing system consists of a memory and on-chip separable convolution computing logic;
the on-chip separable convolution calculation logic comprises a system control unit, an input characteristic diagram data cache scheduling unit, an input parameter data cache scheduling unit, a depth separation convolution module, a point convolution module and an output data cache scheduling unit;
the input feature map data cache scheduling unit is used for storing second input feature map data and sending the second input feature map data to the deep separation convolution module;
the input parameter data cache scheduling unit is used for storing input parameter data, and comprises first input parameter data used for the point convolution module and second input parameter data used for the depth separation convolution module;
the depth separation convolution module adopts the depth separation convolution module in the floating point separable convolution calculation accelerating device according to claim 1, 2 or 3, performs depth separation convolution calculation operation on second input feature map data by using second input parameter data, outputs second output feature map data, and sends the second output feature map data into the point convolution module as first input feature map data, the point convolution module performs convolution calculation operation on the first input feature map data by using first input parameter data, outputs first output feature map data, namely a final separable convolution output result, and sends the first output feature map data into the output data cache scheduling unit;
the output data cache scheduling unit sends the final separable convolution output result to a memory for storage;
and the host reads the final separable convolution output result from the memory through a PCIe bus.
5. An image processing method for accelerating floating-point separable convolution calculation is characterized by comprising the following steps:
an image processing model based on a MobileNet backbone network is constructed, wherein a convolution layer adopts the structure of the floating point separable convolution calculation accelerating device as claimed in claim 1, 2 or 3, an image to be processed is used as the first input feature map data, input parameter data of the deep separation convolution multiplication matrix unit is obtained through learning and training, and finally the output of the image processing model is the processing result of the image to be processed.
CN202110061071.2A 2021-01-18 2021-01-18 Floating point separable convolution calculation accelerating device, system and image processing method Active CN112836793B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110061071.2A CN112836793B (en) 2021-01-18 2021-01-18 Floating point separable convolution calculation accelerating device, system and image processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110061071.2A CN112836793B (en) 2021-01-18 2021-01-18 Floating point separable convolution calculation accelerating device, system and image processing method

Publications (2)

Publication Number Publication Date
CN112836793A true CN112836793A (en) 2021-05-25
CN112836793B CN112836793B (en) 2022-02-08

Family

ID=75928582

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110061071.2A Active CN112836793B (en) 2021-01-18 2021-01-18 Floating point separable convolution calculation accelerating device, system and image processing method

Country Status (1)

Country Link
CN (1) CN112836793B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113298236A (en) * 2021-06-18 2021-08-24 中国科学院计算技术研究所 Low-precision neural network computing device based on data stream structure and acceleration method
CN116933851A (en) * 2023-06-26 2023-10-24 西安电子科技大学 Depthwise convolution calculation device and convolution calculation method based on deconvolution mapping

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106940815A (en) * 2017-02-13 2017-07-11 西安交通大学 A kind of programmable convolutional neural networks Crypto Coprocessor IP Core
US20180053091A1 (en) * 2016-08-17 2018-02-22 Hawxeye, Inc. System and method for model compression of neural networks for use in embedded platforms
US20180357744A1 (en) * 2015-12-16 2018-12-13 Stc.Unm System and methods for computing 2-d convolutions and cross-correlations
CN109146067A (en) * 2018-11-19 2019-01-04 东北大学 A kind of Policy convolutional neural networks accelerator based on FPGA
CN111178519A (en) * 2019-12-27 2020-05-19 华中科技大学 Convolutional neural network acceleration engine, convolutional neural network acceleration system and method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180357744A1 (en) * 2015-12-16 2018-12-13 Stc.Unm System and methods for computing 2-d convolutions and cross-correlations
US20180053091A1 (en) * 2016-08-17 2018-02-22 Hawxeye, Inc. System and method for model compression of neural networks for use in embedded platforms
CN106940815A (en) * 2017-02-13 2017-07-11 西安交通大学 A kind of programmable convolutional neural networks Crypto Coprocessor IP Core
CN109146067A (en) * 2018-11-19 2019-01-04 东北大学 A kind of Policy convolutional neural networks accelerator based on FPGA
CN111178519A (en) * 2019-12-27 2020-05-19 华中科技大学 Convolutional neural network acceleration engine, convolutional neural network acceleration system and method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
邵清: "CNN全连接层FPGA硬件实现技术研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113298236A (en) * 2021-06-18 2021-08-24 中国科学院计算技术研究所 Low-precision neural network computing device based on data stream structure and acceleration method
CN116933851A (en) * 2023-06-26 2023-10-24 西安电子科技大学 Depthwise convolution calculation device and convolution calculation method based on deconvolution mapping
CN116933851B (en) * 2023-06-26 2025-08-26 西安电子科技大学 Depthwise convolution calculation device and convolution calculation method based on deconvolution mapping

Also Published As

Publication number Publication date
CN112836793B (en) 2022-02-08

Similar Documents

Publication Publication Date Title
Yepez et al. Stride 2 1-D, 2-D, and 3-D Winograd for convolutional neural networks
CN111684473B (en) Improve the performance of neural network arrays
CN107239829B (en) A Method for Optimizing Artificial Neural Networks
CN107609641B (en) Sparse neural network architecture and implementation method thereof
CN110263925B (en) A hardware acceleration implementation device for forward prediction of convolutional neural network based on FPGA
CN110163353B (en) Computing device and method
US10698657B2 (en) Hardware accelerator for compressed RNN on FPGA
US10691996B2 (en) Hardware accelerator for compressed LSTM
US20180260710A1 (en) Calculating device and method for a sparsely connected artificial neural network
US20180046903A1 (en) Deep processing unit (dpu) for implementing an artificial neural network (ann)
US20240265234A1 (en) Digital Processing Circuits and Methods of Matrix Operations in an Artificially Intelligent Environment
Farrukh et al. Power efficient tiny yolo cnn using reduced hardware resources based on booth multiplier and wallace tree adders
CN108090565A (en) Accelerated method is trained in a kind of convolutional neural networks parallelization
CN110807522B (en) General calculation circuit of neural network accelerator
CN110543939B (en) Hardware acceleration realization device for convolutional neural network backward training based on FPGA
CN110580519B (en) Convolution operation device and method thereof
CN107203808B (en) A kind of two-value Convole Unit and corresponding two-value convolutional neural networks processor
CN112734020B (en) Convolution multiplication accumulation hardware acceleration device, system and method of convolution neural network
CN110766128A (en) Convolution calculation unit, calculation method and neural network calculation platform
CN115423081A (en) Neural network accelerator based on CNN _ LSTM algorithm of FPGA
CN111126569A (en) A convolutional neural network device and computing method supporting pruning and sparse compression
CN110765413A (en) Matrix summation structure and neural network computing platform
CN112836793B (en) Floating point separable convolution calculation accelerating device, system and image processing method
KR20200020117A (en) Deep learning apparatus for ANN with pipeline architecture
Véstias et al. Hybrid dot-product calculation for convolutional neural networks in FPGA

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220704

Address after: 100083 No. 211 middle Fourth Ring Road, Haidian District, Beijing

Patentee after: NO.15 INSTITUTE OF CHINA ELECTRONICS TECHNOLOGY Group Corp.

Patentee after: CLP Taiji (Group) Co., Ltd

Address before: 100083 No. 211 middle Fourth Ring Road, Haidian District, Beijing

Patentee before: NO.15 INSTITUTE OF CHINA ELECTRONICS TECHNOLOGY Group Corp.