CN112529171B

CN112529171B - In-memory computing accelerator and optimization method thereof

Info

Publication number: CN112529171B
Application number: CN202011406904.6A
Authority: CN
Inventors: 陈瑞; 杨永魁; 王峥; 郭伟钰; 辛锦瀚; 喻之斌
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2020-12-04
Filing date: 2020-12-04
Publication date: 2024-01-05
Anticipated expiration: 2040-12-04
Also published as: CN112529171A

Abstract

The application discloses an in-memory computing accelerator and an optimization method thereof, and relates to the technical field of in-memory computing accelerators. The in-memory computing accelerator includes a plurality of processing units including: the storage units are distributed in a plurality of arrays, the output ports of each column of storage units are correspondingly provided with ADCs, and the ADCs are ADCs with configurable resolution; a plurality of resolution control modules; the resolution control module is used for controlling the resolution of the corresponding ADC. The application also discloses an optimization method of the in-memory computing accelerator, which comprises training and quantifying a neural network model; in the process of deploying the neural network model, determining the sparseness degree of the whole neural network; according to the sparseness of each layer of neural network, calculating the optimal resolution of the ADC in each processing unit; the resolution of the ADC in each processing unit is dynamically adjusted to an optimal resolution. The method and the device are used for improving the performance of the in-memory computing accelerator.

Description

In-memory computing accelerator and optimization method thereof

Technical Field

The application relates to the technical field of in-memory computing accelerators, in particular to an in-memory computing accelerator and an optimization method thereof.

Background

In recent years, neural networks have achieved significant success in various practical applications, such as image classification and object detection, but these efforts have largely relied on complex neural network models with a large number of parameters and calculations. Deployment of these complex neural network models, which require extensive computation and data movement, to a Feng neumann architecture-based neural network accelerator (e.g., CPU, GPU, FPGA) would present a "memory wall" problem, i.e., the data movement speed is not followed by the data processing speed and the data movement energy consumption is much higher than the data processing energy consumption.

In-memory computing is an emerging computing architecture, and compared with the traditional neural network accelerator based on Feng Neumann architecture, which separates storage and computing, the in-memory computing integrates storage and computing, namely, the computing is completed inside a storage unit. In-memory computing is expected to solve the problem of "memory wall" existing in the Feng Neumann architecture. Because of the integration of storage and computation, in-memory computation can realize a neural network consisting of a large number of multiplication and addition operations with higher performance.

In existing in-memory computational accelerators, the power consumption of analog-to-digital converters (ADCs) for analog and digital signal conversion can be as high as 50% or more. It can be seen that the optimal design of the ADC is a major bottleneck for the current in-memory computational accelerator. The prior art generally reduces ADC power consumption by optimizing for the ADC internal circuitry alone, but the effect of this approach is not significant.

Disclosure of Invention

The embodiment of the application provides an in-memory computing accelerator and an optimization method thereof, which can dynamically optimize the resolution of an ADC (analog-to-digital converter) in the in-memory computing accelerator according to the sparsity difference of each layer of a quantized neural network, thereby reducing in-memory computing power consumption and improving in-memory computing capability.

To achieve the above object, in one aspect, an embodiment of the present application provides an in-memory computing accelerator, including a plurality of processing units, where the processing units include: the storage units are distributed in a plurality of arrays, and the output ports of the storage units in each column are correspondingly provided with ADCs, wherein the ADCs are ADC with configurable resolution; a plurality of resolution control modules; the resolution control module is used for controlling the resolution of the corresponding ADC.

Further, the in-memory computing accelerator further comprises a DAC (digital-to-analog converter) correspondingly arranged at an input port of each row of storage units, wherein the DAC is a DAC with configurable resolution; the resolution control module is used for controlling the resolution of the corresponding DAC.

On the other hand, the embodiment of the application provides an optimization method of the in-memory computing accelerator, which is characterized by comprising the following steps: training and quantifying a neural network model; in the process of deploying the neural network model, determining the sparseness degree of the whole neural network; according to the sparseness of each layer of neural network, calculating the optimal resolution of the ADC in each processing unit; the resolution of the ADC in each processing unit is dynamically adjusted to an optimal resolution.

Further, the step of calculating the optimal resolution of the ADC in each processing unit according to the sparseness of the neural network of each layer includes: calculating the optimal resolution of the ADC in each processing unit according to equation (1):

wherein RowNum is the number of rows in the memory cell array in each processing unit; w is a weight; density is the weight Density of each layer of the neural network; and storing the calculated optimal resolution of the ADC in each processing unit into a resolution control module of the corresponding processing unit.

Further, after dynamically adjusting the resolution of the ADC in each processing unit to the optimal resolution, the method further includes: calculating the optimal resolution of the DAC in each processing unit according to the input excitation precision of each layer of neural network; the resolution of the DAC in each processing unit is dynamically adjusted to optimize the DAC resolution.

Further, the step of calculating the optimal resolution of the DAC in each processing unit according to the accuracy of the input excitation of the neural network of each layer includes: calculating the optimal resolution of the DAC in each processing unit according to equation (2):

DAC optimum resolution=input excitation precision+1 equation (2)

And storing the calculated optimal resolution of the DAC in each processing unit into a resolution control module of the corresponding processing unit.

Compared with the prior art, the application has the following beneficial effects:

1. according to the method and the device, the ADC with configurable resolution and the resolution control module for controlling the resolution of the ADC are integrated in the in-memory computing accelerator, the resolution of the ADC can be dynamically optimized according to the sparseness degree of each layer of the neural network, the power consumption and the conversion time of the ADC in the in-memory computing accelerator are flexibly and comprehensively reduced, and therefore the power consumption of the in-memory computing accelerator is reduced, and the computing capability of the in-memory computing accelerator is improved.

2. The method and the device integrate the DAC with configurable resolution into the in-memory computing accelerator, dynamically optimize the resolution of the ADC and the resolution of the DAC through the resolution control module, reduce the power consumption and the conversion time of the ADC and the DAC in the in-memory computing accelerator, further reduce the power consumption of the in-memory computing accelerator and improve the computing capability of the in-memory computing accelerator.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an in-memory computational accelerator based on quantifying sparsity of a neural network according to the present application;

FIG. 2 is a schematic diagram of a processing unit in an in-memory computing accelerator based on quantifying sparseness of a neural network according to the present application;

FIG. 3 is a flow chart of an in-memory calculation accelerator optimization method based on quantifying sparsity of a neural network according to one embodiment of the present application;

fig. 4 is a flowchart of an optimization method of an in-memory computing accelerator based on quantifying sparsity of a neural network according to another embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

In the existing in-memory computing accelerator, the power consumption ratio of the ADC for converting analog signals and digital signals can reach more than 50%, so that the optimal design of the ADC is a main bottleneck of the existing in-memory computing accelerator, however, the research on how to optimize the ADC in the in-memory computing accelerator is basically carried out by singly aiming at the realization of an internal circuit of the ADC, and the optimization of the ADC in the in-memory computing accelerator based on the characteristics of a neural network is omitted. However, related experimental studies have shown that the weights of the neural networks have a high degree of sparsity, particularly the quantized neural networks. Further, at the circuit design level, as the resolution of the ADC increases, the power consumption of the Flash (Flash) type ADC increases exponentially, while the power consumption and conversion time of the Successive Approximation (SAR) type ADC show a tendency to increase linearly.

The method is based on sparsity existing in a common neural network model, particularly a quantized neural network, and a large number of weights with values of 0 exist. During the computation, i.e. when the word line or row line in the memory array is activated, the memory cell storing "0" will not increase or decrease any voltage or current to the bit line, and the voltage or current range on the bit line will also be reduced, thereby reducing the requirements on the ADC resolution. Therefore, when the weight sparseness degree stored on the same bit line is larger, namely the weight with the value of 0 is more, the resolution of the ADC can be optimized to be smaller, so that the power consumption of the ADC and the conversion time of the ADC are reduced, the power consumption of the in-memory computing accelerator is further reduced, and the computing capability of the in-memory computing accelerator is improved.

Referring to fig. 1, an embodiment of the present application provides an in-memory computing accelerator, which includes a plurality of slicing modules, a pooling module, an accumulation module, an activation module, and a global buffer. Weights of all layers of the neural network are deployed to all the segmentation modules respectively. In each of the fragmentation modules there are multiple processing units (PEs) and fragmentation buffers, accumulators and output buffers.

Referring to fig. 2, the processing unit mainly includes a memory cell array, a plurality of ADCs and resolution control modules thereof, a bit line decoder, a word line decoder, an analog multiplexer, a shift register, and the like.

The storage unit array comprises a plurality of storage units distributed in an array mode, and an ADC (analog to digital converter) is correspondingly arranged at an output port of each column of storage units and is configurable in resolution. In the process of the neural network deployment, the optimized resolution of the ADC in each processing unit obtained through software calculation is stored in a resolution control module of each processing unit, the resolution of the ADC in each processing unit is adjusted to the optimized resolution, and in the process of the storage unit calculation, a bit line decoder and a word line decoder jointly control a storage unit array. After multiply-add calculation is completed in the storage unit, conversion from analog signals to digital signals is completed through the ADC. The resolution of the ADC during conversion is determined by the optimal resolution of the ADC stored in the resolution control module. In addition, the ADC may be owned by each bit line individually or shared by multiple bit lines through an analog multiplexer. Therefore, according to the embodiment of the application, the resolution of the ADC can be dynamically optimized according to the sparseness degree of each layer of the neural network, the power consumption of the ADC in the in-memory computing accelerator and the conversion time of the ADC are greatly reduced, so that the power consumption of the in-memory computing accelerator is reduced, and the computing capability of the in-memory computing accelerator is improved.

In some embodiments, the word line decoder in the processing unit further includes a digital-to-analog converter (DAC) disposed at the input port of each row of memory cells, the DAC also being a resolution configurable DAC, and the resolution of the DAC also being adjusted by the resolution control module. Therefore, the resolution control module can dynamically optimize the resolution of the ADC and the resolution of the DAC, power consumption and conversion time of the ADC and the DAC in the in-memory computing accelerator are reduced, power consumption of the in-memory computing accelerator is further reduced, and computing capacity of the in-memory computing accelerator is improved.

Referring to fig. 3, the embodiment of the present application also provides an optimization method for the in-memory computing accelerator, which includes the following steps:

and S1, training and quantifying a neural network model.

And step S2, determining the sparseness degree of the whole neural network in the neural network model deployment process.

And S3, calculating the optimal resolution of the ADC in each processing unit according to the sparseness degree of the neural network of each layer.

Step S31, calculating the optimal resolution of the ADC in each processing unit according to the formula (1);

wherein RowNum is the number of rows in the memory cell array in each processing unit; w is a weight; density is the weight Density of each layer of the neural network;

and step S32, storing the calculated optimal resolution of the ADC in each processing unit into a resolution control module of the corresponding processing unit.

And S4, dynamically adjusting the resolution of the ADC in each processing unit to the optimal resolution.

It should be noted that, in step S2, the module for determining the sparseness of the entire neural network is a sparseness calculation module, and the module may be integrated into the in-memory calculation accelerator or may be disposed outside the in-memory calculation accelerator. In step S4, the resolution of the ADC in the corresponding processing unit is dynamically adjusted to the optimal resolution by the resolution control module of each processing unit. Therefore, the optimization method of the embodiment of the application is based on the quantized sparse characteristics of the neural network, and the ADC resolution is dynamically optimized according to the sparseness of each layer of the neural network. The optimization method greatly reduces the power consumption of the ADC and the conversion time of the ADC in the in-memory computing accelerator, thereby reducing the power consumption of the in-memory computing accelerator and improving the computing capability of the in-memory computing accelerator.

Referring to fig. 4, in some embodiments, the optimization method of the embodiments of the present application further includes:

and S5, calculating the optimal resolution of the DAC in each processing unit according to the input excitation precision of the neural network of each layer.

Step S51, calculating the optimal resolution of the DAC in each processing unit according to the formula (2):

DAC optimum resolution=input excitation precision+1 equation (2)

And step S52, storing the calculated optimal resolution of the DAC in each processing unit into a resolution control module of the corresponding processing unit.

And S6, dynamically adjusting the resolution of the DAC in each processing unit to optimize the resolution of the DAC.

It should be noted that in step S6, the resolution control module of each processing unit dynamically adjusts the DAC resolution in the corresponding processing unit to the optimal resolution. Therefore, the resolution control module dynamically optimizes the resolution of the ADC and the resolution of the DAC, reduces the power consumption and conversion time of the ADC and the DAC in the in-memory computing accelerator, further reduces the power consumption of the in-memory computing accelerator, and improves the computing capability of the in-memory computing accelerator.

The foregoing is merely a specific embodiment of the present application, but the protection scope of the present application is not limited thereto, and any changes or substitutions within the technical scope of the present disclosure should be covered in the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of optimizing an in-memory computational accelerator, the in-memory computational accelerator comprising a plurality of processing units, the processing units comprising: the storage units are distributed in a plurality of arrays, and the output ports of the storage units in each column are correspondingly provided with ADCs, wherein the ADCs are ADC with configurable resolution; a plurality of resolution control modules; the resolution control module is used for controlling the resolution of the corresponding ADC; the device also comprises a DAC correspondingly arranged at the input port of each row of storage units, wherein the DAC is a DAC with configurable resolution; the resolution control module is used for controlling the resolution of the corresponding DAC; the optimization method comprises the following steps:

training and quantifying a neural network model;

in the process of deploying the neural network model, determining the sparseness degree of the whole neural network;

according to the sparseness of each layer of neural network, calculating the optimal resolution of the ADC in each processing unit;

dynamically adjusting the resolution of the ADC in each processing unit to an optimal resolution;

according to the sparseness degree of the neural network of each layer, the step of calculating the optimal resolution of the ADC in each processing unit comprises the following steps:

calculating the optimal resolution of the ADC in each processing unit according to equation (1);

wherein RowNum is the number of rows in the memory cell array in each processing unit;

w is a weight;

density is the weight Density of each layer of the neural network;

and storing the calculated optimal resolution of the ADC in each processing unit into a resolution control module of the corresponding processing unit.

2. The optimization method according to claim 1, characterized in that,

the step of dynamically adjusting the resolution of the ADC in each processing unit to an optimal resolution further comprises:

calculating the optimal resolution of the DAC in each processing unit according to the input excitation precision of each layer of neural network;

the resolution of the DAC in each processing unit is dynamically adjusted to optimize the DAC resolution.

3. The optimization method according to claim 2, wherein the step of calculating the optimal resolution of the DAC in each processing unit based on the accuracy of the input excitation of the neural network of each layer comprises:

calculating the optimal resolution of the DAC in each processing unit according to formula (2);

DAC optimum resolution=input excitation precision+1 equation (2)