[go: up one dir, main page]

CN113032723B - Matrix multiplier realizing method and matrix multiplier device - Google Patents

Matrix multiplier realizing method and matrix multiplier device Download PDF

Info

Publication number
CN113032723B
CN113032723B CN202110568171.4A CN202110568171A CN113032723B CN 113032723 B CN113032723 B CN 113032723B CN 202110568171 A CN202110568171 A CN 202110568171A CN 113032723 B CN113032723 B CN 113032723B
Authority
CN
China
Prior art keywords
matrix
multiplication
operation module
partial products
carry
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110568171.4A
Other languages
Chinese (zh)
Other versions
CN113032723A (en
Inventor
陈钦树
孙继芬
智扬
刘玉佳
廖述京
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Communications and Networks Institute
Original Assignee
Guangdong Communications and Networks Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Communications and Networks Institute filed Critical Guangdong Communications and Networks Institute
Priority to CN202110568171.4A priority Critical patent/CN113032723B/en
Publication of CN113032723A publication Critical patent/CN113032723A/en
Application granted granted Critical
Publication of CN113032723B publication Critical patent/CN113032723B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Complex Calculations (AREA)

Abstract

本发明公开了一种矩阵乘法器的实现方法,方法包括:配置第一乘法运算模块、第二乘法运算模块、保留进位加法运算模块和超前进位加法运算模块;将待运算的多个乘数根据矩阵乘法运算的需求分割成满足第一乘法运算模块和第二乘法运算模块所需的小矩阵;通过小矩阵进行矩阵的乘法运算生成多个部分积;通过保留进位加法运算模块对多个部分积根据不同的权重进行压缩至两个部分积;通过超前进位加法运算模块对两个部分积进行运算生成用于组成矩阵乘法结果的元素。根据本发明公开的方法能够减少矩阵运算所需的时钟周期,提高了计算模块的利用效率,减少了运算资源的浪费。

Figure 202110568171

The invention discloses a method for realizing a matrix multiplier. The method includes: configuring a first multiplication operation module, a second multiplication operation module, a carry-reserved addition operation module and a carry-forward addition operation module; According to the requirements of matrix multiplication operation, it is divided into small matrices that meet the needs of the first multiplication operation module and the second multiplication operation module; the multiplication operation of the matrix is performed through the small matrix to generate multiple partial products; the multiplication operation is carried out by the reserved carry addition operation module. The product is compressed into two partial products according to different weights; the two partial products are operated by the carry-look-ahead addition operation module to generate the elements used to form the matrix multiplication result. According to the method disclosed in the present invention, the clock cycle required for the matrix operation can be reduced, the utilization efficiency of the computing module can be improved, and the waste of computing resources can be reduced.

Figure 202110568171

Description

Matrix multiplier realizing method and matrix multiplier device
Technical Field
The present invention relates to the field of computer algorithm technologies, and in particular, to a matrix multiplier implementation method and a matrix multiplier device.
Background
The matrix multiplication is widely applied to the fields of deep learning, image recognition, automatic driving, 5G/B5G wireless baseband signal processing and the like. With the rapid development of these fields, especially the application of artificial intelligence and 5G in the vertical industry, the requirement on the operation speed of matrix multiplication is higher and higher. The existing matrix multiplication is generally realized by adopting a general CPU (central processing unit), because the CPU is mainly applied to control and is not good at matrix multiplication operation, each element of two matrixes needs to be read out for multiple times one by one according to an operation rule of the matrix multiplication and multiplication calculation according to an artificial programmed algorithm, and the problems that the calculation is slow and the requirements of deep learning, 5G signal processing and the like cannot be met exist.
In order to solve the problem, a ripple array is researched according to the requirement of deep learning to realize matrix multiplication, and the problem of low operation speed of the traditional processor is really solved by realizing matrix multiplication operation through the ripple array.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide a method for implementing a matrix multiplier and a matrix multiplier device, which can reduce the clock period required by matrix operation, improve the utilization efficiency of a computing module, and reduce the waste of operation resources.
In order to solve the above technical problem, a first aspect of the present invention discloses a method for implementing a matrix multiplier, where the method includes: configuring a first multiplication operation module, a second multiplication operation module, a reserved carry addition operation module and a carry look ahead addition operation module; dividing a plurality of multipliers to be operated into small matrixes required by the first multiplication operation module and the second multiplication operation module according to the requirements of multiplication operation; performing matrix multiplication operation through the small matrix to generate a plurality of partial products; compressing the partial products to two partial products according to different weights by a carry-preserving addition operation module; and operating the two partial products through the carry look-ahead addition operation module to generate elements for forming a matrix multiplication result.
In some embodiments, the matrix multiplier comprises a matrix, and after the first multiplication operation module, the second multiplication operation module, the carry-save addition operation module and the carry-look-ahead addition operation module are configured, a plurality of matrices to be operated are further divided into small matrices meeting requirements of the first multiplication operation module and the second multiplication operation module according to requirements of matrix multiplication operation.
In some embodiments, multiplying the matrix by the small matrix generates a plurality of partial products, including: alternately encoding the even bits and the adjacent bits of the multiplier of the small matrix to generate a plurality of partial products and weights corresponding to the partial products; and combining according to the weights of the partial products to generate a plurality of partial products.
In some embodiments, further comprising: storing a plurality of multipliers to be operated and multiplication results composed of elements for composing matrix multiplication results according to a row mode.
In some embodiments, further comprising: and storing a plurality of multipliers to be operated and multiplication results formed by elements for forming matrix multiplication results in a block and column mode according to the divided small matrix mode.
According to a second aspect of the present invention, there is disclosed a matrix multiplier device, said device comprising: the matrix operation control module is used for dividing a plurality of multipliers to be operated into small matrixes meeting the requirements of the first multiplication operation module and the second multiplication operation module according to the requirements of matrix multiplication operation; the first multiplication operation module and the second multiplication operation module are used for carrying out matrix multiplication operation through the small matrix to generate a plurality of partial products; the carry-remaining addition operation module is used for compressing the partial products into two partial products according to different weights; and the carry look-ahead addition operation module is used for operating the two partial products to generate elements for forming a matrix multiplication result.
In some embodiments, the first and second multiplication modules are implemented as: alternately encoding the even bits and the adjacent bits of the multiplier of the small matrix to generate a plurality of partial products and weights corresponding to the partial products; and combining according to the weights of the partial products to generate a plurality of partial products.
In some embodiments, further comprising: the first storage module is used for storing a plurality of multipliers to be operated and multiplication results formed by elements for forming matrix multiplication results according to a row mode.
In some embodiments, further comprising: and the second storage module is used for storing a plurality of multipliers to be operated and multiplication results formed by elements for forming matrix multiplication results in a block column mode according to the divided small matrix mode.
According to a third aspect of the present invention, a computer-readable storage medium is disclosed, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps in the implementation method of the matrix multiplier as described above.
Compared with the prior art, the invention has the beneficial effects that:
the invention can complete one-time matrix multiplication and addition operation through one single clock period, then automatically divides two large matrixes into small matrixes meeting the requirements of an operation unit according to different matrix multiplication requirements to carry out matrix multiplication operation, skillfully utilizes partial products to compress, saves operation time and greatly improves the performance of matrix multiplication operation.
Drawings
Fig. 1 is a schematic flowchart of a method for implementing a matrix multiplier according to an embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating an implementation of a matrix multiplier according to an embodiment of the present invention;
FIG. 3 is a schematic diagram illustrating an implementation of another matrix multiplier according to an embodiment of the present invention;
FIG. 4 is a schematic diagram illustrating an implementation of another matrix multiplier according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a matrix multiplier apparatus according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a matrix multiplier implementation apparatus according to an embodiment of the present invention.
Detailed Description
For better understanding and implementation, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "comprises," "comprising," and any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules explicitly listed, but may include other steps or modules not expressly listed or inherent to such process, method, article, or apparatus.
The problem of low operation speed of a traditional processor is really solved by realizing matrix multiplication by the conventional systolic array, but the systolic array needs more multiplier resources and is high in cost due to the realization of large matrix operation, and the operation speed cannot be met for a small matrix. The systolic array of a TPU such as google is a 256x256 matrix, requiring 511 clock cycles for a matrix multiplication less than or equal to 256x256, and 511 clock cycles for a matrix multiplication of a small matrix of 16x 16.
The embodiment of the invention discloses a matrix multiplier realizing method and a matrix multiplier device, which can complete one-time matrix multiplication and addition operation through a single clock period, then automatically divide two large matrixes into small matrixes meeting the requirements of an operation unit according to different matrix multiplication requirements to carry out matrix multiplication operation, skillfully utilize partial products to compress, save operation time and greatly improve the performance of matrix multiplication operation.
Example one
Referring to fig. 1, fig. 1 is a schematic flow chart illustrating a matrix multiplier implementation method according to an embodiment of the present invention. The matrix multiplier can be applied to different matrix operation combinations, and the application operation combination of the matrix multiplier is not limited in the embodiment of the invention. As shown in fig. 1, the matrix multiplier implementation method may include the following operations:
101. and a first multiplication operation module, a second multiplication operation module, a reserved carry addition operation module and a carry look ahead addition operation module are configured.
In order to satisfy the multiply-add operation required by the matrix operation, for this reason, it is first necessary to configure a first multiply operation module and a second multiply operation module that can be used for the matrix multiply operation, and a carry-save addition module and a carry-look-ahead addition module that can be used for the matrix add operation in this embodiment.
102. And dividing a plurality of multipliers to be operated into small matrixes meeting the requirements of the first multiplication operation module and the second multiplication operation module according to the requirements of matrix multiplication operation.
In order to facilitate an example of how to divide the matrix by the requirement, in this embodiment, a description is given by multiplication operation of two integers, and it is assumed that a multiplier to be operated is p × q, where bit widths of p and q are both n, if n is an odd number, q performs sign bit expansion and expands by one bit, and if n is an even number, q is unchanged, so that the bit width of the processed multiplier is certainly an even number, which may facilitate operation of the matrix, and the processed multiplier is named as r, and the bit width is h. Then r can be represented in binary as:
r= -rh-1*2h-1+ rh-2*2h-2+……+ r1*21+r0*20
=(-rh-1+ rh-2)*2h-1+(-rh-2+ rh-3)*2h-2+……+(-r1+r0)*21+(-r0+0)*20
=(-2*rh-1+ rh-2+rh-3)*2h-2+(-2*rh-3+ rh-4)+ rh-5)*2h-4+……+(-2*r3+r2 +r1)*22+(-2*r1+r0 +0)*20
=
Figure 760658DEST_PATH_IMAGE001
(wherein r is-1=0)
=
Figure 786383DEST_PATH_IMAGE002
Wherein en=-2* r2n+1+ r2n+r2n-1, 0≤n≤(h/2-1), r-1=0
The small matrix resulting from the multiplication of the two multipliers can be expressed as:
p*r=p*(
Figure 443629DEST_PATH_IMAGE003
)=
Figure 845791DEST_PATH_IMAGE004
the truth table is as follows:
Figure 420996DEST_PATH_IMAGE005
103. a plurality of partial products are generated by multiplying a small matrix by a matrix.
Through the obtained multiplier after matrixing in the above steps, the even bits and adjacent bits of the multiplier of the small matrix are alternatively encoded to generate a plurality of weights corresponding to partial products and partial products, for example, p r can be decomposed into five operations of p 4n multiplied by-2, -1, 0, 1, 2, p4nThat is, p is left-shifted by 2n bits, i.e., left-shifted by 1 bit, and then multiplied by 1, and then multiplied by 0 to set the partial product result to 0, and then multiplied by-1 to invert each bit of the multiplicand and then 1, and then multiplied by-2 to invert each bit of the multiplicand and then left-shifted and then 2. For the operation of adding 1 or adding 2, the adding 1 or adding 2 generated by each partial product can be combined together, and since the weight of the generated partial product is 4, and only 2 bits need to be occupied by adding 1 or adding 2, the adding 1 or adding 2 generated by each partial product can be spliced into one partial product by two bits, and the weight is 0. So if the multiplier is h bits, then finally h/2+1 partial products can be generated by encoding, with the weights of adjacent partial products differing by 4.
104. And compressing the partial products into two partial products according to different weights by a carry-save addition operation module.
And obtaining a plurality of partial products generated in the steps, and adding all the partial products according to the weight to obtain a multiplication result. In the prior art, carry look-ahead adders are generally used for addition, but carry propagation from the lowest bit to the highest bit is generated, all carry look-ahead adders are used for compression in the embodiment, and the carry look-ahead adders are not used for addition until 2 partial products are formed finally to obtain multiplication results. Finally, after the two partial products are compressed to only two partial products, the two partial products are added by a carry look-ahead adder to obtain a multiplication result. The carry-retaining addition operation module is provided with three input ports I1, I2 and Ci which are respectively an addend, an addend and an adjacent low-order carry, and two output ports Co and SUM, wherein the SUM has the same weight as I1, I2 and Ci and is an accumulation result of a home bit, and the weight of Co is 2 times of the weight of SUM and is a carry signal to a high bit. The truth table of the arithmetic unit is shown as the following table:
Figure 866890DEST_PATH_IMAGE006
illustratively, as shown in fig. 2, compressing the partial products can be achieved by arranging a plurality of CSA32 adders (CSAs: Carry Save adders) in a row, and if the bit width of the partial products is 32 bits, 32 CSAs 32 are required to be arranged in a row, and three partial products are compressed into two partial products, where the weight of Co is 2.
As an additional embodiment, the carry-save add module may also be implemented as a CSA53 counter having five inputs: i1, I2, I3, I4, Ci; three output terminals: SUM, C, Co. Combining CSA53 encoders into a row, namely a CSA53 counting row; if the adjacent lower Co is connected to the Ci of the home position, it becomes a CSA42 compressor. This can reduce two operands.
The counter algebraic expression is as follows:
SUM + C * 2 + Co * 2 = I1 + I2 + I3 + I4 + Ci
namely: i1, I2, I3, I4, Ci and SUM weight are 1; C. the Co weight is 2.
The truth table of the arithmetic unit is shown as the following table:
Figure 762165DEST_PATH_IMAGE007
SUM = I1 ^ I2 ^ I3 ^ I4 ^ Ci
C = (I1 ^ I2 ^ I3 ^ I4) & Ci | (~ (I1 ^ I2 ^ I3 ^ I4)) & (I1 & I2 | I3 & I4)
= (I1 ^ I2 ^ I3 ^ I4) & Ci | (~ ((I1 ^ I2 ^ I3 ^ I4) | (~ (I1 & I2 | I3 & I4))))
Co = (I1 | I2) & (I3 | I4)
as can be seen from the above equation: when used as a CSA42 compressor, the adjacent lower Co is inserted into the Ci of the local position, and has no influence on the delay. Since Ci happens to form when it is used. The 4 input data are processed by the 42 compressor to obtain two outputs, one is the result SUM with weight 1 and the other is the data C with weight 2. Since the Co signal does not need to be decoded from the Ci signal, no carry chain propagation delay is generated. Similarly, by aligning CSAs 42, 4 partial products can be compressed, and two partial product results can be obtained after compression, one partial product result with a weight of 2.
105. And operating the compressed final two partial products through a carry-look-ahead addition operation module to generate elements for forming a matrix multiplication result. The elements of each multiplication result are operated according to the steps, so that efficient matrix operation can be performed under the conditions of high efficiency and no resource occupation.
According to another preferred embodiment, particularly in practical applications, there may be two methods for storing two multiplier matrices, such as a and B, the first method is to store the a matrix and the B matrix in the DDR memory in a row-by-row manner, when performing the operation, each element needs to be taken out according to the row of the matrix, at this time, a buffer may be set inside the matrix operation controller, a plurality of small matrices are prefetched each time, the number of the specific small matrices may be determined according to the requirements of the operation speed and the cost balance, and then each small matrix is controlled to be sent to each operation module for operation in the manner described above. The operation result can also be stored in the DDR cache in a small matrix blocking mode, the small matrix blocks can be stored in a row or column mode according to configuration, or the operation result is cached in the matrix operation controller firstly, and the operation result is stored in the DDR cache after a plurality of small matrixes of the operation result are spliced into matrix rows. The second method is to divide the matrix A into small matrix blocks according to the requirements of the matrix multiply-add module, then store each small matrix block according to the rows, divide the matrix B into small matrix blocks according to the requirements of the matrix multiply-add unit, and store each small matrix block in a column mode. The storage mode used by the requirement can be made configurable when the matrix operation controller is designed, and the relevant register is configured according to the actual requirement of a user to select one mode to realize the multiplication operation of the matrix.
According to the method provided by the embodiment, one-time matrix multiplication and addition operation can be completed through one single clock cycle, then two large matrixes are automatically divided into small matrixes meeting the requirements of an operation unit according to different matrix multiplication requirements to perform matrix multiplication, and partial products are ingeniously utilized to perform compression, so that the operation time is saved, and the performance of matrix multiplication is greatly improved.
Example two
Referring to fig. 3, fig. 3 is a schematic diagram illustrating an application of a matrix multiplier implementation method according to an embodiment of the present invention. As shown in fig. 3, the multipliers to be operated may be implemented as a plurality of matrices, specifically including a matrix composed of A, B, C, where D is set as the multiplication result, and the matrix rule is required to be D = a × B + C, where a, B, and C may be matrices of any elements as long as the rows and columns of a and B satisfy the matrix multiplication rule. In the present embodiment, for convenience of description, it is assumed that a, B, C, and D are square matrices of 32 × 32.
It can be seen that for the first element D of the D matrix0,0The multiplication operation may be performed by using corresponding elements in the first row of the a matrix and the first column of the B matrix, and then the multiplication operation is added to the first element of the C matrix, and the operation of a specific element is shown in expression (1).
d0,0=a0,0*b0,0+a0,1*b1,0+…+a0,31*b31,0+c0,0 (1)
d0,1=a0,0*b0,1+a0,1*b1,1+…+a0,31*b31,1+c0,1 (2)
d31,31=a31,0*b0,31+a31,1*b1,31+…+a31,31*b31,31+c31,31 (3)
It can be seen from the operation rule of the D matrix that 32 multipliers are needed in total when the multiplication is performed on the corresponding elements in the first row of the a matrix and the first column of the B matrix, and in terms of specific implementation, all the partial products obtained by multiplying the a matrix and the B matrix need to be compressed by a carry-save adder (CSA 32 or CSA 42) according to the method described in the first embodiment, and at the same time, the first element of the C matrix used as the addition is also used as a partial product to be compressed to obtain two compressed partial product results, and then, the first element of the D matrix is obtained by completing the accumulation by using a carry-look-ahead adder. Illustratively, taking the first element of the multiplication result D matrix as an example, a is first introduced0,0*b0,0,a0,1*b1,0,…,a0,31*b31,0The 32 partial products with weight of 0 and the 32 partial products generated by adding 1 or adding 2 are compressed by adopting CSA42 rows, and every four partial products are compressed by a CSA42 row compressor to generate a first-stage partial product P1_0Pj with weight of 0 and a first-stage partial product P1_2Pj with weight of 2, wherein j is more than or equal to 0 and less than or equal to 15. The 16P 1_0Pj and C00 are 17 partial products with the weight of 0 in total, and the compression is carried out by adopting 3 CSAs 42 and 2 CSAs 32, so that a second-stage partial product P2_0Pk with the weight of 0 and a second-stage partial product P2_2Pk with the weight of 2 can be obtained, and k is more than or equal to 0 and less than or equal to 4. And compressing 16 partial products P1_2Pj with weight of 2 obtained by the first stage of compression and 5 second-stage partial products P2_2Pk with weight of 2 by using 4 CSAs 42 and 2 CSAs 32 to obtain 6P 3_4Pl with weight of 4 and 6P 3_2Pl with weight of 2, wherein l is more than or equal to 0 and less than or equal to 5. A is a0,0*b0,0,a0,1*b1,0,…,a0,31*b31,0The 32 partial products with weight 4 and 6P 3_4Pl with weight 4 were compressed with 8 CSA42 and 2 CSA32 to obtain a fourth partial product containing 10 partial products with weight 6The partial product P3_6Pm and the partial product P4_4Pm with the weight of 4, wherein m is more than or equal to 0 and less than or equal to 9. The other partial products are compressed similarly as described above, and finally a0,0*b0,0+ a0,1*b1,0+…+ a0,31*b31,0+ c0,0After all partial products are compressed to only 2 residual partial products, the calculation result of the expression (1) can be obtained by adding the partial products through a carry look-ahead adder. It can be seen that, in this embodiment, the operation expression (1) can be implemented by only one multiplier, so only 1024 multipliers are needed to complete the calculation of the multiplication result of the whole D matrix, while 32 multipliers are needed to calculate the expression (1) according to the conventional design method, and 32768 multipliers are needed to complete the calculation of the D matrix.
In other embodiments, the computing power of the matrix multiplication unit can be improved in a pipeline mode. Because the compression of the partial product is completed by adopting the carry-retaining adder without carry propagation delay, the actual combinational logic path is very short, the partial products generated by 32 multiplication operations and one element of the C matrix can be compressed in one clock cycle, and the final two partial products are obtained after the compression. The 32-bit multiplication operation finally obtains a 64-bit result, and considering that the delay of the carry chain of the 64-bit carry look ahead adder needs to be transferred from the lowest bit to the highest bit, and the combinational logic delay is too large, so the 64-bit carry look ahead adder is realized by adopting two stages of pipelining, the operation of the low 32-bit carry look ahead adder is realized in the first clock cycle, and the operation of the high 32-bit carry look ahead adder is realized in the second clock cycle, so that one element of the D matrix can be completed by three clock cycles. All elements of the D matrix are realized by adopting parallel computation, so that the whole D matrix can finish the operation of all the elements by only three clock cycles to obtain an operation result. Because the matrix multiplication and addition operation is realized by adopting a pipeline mode, the data of the three matrixes A, B and C are sent in the first clock cycle, and the data of the new three matrixes A, B and C can be sent in the next clock cycle. That is, the data sent for the first time needs to wait for the third clock cycle to obtain the operation result, the data sent for the second time needs to wait for the fourth clock cycle to obtain the operation result, equivalent to one clock cycle, the matrix multiplication accumulation operation can be completed for one time, and only the delay of the operation result is three clock cycles. Therefore, only one clock cycle is needed when the 32x32 matrix multiplication and accumulation operation is completed to achieve one-time matrix multiplication and accumulation equivalent, and the performance of the matrix multiplication operation is greatly improved.
EXAMPLE III
Referring to fig. 4, fig. 4 is a schematic application diagram of a matrix multiplier implementation method according to an embodiment of the present invention. For other matrixes with any row and column, if the number of the rows and the columns of the matrixes is not an integral multiple of 32, zero padding operation is required to be carried out on the rows and the columns of the matrixes until the number of the rows and the columns of the matrixes is an integral multiple of 32; if the matrix row and column are integral multiple of 32, the next step is directly proceeded. The method comprises the steps of firstly dividing the matrix into rows or columns according to the requirements of matrix multiplication operation, and respectively sending the rows or columns into a matrix multiplication operation unit for multiplication and addition operation. For example, the a matrix is a 64x96 matrix, the B matrix is a 64x64 matrix, and then the a matrix can be divided into 6 small matrices according to 32x32 by matrix multiplication, and the small matrices are a00, A01……A21The specific division is shown in fig. 4 below.
The B matrix is also divided into 4 small matrices, B respectively, according to 32x3200, B01, B10And B11The specific division is shown in fig. 4. Then, when calculating the multiplication operation of the A matrix and the B matrix, the first small matrix D of the result matrix D is calculated first00,D00Is a32 x32 matrix. First, take out A00And B00Sending into a matrix multiply-add operation unit to finish A00And B00Since no accumulation is required, so C00All values of (a) are 0. Then taking A01And B10And the last operation result C = A00*B00The result of the first small matrix of the D matrix is obtained by the matrix multiplication and addition operation unit. Other small matrices of the D matrix are shown in fig. 4, and the implementation is referred to above, so that the complete multiplication result is obtained.
In the prior art, for example, a matrix multiply-add unit is designed to support one clock cycle to complete a × B + C, where a, B, and C are all small matrices of 32 × 32. The 256x256 matrix multiplication operation is completed by adopting a processor method, even if the processor completes the multiplication operation once in one clock cycle, 65536 clock cycles are needed at least, and the calculation time of accessing data and addition is not calculated; 511 clock cycles are needed if a pulse permutation method is adopted for matrix multiplication; however, if the matrix multiplication is completed by the method of the present embodiment, 514 clock cycles are required. While the systolic array requires 65536 multipliers, the invention can be implemented with only 1024 multipliers. Therefore, according to the method provided by the embodiment, one-time matrix multiplication and addition operation can be completed through one single clock cycle, then two large matrices are automatically divided into small matrices meeting the requirements of an operation unit according to different matrix multiplication requirements to perform matrix multiplication, partial products are ingeniously utilized to perform compression, the operation time is saved, and the performance of matrix multiplication operation is greatly improved.
Example four
Referring to fig. 5, fig. 5 is a schematic diagram of a matrix multiplier device according to an embodiment of the invention. The matrix multiplier device can be applied to different matrix operation combinations, and the application operation combination of the matrix multiplier device is not limited in the embodiment of the invention. As shown in fig. 5, the matrix multiplier apparatus includes:
the matrix operation control module 1 is used for dividing a plurality of multipliers or a plurality of matrixes to be operated into small matrixes required by the first multiplication operation module and the second multiplication operation module according to the requirement of matrix multiplication operation. To illustrate how the matrix is divided by the requirement, in this embodiment, a multiplication operation of two integers is first designed. Assume p × q, where bit widths of p and q are both n, if n is an odd number, q performs sign bit expansion and expands by one bit, and if n is an even number, q is unchanged, so that the bit width of the processed multiplier is an even number, which can facilitate matrix operation, and the processed multiplier is named as r, and the bit width is h. Then r can be represented in binary as:
r = -rh-1*2h-1+ rh-2*2h-2+……+ r1*21+r0*20
=(-rh-1+ rh-2)*2h-1+(-rh-2+ rh-3)*2h-2+……+(-r1+r0)*21+(-r0+0)*20
=(-2*rh-1+ rh-2+rh-3)*2h-2+(-2*rh-3+ rh-4)+ rh-5)*2h-4+……+(-2*r3+r2 +r1)*22+(-2*r1+r0 +0)*20
=
Figure 453171DEST_PATH_IMAGE001
(wherein r is-1=0)
=
Figure 570032DEST_PATH_IMAGE002
Wherein en=-2* r2n+1+ r2n+r2n-1, 0≤n≤(h/2-1), r-1=0
The small matrix resulting from the multiplication of the two multipliers can be expressed as:
p*r=p*(
Figure 357466DEST_PATH_IMAGE003
)=
Figure 254884DEST_PATH_IMAGE004
the truth table is as follows:
Figure 733270DEST_PATH_IMAGE008
the first multiplication module 2 and the second multiplication module 3 are used for carrying out matrix multiplication through small matrixes to generate a plurality of partial products. Through the obtained multiplier after matrixing, the even number and the adjacent number of the multiplier of the small matrix are alternately coded to generate a plurality of partial products corresponding to the partial productsThe weight of (e.g. p r) can be decomposed into five operations p 4n multiplied by-2, -1, 0, 1, 2, p4nThat is, p is left-shifted by 2n bits, i.e., left-shifted by 1 bit, and then multiplied by 1, and then multiplied by 0 to set the partial product result to 0, and then multiplied by-1 to invert each bit of the multiplicand and then 1, and then multiplied by-2 to invert each bit of the multiplicand and then left-shifted and then 2. For the operation of adding 1 or adding 2, the adding 1 or adding 2 generated by each partial product can be combined together, and since the weight of the generated partial product is 4, and only 2 bits need to be occupied by adding 1 or adding 2, the adding 1 or adding 2 generated by each partial product can be spliced into one partial product by two bits, and the weight is 0. So if the multiplier is h bits, then finally h/2+1 partial products can be generated by encoding, with the weights of adjacent partial products differing by 4.
And the carry-save addition operation module 3 is used for compressing the plurality of partial products into two partial products according to different weights. And obtaining a plurality of partial products generated in the steps, and adding all the partial products according to the weight to obtain a multiplication result. In the prior art, carry look-ahead adders are generally used for addition, but carry propagation from the lowest bit to the highest bit is generated, all carry look-ahead adders are used for compression in the embodiment, and the carry look-ahead adders are not used for addition until 2 partial products are formed finally to obtain multiplication results. Finally, after the two partial products are compressed to only two partial products, the two partial products are added by a carry look-ahead adder to obtain a multiplication result. The carry-retaining addition operation module is provided with three input ports I1, I2 and Ci which are respectively an addend, an addend and an adjacent low-order carry, and two output ports Co and SUM, wherein the SUM has the same weight as I1, I2 and Ci and is an accumulation result of a home bit, and the weight of Co is 2 times of the weight of SUM and is a carry signal to a high bit. The truth table of the arithmetic unit is shown as the following table:
Figure 268418DEST_PATH_IMAGE009
SUM = I1 ^ I2 ^ Ci
Co = I1 & I2 | I1 & Ci | I2 & Ci
= I1 & I2 | (I1 | I2) & Ci
illustratively, as shown in fig. 2, compressing the partial products can be achieved by arranging a plurality of CSA32 adders (CSAs: Carry Save adders) in a row, and if the bit width of the partial products is 32 bits, 32 CSAs 32 are required to be arranged in a row, and three partial products are compressed into two partial products, where the weight of Co is 2.
As an additional embodiment, the carry-save add module may also be implemented as a CSA53 counter having five inputs: i1, I2, I3, I4, Ci; three output terminals: SUM, C, Co. Combining CSA53 encoders into a row, namely a CSA53 counting row; if the adjacent lower Co is connected to the Ci of the home position, it becomes a CSA42 compressor. This can reduce two operands.
The counter algebraic expression is as follows:
SUM + C * 2 + Co * 2 = I1 + I2 + I3 + I4 + Ci
namely: i1, I2, I3, I4, Ci and SUM weight are 1; C. the Co weight is 2.
The truth table of the arithmetic unit is shown as the following table:
Figure 56115DEST_PATH_IMAGE010
SUM = I1 ^ I2 ^ I3 ^ I4 ^ Ci
C = (I1 ^ I2 ^ I3 ^ I4) & Ci | (~ (I1 ^ I2 ^ I3 ^ I4)) & (I1 & I2 | I3 & I4)
= (I1 ^ I2 ^ I3 ^ I4) & Ci | (~ ((I1 ^ I2 ^ I3 ^ I4) | (~ (I1 & I2 | I3 & I4))))
Co = (I1 | I2) & (I3 | I4)
as can be seen from the above equation: when used as a CSA42 compressor, the adjacent lower Co is inserted into the Ci of the local position, and has no influence on the delay. Since Ci happens to form when it is used. The 4 input data are processed by the 42 compressor to obtain two outputs, one is the result SUM with weight 1 and the other is the data C with weight 2. Since the Co signal does not need to be decoded from the Ci signal, no carry chain propagation delay is generated. Similarly, by aligning CSAs 42, 4 partial products can be compressed, and two partial product results can be obtained after compression, one partial product result with a weight of 2.
And the carry look-ahead addition operation module 4 is used for operating the two partial products to generate elements for forming a matrix multiplication result. After the last two partial products are obtained, the advanced carry addition operation module can be directly used for operation. The elements of each multiplication result are operated according to the steps, so that efficient matrix operation can be performed under the conditions of high efficiency and no resource occupation.
In other embodiments, the apparatus further comprises two storage modules: and a first storage module (not shown) for storing the plurality of matrices to be operated on and the multiplication result composed of the elements for composing the matrix multiplication result in a row-wise manner. And a second storage module (not shown) for storing the plurality of matrices to be operated and the multiplication result composed of the elements for composing the matrix multiplication result in blocks and columns according to the manner of the divided small matrices. In practical application, the first storage module 5 stores the matrix a and the matrix B into the DDR memory in a row-by-row manner, and when performing operation, each element needs to be taken out according to the row of the matrix, and at this time, a cache may be set inside the matrix operation controller, and a plurality of small matrices are prefetched each time, and the number of specific small matrices may be determined according to the operation speed and the requirement of cost balance, and then each small matrix is controlled to be sent into each operation module for operation in the manner described above. The operation result can also be stored in the DDR cache in a small matrix blocking mode, the small matrix blocks can be stored in a row or column mode according to configuration, or the operation result is cached in the matrix operation controller firstly, and the operation result is stored in the DDR cache after a plurality of small matrixes of the operation result are spliced into matrix rows. The second storage module 6 divides the matrix A into small matrix blocks according to the requirements of the matrix multiply-add module, then stores each small matrix block according to the row, and divides the matrix B into small matrix blocks according to the requirements of the matrix multiply-add unit, and stores each small matrix block in a column mode. The storage mode used by the requirement can be made configurable when the matrix operation controller is designed, and the relevant register is configured according to the actual requirement of a user to select one mode to realize the multiplication operation of the matrix.
According to the device provided by the embodiment, one-time matrix multiplication and addition operation can be completed through one single clock cycle, then two large matrixes are automatically divided into small matrixes meeting the requirements of an operation unit according to different matrix multiplication requirements to perform matrix multiplication, and partial products are ingeniously utilized to perform compression, so that the operation time is saved, and the performance of matrix multiplication is greatly improved.
EXAMPLE five
Referring to fig. 6, fig. 6 is a schematic structural diagram of a matrix multiplier implementation apparatus according to an embodiment of the present invention. The matrix multiplier implementation apparatus described in fig. 6 can be applied to a system, and the application system of the matrix multiplier implementation apparatus is not limited in the embodiment of the present invention. As shown in fig. 6, the apparatus may include:
a memory 601 in which executable program code is stored;
a processor 602 coupled to a memory 601;
the processor 602 calls the executable program code stored in the memory 601 for performing the implementation of the matrix multiplier described in the first embodiment.
EXAMPLE six
The embodiment of the invention discloses a computer-readable storage medium which stores a computer program for electronic data exchange, wherein the computer program enables a computer to execute the implementation method of the matrix multiplier described in the first embodiment.
EXAMPLE seven
An embodiment of the present invention discloses a computer program product, which includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer to execute the matrix multiplier implementation method described in the first embodiment or the second embodiment.
The above-described embodiments are only illustrative, and the modules described as separate components may or may not be physically separate, and the components displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above detailed description of the embodiments, those skilled in the art will clearly understand that the embodiments may be implemented by software plus a necessary general hardware platform, and may also be implemented by hardware. Based on such understanding, the above technical solutions may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, where the storage medium includes a Read-Only Memory (ROM), a Random Access Memory (RAM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), a One-time Programmable Read-Only Memory (OTPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a Compact Disc-Read-Only Memory (CD-ROM), or other disk memories, CD-ROMs, or other magnetic disks, A tape memory, or any other medium readable by a computer that can be used to carry or store data.
Finally, it should be noted that: the matrix multiplier implementing method and apparatus disclosed in the embodiments of the present invention are only preferred embodiments of the present invention, which are only used for illustrating the technical solutions of the present invention, and are not limited thereto; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art; the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (8)

1. A method for implementing a matrix multiplier, the method comprising:
configuring a first multiplication operation module, a second multiplication operation module, a reserved carry addition operation module and a carry look ahead addition operation module;
dividing a plurality of multipliers to be operated into small matrixes required by the first multiplication operation module and the second multiplication operation module according to the requirements of multiplication operation;
carrying out matrix multiplication operation through the small matrix to generate a plurality of partial products, wherein even bits and adjacent bits of a multiplier of the small matrix are alternately encoded to generate weights corresponding to the partial products;
combining according to the weights of the partial products to generate a plurality of partial products;
compressing the partial products to two partial products according to different weights by a carry-preserving addition operation module;
and operating the two partial products through the carry look-ahead addition operation module to generate elements for forming a matrix multiplication result.
2. The method of claim 1, wherein the multiplier comprises a matrix, and the configuring of the first multiply operation module, the second multiply operation module, the carry-save add operation module, and the carry-look-ahead add operation module further comprises:
and dividing a plurality of matrixes to be operated into small matrixes meeting the requirements of the first multiplication operation module and the second multiplication operation module according to the requirements of matrix multiplication operation.
3. The method of claim 2, wherein the method further comprises:
storing a plurality of multipliers to be operated and a matrix multiplication result composed of elements for composing the matrix multiplication result according to a row mode.
4. The method of claim 2, wherein the method further comprises:
and storing a plurality of multipliers to be operated and matrix multiplication results formed by elements for forming the matrix multiplication results in a block column mode according to the divided small matrix mode.
5. A matrix multiplier device, characterized in that the device comprises:
the matrix operation control module is used for dividing a plurality of multipliers to be operated into small matrixes meeting the requirements of the first multiplication operation module and the second multiplication operation module according to the requirements of matrix multiplication operation;
the first multiplication operation module and the second multiplication operation module are used for carrying out matrix multiplication operation through the small matrix to generate a plurality of partial products;
the carry-remaining addition operation module is used for compressing the partial products into two partial products according to different weights;
the carry look ahead addition operation module is used for operating the two partial products to generate elements for forming a matrix multiplication result;
wherein the first and second multiplication modules are implemented as:
alternately encoding the even bits and the adjacent bits of the multiplier of the small matrix to generate a plurality of partial products and weights corresponding to the partial products;
and combining according to the weights of the partial products to generate a plurality of partial products.
6. The matrix multiplier device according to claim 5, characterized in that the device further comprises:
the first storage module is used for storing a plurality of multipliers to be operated and multiplication results formed by elements for forming matrix multiplication results according to a row mode.
7. The matrix multiplier device according to claim 5, characterized in that the device further comprises:
and the second storage module is used for storing a plurality of multipliers to be operated and multiplication results formed by elements for forming matrix multiplication results in a block column mode according to the divided small matrix mode.
8. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for implementing a matrix multiplier according to any one of claims 1 to 4.
CN202110568171.4A 2021-05-25 2021-05-25 Matrix multiplier realizing method and matrix multiplier device Active CN113032723B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110568171.4A CN113032723B (en) 2021-05-25 2021-05-25 Matrix multiplier realizing method and matrix multiplier device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110568171.4A CN113032723B (en) 2021-05-25 2021-05-25 Matrix multiplier realizing method and matrix multiplier device

Publications (2)

Publication Number Publication Date
CN113032723A CN113032723A (en) 2021-06-25
CN113032723B true CN113032723B (en) 2021-08-10

Family

ID=76455656

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110568171.4A Active CN113032723B (en) 2021-05-25 2021-05-25 Matrix multiplier realizing method and matrix multiplier device

Country Status (1)

Country Link
CN (1) CN113032723B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113805842B (en) * 2021-11-17 2022-02-22 中科南京智能技术研究院 Integrative device of deposit and calculation based on carry look ahead adder realizes
CN115857873B (en) * 2023-02-07 2023-05-09 兰州大学 Multiplier, multiplication calculation method, processing system, and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101625634A (en) * 2008-07-09 2010-01-13 中国科学院半导体研究所 Reconfigurable multiplier
CN110377876A (en) * 2019-07-19 2019-10-25 广东省新一代通信与网络创新研究院 Matrix multiplication operation method, apparatus and computer readable storage medium
CN110825346A (en) * 2019-10-31 2020-02-21 南京大学 Low-logic-complexity unsigned approximate multiplier
CN111652359A (en) * 2020-05-25 2020-09-11 北京大学深圳研究生院 Multiplier Arrays for Matrix Operations and Multiplier Arrays for Convolution Operations
CN111859273A (en) * 2017-12-29 2020-10-30 华为技术有限公司 matrix multiplier
CN111897513A (en) * 2020-07-29 2020-11-06 上海芷锐电子科技有限公司 Multiplier based on reverse polarity technology and code generation method thereof
CN112685001A (en) * 2020-12-30 2021-04-20 中科院微电子研究所南京智能技术研究院 Booth multiplier and operation method thereof

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10671349B2 (en) * 2017-07-24 2020-06-02 Tesla, Inc. Accelerated mathematical engine
US11194886B2 (en) * 2019-05-09 2021-12-07 Applied Materials, Inc. Bit-ordered binary-weighted multiplier-accumulator
CN111428863B (en) * 2020-03-23 2023-05-16 河海大学常州校区 Low-power-consumption convolution operation circuit based on approximate multiplier

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101625634A (en) * 2008-07-09 2010-01-13 中国科学院半导体研究所 Reconfigurable multiplier
CN111859273A (en) * 2017-12-29 2020-10-30 华为技术有限公司 matrix multiplier
CN110377876A (en) * 2019-07-19 2019-10-25 广东省新一代通信与网络创新研究院 Matrix multiplication operation method, apparatus and computer readable storage medium
CN110825346A (en) * 2019-10-31 2020-02-21 南京大学 Low-logic-complexity unsigned approximate multiplier
CN111652359A (en) * 2020-05-25 2020-09-11 北京大学深圳研究生院 Multiplier Arrays for Matrix Operations and Multiplier Arrays for Convolution Operations
CN111897513A (en) * 2020-07-29 2020-11-06 上海芷锐电子科技有限公司 Multiplier based on reverse polarity technology and code generation method thereof
CN112685001A (en) * 2020-12-30 2021-04-20 中科院微电子研究所南京智能技术研究院 Booth multiplier and operation method thereof

Also Published As

Publication number Publication date
CN113032723A (en) 2021-06-25

Similar Documents

Publication Publication Date Title
JP3244506B2 (en) Small multiplier
KR100714358B1 (en) Method and system for performing calculation operations and a device
US10684825B2 (en) Compressing like magnitude partial products in multiply accumulation
US4041292A (en) High speed binary multiplication system employing a plurality of multiple generator circuits
JPS6347874A (en) Arithmetic unit
KR102341523B1 (en) Concurrent multi-bit adder
KR101202445B1 (en) Processor
CN113032723B (en) Matrix multiplier realizing method and matrix multiplier device
US9372665B2 (en) Method and apparatus for multiplying binary operands
EP3842954A1 (en) System and method for configurable systolic array with partial read/write
JPS6375932A (en) Digital multiplier
KR100308726B1 (en) Apparatus and method for reducing the number of round-up predictor stages in a high speed arithmetic apparatus
US5734599A (en) Performing a population count using multiplication
US5721697A (en) Performing tree additions via multiplication
CN107368459B (en) Scheduling method of reconfigurable computing structure based on arbitrary dimension matrix multiplication
US3842250A (en) Circuit for implementing rounding in add/subtract logic networks
JPH11327875A (en) Arithmetic circuit
CN117762492A (en) Data processing method, device, computer equipment and readable storage medium
CN114385112B (en) Device and method for processing modulus multiplication
CN113031915A (en) Multiplier, data processing method, device and chip
US7334011B2 (en) Method and system for performing a multiplication operation and a device
US5912904A (en) Method for the production of an error correction parameter associated with the implementation of modular operations according to the Montgomery method
JP2024509062A (en) Multipliers and adders in systolic arrays
CN110647307B (en) Data processor, method, chip and electronic device
KR100900790B1 (en) Method and Apparatus for arithmetic of configurable processor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant