Detailed Description
For better understanding and implementation, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "comprises," "comprising," and any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules explicitly listed, but may include other steps or modules not expressly listed or inherent to such process, method, article, or apparatus.
The problem of low operation speed of a traditional processor is really solved by realizing matrix multiplication by the conventional systolic array, but the systolic array needs more multiplier resources and is high in cost due to the realization of large matrix operation, and the operation speed cannot be met for a small matrix. The systolic array of a TPU such as google is a 256x256 matrix, requiring 511 clock cycles for a matrix multiplication less than or equal to 256x256, and 511 clock cycles for a matrix multiplication of a small matrix of 16x 16.
The embodiment of the invention discloses a matrix multiplier realizing method and a matrix multiplier device, which can complete one-time matrix multiplication and addition operation through a single clock period, then automatically divide two large matrixes into small matrixes meeting the requirements of an operation unit according to different matrix multiplication requirements to carry out matrix multiplication operation, skillfully utilize partial products to compress, save operation time and greatly improve the performance of matrix multiplication operation.
Example one
Referring to fig. 1, fig. 1 is a schematic flow chart illustrating a matrix multiplier implementation method according to an embodiment of the present invention. The matrix multiplier can be applied to different matrix operation combinations, and the application operation combination of the matrix multiplier is not limited in the embodiment of the invention. As shown in fig. 1, the matrix multiplier implementation method may include the following operations:
101. and a first multiplication operation module, a second multiplication operation module, a reserved carry addition operation module and a carry look ahead addition operation module are configured.
In order to satisfy the multiply-add operation required by the matrix operation, for this reason, it is first necessary to configure a first multiply operation module and a second multiply operation module that can be used for the matrix multiply operation, and a carry-save addition module and a carry-look-ahead addition module that can be used for the matrix add operation in this embodiment.
102. And dividing a plurality of multipliers to be operated into small matrixes meeting the requirements of the first multiplication operation module and the second multiplication operation module according to the requirements of matrix multiplication operation.
In order to facilitate an example of how to divide the matrix by the requirement, in this embodiment, a description is given by multiplication operation of two integers, and it is assumed that a multiplier to be operated is p × q, where bit widths of p and q are both n, if n is an odd number, q performs sign bit expansion and expands by one bit, and if n is an even number, q is unchanged, so that the bit width of the processed multiplier is certainly an even number, which may facilitate operation of the matrix, and the processed multiplier is named as r, and the bit width is h. Then r can be represented in binary as:
r= -rh-1*2h-1+ rh-2*2h-2+……+ r1*21+r0*20
=(-rh-1+ rh-2)*2h-1+(-rh-2+ rh-3)*2h-2+……+(-r1+r0)*21+(-r0+0)*20
=(-2*rh-1+ rh-2+rh-3)*2h-2+(-2*rh-3+ rh-4)+ rh-5)*2h-4+……+(-2*r3+r2 +r1)*22+(-2*r1+r0 +0)*20
=
Wherein e
n=-2* r
2n+1+ r
2n+r
2n-1, 0≤n≤(h/2-1), r
-1=0
The small matrix resulting from the multiplication of the two multipliers can be expressed as:
the truth table is as follows:
103. a plurality of partial products are generated by multiplying a small matrix by a matrix.
Through the obtained multiplier after matrixing in the above steps, the even bits and adjacent bits of the multiplier of the small matrix are alternatively encoded to generate a plurality of weights corresponding to partial products and partial products, for example, p r can be decomposed into five operations of p 4n multiplied by-2, -1, 0, 1, 2, p4nThat is, p is left-shifted by 2n bits, i.e., left-shifted by 1 bit, and then multiplied by 1, and then multiplied by 0 to set the partial product result to 0, and then multiplied by-1 to invert each bit of the multiplicand and then 1, and then multiplied by-2 to invert each bit of the multiplicand and then left-shifted and then 2. For the operation of adding 1 or adding 2, the adding 1 or adding 2 generated by each partial product can be combined together, and since the weight of the generated partial product is 4, and only 2 bits need to be occupied by adding 1 or adding 2, the adding 1 or adding 2 generated by each partial product can be spliced into one partial product by two bits, and the weight is 0. So if the multiplier is h bits, then finally h/2+1 partial products can be generated by encoding, with the weights of adjacent partial products differing by 4.
104. And compressing the partial products into two partial products according to different weights by a carry-save addition operation module.
And obtaining a plurality of partial products generated in the steps, and adding all the partial products according to the weight to obtain a multiplication result. In the prior art, carry look-ahead adders are generally used for addition, but carry propagation from the lowest bit to the highest bit is generated, all carry look-ahead adders are used for compression in the embodiment, and the carry look-ahead adders are not used for addition until 2 partial products are formed finally to obtain multiplication results. Finally, after the two partial products are compressed to only two partial products, the two partial products are added by a carry look-ahead adder to obtain a multiplication result. The carry-retaining addition operation module is provided with three input ports I1, I2 and Ci which are respectively an addend, an addend and an adjacent low-order carry, and two output ports Co and SUM, wherein the SUM has the same weight as I1, I2 and Ci and is an accumulation result of a home bit, and the weight of Co is 2 times of the weight of SUM and is a carry signal to a high bit. The truth table of the arithmetic unit is shown as the following table:
illustratively, as shown in fig. 2, compressing the partial products can be achieved by arranging a plurality of CSA32 adders (CSAs: Carry Save adders) in a row, and if the bit width of the partial products is 32 bits, 32 CSAs 32 are required to be arranged in a row, and three partial products are compressed into two partial products, where the weight of Co is 2.
As an additional embodiment, the carry-save add module may also be implemented as a CSA53 counter having five inputs: i1, I2, I3, I4, Ci; three output terminals: SUM, C, Co. Combining CSA53 encoders into a row, namely a CSA53 counting row; if the adjacent lower Co is connected to the Ci of the home position, it becomes a CSA42 compressor. This can reduce two operands.
The counter algebraic expression is as follows:
SUM + C * 2 + Co * 2 = I1 + I2 + I3 + I4 + Ci
namely: i1, I2, I3, I4, Ci and SUM weight are 1; C. the Co weight is 2.
The truth table of the arithmetic unit is shown as the following table:
SUM = I1 ^ I2 ^ I3 ^ I4 ^ Ci
C = (I1 ^ I2 ^ I3 ^ I4) & Ci | (~ (I1 ^ I2 ^ I3 ^ I4)) & (I1 & I2 | I3 & I4)
= (I1 ^ I2 ^ I3 ^ I4) & Ci | (~ ((I1 ^ I2 ^ I3 ^ I4) | (~ (I1 & I2 | I3 & I4))))
Co = (I1 | I2) & (I3 | I4)
as can be seen from the above equation: when used as a CSA42 compressor, the adjacent lower Co is inserted into the Ci of the local position, and has no influence on the delay. Since Ci happens to form when it is used. The 4 input data are processed by the 42 compressor to obtain two outputs, one is the result SUM with weight 1 and the other is the data C with weight 2. Since the Co signal does not need to be decoded from the Ci signal, no carry chain propagation delay is generated. Similarly, by aligning CSAs 42, 4 partial products can be compressed, and two partial product results can be obtained after compression, one partial product result with a weight of 2.
105. And operating the compressed final two partial products through a carry-look-ahead addition operation module to generate elements for forming a matrix multiplication result. The elements of each multiplication result are operated according to the steps, so that efficient matrix operation can be performed under the conditions of high efficiency and no resource occupation.
According to another preferred embodiment, particularly in practical applications, there may be two methods for storing two multiplier matrices, such as a and B, the first method is to store the a matrix and the B matrix in the DDR memory in a row-by-row manner, when performing the operation, each element needs to be taken out according to the row of the matrix, at this time, a buffer may be set inside the matrix operation controller, a plurality of small matrices are prefetched each time, the number of the specific small matrices may be determined according to the requirements of the operation speed and the cost balance, and then each small matrix is controlled to be sent to each operation module for operation in the manner described above. The operation result can also be stored in the DDR cache in a small matrix blocking mode, the small matrix blocks can be stored in a row or column mode according to configuration, or the operation result is cached in the matrix operation controller firstly, and the operation result is stored in the DDR cache after a plurality of small matrixes of the operation result are spliced into matrix rows. The second method is to divide the matrix A into small matrix blocks according to the requirements of the matrix multiply-add module, then store each small matrix block according to the rows, divide the matrix B into small matrix blocks according to the requirements of the matrix multiply-add unit, and store each small matrix block in a column mode. The storage mode used by the requirement can be made configurable when the matrix operation controller is designed, and the relevant register is configured according to the actual requirement of a user to select one mode to realize the multiplication operation of the matrix.
According to the method provided by the embodiment, one-time matrix multiplication and addition operation can be completed through one single clock cycle, then two large matrixes are automatically divided into small matrixes meeting the requirements of an operation unit according to different matrix multiplication requirements to perform matrix multiplication, and partial products are ingeniously utilized to perform compression, so that the operation time is saved, and the performance of matrix multiplication is greatly improved.
Example two
Referring to fig. 3, fig. 3 is a schematic diagram illustrating an application of a matrix multiplier implementation method according to an embodiment of the present invention. As shown in fig. 3, the multipliers to be operated may be implemented as a plurality of matrices, specifically including a matrix composed of A, B, C, where D is set as the multiplication result, and the matrix rule is required to be D = a × B + C, where a, B, and C may be matrices of any elements as long as the rows and columns of a and B satisfy the matrix multiplication rule. In the present embodiment, for convenience of description, it is assumed that a, B, C, and D are square matrices of 32 × 32.
It can be seen that for the first element D of the D matrix0,0The multiplication operation may be performed by using corresponding elements in the first row of the a matrix and the first column of the B matrix, and then the multiplication operation is added to the first element of the C matrix, and the operation of a specific element is shown in expression (1).
d0,0=a0,0*b0,0+a0,1*b1,0+…+a0,31*b31,0+c0,0 (1)
d0,1=a0,0*b0,1+a0,1*b1,1+…+a0,31*b31,1+c0,1 (2)
…
d31,31=a31,0*b0,31+a31,1*b1,31+…+a31,31*b31,31+c31,31 (3)
It can be seen from the operation rule of the D matrix that 32 multipliers are needed in total when the multiplication is performed on the corresponding elements in the first row of the a matrix and the first column of the B matrix, and in terms of specific implementation, all the partial products obtained by multiplying the a matrix and the B matrix need to be compressed by a carry-save adder (CSA 32 or CSA 42) according to the method described in the first embodiment, and at the same time, the first element of the C matrix used as the addition is also used as a partial product to be compressed to obtain two compressed partial product results, and then, the first element of the D matrix is obtained by completing the accumulation by using a carry-look-ahead adder. Illustratively, taking the first element of the multiplication result D matrix as an example, a is first introduced0,0*b0,0,a0,1*b1,0,…,a0,31*b31,0The 32 partial products with weight of 0 and the 32 partial products generated by adding 1 or adding 2 are compressed by adopting CSA42 rows, and every four partial products are compressed by a CSA42 row compressor to generate a first-stage partial product P1_0Pj with weight of 0 and a first-stage partial product P1_2Pj with weight of 2, wherein j is more than or equal to 0 and less than or equal to 15. The 16P 1_0Pj and C00 are 17 partial products with the weight of 0 in total, and the compression is carried out by adopting 3 CSAs 42 and 2 CSAs 32, so that a second-stage partial product P2_0Pk with the weight of 0 and a second-stage partial product P2_2Pk with the weight of 2 can be obtained, and k is more than or equal to 0 and less than or equal to 4. And compressing 16 partial products P1_2Pj with weight of 2 obtained by the first stage of compression and 5 second-stage partial products P2_2Pk with weight of 2 by using 4 CSAs 42 and 2 CSAs 32 to obtain 6P 3_4Pl with weight of 4 and 6P 3_2Pl with weight of 2, wherein l is more than or equal to 0 and less than or equal to 5. A is a0,0*b0,0,a0,1*b1,0,…,a0,31*b31,0The 32 partial products with weight 4 and 6P 3_4Pl with weight 4 were compressed with 8 CSA42 and 2 CSA32 to obtain a fourth partial product containing 10 partial products with weight 6The partial product P3_6Pm and the partial product P4_4Pm with the weight of 4, wherein m is more than or equal to 0 and less than or equal to 9. The other partial products are compressed similarly as described above, and finally a0,0*b0,0+ a0,1*b1,0+…+ a0,31*b31,0+ c0,0After all partial products are compressed to only 2 residual partial products, the calculation result of the expression (1) can be obtained by adding the partial products through a carry look-ahead adder. It can be seen that, in this embodiment, the operation expression (1) can be implemented by only one multiplier, so only 1024 multipliers are needed to complete the calculation of the multiplication result of the whole D matrix, while 32 multipliers are needed to calculate the expression (1) according to the conventional design method, and 32768 multipliers are needed to complete the calculation of the D matrix.
In other embodiments, the computing power of the matrix multiplication unit can be improved in a pipeline mode. Because the compression of the partial product is completed by adopting the carry-retaining adder without carry propagation delay, the actual combinational logic path is very short, the partial products generated by 32 multiplication operations and one element of the C matrix can be compressed in one clock cycle, and the final two partial products are obtained after the compression. The 32-bit multiplication operation finally obtains a 64-bit result, and considering that the delay of the carry chain of the 64-bit carry look ahead adder needs to be transferred from the lowest bit to the highest bit, and the combinational logic delay is too large, so the 64-bit carry look ahead adder is realized by adopting two stages of pipelining, the operation of the low 32-bit carry look ahead adder is realized in the first clock cycle, and the operation of the high 32-bit carry look ahead adder is realized in the second clock cycle, so that one element of the D matrix can be completed by three clock cycles. All elements of the D matrix are realized by adopting parallel computation, so that the whole D matrix can finish the operation of all the elements by only three clock cycles to obtain an operation result. Because the matrix multiplication and addition operation is realized by adopting a pipeline mode, the data of the three matrixes A, B and C are sent in the first clock cycle, and the data of the new three matrixes A, B and C can be sent in the next clock cycle. That is, the data sent for the first time needs to wait for the third clock cycle to obtain the operation result, the data sent for the second time needs to wait for the fourth clock cycle to obtain the operation result, equivalent to one clock cycle, the matrix multiplication accumulation operation can be completed for one time, and only the delay of the operation result is three clock cycles. Therefore, only one clock cycle is needed when the 32x32 matrix multiplication and accumulation operation is completed to achieve one-time matrix multiplication and accumulation equivalent, and the performance of the matrix multiplication operation is greatly improved.
EXAMPLE III
Referring to fig. 4, fig. 4 is a schematic application diagram of a matrix multiplier implementation method according to an embodiment of the present invention. For other matrixes with any row and column, if the number of the rows and the columns of the matrixes is not an integral multiple of 32, zero padding operation is required to be carried out on the rows and the columns of the matrixes until the number of the rows and the columns of the matrixes is an integral multiple of 32; if the matrix row and column are integral multiple of 32, the next step is directly proceeded. The method comprises the steps of firstly dividing the matrix into rows or columns according to the requirements of matrix multiplication operation, and respectively sending the rows or columns into a matrix multiplication operation unit for multiplication and addition operation. For example, the a matrix is a 64x96 matrix, the B matrix is a 64x64 matrix, and then the a matrix can be divided into 6 small matrices according to 32x32 by matrix multiplication, and the small matrices are a00, A01……A21The specific division is shown in fig. 4 below.
The B matrix is also divided into 4 small matrices, B respectively, according to 32x3200, B01, B10And B11The specific division is shown in fig. 4. Then, when calculating the multiplication operation of the A matrix and the B matrix, the first small matrix D of the result matrix D is calculated first00,D00Is a32 x32 matrix. First, take out A00And B00Sending into a matrix multiply-add operation unit to finish A00And B00Since no accumulation is required, so C00All values of (a) are 0. Then taking A01And B10And the last operation result C = A00*B00The result of the first small matrix of the D matrix is obtained by the matrix multiplication and addition operation unit. Other small matrices of the D matrix are shown in fig. 4, and the implementation is referred to above, so that the complete multiplication result is obtained.
In the prior art, for example, a matrix multiply-add unit is designed to support one clock cycle to complete a × B + C, where a, B, and C are all small matrices of 32 × 32. The 256x256 matrix multiplication operation is completed by adopting a processor method, even if the processor completes the multiplication operation once in one clock cycle, 65536 clock cycles are needed at least, and the calculation time of accessing data and addition is not calculated; 511 clock cycles are needed if a pulse permutation method is adopted for matrix multiplication; however, if the matrix multiplication is completed by the method of the present embodiment, 514 clock cycles are required. While the systolic array requires 65536 multipliers, the invention can be implemented with only 1024 multipliers. Therefore, according to the method provided by the embodiment, one-time matrix multiplication and addition operation can be completed through one single clock cycle, then two large matrices are automatically divided into small matrices meeting the requirements of an operation unit according to different matrix multiplication requirements to perform matrix multiplication, partial products are ingeniously utilized to perform compression, the operation time is saved, and the performance of matrix multiplication operation is greatly improved.
Example four
Referring to fig. 5, fig. 5 is a schematic diagram of a matrix multiplier device according to an embodiment of the invention. The matrix multiplier device can be applied to different matrix operation combinations, and the application operation combination of the matrix multiplier device is not limited in the embodiment of the invention. As shown in fig. 5, the matrix multiplier apparatus includes:
the matrix operation control module 1 is used for dividing a plurality of multipliers or a plurality of matrixes to be operated into small matrixes required by the first multiplication operation module and the second multiplication operation module according to the requirement of matrix multiplication operation. To illustrate how the matrix is divided by the requirement, in this embodiment, a multiplication operation of two integers is first designed. Assume p × q, where bit widths of p and q are both n, if n is an odd number, q performs sign bit expansion and expands by one bit, and if n is an even number, q is unchanged, so that the bit width of the processed multiplier is an even number, which can facilitate matrix operation, and the processed multiplier is named as r, and the bit width is h. Then r can be represented in binary as:
r = -rh-1*2h-1+ rh-2*2h-2+……+ r1*21+r0*20
=(-rh-1+ rh-2)*2h-1+(-rh-2+ rh-3)*2h-2+……+(-r1+r0)*21+(-r0+0)*20
=(-2*rh-1+ rh-2+rh-3)*2h-2+(-2*rh-3+ rh-4)+ rh-5)*2h-4+……+(-2*r3+r2 +r1)*22+(-2*r1+r0 +0)*20
=
Wherein e
n=-2* r
2n+1+ r
2n+r
2n-1, 0≤n≤(h/2-1), r
-1=0
The small matrix resulting from the multiplication of the two multipliers can be expressed as:
the truth table is as follows:
the first multiplication module 2 and the second multiplication module 3 are used for carrying out matrix multiplication through small matrixes to generate a plurality of partial products. Through the obtained multiplier after matrixing, the even number and the adjacent number of the multiplier of the small matrix are alternately coded to generate a plurality of partial products corresponding to the partial productsThe weight of (e.g. p r) can be decomposed into five operations p 4n multiplied by-2, -1, 0, 1, 2, p4nThat is, p is left-shifted by 2n bits, i.e., left-shifted by 1 bit, and then multiplied by 1, and then multiplied by 0 to set the partial product result to 0, and then multiplied by-1 to invert each bit of the multiplicand and then 1, and then multiplied by-2 to invert each bit of the multiplicand and then left-shifted and then 2. For the operation of adding 1 or adding 2, the adding 1 or adding 2 generated by each partial product can be combined together, and since the weight of the generated partial product is 4, and only 2 bits need to be occupied by adding 1 or adding 2, the adding 1 or adding 2 generated by each partial product can be spliced into one partial product by two bits, and the weight is 0. So if the multiplier is h bits, then finally h/2+1 partial products can be generated by encoding, with the weights of adjacent partial products differing by 4.
And the carry-save addition operation module 3 is used for compressing the plurality of partial products into two partial products according to different weights. And obtaining a plurality of partial products generated in the steps, and adding all the partial products according to the weight to obtain a multiplication result. In the prior art, carry look-ahead adders are generally used for addition, but carry propagation from the lowest bit to the highest bit is generated, all carry look-ahead adders are used for compression in the embodiment, and the carry look-ahead adders are not used for addition until 2 partial products are formed finally to obtain multiplication results. Finally, after the two partial products are compressed to only two partial products, the two partial products are added by a carry look-ahead adder to obtain a multiplication result. The carry-retaining addition operation module is provided with three input ports I1, I2 and Ci which are respectively an addend, an addend and an adjacent low-order carry, and two output ports Co and SUM, wherein the SUM has the same weight as I1, I2 and Ci and is an accumulation result of a home bit, and the weight of Co is 2 times of the weight of SUM and is a carry signal to a high bit. The truth table of the arithmetic unit is shown as the following table:
SUM = I1 ^ I2 ^ Ci
Co = I1 & I2 | I1 & Ci | I2 & Ci
= I1 & I2 | (I1 | I2) & Ci
illustratively, as shown in fig. 2, compressing the partial products can be achieved by arranging a plurality of CSA32 adders (CSAs: Carry Save adders) in a row, and if the bit width of the partial products is 32 bits, 32 CSAs 32 are required to be arranged in a row, and three partial products are compressed into two partial products, where the weight of Co is 2.
As an additional embodiment, the carry-save add module may also be implemented as a CSA53 counter having five inputs: i1, I2, I3, I4, Ci; three output terminals: SUM, C, Co. Combining CSA53 encoders into a row, namely a CSA53 counting row; if the adjacent lower Co is connected to the Ci of the home position, it becomes a CSA42 compressor. This can reduce two operands.
The counter algebraic expression is as follows:
SUM + C * 2 + Co * 2 = I1 + I2 + I3 + I4 + Ci
namely: i1, I2, I3, I4, Ci and SUM weight are 1; C. the Co weight is 2.
The truth table of the arithmetic unit is shown as the following table:
SUM = I1 ^ I2 ^ I3 ^ I4 ^ Ci
C = (I1 ^ I2 ^ I3 ^ I4) & Ci | (~ (I1 ^ I2 ^ I3 ^ I4)) & (I1 & I2 | I3 & I4)
= (I1 ^ I2 ^ I3 ^ I4) & Ci | (~ ((I1 ^ I2 ^ I3 ^ I4) | (~ (I1 & I2 | I3 & I4))))
Co = (I1 | I2) & (I3 | I4)
as can be seen from the above equation: when used as a CSA42 compressor, the adjacent lower Co is inserted into the Ci of the local position, and has no influence on the delay. Since Ci happens to form when it is used. The 4 input data are processed by the 42 compressor to obtain two outputs, one is the result SUM with weight 1 and the other is the data C with weight 2. Since the Co signal does not need to be decoded from the Ci signal, no carry chain propagation delay is generated. Similarly, by aligning CSAs 42, 4 partial products can be compressed, and two partial product results can be obtained after compression, one partial product result with a weight of 2.
And the carry look-ahead addition operation module 4 is used for operating the two partial products to generate elements for forming a matrix multiplication result. After the last two partial products are obtained, the advanced carry addition operation module can be directly used for operation. The elements of each multiplication result are operated according to the steps, so that efficient matrix operation can be performed under the conditions of high efficiency and no resource occupation.
In other embodiments, the apparatus further comprises two storage modules: and a first storage module (not shown) for storing the plurality of matrices to be operated on and the multiplication result composed of the elements for composing the matrix multiplication result in a row-wise manner. And a second storage module (not shown) for storing the plurality of matrices to be operated and the multiplication result composed of the elements for composing the matrix multiplication result in blocks and columns according to the manner of the divided small matrices. In practical application, the first storage module 5 stores the matrix a and the matrix B into the DDR memory in a row-by-row manner, and when performing operation, each element needs to be taken out according to the row of the matrix, and at this time, a cache may be set inside the matrix operation controller, and a plurality of small matrices are prefetched each time, and the number of specific small matrices may be determined according to the operation speed and the requirement of cost balance, and then each small matrix is controlled to be sent into each operation module for operation in the manner described above. The operation result can also be stored in the DDR cache in a small matrix blocking mode, the small matrix blocks can be stored in a row or column mode according to configuration, or the operation result is cached in the matrix operation controller firstly, and the operation result is stored in the DDR cache after a plurality of small matrixes of the operation result are spliced into matrix rows. The second storage module 6 divides the matrix A into small matrix blocks according to the requirements of the matrix multiply-add module, then stores each small matrix block according to the row, and divides the matrix B into small matrix blocks according to the requirements of the matrix multiply-add unit, and stores each small matrix block in a column mode. The storage mode used by the requirement can be made configurable when the matrix operation controller is designed, and the relevant register is configured according to the actual requirement of a user to select one mode to realize the multiplication operation of the matrix.
According to the device provided by the embodiment, one-time matrix multiplication and addition operation can be completed through one single clock cycle, then two large matrixes are automatically divided into small matrixes meeting the requirements of an operation unit according to different matrix multiplication requirements to perform matrix multiplication, and partial products are ingeniously utilized to perform compression, so that the operation time is saved, and the performance of matrix multiplication is greatly improved.
EXAMPLE five
Referring to fig. 6, fig. 6 is a schematic structural diagram of a matrix multiplier implementation apparatus according to an embodiment of the present invention. The matrix multiplier implementation apparatus described in fig. 6 can be applied to a system, and the application system of the matrix multiplier implementation apparatus is not limited in the embodiment of the present invention. As shown in fig. 6, the apparatus may include:
a memory 601 in which executable program code is stored;
a processor 602 coupled to a memory 601;
the processor 602 calls the executable program code stored in the memory 601 for performing the implementation of the matrix multiplier described in the first embodiment.
EXAMPLE six
The embodiment of the invention discloses a computer-readable storage medium which stores a computer program for electronic data exchange, wherein the computer program enables a computer to execute the implementation method of the matrix multiplier described in the first embodiment.
EXAMPLE seven
An embodiment of the present invention discloses a computer program product, which includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer to execute the matrix multiplier implementation method described in the first embodiment or the second embodiment.
The above-described embodiments are only illustrative, and the modules described as separate components may or may not be physically separate, and the components displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above detailed description of the embodiments, those skilled in the art will clearly understand that the embodiments may be implemented by software plus a necessary general hardware platform, and may also be implemented by hardware. Based on such understanding, the above technical solutions may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, where the storage medium includes a Read-Only Memory (ROM), a Random Access Memory (RAM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), a One-time Programmable Read-Only Memory (OTPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a Compact Disc-Read-Only Memory (CD-ROM), or other disk memories, CD-ROMs, or other magnetic disks, A tape memory, or any other medium readable by a computer that can be used to carry or store data.
Finally, it should be noted that: the matrix multiplier implementing method and apparatus disclosed in the embodiments of the present invention are only preferred embodiments of the present invention, which are only used for illustrating the technical solutions of the present invention, and are not limited thereto; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art; the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the spirit and scope of the technical solutions of the embodiments of the present invention.