CN109446478B

CN109446478B - Complex covariance matrix calculation system based on iteration and reconfigurable mode

Info

Publication number: CN109446478B
Application number: CN201811284263.4A
Authority: CN
Inventors: 李丽; 陈辉; 傅玉祥; 陈沁雨; 何国强; 何书专; 李伟
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2018-10-30
Filing date: 2018-10-30
Publication date: 2021-09-28
Anticipated expiration: 2038-10-30
Also published as: CN109446478A

Abstract

The invention relates to a complex covariance matrix computing system based on an iterative and reconfigurable manner, including on-chip SRAM memory, off-chip DDR memory, reconfigurable unit, DMA controller and acceleration core, the acceleration core includes: matrix covariance calculation The module polls the source data of each area of the on-chip SRAM memory through iterative calculation, and calculates the lower triangular covariance matrix; the conjugate symmetric module, according to the conjugate symmetry property of the covariance matrix, maps the lower triangular covariance matrix through the address mapping The complete complex covariance matrix is obtained by reconstructing the storage method to form the final operation result; the DMA interface function module stores the data read from the off-chip DDR memory by DMA into the on-chip SRAM memory by partition. Beneficial effects: the present invention supports the covariance operation of complex matrices with any number of columns, which reduces the calculation amount of the source data in the traditional hardware implementation and the time for writing the result data back to the DDR for many times.

Description

Complex covariance matrix calculation system based on iteration and reconfigurable mode

Technical Field

The invention relates to the technical field of computers, in particular to a complex covariance matrix calculation system based on iteration and reconfigurable modes.

Background

The calculation of the covariance matrix is typical operation in the field of signal processing, is a key part for realizing multi-level nested wiener filters, spatial spectrum estimation, coherent source number estimation and affine invariant pattern recognition, and is widely applied to the fields of radar, sonar, digital image processing and the like. In addition, the covariance matrix is widely applied to the fields of image matching, image pattern recognition and the like, but the calculation process of the covariance matrix is complex, and if the covariance matrix of a multidimensional random variable composed of a plurality of eigenvectors in each region of an image needs to be calculated, the time consumption is long, and the calculation of the covariance matrix of the image on a personal universal PC platform in real time is also a great obstacle.

With the rapid development of the integrated circuit industry, high performance and real time are constantly pursued in the embedded field. Currently, most of the hardware implementation for the complex covariance matrix is designed based on platforms such as a DSP, a GPU, and an FPGA. The invention provides a hardware implementation method for computing a complex covariance matrix based on an iteration and reconfigurable mode based on a reconfigurable intelligent acceleration core, and compared with the traditional hardware implementation mode, the hardware implementation method has the advantages of high resource utilization rate and high hardware implementation speed. As typical operation in the field of signal processing, the hardware implementation method has good reference significance and wide application prospect.

In statistics and probability theory, the covariance matrix is a matrix. Each element of which is the covariance between the elements of the respective vector. Let X ═ X₁,X₂,X₃,...,X_N)^TFor n-dimensional random variables, called matrices

Covariance matrix, denoted as d (X), for n-dimensional random variable X. Wherein, C_ij＝Cov(X_i,X_j) I, j is 1, …, n is the component X of X_iAnd X_jThe covariance of (a). Due to C_ij＝C_jiTherefore, the covariance matrix is a symmetric matrix.

As shown in fig. 1, the conventional hardware implementation is as follows: for an mxn complex matrix a ═ a_ij]Its complex covariance matrix B may be composed ofThe following formula gives: b ═ AA^H. Namely, firstly, the matrix A is transposed, and then conjugate is solved to obtain A^HFinally, each element of the matrix B is obtained as follows:

because the method is realized according to matrix multiplication, when the covariance of a complex matrix with a large number of points is calculated, data is carried in and carried out for multiple times, which can cause the access and storage time to be overlong, and the realization method does not reduce the operation amount according to the conjugate symmetry property of the covariance matrix, which are reasons of long calculation time of the covariance matrix. In many practical application scenarios, the low-efficiency calculation of the covariance matrix can become a great obstacle, and the method has certain reference significance and application prospect.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, provides a complex covariance matrix calculation system based on iteration and reconfigurable modes, which effectively reduces the calculation amount of source data, fully utilizes storage resources, accelerates the calculation speed, and further integrally improves the performance of an application algorithm, and is specifically realized by the following technical scheme:

the complex covariance matrix computing system based on iteration and reconfigurable mode comprises an on-chip SRAM (static random access memory), an off-chip DDR (double data rate) memory, a reconfigurable unit, a DMA (direct memory access) controller and an acceleration core, wherein the acceleration core is respectively in communication connection with the on-chip SRAM memory and the reconfigurable unit, the DMA controller is in communication connection with the reconfigurable unit, the off-chip DDR memory is in communication connection with the DMA controller through a bus, and the acceleration core comprises:

the matrix covariance operation module polls the source data of each area of the on-chip SRAM memory in an iterative computation mode and computes a lower triangular covariance matrix;

the conjugate symmetry module obtains a complete complex covariance matrix from the lower triangular covariance matrix in an address mapping and reconstruction storage mode according to the conjugate symmetry property of the covariance matrix to form a final operation result;

the DMA interface function module is used for storing data read from the off-chip DDR memory in a DMA mode into the on-chip SRAM memory in a partition mode; and writing the operation result back to the off-chip DDR memory in a DMA mode.

The complex covariance matrix calculation system based on the iteration and reconfigurable mode is further designed in that the on-chip SRAM memory sets storage resources to be divided into k banks, the depth of each bank is d, if M banks are allocated for storing a lower triangular covariance matrix, covariance operation is carried out on a complex matrix with the size of M multiplied by N, and the requirement of M multiplied by N is met²Md is less than or equal to md, and N is an arbitrary value; if the calculation parallelism is b, the condition of defining the complex matrix to be solved as the decimal point is as follows: m²Bd is less than or equal to, if the condition is not met, the complex matrix to be solved is judged to be a large point number.

The complex covariance matrix calculation system based on the iteration and reconfigurable mode is further designed in that if the complex matrix to be solved is a small point number, a one-dimensional data transmission mode of DMA is adopted; and if the complex matrix to be solved is a large number of points, adopting a DMA two-dimensional data transmission mode.

The complex covariance matrix calculation system based on the iteration and reconfigurable mode is further designed in that the matrix covariance operation module stores original data according to a rule that one column of a complex matrix to be solved is stored in a single bank, all banks in a single time are stored in one column, the current column is divided into one zone, all banks in a single time are divided into one section, and the total size of a partition in a single time, the total times of the partition in a single time, the number of the zones in the last section and the number of the columns in the last zone are calculated.

The complex covariance matrix computing system based on the iteration and reconfigurable mode is further designed in that a zero filling mechanism is adopted if any area column number is not filled; if the source data storage exceeds the single maximum storage point number, ping-pong operation segmented processing is adopted, and if the source data storage exceeds the single maximum storage point number, batch processing is adopted.

The complex covariance matrix calculation system based on the iteration and reconfigurable mode is further designed in that the matrix covariance operation module adopts the reconfigurable mode to construct a complex multiplication accumulation calculation unit, and iterative calculation is carried out on each segment, each partition and each bank temporary storage result data to obtain a lower triangular covariance matrix.

The complex covariance matrix calculation system based on the iteration and the reconfigurable mode is further designed in that the using number of complex multipliers and complex adders in the complex multiplication accumulation calculation unit is equal to the bank number of source data storage, b is set, and the number of input data of each calculation unit is as follows: 2b + 1; reading the Mth address of all the source data banks as the input A of a complex multiplier; and respectively reading the conjugate of the data of the 1 st to M th addresses of all the source data banks as the input B of the complex multiplier, wherein the other input data is the value of the corresponding storage address in the last storage result bank, and the calculation result needs to be written back to the same address in the same bank again every time.

The complex covariance matrix calculation system based on the iteration and reconfigurable mode is further designed in that the conjugate symmetry module reads data from a bank where the lower triangular covariance matrix is located in sequence, analyzes rows and columns of the data in the lower triangular covariance matrix, and stores the data and conjugate symmetry data thereof into a multiplexed source data bank according to a rule that the rows and the columns of the matrix are stored in one bank until a complete complex covariance matrix is obtained.

The complex covariance matrix computing system based on the iteration and the reconfigurable mode is further designed in that when the acceleration core distributes data read from the DDR, the rows and the columns of the incoming data in the complex matrix to be solved are analyzed according to the addresses incoming by the DMA, and then the incoming data are stored in the corresponding source data bank according to the column distribution principle.

The complex covariance matrix calculation system based on iteration and reconfigurable mode is further designed in that the address mapping adopted by the conjugate symmetry module is as follows: and sequentially reading the data of the lower triangular covariance matrix from the result bank, resolving the row and the column of the original lower triangular covariance matrix according to the bank number of the data and the address in the bank, and storing the current data and the conjugate symmetric data thereof into the reconstructed bank according to the distribution principle that one column of the complete covariance matrix is stored in one bank.

The invention has the following advantages:

the complex covariance matrix calculation system based on the iteration and reconfigurable mode is designed in a parameterization mode, and a scheme with the maximum resource utilization rate is reconfigured for different storage and calculation resources to carry out complex covariance matrix operation. The method reduces the source data calculation amount, improves the resource utilization rate, accelerates the calculation speed of hardware realization, and provides a good reference for the design realization of the calculation covariance matrix in the field of signal processing.

Drawings

FIG. 1 is a hardware implementation architecture diagram of the computational complex covariance matrix of the present invention.

FIG. 2 is a diagram of a hardware implementation architecture for computing a complex covariance matrix in a conventional manner.

FIG. 3 is a schematic diagram of the arrangement of source data in the present invention.

FIG. 4 is a diagram illustrating a source data transport address mapping mechanism according to the present invention.

FIG. 5 is a flow chart of the storage of the dot source data in the present invention.

FIG. 6 is a schematic diagram of a computing unit design in the present invention.

FIG. 7 is a schematic diagram of a source data read of the computing process of the present invention.

FIG. 8 is a schematic diagram of a lower triangular matrix conjugate symmetric address mapping mechanism according to the present invention.

FIG. 9 is a comparison of performance evaluation for the computation of a complex covariance matrix according to the present invention and the conventional method.

Detailed Description

The following describes the present invention in detail with reference to the accompanying drawings.

Referring to fig. 1, the complex covariance matrix calculation system based on iterative and reconfigurable mode in this example mainly includes main modules such as DMA interface function, matrix covariance operation, and conjugate symmetry. The DMA interface function module is mainly responsible for: firstly, data read from an off-chip DDR memory (DDR) by a DMA mode is stored into an on-chip SRAM memory (SRAM) in a partitioning mode, and secondly, the final operation result is written back to the DDR by the DMA mode. The matrix covariance operation module is mainly responsible for polling source data of each region of the SRAM in an iterative computation mode to calculate a lower triangular covariance matrix. The conjugate symmetry module is mainly responsible for obtaining a complete complex covariance matrix by the lower triangular covariance matrix through address mapping and a reconstruction storage mode according to the conjugate symmetry property of the covariance matrix. The method supports the covariance operation of the complex matrix with any column number, reduces the source data calculation amount of the traditional hardware implementation mode and the time for writing back result data to DDR for many times, balances calculation and storage resources to realize maximized multipath parallelism, constructs the calculation unit by using a reconfigurable mode, and sets a specific address mapping rule to obtain the complex covariance matrix.

The on-chip SRAM memory sets storage resources to be divided into k banks, the depth of each bank is d, if M banks are allocated for storing a lower triangular covariance matrix, covariance operation is carried out on a complex matrix with the size of M multiplied by N, and the requirement of M is met²Md is less than or equal to md, and N is an arbitrary value; if the calculation parallelism is b, the condition of defining the complex matrix to be solved as the decimal point is as follows: m²Bd is less than or equal to, if the condition is not met, the complex matrix to be solved is judged to be a large point number.

The following is a detailed description of an example implementation of the invention, and a cycle accurate system level simulation model based on SystemC language is set up for verification.

The invention calculates a complex covariance matrix based on the hardware implementation architecture shown in FIG. 2, in the example, it is assumed that the matrix X is M × N (M is less than or equal to 256, N is less than or equal to 8K), and the generalized covariance result Y is M × M: (hereinafter, "%" indicates a modulo operation)

If the memory size is 2MB of SRAM as an example (divided into 32 banks, each bank depth is 8k), the data processing is carried out in a partitioning mode (large number of points and also segmentation) during hardware implementation, and the source data area is filled in a partitioning mode for multiple times according to columns. Considering supporting the ping-pong operation with large dot count, the generalized covariance result Y is 256 × 256 dot counts at most, i.e. the lower triangular matrix is 256 × 256 (256+1)/2 — 32896 dot counts at most, so at least 5 bank storages are required. Therefore, the available 32 banks remove 5 banks storing the calculation results, and all the remaining banks store the source data through the ping-pong operation, so that the number of banks storing the source data at a time is a floor [ (32-5)/2] ═ 13. (floor function: return the largest integer smaller than the parameter).

Since each bank is 8k deep and data is stored in each bank in a single row of M (M is 256 at most), B ═ floor [8k/M ] zones are defined, i.e. a banks of each zone can store a row data, and all banks can store C ═ B row data at most. (definition of large dot number: data size of complex matrix to be solved exceeds C x M)

Taking an M N matrix as an example (N ≦ C), the source data is arranged as shown in FIG. 3. If N > C, then according to the cycle, D ═ ceil (N/C) times are transmitted in total, and the ceil function: the smallest integer greater than or equal to the specified expression is returned. The number of banks occupied by the last zone is (N-1)% A +1, and the remaining banks of the zone are filled with 0.

The AXI bus data bit width adopted in the implementation of this example is 256 bits, and the DMA splits the AXI bus data into 4 data of 64 bits and transmits the data to the DMA interface function of the matrix covariance operation, and the function stores the data into the corresponding bank according to the address mapping mechanism, as shown in fig. 4. Therefore, when the large dot count is operated, the partition size of each section is preferably multiple of 4, so that the number of columns transmitted each time is just multiple of 4, 2 data cannot be written into the same bank at the same time, and the influence of delay memory on the overall operation speed is avoided; for the small dot count operation, since the DMA is transferred once and the number of columns is not necessarily exactly a multiple of 4, 2 data are written to the same bank at the same time, and at this time, the data need to be written in turn with a delay, as shown in fig. 5. For the last section of last area data transmission of large dot number operation, because the DMA adopts a two-dimensional data transmission mode, if the number of the remaining columns is not a multiple of 4, zero padding can be automatically carried out to be a multiple of 4, although zero padding columns are written into the banks, zero padding is just needed to be carried out on the last area, so that the method is not only irrelevant, but also shortens the number of zero padding columns and shortens the storage time of source data.

The reconfigurable computing unit needs to be reconfigured into a 27 complex data input and 5-level full-flow water complex multiplication accumulation unit, as shown in fig. 6, 13 complex multipliers and 13 complex adders are used in total.

According to the conjugate symmetry characteristic of the generalized covariance result, in order to shorten the operation time, only the lower triangle of the complex covariance matrix is calculated, and the upper triangle is expanded by adopting conjugate symmetry. The source data reading form in the calculation process is shown in fig. 7, and the specific operation steps are as follows:

1) the matrix is divided into D sections according to columns, 1 section of data is imported each time, and the source data bank is placed according to the form of a graph 3;

2) the following operation is performed for each area occupied by the current segment: (input Pre _ Region Data is Data of the corresponding position in the same area)

Reading the 1 st number from the A banks simultaneously as an input I1, and taking the conjugate of the 1 st number as an input I2;

reading the 2 nd number from the A banks simultaneously as an input I1, and sequentially reading the 1 st and 2 nd numbers to take a conjugate as an input I2;

reading the 3 rd number from the A banks simultaneously as an input I1, and sequentially reading the 1 st, 2 nd and 3 rd numbers to take conjugation as an input I2;

……；

reading the Mth number from the A banks simultaneously as an input I1, and sequentially reading the 1 st, the 2 nd and the … …, wherein the Mth number takes a conjugate as an input I2;

3) repeating the step 2) to finish the operation of all the areas of the current section;

4) repeating the step 2) and the step 3) to complete calculation of all the sections to obtain M (M +1)/2 results;

5) and expanding the M (M +1)/2 results into M results through conjugate symmetry and storing the M results into a new storage array again.

Since the source data storage area can be reused after the calculation is completed, a new storage array is constructed to store the covariance calculation results of M matrix/M matrix, in order to correspond to that the DMA transfers data with 256 bits each time, and in consideration of the convenience of taking out 4 data each time, the new storage array is planned as bank0-bank15, the calculated M matrix/M complex covariance matrix is stored in the new storage array in sequence according to columns, and a specific lower triangular matrix conjugate symmetric address mapping mechanism is shown in FIG. 8.

The performance of this example was evaluated as follows: 1) the number of the sections occupying the B areas is floor (N/C); 2) the number of zones occupied by the last stage is ceil ((N% C)/A); 3) taking the number of cycles of the input I1 as M during operation of each Region, and taking the number of cycles of the input I2 and the input Pre _ Region Data as M x (M +1)/2 in parallel; 4) the time of conjugate symmetric expansion of the lower triangular matrix is M (M + 1)/2; 5) the complex multiply accumulate unit calculates time as T1, and reads and stores data from the bank as T2 and T3, respectively. The total number of operation cycles of the complex covariance matrix is as follows: floor (N/C) [ (M +1)/2+ M) × B ] + ceil [ (N% C)/a ] × [ M × (M +1)/2+ M ] + M × (M +1)/2+ T1+ T2+ T3. While the conventional way calculates the number of cycles of the complex covariance matrix as follows: m N (M + 4)/4.

T1, T2, T3 are negligible because they are small relative to the total number of cycles. The performance pairs for the above 2 implementations are shown in fig. 9 when covariance operations are performed on complex matrices of different sizes. As is apparent from the figure, the cycle number of calculating the complex covariance matrix based on the iteration and the reconfigurable mode is greatly reduced compared with the traditional mode. The complex covariance matrix calculation system based on the iteration and reconfigurable mode supports complex matrixes with any column number to perform covariance calculation, reduces the source data calculation amount of the traditional hardware implementation mode and the time of writing back result data to DDR for multiple times, balances calculation and storage resources to realize maximized multipath parallelism, constructs a calculation unit by using the reconfigurable mode, sets a specific address mapping rule to obtain the complex covariance matrix, and greatly improves the resource utilization rate and the calculation speed of hardware implementation. As typical operation in the field of signal processing, the hardware implementation method has good reference significance and wide application prospect.

Claims

1. A complex covariance matrix computing system based on iterative and reconfigurable mode, comprising on-chip SRAM memory, off-chip DDR memory, reconfigurable unit, DMA controller and acceleration core, the acceleration core and on-chip SRAM memory are respectively , the reconfigurable unit is communicated and connected, the DMA controller is communicated with the reconfigurable unit, and the off-chip DDR memory is communicated with the DMA controller through a bus, and is characterized in that the acceleration core includes:

The matrix covariance operation module polls the source data of each area of the on-chip SRAM memory by iterative calculation, and calculates the lower triangular covariance matrix;

The conjugate symmetry module, according to the conjugate symmetry property of the covariance matrix, obtains a complete complex covariance matrix by means of address mapping and reconstruction storage of the lower triangular covariance matrix, and forms the final operation result;

The DMA interface function module stores the data read in from the off-chip DDR memory by DMA into the on-chip SRAM memory by partition; and writes the operation result back to the off-chip DDR memory by DMA;

The matrix covariance computing module processes the original data according to the rules of storing one column of the complex matrix to be obtained in a single bank, storing all banks in one column at a time, dividing the current column into one area, and dividing all banks into one segment at a time. Store, calculate the total size of a single partition, the total number of segmentations, the number of zones in the last segment, and the number of columns in the last zone.

2. the complex covariance matrix computing system based on iteration and reconfigurable mode according to claim 1, is characterized in that described on-chip SRAM memory setting storage resource is divided into k banks, and the depth of each bank is d, If m banks are allocated to store the lower triangular covariance matrix, the covariance operation is performed on a complex matrix of size M×N, satisfying M ² ≤md, and N is an arbitrary value; if the calculation parallelism is b, the The condition for finding a complex matrix to be a small number of points is: M ² ≤bd, if this condition is not satisfied, it is determined that the complex matrix to be found is a large number of points.

3. the complex covariance matrix computing system based on iterative and reconfigurable mode according to claim 2 is characterized in that if the complex matrix to be sought is a small number of points, then the one-dimensional data transmission mode of DMA is adopted; If the matrix is large, the two-dimensional data transfer mode of DMA is adopted.

4. The complex covariance matrix computing system based on iterative and reconfigurable mode according to claim 1, is characterized in that if the number of columns in any region is not filled up, then a zero-filling mechanism is adopted; if the source data storage exceeds a single maximum The number of storage points is processed in sections by ping-pong operation. If the source data storage exceeds the maximum number of storage points at a time, batch processing is used.

5. the complex covariance matrix computing system based on iterative and reconfigurable mode according to claim 1, is characterized in that the matrix covariance calculation module adopts reconfigurable mode to construct complex multiply-accumulate calculation unit, and each segment, each Partition and each bank temporarily store the result data for iterative calculation to obtain the lower triangular covariance matrix.

6. the complex covariance matrix computing system based on iterative and reconfigurable mode according to claim 5 is characterized in that in the complex multiplying and accumulating computing unit, the number of use of complex multiplier and complex adder is equal to the bank of source data storage The number is set to b, and the number of input data of each calculation unit is: 2b+1; read the Mth address of all source data banks as input A of the complex multiplier; read all source data banks 1st to Mth respectively The conjugate of the data of one address is input to B as a complex multiplier, and the other input data is the value of the corresponding storage address in the bank of the previous storage result, and each calculation result needs to be rewritten back to the same address in the same bank.

7. the complex covariance matrix computing system based on iterative and reconfigurable mode according to claim 1, is characterized in that described conjugate symmetry module reads data successively from the bank where the lower triangular covariance matrix is located, and parses out that data is in the following For the rows and columns in the triangular matrix, according to the rule that one column of the matrix is stored in a bank, the data and its conjugate symmetric data are stored in the multiplexed source data bank until a complete complex covariance matrix is obtained.

8. the complex covariance matrix computing system based on iterative and reconfigurable mode according to claim 1, is characterized in that when the described acceleration core distributes the data read from DDR, according to the address that DMA passes in, resolves outgoing transmission. The input data is stored in the corresponding source data bank according to the row and column of the complex matrix to be calculated.

9. the complex covariance matrix computing system based on iterative and reconfigurable mode according to claim 1, it is characterized in that the address mapping that the conjugated symmetric module adopts is: read lower triangular covariance matrix successively from the result bank According to the bank number of the data and the address in the bank, the row and column of the original lower triangular matrix are parsed out, and then the current data and its conjugate are symmetric according to the allocation principle that one column of the complete covariance matrix is stored in a bank The data is stored in the reconstructed bank.