Background
The calculation of the covariance matrix is typical operation in the field of signal processing, is a key part for realizing multi-level nested wiener filters, spatial spectrum estimation, coherent source number estimation and affine invariant pattern recognition, and is widely applied to the fields of radar, sonar, digital image processing and the like. In addition, the covariance matrix is widely applied to the fields of image matching, image pattern recognition and the like, but the calculation process of the covariance matrix is complex, and if the covariance matrix of a multidimensional random variable composed of a plurality of eigenvectors in each region of an image needs to be calculated, the time consumption is long, and the calculation of the covariance matrix of the image on a personal universal PC platform in real time is also a great obstacle.
With the rapid development of the integrated circuit industry, high performance and real time are constantly pursued in the embedded field. Currently, most of the hardware implementation for the complex covariance matrix is designed based on platforms such as a DSP, a GPU, and an FPGA. The invention provides a hardware implementation method for computing a complex covariance matrix based on an iteration and reconfigurable mode based on a reconfigurable intelligent acceleration core, and compared with the traditional hardware implementation mode, the hardware implementation method has the advantages of high resource utilization rate and high hardware implementation speed. As typical operation in the field of signal processing, the hardware implementation method has good reference significance and wide application prospect.
In statistics and probability theory, the covariance matrix is a matrix. Each element of which is the covariance between the elements of the respective vector. Let X ═ X1,X2,X3,...,XN)TFor n-dimensional random variables, called matrices
Covariance matrix, denoted as d (X), for n-dimensional random variable X. Wherein, Cij=Cov(Xi,Xj) I, j is 1, …, n is the component X of XiAnd XjThe covariance of (a). Due to Cij=CjiTherefore, the covariance matrix is a symmetric matrix.
As shown in fig. 1, the conventional hardware implementation is as follows: for an mxn complex matrix a ═ a
ij]Its complex covariance matrix B may be composed ofThe following formula gives: b ═ AA
H. Namely, firstly, the matrix A is transposed, and then conjugate is solved to obtain A
HFinally, each element of the matrix B is obtained as follows:
because the method is realized according to matrix multiplication, when the covariance of a complex matrix with a large number of points is calculated, data is carried in and carried out for multiple times, which can cause the access and storage time to be overlong, and the realization method does not reduce the operation amount according to the conjugate symmetry property of the covariance matrix, which are reasons of long calculation time of the covariance matrix. In many practical application scenarios, the low-efficiency calculation of the covariance matrix can become a great obstacle, and the method has certain reference significance and application prospect.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, provides a complex covariance matrix calculation system based on iteration and reconfigurable modes, which effectively reduces the calculation amount of source data, fully utilizes storage resources, accelerates the calculation speed, and further integrally improves the performance of an application algorithm, and is specifically realized by the following technical scheme:
the complex covariance matrix computing system based on iteration and reconfigurable mode comprises an on-chip SRAM (static random access memory), an off-chip DDR (double data rate) memory, a reconfigurable unit, a DMA (direct memory access) controller and an acceleration core, wherein the acceleration core is respectively in communication connection with the on-chip SRAM memory and the reconfigurable unit, the DMA controller is in communication connection with the reconfigurable unit, the off-chip DDR memory is in communication connection with the DMA controller through a bus, and the acceleration core comprises:
the matrix covariance operation module polls the source data of each area of the on-chip SRAM memory in an iterative computation mode and computes a lower triangular covariance matrix;
the conjugate symmetry module obtains a complete complex covariance matrix from the lower triangular covariance matrix in an address mapping and reconstruction storage mode according to the conjugate symmetry property of the covariance matrix to form a final operation result;
the DMA interface function module is used for storing data read from the off-chip DDR memory in a DMA mode into the on-chip SRAM memory in a partition mode; and writing the operation result back to the off-chip DDR memory in a DMA mode.
The complex covariance matrix calculation system based on the iteration and reconfigurable mode is further designed in that the on-chip SRAM memory sets storage resources to be divided into k banks, the depth of each bank is d, if M banks are allocated for storing a lower triangular covariance matrix, covariance operation is carried out on a complex matrix with the size of M multiplied by N, and the requirement of M multiplied by N is met2Md is less than or equal to md, and N is an arbitrary value; if the calculation parallelism is b, the condition of defining the complex matrix to be solved as the decimal point is as follows: m2Bd is less than or equal to, if the condition is not met, the complex matrix to be solved is judged to be a large point number.
The complex covariance matrix calculation system based on the iteration and reconfigurable mode is further designed in that if the complex matrix to be solved is a small point number, a one-dimensional data transmission mode of DMA is adopted; and if the complex matrix to be solved is a large number of points, adopting a DMA two-dimensional data transmission mode.
The complex covariance matrix calculation system based on the iteration and reconfigurable mode is further designed in that the matrix covariance operation module stores original data according to a rule that one column of a complex matrix to be solved is stored in a single bank, all banks in a single time are stored in one column, the current column is divided into one zone, all banks in a single time are divided into one section, and the total size of a partition in a single time, the total times of the partition in a single time, the number of the zones in the last section and the number of the columns in the last zone are calculated.
The complex covariance matrix computing system based on the iteration and reconfigurable mode is further designed in that a zero filling mechanism is adopted if any area column number is not filled; if the source data storage exceeds the single maximum storage point number, ping-pong operation segmented processing is adopted, and if the source data storage exceeds the single maximum storage point number, batch processing is adopted.
The complex covariance matrix calculation system based on the iteration and reconfigurable mode is further designed in that the matrix covariance operation module adopts the reconfigurable mode to construct a complex multiplication accumulation calculation unit, and iterative calculation is carried out on each segment, each partition and each bank temporary storage result data to obtain a lower triangular covariance matrix.
The complex covariance matrix calculation system based on the iteration and the reconfigurable mode is further designed in that the using number of complex multipliers and complex adders in the complex multiplication accumulation calculation unit is equal to the bank number of source data storage, b is set, and the number of input data of each calculation unit is as follows: 2b + 1; reading the Mth address of all the source data banks as the input A of a complex multiplier; and respectively reading the conjugate of the data of the 1 st to M th addresses of all the source data banks as the input B of the complex multiplier, wherein the other input data is the value of the corresponding storage address in the last storage result bank, and the calculation result needs to be written back to the same address in the same bank again every time.
The complex covariance matrix calculation system based on the iteration and reconfigurable mode is further designed in that the conjugate symmetry module reads data from a bank where the lower triangular covariance matrix is located in sequence, analyzes rows and columns of the data in the lower triangular covariance matrix, and stores the data and conjugate symmetry data thereof into a multiplexed source data bank according to a rule that the rows and the columns of the matrix are stored in one bank until a complete complex covariance matrix is obtained.
The complex covariance matrix computing system based on the iteration and the reconfigurable mode is further designed in that when the acceleration core distributes data read from the DDR, the rows and the columns of the incoming data in the complex matrix to be solved are analyzed according to the addresses incoming by the DMA, and then the incoming data are stored in the corresponding source data bank according to the column distribution principle.
The complex covariance matrix calculation system based on iteration and reconfigurable mode is further designed in that the address mapping adopted by the conjugate symmetry module is as follows: and sequentially reading the data of the lower triangular covariance matrix from the result bank, resolving the row and the column of the original lower triangular covariance matrix according to the bank number of the data and the address in the bank, and storing the current data and the conjugate symmetric data thereof into the reconstructed bank according to the distribution principle that one column of the complete covariance matrix is stored in one bank.
The invention has the following advantages:
the complex covariance matrix calculation system based on the iteration and reconfigurable mode is designed in a parameterization mode, and a scheme with the maximum resource utilization rate is reconfigured for different storage and calculation resources to carry out complex covariance matrix operation. The method reduces the source data calculation amount, improves the resource utilization rate, accelerates the calculation speed of hardware realization, and provides a good reference for the design realization of the calculation covariance matrix in the field of signal processing.
Detailed Description
The following describes the present invention in detail with reference to the accompanying drawings.
Referring to fig. 1, the complex covariance matrix calculation system based on iterative and reconfigurable mode in this example mainly includes main modules such as DMA interface function, matrix covariance operation, and conjugate symmetry. The DMA interface function module is mainly responsible for: firstly, data read from an off-chip DDR memory (DDR) by a DMA mode is stored into an on-chip SRAM memory (SRAM) in a partitioning mode, and secondly, the final operation result is written back to the DDR by the DMA mode. The matrix covariance operation module is mainly responsible for polling source data of each region of the SRAM in an iterative computation mode to calculate a lower triangular covariance matrix. The conjugate symmetry module is mainly responsible for obtaining a complete complex covariance matrix by the lower triangular covariance matrix through address mapping and a reconstruction storage mode according to the conjugate symmetry property of the covariance matrix. The method supports the covariance operation of the complex matrix with any column number, reduces the source data calculation amount of the traditional hardware implementation mode and the time for writing back result data to DDR for many times, balances calculation and storage resources to realize maximized multipath parallelism, constructs the calculation unit by using a reconfigurable mode, and sets a specific address mapping rule to obtain the complex covariance matrix.
The on-chip SRAM memory sets storage resources to be divided into k banks, the depth of each bank is d, if M banks are allocated for storing a lower triangular covariance matrix, covariance operation is carried out on a complex matrix with the size of M multiplied by N, and the requirement of M is met2Md is less than or equal to md, and N is an arbitrary value; if the calculation parallelism is b, the condition of defining the complex matrix to be solved as the decimal point is as follows: m2Bd is less than or equal to, if the condition is not met, the complex matrix to be solved is judged to be a large point number.
The complex covariance matrix calculation system based on the iteration and reconfigurable mode is further designed in that if the complex matrix to be solved is a small point number, a one-dimensional data transmission mode of DMA is adopted; and if the complex matrix to be solved is a large number of points, adopting a DMA two-dimensional data transmission mode.
The complex covariance matrix calculation system based on the iteration and reconfigurable mode is further designed in that the matrix covariance operation module stores original data according to a rule that one column of a complex matrix to be solved is stored in a single bank, all banks in a single time are stored in one column, the current column is divided into one zone, all banks in a single time are divided into one section, and the total size of a partition in a single time, the total times of the partition in a single time, the number of the zones in the last section and the number of the columns in the last zone are calculated.
The complex covariance matrix computing system based on the iteration and reconfigurable mode is further designed in that a zero filling mechanism is adopted if any area column number is not filled; if the source data storage exceeds the single maximum storage point number, ping-pong operation segmented processing is adopted, and if the source data storage exceeds the single maximum storage point number, batch processing is adopted.
The complex covariance matrix calculation system based on the iteration and reconfigurable mode is further designed in that the matrix covariance operation module adopts the reconfigurable mode to construct a complex multiplication accumulation calculation unit, and iterative calculation is carried out on each segment, each partition and each bank temporary storage result data to obtain a lower triangular covariance matrix.
The complex covariance matrix calculation system based on the iteration and the reconfigurable mode is further designed in that the using number of complex multipliers and complex adders in the complex multiplication accumulation calculation unit is equal to the bank number of source data storage, b is set, and the number of input data of each calculation unit is as follows: 2b + 1; reading the Mth address of all the source data banks as the input A of a complex multiplier; and respectively reading the conjugate of the data of the 1 st to M th addresses of all the source data banks as the input B of the complex multiplier, wherein the other input data is the value of the corresponding storage address in the last storage result bank, and the calculation result needs to be written back to the same address in the same bank again every time.
The complex covariance matrix calculation system based on the iteration and reconfigurable mode is further designed in that the conjugate symmetry module reads data from a bank where the lower triangular covariance matrix is located in sequence, analyzes rows and columns of the data in the lower triangular covariance matrix, and stores the data and conjugate symmetry data thereof into a multiplexed source data bank according to a rule that the rows and the columns of the matrix are stored in one bank until a complete complex covariance matrix is obtained.
The complex covariance matrix computing system based on the iteration and the reconfigurable mode is further designed in that when the acceleration core distributes data read from the DDR, the rows and the columns of the incoming data in the complex matrix to be solved are analyzed according to the addresses incoming by the DMA, and then the incoming data are stored in the corresponding source data bank according to the column distribution principle.
The complex covariance matrix calculation system based on iteration and reconfigurable mode is further designed in that the address mapping adopted by the conjugate symmetry module is as follows: and sequentially reading the data of the lower triangular covariance matrix from the result bank, resolving the row and the column of the original lower triangular covariance matrix according to the bank number of the data and the address in the bank, and storing the current data and the conjugate symmetric data thereof into the reconstructed bank according to the distribution principle that one column of the complete covariance matrix is stored in one bank.
The following is a detailed description of an example implementation of the invention, and a cycle accurate system level simulation model based on SystemC language is set up for verification.
The invention calculates a complex covariance matrix based on the hardware implementation architecture shown in FIG. 2, in the example, it is assumed that the matrix X is M × N (M is less than or equal to 256, N is less than or equal to 8K), and the generalized covariance result Y is M × M: (hereinafter, "%" indicates a modulo operation)
If the memory size is 2MB of SRAM as an example (divided into 32 banks, each bank depth is 8k), the data processing is carried out in a partitioning mode (large number of points and also segmentation) during hardware implementation, and the source data area is filled in a partitioning mode for multiple times according to columns. Considering supporting the ping-pong operation with large dot count, the generalized covariance result Y is 256 × 256 dot counts at most, i.e. the lower triangular matrix is 256 × 256 (256+1)/2 — 32896 dot counts at most, so at least 5 bank storages are required. Therefore, the available 32 banks remove 5 banks storing the calculation results, and all the remaining banks store the source data through the ping-pong operation, so that the number of banks storing the source data at a time is a floor [ (32-5)/2] ═ 13. (floor function: return the largest integer smaller than the parameter).
Since each bank is 8k deep and data is stored in each bank in a single row of M (M is 256 at most), B ═ floor [8k/M ] zones are defined, i.e. a banks of each zone can store a row data, and all banks can store C ═ B row data at most. (definition of large dot number: data size of complex matrix to be solved exceeds C x M)
Taking an M N matrix as an example (N ≦ C), the source data is arranged as shown in FIG. 3. If N > C, then according to the cycle, D ═ ceil (N/C) times are transmitted in total, and the ceil function: the smallest integer greater than or equal to the specified expression is returned. The number of banks occupied by the last zone is (N-1)% A +1, and the remaining banks of the zone are filled with 0.
The AXI bus data bit width adopted in the implementation of this example is 256 bits, and the DMA splits the AXI bus data into 4 data of 64 bits and transmits the data to the DMA interface function of the matrix covariance operation, and the function stores the data into the corresponding bank according to the address mapping mechanism, as shown in fig. 4. Therefore, when the large dot count is operated, the partition size of each section is preferably multiple of 4, so that the number of columns transmitted each time is just multiple of 4, 2 data cannot be written into the same bank at the same time, and the influence of delay memory on the overall operation speed is avoided; for the small dot count operation, since the DMA is transferred once and the number of columns is not necessarily exactly a multiple of 4, 2 data are written to the same bank at the same time, and at this time, the data need to be written in turn with a delay, as shown in fig. 5. For the last section of last area data transmission of large dot number operation, because the DMA adopts a two-dimensional data transmission mode, if the number of the remaining columns is not a multiple of 4, zero padding can be automatically carried out to be a multiple of 4, although zero padding columns are written into the banks, zero padding is just needed to be carried out on the last area, so that the method is not only irrelevant, but also shortens the number of zero padding columns and shortens the storage time of source data.
The reconfigurable computing unit needs to be reconfigured into a 27 complex data input and 5-level full-flow water complex multiplication accumulation unit, as shown in fig. 6, 13 complex multipliers and 13 complex adders are used in total.
According to the conjugate symmetry characteristic of the generalized covariance result, in order to shorten the operation time, only the lower triangle of the complex covariance matrix is calculated, and the upper triangle is expanded by adopting conjugate symmetry. The source data reading form in the calculation process is shown in fig. 7, and the specific operation steps are as follows:
1) the matrix is divided into D sections according to columns, 1 section of data is imported each time, and the source data bank is placed according to the form of a graph 3;
2) the following operation is performed for each area occupied by the current segment: (input Pre _ Region Data is Data of the corresponding position in the same area)
Reading the 1 st number from the A banks simultaneously as an input I1, and taking the conjugate of the 1 st number as an input I2;
reading the 2 nd number from the A banks simultaneously as an input I1, and sequentially reading the 1 st and 2 nd numbers to take a conjugate as an input I2;
reading the 3 rd number from the A banks simultaneously as an input I1, and sequentially reading the 1 st, 2 nd and 3 rd numbers to take conjugation as an input I2;
……;
reading the Mth number from the A banks simultaneously as an input I1, and sequentially reading the 1 st, the 2 nd and the … …, wherein the Mth number takes a conjugate as an input I2;
3) repeating the step 2) to finish the operation of all the areas of the current section;
4) repeating the step 2) and the step 3) to complete calculation of all the sections to obtain M (M +1)/2 results;
5) and expanding the M (M +1)/2 results into M results through conjugate symmetry and storing the M results into a new storage array again.
Since the source data storage area can be reused after the calculation is completed, a new storage array is constructed to store the covariance calculation results of M matrix/M matrix, in order to correspond to that the DMA transfers data with 256 bits each time, and in consideration of the convenience of taking out 4 data each time, the new storage array is planned as bank0-bank15, the calculated M matrix/M complex covariance matrix is stored in the new storage array in sequence according to columns, and a specific lower triangular matrix conjugate symmetric address mapping mechanism is shown in FIG. 8.
The performance of this example was evaluated as follows: 1) the number of the sections occupying the B areas is floor (N/C); 2) the number of zones occupied by the last stage is ceil ((N% C)/A); 3) taking the number of cycles of the input I1 as M during operation of each Region, and taking the number of cycles of the input I2 and the input Pre _ Region Data as M x (M +1)/2 in parallel; 4) the time of conjugate symmetric expansion of the lower triangular matrix is M (M + 1)/2; 5) the complex multiply accumulate unit calculates time as T1, and reads and stores data from the bank as T2 and T3, respectively. The total number of operation cycles of the complex covariance matrix is as follows: floor (N/C) [ (M +1)/2+ M) × B ] + ceil [ (N% C)/a ] × [ M × (M +1)/2+ M ] + M × (M +1)/2+ T1+ T2+ T3. While the conventional way calculates the number of cycles of the complex covariance matrix as follows: m N (M + 4)/4.
T1, T2, T3 are negligible because they are small relative to the total number of cycles. The performance pairs for the above 2 implementations are shown in fig. 9 when covariance operations are performed on complex matrices of different sizes. As is apparent from the figure, the cycle number of calculating the complex covariance matrix based on the iteration and the reconfigurable mode is greatly reduced compared with the traditional mode. The complex covariance matrix calculation system based on the iteration and reconfigurable mode supports complex matrixes with any column number to perform covariance calculation, reduces the source data calculation amount of the traditional hardware implementation mode and the time of writing back result data to DDR for multiple times, balances calculation and storage resources to realize maximized multipath parallelism, constructs a calculation unit by using the reconfigurable mode, sets a specific address mapping rule to obtain the complex covariance matrix, and greatly improves the resource utilization rate and the calculation speed of hardware implementation. As typical operation in the field of signal processing, the hardware implementation method has good reference significance and wide application prospect.