Disclosure of Invention
The invention mainly aims to provide a JPEG image compression system and a JPEG image compression system based on an FPGA double-matrix shared pipeline, which aim at realizing high-efficiency time-sharing multiplexing of Y, U, V three-component pair Discrete Cosine Transform (DCT), zig-Zag scanning, quantization, huffman coding and other core modules by aiming at image data compression processing under YUV color space through a double-8-row matrix segmentation architecture and time sequence alignment shared pipeline design, and solve the technical problem of hardware resource redundancy in the traditional scheme.
In order to achieve the above object, the JPEG image compression system based on the FPGA double-matrix shared pipeline according to the present invention includes:
The YUV420 preprocessing module is used for converting YUV444 format data into YUV420 format data, completely reserving a Y component, performing 2:1 horizontal/vertical downsampling on a UV component, and generating a Y component data stream and a UV component data stream subjected to 2:1 horizontal/vertical downsampling;
The double 8-row matrix segmentation storage module is used for alternately storing the Y component data streams into Y1 and Y2 cache areas according to rows to form a double 8-row matrix structure, synchronously storing the UV component data streams to obtain Y1, Y2 and U, V data blocks, and realizing Y1/Y2 double matrix segmentation of the Y component and 8-row alignment storage of the UV component;
the shared pipeline time sequence scheduling module is used for reading Y1, Y2 and U, V data blocks from the Y1 and Y2 buffer areas according to the sequence of Y1-Y2-U-V, generating a time sequence control signal of a shared processing pipeline, realizing pipeline type continuous processing and converting parallel multi-component data into serial data streams;
the sharing processing module is used for performing DCT conversion, quantization, zigZag scanning, run length coding and Huffman coding processing on the Y1, Y2 and U, V data blocks in a time-sharing multiplexing mode according to the time sequence control signals;
and the merging and encoding module is used for splicing the encoded output into a continuous bit stream, generating output according to byte alignment, realizing the formatting integration of encoded data and generating a compressed data stream conforming to the JPEG standard.
Optionally, the dual 8-row matrix split storage module alternately generates 28×8Y 1 and Y2 data blocks each time 16 rows of Y component data are received by a row count control and split counter, the UV component data stream is buffered according to 8 rows depth, forms a standard input group of 68×8 data blocks with the Y1 and Y2 data blocks, and provides a standardized 8×8 data block for a shared pipeline.
Optionally, the shared pipeline timing scheduling module adopts a three-stage state machine, and the three-stage state machine includes:
an IDLE state, namely when the data quantity of the UV component FIFO is detected to be more than or equal to 8, jumping to a read enabling generation state;
GEN_RD_EN state, namely generating a read enable signal according to the sequence of Y1- & gt Y2- & gt U- & gt V through a 48-period counter;
and in the WAIT_BACK state, monitoring a rear end FIFO empty mark to ensure that a shared pipeline has no backlog.
Optionally, the sharing processing module includes:
The shared two-dimensional DCT conversion module is used for decomposing the two-dimensional DCT into row conversion and column conversion through a row-column separation algorithm, realizing 8-point DCT calculation through a 3-level butterfly network by adopting a Loeffler algorithm, and converting a floating point coefficient of a cosine base function into a 16-bit fixed point number;
The dynamic quantization module is used for executing non-uniform quantization on DCT coefficients based on the time sequence control signals, dynamically switching the brightness/chromaticity quantization table and reserving human eye sensitive low-frequency information;
The Zigzag pipeline scanning module is used for rearranging the quantized 8 multiplied by 8 quantization coefficient matrix into a one-dimensional sequence according to the Zigzag path and performing DPCM differential encoding on the DC coefficient;
The shared run length coding module is used for compressing the one-dimensional sequence scanned by the ZigZag, executing zero run length coding and reducing the data quantity;
The four-table sharing Huffman coding module is used for carrying out variable length coding on Y, U, V component data DC/AC four sets of independent code tables after run length coding, and distributing codes with different lengths according to the occurrence probability of the data so as to realize lossless compression.
Optionally, the dynamic quantization module dynamically switches the luminance/chrominance quantization table according to a 48-period counter, wherein:
The counter processes Y1 and Y2 data blocks in 0-31 period and uses brightness quantization table;
The counter processes U, V blocks of data at 32-47 cycles, using the chroma quantization table.
On the other hand, the invention also provides a JPEG image compression method based on the FPGA double-matrix sharing pipeline, which is performed by adopting the JPEG image compression system based on the FPGA double-matrix sharing pipeline, and comprises the following steps:
converting YUV444 format data into YUV420 format, and generating Y component data stream and UV component data stream subjected to 2:1 horizontal/vertical downsampling;
The Y component data stream is alternately stored in Y1 and Y2 buffer areas according to rows to form a double-8-row matrix structure, UV component data streams are synchronously stored, Y1/Y2 double-matrix segmentation of the Y component and 8-row aligned storage of the UV component are realized, and standardized 8X 8 data blocks are provided for a shared pipeline;
Reading data blocks from the Y1 and Y2 buffer areas according to the sequence of Y1, Y2, U and V, generating a time sequence control signal of a shared processing pipeline, and outputting according to the sequence of Y1, Y2, U and V to realize pipeline type continuous processing;
Based on the time sequence control signal, performing DCT conversion, quantization, zigZag scanning, run length coding and Huffman coding processing of Y1, Y2 and U, V data blocks in a time-sharing multiplexing mode;
The encoded outputs are spliced into a continuous bit stream, generating a compressed data stream conforming to the JPEG standard.
Optionally, the converting YUV444 format data into YUV420 format, generating the Y component data stream and the UV component data stream subjected to 2:1 horizontal/vertical downsampling includes the following steps:
constructing a UV component cache with 2 lines of depth to form a 2X 2 pixel window matrix;
the average value calculation is to sum 4 UV values in a window and realize high-efficiency average calculation by right shift of 2 bits;
And (3) data alignment output, namely generating a YUV420 format data stream, wherein the Y resolution is 2 multiplied by UV, and matching the double 8-row matrix segmentation requirement.
Optionally, storing the Y component data stream alternately in rows in Y1 and Y2 buffers to form a dual 8-row matrix structure, including the steps of:
by row counting control, every 16 rows of Y component data are received, 2Y 1 and Y2 data blocks of 8 multiplied by 8 are alternately generated;
the UV component data stream is buffered at 8 line depth and forms a standard input set of 6 8 x 8 data blocks with the Y1/Y2 data blocks.
Optionally, the time-division multiplexing performs DCT transform, quantization, zigZag scanning, run-length encoding and Huffman encoding processing on the Y1, Y2, U, V data blocks based on the timing control signal, including the steps of:
Based on the 48-period counter value, the luminance quantization table is used when processing Y1, Y2 data blocks in 0-31 periods, and the chrominance quantization table is used when processing U, V data blocks in 32-47 periods;
huffman coding is performed on the different component data based on four sets of independent code tables of the Y/UV component DC/AC.
Optionally, the performing the DCT transform of the Y1, Y2, U, V data blocks based on the timing control signal by time-division multiplexing includes the steps of:
Line-column separation, namely decomposing the two-dimensional DCT into 8 times of one-dimensional line transformation and 8 times of one-dimensional column transformation, namely:
Wherein f (x, y) is a spatial domain 8×8 pixel value, C i(x)、Cj (y) is a cosine basis function, a one-dimensional DCT unit is called for each row of an 8×8 image block to generate an intermediate frequency domain matrix, row-column dimension interchange is performed on a row conversion result through a dual-port BRAM, the row-column dimension interchange is converted into a column data format, and then a one-dimensional DCT unit is called for the transposed column data to output a complete 8×8 frequency domain coefficient matrix;
The butterfly operation is that the Loeffler algorithm is adopted, 8-point DCT calculation is completed through a 3-level butterfly network, each level only needs 4 times of multiplication and 8 times of addition, and the matrix multiplication is disassembled into the iterative operation of addition, subtraction and a small amount of multiplication by utilizing cosine function symmetry;
realizing fixed point: and converting the floating point coefficient of the cosine base function into a 16-bit fixed point number.
The technical scheme of the invention has the advantages that the technical scheme breaks through the traditional independent module architecture, the design of sharing the pipeline by double 8-row matrix segmentation and time sequence alignment is realized, the efficient time-sharing multiplexing of YUV three-component to DCT conversion, quantization, zigZag scanning, huffman coding and other core modules is realized, compared with the traditional three-channel architecture, the computing circuit reduces more than 50% of logic units and DSP resources, the storage system does not need to buffer three-component intermediate results at the same time, the on-chip SRAM requirement is reduced by 30%, the power consumption is controlled, the time-sharing multiplexing mechanism reduces the average power consumption by 40%, the data bus bandwidth requirement by 30%, the realization complexity is optimized, the data path is simplified, the FPGA/ASIC wiring difficulty is reduced, the development period is shortened by about 25%, and the invention forms a complete and efficient coding system by optimizing the data organization, the processing flow and the time sequence control, thereby remarkably reducing the JPEG computation complexity and the data handling overhead, and being particularly suitable for the embedded image compression application scene with limited resources.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be noted that all directional indicators (such as up, down, left, right, front, and rear are used in the embodiments of the present invention) are merely for explaining the relative positional relationship, movement conditions, and the like between the components in a certain specific posture (as shown in the drawings), and if the specific posture is changed, the directional indicators are changed accordingly.
In addition, the technical solutions of the embodiments may be combined with each other, but it is necessary to base that the technical solutions can be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be considered to be absent and not within the scope of protection claimed in the present invention.
The invention provides a JPEG image compression system and method based on an FPGA double-matrix sharing pipeline.
As shown in fig. 1 to 6, in an embodiment of the present invention, the JPEG image compression system based on the FPGA dual matrix shared pipeline includes:
the YUV420 preprocessing module 101 is configured to convert YUV444 format data into YUV420 format data, completely reserve a Y component, perform 2:1 horizontal/vertical downsampling on a UV component, and generate a Y component data stream and a UV component data stream subjected to 2:1 horizontal/vertical downsampling;
The dual 8-row matrix segmentation storage module 102 is used for alternately storing Y component data streams into Y1 and Y2 cache areas according to rows to form a dual 8-row matrix structure, synchronously storing UV component data streams to obtain Y1, Y2 and U, V data blocks, and realizing Y1/Y2 dual matrix segmentation of Y components and 8-row aligned storage of UV components;
The shared pipeline time sequence scheduling module 103 is used for reading Y1, Y2 and U, V data blocks from the Y1 and Y2 buffer areas according to the sequence of Y1, Y2, U and V, generating a time sequence control signal of a shared processing pipeline, realizing pipeline type continuous processing and converting parallel multicomponent data into serial data streams;
The sharing processing module 104 is configured to perform DCT transform, quantization, zigZag scanning, run-length encoding, and Huffman encoding processing on the Y1, Y2, U, V data blocks in a time-division multiplexing manner according to the timing control signal;
And the merging and encoding module 105 is used for splicing the encoded output into a continuous bit stream, generating output according to byte alignment, realizing the formatting integration of encoded data and generating a compressed data stream conforming to the JPEG standard.
The JPEG image compression system based on the FPGA double-matrix sharing pipeline constructs a JPEG compression architecture taking double-8-row matrix segmentation and time sequence alignment sharing pipeline as a core, and realizes efficient compression processing of YUV three components through cooperation of 9 large modules, wherein the JPEG image compression flow is shown in figure 3.
Specifically, the YUV420 preprocessing module 101 (dual matrix preparation) is configured to implement format conversion from YUV444 to YUV420, completely reserve the Y component, perform 2:1 horizontal/vertical downsampling on the UV component, and use 2×2 block mean filtering to ensure chroma smoothing, and provide preprocessed data for dual 8-row matrix segmentation, where the implementation process specifically includes the following steps:
(1) Constructing a UV component cache with 2 lines of depth to form a 2X 2 pixel window matrix;
(2) The average value calculation is that 4 UV values in a window are summed, and high-efficiency average calculation is realized by right shift by 2 bits (fixed-point division);
(3) And (3) data alignment output, namely generating a YUV420 format data stream, wherein the Y resolution is 2 multiplied by UV, and matching the double 8-row matrix segmentation requirement.
Specifically, the dual 8-row matrix partition storage module 102 realizes the Y1/Y2 dual matrix partition of the Y component and the 8-row aligned storage of the UV component through independent caching, row counting judgment, write enabling triggering and FIFO writing control, provides a standardized 8×8 data block for the shared pipeline, supports links such as subsequent encoding, algorithm processing and the like to operate efficiently, and the implemented flowchart is shown in fig. 4, and specifically realizes the following steps:
(1) Parallel channel input Y, U, V component data is continuously input through independent channels. When i_de_Y is valid, writing i_data_Y into a Y cache 8X 8 sliding window according to rows, and similarly writing data into a U/V cache 8X 8 sliding window through i_de_U/i_data_ U, i _de_V/i_data_V;
(2) The row count and write enable control is that the Y component triggers the row count by the i_de_Y rising edge and the 3-bit counter row_cnt_Y cycles through 0-7. When row_cnt_y=7, o_de_y= i_de_y, the counter automatically resets to zero into the next row count period. The UV component is the same as the Y component, outputting o_de_u=i_de_u, o_de_v=i_de_v when the row counts to 7;
(3) Y component double buffer discrimination, namely row counting the falling edge of o_de_Y, and Y1/Y2 switching is realized by using a 1-bit counter row_cnt_Y (0/1 cycle). When row_cnt_y=0, o_de_y1=o_de_y, and when row_cnt_y=1, o_de_y2=o_de_y, so that the first 8 rows Y and the last 8 rows Y are respectively written into corresponding FIFOs, and a double 8-row matrix splitting strategy is realized;
(4) The data bit width and transmission period are that the single pixel bit width is 8 bits/pixel, the YUV420 standard is met, the bus bit width is 64 bits (8 pixels×8 bits/pixel), 1 row of data is transmitted in each clock period, and 18×8 data block writing is completed in 8 periods.
Specifically, the shared pipeline timing scheduling module 103 designs a special timing scheduling module according to the YUV42016 ×16 matrix operation requirement in the JPEG compression process, converts parallel multi-component data (Y1/Y2/U/V matrix) into a serial data stream, and outputs the serial data stream according to the sequence of Y1→y1→y2→y2→u→v, so as to realize pipeline type continuous processing, eliminate the data waiting bottleneck, and ensure the orderly and efficient processing of each component data.
The specific implementation flow is that a three-state machine (IDLE→GEN_RD_EN→WAIT_BACK) is adopted to realize the function, the state machine flow is shown in the figure 5, and the specific flow steps are as follows:
(1) Initial state (IDLE)
The function is to wait for data ready, ensuring that 8 x 8 data is stored in both the UV component and Y component FIFOs.
Triggering condition is that the Y component data amount is 4 times of U/V due to YUV420 sampling characteristic, and the UV component writing FIFO speed is slower than the Y component. Only the amount of data in the U-component FIFO needs to be detected, and when the amount of data in the U-component FIFO is equal to or greater than 8, the jump is made to the read enable generation state (gen_rd_en).
(2) Read enable generation state (GEN_RD_EN)
The function is that each component reading enabling signal is generated according to the sequence of Y1-Y2-U-V, so that data sequence control is realized, the 16X 16 matrix processing flow in JPEG compression is strictly matched, each 8X 8 data block is ensured to be completely read, and data dislocation is avoided.
Trigger condition-6 8×8 blocks (y1×2+y2× 2+U × 1+V ×1) need to be completed for a single read, 384 pixels in total. If the data is transmitted according to 64bit bus plus 8 cycles/block, the data is counted by cnt to be 0-47 because the data is transmitted according to 6 blocks with 48 cycles. When cnt= 47 (48 cycle read completed), the trigger state jumps to wait_back waiting for the backend FIFO to empty. The specific cnt range and read enable control relationship is as follows:
(3) Waiting for a backend FIFO empty state (wait_back)
The flow control mechanism is used for preventing the back-end FIFO from overflowing, avoiding data backlog caused by low back-end processing speed and ensuring continuous and smooth operation of the assembly line.
The triggering condition is to monitor the empty flag (empty_back) of the back-end FIFO, and return to the initial state to prepare for the next round of reading only when the back-end has enough space.
Specifically, the shared processing module 104 includes a shared two-dimensional DCT transformation module 1041, a dynamic quantization module 1042, a ZigZag pipeline scanning module 1043, a shared run-length encoding module 1044, and a four-table shared Huffman encoding module 1045.
Specifically, the shared two-dimensional DCT transform module 1041 is configured to implement two-dimensional discrete cosine transform (2D-DCT) of an 8×8 image block, convert spatial domain pixels into frequency domain coefficients, and implement energy concentration, parallel computation, and resource optimization through a line-column separation algorithm, a Loeffler algorithm, and a fixed-point process. The specific implementation flow steps are as follows:
(1) The row-column separation algorithm is to decompose the two-dimensional DCT into 8 times of one-dimensional row transformation and 8 times of one-dimensional column transformation, namely:
Where f (x, y) is a spatial domain 8×8 pixel value, and Ci (x), cj (y) are cosine basis functions.
Transferring each row of the 8X 8 image block to a one-dimensional DCT unit to generate an intermediate frequency domain matrix, exchanging row and column dimensions of a row transformation result through a dual-port BRAM, converting the row transformation result into a column data format, transferring the transposed column data to the one-dimensional DCT unit, and outputting a complete 8X 8 frequency domain coefficient matrix;
(2) And performing butterfly operation, namely completing 8-point DCT calculation through a 3-level butterfly network by adopting a Loeffler algorithm. Only 4 multiplications and 8 additions are needed in each stage, the matrix multiplication is disassembled into the iterative operation of addition, subtraction and a small amount of multiplications by utilizing cosine function symmetry, the calculated amount is reduced, the parallel pipeline characteristics of the FPGA are matched, and the processing throughput is improved;
(3) The fixed-point implementation is realized by converting a cosine base function floating-point coefficient into a 16-bit fixed-point number in order to avoid high resource consumption of floating-point operation, wherein the fixed-point number 0xB5 (decimal 181) is obtained by approximating a floating-point coefficient C (0) = 0.3536 through 0.3536 × 2^9 = 181.0432.
Specifically, the dynamic quantization module 1042 (shared table switching) is used for performing non-uniform quantization on the DCT coefficients, dynamically switching the luminance/chrominance quantization table according to the 48-period counter, and retaining the human eye sensitive low frequency information. The specific implementation flow steps are as follows:
(1) Quantization table storage and reading each quantization table uses an 8 x8 bit wide data store containing luminance component (Y) and chrominance component (U/V) quantization tables. A column priority parallel processing architecture is adopted, one column of data (8 elements) of a quantization table is read in each clock period, element-by-element fixed-point division operation (equivalent to multiplication of the reciprocal of the quantization table) is carried out on the column corresponding to the DCT coefficient matrix, and the quantization operation of the 8X 8 matrix is completed in 8 periods;
(2) And the quantization table switching mechanism is that the system uses a 0-47 cycle counter as a time sequence reference and is synchronous with the processing rhythm of the DCT module. An 8 x 8 block is processed for each 8 counts, and either the luminance table (Y) or the chrominance table (UV) is dynamically selected according to the counter value:
Counter = 0-7-the first 8 x 8 block of Y1 component is processed, using a luminance table.
Counter = 8-15-the second 8 x 8 block of Y1 component is processed, using a luminance table.
Counter = 16-23-the first 8 x 8 block of Y2 component is processed, using a luminance table.
Counter = 24-31-the second 8 x 8 block of Y2 component is processed, using a luminance table.
Counter = 32-39-the first 8 x 8 block of U components is processed, using a chroma table.
Counter = 40-47-the second 8 x 8 block of V components is processed, using a chroma table.
Specifically, the Zigzag pipeline scanning module 1043 is configured to implement Zigzag rearrangement, DPCM differential encoding, and data pipeline processing of an 8×8 quantization coefficient matrix, convert a two-dimensional matrix into a one-dimensional sequence, reduce inter-block redundancy, and ensure continuous output of data. The specific implementation flow steps are as follows:
(1) The matrix cache architecture adopts an 8-level line shift register group, 8 pixels are received each time and stored in a first line, the subsequent line is updated for 1 period through a register chain, a complete 8 multiplied by 8 matrix is formed after 8 periods, and the data time sequence alignment during ZigZag scanning is ensured;
(2) The ZigZag path decomposition is that the matrix is split into 8 subsequences according to the ZigZag path of figure 4, and each subsequence contains 8 data;
(3) Pipeline buffering and outputting
Multistage buffering, namely delaying each subsequence through a register to ensure time sequence alignment;
the time sequence control, namely circularly gating the subsequence through a counter, and outputting a complete sequence in 8 periods;
enabling delay, namely delaying an input enabling signal through a 9-stage register to ensure synchronization with data;
data output, namely writing the processed data into the FIFO in sequence;
(4) FIFO control
The read enabling triggering is that when the residual data quantity of the FIFO reaches a threshold value, the read enabling is started;
rhythm control, namely controlling read enable through a counter, reading 1 pixel per cycle, and ensuring continuous output of data;
(5) DPCM processing
Latching DC value, namely latching the DC value of the current block to a corresponding component register when detecting the initial position of the block;
calculating the difference value between the current DC and the previous block DC;
output control, namely outputting DC difference at the initial position of the block and outputting AC components at the rest positions;
Component switching, namely tracking the block types through a counter, and sequentially processing 4Y blocks, 1U block and 1V block;
each component DC value is managed independently.
Specifically, the shared run-length encoding module 1044 is configured to compress the one-dimensional sequence scanned by the zigbee, and represent consecutive zero values with "zero run length+non-zero values", so as to reduce the data amount, and especially optimize for AC components in the zero value set in the high frequency region. The specific implementation flow steps are as follows:
(1) Data input, namely receiving a one-dimensional sequence (DC difference plus AC component) processed by DPCM;
(2) Zero run count, namely adding 1 to the counter when encountering 0, outputting the current count value and a non-zero value when encountering non-0, and resetting the counter;
(3) The code output is generated (run length, non-zero value) combination, the end of block output (0, 0) is used as end symbol (EOB);
(4) Special processing, namely, the EOB is directly output by all zero blocks, and the DC difference is independently output and does not participate in zero run counting.
Specifically, the four-table shared Huffman coding module 1045 is configured to perform variable length coding on the Y, U, V component data after run-length coding, and allocate codes with different lengths according to the probability of occurrence of the data. The Y component DC/AC and the UV component DC/AC respectively use independent Huffman tables to output the code length (size) and the binary code (code) corresponding to each symbol, thereby realizing lossless compression. The specific implementation flow steps are as follows:
(1) Four sets of Huffman table loading, namely, reading Y-DC, Y-AC, UV-DC and UV-AC tables from ROM, and storing mapping relation from symbols to (code length and code) in the tables;
(2) The method comprises the steps of table lookup coding, namely selecting a corresponding table according to the component (Y/UV) and the type (DC/AC) of input data, wherein DC difference is used for table lookup according to the amplitude value and the (run length and non-zero value) of AC by pressing a combination key, and obtaining a corresponding code length and a binary code;
(3) Output control, namely outputting a group of codes (size) per cycle, wherein the code length indicates the number of coding bits, and binary codes are output according to left-to-left Ziegler;
(4) Special symbol processing, outputting a fixed code length and code (e.g., size=2, code=00) when an end of block (EOB) is encountered;
Specifically, the merging and encoding module 105 is configured to splice a code length (size) and a binary code (code) output by the Huffman coding module into a continuous bitstream, and generate output according to byte alignment, so as to realize formatting integration of encoded data. The specific implementation flow steps are as follows:
(1) The input buffer architecture is designed by adopting a 32-bit depth buffer register (supporting two splicing of maximum 16-bit coding) and matching with a 5-bit counter (recording the current buffer bit number). The buffer is set to 0 during initialization, and the counter is set to 0, so that the continuity of the cross-byte coding is ensured;
(2) Bit stream dynamic splicing, namely splicing codes to a buffer according to the code length, namely combining the new codes with the buffer after shifting the current buffer bit number leftwards, and updating the buffer and the bit count. The maximum 16-bit coding (compatible with Huffman longest codes) is supported, and the coding sequence is ensured to be correct through a shift operation;
(3) And (3) byte alignment output control, namely when the number of bits in the buffer is more than or equal to 8, intercepting the high 8 bits as byte output, updating the buffer to the rest bits, and subtracting 8 from the counter. The multistage register delay is adopted to ensure that output data is synchronous with an enabling signal, so that time sequence conflict is avoided;
(4) And (3) performing cross-byte coding processing, namely, aiming at the condition that the coding length exceeds 8 bits, judging dynamic segmentation coding through a state machine and conditions, namely, outputting a complete byte firstly, and reserving the remaining bits until the next splicing (for example, 3 bits exist in a buffer, 7 bits are newly coded, and outputting the upper 8 bits and reserving the remaining 2 bits after splicing). Processing combinations of different code lengths and residual bits by using a predefined case statement to ensure that the splicing logic covers all scenes;
(5) And (3) counting the end of block and data, namely outputting residual bits in the buffer and filling the residual bits into byte boundaries when the end of block sign is detected, and generating an end of block signal. And accumulating and counting total code length data for subsequent compression ratio analysis or state monitoring.
On the other hand, as shown in fig. 7, the invention also provides a JPEG image compression method based on an FPGA double-matrix sharing pipeline, which is performed by adopting the JPEG image compression system based on the FPGA double-matrix sharing pipeline, and the JPEG image compression method comprises the following steps:
s100, converting YUV444 format data into YUV420 format, and generating a Y component data stream and a UV component data stream subjected to 2:1 horizontal/vertical downsampling;
s200, alternately storing the Y component data stream into Y1 and Y2 buffer areas according to rows to form a double-8-row matrix structure, synchronously storing the UV component data stream, realizing Y1/Y2 double-matrix segmentation of the Y component and 8-row aligned storage of the UV component, and providing a standardized 8X 8 data block for a shared pipeline;
s300, reading data blocks from the Y1 and Y2 cache areas according to the sequence of Y1, Y2, U and V, generating a time sequence control signal of a shared processing pipeline, and outputting according to the sequence of Y1, Y2, U and V to realize pipeline type continuous processing;
S400, based on the time sequence control signals, performing DCT conversion, quantization, zigZag scanning, run length coding and Huffman coding processing on the Y1, Y2 and U, V data blocks in a time-sharing multiplexing mode;
S500, splicing the coded output into a continuous bit stream to generate a compressed data stream conforming to the JPEG standard.
Specifically, the converting YUV444 format data into YUV420 format, generating a Y component data stream and a 2:1 horizontal/vertical downsampled UV component data stream includes the following steps:
constructing a UV component cache with 2 lines of depth to form a 2X 2 pixel window matrix;
The average value calculation is that 4 UV values in a window are summed, and high-efficiency average calculation is realized by right shift by 2 bits (fixed-point division);
And (3) data alignment output, namely generating a YUV420 format data stream, wherein the Y resolution is 2 multiplied by UV, and matching the double 8-row matrix segmentation requirement.
Specifically, the Y component data stream is alternately stored in the Y1 and Y2 buffer areas according to rows, so as to form a dual 8-row matrix structure, which comprises the following steps:
by row counting control, every 16 rows of Y component data are received, 2Y 1 and Y2 data blocks of 8 multiplied by 8 are alternately generated;
the UV component data stream is buffered at 8 line depth and forms a standard input set of 6 8 x 8 data blocks with the Y1/Y2 data blocks.
Specifically, the time-division multiplexing performs DCT transform, quantization, zigZag scanning, run-length encoding and Huffman encoding processing on the Y1, Y2, U, V data blocks based on the timing control signal, and includes the following steps:
Based on the 48-period counter value, the luminance quantization table is used when processing Y1, Y2 data blocks in 0-31 periods, and the chrominance quantization table is used when processing U, V data blocks in 32-47 periods;
huffman coding is performed on the different component data based on four sets of independent code tables of the Y/UV component DC/AC.
Specifically, the performing DCT transform of the Y1, Y2, U, V data blocks based on the timing control signal in a time-division multiplexing manner includes the following steps:
Line-column separation, namely decomposing the two-dimensional DCT into 8 times of one-dimensional line transformation and 8 times of one-dimensional column transformation, namely:
Wherein f (x, y) is a spatial domain 8×8 pixel value, C i(x)、Cj (y) is a cosine basis function, a one-dimensional DCT unit is called for each row of an 8×8 image block to generate an intermediate frequency domain matrix, row-column dimension interchange is performed on a row conversion result through a dual-port BRAM, the row-column dimension interchange is converted into a column data format, and then a one-dimensional DCT unit is called for the transposed column data to output a complete 8×8 frequency domain coefficient matrix;
the butterfly operation is that the Loeffler algorithm is adopted, 8-point DCT calculation is completed through a 3-level butterfly network, each level only needs 4 times of multiplication and 8 times of addition, the cosine function symmetry is utilized to disassemble matrix multiplication into iterative operation of addition, subtraction and a small amount of multiplication, the calculated amount is reduced, the parallel pipelining characteristic of the FPGA is matched, and the processing throughput is improved;
and (3) realizing fixed-point implementation, namely converting the floating point coefficient of the cosine base function into a 16-bit fixed point number in order to avoid high resource consumption of floating point operation.
Specifically, the basic principle and process of the technical scheme of the invention are as follows:
For YUV420 data characteristics, a vertical matrix segmentation strategy is proposed:
(1) Dividing 16-row Y component data into a first 8-row Y1 matrix and a second 8-row Y2 matrix, and synchronously extracting 8-row U/V components to form an independent matrix;
(2) Switching Y1/Y2 write enable through a 1-bit row counter cycle generates 28×8Y matrices, 18×8U matrix, and 18×8V matrix per 16 rows of Y data received;
(3) The method has the beneficial effects of directly matching 8X 8DCT operation, eliminating the traditional row-column conversion overhead and reducing the data caching requirement by 30%.
2. Sequential alignment shared pipeline
The special time sequence control module is designed to realize ordered scheduling of multi-component data:
(1) The 48-cycle state machine is adopted to output 6 8X 8 data blocks (Y1×2, Y2×2, U× 1 and V× 1) according to the sequence of Y1→Y1→Y2→Y2→U→V, and DCT conversion is carried out on each block.
(2) Triggering data reading based on the UV component FIFO depth threshold value, and ensuring YUV component data synchronization;
(3) And the sharing processing unit only needs one set of DCT conversion, quantization, zigZag scanning, RLE coding and Huffman coding modules to process different component data blocks in a time-sharing multiplexing way.
3. Hardware resource optimization mechanism
The resource efficient multiplexing is realized through architecture innovation:
(1) Compared with the traditional three-channel architecture, the computing circuit reduces logic units and DSP resources by more than 50%;
(2) The memory system does not need to buffer the intermediate result of three components at the same time, and the demand of the on-chip SRAM is reduced by 30%;
(3) And the power consumption control is that the time division multiplexing mechanism reduces the average power consumption by 40 percent and the data bus bandwidth requirement by 30 percent.
4. Complete coding flow optimization
(1) YUV444 to YUV420, 2 x2 block mean filtering downsampling, preserving the Y component, compressing the UV component;
(2) Two-dimensional DCT transformation, namely combining a row-column separation algorithm with a Loeffler butterfly network, and realizing 16-bit fixed-point implementation;
(3) Quantization and Huffman coding, namely dynamically switching brightness/chromaticity quantization tables, and realizing entropy coding by four sets of independent code tables.
Specifically, the invention forms a complete and efficient JPEG coding system by optimizing data organization, processing flow and time sequence control, remarkably reduces the computational complexity and the data carrying cost, and is particularly suitable for embedded image compression application scenes with limited resources.
In particular, compared with the prior art, the invention has the following advantages:
the invention realizes the efficient multiplexing of resources by using the serial processing sequence of Y1, Y2, U and V:
(1) The hardware area is saved, only one DCT/quantization/ZigZag/Huffman coding processing unit is needed (3 parallel units are needed in the traditional scheme), the calculation circuit area is reduced by more than 50%, the control logic is simplified, and the time sequence synchronization complexity is reduced.
(2) The power consumption is reduced, namely the average power consumption of the time division multiplexing calculation unit is reduced by about 40%, the bandwidth requirement of a data bus is reduced, and the memory access power consumption is reduced.
(3) The memory requirements are reduced by about 30% without the need to cache Y, U, V intermediate results for three components simultaneously.
(4) The complexity optimization is realized, the data path is simplified, the wiring difficulty of the FPGA/ASIC is reduced, and the development period is shortened by about 25%.
The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the invention, and all equivalent structural changes made by the description of the present invention and the accompanying drawings or direct/indirect application in other related technical fields are included in the scope of the invention.