CN120602650A

CN120602650A - JPEG image compression system and method based on FPGA double-matrix sharing assembly line

Info

Publication number: CN120602650A
Application number: CN202511109247.1A
Authority: CN
Inventors: 欧洋; 李洪威; 张俊佳
Original assignee: Beijing Zhongxing Times Technology Co ltd
Current assignee: Beijing Zhongxing Times Technology Co ltd
Priority date: 2025-08-08
Filing date: 2025-08-08
Publication date: 2025-09-05
Anticipated expiration: 2045-08-08

Abstract

The present invention discloses a JPEG image compression system and method based on an FPGA dual-matrix shared pipeline. The JPEG image compression system includes a YUV420 preprocessing module, a dual 8-row matrix segmentation storage module, a shared pipeline timing scheduling module, a shared processing module, and a merge encoding module. The present invention also proposes a JPEG image compression method based on an FPGA dual-matrix shared pipeline. The technical solution of the present invention can break through the traditional independent module architecture. Through the dual 8-row matrix segmentation and timing alignment shared pipeline design, it realizes efficient time-sharing multiplexing of core modules such as DCT transformation, quantization, ZigZag scanning, and Huffman encoding for the three YUV components. By optimizing data organization, processing flow, and timing control, a complete and efficient JPEG encoding system is formed, which significantly reduces computational complexity and data handling overhead. It is particularly suitable for resource-constrained embedded image compression application scenarios.

Description

JPEG image compression system and method based on FPGA double-matrix sharing assembly line

Technical Field

The invention relates to the technical field of image processing and data compression, in particular to a JPEG image compression system and method based on an FPGA double-matrix sharing pipeline.

Background

Existing image compression algorithms are mainly divided into two main categories, namely lossless compression and lossy compression. Lossless compression refers to the pure reliance on encoding of the data stream for statistical properties without losing any information of the image, but such compression is often difficult to meet the high compression ratio requirements. Lossy compression refers to the loss of certain redundant information or insensitive information generated in the compression process, and the subsequent image processing is not affected excessively, but a high compression ratio can be obtained. The JPEG compression algorithm is a lossy compression algorithm which is most widely applied, has the characteristics of high compression efficiency and simple and easy realization of the algorithm, and is widely applied to the fields of digital photography, network transmission, remote sensing communication and the like.

At present, in the existing JPEG image compression scheme based on the FPGA, a component independent processing architecture is generally adopted for processing a YUV color space, wherein a brightness component (Y) and a chromaticity component (U, V) are respectively processed through an independent DCT module, a Zig-Zag scanning module, a quantization module and a Huffman coding module. This architecture suffers from the following significant drawbacks:

And the hardware resource redundancy is that each component needs to be configured with a complete processing module group, taking 8-bit image data as an example, a single DCT module needs 64 multipliers and 192 adders, and the independent configuration of the three components can lead to the direct multiplication of the resource occupation amount by 3, thereby seriously increasing the consumption of an FPGA logic unit (LE), a storage unit (RAM) and a multiplier (DSP).

The coding module is isolated, the Huffman coding is used as an entropy coding core link, the statistical characteristic association of YUV component code streams is not considered in the existing scheme, and the independent coding leads to repeated design of code table cache (3 sets of independent code tables are required to be stored) and coding control logic, so that resources are further wasted.

Disclosure of Invention

The invention mainly aims to provide a JPEG image compression system and a JPEG image compression system based on an FPGA double-matrix shared pipeline, which aim at realizing high-efficiency time-sharing multiplexing of Y, U, V three-component pair Discrete Cosine Transform (DCT), zig-Zag scanning, quantization, huffman coding and other core modules by aiming at image data compression processing under YUV color space through a double-8-row matrix segmentation architecture and time sequence alignment shared pipeline design, and solve the technical problem of hardware resource redundancy in the traditional scheme.

In order to achieve the above object, the JPEG image compression system based on the FPGA double-matrix shared pipeline according to the present invention includes:

The YUV420 preprocessing module is used for converting YUV444 format data into YUV420 format data, completely reserving a Y component, performing 2:1 horizontal/vertical downsampling on a UV component, and generating a Y component data stream and a UV component data stream subjected to 2:1 horizontal/vertical downsampling;

The double 8-row matrix segmentation storage module is used for alternately storing the Y component data streams into Y1 and Y2 cache areas according to rows to form a double 8-row matrix structure, synchronously storing the UV component data streams to obtain Y1, Y2 and U, V data blocks, and realizing Y1/Y2 double matrix segmentation of the Y component and 8-row alignment storage of the UV component;

the shared pipeline time sequence scheduling module is used for reading Y1, Y2 and U, V data blocks from the Y1 and Y2 buffer areas according to the sequence of Y1-Y2-U-V, generating a time sequence control signal of a shared processing pipeline, realizing pipeline type continuous processing and converting parallel multi-component data into serial data streams;

the sharing processing module is used for performing DCT conversion, quantization, zigZag scanning, run length coding and Huffman coding processing on the Y1, Y2 and U, V data blocks in a time-sharing multiplexing mode according to the time sequence control signals;

and the merging and encoding module is used for splicing the encoded output into a continuous bit stream, generating output according to byte alignment, realizing the formatting integration of encoded data and generating a compressed data stream conforming to the JPEG standard.

Optionally, the dual 8-row matrix split storage module alternately generates 28×8Y 1 and Y2 data blocks each time 16 rows of Y component data are received by a row count control and split counter, the UV component data stream is buffered according to 8 rows depth, forms a standard input group of 68×8 data blocks with the Y1 and Y2 data blocks, and provides a standardized 8×8 data block for a shared pipeline.

Optionally, the shared pipeline timing scheduling module adopts a three-stage state machine, and the three-stage state machine includes:

an IDLE state, namely when the data quantity of the UV component FIFO is detected to be more than or equal to 8, jumping to a read enabling generation state;

GEN_RD_EN state, namely generating a read enable signal according to the sequence of Y1- & gt Y2- & gt U- & gt V through a 48-period counter;

and in the WAIT_BACK state, monitoring a rear end FIFO empty mark to ensure that a shared pipeline has no backlog.

Optionally, the sharing processing module includes:

The shared two-dimensional DCT conversion module is used for decomposing the two-dimensional DCT into row conversion and column conversion through a row-column separation algorithm, realizing 8-point DCT calculation through a 3-level butterfly network by adopting a Loeffler algorithm, and converting a floating point coefficient of a cosine base function into a 16-bit fixed point number;

The dynamic quantization module is used for executing non-uniform quantization on DCT coefficients based on the time sequence control signals, dynamically switching the brightness/chromaticity quantization table and reserving human eye sensitive low-frequency information;

The Zigzag pipeline scanning module is used for rearranging the quantized 8 multiplied by 8 quantization coefficient matrix into a one-dimensional sequence according to the Zigzag path and performing DPCM differential encoding on the DC coefficient;

The shared run length coding module is used for compressing the one-dimensional sequence scanned by the ZigZag, executing zero run length coding and reducing the data quantity;

The four-table sharing Huffman coding module is used for carrying out variable length coding on Y, U, V component data DC/AC four sets of independent code tables after run length coding, and distributing codes with different lengths according to the occurrence probability of the data so as to realize lossless compression.

Optionally, the dynamic quantization module dynamically switches the luminance/chrominance quantization table according to a 48-period counter, wherein:

The counter processes Y1 and Y2 data blocks in 0-31 period and uses brightness quantization table;

The counter processes U, V blocks of data at 32-47 cycles, using the chroma quantization table.

On the other hand, the invention also provides a JPEG image compression method based on the FPGA double-matrix sharing pipeline, which is performed by adopting the JPEG image compression system based on the FPGA double-matrix sharing pipeline, and comprises the following steps:

converting YUV444 format data into YUV420 format, and generating Y component data stream and UV component data stream subjected to 2:1 horizontal/vertical downsampling;

The Y component data stream is alternately stored in Y1 and Y2 buffer areas according to rows to form a double-8-row matrix structure, UV component data streams are synchronously stored, Y1/Y2 double-matrix segmentation of the Y component and 8-row aligned storage of the UV component are realized, and standardized 8X 8 data blocks are provided for a shared pipeline;

Reading data blocks from the Y1 and Y2 buffer areas according to the sequence of Y1, Y2, U and V, generating a time sequence control signal of a shared processing pipeline, and outputting according to the sequence of Y1, Y2, U and V to realize pipeline type continuous processing;

Based on the time sequence control signal, performing DCT conversion, quantization, zigZag scanning, run length coding and Huffman coding processing of Y1, Y2 and U, V data blocks in a time-sharing multiplexing mode;

The encoded outputs are spliced into a continuous bit stream, generating a compressed data stream conforming to the JPEG standard.

Optionally, the converting YUV444 format data into YUV420 format, generating the Y component data stream and the UV component data stream subjected to 2:1 horizontal/vertical downsampling includes the following steps:

constructing a UV component cache with 2 lines of depth to form a 2X 2 pixel window matrix;

the average value calculation is to sum 4 UV values in a window and realize high-efficiency average calculation by right shift of 2 bits;

And (3) data alignment output, namely generating a YUV420 format data stream, wherein the Y resolution is 2 multiplied by UV, and matching the double 8-row matrix segmentation requirement.

Optionally, storing the Y component data stream alternately in rows in Y1 and Y2 buffers to form a dual 8-row matrix structure, including the steps of:

by row counting control, every 16 rows of Y component data are received, 2Y 1 and Y2 data blocks of 8 multiplied by 8 are alternately generated;

the UV component data stream is buffered at 8 line depth and forms a standard input set of 6 8 x 8 data blocks with the Y1/Y2 data blocks.

Optionally, the time-division multiplexing performs DCT transform, quantization, zigZag scanning, run-length encoding and Huffman encoding processing on the Y1, Y2, U, V data blocks based on the timing control signal, including the steps of:

Based on the 48-period counter value, the luminance quantization table is used when processing Y1, Y2 data blocks in 0-31 periods, and the chrominance quantization table is used when processing U, V data blocks in 32-47 periods;

huffman coding is performed on the different component data based on four sets of independent code tables of the Y/UV component DC/AC.

Optionally, the performing the DCT transform of the Y1, Y2, U, V data blocks based on the timing control signal by time-division multiplexing includes the steps of:

Line-column separation, namely decomposing the two-dimensional DCT into 8 times of one-dimensional line transformation and 8 times of one-dimensional column transformation, namely:

Wherein f (x, y) is a spatial domain 8×8 pixel value, C _i(x)、C_j (y) is a cosine basis function, a one-dimensional DCT unit is called for each row of an 8×8 image block to generate an intermediate frequency domain matrix, row-column dimension interchange is performed on a row conversion result through a dual-port BRAM, the row-column dimension interchange is converted into a column data format, and then a one-dimensional DCT unit is called for the transposed column data to output a complete 8×8 frequency domain coefficient matrix;

The butterfly operation is that the Loeffler algorithm is adopted, 8-point DCT calculation is completed through a 3-level butterfly network, each level only needs 4 times of multiplication and 8 times of addition, and the matrix multiplication is disassembled into the iterative operation of addition, subtraction and a small amount of multiplication by utilizing cosine function symmetry;

realizing fixed point: and converting the floating point coefficient of the cosine base function into a 16-bit fixed point number.

The technical scheme of the invention has the advantages that the technical scheme breaks through the traditional independent module architecture, the design of sharing the pipeline by double 8-row matrix segmentation and time sequence alignment is realized, the efficient time-sharing multiplexing of YUV three-component to DCT conversion, quantization, zigZag scanning, huffman coding and other core modules is realized, compared with the traditional three-channel architecture, the computing circuit reduces more than 50% of logic units and DSP resources, the storage system does not need to buffer three-component intermediate results at the same time, the on-chip SRAM requirement is reduced by 30%, the power consumption is controlled, the time-sharing multiplexing mechanism reduces the average power consumption by 40%, the data bus bandwidth requirement by 30%, the realization complexity is optimized, the data path is simplified, the FPGA/ASIC wiring difficulty is reduced, the development period is shortened by about 25%, and the invention forms a complete and efficient coding system by optimizing the data organization, the processing flow and the time sequence control, thereby remarkably reducing the JPEG computation complexity and the data handling overhead, and being particularly suitable for the embedded image compression application scene with limited resources.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to the structures shown in these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an overall module frame structure of a JPEG image compression system based on an FPGA double-matrix sharing pipeline according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a partial module frame structure of a JPEG image compression system based on an FPGA double-matrix sharing pipeline according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating the operation of JPEG image compression in a JPEG image compression system based on FPGA double matrix shared pipeline according to an embodiment of the present invention;

FIG. 4 is a flowchart of a YUV data storage module of a JPEG image compression system based on an FPGA double matrix shared pipeline according to an embodiment of the present invention;

FIG. 5 is a flow chart of a shared pipeline timing scheduling module state machine of a JPEG image compression system based on an FPGA double matrix shared pipeline according to an embodiment of the invention.

FIG. 6 is a ZigZag path diagram of a JPEG image compression system based on an FPGA double matrix shared pipeline according to an embodiment of the present invention;

fig. 7 is a schematic flow chart of a JPEG image compression method based on an FPGA dual matrix shared pipeline according to an embodiment of the present invention.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that all directional indicators (such as up, down, left, right, front, and rear are used in the embodiments of the present invention) are merely for explaining the relative positional relationship, movement conditions, and the like between the components in a certain specific posture (as shown in the drawings), and if the specific posture is changed, the directional indicators are changed accordingly.

In addition, the technical solutions of the embodiments may be combined with each other, but it is necessary to base that the technical solutions can be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be considered to be absent and not within the scope of protection claimed in the present invention.

The invention provides a JPEG image compression system and method based on an FPGA double-matrix sharing pipeline.

As shown in fig. 1 to 6, in an embodiment of the present invention, the JPEG image compression system based on the FPGA dual matrix shared pipeline includes:

the YUV420 preprocessing module 101 is configured to convert YUV444 format data into YUV420 format data, completely reserve a Y component, perform 2:1 horizontal/vertical downsampling on a UV component, and generate a Y component data stream and a UV component data stream subjected to 2:1 horizontal/vertical downsampling;

The dual 8-row matrix segmentation storage module 102 is used for alternately storing Y component data streams into Y1 and Y2 cache areas according to rows to form a dual 8-row matrix structure, synchronously storing UV component data streams to obtain Y1, Y2 and U, V data blocks, and realizing Y1/Y2 dual matrix segmentation of Y components and 8-row aligned storage of UV components;

The shared pipeline time sequence scheduling module 103 is used for reading Y1, Y2 and U, V data blocks from the Y1 and Y2 buffer areas according to the sequence of Y1, Y2, U and V, generating a time sequence control signal of a shared processing pipeline, realizing pipeline type continuous processing and converting parallel multicomponent data into serial data streams;

The sharing processing module 104 is configured to perform DCT transform, quantization, zigZag scanning, run-length encoding, and Huffman encoding processing on the Y1, Y2, U, V data blocks in a time-division multiplexing manner according to the timing control signal;

And the merging and encoding module 105 is used for splicing the encoded output into a continuous bit stream, generating output according to byte alignment, realizing the formatting integration of encoded data and generating a compressed data stream conforming to the JPEG standard.

The JPEG image compression system based on the FPGA double-matrix sharing pipeline constructs a JPEG compression architecture taking double-8-row matrix segmentation and time sequence alignment sharing pipeline as a core, and realizes efficient compression processing of YUV three components through cooperation of 9 large modules, wherein the JPEG image compression flow is shown in figure 3.

Specifically, the YUV420 preprocessing module 101 (dual matrix preparation) is configured to implement format conversion from YUV444 to YUV420, completely reserve the Y component, perform 2:1 horizontal/vertical downsampling on the UV component, and use 2×2 block mean filtering to ensure chroma smoothing, and provide preprocessed data for dual 8-row matrix segmentation, where the implementation process specifically includes the following steps:

(1) Constructing a UV component cache with 2 lines of depth to form a 2X 2 pixel window matrix;

(2) The average value calculation is that 4 UV values in a window are summed, and high-efficiency average calculation is realized by right shift by 2 bits (fixed-point division);

(3) And (3) data alignment output, namely generating a YUV420 format data stream, wherein the Y resolution is 2 multiplied by UV, and matching the double 8-row matrix segmentation requirement.

Specifically, the dual 8-row matrix partition storage module 102 realizes the Y1/Y2 dual matrix partition of the Y component and the 8-row aligned storage of the UV component through independent caching, row counting judgment, write enabling triggering and FIFO writing control, provides a standardized 8×8 data block for the shared pipeline, supports links such as subsequent encoding, algorithm processing and the like to operate efficiently, and the implemented flowchart is shown in fig. 4, and specifically realizes the following steps:

(1) Parallel channel input Y, U, V component data is continuously input through independent channels. When i_de_Y is valid, writing i_data_Y into a Y cache 8X 8 sliding window according to rows, and similarly writing data into a U/V cache 8X 8 sliding window through i_de_U/i_data_ U, i _de_V/i_data_V;

(2) The row count and write enable control is that the Y component triggers the row count by the i_de_Y rising edge and the 3-bit counter row_cnt_Y cycles through 0-7. When row_cnt_y=7, o_de_y= i_de_y, the counter automatically resets to zero into the next row count period. The UV component is the same as the Y component, outputting o_de_u=i_de_u, o_de_v=i_de_v when the row counts to 7;

(3) Y component double buffer discrimination, namely row counting the falling edge of o_de_Y, and Y1/Y2 switching is realized by using a 1-bit counter row_cnt_Y (0/1 cycle). When row_cnt_y=0, o_de_y1=o_de_y, and when row_cnt_y=1, o_de_y2=o_de_y, so that the first 8 rows Y and the last 8 rows Y are respectively written into corresponding FIFOs, and a double 8-row matrix splitting strategy is realized;

(4) The data bit width and transmission period are that the single pixel bit width is 8 bits/pixel, the YUV420 standard is met, the bus bit width is 64 bits (8 pixels×8 bits/pixel), 1 row of data is transmitted in each clock period, and 18×8 data block writing is completed in 8 periods.

Specifically, the shared pipeline timing scheduling module 103 designs a special timing scheduling module according to the YUV42016 ×16 matrix operation requirement in the JPEG compression process, converts parallel multi-component data (Y1/Y2/U/V matrix) into a serial data stream, and outputs the serial data stream according to the sequence of Y1→y1→y2→y2→u→v, so as to realize pipeline type continuous processing, eliminate the data waiting bottleneck, and ensure the orderly and efficient processing of each component data.

The specific implementation flow is that a three-state machine (IDLE→GEN_RD_EN→WAIT_BACK) is adopted to realize the function, the state machine flow is shown in the figure 5, and the specific flow steps are as follows:

(1) Initial state (IDLE)

The function is to wait for data ready, ensuring that 8 x 8 data is stored in both the UV component and Y component FIFOs.

Triggering condition is that the Y component data amount is 4 times of U/V due to YUV420 sampling characteristic, and the UV component writing FIFO speed is slower than the Y component. Only the amount of data in the U-component FIFO needs to be detected, and when the amount of data in the U-component FIFO is equal to or greater than 8, the jump is made to the read enable generation state (gen_rd_en).

(2) Read enable generation state (GEN_RD_EN)

The function is that each component reading enabling signal is generated according to the sequence of Y1-Y2-U-V, so that data sequence control is realized, the 16X 16 matrix processing flow in JPEG compression is strictly matched, each 8X 8 data block is ensured to be completely read, and data dislocation is avoided.

Trigger condition-6 8×8 blocks (y1×2+y2× 2+U × 1+V ×1) need to be completed for a single read, 384 pixels in total. If the data is transmitted according to 64bit bus plus 8 cycles/block, the data is counted by cnt to be 0-47 because the data is transmitted according to 6 blocks with 48 cycles. When cnt= 47 (48 cycle read completed), the trigger state jumps to wait_back waiting for the backend FIFO to empty. The specific cnt range and read enable control relationship is as follows:

(3) Waiting for a backend FIFO empty state (wait_back)

The flow control mechanism is used for preventing the back-end FIFO from overflowing, avoiding data backlog caused by low back-end processing speed and ensuring continuous and smooth operation of the assembly line.

The triggering condition is to monitor the empty flag (empty_back) of the back-end FIFO, and return to the initial state to prepare for the next round of reading only when the back-end has enough space.

Specifically, the shared processing module 104 includes a shared two-dimensional DCT transformation module 1041, a dynamic quantization module 1042, a ZigZag pipeline scanning module 1043, a shared run-length encoding module 1044, and a four-table shared Huffman encoding module 1045.

Specifically, the shared two-dimensional DCT transform module 1041 is configured to implement two-dimensional discrete cosine transform (2D-DCT) of an 8×8 image block, convert spatial domain pixels into frequency domain coefficients, and implement energy concentration, parallel computation, and resource optimization through a line-column separation algorithm, a Loeffler algorithm, and a fixed-point process. The specific implementation flow steps are as follows:

(1) The row-column separation algorithm is to decompose the two-dimensional DCT into 8 times of one-dimensional row transformation and 8 times of one-dimensional column transformation, namely:

Where f (x, y) is a spatial domain 8×8 pixel value, and Ci (x), cj (y) are cosine basis functions.

Transferring each row of the 8X 8 image block to a one-dimensional DCT unit to generate an intermediate frequency domain matrix, exchanging row and column dimensions of a row transformation result through a dual-port BRAM, converting the row transformation result into a column data format, transferring the transposed column data to the one-dimensional DCT unit, and outputting a complete 8X 8 frequency domain coefficient matrix;

(2) And performing butterfly operation, namely completing 8-point DCT calculation through a 3-level butterfly network by adopting a Loeffler algorithm. Only 4 multiplications and 8 additions are needed in each stage, the matrix multiplication is disassembled into the iterative operation of addition, subtraction and a small amount of multiplications by utilizing cosine function symmetry, the calculated amount is reduced, the parallel pipeline characteristics of the FPGA are matched, and the processing throughput is improved;

(3) The fixed-point implementation is realized by converting a cosine base function floating-point coefficient into a 16-bit fixed-point number in order to avoid high resource consumption of floating-point operation, wherein the fixed-point number 0xB5 (decimal 181) is obtained by approximating a floating-point coefficient C (0) = 0.3536 through 0.3536 × 2^9 = 181.0432.

Specifically, the dynamic quantization module 1042 (shared table switching) is used for performing non-uniform quantization on the DCT coefficients, dynamically switching the luminance/chrominance quantization table according to the 48-period counter, and retaining the human eye sensitive low frequency information. The specific implementation flow steps are as follows:

(1) Quantization table storage and reading each quantization table uses an 8 x8 bit wide data store containing luminance component (Y) and chrominance component (U/V) quantization tables. A column priority parallel processing architecture is adopted, one column of data (8 elements) of a quantization table is read in each clock period, element-by-element fixed-point division operation (equivalent to multiplication of the reciprocal of the quantization table) is carried out on the column corresponding to the DCT coefficient matrix, and the quantization operation of the 8X 8 matrix is completed in 8 periods;

(2) And the quantization table switching mechanism is that the system uses a 0-47 cycle counter as a time sequence reference and is synchronous with the processing rhythm of the DCT module. An 8 x 8 block is processed for each 8 counts, and either the luminance table (Y) or the chrominance table (UV) is dynamically selected according to the counter value:

Counter = 0-7-the first 8 x 8 block of Y1 component is processed, using a luminance table.

Counter = 8-15-the second 8 x 8 block of Y1 component is processed, using a luminance table.

Counter = 16-23-the first 8 x 8 block of Y2 component is processed, using a luminance table.

Counter = 24-31-the second 8 x 8 block of Y2 component is processed, using a luminance table.

Counter = 32-39-the first 8 x 8 block of U components is processed, using a chroma table.

Counter = 40-47-the second 8 x 8 block of V components is processed, using a chroma table.

Specifically, the Zigzag pipeline scanning module 1043 is configured to implement Zigzag rearrangement, DPCM differential encoding, and data pipeline processing of an 8×8 quantization coefficient matrix, convert a two-dimensional matrix into a one-dimensional sequence, reduce inter-block redundancy, and ensure continuous output of data. The specific implementation flow steps are as follows:

(1) The matrix cache architecture adopts an 8-level line shift register group, 8 pixels are received each time and stored in a first line, the subsequent line is updated for 1 period through a register chain, a complete 8 multiplied by 8 matrix is formed after 8 periods, and the data time sequence alignment during ZigZag scanning is ensured;

(2) The ZigZag path decomposition is that the matrix is split into 8 subsequences according to the ZigZag path of figure 4, and each subsequence contains 8 data;

(3) Pipeline buffering and outputting

Multistage buffering, namely delaying each subsequence through a register to ensure time sequence alignment;

the time sequence control, namely circularly gating the subsequence through a counter, and outputting a complete sequence in 8 periods;

enabling delay, namely delaying an input enabling signal through a 9-stage register to ensure synchronization with data;

data output, namely writing the processed data into the FIFO in sequence;

(4) FIFO control

The read enabling triggering is that when the residual data quantity of the FIFO reaches a threshold value, the read enabling is started;

rhythm control, namely controlling read enable through a counter, reading 1 pixel per cycle, and ensuring continuous output of data;

(5) DPCM processing

Latching DC value, namely latching the DC value of the current block to a corresponding component register when detecting the initial position of the block;

calculating the difference value between the current DC and the previous block DC;

output control, namely outputting DC difference at the initial position of the block and outputting AC components at the rest positions;

Component switching, namely tracking the block types through a counter, and sequentially processing 4Y blocks, 1U block and 1V block;

each component DC value is managed independently.

Specifically, the shared run-length encoding module 1044 is configured to compress the one-dimensional sequence scanned by the zigbee, and represent consecutive zero values with "zero run length+non-zero values", so as to reduce the data amount, and especially optimize for AC components in the zero value set in the high frequency region. The specific implementation flow steps are as follows:

(1) Data input, namely receiving a one-dimensional sequence (DC difference plus AC component) processed by DPCM;

(2) Zero run count, namely adding 1 to the counter when encountering 0, outputting the current count value and a non-zero value when encountering non-0, and resetting the counter;

(3) The code output is generated (run length, non-zero value) combination, the end of block output (0, 0) is used as end symbol (EOB);

(4) Special processing, namely, the EOB is directly output by all zero blocks, and the DC difference is independently output and does not participate in zero run counting.

Specifically, the four-table shared Huffman coding module 1045 is configured to perform variable length coding on the Y, U, V component data after run-length coding, and allocate codes with different lengths according to the probability of occurrence of the data. The Y component DC/AC and the UV component DC/AC respectively use independent Huffman tables to output the code length (size) and the binary code (code) corresponding to each symbol, thereby realizing lossless compression. The specific implementation flow steps are as follows:

(1) Four sets of Huffman table loading, namely, reading Y-DC, Y-AC, UV-DC and UV-AC tables from ROM, and storing mapping relation from symbols to (code length and code) in the tables;

(2) The method comprises the steps of table lookup coding, namely selecting a corresponding table according to the component (Y/UV) and the type (DC/AC) of input data, wherein DC difference is used for table lookup according to the amplitude value and the (run length and non-zero value) of AC by pressing a combination key, and obtaining a corresponding code length and a binary code;

(3) Output control, namely outputting a group of codes (size) per cycle, wherein the code length indicates the number of coding bits, and binary codes are output according to left-to-left Ziegler;

(4) Special symbol processing, outputting a fixed code length and code (e.g., size=2, code=00) when an end of block (EOB) is encountered;

Specifically, the merging and encoding module 105 is configured to splice a code length (size) and a binary code (code) output by the Huffman coding module into a continuous bitstream, and generate output according to byte alignment, so as to realize formatting integration of encoded data. The specific implementation flow steps are as follows:

(1) The input buffer architecture is designed by adopting a 32-bit depth buffer register (supporting two splicing of maximum 16-bit coding) and matching with a 5-bit counter (recording the current buffer bit number). The buffer is set to 0 during initialization, and the counter is set to 0, so that the continuity of the cross-byte coding is ensured;

(2) Bit stream dynamic splicing, namely splicing codes to a buffer according to the code length, namely combining the new codes with the buffer after shifting the current buffer bit number leftwards, and updating the buffer and the bit count. The maximum 16-bit coding (compatible with Huffman longest codes) is supported, and the coding sequence is ensured to be correct through a shift operation;

(3) And (3) byte alignment output control, namely when the number of bits in the buffer is more than or equal to 8, intercepting the high 8 bits as byte output, updating the buffer to the rest bits, and subtracting 8 from the counter. The multistage register delay is adopted to ensure that output data is synchronous with an enabling signal, so that time sequence conflict is avoided;

(4) And (3) performing cross-byte coding processing, namely, aiming at the condition that the coding length exceeds 8 bits, judging dynamic segmentation coding through a state machine and conditions, namely, outputting a complete byte firstly, and reserving the remaining bits until the next splicing (for example, 3 bits exist in a buffer, 7 bits are newly coded, and outputting the upper 8 bits and reserving the remaining 2 bits after splicing). Processing combinations of different code lengths and residual bits by using a predefined case statement to ensure that the splicing logic covers all scenes;

(5) And (3) counting the end of block and data, namely outputting residual bits in the buffer and filling the residual bits into byte boundaries when the end of block sign is detected, and generating an end of block signal. And accumulating and counting total code length data for subsequent compression ratio analysis or state monitoring.

On the other hand, as shown in fig. 7, the invention also provides a JPEG image compression method based on an FPGA double-matrix sharing pipeline, which is performed by adopting the JPEG image compression system based on the FPGA double-matrix sharing pipeline, and the JPEG image compression method comprises the following steps:

s100, converting YUV444 format data into YUV420 format, and generating a Y component data stream and a UV component data stream subjected to 2:1 horizontal/vertical downsampling;

s200, alternately storing the Y component data stream into Y1 and Y2 buffer areas according to rows to form a double-8-row matrix structure, synchronously storing the UV component data stream, realizing Y1/Y2 double-matrix segmentation of the Y component and 8-row aligned storage of the UV component, and providing a standardized 8X 8 data block for a shared pipeline;

s300, reading data blocks from the Y1 and Y2 cache areas according to the sequence of Y1, Y2, U and V, generating a time sequence control signal of a shared processing pipeline, and outputting according to the sequence of Y1, Y2, U and V to realize pipeline type continuous processing;

S400, based on the time sequence control signals, performing DCT conversion, quantization, zigZag scanning, run length coding and Huffman coding processing on the Y1, Y2 and U, V data blocks in a time-sharing multiplexing mode;

S500, splicing the coded output into a continuous bit stream to generate a compressed data stream conforming to the JPEG standard.

Specifically, the converting YUV444 format data into YUV420 format, generating a Y component data stream and a 2:1 horizontal/vertical downsampled UV component data stream includes the following steps:

The average value calculation is that 4 UV values in a window are summed, and high-efficiency average calculation is realized by right shift by 2 bits (fixed-point division);

Specifically, the Y component data stream is alternately stored in the Y1 and Y2 buffer areas according to rows, so as to form a dual 8-row matrix structure, which comprises the following steps:

Specifically, the time-division multiplexing performs DCT transform, quantization, zigZag scanning, run-length encoding and Huffman encoding processing on the Y1, Y2, U, V data blocks based on the timing control signal, and includes the following steps:

Specifically, the performing DCT transform of the Y1, Y2, U, V data blocks based on the timing control signal in a time-division multiplexing manner includes the following steps:

the butterfly operation is that the Loeffler algorithm is adopted, 8-point DCT calculation is completed through a 3-level butterfly network, each level only needs 4 times of multiplication and 8 times of addition, the cosine function symmetry is utilized to disassemble matrix multiplication into iterative operation of addition, subtraction and a small amount of multiplication, the calculated amount is reduced, the parallel pipelining characteristic of the FPGA is matched, and the processing throughput is improved;

and (3) realizing fixed-point implementation, namely converting the floating point coefficient of the cosine base function into a 16-bit fixed point number in order to avoid high resource consumption of floating point operation.

Specifically, the basic principle and process of the technical scheme of the invention are as follows:

For YUV420 data characteristics, a vertical matrix segmentation strategy is proposed:

(1) Dividing 16-row Y component data into a first 8-row Y1 matrix and a second 8-row Y2 matrix, and synchronously extracting 8-row U/V components to form an independent matrix;

(2) Switching Y1/Y2 write enable through a 1-bit row counter cycle generates 28×8Y matrices, 18×8U matrix, and 18×8V matrix per 16 rows of Y data received;

(3) The method has the beneficial effects of directly matching 8X 8DCT operation, eliminating the traditional row-column conversion overhead and reducing the data caching requirement by 30%.

2. Sequential alignment shared pipeline

The special time sequence control module is designed to realize ordered scheduling of multi-component data:

(1) The 48-cycle state machine is adopted to output 6 8X 8 data blocks (Y1×2, Y2×2, U× 1 and V× 1) according to the sequence of Y1→Y1→Y2→Y2→U→V, and DCT conversion is carried out on each block.

(2) Triggering data reading based on the UV component FIFO depth threshold value, and ensuring YUV component data synchronization;

(3) And the sharing processing unit only needs one set of DCT conversion, quantization, zigZag scanning, RLE coding and Huffman coding modules to process different component data blocks in a time-sharing multiplexing way.

3. Hardware resource optimization mechanism

The resource efficient multiplexing is realized through architecture innovation:

(1) Compared with the traditional three-channel architecture, the computing circuit reduces logic units and DSP resources by more than 50%;

(2) The memory system does not need to buffer the intermediate result of three components at the same time, and the demand of the on-chip SRAM is reduced by 30%;

(3) And the power consumption control is that the time division multiplexing mechanism reduces the average power consumption by 40 percent and the data bus bandwidth requirement by 30 percent.

4. Complete coding flow optimization

(1) YUV444 to YUV420, 2 x2 block mean filtering downsampling, preserving the Y component, compressing the UV component;

(2) Two-dimensional DCT transformation, namely combining a row-column separation algorithm with a Loeffler butterfly network, and realizing 16-bit fixed-point implementation;

(3) Quantization and Huffman coding, namely dynamically switching brightness/chromaticity quantization tables, and realizing entropy coding by four sets of independent code tables.

Specifically, the invention forms a complete and efficient JPEG coding system by optimizing data organization, processing flow and time sequence control, remarkably reduces the computational complexity and the data carrying cost, and is particularly suitable for embedded image compression application scenes with limited resources.

In particular, compared with the prior art, the invention has the following advantages:

the invention realizes the efficient multiplexing of resources by using the serial processing sequence of Y1, Y2, U and V:

(1) The hardware area is saved, only one DCT/quantization/ZigZag/Huffman coding processing unit is needed (3 parallel units are needed in the traditional scheme), the calculation circuit area is reduced by more than 50%, the control logic is simplified, and the time sequence synchronization complexity is reduced.

(2) The power consumption is reduced, namely the average power consumption of the time division multiplexing calculation unit is reduced by about 40%, the bandwidth requirement of a data bus is reduced, and the memory access power consumption is reduced.

(3) The memory requirements are reduced by about 30% without the need to cache Y, U, V intermediate results for three components simultaneously.

(4) The complexity optimization is realized, the data path is simplified, the wiring difficulty of the FPGA/ASIC is reduced, and the development period is shortened by about 25%.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the invention, and all equivalent structural changes made by the description of the present invention and the accompanying drawings or direct/indirect application in other related technical fields are included in the scope of the invention.

Claims

1. A JPEG image compression system based on an FPGA dual-matrix shared pipeline, comprising:

The YUV420 preprocessing module is used to convert YUV444 format data into YUV420 format data, completely retain the Y component, perform 2:1 horizontal/vertical downsampling on the UV component, and generate a Y component data stream and a 2:1 horizontal/vertical downsampling UV component data stream;

A dual 8-row matrix segmentation storage module is used to store the Y component data stream alternately in the Y1 and Y2 buffer areas by row to form a dual 8-row matrix structure, and synchronously store the UV component data stream to obtain Y1, Y2, U, and V data blocks, thereby realizing Y1/Y2 dual matrix segmentation of the Y component and 8-row aligned storage of the UV component;

A shared pipeline timing scheduling module is used to read Y1, Y2, U, and V data blocks from the Y1 and Y2 buffers in the order of Y1 → Y1 → Y2 → Y2 → U → V, generate timing control signals for the shared processing pipeline, implement pipeline-type continuous processing, and convert parallel multi-component data into a serial data stream;

A shared processing module, configured to perform DCT transformation, quantization, ZigZag scanning, run-length coding, and Huffman coding of the Y1, Y2, U, and V data blocks in a time-division multiplexing manner according to the timing control signal; and

The merge encoding module is used to splice the encoded output into a continuous bit stream and generate output by byte alignment, so as to realize the formatted integration of the encoded data and generate a compressed data stream that conforms to the JPEG standard.

2. The JPEG image compression system based on the FPGA dual-matrix shared pipeline according to claim 1 is characterized in that the dual 8-row matrix segmentation storage module uses row count control and a segmentation counter to alternately generate two 8×8 Y1 and Y2 data blocks every time 16 rows of Y component data are received. The UV component data stream is cached at an 8-row depth and forms a standard input group of six 8×8 data blocks with the Y1 and Y2 data blocks, providing standardized 8×8 data blocks for the shared pipeline.

3. The JPEG image compression system based on the FPGA dual-matrix shared pipeline according to claim 1, wherein the shared pipeline timing scheduling module adopts a three-stage state machine, and the three-stage state machine comprises:

IDLE state: when the UV component FIFO data volume is detected to be ≥8, it jumps to the read enable generation state;

GEN_RD_EN state: Generates the read enable signal in the order of Y1→Y1→Y2→Y2→U→V through the 48-cycle counter;

WAIT_BACK state: monitors the backend FIFO empty flag to ensure that there is no data backlog in the shared pipeline.

4. The JPEG image compression system based on FPGA dual-matrix shared pipeline according to claim 1, wherein the shared processing module comprises:

Shared 2D DCT transform module, used to decompose the 2D DCT into row transform and column transform using the row-column separation algorithm, implement 8-point DCT calculation through a 3-level butterfly network using the Loeffler algorithm, and convert the cosine basis function floating-point coefficients into 16-bit fixed-point numbers;

Dynamic quantization module, which is used to perform non-uniform quantization on DCT coefficients based on timing control signals, dynamically switch the luminance/chrominance quantization table, and retain low-frequency information that is sensitive to the human eye;

Zigzag pipeline scanning module, used to rearrange the quantized 8×8 quantization coefficient matrix into a one-dimensional sequence according to the ZigZag path, and perform DPCM differential coding on the DC coefficient;

A shared run-length encoding module is used to compress the one-dimensional sequence after ZigZag scanning, perform zero run-length encoding, and reduce the amount of data; and

The four tables share the Huffman encoding module, which is used to perform variable-length encoding on the four independent code tables of DC/AC for the Y, U, and V component data after run-length encoding, and assign codes of different lengths according to the probability of data occurrence to achieve lossless compression.

5. The JPEG image compression system based on FPGA dual-matrix shared pipeline according to claim 4, wherein the dynamic quantization module dynamically switches the luminance/chrominance quantization table according to a 48-cycle counter, wherein:

Counter in cycles 0-31: Process Y1 and Y2 data blocks and use the brightness quantization table;

Counter in cycles 32-47: Processing U, V data blocks and using the chromaticity quantization table.

6. A JPEG image compression method based on an FPGA dual-matrix shared pipeline, performed using the JPEG image compression system based on an FPGA dual-matrix shared pipeline according to any one of claims 1 to 5, characterized in that the JPEG image compression method comprises the following steps:

Convert YUV444 format data to YUV420 format, generate Y component data stream and UV component data stream after 2:1 horizontal/vertical downsampling;

The Y component data stream is alternately stored in the Y1 and Y2 buffer areas by row to form a dual 8-row matrix structure, and the UV component data stream is stored synchronously to achieve Y1/Y2 dual matrix segmentation of the Y component and 8-row aligned storage of the UV component, providing a standardized 8×8 data block for the shared pipeline;

Read data blocks from the Y1 and Y2 buffers in the order of Y1 → Y1 → Y2 → Y2 → U → V, generate timing control signals for the shared processing pipeline, and output them in the order of Y1 → Y1 → Y2 → Y2 → U → V to achieve pipeline continuous processing;

Based on the timing control signal, DCT transformation, quantization, ZigZag scanning, run-length coding and Huffman coding processing of Y1, Y2, U and V data blocks are performed in a time-division multiplexing manner;

The encoded output is spliced into a continuous bit stream to generate a compressed data stream that conforms to the JPEG standard.

7. The JPEG image compression method based on FPGA dual-matrix shared pipeline according to claim 6, wherein the step of converting YUV444 format data into YUV420 format to generate a Y component data stream and a UV component data stream that has been horizontally/vertically downsampled by 2:1 comprises the following steps:

Data buffer: Build a UV component buffer with a depth of 2 rows, forming a 2×2 pixel window matrix;

Average calculation: sum the 4 UV values in the window and perform efficient average calculation by right shifting 2 bits;

Data alignment output: Generates a YUV420 format data stream with a Y resolution 2×2 times that of UV, matching the dual 8-row matrix segmentation requirements.

8. The JPEG image compression method based on an FPGA dual-matrix shared pipeline according to claim 6, wherein the step of alternately storing the Y component data stream in the Y1 and Y2 buffers by row to form a dual 8-row matrix structure comprises the following steps:

Through line count control, every time 16 lines of Y component data are received, two 8×8 Y1 and Y2 data blocks are alternately generated;

The UV component data stream is buffered as 8 lines of depth and forms a standard input group of 6 8×8 data blocks with the Y1/Y2 data blocks.

9. The JPEG image compression method based on FPGA dual-matrix shared pipeline according to claim 6, characterized in that the DCT transformation, quantization, ZigZag scanning, run-length encoding and Huffman encoding processing of the Y1, Y2, U and V data blocks are performed in a time-division multiplexing manner based on the timing control signal, comprising the following steps:

Based on the 48-cycle counter value, the luminance quantization table is used when processing the Y1 and Y2 data blocks in cycles 0-31, and the chrominance quantization table is used when processing the U and V data blocks in cycles 32-47;

Based on four independent code tables for Y/UV components DC/AC, Huffman encoding is performed on different component data.

10. The JPEG image compression method based on FPGA dual-matrix shared pipeline according to claim 6, wherein the time-division multiplexing execution of DCT transform of Y1, Y2, U, and V data blocks based on the timing control signal comprises the following steps:

Row and column separation: Decompose the two-dimensional DCT into 8 one-dimensional row transforms + 8 one-dimensional column transforms, that is:

Where f(x,y) is the 8×8 pixel value in the spatial domain, C _i (x) and C _j (y) are cosine basis functions, and a one-dimensional DCT unit is called for each row of the 8×8 image block to generate an intermediate frequency domain matrix. The row transform result is converted to column data format by swapping the row and column dimensions through a dual-port BRAM. The one-dimensional DCT unit is then called on the transposed column data to output a complete 8×8 frequency domain coefficient matrix.

Butterfly operation: Using the Loeffler algorithm, a three-level butterfly network is used to complete 8-point DCT calculations. Each level requires only four multiplications and eight additions. Utilizing the symmetry of the cosine function, matrix multiplication is decomposed into iterative operations of addition, subtraction, and a small number of multiplications.

Fixed-point implementation: Convert the cosine basis function floating-point coefficients to 16-bit fixed-point numbers.