CN118354076A

CN118354076A - PNG image coding architecture based on FPGA

Info

Publication number: CN118354076A
Application number: CN202410448747.7A
Authority: CN
Inventors: 张飞飞; 卢孟; 陶长青
Original assignee: Jiangsu Brmico Electronics Co ltd
Current assignee: Jiangsu Brmico Electronics Co ltd
Priority date: 2024-04-15
Filing date: 2024-04-15
Publication date: 2024-07-16

Abstract

The application discloses a PNG image coding architecture based on an FPGA, which relates to the field of the FPGA, wherein a filtering module receives an image byte stream, determines the filtering type of each scanning line, outputs a filtering byte stream, and sends the filtering byte stream to a data caching module; inputting the code stream to be spliced into a code stream splicing module, and sending the data to be calculated into an adler32 module to generate an adler32 check code; the adler32 check code and the code stream to be spliced are sent to a code stream splicing module together, the code stream splicing module intercepts image data information and sends the image data information to a crc32 module to generate a crc32 check code, and PNG image bit streams are generated by splicing and merging the code stream to be spliced, the crc32 and the adler32 check code. The PNG image filtering coding architecture built by using FPGA hardware can realize high concurrency and high parallel processing efficiency of data.

Description

PNG image coding architecture based on FPGA

Technical Field

The embodiment of the application relates to the technical field of FPGA (field programmable gate array), in particular to a PNG (PNG) image coding architecture based on FPGA.

Background

PNG (Portable Network Graphics ) is a bitmap graphics format that supports lossless compression. Compared with other common graphic formats such as RAW, JPEG, GIF, the PNG graphic has the characteristics of lossless compression, support of multiple color types, support of transparent characteristics, optimized network transmission display, avoidance of patent influence and the like, becomes one of the current mainstream image formats, and is widely applied in some occasions.

PNG image compression standards are specifications that describe the format in which PNG bitstreams are required to conform and the process of recovering the original image pixels from the bitstream (parsing decoding). Version 1.0 of PNG was promulgated in 1996, after which several modifications were made. The current version is an international standard (ISO/IEC 15948:2003) and was released in 2003 as a W3C recommendation, which version is available on the network.

The PNG standard is implemented mainly as a PNG encoder, which is responsible for encoding original image pixels into a PNG standard compliant bitstream, and a PNG decoder, which is responsible for decoding the PNG standard compliant bitstream into original image pixels. The main open source software references of the current PNG standard are:

Libpng, official PNG open source reference library, supports almost all features of the PNG standard.

Lodepng, a PNG open source codec library without dependency, supports most of the features of PNG standards.

However, the current PNG standards are all implemented based on software algorithms, and PNG image encoding and decoding based on software protocols have no advantage in terms of high-speed calculation, because the software level depends more on the execution of a pipeline processor such as a CPU, and cannot meet the requirement of high-bandwidth transmission.

Disclosure of Invention

The embodiment of the application provides a PNG image coding architecture based on an FPGA, and a CPU processor executes a software algorithm without the problem of meeting high bandwidth and high computation power.

The method is used for a PNG image encoder of an FPGA architecture and comprises a filtering module, a data compression module, a data verification module and a code stream splicing module; the filtering module receives the scanned and serialized image byte stream, determines the filtering type of each scanning line according to the pixel data, outputs the filtered byte stream after filtering, and sends the filtered byte stream into the data compression module; the data compression module carries out lossless compression based on a sliding window of the dictionary, and a compressed output code stream comprises a code stream to be spliced and data to be calculated; inputting the code stream to be spliced into a code stream splicing module, wherein the data verification module is divided into a crc32 verification module and an adler32 verification module, and sending data to be calculated into the adler32 verification module to generate an adler32 verification code; the adler32 check code and the code stream to be spliced are jointly sent to a code stream splicing module, the code stream splicing module intercepts image data information from the code stream to be spliced, sends the image data information to a crc32 check module, receives the crc32 check code calculated and fed back by the crc32 check module, and splices and merges according to the code stream to be spliced, the crc32 check code and the adler32 check code to generate a complete PNG image bit stream.

Specifically, the filtering module performs pre-filtering processing according to the input current byte to be filtered, the previous filtering byte of the current scanning line and the filtering byte at the corresponding position of the previous scanning line, and determines the target filtering type of the current scanning line; and the data caching module caches the current byte to be filtered which is output after the filtering processing, takes the current byte to be filtered as a historical filtering byte of the next time sequence, and carries out filtering based on the target filtering type.

Specifically, the data buffer module comprises three FIFO buffers, the first FIFO buffer and the second FIFO buffer alternately buffer the history filter bytes output by the pre-filtering according to the time sequence of the FIFO buffers, the history filter bytes are used as reference pixels to be input into the filtering module, and the filtering module performs filtering based on the bytes to be filtered, the pixel width data and the history filter bytes in the next row; the third FIFO buffer buffers the filtered byte stream output by the filtering module and sends the filtered byte stream to the data compression module.

Specifically, the data buffer module pre-filters and outputs a gating control signal, two paths of FIFO enabling signals, a compression enabling signal and a pre-filtering data output end;

The first FIFO enable signal of the data buffer module is connected with one input end of the first selector MUX1 and the second selector MUX2, the second FIFO enable signal is connected with one input end of the third selector MUX3 and the fourth selector MUX4, and the prefilter data output end is connected with one input end of the fifth selector MUX5 and the sixth selector MUX 6; wherein the other input ends of the MUX1, the MUX2, the MUX3, the MUX4, the MUX5 and the MUX6 are input with invalid signals, and the gating control signals control gating output;

The output ends of the MUX1 and the MUX2 are respectively connected with the first FIFO buffer and the second FIFO buffer; the gating outputs of the same time sequence MUX1 and MUX2 are different and are used for alternately controlling the first FIFO buffer and the second FIFO buffer to start data buffering;

The second FIFO enabling signal is a reading energy signal, and the output ends of the MUX3 and the MUX4 are respectively connected with the first FIFO buffer and the second FIFO buffer; the gating outputs of the same time sequence MUX3 and MUX4 are different and are used for alternately controlling the first FIFO buffer and the second FIFO buffer to start data output;

The output ends of the MUX5 and the MUX6 are respectively connected with the first FIFO buffer and the second FIFO buffer and are used for writing the historical filtering bytes according to the write enabling signals, and the gating output of the MUX5 and the MUX6 at the same time sequence is different;

the compression enabling signal output is connected with the data compression module to synchronize lossless compression processes.

Specifically, the first FIFO buffer and the second FIFO buffer also respectively input pixel width data, and the output end of the first FIFO buffer is connected to one data input end of the seventh selector MUX7 and the eighth selector MUX7, and the output end of the second FIFO buffer is connected to the other data input end of the MUX7 and the MUX 8; the gating output ends of the MUX7 and the MUX8 are connected to the data input end of the filtering module, and historical filtering data is input to the filtering module; wherein MUX7 and MUX8 are gated out by a gating control signal.

Specifically, an inner layer state machine, an outer layer state machine, an encoding register and an encoding core are arranged in a data compression module, the encoding register is divided into a buffer area to be encoded and a data sliding area, data to be encoded and sliding window data divided by filtering data are respectively stored, the data to be encoded are moved to the data sliding area according to time sequence, and lossless compression of the data is carried out on the sliding window data through the encoding core;

The inner layer state machine is used for controlling the sliding operation process in the coding core, and the outer layer state machine is used for controlling the coding of different chunk block data and the internal crc32 check code calculation process.

Specifically, the encoder core comprises a sliding compression unit, a state mark detection unit, a least significant bit detection unit and a data coding output unit;

The outer layer state machine inputs the pixel width and height data, determines and controls the progress state of image coding, and the state signals output by the outer layer state machine are sent to the inner layer state machine and the sliding compression unit;

The sliding compression unit determines a coding process based on the output of the coding register and the state signals output by the outer layer state machine, and shifts and codes the filtered input data;

The sliding compression unit sends the coded data output by the shift operation to the connection state mark detection unit and the least significant bit detection unit; the state mark detection unit determines a sliding state according to the coded data and a state signal output by the inner layer state machine and outputs a state mark; the state feedback is input to an inner layer state machine and is used for determining and updating a state signal;

The least significant bit detection unit determines the least significant bit of the encoded data according to the encoded data and the state signal output by the inner layer state machine and outputs the least significant bit; the data coding output unit determines coded chunk block data based on the shift coding output, the length range of the least significant bit and the output signal of the outer layer state machine, and performs coding output.

Specifically, a feedback enabling signal is output at the filtering module, and the data caching module performs caching historical filtering data and data compression input based on the feedback enabling signal.

Specifically, the crc32 checking module comprises a counting unit, a checking calculation unit and a checking output unit; the checking calculation unit executes bit operation based on the received pixel height data, width data, state signals of an outer layer state machine and encoded output data, and outputs the checking check data under the target checking type; the chunk check data input value check output unit determines and outputs a crc32 check code according to the target chunk type and pixel parameter information; the counting unit is used for the count and zero clearing operation of the chunk and determining the target chunk type according to the state signal.

Specifically, the adler32 verification module comprises a buffer unit, a storage unit and a splicing unit; the buffer unit receives the coded output and the state signal of the outer layer state machine, and outputs buffer data to the storage unit according to the state;

The buffer unit comprises a data buffer and a counting buffer, wherein the data buffer is used for buffering filtering data, and the counting buffer is used for counting the buffering times; the storage unit comprises a first register and a second register, wherein the first register is used for registering the check byte sum in the data buffer;

and the splicing unit takes the check byte sum of the first register as the first data and the check byte sum of the second register as the high data, and outputs the shifted and spliced check byte sum to obtain the adler32 check code.

The technical scheme provided by the embodiment of the application has the beneficial effects that at least:

Constructing a PNG image filtering coding framework based on hardware logic by using an FPGA, and realizing high concurrency and parallel processing efficiency of data;

the filtering stage uses primary pre-filtering to firstly determine the target filtering type of the actual scanning line, and formally carries out filtering treatment during secondary filtering to obtain a filtering byte stream;

aiming at the throughput rate problem of the module, a buffer device comprising a FIFO buffer is arranged to solve the time sequence problem;

The design scheme of time-switched area is adopted to execute a secondary filtering scheme, and simultaneously, two groups of FIFO are used for alternately caching historical filtering data, so that the problem of throughput delay caused by single round-robin data handling is avoided;

and the sliding window and the coding register are used for carrying out data temporary storage and data carrying, and the design size area of the FPGA is reduced on the premise of maintaining certain compression efficiency.

Drawings

Fig. 1 is a schematic flow chart of PNG image encoding according to an embodiment of the present application;

FIG. 2 is a representation of a PNG image-based encoded bitstream provided by the present application;

FIG. 3 shows a schematic diagram of an FPGA-based PNG image encoding architecture;

FIG. 4 is a detailed schematic diagram of an FPGA-based PNG image encoding architecture;

FIG. 5 is a schematic diagram of the connection between a filtering module and a buffering module;

FIG. 6 is a schematic diagram of a data compression module;

FIG. 7 is a state machine transition diagram of the PNG image encoding architecture;

FIG. 8 is a detailed block diagram of the lz77_top interior;

FIG. 9 is a schematic diagram of the steps performed by lz77_top;

FIG. 10 is a schematic diagram of the structure of a crc32 verification module;

FIG. 11 is a schematic diagram of the architecture of an adler32 calibration module.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

References herein to "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

FPGA (Field Programmable GATE ARRAYS, programmable array logic) has found wider application in image compression due to its parallel processing capability and its advantages in high-speed computing. The PNG encoder functions to encode the original image pixels into a bit stream conforming to the PNG standard, which is much smaller in size than the original image pixels, thereby achieving the purpose of saving storage space, transmission bandwidth, and the like. Hardware encoders tend to have better performance than software encoders, with great utility. The application designs a coding and decoding framework based on FPGA hardware based on the characteristics of the FPGA so as to realize the PNG image coding and decoding process with high concurrency and large bandwidth.

As shown in fig. 1, for the PNG image encoding process, the present application divides the compression encoding process of an image into the following steps based on PNG standard (ISO/IEC 15948:2003):

s1, the image pixel data are serialized according to the sequence of the scanning lines to form scanning lines, and the scanning lines are arranged into byte streams of the image pixels.

PNG image display is classified into progressive display and non-progressive display, and progressive display is an interlace scanning and display method that enables PNG images to be displayed step by step from blurred to clear during network transmission or loading. In particular, interlaced scanning scans pixel data of an image in multiple portions or "lines". Unlike conventional progressive scanning (i.e., scanning each row of pixels in turn), interlaced scanning skips certain rows or pixels. The scanning mode only scans partial data of the image at the beginning, and as the scanning is carried out, more pixels are captured and rendered, and the image becomes clearer and more complete gradually. Interlacing with progressive display allows PNG images to first display a more blurred or low resolution version when loaded, then with progressive loading of the data, the image quality continues to increase, eventually displaying a complete, high resolution image. This display is particularly useful for network transmission because it allows the user to see the approximate content of the image first in the event that the network connection is slow or unstable, improving the user experience. The PNG image sender or encoding device will cut the original PNG image according to whether this functional instruction is required, e.g. after cutting the original image pixels, the subgraphs will be rearranged in the order of Adam7 algorithm. This setting can be skipped directly when the encoding party does not need to use the display function.

The pixel data serialization specifically divides the image pixel data of the original PNG image or the segmented PNG subgraph into a plurality of scanning lines, each scanning line is provided with pixel points in sequence from left to right, and the pixel points encode the color information according to the sequence from left to right to form byte streams. The application is implemented by adopting the true color RGBA with transparent channels, and the data bit width is set to 8bits per channel.

S2, filtering processing is carried out on each scanning line, and a filtering byte stream is generated according to the filtering type and the filtering value executed on each scanning line.

The purpose of the filtering process is to reduce redundancy of the data by encoding the difference between each pixel and its surrounding pixels. Common filtering methods include no filtering, sub-left filtering, sub-up filtering, sub-average filtering, sub-Paeth filtering, etc. The type of filtering can be random or limited, the filtering values obtained by different filtering types are different, and the compression efficiency obtained by subsequent calculation is also different. A new byte stream of scan line data for each row is formed from the filtered values and one byte is added before each row to indicate the filter type for that row.

S3, compressing the filtered byte stream, combining the image information and PNG standard to form a corresponding chunk, and constructing a complete PNG image bit stream according to the set signature, the head data block, the image data block and the tail data block.

The compression step is essentially a lossless or other lossy compression implemented using standard compression algorithms, such as LZ77 and Huffman Coding. And replacing the current data with the corresponding matching information which has appeared, wherein the Huffman Coding is to code and compress the symbol result after LZ77 compression. The compressed bit stream needs to be packaged into a bit stream conforming to the zlib standard, and the packaging process packages the bit stream into corresponding chunk data blocks.

In this embodiment, the chunk is divided into an image header chunk (IHDRchunk), an image chunk (IDATchunk), and an image trailer chunk (IENDchunk) according to the type. For the data stream which has completed the blocking, the complete PNG image bit stream is constructed together according to the signature, the head data block, the image data block and the tail data block set by the encoder or the sender, and then sent to the receiver or the decoder for decoding.

As shown in fig. 2, a PNG image bitstream is shown, and for a decoder to decode a PNG image bitstream correctly, a check code needs to be added during encoding, so that for the design of the encoder, an encoding module and a bitstream splicing module need to be designed, and the obtained bitstream contains compressed data and the check code for the decoder to decode. Therefore, information such as width and height of an image, sample bit width, color type, progressive display and the like is added to the IHDR chunk data, and in each chunk, the IDAT chunk includes compressed image data, and the chunk data is a zlib bit stream with an unfixed generated byte number. Whereas the zlib bitstream structure consists of one byte CMF, one byte FLG, a deflate bitstream with an unfixed number of bytes, four bytes ADLER32 in sequence. The deflate bit stream is composed of one bit BFINAL, one bit BTYPE and a compressed bit stream with an unfixed bit number in sequence.

Fig. 3 shows a schematic diagram of a PNG image coding architecture based on an FPGA, which at least includes a filter_top, a data buffer module fifo_flt, a data compression module lz77_top, a data verification module and a stream splicing module bs_top. The PNG image coding architecture based on the FPGA provided by the application is implemented based on the execution method of FIG. 1. The filtering module inputs pixel data data_in of the original PNG image, namely, receives a scanned and serialized image byte stream. The data is filtered before compression encoding in order to reduce redundancy between the data. The filtered byte stream output by the filtering module can be directly sent to the data compression module for data compression, or is buffered by the data buffer module fifo_flt and then sent to the data compression module according to the throughput rate. The data compression module can specifically perform compression processing through the Huffman coding core computing unit. The huffman coding core computing unit performs lossless compression based on a sliding window of the dictionary. The compressed output code stream comprises two parts, wherein one part is the code stream to be spliced and the other part is the data to be calculated. And inputting the code stream to be spliced into a code stream splicing module. The data verification module is divided into a crc32 verification module and an adler32 verification module, and data to be calculated is sent to the adler32 verification module to generate an adler32 verification code. The adler32 check codes and the code streams to be spliced are jointly sent to a code stream splicing module, the code stream splicing module intercepts image data information from the code streams to be spliced, the image data information is sent to a crc32 check module, the crc32 check module calculates and generates a crc32 check code according to the image data information, the crc32 check code is fed back to the code stream splicing module, and then the code stream splicing module splices and merges the code streams to be spliced, the crc32 check code and the adler32 check code to generate a complete PNG image bit stream.

The design and operation flow of the scheme in the filtering stage need to control the filtering module to adaptively select the optimal filtering type of each scanning line according to the picture content, for example, the application supports 5 filtering types specified in the standard. For the filtering module, the filtering type of the current scanning line is determined according to the input current byte to be filtered, the previous filtering byte of the current scanning line and the filtering byte of the corresponding position of the previous scanning line, and filtering processing is performed.

In some embodiments, the bit width of the input data and the output data is set to be 32bits, the input data is 1 original pixel containing RGBA4 channels, and the output data is a spliced code stream. The input filtered data contains four RGBA channels and the filter module needs to identify the channel type of the filtered data.

The filtered data for four channels of RGBA includes the following:

Wherein X represents a current byte to be filtered, a represents a byte filtered before a current scanning line, b represents a byte filtered at the same position of the previous scanning line as X, and c represents a byte filtered at the same position of the previous scanning line as X. a, b, c are bytes of the same channel, and parallel filtering can be performed between channels. If x is at the upper left boundary, the reference bytes are all invalid bytes, set to 0.

The filtering process of the filtering module may include the steps of:

A, acquiring pixel data of filtering bytes at corresponding positions in two adjacent scanning lines, taking original pixel data of a previous line as a reference pixel, and pre-filtering original pixels of the current line;

b, accumulating absolute values of the filtered data in units of scanning lines to serve as error sums, and selecting a filtering type corresponding to the minimum error sum as a target filtering type of the line;

And C, performing formal filtering on the current filtering data according to the determined target filtering type, and outputting a filtering byte stream.

The application carries out 5 types of filtering on one scanning line data, compares the minimum error with the minimum error, and can determine the filtering type of the line, thereby selecting the corresponding filtered data. In the software algorithm, the filtered data corresponding to the 5 filtering types can be stored, and finally the selection output is carried out. However, in the FPGA hardware design, if intermediate data is stored, 5 blocks of RAM with the size of 32512=2kb are required, and the area is too large, which is not friendly to design development and deployment. Therefore, in the PNG image filtering coding structure based on the FPGA, filtering is carried out in a scanning line unit, and each line is filtered twice. The first filtering is pre-filtering, the process is carried out while accumulating error sums, and the minimum error sum corresponding to the optimal target filtering type is calculated after the filtering is finished. The second filtering is formal filtering, and specifically, filtering is performed according to the optimal filtering type, and then a filtering byte stream is output.

Considering that the first and second filtering calculations are identical, the same filtering calculation logic can be multiplexed to reduce the area, i.e. the time-shifted area is chosen here. Of course, in some other embodiments, the method can be modified according to different scene requirements, for example, the method uses area change time in a scene with high real-time requirement, and only performs filtering once, but needs to integrate 5 blocks of RAM with 2KB into the FPGA. When the scanning line unit is used for filtering twice, the original pixels in the previous line are used as reference pixels of certain filtering types in the filtering twice, and the original pixels in the line are used as pixels to be filtered in the filtering twice. Thus, this design requires at least 2 RAM caches of 32x512 = 2 KB.

In particular, considering that the throughput rates of the filtering module and the compression module in the FPGA may be different, the storage problem may not be considered when the throughput rates are the same, and the filtering module may be directly sent to the data compression module, but when the throughput rates of the filtering module and the data compression module are different, a data buffer module needs to be set between the filtering module and the data compression module.

As shown in fig. 4, the data buffer module includes several FIFO buffers, buffers historical filtered word data according to a time sequence, and inputs the data compression module in order according to a scan line sequence and the historical filtered data for lossless compression. filter_top pre-filters data_in to determine the target filter type while writing the original filter data (fifo_wr_dat) into fifo_flt, and in the secondary filtering stage, filter_top reads back the original filter data (rd_dat). The filtered byte stream lz77_data output by the secondary filtering is further written into lz77_top for lossless compression.

Because of the timing data transfer involved, the present application also sets a feedback enable signal rd_val between lz77_top and fifo_flt, i.e., fifo_flt performs fifo_wr_dat and rd_dat operations based on the valid feedback enable signal. Accordingly, because the time sequence control is performed between the hardware logic devices, a bs_val enabling signal is also arranged between the bs_top and the lz77_top to instruct the bs_top to execute the code stream splicing task.

Considering that the data of the next line in every two lines of data in the filtering process will become the previous line in the next stage, FIFO needs to rotate to avoid data handling, so two FIFO buffers can be designed, and the two FIFO buffers alternately buffer the original pixels of the previous scanning line according to the time sequence, and filter the buffered original pixels as the reference pixels and the original pixels of the next scanning line.

Fig. 5 is a schematic diagram of the connection between the filtering module and the buffering module. fifo_flt includes three FIFO buffers and eight selectors. The first FIFO buffer fifo_0 and the second FIFO buffer fifo_1 alternately buffer the history filter bytes of the pre-filter output according to the timing sequence of the FIFO buffer fifo_1, and input the history filter bytes as reference pixels to the filter module. The filtering module filters based on the next row of bytes to be filtered and the historical filtering bytes. The filtered output is buffered in a third FIFO buffer fifo_2, and then the byte stream is fed into the data compression module according to the timing sequence.

The data buffer module pre-filters and outputs a strobe control signal cnt_h, a two-way FIFO (fifo_cur_wr_val write enable signal and fifo_pre_rd_val read enable signal), a compression enable signal bs_val and a pre-filtering data output end output (corresponding to fifo_flt_wr_dat filtering byte stream).

The scheme generally implements enable signals in compliance with FIFO interleaved rules. The first FIFO enable signal of the data buffer module is connected to one input terminal of the first selector MUX1 and the second selector MUX2, the second FIFO enable signal fifo_pre_rd_val is connected to one input terminal of the third selector MUX3 and the fourth selector MUX4, and the prefilter data output terminal fifo_cur_wr_dat is connected to one input terminal of the fifth selector MUX5 and the sixth selector MUX 6. All six selectors have only one active input, and the other input inputs an inactive signal (e.g., all grounded, active high, inactive low), and the gating control terminals selected by these selectors are all controlled by cnt_h.

Fifo_cur_wr_va is a write enable signal, and output ends of MUX1 and MUX2 are respectively connected with fifo_0 and fifo_1, and strobe outputs of the same time sequence MUX1 and MUX2 are different and used for alternately controlling fifo_0 and fifo_1 to start data buffering. The fifo_pre_rd_val is a read energy signal, the output ends of the MUX3 and the MUX4 are respectively connected with the fifo_0 and the fifo_1, and the gating output of the MUX3 and the MUX4 at the same time sequence is different and is used for alternately controlling the fifo_0 and the fifo_1 to start data output.

Since the above is the alternate control access, the history filter bytes corresponding to fifo_pre_rd_val and fifo_cur_wr_va are adjacent, i.e., the pre-filter of the current output and the previous pre-filter data. For the data storage stage, the pre-filtering data output end outputs the historical filtering byte, the output ends of the MUX5 and the MUX6 are respectively connected with the fifo_0 and the fifo_1 and are used for writing the historical filtering byte according to the write enabling signal, and the gating output of the same time sequence MUX5 and the MUX6 is different.

Taking fig. 5 as an example, when the filter_top outputs fifo_cur_wr_dat, MUX2 is gated by fifo_cur_wr_val and fifo_1, i.e., the wr_val signal is valid, and fifo_cur_wr_dat is stored in fifo_1 through MUX6, i.e., the wr_dat data. During the stage of performing fifo_1 storage, fifo_pre_rd_val gates MUX3 and successfully feeds fifo_0, rd_val signals are active, fifo_0 will now send out previously cached history data rd_dat, which is output via MUX7 or MUX8 (depending on cnt_h settings). Assuming MUX7 gates the fifo_pre_rd_dat output, this data will be fed into the filter_top and filtered together with the data_i of the time sequence filter_top input, outputting fifo_flt_wr_dat into fifo_2.

In the subsequent timing sequence, fifo_cur_wr_val strobe MUX1 feeds fifo_0, fifo_0 performs the buffering step, and corresponding fifo_1 performs the outputting step. The whole process runs circularly and alternately, and the filter_top is guaranteed to be continuously executed, so that the dynamic balance of throughput is achieved, and the execution efficiency is improved. The bs_val output from the filter_top acts on lz77_top to coordinate with starting the lossless compression process.

The filtered byte stream obtained after the filtering is continuously sent to a data compression module, and the data compression process is mainly carried out by depending on a Huffman coding core unit and a dictionary, and the application does not explain the details of the execution, and mainly designs the cache control logic of the coding core.

Referring to fig. 6, a coding register reg and a coding Core are set in the data compression module, the coding register receives the filtering byte stream of the upper layer, the inner part of the coding register is a buffer area to be coded and a data sliding area, and sliding window data and data to be coded, which are divided by the filtering data, are respectively stored.

In some embodiments, when the filter output and the coding throughput rate are different, the fifo_2 setting may act as a buffer. Lz77_top is set to encode in units of bytes (8 bits), and bit width of fifo is set to 32bits, i.e. a multiple of 4 is required to be rounded up when fetching data from fifo, so the size of the buffer to be encoded needs to be extended from 64bits to 67bits. In order to realize data sliding between the sliding window and the buffer area to be encoded, firstly, the data of the sliding window and the buffer area to be encoded are taken as a whole to be cached by a register variable reg, and each time, the data is taken from fifo_2 and then is shifted and registered into reg. Assuming 66 data in the buffer area to be encoded before a certain encoding, if no matching exists after the encoding or the optimal matching length is smaller than 3, directly recording the filtered data, and changing the data in the buffer area to be encoded into 65 data. No new data need to be taken from fifo, i.e. reg is not updated, but the filtered data needs to be moved from the buffer to be encoded into the sliding window before the next encoding takes place.

In implementing the above process, since the chunk of the filtering output is divided into multiple types, there is a difference between different types of processing and encoding splicing, so a state machine needs to be introduced for logic control. The state machine transition diagram of fig. 7 can be derived from sliding window encoding and subsequent check code logic calculations. The IDLE state, the UPT state, the update window and the buffer to be encoded, the SCH state, the search in the sliding window, whether the data in the buffer to be encoded is available, the subsequent crc32 check module needs to be linked to calculate the crc32 check code after each complete chunk, the adler32 check value needs to be calculated in the IDAT chunk, and finally the splicing and packaging are performed.

Fig. 8 is a detailed structure diagram of the inside of the lz77_top, which includes a sliding compression unit dat_win_shift_w, a status flag detection unit flg_mat_w, a least significant bit detection unit lz77_detect_one, and a data encoding output unit lz77 output. In addition, two layers of state machines, namely an inner layer state machine and an outer layer state machine, are arranged in the data compression module. The inner layer state machine is used for controlling the sliding operation process in the coding core, and the outer layer state machine is used for controlling the coding of different chunk block data and the internal crc32 check code calculation process. The control of the internal and external states of the two layers can orderly control each part of the hardware according to the time sequence.

The outer layer state machine fsm inputs the pixel width cfg_w_i and the height data cfg_h_i to determine and control the progress state of image encoding, and a state signal cur_state output by the outer layer state machine fsm is fed into the inner layer state machine fsm_sch and the sliding compression unit dat_win_shift_w.

The dat_win_shift_w determines the coding process based on the output dat_inp_r of the coding register and the state signal cur_state output by the outer layer state machine, and performs shifting and coding operation on the filtered input data. The sliding compression unit dat_win_shift_w feeds the encoded data flg_mat_w_i output from the shift operation to the connection state flag detection unit flg_mat_w and the least significant bit detection unit lz77_detect_one.

The state flag detection unit flg_mat_w determines a sliding state from the encoded data flg_mat_w_i and the state signal flg_mat output from the inner state machine, and outputs the state flag flg_mat_w_o. The flg_mat_w_o is again fed back into the inner state machine for fsm_sch determination and updating of the state signal.

The least significant bit detection unit lz77_detect_one determines the least significant bit of the encoded data according to the encoded data and the state signal flg_mat output by the inner layer state machine and outputs the corresponding data bst_dst_w. The data encoding output unit lz77 output determines encoded chunk block data based on the shift encoding output flg_mat_w_i, the length range of the least significant bits bst_dst_w, and the output signal flg_mat of the outer layer state machine, and encodes output bs_dat.

In this logic step, since each chunk has a corresponding crc32 value, that is, the control logic of the inner layer state machine is required to assist in calculating the crc32 check codes of the three chunks IHDR, IDAT, IEND, the crc32 check module is calculated as a core module to be multiplexed when implemented, and then the state machine control logic is used at the outer layer to control the input data. The process of the crc32 check module is similar to the adler32 check module, i.e., it uses state machine control, then processes 32 bit data inputs at a time, and uses a signal to indicate the number of valid bytes in the input, and outputs the current processing result value four clock cycles later. The outer layer control logic uses a state machine, and the inner layer state machine is used for identifying and controlling the encoding of the IDAT block after starting the signal, wherein the input data required by the process are the type field and the data field of the IDAT block, and the data field of the IDAT block uses the zlib bit stream output by the bit stream module. And outputting the crc32 check codes of the IDAT blocks, resetting the crc32 check module, and then sequentially calculating the crc32 check codes of the IHDR and IEND blocks.

Lz77_top detailed procedure referring to fig. 9, the procedure includes the following:

1. In order to realize data sliding between the sliding window and the buffer area to be encoded, firstly, the data of the sliding window and the buffer area to be encoded are taken as a whole to be cached by using register variables, and shift register is carried out after the data is taken from fifo each time.

2. Searching whether the data in the buffer area to be encoded exists in the sliding window, describing that the sizes of the sliding window and the buffer area to be encoded are 8 bytes, and setting the minimum matching length to be 2.

3. In N0, comparing whether inp [7] (inp- > input) and win [7] -win [0] (win- > sliding window) are equal to each other, and obtaining 8-bit flag signals (1 if equal, or 0 if opposite). If a flag bit is 1, the matching length representing the corresponding position is 1.

4. And in the N1 process, comparing whether inp [6] is equal to win [6] to win [0], and performing zero padding on the low-level flag, and performing logic AND operation on the flag obtained at the moment and the flag obtained by N0. If a flag bit is 1, the matching length representing the corresponding position is 2.

5. Similarly, in N2, it is compared whether inp [5] is equal to win [5] to win [0 ]. If a flag bit is 1, it represents that the matching length is 3. As shown in fig. 9, an example is: for example, win= DBCCDDBC, inp= DDBCAAAA, then flag (N0) = 10001100, there is a match at all 2,3, 7, the minimum match is 2, and the matching length is 1; flag (N1) =00011000 & 10001100=00001000, there is a match at 3, the matching length is 2; flag (N2) =00001000 & 00001000=00001000, there is a match at 3, the matching length is 3; flag (N3) =00001000 & 000000000000=00001000, there is a match at 3, the matching length is 4; flag (N4) =000000000000 & 000000000000=000000000000, no match. flag (N5) =000000000000 & 000000000000=000000000000, no match. flag (N6) =000000000000 & 000000000000=000000000000, no match. flag (N7) =000000000000 & 000000000000=000000000000, no match. Summarizing, the best match length is 4, distance 4 (address starts from 0, distance starts from 1, distance = address + 1).

6. If there are multiple potential matches at a time, the minimum match, i.e. the shortest distance, is recorded. At this time, a module for detecting the position of the lowest bit 1 in the code string, i.e., lz77_detect_one, is required. For example, an 8bits code string is detected: when the input is 11110100, it contains a minimum position of 1 of 3; however, when the input is 00000000, 1 is not present, and the maximum value is set to +1, that is, 9.

7. In each matching process, the shortest distance and the longest matching length are updated, if the longest matching length exceeds the interval range of [3,64], invalid matching is considered, and filtered data is directly output. As can be seen from the example in step 5, if there is no match for a certain time, the matching process can be finished in advance to improve the real-time performance.

Fig. 10 is a schematic structural diagram of the crc32 check module, which includes a count unit count, a check computation unit crc32 core, and a check output unit crc32 mon. Software implementations typically employ a look-up table approach to process one byte input at a time in parallel in the bitstream. For hardware implementation, if a similar table look-up method is adopted, 256 entries in total and 32 bits of complex data of each entry are required to be stored, the consumption is large, and the hardware is considered to have the advantage of convenient and flexible bit operation. Thus, the present design uses bit manipulation to process data bit inputs in parallel, and the required bit manipulation operations can be easily deduced from the algorithms described above.

The check computation unit crc32 core receives the pixel width data cfg_w_i, the height data cfg_h_i, and the encoded output data bs_dat, and performs a bit manipulation operation through the crc32 core. Furthermore, because the crc32 checks the multiplexing of the modules, it is also necessary to acquire the state signal cur_state_r of the outer-layer state machine fsm, which is used to determine the currently executed chunk type. In addition, the crc32 core also needs to acquire the count value of the counting unit, so as to determine that the execution operation is finished and the execution operation is cleared.

And the checking output unit crc32 mon receives the chunk data output by the checking calculation unit crc32 core, and the chunk_dat_i_w is spliced and output according to chunk type, width and height data and the like to obtain the crc32 checking code under the target chunk.

FIG. 11 is a schematic diagram of an adler32 calibration module, where the adler32 calibration module includes a buffer unit, a storage unit calculation, and a splice unit; the buffer unit receives the encoded output and a state signal of the outer layer state machine, and outputs buffered data to the storage unit according to the state. The last four bytes in the zlib bitstream in the present invention are fixed to an adler32 checksum, which adler32 checksum byte-wise processes the filtered byte stream.

The buffer unit comprises a data buffer (dat_i buffer) and a counting buffer (num_i buffer), wherein the data buffer is used for buffering filtering data, and the counting buffer is used for counting the buffering times; the memory unit includes a first register adler 32S 1 for registering a check byte sum s1_nxt_w in the data buffer and a second register adler 32S 2 for registering a check byte sum s2_nxt_w of the first register. And the splicing unit takes the check byte sum of the first register as first-order data, and takes the check byte sum of the second register as high-order data, and outputs the check byte sum after shifting and splicing to obtain an adler32 check code adler32_dat. In particular, the buffer unit is further provided with buffer sub-units din_w, dat_i buffer and num_i buffer, where the buffer is buffered in the din_w, and then outputted to the storage unit calculation.

The main difficulty with this implementation of execution logic is that two modulo operations are required for each process. Software implementations typically register s1 and s2 with large bit width type variables, and since the increment of the summation for each processing is finite, modulo operations can only be performed when the summation is up to a pre-calculated overflow number, or when the byte stream processing is over, thereby greatly saving the number of modulo operations. For hardware implementations, if the divisor is a positive integer power of 2 modulo arithmetic, it is typically implemented with the lower bits of the dividend directly, but the divisor (65521) of the algorithm is not a positive integer power of 2. If the method similar to the method implemented by the software is adopted, a register with large bit width and additional control logic are needed, so that the consumption is large; and processes one byte input per clock cycle, while each byte may be the last byte requiring an immediate modulo operation, i.e. the modulo operation path is always present, the method is not suitable for hardware implementation. In the design, the values of s1 and s2 after each modulo operation are not more than 65520, and the maximum value of s1 after summation is not more than 65520+255 and the maximum value of s2 after summation is not more than 65520+65520+255 are considered, so that the immediate modulo operation can be realized simply by comparison and subtraction operation, thereby avoiding directly using modulo operation logic.

The splicing module splices a series of separated output data to form a hardware final output bit stream. The separate output data will arrive at different clock cycles and will not exceed 32 bits, with a signal indicating the number of significant bits in the input. The splice module internally uses a 64-bit length register buffer and uses a counter pointer to mark the number of non-output data bits in the buffer. Each time the split output data arrives, the split output data is shifted in from the buffered low-significant bits while the pointer is updated. Then checking the pointer; if the non-output data in the buffer is not less than 32 bits, the high-valid 32 bits of the non-output data in the buffer are output and an output valid signal of one clock cycle is generated while subtracting 32 from the pointer. In addition, according to the standard, when the bit stream of the deflate compressed data block forms a byte stream, the sequence of filling bytes is that the byte is bit by bit and has low significance and then high significance; while for other byte data the order of the bits remains unchanged. Therefore, in order to simplify the splicing logic, in the design, all input data sent to the splicing module are converted into the form of first low-significant bits and then high-significant bits, namely byte data is sent to the splicing module after byte data is subjected to byte internal reverse sequence, so that the splicing module can process all sent input data in a unified form.

In summary, the beneficial effects brought by the scheme provided by the application include as follows:

The foregoing describes preferred embodiments of the present invention; it is to be understood that the invention is not limited to the specific embodiments described above, wherein devices and structures not described in detail are to be understood as being implemented in a manner common in the art; any person skilled in the art will make many possible variations and modifications, or adaptations to equivalent embodiments without departing from the technical solution of the present invention, which do not affect the essential content of the present invention; therefore, any simple modification, equivalent variation and modification of the above embodiments according to the technical substance of the present invention still fall within the scope of the technical solution of the present invention.

Claims

1. The PNG image coding architecture based on the FPGA is characterized by comprising a filtering module, a data caching module, a data compression module, a data verification module and a code stream splicing module;

The filtering module receives the scanned and serialized image byte stream, determines the filtering type of each scanning line according to the pixel data, outputs the filtered byte stream after filtering, and sends the filtered byte stream into the data caching module;

The data caching module caches the intermediate data and the result filtering byte stream of the filtering module according to the time sequence, and sends the filtering byte stream to the data compression module;

the data compression module reads the filtered byte stream according to the time sequence, carries out lossless compression based on a sliding window of the dictionary, and the compressed output code stream comprises a code stream to be spliced and data to be calculated;

The data verification module comprises a crc32 verification module and an adler32 verification module, and the code stream to be spliced is input into the code stream splicing module; the data to be calculated is sent to an adler32 check module to generate an adler32 check code; the adler32 check code and the code stream to be spliced are jointly sent to a code stream splicing module, the code stream splicing module intercepts image data information from the code stream to be spliced, sends the image data information to a crc32 check module, receives the crc32 check code calculated and fed back by the crc32 check module, and splices and merges according to the code stream to be spliced, the crc32 check code and the adler32 check code to generate a complete PNG image bit stream.

2. The PNG image encoding architecture based on FPGA of claim 1, wherein the filtering module performs pre-filtering processing according to the input current byte to be filtered, a byte filtered before the current scan line, and a byte filtered at a position corresponding to the previous scan line, so as to determine a target filtering type of the current scan line; and the data caching module caches the current byte to be filtered which is output after the filtering processing, takes the current byte to be filtered as a historical filtering byte of the next time sequence, and carries out filtering based on the target filtering type.

3. The FPGA-based PNG image encoding architecture according to claim 2, wherein the data buffering module comprises three FIFO buffers, the first FIFO buffer and the second FIFO buffer alternately buffering the history filter bytes of the pre-filter output according to the FIFO buffer timing, inputting the history filter bytes as reference pixels to the filtering module, and the filtering module performs filtering based on the next row of bytes to be filtered, the pixel width data, and the history filter bytes; the third FIFO buffer buffers the filtered byte stream output by the filtering module and sends the filtered byte stream to the data compression module.

4. A PNG image coding architecture based on FPGA of claim 3 wherein the data buffer module pre-filters output with a strobe control signal, a two-way FIFO enable signal, a compression enable signal, and a pre-filtered data output;

5. The FPGA-based PNG image encoding architecture according to claim 4, wherein the first FIFO buffer and the second FIFO buffer further respectively input pixel width data, and wherein an output of the first FIFO buffer is connected to one data input of the seventh selector MUX7 and the eighth selector MUX7, and an output of the second FIFO buffer is connected to the other data input of the muxes 7 and 8; the gating output ends of the MUX7 and the MUX8 are connected to the data input end of the filtering module, and historical filtering data is input to the filtering module; wherein MUX7 and MUX8 are gated out by a gating control signal.

6. The PNG image encoding architecture based on FPGA of claim 1, wherein an inner layer state machine, an outer layer state machine, an encoding register and an encoding core are provided in the data compression module, the encoding register is divided into a buffer area to be encoded and a data sliding area, the data to be encoded and the sliding window data divided by the filtering data are stored respectively, the data to be encoded is moved to the data sliding area according to a time sequence, and the sliding window data is subjected to lossless compression of the data by the encoding core;

7. The FPGA-based PNG image encoding architecture according to claim 6, wherein the encoder core comprises a sliding compression unit, a status flag detection unit, a least significant bit detection unit, and a data encoding output unit;

8. The FPGA-based PNG image encoding architecture of claim 6, wherein a feedback enable signal is output at the filtering module, and a data buffering module performs buffering historical filtering data and data compression input based on the feedback enable signal.

9. The FPGA-based PNG image encoding architecture according to claim 6, wherein the crc32 checking module comprises a counting unit, a checking calculation unit, and a checking output unit; the checking calculation unit executes bit operation based on the received pixel height data, width data, state signals of an outer layer state machine and encoded output data, and outputs the checking check data under the target checking type; the chunk check data input value check output unit determines and outputs a crc32 check code according to the target chunk type and pixel parameter information; the counting unit is used for the count and zero clearing operation of the chunk and determining the target chunk type according to the state signal.

10. The FPGA-based PNG image encoding architecture according to claim 6, wherein the adler32 verification module comprises a buffer unit, a storage unit, and a stitching unit; the buffer unit receives the coded output and the state signal of the outer layer state machine, and outputs buffer data to the storage unit according to the state;