Disclosure of Invention
The present invention aims to solve at least one of the technical problems existing in the prior art. Therefore, the invention provides a multi-scale semantic guided image compression method, a system and a storage medium, which can solve the problems of low modeling efficiency and serious compression distortion when the current image compression algorithm is used for efficiently modeling long-distance dependence, maintaining the image quality of a key semantic region and realizing high-resolution image compression.
According to an embodiment of the first aspect of the invention, a multi-scale semantic guided image compression method comprises the following steps:
s100, acquiring input image data, and preprocessing the input image to obtain standardized image data ;
S200, the standardized image dataInputting a pre-trained semantic segmentation network to generate a multi-scale semantic feature mapSemantic weight map corresponding to the same;
S300, constructing a three-level pyramid encoder for the standardized image dataStep by step execution:
s301, depth separable convolution downsampling to generate multi-scale features ;
S302, the reversible neural network performs the multi-scale featurePerforming nonlinear transformation;
s303, adaptively discrete wavelet transforming the multi-scale features subjected to nonlinear transformation Decomposition into low frequency subbandsAnd high frequency sub-bands;
S400, based on the semantic weight graphFor the high frequency sub-bandPerforming dynamic selective state space modeling, comprising:
Updating the state by a bidirectional scanning mechanism;
generating a dynamic convolution kernel;
Channel-space two-way attention gating enhancement features;
s500, performing non-uniform quantization and entropy coding on the modeled features of each layer in the step S400 to generate a compressed code stream;
S600, inputting the compressed code stream into a decoder, decoding based on a lightweight Mamba module, and combining inverse wavelet transformation and a semantic weight map An image is reconstructed.
According to some embodiments of the invention, in S200, the normalizing the image dataInputting a pre-trained semantic segmentation network to generate a multi-scale semantic feature mapSemantic weight map corresponding to the sameComprising the following steps:
Extracting multi-scale features from Xception65 trunk of DeepLabv & lt3+ & gt network, and generating semantic feature map after fusion by using a cavity space pyramid pooling module Aligning the coding layers through bilinear interpolation, and generating the semantic weight graph by taking the maximum value along the channel dimension。
In accordance with some embodiments of the invention, in S300, the depth separable convolutionally downsampled, a multi-scale feature is generatedComprising the following steps:
Original image scale feature Half-scale featuresQuarter scale features;
Wherein R is real space, H is feature map height, and W is feature map width.
According to some embodiments of the invention, in S300, the reversible neural network is configured to determine the multi-scale featurePerforming the nonlinear transformation includes:
The reversible neural network performs a forward transform:
;
;
;
the reversible neural network performs a reverse transformation:
;
;
;
Wherein:
、 For multi-scale features Two parts divided along the channel dimension;
Is an output feature;
F. g is a three-layer convolution residual block;
T Representing a forward mapping function;
Representing the inverse mapping function.
According to some embodiments of the invention, in S400, the bidirectional scanning mechanism includes:
forward state update equation:
;
the backward state update equation:
;
Wherein:
、 respectively representing forward and backward states;
e, R C×1 is a state memory weight;
e, R C×C is an input gating weight matrix;
a depth separable convolution operation for dynamically generating a convolution kernel;
Is an element-level product.
According to some embodiments of the invention, in S400, the dynamic convolution kernel generation includes:
Based on the semantic weight map And the high frequency sub-bandGenerating a query Q, a key K and a value V, and calculating a convolution kernel parameter matrix through multiple attentions:
;
Wherein d is the dimension of the attention head, softmaxIs a normalized exponential function, and T is a matrix transposition.
According to some embodiments of the invention, in S400, the channel-space dual-channel attention-gating enhancement feature comprises:
channel attention:
;
Spatial attention:
;
final output characteristics:
;
Wherein:
x is the input feature;
GAP is global average pooling;
MLP is a multi-layer perceptron;
Is that Is a function of the activation of (a);
the product by element is indicated as follows.
According to some embodiments of the invention, in S500, the non-uniform quantization measure includes:
The quantization step size is adaptively adjusted according to the semantic weight graph:
;
Wherein:
taking the quantized step length as a basis;
is the position in the semantic weight graph Response intensity of (2);
Lambda is the adjustment coefficient;
The quantization operation is defined as: ;
Wherein: representing the position ,Z i,j is the characteristic value to be quantized, and is output from dynamic SSM modeling; rounding to a rounding function; For adaptive quantization step size, the semantic weights are dynamically adjusted.
According to some embodiments of the invention, in S500, the entropy encoding includes:
based on the state space modeling super prior network, constructing a joint probability model for potential variables:
;
Wherein:
N represents a Gaussian distribution;
respectively represent feature dimension ,The predicted mean and standard deviation of the dimension are generated by the super prior network:
;;
Wherein:
lightweight neural network module for state space modeling for estimating parameters for each feature location ;
Lightweight neural network module for state space modeling for estimating parameters for each feature location;
Representing a channel splicing operation;
is a decoded feature.
According to some embodiments of the invention, in S600, the inputting the compressed code stream into a decoder, based on lightweight Mamba module decoding, combines inverse wavelet transform with a semantic weight mapReconstructing an image comprises the steps of:
Mamba module decodes the state update:
;
Wherein:
is the current moment state vector;
A、B State transition coefficients generated for semantic guidance;
Is a gating function;
inverse wavelet transform reconstruction features:
;
Wherein:
reconstructing a feature map for a kth layer;
is a learnable inverse wavelet transform operator;
multiscale semantic fusion output image:
;
Wherein:
is a learnable fusion weight.
According to some embodiments of the invention, in S500, the method further includes median deviation mapping quantization encoding, including the steps of:
potential eigenvalues Mapping to a median reference coordinate system, and calculating the interval to which the median reference coordinate system belongsMedian of (2)And calculates the deviation value:
;
Symmetric discrete quantization of the deviation:
;
Wherein gamma is the quantization step length;
Generating a ternary Performing compression representation;
Wherein: for the spatial position index, To quantify the deviation.
According to the embodiment of the second aspect of the invention, the multi-scale semantic guidance image compression system comprises a memory and a processor, wherein the processor realizes the multi-scale semantic guidance image compression method when executing a computer program stored in the memory.
According to an embodiment of the third aspect of the present invention, a storage medium stores a determination program of a multi-scale semantic guided image compression method, which is implemented when executed by a processor.
The multi-scale semantic guided image compression method, system and storage medium have the advantages that by introducing a dynamic selective state space modeling mechanism and combining bidirectional scanning and semantic sensitive attention gating, the computing complexity is effectively reduced, the detail capturing and global context understanding capability of key areas of an image such as faces and characters is enhanced, and the problem of image detail blurring caused by high-frequency information loss in the traditional compression method is solved. The multi-scale-wavelet joint coding architecture constructed by fusing the reversible neural network and the adaptive wavelet transformation effectively realizes lossless compression of the low-frequency sub-band, avoids low-frequency distortion, and simultaneously retains texture details by virtue of lightweight dynamic convolution coding of the high-frequency sub-band so as to meet the real-time processing requirement of edge equipment. At the decoding end, a selective state space activation mechanism based on Mamba module decoding structure is introduced, only key channels are reserved to participate in image reconstruction, the decoding calculation amount is obviously reduced, and the decoding calculation amount is obviously reduced through inverse wavelet transformation and a semantic weight graphAnd fusing the reconstructed output images.
Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.
In the description of the present invention, it should be understood that the direction or positional relationship indicated with respect to the description of the orientation, such as up, down, etc., is based on the direction or positional relationship shown in the drawings, is merely for convenience of describing the present invention and simplifying the description, and does not indicate or imply that the apparatus or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention.
In the description of the present invention, plural means two or more. The description of the first and second is for the purpose of distinguishing between technical features only and should not be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.
In the description of the present invention, unless explicitly defined otherwise, terms such as arrangement, installation, connection, etc. should be construed broadly and the specific meaning of the terms in the present invention can be reasonably determined by a person skilled in the art in combination with the specific contents of the technical scheme.
Referring to fig. 1, fig. 1 is a schematic diagram of a computer device structure of a hardware running environment according to an embodiment of the present application.
As shown in FIG. 1, the computer device may include a processor 1001, such as a central processing unit (CentralProcessing Unit, CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a Wireless interface (e.g., a Wireless-Fidelity (Wi-Fi) interface). The Memory 1005 may be a high-speed random access Memory (RandomAccess Memory, RAM) or a stable nonvolatile Memory (NVM), such as a disk Memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.
Those skilled in the art will appreciate that the architecture shown in fig. 1 is not limiting of a computer device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.
As shown in fig. 1, an operating system, a network communication module, a user interface module, and a face detection program for edge-oriented computing may be included in the memory 1005 as one storage medium.
In the computer device shown in fig. 1, the network interface 1004 is mainly used for data communication with a network server, the user interface 1003 is mainly used for data interaction with a user, the processor 1001 and the memory 1005 in the application can be arranged in the computer device, and the computer device calls a multi-scale semantic guidance image compression program based on dynamic state space modeling stored in the memory 1005 through the processor 1001 and executes the multi-scale semantic guidance image compression method based on dynamic state space modeling provided by the embodiment of the application.
Referring to fig. 2, the invention discloses a multi-scale semantic guided image compression method, which comprises the following steps:
s100, acquiring input image data, and preprocessing the input image to obtain standardized image data ;
S200, normalizing the image dataInputting a pre-trained semantic segmentation network to generate a multi-scale semantic feature mapSemantic weight map corresponding to the same;
S300, constructing a three-level pyramid encoder, and standardizing image dataStep by step execution:
s301, depth separable convolution downsampling to generate multi-scale features ;
S302, reversible neural network is used for multi-scale featurePerforming nonlinear transformation;
s303, adaptive discrete wavelet transformation is to be subjected to nonlinear transformation on multi-scale characteristics Decomposition into low frequency subbandsAnd high frequency sub-bands;
S400, based on semantic weight graphFor high frequency sub-bandsPerforming dynamic selective state space modeling, comprising:
Updating the state by a bidirectional scanning mechanism;
generating a dynamic convolution kernel;
Channel-space two-way attention gating enhancement features;
s500, performing non-uniform quantization and entropy coding on the modeled features of each layer in the step S400 to generate a compressed code stream;
S600, inputting the compressed code stream into a decoder, decoding based on a lightweight Mamba module, and combining inverse wavelet transformation and a semantic weight map An image is reconstructed.
In this embodiment, the input RGB image is subjected to color space conversion (RGB→YUV), normalization ([ 0,255] →1,1 ]) and anomaly filtering to generate a normalized image. Semantic segmentation network (DeepLabv & lt3+ & gt) for extracting multi-scale feature mapAnd generating a semantic weight graph by taking maximum value along the channel. The encoder adopts a three-level pyramid structure:
depth separable convolutional downsampling (step size = 2);
The INN block performs a coupled transformation: ,;
adaptive discrete wavelet transform multi-scale features to be non-linearly transformed Decomposition into low frequency subbandsAnd high frequency sub-bands。
And the dynamic SSM coding module is used for fusing a dynamic convolution and attention mechanism in a state space model, so that the state space modeling and dynamic coding are carried out on the image data, and the modeling capacity of a key region is enhanced. Dynamic SSM coding module for high frequency sub-bandsPerforming semantic guided state space modeling, entropy coding adopts a non-uniform quantization strategy, and reconstructing an image by a lightweight Mamba module and inverse wavelet transform (IDWT) at a decoding end. Semantic weight mapAs a space importance priori, the dynamic SSM coding module models, quantizes step length allocation and decoding path activation, and realizes resource allocation according to requirements.
In some embodiments of the present invention, in step S100, input image data is acquired by an image acquisition module, and color space conversion, normalization, and abnormal image filtering are performed to generate normalized image data. The step of generating normalized image data includes converting an RGB image into a YUV color space using a standard conversion matrix, separating luminance (Y) and chrominance (UV) components, preserving Y channels for subsequent compression, sub-sampling the UV components to reduce the amount of data, linearly normalizing YUV channel pixel values while preserving sign bits to support negative computation of subsequent wavelet transforms, computing an image sharpness score by a Laplacian operator, counting the duty ratio of pixel values in luminance channels exceeding a threshold, determining an outlier image and filtering, and finally outputting the preprocessed normalized image dataE R H×W×3 whose dimensions are consistent with the original input.
In step S200, the image data is normalizedInputting a pre-trained semantic segmentation network to generate a multi-scale semantic feature mapSemantic weight map corresponding to the sameComprising the following steps:
Extracting multi-scale features from Xception65 trunk of DeepLabv & lt3+ & gt network, and generating semantic feature map after fusion by using a cavity space pyramid pooling module Aligning the coding layers through bilinear interpolation, and taking the maximum value along the channel dimension to generate a semantic weight map。
In this embodiment, a pre-trained network is used to extract multi-scale semantic features, and a semantic weight map is generated by hole space pyramid pooling for semantic guidance of subsequent encoding.
Specifically, a pre-trained network is adopted to extract multi-scale semantic features, and semantic weight graphs are generated through hole space pyramid pooling;
It should be noted that, a pre-trained DeepLabv3+ model was used, and the backbone network was Xception65. Standardizing image data Feature extraction is carried out through DeepLabv & lt3+ & gt convolution layers in sequence, and low-level features are obtainedMid-level featuresAnd advanced featuresWherein c1=256, c2=512, c3=1024.
It should be noted that multi-scale context modeling, high-level featuresThe input ASPP modules, as shown in fig. 4, are processed by five parallel branches:
branch 1:1X1 standard convolution, keeping space dimension unchanged;
3×3 hole convolutions with hole ratios of 6, 12 and 18 respectively;
and 5, carrying out global average pooling, then up-sampling to the original space dimension, and recovering the channel dimension through 1X 1 convolution.
The five branch outputs are subjected to channel dimension splicing and are fused through 1X 1 convolution to obtain a multi-scale feature fusion graph。
Semantic segmentation map generation F ASPP is input into the 3 x 3 convolution and Softmax layers to generate a semantic prediction mapWhere Cs represents the number of semantic categories.
S is up-sampled to be aligned with each level of input resolution of the encoder through bilinear interpolation, and a three-level semantic feature map { S 0,S1,S2 }, which corresponds to the original resolution, 1/2 resolution and 1/4 resolution respectively, is generated.
Subsequently for each of which a multi-scale semantic feature map is generatedTaking the maximum value along the channel dimension to obtain a weight graph:
;
The semantic weight map Will be used for dynamic attention generation and important region guidance during the encoding and compression stages.
In some embodiments of the invention, depth separable convolution downsampling, in S300, generates multi-scale featuresComprising the following steps:
Original image scale feature Half-scale featuresQuarter scale features。
In this embodiment, through the multi-scale-wavelet joint coding module, a three-level pyramid structure is constructed by using depth separable convolution with a step length of 2 to perform step-by-step downsampling, and then adaptive wavelet transformation is performed on the feature map of each scale to decompose the image features into features and low-frequency sub-bandsAnd high frequency sub-bandsAnd lossless compression is achieved through a reversible neural network.
Specifically, each stage of the pyramid encoder performs the following operations, as shown in fig. 5:
initial downsampling and feature extraction to normalize image data Inputting the three layers of depth separable convolution networks to generate three layers of characteristic expressions:
First stage, input standardized image data Output of;
Second stage, inputOutput of;
Third stage, inputOutput of;
Wherein the depth separable convolution operation is defined as:
;
Wherein, the From the following componentsConvolution and its applicationThe convolution series connection structure can remarkably reduce the calculation cost and is suitable for edge calculation equipment.
Reversible neural network transformation (INN) for each level of featuresDecompose it intoAndTwo parts, the following coupling layer forward mapping is performed:
;
;
;
Output of = (,) Where F and G are three layers of convolved residual blocks, each layer containing BatchNorm (batch normalization), leakyReLU (modified linear unit with leakage) and 3 x 3 convolutions.
Adaptive wavelet decomposition to output INN characteristicsInputting to a custom wavelet decomposition module, performing a learnable discrete wavelet transform (Learnable DWT):
;
Wherein:
the low frequency sub-band, keep the main structural information of the picture;
the high frequency sub-band keeps texture and edge details and is further sent to a state space modeling module for processing.
The high frequency subband size and channel number are: wherein the triple channels correspond to horizontal, vertical and diagonal detail directions in the wavelet transform.
It should be noted that, the reversible neural network INN supports precise inverse transformation, and the inverse function is defined as:
;
;
;
Representing a reverse mapping function to ensure output of the encoding end Can be recovered by INN in the decoding stage, and can meet the lossless compression requirement of the low-frequency sub-band.
In some embodiments of the present invention, in S400, the bi-directional scanning mechanism includes:
forward state update equation:
;
the backward state update equation:
;
Wherein:
、 respectively representing forward and backward states;
e, R C×1 is a state memory weight;
e, R C×C is an input gating weight matrix;
a depth separable convolution operation for dynamically generating a convolution kernel;
Is an element-level product.
In the embodiment, a dynamic SSM coding module is used for fusing a dynamic convolution and an attention mechanism in a state space model to perform state space modeling and dynamic coding on image data, so that the modeling capacity of a key region is enhanced.
And a bidirectional scanning mechanism and a dynamic convolution kernel driven by a semantic weight graph are introduced, and the modeling of a key region is enhanced through channel-space double-channel attention gating, so that the computational complexity is reduced. The method specifically comprises the following steps:
as shown in fig. 6, the implementation of the dynamic SSM module includes:
Receiving high frequency subband features from step S300 And the multi-scale semantic weight map generated in the step S200A dynamic selective state space modeling (DYNAMIC SELECTIVE SSM) module is constructed to realize dynamic modeling and compressed representation of high semantic region features.
The semantic alignment process includes first employing a bilinear interpolation function resizeWeight map for meaningHigh frequency features with spatial dimensions adjusted to corresponding dimensionsAnd (3) coincidence:
;
Wherein: is the aligned semantic weight graph, size A function is adjusted for the spatial dimension.
The operation ensures that the semantic guidance effect corresponds to the feature space one by one, and the guidance precision is enhanced.
The bi-directional state update mechanism includes:
For feature modeling on a time sequence, a bidirectional state space structure is introduced, comprising two directions of forward state updating and backward state updating. The specific update formula is as follows:
the forward state update equation is:
;
The backward state update equation is:
;
Wherein:
、 respectively represent a forward direction state and a backward direction state, E R C×1 is state memory weight and is subject to a multi-scale semantic weight graphThe guiding of the guide is performed,E R C×C is the input gating weight matrix,Depth separable convolution operations that dynamically generate convolution kernels,Is an element-level product.
In some embodiments of the present invention, in S400, the dynamic convolution kernel generation includes:
Semantic weight graph And high frequency sub-bandsGenerating a query Q, a key K and a value V, and calculating a convolution kernel parameter matrix through multiple attentions:
;
Where d is the dimension of the attention head, softmaxIs a normalized exponential function, and T is a matrix transposition.
In this embodiment, the dynamic convolution kernel generation mechanism:
the weights of the dynamic convolution kernel are generated based on an attention mechanism. Firstly, respectively carrying out 1×1 convolution on the semantic graph and the high-frequency features to obtain a query Q, a key K and a value V:
;
;
;
based on standard multi-head attention mechanism, calculating dynamic convolution kernel parameter matrix :
;
Where d is the dimension of the attention head for normalization to prevent gradient explosions.
In some embodiments of the present invention, in step S400, the channel-space two-way attention-gating enhancement feature comprises:
channel attention:
;
Spatial attention:
;
final output characteristics:
;
Wherein:
x is the input feature;
GAP is global average pooling;
MLP is a multi-layer perceptron;
Is that Is a function of the activation of (a);
the product by element is indicated as follows.
Specifically, the channel-space dual attention gating mechanism includes:
the attention gating enhancement is carried out on the characteristic X generated by the dynamic convolution kernel, and the characteristic X is divided into two stages of Channel Attention (CA) and Space Attention (SA):
The channel attention calculation formula is: ;
The spatial attention calculation formula is: ;
the final output characteristic is a two-way enhancement characteristic:
;
in some embodiments of the present invention, in step S500, the non-uniform quantization measurement includes:
The quantization step size is adaptively adjusted according to the semantic weight graph:
;
Wherein:
taking the quantized step length as a basis;
is the position in the semantic weight graph Response intensity of (2);
Lambda is an adjustment coefficient for enhancing the resolution of the high semantic region;
The quantization operation is defined as: ;
Wherein: representing the position ,Z i,j is the characteristic value to be quantized, and is output from dynamic SSM modeling; rounding to a rounding function; For adaptive quantization step size, the semantic weights are dynamically adjusted.
In this embodiment, the semantic guidance based heterogeneous quantization strategy and joint probability modeling process includes receiving dynamic features Z=from the S400 outputAnd carrying out quantization and modeling to realize the balance of the compression rate and the fidelity.
Calculating an adaptive quantization step size at each position using the formula:
;
The quantization operation is defined as:
;
Quantized features Is an entropy-encodable compressed representation. This strategy can guarantee that semantically important regions (e.g., faces, text) get finer coding, while background regions can be coarser processed to save bit rate.
In some embodiments of the present invention, in S500, entropy encoding includes:
based on a state space modeling super prior network, constructing a joint probability model for potential variables:
;
Wherein:
N represents a Gaussian distribution;
respectively represent feature dimension ,The predicted mean and standard deviation of the dimension are generated by the super prior network:
;;
Wherein:
lightweight neural network module for state space modeling for estimating parameters for each feature location ;
Lightweight neural network module for state space modeling for estimating parameters for each feature location;
Representing a channel splicing operation;
is a decoded feature.
In the present embodiment, to improve compression efficiency, the joint probability distribution of the features is modeledThe construction form is as follows:
;
the super prior network adopts a multi-scale feature fusion mode to decode the features Semantic feature map under corresponding scaleAfter splicing, inputting the parameters to a state space modeling module (Mamba) for parameter prediction, specifically:
;
and (c) represents a channel splicing operation, and an exponential function is used to ensure that the standard deviation of the prediction is positive.
In some embodiments of the present invention, in S600, the compressed code stream is input to a decoder, decoded based on a lightweight Mamba module, and combined with an inverse wavelet transform and a semantic weight mapReconstructing an image comprises the steps of:
Mamba module decodes the state update:
;
Wherein:
is the current moment state vector;
A、B State transition coefficients generated for semantic guidance;
Is a gating function;
inverse wavelet transform reconstruction features:
;
Wherein:
is a learnable inverse wavelet transform operator;
multiscale semantic fusion output image:
;
Wherein:
is a learnable fusion weight.
In an implementation, as shown in fig. 8, the semantic guided image reconstruction process based on Mamba decoders includes:
receiving a representation of a code stream output by an encoding stage Combining semantic weight graphsAnd wavelet subband [ ]) The original image is reconstructed through Mamba decoder and inverse wavelet transform module.
The Mamba decoder based on state space modeling, mamba decoder uses a lightweight state space mechanism to update states based on a semantically guided selectively activated channel, and the calculation process is as follows:
;
Wherein: e R C is the current channel state, State transition coefficients generated for semantic guidance;
Invalid feature channels are suppressed for gating functions.
An inverse wavelet reconstruction module receives Mamba the output characteristics of the decoderMatching with low frequency sub-bandsAnd high frequency sub-bandsPerforming inverse wavelet transform:
;
Wherein the method comprises the steps of And reconstructing the current scale image feature map for the learnable inverse wavelet transform operator.
Multi-scale fusion and semantic guided reconstruction, namely weighting and fusing the decoding result of three scales with a semantic graph to generate a final reconstructed image:
;
Wherein: E, R is fusion weight; Is a semantic weight map under the corresponding scale.
The fusion strategy gives a larger influence to the semantic high-weight region, so that key structures (such as face contours and text edges) are ensured to be more completely reserved in image reconstruction.
In a specific implementation, the added loss function is as follows:
in order to effectively improve the restoration quality, structural fidelity and visual consistency of semantic regions in the image compression process, a reconstructed image is obtained in a model training stage And original imageIntroducing multiple loss functions between them to construct a composite loss objective functionThe loss function includes semantic edge loss, structure retention loss, and conventional reconstruction loss terms. The method comprises the following steps:
pixel level reconstruction loss (MSE):
This loss is used to measure the error between the reconstructed image and the original at the pixel level, defined as:
;
where N is the total number of pixels in the image, AndRespectively reconstructing the image and the pixel points in the original imageIs a value of (2).
Structure retention Loss (SSIM Loss):
The structural similarity index (Structural Similarity Index, SSIM) is used to measure the structural consistency of the image, and is defined as:
;
The loss term mainly constrains the brightness, contrast and structural information of the image, ensuring that the reconstructed image remains perceptually similar to the original image.
Considering that the edges of semantic regions typically carry important structural information, a semantic edge guide penalty is introduced:
;
Wherein:
M is a normalization factor;
k is pyramid level index;
is the pixel space coordinate;
gradient values for the reconstructed image;
gradient values for the original image;
Representing a high response semantic region at a kth level;
e [0,1] is the guiding strength of the position in the semantic weight graph;
The loss term emphasizes the fidelity of the semantic region edge reconstruction.
In the training process, synchronously considering a compression rate target, adding a code rate control item:
;
Wherein:
For the purpose of quantization features Is a desired value of (2);
To quantify characteristics Probability values of (2);
as a function of the amount of information.
The loss term measures the average number of bits of the encoded compressed code and is derived from the joint probability model established in step S500.
It will be appreciated that the total loss function combination is as follows:
;
Wherein:
lambda 1~λ4 is a loss term balance coefficient, is optimized in a cross verification mode in the training process, lambda 3>λ1 is generally set to highlight reconstruction accuracy of a semantic edge region, and lambda 4 can be properly improved for real-time compression application to control the overall code rate.
Through the multi-loss combined optimization strategy, the structure retaining capacity and reconstruction quality of an image compression system in a semantic significant region (such as a human face, a text and an object outline) can be effectively improved, and meanwhile, the overall image compression efficiency is considered, so that the method is applicable to various scenes requiring high compression ratio and high fidelity reconstruction.
In some embodiments of the present invention, in step S500, further comprising median deviation map quantization encoding, comprising the steps of:
potential eigenvalues Mapping to a median reference coordinate system, and calculating the interval to which the median reference coordinate system belongsMedian of (2)And calculates the deviation value:
;
Wherein: Is the original characteristic value;
Symmetric discrete quantization of the deviation:
;
Wherein gamma is the quantization step length;
Generating a ternary Performing compression representation;
Wherein the method comprises the steps of For the spatial position index,To quantify the deviation.
The distribution model is used in the encoder to optimize bit rate allocation and a priori guidance is made in the decoder for reconstruction.
In some embodiments of the present invention, a discrete coding mechanism based on a bias representation is further introduced on the basis of a non-uniform quantization module, as shown in fig. 7. The potential feature value or pixel value range [0,255] is first divided into N interval segments, each interval defined as:
;
setting an intermediate value for each interval As a reconstructed reference value. For each encoded feature pointBy looking up the interval to which it belongsAnd calculate its relative median deviation:
;
Symmetric discrete quantization of the deviation:
;
Will eventually Triads are encoded in whichFor the spatial position index,To quantify the deviation. Entropy encoding or variable length encoding may be further employed to compress the triplet data to achieve more efficient code stream expression.
According to the application, by introducing a dynamic selective state space modeling mechanism and combining bidirectional scanning and semantic sensitive attention gating, the computational complexity is effectively reduced, the detail capturing and global context understanding capability of key areas of images such as faces and characters is enhanced, and the problem of image detail blurring caused by high-frequency information loss in the traditional compression method is solved. Furthermore, based on a semantic-guided non-uniform quantization strategy and a median deviation mapping quantization coding mechanism, quantization step sizes can be dynamically adjusted according to a semantic weight graph, three-dimensional coordinate coding is performed by utilizing spatial position and deviation information, finer quantization levels are distributed to a high semantic value area, and balance between compression efficiency and reconstruction quality is optimized. The multi-scale-wavelet joint coding architecture constructed by fusing the reversible neural network and the adaptive wavelet transformation effectively realizes lossless compression of the low-frequency sub-band, avoids low-frequency distortion, and simultaneously retains texture details by virtue of lightweight dynamic convolution coding of the high-frequency sub-band so as to meet the real-time processing requirement of edge equipment. At the decoding end, a selective state space activation mechanism based on Mamba structures is introduced, only key channels are reserved to participate in image reconstruction, the decoding calculation amount is obviously reduced, and an output image is reconstructed through inverse wavelet transformation and multi-scale semantic fusion. The technical breakthroughs enable the application to have important practical value and wide application prospect in application scenes such as security monitoring, mobile communication, medical images and the like which need to balance compression rate and visual fidelity.
Referring to fig. 9, the invention also discloses a multi-scale semantic guided image compression system, which comprises a memory and a processor, wherein the processor realizes the multi-scale semantic guided image compression method when executing the computer program stored in the memory.
Further, to implement end-to-end of an image compression system, at a system architecture level, the image compression system provided by the present invention further includes the following modules:
And the image preprocessing module is used for acquiring an original image, performing color space conversion, normalization processing and abnormal image rejection and outputting a standardized image.
And the semantic guidance module is used for extracting semantic features by utilizing a DeepLabv3+ semantic segmentation model and obtaining a multi-scale semantic weight map through bilinear interpolation to guide subsequent encoding.
And the multi-scale wavelet coding module is used for constructing a three-level coding pyramid, wherein each level consists of depth separable convolution, a reversible neural network and self-adaptive wavelet transformation and outputs a low-frequency sub-band and a high-frequency sub-band.
And the dynamic SSM modeling module is used for integrating semantic graph guidance and a bidirectional scanning mechanism, realizing state space modeling and outputting dynamic convolution coding characteristics.
And the entropy coding module is used for extracting the context characteristics by using the super prior network, estimating probability distribution by combining the context characteristics with the semantic weight map, and executing a non-uniform quantization and median deviation mapping quantization coding strategy to generate a code stream.
And the decoding and reconstructing module combines Mamba state updating and a semantic gating mechanism, and finally generates a reconstructed image through inverse wavelet transformation and semantic weighted fusion.
The invention also discloses a storage medium, wherein the storage medium stores a determining program of the multi-scale semantic guidance image compression method, and the determining program realizes the multi-scale semantic guidance image compression method when being executed by a processor.
The multi-scale semantic guidance image compression system and the storage medium adopt all the technical schemes of the multi-scale semantic guidance image compression method of the above embodiment, so that the multi-scale semantic guidance image compression system and the storage medium at least have all the beneficial effects brought by the technical schemes of the above embodiment, and are not repeated herein.
The embodiments of the present invention have been described in detail with reference to the accompanying drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of one of ordinary skill in the art without departing from the spirit of the present invention.