Disclosure of Invention
The invention aims to solve the technical problem of providing an active defense method based on a face identity watermark and a mixed attention module, which realizes the detection and tracing functions of face tampering and greatly improves the practicability and the functionality of a model.
The invention adopts the following technical scheme to realize the aim of the invention:
an active defense method based on a face identity watermark and a mixed attention module is characterized by comprising the following steps:
s1, extracting a face identity code;
S2, generating a face identity code;
s3, a mixed attention encoder module;
S4, designing a loss function;
s5, analyzing the correlation of the facial features.
As a further limitation of the present technical solution, the specific steps of S1 are:
The primary link of face identity watermark generation is to accurately extract stable face features from a face region of an input image, and adopt a retina face detection algorithm in a face detection and alignment algorithm based on deep learning to perform face detection and positioning on the input image, wherein the main loss of the positioning face region is face classification loss:
(1);
Wherein: Is a real label; the probability that the model prediction priori frame contains the human face is obtained; is the weight of the positive sample; focusing parameters of positive and negative sample weights are adjusted;
Constructing a two-dimensional Meyer face feature extraction structure aiming at image texture feature extraction, wherein the two-dimensional Meyer face feature extraction structure mainly comprises three parts, namely a multi-resolution laminated map, a multi-scale two-dimensional Meyer filter and logarithmic energy spectrum feature extraction;
The multi-resolution stacking atlas is to downsample the image according to different resolutions to form a series of images with different scales, and the process mainly comprises Gaussian smoothing and downsampling operation;
the core of Gaussian smoothing is to use a Gaussian kernel to carry out convolution operation with an image, and the Gaussian kernel has the expression:
(2);
Wherein: Is the coordinates of the pixels in the gaussian kernel; Is the standard deviation of the gaussian function;
the multi-scale two-dimensional mel filter construction formula is as follows:
(3);
Wherein: ;; A wavelength that is a sine function; is the direction of the filter; Representing the phase offset; Controlling the degree of spatial localization of the filter for a variable parameter; Representing the spatial aspect ratio, controlling the shape of the filter in different directions;
For images with different resolutions after multi-resolution stacking treatment, a two-dimensional Mel filter with different scales is used for carrying out convolution operation with the layer of images to obtain response images with different scales, and finally, the frequency distribution and the energy information of the images are extracted by using a logarithmic energy spectrum extraction method, wherein the extraction formula is as follows:
energy spectrum calculation:
(4);
Wherein: Representing a multi-scale response image; representing the frequency domain coordinates of the corresponding image, The representation takes absolute value;
And (3) logarithmic extraction:
(5);
Wherein Energy represents coordinates in the image frequency domain Face information extracted from the position; Is a small constant.
As a further limitation of the present technical solution, the specific steps of S2 are:
the average hash algorithm is adopted to convert the extracted face feature vector into a binary hash value with a specified length, and the generation formula is as follows:
The face feature vector array is as follows:
(6);
The calculation formula of the average value E is as follows:
(7);
For each element in the face feature array If (3)The corresponding binary value1 And vice versa, 0, the mathematical expression is expressed as follows:
(8)。
as a further limitation of the present technical solution, the specific steps of S3 are:
s31, a multiscale identity watermark preprocessing module;
in the identity code watermark transformation module, binary watermark information with length L is converted into a format consistent with the dimension of an image tensor after being processed, and the watermark is remodeled into a two-dimensional array with the shape of (H/2, W/2) Then select a plurality of scale factorsFor a pair ofUp-sampling is carried out, and the up-sampling formula is as follows:
(9);
Wherein: representing an upsampling operation; Representing different scale factors;
After multi-scale up-sampling, carrying out weighted fusion on the feature images after up-sampling of different scales, wherein the formula of the weighted fusion is as follows:
(10);
Wherein: Represent the first Weights corresponding to the scale factors;
S32, a mixed attention embedding module;
The module receives an input original image And watermark message after multiscale preprocessingThe method comprises the steps of constructing a visual Manaba-like linear attention U-shaped network combining a Manaba-like linear attention mechanism and the U-shaped network to specially process watermark embedding work, wherein the model mainly comprises three parts, namely a channel attention-based feature initial processing module (SE-Stem), a linear attention module and a multi-scale cavity downsampling convolution block, wherein the feature initial processing module is used for initial feature extraction;
s321, a feature initial processing module SE-Stem based on channel attention;
constructing a double-branch structure to independently process an input image and a watermark, wherein each branch processes the image and the watermark through convolution operation and a channel attention mechanism, gradually reduces the space dimension, increases the channel dimension and retains important information, and finally splices two characteristic information into subsequent operation to prepare, wherein the formula is as follows:
(11);
Wherein: Representing a channel attention operation; And Representing convolution operations performed sequentially; Representing characteristic stitching; representing an input source image; representing a preprocessed identity watermark;
s322, a linear attention module;
The fusion characteristics of the source image and the identity watermark which are subjected to preliminary treatment by a characteristic initial treatment module based on channel attention are input into a Manba-like linear attention block for further treatment;
The method comprises the steps of replacing two linear blocks in an original Manba-like linear attention module with a line-direction feature cooperation module and a column-direction feature cooperation module to form a novel MLLA-structure linear attention module;
the line direction feature cooperation module consists of a feature extraction convolution layer, a horizontal position coding layer and a linear layer, wherein a feature vector input into the line direction feature cooperation module firstly extracts line local features through the convolution layer:
(12);
Wherein: Is the extracted line local feature that is used to determine, Conv () represents a convolution operation for the feature map subjected to the SE-Stem preliminary processing;
Then entering a horizontal position coding layer to perform matrix operation:
(13);
Wherein: Is that The position coding in the horizontal direction is calculated as follows:
Inputting a feature map Horizontal position coding parameters,Is the preset maximum horizontal position number, which represents the maximum position range in the horizontal direction that the model can process, and encodes parameters from the horizontal positionSelecting a portion corresponding to the width W of the current feature map;
For the followingFirst, theLine feature vectorThe calculation of the linear layer is:
(14);
Wherein: 、 And In order to project the matrix of the light,Is a processed line feature;、、 Respectively represent the first Line feature vectorThe weights multiplied by three different projection matrices,Representing the linear activation of the key matrix, softmax represents a linear activation function;
The column direction feature cooperation module consists of a feature extraction convolution layer, a vertical position coding layer and a linear layer, wherein a feature vector input into the feature cooperation module firstly extracts column local features through the convolution layer:
(15);
Wherein: is an extracted column local feature that is used to extract the column local feature, Is a feature subjected to SE-Stem preliminary treatment;
then enter the vertical position coding layer to carry out matrix operation:
(16);
Wherein: Is that The position coding in the vertical direction is calculated as follows:
Inputting a feature map Vertical position coding parameters,Is a preset maximum vertical position number, which represents the maximum position range in the vertical direction that the model can process, and encodes parameters from the vertical positionSelecting a part corresponding to the height H of the current feature map;
For the followingFirst, theColumn feature vectorThe calculation of the linear layer is:
(17);
Wherein: Is a processed column feature; Respectively represent the first Column feature vectorThe weights multiplied by three different projection matrices,Representative of linear activation of the key matrix;
S323, a multi-scale cavity downsampling convolution block;
The combination of the cavity convolution and the depth separable convolution can enlarge the receptive field without obviously reducing the resolution, the cavity convolution respectively captures the characteristics of three different scales of local, medium and global by using different expansion rates in the convolution kernel, simultaneously retains the details of the image, adopts the idea of dynamically adjusting the expansion rate, introduces a learnable parameter to dynamically determine the optimal expansion rate of each convolution layer, and adopts the method that Dynamically adjusting according to the characteristics of the input image, so that more context information is captured on different scales, and analyzing the statistical information of the input characteristics by a characteristic analysis layer;
first, the gradient of the input feature is calculated Then according to the gradientAnd parametersCalculating dynamic expansion rate:
(18);
Finally, dynamically selecting cavity convolutions with different expansion rates according to d to operate;
its entire downsampling process can be expressed as:
(19);
Wherein: A feature map representing the final output; representing a flattening layer; Representing convolutional layer 2; representing multi-scale hole convolution; representing a depth separable convolution; Representing convolutional layer 1; Representing preliminary remodeling of the input features; Representing the input features.
As a further limitation of the present technical solution, the specific steps of S4 are:
The loss function consists of three parts, namely weighted cross entropy loss, multi-scale image pixel loss and self-adaptive message loss;
s41, weighting cross entropy loss;
The weighted cross entropy loss is used to solve the class imbalance problem by assigning different weights to different classes, as follows:
(20);
Wherein: Is the real class label of the face area detected by the ith face sample; the positive class probability of model prediction; is a weighted weight;
s42, multi-scale image pixel loss;
By introducing the thought of multi-scale in the mean square error, the detail and structural information of the image are better captured by calculating the mean square error on different scales, and the formula is as follows:
(21);
Wherein: Is the number of dimensions; Is the first A number of pixels on a scale; And The first image and the second image are respectively the image after being embedded with the watermarkA representation on a scale of the individual;
s43, self-adaptive texture feature message loss;
constructing an innovative self-adaptive loss function;
First, it is necessary to extract local texture features of the image, which help to understand the complexity of the image, and thus determine in which areas to embed stronger watermarks, the formula is as follows:
(22);
Wherein: And Respectively, input imagesIs a gradient of (2); Is the image position Is a local texture feature of (a);
then, according to the local texture characteristics, calculating self-adaptive weights, and increasing watermark embedding strength in a region with rich textures;
(23);
Wherein: Is the image position Is determined by the adaptive weights of (a); And (3) with Respectively minimum and maximum values of texture features;
finally, definition of adaptive message loss function:
(24);
Wherein: is the original watermark message; Is the extracted watermark message; Is a class label that is classified into a class label, Is the decoding threshold value and,Is the decoding confidence; Is the adaptive weight of the location i, Is the total number of pixels of the watermark;
s44 the overall loss function is a weighted sum of weighted cross entropy loss, multi-scale image loss, and adaptive message loss:
(25);
Wherein: 、 And The weight of each loss term.
As a further limitation of the present technical solution, the specific step of S5 is:
Because two decoding methods are used in the present model, decoding the watermark embedded in the face image using a decoder and regenerating the watermark in the face region of the face image using the same face recognition algorithm and two-dimensional mel face extraction structure, conventional correlation comparison involves comparing hamming distances, but if there is a small bit difference between the two vectors, the hamming distances may increase significantly, even if the two watermarks are visually very similar, the hamming distances may cause misjudgment as dissimilar due to the difference of the individual bits, and for accurate evaluation of the similarity between the watermarks decoded in the two decoding modes, regularized pearson correlation coefficients are employed to quantify the linear correlation between them, as follows:
(26);
Wherein: Is a regularization term; And (3) with Respectively represent feature vectorsAnd (3) withIs the first of (2)A dimension element; Is the dimension of the feature vector; And (3) with Is a feature vectorAnd (3) withAverage value of (2); When 1, the two vectors are completely positively correlated, and when 0, the two eigenvectors have no linear relationship.
Compared with the prior art, the invention has the advantages and positive effects that:
1. Compared with the traditional method, the method can realize the detection and tracing functions of face tampering through different decoding structures, and greatly improves the practicability and the functionality of the model.
2. The invention provides a novel method for extracting the face identity watermark, which has good robustness to conventional noise operation through experimental verification.
3. The invention provides a VM-UNet (linear attention UNet similar to a state space model) which is a novel architecture, and the novel architecture achieves higher improvement in the aspects of natural vision and robustness of watermark images by innovatively combining linear attention, channel attention, a Manba model and Unet architecture to specially process watermark embedding tasks.
Detailed Description
One embodiment of the present invention will be described in detail below with reference to the attached drawings, but it should be understood that the scope of the present invention is not limited by the embodiment.
The invention provides a face counterfeiting active defense method based on a face identity watermark and a mixed attention module, which can realize detection and tracing of counterfeiting actions simultaneously by embedding a special face identity watermark in a face image. Specifically, the method includes the steps of firstly inputting a source target face image, dividing a face region of the face image by adopting a face recognition technology, generating a face identity code by utilizing abundant texture features of the face region, and embedding the extracted identity code into the whole image as a robust watermark I A after shape adjustment and average hash processing. In the embedding process, a new framework VM-Unet of a state space model is proposed, wherein a linear attention Mechanism (MLLA) is combined with a Unet network to specially process watermark embedding, and the framework enhances feature processing and watermark embedding accuracy by combining advantages of linear attention, channel attention and State Space Model (SSM) and a Unet efficient symmetrical sampling structure. Then, the image is subjected to a series of processes of enhancing model robustness such as noise layer (compression, blurring, cutting) and the like, and then the image containing the face identity code watermark is generated. In the construction of the watermark decoder, the method adopts two different decoding modes to respectively aim at tamper detection and traceability analysis, namely, on one hand, the identity code watermark I B is extracted from a tampered image face area through a face recognition algorithm which is the same as that of the face identity code generated before, on the other hand, the decoder is used for extracting the embedded identity code watermark I A', from the image to determine whether the image is tampered or not through comparing the correlation between I B and I A', and on the other hand, the source of the image can be further tracked through comparing I A and I A', so that traceability analysis is realized.
The invention comprises the following steps:
s1, extracting face identity codes.
The specific steps of the S1 are as follows:
the primary link of face identity watermark generation is to accurately extract stable face features from a face region of an input image, and adopt a retina face detection algorithm (RETINAFACE) in a face detection and alignment algorithm based on deep learning to perform face detection and positioning on the input image, wherein the main loss of the positioning face region is face classification loss:
(1);
Wherein: Is a real label; the probability that the model prediction priori frame contains the human face is obtained; is the weight of the positive sample; focusing parameters of positive and negative sample weights are adjusted;
Inspired by the mel frequency spectrum coefficient in audio anti-counterfeiting, an input image is subjected to facial region positioning through RETINAFACE detection algorithm, and a two-dimensional mel facial feature extraction structure aiming at image texture feature extraction is constructed, and the structure is shown in figure 3. The two-dimensional Meier face feature extraction structure adopts a multi-scale feature extraction method, extracts face features under different scales, captures multi-scale texture information in an image, enables the extracted face features to be more stable, and still has stronger robustness under various distortion conditions such as JPEG compression, noise, scaling and the like. The two-dimensional Meier face feature extraction structure mainly comprises three parts, namely a multi-resolution laminated map, a multi-scale two-dimensional Meier filter and logarithmic energy spectrum feature extraction;
the multi-resolution stacking atlas is characterized in that images are downsampled according to different resolutions to form a series of images with different scales, the process mainly comprises Gaussian smoothing and downsampling operation, gaussian smoothing refers to Gaussian filtering of face area images, high-frequency noise in the images is removed, and the images are smoothed. And downsampling is to downsample the Gaussian-smoothed image, and select pixel points every other row and column, so as to obtain images with different resolution scales.
The core of Gaussian smoothing is to use a Gaussian kernel to carry out convolution operation with an image, and the Gaussian kernel has the expression:
(2);
Wherein: Is the coordinates of the pixels in the gaussian kernel; Is the standard deviation of the gaussian function;
the multi-scale two-dimensional mel filter construction formula is as follows:
(3);
Wherein: ;; A wavelength that is a sine function; is the direction of the filter; Representing the phase offset; Controlling the degree of spatial localization of the filter for a variable parameter; Representing the spatial aspect ratio, controlling the shape of the filter in different directions;
For images with different resolutions after multi-resolution stacking treatment, a two-dimensional Mel filter with different scales is used for carrying out convolution operation with the layer of images to obtain response images with different scales, the response images reflect texture features and energy information of the images in different scales and directions, and finally, a method for extracting the frequency distribution and the energy information of the images by using a logarithmic energy spectrum extraction method is used for extracting the frequency distribution and the energy information of the images, wherein the extraction formula is as follows:
energy spectrum calculation:
(4);
Wherein: Representing a multi-scale response image; representing the frequency domain coordinates of the corresponding image, The representation takes absolute value;
And (3) logarithmic extraction:
(5);
Wherein Energy represents coordinates in the image frequency domain Face information extracted from the position; is a small constant used to prevent numerical problems in logarithmic operations.
S2, generating a face identity code.
The specific steps of the S2 are as follows:
the face feature vector generated by the extraction in the steps is too lengthy, and the watermark embedding quality can be affected by the direct embedding into the image. Therefore, the average hash algorithm is adopted to convert the generated face feature vector into a binary hash value with a specified length, and the generation formula is as follows:
The face feature vector array is as follows:
(6);
The calculation formula of the average value E is as follows:
(7);
For each element in the face feature array If (3)The corresponding binary value1 And vice versa, 0, the mathematical expression is expressed as follows:
(8)。
S3, a mixed attention encoder module.
The specific steps of the S3 are as follows:
In the image identity code watermark embedding process, in order to ensure that the image keeps high integrity in visual and quality aspects, original details and definition of the image are maintained as much as possible, and meanwhile, strong robustness of the embedded watermark is ensured, so that the embedded watermark can still exist stably and completely when the embedded watermark faces common image processing attacks such as compression, clipping, filtering and the like or suffers noise interference, and the encoder module comprises a multi-scale identity watermark preprocessing module and a mixed attention embedding module.
S31, a multiscale identity watermark preprocessing module;
in order to enable binary hash watermarks generated by face feature codes to be better fused with image features, in an identity code watermark conversion module, binary watermark information with the length of L is processed and then converted into a format consistent with the dimension of an image tensor, and the conversion process ensures that watermark information can be effectively fused with the image features in the same dimension, specifically, the watermark is remodeled into a two-dimensional array with the shape of (H/2, W/2) size Then select a plurality of scale factorsFor a pair ofUp-sampling is carried out, and the up-sampling formula is as follows:
(9);
Wherein: representing an upsampling operation; Representing different scale factors;
After multi-scale up-sampling, carrying out weighted fusion on the feature images after up-sampling of different scales, wherein the formula of the weighted fusion is as follows:
(10);
Wherein: Represent the first Weights corresponding to the scale factors;
the watermark is preprocessed in multiple scales, and the watermark is expressed to a certain extent on different scales, so that the redundancy and the robustness of the watermark can be effectively enhanced, and the dimensional consistency with the image characteristics is ensured.
S32, a mixed attention embedding module;
The module receives an input original image (R represents an image data set, 3 represents that the image is input in RGB three-channel format, and H and W represent the height and width of the image respectively) and watermark information after multiscale preprocessingTo more fully characterize the content of the image, a Manba-like linear attention Mechanism (MLLA) is constructed to specifically handle watermark embedding in a visual Manba-like linear attention U-shaped network (Vision-Mamba-U-Net, VM-UNet) combined with a U-shaped network (U-Net, unet), and the model mainly consists of three parts, namely a channel attention-based feature initial processing (squeize-and-Excitation Network Stem, SE-Stem) module (SE-Stem), a linear attention module (HLMLLA) and a multi-scale hole downsampling convolution block (MSDC) for initial feature extraction;
s321, a feature initial processing module SE-Stem based on channel attention;
In order to make the proposed architecture specially used for watermark embedding, we improve the Stem module, specifically, construct a dual-branch structure to process the input image and watermark separately, as shown in fig. 4, each branch processes the image and watermark through convolution operation and channel attention mechanism (SEnet), gradually reduces its spatial dimension and increases the channel dimension and retains important information, and finally splice two feature information to prepare for subsequent operation, the formula is as follows:
(11);
Wherein: Representing a channel attention operation; And Representing convolution operations performed sequentially; Representing characteristic stitching; representing an input source image; representing a preprocessed identity watermark;
S322 a linear attention module (HLMLLA);
The fusion characteristics of the source image and the identity watermark which are subjected to preliminary treatment by a characteristic initial treatment module based on channel Attention are input into a Manba-like linear Attention block (Mamba-LIKE LINEAR Attention, MLLA) for further treatment;
MLLA (Mamba-LIKE LINEAR Attention) is an improved method combining the Mamba model and the linear Attention mechanism, aiming at improving the performance of the model in visual tasks, and MLLA mainly integrates two key factors of "forgetting gate" and block design "in Mamba. However, forgetting gates must use cyclic calculations and may not be well suited for modeling non-causal data, e.g. images, which are two-dimensional spatial data whose information is predominantly represented by spatial relationships between pixels and local features, unlike sequence data which have significant temporal or causal relationships, which do not need to be recursive in nature. In order to be able to model spatial features in a face image effectively, unnecessary time consumption and complexity of loop computation, which may occur when processing non-causal data, are avoided.
The two linear blocks in the original Manba-like linear attention module are replaced by a line feature cooperation module and a column feature cooperation module to form a new MLLA-structure linear attention module, and as shown in fig. 6, the two components are used for capturing image sequence information through a linear attention mechanism by combining horizontal position coding and vertical position coding enhancement models to understand the pixel position relation without depending on a circulating structure.
The line direction feature cooperation module consists of a feature extraction convolution layer, a horizontal position coding layer and a linear layer, wherein a feature vector input into the line direction feature cooperation module firstly extracts line local features through the convolution layer:
(12);
Wherein: Is the extracted line local feature that is used to determine, Conv () represents a convolution operation for the feature map subjected to the SE-Stem preliminary processing;
Then entering a horizontal position coding layer to perform matrix operation:
(13);
Wherein: Is that The position coding in the horizontal direction is calculated as follows:
Inputting a feature map ,、、The horizontal position coding parameters respectively represent the channel, height and width of the feature map,Is the preset maximum horizontal position number, which represents the maximum position range in the horizontal direction that the model can process, and encodes parameters from the horizontal positionSelecting a portion corresponding to the width W of the current feature map;
For the followingFirst, theLine feature vectorThe calculation of the linear layer is:
(14);
Wherein: 、 And In order to project the matrix of the light,Is a processed line feature;、、 Respectively represent the first Line feature vectorThe weights multiplied by three different projection matrices,Representing the linear activation of the key matrix, softmax represents a linear activation function;
The column direction feature cooperation module consists of a feature extraction convolution layer, a vertical position coding layer and a linear layer, wherein a feature vector input into the feature cooperation module firstly extracts column local features through the convolution layer:
(15);
Wherein: is an extracted column local feature that is used to extract the column local feature, Is a feature subjected to SE-Stem preliminary treatment;
then enter the vertical position coding layer to carry out matrix operation:
(16);
Wherein: Is that The position coding in the vertical direction is calculated as follows:
Inputting a feature map Vertical position coding parameters,Is a preset maximum vertical position number, which represents the maximum position range in the vertical direction that the model can process, and encodes parameters from the vertical positionSelecting a part corresponding to the height H of the current feature map;
For the followingFirst, theColumn feature vectorThe calculation of the linear layer is:
(17);
Wherein: Is a processed column feature; Respectively represent the first Column feature vectorThe weights multiplied by three different projection matrices,Representative of linear activation of the key matrix;
S323, a multi-scale cavity downsampling convolution block;
Common downsampling methods use maximum pooling, average pooling to achieve downsampling, which, while reducing resolution, tend to lose some of the feature information. The combination of the cavity convolution and the depth separable convolution can enlarge the receptive field without obviously reducing the resolution, the specific structure is shown in figure 5, the cavity convolution captures the characteristics at three different scales of local scale, medium scale and global scale respectively by using different expansion rates in a convolution kernel, simultaneously retains the details of images, adopts the idea of dynamically adjusting the expansion rate to improve the adaptability and the performance of the cavity convolution, introduces a learnable parameter to dynamically determine the optimal expansion rate of each convolution layer, and adopts the method of dynamically adjusting the expansion rate to dynamically determine the optimal expansion rate of each convolution layer Dynamically adjusting according to the characteristics of the input image, so that more context information is captured on different scales, and analyzing the statistical information of the input characteristics by a characteristic analysis layer;
first, the gradient of the input feature is calculated Then according to the gradientAnd parametersCalculating dynamic expansion rate:
(18);
Finally, dynamically selecting cavity convolutions with different expansion rates according to d to operate;
its entire downsampling process can be expressed as:
(19);
Wherein: A feature map representing the final output; representing a flattening layer; Representing convolutional layer 2; representing multi-scale hole convolution; representing a depth separable convolution; Representing convolutional layer 1; Representing preliminary remodeling of the input features; Representing the input features.
S4, designing a loss function.
The specific steps of the S4 are as follows:
The loss function consists of three parts, namely weighted cross entropy loss, multi-scale image pixel loss and self-adaptive message loss, wherein watermark messages can be effectively embedded and extracted while the high fidelity of the image is maintained;
s41, weighting cross entropy loss;
In the face detection stage, in order to reduce the class imbalance problem caused by different positions of the face class, the class imbalance problem is solved by using the weighted cross entropy loss to allocate different weights to different classes, and the formula is as follows:
(20);
Wherein: Is the real class label of the face area detected by the ith face sample; the positive class probability of model prediction; Is a weighted weight such that the loss contribution of each category can be adjusted according to its importance;
s42, multi-scale image pixel loss;
Image loss, which is used to measure the difference between the original image and the watermark image, ensures that the watermark embedding process does not significantly degrade the quality of the image, is typically calculated using a Mean Square Error (MSE), but MSE usually only focuses on the difference in pixel values, and ignores the structural and texture information of the image. By introducing the thought of multi-scale in the mean square error, the detail and structural information of the image can be captured better by calculating the mean square error on different scales, and the formula is as follows:
(21);
Wherein: is the number of dimensions, typically set to 1,2,4; Is the first A number of pixels on a scale; And The first image and the second image are respectively the image after being embedded with the watermarkA representation on a scale of the individual;
s43, self-adaptive texture feature message loss;
In order to ensure that watermark messages can be accurately extracted, the message loss function plays a critical role in the watermark embedding process. However, conventional message loss functions may in some cases result in too strong watermark embedding, thereby negatively affecting image quality. To solve this problem, an innovative adaptive loss function is constructed that can flexibly adjust the strength of watermark embedding depending on the local texture characteristics of the image. The self-adaptive mechanism is not only beneficial to maintaining the overall quality of the image, but also can remarkably improve the robustness of the watermark, and ensures that watermark information can be effectively extracted under various conditions;
first, it is necessary to extract local texture features of the image, which can help to understand the complexity of the image, and thus determine in which areas a stronger watermark can be embedded, as follows:
(22);
Wherein: And Respectively, input imagesIs a gradient of (2); Is the image position Is a local texture feature of (a);
then, according to the local texture characteristics, calculating self-adaptive weights, and increasing watermark embedding strength in a region with rich textures;
(23);
Wherein: Is the image position Is determined by the adaptive weights of (a); And (3) with Respectively minimum and maximum values of texture features;
finally, definition of adaptive message loss function:
(24);
Wherein: is the original watermark message; Is the extracted watermark message; is a class label for bit-ordering watermark messages Converting into a classification problem (1 or-1),Is the decoding threshold value and,Is the decoding confidence, representing the extracted watermark message bitsAnd threshold valueThe gap between them; Is the adaptive weight of the location i, Is the total number of pixels of the watermark;
s44 the overall loss function is a weighted sum of weighted cross entropy loss, multi-scale image loss, and adaptive message loss:
(25);
Wherein: 、 And The weight of each loss term.
S5, analyzing the correlation of the facial features.
The specific steps of the S5 are as follows:
Because two decoding methods are used in the present model, decoding the watermark embedded in the face image using a decoder and regenerating the watermark in the face region of the face image using the same face recognition algorithm and two-dimensional mel face extraction structure, conventional correlation comparisons include comparing hamming distances, but if there is a small bit difference between the two vectors, the hamming distance may be significantly increased, even though the two watermarks are visually very similar, but due to the difference of the individual bits, the hamming distance may lead to misjudgment as dissimilar. In order to accurately evaluate the similarity between watermarks decoded in two decoding modes, a regularized pearson correlation coefficient is adopted to quantify the linear correlation between the watermarks, and the formula is as follows:
(26);
Wherein: the method is a regularization term, can avoid the situation that the denominator is zero, and improves the numerical stability; And (3) with Respectively represent feature vectorsAnd (3) withIs the first of (2)A dimension element; Is the dimension of the feature vector; And (3) with Is a feature vectorAnd (3) withAverage value of (2); When 1, the two vectors are completely positively correlated, and when 0, the two eigenvectors have no linear relationship.
Experimental setup
Training and testing was performed on CelebA face datasets, all pictures were cropped to 128 x 128 size. Training set and test set with dividing ratio of 0.8:0.2.
The training parameters are divided into 100 epochs in their configuration. In the first 50 epochs of training, a higher initial learning rate is adopted, and the strategy is favorable for the model to quickly explore the parameter space in the initial stage of training, so that the convergence process is accelerated. Subsequently, among the remaining 50 epochs, we turned the learning rate down so that the model could be more finely tuned later in the training.
The batch size set was 64. Enough sample diversity can be maintained when each gradient is updated, so that the generalization capability of the model is effectively improved. The optimizer is an Adam optimizer.
In the construction of the loss function, we set weight coefficients in the loss function, the initial weights are respectively、、。
Experimental results:
table 1 visual quality contrast of watermark images
Table 2 conventional noise robustness test
TABLE 3 generalization test, training with CelebA dataset, FFHQ dataset test
The above disclosure is merely illustrative of specific embodiments of the present invention, but the present invention is not limited thereto, and any variations that can be considered by those skilled in the art should fall within the scope of the present invention.