CN112381083A

CN112381083A - Saliency perception image clipping method based on potential region pair

Info

Publication number: CN112381083A
Application number: CN202010538411.1A
Authority: CN
Inventors: 袁峰; 徐武将; 王冕; 徐亦飞; 李浬; 桑葛楠
Original assignee: Hangzhou Oying Network Technology Co ltd
Current assignee: Hangzhou Oying Network Technology Co ltd
Priority date: 2020-06-12
Filing date: 2020-06-12
Publication date: 2021-02-19
Also published as: CN113159028B; CN113159028A

Abstract

The invention discloses a saliency perception image clipping method based on potential region pairs, which generates an attractive clipping image by constructing a clipping image frame based on deep learning. The framework includes a multi-scale CNN feature extractor, a deformable salient position sensitive roi (rod) alignment operator, a twinned fully-connected network, and a mixture-loss function. The method fully utilizes the saliency map, considers the saliency information to eliminate poor candidate cropping maps, prevents the model from over-fitting, and integrates the saliency map into the pooling operator to help construct the sense of saliency capable of coding content preference. The present invention reveals the intrinsic mechanism of the clipping process and also reveals the internal connection of potential region pairs. Not only can achieve better aesthetic effect of image cutting, but also can neglect the required calculation burden.

Description

Saliency perception image clipping method based on potential region pair

Technical Field

The invention belongs to the technical field of artificial intelligence, and relates to a saliency perception image clipping method based on potential region pairs.

Background

Image cropping, which is intended to find an image crop with the best aesthetic quality, is widely used in image post-processing, visual recommendation, and image selection as an important technique. Especially when a large number of images need to be cropped, image cropping becomes a laborious task. Thus, automatic image cropping has recently attracted increased attention within the research community and industry.

Early cropping methods explicitly designed various manual features based on photographic knowledge (e.g., the trisection method and the center method). With the development of deep learning, a great deal of researchers are dedicated to developing clipping methods in a data-driven manner, and the release of some reference data sets for comparison greatly facilitates the progress of related research.

However, obtaining the best candidate clip map is still extremely difficult, and is mainly influenced by the following three aspects: 1) the potential of image saliency information cannot be fully released. The previous saliency-based cropping methods focus on preserving the most important content in the best cropping map, but ignore the cases: if the rectangle of the saliency region is located near the boundary of the source image, the saliency region and the best cropped picture will overlap. Moreover, the saliency information is only used for the generation of candidate clipping maps and is not continuously used in subsequent clipping modules. 2) The potential region pairs (region of interest (ROI) and region of discard (ROD)) and their internal laws are not well represented. In general, the pairwise cropping method explicitly forms and feeds a pair of source images into an automated cropping model, but the performance of such methods is often poor due to the selection of a source image pair that is overly dependent on detail and uncertain. 3) Traditional indicators for evaluating clipping methods are unreliable and inaccurate. In some cases, the intersection ratio (IoU) and the Boundary Displacement Error (BDE) are not sufficient to subjectively evaluate the performance of their clipping method.

Disclosure of Invention

The invention aims to provide a saliency-sensing image clipping method based on potential region pairs, so as to overcome the defects of the prior art.

In order to achieve the purpose, the invention adopts the following technical scheme:

a salient image cropping method based on potential region pairs comprises the following steps:

step 1), generating a candidate cutting map based on a grid anchor by researching the criteria and procedures of professional photography.

And 2) describing the features of the source image by adopting a multi-scale and lightweight feature extraction network, and then clipping the extracted features by utilizing deformable interesting pooling and deformable uninteresting pooling.

And 3) training a twin aesthetic evaluation network, and predicting the aesthetic scores of the candidate cutting pictures by minimizing a mixing loss function.

Further, candidate clips based on saliency are generated, an initial clip map is first created based on saliency areas, and then candidate clip maps are generated in a grid anchor frame manner.

Further, the algorithm for creating the initial cropping map is as follows:

inputting: the size of the image (I) is wide (W) x high (H), and the magnification is lambda_largeReduction ratio lambda_smallArea function area (·), two rectangles Re₁And Re₂The closest distance between the outlines of (a) and (b) Clo _ Dis (Re)₁， Re₂)。

And (3) outputting: initial cut S_{init_crop}。

Wherein s is₁∈(0,1]And d₁∈[0,1]The threshold values of the respective location (b) and location (a)

Further, a method for generating candidate cropping graphs by means of mesh anchors is shown in fig. 2:

wherein the input image size is WXH, which corresponds to M × N bins, M₁,m₂,n₁,n₂Respectively, the number of bins from the initial cropped picture to the source image boundary.

The total number of the candidate cutting pictures is

And setting constraint conditions: a qualified clip map should exceed a certain proportion of the input image to exclude a certain number of unsuitable size candidate clip maps:

area(S_crop)＝ρarea(I) (1)

wherein area (. cndot.) is an area function, S_croAnd S_salRespectively representing a clipping region and a saliency bounding box region.

And the aesthetic quality of the cut-out picture is improved by specifying the length-width ratio of the image:

α₁and alpha₂Are respectively set to 0.5 and 2

Further, a multi-scale and lightweight feature extraction network is adopted to describe the features of the source image, as shown in module 1 of fig. 1. Through the feature extraction network, a source image can be converted into a feature diagram rich in information and capable of simultaneously representing the whole local context and the local context. The feature extraction network consists of two modules: an infrastructure network and a Feature Aggregation Module (FAM).

Further, the multi-scale features can effectively remove local interference elements and enhance the recognition capability of the features by considering the spatial relationship.

Further, the underlying network may be any effective Convolutional Neural Network (CNN) model to capture image features while preserving a sufficient receptive field. The nth layer and the n-1 st layer of the base network are the last two layers of the base network, and global context information can be provided to some extent by skipping connections.

Further, with respect to FAM, it aims to compensate for the loss of global and multi-scale context during feature extraction. The FAM execution steps are as follows:

step 1, firstly, average pooling at different scales is adopted to generate some feature maps, and then the feature maps are added on the 3 × 3 convolutional layer.

And 2, directly performing up-sampling on the low-dimensional feature map through bilinear interpolation to obtain the same size features as the original feature map of the nth layer.

And 3, finally connecting the up-sampling characteristic graphs from different sub-branches into a final output characteristic graph.

Further, the cropped regions are focused using two saliency-oriented alignment operators, i.e., the deformable position-sensitive ROI and the ROD align are perceptually significant. The saliency information is combined with deformable psroi (psrod) pooling and some lightweight head designs to take full advantage of the feature representation.

Further, significance deformable psroi (rod) pooling is defined as:

f' (i, j) and f (i, j) are the output pooled feature map and the original feature map, respectively. (x)_lf，y_lf) For ROI (ROD) in the upper left corner, n is the number of pixels in bins, (Δ x, Δ y) is the fractional offset learned from the fully connected (fc) layer, S is the saliency map, S is the number of pixels in the bin_i,j(x, y) is 0 or 1.

Further, as shown in fig. 3, we set C to 8 to reduce the amount of computation of the subsequent subnet, and to some extent, fix k to 3 according to the composition pattern of the 3 × 3 grid. And using the bilinear interpolation values to calculate the exact values employed in ROI (ROD) align to account for rounding errors and misalignment issues that occur in the significance aware deformable PS ROI (ROD) merge, and named significance aware deformable PS ROI (ROD) align.

Further, as shown in block 2 of FIG. 1, F represents the entire feature map generated by the feature extraction network, F_ROIAnd F_RODCharacteristic maps of ROI and ROD, respectively. Applying a saliency-aware deformable PS ROI alignment approach to F at 8 × 8 resolution_ROIIs converted into

In branch 2, the ROD is first reconstructed from mode 4 and four separable component ROD alignments are performed with saliency perception to generate the corresponding feature maps, followed by the 1 x 1 convolutional layer to reduce the channel size. All four feature maps are connected together as an alignment feature map

And (4) showing. On the one hand F_ROIAnd

is connected as

Will be fed into both fully connected layers for final MOS prediction. On the other hand, in the case of a liquid,

is marked as being copied

Will be mixed with

Are fed together into the twin evaluation network.

Further, twinning the netThe network is shown in block 3 of fig. 1 and is composed of two identical fully-connected networks, which are connected in series

And

the weights are shared when extracting features. The twin network inputs the aligned feature map and outputs the predicted aesthetic score. By using

And

input feature maps indicating ROI and ROD, respectively, and prediction scores thereof are indicated by Φ (ROI _ D _ P4) and Φ (ROD _ P4), respectively. And training the twin aesthetic evaluation network with the following constraints:

here, area (·) represents an area function, γ is an area ratio, and is empirically set to 2/3. After twin network processing, the ranking penalty is defined for each potential pair as follows:

l_rank(ROI_D_P4,ROD_P4)＝max{0,Φ(ROD_P4)-Φ(ROI_D_P4)}

(7)

let e_ij＝g_ij-p_ij，g_ijAnd p_ijRespectively the Mean Opinion Score (MOS) and the predicted aesthetic score of the jth cropping map of image i. To enhance robustness to outliers, Huber loss is defined as follows:

the final overall loss function is:

wherein,

to balance the parameters, it is empirically set to 1. If a saliency map is not available, all values of the saliency map are set to 0.

Compared with the prior art, the invention has the following beneficial technical effects:

the saliency perception image clipping method based on the potential region pair fully utilizes the saliency map, considers saliency information to eliminate poor candidate clipping maps, prevents the problem of excessive fitting of a model, and integrates the saliency information into a pooling operator to help build a saliency perception sense field capable of coding content preference.

The invention discloses an image cropping method based on significance perception of potential region pairs, which discloses an internal mechanism of a cropping process and reveals internal relation of the potential region pairs. Specifically, four different ROD modes and various combinations of ROIs and RODs were designed in different cases, and then the relative ranking order and ranking loss of ROIs and ROIs were learned.

The invention discloses a saliency perception image clipping method based on potential region pairs, which constructs a clipping image frame based on deep learning to generate an attractive clipping image. The framework includes a multi-scale CNN feature extractor, a deformable salient position sensitive roi (rod) alignment operator, a twinned fully-connected network, and a mixture-loss function.

According to the saliency perception image clipping method based on the potential region pairs, most of indexes can be ignored in calculation load, and the method is superior to other methods.

Drawings

Fig. 1 is a diagram of the overall network architecture of the present invention.

Fig. 2 is a saliency map of a candidate clip map. The red solid line boxes represent saliency areas, the red dashed line boxes represent candidate clip diagrams, and the blue solid line boxes represent initial clip diagrams. (a) The bounding box of the saliency region is located near the boundary of the given image. (b) A salient region is a small portion of the source image. (c) The salient region is directly used as an initial cropping map.

Fig. 3 is a saliency perceived deformable location sensitive ROI pooling map.

Fig. 4 is a diagram of four modes of ROD.

Detailed Description

GAICD _ S dataset: the GAICD dataset first captured about 50K images from the Flickr website and then manually reduced to 10K images with a good composition. For each image, 19 annotators were invited to assign aesthetic scores to the various aspect ratio crop maps using the annotation tool. Among 1,236 images, there are a total of 106,800 candidate cropping patterns. As a condensed version of GAICD, GAICD _ S contains 1,236 photographs containing 100,641 reasonable annotated cutmaps.

For all samples, the short edge is resized to 256 by bilinear interpolation and data enhancement is performed using several conventional operators (random adjustments of contrast, saturation, brightness, hue and horizontal flipping).

In addition, the values of all samples were normalized to [0,1] using the mean and standard deviation calculated on the ImageNet dataset. During training, a pre-trained MobilNetV2 model is loaded into the feature extraction network of the present invention to mitigate overfitting. The network of the present invention is trained with an Adam optimizer by minimizing the mixing loss and setting all hyper-parameters to default values. The initial learning rate lr is 1e-4, and the maximum epoch is set to 100. With respect to saliency maps, using PoolNet can produce a pleasing saliency bounding box. Furthermore, batch normalization and dropout are also used for twin evaluation networks.

Claims

1. A salient perception image clipping method based on potential region pairs is characterized by comprising the following steps:

step 1), generating candidate clipping anchor frames based on significance by researching the criteria and procedures of professional photography.

2. The criteria and procedure for generating mesh anchor-based candidate cropping maps from research professional photography of claim 1, wherein the initial cropping map is first created based on salient regions, and then the candidate cropping maps are generated in a mesh anchor fashion.

3. The method for creating an initial clipping map based on the salient region according to claim 2, wherein the algorithm for creating the initial clipping map is as follows:

inputting: the size of the image (I) is wide (W) x high (H), and the magnification is lambda_largeReduction ratio lambda_smallArea function area (·), two rectangles Re₁And Re₂The closest distance between the outlines of (a) and (b) Clo _ Dis (Re)₁，Re₂)。

And (3) outputting: initial clipping region S_{init_crop}。

Wherein s is₁∈(0,1]And d₁∈(0,1]The threshold values of the location (b) and the location (a), respectively.

4. The method for generating candidate cropping maps in the form of mesh anchors according to claim 2, wherein as shown in fig. 2:

where the input image size is WxH, which corresponds to the anchor frame MxN blocks, M₁,m₂,n₁,n₂Respectively, representing the number of blocks from the initial cropped picture to the source image boundary.

The total number of the candidate cutting pictures is

area(S_crop)＝ρarea(I) (1)

α₁and alpha₂Set to 0.5 and 2, respectively.

5. The method as claimed in claim 1, wherein the source image is characterized by using a multi-scale and lightweight feature extraction network to describe the features of the source image, and focusing the clipping region by using two significance-oriented alignment operators, wherein the source image can be converted into an information-rich feature map capable of simultaneously representing a global context and a local context through the feature extraction network. The feature extraction network consists of two modules: an infrastructure network and a Feature Aggregation Module (FAM).

6. The infrastructure network as in claim 5, wherein the infrastructure network can be any effective Convolutional Neural Network (CNN) model to capture image features while preserving a sufficient receptive field. The nth layer and the n-1 st layer of the base network are the last two layers of the base network, and global context information can be provided to some extent by skipping connections.

7. The Feature Aggregation Module (FAM) according to claim 5, characterized in that it aims to compensate for the loss of global and multi-scale context during feature extraction. The FAM execution steps are as follows:

step 1, firstly, average pooling of different scales is adopted to generate some feature maps, and then the feature maps are added on a 3 × 3 convolutional layer.

8. The method of claim 1, using a multi-scale, lightweight feature extraction network to characterize a source image, using two saliency-guided alignment operators to focus a cropped region, characterized by using saliency-aware deformable position-sensitive ROI and ROD align, combining saliency information with deformable psroi (psrod) pooling and some lightweight head design to fully exploit feature representation. Significance deformable psroi (rod) pooling is defined as:

f' (i, j) and f (i, j) are the output pooled feature map and the original feature map, respectively. (x)_mf，y_mf) For ROI (ROD) in the upper left corner, n is the number of pixels in bins, (Δ x, Δ y) is the offset learned from the fully connected (fc) layer, S is the saliency map_i,j(x, y) is 0 or 1.

Further, as shown in fig. 3, we set C to 8 to reduce the amount of computation of the subsequent sub-network, and to some extent, fix k to 3 according to the composition pattern of the 3 × 3 grid. And computing the exact values employed in ROI (ROD) align using bilinear interpolation to solve rounding errors and misalignment problems that occur in significance aware deformed PS ROI (ROD) merging, and is named significance aware deformable PS ROI (ROD) align.

9. Training a twin aesthetic evaluation network to predict the aesthetic score of a candidate clipping graph by minimizing a blending loss function according to claim 1, wherein F denotes the entire feature graph generated by the feature extraction network, F, as shown in block 2 of figure 1_ROIAnd F_ROmCharacteristic maps of ROI and ROD, respectively. Applying a deformable PS ROI alignment approach with saliency perception, F at 8 × 8 resolution_ROIIs converted into

In branch 2, the ROD is first reconstructed from mode 4 and four separable component ROD alignments are performed with saliency sensing to generate the corresponding feature maps, followed by the 1 × 1 convolutional layer to reduce the channel size. All four feature maps are connected together as an alignment feature map

And (4) showing. On the one hand F_ROIAnd

is connected as

Will be fed into the fully connected layer for final MOS prediction. On the other hand, in the case of a liquid,

is marked as being copied

Will be mixed with

Are fed together into the twin evaluation network.

10. Training a twin aesthetic evaluation network to predict the aesthetic score of a candidate cutting graph by minimizing a blending loss function according to claim 1, wherein the twin network is composed of two identical fully connected networks as shown in block 3 of figure 1

And

And

here, area (·) represents an area function, γ is an area ratio, and is empirically set to 2/3. After twin network processing, the ranking penalty is defined as follows for each potential pair:

l_rank(ROI_D_P4,ROD_P4)＝max{0,Φ(ROD_P4)-Φ(ROI_D_P4)} (7)

the final overall loss function is:

wherein,