Background
With the continuous progress of society, people expect that computers can have logical reasoning ability and decision making ability like human beings, and people are saved from various complex tasks. Image semantic segmentation divides an image into regions representing different semantic information by assigning a semantic label which is well defined in advance to represent a category to a pixel in the image. The image subjected to image semantic segmentation can be used for scene understanding tasks such as image semantic recognition, target tracking and the like, and is an important means for image processing. At present, image semantic segmentation technology has been applied to the fields of medical imaging, automatic driving, intelligent home, image engine searching and the like.
The Image Semantic Segmentation method can be divided into a conventional Image Semantic Segmentation method and an Image Semantic Segmentation method (ISSbDL) Based On Deep Learning, and is hereinafter referred to as a Deep Semantic Segmentation method. Most of the conventional semantic segmentation methods perform segmentation according to the characteristics of the gray scale, texture and the like of an image, such as an image pixel threshold-based segmentation method, an image object edge-based segmentation method, a region-based segmentation method or a super-pixel segmentation method. The traditional semantic segmentation algorithm needs to manually extract features, and the quality of a segmentation result depends on the quality of the feature extraction result, so that the traditional semantic segmentation method is time-consuming and labor-consuming and the segmentation result is rough. The deep semantic segmentation method can automatically extract image features through the powerful feature extraction capability of a Convolutional Neural Network (CNN) so as to carry out end-to-end training, deep Labv3+ adopts ResNet101 as a feature extraction Network, collects and multiscale context information through a spatial pyramid pooling, and finally adopts a simple and efficient decoder to restore a prediction score image to the size of an input image to serve as a prediction result.
In the deep label bv3+ network, only the feature map with the down-sampling step length of 4 is fused in the decoder, shallow feature map information in the convolution process is not fully utilized, and the interested target cannot be subjected to side learning, so that the segmentation effect of the target boundary pixel points is rough, and the small target is easy to miss segmentation.
Image semantic segmentation is one of key problems in computer vision, and a convolutional neural network is a mainstream method of a semantic segmentation task. The encoder of the DeepLabv3+ semantic segmentation network effectively extracts high-level features, but a decoder directly fuses a single low-level feature map and a high-level feature map in the feature extraction network, the feature fusion mode is too simple, and the detail information of an image cannot be effectively recovered, so that the target edge pixels in a segmentation result are not accurately positioned, and the problems of missing segmentation and wrong segmentation exist.
Disclosure of Invention
Based on the defects of the prior art, the technical problem to be solved by the invention is to provide an attention mechanism feature fusion segmentation method for an image, which is helpful for forming clearer target segmentation boundaries and fine segmentation results, and fully utilizes detail information in a low-level feature map to improve the semantic segmentation precision of the image.
In order to realize the invention, the invention provides an attention mechanism feature fusion segmentation method of an image, which comprises the following steps:
s1, uniformly cutting the input image into 513 × 513 resolution, and cutting the input image with the original size smaller than 513 × 513 after zero filling operation;
s2, reducing the input image size from 513 × 513 to 257 × 257 by 7 × 7 convolution with convolution step size 2;
s3, performing pooling operation with step size of 2 and pooling kernel size of 3 × 3 on the output feature map of step S2, and reducing the feature map size to 129 × 129 by downsampling;
s4, marking four different convolution stages in the feature extraction network ResNet 101;
s5, aggregating the output feature map in the step S4 with multi-scale context information through spatial pyramid pooling;
s6, performing double up-sampling on the output characteristic diagram of Conv2_ x through bilinear interpolation, and performing characteristic fusion on the output characteristic diagram of Conv3_ x through a channel attention mechanism;
s7, performing double up-sampling on the output characteristic diagram of the step S6 through bilinear interpolation, and performing characteristic fusion on the output characteristic diagram of the Conv4_ x through a channel attention mechanism to serve as an output characteristic diagram of a channel attention mechanism characteristic fusion module;
and S8, performing splicing operation on the output feature maps of the step S7 and the step S5 in channel dimensions, generating dense feature maps through 3 x 3 convolution, performing 4-time upsampling on the diagnostic map by using bilinear interpolation, reducing the upsampled feature maps to the resolution of the input image, and generating a final prediction result.
Preferably, in step S4, the four different convolution stages are respectively denoted as Conv2_ x, Conv3_ x, Conv4_ x and Conv5_ x, the number of residual structures in Conv2_ x, Conv3_ x, Conv4_ x and Conv5_ x is adjusted to {8,8,9 and 8}, and the adjusted feature extraction network shallow feature map contains more high-level semantic information, so that fusion between the shallow features and the deep features can be effectively guided.
Preferably, in step S5, the spatial pyramid pooling includes four feature extraction paths with different sampling rates and a global average pooling channel, the global average pooling has a global receptive field, the output feature maps of different sampling paths are subjected to channel dimension splicing, and then, 4 times of upsampling is performed by bilinear interpolation to serve as the encoder output.
Therefore, the attention mechanism feature fusion segmentation method for the image has the following beneficial effects:
the method adopts the convolutional neural network to automatically extract the image characteristics without manually selecting the characteristics, so that the image semantic segmentation process can be trained end to end, and compared with the traditional segmentation algorithm, the method is simpler. Compared with the feature map with only the downsampling step length of 4 in the DeepLabv3+, the method adopts the channel attention feature fusion module to aggregate the multi-scale shallow feature map in a cascading mode, obtains richer context information, efficiently recovers the spatial position information of the pixel in a decoder, and generates a finer segmentation result. The invention adopts ResNet101 to extract image characteristics in an encoder, performs spatial pyramid pooling and multi-scale context information, and then performs level-by-level fusion on information in a deep layer characteristic diagram and a shallow layer characteristic diagram in a decoder through a channel attention characteristic fusion module, aiming at reserving more complete low-level characteristic information and generating a clearer target boundary.
The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical means of the present invention more clearly understood, the present invention may be implemented in accordance with the content of the description, and in order to make the above and other objects, features, and advantages of the present invention more clearly understood, the following detailed description is given in conjunction with the preferred embodiments, together with the accompanying drawings.
Detailed Description
Other aspects, features and advantages of the present invention will become apparent from the following detailed description, taken in conjunction with the accompanying drawings, which form a part of this specification, and which illustrate, by way of example, the principles of the invention. In the referenced drawings, the same or similar components in different drawings are denoted by the same reference numerals.
The invention provides an attention mechanism feature fusion segmentation method for an image, which comprises the following steps of:
s1, the input image is uniformly cropped to 513 × 513 resolution, and the input image with the original size smaller than 513 × 513 is cropped after the zero padding operation.
S2, the input image size is reduced from 513 × 513 to 257 × 257 with 7 × 7 convolution with a convolution step size of 2.
S3, performing pooling operation with step size of 2 and pooling kernel size of 3 × 3 on the output feature map of step S2, and this down-sampling reduces the feature map size to 129 × 129.
S4, the four different convolution stages in the feature extraction network ResNet101 are respectively denoted as Conv2_ x, Conv3_ x, Conv4_ x, and Conv5_ x. The number of residual structures in Conv2_ x, Conv3_ x, Conv4_ x and Conv5_ x was adjusted to {8,8,9,8}, respectively. The adjusted feature extraction network shallow feature map contains more high-level semantic information, and fusion between shallow features and deep features can be effectively guided.
And S5, aggregating the multi-scale context information of the output feature map in the step S4 through spatial pyramid pooling, wherein the spatial pyramid pooling comprises four feature extraction paths with different sampling rates and a global average pooling channel, and the global average pooling has a global receptive field. And the output characteristic diagrams of different sampling paths are spliced through channel dimensions, and then are subjected to 4 times of upsampling through bilinear interpolation to serve as encoder output.
And S6, performing double up-sampling on the output characteristic diagram of Conv2_ x through bilinear interpolation, and performing characteristic fusion on the output characteristic diagram of Conv3_ x through a channel attention mechanism.
And S7, performing double up-sampling on the output characteristic diagram of the step S6 through bilinear interpolation, and performing characteristic fusion with the output characteristic diagram of Conv4_ x through a channel attention mechanism. And taking the output feature map of the feature fusion module as a channel attention mechanism.
And S8, performing splicing operation on the output feature maps of the step S7 and the step S5 in channel dimensions, generating dense feature maps through 3 x 3 convolution, performing 4-time upsampling on the diagnostic map by using bilinear interpolation, reducing the upsampled feature maps to the resolution of the input image, and generating a final prediction result.
FIG. 1 is a process diagram of the image semantic segmentation algorithm of the channel attention mechanism of the present invention, wherein the image semantic segmentation of the channel attention mechanism mainly comprises the following steps:
step 1, in the stage of an encoder, ResNet101 is used as a backbone network for extracting image features, feature graphs with down-sampling step sizes of 4, 8 and 16 are selected for feature fusion, and low-level information of the image is fully utilized. And the number of residual error structures at different convolution stages in ResNet101 is adjusted, and the proportion of detail information and semantic information in the characteristic diagram is balanced.
And 2, reducing the channel number of the feature map to 256 by convolution operation of 1 multiplied by 1 before feature fusion in a decoder. And performing channel dimension splicing operation on the shallow layer feature map subjected to feature fusion by a channel attention mechanism and an output feature map of an encoder, and enabling a decoder to fully learn the mapping relation between the features and the segmentation target by using 3 x 3 convolution. And finally, performing four-time up-sampling on the feature map, converting the feature map to the resolution of the input image, and outputting the prediction map.
Fig. 2 is a global channel attention calculation process diagram of the present invention, in which a global channel attention module obtains a weight vector with a global receptive field through global average pooling, so that a neural network models the importance of each channel of a feature diagram in a training process, and automatically determines which channels have information on useful information and which channels have information on noise. The global channel attention module mainly comprises the following steps:
and 3, firstly, performing global average pooling on the input feature map, changing the feature map with the scale of H multiplied by W multiplied by C into a vector of 1 multiplied by C, wherein each element in the vector corresponds to one channel in the input feature map and has a global receptive field. The result of the global average pooling is:
wherein Z represents a diagonalThe result of global average pooling of the feature maps X with the channel number C, XC(i, j) represents the element in the ith row and jth column on the C channel in the input feature map.
Step 4, taking the vector of the global average pooling in the step 3 as an input to generate a global channel attention matrix MGlobal:
MGlobal=FGlobal(XC)×W=W2ReLU(β(W1Z))
Wherein beta represents batch normalization, W1And W2Representing the downscaling and upscaling, respectively, of the feature map using a 1 x 1 convolution operation. MGlobalThe method stores a weight coefficient of the interest degree of each channel of the input image by the semantic segmentation network, and the weight coefficient can be learned in the network training process.
FIG. 3 is a diagram of a local channel attention calculation process of the present invention, where local channel attention learns specialized weight information for pixels at different locations on each channel, as opposed to global channel attention, which generates unique weights for each channel. The local channel attention calculation steps are as follows:
and 5, directly reducing the number of channels of the input feature map to the original 1/16 by adopting 1 × 1 convolution. And modeling the correlation among the channels through convolution, and restoring the number of the channels of the feature map to C. Finally, outputting a weight matrix M with the result that elements at different positions on the same channel are subjected to differential characterizationLocal:
MLocal(X)=Conv2(ReLU(β(Conv1(X))))
Conv1And Conv2The 1 × 1 convolution operations of ascending and descending dimensions are respectively performed on the input feature map, and β is batch normalization. The local channel attention changes the pooling window size of the global channel attention, and the pooling operation with the window size of 1 × 1 is adopted in the figure, so that pooling cores with different pooling window sizes can be selected according to specific situations.
FIG. 4 is a block diagram of a channel attention mechanism feature fusion module of the present invention, with input features of the same size passing through a global channel attention module and local viasThe channel attention module models multi-scale channel correlation information to generate a multi-scale channel attention weight matrix Xout:
The sizes of the input feature map and the output feature map are all H multiplied by W multiplied by C and respectively marked as XinAnd Xout. The input feature map generates a weight matrix M through a global channel attention module and a local channel attention module which are parallelGlobalAnd MLocal. Wherein M isGlobalIs a weight vector of 1 × 1 × C, and MLocalThe weight matrix is H multiplied by W multiplied by C, and the two cannot be directly added because of different scales. Needs to mix MGlobalThe weight matrix expanded to H × W × C along the channel dimension is subjected to the corresponding element summation operation. And then, performing threshold softening on the weight matrix by adopting a Sigmoid function, and mapping coefficients in the weight matrix into a range of (0-1).
The invention adjusts the number of residual error structures in each convolution stage in a feature extraction network ResNet101 in an encoder, and aims to enable a high-resolution low-level feature diagram to bear more high-level semantic information; and the depth separable convolution is adopted to replace the common convolution, so that the model is as light as possible. The low-level feature maps of three different scales are used in a decoder, and feature fusion is carried out in a cascading mode. In feature fusion, a channel attention module is adopted to model the correlation between channels, enhance the feature expression of an interested target and weaken the influence of useless information in a low-level feature map. The validity of the model was verified by the PASCAL VOC 2012 data set. The result shows that AFF-deep Lab can generate a fine segmentation boundary which is better than the classification capability of deep bv3+ on confusable pixels. The average cross-over ratio (mIoU) of the method reaches 81.08%, and is improved by 0.86% compared with DeepLabv3+, and higher semantic segmentation precision is achieved under the condition of not increasing the computational complexity.
While the foregoing is directed to the preferred embodiment of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.