CN112989958A

CN112989958A - Helmet wearing identification method based on YOLOv4 and significance detection

Info

Publication number: CN112989958A
Application number: CN202110195098.0A
Authority: CN
Inventors: 李岳阳; 兰天; 罗海驰; 杜鹏; 朱一昕; 樊启高; 毕恺韬
Original assignee: Harbin Institute Of Technology Robot Group Wuxi Science And Technology Innovation Base Research Institute
Current assignee: Harbin Institute Of Technology Robot Group Wuxi Science And Technology Innovation Base Research Institute
Priority date: 2021-02-22
Filing date: 2021-02-22
Publication date: 2021-06-18

Abstract

The invention discloses a safety helmet wearing recognition method based on YOLOv4 and saliency detection, comprising the following steps: labeling an existing data set, training a YOLOv4 target detection model; downloading and expanding the saliency data set, training saliency detection model; use the trained target detection model to obtain the target recognition result and target frame information; use the trained saliency detection model to obtain the saliency estimate of the image; use the target frame position information to crop the saliency estimate ; Recheck the single small pictures of all targets; Through the above-mentioned method, the use of the saliency detection method of the present invention obtains the dynamic saliency estimation of the image, thereby effectively removing the background influence, and obtaining a high level of pixel level for the moving target. Significant results; re-checking the target detection results through the saliency detection results can greatly reduce the probability of false detection, effectively distinguish the interference items for target detection in the background, and improve the accuracy of target detection.

Description

Helmet wearing identification method based on YOLOv4 and significance detection

Technical Field

The invention relates to the field of machine vision and pattern recognition, in particular to a helmet wearing recognition method based on YOLOv4 and significance detection.

Background

The safety helmet is the most common and practical personal protection appliance, and can effectively prevent and reduce the head injury caused by external dangerous sources. For a long time, the problems of low comprehensive quality and weak safety consciousness of operating personnel in construction areas in China generally exist, the wearing consciousness of basic protection facilities such as safety helmets is lacked, the operation risk is greatly increased, and safety accidents happen occasionally. From the published network data, safety incidents due to unsafe behavior of the constructors account for 95% of the total category of all incidents.

At present, the potential safety hazard problem that exists is worn at the safety helmet, what the enterprise mainly relied on is that relevant managers's tour or look over through the control of making a video recording by security personnel, and this manpower and material resources not only are wasted, and inefficiency moreover.

In recent years, the technology in the artificial intelligence system is more mature, and the number of successful application cases in the technical fields of deep learning and computer vision is not enough, such as speech recognition, fingerprint recognition and face recognition which are widely known. The method has the advantages of full automation, no human interference, high precision and the like, and can be applied to the fields of supervision, security and the like. Once the technology is popularized, great change is generated to the society, people can be liberated from simple repeated labor, and the social productivity is greatly improved.

Disclosure of Invention

The invention mainly solves the technical problem of providing a helmet wearing identification method based on YOLOv4 and significance detection, which can solve the problem of influence of complex background in the conventional target detection and significance detection, can well identify workers wearing helmets, finally cuts the corresponding significance estimation images by using the frame positions in the target detection to obtain significance estimation images of all single targets, and rechecks the targets by using the images to finally achieve the effect of improving the identification accuracy.

In order to solve the technical problems, the invention adopts a technical scheme that: the helmet wearing identification method based on YOLOv4 and significance detection comprises the following steps:

step 1: labeling an existing data set, and training a Yolov4 target detection model;

marking the data by using marking software to obtain a file recording the size and the label of the target position in each picture, and dividing the data into two parts which are respectively a training set and a test set;

adopting a YOLOv4 network as a target detection model, during training, based on the output of target detection, carrying out iterative computation to minimize the loss function value of the detection model, and obtaining the trained target detection model after reaching the predetermined iteration times;

step 2: downloading and expanding a significance data set, and training a significance detection model;

the training set used in training the significance detection model is from a significance detection data set disclosed in a network, data expansion is carried out on the existing significance detection data set to obtain a usable video training data set, and the significance detection model with higher accuracy is trained;

and step 3: obtaining a recognition result of the target and target frame information by using the trained target detection model;

transmitting a single frame obtained from a picture or a video shot by a camera into the target detection model by using the well-trained Yolov4 target detection model in the step 1 to obtain the output of a target detection result;

and 4, step 4: obtaining the saliency estimation of the image by using the trained saliency detection model;

performing a significance detection task by using the significance model trained in the step 2, inputting each frame of image monitored by the video into a neural network, and then outputting significance mapping by the network; the static saliency model takes a single-frame image as input and generates pixel-level saliency estimation; the input of the dynamic significance model comprises two adjacent video frames in the video and a static significance map output by the static significance model, and a final dynamic significance result according to a time sequence is generated;

and 5: clipping the significance estimation by using the position information of the target frame body;

because the original image, the result image of the target detection model and the estimated image identified by the saliency detection model are consistent in size and specification, the target frame position in the target detection result corresponds to the target position in the saliency estimated image, and the target position can be marked in the saliency estimated image by directly utilizing the output finally obtained in the step 3;

step 6: rechecking single small pictures of all targets;

and 5, judging whether the clipped single saliency map has the target or not by using the saliency estimation map of each target obtained in the step 5.

In a preferred embodiment of the present invention, in step 1, the YOLOv4 network mainly includes a backbone feature extraction network and an enhanced feature extraction network;

the main feature extraction network adopts a CSPDarkNet53 architecture, the input of the main feature extraction network is a 3-channel picture, and in order to ensure the input consistency, the original picture is scaled in an equal ratio; then, in order to ensure that the picture is not distorted, the length-width ratio of the picture is not changed when the short edge is adjusted, and the gray area is expanded up and down or left and right on the short edge; in the trunk feature extraction network, a CSPnet improved residual block is adopted for convolution for many times, and finally three results of feature extraction are input of a subsequent enhanced feature extraction network;

the reinforced feature extraction network comprises an SPP structure and a PANet structure; in the SPP structure, after the last feature layer of CSPdakrnet 53 is convoluted for three times, the processing is carried out by respectively utilizing the maximum pooling of four different scales, so that the receptive field can be greatly increased, and the most obvious context feature can be separated; the PANET is an example segmentation algorithm, and the specific structure of the PANet has the function of repeatedly improving the characteristics;

three effective characteristic layers obtained by a main characteristic extraction network and an SPP structure use multiple convolution up-sampling and convolution down-sampling to effectively realize characteristic extraction and finally obtain three YOLOread outputs.

In a preferred embodiment of the present invention, in step 1, the loss function value is calculated by the following three parts:

1) calculating the error between the position of the predicted frame and the position of the real frame by using the CIOU; compared with the IOU, when the CIOU is calculated, the errors caused by different positions and different frame width-height ratios are considered when the intersection area ratio between the two frames is the same, so that the result is more accurate;

2) errors due to target confidence; when a target is correctly detected, the higher the confidence score is, the smaller the error is, otherwise, the larger the error is;

identifying errors caused by categories; i.e. the comparison of the predicted species result with the actual result.

In a preferred embodiment of the present invention, each of the saliency detection data sets disclosed in step 2 includes thousands of pictures and corresponding saliency maps for training a static saliency detection network; for training the dynamic significance detection network, a video frame data set adjacent to the pictures is also needed;

the difference information at the pixel level between adjacent frames is represented by an optical flow field V = (u, V), where u is a vertical direction, V is a horizontal direction, and the position of one point is represented by X (X, y). Therefore, the difference relationship between the adjacent frames I and I' can be expressed by the following formula:

the vertical direction u is taken as an example, since the principles in the horizontal and vertical directions are identical. And dividing pixels in the picture into foreground pixels f and background pixels b, and processing the foreground pixels and the background pixels separately. Extracting 10% of random initialization motion values from background pixels b, wherein the range is [ -d, d ], d = h/10, and h is the height of a picture, and jogging partial pixels in the background to simulate the noise of the background in an actual video; in the foreground pixel f, firstly, a foreground main motion mode m is assumed, the numerical value of m is the main motion direction and the distance of a foreground target between two frames, and then, the value of each pixel is randomly selected in an interval [ m-d/10, m + d/10] to generate the motion difference of different pixels, so that the effect that all pixels of the foreground have the same overall movement trend and the specific movement distance of each pixel is different and is consistent with the actual effect is achieved; thus, a new picture which enables the foreground object to move based on the original picture is generated; after the foreground pixels and the background pixels are processed, the foreground pixels and the background pixels are combined to generate a multi-frame video data set with a target moving.

In a preferred embodiment of the present invention, the overall operation formula of the static saliency network is:

y is image output, I is image input, FS is characteristic output generated by the convolution layer, DS is deconvolution operation, the output Y is ensured to be the same as the input image I, and theta F and theta D represent parameters during convolution and deconvolution;

in deconvolution, the feature matrix is mutually corresponding to the first half convolution operation, and after expansion, the feature matrix is multiplied by a transposed matrix of a convolution kernel in the convolution step to obtain a feature map with the size expanded by two times.

In a preferred embodiment of the present invention, the dynamic saliency estimation is implemented by inputting a static saliency estimation map and adjacent frames of the original map and the original map into a dynamic saliency detection network, which are connected in a channel to obtain an input in the format (a, B, C), which enters the convolution operation of the first layer by the following formula:

where W is the weight at convolution and b is the bias term. I is_tAnd I_t+1For two adjacent frames, P_tIs I_tA temporally corresponding static saliency image; the convolution and deconvolution operations of the subsequent dynamic significance network are consistent with the network structure of the static significance; through the optical flow comparison of the pixel levels of two adjacent frames, a dynamic high-significance target can be better detected, and the significance recognition accuracy is improved.

In a preferred embodiment of the present invention, YOLOv4 divides each input image into three scales for output in step 3, each scale corresponds to three prior frames, and the three outputs total nine prior frames for detection;

the output of the first scale is subjected to convolution operation for multiple times, the compression degree is large, the method is suitable for identifying and detecting a large target, and the corresponding prior frames are the three largest prior frames;

the scale two is positioned in the middle of the output of the three scales, is suitable for detecting the target with the medium size, and utilizes three prior frames with the medium size;

the scale III is an output format with the least convolution times, so that three prior frames with smaller scales are utilized to have better identification effect on small targets in the picture;

in the recognition target positioning, YOLOv4 adopts the following formula in order to obtain the frame position information:

wherein (t)_x,t_y,t_w,t_h) The prediction output of the model is the network learning target; (c)_x,c_y) Is the coordinate offset of the cell, in units of cell side length, (p)_w,p_h) Is the preset side length of Anchor Box, (b)_x,b_y,b_w,b_h) The center coordinates and width and height of the predicted bounding box are finally obtained.

In a preferred embodiment of the present invention, in step 4, in order to effectively distinguish the background part and the moving target part in the image, the model is divided into a static model and a dynamic model, which are combined with each other to capture the space and time information of the image at the same time, and a pixel-level saliency map is directly generated through a full convolution network;

the static saliency model takes a single-frame image as input and generates pixel-level saliency estimation; the input of the dynamic significance model comprises two adjacent video frames in the video and a static significance map output by the static significance model, and a final dynamic significance result according to a time sequence is generated.

In a preferred embodiment of the present invention, in step 6, a highlight white in the saliency estimation is defined as an effective target, and a black is defined as a background; if the white proportion is large, the identified target is a valid target, and the recheck is passed; and if the black ratio is large, the identified target is a background, and the identification result is judged to be false detection.

The invention has the beneficial effects that: according to the method, a target detection model is established by using YOLOv4, then a saliency detection model is used for rechecking a YOLOv4 target detection result, and a Convolutional Neural Network (CNN) is adopted for both models, so that the method has good robustness and accuracy.

Compared with the method of only using YOLOv4 and the like to carry out the target detection task, the method of the invention also uses the significance detection method to obtain the dynamic significance estimation of the image, thereby effectively removing the background influence and obtaining the high significance result of the moving target pixel level. The target detection result is rechecked through the significance detection result, so that the false detection probability can be greatly reduced, interference items to the target detection in the background can be effectively distinguished, and the target detection precision is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without inventive efforts, wherein:

FIG. 1 is a flow chart of the detection of the present invention;

FIG. 2 is a network structure of a Yolov4 target detection model;

FIG. 3 is a static saliency model network structure used by the present invention;

FIG. 4 is a combined structure of a static saliency module and a dynamic saliency module of the present invention;

FIG. 5 is an example of the recognition result of the present invention;

FIG. 6 is a diagram illustrating an exemplary review operation performed on a false positive test picture according to the present invention;

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

In the description of the present invention, it should be noted that the terms "front" and "back" and the like indicate orientations and positional relationships based on orientations and positional relationships shown in the drawings or orientations and positional relationships where the products of the present invention are conventionally placed in use, and are used for convenience in describing the present invention and simplifying the description, but do not indicate or imply that the devices or elements to be referred must have a specific orientation, be constructed in a specific orientation, and be operated, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like are used merely to distinguish one description from another, and are not to be construed as indicating or implying relative importance.

In the description of the present invention, it should also be noted that, unless otherwise explicitly specified or limited, the terms "disposed" and "connected" are to be interpreted broadly, e.g., as being either fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

In the present invention, unless otherwise expressly stated or limited, the first feature may be present on or under the second feature in direct contact with the first and second feature, or may be present in the first and second feature not in direct contact but in contact with another feature between them. Also, the first feature being above, on or above the second feature includes the first feature being directly above and obliquely above the second feature, or merely means that the first feature is at a higher level than the second feature. A first feature that underlies, and underlies a second feature includes a first feature that is directly under and obliquely under a second feature, or simply means that the first feature is at a lesser level than the second feature.

The embodiment of the invention comprises the following steps:

a helmet wearing identification method based on YOLOv4 and significance detection comprises the following steps:

step 1: labeling the existing data set, and training a Yolov4 target detection model.

The label labeling software is utilized when the data labeling is executed, emphasizes on labeling related tasks of target detection, can rapidly and conveniently label the frame and the label of a single picture, and has good applicability in the target detection task. And after the labeling is finished, obtaining an xml file recording the size of the target position and the label in each picture, and then finishing the preparation of the data required by the training. These data were then divided into two parts, training and test sets respectively.

In step 1, a YOLOv4 network is used as a target detection model, as shown in fig. 2, which mainly includes a trunk feature extraction network and a reinforced feature extraction network. The main feature extraction network adopts a CSPDarkNet53 architecture, 3-channel pictures with 416 × 416 are input, and in order to ensure the input consistency, the original pictures are scaled equally, and the long edges of the original pictures are set to 416 sizes. Then, in order to ensure that the picture is not distorted, the aspect ratio of the picture is not changed when the short edge is adjusted, and the gray area is expanded above, below, left and right of the short edge, so that the whole picture reaches the input size of 416 x 416. In a backbone network, a CSPnet improved residual block is adopted for convolution for many times, the maximum characteristic of the residual network is easy to optimize, meanwhile, a certain network depth can be increased to improve the accuracy, and three results of final feature extraction are input of a subsequent enhanced feature extraction network.

The enhanced feature extraction network comprises an SPP structure and a PANet structure. In the SPP structure, after the last feature layer of CSPdarknet53 is convoluted for three times, the feature layers are processed by using four maximum pooling of different scales, the sizes of the maximum pooling kernels are respectively 13x13, 9x9, 5x5 and 1x1 (1 x1 is no processing), the receptive field can be greatly increased, and the most remarkable contextual features can be separated. The PANET is an example segmentation algorithm, and the specific structure of the PANet has the function of repeatedly improving the characteristics. Three effective characteristic layers obtained by a main characteristic extraction network and an SPP structure use multiple convolution up-sampling and convolution down-sampling to effectively realize characteristic extraction and finally obtain three YOLOread outputs. The detection method using multiple scales can effectively improve the detection accuracy.

When the YOLOv4 is trained, based on the three yoloead outputs, the loss function value of the minimum detection model is calculated iteratively, and when the predetermined iteration times is reached, the trained target detection model can be obtained. In defining the loss function, the loss function value is calculated by three parts:

1) the CIOU is used to calculate the error between the predicted frame and the true frame position. Compared with the IOU, when the CIOU is calculated, the errors caused by different positions and different frame width-height ratios are considered when the intersection area ratio between the two frames is the same, so that the result is more accurate.

2) Target confidence. When an object is correctly detected, the higher the confidence score, the smaller the error and vice versa.

3) And identifying errors caused by the category. I.e. the comparison of the predicted species result with the actual result.

Step 2: and downloading and expanding a significance data set, and training a significance detection model.

The training set used in training the significance detection model IS from significance detection data sets disclosed in the network, including ECSSD, HKU-IS, THUR15K, PASCAL-S, DUTS, and the like. The data sets comprise an original image and a picture which corresponds to the original image and is marked with a standard significance target, wherein the picture has a simple pure-color background and a complex and various background. Based on these existing public data sets, a significance detection model with higher accuracy can be trained.

Each public data set includes thousands of pictures, and corresponding saliency maps, that train well static saliency detection networks. For training the dynamic saliency detection network, a video frame data set adjacent to the pictures is also needed, so that in the invention, the data expansion needs to be carried out on the existing saliency picture data set to obtain a usable video training data set.

For different pictures, the present invention represents difference information of pixel levels between adjacent frames by an optical flow field V = (u, V), where u is a vertical direction, V is a horizontal direction, and X (X, y) represents the position of one point. Therefore, the difference relationship between the adjacent frames I and I' can be expressed by the following formula:

the vertical direction u is taken as an example, since the principles in the horizontal and vertical directions are identical. And dividing pixels in the picture into foreground pixels f and background pixels b, and processing the foreground pixels and the background pixels separately. And extracting 10% of random initialization motion values from the background pixels b, wherein the motion values are in the range of [ -d, d ], d = h/10, and h is the height of the picture, and jogging partial pixels in the background to simulate the noise of the background in the actual video. In the foreground pixel f, firstly, a foreground main motion mode m is assumed, the numerical value of m is the main motion direction and the distance of the foreground target between the two frames, and then, the value of each pixel is randomly selected in an interval [ m-d/10, m + d/10] to generate the motion difference of different pixels, so that the effect that all pixels of the foreground have the same overall movement trend and the specific movement distance of each pixel is different is achieved, and the effect is consistent with the actual effect. This generates a new picture that moves the foreground object based on the original picture. After the foreground pixels and the background pixels are processed, the foreground pixels and the background pixels are combined to generate a multi-frame video data set with a target moving.

In the saliency detection model, a picture or video frame is input into a neural network, which outputs a saliency map, where lighter pixels represent objects with high saliency values and darker pixels represent the background.

As shown in fig. 3, the network structure of the static saliency model is mainly divided into two parts, and the convolution of the left half part is used for extracting picture features; and the right half part is deconvoluted, so that the size of the network output is the same as that of the input.

In the significance detection network, the input picture needs to be scaled to (224, 224, 3), and then subjected to 3 × 3 convolutions 13 times, so as to finally obtain the feature output with the format of (14, 14, 512), and then the feature picture is returned to the original size by using the corresponding deconvolution. Compared with convolution operation of a large convolution kernel, the method has the advantages that the convolution operation of a small convolution kernel for multiple times is used, the network depth can be improved under the condition that the receptive field is not changed, and therefore the network learning accuracy rate is improved. The overall operation formula of the static significance network is as follows:

y is image output, I is image input, FS is characteristic output generated by the convolutional layer, DS is deconvolution operation, the output Y is ensured to be the same as the input image I in size, and theta F and theta D represent parameters during convolution and deconvolution.

During deconvolution, the feature matrix is corresponding to the first half convolution operation, after expansion, the feature matrix is multiplied by a transpose matrix of a convolution kernel during the convolution step, so that a feature graph with twice expanded size is obtained, and finally, the output with the length and the width being 224 is obtained.

To obtain the dynamic saliency estimate, as shown in fig. 4, the static saliency estimate map obtained as described above and the adjacent frames of the original map and the original map of the map need to be simultaneously input to the dynamic saliency detection network, and the three are connected on the channel to obtain the input in the format (224, 224, 7), which enters the convolution operation of the first layer by the following formula:

where W is the weight at convolution and b is the bias term. I is_tAnd I_t+1For two adjacent frames, P_tIs I_tA corresponding static saliency image. Subsequent toThe convolution and deconvolution operations of the dynamic significance network are consistent with the structure of the static significance network.

Through the optical flow comparison of the pixel levels of two adjacent frames, a dynamic high-significance target can be better detected, and the significance recognition accuracy is improved.

The output of the static significance network is used as part of the input of the dynamic significance network, the space-time significance result of the picture is directly generated, the dynamic significance and the static significance are fused and are explicitly embedded into the dynamic significance network instead of a double-flow network for training space-time characteristics, and repeated operation and redundant network parameters are reduced. The model utilizes the optical flow image to directly infer the time information of two adjacent frames of the video instead of the traditional method of mainly comparing the color difference of different pixels, thereby obtaining higher computational efficiency and precision.

And step 3: and obtaining the recognition result of the target and the target frame information by using the trained target detection model.

The trained Yolov4 target detection model detects the test data, and the possible results of the output image are three: firstly, workers normally wear safety helmets, and the target is to mark a frame with a hat label on the head of the worker; secondly, a worker who does not wear a safety helmet marks a frame with a person label on the head of the worker; and thirdly, the safety helmet is not worn on the head of a worker, the image is a false detection target, the false detection target cannot be detected and marked by the trained YOLOv4 under the general condition, and if the false detection target is a hat label, the false detection target can be screened out in the subsequent re-detection operation of the invention.

During detection, the YOLOv4 mainly uses a multi-scale detection method, and can effectively solve the problem that when the position of a target is different from the position of a camera, images can display different sizes.

YOLOv4 divides each input image into three scale outputs, each scale corresponding to three prior boxes, and the three outputs total nine prior boxes for detection. The first scale is 13 × 13, the output of the first scale is subjected to convolution operation for multiple times, the degree of compression is large, the method is suitable for identification and detection of a large target, and the corresponding prior frames are the three largest prior frames.

The second dimension is 26 x 26, which is located in the middle of the output of the three dimensions, and is suitable for detecting the target with the medium size, and three prior frames with the medium size are also used.

The three dimensions of the scale are 52 x 52, and the output format is the output format with the least convolution times, so that three prior frames with smaller dimensions are utilized to have better identification effect on small targets in the picture. The YOLOv4 is used for independently detecting fused feature maps of multiple scales, and finally, good identification accuracy is achieved for targets of different sizes.

wherein (t)_x,t_y,t_w,t_h) The prediction output of the model is the network learning target. (c)_x,c_y) Is the coordinate offset of the cell, in units of cell side length, (p)_w,p_h) Is the preset side length of Anchor Box, (b)_x,b_y,b_w,b_h) The center coordinates and width and height of the predicted bounding box are finally obtained.

And 3, transmitting the single frame obtained from the picture or video shot by the camera into the target detection model by using the trained helmet detection model obtained in the step 1 to obtain the outputs of the helmet detection result hat and person, and visually seeing whether the shot worker correctly wears the helmet.

In order to facilitate the subsequent step 5 of cropping the image, all the detected objects and the position information of each object, which are left, top, right and bottom, are marked on the original image. The four values are directly output with the model_x,b_y,b_w,b_hIn contrast, where (left, top) is the pixel coordinate of the upper left corner of the frame, (right, b)ottom) is the pixel coordinate of the lower right corner of the frame body, and can be obtained only by simple conversion.

And 4, step 4: and obtaining the saliency estimation of the image by using the trained saliency detection model.

And 4, executing a saliency detection task by using the trained saliency model obtained in the step 2, inputting each frame of image monitored by the video into a neural network, and outputting saliency mapping by the network, wherein high-brightness white is a high saliency area, and black is an insignificant background area.

As shown in fig. 4, in order to effectively distinguish a background portion and a moving target portion in an image, a model is divided into a static model and a dynamic model, and the static model and the dynamic model are combined with each other to capture space and time information of the image at the same time, so that a moving high-saliency target in the image is effectively identified, and a pixel-level saliency map is directly generated through a full convolution network.

The static saliency model takes a single frame image as input and generates pixel-level saliency estimation. The input of the dynamic significance model comprises two adjacent video frames in the video and a static significance map output by the static significance model, and a final dynamic significance result according to a time sequence is generated.

Compared with the traditional method for comparing the color difference among all pixels, the saliency estimated picture obtained by the method greatly improves the identification accuracy of the moving target. By using the comparison between the adjacent frames, the background can be effectively judged to be low in significance, and the accuracy of the identification result is greatly improved.

To facilitate the subsequent step 5 of cropping the image, the output of the dynamic saliency model is scaled to the original image size.

And 5: and clipping the significance estimation by using the position information of the target frame body.

Since the original image, the result image of the target detection model, and the estimated image identified by the saliency detection model are all in accordance with the size specification, the target frame position in the target detection result corresponds to the target position in the saliency estimated image, and the target position can be marked in the saliency estimated image directly by using the frame coordinates left, top, right, and bottom obtained finally in step 3.

As shown in fig. 5, the detection results of the present invention include (a) input original images for testing and (b) result images detected by the YOLOv4 target detection model, and after the original images are input into the trained YOLOv4 helmet detection model, correct helmet-wearing workers marked with "hat" and no helmet-wearing workers marked with "person" can be obtained, and the frame positions of these targets can be saved. (c) A static saliency picture. (d) And (4) dynamic significance estimation graph. (e) In order to find out the corresponding position in the significance detection result (d) by utilizing the coordinate information obtained in the step (b), all the targets are separately cut out by utilizing cropping operation to obtain small pictures of a single target and the small pictures are stored locally, so that the subsequent reinspection operation is facilitated.

Step 6: and (5) rechecking the single small pictures of all the targets.

And step 6, the rechecking operation utilizes the significance estimation image of each target obtained in the step 5, and because highlight white is an effective target and black is a background in the significance estimation, the rechecking operation mainly judges the cut single significance image to judge whether the target exists. If the white proportion is large, the identified target is a valid target, and the recheck is passed; and if the black ratio is large, the identified target is a background, and the identification result is judged to be false detection. Specifically, from the experimental results, for each pixel in a saliency estimation result map, if the gray value thereof is between [0, 10], it is determined as a black background pixel; otherwise, the target pixel is obtained. And after all the pixel operations are completed, calculating the black background pixel ratio in the picture.

As shown in fig. 6, (a) is the original image, (b) is the YOLOv4 detection result image, (c) is the dynamic saliency estimation image, and (d) is the clipping result of the corresponding positions of the 3 objects in the dynamic saliency image. After the black percentage of the single-sheet saliency target is obtained, the black percentage is compared with a set threshold value, and in the invention, the threshold value can be set to 70%. And if the black proportion in the significance estimation graph of a single target is more than 70%, judging that the target is a background, and judging that the retest of the target fails to pass through, and judging that the target is false detected. And if the black proportion in the significance estimation graph of the single target is less than 70%, judging that the target is really the target needing to be detected, and passing the rechecking. Therefore, in fig. 6, two of the 3 objects are checked for error (the object with the helmet and the object without the helmet), and one is checked for error (the helmet placed on the table).

Through the re-inspection operation after the target detection, the false detection rate in the target detection can be greatly reduced, and the overall identification accuracy is improved.

3703 pictures and a piece of video data were used in the experiment. The pictures are divided into a training set (3203) and a testing set (450). And training a YOLOv4 target detection model by using the training set data, wherein the target detection accuracy is more than 96% during testing.

For video data, a Yolov4 target detection result is obtained firstly during testing, and then rechecking is carried out through a significance detection model, and experimental results show that the method can effectively detect the target and can effectively screen the false detection target through rechecking.

When the method is used, the video data returned by the monitoring camera can be utilized to carry out target detection through the neural network, and then the original image is input into the saliency detection model and respectively passes through the static saliency estimation model and the dynamic saliency estimation model to obtain the saliency estimation about the target. The identification method solves the problem of influence of complex background in the past target detection and significance detection, and can well identify workers wearing the safety helmet. And finally, cutting the corresponding significance estimation graphs by using the frame positions in the target detection to obtain significance estimation graphs of all single targets, and rechecking the targets by using the graphs to finally achieve the effect of improving the identification accuracy.

According to the method, a target detection model is established by using YOLOv4, then a saliency detection model is used for rechecking a YOLOv4 target detection result, and a Convolutional Neural Network (CNN) is adopted for both models, so that the method has good robustness and accuracy.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by the present specification, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. a safety helmet wearing identification method based on YOLOv4 and significance detection, is characterized in that, comprises the following steps:

Step 1: Label the existing data set and train the YOLOv4 target detection model;

Use the labeling software to label the data, and after obtaining the file recording the size and label of the target position in each picture, the data is divided into two parts, namely the training set and the test set;

The YOLOv4 network is used as the target detection model. During training, based on the output of the target detection, the loss function value of the detection model is minimized by iterative calculation. When the predetermined number of iterations is reached, the trained target detection model is obtained;

Step 2: Download and expand the saliency dataset, and train the saliency detection model;

The training set used in training the saliency detection model comes from the saliency detection data set published in the network, and the existing saliency detection data set is expanded to obtain a usable video training data set. High-accuracy saliency detection model;

Step 3: Use the trained target detection model to obtain the target recognition result and target frame information;

Using the YOLOv4 target detection model trained in step 1, transfer the picture captured by the camera or a single frame obtained from the video to the target detection model to obtain the output of the target detection result;

Step 4: Use the trained saliency detection model to obtain the saliency estimation of the image;

Use the saliency model trained in step 2 to perform the saliency detection task, input each frame of video surveillance image into the neural network, and then the network outputs the saliency map; the static saliency model takes a single frame of image as input, and generates pixels Level saliency estimation; the input of the dynamic saliency model includes two adjacent video frames in the video, and the static saliency map output by the static saliency model to generate the final dynamic saliency result based on time series;

Step 5: Use the target frame position information to crop the saliency estimate;

Since the size specifications of the original image, the result image of the target detection model and the estimated image recognized by the saliency detection model are consistent, the position of the target frame in the target detection result is the same as the target position in the saliency estimation image. Correspondingly, you can directly use the output obtained in step 3 to mark the target position in the saliency estimation map;

Step 6: Recheck the single small pictures of all the targets;

Use the saliency estimation map of each target obtained in step 5 to judge whether there is a target in the cropped single saliency map.

2. the safety helmet wearing identification method based on YOLOv4 and salience detection according to claim 1, is characterized in that, in step 1, YOLOv4 network mainly comprises backbone feature extraction network and strengthened feature extraction network;

The backbone feature extraction network adopts the CSPDarkNet53 architecture, and its input is a 3-channel image. In order to ensure the consistency of the input, the original image will be proportionally scaled; in order to ensure that the image is not distorted, the length and width of the image will not be changed when adjusting the short side. Instead, the gray area is expanded up and down or left and right on the short side; in the backbone feature extraction network, the improved residual block of CSPnet is used for convolution many times, and the three results of the final feature extraction are the input of the subsequent enhanced feature extraction network. ;

The enhanced feature extraction network includes SPP structure and PANet structure; in the SPP structure, after performing three convolutions on the last feature layer of CSPdarknet53, four different scales of maximum pooling are used for processing, which can greatly increase the receptive field. Isolate the most significant context features; PANet is an instance segmentation algorithm, and its specific structure has the effect of repeatedly improving features;

The three effective feature layers obtained by the backbone feature extraction network and the SPP structure use multiple convolution upsampling and convolution downsampling to effectively implement feature extraction, and finally obtain three YOLOHead outputs.

3. the helmet wearing identification method based on YOLOv4 and saliency detection according to claim 1, is characterized in that, in step 1, calculates the loss function value by following three parts:

The error between the predicted frame and the real frame position calculated by CIOU; compared with IOU, when calculating CIOU, considering that the intersection area ratio between the two frames is the same, the position is different and the frame aspect ratio is different. errors, making the results more accurate;

Error caused by target confidence; when a target is correctly detected, the higher the confidence score, the smaller the error, and vice versa;

Errors caused by identifying categories; that is, the comparison of the predicted category results with the actual results.

4. the helmet wearing identification method based on YOLOv4 and saliency detection according to claim 1, is characterized in that, in step 2, each disclosed saliency detection data set includes thousands of pictures, and corresponding Saliency map, used to train static saliency detection network; for training dynamic saliency detection network, video frame datasets adjacent to these pictures are also required;

The pixel-level difference information between adjacent frames is represented by the optical flow field V=(u, v), where u is the vertical direction, v is the horizontal direction, and X(x, y) is used to represent the position of a point, so The difference relationship between adjacent frames I and I' can be expressed by the following formula:

Since the principle is the same in the horizontal and vertical directions, taking the vertical direction u as an example, the pixels in the picture are divided into foreground pixels f and background pixels b, the foreground pixels and background pixels are processed separately, and 10% of the background pixels are randomly selected. Initialize its movement value, the range is [-d, d], where d=h/10, h is the height of the picture, and make some pixels in the background move slightly to simulate the background noise in the actual video; in the foreground pixel f , first assume that the foreground main motion mode m, the value of m is the main motion direction and distance of the foreground target between the two frames, and then randomly select each pixel in the interval [m-d/10, m+d/10]. value to generate the motion difference of different pixels, to achieve the effect that all the pixels in the foreground have the same overall movement trend, and the specific distance moved by each pixel is not the same, which is consistent with the actual effect; The foreground object generates a new picture of motion; after processing the foreground pixels and the background pixels, the combination of the two generates a multi-frame video dataset with moving objects.

5. the safety helmet wearing identification method based on YOLOv4 and saliency detection according to claim 4, is characterized in that, the overall operation formula of static saliency network is:

Where Y is the image output, I is the image input, FS is the feature output generated by the convolution layer, DS is the deconvolution operation, to ensure that the output Y is the same size as the input image I, ΘF and ΘD represent the convolution and deconvolution. parameter;

In the deconvolution, it corresponds to the first half of the convolution operation. After the feature matrix is expanded, it is multiplied by the transposed matrix of the convolution kernel in the convolution step to obtain a feature map with twice the size.

6. the helmet wearing identification method based on YOLOv4 and saliency detection according to claim 5, it is characterized in that, dynamic saliency estimation is the adjacent frame of static saliency estimation graph and this graph original image and original image, At the same time, the dynamic saliency detection network is input, and the three are connected on the channel to obtain the input of the format (A, B, C). The input of this format enters the convolution operation of the first layer through the following formula:

where W is the weight during convolution, b is the bias term, It and It ₊₁ are two adjacent frames of images, and _Pt _is the static _saliency image corresponding to It; the convolution of the subsequent dynamic saliency network The deconvolution operation is consistent with the static saliency network structure; by comparing the pixel-level optical flow of two adjacent frames, dynamic high-saliency targets can be better detected, and the accuracy of saliency recognition can be improved.

7. The safety helmet wearing recognition method based on YOLOv4 and saliency detection according to claim 1, is characterized in that, in step 3, YOLOv4 will divide each input image into three scale outputs, and each scale corresponds to three first. Check box, three outputs a total of nine a priori boxes for detection;

Among them, the output of scale one has undergone multiple convolution operations, and the degree of compression is large, which is suitable for the recognition and detection of larger targets, and the corresponding a priori boxes are the largest three a priori boxes;

Scale 2 is located in the middle of the output of the three scales, which is suitable for the detection of medium-sized targets, and also uses three medium-sized a priori boxes;

Scale three is the output format with the least number of convolutions, so three a priori frames with smaller scales are used to have a better recognition effect on small objects in the picture;

In order to identify the target positioning, YOLOv4 uses the following formula in order to obtain the frame position information:

Among them (t _x , t _y , t _w , t _h ) is the predicted output of the model, which is the network learning target; (c _x , c _y ) is the coordinate offset of the cell, and its unit is the cell side length, (p _w , _ph ) is the side length of the preset Anchor Box, and (b _x , b _y , b _w , b _h ) are the center coordinates and width and height of the final predicted bounding box.

8. the safety helmet wearing identification method based on YOLOv4 and saliency detection according to claim 1, is characterized in that, in step 4, in order to effectively distinguish the background part and the moving target part in the image, the model is divided into static state. Model and dynamic model, the combination of the two can capture the spatial and temporal information of the image at the same time, and directly generate the pixel-level saliency map through the fully convolutional network;

Among them, the static saliency model takes a single-frame image as input to generate pixel-level saliency estimates; the input of the dynamic saliency model includes two adjacent video frames in the video, and the static saliency map output by the static saliency model to generate The final result is based on the dynamic saliency of the time series.

9. the safety helmet wearing identification method based on YOLOv4 and saliency detection according to claim 1, is characterized in that, in step 6, in defining saliency estimation, highlighted white is an effective target, and black is background; , then the recognized target is a valid target, and the recheck is passed; if the proportion of black is large, then the recognized target is the background, and the recognition result is determined to be a false detection.