[go: up one dir, main page]

CN112989958A - Helmet wearing identification method based on YOLOv4 and significance detection - Google Patents

Helmet wearing identification method based on YOLOv4 and significance detection Download PDF

Info

Publication number
CN112989958A
CN112989958A CN202110195098.0A CN202110195098A CN112989958A CN 112989958 A CN112989958 A CN 112989958A CN 202110195098 A CN202110195098 A CN 202110195098A CN 112989958 A CN112989958 A CN 112989958A
Authority
CN
China
Prior art keywords
saliency
target
detection
image
yolov4
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110195098.0A
Other languages
Chinese (zh)
Inventor
李岳阳
兰天
罗海驰
杜鹏
朱一昕
樊启高
毕恺韬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute Of Technology Robot Group Wuxi Science And Technology Innovation Base Research Institute
Original Assignee
Harbin Institute Of Technology Robot Group Wuxi Science And Technology Innovation Base Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute Of Technology Robot Group Wuxi Science And Technology Innovation Base Research Institute filed Critical Harbin Institute Of Technology Robot Group Wuxi Science And Technology Innovation Base Research Institute
Priority to CN202110195098.0A priority Critical patent/CN112989958A/en
Publication of CN112989958A publication Critical patent/CN112989958A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/136Segmentation; Edge detection involving thresholding
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/194Segmentation; Edge detection involving foreground-background segmentation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/30Noise filtering
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20112Image segmentation details
    • G06T2207/20132Image cropping
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Image Analysis (AREA)

Abstract

本发明公开了一种基于YOLOv4与显著性检测的安全帽佩戴识别方法,包括以下步骤:对已有的数据集进行标注,训练YOLOv4目标检测模型;下载并扩充显著性数据集,训练显著性检测模型;利用训练好的目标检测模型,得到目标的识别结果与目标框体信息;利用训练好的显著性检测模型,得到图像的显著性估计;利用目标框体位置信息,对显著性估计进行裁剪;对所有目标的单张小图片进行复检;通过上述方式,本发明的利用显著性检测方法,得到图像的动态显著性估计,从而能有效地去除背景影响,得到对移动目标像素级别的高显著性结果;通过显著性检测结果对目标检测结果进行复检,可以大大降低误检概率,有效区分出背景中对目标检测的干扰项,提升目标检测精度。

Figure 202110195098

The invention discloses a safety helmet wearing recognition method based on YOLOv4 and saliency detection, comprising the following steps: labeling an existing data set, training a YOLOv4 target detection model; downloading and expanding the saliency data set, training saliency detection model; use the trained target detection model to obtain the target recognition result and target frame information; use the trained saliency detection model to obtain the saliency estimate of the image; use the target frame position information to crop the saliency estimate ; Recheck the single small pictures of all targets; Through the above-mentioned method, the use of the saliency detection method of the present invention obtains the dynamic saliency estimation of the image, thereby effectively removing the background influence, and obtaining a high level of pixel level for the moving target. Significant results; re-checking the target detection results through the saliency detection results can greatly reduce the probability of false detection, effectively distinguish the interference items for target detection in the background, and improve the accuracy of target detection.

Figure 202110195098

Description

Helmet wearing identification method based on YOLOv4 and significance detection
Technical Field
The invention relates to the field of machine vision and pattern recognition, in particular to a helmet wearing recognition method based on YOLOv4 and significance detection.
Background
The safety helmet is the most common and practical personal protection appliance, and can effectively prevent and reduce the head injury caused by external dangerous sources. For a long time, the problems of low comprehensive quality and weak safety consciousness of operating personnel in construction areas in China generally exist, the wearing consciousness of basic protection facilities such as safety helmets is lacked, the operation risk is greatly increased, and safety accidents happen occasionally. From the published network data, safety incidents due to unsafe behavior of the constructors account for 95% of the total category of all incidents.
At present, the potential safety hazard problem that exists is worn at the safety helmet, what the enterprise mainly relied on is that relevant managers's tour or look over through the control of making a video recording by security personnel, and this manpower and material resources not only are wasted, and inefficiency moreover.
In recent years, the technology in the artificial intelligence system is more mature, and the number of successful application cases in the technical fields of deep learning and computer vision is not enough, such as speech recognition, fingerprint recognition and face recognition which are widely known. The method has the advantages of full automation, no human interference, high precision and the like, and can be applied to the fields of supervision, security and the like. Once the technology is popularized, great change is generated to the society, people can be liberated from simple repeated labor, and the social productivity is greatly improved.
Disclosure of Invention
The invention mainly solves the technical problem of providing a helmet wearing identification method based on YOLOv4 and significance detection, which can solve the problem of influence of complex background in the conventional target detection and significance detection, can well identify workers wearing helmets, finally cuts the corresponding significance estimation images by using the frame positions in the target detection to obtain significance estimation images of all single targets, and rechecks the targets by using the images to finally achieve the effect of improving the identification accuracy.
In order to solve the technical problems, the invention adopts a technical scheme that: the helmet wearing identification method based on YOLOv4 and significance detection comprises the following steps:
step 1: labeling an existing data set, and training a Yolov4 target detection model;
marking the data by using marking software to obtain a file recording the size and the label of the target position in each picture, and dividing the data into two parts which are respectively a training set and a test set;
adopting a YOLOv4 network as a target detection model, during training, based on the output of target detection, carrying out iterative computation to minimize the loss function value of the detection model, and obtaining the trained target detection model after reaching the predetermined iteration times;
step 2: downloading and expanding a significance data set, and training a significance detection model;
the training set used in training the significance detection model is from a significance detection data set disclosed in a network, data expansion is carried out on the existing significance detection data set to obtain a usable video training data set, and the significance detection model with higher accuracy is trained;
and step 3: obtaining a recognition result of the target and target frame information by using the trained target detection model;
transmitting a single frame obtained from a picture or a video shot by a camera into the target detection model by using the well-trained Yolov4 target detection model in the step 1 to obtain the output of a target detection result;
and 4, step 4: obtaining the saliency estimation of the image by using the trained saliency detection model;
performing a significance detection task by using the significance model trained in the step 2, inputting each frame of image monitored by the video into a neural network, and then outputting significance mapping by the network; the static saliency model takes a single-frame image as input and generates pixel-level saliency estimation; the input of the dynamic significance model comprises two adjacent video frames in the video and a static significance map output by the static significance model, and a final dynamic significance result according to a time sequence is generated;
and 5: clipping the significance estimation by using the position information of the target frame body;
because the original image, the result image of the target detection model and the estimated image identified by the saliency detection model are consistent in size and specification, the target frame position in the target detection result corresponds to the target position in the saliency estimated image, and the target position can be marked in the saliency estimated image by directly utilizing the output finally obtained in the step 3;
step 6: rechecking single small pictures of all targets;
and 5, judging whether the clipped single saliency map has the target or not by using the saliency estimation map of each target obtained in the step 5.
In a preferred embodiment of the present invention, in step 1, the YOLOv4 network mainly includes a backbone feature extraction network and an enhanced feature extraction network;
the main feature extraction network adopts a CSPDarkNet53 architecture, the input of the main feature extraction network is a 3-channel picture, and in order to ensure the input consistency, the original picture is scaled in an equal ratio; then, in order to ensure that the picture is not distorted, the length-width ratio of the picture is not changed when the short edge is adjusted, and the gray area is expanded up and down or left and right on the short edge; in the trunk feature extraction network, a CSPnet improved residual block is adopted for convolution for many times, and finally three results of feature extraction are input of a subsequent enhanced feature extraction network;
the reinforced feature extraction network comprises an SPP structure and a PANet structure; in the SPP structure, after the last feature layer of CSPdakrnet 53 is convoluted for three times, the processing is carried out by respectively utilizing the maximum pooling of four different scales, so that the receptive field can be greatly increased, and the most obvious context feature can be separated; the PANET is an example segmentation algorithm, and the specific structure of the PANet has the function of repeatedly improving the characteristics;
three effective characteristic layers obtained by a main characteristic extraction network and an SPP structure use multiple convolution up-sampling and convolution down-sampling to effectively realize characteristic extraction and finally obtain three YOLOread outputs.
In a preferred embodiment of the present invention, in step 1, the loss function value is calculated by the following three parts:
1) calculating the error between the position of the predicted frame and the position of the real frame by using the CIOU; compared with the IOU, when the CIOU is calculated, the errors caused by different positions and different frame width-height ratios are considered when the intersection area ratio between the two frames is the same, so that the result is more accurate;
2) errors due to target confidence; when a target is correctly detected, the higher the confidence score is, the smaller the error is, otherwise, the larger the error is;
identifying errors caused by categories; i.e. the comparison of the predicted species result with the actual result.
In a preferred embodiment of the present invention, each of the saliency detection data sets disclosed in step 2 includes thousands of pictures and corresponding saliency maps for training a static saliency detection network; for training the dynamic significance detection network, a video frame data set adjacent to the pictures is also needed;
the difference information at the pixel level between adjacent frames is represented by an optical flow field V = (u, V), where u is a vertical direction, V is a horizontal direction, and the position of one point is represented by X (X, y). Therefore, the difference relationship between the adjacent frames I and I' can be expressed by the following formula:
Figure RE-RE-DEST_PATH_IMAGE001
the vertical direction u is taken as an example, since the principles in the horizontal and vertical directions are identical. And dividing pixels in the picture into foreground pixels f and background pixels b, and processing the foreground pixels and the background pixels separately. Extracting 10% of random initialization motion values from background pixels b, wherein the range is [ -d, d ], d = h/10, and h is the height of a picture, and jogging partial pixels in the background to simulate the noise of the background in an actual video; in the foreground pixel f, firstly, a foreground main motion mode m is assumed, the numerical value of m is the main motion direction and the distance of a foreground target between two frames, and then, the value of each pixel is randomly selected in an interval [ m-d/10, m + d/10] to generate the motion difference of different pixels, so that the effect that all pixels of the foreground have the same overall movement trend and the specific movement distance of each pixel is different and is consistent with the actual effect is achieved; thus, a new picture which enables the foreground object to move based on the original picture is generated; after the foreground pixels and the background pixels are processed, the foreground pixels and the background pixels are combined to generate a multi-frame video data set with a target moving.
In a preferred embodiment of the present invention, the overall operation formula of the static saliency network is:
Figure RE-197820DEST_PATH_IMAGE002
y is image output, I is image input, FS is characteristic output generated by the convolution layer, DS is deconvolution operation, the output Y is ensured to be the same as the input image I, and theta F and theta D represent parameters during convolution and deconvolution;
in deconvolution, the feature matrix is mutually corresponding to the first half convolution operation, and after expansion, the feature matrix is multiplied by a transposed matrix of a convolution kernel in the convolution step to obtain a feature map with the size expanded by two times.
In a preferred embodiment of the present invention, the dynamic saliency estimation is implemented by inputting a static saliency estimation map and adjacent frames of the original map and the original map into a dynamic saliency detection network, which are connected in a channel to obtain an input in the format (a, B, C), which enters the convolution operation of the first layer by the following formula:
Figure RE-RE-DEST_PATH_IMAGE003
where W is the weight at convolution and b is the bias term. I istAnd It+1For two adjacent frames, PtIs ItA temporally corresponding static saliency image; the convolution and deconvolution operations of the subsequent dynamic significance network are consistent with the network structure of the static significance; through the optical flow comparison of the pixel levels of two adjacent frames, a dynamic high-significance target can be better detected, and the significance recognition accuracy is improved.
In a preferred embodiment of the present invention, YOLOv4 divides each input image into three scales for output in step 3, each scale corresponds to three prior frames, and the three outputs total nine prior frames for detection;
the output of the first scale is subjected to convolution operation for multiple times, the compression degree is large, the method is suitable for identifying and detecting a large target, and the corresponding prior frames are the three largest prior frames;
the scale two is positioned in the middle of the output of the three scales, is suitable for detecting the target with the medium size, and utilizes three prior frames with the medium size;
the scale III is an output format with the least convolution times, so that three prior frames with smaller scales are utilized to have better identification effect on small targets in the picture;
in the recognition target positioning, YOLOv4 adopts the following formula in order to obtain the frame position information:
Figure RE-468395DEST_PATH_IMAGE004
wherein (t)x,ty,tw,th) The prediction output of the model is the network learning target; (c)x,cy) Is the coordinate offset of the cell, in units of cell side length, (p)w,ph) Is the preset side length of Anchor Box, (b)x,by,bw,bh) The center coordinates and width and height of the predicted bounding box are finally obtained.
In a preferred embodiment of the present invention, in step 4, in order to effectively distinguish the background part and the moving target part in the image, the model is divided into a static model and a dynamic model, which are combined with each other to capture the space and time information of the image at the same time, and a pixel-level saliency map is directly generated through a full convolution network;
the static saliency model takes a single-frame image as input and generates pixel-level saliency estimation; the input of the dynamic significance model comprises two adjacent video frames in the video and a static significance map output by the static significance model, and a final dynamic significance result according to a time sequence is generated.
In a preferred embodiment of the present invention, in step 6, a highlight white in the saliency estimation is defined as an effective target, and a black is defined as a background; if the white proportion is large, the identified target is a valid target, and the recheck is passed; and if the black ratio is large, the identified target is a background, and the identification result is judged to be false detection.
The invention has the beneficial effects that: according to the method, a target detection model is established by using YOLOv4, then a saliency detection model is used for rechecking a YOLOv4 target detection result, and a Convolutional Neural Network (CNN) is adopted for both models, so that the method has good robustness and accuracy.
Compared with the method of only using YOLOv4 and the like to carry out the target detection task, the method of the invention also uses the significance detection method to obtain the dynamic significance estimation of the image, thereby effectively removing the background influence and obtaining the high significance result of the moving target pixel level. The target detection result is rechecked through the significance detection result, so that the false detection probability can be greatly reduced, interference items to the target detection in the background can be effectively distinguished, and the target detection precision is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without inventive efforts, wherein:
FIG. 1 is a flow chart of the detection of the present invention;
FIG. 2 is a network structure of a Yolov4 target detection model;
FIG. 3 is a static saliency model network structure used by the present invention;
FIG. 4 is a combined structure of a static saliency module and a dynamic saliency module of the present invention;
FIG. 5 is an example of the recognition result of the present invention;
FIG. 6 is a diagram illustrating an exemplary review operation performed on a false positive test picture according to the present invention;
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
In the description of the present invention, it should be noted that the terms "front" and "back" and the like indicate orientations and positional relationships based on orientations and positional relationships shown in the drawings or orientations and positional relationships where the products of the present invention are conventionally placed in use, and are used for convenience in describing the present invention and simplifying the description, but do not indicate or imply that the devices or elements to be referred must have a specific orientation, be constructed in a specific orientation, and be operated, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like are used merely to distinguish one description from another, and are not to be construed as indicating or implying relative importance.
In the description of the present invention, it should also be noted that, unless otherwise explicitly specified or limited, the terms "disposed" and "connected" are to be interpreted broadly, e.g., as being either fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
In the present invention, unless otherwise expressly stated or limited, the first feature may be present on or under the second feature in direct contact with the first and second feature, or may be present in the first and second feature not in direct contact but in contact with another feature between them. Also, the first feature being above, on or above the second feature includes the first feature being directly above and obliquely above the second feature, or merely means that the first feature is at a higher level than the second feature. A first feature that underlies, and underlies a second feature includes a first feature that is directly under and obliquely under a second feature, or simply means that the first feature is at a lesser level than the second feature.
The embodiment of the invention comprises the following steps:
a helmet wearing identification method based on YOLOv4 and significance detection comprises the following steps:
step 1: labeling the existing data set, and training a Yolov4 target detection model.
The label labeling software is utilized when the data labeling is executed, emphasizes on labeling related tasks of target detection, can rapidly and conveniently label the frame and the label of a single picture, and has good applicability in the target detection task. And after the labeling is finished, obtaining an xml file recording the size of the target position and the label in each picture, and then finishing the preparation of the data required by the training. These data were then divided into two parts, training and test sets respectively.
In step 1, a YOLOv4 network is used as a target detection model, as shown in fig. 2, which mainly includes a trunk feature extraction network and a reinforced feature extraction network. The main feature extraction network adopts a CSPDarkNet53 architecture, 3-channel pictures with 416 × 416 are input, and in order to ensure the input consistency, the original pictures are scaled equally, and the long edges of the original pictures are set to 416 sizes. Then, in order to ensure that the picture is not distorted, the aspect ratio of the picture is not changed when the short edge is adjusted, and the gray area is expanded above, below, left and right of the short edge, so that the whole picture reaches the input size of 416 x 416. In a backbone network, a CSPnet improved residual block is adopted for convolution for many times, the maximum characteristic of the residual network is easy to optimize, meanwhile, a certain network depth can be increased to improve the accuracy, and three results of final feature extraction are input of a subsequent enhanced feature extraction network.
The enhanced feature extraction network comprises an SPP structure and a PANet structure. In the SPP structure, after the last feature layer of CSPdarknet53 is convoluted for three times, the feature layers are processed by using four maximum pooling of different scales, the sizes of the maximum pooling kernels are respectively 13x13, 9x9, 5x5 and 1x1 (1 x1 is no processing), the receptive field can be greatly increased, and the most remarkable contextual features can be separated. The PANET is an example segmentation algorithm, and the specific structure of the PANet has the function of repeatedly improving the characteristics. Three effective characteristic layers obtained by a main characteristic extraction network and an SPP structure use multiple convolution up-sampling and convolution down-sampling to effectively realize characteristic extraction and finally obtain three YOLOread outputs. The detection method using multiple scales can effectively improve the detection accuracy.
When the YOLOv4 is trained, based on the three yoloead outputs, the loss function value of the minimum detection model is calculated iteratively, and when the predetermined iteration times is reached, the trained target detection model can be obtained. In defining the loss function, the loss function value is calculated by three parts:
1) the CIOU is used to calculate the error between the predicted frame and the true frame position. Compared with the IOU, when the CIOU is calculated, the errors caused by different positions and different frame width-height ratios are considered when the intersection area ratio between the two frames is the same, so that the result is more accurate.
2) Target confidence. When an object is correctly detected, the higher the confidence score, the smaller the error and vice versa.
3) And identifying errors caused by the category. I.e. the comparison of the predicted species result with the actual result.
Step 2: and downloading and expanding a significance data set, and training a significance detection model.
The training set used in training the significance detection model IS from significance detection data sets disclosed in the network, including ECSSD, HKU-IS, THUR15K, PASCAL-S, DUTS, and the like. The data sets comprise an original image and a picture which corresponds to the original image and is marked with a standard significance target, wherein the picture has a simple pure-color background and a complex and various background. Based on these existing public data sets, a significance detection model with higher accuracy can be trained.
Each public data set includes thousands of pictures, and corresponding saliency maps, that train well static saliency detection networks. For training the dynamic saliency detection network, a video frame data set adjacent to the pictures is also needed, so that in the invention, the data expansion needs to be carried out on the existing saliency picture data set to obtain a usable video training data set.
For different pictures, the present invention represents difference information of pixel levels between adjacent frames by an optical flow field V = (u, V), where u is a vertical direction, V is a horizontal direction, and X (X, y) represents the position of one point. Therefore, the difference relationship between the adjacent frames I and I' can be expressed by the following formula:
Figure RE-548347DEST_PATH_IMAGE001
the vertical direction u is taken as an example, since the principles in the horizontal and vertical directions are identical. And dividing pixels in the picture into foreground pixels f and background pixels b, and processing the foreground pixels and the background pixels separately. And extracting 10% of random initialization motion values from the background pixels b, wherein the motion values are in the range of [ -d, d ], d = h/10, and h is the height of the picture, and jogging partial pixels in the background to simulate the noise of the background in the actual video. In the foreground pixel f, firstly, a foreground main motion mode m is assumed, the numerical value of m is the main motion direction and the distance of the foreground target between the two frames, and then, the value of each pixel is randomly selected in an interval [ m-d/10, m + d/10] to generate the motion difference of different pixels, so that the effect that all pixels of the foreground have the same overall movement trend and the specific movement distance of each pixel is different is achieved, and the effect is consistent with the actual effect. This generates a new picture that moves the foreground object based on the original picture. After the foreground pixels and the background pixels are processed, the foreground pixels and the background pixels are combined to generate a multi-frame video data set with a target moving.
In the saliency detection model, a picture or video frame is input into a neural network, which outputs a saliency map, where lighter pixels represent objects with high saliency values and darker pixels represent the background.
As shown in fig. 3, the network structure of the static saliency model is mainly divided into two parts, and the convolution of the left half part is used for extracting picture features; and the right half part is deconvoluted, so that the size of the network output is the same as that of the input.
In the significance detection network, the input picture needs to be scaled to (224, 224, 3), and then subjected to 3 × 3 convolutions 13 times, so as to finally obtain the feature output with the format of (14, 14, 512), and then the feature picture is returned to the original size by using the corresponding deconvolution. Compared with convolution operation of a large convolution kernel, the method has the advantages that the convolution operation of a small convolution kernel for multiple times is used, the network depth can be improved under the condition that the receptive field is not changed, and therefore the network learning accuracy rate is improved. The overall operation formula of the static significance network is as follows:
Figure RE-947973DEST_PATH_IMAGE002
y is image output, I is image input, FS is characteristic output generated by the convolutional layer, DS is deconvolution operation, the output Y is ensured to be the same as the input image I in size, and theta F and theta D represent parameters during convolution and deconvolution.
During deconvolution, the feature matrix is corresponding to the first half convolution operation, after expansion, the feature matrix is multiplied by a transpose matrix of a convolution kernel during the convolution step, so that a feature graph with twice expanded size is obtained, and finally, the output with the length and the width being 224 is obtained.
To obtain the dynamic saliency estimate, as shown in fig. 4, the static saliency estimate map obtained as described above and the adjacent frames of the original map and the original map of the map need to be simultaneously input to the dynamic saliency detection network, and the three are connected on the channel to obtain the input in the format (224, 224, 7), which enters the convolution operation of the first layer by the following formula:
Figure RE-RE-DEST_PATH_IMAGE005
where W is the weight at convolution and b is the bias term. I istAnd It+1For two adjacent frames, PtIs ItA corresponding static saliency image. Subsequent toThe convolution and deconvolution operations of the dynamic significance network are consistent with the structure of the static significance network.
Through the optical flow comparison of the pixel levels of two adjacent frames, a dynamic high-significance target can be better detected, and the significance recognition accuracy is improved.
The output of the static significance network is used as part of the input of the dynamic significance network, the space-time significance result of the picture is directly generated, the dynamic significance and the static significance are fused and are explicitly embedded into the dynamic significance network instead of a double-flow network for training space-time characteristics, and repeated operation and redundant network parameters are reduced. The model utilizes the optical flow image to directly infer the time information of two adjacent frames of the video instead of the traditional method of mainly comparing the color difference of different pixels, thereby obtaining higher computational efficiency and precision.
And step 3: and obtaining the recognition result of the target and the target frame information by using the trained target detection model.
The trained Yolov4 target detection model detects the test data, and the possible results of the output image are three: firstly, workers normally wear safety helmets, and the target is to mark a frame with a hat label on the head of the worker; secondly, a worker who does not wear a safety helmet marks a frame with a person label on the head of the worker; and thirdly, the safety helmet is not worn on the head of a worker, the image is a false detection target, the false detection target cannot be detected and marked by the trained YOLOv4 under the general condition, and if the false detection target is a hat label, the false detection target can be screened out in the subsequent re-detection operation of the invention.
During detection, the YOLOv4 mainly uses a multi-scale detection method, and can effectively solve the problem that when the position of a target is different from the position of a camera, images can display different sizes.
YOLOv4 divides each input image into three scale outputs, each scale corresponding to three prior boxes, and the three outputs total nine prior boxes for detection. The first scale is 13 × 13, the output of the first scale is subjected to convolution operation for multiple times, the degree of compression is large, the method is suitable for identification and detection of a large target, and the corresponding prior frames are the three largest prior frames.
The second dimension is 26 x 26, which is located in the middle of the output of the three dimensions, and is suitable for detecting the target with the medium size, and three prior frames with the medium size are also used.
The three dimensions of the scale are 52 x 52, and the output format is the output format with the least convolution times, so that three prior frames with smaller dimensions are utilized to have better identification effect on small targets in the picture. The YOLOv4 is used for independently detecting fused feature maps of multiple scales, and finally, good identification accuracy is achieved for targets of different sizes.
In the recognition target positioning, YOLOv4 adopts the following formula in order to obtain the frame position information:
Figure RE-11744DEST_PATH_IMAGE006
wherein (t)x,ty,tw,th) The prediction output of the model is the network learning target. (c)x,cy) Is the coordinate offset of the cell, in units of cell side length, (p)w,ph) Is the preset side length of Anchor Box, (b)x,by,bw,bh) The center coordinates and width and height of the predicted bounding box are finally obtained.
And 3, transmitting the single frame obtained from the picture or video shot by the camera into the target detection model by using the trained helmet detection model obtained in the step 1 to obtain the outputs of the helmet detection result hat and person, and visually seeing whether the shot worker correctly wears the helmet.
In order to facilitate the subsequent step 5 of cropping the image, all the detected objects and the position information of each object, which are left, top, right and bottom, are marked on the original image. The four values are directly output with the modelx,by,bw,bhIn contrast, where (left, top) is the pixel coordinate of the upper left corner of the frame, (right, b)ottom) is the pixel coordinate of the lower right corner of the frame body, and can be obtained only by simple conversion.
And 4, step 4: and obtaining the saliency estimation of the image by using the trained saliency detection model.
And 4, executing a saliency detection task by using the trained saliency model obtained in the step 2, inputting each frame of image monitored by the video into a neural network, and outputting saliency mapping by the network, wherein high-brightness white is a high saliency area, and black is an insignificant background area.
As shown in fig. 4, in order to effectively distinguish a background portion and a moving target portion in an image, a model is divided into a static model and a dynamic model, and the static model and the dynamic model are combined with each other to capture space and time information of the image at the same time, so that a moving high-saliency target in the image is effectively identified, and a pixel-level saliency map is directly generated through a full convolution network.
The static saliency model takes a single frame image as input and generates pixel-level saliency estimation. The input of the dynamic significance model comprises two adjacent video frames in the video and a static significance map output by the static significance model, and a final dynamic significance result according to a time sequence is generated.
Compared with the traditional method for comparing the color difference among all pixels, the saliency estimated picture obtained by the method greatly improves the identification accuracy of the moving target. By using the comparison between the adjacent frames, the background can be effectively judged to be low in significance, and the accuracy of the identification result is greatly improved.
To facilitate the subsequent step 5 of cropping the image, the output of the dynamic saliency model is scaled to the original image size.
And 5: and clipping the significance estimation by using the position information of the target frame body.
Since the original image, the result image of the target detection model, and the estimated image identified by the saliency detection model are all in accordance with the size specification, the target frame position in the target detection result corresponds to the target position in the saliency estimated image, and the target position can be marked in the saliency estimated image directly by using the frame coordinates left, top, right, and bottom obtained finally in step 3.
As shown in fig. 5, the detection results of the present invention include (a) input original images for testing and (b) result images detected by the YOLOv4 target detection model, and after the original images are input into the trained YOLOv4 helmet detection model, correct helmet-wearing workers marked with "hat" and no helmet-wearing workers marked with "person" can be obtained, and the frame positions of these targets can be saved. (c) A static saliency picture. (d) And (4) dynamic significance estimation graph. (e) In order to find out the corresponding position in the significance detection result (d) by utilizing the coordinate information obtained in the step (b), all the targets are separately cut out by utilizing cropping operation to obtain small pictures of a single target and the small pictures are stored locally, so that the subsequent reinspection operation is facilitated.
Step 6: and (5) rechecking the single small pictures of all the targets.
And step 6, the rechecking operation utilizes the significance estimation image of each target obtained in the step 5, and because highlight white is an effective target and black is a background in the significance estimation, the rechecking operation mainly judges the cut single significance image to judge whether the target exists. If the white proportion is large, the identified target is a valid target, and the recheck is passed; and if the black ratio is large, the identified target is a background, and the identification result is judged to be false detection. Specifically, from the experimental results, for each pixel in a saliency estimation result map, if the gray value thereof is between [0, 10], it is determined as a black background pixel; otherwise, the target pixel is obtained. And after all the pixel operations are completed, calculating the black background pixel ratio in the picture.
As shown in fig. 6, (a) is the original image, (b) is the YOLOv4 detection result image, (c) is the dynamic saliency estimation image, and (d) is the clipping result of the corresponding positions of the 3 objects in the dynamic saliency image. After the black percentage of the single-sheet saliency target is obtained, the black percentage is compared with a set threshold value, and in the invention, the threshold value can be set to 70%. And if the black proportion in the significance estimation graph of a single target is more than 70%, judging that the target is a background, and judging that the retest of the target fails to pass through, and judging that the target is false detected. And if the black proportion in the significance estimation graph of the single target is less than 70%, judging that the target is really the target needing to be detected, and passing the rechecking. Therefore, in fig. 6, two of the 3 objects are checked for error (the object with the helmet and the object without the helmet), and one is checked for error (the helmet placed on the table).
Through the re-inspection operation after the target detection, the false detection rate in the target detection can be greatly reduced, and the overall identification accuracy is improved.
3703 pictures and a piece of video data were used in the experiment. The pictures are divided into a training set (3203) and a testing set (450). And training a YOLOv4 target detection model by using the training set data, wherein the target detection accuracy is more than 96% during testing.
For video data, a Yolov4 target detection result is obtained firstly during testing, and then rechecking is carried out through a significance detection model, and experimental results show that the method can effectively detect the target and can effectively screen the false detection target through rechecking.
When the method is used, the video data returned by the monitoring camera can be utilized to carry out target detection through the neural network, and then the original image is input into the saliency detection model and respectively passes through the static saliency estimation model and the dynamic saliency estimation model to obtain the saliency estimation about the target. The identification method solves the problem of influence of complex background in the past target detection and significance detection, and can well identify workers wearing the safety helmet. And finally, cutting the corresponding significance estimation graphs by using the frame positions in the target detection to obtain significance estimation graphs of all single targets, and rechecking the targets by using the graphs to finally achieve the effect of improving the identification accuracy.
According to the method, a target detection model is established by using YOLOv4, then a saliency detection model is used for rechecking a YOLOv4 target detection result, and a Convolutional Neural Network (CNN) is adopted for both models, so that the method has good robustness and accuracy.
Compared with the method of only using YOLOv4 and the like to carry out the target detection task, the method of the invention also uses the significance detection method to obtain the dynamic significance estimation of the image, thereby effectively removing the background influence and obtaining the high significance result of the moving target pixel level. The target detection result is rechecked through the significance detection result, so that the false detection probability can be greatly reduced, interference items to the target detection in the background can be effectively distinguished, and the target detection precision is improved.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by the present specification, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (9)

1.一种基于YOLOv4与显著性检测的安全帽佩戴识别方法,其特征在于,包括以下步骤:1. a safety helmet wearing identification method based on YOLOv4 and significance detection, is characterized in that, comprises the following steps: 步骤1:对已有的数据集进行标注,训练YOLOv4目标检测模型;Step 1: Label the existing data set and train the YOLOv4 target detection model; 利用标注软件对数据进行标注,得到记录着每张图片中目标位置大小与标签的文件后,再将这些数据分为二份,分别为训练集和测试集;Use the labeling software to label the data, and after obtaining the file recording the size and label of the target position in each picture, the data is divided into two parts, namely the training set and the test set; 采用YOLOv4网络作为目标检测模型,在训练时,基于目标检测的输出,通过迭代计算最小化检测模型的损失函数值,当达到预先确定的迭代次数后,得到训练完成的目标检测模型;The YOLOv4 network is used as the target detection model. During training, based on the output of the target detection, the loss function value of the detection model is minimized by iterative calculation. When the predetermined number of iterations is reached, the trained target detection model is obtained; 步骤2:下载并扩充显著性数据集,训练显著性检测模型;Step 2: Download and expand the saliency dataset, and train the saliency detection model; 在训练显著性检测模型时所使用的训练集来自于网络中公开的显著性检测数据集,对已有的显著性检测数据集进行数据扩充,得到可使用的视频训练数据集,并训练出有较高准确度的显著性检测模型;The training set used in training the saliency detection model comes from the saliency detection data set published in the network, and the existing saliency detection data set is expanded to obtain a usable video training data set. High-accuracy saliency detection model; 步骤3:利用训练好的目标检测模型,得到目标的识别结果与目标框体信息;Step 3: Use the trained target detection model to obtain the target recognition result and target frame information; 利用步骤1训练好的YOLOv4目标检测模型,将摄像头拍摄到的图片或者视频中取得的单个帧传入目标检测模型,得到关于目标检测结果的输出;Using the YOLOv4 target detection model trained in step 1, transfer the picture captured by the camera or a single frame obtained from the video to the target detection model to obtain the output of the target detection result; 步骤4:利用训练好的显著性检测模型,得到图像的显著性估计;Step 4: Use the trained saliency detection model to obtain the saliency estimation of the image; 用步骤2训练好的显著性模型执行显著性检测任务,将视频监控的每一帧图像输入到神经网络中,之后网络输出显著性映射;静态显著性模型是以单帧图像为输入,生成像素级显著性估计;动态显著性模型的输入包括视频中相邻两个视频帧,以及静态显著性模型输出的静态显著性图,生成最终依据时间序列的动态显著性结果;Use the saliency model trained in step 2 to perform the saliency detection task, input each frame of video surveillance image into the neural network, and then the network outputs the saliency map; the static saliency model takes a single frame of image as input, and generates pixels Level saliency estimation; the input of the dynamic saliency model includes two adjacent video frames in the video, and the static saliency map output by the static saliency model to generate the final dynamic saliency result based on time series; 步骤5:利用目标框体位置信息,对显著性估计进行裁剪;Step 5: Use the target frame position information to crop the saliency estimate; 由于原图、目标检测模型的结果图与显著性检测模型识别后的估计图的大小规格都是一致的,所以目标检测结果中的目标框体位置,与显著性估计图中的目标位置是相对应的,可以直接利用步骤3最后得到的输出在显著性估计图中标注目标位置;Since the size specifications of the original image, the result image of the target detection model and the estimated image recognized by the saliency detection model are consistent, the position of the target frame in the target detection result is the same as the target position in the saliency estimation image. Correspondingly, you can directly use the output obtained in step 3 to mark the target position in the saliency estimation map; 步骤6:对所有目标的单张小图片进行复检;Step 6: Recheck the single small pictures of all the targets; 利用步骤5得到的每个目标的显著性估计图,对裁剪的单个显著性图进行判断,是否存在目标。Use the saliency estimation map of each target obtained in step 5 to judge whether there is a target in the cropped single saliency map. 2.根据权利要求1所述的基于YOLOv4与显著性检测的安全帽佩戴识别方法,其特征在于,步骤1中,YOLOv4网络主要包括主干特征提取网络和加强特征提取网络;2. the safety helmet wearing identification method based on YOLOv4 and salience detection according to claim 1, is characterized in that, in step 1, YOLOv4 network mainly comprises backbone feature extraction network and strengthened feature extraction network; 主干特征提取网络采用了CSPDarkNet53架构,其输入为3通道图片,为了保证输入的一致,会把原始图片进行等比缩放;之后为了保证图片不失真,在调整短边时不会改变图片的长宽比,而是在短边上下或左右扩充灰色区域;在主干特征提取网络中,多次采用CSPnet改进的残差块来进行卷积,最终特征提取的三个结果就是后续加强特征提取网络的输入;The backbone feature extraction network adopts the CSPDarkNet53 architecture, and its input is a 3-channel image. In order to ensure the consistency of the input, the original image will be proportionally scaled; in order to ensure that the image is not distorted, the length and width of the image will not be changed when adjusting the short side. Instead, the gray area is expanded up and down or left and right on the short side; in the backbone feature extraction network, the improved residual block of CSPnet is used for convolution many times, and the three results of the final feature extraction are the input of the subsequent enhanced feature extraction network. ; 加强特征提取网络包括SPP结构和PANet结构;在SPP结构中,对CSPdarknet53的最后一个特征层进行三次卷积后,再分别利用四个不同尺度的最大池化进行处理,能够极大地增加感受野,分离出最显著的上下文特征;PANet是一种实例分割算法,其具体结构有反复提升特征的作用;The enhanced feature extraction network includes SPP structure and PANet structure; in the SPP structure, after performing three convolutions on the last feature layer of CSPdarknet53, four different scales of maximum pooling are used for processing, which can greatly increase the receptive field. Isolate the most significant context features; PANet is an instance segmentation algorithm, and its specific structure has the effect of repeatedly improving features; 由主干特征提取网络和SPP结构得到的三个有效特征层,使用多次卷积上采样和卷积下采样,有效实现特征提取,最后得到三个YOLOHead输出。The three effective feature layers obtained by the backbone feature extraction network and the SPP structure use multiple convolution upsampling and convolution downsampling to effectively implement feature extraction, and finally obtain three YOLOHead outputs. 3.根据权利要求1所述的基于YOLOv4与显著性检测的安全帽佩戴识别方法,其特征在于,步骤1中,通过以下三个部分计算损失函数值:3. the helmet wearing identification method based on YOLOv4 and saliency detection according to claim 1, is characterized in that, in step 1, calculates the loss function value by following three parts: 利用CIOU来计算的预测框体与真实框体位置的误差;与IOU对比,计算CIOU时,在考虑到两个框体间的交并面积比相同时,位置不同以及框体宽高比不同带来的误差,使结果更为准确;The error between the predicted frame and the real frame position calculated by CIOU; compared with IOU, when calculating CIOU, considering that the intersection area ratio between the two frames is the same, the position is different and the frame aspect ratio is different. errors, making the results more accurate; 目标置信度带来的误差;在正确检测到一个目标时,置信度得分越高,误差越小,反之则越大;Error caused by target confidence; when a target is correctly detected, the higher the confidence score, the smaller the error, and vice versa; 识别类别带来的误差;也就是预测的种类结果与实际结果的对比。Errors caused by identifying categories; that is, the comparison of the predicted category results with the actual results. 4.根据权利要求1所述的基于YOLOv4与显著性检测的安全帽佩戴识别方法,其特征在于,步骤2中每一种公开的显著性检测数据集都包括有上千张图片,以及相对应的显著性图,用于训练静态显著性检测网络;对于训练动态显著性检测网络,还需要有与这些图片相邻的视频帧数据集;4. the helmet wearing identification method based on YOLOv4 and saliency detection according to claim 1, is characterized in that, in step 2, each disclosed saliency detection data set includes thousands of pictures, and corresponding Saliency map, used to train static saliency detection network; for training dynamic saliency detection network, video frame datasets adjacent to these pictures are also required; 以光流场V=(u,v)来表示相邻帧之间的像素级别的差异信息,其中u为垂直方向,v为水平方向,用X(x,y)表示一个点的位置,故相邻帧I与I’之间的差异关系可以用如下公式表示:The pixel-level difference information between adjacent frames is represented by the optical flow field V=(u, v), where u is the vertical direction, v is the horizontal direction, and X(x, y) is used to represent the position of a point, so The difference relationship between adjacent frames I and I' can be expressed by the following formula:
Figure 652033DEST_PATH_IMAGE002
Figure 652033DEST_PATH_IMAGE002
由于水平与垂直方向上原理一致,故以垂直方向u为例,将图片中的像素分为前景像素f与背景像素b,对前景像素与背景像素分开处理,在背景像素b中抽取10%随机初始化其移动值,范围为[-d,d],其中d=h/10,h为图片的高度,使背景中的部分像素进行微动,来模拟实际视频中背景的噪声;在前景像素f中,首先假定前景主要运动模式m,m的数值即为前景目标在这两帧之间的主要运动方向与距离,然后在区间[m-d/10,m+d/10]随机对每个像素取值来产生不同像素的运动差异,达到了使前景所有像素有相同的整体移动趋势,且每个像素移动的具体距离不都相同的效果,与实际效果相符;这样就生成了基于原始图片,使前景目标产生运动的新图片;将前景像素与背景像素处理后,二者结合即生成了有目标移动的多帧视频数据集。Since the principle is the same in the horizontal and vertical directions, taking the vertical direction u as an example, the pixels in the picture are divided into foreground pixels f and background pixels b, the foreground pixels and background pixels are processed separately, and 10% of the background pixels are randomly selected. Initialize its movement value, the range is [-d, d], where d=h/10, h is the height of the picture, and make some pixels in the background move slightly to simulate the background noise in the actual video; in the foreground pixel f , first assume that the foreground main motion mode m, the value of m is the main motion direction and distance of the foreground target between the two frames, and then randomly select each pixel in the interval [m-d/10, m+d/10]. value to generate the motion difference of different pixels, to achieve the effect that all the pixels in the foreground have the same overall movement trend, and the specific distance moved by each pixel is not the same, which is consistent with the actual effect; The foreground object generates a new picture of motion; after processing the foreground pixels and the background pixels, the combination of the two generates a multi-frame video dataset with moving objects.
5.根据权利要求4所述的基于YOLOv4与显著性检测的安全帽佩戴识别方法,其特征在于,静态显著性网络的整体运算公式为:5. the safety helmet wearing identification method based on YOLOv4 and saliency detection according to claim 4, is characterized in that, the overall operation formula of static saliency network is:
Figure 788617DEST_PATH_IMAGE004
Figure 788617DEST_PATH_IMAGE004
其中Y为图像输出,I为图像输入,FS为卷积层产生的特征输出,DS为反卷积操作,确保输出Y与输入图像I相同大小,ΘF与ΘD代表卷积与反卷积时的参数;Where Y is the image output, I is the image input, FS is the feature output generated by the convolution layer, DS is the deconvolution operation, to ensure that the output Y is the same size as the input image I, ΘF and ΘD represent the convolution and deconvolution. parameter; 在反卷积时,都是与前半部分卷积操作相互对应的,特征矩阵在扩充之后与其在卷积步骤时的卷积核的转置矩阵相乘,得到大小扩大两倍的特征图。In the deconvolution, it corresponds to the first half of the convolution operation. After the feature matrix is expanded, it is multiplied by the transposed matrix of the convolution kernel in the convolution step to obtain a feature map with twice the size.
6.根据权利要求5所述的基于YOLOv4与显著性检测的安全帽佩戴识别方法,其特征在于,动态显著性估计是将静态显著性估计图与该图原图及原图的相邻帧,同时输入动态显著性检测网络,三者在通道上连接,得到格式为(A,B,C)的输入,该格式的输入则通过如下公式进入第一层的卷积运算:6. the helmet wearing identification method based on YOLOv4 and saliency detection according to claim 5, it is characterized in that, dynamic saliency estimation is the adjacent frame of static saliency estimation graph and this graph original image and original image, At the same time, the dynamic saliency detection network is input, and the three are connected on the channel to obtain the input of the format (A, B, C). The input of this format enters the convolution operation of the first layer through the following formula:
Figure 644446DEST_PATH_IMAGE006
Figure 644446DEST_PATH_IMAGE006
其中W为卷积时的权重,b为偏置项,It和It+1为相邻两帧图像,Pt为It时对应的静态显著性图像;后续动态显著性网络的卷积与反卷积运算都和静态显著性的网络结构一致;通过相邻两帧的像素级别的光流对比,可以更好地检测到动态的高显著性目标,提升显著性识别的准确度。where W is the weight during convolution, b is the bias term, It and It +1 are two adjacent frames of images, and Pt is the static saliency image corresponding to It; the convolution of the subsequent dynamic saliency network The deconvolution operation is consistent with the static saliency network structure; by comparing the pixel-level optical flow of two adjacent frames, dynamic high-saliency targets can be better detected, and the accuracy of saliency recognition can be improved.
7.根据权利要求1所述的基于YOLOv4与显著性检测的安全帽佩戴识别方法,其特征在于,步骤3中YOLOv4会将每个输入图像分为三个尺度输出,每个尺度对应三个先验框,三个输出总共九个先验框进行检测;7. The safety helmet wearing recognition method based on YOLOv4 and saliency detection according to claim 1, is characterized in that, in step 3, YOLOv4 will divide each input image into three scale outputs, and each scale corresponds to three first. Check box, three outputs a total of nine a priori boxes for detection; 其中,尺度一的输出经过了多次的卷积操作,压缩程度较大,适合对较大目标的识别检测,所对应的先验框为最大的三个先验框;Among them, the output of scale one has undergone multiple convolution operations, and the degree of compression is large, which is suitable for the recognition and detection of larger targets, and the corresponding a priori boxes are the largest three a priori boxes; 尺度二位于三种尺度输出的中间,适合对中等大小目标的检测,利用的也是中等大小的三个先验框;Scale 2 is located in the middle of the output of the three scales, which is suitable for the detection of medium-sized targets, and also uses three medium-sized a priori boxes; 尺度三是经过卷积次数最少的输出格式,故利用较小尺度的三种先验框,来对图片中小目标有着较好的识别效果;Scale three is the output format with the least number of convolutions, so three a priori frames with smaller scales are used to have a better recognition effect on small objects in the picture; 在识别目标定位中,YOLOv4为了得到框体位置信息,采用了如下公式:In order to identify the target positioning, YOLOv4 uses the following formula in order to obtain the frame position information:
Figure 637810DEST_PATH_IMAGE008
Figure 637810DEST_PATH_IMAGE008
其中(tx,ty,tw,th)就是模型的预测输出,即为网络学习目标;(cx,cy) 是单元格的坐标偏移量,其单位为单元格边长,(pw,ph)是预设的Anchor Box的边长,(bx,by,bw,bh)就是最终得到预测出的边界框的中心坐标和宽高。Among them (t x , t y , t w , t h ) is the predicted output of the model, which is the network learning target; (c x , c y ) is the coordinate offset of the cell, and its unit is the cell side length, (p w , ph ) is the side length of the preset Anchor Box, and (b x , b y , b w , b h ) are the center coordinates and width and height of the final predicted bounding box.
8.根据权利要求1所述的基于YOLOv4与显著性检测的安全帽佩戴识别方法,其特征在于,步骤4中,为了能有效区分出图像中背景部分与移动的目标部分,将模型分为静态模型与动态模型,二者相互结合就可以同时捕捉到图像的空间与时间信息,通过全卷积网络直接生成像素级别的显著性映射;8. the safety helmet wearing identification method based on YOLOv4 and saliency detection according to claim 1, is characterized in that, in step 4, in order to effectively distinguish the background part and the moving target part in the image, the model is divided into static state. Model and dynamic model, the combination of the two can capture the spatial and temporal information of the image at the same time, and directly generate the pixel-level saliency map through the fully convolutional network; 其中,静态显著性模型是以单帧图像为输入,生成像素级显著性估计;动态显著性模型的输入包括视频中相邻两个视频帧,以及静态显著性模型输出的静态显著性图,生成最终依据时间序列的动态显著性结果。Among them, the static saliency model takes a single-frame image as input to generate pixel-level saliency estimates; the input of the dynamic saliency model includes two adjacent video frames in the video, and the static saliency map output by the static saliency model to generate The final result is based on the dynamic saliency of the time series. 9.根据权利要求1所述的基于YOLOv4与显著性检测的安全帽佩戴识别方法,其特征在于,步骤6中,定义显著性估计中高亮白色为有效目标,黑色为背景;如果白色占比较大,则识别到的目标为有效目标,复检通过;如果黑色占比较大,则识别到的目标为背景,判定该识别结果为误检。9. the safety helmet wearing identification method based on YOLOv4 and saliency detection according to claim 1, is characterized in that, in step 6, in defining saliency estimation, highlighted white is an effective target, and black is background; , then the recognized target is a valid target, and the recheck is passed; if the proportion of black is large, then the recognized target is the background, and the recognition result is determined to be a false detection.
CN202110195098.0A 2021-02-22 2021-02-22 Helmet wearing identification method based on YOLOv4 and significance detection Pending CN112989958A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110195098.0A CN112989958A (en) 2021-02-22 2021-02-22 Helmet wearing identification method based on YOLOv4 and significance detection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110195098.0A CN112989958A (en) 2021-02-22 2021-02-22 Helmet wearing identification method based on YOLOv4 and significance detection

Publications (1)

Publication Number Publication Date
CN112989958A true CN112989958A (en) 2021-06-18

Family

ID=76393783

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110195098.0A Pending CN112989958A (en) 2021-02-22 2021-02-22 Helmet wearing identification method based on YOLOv4 and significance detection

Country Status (1)

Country Link
CN (1) CN112989958A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113516069A (en) * 2021-07-08 2021-10-19 北京华创智芯科技有限公司 Road mark real-time detection method and device based on size robustness
CN113983737A (en) * 2021-10-18 2022-01-28 海信(山东)冰箱有限公司 Refrigerator and food material positioning method thereof
CN114022474A (en) * 2021-11-23 2022-02-08 浙江宁海抽水蓄能有限公司 Particle grading rapid detection method based on YOLO-V4

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8363939B1 (en) * 2006-10-06 2013-01-29 Hrl Laboratories, Llc Visual attention and segmentation system
CN109376676A (en) * 2018-11-01 2019-02-22 哈尔滨工业大学 Safety early warning method for construction personnel on highway engineering site based on UAV platform
CN109784183A (en) * 2018-12-17 2019-05-21 西北工业大学 Saliency object detection method based on concatenated convolutional network and light stream
AU2020100371A4 (en) * 2020-03-12 2020-04-16 Jilin University Hierarchical multi-object tracking method based on saliency detection
CN111027505A (en) * 2019-12-19 2020-04-17 吉林大学 Hierarchical multi-target tracking method based on significance detection
CN111292305A (en) * 2020-01-22 2020-06-16 重庆大学 Improved YOLO-V3 metal processing surface defect detection method
AU2020100705A4 (en) * 2020-05-05 2020-06-18 Chang, Jiaying Miss A helmet detection method with lightweight backbone based on yolov3 network
CN112149459A (en) * 2019-06-27 2020-12-29 哈尔滨工业大学(深圳) Video salient object detection model and system based on cross attention mechanism

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8363939B1 (en) * 2006-10-06 2013-01-29 Hrl Laboratories, Llc Visual attention and segmentation system
CN109376676A (en) * 2018-11-01 2019-02-22 哈尔滨工业大学 Safety early warning method for construction personnel on highway engineering site based on UAV platform
CN109784183A (en) * 2018-12-17 2019-05-21 西北工业大学 Saliency object detection method based on concatenated convolutional network and light stream
CN112149459A (en) * 2019-06-27 2020-12-29 哈尔滨工业大学(深圳) Video salient object detection model and system based on cross attention mechanism
CN111027505A (en) * 2019-12-19 2020-04-17 吉林大学 Hierarchical multi-target tracking method based on significance detection
CN111292305A (en) * 2020-01-22 2020-06-16 重庆大学 Improved YOLO-V3 metal processing surface defect detection method
AU2020100371A4 (en) * 2020-03-12 2020-04-16 Jilin University Hierarchical multi-object tracking method based on saliency detection
AU2020100705A4 (en) * 2020-05-05 2020-06-18 Chang, Jiaying Miss A helmet detection method with lightweight backbone based on yolov3 network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
侯毅苇;李林汉;王彦;: "结合红外显著性目标导引的改进YOLO网络的智能装备目标识别研究", 红外技术, no. 07 *
魏龙生;罗大鹏: "基于视觉注意机制的遥感图像显著性目标检测", 计算机工程与应用, vol. 50, no. 19 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113516069A (en) * 2021-07-08 2021-10-19 北京华创智芯科技有限公司 Road mark real-time detection method and device based on size robustness
CN113983737A (en) * 2021-10-18 2022-01-28 海信(山东)冰箱有限公司 Refrigerator and food material positioning method thereof
CN114022474A (en) * 2021-11-23 2022-02-08 浙江宁海抽水蓄能有限公司 Particle grading rapid detection method based on YOLO-V4

Similar Documents

Publication Publication Date Title
CN110348319B (en) A face anti-counterfeiting method based on the fusion of face depth information and edge images
CN112861635B (en) Fire disaster and smoke real-time detection method based on deep learning
US8706663B2 (en) Detection of people in real world videos and images
US6611613B1 (en) Apparatus and method for detecting speaking person's eyes and face
US8577151B2 (en) Method, apparatus, and program for detecting object
CN102509104B (en) Confidence map-based method for distinguishing and detecting virtual object of augmented reality scene
Ma et al. Fusioncount: Efficient crowd counting via multiscale feature fusion
CN112001241B (en) Micro-expression recognition method and system based on channel attention mechanism
CN111241989A (en) Image recognition method and device and electronic equipment
CN112115775B (en) Smoke sucking behavior detection method based on computer vision under monitoring scene
CN111738054B (en) Behavior anomaly detection method based on space-time self-encoder network and space-time CNN
US20090245575A1 (en) Method, apparatus, and program storage medium for detecting object
CN112989958A (en) Helmet wearing identification method based on YOLOv4 and significance detection
CN103514432A (en) Method, device and computer program product for extracting facial features
CN112101195B (en) Crowd density estimation method, crowd density estimation device, computer equipment and storage medium
CN110929593A (en) A Real-time Saliency Pedestrian Detection Method Based on Detail Discrimination
CN115273234B (en) Crowd abnormal behavior detection method based on improved SSD
CN111401278A (en) Helmet identification method and device, electronic equipment and storage medium
CN105426895A (en) Prominence detection method based on Markov model
CN112200056A (en) Face living body detection method and device, electronic equipment and storage medium
CN112766145B (en) Method and device for identifying dynamic facial expressions of artificial neural network
US20090245576A1 (en) Method, apparatus, and program storage medium for detecting object
CN115937991A (en) Human body tumbling identification method and device, computer equipment and storage medium
CN112560584A (en) Face detection method and device, storage medium and terminal
CN114724190A (en) Mood recognition method based on pet posture

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
AD01 Patent right deemed abandoned

Effective date of abandoning: 20240621

AD01 Patent right deemed abandoned