Disclosure of Invention
In order to solve the above-mentioned problems in the prior art, that is, to solve the problem of improving the classification capability of the face detector without adding any additional overhead, a first aspect of the present invention provides a single-step face detector optimization system, which includes a training system and a testing system; the training system comprises a data enhancement module, a scale perception margin module, a feature supervision module, a single-step face detector interface module and a loss function module;
the data enhancement module is configured to obtain a spliced image through copying and splicing based on the detected image; randomly cutting the spliced image to obtain an image block, and after data enhancement is performed on the image block, dividing an anchor point frame to obtain a training sample;
the single-step face detector interface module is configured to send the training samples to a single-step face detector to be trained for secondary classification and frame regression, and obtain sampling features of the training samples obtained in the process of carrying out the secondary classification on the training samples by the single-step face detector;
the scale perception margin module is configured to obtain the loss of the scale perception margin of the training sample;
the characteristic supervision module is configured to perform two classifications of the training samples through a classification network based on characteristic supervision based on the sampling characteristics of the training samples;
the loss function module is configured to perform a function based on LCLS、LLOC、LFSMThe loss function of (3) updating parameters of the single-step face detector; wherein L isCLSFor a scale-aware edge-distance loss function in the binary classification of the single-step face detector, LLOCIs a frame regression loss function, LFSMA loss function for a second class in a feature supervision-based classification network;
the testing system is configured to perform a face detection task by using the single-step face detector obtained by the training system based on preset testing data to obtain detection accuracy, and when the accuracy is smaller than a preset accuracy threshold, the single-step face detector is optimized by the training system again.
In some preferred embodiments, the detected image is a rectangle, and the data enhancement module "obtains a stitched image by copy stitching based on the detected image" by:
copying the detected image for 4 times, and obtaining the spliced image through matrix splicing; the long edge and the short edge of the spliced image are respectively 2 times of the long edge and the short edge of the detected image.
In some preferred embodiments, the image block in the data enhancement module has a long side and a short side which are a times longer side and shorter side, a e [1,2], respectively, of the detected image.
In some preferred embodiments, the feature supervision-based classification network comprises a ROI Align layer, four convolutional layers, a global average pooling layer, and a loss function layer.
In some preferred embodiments, the scale-aware edge distance loss function is constructed based on a perceptual edge distance prediction probability function
y=sigmoid(x-m)
Wherein y is a predicted probability value, x is a predicted value, m is an edge distance value of x, alpha is a preset hyper-parameter, and w and h are width and height of the sample respectively.
In some preferred embodiments, the scale-aware edge distance loss function is provided in a scale-aware edge distance network that includes a classification convolutional layer, a scale-aware edge distance layer, a Sigmoid function layer, and a loss function layer.
In some preferred embodiments, the base is LCLS、LLOC、LFSMHas a loss function of
L=LCLS+LLOC+λLFSM
Wherein λ is a predetermined weight.
In a second aspect of the present invention, a single-step face detector optimization method is provided, where the method includes the following steps:
s100, obtaining a spliced image through copying and splicing based on the detected image; randomly cutting the spliced image to obtain an image block, and after data enhancement is performed on the image block, dividing an anchor point frame to obtain a training sample;
s200, performing secondary classification and frame regression of training samples through a single-step face detector, and acquiring sampling characteristics of each training sample;
step S300, performing two-classification on the training samples through a classification network based on feature supervision based on the sampling features of the training samples obtained in the two-classification process of the training samples by the single-step face detector;
step S400, based on LCLS、LLOC、LFSMThe loss function of (2) training the single-step face detector until reaching a preset training end condition; wherein L isCLSFor a scale-aware edge-distance loss function in the binary classification of the single-step face detector, LLOCIs a frame regression loss function, LFSMA loss function for a second class in a feature supervision-based classification network;
and S500, based on preset test data, performing a face detection task by using the single-step face detector obtained by the training system to obtain detection accuracy, if the accuracy is smaller than a preset accuracy threshold value, skipping to the step S100 to optimize the single-step face detector again, and otherwise, outputting the trained single-step face detector.
In a third aspect of the present invention, a storage device is provided, in which a plurality of programs are stored, the programs being adapted to be loaded and executed by a processor to implement the above-mentioned single-step face detector optimization method.
In a fourth aspect of the present invention, a processing apparatus is provided, which includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; wherein the program is adapted to be loaded and executed by a processor to implement the single-step face detector optimization method described above.
The invention has the beneficial effects that:
the invention utilizes the classification characteristics with more discriminative property obtained from the single-step face detector, and constructs an integral loss function through the combination of the classification network loss function of characteristic supervision, the scale perception margin loss function and the binary classification loss function of the single-step face detector to train the single-step face detector to be optimized, thereby effectively enhancing the classification capability of the high-performance single-step face detector, and the operation of the characteristic supervision classification network and the scale perception margin is not needed in the test stage, so the operation amount of the single-step face detector is not increased, and the invention can improve the classification capability of the face detector without increasing any additional cost.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
How to continuously improve these high performance face detectors has become a challenging problem, especially without adding any additional overhead. Aiming at the problem, the invention provides a single-step face detector optimization system and a single-step face detector optimization method.
A single-step face detector optimization method of the present invention, as shown in figure 1,
the system comprises a data enhancement module, a scale perception margin module, a feature supervision module, a single-step face detector interface module and a loss function module;
the data enhancement module is configured to obtain a spliced image through copying and splicing based on the detected image; randomly cutting the spliced image to obtain an image block, and acquiring a training sample through an anchor point frame after data enhancement is performed on the image block;
the single-step face detector interface module is configured to send the training samples to a single-step face detector to be trained for secondary classification and frame regression, and obtain sampling features of the training samples obtained in the process of carrying out the secondary classification on the training samples by the single-step face detector;
the scale perception margin module is configured to obtain the loss of the scale perception margin of the training sample;
the characteristic supervision module is configured to perform two classifications of the training samples through a classification network based on characteristic supervision based on the sampling characteristics of the training samples;
the loss function module is configured to perform a function based on LCLS、LLOC、LFSMThe loss function of (3) updating parameters of the single-step face detector; wherein L isCLSFor a scale-aware edge-distance loss function in the binary classification of the single-step face detector, LLOCIs a frame regression loss function, LFSMA loss function for a second class in a feature supervision-based classification network;
the testing system is configured to perform a face detection task by using the single-step face detector obtained by the training system based on preset testing data to obtain detection accuracy, and when the accuracy is smaller than a preset accuracy threshold, the single-step face detector is optimized by the training system again.
For a clearer explanation of the single-step face detector optimization system of the present invention, the following will discuss the steps of one embodiment of the method of the present invention in detail with reference to the accompanying drawings.
The one-step face detector backbone network to be trained in the embodiment of the present invention is constructed based on a ResNet network, where the ResNet network is a feature pyramid structure with 6 layers, as shown in fig. 1, the network includes 6 feature layers of C2, C3, C4, C5, C6, and C7, and the corresponding feature layers include 6 detection layers of P2, P3, P4, P5, P6, and P7.
On each detection layer, an anchor box of 2 scales was used: 2S and
s, where S is each detectionThe down-sampling rate of the layers, and additionally the aspect ratio of width and height used for only one, is 1.25. Each detection layer can cover a size range of 8 to 362 pixels on the network input image using 2 anchor boxes.
The single-step face detector optimization system of an embodiment of the present invention, as shown in fig. 1, includes a training system and a testing system; the training system comprises a data enhancement module, a scale perception margin module, a feature supervision module, a single-step face detector interface module and a loss function module.
1. Training system
(1) Data enhancement module
The module is configured to obtain a stitched image by copy stitching based on the detected image; and randomly cutting the spliced image to obtain an image block, and acquiring a training sample through an anchor point frame after data enhancement is performed on the image block.
The existing data enhancement strategy is to add some luminosity distortion on a training image, then carry out mean value expansion operation, then cut out two patches (image blocks), randomly select one for training, wherein one patch is the size of the short edge of the image, and the other patch is the size determined by multiplying the short edge of the image by a random number in an interval [0.5,1.0 ]; and finally, randomly overturning the selected patch, and adjusting the selected patch to 1024x1024 to obtain a final training sample. In the data enhancement strategy, more small faces can be obtained by the expansion operation, so that the performance is remarkably improved, particularly for the small faces, however, the whole image except for the placed original image does not contribute to other places in the training phase, and the utilization rate is low.
To solve the above problem, the present embodiment replaces the expansion operation in the original method with a new and efficient data Enhancement (EDA).
Because the largest reduction factor is 2, the detected image is copied for 4 times in the embodiment, and a canvas is generated as a spliced image through the matrix type, and the long edge and the short edge of the spliced image are respectively 2 times of the long edge and the short edge of the detected image. And then randomly clipping the spliced image to obtain a patch, wherein the long side and the short side of the patch are respectively A times of the long side and the short side of the detected image, and A belongs to [1,2 ]. And performing data enhancement on the patch obtained by cutting by using a conventional data enhancement method.
In the training phase, the preset anchor boxes need to be set as positive and negative samples: the anchor block is set to positive samples based on its overlap (IoU) threshold with the true value being greater than 0.5; the basis for setting to negative samples is that their overlap (IoU) with the true value is in the interval [0, 0.4); if the overlap (IoU) of the anchor block with the true value is in the interval [0.4,0.5), it is ignored during the training phase.
(2) Single step human face detector interface module
The module is configured to send the training samples to a single-step face detector to be trained for secondary classification and frame regression, and obtain sampling features of the training samples obtained in the process of carrying out the secondary classification on the training samples by the single-step face detector.
In recent years, a convolutional neural network-based method is dominant in target detection, and is divided into a one-step detection method and a two-step detection method. The two-step detection method consumes a lot of time, but the one-step detection method is faster and can be more practical in many applications. In this embodiment, a one-step detection method is applied.
Face detection is a relatively simple two-classification method with a large number of small faces, which makes the advantage of the second step of the two-step detection method less obvious. In the embodiment, a one-step detection method is applied, so that in order to enhance the classification capability of the one-step detection method without reducing the speed, it becomes critical how to fully utilize the second step in the two-step method to learn more distinctive features. To solve this problem, the present embodiment designs a feature supervised classification network, which, like the second stage in the two-step detection method, lets the backbone network learn more distinctive features in the training stage and keeps the test time of the detector unchanged. Using this method, the second stage in the two-step method is fully exploited in the testing stage to learn more discriminative classification features without any additional overhead.
Since the method adds a classification network based on feature supervision, a Non Maximum Suppression (Non Maximum Suppression) threshold needs to be set as a set threshold, which is 0.7 in the present embodiment. In this embodiment, 512 prediction anchor boxes are selected as training samples, the anchor boxes are distributed on a suitable pyramid layer to remove sampling characteristics, and a layer Pk on which subsequent two classes are to be sampled is determined by formula (1):
wherein k is0W, h are the width and height of the training sample, respectively.
Namely:
if the training sample size is less than 162Then distributed to the P2 layer;
if the training sample size is 162To 322In the middle, it is distributed on the P3 layer;
if the training sample size is 322To 642In the middle, it is distributed on the P4 layer;
if the training sample size is 642To 1282In the middle, it is distributed on the P5 layer;
if the training sample size is 1282To 2562In the middle, it is distributed on the P6 layer;
if the training sample size is larger than 2562And then distributed to the P7 layer.
(3) Scale perception edge distance module
The module is configured to obtain a scale-aware edge distance loss for the training sample.
In order to obtain better high-performance face detection, the main obstacle is classification error, i.e. the classification capability is not robust enough. In order to enhance the classification capability in detection, a conventional prediction probability function based on edge distance (as shown in formula (2)) is combined to reconstruct a scale perception edge distance loss function, and the difference is that the setting of an edge distance value m is changed from a fixed value to a factor related to the length and width of a sample, as shown in formula (3).
y=sigmoid(x-m) (2)
Wherein y is a predicted probability value, x is a predicted value, m is an edge distance value of x, alpha is a preset hyper-parameter, and w and h are width and height of the sample respectively.
In order to obtain the value of the scale-aware edge-distance loss function, a scale-aware edge-distance network (SAM) is provided. The scale-aware edge distance network comprises a classification convolution layer, a scale-aware edge distance layer, a Sigmoid function layer and a Loss function layer (Focal Loss layer), as shown in fig. 1 and 3, wherein x-m is obtained through the scale-aware edge distance layer, y value is obtained through the Sigmoid function layer, and a Loss function L based on the y value passes through the Loss function layerCLSThe loss is calculated. And (3) inputting an image into a scale perception margin network to obtain the scale of each sample, obtaining the value m through a formula (3), and further obtaining the loss of the image based on the margin through a formula (2).
The scale-aware edge distance network uses smaller edge distance values for larger faces and larger edge distance values for smaller faces to enhance classification capability. After the scale perception edge distance module is used, the face and the complex background can be better distinguished, and therefore the classifying capability of the small face is enhanced.
(4) The feature supervision module
The module is configured to perform a second classification of the training samples through a feature-supervised based classification network based on the sampling features of the training samples.
As shown in fig. 1 and 2, the feature supervision-based classification network (FSM) includes an ROI Align operation layer, four convolution layers (256 × 128 × 3 × 3 convolution, 128 × 64 × 3 × 3 convolution, 64 × 32 × 3 × 3 convolution, 32 × 1 × 3 × 3 convolution), a global average pooling layer, and a Loss function layer (Focal local layer).
(5) Loss function module
The module is configured by being based on LCLS、LLOC、LFSMLoss function ofUpdating parameters of the single-step face detector; wherein L isCLSFor a scale-aware edge-distance loss function in the binary classification of the single-step face detector, LLOCIs a frame regression loss function, LFSMIs a loss function for two classes in a class network based on feature supervision.
Based on LCLS、LLOC、LFSMIs shown in equation (4)
L=LCLS+LLOC+λLFSM (4)
Wherein λ is a preset weight, and the value in this embodiment is 0.5, and the weight is used to balance the loss function in the two classes of the single-step face detector and the loss function in the two classes in the feature supervised classification network in the training.
2. Test system
The system is configured to perform a face detection task by using a single-step face detector obtained by the training system based on preset test data to obtain detection accuracy, and when the accuracy is smaller than a preset accuracy threshold, the single-step face detector is optimized by the training system again.
The test system does not need to carry out the operation of the feature supervision and classification network, so the operation amount of the single-step face detector is not increased.
In the testing stage, the confidence coefficient is set to be 0.05, some detection results are filtered, and 400 frames with the highest confidence coefficient scores are reserved;
then, a Non-Maximum Suppression algorithm (Non Maximum Suppression) is used to set the threshold to 0.4, and a detection box with the highest confidence score of 200 is generated in each image as a final result.
It should be noted that the single-step face detector optimization system provided in the foregoing embodiment is only illustrated by dividing the functional modules, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the modules or steps in the embodiments of the present invention are further decomposed or combined, for example, the modules in the foregoing embodiments may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the functions described above. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.
A single-step face detector optimization method according to a second embodiment of the present invention, as shown in fig. 4, includes the following steps:
s100, obtaining a spliced image through copying and splicing based on the detected image; randomly cutting the spliced image to obtain an image block, and after data enhancement is performed on the image block, dividing an anchor point frame to obtain a training sample;
s200, performing secondary classification and frame regression of training samples through a single-step face detector, and acquiring sampling characteristics of each training sample;
step S300, performing two-classification on the training samples through a classification network based on feature supervision based on the sampling features of the training samples obtained in the two-classification process of the training samples by the single-step face detector;
step S400, based on LCLS、LLOC、LFSMThe loss function of (2) training the single-step face detector until reaching a preset training end condition; wherein L isCLSFor a scale-aware edge-distance loss function in the binary classification of the single-step face detector, LLOCIs a frame regression loss function, LFSMA loss function for a second class in a feature supervision-based classification network;
and S500, based on preset test data, performing a face detection task by using the single-step face detector obtained by the training system to obtain detection accuracy, if the accuracy is smaller than a preset accuracy threshold value, skipping to the step S100 to optimize the single-step face detector again, and otherwise, outputting the trained single-step face detector.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related descriptions of the method described above may refer to the corresponding process in the foregoing system embodiment, and are not described herein again.
A storage device according to a third embodiment of the present invention stores a plurality of programs, and the programs are suitable for being loaded and executed by a processor to realize the above-mentioned single-step face detector optimization method.
A processing apparatus according to a fourth embodiment of the present invention includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the single-step face detector optimization method described above.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
Those of skill in the art would appreciate that the various illustrative modules, method steps, and modules described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that programs corresponding to the software modules, method steps may be located in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. To clearly illustrate this interchangeability of electronic hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.
The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.