CN110222657B

CN110222657B - Single-step face detector optimization system, method and apparatus

Info

Publication number: CN110222657B
Application number: CN201910502740.8A
Authority: CN
Inventors: 雷震; 张士峰; 张永明
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2019-06-11
Filing date: 2019-06-11
Publication date: 2021-07-20
Anticipated expiration: 2039-06-11
Also published as: CN110222657A

Abstract

A single-step face detector optimization system of the present invention includes a training system: image blocks obtained by duplicating and splicing and randomly cropping the detected images through a data enhancement module, and obtaining training samples through data enhancement; The detector interface module sends the training samples to the single-step face detector to be trained for binary classification and bounding box regression, and obtains the sampling features in the binary classification process; the scale-aware margin module obtains the scale-aware margin loss of the training samples ; The two-classification of training samples is carried out through the classification network based on feature supervision through the feature supervision module; the loss function module is used to train the single-step face detector through the loss function based on L _CLS , L _LOC , and L _FSM ; Test system: calculation The accuracy of the single-step face detector output by the training system, and is retrained by the training system when the set conditions are not met. The invention can improve the classification capability of the face detector without adding any extra overhead.

Description

Single-step face detector optimization system, method and device

Technical Field

The invention belongs to the technical field of pattern recognition, and particularly relates to a single-step face detector optimization system, method and device.

Background

Face detection is a technology for determining whether a face exists in any input image and returning the position of each face, and is widely applied to the fields of computer vision and the like, such as face recognition, face tracking, face analysis and the like.

Currently, in a face detection model, a single-step detection method based on an anchor point frame is dominant, and the method carries out face detection based on anchor point frames with different positions, scales and aspect ratios. With the development of deep neural networks, the anchor block-based single-step detection method has made great progress in academia. In particular, in the very challenging WIDER FACE dataset, performance on difficult subsets has been promoted in recent years from 40% to 90%. It has now become a challenging problem how to continuously improve these high performance face detectors, especially without adding additional overhead. To address this problem, by analyzing the error distribution of the high performance face detector on the WIDER FACE validation set, two error modes are found, i.e., regression and classification, where classification errors play a major role in detection. If the classification capability of the face detector can be enhanced, more faces can be identified from a complex background, so that error samples are reduced, and the detection precision is improved. Therefore, how to improve the classification capability of the face detector is a problem worthy of further research.

Disclosure of Invention

In order to solve the above-mentioned problems in the prior art, that is, to solve the problem of improving the classification capability of the face detector without adding any additional overhead, a first aspect of the present invention provides a single-step face detector optimization system, which includes a training system and a testing system; the training system comprises a data enhancement module, a scale perception margin module, a feature supervision module, a single-step face detector interface module and a loss function module;

the data enhancement module is configured to obtain a spliced image through copying and splicing based on the detected image; randomly cutting the spliced image to obtain an image block, and after data enhancement is performed on the image block, dividing an anchor point frame to obtain a training sample;

the single-step face detector interface module is configured to send the training samples to a single-step face detector to be trained for secondary classification and frame regression, and obtain sampling features of the training samples obtained in the process of carrying out the secondary classification on the training samples by the single-step face detector;

the scale perception margin module is configured to obtain the loss of the scale perception margin of the training sample;

the characteristic supervision module is configured to perform two classifications of the training samples through a classification network based on characteristic supervision based on the sampling characteristics of the training samples;

the loss function module is configured to perform a function based on L_CLS、L_LOC、L_FSMThe loss function of (3) updating parameters of the single-step face detector; wherein L is_CLSFor a scale-aware edge-distance loss function in the binary classification of the single-step face detector, L_LOCIs a frame regression loss function, L_FSMA loss function for a second class in a feature supervision-based classification network;

the testing system is configured to perform a face detection task by using the single-step face detector obtained by the training system based on preset testing data to obtain detection accuracy, and when the accuracy is smaller than a preset accuracy threshold, the single-step face detector is optimized by the training system again.

In some preferred embodiments, the detected image is a rectangle, and the data enhancement module "obtains a stitched image by copy stitching based on the detected image" by:

copying the detected image for 4 times, and obtaining the spliced image through matrix splicing; the long edge and the short edge of the spliced image are respectively 2 times of the long edge and the short edge of the detected image.

In some preferred embodiments, the image block in the data enhancement module has a long side and a short side which are a times longer side and shorter side, a e [1,2], respectively, of the detected image.

In some preferred embodiments, the feature supervision-based classification network comprises a ROI Align layer, four convolutional layers, a global average pooling layer, and a loss function layer.

In some preferred embodiments, the scale-aware edge distance loss function is constructed based on a perceptual edge distance prediction probability function

y＝sigmoid(x-m)

Wherein y is a predicted probability value, x is a predicted value, m is an edge distance value of x, alpha is a preset hyper-parameter, and w and h are width and height of the sample respectively.

In some preferred embodiments, the scale-aware edge distance loss function is provided in a scale-aware edge distance network that includes a classification convolutional layer, a scale-aware edge distance layer, a Sigmoid function layer, and a loss function layer.

In some preferred embodiments, the base is L_CLS、L_LOC、L_FSMHas a loss function of

L＝L_CLS+L_LOC+λL_FSM

Wherein λ is a predetermined weight.

In a second aspect of the present invention, a single-step face detector optimization method is provided, where the method includes the following steps:

s100, obtaining a spliced image through copying and splicing based on the detected image; randomly cutting the spliced image to obtain an image block, and after data enhancement is performed on the image block, dividing an anchor point frame to obtain a training sample;

s200, performing secondary classification and frame regression of training samples through a single-step face detector, and acquiring sampling characteristics of each training sample;

step S300, performing two-classification on the training samples through a classification network based on feature supervision based on the sampling features of the training samples obtained in the two-classification process of the training samples by the single-step face detector;

step S400, based on L_CLS、L_LOC、L_FSMThe loss function of (2) training the single-step face detector until reaching a preset training end condition; wherein L is_CLSFor a scale-aware edge-distance loss function in the binary classification of the single-step face detector, L_LOCIs a frame regression loss function, L_FSMA loss function for a second class in a feature supervision-based classification network;

and S500, based on preset test data, performing a face detection task by using the single-step face detector obtained by the training system to obtain detection accuracy, if the accuracy is smaller than a preset accuracy threshold value, skipping to the step S100 to optimize the single-step face detector again, and otherwise, outputting the trained single-step face detector.

In a third aspect of the present invention, a storage device is provided, in which a plurality of programs are stored, the programs being adapted to be loaded and executed by a processor to implement the above-mentioned single-step face detector optimization method.

In a fourth aspect of the present invention, a processing apparatus is provided, which includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; wherein the program is adapted to be loaded and executed by a processor to implement the single-step face detector optimization method described above.

The invention has the beneficial effects that:

the invention utilizes the classification characteristics with more discriminative property obtained from the single-step face detector, and constructs an integral loss function through the combination of the classification network loss function of characteristic supervision, the scale perception margin loss function and the binary classification loss function of the single-step face detector to train the single-step face detector to be optimized, thereby effectively enhancing the classification capability of the high-performance single-step face detector, and the operation of the characteristic supervision classification network and the scale perception margin is not needed in the test stage, so the operation amount of the single-step face detector is not increased, and the invention can improve the classification capability of the face detector without increasing any additional cost.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is a network framework diagram of a single-step face detector optimization system according to one embodiment of the invention;

FIG. 2 is a schematic diagram of the principal structure of a feature supervision module in one embodiment of the invention;

FIG. 3 is a schematic diagram of the main structure of a scale-aware edge distance module in an embodiment of the present invention;

FIG. 4 is a flowchart illustrating a single-step face detector optimization method according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

How to continuously improve these high performance face detectors has become a challenging problem, especially without adding any additional overhead. Aiming at the problem, the invention provides a single-step face detector optimization system and a single-step face detector optimization method.

A single-step face detector optimization method of the present invention, as shown in figure 1,

the system comprises a data enhancement module, a scale perception margin module, a feature supervision module, a single-step face detector interface module and a loss function module;

the data enhancement module is configured to obtain a spliced image through copying and splicing based on the detected image; randomly cutting the spliced image to obtain an image block, and acquiring a training sample through an anchor point frame after data enhancement is performed on the image block;

For a clearer explanation of the single-step face detector optimization system of the present invention, the following will discuss the steps of one embodiment of the method of the present invention in detail with reference to the accompanying drawings.

The one-step face detector backbone network to be trained in the embodiment of the present invention is constructed based on a ResNet network, where the ResNet network is a feature pyramid structure with 6 layers, as shown in fig. 1, the network includes 6 feature layers of C2, C3, C4, C5, C6, and C7, and the corresponding feature layers include 6 detection layers of P2, P3, P4, P5, P6, and P7.

On each detection layer, an anchor box of 2 scales was used: 2S and

s, where S is each detectionThe down-sampling rate of the layers, and additionally the aspect ratio of width and height used for only one, is 1.25. Each detection layer can cover a size range of 8 to 362 pixels on the network input image using 2 anchor boxes.

The single-step face detector optimization system of an embodiment of the present invention, as shown in fig. 1, includes a training system and a testing system; the training system comprises a data enhancement module, a scale perception margin module, a feature supervision module, a single-step face detector interface module and a loss function module.

1. Training system

(1) Data enhancement module

The module is configured to obtain a stitched image by copy stitching based on the detected image; and randomly cutting the spliced image to obtain an image block, and acquiring a training sample through an anchor point frame after data enhancement is performed on the image block.

The existing data enhancement strategy is to add some luminosity distortion on a training image, then carry out mean value expansion operation, then cut out two patches (image blocks), randomly select one for training, wherein one patch is the size of the short edge of the image, and the other patch is the size determined by multiplying the short edge of the image by a random number in an interval [0.5,1.0 ]; and finally, randomly overturning the selected patch, and adjusting the selected patch to 1024x1024 to obtain a final training sample. In the data enhancement strategy, more small faces can be obtained by the expansion operation, so that the performance is remarkably improved, particularly for the small faces, however, the whole image except for the placed original image does not contribute to other places in the training phase, and the utilization rate is low.

To solve the above problem, the present embodiment replaces the expansion operation in the original method with a new and efficient data Enhancement (EDA).

Because the largest reduction factor is 2, the detected image is copied for 4 times in the embodiment, and a canvas is generated as a spliced image through the matrix type, and the long edge and the short edge of the spliced image are respectively 2 times of the long edge and the short edge of the detected image. And then randomly clipping the spliced image to obtain a patch, wherein the long side and the short side of the patch are respectively A times of the long side and the short side of the detected image, and A belongs to [1,2 ]. And performing data enhancement on the patch obtained by cutting by using a conventional data enhancement method.

In the training phase, the preset anchor boxes need to be set as positive and negative samples: the anchor block is set to positive samples based on its overlap (IoU) threshold with the true value being greater than 0.5; the basis for setting to negative samples is that their overlap (IoU) with the true value is in the interval [0, 0.4); if the overlap (IoU) of the anchor block with the true value is in the interval [0.4,0.5), it is ignored during the training phase.

(2) Single step human face detector interface module

The module is configured to send the training samples to a single-step face detector to be trained for secondary classification and frame regression, and obtain sampling features of the training samples obtained in the process of carrying out the secondary classification on the training samples by the single-step face detector.

In recent years, a convolutional neural network-based method is dominant in target detection, and is divided into a one-step detection method and a two-step detection method. The two-step detection method consumes a lot of time, but the one-step detection method is faster and can be more practical in many applications. In this embodiment, a one-step detection method is applied.

Face detection is a relatively simple two-classification method with a large number of small faces, which makes the advantage of the second step of the two-step detection method less obvious. In the embodiment, a one-step detection method is applied, so that in order to enhance the classification capability of the one-step detection method without reducing the speed, it becomes critical how to fully utilize the second step in the two-step method to learn more distinctive features. To solve this problem, the present embodiment designs a feature supervised classification network, which, like the second stage in the two-step detection method, lets the backbone network learn more distinctive features in the training stage and keeps the test time of the detector unchanged. Using this method, the second stage in the two-step method is fully exploited in the testing stage to learn more discriminative classification features without any additional overhead.

Since the method adds a classification network based on feature supervision, a Non Maximum Suppression (Non Maximum Suppression) threshold needs to be set as a set threshold, which is 0.7 in the present embodiment. In this embodiment, 512 prediction anchor boxes are selected as training samples, the anchor boxes are distributed on a suitable pyramid layer to remove sampling characteristics, and a layer Pk on which subsequent two classes are to be sampled is determined by formula (1):

wherein k is₀W, h are the width and height of the training sample, respectively.

Namely:

if the training sample size is less than 16²Then distributed to the P2 layer;

if the training sample size is 16²To 32²In the middle, it is distributed on the P3 layer;

if the training sample size is 32²To 64²In the middle, it is distributed on the P4 layer;

if the training sample size is 64²To 128²In the middle, it is distributed on the P5 layer;

if the training sample size is 128²To 256²In the middle, it is distributed on the P6 layer;

if the training sample size is larger than 256²And then distributed to the P7 layer.

(3) Scale perception edge distance module

The module is configured to obtain a scale-aware edge distance loss for the training sample.

In order to obtain better high-performance face detection, the main obstacle is classification error, i.e. the classification capability is not robust enough. In order to enhance the classification capability in detection, a conventional prediction probability function based on edge distance (as shown in formula (2)) is combined to reconstruct a scale perception edge distance loss function, and the difference is that the setting of an edge distance value m is changed from a fixed value to a factor related to the length and width of a sample, as shown in formula (3).

y＝sigmoid(x-m) (2)

In order to obtain the value of the scale-aware edge-distance loss function, a scale-aware edge-distance network (SAM) is provided. The scale-aware edge distance network comprises a classification convolution layer, a scale-aware edge distance layer, a Sigmoid function layer and a Loss function layer (Focal Loss layer), as shown in fig. 1 and 3, wherein x-m is obtained through the scale-aware edge distance layer, y value is obtained through the Sigmoid function layer, and a Loss function L based on the y value passes through the Loss function layer_CLSThe loss is calculated. And (3) inputting an image into a scale perception margin network to obtain the scale of each sample, obtaining the value m through a formula (3), and further obtaining the loss of the image based on the margin through a formula (2).

The scale-aware edge distance network uses smaller edge distance values for larger faces and larger edge distance values for smaller faces to enhance classification capability. After the scale perception edge distance module is used, the face and the complex background can be better distinguished, and therefore the classifying capability of the small face is enhanced.

(4) The feature supervision module

The module is configured to perform a second classification of the training samples through a feature-supervised based classification network based on the sampling features of the training samples.

As shown in fig. 1 and 2, the feature supervision-based classification network (FSM) includes an ROI Align operation layer, four convolution layers (256 × 128 × 3 × 3 convolution, 128 × 64 × 3 × 3 convolution, 64 × 32 × 3 × 3 convolution, 32 × 1 × 3 × 3 convolution), a global average pooling layer, and a Loss function layer (Focal local layer).

(5) Loss function module

The module is configured by being based on L_CLS、L_LOC、L_FSMLoss function ofUpdating parameters of the single-step face detector; wherein L is_CLSFor a scale-aware edge-distance loss function in the binary classification of the single-step face detector, L_LOCIs a frame regression loss function, L_FSMIs a loss function for two classes in a class network based on feature supervision.

Based on L_CLS、L_LOC、L_FSMIs shown in equation (4)

L＝L_CLS+L_LOC+λL_FSM (4)

Wherein λ is a preset weight, and the value in this embodiment is 0.5, and the weight is used to balance the loss function in the two classes of the single-step face detector and the loss function in the two classes in the feature supervised classification network in the training.

2. Test system

The system is configured to perform a face detection task by using a single-step face detector obtained by the training system based on preset test data to obtain detection accuracy, and when the accuracy is smaller than a preset accuracy threshold, the single-step face detector is optimized by the training system again.

The test system does not need to carry out the operation of the feature supervision and classification network, so the operation amount of the single-step face detector is not increased.

In the testing stage, the confidence coefficient is set to be 0.05, some detection results are filtered, and 400 frames with the highest confidence coefficient scores are reserved;

then, a Non-Maximum Suppression algorithm (Non Maximum Suppression) is used to set the threshold to 0.4, and a detection box with the highest confidence score of 200 is generated in each image as a final result.

It should be noted that the single-step face detector optimization system provided in the foregoing embodiment is only illustrated by dividing the functional modules, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the modules or steps in the embodiments of the present invention are further decomposed or combined, for example, the modules in the foregoing embodiments may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the functions described above. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.

A single-step face detector optimization method according to a second embodiment of the present invention, as shown in fig. 4, includes the following steps:

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related descriptions of the method described above may refer to the corresponding process in the foregoing system embodiment, and are not described herein again.

A storage device according to a third embodiment of the present invention stores a plurality of programs, and the programs are suitable for being loaded and executed by a processor to realize the above-mentioned single-step face detector optimization method.

A processing apparatus according to a fourth embodiment of the present invention includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the single-step face detector optimization method described above.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

Those of skill in the art would appreciate that the various illustrative modules, method steps, and modules described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that programs corresponding to the software modules, method steps may be located in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. To clearly illustrate this interchangeability of electronic hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.

The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. a single-step face detector optimization system, is characterized in that, described system comprises training system, test system; Described training system comprises data enhancement module, scale perception margin module, feature supervision module, single-step face Detector interface module, loss function module;

The data enhancement module is configured to obtain a spliced image by copying and splicing based on the detected image; randomly crop an image block in the spliced image, and after performing data enhancement on the image block, obtain a training sample through an anchor frame ;

The single-step face detector interface module is configured to send the training samples to the single-step face detector to be trained for binary classification and bounding box regression, and obtain the single-step face detector for binary classification of the training samples. The sampling characteristics of each training sample obtained in the process;

the scale-aware margin module, configured to obtain the scale-aware margin loss of the training sample;

The feature supervision module is configured to perform two-classification of the training samples through a classification network based on feature supervision based on the sampling features of the training samples;

The loss function module is configured to update the parameters of the single-step face detector through a loss function based on L _CLS , L _LOC , and L _FSM ; wherein, L _CLS is the second classification of the single-step face detector. The scale-aware margin loss function, L _LOC is the border regression loss function, and L _FSM is the loss function of the binary classification in the classification network based on feature supervision;

The test system is configured to use the single-step face detector obtained by the training system to perform a face detection task based on preset test data to obtain detection accuracy, and when the accuracy is less than a preset accuracy threshold, The one-step face detector is optimized again by the training system.

2. single-step face detector optimization system according to claim 1, is characterized in that, described detected image is rectangle, in described data enhancement module " based on detected image, obtain stitched image by copying stitching ", Its method is:

The detected image is copied 4 times, and the stitched image is obtained by matrix splicing; the long side of the stitched image is 2 times the long side of the detected image, and the short side of the stitched image is the short side of the detected image. 2 times.

3. The single-step face detector optimization system according to claim 2, wherein the long side of the image block in the data enhancement module is A times the long side of the detected image, the image block The short side is A times the short side of the detected image, A∈[1,2].

4. single-step face detector optimization system according to claim 1, is characterized in that, described classification network based on feature supervision, comprises ROI Align layer, four convolution layers, a global average pooling layer, loss function layer.

5. The single-step face detector optimization system according to claim 1, wherein the scale-aware margin loss function is constructed based on the predicted probability function of the perceptual margin, and the predicted probability function based on the perceptual margin is constructed. for

y=sigmoid(x-m)

Among them, y is the predicted probability value, x is the predicted value, m is the margin value of x, α is the preset hyperparameter, and w and h are the width and height of the sample, respectively.

6. The single-step face detector optimization system according to claim 5, wherein the scale-aware margin loss function is set in a scale-aware margin network, and the scale-aware margin network comprises a classification convolution layer, scale-aware margin layer, sigmoid function layer, and loss function layer.

7. The single-step face detector optimization system according to any one of claims 1-6, wherein the loss function based on L _CLS , L _LOC , L _FSM is

L=L _CLS +L _LOC +λL _FSM

Among them, λ is the preset weight.

8. A single-step face detector optimization method, characterized in that the method comprises the following steps:

Step S100, based on the detected image, obtain a spliced image by duplicating and splicing; randomly crop an image block in the spliced image, and after performing data enhancement on the image block, obtain a training sample by dividing the anchor frame;

Step S200, performing binary classification and frame regression of the training samples through a single-step face detector, and obtaining the sampling features of each training sample;

Step S300, based on the sampling features of each training sample obtained in the two-classification process of the training samples performed by the single-step face detector, the two-classification of the training samples is carried out through a classification network based on feature supervision;

Step S400, through the loss function based on L _CLS , L _LOC , and L _FSM , the single-step face detector is trained until a preset training end condition is reached; wherein, L _CLS is the second step of the single-step face detector. Scale-aware margin loss function in classification, L _LOC is the loss function of border regression, and L _FSM is the loss function of binary classification in the classification network based on feature supervision;

Step S500, based on the preset test data, use the single-step face detector trained in step S400 to perform the face detection task to obtain the detection accuracy, if the accuracy is less than the preset accuracy threshold, jump to step S100 again Optimize the one-step face detector, otherwise output the trained one-step face detector.

9 . A storage device, wherein a plurality of programs are stored, wherein the programs are adapted to be loaded and executed by a processor to implement the one-step face detector optimization method of claim 8 .

10. A processing device, comprising a processor and a storage device; the processor is adapted to execute various programs; the storage device is adapted to store a plurality of programs; characterized in that the programs are adapted to be loaded and executed by the processor to The one-step face detector optimization method of claim 8 is implemented.