CN113283319B

CN113283319B - Method, device, medium and electronic device for evaluating face blur

Info

Publication number: CN113283319B
Application number: CN202110524303.3A
Authority: CN
Inventors: 邹子杰
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2021-05-13
Filing date: 2021-05-13
Publication date: 2025-03-07
Anticipated expiration: 2041-05-13
Also published as: CN113283319A

Abstract

The disclosure provides a face ambiguity evaluation method, a face ambiguity evaluation device, a computer readable medium and electronic equipment, and relates to the technical field of image processing. The method comprises the steps of extracting a target video frame containing human faces from video data, obtaining a reference video frame corresponding to the target video frame from the video data, calculating human face posture features, human face similarity features and human face gradient features corresponding to all the human faces contained in the target video frame based on the target video frame and the reference video frame, respectively fusing the human face posture features, the human face similarity features and the human face gradient features corresponding to all the human faces to obtain fusion features, and determining the human face ambiguity corresponding to all the human faces based on the fusion features corresponding to all the human faces. According to the method and the device, through multi-frame images and multi-frame image semantic information, the blurring of the face area caused by motion or image quality blurring can be considered at the same time, and further more accurate face blurring degree is obtained.

Description

Face ambiguity evaluation method and device, medium and electronic equipment

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a method for evaluating a face ambiguity, an apparatus for evaluating a face ambiguity, a computer readable medium, and an electronic device.

Background

Image scene understanding is a basic and important task of image classification understanding, and has attracted great research interest in recent years, and is widely applied to various fields. In addition to high-level semantics, there is also a need for scene understanding in terms of image quality. For example, in the image deblurring, a task related to image quality.

The image deblurring is a task related to image quality, and is complicated. For example, due to exposure time, the acquired image may have imaging blur, and referring to fig. 1, face images with different blur degrees are shown. In this case, it is necessary to determine the degree of blurring of the image first, so that different deblurring algorithms are selected for deblurring according to the degree of blurring of the image.

Disclosure of Invention

The present disclosure aims to provide a face ambiguity evaluation method, a face ambiguity evaluation device, a computer readable medium and an electronic apparatus, so as to improve accuracy of ambiguity estimation at least to some extent.

According to a first aspect of the disclosure, a method for evaluating face ambiguity is provided, which includes extracting a target video frame containing a face from video data, acquiring a reference video frame corresponding to the target video frame from the video data, calculating face posture features, face similarity features and face gradient features corresponding to each face contained in the target video frame based on the target video frame and the reference video frame, respectively fusing the face posture features, the face similarity features and the face gradient features corresponding to each face to obtain fusion features, and determining the face ambiguity corresponding to each face based on the fusion features corresponding to each face.

According to a second aspect of the disclosure, a face ambiguity evaluation device is provided, which comprises a video frame acquisition module, a feature calculation module and an ambiguity evaluation module, wherein the video frame acquisition module is used for extracting a target video frame containing a face from video data, acquiring a reference video frame corresponding to the target video frame from the video data, the feature calculation module is used for calculating face posture features, face similarity features and face gradient features corresponding to each face contained in the target video frame based on the target video frame and the reference video frame, and the ambiguity evaluation module is used for respectively fusing the face posture features, the face similarity features and the face gradient features corresponding to each face to obtain fusion features and determining the face ambiguity corresponding to each face based on the fusion features corresponding to each face.

According to a third aspect of the present disclosure, there is provided a computer readable medium having stored thereon a computer program which, when executed by a processor, implements the method described above.

According to a fourth aspect of the present disclosure, there is provided an electronic device characterized by comprising a processor, and a memory for storing one or more programs, which when executed by the one or more processors, cause the one or more processors to implement the above-described method.

According to the face ambiguity evaluation method provided by the embodiment of the disclosure, the target video frame in the video data is obtained, and face detection is carried out on the target video frame. And then when the target video frame comprises at least one human face, acquiring a reference video frame corresponding to the target video frame comprising the acquired human face, calculating the human face posture feature, the human face similarity feature and the human face gradient feature corresponding to each human face based on the multi-frame video frame, determining the fusion feature of each human face according to the human face posture feature, the human face similarity feature and the human face gradient feature corresponding to each human face, and determining the human face ambiguity corresponding to the human face according to any feature. Through the multi-frame images and the multi-frame image semantic information, the blurring of the face area caused by motion or image quality blurring can be considered at the same time, and further more accurate face blurring degree can be obtained.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort. In the drawings:

FIG. 1 shows face images of different blur levels;

FIG. 2 illustrates a schematic diagram of an exemplary system architecture to which embodiments of the present disclosure may be applied;

FIG. 3 shows a schematic diagram of an electronic device to which embodiments of the present disclosure may be applied;

FIG. 4 schematically illustrates a flow chart of a method of blur level determination by a secondary blur algorithm;

FIG. 5 schematically illustrates a flowchart of a method of face ambiguity assessment in an exemplary embodiment of the present disclosure;

FIG. 6 schematically illustrates a drawing of a reference video frame in an exemplary embodiment of the present disclosure;

FIG. 7 schematically illustrates a flowchart of a regression model training method in an exemplary embodiment of the present disclosure;

FIG. 8 schematically illustrates a flowchart of another method of face ambiguity assessment in an exemplary embodiment of the present disclosure;

Fig. 9 schematically illustrates a composition diagram of an evaluation apparatus of face blur degree in an exemplary embodiment of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein, but rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the exemplary embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.

Fig. 2 is a schematic diagram of a system architecture of an exemplary application environment to which a face ambiguity evaluation method and apparatus according to an embodiment of the present disclosure may be applied.

As shown in fig. 2, the system architecture 200 may include one or more of the terminal devices 201, 202, 203, a network 204, and a server 205. The network 204 is the medium used to provide communication links between the terminal devices 201, 202, 203 and the server 205. The network 204 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others. The terminal devices 201, 202, 203 may be various electronic devices having image processing functions, including but not limited to desktop computers, portable computers, smart phones, tablet computers, and the like. It should be understood that the number of terminal devices, networks and servers in fig. 2 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, the server 205 may be a server cluster formed by a plurality of servers.

The method for evaluating the face ambiguity provided by the embodiments of the present disclosure is generally performed in the terminal devices 201, 202, 203, and accordingly, the apparatus for evaluating the face ambiguity is generally disposed in the terminal devices 201, 202, 203. However, it is easily understood by those skilled in the art that the method for evaluating the face ambiguity provided in the embodiment of the present disclosure may be performed by the server 205, and accordingly, the apparatus for evaluating the face ambiguity may also be provided in the server 205, which is not particularly limited in the present exemplary embodiment. For example, in an exemplary embodiment, the user may collect, in real time, video data for previewing through a camera module for collecting images included in the terminal related devices 201, 202, 203, then determine a video frame corresponding to a moment when the user presses a photographing key as a target video frame, obtain a reference video frame corresponding to the target video frame in the video data, send the target video frame and the reference video frame corresponding to the target video frame to the server 205 through the network 204, then calculate a face pose feature, a face similarity feature and a face gradient feature corresponding to each face through the server 205, perform feature fusion to obtain a fusion feature corresponding to each face, calculate face ambiguity corresponding to each face according to the fusion feature, and return the face ambiguity to the terminal devices 201, 202, 203.

The exemplary embodiments of the present disclosure provide an electronic device, which may be the terminal device 201, 202, 203 or the server 205 in fig. 2, for implementing the method of evaluating face ambiguity. The electronic device comprises at least a processor and a memory for storing executable instructions of the processor, the processor being configured to perform a method of face ambiguity assessment via execution of the executable instructions.

The configuration of the electronic device will be exemplarily described below using the mobile terminal 300 of fig. 3 as an example. It will be appreciated by those skilled in the art that the configuration of fig. 3 can also be applied to stationary type devices in addition to components specifically for mobile purposes. In other embodiments, mobile terminal 300 may include more or less components than illustrated, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware. The interfacing relationship between the components is shown schematically only and does not constitute a structural limitation of the mobile terminal 300. In other embodiments, the mobile terminal 300 may also employ a different interface from that of fig. 3, or a combination of interfaces.

As shown in fig. 3, the mobile terminal 300 may specifically include a processor 310, an internal memory 321, an external memory interface 322, a universal serial bus (Universal Serial Bus, USB) interface 330, a charge management module 340, a power management module 341, a battery 342, an antenna 1, an antenna 2, a mobile communication module 350, a wireless communication module 360, an audio module 370, a speaker 371, a receiver 372, a microphone 373, an earphone interface 374, a sensor module 380, a display 390, an image capturing module 391, an indicator 392, a motor 393, a key 394, a user identification module (subscriber identification module, SIM) card interface 395, and the like. Wherein the sensor module 380 may include a depth sensor 3801, a pressure sensor 3802, a gyro sensor 3803, and the like.

The Processor 310 may include one or more processing units, for example, the Processor 310 may include an application Processor (Application Processor, AP), a modem Processor, a graphics Processor (Graphics Processing Unit, GPU), an image signal Processor (IMAGE SIGNAL Processor, ISP), a controller, a video codec, a digital signal Processor (DIGITAL SIGNAL Processor, DSP), a baseband Processor and/or a neural network Processor (Neural-Network Processing Unit, NPU), and the like. Wherein the different processing units may be separate devices or may be integrated in one or more processors. In an exemplary embodiment, the steps of acquiring the target video frame included in the video data, acquiring the reference video frame corresponding to the target video frame in the video data, and the like may be implemented by a processor.

The NPU is a neural Network (Neural-Network, NN) computing processor, and can rapidly process input information by referencing a biological neural Network structure, such as referencing a transmission mode among human brain neurons, and can continuously learn. Applications such as intelligent recognition of the mobile terminal 300, for example, image recognition, face recognition, voice recognition, text understanding, etc., can be realized through the NPU. In an exemplary embodiment, the process of face detection, face feature extraction, and face ambiguity determination based on features may be performed by the NPU.

The mobile terminal 300 implements display functions through a GPU, a display screen 390, an application processor, and the like. The GPU is a microprocessor for image processing, connected to the display 390 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 310 may include one or more GPUs that execute program instructions to generate or change display information.

The mobile terminal 300 may implement a photographing function through an ISP, a camera module 391, a video codec, a GPU, a display 390, an application processor, and the like. The ISP is used for processing data fed back by the camera module 391, the camera module 391 is used for capturing still images or videos, the digital signal processor is used for processing digital signals, and can process other digital signals besides digital image signals, the video codec is used for compressing or decompressing digital videos, and the mobile terminal 300 can also support one or more video codecs. In an exemplary embodiment, the process of acquiring video data may be implemented by an ISP, camera module 391, video codec, GPU, display 390, application processor, and the like.

In the related art, a single conventional feature is generally adopted to perform fuzzy judgment, and the fuzzy judgment is mainly classified into the following categories:

Firstly, judging by the characteristic of image gradient. Such as Tenengrad gradient function, laplacian gradient function, SMD (gray variance) function, SMD2 (gray variance product) function, brenner gradient function, etc.;

And secondly, judging by a secondary blurring algorithm. Specifically, if an image is blurred, the blurring process is performed once again, and the high-frequency component is not greatly changed, but if the original image is clear, the blurring process is performed once, and the high-frequency component is greatly changed. For example, as shown in fig. 4, in step S410, a degraded image of the image to be evaluated may be obtained by performing a gaussian blur process on the image to be evaluated, and in step S420, by comparing the change condition of the adjacent pixel values of the original image and the degraded image, the sharpness value is determined according to the change size, and the smaller the calculation result is, the clearer the image is, and vice versa. This processing mode may be referred to as a judgment method based on a secondary blurring algorithm.

Thirdly, judging through the similarity of the gradient structure. For example, the image I to be evaluated may be subjected to low-pass filtering to obtain a reference image Ir, gradient information of the images I and Ir is extracted to obtain gradient images G and Gr, then N image blocks with the most abundant gradient information in the gradient image G and the first N blocks of the corresponding gradient image Gr are found, and finally no-reference structure definition of the image I is calculated according to the found N blocks with the most abundant gradient information.

And fourthly, judging by a machine learning method. For example, using a Laplace filter to find the maximum high variance (and high maximum) of the edge calculation variance and filtered image pixel values in the input image indicates that the edge is clearly visible, i.e., the image is clear, and the low variance indicates that the image is blurred.

However, the detection and prevention of the image blurring degree mainly has the defect that whether the image is blurred is difficult to determine because the application scene of the blurring degree detection is rich. Specifically, different thresholds may need to be set manually as blur limits to distinguish between blur and non-blur according to the application scenario. In this case, both blur and non-blur require manual thresholding, and if it is desired to define the respective degrees of blur in the blur area, this is also subject to greater artificial subjectivity. Correspondingly, the selection of the deblurring algorithm becomes very difficult.

In addition, there are some methods for performing blur determination based on multi-frame images. For example, in the patent application CN1992813a, a reference image with the largest definition is first found in the focusing stage of the electronic photographing device, and the reference image is compared with the photographed image, so as to determine the difference in definition between the compared image and the reference image, if the difference is too large, the photographed image is blurred, otherwise, if the difference is small, the photographed image is clear. For another example, in the patent application with publication number CN106296688a, a homonym point set between every two images in the image set is obtained through feature point detection, so as to obtain a homonym region, after Laplace convolution is performed on the homonym region, the variance of the homonym region is calculated, and the mean value of the variances of all homonym regions in the two images is taken as the characterization quantity of the image set. And reflecting the fuzzy degree relation between every two images in a ratio mode to obtain the sequence of the fuzzy degree of the image set.

However, one of the two multi-frame image judging modes needs to manually determine the comparison image, and the other can only sort the relative blur degree of the images in one image set, so that an accurate evaluation result cannot be obtained.

Based on one or more of the above problems, the present exemplary embodiment provides a method for evaluating face ambiguity. The face ambiguity evaluation method may be applied to the server 105, or may be applied to one or more of the terminal devices 201, 202, 203, which is not particularly limited in the present exemplary embodiment. Referring to fig. 5, the face ambiguity evaluation method may include the following steps S510 to S530:

In step S510, a target video frame including a face is extracted from the video data, and a reference video frame corresponding to the target video frame is acquired from the video data.

Wherein the target video frame may include at least one face.

In an exemplary embodiment, when extracting a target video frame containing a face from video data, a video frame to be detected may be first extracted from the video data, and face detection may be performed on the video frame to be detected. When the video frame to be detected is detected, if the video frame to be detected includes at least one face, the video frame to be detected can be determined as a target video frame.

In an exemplary embodiment, after a user opens a camera of the terminal device, video data may be collected in real time through the camera, so as to display a preview screen on a screen of the terminal device, and when the user performs shooting (presses a shooting key such as a shutter) or performs a specific operation, it is determined that a video frame corresponding to the moment in the video data is a video frame to be detected. For example, if the user presses the shutter at the 6 th frame, it is determined that the 6 th frame is the video frame to be measured, assuming that the video data corresponding to the preview image is shown in fig. 6 during the camera shooting process.

It should be noted that, in an exemplary embodiment, the video frame to be measured may be determined by other means besides the above-described manner of photographing or performing a specific operation, which is not particularly limited in this disclosure. For example, any frame can be designated as a video frame to be detected in the collected video data, or a video frame meeting the requirement can be selected from the video data according to the requirement of a user as the video frame to be detected.

In an exemplary embodiment, when the reference video frame corresponding to the target video frame is acquired in the video data, the reference video frame may be extracted from the video data based on a preset time condition and a time point of the target video frame in the video data. Specifically, the preset time condition may include a time range corresponding to the reference video frame. For example, referring to fig. 6, at the time point of the target video frame in the video data, the preset time condition may include 4 frames preceding the target video frame, that is, the 2 nd to 5 th frames, as reference video frames corresponding to the target video frame, in the video data, for example, the preset time condition may include 3 frames preceding and following the target video frame, that is, the 3 rd to 5 th frames and the 7 th to 9 th frames, as reference video frames corresponding to the target video frame, in the video data, and for example, the preset time condition may further include 5 frames following the target video frame, that is, the 7 th to 11 th frames, as reference video frames corresponding to the target video frame, in the video data.

It should be noted that the preset time condition may be set according to specific requirements. Specifically, with the target video frame as a reference, the reference video frame can be taken forward, backward or both forward and backward in the video data, and the number of extracted reference video frames can be set as required. In addition, since the previous video frame can be generally used for predicting the subsequent video frame, when the reference video frame is acquired, the video frame before the target video frame is generally used as the reference video frame, so that the face ambiguity corresponding to the face included in the target video frame can be more accurately evaluated.

In step S520, face pose features, face similarity features, and face gradient features corresponding to the faces included in the target video frame are calculated based on the target video frame and the reference video frame.

The face pose features may include one or more combinations of features for characterizing a face pose in a multi-frame image, for example, the face pose features may include a variation of a face pose of each of the multi-frame image relative to a previous frame image, the face similarity features may include features for characterizing a face similarity in the multi-frame image, for example, the face similarity features may include an average value of the similarity of the face pose of each of the multi-frame image relative to the previous frame image, and the face gradient features may include features for characterizing a face gradient in the multi-frame image, for example, the face gradient features may include an average value of the gradient of the face pose of each of the multi-frame image relative to the previous frame image.

In an exemplary embodiment, after the target video frame and the reference video frame are determined, face pose features corresponding to the faces in the target video frame may be calculated based on the target video frame and the reference video frame. Specifically, the target video frames and the reference video frames are sequenced according to the sequence of the target video frames and the reference video frames in the video data to obtain a video frame sequence, meanwhile, for each face, the head gesture corresponding to the face in the target video frames and the reference video frames can be calculated respectively, then the gesture change quantity of each video frame relative to the previous video frame is calculated respectively according to the sequence of the video frames, and the face gesture feature corresponding to each face is determined based on the gesture change quantity.

The head pose may include, among other things, a combination of one or more of the following feature data that characterizes the head pose in each video frame, pitch angle (pitch), yaw angle (yaw), roll angle (roll), two-dimensional coordinates of face keypoints in the target video frame and the reference video frame, and so forth.

When the number of the reference video frames is greater than 1, that is, when the multi-frame video frames are extracted as the reference video frames when the reference video frames are extracted, the pose change amount corresponding to the multi-frame video can be directly subjected to dimension reduction processing to obtain the pose feature of the face when the pose feature of the face is determined based on the pose change amount. For example, a dimension reduction algorithm such as PCA (Principal Component Analysis) dimension reduction can be adopted to reduce the dimension of the obtained attitude change quantity so as to obtain the face attitude characteristics.

In an exemplary embodiment, after the target video frame and the reference video frame are determined, face similarity features corresponding to faces may be calculated based on the target video frame and the reference video frame. Specifically, for each face, the target video frame and the reference video frame may be sequenced according to the sequence of the target video frame and the reference video frame in the video data to obtain a video frame sequence, and then similarity values of image areas corresponding to the face in each video frame relative to image areas corresponding to the face in the previous video frame are calculated according to the sequence in the video frame sequence, and face similarity features are determined based on the similarity values.

In an exemplary embodiment, after the video frame sequence is obtained, a histogram comparison method may be directly adopted, first, histogram data of an image area corresponding to a face in a video frame and an image area corresponding to the face in a previous video frame are obtained, and then the histogram data are compared to obtain a similarity value corresponding to the face. In addition, before the histogram data comparison, the video frame and the previous video frame may be normalized for comparison. It should be noted that, when calculating the similarity value, other image similarity calculation methods may be used for calculation.

In an exemplary embodiment, when determining the face similarity feature based on the similarity values, for each face, an average value may be calculated for all the similarity values calculated according to the target video frame and the reference video frame, and the calculated average value may be determined as the face similarity feature corresponding to the face. It should be noted that, when determining the face similarity feature, other manners may be used to determine the face similarity feature besides taking an average value.

In an exemplary embodiment, after the target video frame and the reference video frame are determined, face gradient features corresponding to the faces may be calculated based on the target video frame and the reference video frame. Specifically, for each face, the target video frame and the reference video frame may be sequenced according to the sequence of the target video frame and the reference video frame in the video data to obtain a video frame sequence, then gradient differences between a corresponding image area of a face of each video frame and a corresponding image area of the face of the previous video frame are calculated respectively according to the sequence in the video frame sequence, and the face similarity feature is determined based on the gradient differences.

In calculating the gradient, gradient functions such as Tenengrad gradient function, laplacian gradient function, SMD (gray variance) function, SMD2 (gray variance product) function, brenner gradient function and the like may be selected, which are not particularly limited in the present disclosure.

In an exemplary embodiment, when determining the gradient characteristics of the face based on the gradient differences, for each face, an average value may be calculated for all gradient differences calculated from the target video frame and the reference video frame, and the calculated average value may be determined as the gradient characteristics of the face corresponding to the face. It should be noted that, in determining the gradient feature of the face, other manners may be used to determine the gradient feature of the face besides taking an average value.

In step S530, the face pose features, the face similarity features and the face gradient features corresponding to the faces are fused to obtain fusion features, and the face ambiguity corresponding to the faces is determined based on the fusion features corresponding to the faces.

In an exemplary embodiment, after the face pose feature, the face similarity feature and the face gradient feature are obtained, the three features may be fused, and the three features may be spliced together to obtain a fused feature. When the three features are spliced, the three features may be spliced in any order, which is not particularly limited in the present disclosure.

In an exemplary embodiment, when determining the face ambiguity corresponding to each face based on the fusion feature corresponding to the face, regression prediction may be performed on the fusion feature corresponding to each face according to a preset regression algorithm, so as to obtain the face ambiguity corresponding to each face. The preset regression algorithm may include various types of regression algorithms, such as support vector machine regression, linear regression, logistic regression, and the like, which are not limited by the present disclosure.

The preset regression algorithm may include a preset regression model. Correspondingly, before regression prediction is performed on the fusion features through a preset regression algorithm, a sample video frame and a reference video frame corresponding to the sample video frame can be acquired, face ambiguity corresponding to the sample video frame is marked, and then a regression model is trained based on the sample video frame, the reference video frame corresponding to the sample video frame and the face ambiguity corresponding to the sample video frame. The regression model may include a support vector machine, a fully connected neural network, and the like, which is not particularly limited in the present disclosure. The sample video frames and the reference video frames corresponding to the sample video frames are training samples which are acquired in advance and used for model training, the face ambiguity corresponding to the sample video frames is a mark of the face ambiguity which is carried out on the sample video frames in advance, and supervised learning of the regression model can be achieved through the training samples and the marks.

Specifically, referring to fig. 7, step S710, the above-mentioned calculation process of the face pose feature, the face similarity feature and the face gradient feature is performed based on the sample video frame and the reference video frame corresponding to the sample video frame, the obtained face pose feature, the face similarity feature and the face gradient feature are obtained, step S720, the obtained face pose feature, the face similarity feature and the face gradient feature are fused to obtain a fused feature, step S730, the obtained fused feature is input into a regression model, and the weight parameters in the regression model are updated through back propagation through the face ambiguity and the loss function corresponding to the marked sample video frame, thereby obtaining a trained regression model, that is, a preset regression model.

The following takes the target video frame including only 1 face and takes the forward 4 frames as the reference video frame as an example, and the technical solution of the embodiment of the disclosure will be described in detail with reference to fig. 8:

In step S801, face detection is performed on the 6 th frame after determining that the 6 th frame is a video frame to be detected in the video data.

In step S803, it is determined whether or not a face exists in the 6 th frame.

In step S805, when at least one face is included in the 6 th frame, the 6 th frame may be determined as a target video frame.

In step S807, after determining the target video frame, 4 frames may be taken forward as reference video frames according to a preset time condition. I.e. taking the 2 nd-5 th frame as the reference video frame.

Assume that the head pose of the user includes both pitch angle (pitch), yaw angle (yaw), roll angle (roll), and two-dimensional coordinates of the face keypoints in the target video frame and the reference video frame.

In step S809, the target video frame and the reference video frame are ordered to obtain a video frame sequence.

Specifically, it is assumed that the target video frame is the 6 th frame in the video data, and the reference video frame is obtained by taking 4 frames forward in the video data, i.e., the 2 nd to 5 th frames in the video data. The obtained video frame sequences are the 2 nd frame, the 3 rd frame, the 4 th frame, the 5 th frame and the 6 th frame.

In step S811, the head pose is calculated.

For a target video frame or a reference video frame, 68 two-dimensional face key points included in the video frame are firstly extracted, a rotation vector is calculated based on the two-dimensional face key points, a rotation matrix corresponding to the rotation vector is calculated, further Euler angles such as a pitch angle, a yaw angle and a roll angle are obtained, and the head gesture of each face shown in formula 1 can be obtained based on the above processes:

Face= { Pitch, yaw, roll, keyp } equation 1

Wherein Pitch, yaw, roll represents the pitch angle, yaw angle and roll angle of the corresponding head of the face, and Keyp represents the two-dimensional coordinates of the key points of 68 faces.

After the head poses corresponding to all the target video frames and the reference video frames are obtained, the head poses of all the video frames may be expressed as the following formula 2 according to the sequence of the video frames:

face= { Face ₂,Face₃,Face₄,Face₅,Face₆ } equation 2

Wherein Face ₂,Face₃,Face₄,Face₅,Face₆ represents the head pose in the 2 nd-6 th video frames, respectively.

In step S813, the amount of change in pose between every two frames is calculated in the order of the video frame sequence.

Specifically, the Face set in equation 2 may be calculated based on the following equation 3:

wherein, F-motion _i represents the gesture change amount between the head gesture corresponding to the face in the ith video frame in the video frame sequence and the head gesture in the previous video frame, t represents time, and mean () represents the calculated average value.

In step S815, PCA dimension reduction is performed on the attitude change amount.

It should be noted that, since there are a plurality of face key point coordinates, an average value of the face key point coordinates may be calculated at the time of calculation to simplify the calculation process. When 5 video frames participate in calculation, the formula can obtain 4 groups of one-dimensional vectors, and correspondingly can obtain a 4x4 matrix vector F_motion. In order to simplify the calculation process, the dimension of the matrix vector of 4x4 can be reduced based on the dimension reduction of PCA, so as to obtain a one-dimensional vector, which can be represented by the following formula 4:

fm simi =pca (F motion) equation 4

Fm_ simi represents the face posture characteristics after dimension reduction, F_motion represents the face posture characteristics before dimension reduction, and PCA represents a PCA dimension reduction algorithm.

In step S817, similarity calculation is performed based on the video frame sequence determined in step S809, and face similarity characteristics are determined.

Specifically, the calculation can be performed by the following formulas 5 and 6:

hist_ simi _i＝Histogram(face_i,face_i-1) equation 5

H_ simi =mean (hist_ simi) equation 6

Wherein hist_ simi _i represents a similarity value of a face corresponding image region in an i-th video frame in a video frame sequence relative to a face corresponding image region in a previous video frame, histogram (face _i,face_i-1) represents Histogram comparison of the face corresponding image region in the i-th video frame and the face corresponding image region in the previous video frame (i-1), mean () represents a calculated average value, and h_ simi represents a face similarity feature.

In step S819, gradient calculations are performed and face gradient features are determined based on the video frame sequence determined in step S809.

Specifically, the calculation can be performed by the following formulas 7 and 8:

grad simi _i＝Grad(face_i,face_i-1) equation 7

G simi =mean (grad_ simi) formula 8

Wherein grad_ simi _i represents the gradient difference of the image area corresponding to the face in the ith video frame in the video frame sequence relative to the image area corresponding to the face in the previous video frame, grad (face _i,face_i-1) represents the gradient difference between the image area corresponding to the face in the ith video frame and the image area corresponding to the face in the previous video frame (i-1), mean () represents the calculated average value, and G_ simi represents the face gradient feature.

In step S821, feature fusion is performed on the face pose feature, the face similarity feature, and the face gradient feature.

After the above face pose features, face similarity features and face gradient features are obtained, feature stitching may be performed based on the following formula 9 to obtain fusion features corresponding to the face:

Feature= { fm_ simi, h_ simi, g_ simi } equation 9

Wherein Fm_ simi represents the face pose characteristics after dimension reduction, H_ simi represents the face similarity characteristics, and G_ simi represents the face gradient characteristics.

In step S823, regression prediction is performed based on the fusion features of the face to obtain a face ambiguity corresponding to the face.

After the fusion characteristics are obtained, inputting the fusion characteristics into a pre-set support vector machine model trained in advance to carry out regression prediction, so as to obtain a regression prediction value, namely the face ambiguity.

It should be noted that, the preset support vector machine may train a support vector machine through the sample video frame, the reference video frame corresponding to the sample video frame, and the face ambiguity corresponding to the sample video frame, so as to obtain the preset vector machine model. When training is performed, face posture features, face similarity features and face gradient features can be calculated based on a sample video frame and a reference video frame, then the three features are subjected to feature stitching, and parameters in the support vector machine are updated according to fusion features and face ambiguity corresponding to the sample video frame, so that a preset support vector machine is obtained.

In summary, in the present exemplary embodiment, the motion and the blurring calculation in the aspect of the image quality are performed on the face through the multi-frame image and the semantic information of the multi-frame image, so that the accuracy of the face blurring evaluation result is improved. Meanwhile, the traditional scheme can only singly distinguish the blurring caused by the motion blur or the poor imaging quality of the image, and the blurring caused by the rigid motion and the blurring related to the image quality exist on the face at the same time, so that the evaluation of the face blurring degree is carried out by fusion of three characteristics.

In addition, it is difficult to achieve sufficient accuracy for conventional features, single deep learning, and schemes that combine conventional features and deep learning, and only clear and severe ambiguity can be identified. Moreover, the deep learning scheme also has the problem of difficult deployment of the mobile terminal. And the regression prediction is carried out on the characteristics based on machine learning, so that the accuracy rate of the face ambiguity evaluation is improved, the method is easy to deploy in a mobile terminal, more accurate fuzzy semantic information is provided for a portrait photographing scene, and the follow-up image quality algorithm is assisted to achieve the optimal effect.

It is noted that the above-described figures are merely schematic illustrations of processes involved in a method according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily appreciated that the processes shown in the above figures do not indicate or limit the temporal order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, among a plurality of modules.

Further, referring to fig. 9, in this exemplary embodiment, a face ambiguity evaluation apparatus 900 is further provided, which includes a video frame acquisition module 910, a feature calculation module 920, and an ambiguity evaluation module 930. Wherein:

the video frame acquisition module 910 may be configured to extract a target video frame containing a face from video data, and acquire a reference video frame corresponding to the target video frame from the video data.

The feature calculation module 920 may be configured to calculate, based on the target video frame and the reference video frame, a face pose feature, a face similarity feature, and a face gradient feature corresponding to each face included in the target video frame.

The ambiguity evaluation module 930 may be configured to fuse the face pose feature, the face similarity feature, and the face gradient feature corresponding to each face to obtain fusion features, and determine the face ambiguity corresponding to each face based on the fusion features corresponding to each face.

In an exemplary embodiment, the feature calculation module 920 may be configured to calculate, for each face, a head pose in a target video frame and a reference video frame, sort the target video frame and the reference video frame based on an order of the target video frame and the reference video frame in video data, to obtain a video frame sequence, calculate, for each face, a pose change amount of each video frame relative to a previous video frame based on a head pose corresponding to each video frame, and determine a face pose feature corresponding to each face based on the pose change amount.

In an exemplary embodiment, the feature calculation module 920 may be configured to perform a dimension reduction process on the pose variation to obtain the face pose feature.

In an exemplary embodiment, the feature calculation module 920 may be configured to sort the target video frame and the reference video frame based on the order of the target video frame and the reference video frame in the video data to obtain a video frame sequence, respectively calculate, for each face in the video frame sequence, a similarity value of an image area corresponding to the face in each video frame relative to an image area corresponding to the face in a previous video frame, and determine a face similarity feature corresponding to each face based on the similarity values.

In an exemplary embodiment, the feature calculation module 920 may be configured to sort the target video frame and the reference video frame based on the order of the target video frame and the reference video frame in the video data, to obtain a video frame sequence, respectively calculate, for each face, a gradient difference of an image area corresponding to the face in each video frame relative to an image area corresponding to the face in a previous video frame, and determine a face gradient feature based on the gradient difference.

In an exemplary embodiment, the ambiguity evaluation module 930 may be configured to perform regression prediction on the fusion features corresponding to each face through a preset regression algorithm, so as to obtain a face ambiguity corresponding to each face.

In an exemplary embodiment, the ambiguity evaluation module 930 may be configured to train the regression model based on the sample video frame, the reference video frame corresponding to the sample video frame, and the face ambiguity corresponding to the sample frame, to obtain a preset regression model.

In an exemplary embodiment, the video frame acquisition module 910 may be configured to extract, from the video data, a reference video frame corresponding to the target video frame based on a preset time condition and a time point of the target video frame in the video data.

In an exemplary embodiment, the video frame acquisition module 910 may be configured to extract a video frame to be detected from the video data, and perform face detection on the video frame to be detected, where the video frame to be detected includes at least one face, determine the video frame to be detected as a target video frame.

The specific details of each module in the above apparatus are already described in the method section, and the details that are not disclosed can be referred to the embodiment of the method section, so that they will not be described in detail.

Those skilled in the art will appreciate that the various aspects of the present disclosure may be implemented as a system, method, or program product. Accordingly, aspects of the present disclosure may be embodied in the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects that may be referred to herein collectively as a "circuit," module, "or" system.

Exemplary embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon a program product capable of implementing the method described above in the present specification. In some possible implementations, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the disclosure as described in the "exemplary methods" section of this specification, when the program product is run on the terminal device, e.g. any one or more of the steps of fig. 5, 7 and 8 may be carried out.

It should be noted that the computer readable medium shown in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of a computer-readable storage medium may include, but are not limited to, an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Furthermore, the program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. The method for evaluating the face ambiguity is characterized by comprising the following steps of:

extracting target video frames containing human faces from video data, and acquiring reference video frames corresponding to the target video frames from the video data, wherein the number of the reference video frames is more than 1;

calculating face posture features, face similarity features and face gradient features corresponding to the faces contained in the target video frame based on the target video frame and the reference video frame;

respectively fusing the face posture features, the face similarity features and the face gradient features corresponding to the faces to obtain fusion features, and determining the face ambiguity corresponding to the faces based on the fusion features corresponding to the faces;

the obtaining the reference video frame corresponding to the target video frame in the video data includes:

and taking the reference video frame forwards, backwards or forwards and backwards in the video data based on a preset time condition and a time point of the target video frame in the video data and taking the target video frame as a reference.

2. The method of claim 1, wherein calculating the face pose feature corresponding to the face based on the target video frame and the reference video frame comprises:

Ordering the target video frame and the reference video frame based on the order of the target video frame and the reference video frame in the video data to obtain a video frame sequence;

For each face, respectively calculating the head gestures in the target video frame and the reference video frame;

In the video frame sequence, for each face, respectively calculating the posture change quantity of each video frame relative to the previous video frame based on the head posture corresponding to each video frame;

And determining the face posture characteristics corresponding to the faces based on the posture change quantity.

3. The method according to claim 2, wherein the determining the face pose feature corresponding to each face based on the pose change amount includes:

and performing dimension reduction processing on the attitude change quantity to obtain the facial attitude characteristics.

4. The method of claim 1, wherein computing the face similarity feature based on the target video frame and the reference video frame comprises:

In the video frame sequence, for each face, calculating a similarity value of an image area corresponding to the face in each video frame relative to an image area corresponding to the face in a previous video frame;

And determining the face similarity characteristics corresponding to the faces based on the similarity values.

5. The method of claim 1, wherein computing the face gradient feature based on the target video frame and the reference video frame comprises:

in the video frame sequence, for each face, calculating the gradient difference of an image area corresponding to the face in each video frame relative to an image area corresponding to the face in a previous video frame;

and determining the gradient characteristics of the human face based on the gradient difference.

6. The method of claim 1, wherein the determining the face ambiguity corresponding to each face based on the fusion feature corresponding to each face comprises:

And respectively carrying out regression prediction on fusion features corresponding to the faces through a preset regression algorithm to obtain the face ambiguity corresponding to the faces.

7. The method of claim 6, wherein the preset regression algorithm comprises a preset regression model, the method further comprising, prior to the regression predicting the fused feature by the preset regression algorithm:

Training the regression model based on a sample video frame, a reference video frame corresponding to the sample video frame and a face ambiguity corresponding to the sample video frame to obtain a preset regression model.

8. The method of claim 1, wherein extracting the target video frame containing the face from the video data comprises:

extracting a video frame to be detected from video data, and carrying out face detection on the video frame to be detected;

And when the video frame to be detected comprises at least one human face, determining the video frame to be detected as a target video frame.

9. An evaluation device for face ambiguity, comprising:

The video frame acquisition module is used for extracting target video frames containing human faces from video data, and acquiring reference video frames corresponding to the target video frames from the video data, wherein the number of the reference video frames is greater than 1;

The feature calculation module is used for calculating face posture features, face similarity features and face gradient features corresponding to the faces contained in the target video frame based on the target video frame and the reference video frame;

The ambiguity evaluation module is used for respectively fusing the face posture characteristics, the face similarity characteristics and the face gradient characteristics corresponding to the faces to obtain fusion characteristics, and determining the face ambiguity corresponding to the faces based on the fusion characteristics corresponding to the faces;

The video frame acquisition module is specifically configured to acquire the reference video frame forward, backward or simultaneously forward and backward in the video data based on a preset time condition and a time point of the target video frame in the video data, and based on the preset time condition and the time point of the target video frame in the video data, with the target video frame as a reference.

10. A computer readable medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the method according to any one of claims 1 to 8.

11. An electronic device, comprising:

Processor, and

A memory for storing executable instructions of the processor;

wherein the processor is configured to perform the method of any one of claims 1 to 8 via execution of the executable instructions.