CN114724218A - Video detection method, device, equipment and medium - Google Patents
Video detection method, device, equipment and medium Download PDFInfo
- Publication number
- CN114724218A CN114724218A CN202210369060.5A CN202210369060A CN114724218A CN 114724218 A CN114724218 A CN 114724218A CN 202210369060 A CN202210369060 A CN 202210369060A CN 114724218 A CN114724218 A CN 114724218A
- Authority
- CN
- China
- Prior art keywords
- image
- video
- attention
- face
- features
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
Abstract
Description
技术领域technical field
本公开涉及视频处理技术领域,尤其涉及一种视频检测方法、装置、设备及介质。The present disclosure relates to the technical field of video processing, and in particular, to a video detection method, apparatus, device, and medium.
背景技术Background technique
伪造脸部视频是指视频内容中的人脸、动物脸等脸部经过深度伪造算法(Deepfake)篡改的视频。Fake face videos refer to videos in which faces such as human faces and animal faces in the video content have been tampered with by deepfake algorithms.
因此,如何准确地检测出伪造脸部视频是亟需解决的技术问题。Therefore, how to accurately detect fake face videos is a technical problem that needs to be solved urgently.
发明内容SUMMARY OF THE INVENTION
为了解决上述技术问题,本公开提供了一种视频检测方法、装置、设备及介质。In order to solve the above technical problems, the present disclosure provides a video detection method, apparatus, device and medium.
第一方面,本公开提供了一种视频检测方法,包括:In a first aspect, the present disclosure provides a video detection method, including:
获取待检测的图像序列,图像序列包含同一视频中的至少两个视频帧;Obtain an image sequence to be detected, the image sequence includes at least two video frames in the same video;
针对图像序列中的每个图像,对所述图像的脸部特征进行非线性变换处理,得到图像对应的脸部多个区域的注意力特征;For each image in the image sequence, non-linear transformation processing is performed on the facial features of the image to obtain the attention features of multiple regions of the face corresponding to the image;
基于各个图像对应的脸部多个区域的注意力特征,构建图像序列对应的脸部多个区域之间的时序关系特征;Based on the attention features of the multiple face regions corresponding to each image, construct the temporal relationship features between the multiple face regions corresponding to the image sequence;
基于时序关系特征,计算视频为伪造脸部的视频的概率。Based on the temporal relationship feature, the probability that the video is a fake face video is calculated.
第二方面,本公开提供了一种视频检测装置,包括:In a second aspect, the present disclosure provides a video detection device, comprising:
图像获取模块,用于获取待检测的图像序列,图像序列包含同一视频中的至少两个视频帧;an image acquisition module for acquiring an image sequence to be detected, the image sequence including at least two video frames in the same video;
非线性变化模块,用于针对图像序列中的每个图像,对图像的脸部特征进行非线性变换处理,得到图像对应的脸部多个区域的注意力特征;The nonlinear change module is used to perform nonlinear transformation processing on the facial features of the image for each image in the image sequence, so as to obtain the attention features of multiple areas of the face corresponding to the image;
特征构建模块,用于基于各个图像对应的脸部多个区域的注意力特征,构建图像序列对应的脸部多个区域之间的时序关系特征;The feature building module is used to construct the temporal relationship feature between the multiple face regions corresponding to the image sequence based on the attention features of the multiple face regions corresponding to each image;
概率计算模块,用于基于时序关系特征,计算视频为伪造脸部的视频的概率。The probability calculation module is used to calculate the probability that the video is a video of a fake face based on the time series relationship feature.
第三方面,本公开提供了一种视频检测设备,包括:In a third aspect, the present disclosure provides a video detection device, including:
处理器;processor;
存储器,用于存储可执行指令;memory for storing executable instructions;
其中,处理器用于从存储器中读取可执行指令,并执行可执行指令以实现第一方面的视频检测方法。The processor is configured to read executable instructions from the memory and execute the executable instructions to implement the video detection method of the first aspect.
第四方面,本公开提供了一种计算机可读存储介质,该存储介质存储有计算机程序,当计算机程序被处理器执行时,使得处理器实现第一方面的视频检测方法。In a fourth aspect, the present disclosure provides a computer-readable storage medium storing a computer program, which, when executed by a processor, causes the processor to implement the video detection method of the first aspect.
本公开实施例提供的技术方案与现有技术相比具有如下优点:Compared with the prior art, the technical solutions provided by the embodiments of the present disclosure have the following advantages:
本公开实施例的视频检测方法、装置、设备及介质,能够在获取到待检测的包含同一视频中的至少两个视频帧的图像序列之后,对图像序列中的每个图像的脸部特征进行非线性变换处理,得到每个图像对应的脸部多个区域的注意力特征,并基于各个图像对应的脸部多个区域的注意力特征,构建图像序列对应的脸部多个区域之间的时序关系特征,进而基于时序关系特征,计算视频为伪造脸部的视频的概率,该概率可以用于判断视频是否为伪造脸部的视频,由于在本公开实施例中可以基于图像序列中的对应的脸部多个区域之间的时序关系特征计算该概率,可以在计算该概率时引入脸部多个区域之间的时序关系,进而检测出视频中脸部的时序不一致性,使得概率的计算结果的准确率更高、泛化能力更强,进而提高伪造脸部视频检测的精确度。The video detection method, device, device, and medium of the embodiments of the present disclosure can, after acquiring an image sequence to be detected that includes at least two video frames in the same video, perform facial features of each image in the image sequence. Through nonlinear transformation processing, the attention features of multiple face regions corresponding to each image are obtained, and based on the attention features of multiple face regions corresponding to each image, the relationship between multiple face regions corresponding to the image sequence is constructed. time sequence relationship feature, and then based on the time sequence relationship feature, calculate the probability that the video is a video of a fake face, and this probability can be used to determine whether the video is a video of a fake face, because in the embodiment of the present disclosure, it can be based on the corresponding image sequence in the image sequence. To calculate the probability, the timing relationship between multiple face regions can be calculated by introducing the timing relationship between multiple face regions, and then to detect the timing inconsistency of the face in the video, which makes the calculation of the probability The results are more accurate and generalize better, which in turn improves the accuracy of fake face video detection.
附图说明Description of drawings
结合附图并参考以下具体实施方式,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中,相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的,原件和元素不一定按照比例绘制。The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent when taken in conjunction with the accompanying drawings and with reference to the following detailed description. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that the originals and elements are not necessarily drawn to scale.
图1为本公开实施例提供的一种视频检测方法的流程示意图;1 is a schematic flowchart of a video detection method according to an embodiment of the present disclosure;
图2为本公开实施例提供的一种图像序列获取方法的流程示意图;FIG. 2 is a schematic flowchart of a method for acquiring an image sequence according to an embodiment of the present disclosure;
图3为本公开实施例提供的一种注意力特征获取方法的流程示意图;3 is a schematic flowchart of a method for acquiring an attention feature according to an embodiment of the present disclosure;
图4为本公开实施例提供的另一种注意力特征获取方法的流程示意图;FIG. 4 is a schematic flowchart of another attention feature acquisition method provided by an embodiment of the present disclosure;
图5为本公开实施例提供的一种时序关系特征获取方法的流程示意图;FIG. 5 is a schematic flowchart of a method for obtaining a timing relationship feature according to an embodiment of the present disclosure;
图6为本公开实施例提供的另一种时序关系特征获取方法的流程示意图;FIG. 6 is a schematic flowchart of another method for obtaining a timing relationship feature according to an embodiment of the present disclosure;
图7为本公开实施例提供的一种视频检测模型的示意图;7 is a schematic diagram of a video detection model provided by an embodiment of the present disclosure;
图8为本公开实施例提供的一种三维注意力神经网络模型的示意图;8 is a schematic diagram of a three-dimensional attention neural network model provided by an embodiment of the present disclosure;
图9为本公开实施例提供的一种时序图神经网络模型的示意图;9 is a schematic diagram of a sequence diagram neural network model provided by an embodiment of the present disclosure;
图10为本公开实施例提供的一种视频检测装置的结构示意图;10 is a schematic structural diagram of a video detection apparatus according to an embodiment of the present disclosure;
图11为本公开实施例提供的一种视频检测设备的结构示意图。FIG. 11 is a schematic structural diagram of a video detection device according to an embodiment of the present disclosure.
具体实施方式Detailed ways
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for the purpose of A more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are only for exemplary purposes, and are not intended to limit the protection scope of the present disclosure.
应当理解,本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。It should be understood that the various steps described in the method embodiments of the present disclosure may be performed in different orders and/or in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this regard.
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。As used herein, the term "including" and variations thereof are open-ended inclusions, ie, "including but not limited to". The term "based on" is "based at least in part on." The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions of other terms will be given in the description below.
需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。It should be noted that concepts such as "first" and "second" mentioned in the present disclosure are only used to distinguish different devices, modules or units, and are not used to limit the order of functions performed by these devices, modules or units or interdependence.
需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。It should be noted that the modifications of "a" and "a plurality" mentioned in the present disclosure are illustrative rather than restrictive, and those skilled in the art should understand that unless the context clearly indicates otherwise, they should be understood as "one or a plurality of". multiple".
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。The names of messages or information exchanged between multiple devices in the embodiments of the present disclosure are only for illustrative purposes, and are not intended to limit the scope of these messages or information.
目前,伪造脸部视频的检测方法主要分为两类:At present, the detection methods of fake face videos are mainly divided into two categories:
(一)基于图像的检测方法。其中,基于图像的检测方法可以包含传统图像取证方法、深度学习方法、图像内部模式分析方法以及伪造图像指纹方法。(1) Image-based detection method. Among them, image-based detection methods may include traditional image forensics methods, deep learning methods, image internal pattern analysis methods, and forged image fingerprint methods.
具体地,基于图像的检测方法为以视频帧为粒度进行伪造脸部视频检测的方法,通常以深度伪造算法产生的伪影如图像噪声、纹理不平滑等作为伪造脸部检测的重要线索,来实现伪造脸部视频的检测。虽然基于图像的检测方法借助先进的卷积神经网络架构实现了较好的检测性能,但由于深度伪造算法正在不断改进,而基于图像的检测方法存在泛化性能较弱的缺点,因此,基于图像的检测方法在检测新型的深度伪造算法产生的伪造脸部视频时的准确性较差。Specifically, the image-based detection method is a method of detecting fake face videos with video frames as the granularity. Usually, artifacts generated by deep forgery algorithms, such as image noise, uneven texture, etc., are used as important clues for fake face detection. Implement detection of fake face videos. Although image-based detection methods have achieved good detection performance with the help of advanced convolutional neural network architectures, due to the continuous improvement of deepfake algorithms, image-based detection methods have the disadvantage of weak generalization performance. The detection method of the novel deepfake algorithm has poor accuracy in detecting fake face videos produced by the new deepfake algorithm.
(二)基于视频的检测方法。其中,基于视频的检测方法可以包含利用视频的时序一致性作为线索进行检测的方法。(2) Video-based detection methods. Among them, the video-based detection method may include a method for detecting by using the temporal consistency of the video as a clue.
具体地,基于视频的检测方法为以视频为粒度进行伪造脸部视频检测的方法,由于深度伪造技术一般以逐个视频帧处理的方式篡改人脸视频,这种处理方式会导致视频帧之间存在不连续的现象,因此,基于视频的检测方法通常以深度伪造算法产生的时序一致性作为伪造脸部检测的重要线索,来实现伪造脸部视频的检测。Specifically, the video-based detection method is a method of detecting fake face videos with video as the granularity. Since deep forgery technology generally tampers with face videos by processing video frames one by one, this processing method will lead to the existence of existence between video frames. Therefore, video-based detection methods usually use the temporal consistency generated by deepfake algorithms as an important clue for fake face detection to detect fake face videos.
其中,一致性可以包含多模态一致性和视频帧一致性。Among them, the consistency can include multi-modal consistency and video frame consistency.
多模态一致性可以用于判断视频的脸部嘴唇动作模态与语音音频模态是否一致,从而实现对伪造脸部视频的检测,虽然这种基于多模态一致性进行伪造脸部视频检测的方法能够提高伪造脸部视频检测的泛化性能,但这种方法不能用于检测无音频或音频被覆盖的视频,即这种方法对无音频或音频被覆盖的视频进行检测时的准确性较低。Multi-modal consistency can be used to determine whether the facial and lip motion modes of the video are consistent with the voice and audio modes, so as to detect fake face videos, although this kind of fake face video detection is based on multi-modal consistency. The method can improve the generalization performance of fake face video detection, but this method cannot be used to detect video without audio or audio is covered, that is, the accuracy of this method when detecting video without audio or audio is covered lower.
视频帧一致性可以用于检测视频帧之间的时序一致性,即通过检测视频帧的变化是否连续来检测脸部视频是否经过伪造,虽然这种基于视频一致性进行伪造脸部视频检测的方法提高了伪造脸部视频检测的泛化性能,并且能够避免基于多模态一致性进行伪造脸部视频检测的方法的局限性,但是在使用这种基于视频一致性进行伪造脸部视频检测的方法的过程中,申请人发现其仍存在以下问题:(1)现有技术使用二维注意力机制从脸部图像获取局部区域特征;(2)现有技术在获得局部区域特征后直接用于分类,或在获得局部区域特征后仅建模图像的区域间关系,并用区域间关系进行分类。因此,这种基于视频一致性进行伪造脸部视频检测的方法仍然存在准确性较低的问题。Video frame consistency can be used to detect the timing consistency between video frames, that is, by detecting whether the changes of video frames are continuous to detect whether the face video has been forged, although this method of forgery face video detection based on video consistency The generalization performance of fake face video detection is improved, and the limitations of the method for fake face video detection based on multimodal consistency can be avoided, but when using this method for fake face video detection based on video consistency During the process, the applicant found that it still has the following problems: (1) the prior art uses a two-dimensional attention mechanism to obtain local area features from the face image; (2) the prior art directly uses the local area features for classification after obtaining the local area features , or only model the inter-regional relationship of the image after obtaining local region features, and use the inter-regional relationship for classification. Therefore, this method for fake face video detection based on video consistency still has the problem of low accuracy.
综上,上述的两类伪造脸部视频的检测方法对伪造脸部视频的检测均存在准确性较低的问题,因此,如何准确地检测出伪造脸部视频是亟需解决的技术问题。To sum up, the above two types of detection methods for fake face videos have the problem of low accuracy in detecting fake face videos. Therefore, how to accurately detect fake face videos is a technical problem that needs to be solved urgently.
针对上述问题,本公开实施例提供了一种视频检测方法、装置、设备及介质。下面首先结合具体的实施例对该视频检测方法进行介绍。In view of the above problems, embodiments of the present disclosure provide a video detection method, apparatus, device, and medium. The video detection method is first introduced below with reference to specific embodiments.
在本公开实施例中,该视频检测方法可以由计算设备执行。其中,计算设备可以包括电子设备和服务器等,在此不作限制。电子设备可以包括移动电话、平板电脑、台式计算机、笔记本电脑、车载终端、可穿戴电子设备、一体机、智能家居设备等具有计算功能的设备,也可以是虚拟机或者模拟器模拟的设备。服务器可以是独立的服务器,也可以是多个服务器的集群,可以包括搭建在本地的服务器和架设在云端的服务器。In an embodiment of the present disclosure, the video detection method may be performed by a computing device. The computing devices may include electronic devices, servers, etc., which are not limited herein. Electronic devices may include mobile phones, tablet computers, desktop computers, notebook computers, vehicle-mounted terminals, wearable electronic devices, all-in-one computers, smart home devices, and other devices with computing functions, and may also be devices simulated by virtual machines or simulators. The server can be an independent server or a cluster of multiple servers, including a server built locally and a server built in the cloud.
图1为本公开实施例提供的一种视频检测方法的流程示意图。FIG. 1 is a schematic flowchart of a video detection method provided by an embodiment of the present disclosure.
如图1所示,该视频检测方法可以包括如下步骤。As shown in FIG. 1 , the video detection method may include the following steps.
S110、获取待检测的图像序列,图像序列包含同一视频中的至少两个视频帧。S110. Acquire an image sequence to be detected, where the image sequence includes at least two video frames in the same video.
在本公开实施例中,计算设备可以获取待检测的包含同一视频中的至少两个视频帧的图像序列。In an embodiment of the present disclosure, the computing device may acquire an image sequence to be detected that includes at least two video frames in the same video.
其中,图像序列包含同一视频中的至少两个视频帧,该至少两个视频帧可以为在视频中相邻的两个视频帧,也可以为在视频中不相邻的两个视频帧。图像序列的序列顺序即图像序列中的至少两个视频帧的排列顺序,可以按照各个视频帧在视频中的播放时间由先到后的顺序确定。The image sequence includes at least two video frames in the same video, and the at least two video frames may be two adjacent video frames in the video, or may be two video frames that are not adjacent in the video. The sequence order of the image sequence, that is, the arrangement sequence of at least two video frames in the image sequence, may be determined in a first-come-last order according to the playback time of each video frame in the video.
进一步地,该图像序列所属的视频可以为需要对视频内容中的脸部是否被篡改进行检测的视频。Further, the video to which the image sequence belongs may be a video that needs to be detected whether the face in the video content has been tampered with.
在一些实施例中,图像序列中的至少两个视频帧可以为对视频进行整体抽帧得到的视频帧。In some embodiments, the at least two video frames in the image sequence may be video frames obtained by performing overall frame extraction on the video.
在另一些实施例中,图像序列中的至少两个视频帧可以为对视频进行分段抽帧得到的视频帧。In other embodiments, the at least two video frames in the image sequence may be video frames obtained by segmenting the video.
S120、针对图像序列中的每个图像,对图像的脸部特征进行非线性变换处理,得到图像对应的脸部多个区域的注意力特征。S120. For each image in the image sequence, perform nonlinear transformation processing on the facial features of the image to obtain attention features of multiple face regions corresponding to the image.
在本公开实施例中,计算设备在获取到待检测的图像序列之后,可以对图像序列中的每个图像的脸部特征进行非线性变换处理,得到每个图像对应的脸部多个区域的注意力特征。In this embodiment of the present disclosure, after acquiring the image sequence to be detected, the computing device may perform nonlinear transformation processing on the facial features of each image in the image sequence, to obtain the facial features of multiple face regions corresponding to each image. Attention features.
具体地,计算设备可以在获取到待检测的图像序列之后,针对图像序列中的每个图像,先提取图像的脸部特征,进而对图像的脸部特征进行基于注意力机制的融合计算,以实现对图像的脸部特征的非线性变换处理,得到图像对应的脸部多个区域的注意力特征。Specifically, after acquiring the image sequence to be detected, the computing device may first extract the facial features of the images for each image in the image sequence, and then perform fusion computation based on the attention mechanism on the facial features of the images, so as to Realize the nonlinear transformation processing of the facial features of the image, and obtain the attention features of multiple areas of the face corresponding to the image.
其中,图像中的脸部可以为任意对象的脸部,例如,人的脸部、动物的脸部或者虚拟形象的脸部,在此不作限制。The face in the image may be the face of any object, for example, the face of a human being, the face of an animal, or the face of an avatar, which is not limited herein.
在一些实施例中,图像的脸部特征可以理解为用于表征图像中的脸部的特点的特征,例如,脸部特征可以包括脸部的特定部位的特征、脸部的轮廓的特征和脸部的表情的特征等,在此不作限制。In some embodiments, the facial features of an image may be understood as features used to characterize the features of the faces in the images, for example, the facial features may include features of specific parts of the face, features of the contours of the face, and features of the face The features of facial expressions, etc., are not limited here.
可选地,脸部特征可以为二维脸部特征,也可以为三维脸部特征,在此不作限制。Optionally, the facial feature may be a two-dimensional facial feature or a three-dimensional facial feature, which is not limited herein.
在一些实施例中,每种对象的脸部均可以根据需要预先划分出多个区域。其中,图像对应的脸部多个区域的注意力特征可以理解为用于表征图像中的脸部内各个区域与脸部之间的相关程度的特征。In some embodiments, the face of each object can be pre-divided into multiple regions as required. Among them, the attention features of multiple face regions corresponding to the image can be understood as features used to represent the degree of correlation between each region in the face in the image and the face.
可选地,注意力特征可以为二维注意力特征,也可以为三维注意力特征,在此不作限制。Optionally, the attention feature may be a two-dimensional attention feature or a three-dimensional attention feature, which is not limited herein.
示例的,计算设备可以在获取到待检测的图像序列之后,针对图像序列中的每个图像,先提取图像的二维脸部特征,如利用二维脸部特征提取器提取图像的二维脸部特征,进而对图像的二维脸部特征进行基于二维注意力机制的融合计算如二维卷积运算,以实现对图像的二维脸部特征的非线性变换处理,得到图像对应的脸部多个区域的二维注意力特征。Exemplarily, after acquiring the image sequence to be detected, the computing device may first extract the two-dimensional facial features of the image for each image in the image sequence, such as extracting the two-dimensional face of the image by using a two-dimensional facial feature extractor. Then, the fusion calculation based on the two-dimensional attention mechanism, such as the two-dimensional convolution operation, is performed on the two-dimensional facial features of the image, so as to realize the nonlinear transformation processing of the two-dimensional facial features of the image, and obtain the face corresponding to the image. 2D attention features in multiple regions.
示例的,计算设备可以在获取到待检测的图像序列之后,针对图像序列中的每个图像,先提取图像的三维脸部特征,如利用三维脸部特征提取器提取模型提取图像的三维脸部特征,进而对图像的三维脸部特征进行基于三维注意力机制的融合计算如三维卷积运算,例如,基于三维注意力神经网络模型实现三维卷积运算,以实现对图像的三维脸部特征的非线性变换处理,得到图像对应的脸部多个区域的三维注意力特征。For example, after acquiring the image sequence to be detected, the computing device may first extract the three-dimensional facial features of the images for each image in the image sequence, such as extracting the three-dimensional facial features of the images by using a three-dimensional facial feature extractor extraction model. feature, and then perform fusion calculation based on the 3D attention mechanism on the 3D facial features of the image, such as 3D convolution operation, for example, based on the 3D attention neural network model to realize the 3D convolution operation, so as to realize the 3D facial features of the image. Non-linear transformation processing to obtain three-dimensional attention features of multiple face regions corresponding to the image.
S130、基于各个图像对应的脸部多个区域的注意力特征,构建图像序列对应的脸部多个区域之间的时序关系特征。S130. Based on the attention features of the multiple face regions corresponding to each image, construct a time sequence relationship feature between the multiple face regions corresponding to the image sequence.
在本公开实施例中,计算设备在得到各个图像中脸部多个区域的注意力特征之后,可以对各个图像中脸部多个区域的注意力特征进行基于图像序列整体的时序关系构建处理,得到图像序列对应的脸部多个区域之间的时序关系特征。In the embodiment of the present disclosure, after obtaining the attention features of the multiple face regions in each image, the computing device may perform a time sequence relationship construction process based on the entire image sequence on the attention features of the multiple face regions in each image, The temporal relationship features between multiple face regions corresponding to the image sequence are obtained.
其中,图像序列对应的脸部多个区域之间的时序关系特征可以理解为用于表征脸部内各个区域的注意力特征在时间维度上的变化情况的特征。Among them, the time-series relationship feature between multiple face regions corresponding to the image sequence can be understood as a feature used to represent the variation of the attention feature of each region in the face in the time dimension.
示例的,如果图像中的脸部可以预先划分出眉毛所在区域、眼睛所在区域、嘴巴所在区域以及鼻子所在区域等四个区域,图像序列对应的脸部多个区域之间的时序关系特征可以为眉毛所在区域、眼睛所在区域、嘴巴所在区域以及鼻子所在区域内图像在图像序列对应的不同视频帧所属的时刻下的变化情况。For example, if the face in the image can be pre-divided into four areas, including the area where the eyebrows are located, the area where the eyes are located, the area where the mouth is located, and the area where the nose is located, the temporal relationship features between multiple areas of the face corresponding to the image sequence can be: Changes of the images in the area where the eyebrows are located, the area where the eyes are located, the area where the mouth is located, and the area where the nose is located at the times when the different video frames corresponding to the image sequence belong.
在一些实施例中,计算设备可以根据各个图像对应的脸部多个区域的注意力特征,构建各个图像对应的脸部多个区域的图结构数据,再利用图卷积运算的方式,对各个图像对应的脸部多个区域的图结构数据进行时序关系构建处理,得到图像序列对应的脸部多个区域之间的时序关系特征。In some embodiments, the computing device may construct graph structure data of the multiple face regions corresponding to each image according to the attention features of the multiple face regions corresponding to the respective images, and then use the graph convolution operation to analyze the graph structure data for each face region. The graph structure data of the multiple face regions corresponding to the image are subjected to time sequence relationship construction processing to obtain the time sequence relationship features between the multiple face regions corresponding to the image sequence.
在另一些实施例中,计算设备可以根据各个图像对应的脸部多个区域的注意力特征,构建各个图像对应的脸部多个区域的向量数据,例如,按照指定的区域顺序,将每个图像对应的脸部各个区域对应的特征向量拼接得到每个图像对应的脸部多个区域的向量数据。在计算设备在得到向量数据之后,可以再利用时域卷积神经网络,对各个图像对应的脸部多个区域的向量数据进行时序关系构建处理,得到图像序列对应的脸部多个区域之间的时序关系特征。In other embodiments, the computing device may construct the vector data of the multiple face regions corresponding to each image according to the attention features of the multiple face regions corresponding to the respective images. The feature vectors corresponding to each face region corresponding to the image are spliced to obtain vector data of multiple face regions corresponding to each image. After the computing device obtains the vector data, the time-domain convolutional neural network can be used to construct the time series relationship between the vector data of the multiple face regions corresponding to each image, and obtain the relationship between the multiple face regions corresponding to the image sequence. temporal relationship characteristics.
S140、基于时序关系特征,计算视频为伪造脸部的视频的概率。S140. Calculate the probability that the video is a video of a fake face based on the time sequence relationship feature.
在本公开实施例中,计算设备可以在图像序列对应的脸部多个区域之间的时序关系特征之后,对图像序列对应的脸部多个区域之间的时序关系特征进行分类检测,得到待检测视频为伪造脸部视频的概率。In the embodiment of the present disclosure, the computing device may classify and detect the time sequence relationship features between the multiple face regions corresponding to the image sequence after the time sequence relationship features between the multiple face regions corresponding to the image sequence, and obtain the Probability of detecting a video as a fake face video.
可选地,计算设备可以将图像序列对应的脸部多个区域之间的时序关系特征输入预先训练得到的用于检测时序关系特征所属的视频是否为伪造脸部的视频的分类器中,得到分类器输出的视频为伪造脸部的视频的概率。Optionally, the computing device may input the temporal relationship features between multiple regions of the face corresponding to the image sequence into a pre-trained classifier for detecting whether the video to which the temporal relationship feature belongs is a video of a fake face, to obtain: The probability that the video output by the classifier is a fake face video.
进一步地,计算设备可以将该概率与预设的概率阈值进行比较,若该概率大于或等于预设的概率阈值,则确定图像序列所属的待检测的视频为伪造脸部的视频;若该概率小于预设的概率阈值,则确定图像序列所属的待检测的视频不是伪造脸部的视频。Further, the computing device may compare the probability with a preset probability threshold, and if the probability is greater than or equal to the preset probability threshold, determine that the video to be detected to which the image sequence belongs is a video of a fake face; If it is less than the preset probability threshold, it is determined that the video to be detected to which the image sequence belongs is not a video of a fake face.
需要说明的是,预设的概率阈值可以为根据需要预先设置的用于表征视频为伪造脸部的视频的概率值,例如,预设的概率阈值可以为 0.5,也可以为0.8,在此不作限制。It should be noted that the preset probability threshold may be a probability value preset as required and used to characterize the video as a fake face video. For example, the preset probability threshold may be 0.5 or 0.8, which is not made here. limit.
在本公开实施例中,能够在获取到待检测的包含同一视频中至少两个视频帧的图像序列之后,对图像序列中的每个图像的脸部特征进行非线性变换处理,得到每个图像对应的脸部多个区域的注意力特征,并基于各个图像对应的脸部多个区域的注意力特征,构建图像序列对应的脸部多个区域之间的时序关系特征,进而基于时序关系特征,计算视频为伪造脸部的视频的概率,该概率可以用于判断视频是否为伪造脸部的视频,对时序关系特征的处理能够检测出视频中脸部的时序不一致性,使得概率的计算结果的准确率更高、泛化能力更强,进而提高伪造脸部视频检测的精确度。In the embodiment of the present disclosure, after acquiring an image sequence to be detected that includes at least two video frames in the same video, nonlinear transformation processing can be performed on the facial features of each image in the image sequence to obtain each image Corresponding attention features of multiple face regions, and based on the attention features of multiple face regions corresponding to each image, construct the temporal relationship features between multiple face regions corresponding to the image sequence, and then based on the temporal relationship features , calculate the probability that the video is a fake face video, this probability can be used to determine whether the video is a fake face video, and the processing of the time series relationship feature can detect the time series inconsistency of the face in the video, so that the calculation result of the probability It has higher accuracy and stronger generalization ability, thereby improving the accuracy of fake face video detection.
下面对计算设备获取待检测的图像序列的方法进行详细说明。The method for acquiring the image sequence to be detected by the computing device will be described in detail below.
在本公开一些实施例中,图像序列中的至少两个视频帧可以为对视频进行分段抽帧得到的视频帧。In some embodiments of the present disclosure, at least two video frames in the image sequence may be video frames obtained by segmenting the video.
在这些实施例中,计算设备可以对视频进行分段抽帧得到图像序列。具体地,将参考图2进行详细说明。In these embodiments, the computing device may segment the video to obtain a sequence of images. Specifically, a detailed description will be made with reference to FIG. 2 .
图2为本公开实施例提供的一种图像序列获取方法的流程示意图。如图2所示,该图像序列获取方法可以包括如下步骤。FIG. 2 is a schematic flowchart of a method for acquiring an image sequence according to an embodiment of the present disclosure. As shown in FIG. 2 , the image sequence acquisition method may include the following steps.
S210、将视频划分为多个视频段。S210. Divide the video into multiple video segments.
在本公开实施例中,计算设备可以对待检测的视频进行分段处理,以将视频划分为多个视频段。In an embodiment of the present disclosure, the computing device may perform segmentation processing on the video to be detected, so as to divide the video into a plurality of video segments.
具体地,计算设备可以按照预设的分段方式对视频进行分段,得到多个视频段。Specifically, the computing device may segment the video according to a preset segmentation manner to obtain multiple video segments.
其中,预设的分段方式可以为根据需要预先设置的用于按照用户需求对视频进行分段的方式。例如,该预设的分段方式可以为将视频等长划分为确定数量的视频段,该预设的分段方式也可以为将视频分为确定长度的视频段,该预设的分段方式还可以为将视频随机的划分为多个视频段,在此不作限制。The preset segmentation mode may be a preset mode for segmenting the video according to user requirements. For example, the preset segmentation method may be to divide the video into a certain number of video segments of equal length, and the preset segmentation method may also be to divide the video into video segments of a certain length, and the preset segmentation method It is also possible to randomly divide the video into multiple video segments, which is not limited here.
在一些实施例中,计算设备可以使用开源计算机视觉库(Open Source ComputerVision Library,OpenCV)工具提取视频中的各个视频帧,然后采用多任务卷积神经网络(Multi-task convolutional neural network,MTCNN)脸部检测模型截取各个视频帧中的脸部图像,并以图片的形式对截取到的各个脸部图像进行存储,进而按照将视频等长划分为确定数量的视频段的分段方式,对存储的连续视频帧中的脸部图像进行分段处理,进而将分段后的多组脸部图像作为多个视频段。In some embodiments, the computing device may use an Open Source Computer Vision Library (OpenCV) tool to extract individual video frames in the video, and then employ a Multi-task convolutional neural network (MTCNN) face The partial detection model intercepts the facial images in each video frame, and stores the intercepted facial images in the form of pictures, and then divides the video into a certain number of video segments with equal length. The facial images in the consecutive video frames are segmented, and then the segmented groups of facial images are regarded as multiple video segments.
在另一些实施例中,计算设备可以使用OpenCV工具提取视频中的各个视频帧,并以图片的形式对提取到的各个视频帧进行存储,进而按照将视频等长划分为确定数量的视频段的分段方式,对存储的连续视频帧进行分段处理,进而将分段后的多组视频帧作为多个视频段。In other embodiments, the computing device can use the OpenCV tool to extract each video frame in the video, and store each extracted video frame in the form of a picture, and then divide the video into a certain number of video segments according to the same length. In the segmentation method, the stored continuous video frames are segmented, and then the segmented groups of video frames are regarded as multiple video segments.
S220、按照预设抽帧方式,抽取每个视频段的关键视频帧。S220. Extract key video frames of each video segment according to a preset frame extraction method.
在本公开实施例中,计算设备在得到多个视频段之后,可以对每个视频段内的图片进行抽帧处理,将抽取得到的图片作为该视频段的关键视频帧。In the embodiment of the present disclosure, after obtaining multiple video segments, the computing device may perform frame extraction processing on pictures in each video segment, and use the extracted pictures as key video frames of the video segment.
其中,关键视频帧可以理解为用于表征其所属视频段的特点的视频帧。The key video frame can be understood as a video frame used to characterize the characteristics of the video segment to which it belongs.
具体地,预设的抽帧方式可以为根据需要预先设置的用于按照用户需求对视频进行抽帧的方式。例如,该预设的抽帧方式可以为按照确定间隔抽取关键视频帧,该预设的抽帧方式也可以为随机抽取确定数量的关键视频帧,在此不作限制。Specifically, the preset frame extraction method may be a method preset as required for extracting frames from a video according to user requirements. For example, the preset frame extraction method may be to extract key video frames according to a certain interval, and the preset frame extraction method may also be to randomly extract a certain number of key video frames, which is not limited herein.
S230、按照播放时间顺序对各个关键视频帧进行排序,得到图像序列。S230. Sort each key video frame according to the play time sequence to obtain an image sequence.
在本公开实施例中,计算设备在得到每个视频段的关键视频帧之后,可以按照播放时间顺序对各个关键视频帧进行排序,得到图像序列。In the embodiment of the present disclosure, after obtaining the key video frames of each video segment, the computing device may sort the key video frames in the order of playback time to obtain an image sequence.
由此,在本公开实施例中,计算设备可以通过分段抽帧得到图像序列,使得图像序列中的各个视频帧能够覆盖视频的各个时间段,进而使得伪造脸部视频的检测能够覆盖整个视频,进一步提高伪造脸部视频检测的精确度。Therefore, in the embodiment of the present disclosure, the computing device can obtain an image sequence by segmenting frames, so that each video frame in the image sequence can cover each time period of the video, and then the detection of the fake face video can cover the entire video. , to further improve the accuracy of fake face video detection.
在本公开另一些实施例中,图像序列中的至少两个视频帧可以为对视频进行整体抽帧得到的视频帧。In other embodiments of the present disclosure, the at least two video frames in the image sequence may be video frames obtained by performing overall frame extraction on the video.
在一些实施例中,计算设备可以使用开源计算机视觉库(Open Source ComputerVision Library,OpenCV)工具提取视频中的各个视频帧,然后采用多任务卷积神经网络(Multi-task convolutional neural network,MTCNN)脸部检测模型截取各个视频帧中的脸部图像,并以图片的形式对截取到的各个脸部图像进行存储,得到视频对应的多个图片。In some embodiments, the computing device may use an Open Source Computer Vision Library (OpenCV) tool to extract individual video frames in the video, and then employ a Multi-task convolutional neural network (MTCNN) face The facial image in each video frame is intercepted by the partial detection model, and each intercepted facial image is stored in the form of a picture to obtain a plurality of pictures corresponding to the video.
在另一些实施例中,计算设备可以使用OpenCV工具提取视频中的各个视频帧,并以图片的形式对提取到的各个视频帧进行存储,得到视频对应的多个图片。In other embodiments, the computing device may use the OpenCV tool to extract each video frame in the video, and store each extracted video frame in the form of a picture to obtain multiple pictures corresponding to the video.
在这些实施例中,计算设备可以对视频进行整体抽帧得到图像序列。具体地,计算设备可以按照预设的抽帧方式对视频对应的多个图片进行整体抽帧处理,得到多个关键视频帧。计算设备在得到多个关键视频帧之后,可以按照播放时间顺序对各个关键视频帧进行排序,得到图像序列。In these embodiments, the computing device may perform overall frame extraction on the video to obtain the image sequence. Specifically, the computing device may perform an overall frame extraction process on multiple pictures corresponding to the video according to a preset frame extraction method to obtain multiple key video frames. After obtaining a plurality of key video frames, the computing device may sort each key video frame in the order of playback time to obtain an image sequence.
其中,预设的抽帧方式可以为根据需要预先设置的用于按照用户需求对视频进行抽帧的方式。例如,该预设的抽帧方式可以为按照确定间隔抽取关键视频帧,该预设的抽帧方式也可以为随机抽取确定数量的关键视频帧,在此不作限制。Wherein, the preset frame extraction method may be a method preset according to needs for extracting frames from a video according to user requirements. For example, the preset frame extraction method may be to extract key video frames according to a certain interval, and the preset frame extraction method may also be to randomly extract a certain number of key video frames, which is not limited herein.
由此,在本公开另一些实施例中,计算设备可以通过整体抽帧得到图像序列,使得图像序列中的各个视频帧能够灵活的在整段视频中的各个位置进行选择,进而使得伪造脸部视频的检测能够根据实际情况选择覆盖的范围,进一步提高伪造脸部视频检测的精确度。Therefore, in some other embodiments of the present disclosure, the computing device can obtain an image sequence by extracting frames as a whole, so that each video frame in the image sequence can be flexibly selected from various positions in the entire video, thereby enabling the fake face Video detection can select the coverage range according to the actual situation, and further improve the accuracy of fake face video detection.
在本公开又一些实施例中,为了使关键视频帧中每帧图像的脸部特征更容易被提取得到,在计算设备抽取到关键视频帧之后,还需要先对各个关键视频帧进行图像增强处理,以扩大图像中不同脸部特征之间的差别,得到各个处理后的关键视频帧,进而按照播放时间顺序对各个处理后的关键视频帧进行排序,得到图像序列。In still other embodiments of the present disclosure, in order to make it easier to extract the facial features of each frame of images in the key video frames, after the key video frames are extracted by the computing device, it is necessary to perform image enhancement processing on each key video frame first. , in order to enlarge the difference between different facial features in the image, obtain each processed key video frame, and then sort each processed key video frame according to the playback time sequence to obtain an image sequence.
示例的,对图像的数据增强处理可以包括将图像水平翻转、平移一定距离、缩放一定比例、旋转一定角度、调整一定色调值、调整一定对比度、调整一定饱和度、调整一定亮度、添加一定高斯噪声、进行一定运动模糊滤波、进行一定高斯模糊滤波、进行一定JPEG压缩以及进行灰度化处理等处理方式中的至少一项。By way of example, the data enhancement processing on the image may include horizontally flipping the image, translating a certain distance, scaling a certain ratio, rotating a certain angle, adjusting a certain hue value, adjusting a certain contrast, adjusting a certain saturation, adjusting a certain brightness, and adding a certain Gaussian noise. , at least one of processing methods such as performing certain motion blur filtering, performing certain Gaussian blur filtering, performing certain JPEG compression, and performing grayscale processing.
可选地,计算设备可以按照一定概率在上述各项数据增强处理方式中选择至少一项来对图像的数据增强处理。Optionally, the computing device may select at least one of the above data enhancement processing methods according to a certain probability to perform data enhancement processing on the image.
下面以一个具体示例对图像增强处理的具体处理方式以及选择的概率进行详细说明,如表1所示。The specific processing method and selection probability of the image enhancement processing are described in detail below with a specific example, as shown in Table 1.
表1Table 1
由此,在本公开实施例中,计算设备抽取到关键视频帧之后,先对各个关键视频帧进行图像增强处理,以扩大图像中不同脸部特征之间的差别,再利用处理后的关键视频帧构成图像序列,使得利用图像序列中的各个视频帧得到的注意力特征以及时序关系特征的特点更加突出,进而使得计算得到的概率更加准确,进一步提高伪造脸部视频检测的精确度。Therefore, in the embodiment of the present disclosure, after the computing device extracts the key video frames, it first performs image enhancement processing on each key video frame to enlarge the difference between different facial features in the image, and then uses the processed key video frames. Frames constitute an image sequence, which makes the attention features and time sequence relationship features obtained by using each video frame in the image sequence more prominent, thereby making the calculated probability more accurate, and further improving the accuracy of fake face video detection.
下面对计算设备获取图像对应的脸部多个区域的三维注意力特征的具体方式进行详细说明。The specific manner in which the computing device obtains the three-dimensional attention features of multiple face regions corresponding to the image will be described in detail below.
在本公开一些实施例中,计算设备可以通过三维卷积运算获取图像对应的脸部多个区域的三维注意力特征。In some embodiments of the present disclosure, the computing device may obtain three-dimensional attention features of multiple regions of the face corresponding to the image through a three-dimensional convolution operation.
图3为本公开实施例提供的一种注意力特征获取方法的流程示意图。如图3所示,该注意力特征获取方法可以包括如下步骤。FIG. 3 is a schematic flowchart of a method for acquiring an attention feature according to an embodiment of the present disclosure. As shown in FIG. 3 , the attention feature acquisition method may include the following steps.
S310、对图像进行三维脸部特征提取处理,得到图像对应的三维脸部特征。S310. Perform three-dimensional facial feature extraction processing on the image to obtain three-dimensional facial features corresponding to the image.
在本公开实施例中,计算设备在获取到图像序列中的任意图像之后,可以对该图像进行三维脸部特征提取处理,得到该图像对应的三维脸部特征。In the embodiment of the present disclosure, after acquiring any image in the image sequence, the computing device may perform three-dimensional facial feature extraction processing on the image to obtain the three-dimensional facial feature corresponding to the image.
可选地,计算设备可以在获取到任意图像之后,将该图像对应的图像数据输入三维脸部特征提取器,得到三维脸部特征提取器输出的特征图,该特征图可以作为该图像对应的三维脸部特征。Optionally, after acquiring any image, the computing device may input the image data corresponding to the image into a three-dimensional facial feature extractor to obtain a feature map output by the three-dimensional facial feature extractor, and the feature map can be used as the corresponding image data of the image. 3D facial features.
示例的,三维脸部特征提取器可以为三维残差网络(Residual Network),如混合卷积神经网络(Mixed Convolution Networks, MC3),MC3可以由五层卷积层构成,前两层为三维卷积层,第三层至第五层为二维卷积层,该MC3输出的特征图可以具有四个维度,四个维度可以分别为通道数、长度、高度和宽度,例如特征图的尺寸可以为256×20×14×14(通道数×长度×高度×宽度)。For example, the three-dimensional facial feature extractor may be a three-dimensional residual network (Residual Network), such as a mixed convolutional neural network (Mixed Convolution Networks, MC3). Layers, the third to fifth layers are two-dimensional convolution layers, the feature map output by the MC3 can have four dimensions, and the four dimensions can be the number of channels, length, height and width, for example, the size of the feature map can be It is 256×20×14×14 (number of channels×length×height×width).
S320、对图像对应的三维脸部特征进行三维卷积运算,得到图像对应的脸部多个区域的注意力权重矩阵。S320. Perform a three-dimensional convolution operation on the three-dimensional facial features corresponding to the image to obtain attention weight matrices of multiple face regions corresponding to the image.
在本公开实施例中,计算设备在获取到该图像对应的三维脸部特征之后,可以对该图像的三维脸部特征进行三维卷积运算,得到该图像对应的脸部多个区域的注意力权重矩阵。In this embodiment of the present disclosure, after acquiring the three-dimensional facial features corresponding to the image, the computing device may perform a three-dimensional convolution operation on the three-dimensional facial features of the image to obtain the attention of multiple regions of the face corresponding to the image. weight matrix.
可选地,S320可以具体包括:基于三维注意力神经网络模型,对图像对应的三维脸部特征进行三维卷积运算,得到图像对应的脸部多个区域的注意力权重矩阵,其中,三维注意力神经网络模型基于时序连续性损失函数和稀疏注意力对比损失函数训练得到。Optionally, S320 may specifically include: based on the three-dimensional attention neural network model, performing a three-dimensional convolution operation on the three-dimensional facial features corresponding to the image to obtain the attention weight matrices of multiple areas of the face corresponding to the image, wherein the three-dimensional attention The force neural network model is trained based on the temporal continuity loss function and the sparse attention contrast loss function.
具体地,计算设备可以在获取到该图像对应的三维脸部特征之后,将三维脸部特征输入三维注意力神经网络模型,得到三维注意力神经网络模型输出的注意力权重矩阵。Specifically, after acquiring the 3D facial features corresponding to the image, the computing device may input the 3D facial features into the 3D attention neural network model to obtain the attention weight matrix output by the 3D attention neural network model.
示例的,三维注意力神经网络模型可以包括三个注意力机制模块,以实现对三维脸部特征的三维卷积运算。其中,每个注意力机制模块包含一组三维卷积层、三维批标准化层和激活层,三维卷积层是卷积核在输入图像的三维空间进行滑窗操作,三维批标准化层是对小批量四位数据组成的五维输入进行批标准化操作,三维激活层是激活函数对特征在三维空间中进行非线性变换操作。由此,可以通过多层神经网络,实现三维注意力机制,以解决二维注意力机制忽略时序信息的问题。Exemplarily, the three-dimensional attention neural network model may include three attention mechanism modules to implement three-dimensional convolution operations on three-dimensional facial features. Among them, each attention mechanism module includes a set of 3D convolution layers, 3D batch normalization layers and activation layers. The 3D convolution layer is the sliding window operation of the convolution kernel in the 3D space of the input image, and the 3D batch normalization layer is used for small The batch normalization operation is performed on the five-dimensional input composed of batch four-bit data, and the three-dimensional activation layer is the activation function to perform nonlinear transformation operations on the features in the three-dimensional space. Therefore, a three-dimensional attention mechanism can be realized through a multi-layer neural network to solve the problem that the two-dimensional attention mechanism ignores timing information.
下面以一个具体示例对三维卷积运算网络进行详细说明,如表2 所示。The following is a detailed description of the three-dimensional convolution operation network with a specific example, as shown in Table 2.
表2Table 2
其中,括号内的三个数字可以分别理解为表示长度、高度和宽度维度的参数,Conv3D(Convolution 3D)可以理解为三维卷积层, BN3D(Batch Normalization 3D)可以理解为三维批标准化运算层。Among them, the three numbers in parentheses can be understood as parameters representing the dimensions of length, height and width respectively, Conv3D (Convolution 3D) can be understood as a three-dimensional convolution layer, and BN3D (Batch Normalization 3D) can be understood as a three-dimensional batch normalization operation layer.
Leaky ReLU可以理解为应用于三维激活层的一种激活函数,其公式为:Leaky ReLU can be understood as an activation function applied to the three-dimensional activation layer, and its formula is:
Softmax可以理解为应用于三维激活层的另一种激活函数,其公式为:Softmax can be understood as another activation function applied to the three-dimensional activation layer, and its formula is:
进一步地,图像对应的三维脸部特征在经过表2所示的三维卷积运算网络的处理之后,所得到的注意力权重的每个通道存储一个尺寸为14×14的的注意力矩阵。Further, after the three-dimensional face feature corresponding to the image is processed by the three-dimensional convolution operation network shown in Table 2, each channel of the obtained attention weight stores an attention matrix with a size of 14×14.
由此,在本公开实施例中,计算设备可以基于三维卷积运算对图像的三维脸部特征进行处理,得到用于表征图像中的脸部内各个区域与脸部之间的相关程度的三维注意力特征,使得注意力特征不忽略时序信息,进而提高伪造脸部视频检测的精确度。Therefore, in the embodiment of the present disclosure, the computing device may process the three-dimensional facial features of the image based on the three-dimensional convolution operation to obtain a three-dimensional representation of the degree of correlation between each region in the face and the face in the image. Attention features, so that attention features do not ignore timing information, thereby improving the accuracy of fake face video detection.
在本公开实施例中,可以利用反向传播法通过损失函数对三维注意力神经网络模型进行训练,损失函数可以包括时序连续性损失函数和稀疏注意力对比损失函数,分别对三维注意力神经网络模型进行调参,直至时序连续性损失函数以及稀疏注意力对比损失函数对应的损失值分别小于对应的损失值阈值。In the embodiment of the present disclosure, the 3D attention neural network model can be trained through a loss function using the back propagation method. The loss function can include a time series continuity loss function and a sparse attention contrast loss function. The parameters of the model are adjusted until the loss values corresponding to the time series continuity loss function and the sparse attention contrast loss function are respectively smaller than the corresponding loss value thresholds.
其中,时序连续性损失函数的表达式为:Among them, the time series continuity loss function The expression is:
其中,T表示图像序列中图像的数量,M表示三维注意力神经网络模型关注的脸部区域总数,表示图像序列中第i张图像第j个脸部区域的注意力权重矩阵。where T represents the number of images in the image sequence, M represents the total number of face regions that the 3D attention neural network model pays attention to, Represents the attention weight matrix of the jth face region of the ith image in the image sequence.
稀疏注意力对比损失函数的表达式为:Sparse Attention Contrastive Loss Function The expression is:
其中,向量[·,·]表示向量合并运算, 1W×1表示尺寸为W×1的全1向量,11×H表示尺寸为1×H的全1向量,为指示函数,表示当i≠j时函数值为1,否则函数值为0,表示信息熵函数,表示的信息熵阈值。where the vector [·,·] represents the vector merging operation, 1 W×1 represents an all-ones vector of size W×1, 1 1×H represents an all-ones vector of size 1×H, is an indicator function, indicating that the function value is 1 when i≠j, otherwise the function value is 0, represents the information entropy function, express The information entropy threshold of .
由此,在本公开实施例中,可以通过时序连续性损失函数使三维注意力神经网络模型关注的人脸区域能够在时间维度上始终稳定,并且通过稀疏注意力对比损失函使三维注意力神经网络模型关注多个不同的人脸区域能够在空间维度上保持多样性,进而提高三维注意力神经网络模型输出的注意力权重矩阵的准确性,进而提高伪造脸部视频检测的精确度。Therefore, in the embodiment of the present disclosure, the face region concerned by the 3D attention neural network model can be always stable in the time dimension through the time series continuity loss function, and the 3D attention neural network can be made stable through the sparse attention contrast loss function. The network model focuses on multiple different face regions to maintain diversity in the spatial dimension, thereby improving the accuracy of the attention weight matrix output by the three-dimensional attention neural network model, thereby improving the accuracy of fake face video detection.
S330、基于图像对应的三维脸部特征和图像对应的脸部多个区域的注意力权重矩阵,生成图像对应的脸部多个区域的注意力特征。S330 , based on the three-dimensional facial feature corresponding to the image and the attention weight matrix of the multiple face regions corresponding to the image, generate the attention features of the multiple face regions corresponding to the image.
在本公开实施例中,计算设备在获取到该图像对应的脸部多个区域的注意力权重矩阵之后,可以将图像对应的三维脸部特征与脸部多个区域的注意力权重矩阵进行融合,得到图像对应的脸部多个区域的注意力特征,下面参考图4进行说明。In the embodiment of the present disclosure, after acquiring the attention weight matrices of the multiple face regions corresponding to the image, the computing device may fuse the three-dimensional facial features corresponding to the image with the attention weight matrices of the multiple face regions , to obtain the attention features of multiple regions of the face corresponding to the image, which are described below with reference to FIG. 4 .
图4为本公开实施例提供的另一种注意力特征获取方法的流程示意图。如图4所示,该注意力特征获取方法可以包括如下步骤。FIG. 4 is a schematic flowchart of another attention feature acquisition method provided by an embodiment of the present disclosure. As shown in FIG. 4 , the attention feature acquisition method may include the following steps.
S410、将图像对应的三维脸部特征与图像对应的脸部多个区域的注意力权重矩阵的乘积,作为图像对应的脸部多个区域的注意力特征矩阵。S410. Use the product of the three-dimensional facial features corresponding to the image and the attention weight matrices of the multiple face regions corresponding to the image as the attention feature matrices of the multiple face regions corresponding to the image.
在本公开实施例中,计算设备在得到图像序列中的任意图像对应的三维脸部特征和该图像对应的脸部多个区域的注意力权重矩阵之后,可以计算该图像对应的三维脸部特征和该注意力权重矩阵的乘积,得到图像对应的脸部多个区域的注意力特征矩阵。In the embodiment of the present disclosure, after obtaining the three-dimensional facial feature corresponding to any image in the image sequence and the attention weight matrix of the multiple face regions corresponding to the image, the computing device may calculate the three-dimensional facial feature corresponding to the image and the product of the attention weight matrix to obtain the attention feature matrix of multiple regions of the face corresponding to the image.
其中,三维脸部特征和该注意力权重矩阵的乘积具体指三维脸部特征和该注意力权重矩阵的外积。The product of the three-dimensional face feature and the attention weight matrix specifically refers to the outer product of the three-dimensional face feature and the attention weight matrix.
示例的,该注意力特征矩阵的参数维度可以包括颜色值、向量维度、长度、高度和宽度。例如,注意力特征矩阵的尺寸可以为256×8×20×14×14(颜色值×向量维度×长度×高度×宽度)。Exemplarily, the parameter dimension of the attention feature matrix may include color value, vector dimension, length, height and width. For example, the dimension of the attention feature matrix can be 256×8×20×14×14 (color value×vector dimension×length×height×width).
S420、对图像对应的脸部多个区域的注意力特征矩阵的高度维度和宽度维度分别进行求和,得到求和后的注意力特征矩阵。S420 , summing the height dimension and the width dimension of the attention feature matrices of the multiple face regions corresponding to the image, respectively, to obtain the summed attention feature matrix.
在本公开实施例中,计算设备在得到该图像对应的脸部多个区域的注意力特征矩阵之后,可以分别对该注意力特征矩阵的高度维度和宽度维度进行求和处理,得到求和后的注意力特征矩阵。In this embodiment of the present disclosure, after obtaining the attention feature matrices of the multiple face regions corresponding to the image, the computing device may perform a summation process on the height dimension and the width dimension of the attention feature matrix, respectively, to obtain the summation. The attention feature matrix of .
S430、对求和后的注意力特征矩阵的长度维度进行全局平均池化,得到图像对应的脸部多个区域的注意力特征。S430. Perform global average pooling on the length dimension of the summed attention feature matrix to obtain attention features of multiple regions of the face corresponding to the image.
本公开实施例中,计算设备在得到该图像对应的求和后的注意力特征矩阵之后,可以对该求和后的注意力特征矩阵的长度维度进行全局平均池化处理,得到图像对应的脸部多个区域的注意力特征。In the embodiment of the present disclosure, after obtaining the summed attention feature matrix corresponding to the image, the computing device may perform global average pooling on the length dimension of the summed attention feature matrix to obtain the face corresponding to the image. attention features in multiple regions.
其中,全局平均池化可以理解为对长度维度采用取平均的方法进行降维处理,从而加快运算速度。Among them, the global average pooling can be understood as adopting the average method to reduce the dimension of the length dimension, thereby speeding up the operation speed.
具体地,计算设备可以对获得的每一个通道的特征图的所有区域在长度维度计算平均值,得到在长度维度经过进行全局平均池化处理后的图像对应的脸部多个区域的注意力特征。Specifically, the computing device can calculate the average value in the length dimension of all regions of the obtained feature map of each channel, and obtain the attention features of multiple face regions corresponding to the image after global average pooling in the length dimension. .
示例的,注意力特征的维度参数可以包括颜色值和向量维度。例如,注意力特征的尺寸可以为256×8(颜色值×向量维度)。Illustratively, the dimension parameter of the attention feature may include color value and vector dimension. For example, the size of the attention feature can be 256×8 (color value×vector dimension).
由此,在本公开实施例中,计算设备获得图像序列中每个图像对应的三维脸部特征和对应图像的脸部多个区域的注意力权重矩阵之后,可以将图像对应的三维脸部特征与脸部多个区域的注意力权重矩阵进行融合,得到精确度更高的注意力特征,该精确度更高的注意力特征可以用于表征图像中的脸部内各个区域与脸部之间的相关程度,进而表征脸部内各个区域在视频检测中所起到作用的大小,进而得到更准确的视频检测结果。Therefore, in this embodiment of the present disclosure, after the computing device obtains the three-dimensional facial features corresponding to each image in the image sequence and the attention weight matrix of multiple face regions of the corresponding images, the three-dimensional facial features corresponding to the images can be calculated by the computing device. It is fused with the attention weight matrix of multiple areas of the face to obtain more accurate attention features, which can be used to characterize the relationship between each area in the face and the face in the image. The correlation degree of the face, and then characterize the role of each area in the face in the video detection, and then get more accurate video detection results.
在本公开另一些实施例中,计算设备还可以通过ResNet18网络模型获取图像对应的脸部多个区域的注意力特征。In other embodiments of the present disclosure, the computing device may further acquire attention features of multiple regions of the face corresponding to the image through the ResNet18 network model.
其中,ResNet18网络模型包含17个卷积层和1个全连接层。Among them, the ResNet18 network model contains 17 convolutional layers and 1 fully connected layer.
具体地,计算设备在获取到该图像对应的三维脸部特征之后,可以将脸部特征输入到ResNet18网络模型中,对该脸部特征进行17次卷积运算,并将运算结果进行相加,得到该图像对应的脸部多个区域的注意力特征。Specifically, after obtaining the three-dimensional facial features corresponding to the image, the computing device can input the facial features into the ResNet18 network model, perform 17 convolution operations on the facial features, and add the operation results, The attention features of multiple face regions corresponding to the image are obtained.
下面对计算设备利用时序图卷积神经网络获取图像序列对应的脸部多个区域之间的时序关系特征的方法进行详细说明。The following will describe in detail a method for a computing device to obtain time-series relationship features between multiple face regions corresponding to an image sequence by using a time-series graph convolutional neural network.
图5为本公开实施例提供的一种时序关系特征获取方法的流程示意图。如图5所示,该时序关系特征获取方法可以包括如下步骤。FIG. 5 is a schematic flowchart of a method for acquiring a timing relationship feature according to an embodiment of the present disclosure. As shown in FIG. 5 , the method for obtaining a time sequence relationship feature may include the following steps.
S510、针对每个图像,根据图像对应的脸部多个区域的注意力特征,构建图像对应的脸部多个区域的图结构数据。S510. For each image, construct graph structure data of multiple face regions corresponding to the image according to the attention features of the multiple face regions corresponding to the image.
在本公开实施例中,计算设备在得到图像序列中每个图像对应的脸部多个区域的注意力特征之后,可以根据该注意力特征,构建各个图像对应的脸部多个区域的图结构数据。In the embodiment of the present disclosure, after obtaining the attention features of the multiple face regions corresponding to each image in the image sequence, the computing device may construct a graph structure of the multiple face regions corresponding to each image according to the attention features data.
具体地,注意力特征可以是矩阵的形式,将注意力特征沿着矩阵的通道维度划分成多个注意力特征向量,将该注意力特征向量作为图结构数据的结点,定义邻接矩阵作为注意力特征向量之间的初始关系特征,将该关系特征作为图结构数据的边,将每个图像对应的结点和边进行组合,得到获图像对应的脸部多个区域的图结构数据。Specifically, the attention feature can be in the form of a matrix, the attention feature is divided into multiple attention feature vectors along the channel dimension of the matrix, the attention feature vector is used as the node of the graph structure data, and the adjacency matrix is defined as the attention feature vector. The initial relationship feature between the force feature vectors, the relationship feature is used as the edge of the graph structure data, and the nodes and edges corresponding to each image are combined to obtain the graph structure data of multiple areas of the face corresponding to the obtained image.
S520、对各个图像对应的脸部多个区域的图结构数据进行时序关系构建处理,得到图像序列对应的脸部多个区域之间的时序关系特征。S520. Perform a time sequence relationship construction process on the graph structure data of the multiple face regions corresponding to each image, to obtain the time sequence relationship feature between the multiple face regions corresponding to the image sequence.
在本公开实施例中,计算设备在得到各个图像对应的脸部多个区域的图结构数据之后,可以对图结构数据进行图卷积运算处理,得到图像序列对应的脸部多个区域之间的时序关系特征。In the embodiment of the present disclosure, after obtaining the graph structure data of the multiple face regions corresponding to each image, the computing device may perform a graph convolution operation on the graph structure data to obtain the graph structure data between the multiple face regions corresponding to the image sequence. temporal relationship characteristics.
具体地,计算设备可以将各个图像对应的脸部多个区域的图结构数据输入到时序图神经网络模型中,根据输入的图结构数据对图结构数据的关系特征进行更新,得到图像序列对应的脸部多个区域之间的时序关系特征。Specifically, the computing device can input the graph structure data of multiple face regions corresponding to each image into the time sequence diagram neural network model, update the relational features of the graph structure data according to the input graph structure data, and obtain the corresponding image sequence data. Temporal relationship features between multiple regions of the face.
其中,时序图神经网络模型可以理解为以图的形式呈现的能够表征动态性质的神经网络模型。Among them, the neural network model of the sequence diagram can be understood as a neural network model that is presented in the form of a graph and can represent dynamic properties.
由此,在本公开实施例中,计算设备可以基于图像对应的脸部多个区域的注意力特征进行时序关系特征构建处理,得到图像序列对应的脸部多个区域之间的时序关系特征,通过该时序关系特征检测图像序列中脸部的时序不一致性,进而检测出视频中脸部的时序不一致性,进一步提高伪造脸部视频检测的精确度。Thus, in this embodiment of the present disclosure, the computing device may perform a time sequence relationship feature construction process based on the attention features of multiple face regions corresponding to the image, and obtain the time sequence relationship feature between multiple face regions corresponding to the image sequence, The time sequence inconsistency of the face in the image sequence is detected by the time sequence relationship feature, and the time sequence inconsistency of the face in the video is detected, thereby further improving the detection accuracy of the fake face video.
进一步地,在本公开实施例中,S520可以通过对各个图像对应的脸部多个区域的图结构数据进行基于时序图神经网络模型的时序关系构建处理,将得到的图像序列对应的最后一个隐含状态图作为图像序列对应的脸部多个区域之间的时序关系特征,下面参考图6进行说明。Further, in the embodiment of the present disclosure, S520 may perform a time sequence relationship construction process based on a time sequence diagram neural network model on the graph structure data of the multiple face regions corresponding to each image, and the last hidden image sequence corresponding to the obtained image sequence may be processed. The state diagram is used as the feature of the time series relationship between multiple face regions corresponding to the image sequence, which will be described below with reference to FIG. 6 .
图6为本公开实施例提供的另一种时序关系特征获取方法的流程示意图。如图6所示,该时序关系特征获取方法可以包括如下步骤。FIG. 6 is a schematic flowchart of another method for obtaining a timing relationship feature according to an embodiment of the present disclosure. As shown in FIG. 6 , the method for obtaining a time sequence relationship feature may include the following steps.
S610、基于时序图神经网络模型,对各个图像对应的脸部多个区域的图结构数据进行时序关系构建处理,得到图像序列对应的最后一个隐含状态图。S610. Based on the time sequence diagram neural network model, perform time sequence relationship construction processing on the graph structure data of multiple face regions corresponding to each image, to obtain the last hidden state diagram corresponding to the image sequence.
在本公开实施例中,计算设备在得到图像序列中各个图像对应的脸部多个区域的图结构数据之后,按照图像序列中的排序将第一个图像对应的图结构数据输入到时序图神经网络模型中,得到能够表征该图像的时序关系特征的隐含状态图,再将该第一个图像对应的隐含状态图与第二个图像对应的图结构数据共同输入到时序图神经网络模型中,得到能够表征该第二个图像以及第一个图像的时序关系特征的隐含状态图,以此类推,将图像序列中的所有图像依次输入到时序图神经网络模型中,直至将图像序列中的最后一个图像对应的图结构数据以及其上一个图像对应的隐含状态图共同输入到时序图神经网络模型中,得到图像序列对应的最后一个隐含状态图。In the embodiment of the present disclosure, after obtaining the graph structure data of multiple face regions corresponding to each image in the image sequence, the computing device inputs the graph structure data corresponding to the first image into the sequence diagram neural network according to the order in the image sequence In the network model, an implicit state diagram that can characterize the time series relationship characteristics of the image is obtained, and then the implicit state diagram corresponding to the first image and the graph structure data corresponding to the second image are jointly input into the neural network model of the time series diagram. , obtain a hidden state diagram that can characterize the timing relationship between the second image and the first image, and so on, input all the images in the image sequence into the timing diagram neural network model in turn, until the image sequence The graph structure data corresponding to the last image in and the hidden state graph corresponding to the previous image are jointly input into the neural network model of the sequence diagram, and the last hidden state graph corresponding to the image sequence is obtained.
在本公开实施例中,可以利用反向传播法通过损失函数对时序图神经网络模型进行训练,损失函数可以为交叉熵损失函数,直至交叉熵损失函数对应的损失值小于对应的损失值阈值。In this embodiment of the present disclosure, a backpropagation method can be used to train a time sequence graph neural network model through a loss function, and the loss function can be a cross-entropy loss function until the loss value corresponding to the cross-entropy loss function is less than the corresponding loss value threshold.
在训练时序图神经网络模型的过程中,可以利用反向传播法,通过交叉熵损失函数对时序图神经网络模型进行调参,直至交叉熵损失函数对应的损失值小于对应的损失值阈值。In the process of training the time series graph neural network model, the back propagation method can be used to adjust the parameters of the time series graph neural network model through the cross entropy loss function until the loss value corresponding to the cross entropy loss function is less than the corresponding loss value threshold.
其中,交叉熵损失函数的表达式为:Among them, the cross entropy loss function The expression is:
其中,φ表示分类器,y表示待检测视频的真伪标签,H()表示输入最后一张关键视频帧后获得的隐含状态图。Among them, φ represents the classifier, y represents the authenticity label of the video to be detected, and H () represents the hidden state diagram obtained after inputting the last key video frame.
S620、将最后一个隐含状态图作为图像序列对应的脸部多个区域之间的时序关系特征。S620. Use the last hidden state diagram as a time series relationship feature between multiple face regions corresponding to the image sequence.
在本公开实施例中,计算设备将图像序列中的每个图像对应的图结构数据依次输入到时序图神经网络模型后得到的最后一个隐含状态图可以表征图像序列中所有图像相同的脸部区域之间的时序关系,因此计算设备可以将时序图神经网络模型输出的最后一个隐含状态图作为图像序列对应的脸部多个区域之间的时序关系特征。In the embodiment of the present disclosure, the last hidden state diagram obtained after the computing device sequentially inputs the graph structure data corresponding to each image in the image sequence into the sequence diagram neural network model can represent the same face in all images in the image sequence Therefore, the computing device can use the last hidden state diagram output by the neural network model of the time sequence diagram as a time sequence relationship feature between multiple regions of the face corresponding to the image sequence.
由此,在本公开实施例中,计算设备能够在得到图像序列中各个图像对应的脸部多个区域的图结构数据之后,将每个图像的图结构数据按照序列中的排序依次输入到时序图神经网络模型中,得到图像序列对应的最后一个隐含状态图,进而通过该隐含状态图表示图像序列中不同图像之间相同脸部区域的时序关系,通过该时序关系特征检测图像序列中脸部的时序不一致性,进而检测出视频中脸部的时序不一致性,进一步提高伪造脸部视频检测的精确度。Therefore, in the embodiment of the present disclosure, after obtaining the graph structure data of the multiple face regions corresponding to each image in the image sequence, the computing device can sequentially input the graph structure data of each image into the sequence according to the order in the sequence In the graph neural network model, the last hidden state diagram corresponding to the image sequence is obtained, and then the temporal relationship of the same face region between different images in the image sequence is represented by the hidden state diagram. The timing inconsistency of the face is detected, and the timing inconsistency of the face in the video is detected, which further improves the detection accuracy of the fake face video.
在本公开一种实施方式中,上述的三维脸部特征提取器、三维注意力神经网络模型、融合模块、图结构构建模块、时序图神经网络模型以及分类器可以形成视频检测模型,本公开实施例提供的视频检测方法可以基于视频检测模型实现,下面参考图7至图9,以对人脸视频进行检测的视频检测模型的结构和原理进行详细说明。In one embodiment of the present disclosure, the above-mentioned 3D facial feature extractor, 3D attention neural network model, fusion module, graph structure building module, sequence diagram neural network model and classifier can form a video detection model, which is implemented in the present disclosure. The video detection method provided in the example can be implemented based on a video detection model. The following describes the structure and principle of the video detection model for detecting face videos with reference to FIG. 7 to FIG. 9 .
图7为本公开实施例提供的一种视频检测模型的示意图。如图7 所示,计算设备可以在从人脸视频中提取人脸图像序列710之后,将该人脸图像序列710输入至视频检测模型,视频检测模型中的三维人脸特征提取器可以对每个人脸图像分别进行三维人脸特征提取处理,得到每个人脸图像对应的三维人脸特征720,然后视频检测模型中的三维注意力神经网络模型可以计算得到每个人脸图像对应的人脸多个区域的注意力权重矩阵730,接着视频检测模型中的融合模块可以分别将每个人脸图像对应的三维人脸特征720与注意力权重矩阵730进行融合,得到每个人脸图像对应的注意力特征740,再接着图结构构建模块(图中未示出)可以构建每个注意力特征740对应的图结构数据750,时序图卷积神经网络模型可以基于各个图结构数据750生成图像序列的时序关系特征760,最后分类器可以基于时序关系特征 760,得到待检测视频是否为伪造人脸视频的检测结果即待检测视频是否为伪造人脸视频的概率。FIG. 7 is a schematic diagram of a video detection model provided by an embodiment of the present disclosure. As shown in FIG. 7 , after extracting the
进一步地,图7中的三维注意力神经网络模型的具体结构和原理可以参考图8进行说明。Further, the specific structure and principle of the three-dimensional attention neural network model in FIG. 7 can be described with reference to FIG. 8 .
图8为本公开实施例提供的一种三维注意力神经网络模型的示意图。如图8所示,三维注意力神经网络模型800可以包括三个三维卷积处理模块820,每个三维卷积处理模块820可以分别包括一个三维卷积层821、一个三维批标准化层822和一个三维激活层823。FIG. 8 is a schematic diagram of a three-dimensional attention neural network model provided by an embodiment of the present disclosure. As shown in FIG. 8 , the 3D attention
其中,首个三维卷积处理模块820可以将任意人脸图像对应的三维人脸特征810处理得到第一特征矩阵830,第二个三维卷积处理模块820可以将第一特征矩阵830处理得到第二特征矩阵840,第三个三维卷积处理模块820可以将第二特征矩阵840处理得到注意力权重矩阵850。The first three-dimensional
进一步地,图结构构建模块和图7中的时序图卷积神经网络模型的具体结构和原理可以参考图9进行说明。Further, the specific structure and principle of the graph structure building module and the time sequence graph convolutional neural network model in FIG. 7 can be described with reference to FIG. 9 .
在本公开实施例中,融合模块在得到任意人脸图像对应的注意力特征之后,图结构构建模块可以将该注意力特征沿着第二维度如通道维度划分成8个注意力特征向量,其中,每个注意力特征向量的尺寸为1×256。接着,图结构构建模块可以定义邻接矩阵,作为注意力特征向量之间的初始关系特征,其中,邻接矩阵的尺寸为8×8。最后,图结构构建模块可以使用图卷积运算,将注意力特征向量作为结点V,将关系特征作为边E,构造图G=<V,E>。In the embodiment of the present disclosure, after the fusion module obtains the attention feature corresponding to any face image, the graph structure building module can divide the attention feature into 8 attention feature vectors along the second dimension such as the channel dimension, wherein , and the dimension of each attention feature vector is 1 × 256. Next, the graph structure building module can define an adjacency matrix as the initial relationship feature between the attention feature vectors, where the size of the adjacency matrix is 8×8. Finally, the graph structure building block can use the graph convolution operation to construct the graph G=<V,E> with the attention feature vector as the node V and the relation feature as the edge E.
具体地,图结构构建模块构造图的过程如下:Specifically, the process of constructing a graph by a graph structure building module is as follows:
图结构构建模块首先定义尺寸为256×384的图卷积参数矩阵Wg,将图卷积参数矩阵Wg 910输入到第一图卷积运算单元920中,使第一图卷积运算单元920对图卷积参数矩阵Wg 910进行图卷积运算 GConv(G)=EVWg,得到尺寸为8×384的结果矩阵,将结果矩阵输入第一向量切分单元930中,使第一向量切分单元930沿第二维度如通道维度对该结果矩阵进行切分,得到3个隐向量Gr、Gz、Gh,每个隐向量的尺寸均为8×128。接着,图结构构建模块可以定义一个尺寸为 8×128的初始的隐含状态以及尺寸为128×384的隐含状态的图卷积参数矩阵Wh,其中,隐含状态的初始值为0。再接着,图结构构建模块可以将该隐含状态和上述共享的关系特征所形成的待处理数据940输入第二图卷积运算单元950中,使第二图卷积运算单元950对待处理数据940进行图卷积运算GConv(H)=EhWh,得到隐含状态图960。最后,图结构构建模块可以隐含状态图960输入第二向量切分单元 970中,使第二向量切分单元970沿第二维度如通道维度对隐含状态图960进行切分,得到3个隐向量Hr、Hz、Hh,每个隐向量的尺寸均为8×128。The graph structure building module first defines a graph convolution parameter matrix W g with a size of 256×384, inputs the graph convolution
时序图卷积神经网络模型可以设置初始偏置参数980,将隐含状态图960、隐向量Gr、Gz、Gh、Hr、Hz、Hh、初始偏置参数980输入门控运算单元990中。其中,将初始偏置参数980输入到偏置参数调整子单元991中,对初始偏置参数980进行向量切分,得到偏置参数向量br、bz、bh。将隐向量Hr、Gr、br输入重置门运算子单元992,执行r=sigmoid(Gr+Hr+br)运算,得到的r可以理解为重置结果993,将该重置结果993与隐向量Hh输入第一乘积运算子单元994中,得到第一乘积结果995。将隐向量Hz、Gz、bz输入更新门运算子单元 996中,执行z=sigmoid(Gz+Hz+bz)运算,得到的z可以理解为更新结果997,将该更新结果997与隐含状态图960输入第二乘积运算子单元998中,得到第二乘积结果999,并将更新结果997输入到运算单元9910中,执行1-z的运算,得到计算结果9911,将第一乘积结果995、向量Gh、bh输入候选隐含状态运算子单元9912中,执行的运算,得到的可以理解为候选隐含状态9913,将候选隐含状态9913和结果9911输入第三乘积运算子单元9914中,得到第三乘积结果9915,将第二乘积结果999和第三乘积结果9915输入加法运算子单元9916中,执行 的运算,得到的H可以理解为更新后的隐含状态图9917,将该更新后的隐含状态图9917输入隐含状态图替换单元9100中,通过隐含状态图替换单元9100替换第二图卷积运算单元950输出的隐含状态图960,并将该更新后的隐含状态图9917输入到偏置参数调整子单元991中,对偏置参数进行调整,得到能提升时序图神经网络模型提取图像序列的时序关系特征的效果的偏置参数。The sequence diagram convolutional neural network model can set the initial bias parameter 980, and input the hidden state diagram 960, hidden vectors Gr, Gz, Gh, Hr, Hz, Hh , and initial bias parameter 980 to gate in the
接着,图结构构建模块可以接收下一个人脸图像对应的注意力特征,并将该注意力特征沿着第二维度如通道维度划分成8个注意力特征向量,得到下一个人脸图像对应的注意力特征向量,即下一个结点 V,进而基于该结点完成对隐含状态图的下一次更新,直到将图像序列中每个图像的注意力特征都输入到该时序图神经网络模型中之后,得到的隐含状态图能够表示图像序列整体的时序关系特征。Next, the graph structure building module can receive the attention feature corresponding to the next face image, and divide the attention feature into 8 attention feature vectors along the second dimension, such as the channel dimension, to obtain the corresponding attention feature of the next face image. The attention feature vector, that is, the next node V, and then complete the next update of the hidden state diagram based on this node, until the attention features of each image in the image sequence are input into the neural network model of the sequence diagram After that, the obtained hidden state diagram can represent the temporal relationship characteristics of the whole image sequence.
进一步地,为了验证图7至图9所示的视频检测模型在伪造人脸视频检测的精确度,可以使用FaceForensics++(FF++)、Celeb-DF v2和Deepfake Detection Challenge(DFDC)数据集测试去检测伪造人脸视频的性能。Further, in order to verify the accuracy of the video detection model shown in Figures 7 to 9 in detecting fake face videos, FaceForensics++ (FF++), Celeb-DF v2 and Deepfake Detection Challenge (DFDC) datasets can be used to test forgery detection. The performance of face video.
其中,FF++数据集可以通过Deepfakes、Face2Face、FaceSwap 和NeuralTextures四种深度伪造算法,分别对1000个真实视频进行人脸篡改,每种深度伪造算法均产生1000个伪造视频,得到4000个伪造视频,其中,FF++数据集可以包含高清晰度(High Quality,HQ)版本的数据集和低清晰度(Low Quality,LQ)版本的数据集。Among them, the FF++ dataset can tamper with 1,000 real videos through four deep forgery algorithms, Deepfakes, Face2Face, FaceSwap and NeuralTextures. Each deep-forgery algorithm generates 1,000 forged videos and 4,000 forged videos, of which , the FF++ dataset can contain a High Quality (HQ) version of the dataset and a Low Quality (LQ) version of the dataset.
Celeb-DF v2数据集可以通过改进的Deepfake深度伪造算法,对 590个真实的名人视频进行人脸篡改,得到5639个伪造视频。The Celeb-DF v2 dataset can use the improved Deepfake deepfake algorithm to tamper with 590 real celebrity videos and get 5639 fake videos.
DFDC数据集是Facebook在2019年制作的数据集,对拍摄1131 个真实视频进行深度伪造加工,得到4119个伪造视频。The DFDC dataset is a dataset produced by Facebook in 2019. It processed 1131 real videos for deep forgery and obtained 4119 forged videos.
在本测试中,通过area under the curve(AUC)模型评价指标,对该视频检测模型分别进行模型精度测试和泛化能力测试,其中,模型精度测试用于评估该视频检测模型的检测准确率,采用相同的数据集作为测试的训练集和测试集,泛化能力测试用于评估该视频检测模型对新鲜样本的适应能力,采用不同的数据集作为测试的训练集和测试集。In this test, through the area under the curve (AUC) model evaluation index, the video detection model is tested for model accuracy and generalization ability respectively. The model accuracy test is used to evaluate the detection accuracy of the video detection model. The same data set is used as the training set and test set for testing, and the generalization ability test is used to evaluate the adaptability of the video detection model to fresh samples, and different data sets are used as the training set and test set for testing.
在模型精度测试中,可以使用FF++数据集进行测试,测试结果如表3所示。In the model accuracy test, the FF++ dataset can be used for testing, and the test results are shown in Table 3.
表3table 3
其中,w/o(without)表示取消特定模型的对比方法,用于验证x 相关模型的有效性,示例的,视频检测模型w/o三维注意力神经网络模型表示在视频检测模型取消三维注意力神经网络模型之后的模型。 FF++(HQ)和FF++(LQ)分别表示视频检测模型在高清晰度(HQ)版本的FF++数据集和低清晰度(LQ)版本的FF++数据集中的测试结果,百分数值表示测试结果与真实结果相同的概率,作为相应检测方法在相应数据集中的检测准确率。在该模型精度测试中,视频检测模型能达到最高的检测准确率,在FF++(LQ)数据集中视频检测模型的检测准确率比基本的MC3方法的检测准确率高18.19%。取消视频检测模型中的三维注意力神经网络模型或时序图神经网络模型,都会降低视频检测模型的检测准确率。Among them, w/o(without) means to cancel the comparison method of a specific model, which is used to verify the validity of the x-related model. For example, the video detection model w/o 3D attention neural network model means to cancel the 3D attention in the video detection model The model after the neural network model. FF++(HQ) and FF++(LQ) represent the test results of the video detection model in the high-definition (HQ) version of the FF++ dataset and the low-definition (LQ) version of the FF++ dataset, respectively, and the percentage values represent the test results and the real results. The same probability as the detection accuracy of the corresponding detection method in the corresponding dataset. In the model accuracy test, the video detection model can achieve the highest detection accuracy, and the detection accuracy of the video detection model in the FF++(LQ) dataset is 18.19% higher than that of the basic MC3 method. Cancellation of the 3D attention neural network model or the sequence diagram neural network model in the video detection model will reduce the detection accuracy of the video detection model.
在泛化能力测试中,可以使用FF++(HQ)数据集对相应的视频检测方法进行训练,将FF++(HQ)数据集、Celeb-DF数据集和DFDC 数据集作为测试集,测试结果如表4所示。In the generalization ability test, the FF++ (HQ) data set can be used to train the corresponding video detection method, and the FF++ (HQ) data set, Celeb-DF data set and DFDC data set are used as test sets. The test results are shown in Table 4. shown.
表4Table 4
其中,表4与表3相同内容的含义相同,Celeb-DF v2、DFDC分别表示视频检测模型在Celeb-DF v2数据集和DFDC数据集中的测试结果。在该泛化能力测试中,视频检测模型能达到最高的泛化性能,在Celeb-DF数据集中视频检测模型的检测准确率比基本的MC3方法的检测准确率高11.92%。取消视频检测模型中的三维注意力神经网络模型或时序图神经网络模型,都会降低伪造脸部视频检测的检测准确率。Among them, Table 4 has the same meaning as Table 3, and Celeb-DF v2 and DFDC represent the test results of the video detection model in the Celeb-DF v2 dataset and the DFDC dataset, respectively. In this generalization ability test, the video detection model can achieve the highest generalization performance, and the detection accuracy of the video detection model in the Celeb-DF dataset is 11.92% higher than that of the basic MC3 method. Cancellation of the 3D attention neural network model or the sequence diagram neural network model in the video detection model will reduce the detection accuracy of fake face video detection.
综上所述,对本公开实施例中的视频检测模型的检测准确率进行精度测试和泛化能力测试后,根据测试结果可知,本公开实施例提供的视频检测模型及其对应实现的视频检测方法能够在针对多个数据集的多种视频检测模型及其对应实现的视频检测方法中均得到最高的检测准确率,因此,该测试结果表示本公开实施例提供的视频检测方法能够提高伪造脸部视频检测的精确度和泛化能力。To sum up, after the accuracy test and the generalization ability test are performed on the detection accuracy of the video detection model in the embodiment of the present disclosure, it can be known from the test results that the video detection model provided by the embodiment of the present disclosure and the corresponding video detection method implemented The highest detection accuracy can be obtained among various video detection models for multiple data sets and their corresponding video detection methods. Therefore, this test result indicates that the video detection method provided by the embodiment of the present disclosure can improve the performance of fake face detection methods. Accuracy and generalization ability of video detection.
图10为本公开实施例提供的视频检测装置的结构示意图。FIG. 10 is a schematic structural diagram of a video detection apparatus provided by an embodiment of the present disclosure.
在本公开实施例中,该视频检测装置可以设置于计算设备中。其中,计算设备可以包括电子设备和服务器等,在此不作限制。设备可以包括移动电话、平板电脑、台式计算机、笔记本电脑、车载终端、可穿戴电子设备、一体机、智能家居设备等具有计算功能的设备,也可以是虚拟机或者模拟器模拟的设备。服务器可以是独立的服务器,也可以是多个服务器的集群,可以包括搭建在本地的服务器和架设在云端的服务器。In the embodiment of the present disclosure, the video detection apparatus may be provided in a computing device. The computing devices may include electronic devices, servers, etc., which are not limited herein. The device may include a mobile phone, a tablet computer, a desktop computer, a notebook computer, a vehicle terminal, a wearable electronic device, an all-in-one computer, a smart home device, or other device with computing functions, or a device simulated by a virtual machine or a simulator. The server can be an independent server or a cluster of multiple servers, including a server built locally and a server built in the cloud.
如图10所示,视频检测装置1000包括:图像获取模块1010、线性变化模块1020、特征构建模块1030和概率计算模块1040。As shown in FIG. 10 , the video detection apparatus 1000 includes: an image acquisition module 1010 , a linear change module 1020 , a feature construction module 1030 and a probability calculation module 1040 .
该图像获取模块1010可以用于获取待检测的图像序列,图像序列包含同一视频中的至少两个视频帧。The image acquisition module 1010 can be used to acquire an image sequence to be detected, where the image sequence includes at least two video frames in the same video.
该非线性变化模块1020可以用于针对图像序列中的每个图像,对图像的脸部特征进行非线性变换处理,得到图像对应的脸部多个区域的注意力特征。The nonlinear change module 1020 can be configured to perform nonlinear transformation processing on the facial features of the images for each image in the image sequence, so as to obtain the attention features of multiple face regions corresponding to the images.
该特征构建模块1030可以用于基于各个图像对应的脸部多个区域的注意力特征,构建图像序列对应的脸部多个区域之间的时序关系特征。The feature construction module 1030 may be configured to construct a time series relationship feature between the multiple face regions corresponding to the image sequence based on the attention features of the multiple face regions corresponding to each image.
该概率计算模块1040可以用于基于时序关系特征,计算视频为伪造脸部的视频的概率。The probability calculation module 1040 may be configured to calculate the probability that the video is a video of a fake face based on the time sequence relationship feature.
在本公开实施例中,能够在获取到待检测的包含同一视频中的至少两个视频帧的图像序列之后,对图像序列中的每个图像的脸部特征进行非线性变换处理,得到每个图像对应的脸部多个区域的注意力特征,并基于各个图像对应的脸部多个区域的注意力特征,构建图像序列对应的脸部多个区域之间的时序关系特征,进而基于时序关系特征,计算视频为伪造脸部的视频的概率,该概率可以用于判断视频是否为伪造脸部的视频,由于在本公开实施例中可以基于图像序列中的对应的脸部多个区域之间的时序关系特征计算该概率,可以在计算该概率时引入脸部多个区域之间的时序关系,进而检测出视频中脸部的时序不一致性,使得概率的计算结果的准确率更高、泛化能力更强,进而提高伪造脸部视频检测的精确度。In the embodiment of the present disclosure, after acquiring an image sequence to be detected that includes at least two video frames in the same video, nonlinear transformation processing can be performed on the facial features of each image in the image sequence to obtain each The attention features of multiple face regions corresponding to the image, and based on the attention features of multiple face regions corresponding to each image, the temporal relationship features between the multiple face regions corresponding to the image sequence are constructed, and then based on the temporal relationship feature, to calculate the probability that the video is a video of a fake face, and the probability can be used to determine whether the video is a video of a fake face, because in the embodiment of the present disclosure, it can be based on the corresponding face regions in the image sequence. When calculating the probability, the timing relationship between multiple areas of the face can be introduced when calculating the probability, and then the timing inconsistency of the face in the video can be detected, which makes the calculation result of the probability more accurate and more general. Therefore, the detection ability of fake face video is improved, and the detection accuracy of fake face video is improved.
在本公开一些实施例中,图像获取模块1010可以包括视频分段单元、关键帧抽取单元和关键帧排序单元。In some embodiments of the present disclosure, the image acquisition module 1010 may include a video segmentation unit, a key frame extraction unit, and a key frame sorting unit.
该视频分段单元可以用于将视频划分为多个视频段。The video segmentation unit may be used to divide the video into a plurality of video segments.
该关键帧抽取单元可以用于按照预设抽帧方式,抽取每个视频段的关键视频帧。The key frame extraction unit can be used to extract key video frames of each video segment according to a preset frame extraction method.
该关键帧排序单元可以用于按照播放时间顺序对各个关键视频帧进行排序,得到图像序列。The key frame sorting unit can be used to sort each key video frame according to the play time sequence to obtain an image sequence.
在本公开一些实施例中,非线性变化模块1020可以包括特征提取单元、三维卷积运算单元和注意力特征构建单元。In some embodiments of the present disclosure, the nonlinear change module 1020 may include a feature extraction unit, a three-dimensional convolution operation unit, and an attention feature construction unit.
该特征提取单元可以用于对图像进行三维脸部特征提取处理,得到图像对应的三维脸部特征。The feature extraction unit can be used to perform three-dimensional facial feature extraction processing on the image to obtain three-dimensional facial features corresponding to the image.
该三维卷积运算单元可以用于对图像对应的三维脸部特征进行三维卷积运算,得到图像对应的脸部多个区域的注意力权重矩阵。The three-dimensional convolution operation unit can be used to perform a three-dimensional convolution operation on the three-dimensional facial features corresponding to the image to obtain attention weight matrices of multiple face regions corresponding to the image.
该注意力特征构建单元可以用于基于图像对应的三维脸部特征和图像对应的脸部多个区域的注意力权重矩阵,生成图像对应的脸部多个区域的注意力特征。The attention feature construction unit may be configured to generate attention features of multiple face regions corresponding to the image based on the three-dimensional face feature corresponding to the image and the attention weight matrix of the multiple face regions corresponding to the image.
在本公开一些实施例中,该注意力特征构建单元可以包括特征矩阵构建子单元、求和子单元和池化子单元。In some embodiments of the present disclosure, the attention feature construction unit may include a feature matrix construction subunit, a summation subunit, and a pooling subunit.
该特征矩阵构建子单元可以用于将图像对应的三维脸部特征与图像对应的脸部多个区域的注意力权重矩阵的乘积,作为图像对应的脸部多个区域的注意力特征矩阵。The feature matrix construction subunit can be used to multiply the product of the three-dimensional facial features corresponding to the image and the attention weight matrices of the multiple face regions corresponding to the image as the attention feature matrices of the multiple face regions corresponding to the image.
该求和子单元可以用于对图像对应的脸部多个区域的注意力特征矩阵的高度维度和宽度维度分别进行求和,得到求和后的注意力特征矩阵。The summation subunit may be used to sum the height dimension and the width dimension of the attention feature matrices of multiple face regions corresponding to the image, respectively, to obtain the summed attention feature matrix.
该池化子单元可以用于对求和后的注意力特征矩阵的长度维度进行全局平均池化,得到图像对应的脸部多个区域的注意力特征。The pooling subunit can be used to perform global average pooling on the length dimension of the summed attention feature matrix to obtain the attention features of multiple face regions corresponding to the image.
在本公开一些实施例中,该注意力特征构建单元可以包括权重矩阵构建子单元。In some embodiments of the present disclosure, the attention feature construction unit may include a weight matrix construction sub-unit.
该权重矩阵构建子单元可以用于基于三维注意力神经网络模型,对图像对应的三维脸部特征进行三维卷积运算,得到图像对应的脸部多个区域的注意力权重矩阵;其中,三维注意力神经网络模型包括三个注意力机制模块,每个注意力机制模块包含三维卷积层、三维批标准化层和激活层,三维注意力神经网络模型的损失函数包括基于时序连续性损失函数和稀疏注意力对比损失函数。三维注意力神经网络模型基于时序连续性损失函数和稀疏注意力对比损失函数训练得到。The weight matrix construction sub-unit can be used to perform a three-dimensional convolution operation on the three-dimensional facial features corresponding to the image based on the three-dimensional attention neural network model, so as to obtain the attention weight matrices of multiple areas of the face corresponding to the image; The force neural network model includes three attention mechanism modules. Each attention mechanism module includes a 3D convolution layer, a 3D batch normalization layer and an activation layer. The loss function of the 3D attention neural network model includes time series continuity loss function and sparseness. Attention Contrastive Loss Function. The 3D attention neural network model is trained based on the temporal continuity loss function and the sparse attention contrast loss function.
在本公开一些实施例中,特征构建模块可以包括图结构数据构建单元和时序关系特征构建单元。In some embodiments of the present disclosure, the feature building module may include a graph-structured data building unit and a temporal relationship feature building unit.
该图结构数据构建单元可以用于针对每个图像,根据图像对应的脸部多个区域的注意力特征,构建图像对应的脸部多个区域的图结构数据。The graph structure data construction unit can be configured to, for each image, construct graph structure data of multiple face regions corresponding to the image according to the attention features of the multiple face regions corresponding to the image.
该时序关系特征构建单元可以用于对各个图像对应的脸部多个区域的图结构数据进行时序关系构建处理,得到图像序列对应的脸部多个区域之间的时序关系特征。The temporal relationship feature construction unit can be used to perform temporal relationship building processing on the graph structure data of multiple face regions corresponding to each image, so as to obtain temporal relationship features between multiple face regions corresponding to the image sequence.
在本公开一些实施例中,该时序关系特征构建单元可以包括隐含状态图构建子单元和时序关系特征获取子单元。In some embodiments of the present disclosure, the time sequence relationship feature construction unit may include a hidden state diagram construction subunit and a time sequence relationship feature acquisition subunit.
该隐含状态图构建子单元可以用于基于时序图神经网络模型,对各个图像对应的脸部多个区域的图结构数据进行时序关系构建处理,得到图像序列对应的最后一个隐含状态图。The hidden state diagram construction subunit can be used to construct a time sequence relationship on the graph structure data of multiple face regions corresponding to each image based on the time sequence diagram neural network model, and obtain the last hidden state diagram corresponding to the image sequence.
该时序关系特征获取子单元可以用于将最后一个隐含状态图作为图像序列对应的脸部多个区域之间的时序关系特征。The time sequence relationship feature acquisition subunit can be used to use the last hidden state diagram as the time sequence relationship feature between multiple face regions corresponding to the image sequence.
需要说明的是,图10所示的视频检测装置1000可以执行图1至图5所示的方法实施例中的各个步骤,并且实现图1至图5所示的方法实施例中的各个过程和效果,在此不做赘述。It should be noted that, the video detection apparatus 1000 shown in FIG. 10 may perform various steps in the method embodiments shown in FIG. 1 to FIG. 5 , and implement each process and The effect will not be repeated here.
本公开实施例还提供了一种视频检测设备,该视频检测设备可以包括处理器和存储器,存储器可以用于存储可执行指令。其中,处理器可以用于从存储器中读取可执行指令,并执行可执行指令以实现上述实施例中的视频检测方法。Embodiments of the present disclosure also provide a video detection device, the video detection device may include a processor and a memory, and the memory may be used to store executable instructions. The processor may be configured to read executable instructions from the memory and execute the executable instructions to implement the video detection method in the above-mentioned embodiments.
图11示出了本公开实施例提供的一种视频检测设备的结构示意图。下面具体参考图11,其示出了适于用来实现本公开实施例中的视频检测设备1100的结构示意图。FIG. 11 shows a schematic structural diagram of a video detection device provided by an embodiment of the present disclosure. Referring specifically to FIG. 11 below, it shows a schematic structural diagram of a
在本公开实施例中,视频检测设备1100可以为计算设备。其中,计算设备可以包括电子设备和服务器等,在此不作限制。电子设备可以包括移动电话、平板电脑、台式计算机、笔记本电脑、车载终端、可穿戴电子设备、一体机、智能家居设备等具有计算功能的设备,也可以是虚拟机或者模拟器模拟的设备。服务器可以是独立的服务器,也可以是多个服务器的集群,可以包括搭建在本地的服务器和架设在云端的服务器。In an embodiment of the present disclosure, the
图11为本公开实施例提供的一种视频检测设备的结构示意图。本公开实施例提供的计算机设备可以上述方法实施例的处理流程,如图11所示,计算机设备1100包括:存储器1110、处理器1120、计算机程序和通讯接口1130;其中,计算机程序存储在存储器1110中,并被配置为由处理器1120执行如上所述的视频检测方法。FIG. 11 is a schematic structural diagram of a video detection device according to an embodiment of the present disclosure. The computer device provided by the embodiment of the present disclosure can use the processing flow of the above method embodiment. As shown in FIG. 11 , the
需要说明的是,图11示出的视频检测设备1100仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。It should be noted that the
另外,本公开实施例还提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行以实现上述实施例所述的视频检测方法。In addition, an embodiment of the present disclosure further provides a computer-readable storage medium on which a computer program is stored, and the computer program is executed by a processor to implement the video detection method described in the foregoing embodiments.
需要说明的是,在本文中,诸如“第一”和“第二”等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that, in this document, relational terms such as "first" and "second" etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these There is no such actual relationship or sequence between entities or operations. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device that includes a list of elements includes not only those elements, but also includes not explicitly listed or other elements inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.
以上所述仅是本公开的具体实施方式,使本领域技术人员能够理解或实现本公开。对这些实施例的多种修改对本领域的技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本公开的精神或范围的情况下,在其它实施例中实现。因此,本公开将不会被限制于本文所述的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above descriptions are only specific embodiments of the present disclosure, so that those skilled in the art can understand or implement the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the present disclosure. Therefore, the present disclosure is not intended to be limited to the embodiments described herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210369060.5A CN114724218A (en) | 2022-04-08 | 2022-04-08 | Video detection method, device, equipment and medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210369060.5A CN114724218A (en) | 2022-04-08 | 2022-04-08 | Video detection method, device, equipment and medium |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN114724218A true CN114724218A (en) | 2022-07-08 |
Family
ID=82242663
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202210369060.5A Pending CN114724218A (en) | 2022-04-08 | 2022-04-08 | Video detection method, device, equipment and medium |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN114724218A (en) |
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115719519A (en) * | 2022-10-19 | 2023-02-28 | 天津中科智能识别有限公司 | Face counterfeiting detection method based on multi-frequency domain fusion |
| CN116452787A (en) * | 2023-06-13 | 2023-07-18 | 北京中科闻歌科技股份有限公司 | Virtual character processing system driven by vision |
| CN117473120A (en) * | 2023-12-27 | 2024-01-30 | 南京邮电大学 | Video retrieval method based on lens features |
| WO2024104068A1 (en) * | 2022-11-15 | 2024-05-23 | 腾讯科技(深圳)有限公司 | Video detection method and apparatus, device, storage medium, and product |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20140192903A1 (en) * | 2013-01-07 | 2014-07-10 | Qualcomm Incorporated | Signaling of picture order count to timing information relations for video timing in video coding |
| CN112215337A (en) * | 2020-09-30 | 2021-01-12 | 江苏大学 | A Vehicle Trajectory Prediction Method Based on Environmental Attention Neural Network Model |
| CN112734696A (en) * | 2020-12-24 | 2021-04-30 | 华南理工大学 | Face changing video tampering detection method and system based on multi-domain feature fusion |
| CN112749686A (en) * | 2021-01-29 | 2021-05-04 | 腾讯科技(深圳)有限公司 | Image detection method, image detection device, computer equipment and storage medium |
| CN113435330A (en) * | 2021-06-28 | 2021-09-24 | 平安科技(深圳)有限公司 | Micro-expression identification method, device, equipment and storage medium based on video |
| CN113627233A (en) * | 2021-06-17 | 2021-11-09 | 中国科学院自动化研究所 | Visual semantic information-based face counterfeiting detection method and device |
| WO2022022154A1 (en) * | 2020-07-27 | 2022-02-03 | 腾讯科技(深圳)有限公司 | Facial image processing method and apparatus, and device and storage medium |
-
2022
- 2022-04-08 CN CN202210369060.5A patent/CN114724218A/en active Pending
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20140192903A1 (en) * | 2013-01-07 | 2014-07-10 | Qualcomm Incorporated | Signaling of picture order count to timing information relations for video timing in video coding |
| WO2022022154A1 (en) * | 2020-07-27 | 2022-02-03 | 腾讯科技(深圳)有限公司 | Facial image processing method and apparatus, and device and storage medium |
| CN112215337A (en) * | 2020-09-30 | 2021-01-12 | 江苏大学 | A Vehicle Trajectory Prediction Method Based on Environmental Attention Neural Network Model |
| CN112734696A (en) * | 2020-12-24 | 2021-04-30 | 华南理工大学 | Face changing video tampering detection method and system based on multi-domain feature fusion |
| CN112749686A (en) * | 2021-01-29 | 2021-05-04 | 腾讯科技(深圳)有限公司 | Image detection method, image detection device, computer equipment and storage medium |
| CN113627233A (en) * | 2021-06-17 | 2021-11-09 | 中国科学院自动化研究所 | Visual semantic information-based face counterfeiting detection method and device |
| CN113435330A (en) * | 2021-06-28 | 2021-09-24 | 平安科技(深圳)有限公司 | Micro-expression identification method, device, equipment and storage medium based on video |
Non-Patent Citations (1)
| Title |
|---|
| 曹玉红 等: "智能人脸伪造与检测综述", 工程研究——跨学科视野中的工程, 31 December 2020 (2020-12-31) * |
Cited By (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115719519A (en) * | 2022-10-19 | 2023-02-28 | 天津中科智能识别有限公司 | Face counterfeiting detection method based on multi-frequency domain fusion |
| WO2024104068A1 (en) * | 2022-11-15 | 2024-05-23 | 腾讯科技(深圳)有限公司 | Video detection method and apparatus, device, storage medium, and product |
| CN116452787A (en) * | 2023-06-13 | 2023-07-18 | 北京中科闻歌科技股份有限公司 | Virtual character processing system driven by vision |
| CN116452787B (en) * | 2023-06-13 | 2023-10-10 | 北京中科闻歌科技股份有限公司 | Virtual character processing system driven by vision |
| CN117473120A (en) * | 2023-12-27 | 2024-01-30 | 南京邮电大学 | Video retrieval method based on lens features |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Hou et al. | Context-aware image matting for simultaneous foreground and alpha estimation | |
| Chen et al. | Locally GAN-generated face detection based on an improved Xception | |
| CN109583298B (en) | An Ensemble-Based Cross-View Gait Recognition Method | |
| CN113762138B (en) | Identification method, device, computer equipment and storage medium for fake face pictures | |
| US20240371081A1 (en) | Neural Radiance Field Generative Modeling of Object Classes from Single Two-Dimensional Views | |
| CN104424634B (en) | Object tracking method and device | |
| CN114724218A (en) | Video detection method, device, equipment and medium | |
| CN109522855B (en) | Low-resolution pedestrian detection method, system and storage medium combining ResNet and SENet | |
| WO2019214557A1 (en) | Method and system for detecting face image generated by deep network | |
| CN108345892A (en) | A kind of detection method, device, equipment and the storage medium of stereo-picture conspicuousness | |
| CN105956572A (en) | In vivo face detection method based on convolutional neural network | |
| CN110689599A (en) | 3D visual saliency prediction method for generating countermeasure network based on non-local enhancement | |
| Xu et al. | Detecting facial manipulated videos based on set convolutional neural networks | |
| CN108389192A (en) | Stereo-picture Comfort Evaluation method based on convolutional neural networks | |
| CN111144314A (en) | A tampered face video detection method | |
| WO2022156214A1 (en) | Liveness detection method and apparatus | |
| CN114694074A (en) | Method, device and storage medium for generating video by using image | |
| CN116403063A (en) | No-reference screen content image quality assessment method based on multi-region feature fusion | |
| CN115131218A (en) | Image processing method, apparatus, computer readable medium and electronic device | |
| CN118781665A (en) | A deep fake face forensics method based on reconstruction learning | |
| Saealal et al. | Three-dimensional convolutional approaches for the verification of deepfake videos: The effect of image depth size on authentication performance | |
| Richards et al. | Deep fake face detection using convolutional neural networks | |
| CN116977674A (en) | Image matching method, related device, storage medium and program product | |
| CN107403182A (en) | The detection method and device of space-time interest points based on 3D SIFT frameworks | |
| CN118262276B (en) | Method and device for detecting counterfeiting of video, electronic equipment and storage medium |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |