[go: up one dir, main page]

CN115116468A - Video generation method and device, storage medium and electronic equipment - Google Patents

Video generation method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN115116468A
CN115116468A CN202210688868.XA CN202210688868A CN115116468A CN 115116468 A CN115116468 A CN 115116468A CN 202210688868 A CN202210688868 A CN 202210688868A CN 115116468 A CN115116468 A CN 115116468A
Authority
CN
China
Prior art keywords
frame
dimension
preselected
target
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210688868.XA
Other languages
Chinese (zh)
Inventor
杨红庄
甄海洋
王超
周维
王磊
王进
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Rainbow Software Co ltd
Original Assignee
Rainbow Software Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Rainbow Software Co ltd filed Critical Rainbow Software Co ltd
Priority to CN202210688868.XA priority Critical patent/CN115116468A/en
Publication of CN115116468A publication Critical patent/CN115116468A/en
Priority to PCT/CN2023/094868 priority patent/WO2023241298A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • G06T2207/30201Face

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the invention discloses a video generation method, which comprises the following steps: acquiring a frame sequence to be selected; determining a target frame from a frame sequence to be selected according to the frame selection dimension; performing voice driving on a target frame based on a current voice signal to obtain a target video; the frame selection dimension comprises at least one of a first frame selection dimension and a second frame selection dimension. Through the scheme of the embodiment, the target frame meeting the voice driving requirement is obtained, and the subsequent voice driving effect is improved.

Description

一种视频生成方法、装置、存储介质及电子设备A video generation method, device, storage medium and electronic device

技术领域technical field

本文涉及视频生成技术,尤指一种视频生成方法、装置、存储介质及电子设备。This article relates to video generation technology, especially to a video generation method, device, storage medium and electronic device.

背景技术Background technique

通过语音驱动生成视频的方法在各领域中的已有广泛应用。现有技术通常以未经筛选的单帧静态帧作为输入帧,通过语音驱动生成视频。然而,语音驱动对输入帧有诸多要求,例如,需要输入帧的画质清晰、人脸居中、表情中性,仅基于未经筛选的单帧静态帧,难以满足语音驱动的要求。The method of generating video by voice drive has been widely used in various fields. In the prior art, an unfiltered single static frame is usually used as an input frame, and a video is generated by voice driving. However, the voice driver has many requirements on the input frame. For example, the input frame needs to be clear in image quality, centered on the face, and neutral in expression. It is difficult to meet the requirements of the voice driver only based on unfiltered single-frame static frames.

发明内容SUMMARY OF THE INVENTION

与相关技术相比,本申请记载的技术方案,得到满足语音驱动要求的目标帧,提升了后续语音驱动的效果;同时,解决了语音驱动的过程中表情系数变化可能造成的面部细节缺失的问题,使生成的视频更加生动、自然。Compared with the related art, the technical solution described in the present application obtains a target frame that meets the requirements of voice driving, which improves the effect of subsequent voice driving; at the same time, it solves the problem of missing facial details that may be caused by changes in expression coefficients in the process of voice driving. , to make the generated video more vivid and natural.

为了达到本发明实施例目的,本发明实施例提供了一种视频生成方法,所述方法可以包括:In order to achieve the purpose of the embodiment of the present invention, the embodiment of the present invention provides a video generation method, and the method may include:

获取待选帧序列;Get the sequence of frames to be selected;

根据选帧维度,从所述待选帧序列中确定目标帧;Determine a target frame from the to-be-selected frame sequence according to the frame selection dimension;

基于当前语音信号,对所述目标帧进行语音驱动,获取目标视频;Based on the current voice signal, the target frame is voice-driven to obtain the target video;

其中,所述选帧维度包括第一选帧维度、第二选帧维度中至少一项。Wherein, the frame selection dimension includes at least one of a first frame selection dimension and a second frame selection dimension.

在本发明的示例性实施例中,所述根据选帧维度,从所述待选帧序列中确定目标帧,包括:In an exemplary embodiment of the present invention, determining the target frame from the to-be-selected frame sequence according to the frame selection dimension includes:

根据所述选帧维度,从所述待选帧序列中获取满足选帧条件的预选帧,其中,所述预选帧为一帧或多帧;According to the frame selection dimension, obtain a preselected frame that satisfies the frame selection condition from the to-be-selected frame sequence, wherein the preselected frame is one or more frames;

当所述预选帧为一帧时,所述预选帧为所述目标帧;When the preselected frame is one frame, the preselected frame is the target frame;

当所述预选帧为多帧时,对所述多帧预选帧进行融合,得到所述目标帧;When the preselected frame is a multi-frame, the multi-frame preselected frame is fused to obtain the target frame;

其中,所述选帧条件包括第一选帧条件、第二选帧条件中至少一项。Wherein, the frame selection condition includes at least one of a first frame selection condition and a second frame selection condition.

在本发明的示例性实施例中,所述融合包括第一融合或第二融合中至少一项。In an exemplary embodiment of the present invention, the fusion includes at least one of a first fusion or a second fusion.

在本发明的示例性实施例中,所述根据所述选帧维度,从所述待选帧序列中获取满足选帧条件的预选帧,包括:In an exemplary embodiment of the present invention, the obtaining, according to the frame selection dimension, a preselected frame that satisfies the frame selection condition from the to-be-selected frame sequence includes:

根据所述第一选帧维度,计算所述待选帧序列中每一帧的第一维度值;从所述待选帧序列中获取所述第一维度值满足所述第一选帧条件的第一预选帧;Calculate the first dimension value of each frame in the to-be-selected frame sequence according to the first frame selection dimension; obtain the first dimension value that satisfies the first frame selection condition from the to-be-selected frame sequence the first preselected frame;

其中,所述第一预选帧为一帧或多帧。Wherein, the first preselected frame is one or more frames.

在本发明的示例性实施例中,所述第一选帧条件为所述第一维度值在第一选帧范围内。In an exemplary embodiment of the present invention, the first frame selection condition is that the first dimension value is within a first frame selection range.

在本发明的示例性实施例中,当所述第一预选帧为一帧时,所述第一预选帧为所述目标帧;In an exemplary embodiment of the present invention, when the first preselected frame is a frame, the first preselected frame is the target frame;

当所述第一预选帧为多帧时,对多帧所述第一预选帧进行第一融合,得到所述目标帧。When the first pre-selected frame is a multi-frame, the first pre-selected frame of the multi-frame is first fused to obtain the target frame.

在本发明的示例性实施例中,所述根据所述选帧维度,从所述待选帧序列中获取满足选帧条件的预选帧,包括:In an exemplary embodiment of the present invention, the obtaining, according to the frame selection dimension, a preselected frame that satisfies the frame selection condition from the to-be-selected frame sequence includes:

根据所述第二选帧维度,计算所述待选帧序列中每一帧的第二维度值;According to the second frame selection dimension, calculate the second dimension value of each frame in the to-be-selected frame sequence;

从所述待选帧序列中获取所述第二维度值满足所述第二选帧条件的第二预选帧;Obtain a second pre-selected frame whose second dimension value satisfies the second frame selection condition from the to-be-selected frame sequence;

其中,所述第二预选帧为一帧或多帧。Wherein, the second pre-selected frame is one or more frames.

在本发明的示例性实施例中,当所述第二预选帧为一帧时,所述第二预选帧为所述目标帧;In an exemplary embodiment of the present invention, when the second preselected frame is one frame, the second preselected frame is the target frame;

当所述第二预选帧为多帧时,对所述多帧第二预选帧进行第二融合得到所述目标帧。When the second pre-selected frame is a multi-frame, the target frame is obtained by performing a second fusion on the multi-frame second pre-selected frame.

在本发明的示例性实施例中,所述第二选帧条件为所述第二维度值或第二维度综合值最低或最高。In an exemplary embodiment of the present invention, the second frame selection condition is that the second dimension value or the second dimension comprehensive value is the lowest or the highest.

在本发明的示例性实施例中,所述根据所述选帧维度,从所述待选帧序列中获取满足选帧条件的预选帧,包括:In an exemplary embodiment of the present invention, the obtaining, according to the frame selection dimension, a preselected frame that satisfies the frame selection condition from the to-be-selected frame sequence includes:

根据所述第二选帧维度,计算所述第一预选帧中每一帧的第二维度值;According to the second frame selection dimension, calculate the second dimension value of each frame in the first preselected frame;

从所述第一预选帧中获取所述第二维度值满足所述第二选帧条件的第二预选帧;Acquiring, from the first preselected frame, a second preselected frame whose second dimension value satisfies the second frame selection condition;

其中,所述第二预选帧为一帧或多帧。Wherein, the second pre-selected frame is one or more frames.

在本发明的示例性实施例中,当所述第二预选帧为一帧时,所述第二预选帧为所述目标帧;In an exemplary embodiment of the present invention, when the second preselected frame is one frame, the second preselected frame is the target frame;

当所述第二预选帧为多帧时,对所述多帧第二预选帧进行第二融合得到所述目标帧。When the second pre-selected frame is a multi-frame, the target frame is obtained by performing a second fusion on the multi-frame second pre-selected frame.

在本发明的示例性实施例中,当所述第一预选帧为多帧时,所述第二选帧条件为所述第二维度值或所述第二维度综合值最低或最高。In an exemplary embodiment of the present invention, when the first preselected frame is a multi-frame, the second frame selection condition is that the second dimension value or the second dimension comprehensive value is the lowest or the highest.

在本发明的示例性实施例中,当所述第一预选帧为一帧时,所述第二选帧条件为第二维度值在第二选帧范围内。In an exemplary embodiment of the present invention, when the first preselected frame is one frame, the second frame selection condition is that the second dimension value is within the second frame selection range.

在本发明的示例性实施例中,所述基于当前语音信号,对所述目标帧进行语音驱动,获取目标视频,包括:In an exemplary embodiment of the present invention, performing voice driving on the target frame based on the current voice signal to obtain the target video includes:

根据当前语音信号,通过训练后的语音驱动模型生成对应的驱动表情系数;According to the current voice signal, the corresponding driving expression coefficient is generated through the trained voice driving model;

对所述目标帧与所述驱动表情系数进行匹配,生成关键帧;Matching the target frame and the driving expression coefficient to generate a key frame;

基于所述待选帧序列和所述目标帧,对所述关键帧进行表情匹配,得到驱动帧;Based on the to-be-selected frame sequence and the target frame, performing expression matching on the key frame to obtain a drive frame;

连续的所述驱动帧构成所述目标视频。Successive said drive frames constitute said target video.

本发明实施例还提供了一种视频生成装置,可以包括:The embodiment of the present invention also provides a video generation device, which may include:

采集单元,配置为获取待选帧序列;an acquisition unit, configured to acquire a sequence of frames to be selected;

选帧单元,配置为根据选帧维度,从所述待选帧序列中确定目标帧;a frame selection unit, configured to determine a target frame from the to-be-selected frame sequence according to the frame selection dimension;

驱动单元,配置为基于当前语音信号,对所述目标帧进行语音驱动,获取目标视频A drive unit, configured to perform voice driving on the target frame based on the current voice signal, and obtain the target video

本发明实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现上述任意一项所述的视频生成方法的步骤。An embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, implements the steps of any one of the above video generation methods .

本发明实施例还提供了一种电子设备,可以包括:The embodiment of the present invention also provides an electronic device, which may include:

处理器;以及processor; and

存储器,用于存储所述处理器的可执行指令;a memory for storing executable instructions for the processor;

其中,所述处理器配置为经由执行所述可执行指令来执行上述任意一项所述的视频生成方法。Wherein, the processor is configured to execute the video generation method described in any one of the above by executing the executable instructions.

通过上述本发明实施例方案,得到满足语音驱动要求的目标帧,提升了后续语音驱动的效果;同时,解决了语音驱动的过程中表情系数变化可能造成的面部细节缺失的问题,使生成的视频更加生动、自然。Through the above-mentioned solution of the embodiment of the present invention, the target frame that meets the voice driving requirements is obtained, which improves the effect of subsequent voice driving; at the same time, the problem of missing facial details that may be caused by the change of the expression coefficient during the voice driving process is solved, so that the generated video more vivid and natural.

本申请的其它特征和优点将在随后的说明书中阐述,并且,部分地从说明书中变得显而易见,或者通过实施本申请而了解。本申请的其他优点可通过在说明书以及附图中所描述的方案来实现和获得。Other features and advantages of the present application will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the present application. Other advantages of the present application may be realized and attained by the approaches described in the specification and drawings.

附图说明Description of drawings

附图用来提供对本申请技术方案的理解,并且构成说明书的一部分,与本申请的实施例一起用于解释本申请的技术方案,并不构成对本申请技术方案的限制。The accompanying drawings are used to provide an understanding of the technical solutions of the present application, and constitute a part of the specification. They are used to explain the technical solutions of the present application together with the embodiments of the present application, and do not constitute a limitation on the technical solutions of the present application.

图1为根据本申请实施例的视频生成方法的流程图;1 is a flowchart of a video generation method according to an embodiment of the present application;

图2为根据本申请实施例的从待选帧序列中确定目标帧的流程图;2 is a flowchart of determining a target frame from a sequence of frames to be selected according to an embodiment of the present application;

图3为根据本申请另一实施例的从待选帧序列中确定目标帧的流程图;3 is a flowchart of determining a target frame from a sequence of frames to be selected according to another embodiment of the present application;

图4a为根据本申请实施例的眼部特征点示意图;4a is a schematic diagram of an eye feature point according to an embodiment of the present application;

图4b为根据本申请实施例的嘴部特征点示意图;4b is a schematic diagram of a mouth feature point according to an embodiment of the present application;

图5为根据本申请又一实施例的从待选帧序列中确定目标帧的流程图;5 is a flowchart of determining a target frame from a sequence of frames to be selected according to yet another embodiment of the present application;

图6为根据本申请实施例的对目标帧进行语音驱动,获取目标视频的流程图;FIG. 6 is a flowchart of performing voice driving on a target frame to obtain a target video according to an embodiment of the present application;

图7为根据本申请实施例的视频通话中的视频生成方法的流程图;7 is a flowchart of a video generation method in a video call according to an embodiment of the present application;

图8为根据本申请实施例的视频生成装置的框图。FIG. 8 is a block diagram of a video generating apparatus according to an embodiment of the present application.

具体实施方式Detailed ways

本申请描述了多个实施例,但是该描述是示例性的,而不是限制性的,并且对于本领域的普通技术人员来说显而易见的是,在本申请所描述的实施例包含的范围内可以有更多的实施例和实现方案。尽管在附图中示出了许多可能的特征组合,并在具体实施方式中进行了讨论,但是所公开的特征的许多其它组合方式也是可能的。除非特意加以限制的情况以外,任何实施例的任何特征或元件可以与任何其它实施例中的任何其他特征或元件结合使用,或可以替代任何其它实施例中的任何其他特征或元件。This application describes a number of embodiments, but the description is exemplary rather than restrictive, and it will be apparent to those of ordinary skill in the art that within the scope of the embodiments described in this application can be There are many more examples and implementations. Although many possible combinations of features are shown in the drawings and discussed in the detailed description, many other combinations of the disclosed features are possible. Unless expressly limited, any feature or element of any embodiment may be used in combination with, or may be substituted for, any other feature or element of any other embodiment.

本申请包括并设想了与本领域普通技术人员已知的特征和元件的组合。本申请已经公开的实施例、特征和元件也可以与任何常规特征或元件组合,以形成由权利要求限定的独特的发明方案。任何实施例的任何特征或元件也可以与来自其它发明方案的特征或元件组合,以形成另一个由权利要求限定的独特的发明方案。因此,应当理解,在本申请中示出和/或讨论的任何特征可以单独地或以任何适当的组合来实现。因此,除了根据所附权利要求及其等同替换所做的限制以外,实施例不受其它限制。此外,可以在所附权利要求的保护范围内进行各种修改和改变。This application includes and contemplates combinations with features and elements known to those of ordinary skill in the art. The embodiments, features and elements that have been disclosed in this application can also be combined with any conventional features or elements to form unique inventive solutions as defined by the claims. Any features or elements of any embodiment may also be combined with features or elements from other inventive arrangements to form another unique inventive arrangement defined by the claims. Accordingly, it should be understood that any of the features shown and/or discussed in this application may be implemented alone or in any suitable combination. Accordingly, the embodiments are not to be limited except in accordance with the appended claims and their equivalents. Furthermore, various modifications and changes may be made within the scope of the appended claims.

此外,在描述具有代表性的实施例时,说明书可能已经将方法和/或过程呈现为特定的步骤序列。然而,在该方法或过程不依赖于本文所述步骤的特定顺序的程度上,该方法或过程不应限于所述的特定顺序的步骤。如本领域普通技术人员将理解的,其它的步骤顺序也是可能的。因此,说明书中阐述的步骤的特定顺序不应被解释为对权利要求的限制。此外,针对该方法和/或过程的权利要求不应限于按照所写顺序执行它们的步骤,本领域技术人员可以容易地理解,这些顺序可以变化,并且仍然保持在本申请实施例的精神和范围内。Furthermore, in describing representative embodiments, the specification may have presented methods and/or processes as a particular sequence of steps. However, to the extent that the method or process does not depend on the specific order of steps described herein, the method or process should not be limited to the specific order of steps described. Other sequences of steps are possible, as will be understood by those of ordinary skill in the art. Therefore, the specific order of steps set forth in the specification should not be construed as limitations on the claims. Furthermore, the claims directed to the method and/or process should not be limited to performing their steps in the order written, as those skilled in the art will readily appreciate that these orders may be varied and still remain within the spirit and scope of the embodiments of the present application Inside.

本申请实施例提供了一种视频生成方法,如图1所示,所述方法包括:An embodiment of the present application provides a video generation method, as shown in FIG. 1 , the method includes:

S101:获取待选帧序列;S101: Obtain the frame sequence to be selected;

待选帧序列可以包括:实时视频缓存帧、用户预拍摄的预存帧中的至少一项,待选帧序列包括不少于两帧待选帧,其中,用户预拍摄的预存帧可以为用户根据系统提示拍摄的基于不同选帧维度的帧;The sequence of frames to be selected may include: at least one of real-time video cache frames and pre-stored frames pre-shot by the user, and the sequence of frames to be selected includes no less than two frames to be selected, wherein the pre-stored frames pre-shot by the user can be selected by the user. The system prompts the captured frames based on different frame selection dimensions;

待选帧需要包含人脸信息;The frame to be selected needs to contain face information;

S102:根据选帧维度,获取待选帧序列中的目标帧;S102: Acquire a target frame in the frame sequence to be selected according to the frame selection dimension;

基于语音驱动模型对目标帧的需求,选帧维度可以包括第一选帧维度、第二选帧维度中至少一项,其中,第一选帧维度可以是画面维度,包括人脸位置、人脸朝向、人体姿态、光线中至少一项,第二选帧维度可以是画质维度、五官维度中至少一项,画质维度可以包括模糊度、阴影、噪声等等,五官维度可以包括眼部维度、嘴部维度中至少一项;Based on the requirements of the voice-driven model for the target frame, the frame selection dimension may include at least one of a first frame selection dimension and a second frame selection dimension, wherein the first frame selection dimension may be a picture dimension, including face position, face At least one of orientation, body posture, and light. The second frame selection dimension can be at least one of image quality and facial features. , at least one of the dimensions of the mouth;

选帧维度可以预先设定,也可以根据语音驱动模型的需求自动生成;The frame selection dimension can be preset or automatically generated according to the needs of the voice-driven model;

语音驱动模型对目标帧的需求可以包括画面需求、画质需求、五官需求中至少一项,其中,画面需求包括人脸位置居中、人脸朝向向前、人体姿态中立、光线适中中至少一项,画质需求可以包括:图像清晰,五官需求可以包括眼部张开、嘴部闭合中至少一项;The requirements of the voice-driven model for the target frame may include at least one of picture requirements, picture quality requirements, and facial features requirements, wherein the picture requirements include at least one of face position centering, face facing forward, neutral human posture, and moderate lighting. , the image quality requirements can include: clear images, facial features can include at least one of eyes open and mouth closed;

目标帧为待选帧序列中满足全部选帧条件的一帧或多帧;The target frame is one or more frames that satisfy all frame selection conditions in the frame sequence to be selected;

具体地,步骤S102可以包括:Specifically, step S102 may include:

S1021:根据选帧维度,从待选帧序列中获取满足选帧条件的预选帧,其中,所述预选帧为一帧或多帧;S1021: According to the frame selection dimension, obtain a preselected frame that satisfies the frame selection condition from the to-be-selected frame sequence, wherein the preselected frame is one or more frames;

选帧条件为符合语音驱动模型对目标帧的需求所需要满足的条件;The frame selection condition is the condition that needs to be satisfied to meet the requirements of the voice-driven model for the target frame;

选帧条件可以包括第一选帧条件、第二选帧条件中至少一项;The frame selection condition may include at least one of the first frame selection condition and the second frame selection condition;

S1022:判断所述预选帧是否为一帧;S1022: Determine whether the preselected frame is a frame;

S1023:当预选帧为一帧时,该一帧预选帧为目标帧;S1023: when the preselected frame is one frame, the preselected frame of the one frame is the target frame;

S1024:当预选帧为多帧时,对多帧预选帧进行融合,得到目标帧;S1024: when the preselected frame is multiple frames, fuse the multiple preselected frames to obtain the target frame;

上述融合包括第一融合或第二融合中至少一项;The above-mentioned fusion includes at least one of the first fusion or the second fusion;

S103:基于当前语音信号,对目标帧进行语音驱动,生成目标视频。S103: Based on the current voice signal, voice-driven the target frame to generate a target video.

通过本实施例中的方法,可以得到满足语音驱动要求的目标帧,提升后续语音驱动的效果。Through the method in this embodiment, a target frame that meets the voice driving requirements can be obtained, and the effect of subsequent voice driving can be improved.

本申请实施例提供了一种从待选帧序列中获取满足选帧条件的预选帧的方法,如图2所示,该方法包括:An embodiment of the present application provides a method for obtaining a preselected frame that satisfies a frame selection condition from a sequence of to-be-selected frames. As shown in FIG. 2 , the method includes:

S201:根据第一选帧维度,计算待选帧序列中每一帧的第一维度值;S201: Calculate the first dimension value of each frame in the frame sequence to be selected according to the first frame selection dimension;

在本实施例中,第一选帧维度为画面维度,可以包括人脸位置、人脸朝向、人体姿态、光线中至少一项;In this embodiment, the first frame selection dimension is the picture dimension, which may include at least one of the position of the face, the orientation of the face, the posture of the human body, and the light;

相应地,第一维度值可以包括人脸位置值、人脸朝向值、人体姿态值、光线值中至少一项;Correspondingly, the first dimension value may include at least one of a face position value, a face orientation value, a human body posture value, and a light value;

在一示例性实施例中,计算人脸位置值的方法包括:In an exemplary embodiment, the method of calculating a face position value includes:

基于人脸特征点,获取待选帧中的人脸包围框所对应的中心点bbox_center,计算中心点bbox_center在待选帧中的横纵坐标比bbox_center_u/v,该横纵坐标比bbox_center_u/v即为人脸位置值;Based on the face feature points, obtain the center point bbox_center corresponding to the face bounding box in the frame to be selected, and calculate the horizontal and vertical coordinate ratio bbox_center_u/v of the center point bbox_center in the frame to be selected. The horizontal and vertical coordinate ratio bbox_center_u/v is is the face position value;

在一示例性实施例中,计算人脸朝向值的方法包括:In an exemplary embodiment, the method for calculating a face orientation value includes:

基于人脸特征点,获取待选帧的人脸朝向角(roll,yaw,pitch),该人脸朝向角(roll,yaw,pitch)即为人脸朝向值;Based on the face feature points, the face orientation angle (roll, yaw, pitch) of the frame to be selected is obtained, and the face orientation angle (roll, yaw, pitch) is the face orientation value;

在一示例性实施例中,计算人体姿态值的方法包括:In an exemplary embodiment, the method of calculating a human body pose value includes:

通过比较正姿人体关节点与待选帧人体关节点,得到人体关节点相对关系值Tval,该人体关节点相对关系值Tval即为人体姿态值;By comparing the joint points of the human body with the positive posture and the joint points of the human body to be selected, the relative relationship value T val of the joint points of the human body is obtained, and the relative relationship value T val of the joint points of the human body is the human body posture value;

在一示例性实施例中,计算光线值的方法包括:In an exemplary embodiment, the method of calculating a ray value includes:

统计小于欠曝亮度阈值、大于过曝亮度阈值的像素占比,得到欠曝比和过曝比,该欠曝比和过曝比即为光线值;Calculate the proportion of pixels less than the underexposure brightness threshold and greater than the overexposure brightness threshold to obtain the underexposure ratio and overexposure ratio, and the underexposure ratio and overexposure ratio are the light value;

其中,欠曝亮度阈值、过曝亮度阈值可以根据需求预先设定,也可以由系统自动生成;Among them, the under-exposure brightness threshold and the over-exposure brightness threshold can be preset according to requirements, or can be automatically generated by the system;

通过统计待选帧像素的亮度分布获取待选帧暗部比例,该比例即为光线值;Obtain the ratio of the dark part of the frame to be selected by counting the brightness distribution of the pixels of the frame to be selected, and this ratio is the light value;

S202:从待选帧序列中获取第一维度值满足第一选帧条件的第一预选帧;S202: Obtain a first pre-selected frame whose first dimension value satisfies the first frame selection condition from the frame sequence to be selected;

其中,第一选帧条件为第一维度值在第一选帧范围内;Wherein, the first frame selection condition is that the first dimension value is within the range of the first frame selection;

与第一维度对应,第一选帧范围可以包括人脸位置范围、人脸朝向范围、人体姿态范围、光线范围中至少一项;Corresponding to the first dimension, the first frame selection range may include at least one of a face position range, a face orientation range, a body posture range, and a light range;

在一示例性实施例中,人脸位置范围为,In an exemplary embodiment, the face position range is,

TMinu/v<bbox_center_u/v<TMaxu/vTMin u/v < bbox_center_u/v<TMax u/v ;

其中,bbox_center_u/v为人脸位置值,TMinu/v为横纵坐标比最小阈值,TMaxu/v为横纵坐标比最大阈值,TMinu/v和TMaxu/v可以根据需求预先设定,也可以由系统自动生成;Among them, bbox_center_u /v is the face position value, TMin u/v is the minimum threshold of the abscissa ratio, TMax u/v is the maximum threshold of the abscissa ratio, TMin u/v and TMax u/v can be preset according to requirements It can also be automatically generated by the system;

人脸位置值在人脸位置范围内的帧满足人脸位置居中的需求;The frame whose face position value is within the range of face position satisfies the requirement of face position centering;

在一示例性实施例中,人脸朝向范围为,In an exemplary embodiment, the face orientation range is,

roll<Troll,yaw<Tyaw,pitch<Tpitchroll<T roll ,yaw<T yaw ,pitch<T pitch ;

其中,(roll,yaw,pitch)为人脸朝向值,(Troll,Tyaw,Tpitch)为人脸朝向阈值,(Troll,Tyaw,Tpitch)可以根据需求预先设定,也可以由系统自动生成;Among them, (roll, yaw, pitch) is the face orientation value, (T roll , T yaw , T pitch ) is the face orientation threshold value, (T roll , T yaw , T pitch ) can be preset according to requirements, or can be set by the system Automatic generated;

人脸朝向值在人脸朝向范围内的帧满足人脸朝向向前的需求;Frames whose face orientation value is within the face orientation range satisfies the requirement of face orientation forward;

在一示例性实施例中,人体姿态范围为,In an exemplary embodiment, the human body pose range is,

Tval<∈;T val <∈;

其中,Tval为人体姿态值,∈为人体姿态阈值,正姿人体关节点和人体姿态阈值可以根据需求预先设定,也可以由系统自动生成;Among them, T val is the human body posture value, ∈ is the human body posture threshold value, the joint points of the human body in the positive posture and the human body posture threshold value can be preset according to the requirements, or can be automatically generated by the system;

人体姿态值在人体姿态范围内的帧满足人体姿态中立的需求;The frame whose human body pose value is within the range of human body pose meets the requirement of neutrality of human body pose;

在一示例性实施例中,光线范围为,In an exemplary embodiment, the ray range is,

过曝比≤过曝阈值,欠曝比≤欠曝阈值;Overexposure ratio≤overexposure threshold, underexposure ratio≤underexposure threshold;

其中,欠曝比和过曝比为光线值,过曝阈值和欠曝阈值可以根据需求预先设定,也可以由系统自动生成;Among them, the underexposure ratio and the overexposure ratio are light values, and the overexposure threshold and the underexposure threshold can be preset according to requirements, or can be automatically generated by the system;

光线值在光线范围内的帧满足光线适中的需求;Frames with light values within the light range meet the needs of moderate light;

第一预选帧可以为一帧或多帧;The first preselected frame can be one or more frames;

若待选帧序列中不存在满足第一选帧条件的帧,提示用户根据第一选帧条件拍摄或上传图像至待选帧序列,直至待选帧序列中存在满足第一选帧条件的帧,该帧为目标帧;If there is no frame that satisfies the first frame selection condition in the to-be-selected frame sequence, the user is prompted to shoot or upload an image to the to-be-selected frame sequence according to the first frame-selection condition, until there is a frame that meets the first frame selection condition in the to-be-selected frame sequence , the frame is the target frame;

S203:判断第一预选帧是否为一帧;S203: Determine whether the first preselected frame is a frame;

S204:当第一预选帧为一帧时,该一帧第一预选帧为目标帧;S204: when the first preselected frame is a frame, the first preselected frame of the one frame is the target frame;

S205:当第一预选帧为多帧时,对多帧第一预选帧进行第一融合,得到目标帧;S205: when the first pre-selected frame is a multi-frame, perform a first fusion on the first pre-selected frame of the multi-frame to obtain a target frame;

具体地,第一融合包括:Specifically, the first fusion includes:

以多帧第一预选帧中任一帧为参考帧,其他帧为匹配帧;Taking any one of the first preselected frames of multiple frames as a reference frame, and other frames as matching frames;

获取参考帧的Harris角点,记为参考点;Obtain the Harris corner point of the reference frame, which is recorded as the reference point;

计算参考点的特征描述子;Calculate the feature descriptor of the reference point;

获取匹配帧的匹配范围,其中,匹配范围可以为以匹配帧中与参考帧的参考点对应的点为圆心、匹配距离为半径得到的圆的范围,可选地,匹配距离为5-15像素;Obtain the matching range of the matching frame, where the matching range can be the range of a circle obtained by taking the point corresponding to the reference point of the reference frame in the matching frame as the center of the circle and the matching distance as the radius, optionally, the matching distance is 5-15 pixels ;

计算匹配范围内的点的特征描述子,选取与参考帧中参考点的特征描述子最相近的点作为匹配点;Calculate the feature descriptors of the points within the matching range, and select the points closest to the feature descriptors of the reference points in the reference frame as the matching points;

基于参考帧的参考点和匹配帧的匹配点,通过射影变换得到单应矩阵,可选地,单应矩阵有8个自由度,此时,由最少4对参考点和匹配点就可以得到单应矩阵;Based on the reference point of the reference frame and the matching point of the matching frame, the homography matrix is obtained through projective transformation. Optionally, the homography matrix has 8 degrees of freedom. At this time, the homography can be obtained from at least 4 pairs of reference points and matching points. response matrix;

基于单应矩阵,通过矩阵变换和像素插值得到参考帧与匹配帧的像素对应关系;Based on the homography matrix, the pixel correspondence between the reference frame and the matching frame is obtained through matrix transformation and pixel interpolation;

将参考帧与匹配帧的像素对应相减,得到像素差值的绝对值;Subtract the pixels of the reference frame and the matching frame correspondingly to obtain the absolute value of the pixel difference;

比较像素差值的绝对值与像素噪声阈值,得到像素权重;Compare the absolute value of the pixel difference with the pixel noise threshold to obtain the pixel weight;

根据像素权重,对参考帧与匹配帧中对应的像素进行加权平均,得到目标帧;According to the pixel weight, weighted average is performed on the corresponding pixels in the reference frame and the matching frame to obtain the target frame;

通过第一融合,不仅可以将多帧第一预选帧融合为一帧目标帧,且经过第一融合的目标帧具有更高的空间分辨率,更明显的信息表现和更低的噪声;Through the first fusion, not only can the first preselected frames of multiple frames be fused into one target frame, but also the target frame after the first fusion has higher spatial resolution, more obvious information performance and lower noise;

通过本实施例中的方法,可以得到满足语音驱动要求的目标帧,提升了后续语音驱动的效果。Through the method in this embodiment, a target frame that meets the voice driving requirement can be obtained, which improves the effect of subsequent voice driving.

本申请实施例提供了一种从待选帧序列中获取满足选帧条件的预选帧的方法,如图3所示,该方法包括:An embodiment of the present application provides a method for obtaining a preselected frame that satisfies a frame selection condition from a sequence of frames to be selected. As shown in FIG. 3 , the method includes:

S301:根据第二选帧维度,计算待选帧序列中每一帧的第二维度值;S301: Calculate the second dimension value of each frame in the frame sequence to be selected according to the second frame selection dimension;

在本实施例中,第二选帧维度为画质维度或五官维度中至少一项,画质维度可以包括模糊度,五官维度可以包括眼部维度、嘴部维度中至少一项;In this embodiment, the second frame selection dimension is at least one of an image quality dimension or a facial feature dimension, the image quality dimension may include blurriness, and the facial feature dimension may include at least one of an eye dimension and a mouth dimension;

相应地,第二维度值可以包括模糊度值、五官维度值(眼部维度值、嘴部维度值)中至少一项;Correspondingly, the second dimension value may include at least one of ambiguity value and facial feature dimension value (eye dimension value, mouth dimension value);

在一示例性实施例中,计算模糊度值的方法可以包括:In an exemplary embodiment, the method of calculating the ambiguity value may include:

对待选帧序列中的每一帧进行高斯模糊,得到其高斯模糊图像;Gaussian blur is performed on each frame in the sequence to be selected to obtain its Gaussian blurred image;

对待选帧序列中的每一帧及其高斯模糊图像进行水平梯度计算及垂直梯度计算,得到它们的水平梯度值及垂直梯度值;Perform horizontal gradient calculation and vertical gradient calculation on each frame in the frame sequence to be selected and its Gaussian blurred image to obtain their horizontal gradient value and vertical gradient value;

基于上述水平梯度值及垂直梯度值,计算待选帧序列中的每一帧及其高斯模糊图像的水平梯度差及垂直梯度差;Based on the above-mentioned horizontal gradient value and vertical gradient value, calculate the horizontal gradient difference and vertical gradient difference of each frame in the frame sequence to be selected and its Gaussian blurred image;

对上述水平梯度差及垂直梯度差求和,得到模糊度值;Summing the above horizontal gradient difference and vertical gradient difference to obtain the ambiguity value;

采用上述方法计算的模糊度值与帧的清晰之间的关系为模糊度值越高,帧越模糊,模糊度值越低,帧越清晰;The relationship between the blurriness value calculated by the above method and the clarity of the frame is that the higher the blurriness value, the blurrier the frame, and the lower the blurriness value, the clearer the frame;

本申请不限制计算模糊度值的方法,也可以选择其他方法计算模糊度值,在其他方法中,可能模糊度值越低,帧越模糊,模糊度值越高,帧越清晰;The present application does not limit the method of calculating the ambiguity value, and other methods can also be selected to calculate the ambiguity value. In other methods, the lower the ambiguity value may be, the more blurred the frame, and the higher the ambiguity value, the clearer the frame;

在一示例性实施例中,如图4a所示,计算眼部维度值的方法可以为:In an exemplary embodiment, as shown in FIG. 4a, the method for calculating the eye dimension value may be:

eye_val=1-len(pt42-pt48)/len(pt39-pt45)eye_val=1-len(pt 42 -pt 48 )/len(pt 39 -pt 45 )

其中,pt42,pt48,pt39,pt45为基于人脸特征点获取的眼部特征点,len(pt42-pt48)为pt42和pt48之间的距离,len(pt39-pt45)为pt39和pt45之间的距离;Among them, pt 42 , pt 48 , pt 39 , pt 45 are the eye feature points obtained based on the face feature points, len(pt 42 -pt 48 ) is the distance between pt 42 and pt 48 , len(pt 39 - pt 45 ) is the distance between pt 39 and pt 45 ;

采用上述方法计算的眼部维度值与眼部之间的关系为eye_val越低,眼部张开程度越高;The relationship between the eye dimension value and the eye calculated by the above method is that the lower the eye_val, the higher the degree of eye opening;

本申请不限制计算眼部维度值的方法,也可以选择其他方法计算眼部维度值,在其他方法中,可能眼部维度值越低,眼部张开程度越低,眼部维度值越高,眼部张开程度越高;This application does not limit the method of calculating the eye dimension value, and other methods can also be selected to calculate the eye dimension value. In other methods, the lower the eye dimension value may be, the lower the degree of eye opening, and the higher the eye dimension value. , the higher the degree of eye opening;

在一示例性实施例中,如图4b所示,计算嘴部维度值的方法可以包括:In an exemplary embodiment, as shown in FIG. 4b, the method for calculating the mouth dimension value may include:

mouth_val=len(pt89-pt93)/len(pt87-pt91)mouth_val=len(pt 89 -pt 93 )/len(pt 87 -pt 91 )

其中,pt89,pt93,pt87,pt91为基于人脸特征点获取的嘴部特征点,len(pt89-pt93)为pt89和pt93之间的距离,len(pt87-pt91)为pt87和pt91之间的距离;Among them, pt 89 , pt 93 , pt 87 , pt 91 are the mouth feature points obtained based on the face feature points, len(pt 89 -pt 93 ) is the distance between pt 89 and pt 93 , len(pt 87 - pt 91 ) is the distance between pt 87 and pt 91 ;

采用上述方法计算的嘴部维度值与嘴部之间的关系为mouth_val越低,嘴部闭合程度越高;The relationship between the mouth dimension value calculated by the above method and the mouth is that the lower the mouth_val, the higher the degree of mouth closure;

本申请不限制计算嘴部维度值的方法,也可以选择其他方法计算嘴部维度值,在其他方法中,可能嘴部维度值越低,嘴部闭合程度越低,嘴部维度值越高,嘴部闭合程度越高;This application does not limit the method of calculating the mouth dimension value, and other methods can also be selected to calculate the mouth dimension value. The higher the degree of mouth closure;

与第二选帧维度对应,选帧条件包括画质清晰、眼部张开、嘴部闭合中至少一项;Corresponding to the second frame selection dimension, the frame selection conditions include at least one of clear image quality, open eyes, and closed mouth;

当第一预选帧为多帧时,该方法包括:When the first preselected frame is multiple frames, the method includes:

S302:从待选帧序列中获取第二维度值满足第二选帧条件的第二预选帧;S302: Obtain a second pre-selected frame whose second dimension value satisfies the second frame selection condition from the frame sequence to be selected;

当第一预选帧为多帧时,第二选帧条件为第二维度值或第二维度综合值最低或最高;When the first preselected frame is multiple frames, the second frame selection condition is the lowest or highest second dimension value or the second dimension comprehensive value;

当第二选帧条件为第二维度值最低或最高时;When the second frame selection condition is the lowest or highest value of the second dimension;

具体地,该方法包括:Specifically, the method includes:

S3021:获取待选帧序列中第二维度值最低或最高的帧为第二预选帧;S3021: Obtain the frame with the lowest or highest second dimension value in the frame sequence to be selected as the second pre-selected frame;

在一示例性实施例中,当第二选帧维度包括模糊度时,获取待选帧序列中模糊度值最低的帧为第二预选帧;In an exemplary embodiment, when the second frame selection dimension includes the ambiguity, acquiring the frame with the lowest ambiguity value in the frame sequence to be selected is the second pre-selected frame;

在一示例性实施例中,当第二选帧维度包括眼部维度时,获取待选帧序列中眼部维度值最低的帧为第二预选帧;In an exemplary embodiment, when the second frame selection dimension includes the eye dimension, acquiring the frame with the lowest eye dimension value in the frame sequence to be selected is the second pre-selected frame;

在一示例性实施例中,当第二选帧维度包括嘴部维度时,获取待选帧序列中嘴部维度值最低的帧为第二预选帧;In an exemplary embodiment, when the second frame selection dimension includes the mouth dimension, the frame with the lowest mouth dimension value in the frame sequence to be selected is obtained as the second pre-selected frame;

在一示例性实施例中,当第二选帧维度包括模糊度和眼部维度时,获取待选帧序列中模糊度值最低的帧和眼部维度值最低的帧为第二预选帧,模糊度值最低的帧和眼部维度值最低的帧可以为同一帧,也可以为不同帧;In an exemplary embodiment, when the second frame selection dimension includes ambiguity and eye dimension, the frame with the lowest ambiguity value and the frame with the lowest eye dimension value in the frame sequence to be selected are obtained as the second pre-selected frame, and the fuzzy The frame with the lowest degree value and the frame with the lowest eye dimension value can be the same frame or different frames;

在一示例性实施例中,当第二选帧维度包括模糊度和嘴部维度时,获取待选帧序列中模糊度值最低的帧和嘴部维度值最低的帧为第二预选帧,模糊度值最低的帧和嘴部维度值最低的帧可以为同一帧,也可以为不同帧;In an exemplary embodiment, when the second frame selection dimension includes ambiguity and mouth dimension, the frame with the lowest ambiguity value and the frame with the lowest mouth dimension value in the frame sequence to be selected are obtained as the second pre-selected frame, and the fuzzy The frame with the lowest degree value and the frame with the lowest mouth dimension value can be the same frame or different frames;

在一示例性实施例中,当第二选帧维度包括眼部维度和嘴部维度时,获取待选帧序列中眼部维度值最低的帧和嘴部维度值最低的帧为第二预选帧,眼部维度值最低的帧和嘴部维度值最低的帧可以为同一帧,也可以为不同帧;In an exemplary embodiment, when the second frame selection dimension includes the eye dimension and the mouth dimension, the frame with the lowest eye dimension value and the frame with the lowest mouth dimension value in the frame sequence to be selected are obtained as the second pre-selected frame. , the frame with the lowest eye dimension value and the frame with the lowest mouth dimension value can be the same frame or different frames;

在一示例性实施例中,当第二选帧维度包括模糊度、眼部维度和嘴部维度时,获取待选帧序列中模糊度值最低的帧、眼部维度值最低的帧和嘴部维度值最低的帧为第二预选帧,模糊度值最低的帧、眼部维度值最低的帧和嘴部维度值最低的帧可以为同一帧,也可以为不同帧;In an exemplary embodiment, when the second frame selection dimension includes the ambiguity, the eye dimension, and the mouth dimension, obtain the frame with the lowest ambiguity value, the frame with the lowest eye dimension value, and the mouth in the frame sequence to be selected. The frame with the lowest dimension value is the second pre-selected frame, and the frame with the lowest ambiguity value, the frame with the lowest eye dimension value, and the frame with the lowest mouth dimension value can be the same frame or different frames;

在一些实施例中,可能模糊度值越高,帧越清晰,眼部维度值越高,眼部张开程度越高,嘴部维度值越高,嘴部闭合程度越高,此时,获取待选帧序列中模糊度值最高的帧、眼部维度值最高的帧、嘴部维度值最高的帧中至少一项为第二预选帧;In some embodiments, it is possible that the higher the blurriness value, the clearer the frame, the higher the eye dimension value, the higher the eye opening degree, the higher the mouth dimension value, the higher the mouth closing degree. In this case, obtain At least one of the frame with the highest ambiguity value, the frame with the highest eye dimension value, and the frame with the highest mouth dimension value in the frame sequence to be selected is the second preselected frame;

由于第二选帧维度包含的多种选帧条件,且不同评价指标往往具有不同的量纲和量纲单位,这样的情况会影响到分析的结果,为了消除指标之间的量纲影响,本申请引入了第二维度综合值;Since the second frame selection dimension contains a variety of frame selection conditions, and different evaluation indicators often have different dimensions and dimensional units, such a situation will affect the analysis results. In order to eliminate the dimensional influence between the indicators, this paper The application introduces the second dimension comprehensive value;

当第二选帧条件为第二维度综合值最低或最高时;When the second frame selection condition is the lowest or highest comprehensive value of the second dimension;

具体地,该方法包括:Specifically, the method includes:

S3022:计算待选帧序列中每一帧的第二维度综合值;S3022: Calculate the second dimension comprehensive value of each frame in the frame sequence to be selected;

第二维度综合值可以是第二维度值的加权值;The second dimension comprehensive value may be a weighted value of the second dimension value;

S3023:获取待选帧序列中第二维度综合值最低或最高的帧为第二预选帧;S3023: Obtain the frame with the lowest or highest comprehensive value of the second dimension in the frame sequence to be selected as the second pre-selected frame;

在一示例性实施例中,当第二选帧维度包括模糊度、眼部维度和嘴部维度时,计算待选帧序列中每一帧的第二维度的加权值,得到第二维度综合值,获取待选帧序列中第二维度综合值最低或最高的帧为第二预选帧;In an exemplary embodiment, when the second frame selection dimension includes the ambiguity, the eye dimension and the mouth dimension, the weighted value of the second dimension of each frame in the frame sequence to be selected is calculated to obtain the integrated value of the second dimension. , obtaining the frame with the lowest or highest comprehensive value of the second dimension in the frame sequence to be selected as the second pre-selected frame;

第二预选帧可以为一帧或多帧;The second preselected frame can be one or more frames;

S303:判断第二预选帧是否为一帧;S303: Determine whether the second preselected frame is a frame;

S304:当第二预选帧为一帧时,该一帧第二预选帧为目标帧;S304: when the second preselected frame is a frame, the second preselected frame of the one frame is the target frame;

S305:当第二预选帧为多帧时,对多帧第二预选帧进行第二融合得到目标帧;S305: When the second preselected frame is a multi-frame, perform a second fusion on the second preselected frame of the multi-frame to obtain the target frame;

具体地,第二融合包括:Specifically, the second fusion includes:

基于人脸特征点,获取多帧第二预选帧的人脸偏差值;Based on the face feature points, obtain the face deviation value of the second pre-selected frame of multiple frames;

比较人脸偏差值与融合阈值;Compare the face deviation value with the fusion threshold;

当人脸偏差值小于融合阈值时,基于人脸特征点,获取最优融合边界;When the face deviation value is less than the fusion threshold, the optimal fusion boundary is obtained based on the face feature points;

根据最优融合边界,对多帧第二预选帧进行五官融合,得到目标帧;According to the optimal fusion boundary, perform facial features fusion on the second pre-selected frames of multiple frames to obtain the target frame;

当人脸偏差值不小于融合阈值时,通过仿射变换获取多帧第二预选帧的五官对应关系;When the face deviation value is not less than the fusion threshold, obtain the facial features correspondence of the second pre-selected frames of multiple frames through affine transformation;

基于上述五官对应关系对多帧第二预选帧进行五官融合,得到目标帧;Perform facial feature fusion on the second pre-selected frame of multiple frames based on the above-mentioned facial feature correspondence to obtain the target frame;

其中,融合阈值可以根据需求预先设定,也可以由系统自动生成;Among them, the fusion threshold can be preset according to requirements, or can be automatically generated by the system;

通过本实施例中的方法,可以得到满足语音驱动要求的目标帧,提升了后续语音驱动的效果。Through the method in this embodiment, a target frame that meets the voice driving requirement can be obtained, which improves the effect of subsequent voice driving.

本申请实施例提供了一种根据选帧维度,从待选帧序列中获取满足选帧条件的预选帧的方法,如图5所示,该方法包括:An embodiment of the present application provides a method for obtaining a pre-selected frame that satisfies a frame selection condition from a to-be-selected frame sequence according to the frame selection dimension. As shown in FIG. 5 , the method includes:

S401:根据第二选帧维度,计算第一预选帧中每一帧的第二维度值;S401: Calculate the second dimension value of each frame in the first pre-selected frame according to the second frame selection dimension;

第一预选帧可以为一帧或多帧;The first preselected frame can be one or more frames;

S402:从第一预选帧中获取第二维度值满足第二选帧条件的第二预选帧;S402: Obtain a second preselected frame whose second dimension value satisfies the second frame selection condition from the first preselected frame;

具体地,S402包括:Specifically, S402 includes:

S4021:判断第一预选帧是否为多帧;S4021: Determine whether the first preselected frame is a multi-frame;

当第一预选帧为多帧时,第二选帧条件为第二维度值或第二维度综合值最低或最高;When the first preselected frame is multiple frames, the second frame selection condition is the lowest or highest second dimension value or the second dimension comprehensive value;

S4022:从多帧第一预选帧中获取第二维度值或第二维度综合值最低或最高帧为第二预选帧;S4022: Obtaining the second dimension value or the lowest or highest frame of the second dimension comprehensive value from the first preselected frame of multiple frames is the second preselected frame;

当第一预选帧为一帧时,第二选帧条件为第二维度值在第二选帧范围内;When the first preselected frame is one frame, the second frame selection condition is that the second dimension value is within the second frame selection range;

S4023:判断第一预选帧的第二维度值是否在第二选帧范围内;S4023: Determine whether the second dimension value of the first preselected frame is within the range of the second selected frame;

第二选帧范围可以包括模糊度范围、五官范围中至少一项,五官范围可以包括眼部范围、嘴部范围中至少一项;The second frame selection range may include at least one of the blurriness range and the facial feature range, and the facial feature range may include at least one of the eye range and the mouth range;

在一示例性实施例中,模糊度范围为:模糊度值<模糊度阈值;In an exemplary embodiment, the ambiguity range is: ambiguity value < ambiguity threshold;

在一示例性实施例中,模糊度范围为:模糊度值>模糊度阈值;In an exemplary embodiment, the ambiguity range is: ambiguity value > ambiguity threshold;

在一示例性实施例中,五官范围为:五官维度值<五官阈值;In an exemplary embodiment, the range of facial features is: dimension value of facial features < threshold of facial features;

在一示例性实施例中,五官范围为:五官维度值>五官阈值;In an exemplary embodiment, the range of facial features is: facial feature dimension value>facial feature threshold;

五官阈值可以包括眼部阈值、嘴部阈值中至少一项;The facial features threshold may include at least one of an eye threshold and a mouth threshold;

模糊度阈值和五官阈值可以根据需求预先设定,也可以由系统自动生成;The ambiguity threshold and facial feature threshold can be preset according to requirements, or can be automatically generated by the system;

S4024:当第一预选帧的第二维度值在第二选帧范围内时,第一预选帧为第二预选帧;S4024: when the second dimension value of the first pre-selected frame is within the range of the second pre-selected frame, the first pre-selected frame is the second pre-selected frame;

在一示例性实施例中,选帧范围包括模糊度范围与五官范围,若第一预选帧的模糊度值在模糊度范围内,且五官维度值在五官范围内,则第一预选帧满足第二选帧条件,第一预选帧为第二预选帧;In an exemplary embodiment, the frame selection range includes the blurriness range and the facial features range. If the blurriness value of the first preselected frame is within the blurriness range, and the facial feature dimension value is within the facial feature range, then the first preselected frame satisfies the first preselected frame. Two frame selection conditions, the first pre-selection frame is the second pre-selection frame;

S4025:当第一预选帧的第二维度值不在第二选帧范围内时,从待选帧序列中获取第三预选帧;S4025: when the second dimension value of the first preselected frame is not within the range of the second selected frame, obtain a third preselected frame from the sequence of frames to be selected;

第三预选帧为待选帧序列中第二维度值或第二维度综合值最低或最高的帧;The third preselected frame is the frame with the lowest or highest second dimension value or the second dimension comprehensive value in the frame sequence to be selected;

在一示例性实施例中,选帧范围包括模糊度范围与五官范围,若第一预选帧的模糊度值不在模糊度范围内,但五官维度值在五官范围内,则第一预选帧的第二维度值不在第二选帧范围内,获取待选帧序列中模糊度值最低或最高的帧,该帧为第三预选帧;In an exemplary embodiment, the frame selection range includes a blurriness range and a facial feature range. If the blurriness value of the first pre-selected frame is not within the blurriness range, but the facial feature dimension value is within the facial feature range, the first pre-selected frame If the two-dimensional value is not within the range of the second frame selection, obtain the frame with the lowest or highest ambiguity value in the frame sequence to be selected, and this frame is the third pre-selected frame;

在一示例性实施例中,选帧范围包括模糊度范围与五官范围,若第一预选帧的模糊度值不在模糊度范围内,且五官维度值不在五官范围内,则第一预选帧的第二维度值不在第二选帧范围内,获取待选帧序列中模糊度值最低或最高的帧和五官维度值最低或最高的帧,或者获取待选帧序列中模糊度值和五官维度值的综合值最低或最高的帧,该帧为第三预选帧;In an exemplary embodiment, the frame selection range includes the blurriness range and the facial features range. If the blurriness value of the first preselected frame is not within the blurriness range, and the facial feature dimension value is not within the facial features range, then the first preselected frame If the two-dimensional value is not within the range of the second frame selection, obtain the frame with the lowest or highest ambiguity value and the frame with the lowest or highest facial feature dimension value in the frame sequence to be selected, or obtain the ambiguity value and facial feature dimension value in the frame sequence to be selected. The frame with the lowest or highest comprehensive value, this frame is the third pre-selected frame;

第三预选帧可以为一帧或多帧;The third preselected frame can be one or more frames;

S4026:对第一预选帧与第三预选帧进行融合,得到预选融合帧;S4026: Fusion of the first pre-selected frame and the third pre-selected frame to obtain a pre-selected fusion frame;

S4027:判断预选融合帧的第二维度值是否在第二选帧范围内;S4027: Determine whether the second dimension value of the preselected fusion frame is within the range of the second selected frame;

若预选融合帧的第二维度值在第二选帧范围内,预选融合帧为第二预选帧;If the value of the second dimension of the preselected fusion frame is within the range of the second selected frame, the preselected fusion frame is the second preselected frame;

若预选融合帧的第二维度值不在第二选帧范围内,提示用户根据第二选帧条件拍摄或上传图像,将该图像作为第三预选帧与第一预选帧进行融合,直至得到的预选融合帧的第二维度值在第二选帧范围内,得到第二预选帧;If the second dimension value of the preselected fusion frame is not within the range of the second selected frame, the user will be prompted to shoot or upload an image according to the second selected frame condition, and the image will be used as the third preselected frame to be fused with the first preselected frame until the preselected frame is obtained. The second dimension value of the fusion frame is within the range of the second frame selection to obtain the second pre-selection frame;

融合包括第一融合、第二融合中至少一项;The fusion includes at least one of the first fusion and the second fusion;

S403:判断第二预选帧是否为一帧;S403: Determine whether the second preselected frame is a frame;

S404:当第二预选帧为一帧时,该一帧第二预选帧为目标帧;S404: when the second preselected frame is one frame, the second preselected frame of the one frame is the target frame;

S405:当第二预选帧为多帧时,对多帧第二预选帧进行第二融合得到目标帧;S405: when the second pre-selected frame is a multi-frame, perform a second fusion on the second pre-selected frame of the multi-frame to obtain the target frame;

通过本实施例中的方法,可以得到满足语音驱动模型要求的目标帧,提升了后续语音驱动的效果。Through the method in this embodiment, a target frame that meets the requirements of the voice driving model can be obtained, which improves the effect of subsequent voice driving.

本申请实施例提供了一种基于当前语音信号,对目标帧进行语音驱动,获取目标视频的方法,如图6所示,该方法包括:The embodiment of the present application provides a method for obtaining a target video by voice-driving a target frame based on a current voice signal. As shown in FIG. 6 , the method includes:

S501:训练语音驱动模型;S501: Train a voice-driven model;

具体地,步骤S501包括:Specifically, step S501 includes:

S5011:获取训练素材;S5011: Obtain training materials;

训练素材需要包括语音信息及其对应的表情系数信息;The training material needs to include voice information and its corresponding expression coefficient information;

训练素材可以是视频素材,其需要包含语音信息和图像信息,其中,图像信息需要包括人脸的表情信息;The training material can be video material, which needs to include voice information and image information, wherein the image information needs to include facial expression information;

视频素材可以是提前录制的视频,也可以是网上爬取的视频;The video material can be a pre-recorded video or a video crawled from the Internet;

S5012:采集训练素材中的语音信号样本及其对应的表情系数样本;S5012: Collect speech signal samples in the training material and their corresponding expression coefficient samples;

语音信号样本是时序信号,其可以是语音信号,也可以是语音信号的频谱特征,例如,梅尔普特征;The speech signal sample is a time-series signal, which can be a speech signal or a spectral feature of the speech signal, for example, a Melp feature;

当训练素材是视频素材时,具体地,步骤S5012可以包括:When the training material is a video material, specifically, step S5012 may include:

根据训练素材的帧率,提取训练素材中的语音信号样本及其对应的表情信息;According to the frame rate of the training material, extract the speech signal samples and their corresponding expression information in the training material;

基于表情信息,获取语音信号样本对应的表情系数;Based on the expression information, obtain the expression coefficient corresponding to the speech signal sample;

对表情系数进行滤波综合,得到表情系数样本;Filter and synthesize the expression coefficients to obtain expression coefficient samples;

S5013:基于语音信号样本和表情系数样本,训练语音驱动模型;S5013: Train a voice-driven model based on voice signal samples and expression coefficient samples;

具体地,可以对语音信号样本和表情系数样本进行1D卷积网络训练;也可以将语音信号样本转换成2D图像,对语音信号样本和表情系数样本进行2D卷积网络训练;还可以用LSTM(长短期记忆,Long short-term memory)网络进行辅助训练;还可以使用Transform网络进行训练;Specifically, 1D convolutional network training can be performed on speech signal samples and expression coefficient samples; the speech signal samples can also be converted into 2D images, and 2D convolutional network training can be performed on speech signal samples and expression coefficient samples; LSTM ( Long short-term memory, Long short-term memory) network for auxiliary training; Transform network can also be used for training;

损失函数Loss可以直接利用表情系数样本进行计算,也可以将表情系数样本恢复成网格进行Loss训练;The loss function Loss can be calculated directly using the expression coefficient samples, or the expression coefficient samples can be restored to grids for Loss training;

S502:根据当前语音信号,通过训练后的语音驱动模型生成对应的驱动表情系数;S502: According to the current voice signal, generate a corresponding driving expression coefficient through the trained voice driving model;

S503:对目标帧与连续驱动表情系数进行匹配,生成关键帧;S503: Match the target frame with the continuous driving expression coefficient to generate a key frame;

具体地,步骤S503可以包括:Specifically, step S503 may include:

S5031:对目标帧进行预处理;S5031: Preprocess the target frame;

预处理包括:前景人物分割、人物深度估计和3D人脸重建,前景人物分割得到前景掩膜图,人物深度估计得到人物深度图,3D人脸重建得到3D人脸模型;Preprocessing includes: foreground character segmentation, character depth estimation and 3D face reconstruction, foreground character segmentation to obtain foreground mask map, character depth estimation to obtain character depth map, 3D face reconstruction to obtain 3D face model;

S5032:根据驱动表情系数,得到人脸驱动模型;S5032: Obtain a face driving model according to the driving expression coefficient;

S5033:基于目标帧与人脸驱动模型,得到关键帧;S5033: Obtain key frames based on the target frame and the face driving model;

具体地,步骤S5033包括:Specifically, step S5033 includes:

根据前景掩膜图,提取目标帧中前景区域的轮廓;According to the foreground mask map, extract the contour of the foreground area in the target frame;

根据人物深度图,对目标帧中人物对应深度进行采样;According to the character depth map, sample the corresponding depth of the character in the target frame;

以前景区域的轮廓为边界,对目标帧的前景区域进行Delaunay三角化,得到投影空间的人物3D网格BsTaking the outline of the foreground area as the boundary, the foreground area of the target frame is subjected to Delaunay triangulation to obtain the character 3D mesh B s in the projection space;

移除人物3D网格Bs上的人脸区域,得到网格B′SRemove the face area on the character 3D mesh B s to obtain mesh B′ S ;

基于人脸3D重建,得到投影矩阵P,将人脸变形源网格变换至投影空间,得到3D人脸模型FsBased on the 3D reconstruction of the face, the projection matrix P is obtained, and the face deformation source grid is transformed into the projection space to obtain the 3D face model F s ;

合并3D人脸模型Fs与网格B′S合并,并通过三角化链接两者边界的接缝部分,得到变形源网格MsMerge the 3D face model F s with the mesh B′ S , and connect the seam part of the boundary between the two through triangulation to obtain the deformation source mesh M s ;

将人脸驱动模型通过投影矩阵P变换到投影空间,得到投影空间中的人脸驱动模型FtTransform the face-driven model into the projection space through the projection matrix P, and obtain the face-driven model F t in the projection space;

将人脸驱动模型Ft中所有的顶点位置应用到变形源网格Ms中3D人脸模型Fs的对应顶点上,得到人脸网格MtApplying all the vertex positions in the face driving model F t to the corresponding vertices of the 3D face model F s in the deformation source mesh M s to obtain the face mesh M t ;

令人脸网格Mt中的非人脸区域Ut=Mt/Ft,其在变形源网格Ms上对应为Us=Ms/Fsthe non-face region U t =M t /F t in the face mesh M t , which corresponds to U s =M s /F s on the deformed source mesh M s ;

分别取Fs、Ft的边界

Figure BDA0003698782440000171
则Us、Ut的内边界分别为
Figure BDA0003698782440000172
外边界分别为
Figure BDA0003698782440000173
其中
Figure BDA0003698782440000174
Take the boundaries of F s and F t respectively
Figure BDA0003698782440000171
Then the inner boundaries of U s and U t are respectively
Figure BDA0003698782440000172
The outer boundaries are
Figure BDA0003698782440000173
in
Figure BDA0003698782440000174

基于优化网格加权Laplace能量的方式调整Ut中顶点位置,使Ft在人脸区域平滑连续的过渡,在此,

Figure BDA0003698782440000175
对应顶点的位置相同,将其作为固定锚点,而
Figure BDA0003698782440000176
对应顶点的位置不同,将其作为移动锚点;Adjust the vertex positions in U t by optimizing the grid weighted Laplace energy, so that F t has a smooth and continuous transition in the face area. Here,
Figure BDA0003698782440000175
The positions of the corresponding vertices are the same, and they are used as fixed anchors, while
Figure BDA0003698782440000176
The positions of the corresponding vertices are different, and they are used as movement anchors;

计算Us的顶点到

Figure BDA0003698782440000177
的测地距离d,以1/d2为系数估计该点权重,迭代优化得到平滑非人脸区域网格U′t,并有平滑变形目标网格M′t=U′t∩Ft;Compute the vertices of U s to
Figure BDA0003698782440000177
The geodesic distance d is estimated, the weight of the point is estimated with 1/d 2 as the coefficient, and the smooth non-face area grid U' t is obtained by iterative optimization, and there is a smooth deformation target grid M' t =U' t ∩F t ;

对于M′t渲染到图像空间中得到的目标像素,可以得到该像素在光栅化时对应到网格M′t上的重心坐标,将该坐标应用到Ms,可以得到Ms表面上一点p′sFor the target pixel obtained by rendering M' t into the image space, the barycentric coordinates of the pixel corresponding to the grid M' t during rasterization can be obtained, and by applying the coordinates to M s , a point p on the surface of M s can be obtained 's;

将点p′s投影到预处理后的目标帧上,得到对应的源像素;Project the point p 's onto the preprocessed target frame to get the corresponding source pixel;

通过对目标像素坐标与源像素坐标在图像空间中的偏移量进行反向插值,得到坐标关键帧;The coordinate key frame is obtained by inversely interpolating the offset of the target pixel coordinates and the source pixel coordinates in the image space;

基于坐标关键帧,通过最小二乘算法的图像warp算法,得到关键帧;Based on the coordinate key frame, the key frame is obtained through the image warp algorithm of the least squares algorithm;

S504:基于待选帧序列和目标帧,对关键帧进行表情匹配,得到驱动帧;S504: Based on the to-be-selected frame sequence and the target frame, perform expression matching on the key frames to obtain a driving frame;

当表情匹配包括嘴部匹配时,步骤S504包括:When the expression matching includes mouth matching, step S504 includes:

S5041:获取待选帧序列中每一帧的人脸表情系数;S5041: Obtain the facial expression coefficient of each frame in the frame sequence to be selected;

S5042:基于人脸表情系数,得到待选帧序列中每一帧对应的人脸模型;S5042: Obtain a face model corresponding to each frame in the frame sequence to be selected based on the facial expression coefficient;

S5043:计算待选帧序列中每一帧对应的人脸模型和人脸驱动模型的嘴部偏差;S5043: Calculate the mouth deviation between the face model corresponding to each frame in the frame sequence to be selected and the face driving model;

S5044:获取嘴部偏差最小人脸模型对应的帧,作为渲染帧;S5044: Obtain the frame corresponding to the face model with the smallest mouth deviation as a rendering frame;

S5045:利用渲染帧对关键帧进行渲染,得到驱动帧;S5045: Use the rendering frame to render the key frame to obtain the driving frame;

在一示例性实施例中,利用渲染帧对关键帧进行渲染,包括:提取关键帧中嘴部的结构信息zgeo和风格信息zstyle,同时,提取渲染帧中嘴部的真实风格信息

Figure BDA0003698782440000181
由真实风格信息
Figure BDA0003698782440000182
和结构信息zgeo得到具有真实嘴部纹理质感和牙齿结构的驱动帧;In an exemplary embodiment, using the rendered frame to render the key frame includes: extracting structural information z geo and style information z style of the mouth in the key frame, and at the same time, extracting the real style information of the mouth in the rendered frame
Figure BDA0003698782440000181
by real style info
Figure BDA0003698782440000182
and structural information z geo to get driving frames with real mouth texture and tooth structure;

当表情匹配包括眼部匹配时,步骤S504包括:When the expression matching includes eye matching, step S504 includes:

S5046:基于驱动表情系数,得到眼部张开幅度;S5046: Obtain the eye opening range based on the driving expression coefficient;

S5047:将眼部张开幅度与目标帧输入cGAN网络,输出眼部张开幅度对应的眼部图像;S5047: Input the eye opening range and the target frame into the cGAN network, and output the eye image corresponding to the eye opening range;

S5048:将眼部图像与对关键帧进行匹配,得到驱动帧;S5048: Match the eye image with the key frame to obtain the driving frame;

S505:连续的驱动帧构成目标视频;S505: Continuous driving frames constitute a target video;

通过本实施例中的方法,解决了表情系数变化可能造成的嘴部细节(例如,空腔内部、牙齿)缺失的问题,使生成的视频更加生动、自然。With the method in this embodiment, the problem of missing details of the mouth (eg, inside the cavity, teeth) that may be caused by the change of the expression coefficient is solved, so that the generated video is more vivid and natural.

本申请实施例提供了一种视频通话中的视频生成方法,如图7所示,该方法包括:An embodiment of the present application provides a video generation method in a video call, as shown in FIG. 7 , the method includes:

S601:监控视频通话的实时网络带宽;S601: monitor the real-time network bandwidth of the video call;

S602:判断实时网络带宽是否小于网络阈值;S602: Determine whether the real-time network bandwidth is less than a network threshold;

网络阈值可以根据需求预先设定,也可以由系统自动生成;The network threshold can be preset according to requirements, or can be automatically generated by the system;

当实时网络带宽小于网络阈值时,视频通话发生卡顿,视频生成方法包括:When the real-time network bandwidth is less than the network threshold, the video call freezes, and the video generation method includes:

S603:获取待选帧序列;S603: Obtain the frame sequence to be selected;

待选帧序列可以包括:卡顿前的视频缓存帧、用户预拍摄的预存帧中的至少一项,待选帧序列包括不少于两帧待选帧;The frame sequence to be selected may include: at least one of the video buffer frame before the freeze and the pre-stored frame pre-shot by the user, and the frame sequence to be selected includes no less than two frames to be selected;

S604:根据选帧维度,从待选帧序列中确定目标帧;S604: Determine the target frame from the frame sequence to be selected according to the frame selection dimension;

S605:基于当前语音信号,对目标帧进行语音驱动,获取目标视频;S605: Based on the current voice signal, voice-driven the target frame to obtain the target video;

当前语音信号为视频发生卡顿后用户的语音信号;The current voice signal is the voice signal of the user after the video freezes;

S606:将视频通话的画面切换至目标视频;S606: Switch the screen of the video call to the target video;

S607:当实时网络带宽不小于网络阈值时,切换回视频通话;S607: When the real-time network bandwidth is not less than the network threshold, switch back to the video call;

通过本实施例中的方法,使用户在网络带宽不足时,视频通话的画面依然自然、流畅。With the method in this embodiment, when the network bandwidth is insufficient for the user, the picture of the video call is still natural and smooth.

本申请实施例提供了一种视频生成装置10,如图8所示,该装置包括:The embodiment of the present application provides a video generation apparatus 10, as shown in FIG. 8, the apparatus includes:

采集单元100,配置为获取待选帧序列;The acquisition unit 100 is configured to acquire the frame sequence to be selected;

选帧单元200,配置为根据选帧维度,从待选帧序列中确定目标帧;The frame selection unit 200 is configured to determine the target frame from the frame sequence to be selected according to the frame selection dimension;

驱动单元300,配置为基于当前语音信号,对目标帧进行语音驱动,获取目标视频。The driving unit 300 is configured to perform voice driving on the target frame based on the current voice signal to obtain the target video.

本申请实施例还提供了一种计算机可读存储介质,计算机可读存储介质上存储有计算机程序,计算机程序被处理器执行时实现如前任一实施例所述的视频生成方法的步骤。Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the video generation method described in any of the preceding embodiments are implemented.

本申请实施例还提供了一种电子设备,包括处理器以及存储器,存储器用于存储处理器的可执行指令;其中,处理器配置为经由执行所述可执行指令来执行如前任一实施例所述的视频生成方法。Embodiments of the present application further provide an electronic device, including a processor and a memory, where the memory is used to store executable instructions of the processor; wherein the processor is configured to execute the execution as described in any of the previous embodiments by executing the executable instructions. The video generation method described above.

上述本发明实施例序号仅仅为了描述,不代表实施例的优劣。The above-mentioned serial numbers of the embodiments of the present invention are only for description, and do not represent the advantages or disadvantages of the embodiments.

在本发明的上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。In the above-mentioned embodiments of the present invention, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

在本申请所提供的几个实施例中,应该理解到,所揭露的技术内容,可通过其它的方式实现。其中,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,可以为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,单元或模块的间接耦合或通信连接,可以是电性或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed technical content can be implemented in other ways. The device embodiments described above are only illustrative, for example, the division of the units may be a logical function division, and there may be other division methods in actual implementation, for example, multiple units or components may be combined or Integration into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of units or modules, and may be in electrical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and components shown as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.

所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。The integrated unit, if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention is essentially or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes .

以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。The above are only the preferred embodiments of the present invention. It should be pointed out that for those skilled in the art, without departing from the principles of the present invention, several improvements and modifications can be made. It should be regarded as the protection scope of the present invention.

Claims (17)

1.一种视频生成方法,其特征在于,包括:1. a video generation method, is characterized in that, comprises: 获取待选帧序列;Get the sequence of frames to be selected; 根据选帧维度,从所述待选帧序列中确定目标帧;Determine a target frame from the to-be-selected frame sequence according to the frame selection dimension; 基于当前语音信号,对所述目标帧进行语音驱动,获取目标视频;Based on the current voice signal, the target frame is voice-driven to obtain the target video; 其中,所述选帧维度包括第一选帧维度、第二选帧维度中至少一项。Wherein, the frame selection dimension includes at least one of a first frame selection dimension and a second frame selection dimension. 2.根据权利要求1所述的视频生成方法,其特征在于,所述根据选帧维度,从所述待选帧序列中确定目标帧,包括:2. The video generation method according to claim 1, wherein, according to the frame selection dimension, determining a target frame from the frame sequence to be selected, comprising: 根据所述选帧维度,从所述待选帧序列中获取满足选帧条件的预选帧,其中,所述预选帧为一帧或多帧;According to the frame selection dimension, obtain a preselected frame that satisfies the frame selection condition from the to-be-selected frame sequence, wherein the preselected frame is one or more frames; 当所述预选帧为一帧时,所述预选帧为所述目标帧;When the preselected frame is one frame, the preselected frame is the target frame; 当所述预选帧为多帧时,对所述多帧预选帧进行融合,得到所述目标帧;When the preselected frame is a multi-frame, the multi-frame preselected frame is fused to obtain the target frame; 其中,所述选帧条件包括第一选帧条件、第二选帧条件中至少一项。Wherein, the frame selection condition includes at least one of a first frame selection condition and a second frame selection condition. 3.根据权利要求2所述的视频生成方法,其特征在于,所述融合包括第一融合或第二融合中至少一项。3. The video generation method according to claim 2, wherein the fusion comprises at least one of a first fusion or a second fusion. 4.根据权利要求3所述的视频生成方法,其特征在于,所述根据所述选帧维度,从所述待选帧序列中获取满足选帧条件的预选帧,包括:4. The video generation method according to claim 3, wherein, according to the frame selection dimension, obtaining a preselected frame that satisfies the frame selection condition from the to-be-selected frame sequence, comprising: 根据所述第一选帧维度,计算所述待选帧序列中每一帧的第一维度值;Calculate the first dimension value of each frame in the to-be-selected frame sequence according to the first frame selection dimension; 从所述待选帧序列中获取所述第一维度值满足所述第一选帧条件的第一预选帧;Obtain a first pre-selected frame whose first dimension value satisfies the first frame selection condition from the to-be-selected frame sequence; 其中,所述第一预选帧为一帧或多帧。Wherein, the first preselected frame is one or more frames. 5.根据权利要求2所述的视频生成方法,其特征在于,5. video generation method according to claim 2, is characterized in that, 所述第一选帧条件为所述第一维度值在第一选帧范围内。The first frame selection condition is that the first dimension value is within the first frame selection range. 6.根据权利要求4所述的视频生成方法,其特征在于,6. video generation method according to claim 4, is characterized in that, 当所述第一预选帧为一帧时,所述第一预选帧为所述目标帧;When the first preselected frame is one frame, the first preselected frame is the target frame; 当所述第一预选帧为多帧时,对多帧所述第一预选帧进行第一融合,得到所述目标帧。When the first pre-selected frame is a multi-frame, the first pre-selected frame of the multi-frame is first fused to obtain the target frame. 7.根据权利要求3所述的视频生成方法,其特征在于,所述根据所述选帧维度,从所述待选帧序列中获取满足选帧条件的预选帧,包括:7. The video generation method according to claim 3, wherein, according to the frame selection dimension, obtaining a preselected frame that satisfies the frame selection condition from the frame sequence to be selected, comprising: 根据所述第二选帧维度,计算所述待选帧序列中每一帧的第二维度值;According to the second frame selection dimension, calculate the second dimension value of each frame in the to-be-selected frame sequence; 从所述待选帧序列中获取所述第二维度值满足所述第二选帧条件的第二预选帧;Obtain a second pre-selected frame whose second dimension value satisfies the second frame selection condition from the to-be-selected frame sequence; 其中,所述第二预选帧为一帧或多帧。Wherein, the second pre-selected frame is one or more frames. 8.根据权利要求7所述的视频生成方法,其特征在于:8. video generation method according to claim 7, is characterized in that: 当所述第二预选帧为一帧时,所述第二预选帧为所述目标帧;When the second preselected frame is one frame, the second preselected frame is the target frame; 当所述第二预选帧为多帧时,对所述多帧第二预选帧进行第二融合得到所述目标帧。When the second pre-selected frame is a multi-frame, the target frame is obtained by performing a second fusion on the multi-frame second pre-selected frame. 9.根据权利要求2所述的视频生成方法,其特征在于,9. video generation method according to claim 2, is characterized in that, 所述第二选帧条件为所述第二维度值或第二维度综合值最低或最高。The second frame selection condition is that the second dimension value or the second dimension comprehensive value is the lowest or the highest. 10.根据权利要求4所述的视频生成方法,其特征在于,所述根据所述选帧维度,从所述待选帧序列中获取满足选帧条件的预选帧,包括:10. The video generation method according to claim 4, wherein, according to the frame selection dimension, obtaining a preselected frame that satisfies a frame selection condition from the to-be-selected frame sequence, comprising: 根据所述第二选帧维度,计算所述第一预选帧中每一帧的第二维度值;According to the second frame selection dimension, calculate the second dimension value of each frame in the first preselected frame; 从所述第一预选帧中获取所述第二维度值满足所述第二选帧条件的第二预选帧;Acquiring, from the first preselected frame, a second preselected frame whose second dimension value satisfies the second frame selection condition; 其中,所述第二预选帧为一帧或多帧。Wherein, the second pre-selected frame is one or more frames. 11.根据权利要求10所述的视频生成方法,其特征在于,11. The video generation method according to claim 10, wherein, 当所述第二预选帧为一帧时,所述第二预选帧为所述目标帧;When the second preselected frame is one frame, the second preselected frame is the target frame; 当所述第二预选帧为多帧时,对所述多帧第二预选帧进行第二融合得到所述目标帧。When the second pre-selected frame is a multi-frame, the target frame is obtained by performing a second fusion on the multi-frame second pre-selected frame. 12.根据权利要求10所述的视频生成方法,其特征在于,12. The video generation method according to claim 10, wherein, 当所述第一预选帧为多帧时,所述第二选帧条件为所述第二维度值或所述第二维度综合值最低或最高。When the first preselected frame is multiple frames, the second frame selection condition is that the second dimension value or the second dimension comprehensive value is the lowest or the highest. 13.根据权利要求10所述的视频生成方法,其特征在于,13. The video generation method according to claim 10, wherein, 当所述第一预选帧为一帧时,所述第二选帧条件为第二维度值在第二选帧范围内。When the first preselected frame is one frame, the second frame selection condition is that the second dimension value is within the second frame selection range. 14.根据权利要求1所述的视频生成方法,其特征在于,所述基于当前语音信号,对所述目标帧进行语音驱动,获取目标视频,包括:14. The video generation method according to claim 1 , wherein, based on the current voice signal, the target frame is voice-driven to obtain the target video, comprising: 根据当前语音信号,通过训练后的语音驱动模型生成对应的驱动表情系数;According to the current voice signal, the corresponding driving expression coefficient is generated through the trained voice driving model; 对所述目标帧与所述驱动表情系数进行匹配,生成关键帧;Matching the target frame and the driving expression coefficient to generate a key frame; 基于所述待选帧序列和所述目标帧,对所述关键帧进行表情匹配,得到驱动帧;Based on the to-be-selected frame sequence and the target frame, performing expression matching on the key frame to obtain a drive frame; 连续的所述驱动帧构成所述目标视频。Successive said drive frames constitute said target video. 15.一种视频生成装置,其特征在于,包括:15. A video generation device, comprising: 采集单元,配置为获取待选帧序列;an acquisition unit, configured to acquire a sequence of frames to be selected; 选帧单元,配置为根据选帧维度,从所述待选帧序列中确定目标帧;a frame selection unit, configured to determine a target frame from the to-be-selected frame sequence according to the frame selection dimension; 驱动单元,配置为基于当前语音信号,对所述目标帧进行语音驱动,获取目标视频。The driving unit is configured to perform voice driving on the target frame based on the current voice signal to obtain the target video. 16.一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至14中任一项所述方法的步骤。16. A computer-readable storage medium on which a computer program is stored, the computer program implementing the steps of the method according to any one of claims 1 to 14 when executed by a processor. 17.一种电子设备,其特征在于,包括:17. An electronic device, characterized in that, comprising: 处理器;以及processor; and 存储器,用于存储所述处理器的可执行指令;a memory for storing executable instructions for the processor; 其中,所述处理器配置为经由执行所述可执行指令来执行权利要求1至14中任意一项所述的视频生成方法。Wherein, the processor is configured to perform the video generation method of any one of claims 1 to 14 by executing the executable instructions.
CN202210688868.XA 2022-06-16 2022-06-16 Video generation method and device, storage medium and electronic equipment Pending CN115116468A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210688868.XA CN115116468A (en) 2022-06-16 2022-06-16 Video generation method and device, storage medium and electronic equipment
PCT/CN2023/094868 WO2023241298A1 (en) 2022-06-16 2023-05-17 Video generation method and apparatus, storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210688868.XA CN115116468A (en) 2022-06-16 2022-06-16 Video generation method and device, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN115116468A true CN115116468A (en) 2022-09-27

Family

ID=83328086

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210688868.XA Pending CN115116468A (en) 2022-06-16 2022-06-16 Video generation method and device, storage medium and electronic equipment

Country Status (2)

Country Link
CN (1) CN115116468A (en)
WO (1) WO2023241298A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023241298A1 (en) * 2022-06-16 2023-12-21 虹软科技股份有限公司 Video generation method and apparatus, storage medium and electronic device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110166829A (en) * 2019-05-15 2019-08-23 上海商汤智能科技有限公司 Method for processing video frequency and device, electronic equipment and storage medium
CN112967212A (en) * 2021-02-01 2021-06-15 北京字节跳动网络技术有限公司 Virtual character synthesis method, device, equipment and storage medium
CN113378806A (en) * 2021-08-16 2021-09-10 之江实验室 Audio-driven face animation generation method and system integrating emotion coding
CN113554737A (en) * 2020-12-04 2021-10-26 腾讯科技(深圳)有限公司 Target object motion driving method, device, equipment and storage medium
US20210357625A1 (en) * 2019-09-18 2021-11-18 Beijing Sensetime Technology Development Co., Ltd. Method and device for generating video, electronic equipment, and computer storage medium
CN113822136A (en) * 2021-07-22 2021-12-21 腾讯科技(深圳)有限公司 Video material image selection method, device, equipment and storage medium
CN113987269A (en) * 2021-09-30 2022-01-28 深圳追一科技有限公司 Digital human video generation method and device, electronic equipment and storage medium

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109993025B (en) * 2017-12-29 2021-07-06 中移(杭州)信息技术有限公司 Method and device for extracting key frames
CN110390263A (en) * 2019-06-17 2019-10-29 宁波江丰智能科技有限公司 A kind of method of video image processing and system
CN113689538B (en) * 2020-05-18 2024-05-21 北京达佳互联信息技术有限公司 Video generation method and device, electronic equipment and storage medium
CN111833418B (en) * 2020-07-14 2024-03-29 北京百度网讯科技有限公司 Animation interaction method, device, equipment and storage medium
CN112215927B (en) * 2020-09-18 2023-06-23 腾讯科技(深圳)有限公司 Method, device, equipment and medium for synthesizing face video
CN113507627B (en) * 2021-07-08 2022-03-25 北京的卢深视科技有限公司 Video generation method and device, electronic equipment and storage medium
CN114202604B (en) * 2021-11-30 2025-07-15 长城信息股份有限公司 A method, device and storage medium for generating a target person video driven by voice
CN115116468A (en) * 2022-06-16 2022-09-27 虹软科技股份有限公司 Video generation method and device, storage medium and electronic equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110166829A (en) * 2019-05-15 2019-08-23 上海商汤智能科技有限公司 Method for processing video frequency and device, electronic equipment and storage medium
US20210357625A1 (en) * 2019-09-18 2021-11-18 Beijing Sensetime Technology Development Co., Ltd. Method and device for generating video, electronic equipment, and computer storage medium
CN113554737A (en) * 2020-12-04 2021-10-26 腾讯科技(深圳)有限公司 Target object motion driving method, device, equipment and storage medium
CN112967212A (en) * 2021-02-01 2021-06-15 北京字节跳动网络技术有限公司 Virtual character synthesis method, device, equipment and storage medium
CN113822136A (en) * 2021-07-22 2021-12-21 腾讯科技(深圳)有限公司 Video material image selection method, device, equipment and storage medium
CN113378806A (en) * 2021-08-16 2021-09-10 之江实验室 Audio-driven face animation generation method and system integrating emotion coding
CN113987269A (en) * 2021-09-30 2022-01-28 深圳追一科技有限公司 Digital human video generation method and device, electronic equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023241298A1 (en) * 2022-06-16 2023-12-21 虹软科技股份有限公司 Video generation method and apparatus, storage medium and electronic device

Also Published As

Publication number Publication date
WO2023241298A1 (en) 2023-12-21

Similar Documents

Publication Publication Date Title
CN111598998B (en) Three-dimensional virtual model reconstruction method, three-dimensional virtual model reconstruction device, computer equipment and storage medium
CN109285215B (en) Human body three-dimensional model reconstruction method and device and storage medium
CN108921782B (en) Image processing method, device and storage medium
US20200226821A1 (en) Systems and Methods for Automating the Personalization of Blendshape Rigs Based on Performance Capture Data
CN107578435A (en) A method and device for image depth prediction
CN114782864B (en) Information processing method, device, computer equipment and storage medium
WO2023066120A1 (en) Image processing method and apparatus, electronic device, and storage medium
US12307616B2 (en) Techniques for re-aging faces in images and video frames
CN113989434A (en) Human body three-dimensional reconstruction method and device
CN113033442A (en) StyleGAN-based high-freedom face driving method and device
CN118822854A (en) An AI image super-resolution detail enhancement system based on film and television production
US20240355051A1 (en) Differentiable facial internals meshing model
CN111680573A (en) Face recognition method, device, electronic device and storage medium
WO2023241298A1 (en) Video generation method and apparatus, storage medium and electronic device
JP2002245455A (en) Method and device for multi-variable space processing
CN113920023A (en) Image processing method and device, computer readable medium and electronic device
EP3809372B1 (en) Method of real-time generation of 3d imaging
CN117635838A (en) Three-dimensional face reconstruction method, equipment, storage medium and device
CN116912393A (en) Face reconstruction method and device, electronic equipment and readable storage medium
CN113409207B (en) Face image definition improving method and device
CN115497029A (en) Video processing method, device and computer readable storage medium
KR101533494B1 (en) Method and apparatus for generating 3d video based on template mode
CN120431269B (en) Three-dimensional Gaussian scene reconstruction method and device, electronic equipment and storage medium
CN117974906B (en) Face modeling method, device, electronic equipment and storage medium
CN115908162B (en) A method and system for generating a virtual viewpoint based on background texture recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination