Kadam et al., 2021 - Google Patents
A Survey of Audio Synthesis and Lip-syncing for Synthetic Video Generation.Kadam et al., 2021
View PDF- Document ID
- 8736857034915895145
- Author
- Kadam A
- Rane S
- Mishra A
- Sahu S
- Singh S
- Pathak S
- Publication year
- Publication venue
- EAI Endorsed Transactions on Creative Technologies
External Links
Snippet
Abstract The fields like Media, Education and Corporations etc have started focusing on content creation. This has led to the huge demand for synthetic media generation using less data. To synthesize a high-grade artificial video, the lip must be synchronized with the audio …
- 230000002194 synthesizing 0 title description 29
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/10—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids transforming into visible information
- G10L2021/105—Synthesis of the lips movements from speech, e.g. for talking heads
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
- G10L21/013—Adapting to target pitch
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L2015/088—Word spotting
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/065—Adaptation
- G10L15/07—Adaptation to the speaker
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/26—Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/06—Elementary speech units used in speech synthesisers; Concatenation rules
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/66—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signal analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signal, using source filter models or psychoacoustic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—3D [Three Dimensional] animation
- G06T13/205—3D [Three Dimensional] animation driven by audio data
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| KR et al. | Towards automatic face-to-face translation | |
| US20210390945A1 (en) | Text-driven video synthesis with phonetic dictionary | |
| Hu et al. | Neural dubber: Dubbing for videos according to scripts | |
| Vougioukas et al. | Video-driven speech reconstruction using generative adversarial networks | |
| Mittal et al. | Animating face using disentangled audio representations | |
| KR20210048441A (en) | Matching mouth shape and movement in digital video to alternative audio | |
| US20230386475A1 (en) | Systems and methods of text to audio conversion | |
| Kadam et al. | A Survey of Audio Synthesis and Lip-syncing for Synthetic Video Generation. | |
| Hassid et al. | More than words: In-the-wild visually-driven prosody for text-to-speech | |
| Hegde et al. | Visual speech enhancement without a real visual stream | |
| CN118571229B (en) | Voice labeling method and device for voice feature description | |
| CN117711042A (en) | Method and device for generating broadcast video of digital person based on driving text | |
| WO2024182112A1 (en) | Using text-injection to recognize speech without transcription | |
| Liu et al. | M3TTS: Multi-modal text-to-speech of multi-scale style control for dubbing | |
| Alshahrani et al. | A Systematic Literature Review: Facial Expression and Lip Movement Synchronization of an Audio Track | |
| Rafiei Oskooei et al. | Can One Model Fit All? An Exploration of Wav2Lip’s Lip-Syncing Generalizability Across Culturally Distinct Languages | |
| Hussen Abdelaziz et al. | Speaker-independent speech-driven visual speech synthesis using domain-adapted acoustic models | |
| WO2025011651A1 (en) | Digital human mouth shape generation method and device | |
| Kadam et al. | ReVoice: A neural network based voice cloning system | |
| Kolivand et al. | Realistic lip syncing for virtual character using common viseme set | |
| Sun et al. | Pre-avatar: An automatic presentation generation framework leveraging talking avatar | |
| Shelar et al. | Deepfakes for video conferencing using general adversarial networks (GANs) and multilingual voice cloning | |
| Venkataraghavan et al. | Wav2Lip Bridges Communication Gap: Automating Lip Sync and Language Translation for Indian Languages | |
| Zhang et al. | Realistic Speech‐Driven Talking Video Generation with Personalized Pose | |
| CN114283782A (en) | Speech synthesis method and apparatus, electronic device, and storage medium |