Hershey et al., 2004 - Google Patents

Audio-visual graphical models for speech processing

Hershey et al., 2004

Document ID: 17843625553296045077
Author: Hershey J; Attias H; Jojic N; Kristjansson T
Publication year: 2004
Publication venue: 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing

External Links

Cited by

Snippet

Perceiving sounds in a noisy environment is a challenging problem. Visual lip-reading can provide relevant information but is also challenging because lips are moving and a tracker must deal with a variety of conditions. Typically audio-visual systems have been assembled …

Continue reading at www.microsoft.com (PDF) (other versions)

210000000088 Lip 0 abstract description 12

Classifications

- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/62—Methods or arrangements for recognition using electronic means
- G06K9/6217—Design or setup of recognition systems and techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
- G06K9/6232—Extracting features by transforming the feature space, e.g. multidimensional scaling; Mappings, e.g. subspace methods
- G06K9/6247—Extracting features by transforming the feature space, e.g. multidimensional scaling; Mappings, e.g. subspace methods based on an approximation criterion, e.g. principal component analysis
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/00221—Acquiring or recognising human faces, facial parts, facial sketches, facial expressions
- G06K9/00268—Feature extraction; Face representation
- G06K9/00281—Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/62—Methods or arrangements for recognition using electronic means
- G06K9/6267—Classification techniques
- G06K9/6268—Classification techniques relating to the classification paradigm, e.g. parametric or non-parametric approaches
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/00221—Acquiring or recognising human faces, facial parts, facial sketches, facial expressions
- G06K9/00288—Classification, e.g. identification
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/36—Image preprocessing, i.e. processing the image information without deciding about the identity of the image
- G06K9/46—Extraction of features or characteristics of the image
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/00624—Recognising scenes, i.e. recognition of a whole field of perception; recognising scene-specific objects
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/00335—Recognising movements or behaviour, e.g. recognition of gestures, dynamic facial expressions; Lip-reading
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
- G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
- G06K9/00362—Recognising human body or animal bodies, e.g. vehicle occupant, pedestrian; Recognising body parts, e.g. hand
- G06K9/00369—Recognition of whole body, e.g. static pedestrian or occupant recognition
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0635—Training updating or merging of old and new templates; Mean values; Weighting
- G10L2015/0636—Threshold criteria for the updating

Similar Documents

Publication	Publication Date	Title
AU2022200439B2 (en)	2022-10-20	Multi-modal speech separation method and system
US7689413B2 (en)	2010-03-30	Speech detection and enhancement using audio/video fusion
Beal et al.	2003	A graphical model for audiovisual object tracking
Zhou et al.	2013	A compact representation of visual speech data using latent variables
US7343289B2 (en)	2008-03-11	System and method for audio/video speaker detection
Lu et al.	2014	Ensemble modeling of denoising autoencoder for speech spectrum restoration.
JP2018077479A (en)	2018-05-17	Object recognition using multimodal alignment
Estellers et al.	2012	Multi-pose lipreading and audio-visual speech recognition
Gurbuz et al.	2001	Application of affine-invariant Fourier descriptors to lipreading for audio-visual speech recognition
Minotto et al.	2013	Audiovisual voice activity detection based on microphone arrays and color information
Hershey et al.	2004	Audio-visual graphical models for speech processing
Christoudias et al.	2006	Co-adaptation of audio-visual speech and gesture classifiers
Argones Rua et al.	2009	Audio-visual speech asynchrony detection using co-inertia analysis and coupled hidden markov models
Chetty et al.	2008	Robust face-voice based speaker identity verification using multilevel fusion
Canton-Ferrer et al.	2007	Audiovisual head orientation estimation with particle filtering in multisensor scenarios
Schymura et al.	2020	A dynamic stream weight backprop Kalman filter for audiovisual speaker tracking
Gebru et al.	2017	Audio-visual tracking by density approximation in a sequential Bayesian filtering framework
Stiefelhagen et al.	2006	Audio-visual perception of a lecturer in a smart seminar room
Seymour et al.	2007	Audio-visual integration for robust speech recognition using maximum weighted stream posteriors.
Chetty et al.	2007	Audio visual speaker verification based on hybrid fusion of cross modal features
Sarada et al.	2024	Audio deepfake detection and classification
Rajavel et al.	2015	Optimum integration weight for decision fusion audio–visual speech recognition
Dean et al.	2010	Dynamic visual features for audio–visual speaker verification
Chetty	2010	Robust audio visual biometric person authentication with liveness verification
Martinson et al.	2011	Learning speaker recognition models through human-robot interaction