[go: up one dir, main page]

WO2019022722A1 - Language identification with speech and visual anthropometric features - Google Patents

Language identification with speech and visual anthropometric features Download PDF

Info

Publication number
WO2019022722A1
WO2019022722A1 PCT/US2017/043765 US2017043765W WO2019022722A1 WO 2019022722 A1 WO2019022722 A1 WO 2019022722A1 US 2017043765 W US2017043765 W US 2017043765W WO 2019022722 A1 WO2019022722 A1 WO 2019022722A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
language
user
processor
instructions
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US2017/043765
Other languages
French (fr)
Inventor
Sunil Bharitkar
David Murphy
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hewlett Packard Development Co LP
Original Assignee
Hewlett Packard Development Co LP
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett Packard Development Co LP filed Critical Hewlett Packard Development Co LP
Priority to PCT/US2017/043765 priority Critical patent/WO2019022722A1/en
Publication of WO2019022722A1 publication Critical patent/WO2019022722A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/263Language identification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2218/00Aspects of pattern recognition specially adapted for signal processing
    • G06F2218/08Feature extraction
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2218/00Aspects of pattern recognition specially adapted for signal processing
    • G06F2218/12Classification; Matching
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Definitions

  • Speech may be translated into text or another form.
  • ASR automatic speech recognition
  • Some speech recognition systems may need for a user to read text or isolated vocabulary into the system.
  • the speech recognition system may analyze the user's specific voice, and use attributes of the user's specific voice to fine-tune the recognition of that user's speech, resulting in increased accuracy.
  • Figure 1 illustrates an example layout of a language identification with speech and visual anthropometric features apparatus
  • Figure 2 illustrates an example of language identification using speech to illustrate operation of the language identification with speech and visual anthropometric features apparatus of Figure 1 ;
  • Figure 3 illustrates an example of language identification using speech and camera input to illustrate operation of the language identification with speech and visual anthropometric features apparatus of Figure 1 ;
  • Figure 4 illustrates an example of feature based training and transfer of a neural network model to illustrate operation of the language identification with speech and visual anthropometric features apparatus of Figure 1 ;
  • Figure 5 illustrates an example of a multilayer feedforward neural network to illustrate operation of the language identification with speech and visual anthropometric features apparatus of Figure 1 ;
  • Figure 6 illustrates English emulated "common" spoken word (family) by a female, where the top subplot is the spoken word and the bottom subplot is the fundamental frequency f 0 tracked over the word for a frame size of 512 samples, to illustrate operation of the language identification with speech and visual anthropometric features apparatus of Figure 1 ;
  • Figure 7 illustrates Russian emulated "common" spoken word (sem'ya for family) by a male, where the top subplot is the spoken word and the bottom subplot is the fundamental frequency f 0 tracked over the word for a frame size of 512 samples, to illustrate operation of the language identification with speech and visual anthropometric features apparatus of Figure 1 ;
  • Figure 8 illustrates Russian emulated "common" spoken word (sem'ya for family) by a female, where the top subplot is the spoken word and the bottom subplot is the fundamental frequency f 0 tracked over the word for a frame size of 512 samples, to illustrate operation of the language identification with speech and visual anthropometric features apparatus of Figure 1 ;
  • Figure 9 illustrates processed female English tracked feature f Q to remove redundancies (e.g., zeros in the f Q value), corresponding to Figure 6, to illustrate operation of the language identification with speech and visual anthropometric features apparatus of Figure 1 ;
  • Figure 10 illustrates processed male Russian tracked feature f Q to remove redundancies (e.g., zeros in the f Q value), corresponding to Figure 7, to illustrate operation of the language identification with speech and visual anthropometric features apparatus of Figure 1 ;
  • redundancies e.g., zeros in the f Q value
  • Figure 11 illustrates processed female Russian tracked feature f 0 to remove redundancies (e.g., zeros in the f 0 value), corresponding to Figure 8, to illustrate operation of the language identification with speech and visual anthropometric features apparatus of Figure 1 ;
  • Figure 12 illustrates an example block diagram for language identification with speech and visual anthropometric features
  • Figure 13 illustrates an example flowchart of a method for language identification with speech and visual anthropometric features
  • Figure 14 illustrates a further example block diagram for language identification with speech and visual anthropometric features.
  • the terms “a” and “an” are intended to denote at least one of a particular element.
  • the term “includes” means includes but not limited to, the term “including” means including but not limited to.
  • the term “based on” means based at least in part on.
  • Apparatuses for language identification with speech and visual anthropometric features, methods for language identification with speech and visual anthropometric features, and non-transitory computer readable media having stored thereon machine readable instructions to provide language identification with speech and visual anthropometric features are disclosed.
  • the apparatuses, methods, and non-transitory computer readable media disclosed herein provide for automatic language identification during speech-based communication between people. For example, in a collaboration setting (e.g., using a smart-surface) or a conference call, languages that are being spoken may be identified at the start of a conversation so as to enable linguistically-relevant speech-synthesis via a machine-translation schema.
  • the apparatuses, methods, and non-transitory computer readable media disclosed herein provide for detection of demographic-based visual features of a speaker, for example, by a camera, to create a multimodal approach for language identification.
  • the apparatuses, methods, and non-transitory computer readable media disclosed herein provide for use cases such as speech to text translation, communication between participants speaking different languages, etc.
  • language identification techniques it is technically challenging to sense and recognize speech to determine what language is being spoken. It is also technically challenging to sense and recognize speech with a high degree of confidence, and/or to combine multiple techniques of sensing and recognizing speech.
  • the apparatuses, methods, and non-transitory computer readable media disclosed herein provide for language identification that includes, for example, word boundary detection in the speech, identification of a rhythm structure of the speech, and/or feature identification, where these aspects are used in combination to identify a language being spoken. Further, the apparatuses, methods, and non-transitory computer readable media disclosed herein provide for language identification that includes feature identification based on fundamental frequency analysis of words.
  • the apparatuses, methods, and non-transitory computer readable media disclosed herein provide a database based approach for which word boundary identification is performed on speech. Thereafter, a correlation may be performed with respect to the fundamental atoms of the words in the database that is available for various languages and various frequently used words. Based on the results (e.g., what language, and associated probability), the spoken language of the speech may be detected.
  • the features that may be extracted from speech and analyzed may include, for example, the fundamental frequency, spectral centroid, how words evolve in time, etc.
  • the database approach may also employ a neural network, where features may be stored for various languages in the database, and the neural network may be pre-trained on certain ones of the features.
  • the neural network may classify the language according to the pre-trained features that are frequently occurring based on key areas (e.g., fundamental frequency, spectral centroid, how words evolve in time, etc.), or words.
  • key areas e.g., fundamental frequency, spectral centroid, how words evolve in time, etc.
  • the apparatuses, methods, and non-transitory computer readable media disclosed herein provide a recurrent neural network based approach, where the recurrent neural network may be trained on various examples of spoken languages by different speakers. Having trained the recurrent neural network, for new speech, the recurrent neural network may identify which language is being spoken, and redirect the results to the correct automatic speech recognition (ASR) to have the speech interpreted properly.
  • ASR automatic speech recognition
  • the features in this regard may be extracted from the speech based on temporal signal.
  • modules may be any combination of hardware and programming to implement the functionalities of the respective modules.
  • the combinations of hardware and programming may be implemented in a number of different ways.
  • the programming for the modules may be processor executable instructions stored on a non-transitory machine-readable storage medium and the hardware for the modules may include a processing resource to execute those instructions.
  • a computing device implementing such modules may include the machine-readable storage medium storing the instructions and the processing resource to execute the instructions, or the machine-readable storage medium may be separately stored and accessible by the computing device and the processing resource.
  • some modules may be implemented in circuitry.
  • Figure 1 illustrates an example layout of a language identification with speech and visual anthropometric features apparatus (hereinafter also referred to as "apparatus 100").
  • the apparatus 100 may include a speech identification module 102 to analyze an input signal 104 to identify a speech 106 of a user 108.
  • the speech identification module 102 is to analyze the input signal 104 to identify the speech 106 of the user 108 by performing, based on a zero-crossing rate and/or a periodicity analysis, a voice activity detection.
  • a word determination module 110 is to determine, based on word boundary detection, a plurality of words 112 included in the speech 106.
  • a word categorization module 114 is to categorize the words 112, for example, into a phoneme, a vowel, and/or a consonant, to generate categorized words 116.
  • a rhythm identification module 118 is to identify a rhythm structure 120 of the speech 106.
  • the rhythm identification module 118 is to identify the rhythm structure 120 of the speech 106 by determining a vowel duration of a vowel included in the words 112, and analyzing, based on the vowel duration, a variability in a length of the vowel included in the words 112.
  • a feature extraction module 122 is to extract a feature 124 of the speech 106.
  • a language identification module 126 is to identify, based on a classification of the categorized words 112, the identified rhythm structure 120, and the extracted feature 124, a language 128 of the speech 106.
  • the feature extraction module 122 is to analyze landmark features of the user 108 to determine a racial characteristic of the user 108, and the language identification module 126 is to combine results including the language identification and the determined racial characteristic to increase a confidence associated with the identification of the language 128 of the speech 106.
  • the feature extraction module 122 is to analyze landmark features of the user 108 as received from an image captured by a camera 136.
  • the feature extraction module 122 is to classify, based on an analysis of an image of the user by a pre-trained learning model, the user according to an ethnicity
  • the language identification module 126 is to combine results including the language identification and the classified ethnicity to increase a confidence associated with the identification of the language 128 of the speech 106.
  • the feature extraction module 122 is to extract the feature 124 of the speech 106 by analyzing a fundamental frequency f Q of a word of the plurality of words 112 by a trained learning model.
  • a speech processing control module 130 is to control processing associated with the speech 106 based on the identified language 128.
  • the speech processing control module 130 is to control processing associated with the speech 106 based on the identified language by implementing, for the identified language 128, a speech to text translation, and/or translating the identified language 128 to another language specific to another user.
  • a recurrent neural network implementation module 132 is to analyze, by a trained recurrent neural network 134, the speech 106 of the user 108.
  • the recurrent neural network implementation module 132 is to extract, based on the analysis of the speech 106 by the recurrent neural network 134, features of the speech 106.
  • the language identification module 126 is to identify, based on a classification of the features by the recurrent neural network 134, a language 128 of the speech 106.
  • the speech processing control module 130 is to control processing associated with the speech 106 based on the identified language 128.
  • Figure 2 illustrates an example of language identification using speech to illustrate operation of the apparatus 100.
  • the speech identification module 102 is to analyze the input signal 104 to identify the speech 106 of the user 108 by performing, based on a zero-crossing rate and/or a periodicity analysis, a voice activity detection.
  • the speech identification module 102 is to analyze the input signal 104 to identify the speech 106 of the user 108 by performing, based on a zero-crossing rate, periodicity analysis (e.g., fundamental and formant frequencies f 0 , f lt f 2 ), centroid, critical-band filter coefficients, mel-frequency cepstrum (MFC) coefficients, etc., a voice activity detection.
  • the voice activity detection may also include noise suppression.
  • the noise suppression may be performed by implementing spectral subtraction, and may be used in conjunction with the voice activity detection.
  • the noise suppression may also include a blind de-reverberation technique to reduce the influence of strong acoustical reflections that impede proper speech recognition.
  • the word determination module 110 is to determine, based on word boundary detection, the plurality of words 112 included in the speech 106.
  • the feature extraction module 122 is to extract a feature 124 of the speech 106.
  • the language identification module 126 is to identify, based on a classification of the categorized words 112, the identified rhythm structure 120, and the extracted feature 124, the language 128 of the speech 106.
  • the rhythm, features, explicit words, etc. may be classified based on a dictionary (or database) of frequently used words which has been created a-prior for different languages.
  • the classification may be performed based, for example, on Bayes classifier (looking at posterior probabilities, and creating discrete probability distribution functions based on the words being identified and their frequencies).
  • the classification at block 206 may be performed based, for example, on an artificial neural network (ANN) structure.
  • the ANN may employ, for example, one neural network per language or one neural network with multiple output neurons, with one neuron per language.
  • the ANN may be based on an appropriate selection of feature vectors for frequent words corresponding to a specified language.
  • the classification at block 206 may be performed based, for example, on an unsupervised approach such as fuzzy c- means or k- means clustering based on feature vectors being clustered according to the language class.
  • the classification at block 206 may be performed based, for example, on a Hidden Markov Model.
  • the classification at block 206 may be performed based, for example, on a deep-learning neural network employing 2-d data or 3-d data.
  • 2-d data may include time and amplitude of the speech words.
  • 3-d data may include a spectrogram or red, green, blue (RGB) mapped spectrograms.
  • the classification at block 206 there may be one model for performing classification and identification, or a parallel structure employing multiple-models for classification.
  • the model(s) with respect to block 206 may be trained to account for speech rate variations as well.
  • Figure 3 illustrates an example of language identification using speech and camera input to illustrate operation of the apparatus 100.
  • a-priori landmark features from block 304 of the user 108 may be applied to pre-trained models at block 306.
  • the pre-trained models may use anthropometric data based on racial characteristics.
  • the image of the user 108 may be applied to a pre-trained deep-learning model to classify the user 108 based on ethnicity.
  • the language identification module 126 is to combine the results of the visual demographic classification with speech identification to increase classification rates in the presence of certain confounds.
  • the confounds may include residual ambient noise corrupting the speech signal, acoustical environment degrading speech quality, infrequently used words in the speech capture, rhythm mis-estimation, etc.).
  • an example of a machine learning model may include a deep neural network (DNN) such as a long short-term memory (LSTM), or related model.
  • DNN deep neural network
  • LSTM long short-term memory
  • the machine learning model may be trained to discriminate languages, and include the application of pre word or post word segmentation. Facial feature tracking may also be used as an additional modality to reinforce phoneme matching.
  • this approach may employ weighting factors to account for the results of blocks 206 and 306 (e.g. , see block 308). For example, assuming that the results of the classification at block 206 include a 55% probability, and results of the classification at block 306 include an 80% probability that the user 108 is of a particular ethnicity, then the confidence of the overall result of the classifier may be increased based on the probability determined by the results of block 306. For example, weights may be assigned to the classification results as follows:
  • the weights (W) in this regard may be assigned over a period of time or a specified duration of words (e.g., 10 words).
  • Figure 4 illustrates an example of feature based training and transfer of a neural network model to illustrate operation of the apparatus 100.
  • multiple speech signals may be emulated as being common words.
  • an English word “family” spoke by a female
  • a Russian word “sem'ya” (equivalent to family) spoken by a male as well as a female
  • the feature being used may include the fundamental frequency f Q (Hz).
  • training of a machine learning model may be performed at 402, and authentication based on a speech signal at block 404 may be performed at 406.
  • the intermediate blocks between block 400 to block 402 may include Hamming window and frame-hop, feature extraction, and feature normalization over the duration of speech.
  • the intermediate blocks between block 404 to block 406 may include Hamming window and frame-hop, feature extraction, and feature normalization over the duration of speech.
  • Figure 6 illustrates English emulated "common” spoken word (family) by a female, where the top subplot is the spoken word and the bottom subplot is the fundamental frequency f Q tracked over the word for a frame size of 512 samples, to illustrate operation of the apparatus 100.
  • Figure 7 illustrates Russian emulated "common” spoken word (sem'ya for family) by a male, where the top subplot is the spoken word and the bottom subplot is the fundamental frequency f Q tracked over the word for a frame size of 512 samples, to illustrate operation of the apparatus 100.
  • Figure 8 illustrates Russian emulated "common" spoken word (sem'ya for family) by a female, where the top subplot is the spoken word and the bottom subplot is the fundamental frequency f Q tracked over the word for a frame size of 512 samples, to illustrate operation of the apparatus 100.
  • the fundamental frequency f Q may be tracked over the duration of the word (the spoken word shown at the top subplot, whereas f Q shown in the bottom subplot).
  • Figure 9 illustrates processed female English tracked feature f Q to remove redundancies (e.g., zeros in the f Q value), corresponding to Figure 6, to illustrate operation of the apparatus 100.
  • Figure 10 illustrates processed male Russian tracked feature f Q to remove redundancies (e.g., zeros in the f Q value), corresponding to Figure 7, to illustrate operation of the apparatus 100.
  • Figure 11 illustrates processed female Russian tracked feature f 0 to remove redundancies (e.g., zeros in the f 0 value), corresponding to Figure 8, to illustrate operation of the apparatus 100.
  • the similarities in the female and male Russian spoken f Q with respect to the envelope may be observed, whereas clear differences in the envelope of the English spoken word f Q may be observed.
  • the envelope may then be used to train a machine learning model as shown at 402.
  • ANN feed-forward artificial neural network
  • the training exemplars may include the fundamental frequency f Q over the duration of the spoken word (e.g., the boundaries of the word may be determined when the threshold of the amplitude of the signal falls below a certain value factoring any residual/ambient noise) for the various languages.
  • an output neuron may be trained to identify the most common word in that language with a categorical label for that language, or a class value of 1 .
  • the neurons associated with the remaining languages in the ANN model may be trained to report an output of -1 .
  • the ANN may be trained with rate changes of the commonly spoken word (for both male and female in a given language) using, for example, a vocoder.
  • the ANN model may also be trained with various accents.
  • the training algorithm may be gradient descent based, Jacobian, or Hessian matrix based.
  • / may represent the Jacobian matrix that contains first derivatives of the network errors with respect to the weights and biases, and e may represent a vector of network errors.
  • the Jacobian matrix may be determined through a backpropagation technique, as opposed to determining the Hessian matrix.
  • the size of the input layer, for example, for the ANN may be determined by the product of the number of features tracked and the duration of the feature (or the duration of the envelope of the feature being tracked).
  • An example of a machine learning model may include a DNN such as a LSTM, or related model.
  • the machine learning model may be trained to discriminate languages, and include the application of pre word or post word segmentation. Facial feature tracking may also be used as an additional modality to reinforce phoneme matching.
  • Figures 12-14 respectively illustrate an example block diagram 1200, an example flowchart of a method 1300, and a further example block diagram 1400 for language identification with speech and visual anthropometric features.
  • the block diagram 1200, the method 1300, and the block diagram 1400 may be implemented on the apparatus 100 described above with reference to Figure 1 by way of example and not limitation.
  • the block diagram 1200, the method 1300, and the block diagram 1400 may be practiced in other apparatus.
  • Figure 12 shows hardware of the apparatus 100 that may execute the instructions of the block diagram 1200.
  • the hardware may include a processor 1202, and a memory 1204 (i.e., a non-transitory computer readable medium) storing machine readable instructions that when executed by the processor cause the processor to perform the instructions of the block diagram 1200.
  • the memory 1204 may represent a non-transitory computer readable medium.
  • Figure 13 may represent a method for language identification with speech and visual anthropometric features, and the steps of the method.
  • Figure 14 may represent a non-transitory computer readable medium 1402 having stored thereon machine readable instructions to provide language identification with speech and visual anthropometric features.
  • the machine readable instructions when executed, cause a processor 1404 to perform the instructions of the block diagram 1400 also shown in Figure 14.
  • the processor 1202 of Figure 12 and/or the processor 1404 of Figure 14 may include a single or multiple processors or other hardware processing circuit, to execute the methods, functions and other processes described herein. These methods, functions and other processes may be embodied as machine readable instructions stored on a computer readable medium, which may be non-transitory (e.g., the non-transitory computer readable medium 1402 of Figure 14), such as hardware storage devices (e.g., RAM (random access memory), ROM (read only memory), EPROM (erasable, programmable ROM), EEPROM (electrically erasable, programmable ROM), hard drives, and flash memory).
  • the memory 1204 may include a RAM, where the machine readable instructions and data for a processor may reside during runtime.
  • the memory 1204 may include instructions 1206 to analyze an input signal 104 to identify a speech 106 of a user 108.
  • the processor 1202 may fetch, decode, and execute the instructions 1208 to determine, based on word boundary detection, a plurality of words 112 included in the speech 106.
  • the processor 1202 may fetch, decode, and execute the instructions 1210 to categorize the words 112, for example, into a phoneme, a vowel, and/or a consonant, to generate categorized words 116.
  • the processor 1202 may fetch, decode, and execute the instructions 1212 to identify a rhythm structure 120 of the speech 106.
  • the processor 1202 may fetch, decode, and execute the instructions 1214 to extract a feature 124 of the speech 106.
  • the processor 1202 may fetch, decode, and execute the instructions 1216 to identify, based on a classification of the categorized words 112, the identified rhythm structure 120, and the extracted feature 124, a language 128 of the speech 106.
  • the processor 1202 may fetch, decode, and execute the instructions 1218 to control processing associated with the speech 106 based on the identified language 128.
  • the method may include analyzing an input signal 104 to identify a speech 106 of a user 108.
  • the method may include analyzing, by a trained recurrent neural network 134, the speech 106 of the user 108.
  • the method may include extracting, based on the analysis of the speech 106 by the recurrent neural network 134, features of the speech 106.
  • the method may include identifying, based on a classification of the features by the recurrent neural network 134, a language 128 of the speech 106.
  • the method may include controlling processing associated with the speech 106 based on the identified language 128.
  • the non-transitory computer readable medium 1402 may include instructions 1406 to analyze an input signal 104 to identify a speech 106 of a user 108.
  • the processor 1404 may fetch, decode, and execute the instructions 1408 to determine, based on word boundary detection, a plurality of words 112 included in the speech 106. [0081] The processor 1404 may fetch, decode, and execute the instructions 1410 to extract a feature 124 of the speech 106 by analyzing a fundamental frequency of a word of the plurality of words 112 by a trained learning model.
  • the processor 1404 may fetch, decode, and execute the instructions 1412 to identify, based on a classification of the extracted feature 124, a language 128 of the speech 106.
  • the processor 1404 may fetch, decode, and execute the instructions 1414 to control processing associated with the speech 106 based on the identified language 128.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Machine Translation (AREA)

Abstract

In some examples, language identification with speech and visual anthropometric features may include analyzing an input signal to identify a speech of a user, and extracting a feature of the speech. Language identification with speech and visual anthropometric features may further include identifying, based on a classification of the feature, a language of the speech, and controlling processing associated with the speech based on the identified language.

Description

LANGUAGE IDENTIFICATION WITH SPEECH AND VISUAL
ANTHROPOMETRIC FEATURES
BACKGROUND
[0001] Speech may be translated into text or another form. For example, automatic speech recognition (ASR) may be used to transcribe speech into readable text in real time. Some speech recognition systems may need for a user to read text or isolated vocabulary into the system. The speech recognition system may analyze the user's specific voice, and use attributes of the user's specific voice to fine-tune the recognition of that user's speech, resulting in increased accuracy.
BRIEF DESCRIPTION OF DRAWINGS
[0002] Features of the present disclosure are illustrated by way of example and not limited in the following figure(s), in which like numerals indicate like elements, in which:
[0003] Figure 1 illustrates an example layout of a language identification with speech and visual anthropometric features apparatus;
[0004] Figure 2 illustrates an example of language identification using speech to illustrate operation of the language identification with speech and visual anthropometric features apparatus of Figure 1 ;
[0005] Figure 3 illustrates an example of language identification using speech and camera input to illustrate operation of the language identification with speech and visual anthropometric features apparatus of Figure 1 ;
[0006] Figure 4 illustrates an example of feature based training and transfer of a neural network model to illustrate operation of the language identification with speech and visual anthropometric features apparatus of Figure 1 ;
[0007] Figure 5 illustrates an example of a multilayer feedforward neural network to illustrate operation of the language identification with speech and visual anthropometric features apparatus of Figure 1 ;
[0008] Figure 6 illustrates English emulated "common" spoken word (family) by a female, where the top subplot is the spoken word and the bottom subplot is the fundamental frequency f0 tracked over the word for a frame size of 512 samples, to illustrate operation of the language identification with speech and visual anthropometric features apparatus of Figure 1 ;
[0009] Figure 7 illustrates Russian emulated "common" spoken word (sem'ya for family) by a male, where the top subplot is the spoken word and the bottom subplot is the fundamental frequency f0 tracked over the word for a frame size of 512 samples, to illustrate operation of the language identification with speech and visual anthropometric features apparatus of Figure 1 ;
[0010] Figure 8 illustrates Russian emulated "common" spoken word (sem'ya for family) by a female, where the top subplot is the spoken word and the bottom subplot is the fundamental frequency f0 tracked over the word for a frame size of 512 samples, to illustrate operation of the language identification with speech and visual anthropometric features apparatus of Figure 1 ;
[0011] Figure 9 illustrates processed female English tracked feature fQ to remove redundancies (e.g., zeros in the fQ value), corresponding to Figure 6, to illustrate operation of the language identification with speech and visual anthropometric features apparatus of Figure 1 ;
[0012] Figure 10 illustrates processed male Russian tracked feature fQ to remove redundancies (e.g., zeros in the fQ value), corresponding to Figure 7, to illustrate operation of the language identification with speech and visual anthropometric features apparatus of Figure 1 ;
[0013] Figure 11 illustrates processed female Russian tracked feature f0 to remove redundancies (e.g., zeros in the f0 value), corresponding to Figure 8, to illustrate operation of the language identification with speech and visual anthropometric features apparatus of Figure 1 ;
[0014] Figure 12 illustrates an example block diagram for language identification with speech and visual anthropometric features;
[0015] Figure 13 illustrates an example flowchart of a method for language identification with speech and visual anthropometric features; and
[0016] Figure 14 illustrates a further example block diagram for language identification with speech and visual anthropometric features. DETAILED DESCRIPTION
[0017] For simplicity and illustrative purposes, the present disclosure is described by referring mainly to examples. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It will be readily apparent however, that the present disclosure may be practiced without limitation to these specific details. In other instances, some methods and structures have not been described in detail so as not to unnecessarily obscure the present disclosure.
[0018] Throughout the present disclosure, the terms "a" and "an" are intended to denote at least one of a particular element. As used herein, the term "includes" means includes but not limited to, the term "including" means including but not limited to. The term "based on" means based at least in part on.
[0019] Apparatuses for language identification with speech and visual anthropometric features, methods for language identification with speech and visual anthropometric features, and non-transitory computer readable media having stored thereon machine readable instructions to provide language identification with speech and visual anthropometric features are disclosed. The apparatuses, methods, and non-transitory computer readable media disclosed herein provide for automatic language identification during speech-based communication between people. For example, in a collaboration setting (e.g., using a smart-surface) or a conference call, languages that are being spoken may be identified at the start of a conversation so as to enable linguistically-relevant speech-synthesis via a machine-translation schema. In addition to speech, the apparatuses, methods, and non-transitory computer readable media disclosed herein provide for detection of demographic-based visual features of a speaker, for example, by a camera, to create a multimodal approach for language identification. Thus, the apparatuses, methods, and non-transitory computer readable media disclosed herein provide for use cases such as speech to text translation, communication between participants speaking different languages, etc. [0020] With respect to language identification techniques, it is technically challenging to sense and recognize speech to determine what language is being spoken. It is also technically challenging to sense and recognize speech with a high degree of confidence, and/or to combine multiple techniques of sensing and recognizing speech.
[0021] In order to address at least these technical challenges associated with language identification, the apparatuses, methods, and non-transitory computer readable media disclosed herein provide for language identification that includes, for example, word boundary detection in the speech, identification of a rhythm structure of the speech, and/or feature identification, where these aspects are used in combination to identify a language being spoken. Further, the apparatuses, methods, and non-transitory computer readable media disclosed herein provide for language identification that includes feature identification based on fundamental frequency analysis of words.
[0022] According to an example, the apparatuses, methods, and non-transitory computer readable media disclosed herein provide a database based approach for which word boundary identification is performed on speech. Thereafter, a correlation may be performed with respect to the fundamental atoms of the words in the database that is available for various languages and various frequently used words. Based on the results (e.g., what language, and associated probability), the spoken language of the speech may be detected. The features that may be extracted from speech and analyzed may include, for example, the fundamental frequency, spectral centroid, how words evolve in time, etc. The database approach may also employ a neural network, where features may be stored for various languages in the database, and the neural network may be pre-trained on certain ones of the features. Thus, when certain features are present in speech, the neural network may classify the language according to the pre-trained features that are frequently occurring based on key areas (e.g., fundamental frequency, spectral centroid, how words evolve in time, etc.), or words. [0023] According to an example, the apparatuses, methods, and non-transitory computer readable media disclosed herein provide a recurrent neural network based approach, where the recurrent neural network may be trained on various examples of spoken languages by different speakers. Having trained the recurrent neural network, for new speech, the recurrent neural network may identify which language is being spoken, and redirect the results to the correct automatic speech recognition (ASR) to have the speech interpreted properly. The features in this regard may be extracted from the speech based on temporal signal.
[0024] For the apparatuses, methods, and non-transitory computer readable media disclosed herein, modules, as described herein, may be any combination of hardware and programming to implement the functionalities of the respective modules. In some examples described herein, the combinations of hardware and programming may be implemented in a number of different ways. For example, the programming for the modules may be processor executable instructions stored on a non-transitory machine-readable storage medium and the hardware for the modules may include a processing resource to execute those instructions. In these examples, a computing device implementing such modules may include the machine-readable storage medium storing the instructions and the processing resource to execute the instructions, or the machine-readable storage medium may be separately stored and accessible by the computing device and the processing resource. In some examples, some modules may be implemented in circuitry.
[0025] Figure 1 illustrates an example layout of a language identification with speech and visual anthropometric features apparatus (hereinafter also referred to as "apparatus 100").
[0026] Referring to Figure 1 , the apparatus 100 may include a speech identification module 102 to analyze an input signal 104 to identify a speech 106 of a user 108. For example, the speech identification module 102 is to analyze the input signal 104 to identify the speech 106 of the user 108 by performing, based on a zero-crossing rate and/or a periodicity analysis, a voice activity detection. [0027] A word determination module 110 is to determine, based on word boundary detection, a plurality of words 112 included in the speech 106.
[0028] A word categorization module 114 is to categorize the words 112, for example, into a phoneme, a vowel, and/or a consonant, to generate categorized words 116.
[0029] A rhythm identification module 118 is to identify a rhythm structure 120 of the speech 106. For example, the rhythm identification module 118 is to identify the rhythm structure 120 of the speech 106 by determining a vowel duration of a vowel included in the words 112, and analyzing, based on the vowel duration, a variability in a length of the vowel included in the words 112.
[0030] A feature extraction module 122 is to extract a feature 124 of the speech 106.
[0031] A language identification module 126 is to identify, based on a classification of the categorized words 112, the identified rhythm structure 120, and the extracted feature 124, a language 128 of the speech 106.
[0032] According to an example, the feature extraction module 122 is to analyze landmark features of the user 108 to determine a racial characteristic of the user 108, and the language identification module 126 is to combine results including the language identification and the determined racial characteristic to increase a confidence associated with the identification of the language 128 of the speech 106. For example, the feature extraction module 122 is to analyze landmark features of the user 108 as received from an image captured by a camera 136.
[0033] According to an example, the feature extraction module 122 is to classify, based on an analysis of an image of the user by a pre-trained learning model, the user according to an ethnicity, and the language identification module 126 is to combine results including the language identification and the classified ethnicity to increase a confidence associated with the identification of the language 128 of the speech 106.
[0034] According to an example, the feature extraction module 122 is to extract the feature 124 of the speech 106 by analyzing a fundamental frequency fQ of a word of the plurality of words 112 by a trained learning model.
[0035] A speech processing control module 130 is to control processing associated with the speech 106 based on the identified language 128. For example, the speech processing control module 130 is to control processing associated with the speech 106 based on the identified language by implementing, for the identified language 128, a speech to text translation, and/or translating the identified language 128 to another language specific to another user.
[0036] According to an example, instead of or in combination with operation of the word determination module 110, the word categorization module 114, the rhythm identification module 118, and/or the feature extraction module 122, a recurrent neural network implementation module 132 is to analyze, by a trained recurrent neural network 134, the speech 106 of the user 108. The recurrent neural network implementation module 132 is to extract, based on the analysis of the speech 106 by the recurrent neural network 134, features of the speech 106. The language identification module 126 is to identify, based on a classification of the features by the recurrent neural network 134, a language 128 of the speech 106. Further, the speech processing control module 130 is to control processing associated with the speech 106 based on the identified language 128.
[0037] Figure 2 illustrates an example of language identification using speech to illustrate operation of the apparatus 100.
[0038] Referring to Figure 2, at block 200 the speech identification module 102 is to analyze the input signal 104 to identify the speech 106 of the user 108 by performing, based on a zero-crossing rate and/or a periodicity analysis, a voice activity detection. For example, the speech identification module 102 is to analyze the input signal 104 to identify the speech 106 of the user 108 by performing, based on a zero-crossing rate, periodicity analysis (e.g., fundamental and formant frequencies f0, flt f2), centroid, critical-band filter coefficients, mel-frequency cepstrum (MFC) coefficients, etc., a voice activity detection. The voice activity detection may also include noise suppression. For example, the noise suppression may be performed by implementing spectral subtraction, and may be used in conjunction with the voice activity detection. The noise suppression may also include a blind de-reverberation technique to reduce the influence of strong acoustical reflections that impede proper speech recognition.
[0039] At block 202, the word determination module 110 is to determine, based on word boundary detection, the plurality of words 112 included in the speech 106.
[0040] At block 204, the feature extraction module 122 is to extract a feature 124 of the speech 106.
[0041] At block 206, the language identification module 126 is to identify, based on a classification of the categorized words 112, the identified rhythm structure 120, and the extracted feature 124, the language 128 of the speech 106. In this regard, the rhythm, features, explicit words, etc., may be classified based on a dictionary (or database) of frequently used words which has been created a-prior for different languages.
[0042] At block 206, the classification may be performed based, for example, on Bayes classifier (looking at posterior probabilities, and creating discrete probability distribution functions based on the words being identified and their frequencies).
[0043] Alternatively or additionally, the classification at block 206 may be performed based, for example, on an artificial neural network (ANN) structure. The ANN may employ, for example, one neural network per language or one neural network with multiple output neurons, with one neuron per language. The ANN may be based on an appropriate selection of feature vectors for frequent words corresponding to a specified language. [0044] Alternatively or additionally, the classification at block 206 may be performed based, for example, on an unsupervised approach such as fuzzy c- means or k- means clustering based on feature vectors being clustered according to the language class.
[0045] Alternatively or additionally, the classification at block 206 may be performed based, for example, on a Hidden Markov Model.
[0046] Alternatively or additionally, the classification at block 206 may be performed based, for example, on a deep-learning neural network employing 2-d data or 3-d data. An example of 2-d data may include time and amplitude of the speech words. An example of 3-d data may include a spectrogram or red, green, blue (RGB) mapped spectrograms.
[0047] For the classification at block 206, there may be one model for performing classification and identification, or a parallel structure employing multiple-models for classification.
[0048] The model(s) with respect to block 206 may be trained to account for speech rate variations as well.
[0049] Figure 3 illustrates an example of language identification using speech and camera input to illustrate operation of the apparatus 100.
[0050] Referring to Figure 3, due to the presence of a camera (e.g., see camera input at 300), after ambient light compensation at block 302, a-priori landmark features from block 304 of the user 108 may be applied to pre-trained models at block 306. The pre-trained models may use anthropometric data based on racial characteristics. Alternatively or additionally, the image of the user 108 may be applied to a pre-trained deep-learning model to classify the user 108 based on ethnicity. In this regard, referring to block 308, the language identification module 126 is to combine the results of the visual demographic classification with speech identification to increase classification rates in the presence of certain confounds. For example, the confounds may include residual ambient noise corrupting the speech signal, acoustical environment degrading speech quality, infrequently used words in the speech capture, rhythm mis-estimation, etc.).
[0051] For block 306, an example of a machine learning model may include a deep neural network (DNN) such as a long short-term memory (LSTM), or related model. The machine learning model may be trained to discriminate languages, and include the application of pre word or post word segmentation. Facial feature tracking may also be used as an additional modality to reinforce phoneme matching.
[0052] For the camera-based approach of Figure 3, this approach may employ weighting factors to account for the results of blocks 206 and 306 (e.g. , see block 308). For example, assuming that the results of the classification at block 206 include a 55% probability, and results of the classification at block 306 include an 80% probability that the user 108 is of a particular ethnicity, then the confidence of the overall result of the classifier may be increased based on the probability determined by the results of block 306. For example, weights may be assigned to the classification results as follows:
(Wl xClassifier for speech + W2*Classifier for image)/(W1 +W2) Equation (1 )
The weights (W) in this regard may be assigned over a period of time or a specified duration of words (e.g., 10 words).
[0053] Figure 4 illustrates an example of feature based training and transfer of a neural network model to illustrate operation of the apparatus 100.
[0054] Referring to Figure 4, according to an example, multiple speech signals may be emulated as being common words. For example, an English word "family" (spoken by a female), and a Russian word "sem'ya" (equivalent to family) spoken by a male as well as a female may be used. In this example, the feature being used, among the various features described herein, may include the fundamental frequency fQ (Hz).
[0055] At block 400 of Figure 4, for a speech signal, training of a machine learning model may be performed at 402, and authentication based on a speech signal at block 404 may be performed at 406. The intermediate blocks between block 400 to block 402 may include Hamming window and frame-hop, feature extraction, and feature normalization over the duration of speech. Similarly, the intermediate blocks between block 404 to block 406 may include Hamming window and frame-hop, feature extraction, and feature normalization over the duration of speech.
[0056] Figure 6 illustrates English emulated "common" spoken word (family) by a female, where the top subplot is the spoken word and the bottom subplot is the fundamental frequency fQ tracked over the word for a frame size of 512 samples, to illustrate operation of the apparatus 100. Figure 7 illustrates Russian emulated "common" spoken word (sem'ya for family) by a male, where the top subplot is the spoken word and the bottom subplot is the fundamental frequency fQ tracked over the word for a frame size of 512 samples, to illustrate operation of the apparatus 100. Figure 8 illustrates Russian emulated "common" spoken word (sem'ya for family) by a female, where the top subplot is the spoken word and the bottom subplot is the fundamental frequency fQ tracked over the word for a frame size of 512 samples, to illustrate operation of the apparatus 100.
[0057] Referring to Figures 6-8, the fundamental frequency fQ may be tracked over the duration of the word (the spoken word shown at the top subplot, whereas fQ shown in the bottom subplot).
[0058] Figure 9 illustrates processed female English tracked feature fQ to remove redundancies (e.g., zeros in the fQ value), corresponding to Figure 6, to illustrate operation of the apparatus 100. Figure 10 illustrates processed male Russian tracked feature fQ to remove redundancies (e.g., zeros in the fQ value), corresponding to Figure 7, to illustrate operation of the apparatus 100. Figure 11 illustrates processed female Russian tracked feature f0 to remove redundancies (e.g., zeros in the f0 value), corresponding to Figure 8, to illustrate operation of the apparatus 100. [0059] Referring to Figures 9-11 , the similarities in the female and male Russian spoken fQ with respect to the envelope may be observed, whereas clear differences in the envelope of the English spoken word fQ may be observed.
[0060] The envelope may then be used to train a machine learning model as shown at 402. Specifically, a neural network model, as shown in Figure 5, employing a feed-forward artificial neural network (ANN) with Nh hidden layers (e.g., / = {1 , . . . , Nh} and neurons for hidden layer Ni) may be trained on such feature vectors or their transforms (including envelopes) to generalize and classify according to the languages, by exploiting the similarities of the commonly spoken word in a given language (between male and females), and dissimilarities due to accent and differences due to the commonly used word translations. In this example, the training exemplars may include the fundamental frequency fQ over the duration of the spoken word (e.g., the boundaries of the word may be determined when the threshold of the amplitude of the signal falls below a certain value factoring any residual/ambient noise) for the various languages.
[0061] For each language, an output neuron may be trained to identify the most common word in that language with a categorical label for that language, or a class value of 1 . The neurons associated with the remaining languages in the ANN model may be trained to report an output of -1 . The ANN may be trained with rate changes of the commonly spoken word (for both male and female in a given language) using, for example, a vocoder. The ANN model may also be trained with various accents.
[0062] The training algorithm may be gradient descent based, Jacobian, or Hessian matrix based. For example, the Levenberg-Marquardt approach may be used to train the model in a backpropagation mode, where the weight vectors of the neural network may be adapted to minimize the mean-square output classification error, via an approximation to the Hessian matrix of errors (e.g., second-order derivatives), according to: Wfc+1 = w/c - (JTJ + μΐ 1P £ Equation (2)
For Equation (2), / may represent the Jacobian matrix that contains first derivatives of the network errors with respect to the weights and biases, and e may represent a vector of network errors. The Jacobian matrix may be determined through a backpropagation technique, as opposed to determining the Hessian matrix.
[0063] The size of the input layer, for example, for the ANN may be determined by the product of the number of features tracked and the duration of the feature (or the duration of the envelope of the feature being tracked).
[0064] An example of a machine learning model may include a DNN such as a LSTM, or related model. The machine learning model may be trained to discriminate languages, and include the application of pre word or post word segmentation. Facial feature tracking may also be used as an additional modality to reinforce phoneme matching.
[0065] Figures 12-14 respectively illustrate an example block diagram 1200, an example flowchart of a method 1300, and a further example block diagram 1400 for language identification with speech and visual anthropometric features. The block diagram 1200, the method 1300, and the block diagram 1400 may be implemented on the apparatus 100 described above with reference to Figure 1 by way of example and not limitation. The block diagram 1200, the method 1300, and the block diagram 1400 may be practiced in other apparatus. In addition to showing the block diagram 1200, Figure 12 shows hardware of the apparatus 100 that may execute the instructions of the block diagram 1200. The hardware may include a processor 1202, and a memory 1204 (i.e., a non-transitory computer readable medium) storing machine readable instructions that when executed by the processor cause the processor to perform the instructions of the block diagram 1200. The memory 1204 may represent a non-transitory computer readable medium. Figure 13 may represent a method for language identification with speech and visual anthropometric features, and the steps of the method. Figure 14 may represent a non-transitory computer readable medium 1402 having stored thereon machine readable instructions to provide language identification with speech and visual anthropometric features. The machine readable instructions, when executed, cause a processor 1404 to perform the instructions of the block diagram 1400 also shown in Figure 14.
[0066] The processor 1202 of Figure 12 and/or the processor 1404 of Figure 14 may include a single or multiple processors or other hardware processing circuit, to execute the methods, functions and other processes described herein. These methods, functions and other processes may be embodied as machine readable instructions stored on a computer readable medium, which may be non-transitory (e.g., the non-transitory computer readable medium 1402 of Figure 14), such as hardware storage devices (e.g., RAM (random access memory), ROM (read only memory), EPROM (erasable, programmable ROM), EEPROM (electrically erasable, programmable ROM), hard drives, and flash memory). The memory 1204 may include a RAM, where the machine readable instructions and data for a processor may reside during runtime.
[0067] Referring to Figures 1 -12, and particularly to the block diagram 1200 shown in Figure 12, the memory 1204 may include instructions 1206 to analyze an input signal 104 to identify a speech 106 of a user 108.
[0068] The processor 1202 may fetch, decode, and execute the instructions 1208 to determine, based on word boundary detection, a plurality of words 112 included in the speech 106.
[0069] The processor 1202 may fetch, decode, and execute the instructions 1210 to categorize the words 112, for example, into a phoneme, a vowel, and/or a consonant, to generate categorized words 116.
[0070] The processor 1202 may fetch, decode, and execute the instructions 1212 to identify a rhythm structure 120 of the speech 106.
[0071] The processor 1202 may fetch, decode, and execute the instructions 1214 to extract a feature 124 of the speech 106.
[0072] The processor 1202 may fetch, decode, and execute the instructions 1216 to identify, based on a classification of the categorized words 112, the identified rhythm structure 120, and the extracted feature 124, a language 128 of the speech 106.
[0073] The processor 1202 may fetch, decode, and execute the instructions 1218 to control processing associated with the speech 106 based on the identified language 128.
[0074] Referring to Figures 1 -11 and 13, and particularly Figure 13, for the method 1300, at block 1302, the method may include analyzing an input signal 104 to identify a speech 106 of a user 108.
[0075] At block 1304 the method may include analyzing, by a trained recurrent neural network 134, the speech 106 of the user 108.
[0076] At block 1306 the method may include extracting, based on the analysis of the speech 106 by the recurrent neural network 134, features of the speech 106.
[0077] At block 1308 the method may include identifying, based on a classification of the features by the recurrent neural network 134, a language 128 of the speech 106.
[0078] At block 1310 the method may include controlling processing associated with the speech 106 based on the identified language 128.
[0079] Referring to Figures 1 -11 and 14, and particularly Figure 14, for the block diagram 1400, the non-transitory computer readable medium 1402 may include instructions 1406 to analyze an input signal 104 to identify a speech 106 of a user 108.
[0080] The processor 1404 may fetch, decode, and execute the instructions 1408 to determine, based on word boundary detection, a plurality of words 112 included in the speech 106. [0081] The processor 1404 may fetch, decode, and execute the instructions 1410 to extract a feature 124 of the speech 106 by analyzing a fundamental frequency of a word of the plurality of words 112 by a trained learning model.
[0082] The processor 1404 may fetch, decode, and execute the instructions 1412 to identify, based on a classification of the extracted feature 124, a language 128 of the speech 106.
[0083] The processor 1404 may fetch, decode, and execute the instructions 1414 to control processing associated with the speech 106 based on the identified language 128.
[0084] What has been described and illustrated herein is an example along with some of its variations. The terms, descriptions and figures used herein are set forth by way of illustration only and are not meant as limitations. Many variations are possible within the spirit and scope of the subject matter, which is intended to be defined by the following claims - and their equivalents - in which all terms are meant in their broadest reasonable sense unless otherwise indicated.

Claims

What is claimed is:
1 . An apparatus comprising: a processor; and a non-transitory computer readable medium storing machine readable instructions that when executed by the processor cause the processor to: analyze an input signal to identify a speech of a user; determine, based on word boundary detection, a plurality of words included in the speech; categorize the words into at least one of a phoneme, a vowel, and a consonant; identify a rhythm structure of the speech; extract a feature of the speech; identify, based on a classification of the categorized words, the identified rhythm structure, and the extracted feature, a language of the speech; and control processing associated with the speech based on the identified language.
2. The apparatus according to claim 1 , the instructions are further to cause the processor to: analyze the input signal to identify the speech of the user by performing, based on at least one of a zero-crossing rate and a periodicity analysis, a voice activity detection.
3. The apparatus according to claim 1 , the instructions are further to cause the processor to: identify the rhythm structure of the speech by determining a vowel duration of a vowel included in the words, and analyzing, based on the vowel duration, a variability in a length of the vowel included in the words.
4. The apparatus according to claim 1 , the instructions are further to cause the processor to: control processing associated with the speech based on the identified language by at least one of implementing, for the identified language, a speech to text translation, and translating the identified language to another language specific to another user.
5. The apparatus according to claim 1 , the instructions are further to cause the processor to: analyze landmark features of the user to determine a racial characteristic of the user; and combine results including the language identification and the determined racial characteristic to increase a confidence associated with the identification of the language of the speech.
6. The apparatus according to claim 1 , the instructions are further to cause the processor to: classify, based on an analysis of an image of the user by a pre-trained learning model, the user according to an ethnicity; and combine results including the language identification and the classified ethnicity to increase a confidence associated with the identification of the language of the speech.
7. The apparatus according to claim 1 , the instructions are further to cause the processor to: extract the feature of the speech by analyzing a fundamental frequency of a word of the plurality of words by a trained learning model.
8. A method comprising: analyzing, by a processor, an input signal to identify a speech of a user; analyzing, by a trained recurrent neural network, the speech of the user; extracting, based on the analysis of the speech by the recurrent neural network, features of the speech; identifying, based on a classification of the features by the recurrent neural network, a language of the speech; and controlling processing associated with the speech based on the identified language.
9. The method according to claim 8, wherein analyzing the input signal to identify the speech of the user further comprises: performing, based on at least one of a zero-crossing rate and a periodicity analysis, a voice activity detection.
10. The method according to claim 8, wherein controlling processing associated with the speech based on the identified language further comprises at least one of: implementing, for the identified language, a speech to text translation; and translating the identified language to another language specific to another user.
11 . A non-transitory computer readable medium having stored thereon machine readable instructions, the machine readable instructions, when executed, cause a processor to: analyze an input signal to identify a speech of a user; determine, based on word boundary detection, a plurality of words included in the speech; extract a feature of the speech by analyzing a fundamental frequency of a word of the plurality of words by a trained learning model; identify, based on a classification of the extracted feature, a language of the speech; and control processing associated with the speech based on the identified language.
12. The non-transitory computer readable medium according to claim 1 1 , wherein the instructions are further to cause the processor to: categorize the words into at least one of a phoneme, a vowel, and a consonant; identify a rhythm structure of the speech; and identify, based on a classification of the categorized words, the identified rhythm structure, and the extracted feature, the language of the speech.
13. The non-transitory computer readable medium according to claim 11 , wherein the instructions are further to cause the processor to: control processing associated with the speech based on the identified language by at least one of implementing, for the identified language, a speech to text translation, and translating the identified language to another language specific to another user.
14. The non-transitory computer readable medium according to claim 11 , wherein the instructions are further to cause the processor to: analyze landmark features of the user to determine a racial characteristic of the user; and combine results including the language identification and the determined racial characteristic to increase a confidence associated with the identification of the language of the speech.
15. The non-transitory computer readable medium according to claim 11 , wherein the instructions are further to cause the processor to: classify, based on an analysis of an image of the user by a pre-trained learning model, the user according to an ethnicity; and combine results including the language identification and the classified ethnicity to increase a confidence associated with the identification of the language of the speech.
PCT/US2017/043765 2017-07-25 2017-07-25 Language identification with speech and visual anthropometric features Ceased WO2019022722A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2017/043765 WO2019022722A1 (en) 2017-07-25 2017-07-25 Language identification with speech and visual anthropometric features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2017/043765 WO2019022722A1 (en) 2017-07-25 2017-07-25 Language identification with speech and visual anthropometric features

Publications (1)

Publication Number Publication Date
WO2019022722A1 true WO2019022722A1 (en) 2019-01-31

Family

ID=65040823

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2017/043765 Ceased WO2019022722A1 (en) 2017-07-25 2017-07-25 Language identification with speech and visual anthropometric features

Country Status (1)

Country Link
WO (1) WO2019022722A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110070853A (en) * 2019-04-29 2019-07-30 盐城工业职业技术学院 A kind of speech recognition conversion method and system
CN112397089A (en) * 2019-08-19 2021-02-23 中国科学院自动化研究所 Method and device for identifying identity of voice speaker, computer equipment and storage medium
US11996087B2 (en) 2021-04-30 2024-05-28 Comcast Cable Communications, Llc Method and apparatus for intelligent voice recognition

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020086269A1 (en) * 2000-12-18 2002-07-04 Zeev Shpiro Spoken language teaching system based on language unit segmentation
US20030229497A1 (en) * 2000-04-21 2003-12-11 Lessac Technology Inc. Speech recognition method
US20070260461A1 (en) * 2004-03-05 2007-11-08 Lessac Technogies Inc. Prosodic Speech Text Codes and Their Use in Computerized Speech Systems
US20130226583A1 (en) * 2009-08-04 2013-08-29 Autonomy Corporation Limited Automatic spoken language identification based on phoneme sequence patterns

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030229497A1 (en) * 2000-04-21 2003-12-11 Lessac Technology Inc. Speech recognition method
US20020086269A1 (en) * 2000-12-18 2002-07-04 Zeev Shpiro Spoken language teaching system based on language unit segmentation
US20070260461A1 (en) * 2004-03-05 2007-11-08 Lessac Technogies Inc. Prosodic Speech Text Codes and Their Use in Computerized Speech Systems
US20130226583A1 (en) * 2009-08-04 2013-08-29 Autonomy Corporation Limited Automatic spoken language identification based on phoneme sequence patterns

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110070853A (en) * 2019-04-29 2019-07-30 盐城工业职业技术学院 A kind of speech recognition conversion method and system
CN112397089A (en) * 2019-08-19 2021-02-23 中国科学院自动化研究所 Method and device for identifying identity of voice speaker, computer equipment and storage medium
CN112397089B (en) * 2019-08-19 2023-07-04 中国科学院自动化研究所 Speech sender identification method, device, computer equipment and storage medium
US11996087B2 (en) 2021-04-30 2024-05-28 Comcast Cable Communications, Llc Method and apparatus for intelligent voice recognition

Similar Documents

Publication Publication Date Title
US10930270B2 (en) Processing audio waveforms
US9715660B2 (en) Transfer learning for deep neural network based hotword detection
US9646634B2 (en) Low-rank hidden input layer for speech recognition neural network
US20190325861A1 (en) Systems and Methods for Automatic Speech Recognition Using Domain Adaptation Techniques
US6594629B1 (en) Methods and apparatus for audio-visual speech detection and recognition
US20190147854A1 (en) Speech Recognition Source to Target Domain Adaptation
Liu et al. Graph-based semi-supervised acoustic modeling in DNN-based speech recognition
WO2019022722A1 (en) Language identification with speech and visual anthropometric features
Soldi et al. Short-Duration Speaker Modelling with Phone Adaptive Training.
WO2019212375A1 (en) Method for obtaining speaker-dependent small high-level acoustic speech attributes
KR101023211B1 (en) Microphone array based speech recognition system and target speech extraction method in the system
Amit et al. Robust acoustic object detection
Elnaggar et al. A new unsupervised short-utterance based speaker identification approach with parametric t-SNE dimensionality reduction
Hanilci et al. VQ-UBM based speaker verification through dimension reduction using local PCA
Walid et al. Speech recognition system based on discrete wave atoms transform partial noisy environment
Suo et al. Using SVM as back-end classifier for language identification
Fujita et al. Robust DNN-Based VAD Augmented with Phone Entropy Based Rejection of Background Speech.
Nakamura et al. Robot audition based acoustic event identification using a bayesian model considering spectral and temporal uncertainties
Diez et al. On the complementarity of phone posterior probabilities for improved speaker recognition
Kalantari et al. Cross database training of audio-visual hidden Markov models for phone recognition
Pellegrino et al. Comparison of two phonetic approaches to language identification.
Evangelopoulos et al. Learning an invariant speech representation
Srikanth et al. Word Recognition Through Mapping of Lip Movements from Speech Utterance Using Audiovisual Fusion and MLP
Mamodiya et al. Exploring acoustic factor analysis for limited test data speaker verification
Rasheed et al. Methods and Techniques for Speaker Recognition: A Review.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17919608

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17919608

Country of ref document: EP

Kind code of ref document: A1