[go: up one dir, main page]

CN110910904A - Method for establishing voice emotion recognition model and voice emotion recognition method - Google Patents

Method for establishing voice emotion recognition model and voice emotion recognition method Download PDF

Info

Publication number
CN110910904A
CN110910904A CN201911355782.XA CN201911355782A CN110910904A CN 110910904 A CN110910904 A CN 110910904A CN 201911355782 A CN201911355782 A CN 201911355782A CN 110910904 A CN110910904 A CN 110910904A
Authority
CN
China
Prior art keywords
model
emotion recognition
audio
natural
gmm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911355782.XA
Other languages
Chinese (zh)
Inventor
王磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Baiying Technology Co Ltd
Original Assignee
Zhejiang Baiying Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Baiying Technology Co Ltd filed Critical Zhejiang Baiying Technology Co Ltd
Priority to CN201911355782.XA priority Critical patent/CN110910904A/en
Publication of CN110910904A publication Critical patent/CN110910904A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a method for establishing a speech emotion recognition model, which comprises the following steps: extracting 1584-dimensional emotional acoustic features by using opensimile; generating a UBM general background model by using emotional acoustic characteristics as voice training of a natural emotional state; generating a GMM (Gaussian mixture model) corresponding to each type of natural emotion state on the basis of the UBM general background model aiming at each type of natural emotion state; and taking the GMM corresponding to each type of natural emotion state as a speech emotion recognition model.

Description

Method for establishing voice emotion recognition model and voice emotion recognition method
Technical Field
The invention relates to the field of voice signal processing, in particular to a method for establishing a voice emotion recognition model and a voice emotion recognition method.
Background
With the development of artificial intelligence technology, computers become more and more intelligent, and people like classmates, apple siri and more intelligent dialogue systems enter the lives of people, and people can communicate and interact with various machines through the intelligent dialogue systems, but almost all the intelligent dialogue systems cannot identify the emotion of people and make feedback with enough intelligence. The emotion information in the voice is a very important behavior signal reflecting human emotion, and meanwhile, the recognition of the emotion information contained in the voice is an important part for realizing natural human-computer interaction. The existing face recognition technology can judge and analyze human expressions, but cannot be applied to intelligent outbound scenes, and emotion recognition in the existing intelligent outbound field is also a blank field.
Disclosure of Invention
The invention aims to solve the technical problem of providing a method for establishing a speech emotion recognition model and a speech emotion recognition method, so as to realize the recognition of emotion carried by speech in the field of intelligent outbound and improve the semantic understanding precision.
In a first aspect, the present invention provides a method for establishing a speech emotion recognition model, the method comprising:
extracting 1584-dimensional emotional acoustic features by using opensimile;
generating a UBM general background model by using the emotional acoustic characteristics as the voice training of the natural emotional state;
generating a GMM (Gaussian mixture model) corresponding to each type of natural emotion state on the basis of the UBM general background model aiming at each type of natural emotion state;
and taking the GMM corresponding to each type of natural emotion state as a speech emotion recognition model.
In the above scheme, the natural emotional state includes happiness, sadness, anger, and neutrality.
In a second aspect, the present invention provides a speech emotion recognition method based on any one of the first aspect, including:
acquiring a voice file, performing VAD (voice activity detection) pretreatment on the voice file, dividing the voice file into at least one audio according to a mute part, and simultaneously converting the at least one audio into acoustic characteristics;
inputting the acoustic features into a speech emotion recognition model to obtain GMM super vectors;
the GMM supervectors are used as embedding of the at least one audio and input into an XGboost model to obtain a label of the at least one audio;
voting is carried out on the tags of the at least one audio by adopting a Bagging algorithm to obtain a voting result of the tags of the at least one audio;
and taking the voting result occupying most of the numbers as a final output result, and marking the emotion of the voice file.
In the above aspect, the method further includes: and cutting off the mute part of the voice file on a time frequency domain to obtain at least one piece of audio after segmentation.
The invention has the beneficial effects that: based on the technical scheme, the embodiment of the invention provides a method for establishing a speech emotion recognition model and a speech emotion recognition method, wherein a speech emotion recognition model based on GMM-UBM can be trained only by a small amount of linguistic data, so that correct recognition of speech emotion of a user is realized; meanwhile, the model can realize real-time identification, is used on line and can realize millisecond response; secondly, the model is low in maintenance cost, and recognition of various speech emotions can be achieved only by collecting badcase retraining models.
Drawings
FIG. 1 is a schematic flow chart of a method for establishing a speech emotion recognition model according to the present invention;
FIG. 2 is a schematic flow chart of a speech emotion recognition method provided by the present invention.
Detailed Description
The technical solutions of the present invention are further described in detail below with reference to specific embodiments, and it is obvious that the described embodiments are only a part of embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention specifically describes a method for establishing a speech emotion recognition model and a speech emotion recognition method provided by the invention.
Referring to fig. 1, as shown in fig. 1, a flow chart of a method for establishing a speech emotion recognition model provided by the present invention is schematically shown, including:
s101, extracting 1584-dimensional emotional acoustic features by using opensimile;
opensmile (open source media large feature space extraction toolkit) is a modular and flexible feature extractor, an emotional feature set extracted based on Opensmile comprises 1582-dimensional emotional acoustic features, 34 low-level descriptors (LLDs) and 34 corresponding differences are used as 68 LLD contour values, 21 functions are applied on the basis to obtain 1428 emotional acoustic features, in addition, 19 functions are applied on 4 pitch-based LLDs and 4 delta coefficients thereof to obtain 152 emotional acoustic features, and finally, the number of additional pitches (pseudo syllables) and the duration of total input (2 features) are added.
Specifically, the extracted Low Level Descriptor (LLD) includes:
fundamental frequency characteristics including mean, variance, difference, smooth profile curve and the like; root mean square signal energy characteristics and their smooth contours; mel-frequency cepstral coefficients MFCC; linear Prediction Coefficients (LPC); differential frame-to-frame jitter (jitter).
It can be understood that 1584-dimensional emotional acoustic features are extracted as training data based on Opensmile (open source media large feature space extraction toolkit) to cover voices of various environments to train the UBM general background model.
S102, generating a UBM general background model by using the emotional acoustic characteristics as the voice training of the natural emotional state;
it can be understood that, in the case of lacking emotion corpus, the data cannot effectively describe the probability distribution of the speaker emotion feature space, so a UBM general background model is introduced to adaptively obtain an emotion recognition model corresponding to each emotion, and the training data of the UBM general background model is 1584-dimensional emotion acoustic features extracted based on Opensmile (open source media large feature space extraction toolkit).
In one example, an EM algorithm is used to train a UBM general background model, where the UBM general background model is a gaussian mixture model with a mixture order of M, and the UBM general background model uses a model parameter λ ═ { ω ═ ωi,μi,∑iI is 1,2 … m.
S103, generating a GMM (Gaussian mixture model) corresponding to each type of natural emotion state on the basis of the UBM general background model aiming at each type of natural emotion state;
it will be appreciated that each of the above categories of natural emotional states includes happiness, sadness, anger, and neutrality, or further includes either or both of fear and surprise.
In one example, according to a given UBM general background model and the training vectors of the four natural emotion states, performing parameter fine adjustment on the UBM general background model to determine a GMM model, wherein the probability distribution of the training vectors in the UBM general background model is calculated; fully counting by utilizing the probability distribution and the mixed weighted value of the training vector, the average vector and the variance; and finally, using the new sufficient statistics to update the sufficient statistics of the UBM general background model to obtain the GMM model.
And S104, taking the GMM corresponding to each generated natural emotion state as a speech emotion recognition model.
Referring to fig. 2, as shown in fig. 2, a flow chart of a speech emotion recognition method based on a method for establishing a speech emotion recognition model provided by the present invention is schematically shown, and the flow chart includes:
s201, obtaining a voice file, performing VAD preprocessing on the voice file to obtain at least one audio and converting the audio into acoustic characteristics;
with respect to step S201, the voice file is divided into at least one audio according to the mute section, and at the same time, the at least one audio is converted into an acoustic feature.
In one example, in a time-frequency domain, a mute part of a voice file is cut off, and at least one piece of audio after cutting is obtained.
The time domain of the voice file is used as an X axis, the sound pressure of the voice file is used as a Y axis to establish a two-dimensional coordinate system, because the voice signal is a continuously fluctuating signal, the representation form of a section of voice signal in the two-dimensional coordinate system is that at least one audio frequency and more than or equal to 0 mute sections exist, the more than or equal to 0 mute sections in the voice file are cut off to obtain at least one audio frequency, and the at least one audio frequency is converted into acoustic characteristics.
S202, inputting the acoustic features into a speech emotion recognition model to obtain a GMM super vector;
s203, inputting the GMM super vector serving as the embedding of the at least one audio into an XGboost model to obtain a label of the at least one audio;
specifically, the eXtreme Gradient Boosting (Xgboost) model is a machine learning model for classification and regression, and the main idea is to integrate more weak classifiers (such as decision trees) to realize the function of a strong classifier. That is, the Xgboost model is composed of a plurality of weak classifiers, and a plurality of output results are obtained by inputting one input data to the plurality of weak classifiers, respectively, and the plurality of data results are superimposed to obtain the final output data.
In one example, the GMM supervector is input into the XGBoost model as embedding (mapping of vector space) of at least one piece of audio to obtain a label of the at least one piece of audio.
S204, voting is carried out on the at least one audio label by adopting a Bagging algorithm to obtain a voting result of the at least one audio label;
with respect to step S204, the voting result is that the natural emotional state of the voice includes any one of happiness, sadness, anger and neutrality.
And S205, taking the voting result occupying most of the numbers as a final output result, and marking the emotion of the voice file.
For S205, for example, the happy vote number is 1, the sad vote number is 0, the angry vote number is 0, the neutral vote number is 0, and the happy vote occupying the majority vote number is used as the emotion recognition result of the final voice file, and the voice file is subjected to happy emotion marking.
Based on the technical scheme, the method for establishing the speech emotion recognition model based on the GMM-UBM and the speech emotion recognition method only need a small amount of emotion linguistic data to train the UBM general background model, then the GMM model is generated in a self-adaptive mode based on the UBM general background model to serve as the speech emotion recognition model, and meanwhile real-time speech emotion recognition is achieved based on the speech emotion recognition model.
The embodiments of the present invention have been described above with reference to the accompanying drawings, but the present invention is not limited to the above-described embodiments, which are merely illustrative and not restrictive; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, which are within the protection scope of the present invention.

Claims (4)

1. A method for establishing a speech emotion recognition model is characterized by comprising the following steps:
extracting 1584-dimensional emotional acoustic features by using opensimile;
generating a UBM general background model by using the emotional acoustic characteristics as the voice training of the natural emotional state;
generating a GMM (Gaussian mixture model) corresponding to each type of natural emotion state on the basis of the UBM general background model aiming at each type of natural emotion state;
and taking the GMM model corresponding to each type of natural emotion state generated by self-adaptation as a speech emotion recognition model.
2. The method of claim 1, wherein the natural emotional states include happy, sad, angry, and neutral.
3. A speech emotion recognition method based on the method for creating a speech emotion recognition model according to any one of claims 1 to 2, characterized in that the method comprises:
acquiring a voice file, performing VAD (voice activity detection) pretreatment on the voice file, dividing the voice file into at least one audio according to a mute part, and simultaneously converting the at least one audio into acoustic characteristics;
inputting the acoustic features into a speech emotion recognition model to obtain GMM super vectors;
the GMM supervectors are used as embedding of the at least one audio and input into an XGboost model to obtain a label of the at least one audio;
voting is carried out on the tags of the at least one audio by adopting a Bagging algorithm to obtain a voting result of the tags of the at least one audio;
and taking the voting result occupying most of the numbers as a final output result, and marking the emotion of the voice file.
4. The speech emotion recognition method of claim 2, wherein the method further comprises: and cutting off the mute part of the voice file on a time frequency domain to obtain at least one piece of audio after segmentation.
CN201911355782.XA 2019-12-25 2019-12-25 Method for establishing voice emotion recognition model and voice emotion recognition method Pending CN110910904A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911355782.XA CN110910904A (en) 2019-12-25 2019-12-25 Method for establishing voice emotion recognition model and voice emotion recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911355782.XA CN110910904A (en) 2019-12-25 2019-12-25 Method for establishing voice emotion recognition model and voice emotion recognition method

Publications (1)

Publication Number Publication Date
CN110910904A true CN110910904A (en) 2020-03-24

Family

ID=69827694

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911355782.XA Pending CN110910904A (en) 2019-12-25 2019-12-25 Method for establishing voice emotion recognition model and voice emotion recognition method

Country Status (1)

Country Link
CN (1) CN110910904A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112634873A (en) * 2020-12-22 2021-04-09 上海幻维数码创意科技股份有限公司 End-to-end emotion recognition method based on Chinese speech OpenSmile and bidirectional LSTM

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100036660A1 (en) * 2004-12-03 2010-02-11 Phoenix Solutions, Inc. Emotion Detection Device and Method for Use in Distributed Systems
CN102779510A (en) * 2012-07-19 2012-11-14 东南大学 Speech Emotion Recognition Method Based on Feature Space Adaptive Projection
CN102881284A (en) * 2012-09-03 2013-01-16 江苏大学 Unspecific human voice and emotion recognition method and system
CN103440863A (en) * 2013-08-28 2013-12-11 华南理工大学 Speech emotion recognition method based on manifold
US20140257820A1 (en) * 2013-03-10 2014-09-11 Nice-Systems Ltd Method and apparatus for real time emotion detection in audio interactions
US20150206543A1 (en) * 2014-01-22 2015-07-23 Samsung Electronics Co., Ltd. Apparatus and method for emotion recognition
CN107895582A (en) * 2017-10-16 2018-04-10 中国电子科技集团公司第二十八研究所 A speaker-adaptive speech emotion recognition method for multi-source information domain
WO2018180134A1 (en) * 2017-03-28 2018-10-04 株式会社Seltech Emotion recognition device and emotion recognition program
CN109344257A (en) * 2018-10-24 2019-02-15 平安科技(深圳)有限公司 Text emotion recognition methods and device, electronic equipment, storage medium
CN110021308A (en) * 2019-05-16 2019-07-16 北京百度网讯科技有限公司 Voice mood recognition methods, device, computer equipment and storage medium
US20190295533A1 (en) * 2018-01-26 2019-09-26 Shanghai Xiaoi Robot Technology Co., Ltd. Intelligent interactive method and apparatus, computer device and computer readable storage medium
CN110532379A (en) * 2019-07-08 2019-12-03 广东工业大学 A kind of electronics information recommended method of the user comment sentiment analysis based on LSTM

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100036660A1 (en) * 2004-12-03 2010-02-11 Phoenix Solutions, Inc. Emotion Detection Device and Method for Use in Distributed Systems
CN102779510A (en) * 2012-07-19 2012-11-14 东南大学 Speech Emotion Recognition Method Based on Feature Space Adaptive Projection
CN102881284A (en) * 2012-09-03 2013-01-16 江苏大学 Unspecific human voice and emotion recognition method and system
US20140257820A1 (en) * 2013-03-10 2014-09-11 Nice-Systems Ltd Method and apparatus for real time emotion detection in audio interactions
CN103440863A (en) * 2013-08-28 2013-12-11 华南理工大学 Speech emotion recognition method based on manifold
US20150206543A1 (en) * 2014-01-22 2015-07-23 Samsung Electronics Co., Ltd. Apparatus and method for emotion recognition
WO2018180134A1 (en) * 2017-03-28 2018-10-04 株式会社Seltech Emotion recognition device and emotion recognition program
CN107895582A (en) * 2017-10-16 2018-04-10 中国电子科技集团公司第二十八研究所 A speaker-adaptive speech emotion recognition method for multi-source information domain
US20190295533A1 (en) * 2018-01-26 2019-09-26 Shanghai Xiaoi Robot Technology Co., Ltd. Intelligent interactive method and apparatus, computer device and computer readable storage medium
CN109344257A (en) * 2018-10-24 2019-02-15 平安科技(深圳)有限公司 Text emotion recognition methods and device, electronic equipment, storage medium
CN110021308A (en) * 2019-05-16 2019-07-16 北京百度网讯科技有限公司 Voice mood recognition methods, device, computer equipment and storage medium
CN110532379A (en) * 2019-07-08 2019-12-03 广东工业大学 A kind of electronics information recommended method of the user comment sentiment analysis based on LSTM

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112634873A (en) * 2020-12-22 2021-04-09 上海幻维数码创意科技股份有限公司 End-to-end emotion recognition method based on Chinese speech OpenSmile and bidirectional LSTM

Similar Documents

Publication Publication Date Title
Basak et al. Challenges and limitations in speech recognition technology: A critical review of speech signal processing algorithms, tools and systems
Schuller et al. Emotion recognition in the noise applying large acoustic feature sets
Tong et al. A comparative study of robustness of deep learning approaches for VAD
CN107972028B (en) Man-machine interaction method and device and electronic equipment
Sahoo et al. Emotion recognition from audio-visual data using rule based decision level fusion
CN107731233A (en) A kind of method for recognizing sound-groove based on RNN
CN116665669A (en) A voice interaction method and system based on artificial intelligence
Samantaray et al. A novel approach of speech emotion recognition with prosody, quality and derived features using SVM classifier for a class of North-Eastern Languages
CN116631434A (en) Video and voice synchronization method and device based on conversion system and electronic equipment
Cao et al. Speaker-independent speech emotion recognition based on random forest feature selection algorithm
Zhang et al. Interaction and Transition Model for Speech Emotion Recognition in Dialogue.
CN114120973A (en) Training method for voice corpus generation system
CN117352000A (en) Speech classification method, device, electronic equipment and computer readable medium
Sinha et al. Acoustic-phonetic feature based dialect identification in Hindi Speech
Hamidi et al. Emotion recognition from Persian speech with neural network
CN110910904A (en) Method for establishing voice emotion recognition model and voice emotion recognition method
Guha et al. Desco: Detecting emotions from smart commands
Rabiee et al. Persian accents identification using an adaptive neural network
Ashrafidoost et al. Recognizing Emotional State Changes Using Speech Processing
Nereveettil et al. Feature selection algorithm for automatic speech recognition based on fuzzy logic
Yin et al. Investigating speech features and automatic measurement of cognitive load
Gambhir et al. Investigating Activation Functions to Enhance Speaker Identification with LSTM Networks
Bhavani et al. A survey on various speech emotion recognition techniques
Zoric et al. Automated gesturing for virtual characters: speech-driven and text-driven approaches
Putra et al. Indonesian natural voice command for robotic applications

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200324

RJ01 Rejection of invention patent application after publication