[go: up one dir, main page]

CN110634491B - Series connection feature extraction system and method for general voice task in voice signal - Google Patents

Series connection feature extraction system and method for general voice task in voice signal Download PDF

Info

Publication number
CN110634491B
CN110634491B CN201911014154.5A CN201911014154A CN110634491B CN 110634491 B CN110634491 B CN 110634491B CN 201911014154 A CN201911014154 A CN 201911014154A CN 110634491 B CN110634491 B CN 110634491B
Authority
CN
China
Prior art keywords
speech
features
recognition
layer
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911014154.5A
Other languages
Chinese (zh)
Other versions
CN110634491A (en
Inventor
贾宁
郑纯军
褚娜
周慧
孙风栋
李绪成
张轶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian Neusoft University of Information
Original Assignee
Dalian Neusoft University of Information
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian Neusoft University of Information filed Critical Dalian Neusoft University of Information
Priority to CN201911014154.5A priority Critical patent/CN110634491B/en
Publication of CN110634491A publication Critical patent/CN110634491A/en
Application granted granted Critical
Publication of CN110634491B publication Critical patent/CN110634491B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Telephonic Communication Services (AREA)

Abstract

本发明公开了一种语音信号中针对通用语音任务的串联特征提取系统及方法,情感语料库建立后进行语音预处理;语音预处理后的音频文件作为输入与语音特征提取模型相连;语音特征提取模型可通过声纹识别层得到话者的基本信息和显著性特征并一同馈入语音识别层,获得话者信息和文本信息,在提取关键字的基础上与低级描述符一同送入情感识别层,利用个性化的语音特征和情绪特征,进行情感识别获得情感的相关特征;语音特征提取模型也可通过声纹识别层得到话者的基本信息和显著性特征,与语音识别的特征一同馈入语音识别层,将话者信息与低级描述符一同送入情感识别层,进行情感特征的输出。本发明具有数据规模大、情感表达准确、集成度高等优点。

Figure 201911014154

The invention discloses a serial feature extraction system and method for general speech tasks in speech signals. After the emotional corpus is established, speech preprocessing is performed; the audio files after speech preprocessing are used as input to be connected with a speech characteristic extraction model; the speech characteristic extraction model The basic information and salient features of the speaker can be obtained through the voiceprint recognition layer and fed into the speech recognition layer together to obtain the speaker information and text information. Using personalized speech features and emotional features, emotion recognition can be used to obtain emotion-related features; the speech feature extraction model can also obtain the basic information and salient features of the speaker through the voiceprint recognition layer, and feed into the speech together with the features of speech recognition. In the recognition layer, the speaker information and the low-level descriptors are sent to the emotion recognition layer to output emotional features. The invention has the advantages of large data scale, accurate emotion expression and high integration.

Figure 201911014154

Description

语音信号中针对通用语音任务的串联特征提取系统及方法Concatenated feature extraction system and method for general speech tasks in speech signals

技术领域technical field

本发明涉及信号处理提取领域,尤其是一种语音任务的特征提取模型。The invention relates to the field of signal processing and extraction, in particular to a feature extraction model for speech tasks.

背景技术Background technique

语音是人类最有效、最自然也是最重要的一种通信形式,通过语音实现人与机器之间的交流,需要机器有足够的智能去识别人类的声音。伴随着机器学习、神经网络和深度学习理论的发展,语音识别相关任务的完成度在逐步提升,这对于计算机理解出语音的内容提高很大帮助。目前,语音识别任务主要涉及以下3种识别任务:Voice is the most effective, natural and important form of communication for human beings. To realize the communication between humans and machines through voice, machines need to have enough intelligence to recognize human voices. With the development of machine learning, neural network and deep learning theory, the completion of speech recognition-related tasks is gradually improving, which is very helpful for computers to understand the content of speech. At present, speech recognition tasks mainly involve the following three recognition tasks:

1、声纹识别1. Voiceprint recognition

声纹识别又称为说话者识别,它是一种生物特征识别形式,是对说话人的连续语音信号经过分析处理提取离散语音特征,通过与数据库中的模板进行匹配来自动确认该语音的说话者。它关注说话人本身,而不在乎说话内容。由于人与人之间的发音器官、口音、说话节奏等存在差异,通过分析人的语音能够提取出说话人信息,从而达到识别人的身份的目的。Voiceprint recognition, also known as speaker recognition, is a form of biometric recognition. It analyzes the continuous speech signal of the speaker to extract discrete speech features, and automatically confirms the speech by matching the template in the database. By. It pays attention to the speaker itself, not the content of the speech. Due to the differences in pronunciation organs, accents, speaking rhythms, etc. between people, the speaker information can be extracted by analyzing the human voice, so as to achieve the purpose of identifying the person's identity.

2、语音识别2. Voice recognition

语音识别是让机器通过识别和理解过程,把语音信号转变为相应的文本或命令的技术。语音识别技术的应用包括语音拨号、语音导航、室内设备控制、语音文档检索、简单的听写数据录入等。语音识别技术与其他自然语言处理技术如机器翻译及语音合成技术相结合,可以构建出更加复杂的应用。Speech recognition is a technology that allows machines to convert speech signals into corresponding text or commands through the process of recognition and understanding. The applications of speech recognition technology include voice dialing, voice navigation, indoor equipment control, voice document retrieval, simple dictation data entry, etc. Combining speech recognition technology with other natural language processing technologies such as machine translation and speech synthesis can create more complex applications.

3、语音情感识别3. Speech emotion recognition

传统的人机交互主要依靠键盘和鼠标,计算机只是被动地接受信息,不能主动和人进行沟通,人机之间无法进行情感通信。计算机自然无法实现自然与和谐的人机交互。情感识别可以帮助实现模拟的人与人之间蕴含情感的交流和沟通,让计算机也具备情感计算的能力。The traditional human-computer interaction mainly relies on the keyboard and mouse. The computer only passively receives information, and cannot actively communicate with people, and there is no emotional communication between humans and computers. Computers are naturally unable to achieve natural and harmonious human-computer interaction. Emotion recognition can help to realize the communication and communication of emotions between simulated people, and make computers also have the ability to calculate emotions.

然而,上述3中识别任务在实际应用或设计上,存在诸多缺陷或不足。例如:声纹识别、语音识别、情感识别任务模型之间不通用、输入形式不统一、没有普适的解决方案、集成准确度不高、情感识别单个任务的识别准确率不高等等。However, there are many defects or deficiencies in the practical application or design of the recognition tasks in the above 3. For example, voiceprint recognition, speech recognition, and emotion recognition task models are not universal, the input form is not uniform, there is no universal solution, the integration accuracy is not high, the recognition accuracy of a single emotion recognition task is not high, and so on.

发明内容SUMMARY OF THE INVENTION

本发明目的在于提供一种数据规模大、情感表达准确、集成度高的语音信号中针对通用语音任务的串联特征提取系统及方法。The purpose of the present invention is to provide a system and method for extracting series features for general speech tasks in speech signals with large data scale, accurate emotional expression and high integration.

为实现上述目的,采用了以下技术方案:本发明所述系统主要包括情感语料库、语音预处理模型、语音特征提取模型;情感语料库是基于自然语言的形式构建真实情感的集合,建立数据库时在传统的贪婪算法基础上引入统计方法;情感语料库建立后进行语音预处理;In order to achieve the above purpose, the following technical solutions are adopted: the system described in the present invention mainly includes an emotion corpus, a speech preprocessing model, and a speech feature extraction model; the emotion corpus is a collection of real emotions constructed based on the form of natural language. Based on the greedy algorithm, statistical methods are introduced; after the emotional corpus is established, speech preprocessing is performed;

语音预处理分为语谱图和低级描述与高级统计函数相结合两个阶段进行处理,从中提取出相应的基础特征;语音预处理后的音频文件作为输入与语音特征提取模型相连;The speech preprocessing is divided into two stages, the spectrogram and the combination of low-level description and high-level statistical functions, and the corresponding basic features are extracted from them; the audio files after speech preprocessing are used as input to connect with the speech feature extraction model;

语音特征提取模型以用户的音频文件作为输入,包含声纹识别子模块、语音识别子模块、语音情感识别子模块;语音特征提取模型分为两种工作形式,其一,包含3层结构,第1层为声纹识别,声纹识别层的输入是用户的音频数据,通过此层得到话者的基本信息和显著性特征,将基本信息和显著性特征与语音识别的特征一同馈入语音识别层,语音识别层的输入为声纹特征和用户音频数据,基于语音识别层获得了话者信息和文本信息,在提取关键字的基础上,将其与低级描述符一同送入情感识别层,利用个性化的语音特征和情绪特征,进行情感识别,从而获得情感的相关特征;The voice feature extraction model takes the user's audio file as input, and includes a voiceprint recognition sub-module, a speech recognition sub-module, and a voice emotion recognition sub-module; the voice feature extraction model is divided into two working forms. The first layer is voiceprint recognition. The input of the voiceprint recognition layer is the user's audio data. Through this layer, the basic information and salient features of the speaker are obtained, and the basic information and salient features are fed into the speech recognition together with the features of speech recognition. The input of the speech recognition layer is the voiceprint feature and user audio data. Based on the speech recognition layer, the speaker information and text information are obtained. On the basis of extracting keywords, they are sent to the emotion recognition layer together with the low-level descriptors. Use personalized speech features and emotional features to perform emotion recognition, so as to obtain the relevant features of emotion;

其二,包含2层结构,采用串并联结合的方式,第1层为声纹识别,此层的输入是用户的音频数据,通过此层得到话者的基本信息和显著性特征,将其与语音识别的特征一同馈入第2层语音识别层,语音识别层的输入为声纹特征和用户音频数据,基于语音识别层获得了话者信息和文本信息,此外,在语音识别层同时可以将话者信息与低级描述符一同送入情感识别层,同步进行情感特征的输出,此时的情感识别模型输入仅来源于声纹特征和语音信号。Second, it contains a 2-layer structure, which adopts a combination of series and parallel. The first layer is voiceprint recognition. The input of this layer is the user's audio data. Through this layer, the basic information and salient features of the speaker are obtained, and the The features of speech recognition are fed into the second layer of speech recognition. The input of the speech recognition layer is the voiceprint feature and user audio data. Based on the speech recognition layer, the speaker information and text information are obtained. The speaker information is sent to the emotion recognition layer together with the low-level descriptors, and the output of emotion features is simultaneously performed. At this time, the input of emotion recognition model only comes from voiceprint features and speech signals.

进一步的,所述语谱图是语音信号的傅里叶分析的显示图像,语谱图是一种三维频谱,表示语音频谱随时间变化的图形,语谱图的横坐标是时间、纵坐标是频率,坐标点值为语音数据能量;获取方法如下:对于一段语音信号x(t),首先分帧,变为x(m,n),n为帧长,m为帧的个数;进行快速傅立叶变换,得到X(m,n),得到周期图Y(m,n),Y(m,n)=X(m,n)*X(m,n)’,取10*log10(Y(m,n)),把m根据时间变换刻度,得到M,n根据频率变换刻度,得到N;M,N,10*log10(Y(m,n))组成的二维图像,即为语谱图;音频文件以20ms长度为一帧,以10ms为步长进行滑动,分别产生每帧的频谱图;将每帧的频谱图的基础数据进行纵向融合,以512维为横向分割,在对其进行纵向切分的基础上,每列求取其平均值,从而得到1*512维的谱图特征。Further, the spectrogram is the display image of the Fourier analysis of the speech signal, and the spectrogram is a three-dimensional spectrum, representing the graph of the change of the speech spectrum over time, the abscissa of the spectrogram is time, and the ordinate is frequency, the coordinate point value is the energy of speech data; the acquisition method is as follows: for a segment of speech signal x(t), first divide it into frames and change it to x(m,n), where n is the frame length, and m is the number of frames; Fourier transform, get X(m,n), get the periodogram Y(m,n), Y(m,n)=X(m,n)*X(m,n)', take 10*log10(Y( m,n)), transform m according to the time scale to get M, and n according to the frequency transform scale to get N; the two-dimensional image composed of M,N,10*log10(Y(m,n)) is the language spectrum Figure; the audio file takes 20ms as a frame, and slides with 10ms as the step to generate the spectrogram of each frame; the basic data of the spectrogram of each frame is vertically fused, and the 512-dimension is divided horizontally. On the basis of vertical segmentation, the average value of each column is calculated to obtain the spectral features of 1*512 dimensions.

进一步的,所述情感语料库共收录高兴、愤怒、平静和悲伤四种情感;在传统贪婪算法的基础上引入统计方法,针对每类情感信息统计,保证每类数据的容量在80%-120%范围内;为保证数据集中处理的正确性,本数据库的录音文件以wav格式保存,音频文件的采样率为4400Hz,精度为16bit,采用单声道录制;计算所有标注人员每个维度标注结果与上一次可信度乘积之和的均值使其作为基准值,然后分别计算每人的标注结果与基准值的相关系数作为衡量其新的可信度的指标,随着标记数量的增多,可信度的指标即时进行调整,在得到新的可信度指标后,重新计算当前的标注结果,即将所有人的标注结果和权重加权求和得到最终确定的标注结果;具体公式如下。Further, the emotion corpus contains four emotions: happy, angry, calm and sad; a statistical method is introduced on the basis of the traditional greedy algorithm, and for each type of emotional information statistics, the capacity of each type of data is guaranteed to be between 80% and 120%. In order to ensure the correctness of the centralized processing of the data, the recording files in this database are saved in wav format, the sampling rate of the audio files is 4400Hz, the precision is 16bit, and the recording is in mono; The mean of the sum of the last credibility products is used as the benchmark value, and then the correlation coefficient between each person's labeling result and the benchmark value is calculated separately as an indicator to measure its new credibility. After obtaining the new credibility index, the current labeling result is recalculated, that is, the weighted summation of all labeling results and weights to obtain the final labeling result; the specific formula is as follows.

Figure BDA0002245147300000041
Figure BDA0002245147300000041

Figure BDA0002245147300000042
Figure BDA0002245147300000042

Figure BDA0002245147300000043
Figure BDA0002245147300000043

Figure BDA0002245147300000044
Figure BDA0002245147300000044

Wexp

Figure BDA0002245147300000045
分别是上一次和本次的可信度,n是标注人员的数量,Resultexp是本次标注人员的结果,Avgi
Figure BDA0002245147300000046
分别是利用上一次和本次可信度计算的计算标注结果。W exp and
Figure BDA0002245147300000045
are the reliability of the previous and this time respectively, n is the number of labelers, Result exp is the result of the labelers this time, Avg i and
Figure BDA0002245147300000046
They are the calculation and annotation results of the previous and this credibility calculation respectively.

进一步的,在捕获最原始的声学特征时,需要将语音信号转换为语音特征向量,即结合低级描述符LLD和高级统计函数HSF(High level Statistics Functions),特征均可使用OpenSmile toolbox工具箱直接计算得到;选择以下LLD特征:梅尔频率倒谱系数MFCC1-14、梅尔频带对数功率1-7、共振峰1-3、基音频率F0、抖动、能量特征、能量谱特征(alphaRatioUV)、线谱对频率1-8。Further, when capturing the most primitive acoustic features, it is necessary to convert the speech signal into a speech feature vector, that is, combining the low-level descriptor LLD and the high-level statistical function HSF (High level Statistics Functions), the features can be directly calculated using the OpenSmile toolbox toolbox. Obtained; select the following LLD features: Mel Frequency Cepstral Coefficients MFCC1-14, Mel Band Log Power 1-7, Formant 1-3, Pitch Frequency F0, Jitter, Energy Feature, Energy Spectral Feature (alphaRatioUV), Line Spectrum pair frequencies 1-8.

进一步的,所述声纹识别子模块包含前期的全部个人特征因子,输入采用语谱图和它的delta形式,采用的模型是时延神经网络(Time-Delay Neural Network,TDNN)与i-vector模型的结合。Further, the voiceprint recognition sub-module includes all the personal characteristic factors in the early stage, the input adopts the spectrogram and its delta form, and the adopted model is a Time-Delay Neural Network (TDNN) and an i-vector. combination of models.

进一步的,所述语音识别子模型用于将图谱首先输入到卷积神经网络,然后经过Gated Recurrent Unit(GRU)网络层,进入CTC(Connectionist TemporalClassification),得到音素序列,综合个性化声纹特征,与训练样本进行相似度比较得到0~1之间相似度的值。Further, the speech recognition sub-model is used to first input the atlas into the convolutional neural network, then through the Gated Recurrent Unit (GRU) network layer, enter the CTC (Connectionist TemporalClassification), obtain the phoneme sequence, and synthesize personalized voiceprint features, The similarity between 0 and 1 is obtained by comparing the similarity with the training sample.

进一步的,所述语音情感识别子模块,在结合话者识别模型或语音识别模型的基础上,语音情感识别子模型利用神经网络构建手工制作的HSF和卷积神经网络学习语音情感特征的多元通道联合表示,专注于包含强烈发音信息记录的特定部分和全局信息。Further, the speech emotion recognition sub-module, on the basis of combining the speaker recognition model or the speech recognition model, the speech emotion recognition sub-model utilizes the neural network to construct the hand-made HSF and the convolutional neural network to learn the multi-channel of speech emotion features. Joint representation, which focuses on specific parts and global information of records containing strong articulatory information.

本发明另提供了一种语音信号中针对通用语音任务的串联特征提取方法,所述提取方法包括声纹识别特征提取、语音识别特征提取、语音情感识别特征提取三部分;The present invention further provides a tandem feature extraction method for general speech tasks in a speech signal, the extraction method comprising three parts: voiceprint recognition feature extraction, speech recognition feature extraction, and speech emotion recognition feature extraction;

1)提取声纹识别特征的方法为:1) The method of extracting voiceprint recognition features is:

S1-1,输入采用的是语谱图和它的delta形式,采用的模型是时延神经网络(Time-Delay Neural Network,TDNN)与i-vector模型的结合;S1-1, the input adopts the spectrogram and its delta form, and the adopted model is the combination of Time-Delay Neural Network (TDNN) and i-vector model;

S1-2,将TDNN的特征送入注意力attention模块中,提升模型的计算能力;S1-2, send the features of TDNN into the attention module to improve the computing power of the model;

S1-3,注意力attention模块输出重要的声纹特征,声纹识别模型输出得到部分声纹特征,所述特征为1*512维,通过Softmax获得与语料库中所有人的比对结果;S1-3, the attention module outputs important voiceprint features, and the voiceprint recognition model outputs part of the voiceprint features, the features are 1*512 dimensions, and the comparison results with everyone in the corpus are obtained through Softmax;

2)提取语音识别特征的方法为:2) The method of extracting speech recognition features is:

语音识别子模型将图谱首先输入到卷积神经网络,然后经过Gated RecurrentUnit(GRU)网络层,进入CTC(Connectionist Temporal Classification),得到音素序列,综合个性化声纹特征,与训练样本进行相似度比较得到0~1之间相似度的值;The speech recognition sub-model first inputs the map into the convolutional neural network, and then passes through the Gated RecurrentUnit (GRU) network layer, enters the CTC (Connectionist Temporal Classification), obtains the phoneme sequence, integrates the personalized voiceprint features, and compares the similarity with the training samples. Get a similarity value between 0 and 1;

3)提取语音情感识别特征的方法为:3) The method of extracting speech emotion recognition features is:

S3-1,输入信息为语谱图和LLD特征,将LLD特征通过特定的HSF计算,获得642维HSF特征;S3-1, the input information is spectrogram and LLD feature, and the LLD feature is calculated by a specific HSF to obtain a 642-dimensional HSF feature;

S3-2,将语谱图送入CNN模型,经历3层卷积、3层池化和1层全连接,得到1*1024维特征输出;S3-2, send the spectrogram into the CNN model, go through 3 layers of convolution, 3 layers of pooling and 1 layer of full connection, and obtain 1*1024-dimensional feature output;

S3-3,将上述两个输出投影到相同的特征空间中,得到1024+642=1666维特征,然后将其馈入单向3层GRU模型中,提取语音情感的特征,最终获得1024维情感特征。S3-3, project the above two outputs into the same feature space to obtain 1024+642=1666-dimensional features, and then feed them into the one-way 3-layer GRU model to extract the features of speech emotion, and finally obtain 1024-dimensional emotion feature.

与现有技术相比,本发明具有如下优点:Compared with the prior art, the present invention has the following advantages:

1、利用原始语音信号处理模块提取的公共特征和任务的公共模型,设计多通道的网络模型,每种任务可以自主选择若干条通道,协作完成特征提取,从而实现一个输入,经历多条通路,解决多个任务。1. Use the common features extracted by the original speech signal processing module and the common model of the task to design a multi-channel network model. Each task can independently select several channels and cooperate to complete feature extraction, so as to realize one input and go through multiple channels. Solve multiple tasks.

2、利用一次性的输入,同时、分层次、客观的展示声纹识别、语音识别和情感识别的结果。2. Using one-time input, simultaneously, hierarchically and objectively display the results of voiceprint recognition, speech recognition and emotion recognition.

3、提升声纹识别、语音识别和情感识别的准确率。3. Improve the accuracy of voiceprint recognition, speech recognition and emotion recognition.

4、每个子模型中可以自由选择不同的方案,或者采用默认的组合方法。4. Different schemes can be freely selected in each sub-model, or the default combination method can be adopted.

5、新建的语音情感语料库可以为声纹识别、语音识别和情感识别任务提供稳定、可靠的数据来源。5. The newly built speech emotion corpus can provide a stable and reliable data source for voiceprint recognition, speech recognition and emotion recognition tasks.

6、提升声纹识别、语音识别和情感识别任务的集成度。6. Improve the integration of voiceprint recognition, speech recognition and emotion recognition tasks.

7、设计了一个数据规模大、情感表达准确、收录语音质量高的情感语音库;具有多年龄、高层次的言语情感特点,涉及的语音浓重,辨识度高,满足情感丰富多样性的需要。7. Design an emotional voice database with large data scale, accurate emotional expression and high quality of recorded voices; it has the characteristics of multi-age and high-level speech emotions, and the involved voices are dense and highly recognizable, meeting the needs of rich and diverse emotions.

附图说明Description of drawings

图1是本发明中语音特征提取模型的第一种工作形式的流程图。FIG. 1 is a flow chart of the first working form of the speech feature extraction model in the present invention.

图2是本发明中语音特征提取模型的第二种工作形式的流程图。Fig. 2 is a flow chart of the second working form of the speech feature extraction model in the present invention.

图3是本发明中的声纹识别子模型结构图。FIG. 3 is a structural diagram of a voiceprint recognition sub-model in the present invention.

图4是本发明中的TDNN模型结构图。FIG. 4 is a structural diagram of a TDNN model in the present invention.

图5是本发明中的语音识别子模型结构图。FIG. 5 is a structural diagram of a speech recognition sub-model in the present invention.

图6是本发明中的GRU模型结构图。FIG. 6 is a structural diagram of the GRU model in the present invention.

图7是本发明中的语音情感识别子模型结构图。FIG. 7 is a structural diagram of a speech emotion recognition sub-model in the present invention.

具体实施方式Detailed ways

下面结合附图对本发明做进一步说明:The present invention will be further described below in conjunction with the accompanying drawings:

本发明所述系统主要包括情感语料库、语音预处理模型、语音特征提取模型;情感语料库是基于自然语言的形式构建真实情感的集合,建立数据库时在传统的贪婪算法基础上引入统计方法;情感语料库建立后进行语音预处理;The system of the invention mainly includes an emotion corpus, a speech preprocessing model, and a speech feature extraction model; the emotion corpus is a collection of real emotions constructed based on the form of natural language, and statistical methods are introduced on the basis of the traditional greedy algorithm when the database is established; the emotion corpus After the establishment, the speech preprocessing is performed;

所述情感语料库的特点是数据规模大、情感表达准确、收录语音质量高的情感语音库。具有多年龄、高层次的言语情感特点,涉及的语音浓重,辨识度高,一定程度上满足情感丰富多样性的需要。情感语音按采集方式不同分为自然语音、诱导语音和表演语音。本数据库是基于自然语音的形式构建真实情感的集合,共收录高兴、愤怒、平静和悲伤四种情感。构成语音数据库的文本要求覆盖尽可能多的语言单元,同时又要求语音数据库的规模不能过大,因此在建立情感语音数据库时,拟引入改进的贪婪算法,将传统的文本筛选方式与统计方法相结合。The emotional corpus is characterized by a large scale of data, accurate emotional expression, and high quality of recorded voices. It has the characteristics of multi-age and high-level speech and emotion, the involved voice is strong, and the recognition degree is high, which meets the needs of rich and diverse emotions to a certain extent. Emotional speech is divided into natural speech, induced speech and performance speech according to different collection methods. This database is a collection of real emotions constructed based on natural speech, including four emotions: happiness, anger, calmness and sadness. The text that constitutes the speech database needs to cover as many language units as possible, and at the same time, the scale of the speech database is required not to be too large. Therefore, when establishing the emotional speech database, an improved greedy algorithm is proposed to combine traditional text screening methods with statistical methods. combine.

传统的贪婪算法是通过局部的最优策略,导致产生全局的最优解,针对语音数据库内容的筛选,每次从各个类别中筛选情感表达最佳的音频数据,这种方式只考虑文本筛选,未考虑数据集合的均衡性,本发明在此方法的基础上,引入统计方法,针对每类情感信息统计,保证每类数据的容量在80%-120%范围内,避免由数据的倾斜导致的模型误差。The traditional greedy algorithm uses the local optimal strategy to generate the global optimal solution. For the screening of the voice database content, the audio data with the best emotional expression is screened from each category each time. This method only considers text screening, The balance of the data set is not considered. On the basis of this method, the present invention introduces a statistical method, and for each type of emotional information statistics, the capacity of each type of data is guaranteed to be within the range of 80%-120%, to avoid the inclination of the data. model error.

语料形式选择富有丰富情感的言语,相对不同的语境具有不同的理解形式,话语样式在一定程度上满足情感丰富多样性的需要。为保证数据集中处理的正确性,本数据库的录音文件以wav格式保存,音频文件的采样率为4400Hz,精度为16bit,采用单声道录制。The corpus form selects words with rich emotions, which have different comprehension forms in different contexts, and the discourse style meets the needs of rich and diverse emotions to a certain extent. In order to ensure the correctness of the centralized processing of the data, the recording files in this database are saved in wav format, the sampling rate of the audio files is 4400Hz, the precision is 16bit, and the audio files are recorded in mono.

基于多视角情感对数据进行情感标注,由于情感分为4个类别,因此将其分为4个维度,每个音频的每个维度均需要进行标注,其强度分为“无”、“弱”、“中”、“强”四种程度,由“0”、“1”、“2”、“3”四级组成了刻度表。标注过程分为训练阶段和正式阶段两部分,训练阶段先选出10篇进行独立标注,之后几名标注人员比较结果并商讨标注细则,当观点基本一致时进行大规模的正式标注。Emotion annotation is performed on data based on multi-view emotion. Since emotion is divided into 4 categories, it is divided into 4 dimensions. Each dimension of each audio needs to be labeled, and its intensity is divided into "none" and "weak". , "Medium" and "Strong" four levels, and the scale table is composed of four levels of "0", "1", "2" and "3". The labeling process is divided into two parts: the training phase and the formal phase. In the training phase, 10 articles are selected for independent labeling. After that, several labelers compare the results and discuss the labeling rules. When the opinions are basically the same, a large-scale formal labeling is performed.

计算所有标注人员每个维度标注结果与上一次可信度乘积之和的均值使其作为基准值,然后分别计算每人的标注结果与基准值的相关系数作为衡量其新的可信度的指标,随着标记数量的增多,可信度的指标即时进行调整,在得到新的可信度指标后,重新计算当前的标注结果,即将所有人的标注结果和权重加权求和得到最终确定的标注结果。具体公式如下。Calculate the mean value of the sum of the product of each dimension's annotation results and the previous credibility of all the annotators to use as the benchmark value, and then calculate the correlation coefficient between each person's annotation results and the benchmark value as an indicator to measure their new credibility. , as the number of tags increases, the credibility index is adjusted immediately. After obtaining a new credibility index, the current annotation result is recalculated, that is, the weighted summation of all the annotation results and weights is obtained to obtain the finalized annotation. result. The specific formula is as follows.

Figure BDA0002245147300000091
Figure BDA0002245147300000091

Figure BDA0002245147300000092
Figure BDA0002245147300000092

Figure BDA0002245147300000093
Figure BDA0002245147300000093

Figure BDA0002245147300000094
Figure BDA0002245147300000094

Wexp

Figure BDA0002245147300000095
分别是上一次和本次的可信度,n是标注人员的数量,Resultexp是本次标注人员的结果,Avgi
Figure BDA0002245147300000101
分别是利用上一次和本次可信度计算的计算标注结果。W exp and
Figure BDA0002245147300000095
are the reliability of the previous and this time respectively, n is the number of labelers, Result exp is the result of the labelers this time, Avg i and
Figure BDA0002245147300000101
They are the calculation and annotation results of the previous and this credibility calculation respectively.

由于标注数据含有很多标注值全为0的维度信息,标注值全为零时无法计算相关系数。此时选择近似0的权重信息来代替相关系数。通过调整个人标注对于整合结果的贡献率,使用加权平均的计算,加强与均值接近的数值的贡献。Since the labeled data contains a lot of dimensional information whose label values are all 0, the correlation coefficient cannot be calculated when the label values are all zero. At this time, weight information approximately 0 is selected to replace the correlation coefficient. By adjusting the contribution rate of individual annotations to the integrated results, the calculation of weighted average is used to strengthen the contribution of values close to the mean.

语音预处理分为语谱图和低级描述与高级统计函数相结合两个阶段进行处理,从中提取出相应的基础特征;语音预处理后的音频文件作为输入与语音特征提取模型相连;The speech preprocessing is divided into two stages, the spectrogram and the combination of low-level description and high-level statistical functions, and the corresponding basic features are extracted from them; the audio files after speech preprocessing are used as input to connect with the speech feature extraction model;

所述语谱图是语音信号的傅里叶分析的显示图像,语谱图是一种三维频谱,表示语音频谱随时间变化的图形,语谱图的横坐标是时间、纵坐标是频率,坐标点值为语音数据能量;获取方法如下:对于一段语音信号x(t),首先分帧,变为x(m,n),n为帧长,m为帧的个数;进行快速傅立叶变换,得到X(m,n),得到周期图Y(m,n),Y(m,n)=X(m,n)*X(m,n)’,取10*log10(Y(m,n)),把m根据时间变换刻度,得到M,n根据频率变换刻度,得到N;M,N,10*log10(Y(m,n))组成的二维图像,即为语谱图;音频文件以20ms长度为一帧,以10ms为步长进行滑动,分别产生每帧的频谱图;将每帧的频谱图的基础数据进行纵向融合,以512维为横向分割,在对其进行纵向切分的基础上,每列求取其平均值,从而得到1*512维的谱图特征。The spectrogram is the display image of the Fourier analysis of the speech signal. The spectrogram is a three-dimensional spectrum, which represents the graph of the speech spectrum changing with time. The abscissa of the spectrogram is the time, the ordinate is the frequency, and the coordinate is the frequency. The point value is the speech data energy; the acquisition method is as follows: for a segment of speech signal x(t), first divide it into frames and change it into x(m,n), where n is the frame length, and m is the number of frames; perform fast Fourier transform, Get X(m,n), get the periodogram Y(m,n), Y(m,n)=X(m,n)*X(m,n)', take 10*log10(Y(m,n) )), transform m according to the time scale to obtain M, and n according to the frequency transform scale to obtain N; the two-dimensional image composed of M, N, 10*log10(Y(m,n)) is the spectrogram; audio The file takes 20ms as a frame, and slides with 10ms as a step to generate the spectrogram of each frame; the basic data of the spectrogram of each frame is vertically fused, and the 512-dimension is horizontally divided, and it is cut vertically. On the basis of the score, the average value of each column is calculated to obtain the spectral features of 1*512 dimensions.

捕获最原始的声学特征的方法,需要1)将语音信号转换为语音特征向量,即结合声学特征描述符LLD(low-level acoustic feature descriptors)和高级统计函数(Highlevel Statistics Functions,HSF)。本发明基于以下方面选择低级描述符:a)它们有潜力指导声音产生情感生理变化,b)在先前研究中,它们已证明价值以及它们的自动可提取性,以及c)它们的理论意义。该集合旨在为研究语音特征的基线,并消除由变化的模型,甚至相同参数的不同实现引起的差异。The method of capturing the most primitive acoustic features requires 1) converting speech signals into speech feature vectors, that is, combining acoustic feature descriptors LLD (low-level acoustic feature descriptors) and high-level statistical functions (HSF). The present invention selects low-level descriptors based on a) their potential to guide emotional physiological changes in sound, b) their proven value in previous studies and their automatic extractability, and c) their theoretical significance. This ensemble is designed to provide a baseline for studying speech features and to eliminate differences caused by varying models, or even different implementations of the same parameters.

基于此,本发明选择了以下的LLD特征:梅尔频率倒谱系数MFCC 1-14、梅尔频带对数功率1-7、共振峰1-3、基音频率(F0)、抖动、能量特征、能量谱特征(alpha RatioUV)、线谱对频率1-8。Based on this, the present invention selects the following LLD features: Mel frequency cepstral coefficient MFCC 1-14, Mel band logarithmic power 1-7, formant 1-3, fundamental frequency (F0), jitter, energy feature, Energy spectrum feature (alpha RatioUV), line spectrum versus frequency 1-8.

针对HSF,不同的特征表现形式不同,需要分别进行设计,基于此,本发明提取了14维MFCC特征、梅尔频带1-7对数功率、线谱对频率1-8特征,主要体现为静态特征,对其进行最大值、最小值、算术平均值、斜率、偏移量等21个HSF函数的运算,通过特征融合实现了信息的互补,从而有效的提升准确率。同时计算并提取第1、2、3共振峰的中心频率及其带宽。对于当前帧的基音频率和抖动,记录其最大(小)值、均值、变化范围、标准差、中位数、四分卫数(上、下)等8维特征。对于能量特征计算20百分位,50百分位,80百分位,20到80百分位之间的范围,上升/下降语音信号的斜率的均值和标准差等8个函数,得到8个统计特征。针对能量谱特征,计算均值、变化范围、标准差等3个函数,得到3个特征。For HSF, different features have different expression forms and need to be designed separately. Based on this, the present invention extracts 14-dimensional MFCC features, Mel band 1-7 logarithmic power, and line spectrum vs. frequency 1-8 features, which are mainly embodied as static 21 HSF functions, such as maximum value, minimum value, arithmetic mean value, slope, and offset, are performed on them, and the complementarity of information is realized through feature fusion, thereby effectively improving the accuracy. At the same time, the center frequencies and their bandwidths of the 1st, 2nd, and 3rd formants are calculated and extracted. For the pitch frequency and jitter of the current frame, record its maximum (small) value, mean, variation range, standard deviation, median, quarterback number (upper, lower) and other 8-dimensional features. Calculate 20th percentile, 50th percentile, 80th percentile, range between 20th and 80th percentile, mean and standard deviation of the slope of the rising/falling speech signal for energy features, and get 8 Statistical Features. For the energy spectrum features, three functions, such as mean value, variation range, and standard deviation, are calculated, and three features are obtained.

涉及的具体LLD和HSF如表1所示,其中一共涉及642维特征。这些特征均可以使用OpenSmile toolbox工具箱计算得到。The specific LLD and HSF involved are shown in Table 1, which involves a total of 642-dimensional features. These features can be calculated using the OpenSmile toolbox.

表1 具体LLD列表和HSF数量Table 1 List of specific LLDs and number of HSFs

Figure BDA0002245147300000111
Figure BDA0002245147300000111

Figure BDA0002245147300000121
Figure BDA0002245147300000121

表2 21个常见的HSF列表Table 2 List of 21 common HSFs

Figure BDA0002245147300000122
Figure BDA0002245147300000122

语音特征提取模型是针对个体设计的个性化模型,以用户的音频文件作为输入,目标为获得针对特定用户的声学信号的个性化特征,设计一个基于多任务的Tandem串联模型,此模型包含3个子模型,分别对应3个任务,这些子模型的结构由多层组成,该结构有2种形式,由任务的需求决定形式的选取。语音特征提取模型分为两种工作形式,其一,如图1所示,包含3层结构,第1层为声纹识别,声纹识别层的输入是用户的音频数据,通过此层得到话者的基本信息和显著性特征,将基本信息和显著性特征与语音识别的特征一同馈入语音识别层,语音识别层的输入为声纹特征和用户音频数据,基于语音识别层获得了话者信息和文本信息,在提取关键字的基础上,将其与低级描述符一同送入情感识别层,利用个性化的语音特征和情绪特征,进行情感识别,从而获得情感的相关特征;The speech feature extraction model is a personalized model designed for individuals. The user's audio file is used as input, and the goal is to obtain the personalized features of the acoustic signal for a specific user. A multi-task-based Tandem tandem model is designed. This model contains 3 subsections. The model corresponds to three tasks respectively. The structure of these sub-models consists of multiple layers. The structure has two forms, and the selection of the form is determined by the requirements of the task. The voice feature extraction model is divided into two working forms. One, as shown in Figure 1, includes a 3-layer structure. The first layer is voiceprint recognition. The input of the voiceprint recognition layer is the user's audio data, and the speech is obtained through this layer. The basic information and salient features of the speaker are fed into the speech recognition layer together with the features of speech recognition. The input of the speech recognition layer is the voiceprint features and user audio data, and the speaker is obtained based on the speech recognition layer. Information and text information, on the basis of extracting keywords, they are sent to the emotion recognition layer together with the low-level descriptors, and the personalized speech features and emotional features are used for emotion recognition, so as to obtain the relevant features of emotion;

其二,如图2所示,包含2层结构,采用串并联结合的方式,第1层为声纹识别,此层的输入是用户的音频数据,通过此层得到话者的基本信息和显著性特征,将其与语音识别的特征一同馈入第2层语音识别层,语音识别层的输入为声纹特征和用户音频数据,基于语音识别层获得了话者信息和文本信息,此外,在语音识别层同时可以将话者信息与低级描述符一同送入情感识别层,同步进行情感特征的输出,此时的情感识别模型输入仅来源于声纹特征和语音信号。Second, as shown in Figure 2, it includes a 2-layer structure, which adopts a series-parallel combination. The first layer is voiceprint recognition. The input of this layer is the user's audio data. Sexual features, which are fed into the second layer of speech recognition along with the features of speech recognition. The input of the speech recognition layer is the voiceprint feature and user audio data, and the speaker information and text information are obtained based on the speech recognition layer. In addition, in At the same time, the speech recognition layer can send the speaker information together with the low-level descriptors to the emotion recognition layer, and output emotional features simultaneously. At this time, the input of the emotion recognition model only comes from the voiceprint features and speech signals.

以上2种形式的应用范畴略有不同,其侧重点基于生成的结果,第1种形式的需求集中于情感识别,而第2种形式的需求集中于语音识别和情感识别。在特征提取过程中,可根据侧重点挑选其中一种结构完成相关的任务。The application categories of the above two forms are slightly different, and their focus is based on the generated results. The requirements of the first form focus on emotion recognition, while the requirements of the second form focus on speech recognition and emotion recognition. In the process of feature extraction, one of the structures can be selected according to the focus to complete related tasks.

如图3所示,由于本发明是针对于个体的声纹识别、语音识别和情感识别,因此,声纹识别模型是重要的环节,它包含了前期的全部个人特征因子,此处输入采用的是语谱图和它的delta形式,采用的模型是时延神经网络(Time-Delay Neural Network,TDNN)与i-vector模型的结合。As shown in Fig. 3, since the present invention is aimed at individual voiceprint recognition, speech recognition and emotion recognition, the voiceprint recognition model is an important link, which includes all the personal characteristic factors in the previous stage, and the input used here is is the spectrogram and its delta form, and the model adopted is the combination of the Time-Delay Neural Network (TDNN) and the i-vector model.

TDNN由多个簇组成的相互连接的层进行的计算,是一种前馈神经网络。如图4所示,其中,输入j维向量,D1~DN是延时向量,wi为连接权重。假设延时N为2,输入层维度J=10,延时N=2的意思就是将相邻的两帧向量,即共三帧向量拼接起来作为输入层的输入。时延神经网络的一个重要思想是权重共享,即相同位置的连接权重相同。也就是说,当延时为2时,网络中一共有三组权重,一组给当前帧使用,另外两组给延时帧D1和D2使用。不同颜色的位置不同权重,同一位置共享权重。训练方法使用传统BP学习方法。A TDNN is a feedforward neural network, where computations are performed by interconnected layers consisting of multiple clusters. As shown in Figure 4, where j-dimensional vector is input, D1-DN are delay vectors, and wi is connection weight. Assuming that the delay N is 2, the dimension of the input layer is J=10, and the delay N=2 means that two adjacent frame vectors, that is, a total of three frame vectors are spliced together as the input of the input layer. An important idea of time-delay neural networks is weight sharing, that is, connections at the same location have the same weight. That is to say, when the delay is 2, there are three groups of weights in the network, one group is used for the current frame, and the other two groups are used for the delay frames D1 and D2. The positions of different colors have different weights, and the same position shares the weight. The training method uses the traditional BP learning method.

前几层隐藏层都是从较短步长的上下文中去学习。由于深层次的神经元具有更强的学习能力,因此深层隐藏层则从较宽步长的下文中去学习。TDNN不同与传统前向传播神经网络,它能捕获与时间序列相关的特征间对的长时依赖特性,正是由于TDNN的这个特点,文中选用TDNN作为提取说话人表征向量的方法。在不同网络层中将多帧训练数据拼接起来让进行学习,从而获取语音帧之间的长时特征1*512维。The first few hidden layers are all learned from contexts with shorter strides. Since deep neurons have stronger learning ability, deep hidden layers learn from the context with wider stride. Different from the traditional forward propagation neural network, TDNN can capture the long-term dependence of feature pairs related to time series. It is precisely because of this feature of TDNN that TDNN is selected as the method for extracting speaker representation vectors. Multi-frame training data are spliced together in different network layers for learning, so as to obtain long-term features of 1*512 dimensions between speech frames.

将TDNN的特征送入注意力attention模块中,它是用于分配有限信息处理能力的“选择机制”,有助于快速分析目标数据,配合信息筛选和权重设置机制,提升模型的计算能力。The features of TDNN are sent to the attention module, which is a "selection mechanism" used to allocate limited information processing capabilities, which helps to quickly analyze target data, and cooperate with information screening and weight setting mechanisms to improve the computing power of the model.

对于输入x的序列中的每个向量xi,可以按照公式5计算注意力权重αi,其中f(xi)是评分函数。For each vector xi in the sequence of input x, the attention weight α i can be calculated according to Equation 5, where f( xi ) is the scoring function.

Figure BDA0002245147300000141
Figure BDA0002245147300000141

注意力层的输出,即attentive_x,是输入序列的权重之和。如公式6所示。The output of the attention layer, attention_x, is the sum of the weights of the input sequence. as shown in Equation 6.

Figure BDA0002245147300000142
Figure BDA0002245147300000142

由此通道的模型学习输出重要的声纹特征,声纹识别模型输出得到部分声纹特征,该特征为1*512维。通过Softmax获得与语音库中所有人的比对结果。The model of this channel learns and outputs important voiceprint features, and the voiceprint recognition model outputs some voiceprint features, which are 1*512 dimensions. Alignment results with everyone in the speech library were obtained by Softmax.

i-vector是一个低维向量,通常是几百维。i-vector可视为一种前端特征,它同时含有说话人声纹特征和信道信息,并去除了一些无关信息,例如背景噪声、信道干扰等。在i-vector的均值超向量T的训练过程中,以每句话即每一个recording为单位,同一个说话人的不同recording视为不同的人。给定说话人的一段语音,与之对应的高斯均值超矢量可以定义为如下公式:i-vector is a low-dimensional vector, usually a few hundred dimensions. The i-vector can be regarded as a front-end feature, which contains both speaker voiceprint features and channel information, and removes some irrelevant information, such as background noise, channel interference, etc. In the training process of the mean supervector T of the i-vector, each sentence, that is, each recording, is used as a unit, and different recordings of the same speaker are regarded as different people. Given a speech of a speaker, the corresponding Gaussian mean hypervector can be defined as the following formula:

M=m+Tω (7)M=m+Tω (7)

其中,M为给定语音的高斯均值超矢量;m为通用背景模型的高斯均值超矢量,该超矢量与具体说话人以及信道无关;T为全局差异空间矩阵,低秩;ω为全局差异空间因子,它的后验均值即为i-vector矢量,它先验地服从标准正态分布。通过i-vector获得用户的声纹信息,获得声纹的比对结果。Among them, M is the Gaussian mean supervector of the given speech; m is the Gaussian mean supervector of the general background model, which is independent of the specific speaker and channel; T is the global difference space matrix, low rank; ω is the global difference space factor, and its posterior mean is the i-vector vector, which a priori obeys the standard normal distribution. The user's voiceprint information is obtained through i-vector, and the comparison result of the voiceprint is obtained.

对以上2个模型得到的训练结果赋予不同的权重,可得到针对说话者的决策信息,即声纹ID。此处除了获得说话者信息之外,还获得了在语料库中以该话者为标识的全部个人特征信息。此次获得了个性化的声纹特征。在此基础上,执行后续两个语音任务的模型。By assigning different weights to the training results obtained by the above two models, the decision information for the speaker, that is, the voiceprint ID, can be obtained. Here, in addition to the speaker information, all personal feature information identified by the speaker in the corpus is also obtained. This time, a personalized voiceprint feature was obtained. On this basis, perform the model for the following two speech tasks.

如图5所示,所述语音识别子模型用于将图谱首先输入到卷积神经网络,然后经过Gated Recurrent Unit(GRU)网络层,进入CTC(Connectionist TemporalClassification),得到音素序列,综合个性化声纹特征,与训练样本进行相似度比较得到0~1之间相似度的值。As shown in Figure 5, the speech recognition sub-model is used to first input the atlas into the convolutional neural network, and then go through the Gated Recurrent Unit (GRU) network layer and enter the CTC (Connectionist TemporalClassification) to obtain the phoneme sequence and synthesize personalized sound. The texture features are compared with the training samples to obtain a similarity value between 0 and 1.

卷积神经网络(Convolutional Neural Network,CNN)是一种由卷积层与池化层交替堆叠而成的深度神经网络。当前层的神经单元通过一组权重即卷积核连接到前一层的若干个特征图进行卷积运算,再加上偏置就得到当前层的特征图。每一个神经单元只与上一特征图的局部区域连接,每个神经单元提取的是该局部区域的特征,所有神经单元综合起来得到全局特征。为了从特征参数中获得更全面的信息,同一层网络中使用多个不同的卷积核进行操作,得到多个特征图。卷积层前后的映射关系如下。Convolutional Neural Network (CNN) is a deep neural network consisting of alternately stacking convolutional layers and pooling layers. The neural unit of the current layer is connected to several feature maps of the previous layer through a set of weights, that is, the convolution kernel, and the feature map of the current layer is obtained by adding the bias. Each neural unit is only connected to the local area of the previous feature map, each neural unit extracts the features of the local area, and all neural units are combined to obtain global features. In order to obtain more comprehensive information from the feature parameters, multiple different convolution kernels are used in the same layer of network to operate to obtain multiple feature maps. The mapping relationship before and after the convolutional layer is as follows.

Figure BDA0002245147300000161
Figure BDA0002245147300000161

其中,

Figure BDA0002245147300000162
表示第m个卷积层第j个特征图的输入,
Figure BDA0002245147300000163
代表卷积核,
Figure BDA0002245147300000164
表示偏置,*表示卷积操作,Mj表示特征图的集合,f表示激活函数。in,
Figure BDA0002245147300000162
represents the input of the jth feature map of the mth convolutional layer,
Figure BDA0002245147300000163
represents the convolution kernel,
Figure BDA0002245147300000164
represents the bias, * represents the convolution operation, M j represents the set of feature maps, and f represents the activation function.

经过卷积操作后的特征图在池化层进行降采样操作。池化单元计算特征图中局部区域的主要信息,因此去除了冗余信息,缩小了运算规模。CNN由3层卷积层、3层池化层和2层全连接层,共8层构成,第一层卷积层的输入图片为310*310*3,其中310为图片的长和宽,3表示RGB三个通道。图片经过64个3*3的卷积核,以步长为1的卷积操作后产生64个特征图,然后使用Relu激活函数,经过最大池化操作后得到64个特征图,第2层卷积层的输入源即第1层的输出特征图,计算过程与第1层一样,第3层同理,接下来是1层全连接层,此层一共有1024个神经元,在此层上做Dropout操作,防止模型过拟合。此层输出为1*1024维特征。The feature map after the convolution operation is down-sampled in the pooling layer. The pooling unit calculates the main information of the local area in the feature map, so redundant information is removed and the operation scale is reduced. CNN consists of 3 layers of convolution layers, 3 layers of pooling layers and 2 layers of fully connected layers, a total of 8 layers. The input picture of the first layer of convolution layer is 310*310*3, of which 310 is the length and width of the picture, 3 represents three channels of RGB. The image is passed through 64 3*3 convolution kernels, and 64 feature maps are generated after the convolution operation with a stride of 1, and then the Relu activation function is used to obtain 64 feature maps after the maximum pooling operation. The second layer volume The input source of the multi-layer layer is the output feature map of the first layer. The calculation process is the same as that of the first layer. The third layer is the same. Next is a fully connected layer. This layer has a total of 1024 neurons. Do Dropout operation to prevent the model from overfitting. The output of this layer is 1*1024 dimensional features.

将卷积后的特征馈入Gated Recurrent Unit(GRU)模型中,GRU中包含两个门:更新门和重置门。GRU模型结构如图6所示,zt和rt分别表示更新门和重置门。GRU模型的设计公式如下。The convolved features are fed into the Gated Recurrent Unit (GRU) model, which contains two gates: an update gate and a reset gate. The GRU model structure is shown in Figure 6, where z t and r t represent the update gate and the reset gate, respectively. The design formula of the GRU model is as follows.

zt=σ(Wz·[ht-1,xt]) (9)z t =σ(W z ·[h t-1 ,x t ]) (9)

rt=σ(Wr·[ht-1,xt]) (10)r t =σ(W r ·[h t-1 ,x t ]) (10)

Figure BDA0002245147300000171
Figure BDA0002245147300000171

Figure BDA0002245147300000172
Figure BDA0002245147300000172

zt和rt分别表示更新门和重置门;z t and r t represent the update gate and the reset gate, respectively;

Figure BDA0002245147300000173
为第t层部分隐藏层输出;
Figure BDA0002245147300000173
output for the hidden layer of the t-th layer;

ht为第t层全部隐藏层向量。h t is the vector of all hidden layers of the t-th layer.

由GRU模型学习得到了语音识别特征,输出特征为1*1024维。The speech recognition features are learned from the GRU model, and the output features are 1*1024 dimensions.

将之前的输出特征送入CTC中,它是基于神经网络的时序类分类,通过输入序列x得到输出序列y,如可以获得输出序列的分布p(I/x),选择其中概率最大的那一个作为输出序列,如公式13所示。Send the previous output features into CTC, which is based on neural network time series classification, and obtain the output sequence y through the input sequence x, such as the distribution p(I/x) of the output sequence, select the one with the highest probability. As the output sequence, as shown in Equation 13.

Figure BDA0002245147300000174
Figure BDA0002245147300000174

通过CTC得到概率最大的用于表示中文的序列,并进行输出。然后,将序列与文本正确的phones序列进行相似度比较,来衡量准确性。基于此,构建了一个相似度比较矩阵,该矩阵由23个声母与24个韵母构成,因涉及发音相似度的衡量,所以其中未考虑韵母的声调,设计的相似度比较矩阵A如表3所示。Obtain the sequence with the highest probability for representing Chinese through CTC, and output it. Accuracy is then measured by comparing the sequence with the text-correct phone sequence for similarity. Based on this, a similarity comparison matrix is constructed, which consists of 23 initials and 24 finals. Because it involves the measurement of pronunciation similarity, the tones of the finals are not considered. The designed similarity comparison matrix A is shown in Table 3. Show.

表3 相似度比较矩阵Table 3 Similarity comparison matrix

Figure BDA0002245147300000175
Figure BDA0002245147300000175

Figure BDA0002245147300000181
Figure BDA0002245147300000181

声韵母之间的相似程度使用0~100之间的数来表示,数字越大相似度越低,通过模型识别出音素后到混淆矩阵进行查找,识别出的声韵母与样本的声韵母对应位置为1,其他位置为0,生成矩阵B,音素准确度矩阵C=A·B,C中所在元素之和为音素的准确性值,一段语音所有音素准确性的平均值为整段语音的准确性的评价值。通过语音识别模型,最终得到文本信息特征。The degree of similarity between initials and finals is represented by a number between 0 and 100. The larger the number, the lower the similarity. After identifying the phonemes through the model, go to the confusion matrix to search, and the identified initials and finals correspond to the sample's initials and finals. is 1, and other positions are 0, generating matrix B, phoneme accuracy matrix C=A·B, the sum of the elements in C is the accuracy value of the phoneme, and the average value of the accuracy of all phonemes in a speech is the accuracy of the entire speech. Sexual evaluation value. Through the speech recognition model, the text information features are finally obtained.

所述语音情感识别子模块,在结合话者识别模型或语音识别模型的基础上,语音情感识别模型利用神经网络构建手工制作的HSF和卷积神经网络学习语音情感特征的多元通道联合表示,专注于包含强烈发音信息记录的特定部分和全局信息。如图7所示。The speech emotion recognition sub-module, on the basis of combining the speaker recognition model or the speech recognition model, the speech emotion recognition model uses a neural network to construct a hand-made HSF and a convolutional neural network to learn the multi-channel joint representation of speech emotion features, focusing on For specific parts of records containing strong pronunciation information and global information. As shown in Figure 7.

具体流程如下:The specific process is as follows:

输入信息为语谱图和LLD特征,将LLD特征通过特定的HSF计算,获得642维HSF特征,将语谱图送入CNN模型,经历3层卷积、3层池化和1层全连接,得到1*1024维特征输出,将上述两者投影到相同的特征空间中,得到1024+642=1666维特征,然后将其馈入单向3层GRU模型中,提取语音情感的特征,最终获得1024维情感特征,使用这种方法,既保证了手工制作特征的维度适当,又可以得到语音的全局信息。The input information is spectrogram and LLD feature. The LLD feature is calculated by a specific HSF to obtain a 642-dimensional HSF feature. The spectrogram is sent to the CNN model, which undergoes 3 layers of convolution, 3 layers of pooling and 1 layer of full connection. Get 1*1024-dimensional feature output, project the above two into the same feature space, get 1024+642=1666-dimensional feature, and then feed it into the one-way 3-layer GRU model to extract the features of speech emotion, and finally get 1024-dimensional emotional features, using this method not only ensures that the dimensions of the hand-crafted features are appropriate, but also obtains the global information of the speech.

本发明还依据语音信号中针对通用语音任务的串联特征提取系统提供了一种特征提取方法,提取方法包括声纹识别特征提取、语音识别特征提取、语音情感识别特征提取三部分;The present invention also provides a feature extraction method according to the serial feature extraction system for general speech tasks in the speech signal, and the extraction method includes three parts: voiceprint recognition feature extraction, speech recognition feature extraction, and speech emotion recognition feature extraction;

1)提取声纹识别特征的方法为:1) The method of extracting voiceprint recognition features is:

S1-1,输入采用的是语谱图和它的delta形式,采用的模型是时延神经网络(Time-Delay Neural Network,TDNN)与i-vector模型的结合;S1-1, the input adopts the spectrogram and its delta form, and the adopted model is the combination of the Time-Delay Neural Network (TDNN) and the i-vector model;

S1-2,将TDNN的特征送入注意力attention模块中,提升模型的计算能力;S1-2, send the features of TDNN into the attention module to improve the computing power of the model;

S1-3,注意力attention模块输出重要的声纹特征,声纹识别模型输出得到部分声纹特征,所述特征为1*512维,通过Softmax获得与语料库中所有人的比对结果;S1-3, the attention module outputs important voiceprint features, and the voiceprint recognition model outputs part of the voiceprint features, the features are 1*512 dimensions, and the comparison results with everyone in the corpus are obtained through Softmax;

2)提取语音识别特征的方法为:2) The method of extracting speech recognition features is:

语音识别子模型将图谱首先输入到卷积神经网络,然后经过Gated RecurrentUnit(GRU)网络层,进入CTC(Connectionist Temporal Classification),得到音素序列,综合个性化声纹特征,与训练样本进行相似度比较得到0~1之间相似度的值;The speech recognition sub-model first inputs the map into the convolutional neural network, and then passes through the Gated RecurrentUnit (GRU) network layer, enters the CTC (Connectionist Temporal Classification), obtains the phoneme sequence, integrates the personalized voiceprint features, and compares the similarity with the training samples. Get a similarity value between 0 and 1;

3)提取语音情感识别特征的方法为:3) The method of extracting speech emotion recognition features is:

S3-1,输入信息为语谱图和LLD特征,将LLD特征通过特定的HSF计算,获得642维HSF特征;S3-1, the input information is spectrogram and LLD feature, and the LLD feature is calculated by a specific HSF to obtain a 642-dimensional HSF feature;

S3-2,将语谱图送入CNN模型,经历3层卷积、3层池化和1层全连接,得到1*1024维特征输出;S3-2, send the spectrogram into the CNN model, go through 3 layers of convolution, 3 layers of pooling and 1 layer of full connection, and obtain 1*1024-dimensional feature output;

S3-3,将上述两个输出投影到相同的特征空间中,得到1024+642=1666维特征,然后将其馈入单向3层GRU模型中,提取语音情感的特征,最终获得1024维情感特征。S3-3, project the above two outputs into the same feature space to obtain 1024+642=1666-dimensional features, and then feed them into the one-way 3-layer GRU model to extract the features of speech emotion, and finally obtain 1024-dimensional emotion feature.

以上所述的实施例仅仅是对本发明的优选实施方式进行描述,并非对本发明的范围进行限定,在不脱离本发明设计精神的前提下,本领域普通技术人员对本发明的技术方案做出的各种变形和改进,均应落入本发明权利要求书确定的保护范围内。The above-mentioned embodiments are only to describe the preferred embodiments of the present invention, and do not limit the scope of the present invention. On the premise of not departing from the design spirit of the present invention, those of ordinary skill in the art can Such deformations and improvements shall fall within the protection scope determined by the claims of the present invention.

Claims (8)

1.一种语音信号中针对通用语音任务的串联特征提取系统,其特征在于:所述系统主要包括情感语料库、语音预处理模型、语音特征提取模型;情感语料库是基于自然语言的形式构建真实情感的集合,建立数据库时在传统的贪婪算法基础上引入统计方法;情感语料库建立后进行语音预处理;1. a tandem feature extraction system for general speech tasks in a speech signal, is characterized in that: the system mainly comprises an emotional corpus, a speech preprocessing model, a speech feature extraction model; the emotional corpus is based on the form of natural language to build real emotions Statistical methods are introduced based on the traditional greedy algorithm when establishing the database; speech preprocessing is performed after the emotional corpus is established; 语音预处理分为语谱图和低级描述与高级统计函数相结合两个阶段进行处理,从中提取出相应的基础特征;语音预处理后的音频文件作为输入与语音特征提取模型相连;The speech preprocessing is divided into two stages, the spectrogram and the combination of low-level description and high-level statistical functions, and the corresponding basic features are extracted from them; the audio files after speech preprocessing are used as input to connect with the speech feature extraction model; 语音特征提取模型以用户的音频文件作为输入,包含声纹识别子模块、语音识别子模块、语音情感识别子模块;语音特征提取模型分为两种工作形式,其一,包含3层结构,第1层为声纹识别,声纹识别层的输入是用户的音频数据,通过此层得到话者的基本信息和显著性特征,将基本信息和显著性特征与语音识别的特征一同馈入语音识别层,语音识别层的输入为声纹特征和用户音频数据,基于语音识别层获得了话者信息和文本信息,在提取关键字的基础上,将其与低级描述符一同送入情感识别层,利用个性化的语音特征和情绪特征,进行情感识别,从而获得情感的相关特征;The voice feature extraction model takes the user's audio file as input, and includes a voiceprint recognition sub-module, a speech recognition sub-module, and a voice emotion recognition sub-module; the voice feature extraction model is divided into two working forms. The first layer is voiceprint recognition. The input of the voiceprint recognition layer is the user's audio data. Through this layer, the basic information and salient features of the speaker are obtained, and the basic information and salient features are fed into the speech recognition together with the features of speech recognition. The input of the speech recognition layer is the voiceprint feature and user audio data. Based on the speech recognition layer, the speaker information and text information are obtained. On the basis of extracting keywords, they are sent to the emotion recognition layer together with the low-level descriptors. Use personalized speech features and emotional features to perform emotion recognition, so as to obtain the relevant features of emotion; 其二,包含2层结构,采用串并联结合的方式,第1层为声纹识别,此层的输入是用户的音频数据,通过此层得到话者的基本信息和显著性特征,将其与语音识别的特征一同馈入第2层语音识别层,语音识别层的输入为声纹特征和用户音频数据,基于语音识别层获得了话者信息和文本信息,此外,在语音识别层同时可以将话者信息与低级描述符一同送入情感识别层,同步进行情感特征的输出,此时的情感识别模型输入仅来源于声纹特征和语音信号。Second, it contains a 2-layer structure, which adopts a combination of series and parallel. The first layer is voiceprint recognition. The input of this layer is the user's audio data. Through this layer, the basic information and salient features of the speaker are obtained, and the The features of speech recognition are fed into the second layer of speech recognition. The input of the speech recognition layer is the voiceprint feature and user audio data. Based on the speech recognition layer, the speaker information and text information are obtained. The speaker information is sent to the emotion recognition layer together with the low-level descriptors, and the output of emotion features is simultaneously performed. At this time, the input of emotion recognition model only comes from voiceprint features and speech signals. 2.根据权利要求1所述的语音信号中针对通用语音任务的串联特征提取系统,其特征在于:所述语谱图是语音信号的傅里叶分析的显示图像,语谱图是一种三维频谱,表示语音频谱随时间变化的图形,语谱图的横坐标是时间、纵坐标是频率,坐标点值为语音数据能量;获取方法如下:对于一段语音信号x(t),首先分帧,变为x(m,n),n为帧长,m为帧的个数;进行快速傅立叶变换,得到X(m,n),得到周期图Y(m,n),Y(m,n)=X(m,n)*X(m,n)’,取10*log10(Y(m,n)),把m根据时间变换刻度,得到M,n根据频率变换刻度,得到N;M,N,10*log10(Y(m,n))组成的二维图像,即为语谱图;音频文件以20ms长度为一帧,以10ms为步长进行滑动,分别产生每帧的频谱图;将每帧的频谱图的基础数据进行纵向融合,以512维为横向分割,在对其进行纵向切分的基础上,每列求取其平均值,从而得到1*512维的谱图特征。2. in the speech signal according to claim 1, for the serial feature extraction system of general speech task, it is characterized in that: described spectrogram is the display image of the Fourier analysis of speech signal, and spectrogram is a kind of three-dimensional Spectrum, a graph representing the change of speech spectrum over time. The abscissa of the spectrogram is time, the ordinate is frequency, and the coordinate point value is the energy of speech data; the acquisition method is as follows: For a segment of speech signal x(t), first divide it into frames, Change to x(m,n), n is the frame length, m is the number of frames; perform fast Fourier transform to get X(m,n), get the periodogram Y(m,n), Y(m,n) =X(m,n)*X(m,n)', take 10*log10(Y(m,n)), transform m according to the time scale to get M, and n according to the frequency transform scale to get N; M, The two-dimensional image composed of N,10*log10(Y(m,n)) is the spectrogram; the audio file takes 20ms as a frame and slides with 10ms as the step to generate the spectrogram of each frame respectively; The basic data of the spectrogram of each frame is longitudinally fused, and the 512-dimension is used as the horizontal division. On the basis of the longitudinal division, the average value of each column is calculated to obtain the 1*512-dimension spectrogram feature. 3.根据权利要求1所述的语音信号中针对通用语音任务的串联特征提取系统,其特征在于:所述情感语料库共收录高兴、愤怒、平静和悲伤四种情感;3. the tandem feature extraction system for general speech tasks in the speech signal according to claim 1, is characterized in that: described emotion corpus records four kinds of emotions altogether: happy, angry, calm and sad; 在传统贪婪算法的基础上引入统计方法,针对每类情感信息统计;为保证数据集中处理的正确性,本数据库的录音文件以wav格式保存,音频文件的采样率为4400Hz,精度为16bit,采用单声道录制;On the basis of the traditional greedy algorithm, statistical methods are introduced to count each type of emotional information; in order to ensure the correctness of the data set processing, the recording files in this database are saved in wav format, the sampling rate of the audio files is 4400Hz, and the precision is 16bit. Mono recording; 计算所有标注人员每个维度标注结果与上一次可信度乘积之和的均值使其作为基准值,然后分别计算每人的标注结果与基准值的相关系数作为衡量其新的可信度的指标,随着标记数量的增多,可信度的指标即时进行调整,在得到新的可信度指标后,重新计算当前的标注结果,即将所有人的标注结果和权重加权求和得到最终确定的标注结果;具体公式如下:Calculate the mean value of the sum of the product of each dimension's annotation results and the previous credibility of all the annotators to use as the benchmark value, and then calculate the correlation coefficient between each person's annotation results and the benchmark value as an indicator to measure their new credibility. , as the number of tags increases, the credibility index is adjusted immediately. After obtaining the new credibility index, the current annotation result is recalculated, that is, the weighted summation of all the annotation results and weights is obtained to obtain the finalized annotation. The result; the specific formula is as follows:
Figure FDA0003300214960000031
Figure FDA0003300214960000031
Figure FDA0003300214960000032
Figure FDA0003300214960000032
Figure FDA0003300214960000033
Figure FDA0003300214960000033
Figure FDA0003300214960000034
Figure FDA0003300214960000034
Wexp
Figure FDA0003300214960000037
分别是上一次和本次的可信度,n是标注人员的数量,Resultexp是本次标注人员的结果,Avgi
Figure FDA0003300214960000036
分别是利用上一次和本次可信度计算的计算标注结果。
W exp and
Figure FDA0003300214960000037
are the reliability of the previous and this time respectively, n is the number of labelers, Result exp is the result of the labelers this time, Avg i and
Figure FDA0003300214960000036
They are the calculation and annotation results of the previous and this credibility calculation respectively.
4.根据权利要求1所述的语音信号中针对通用语音任务的串联特征提取系统,其特征在于:在捕获最原始的声学特征时,需要将语音信号转换为语音特征向量,即结合低级描述符LLD和高级统计函数HSF,特征均可使用OpenSmile toolbox工具箱直接计算得到;选择以下LLD特征:梅尔频率倒谱系数MFCC 1-14、梅尔频带对数功率1-7、共振峰1-3、基音频率F0、抖动、能量特征、能量谱特征、线谱对频率1-8。4. The tandem feature extraction system for general speech tasks in the speech signal according to claim 1, characterized in that: when capturing the most primitive acoustic features, the speech signal needs to be converted into a speech feature vector, that is, combined with low-level descriptors LLD and advanced statistical function HSF, features can be directly calculated using the OpenSmile toolbox; select the following LLD features: Mel Frequency Cepstral Coefficients MFCC 1-14, Mel Band Log Power 1-7, Formants 1-3 , fundamental frequency F0, jitter, energy characteristics, energy spectrum characteristics, line spectrum to frequency 1-8. 5.根据权利要求1所述的语音信号中针对通用语音任务的串联特征提取系统,其特征在于:所述声纹识别子模块包含前期的全部个人特征因子,输入采用语谱图和它的delta形式,采用的模型是时延神经网络TDNN与i-vector模型的结合。5. in the voice signal according to claim 1, it is characterized in that: the voiceprint recognition submodule comprises all personal characteristic factors in the early stage, and the input adopts spectrogram and its delta Form, the model adopted is the combination of time-delay neural network TDNN and i-vector model. 6.根据权利要求1所述的语音信号中针对通用语音任务的串联特征提取系统,其特征在于:所述语音识别子模型用于将语谱图首先输入到卷积神经网络,然后经过GatedRecurrent Unit网络层,进入CTC,得到音素序列,综合个性化声纹特征,与训练样本进行相似度比较得到0~1之间相似度的值。6. the tandem feature extraction system for general speech tasks in the speech signal according to claim 1, is characterized in that: described speech recognition sub-model is used to input spectrogram into convolutional neural network at first, then through GatedRecurrent Unit In the network layer, enter the CTC to obtain the phoneme sequence, synthesize the personalized voiceprint features, and compare the similarity with the training sample to obtain a similarity value between 0 and 1. 7.根据权利要求1所述的语音信号中针对通用语音任务的串联特征提取系统,其特征在于:所述语音情感识别子模块,在结合话者识别模型或语音识别模型的基础上,语音情感识别子模型利用神经网络构建HSF和卷积神经网络学习语音情感特征的多元通道联合表示,专注于包含强烈发音信息记录的特定部分和全局信息。7. The tandem feature extraction system for general speech tasks in the speech signal according to claim 1, is characterized in that: described speech emotion recognition submodule, on the basis of combining speaker recognition model or speech recognition model, speech emotion The recognition submodel utilizes neural networks to build HSF and convolutional neural networks to learn multi-channel joint representations of speech emotion features, focusing on specific parts and global information that contain strong pronunciation information records. 8.一种基于权利要求1所述语音信号中针对通用语音任务的串联特征提取系统的特征提取方法,其特征在于:所述提取方法包括声纹识别特征提取、语音识别特征提取、语音情感识别特征提取三部分;8. a feature extraction method based on the tandem feature extraction system for general speech tasks in the described voice signal of claim 1, is characterized in that: described extraction method comprises voiceprint recognition feature extraction, speech recognition feature extraction, speech emotion recognition Three parts of feature extraction; 1)提取声纹识别特征的方法为:1) The method of extracting voiceprint recognition features is: S1-1,输入采用的是语谱图和它的delta形式,采用的模型是时延神经网络TDNN与i-vector模型的结合;S1-1, the input adopts the spectrogram and its delta form, and the model adopted is the combination of the time delay neural network TDNN and the i-vector model; S1-2,将TDNN的特征送入注意力attention模块中,提升模型的计算能力;S1-2, send the features of TDNN into the attention module to improve the computing power of the model; S1-3,注意力attention模块输出重要的声纹特征,声纹识别模型输出得到部分声纹特征,所述特征为1*512维,通过Softmax获得与语料库中所有人的比对结果;S1-3, the attention module outputs important voiceprint features, and the voiceprint recognition model outputs part of the voiceprint features, the features are 1*512 dimensions, and the comparison results with everyone in the corpus are obtained through Softmax; 2)提取语音识别特征的方法为:2) The method of extracting speech recognition features is: 语音识别子模型将语谱图首先输入到卷积神经网络,然后经过GRU网络层,进入CTC,得到音素序列,综合个性化声纹特征,与训练样本进行相似度比较得到0~1之间相似度的值;The speech recognition sub-model first inputs the spectrogram into the convolutional neural network, and then passes through the GRU network layer, and then enters the CTC to obtain the phoneme sequence, synthesizes the personalized voiceprint features, and compares the similarity with the training sample to obtain a similarity between 0 and 1. degree value; 3)提取语音情感识别特征的方法为:3) The method of extracting speech emotion recognition features is: S3-1,输入信息为语谱图和LLD特征,将LLD特征通过特定的HSF计算,获得642维HSF特征;S3-1, the input information is spectrogram and LLD feature, and the LLD feature is calculated by a specific HSF to obtain a 642-dimensional HSF feature; S3-2,将语谱图送入CNN模型,经历3层卷积、3层池化和1层全连接,得到1*1024维特征输出;S3-2, send the spectrogram into the CNN model, go through 3 layers of convolution, 3 layers of pooling and 1 layer of full connection, and obtain 1*1024-dimensional feature output; S3-3,将上述两个输出投影到相同的特征空间中,得到1024+642=1666维特征,然后将其馈入单向3层GRU模型中,提取语音情感的特征,最终获得1024维情感特征。S3-3, project the above two outputs into the same feature space to obtain 1024+642=1666-dimensional features, and then feed them into the one-way 3-layer GRU model to extract the features of speech emotion, and finally obtain 1024-dimensional emotion feature.
CN201911014154.5A 2019-10-23 2019-10-23 Series connection feature extraction system and method for general voice task in voice signal Active CN110634491B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911014154.5A CN110634491B (en) 2019-10-23 2019-10-23 Series connection feature extraction system and method for general voice task in voice signal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911014154.5A CN110634491B (en) 2019-10-23 2019-10-23 Series connection feature extraction system and method for general voice task in voice signal

Publications (2)

Publication Number Publication Date
CN110634491A CN110634491A (en) 2019-12-31
CN110634491B true CN110634491B (en) 2022-02-01

Family

ID=68977388

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911014154.5A Active CN110634491B (en) 2019-10-23 2019-10-23 Series connection feature extraction system and method for general voice task in voice signal

Country Status (1)

Country Link
CN (1) CN110634491B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113177816A (en) * 2020-01-08 2021-07-27 阿里巴巴集团控股有限公司 Information processing method and device
CN112053694A (en) * 2020-07-23 2020-12-08 哈尔滨理工大学 Voiceprint recognition method based on CNN and GRU network fusion
CN112259106B (en) * 2020-10-20 2024-06-11 网易(杭州)网络有限公司 Voiceprint recognition method and device, storage medium and computer equipment
CN112233668B (en) * 2020-10-21 2023-04-07 中国人民解放军海军工程大学 Voice instruction and identity recognition method based on neural network
CN111968679B (en) * 2020-10-22 2021-01-29 深圳追一科技有限公司 Emotion recognition method and device, electronic equipment and storage medium
CN112382298B (en) * 2020-11-17 2024-03-08 北京清微智能科技有限公司 Awakening word voiceprint recognition method, awakening word voiceprint recognition model and training method thereof
CN112507311A (en) * 2020-12-10 2021-03-16 东南大学 High-security identity verification method based on multi-mode feature fusion
CN112634947B (en) * 2020-12-18 2023-03-14 大连东软信息学院 Animal voice and emotion feature set sequencing and identifying method and system
CN112767961B (en) * 2021-02-07 2022-06-03 哈尔滨琦音科技有限公司 Accent correction method based on cloud computing
CN113268628B (en) * 2021-04-14 2023-05-23 上海大学 Music emotion recognition method based on modularized weighted fusion neural network
CN113409824B (en) * 2021-07-06 2023-03-28 青岛洞听智能科技有限公司 Speech emotion recognition method
CN113763966B (en) * 2021-09-09 2024-03-19 武汉理工大学 End-to-end text irrelevant voiceprint recognition method and system
CN114171007A (en) * 2021-12-10 2022-03-11 拟仁智能科技(杭州)有限公司 A Systematic Approach to Virtual Population Alignment
CN114817456B (en) * 2022-03-10 2023-09-05 马上消费金融股份有限公司 Keyword detection method, keyword detection device, computer equipment and storage medium
CN114495114B (en) * 2022-04-18 2022-08-05 华南理工大学 Model calibration method for text sequence recognition based on CTC decoder

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015196063A1 (en) * 2014-06-19 2015-12-23 Robert Bosch Gmbh System and method for speech-enabled personalized operation of devices and services in multiple operating environments
CN105791404A (en) * 2016-03-02 2016-07-20 腾讯科技(深圳)有限公司 Control method and device for controlled facility
CN106570496A (en) * 2016-11-22 2017-04-19 上海智臻智能网络科技股份有限公司 Emotion recognition method and device and intelligent interaction method and device
CN107705807A (en) * 2017-08-24 2018-02-16 平安科技(深圳)有限公司 Voice quality detecting method, device, equipment and storage medium based on Emotion identification
CN108922525A (en) * 2018-06-19 2018-11-30 Oppo广东移动通信有限公司 Method of speech processing, device, storage medium and electronic equipment
CN108921284A (en) * 2018-06-15 2018-11-30 山东大学 Interpersonal interactive body language automatic generation method and system based on deep learning
CN109036465A (en) * 2018-06-28 2018-12-18 南京邮电大学 Speech-emotion recognition method
WO2019161009A1 (en) * 2018-02-14 2019-08-22 Jetson Ai Inc. Speech recognition ordering system and related methods

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6587846B1 (en) * 1999-10-01 2003-07-01 Lamuth John E. Inductive inference affective language analyzer simulating artificial intelligence
EP1298645A1 (en) * 2001-09-26 2003-04-02 Sony International (Europe) GmbH Method for detecting emotions in speech, involving linguistic correlation information
US20160372116A1 (en) * 2012-01-24 2016-12-22 Auraya Pty Ltd Voice authentication and speech recognition system and method
CN103531207B (en) * 2013-10-15 2016-07-27 中国科学院自动化研究所 A kind of speech-emotion recognition method merging long span emotion history
US9786274B2 (en) * 2015-06-11 2017-10-10 International Business Machines Corporation Analysis of professional-client interactions
CN107301168A (en) * 2017-06-01 2017-10-27 深圳市朗空亿科科技有限公司 Intelligent robot and its mood exchange method, system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015196063A1 (en) * 2014-06-19 2015-12-23 Robert Bosch Gmbh System and method for speech-enabled personalized operation of devices and services in multiple operating environments
CN105791404A (en) * 2016-03-02 2016-07-20 腾讯科技(深圳)有限公司 Control method and device for controlled facility
CN106570496A (en) * 2016-11-22 2017-04-19 上海智臻智能网络科技股份有限公司 Emotion recognition method and device and intelligent interaction method and device
CN107705807A (en) * 2017-08-24 2018-02-16 平安科技(深圳)有限公司 Voice quality detecting method, device, equipment and storage medium based on Emotion identification
WO2019161009A1 (en) * 2018-02-14 2019-08-22 Jetson Ai Inc. Speech recognition ordering system and related methods
CN108921284A (en) * 2018-06-15 2018-11-30 山东大学 Interpersonal interactive body language automatic generation method and system based on deep learning
CN108922525A (en) * 2018-06-19 2018-11-30 Oppo广东移动通信有限公司 Method of speech processing, device, storage medium and electronic equipment
CN109036465A (en) * 2018-06-28 2018-12-18 南京邮电大学 Speech-emotion recognition method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Determining the voiceprint recognition on the basis of emotional speech signal: Indonesia language;Kanyadian Idananta;《2017 3rd International Conference on Information Management (ICIM)》;IEEE;20170619;全文 *
Emotion Recognition from Variable-Length Speech Segments Using Deep Learning on Spectrograms;Xi Ma,et al.;《interspeech 2018》;20180930;全文 *
多维语音信息识别技术研究;李姗;《中国优秀硕士学位论文全文数据库 (信息科技辑)》;《中国学术期刊(光盘版)电子杂志社》;20180215(第2期);全文 *

Also Published As

Publication number Publication date
CN110634491A (en) 2019-12-31

Similar Documents

Publication Publication Date Title
CN110634491B (en) Series connection feature extraction system and method for general voice task in voice signal
CN110992987B (en) Parallel feature extraction system and method for general specific voice in voice signal
Chen et al. Learning multi-scale features for speech emotion recognition with connection attention mechanism
Sultana et al. Bangla speech emotion recognition and cross-lingual study using deep CNN and BLSTM networks
Sun End-to-end speech emotion recognition with gender information
CN110674339B (en) A sentiment classification method for Chinese songs based on multimodal fusion
Zhang et al. Pre-trained deep convolution neural network model with attention for speech emotion recognition
CN113450830B (en) Speech emotion recognition method of convolutional recurrent neural network with multiple attention mechanisms
Shixin et al. An autoencoder-based feature level fusion for speech emotion recognition
Akinpelu et al. Lightweight deep learning framework for speech emotion recognition
Valles et al. An audio processing approach using ensemble learning for speech-emotion recognition for children with ASD
Swain et al. A DCRNN-based ensemble classifier for speech emotion recognition in Odia language
CN116153339B (en) A speech emotion recognition method and device based on improved attention mechanism
Flower et al. A novel concatenated 1D-CNN model for speech emotion recognition
Ong et al. Speech emotion recognition with light gradient boosting decision trees machine
Gupta et al. Analysis of affective computing for marathi corpus using deep learning
Mishra et al. Speech emotion classification using feature-level and classifier-level fusion
Poorna et al. Hybrid CNN-BiLSTM architecture with multiple attention mechanisms to enhance speech emotion recognition
CN116226372A (en) Multimodal Speech Emotion Recognition Method Based on Bi-LSTM-CNN
CN113808620B (en) Tibetan language emotion recognition method based on CNN and LSTM
Shah et al. Emotion recognition in speech by multimodal analysis of audio and text
Tailor et al. Deep learning approach for spoken digit recognition in Gujarati language
Lekshmi et al. An acoustic model and linguistic analysis for Malayalam disyllabic words: a low resource language
CN117577139A (en) Speech emotion recognition method based on multi-level acoustic information based on interactive attention mechanism
Li et al. Speech emotion recognition based on 1d cnn and mfcc

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant