[go: up one dir, main page]

CN112489622B - A multi-language continuous speech stream speech content recognition method and system - Google Patents

A multi-language continuous speech stream speech content recognition method and system Download PDF

Info

Publication number
CN112489622B
CN112489622B CN201910782981.2A CN201910782981A CN112489622B CN 112489622 B CN112489622 B CN 112489622B CN 201910782981 A CN201910782981 A CN 201910782981A CN 112489622 B CN112489622 B CN 112489622B
Authority
CN
China
Prior art keywords
language
segment
level
level language
state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910782981.2A
Other languages
Chinese (zh)
Other versions
CN112489622A (en
Inventor
徐及
刘丹阳
张鹏远
颜永红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
Original Assignee
Institute of Acoustics CAS
Beijing Kexin Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS, Beijing Kexin Technology Co Ltd filed Critical Institute of Acoustics CAS
Priority to CN201910782981.2A priority Critical patent/CN112489622B/en
Publication of CN112489622A publication Critical patent/CN112489622A/en
Application granted granted Critical
Publication of CN112489622B publication Critical patent/CN112489622B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a method and a system for recognizing multi-language continuous voice stream voice content, wherein the method comprises the following steps: inputting the multi-language continuous voice stream to be identified into a frame-level language classification model, and outputting segment-level language feature vectors; inputting the segment level language feature vector into a segment level language classification model, and outputting posterior probability distribution of the segment level language state; calculating an optimal language state path of the multi-language continuous voice stream based on a Viterbi search algorithm according to posterior probability distribution of the segment-level language states; dividing the multi-language continuous voice stream to be recognized according to the optimal language state path to obtain a language state interval; and sending the segmented language state interval into a multilingual acoustic model and a corresponding multilingual decoder for decoding to obtain a content recognition result of the multilingual continuous voice stream. The invention solves the problem of dynamic detection and recognition of language types with concurrent multilingual content in continuous voice streams by fusing the language classification model with the Viterbi retrieval algorithm.

Description

一种多语言连续语音流语音内容识别方法及系统A multi-language continuous speech stream speech content recognition method and system

技术领域Technical field

本发明涉及语音识别领域,具体而言,尤其涉及一种多语言连续语音流语音内容识别方法及系统。The present invention relates to the field of speech recognition, specifically, to a multi-language continuous speech stream speech content recognition method and system.

背景技术Background technique

随着隐马尔科夫技术以及深度神经网络等技术在自动语音识别领域的应用,自动语音识别技术得到了前所未有的发展。对于使用人数广泛的汉语以及英语等语言,相应的单语言语音识别系统的性能甚至可以达到人类的识别水平。随着世界各国间的经济贸易往来,世界各国间的经济文化加速交融,构建混合多语言语音识别系统已经成为应对多语言语音流内容检测的必要条件。With the application of hidden Markov technology and deep neural network technology in the field of automatic speech recognition, automatic speech recognition technology has achieved unprecedented development. For languages such as Chinese and English that are widely spoken, the performance of the corresponding single-language speech recognition system can even reach human recognition levels. With the economic and trade exchanges between countries around the world, the economic and cultural integration of countries around the world is accelerating. Building a hybrid multi-language speech recognition system has become a necessary condition for detecting multi-language speech stream content.

传统的多语言语音识别系统是基于语种识别前端串接多个并行的单语言语音识别系统后端。一般来说语种识别前端针对整条语音的语音特征对语音的语言种类进行语句级别的分类判别。在多语言连续语音流的多语言识别任务中,这种基于语句级别的语种分类方法无法应对语音流中多语言并存的语种分类任务。The traditional multi-language speech recognition system is based on the language recognition front-end connected in series with multiple parallel single-language speech recognition system back-ends. Generally speaking, the language recognition front-end performs sentence-level classification and judgment on the language type of the speech based on the phonetic characteristics of the entire speech. In the multi-language recognition task of multi-language continuous speech streams, this language classification method based on the sentence level cannot cope with the language classification task of multiple languages coexisting in the speech stream.

发明内容Contents of the invention

本发明的目的在于解决基于语句级别的语种分类方法无法应对语音流中多语言并存的语种分类任务的问题。The purpose of the present invention is to solve the problem that the language classification method based on the sentence level cannot cope with the language classification task of multiple languages coexisting in the speech stream.

为实现上述目的,本发明提出一种多语言连续语音流语音内容识别方法,所述方法包括:In order to achieve the above objectives, the present invention proposes a multi-language continuous speech stream speech content recognition method, which method includes:

将待识别的多语言连续语音流输入帧级别语种分类模型,输出段级别语种特征向量;Input the multi-language continuous speech stream to be recognized into the frame-level language classification model and output the segment-level language feature vector;

将段级别语种特征向量输入段级别语种分类模型,输出段级别语种状态的后验概率分布;Input the segment-level language feature vector into the segment-level language classification model, and output the posterior probability distribution of the segment-level language status;

根据段级别语种状态的后验概率分布,基于维特比检索算法,计算多语言连续语音流的最佳的语种状态路径;According to the posterior probability distribution of the segment-level language status and based on the Viterbi retrieval algorithm, the optimal language status path of the multi-language continuous speech stream is calculated;

根据所述最佳语种状态路径对待识别的多语言连续语音流进行切分获得语种状态区间;Segment the multi-language continuous speech stream to be recognized according to the optimal language status path to obtain the language status interval;

将的语种状态区间送入多语言声学模型以及相应的多语言解码器中进行解码,得到所述多语言连续语音流的内容识别结果。The language status interval is sent to the multi-language acoustic model and the corresponding multi-language decoder for decoding, and the content recognition result of the multi-language continuous speech stream is obtained.

作为所述方法的一种改进,所述方法还包括多语言声学模型的训练步骤,具体步骤为:As an improvement of the method, the method also includes a training step of a multi-language acoustic model, and the specific steps are:

步骤1-1)构建基于多任务学习框架的多语言声学模型,所述模型包括若干个共享隐含层和语言特定输出层;Step 1-1) Construct a multi-language acoustic model based on a multi-task learning framework, the model including several shared hidden layers and language-specific output layers;

步骤1-2)基于多语言语音数据的声学状态标签提取训练集的多语言连续语音流的频谱特征,将所述频谱特征输入共享隐含层进行非线性变换;输出若干单语言的数据至若干语言特定输出层;Step 1-2) Extract the spectral features of the multi-language continuous speech stream of the training set based on the acoustic state labels of the multi-language speech data, input the spectral features into the shared hidden layer for non-linear transformation; output several single-language data to several language-specific output layer;

步骤1-3)将单语言的数据在与输入的频谱特征对应的语言特定输出层计算误差损失函数值,则所述误差损失函数为:Step 1-3) Calculate the error loss function value of the single language data in the language-specific output layer corresponding to the input spectral feature, then the error loss function is:

其中Floss,i为第i个语言特定输出层的误差损失值,pmodel,i(xL)为第L个语言的频谱特征xL对应的在第L个语言特定输出层的输出,qlabel,L为频谱特征xL对应的声学状态标签;其它的输出层的误差损失函数值为零;where F loss,i is the error loss value of the i-th language-specific output layer, p model,i (x L ) is the output of the L-th language-specific output layer corresponding to the spectral feature x L of the L-th language, q label,L is the acoustic state label corresponding to the spectral feature x L ; the error loss function value of other output layers is zero;

步骤1-4)将所述误差损失值Floss,i反向回传;每个语言特定输出层参数根据对应单语言的数据进行参数更新,计算语言特定输出层参数梯度ΔΦiStep 1-4) Return the error loss value F loss,i in reverse; each language-specific output layer parameter is updated according to the data of the corresponding single language, and the language-specific output layer parameter gradient ΔΦ i is calculated:

其中Φi为第i个语言特定输出层的参数;where Φ i is the parameter of the i-th language-specific output layer;

共享隐含层的参数由若干个语言特定输出层的回传的误差损失值Floss,i计算:计算共享隐含层参数的梯度ΔΦ:The parameters of the shared hidden layer are calculated from the error loss values F loss,i returned by several language-specific output layers: Calculate the gradient ΔΦ of the shared hidden layer parameters:

其中Φ为共享隐含层的参数,L为多语言声学模型的特定语言输出层对应的语言种类数;Among them, Φ is the parameter of the shared hidden layer, and L is the number of language categories corresponding to the specific language output layer of the multi-language acoustic model;

步骤1-5)当Floss,i>给定阈值,则转入步骤1-2),Step 1-5) When F loss,i > given threshold, then go to step 1-2),

当Floss,i<给定阈值,获得训练好的多语言声学模型。When F loss,i < given threshold, the trained multi-language acoustic model is obtained.

作为所述方法的一种改进,所述方法还包括帧级别语种分类模型的训练步骤,具体步骤为:As an improvement of the method, the method also includes a training step of a frame-level language classification model. The specific steps are:

步骤2-1)构建帧级别语种分类模型,所述帧级别语种分类模型为深度神经网络;Step 2-1) Construct a frame-level language classification model, and the frame-level language classification model is a deep neural network;

步骤2-2)提取训练集的多语言连续语音流的帧级别频谱特征,将所述帧级别频谱特征输入帧级别语种分类模型,对当前隐含层的输出向量进行长时统计,计算当前隐含层输出向量的均值向量、方差向量和段级别语种特征向量;Step 2-2) Extract the frame-level spectral features of the multi-language continuous speech stream of the training set, input the frame-level spectral features into the frame-level language classification model, perform long-term statistics on the output vector of the current hidden layer, and calculate the current hidden layer. Contains the mean vector, variance vector and segment-level language feature vector of the layer output vector;

所述均值向量为:The mean vector is:

所述方差向量为:The variance vector is:

所述段级别语种特征向量:The segment-level language feature vector:

hsegment=Append(μ,σ) (6)h segment =Append(μ,σ) (6)

其中hi为当前隐含层在i时刻的输出向量,T为长时统计周期,μ为长时统计的均值向量,σ为长时统计的方差向量,hsegment为段级别语种特征向量,所述段级别语种特征向量是将均值向量和方差向量拼接在一起,其维度为hi维度的2倍;其中Append(μ,σ)表示将μ和σ进行拼接构成高维向量;where h i is the output vector of the current hidden layer at time i, T is the long-term statistical period, μ is the mean vector of long-term statistics, σ is the variance vector of long-term statistics, h segment is the segment-level language feature vector, so The paragraph-level language feature vector is the mean vector and variance vector spliced together, and its dimension is twice the h i dimension; Append(μ,σ) means that μ and σ are spliced together to form a high-dimensional vector;

步骤2-3)将均值向量和方差向量作为下一隐含层的输入,根据帧级别语种标签通过误差计算和反向梯度回传过程训练,使每一个隐含层输出段级别语种特征向量,得到训练好的帧级别语种分类模型。Step 2-3) Use the mean vector and variance vector as the input of the next hidden layer, and train them through error calculation and reverse gradient backpropagation process according to the frame-level language label, so that each hidden layer outputs a segment-level language feature vector, Obtain the trained frame-level language classification model.

作为所述方法的一种改进,所述方法还包括段级别语种分类模型的训练步骤,具体步骤为:As an improvement of the method, the method also includes a training step of a segment-level language classification model. The specific steps are:

步骤S2-1)构建段级别语种分类模型;Step S2-1) Construct a segment-level language classification model;

步骤S2-2)提取训练集的多语言连续语音流的帧级别频谱特征,将所述帧级别频谱特征输入训练好的帧级别语种分类模型的隐含层,从训练好的帧级别语种分类模型的隐含层中提取段级别语种特征向量;Step S2-2) Extract the frame-level spectral features of the multi-language continuous speech stream of the training set, input the frame-level spectral features into the hidden layer of the trained frame-level language classification model, and extract the frame-level language classification model from the trained frame-level language classification model. Extract segment-level language feature vectors from the hidden layer;

步骤S2-3)为每一个段级别语种特征向量设置段级别语种标签,将段级别语种特征向量输入段级别语种分类模型,训练输出所述段级别语种标签对应的语种状态的后验概率分布,获得训练好的段级别语种分类模型。Step S2-3) Set a segment-level language label for each segment-level language feature vector, input the segment-level language feature vector into the segment-level language classification model, and train and output the posterior probability distribution of the language status corresponding to the segment-level language label, Obtain the trained segment-level language classification model.

作为所述方法的一种改进,所述将待识别的多语言连续语音流输入帧级别语种分类模型,输出段级别语种特征向量;将段级别语种特征向量输入段级别语种分类模型输出语种状态的后验概率分布;具体包括:As an improvement of the method, the multi-language continuous speech stream to be recognized is input into a frame-level language classification model and a segment-level language feature vector is output; the segment-level language feature vector is input into the segment-level language classification model to output the language status. Posterior probability distribution; specifically includes:

对待识别的多语言连续语音流提取待识别帧级别频谱特征;Extract the frame-level spectral features to be recognized from the multi-language continuous speech stream to be recognized;

将待识别帧级别频谱特征根据特定的步长和窗长输入训练好的帧级别语种分类模型,输出段级别语种特征向量hsegmentInput the frame-level spectral features to be recognized into the trained frame-level language classification model according to specific step sizes and window lengths, and output the segment-level language feature vector h segment ;

将所述段级别语种特征向量hsegment输入训练好的段级别语种分类模型,输出段级别语种特征向量对应的语种状态的后验概率分布。The segment-level language feature vector h segment is input into the trained segment-level language classification model, and the posterior probability distribution of the language status corresponding to the segment-level language feature vector is output.

作为所述方法的一种改进,根据语种状态的后验概率分布,基于维特比检索算法,计算多语言连续语音流的最佳的语种状态路径,具体包括:As an improvement of the method, according to the posterior probability distribution of the language status and based on the Viterbi retrieval algorithm, the optimal language status path of the multi-language continuous speech stream is calculated, specifically including:

步骤3-1)根据语种状态的后验概率分布,设置维特比检索的语种状态的自转概率ploop和跳转概率pskip,得到语种状态的转移矩阵A为:Step 3-1) According to the posterior probability distribution of the language state, set the rotation probability p loop and jump probability p skip of the language state retrieved by Viterbi, and obtain the transition matrix A of the language state as:

其中,ploop表示语种状态的自转概率,pskip表示语种状态的跳转概率,各个语言的自转概率和跳转概率值相同,根据语种类别设置语种状态标号,所述语种状态标号为不相同的语种类别的标签,采用阿拉伯数字1,2,...,N为语种状态标号;转移矩阵A的各元素与语种状态标号的对应关系为:Among them, p loop represents the rotation probability of the language status, and p skip represents the jump probability of the language status. The rotation probability and jump probability values of each language are the same. The language status labels are set according to the language category. The language status labels are different. The language category label uses Arabic numerals 1, 2,...,N as the language status label; the corresponding relationship between each element of the transfer matrix A and the language status label is:

步骤3-2)对预测的语种状态序列进行维特比检索,计算基于维特比检索的目标函数:Step 3-2) Perform Viterbi retrieval on the predicted language status sequence, and calculate the objective function based on Viterbi retrieval:

其中ptrans(sT+1|sT)表示由第T时刻多语言连续语音流的语种状态sT到第T+1时刻语种状态sT+1的转移概率:where p trans (s T+1 |s T ) represents the transition probability from the language state s T of the multilingual continuous speech stream at time T to the language state s T+1 at time T+1 :

其中,语种状态sT和语种状态sT+1对应的语种分类标号在标注的语种分类标号范围内,T为段级别语种特征hsegment对应的统计周期;Among them, the language classification labels corresponding to the language status s T and the language status s T+1 are within the marked range of language classification labels, and T is the statistical period corresponding to the segment-level language feature h segment ;

pemit(sT+1|hsegment)表示对段级别语种特征hsegment在语种状态sT+1上预测的后验概率:p emit (s T+1 |h segment ) represents the posterior probability predicted for the segment-level language feature h segment on the language state s T+1 :

pemit(sT+1|hsegment)=DNN-LID段级别(hsegment) (11)p emit (s T+1 |h segment )=DNN-LID segment level (h segment ) (11)

其中,DNN-LID为基于深度神经网络DNN的段级别语种分类器;Among them, DNN-LID is a segment-level language classifier based on the deep neural network DNN;

步骤3-3)以目标函数值最大的语种状态序列为最佳语种状态序列,根据所述最佳语种状态序列进行语种状态回溯获得最佳的语种状态路径。Step 3-3) Using the objective function The language status sequence with the largest value is the best language status sequence. Based on the best language status sequence, language status backtracking is performed to obtain the best language status path.

本发明还提出一种多语言连续语音流语音内容识别系统,所述系统包括:The present invention also proposes a multi-language continuous speech stream speech content recognition system, which system includes:

段级别语种特征提取模块,用于将待识别的多语言连续语音流输入帧级别语种分类模型,输出段级别语种特征向量;The segment-level language feature extraction module is used to input the multi-language continuous speech stream to be recognized into the frame-level language classification model and output the segment-level language feature vector;

语种状态的后验概率计算模块,将段级别语种特征向量输入段级别语种分类模型,输出段级别语种状态的后验概率分布;The posterior probability calculation module of language status inputs the segment-level language feature vector into the segment-level language classification model and outputs the posterior probability distribution of segment-level language status;

语种状态路径获取模块,用于根据段级别语种状态的后验概率分布,基于维特比检索算法,计算多语言语音流的最佳的语种状态路径;The language status path acquisition module is used to calculate the optimal language status path of multi-language speech streams based on the posterior probability distribution of segment-level language status and based on the Viterbi retrieval algorithm;

语种状态区间切分模块,用于根据所述最佳语种状态路径对待识别的多语言连续语音流进行切分获得语种状态区间;和A language status interval segmentation module, configured to segment the multi-language continuous speech stream to be recognized according to the optimal language status path to obtain the language status interval; and

多语言语音流的内容识别模块,用于将切分后的语种状态区间送入多语言声学模型以及相应的多语言解码器中进行解码,得到所述多语言语音流的内容识别结果。The content recognition module of the multilingual speech stream is used to send the segmented language status intervals into the multilingual acoustic model and the corresponding multilingual decoder for decoding, and obtain the content recognition result of the multilingual speech stream.

本发明还提出一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述任一项所述的方法。The present invention also proposes a computer device, which includes a memory, a processor and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, any one of the above is implemented. method described.

本发明还提出一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行上述任一项所述的方法。The present invention also proposes a computer-readable storage medium that stores a computer program. When executed by a processor, the computer program causes the processor to perform any of the above methods.

与现有技术相比,本发明的优势在于:Compared with the existing technology, the advantages of the present invention are:

1、本发明的多语言连续语音流语音内容识别方法及系统,通过将语种分类模型与维特比检索算法相融合,能够解决连续语音流中多语言内容并存的语言种类动态检测问题。1. The multi-language continuous speech stream speech content recognition method and system of the present invention can solve the problem of dynamic detection of language types in the continuous speech stream when multi-language content coexists by integrating the language classification model and the Viterbi retrieval algorithm.

2、本发明的多语言连续语音流语音内容识别方法,可以对连续语音流中的多语言内容进行动态的语种切换点判别以及相应的多语言内容识别。2. The multi-language continuous speech stream speech content recognition method of the present invention can perform dynamic language switching point discrimination and corresponding multi-language content recognition on the multi-language content in the continuous speech stream.

附图说明Description of the drawings

图1为本发明的多语言连续语音流语音内容识别方法的示意图。Figure 1 is a schematic diagram of the multi-language continuous speech stream speech content recognition method of the present invention.

具体实施方式Detailed ways

下面结合附图和具体实施例对本发明进行详细的说明。The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

本发明提出了本发明的多语言连续语音流语音内容识别方法及系统,所述方法包括:The present invention proposes the multi-language continuous speech stream speech content recognition method and system of the present invention. The method includes:

步骤1)构建基于多任务学习的多语言声学模型;该声学模型将多语言的声学建模任务在一个基于多任务学习的神经网络分类框架下进行统一构建,同时利用多个语言的声学特征对多语言声学模型进行联合优化;具体包括:Step 1) Construct a multi-language acoustic model based on multi-task learning; this acoustic model unifies the multi-language acoustic modeling tasks under a neural network classification framework based on multi-task learning, and simultaneously uses the acoustic features of multiple languages to Joint optimization of multi-language acoustic models; specifically including:

步骤1-1)构建基于多任务学习的神经网络分类框架的多语言声学模型,该模型由多个共享隐含层和语言特定输出层构成;其中共享隐含层的模型参数由多语言数据共同优化;语言特定输出层由各个单语言的数据进行优化;Step 1-1) Construct a multi-language acoustic model based on a multi-task learning neural network classification framework. The model consists of multiple shared hidden layers and language-specific output layers; the model parameters of the shared hidden layer are shared by multi-language data. Optimization; the language-specific output layer is optimized by the data of each single language;

步骤1-2)在模型的前向计算过程中,多语言声学模型的共享隐含层以及语言特定输出层对输入的多语言频谱特征向量进行非线性变换,所有语言特定输出层均有信息输出;Step 1-2) During the forward calculation process of the model, the shared hidden layer and the language-specific output layer of the multi-language acoustic model perform non-linear transformation on the input multi-language spectrum feature vector, and all language-specific output layers have information output ;

步骤1-3)在模型更新的误差损失函数计算过程中,根据频谱特征对应的声学状态标签,只在与频谱特征对应的语言特定输出层计算误差损失函数值,而其它的与频谱特征语言不对应的语言特定输出层计算的误差损失函数值为零;相应的损失函数计算公式如下:Step 1-3) In the calculation process of the error loss function of the model update, according to the acoustic state label corresponding to the spectral feature, the error loss function value is only calculated in the language-specific output layer corresponding to the spectral feature, and other values are not related to the language of the spectral feature. The error loss function value calculated by the corresponding language-specific output layer is zero; the corresponding loss function calculation formula is as follows:

其中Floss,i为第i个语言特定输出层的误差损失函数值,pmodel,i(xL)为第L个语言的频谱特征xL对应的在第L个语言特定输出层的声学模型输出,qlabel,L为频谱特征xL对应的声学状态标签;where F loss,i is the error loss function value of the i-th language-specific output layer, p model,i (x L ) is the acoustic model in the L-th language-specific output layer corresponding to the spectral feature x L of the L-th language Output, q label,L is the acoustic state label corresponding to the spectral feature x L ;

步骤1-4)在模型分类误差反向回传过程中,将所述误差损失值Floss,i反向回传,每个语言特定输出层参数根据对应单语言的数据进行模型参数训练;共享隐含层的参数由若干个语言特定输出层的回传的误差损失值Floss,i计算;Step 1-4) In the process of reverse transmission of the model classification error, the error loss value F loss,i is transmitted back in reverse, and each language-specific output layer parameter is trained on the model parameters based on the data of the corresponding single language; shared The parameters of the hidden layer are calculated by the error loss values F loss,i returned from several language-specific output layers;

语言特定输出层参数梯度计算公式为:The language-specific output layer parameter gradient calculation formula is:

其中Φi为第i个语言特定输出层的参数。where Φ i is the parameter of the i-th language-specific output layer.

共享隐含层参数的梯度计算公式为:The gradient calculation formula of shared hidden layer parameters is:

其中Φ为共享隐含层的参数,L为多语言声学模型的特定语言输出层对应的语言种类数。Among them, Φ is the parameter of the shared hidden layer, and L is the number of language categories corresponding to the specific language output layer of the multi-language acoustic model.

步骤1-5)反复执行步骤1-2)-步骤1-4),直至模型参数收敛。Step 1-5) Repeat steps 1-2)-step 1-4) until the model parameters converge.

步骤2)基于深度神经网络模型构建融合长时统计特征的帧级别语种分类模型;基于帧级别语种分类模型提取代表语言种类特征的语种特征向量;所述帧级别语种分类模型融合长时统计组件,在帧级别语种分类模型前向计算过程中,长时统计组件对前一隐含层的输出向量进行段级统计,计算前一隐含层输出向量的均值和方差统计量,并将均值和方差统计量的向量作为下一隐含层的输入,最后根据帧级别语种标签进行语种分类模型的误差计算和反向梯度回传过程进行模型更新;Step 2) Construct a frame-level language classification model that integrates long-term statistical features based on a deep neural network model; extract a language feature vector representing language type characteristics based on the frame-level language classification model; the frame-level language classification model integrates long-term statistical components, In the forward calculation process of the frame-level language classification model, the long-term statistics component performs segment-level statistics on the output vector of the previous hidden layer, calculates the mean and variance statistics of the output vector of the previous hidden layer, and compares the mean and variance The vector of statistics is used as the input of the next hidden layer, and finally the error calculation of the language classification model is performed based on the frame-level language label and the model is updated through the reverse gradient backpropagation process;

训练帧级别语种分类模型具体步骤包括:The specific steps for training the frame-level language classification model include:

步骤2-1)构建帧级别语种分类模型,所述帧级别语种分类模型为深度神经网络;Step 2-1) Construct a frame-level language classification model, and the frame-level language classification model is a deep neural network;

步骤2-2)提取训练集的多语言连续语音流的帧级别频谱特征,以所述帧级别频谱特征为输入特征输入帧级别语种分类模型,对当前隐含层的输出向量进行长时统计,计算当前隐含层输出向量的均值向量、方差向量和段级别语种特征向量;Step 2-2) Extract the frame-level spectral features of the multi-language continuous speech stream of the training set, use the frame-level spectral features as input features to input the frame-level language classification model, and perform long-term statistics on the output vector of the current hidden layer, Calculate the mean vector, variance vector and segment-level language feature vector of the current hidden layer output vector;

所述均值向量为:The mean vector is:

所述方差向量为:The variance vector is:

所述段级别语种特征向量为:The segment-level language feature vector is:

hsegment=Append(μ,σ) (6)h segment =Append(μ,σ) (6)

其中hi为当前隐含层在i时刻的输出向量,T为长时统计周期,μ为长时统计的均值向量,σ为长时统计的方差向量,hsegment为段级别语种特征向量,所述段级别语种特征向量是将均值向量和方差向量拼接在一起,其维度为hi维度的2倍;其中Append(μ,σ)表示将μ和σ进行拼接构成高维向量;where h i is the output vector of the current hidden layer at time i, T is the long-term statistical period, μ is the mean vector of long-term statistics, σ is the variance vector of long-term statistics, h segment is the segment-level language feature vector, so The paragraph-level language feature vector is the mean vector and variance vector spliced together, and its dimension is twice the h i dimension; Append(μ,σ) means that μ and σ are spliced together to form a high-dimensional vector;

步骤2-3)将均值向量和方差向量作为下一隐含层的输入,根据帧级别语种标签通过误差计算和反向梯度回传过程训练,使每一个隐含层输出段级别语种特征向量,得到训练好的帧级别语种分类模型。Step 2-3) Use the mean vector and variance vector as the input of the next hidden layer, and train them through error calculation and reverse gradient backpropagation process according to the frame-level language label, so that each hidden layer outputs a segment-level language feature vector, Obtain the trained frame-level language classification model.

基于训练好的帧级别语种分类模型,从帧级别语种分类模型的隐含层中提取段级别语种特征向量,为每一个段级别语种特征向量构建段级别语种标签,根据段级别语种特征向量和段级别语种标签训练段级别语种分类模型。具体包括:Based on the trained frame-level language classification model, the segment-level language feature vector is extracted from the hidden layer of the frame-level language classification model, and a segment-level language label is constructed for each segment-level language feature vector. According to the segment-level language feature vector and segment The level language label trains the segment level language classification model. Specifically include:

步骤S2-1)构建段级别语种分类模型;Step S2-1) Construct a segment-level language classification model;

步骤S2-2)提取训练集的多语言连续语音流的帧级别频谱特征,以所述帧级别频谱特征为输入特征输入训练好的帧级别语种分类模型的隐含层,从训练好的帧级别语种分类模型的隐含层中提取段级别语种特征向量;Step S2-2) Extract the frame-level spectral features of the multi-language continuous speech stream of the training set, use the frame-level spectral features as input features to input the hidden layer of the trained frame-level language classification model, and start from the trained frame-level Extract segment-level language feature vectors from the hidden layer of the language classification model;

步骤S2-3)为每一个段级别语种特征向量设置段级别语种标签,将段级别语种特征向量输入段级别语种分类模型,训练输出所述段级别语种标签对应的语种状态的后验概率分布,获得训练好的段级别语种分类模型。Step S2-3) Set a segment-level language label for each segment-level language feature vector, input the segment-level language feature vector into the segment-level language classification model, and train and output the posterior probability distribution of the language status corresponding to the segment-level language label, Obtain the trained segment-level language classification model.

步骤3)对待识别的多语言连续语音流的语音利用训练好的帧级别语种分类模型提取段级别语种特征向量,根据段级别语种分类模型对段级别语种特征向量进行语种分类,结合维特比检索算法,对多语言连续语音流进行语种切换点实时检测;最后根据语种检测结果,对连续语音流进行切分并通过多语言声学模型以及相应的解码器对多语言语音流进行内容识别。具体步骤包括:Step 3) Use the trained frame-level language classification model to extract segment-level language feature vectors from the multi-lingual continuous speech stream to be recognized, and classify the segment-level language feature vectors according to the segment-level language classification model, combined with the Viterbi retrieval algorithm. , conduct real-time detection of language switching points for multi-language continuous speech streams; finally, according to the language detection results, the continuous speech stream is segmented and the content of the multi-language speech stream is recognized through the multi-language acoustic model and the corresponding decoder. Specific steps include:

步骤3-1)对待识别的多语言连续语音流的语音的频谱特征根据特定的步长和窗长由所述帧级别语种分类模型提取段级别语种特征向量;Step 3-1) Extract the segment-level language feature vector from the frame-level language classification model according to the specific step size and window length of the speech spectrum characteristics of the multi-language continuous speech stream to be recognized;

通过段级别语种分类模型,对段级别语种特征向量进行分类,获取段级别语种特征向量对应的语种状态的后验概率分布;Classify the segment-level language feature vectors through the segment-level language classification model, and obtain the posterior probability distribution of the language status corresponding to the segment-level language feature vectors;

设置维特比检索的语种状态的自转概率和跳转概率,通过提高语种装填的自转概率来减小由于段级别语种分类模型的分类不精准造成的语种分类错误;包括:Set the rotation probability and jump probability of the language status of Viterbi retrieval, and reduce the language classification errors caused by the inaccurate classification of the segment-level language classification model by increasing the rotation probability of language filling; including:

基于语种状态的后验概率分布,设置维特比检索的语种状态的自转概率和跳转概率,得到语种状态的转移矩阵A为:Based on the posterior probability distribution of the language status, set the rotation probability and jump probability of the language status for Viterbi retrieval, and obtain the transition matrix A of the language status as:

其中,ploop表示语种状态的自转概率,pskip表示语种状态的跳转概率,各个语言的自转概率和跳转概率值相同,根据语种类别设置语种状态标号,所述语种状态标号为不相同的语种类别的标签,采用阿拉伯数字1,2和N为语种状态标号;转移矩阵A的各元素与语种状态标号的对应关系为:Among them, p loop represents the rotation probability of the language status, and p skip represents the jump probability of the language status. The rotation probability and jump probability values of each language are the same. The language status labels are set according to the language category. The language status labels are different. The language category label uses Arabic numerals 1, 2 and N as the language status label; the corresponding relationship between each element of the transfer matrix A and the language status label is:

步骤3-2)计算预测的段级别语种状态的后验概率pemit(sT+1|hsegment),根据预先设定的语种状态的自转概率ploop和跳转概率pskip对预测的语种状态进行维特比检索,具体包括:Step 3-2) Calculate the posterior probability p emit (s T+1 |h segment ) of the predicted segment-level language status, and calculate the predicted language status based on the preset rotation probability p loop and jump probability p skip of the language status. Status for Viterbi retrieval, including:

基于维特比检索的目标函数计算连续语音流的最佳语种状态序列,所述目标函数为:The optimal language state sequence of the continuous speech stream is calculated based on the objective function of Viterbi retrieval. The objective function is:

其中ptrans(sT+1|sT)表示由第T时刻多语言连续语音流的语种状态sT到第T+1时刻语种状态sT+1的转移概率:where p trans (s T+1 |s T ) represents the transition probability from the language state s T of the multilingual continuous speech stream at time T to the language state s T+1 at time T+1 :

其中,语种状态sT和语种状态sT+1对应的语种分类标号在标注的语种分类标号范围内,T为段级别语种特征hsegment对应的统计周期;Among them, the language classification labels corresponding to the language status s T and the language status s T+1 are within the marked range of language classification labels, and T is the statistical period corresponding to the segment-level language feature h segment ;

pemit(sT+1|hsegment)表示对段级别语种特征hsegment在语种状态sT+1上预测的后验概率;p emit (s T+1 |h segment ) represents the posterior probability of predicting the segment-level language feature h segment on the language state s T+1 ;

pemit(sT+1|hsegment)=DNN-LID段级别(hsegment) (11)p emit (s T+1 |h segment )=DNN-LID segment level (h segment ) (11)

其中,DNN-LID为基于深度神经网络DNN的段级别语种分类器;Among them, DNN-LID is a segment-level language classifier based on the deep neural network DNN;

步骤3-3)通过以上递归公式,可以通过段级别语种分类模型预测的段级别语种状态的后验概率以及预先设定的语种状态的自转概率和跳转概率预测出最佳语种状态进行检索,最终目标函数值最大的序列为多语言连续语音流对应的最佳语种状态序列,由最佳语种状态序列进行语种状态回溯可以获得最佳的语种状态路径。Step 3-3) Through the above recursive formula, the best language status can be predicted for retrieval through the posterior probability of the segment-level language status predicted by the segment-level language classification model and the preset rotation probability and jump probability of the language status. The sequence with the largest final objective function value is the best language status sequence corresponding to the multilingual continuous speech stream. The best language status path can be obtained by performing language status backtracking based on the best language status sequence.

步骤4)根据所述最佳语种状态路径可以对多语言语音流按照语种状态区间进行切分,将切分后的语种状态区间语音流送入多语言声学模型以及相应的多语言解码器进行解码,可以得到所述多语言连续语音流的对应的内容识别结果。Step 4) According to the optimal language status path, the multi-language speech stream can be segmented according to the language status interval, and the segmented language status interval speech stream is sent to the multi-language acoustic model and the corresponding multi-language decoder for decoding. , the corresponding content recognition result of the multi-language continuous speech stream can be obtained.

本发明还提出一种多语言连续语音流语音内容识别系统,所述系统包括:The present invention also proposes a multi-language continuous speech stream speech content recognition system, which system includes:

段级别语种特征提取模块,用于将待识别的多语言连续语音流输入帧级别语种分类模型,输出段级别语种特征向量;The segment-level language feature extraction module is used to input the multi-language continuous speech stream to be recognized into the frame-level language classification model and output the segment-level language feature vector;

语种状态的后验概率计算模块,将段级别语种特征向量输入段级别语种分类模型,输出段级别语种状态的后验概率分布;The posterior probability calculation module of language status inputs the segment-level language feature vector into the segment-level language classification model and outputs the posterior probability distribution of segment-level language status;

语种状态路径获取模块,用于根据段级别语种状态的后验概率分布,基于维特比检索算法,计算多语言语音流的最佳的语种状态路径;The language status path acquisition module is used to calculate the optimal language status path of multi-language speech streams based on the posterior probability distribution of segment-level language status and based on the Viterbi retrieval algorithm;

语种状态区间切分模块,用于根据所述最佳语种状态路径对待识别的多语言连续语音流进行切分获得语种状态区间;和A language status interval segmentation module, configured to segment the multi-language continuous speech stream to be recognized according to the optimal language status path to obtain the language status interval; and

多语言语音流的内容识别模块,用于将切分后的语种状态区间送入多语言声学模型以及相应的多语言解码器中进行解码,得到所述多语言语音流的内容识别结果。The content recognition module of the multilingual speech stream is used to send the segmented language status intervals into the multilingual acoustic model and the corresponding multilingual decoder for decoding, and obtain the content recognition result of the multilingual speech stream.

本发明还提出一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述任一项所述的方法。The present invention also proposes a computer device, which includes a memory, a processor and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, any one of the above is implemented. method described.

本发明还提出一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行上述任一项所述的方法。The present invention also proposes a computer-readable storage medium that stores a computer program. When executed by a processor, the computer program causes the processor to perform any of the above methods.

基于本发明的语音识别系统的合理性和有效性已经在实际系统上得到了验证,结果见表1:The rationality and effectiveness of the speech recognition system based on the present invention have been verified on the actual system. The results are shown in Table 1:

表1Table 1

本发明的方法通过使用广东话、土耳其语以及越南语数据进行多语言声学模型联合训练,同时构建了基于三个语种的帧级别语种分类模型以及段级别语种分类模型,并利用基于维特比算法的多语言连续语音流语音内容识别方法对连续多语言语音进行语种分类以及语音内容识别。从表1可知通过本发明的方法将语种识别的精度从82.1%提高到了92.4%,验证了本发明的基于维特比算法的多语言连续语音流语音内容识别方法可以有效提升连续多语言语音流中语种检测的结果。The method of the present invention conducts joint training of multi-language acoustic models using Cantonese, Turkish and Vietnamese data, and simultaneously constructs a frame-level language classification model and a segment-level language classification model based on three languages, and utilizes Viterbi algorithm-based The multilingual continuous speech stream speech content recognition method performs language classification and speech content recognition of continuous multilingual speech. It can be seen from Table 1 that the accuracy of language recognition is improved from 82.1% to 92.4% through the method of the present invention. It is verified that the multi-language continuous speech stream speech content recognition method based on the Viterbi algorithm of the present invention can effectively improve the accuracy of continuous multi-language speech stream. The result of language detection.

最后所应说明的是,以上实施例仅用以说明本发明的技术方案而非限制。尽管参照实施例对本发明进行了详细说明,本领域的普通技术人员应当理解,对本发明的技术方案进行修改或者等同替换,都不脱离本发明技术方案的精神和范围,其均应涵盖在本发明的权利要求范围当中。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not limiting. Although the present invention has been described in detail with reference to the embodiments, those of ordinary skill in the art will understand that modifications or equivalent substitutions may be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and they shall all be covered by the scope of the present invention. within the scope of the claims.

Claims (8)

1. A method of multi-lingual continuous voice stream voice content recognition, the method comprising:
inputting the multi-language continuous voice stream to be identified into a frame-level language classification model, and outputting segment-level language feature vectors;
inputting the segment level language feature vector into a segment level language classification model, and outputting posterior probability distribution of the segment level language state;
calculating an optimal language state path of the multi-language continuous voice stream based on a Viterbi search algorithm according to posterior probability distribution of the segment-level language states;
dividing the multi-language continuous voice stream to be recognized according to the optimal language state path to obtain a language state interval;
inputting the language state interval into a multi-language acoustic model and a corresponding multi-language decoder for decoding to obtain a content identification result of the multi-language continuous voice stream;
according to the posterior probability distribution of segment-level language states, based on a Viterbi search algorithm, calculating an optimal language state path of the multi-language continuous voice stream, wherein the method specifically comprises the following steps:
step 3-1) setting the autorotation probability p of the language state of the Viterbi search according to the posterior probability distribution of the language state loop And probability of jump p skip The obtained transition matrix A of the language state is as follows:
the method comprises the steps of setting a language state label according to language categories, wherein the rotation probability and the jump probability of each language are the same, the language state label is a label of different language categories, arabic numerals 1 and 2 are adopted, and N is the language state label; the corresponding relation between each element of the transfer matrix A and the language state label is as follows:
step 3-2) carrying out Viterbi search on the predicted language state, and calculating an objective function based on Viterbi search:
wherein p is trans (s T+1 |s T ) Language state s representing a multilingual continuous speech stream from time T T Language state s up to time T+1 T+1 Is a transition probability of (2):
wherein the language state s T And language state s T+1 The corresponding language classification label is within the range of the labeled language classification label, and T is the segment levelLanguage feature h segment Corresponding statistical period;
p emit (s T+1 |h segment ) Representing para-level language features h segment In language state s T+1 Posterior probability of upper prediction:
p emit (s T+1 |h segment )=DNNLID segment level (h segment ) (11)
The DNNLID is a segment-level language classifier based on a deep neural network DNN;
step 3-3) as an objective functionAnd carrying out language state backtracking according to the optimal language state sequence to obtain an optimal language state path.
2. The method for recognizing speech content in a multilingual continuous speech stream according to claim 1, further comprising a training step of the multilingual acoustic model, comprising the specific steps of:
step 1-1) constructing a multi-language acoustic model based on a multi-task learning neural network, wherein the model comprises a plurality of shared hidden layers and a language specific output layer;
step 1-2) extracting spectral features of multi-language continuous voice streams of a training set based on acoustic state labels of multi-language continuous voice data, and inputting the spectral features into a shared hidden layer for nonlinear transformation; outputting the data of a plurality of single languages to a plurality of language specific output layers;
step 1-3) calculating an error loss function value at a language specific output layer corresponding to the input spectral features from the single-language data:
the error loss function F loss,i The method comprises the following steps:
wherein F is loss,i Error loss value for the ith language specific output layer, p model,i (x L ) Spectral feature x for the L-th language L Corresponding output at the L-th language specific output layer, q label,L For spectral feature x L A corresponding acoustic state tag; the error loss function value of other output layers is zero;
step 1-4) comparing the error loss value F loss,i Back-pass, each language specific output layer parameter is updated according to the data of the corresponding single language, calculating the language specific output layer parameter gradient delta phi i
Wherein phi is i Parameters for the i-th language specific output layer;
the parameters sharing the hidden layer are represented by the returned error loss values F of a plurality of language-specific output layers loss,i Updating: calculating the gradient delta phi of the shared hidden layer parameters:
wherein phi is a parameter of a shared hidden layer, and L is the number of language types corresponding to a specific language output layer of the multi-language acoustic model;
step 1-5) when F loss,i >If the threshold value is given, the step 1-2) is carried out;
when F loss,i <And (5) given a threshold value, obtaining a trained multilingual acoustic model.
3. The method of claim 1, further comprising the step of training a frame-level language classification model, comprising the steps of:
step 2-1), constructing a frame-level language classification model, wherein the frame-level language classification model is a deep neural network;
step 2-2) extracting frame-level spectrum features of the multi-language continuous voice stream of the training set, inputting the frame-level spectrum features into a frame-level language classification model, carrying out long-time statistics on the output vector of the current hidden layer, and calculating a mean value vector, a variance vector and a segment-level language feature vector of the output vector of the current hidden layer;
the mean vector μ is:
the variance vector sigma is:
the segment-level language feature vector h segment The method comprises the following steps:
h segment =Append(μ,σ) (6)
wherein h is i For the output vector of the current hidden layer at the moment i, T is the long-term statistical period, mu is the average value vector of the long-term statistics, sigma is the variance vector of the long-term statistics, h segment The segment-level language feature vector is formed by splicing a mean vector and a variance vector, and the dimension is h i 2 times the dimension, where application (μ, σ) represents stitching μ and σ to form a high-dimensional vector;
step 2-3) taking the mean vector and the variance vector as the input of the next hidden layer, training according to the frame-level language labels through error calculation and reverse gradient feedback process, and enabling each hidden layer to output segment-level language feature vectors to obtain a trained frame-level language classification model.
4. The method of claim 1, further comprising the step of training a segment-level language classification model, comprising the steps of:
s2-1), constructing a segment level language classification model;
s2-2) extracting frame-level spectrum features of the multi-language continuous voice stream of the training set, inputting the frame-level spectrum features into an implicit layer of a trained frame-level language classification model, and extracting segment-level language feature vectors from the implicit layer of the trained frame-level language classification model;
step S2-3), setting segment level language feature vectors for each segment level language feature vector, inputting the segment level language feature vectors into a segment level language classification model, training and outputting posterior probability distribution of language states corresponding to the segment level language feature vectors, and obtaining a trained segment level language classification model.
5. The method for recognizing speech content according to claim 1, wherein the multi-language continuous speech stream to be recognized is input into a frame-level language classification model to output segment-level language feature vectors; inputting the segment level language feature vector into the segment level language classification model to output posterior probability distribution of the language state; the method specifically comprises the following steps:
extracting to-be-identified frame-level spectrum features from a multi-language continuous voice stream to be identified;
inputting the frequency spectrum characteristics of the frame level to be identified into a trained frame level language classification model according to a specific step length and window length, and outputting a segment level language characteristic vector h segment
The segment level language feature vector h is processed segment And inputting the trained segment-level language classification model, and outputting posterior probability distribution of the language state corresponding to the segment-level language feature vector.
6. A system based on the multi-lingual continuous voice stream voice content recognition method of claim 1, the system comprising:
the segment level language feature extraction module is used for inputting the multi-language continuous voice stream to be recognized into the frame level language classification model and outputting segment level language feature vectors;
the posterior probability calculation module of the language state inputs the segment-level language feature vector into the segment-level language classification model, and outputs posterior probability distribution of the segment-level language state;
the language state path acquisition module is used for calculating an optimal language state path of the multilingual voice stream based on a Viterbi retrieval algorithm according to posterior probability distribution of the segment-level language states;
the language state interval segmentation module is used for segmenting the multi-language continuous voice stream to be identified according to the optimal language state path to obtain a language state interval; and
and the content recognition module is used for sending the segmented language state interval into a multilingual acoustic model and a corresponding multilingual decoder for decoding to obtain a content recognition result of the multilingual continuous voice stream.
7. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of claims 1-5 when executing the computer program.
8. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a processor, causes the processor to perform the method of any of claims 1-5.
CN201910782981.2A 2019-08-23 2019-08-23 A multi-language continuous speech stream speech content recognition method and system Active CN112489622B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910782981.2A CN112489622B (en) 2019-08-23 2019-08-23 A multi-language continuous speech stream speech content recognition method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910782981.2A CN112489622B (en) 2019-08-23 2019-08-23 A multi-language continuous speech stream speech content recognition method and system

Publications (2)

Publication Number Publication Date
CN112489622A CN112489622A (en) 2021-03-12
CN112489622B true CN112489622B (en) 2024-03-19

Family

ID=74920171

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910782981.2A Active CN112489622B (en) 2019-08-23 2019-08-23 A multi-language continuous speech stream speech content recognition method and system

Country Status (1)

Country Link
CN (1) CN112489622B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113870839B (en) * 2021-09-29 2022-05-03 北京中科智加科技有限公司 Language identification device of language identification model based on multitask
TWI795173B (en) * 2022-01-17 2023-03-01 中華電信股份有限公司 Multilingual speech recognition system, method and computer readable medium
CN114078468B (en) * 2022-01-19 2022-05-13 广州小鹏汽车科技有限公司 Voice multi-language recognition method, device, terminal and storage medium
CN114582329A (en) * 2022-03-03 2022-06-03 北京有竹居网络技术有限公司 Voice recognition method and device, computer readable medium and electronic equipment
CN115831094B (en) * 2022-11-08 2023-08-15 北京数美时代科技有限公司 Multilingual voice recognition method, multilingual voice recognition system, storage medium and electronic equipment
CN116453503A (en) * 2023-04-10 2023-07-18 京东科技信息技术有限公司 Multilingual voice recognition method and device
CN119811383A (en) * 2024-12-25 2025-04-11 北京云上曲率科技有限公司 Stream type audio language identification method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104036774A (en) * 2014-06-20 2014-09-10 国家计算机网络与信息安全管理中心 Method and system for recognizing Tibetan dialects
US9460711B1 (en) * 2013-04-15 2016-10-04 Google Inc. Multilingual, acoustic deep neural networks
CN108492820A (en) * 2018-03-20 2018-09-04 华南理工大学 Chinese speech recognition method based on Recognition with Recurrent Neural Network language model and deep neural network acoustic model

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7043431B2 (en) * 2001-08-31 2006-05-09 Nokia Corporation Multilingual speech recognition system using text derived recognition models
US10593321B2 (en) * 2017-12-15 2020-03-17 Mitsubishi Electric Research Laboratories, Inc. Method and apparatus for multi-lingual end-to-end speech recognition

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9460711B1 (en) * 2013-04-15 2016-10-04 Google Inc. Multilingual, acoustic deep neural networks
CN104036774A (en) * 2014-06-20 2014-09-10 国家计算机网络与信息安全管理中心 Method and system for recognizing Tibetan dialects
CN108492820A (en) * 2018-03-20 2018-09-04 华南理工大学 Chinese speech recognition method based on Recognition with Recurrent Neural Network language model and deep neural network acoustic model

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
《Language identification and multilingual speech recognition using discriminatively trained acoustic models》;Thomas Niesler et al.;《Computer Science, Linguistics》;20061231;第1-6页 *
《Multi-lingual speech recognition with low-rank multi-task deep neural networks》;Aanchan Mohan et al.;《ICASSP 2015》;20150806;第4994-4998页 *
《面向多语言的语音识别声学模型建模方法研究》;姚海涛等;《声学技术》;20151231;第34卷(第6期);第404-407页 *

Also Published As

Publication number Publication date
CN112489622A (en) 2021-03-12

Similar Documents

Publication Publication Date Title
CN112489622B (en) A multi-language continuous speech stream speech content recognition method and system
CN112163426B (en) A relation extraction method based on the combination of attention mechanism and graph long and short-term memory neural network
CN112528676B (en) Document-level event argument extraction method
CN110188343B (en) Multimodal Emotion Recognition Method Based on Fusion Attention Network
CN112613273B (en) Compression method and system of multi-language BERT sequence labeling model
CN108280064B (en) Combined processing method for word segmentation, part of speech tagging, entity recognition and syntactic analysis
CN107918782B (en) Method and system for generating natural language for describing image content
CN107358948B (en) An attention model-based approach to language input relevance detection
CN110968660B (en) Information extraction method and system based on joint training model
CN110555084B (en) Distant Supervised Relation Classification Method Based on PCNN and Multilayer Attention
CN108765383A (en) Video presentation method based on depth migration study
CN111475655A (en) A power dispatching text entity linking method based on distribution network knowledge graph
CN108538285A (en) A kind of various keyword detection method based on multitask neural network
CN118114667B (en) Named Entity Recognition Model Based on Multi-task Learning and Attention Mechanism
CN110516229B (en) A domain-adaptive Chinese word segmentation method based on deep learning
CN113077785B (en) An end-to-end multilingual continuous speech stream speech content recognition method and system
CN111476024A (en) Text word segmentation method and device and model training method
CN114841151B (en) Joint Extraction Method of Entity-Relationship in Medical Text Based on Decomposition-Reorganization Strategy
CN111161724B (en) Method, system, equipment and medium for Chinese audio-visual combined speech recognition
CN110134950B (en) A Text Automatic Proofreading Method Based on Combination of Words
CN112200268A (en) An image description method based on encoder-decoder framework
CN115269822A (en) Sequence labeling acceleration method based on neural network early-quit mechanism
CN113268985B (en) Method, device and medium for remote supervision relation extraction based on relation path
CN112131879A (en) A relation extraction system, method and apparatus
CN116882398B (en) Implicit chapter relation recognition method and system based on phrase interaction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20241016

Address after: 100190, No. 21 West Fourth Ring Road, Beijing, Haidian District

Patentee after: INSTITUTE OF ACOUSTICS, CHINESE ACADEMY OF SCIENCES

Country or region after: China

Address before: 100190, No. 21 West Fourth Ring Road, Beijing, Haidian District

Patentee before: INSTITUTE OF ACOUSTICS, CHINESE ACADEMY OF SCIENCES

Country or region before: China

Patentee before: BEIJING KEXIN TECHNOLOGY Co.,Ltd.

TR01 Transfer of patent right