[go: up one dir, main page]

WO2017166625A1 - Acoustic model training method and apparatus for speech recognition, and electronic device - Google Patents

Acoustic model training method and apparatus for speech recognition, and electronic device Download PDF

Info

Publication number
WO2017166625A1
WO2017166625A1 PCT/CN2016/096672 CN2016096672W WO2017166625A1 WO 2017166625 A1 WO2017166625 A1 WO 2017166625A1 CN 2016096672 W CN2016096672 W CN 2016096672W WO 2017166625 A1 WO2017166625 A1 WO 2017166625A1
Authority
WO
WIPO (PCT)
Prior art keywords
acoustic
state
model
training
original
Prior art date
Application number
PCT/CN2016/096672
Other languages
French (fr)
Chinese (zh)
Inventor
张俊博
Original Assignee
乐视控股(北京)有限公司
乐视致新电子科技(天津)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 乐视控股(北京)有限公司, 乐视致新电子科技(天津)有限公司 filed Critical 乐视控股(北京)有限公司
Publication of WO2017166625A1 publication Critical patent/WO2017166625A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training

Definitions

  • the present invention relates to the field of speech recognition technologies, and in particular, to an acoustic model training method, apparatus and electronic device for speech recognition.
  • One purpose of the speech recognition system is to convert speech into text. Specifically, a speech signal is searched for a sequence of words (consisting of words or words) so that it matches the speech signal with the highest degree.
  • the Acoustic Modeling One of the most important parts of the speech recognition system is the Acoustic Modeling.
  • speech recognition is performed, the speech signal is converted into acoustic features, and then the acoustic model is used to determine the acoustic states corresponding to the acoustic features.
  • Acoustic state combination is the text.
  • the acoustic state is the basic unit constituting the pronunciation of a word, and generally refers to a smaller unit obtained by further dividing the phoneme.
  • the acoustic state corresponding to the acoustic features is obtained by using a state description model in the acoustic model.
  • each acoustic state corresponds to a state description model, and the state description model can be used to identify the acoustic state that best matches the acoustic features.
  • the training process of the acoustic model is very complicated, including not only the state description model.
  • the training also includes the extraction of acoustic features, acoustic feature transformation, decision tree generation, and training of state definition models.
  • the original acoustic model may not be applicable to the current application scenario. This requires retraining a new acoustic model, but retraining a new acoustic model.
  • the complexity is high, and the inventors found in the study that some structures in the original acoustic model may not need to be changed, such as the state definition model. If retrained, it will destroy the acoustic state defined by the state definition model, but will affect it. The accuracy of speech recognition.
  • the solution to be solved by the embodiment of the present invention is to reduce the complexity of model training, not to destroy the structure of the original acoustic model, and to ensure the accuracy of speech recognition.
  • the embodiment of the invention provides an acoustic model training method and device for speech recognition, which is used to solve the technical problem of reducing the complexity of acoustic model training under the premise of ensuring the accuracy of speech recognition in the prior art.
  • An embodiment of the present invention provides an acoustic model training method for speech recognition, including:
  • the training sample includes an acoustic feature and a training text corresponding to the acoustic feature
  • the original state description model in the original acoustic model is updated to obtain an updated acoustic model.
  • An embodiment of the present invention provides an acoustic model training apparatus for speech recognition, including:
  • a sample acquisition module configured to acquire a training sample;
  • the training sample includes an acoustic feature and a training text corresponding to the acoustic feature;
  • a first determining module configured to acquire an original acoustic model, and use the original acoustic model to determine an acoustic state in the original acoustic model corresponding to each training text;
  • a second determining module configured to determine an acoustic feature corresponding to each acoustic state according to an acoustic state and an acoustic feature corresponding to each training text
  • a training module configured to re-train a state description model of the acoustic state by using an acoustic feature corresponding to each acoustic state
  • an update module configured to update the original state description model in the original acoustic model by using a state description model obtained by retraining, and obtain an updated acoustic model.
  • Embodiments of the invention further disclose an electronic device comprising at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor The instructions are executed by the at least one processor to enable the at least one processor to acquire training samples; the training samples include acoustic features and training text corresponding to the acoustic features; acquiring an original acoustic model, and utilizing The original acoustic model determines an acoustic state in the original acoustic model corresponding to each training text; determining an acoustic feature corresponding to each acoustic state according to an acoustic state and an acoustic feature corresponding to each training text; utilizing each acoustic The acoustic feature corresponding to the state is retrained to obtain a state description model of the acoustic state; the state description model obtained by retraining is used to update the original state description model in the original acoustic model to
  • the present invention also discloses a non-volatile computer storage medium, wherein the storage medium stores computer-executable instructions for causing the computer to perform the method of claims 1-5 .
  • the embodiment of the invention further provides a computer program product, the computer program product comprising A computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform the method of claims 1-5.
  • the acoustic model training method and apparatus for speech recognition determine the acoustic state corresponding to the training text in the training sample by using the original acoustic model, and can determine according to the acoustic state and acoustic characteristics corresponding to each training text. Acoustic features corresponding to each acoustic state. Therefore, the state description model of the acoustic state can be retrained directly by using the acoustic features corresponding to the acoustic state, and the state description model obtained by retraining is used to update the original state description model in the original acoustic model, so that after the update can be obtained Acoustic model.
  • the state description model in the original acoustic model can be retrained without training a new acoustic model, which not only reduces the training complexity, but also updates the original acoustic model without destroying the original.
  • the structure of the acoustic model ensures the accuracy of speech recognition.
  • FIG. 1 is a flow chart of an embodiment of an acoustic model training method for speech recognition according to the present invention.
  • FIG. 2 is a schematic structural view of an embodiment of an acoustic model training device for speech recognition according to the present invention
  • FIG. 3 is a schematic structural diagram of hardware of an electronic device according to an embodiment of the present invention.
  • connection or integral connection; may be mechanical connection or electrical connection; may be directly connected, may also be indirectly connected through an intermediate medium, or may be internal communication of two components, may be wireless connection, or may be wired connection.
  • connection or integral connection; may be mechanical connection or electrical connection; may be directly connected, may also be indirectly connected through an intermediate medium, or may be internal communication of two components, may be wireless connection, or may be wired connection.
  • the technical solution of the present invention is mainly applied to the training of acoustic models in the field of speech recognition.
  • the acoustic model is one of the most important parts of the speech recognition system.
  • the acoustic model is used to determine the respective acoustic states corresponding to the acoustic features, and the text is obtained by combining the various acoustic states.
  • the acoustic feature is obtained by extracting the speech signal, and the acoustic feature may be, for example, a MFCC (Mel Frequency Cepstrum Coefficient) feature.
  • Acoustic models are obtained by modeling models such as Hidden Markov Models, and a large number of training samples are needed to train the modeling models to obtain acoustic models.
  • the acoustic model includes a plurality of state description models corresponding to the acoustic states, the state description models are used to calculate the probability of the acoustic features on the acoustic state, and the acoustic states that are most matched with the acoustic features are determined, thereby combining the acoustic states You can get the text.
  • a text or a phoneme in a specific context can be obtained, which state sequence should correspond to; Describe the model, and the probability that the acoustic feature is in a certain acoustic state can be obtained, so that the acoustic state that best matches the acoustic feature can be determined.
  • the original acoustic model is not applicable to the current application scenario, this requires retraining a new acoustic model, but retraining a new acoustic model is not only highly complex, but the inventors found it in the study.
  • the original training data for training the original model is not available for a variety of reasons (eg, the original training data is confidential). If a new acoustic model is retrained, the recognition accuracy of the new acoustic model obtained by retraining may be lower than the recognition accuracy of the original acoustic model.
  • the present invention proposes a technical solution that does not destroy the structure of the original acoustic model under the premise of ensuring the accuracy of speech recognition by updating the original acoustic model.
  • the training sample and the original acoustic model are acquired, and the acoustic state corresponding to the training text in the training sample is determined by using the original acoustic model, and each acoustic can be determined according to the acoustic state and the acoustic characteristics corresponding to each training text.
  • the acoustic characteristics corresponding to the state Therefore, the state description model of the acoustic state can be retrained directly by using the acoustic features corresponding to the acoustic state, and the state description model obtained by retraining is used to update the original state description model in the original acoustic model, so that after the update can be obtained Acoustic model.
  • the updated acoustic model can continue to be used for speech recognition.
  • only the state description model in the original acoustic model can be retrained without training a new acoustic model, which reduces the training complexity.
  • updating the original acoustic model the structure of the original acoustic model is not destroyed, and the accuracy of speech recognition is ensured.
  • FIG. 1 is a flowchart of an embodiment of an acoustic model training method for speech recognition according to an embodiment of the present invention, which may include the following steps:
  • the training sample includes an acoustic feature and a training text corresponding to the acoustic feature.
  • the acoustic features in the training samples may be obtained by extracting from training speech.
  • the acoustic features and their corresponding training texts are used as training samples.
  • the number of training samples can be greatly reduced.
  • the training sample can be determined according to the application scenario in which the original acoustic model is no longer applicable. Since the application scenario is no longer applicable, the acoustic model cannot recognize the exact text of the corresponding application scenario, and therefore needs to be retrained.
  • the state of the original acoustic model may be used to define a model to determine an acoustic state corresponding to each training text.
  • the acoustic state corresponding to each training text includes a plurality, that is, it corresponds to a sequence of acoustic states.
  • the state definition model is used to determine a word or a phoneme in a specific context, and its corresponding acoustic state training.
  • the state description model is used to determine the probability of the acoustic feature in an acoustic state when the acoustic feature is given, thereby determining the sequence of acoustic states that best match the acoustic feature.
  • the acoustic state is a basic unit constituting a utterance of a text, and may refer to a unit smaller than a phoneme obtained by further dividing a phoneme. Combine acoustic states to obtain phonemes, and combine phonemes to get text.
  • the training text is composed of words, so the original acoustic model corresponding to each training text can be obtained.
  • the sequence of acoustic states in the model is composed of words, so the original acoustic model corresponding to each training text can be obtained.
  • the acoustic state corresponding to the training text can be obtained, that is, the acoustic state corresponding to the acoustic features can be obtained.
  • the acoustic features may be segmented according to an acoustic state and an acoustic feature corresponding to each training text to obtain segmentation data;
  • the segmentation data corresponding to each acoustic state is determined.
  • each acoustic state specifically corresponds to segmentation data in the acoustic features.
  • the acoustic feature may be forcibly aligned by the original acoustic model, specifically, the training text is used to construct a linear decoding network, and the Viterbi algorithm is used to perform the acoustic features corresponding to the training file. Segmentation is performed to obtain segmentation data so that segmentation data corresponding to each acoustic state can be determined.
  • the state description model of the acoustic state is retrained by using segmentation data corresponding to each acoustic state.
  • the state description model obtained by retraining can replace the original state description model in the original acoustic model, and the other structures of the original acoustic model are unchanged, thereby obtaining the updated acoustic model.
  • the updated acoustic model is to re-train the acoustic model according to the training sample, and the training sample is a sample of the applicable application scene, so that the acoustic model obtained by the retraining is applicable to the speech recognition of the application scene.
  • only the state description model in the original acoustic model can be retrained without training a new acoustic model, which reduces the training complexity and passes the original acoustic model.
  • the type is updated without destroying the structure of the original acoustic model, and the structure of the original acoustic model can be used to define the model and other structures to ensure the accuracy of speech recognition.
  • the state model in the case that the original training data is lost and the state definition model in the original acoustic model is not desired to be changed, the state model can be retrained by retraining the acoustic model in the original acoustic model. Training complexity can be reduced while the acoustic model can be adapted to the current application scenario.
  • the state description model can be obtained by training deep neural networks (DNN), and can be implemented by a Back Propgation algorithm.
  • DNN deep neural networks
  • GMM mixed Gaussian model
  • EM Estimation Maximization Algorithm
  • FIG. 2 is a schematic structural diagram of an embodiment of an acoustic model training apparatus for voice recognition according to an embodiment of the present disclosure, where the apparatus may include:
  • the sample obtaining module 201 is configured to acquire a training sample.
  • the training sample includes an acoustic feature and a training text corresponding to the acoustic feature.
  • the sample acquisition module may specifically acquire training speech and training text, and extract acoustic features of the training speech.
  • the acoustic features and their corresponding training texts are used as training samples.
  • the number of training samples can be greatly reduced.
  • the training sample can be determined according to an application scenario in which the original acoustic model is no longer applicable. Since the application scenario is no longer applicable, the acoustic model cannot recognize the exact text of the corresponding application scenario, so Retraining.
  • the first determining module 202 is configured to acquire an original acoustic model, and use the original acoustic model to determine an acoustic state in the original acoustic model corresponding to each training text.
  • the first determining module may be configured to determine an acoustic state corresponding to each training text by using a state definition model of the original acoustic model.
  • the acoustic state corresponding to each training text includes a plurality, that is, it corresponds to a sequence of acoustic states.
  • the state definition model is used to determine a word or a phoneme in a specific context, and its corresponding acoustic state training.
  • the state description model is used to determine the probability of the acoustic feature in an acoustic state when the acoustic feature is given, thereby determining the sequence of acoustic states that best match the acoustic feature.
  • the acoustic state is a basic unit constituting a utterance of a text, and may refer to a unit smaller than a phoneme obtained by further dividing a phoneme. Combine acoustic states to obtain phonemes, and combine phonemes to get text.
  • the training text consists of text, so a sequence of acoustic states in the original acoustic model corresponding to each training text can be obtained.
  • the second determining module 203 is configured to determine an acoustic feature corresponding to each acoustic state according to an acoustic state and an acoustic feature corresponding to each training text.
  • the acoustic state corresponding to the training text can be obtained, that is, the acoustic state corresponding to the acoustic features can be obtained.
  • the second determining module may be specifically configured to segment the acoustic features according to an acoustic state and an acoustic feature corresponding to each training text, obtain segmentation data, and determine each acoustic state correspondingly. Segmentation data.
  • each acoustic state specifically corresponds to segmentation data in the acoustic features.
  • the acoustic feature may be forcibly aligned by the original acoustic model
  • the linear decoding network may be constructed by using the training text
  • the acoustic features corresponding to the training file may be segmented by using the Viterbi algorithm. Obtain segmentation data to determine Segmentation data corresponding to each acoustic state.
  • the second determining module establishes a linear decoding network by using each training text, and uses a Viterbi algorithm to segment the acoustic features corresponding to the training file to obtain segment data, and determine Segmentation data corresponding to each acoustic state.
  • the training module 204 is configured to retrain the state description model of the acoustic state by using the acoustic features corresponding to each acoustic state.
  • the state description model of the acoustic state is retrained by using segmentation data corresponding to each acoustic state.
  • the updating module 205 is configured to update the original state description model in the original acoustic model by using a state description model obtained by retraining, and obtain an updated acoustic model.
  • the state description model obtained by retraining can replace the original state description model in the original acoustic model, and the other structures of the original acoustic model are unchanged, thereby obtaining the updated acoustic model.
  • the updated acoustic model is to re-train the acoustic model according to the training sample, and the training sample is a sample of the applicable application scene, so that the acoustic model obtained by the retraining is applicable to the speech recognition of the application scene.
  • only the state description model in the original acoustic model can be retrained without training a new acoustic model, which reduces the training complexity and does not destroy the original acoustics by updating the original acoustic model.
  • the structure of the model can continue to use the structure of the original acoustic model to define the model and other structures to ensure the accuracy of speech recognition.
  • an embodiment of the present invention further discloses an electronic device including at least one processor 810; and a memory 800 communicably connected to the at least one processor 810; wherein the memory 800 is stored An instruction executed by the at least one processor 810, the instructions being executed by the at least one processor 810 to enable the at least one processor 810 to acquire training samples; the training samples including acoustic features and the acoustics Training text corresponding to the feature; An original acoustic model, and using the original acoustic model, determining an acoustic state in the original acoustic model corresponding to each training text; determining an acoustic corresponding to each acoustic state according to an acoustic state and an acoustic feature corresponding to each training text Feature; re-training the state description model of the acoustic state by using the acoustic features corresponding to each acoustic state; updating the original state description model in the original acoustic model by
  • determining the acoustic characteristics corresponding to each acoustic state according to the acoustic state and the acoustic characteristics corresponding to each training text comprises: according to the acoustic state and acoustic characteristics corresponding to each training text, Separating the acoustic features to obtain segmentation data; determining segmentation data corresponding to each acoustic state; and using the acoustic features corresponding to each acoustic state, retraining to obtain a state description model for each acoustic state includes: utilizing The segmentation data corresponding to each acoustic state is retrained to obtain a state description model of the acoustic state.
  • the acquiring the training sample comprises: acquiring training speech and training text, and extracting an acoustic feature of the training speech.
  • determining, by using the original acoustic model, determining an acoustic state in the original acoustic model corresponding to each training text comprises: determining, by using a state definition model in the original acoustic model, The acoustic state corresponding to each training text.
  • the acoustic feature is segmented according to an acoustic state and an acoustic feature corresponding to each training text to obtain segmentation data; and determining segment data corresponding to each acoustic state includes : Using each training text to establish a linear decoding network, using the Viterbi algorithm, segmenting the acoustic features corresponding to the training file, obtaining segmentation data, and determining segmentation data corresponding to each acoustic state.
  • Embodiments of the present invention also disclose a non-volatile computer storage medium, wherein the storage The medium stores computer executable instructions that, when executed by the electronic device, enable the electronic device to: acquire training samples; the training samples include acoustic features and training text corresponding to the acoustic features; and acquire an original acoustic model And using the original acoustic model to determine an acoustic state in the original acoustic model corresponding to each training text; determining an acoustic feature corresponding to each acoustic state according to an acoustic state and an acoustic feature corresponding to each training text; Acoustic features corresponding to each acoustic state are retrained to obtain a state description model of the acoustic state; using a state description model obtained by retraining, the original state description model in the original acoustic model is updated to obtain an updated acoustic model.
  • determining the acoustic characteristics corresponding to each acoustic state according to the acoustic state and the acoustic characteristics corresponding to each training text comprises: according to the acoustic state and acoustic characteristics corresponding to each training text, Separating the acoustic features to obtain segmentation data; determining segmentation data corresponding to each acoustic state; and using the acoustic features corresponding to each acoustic state, retraining to obtain a state description model for each acoustic state includes: utilizing The segmentation data corresponding to each acoustic state is retrained to obtain a state description model of the acoustic state.
  • the acquiring the training samples comprises: acquiring training speech and training text, and extracting acoustic features of the training speech.
  • determining, by using the original acoustic model, determining an acoustic state in the original acoustic model corresponding to each training text comprises: determining, by using a state definition model in the original acoustic model, The acoustic state corresponding to each training text.
  • the acoustic feature is segmented according to an acoustic state and an acoustic feature corresponding to each training text to obtain segmentation data; and determining segment data corresponding to each acoustic state includes : Using each training text to establish a linear decoding network, using the Viterbi algorithm, segmenting the acoustic features corresponding to the training file, obtaining segmentation data, and determining segmentation data corresponding to each acoustic state.
  • the embodiment of the invention further provides a computer program product, the computer program product comprising A computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform the method described in the above embodiments.
  • the state model in the case that the original training data is lost and the state definition model in the original acoustic model is not desired to be changed, the state model can be retrained by retraining the acoustic model in the original acoustic model. Training complexity can be reduced while the acoustic model can be adapted to the current application scenario.
  • the state description model can be obtained by training deep neural networks (DNN), and can be implemented by a Back Propgation algorithm.
  • DNN deep neural networks
  • GMM mixed Gaussian model
  • EM Estimation Maximization Algorithm
  • embodiments of the present invention can be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware. Moreover, the invention can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • the apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

Provided are an acoustic model training method and apparatus for speech recognition, and an electronic device. The method comprises: acquiring a training sample, wherein the training sample comprises an acoustic feature and a training text corresponding to the acoustic feature; acquiring an original acoustic model, and using the original acoustic model to determine an acoustic state in the original acoustic model corresponding to each training text; according to an acoustic state and an acoustic feature corresponding to each training text, determining an acoustic feature corresponding to each acoustic state; using the acoustic feature corresponding to each acoustic state to conduct retraining to obtain a state description model of the acoustic state; and using the state description model obtained by retraining to update an original state description model in the original acoustic model, thus obtaining an updated acoustic model. According to the embodiments of the present invention, the complexity of model training is reduced, and by updating the original acoustic model, the structure of the original acoustic model is not destroyed, thus ensuring the accuracy of speech recognition.

Description

用于语音识别的声学模型训练方法、装置和电子设备Acoustic model training method, device and electronic device for speech recognition
交叉引用cross reference
本申请要求在2016年03月30日提交中国专利局、申请号为201610195612.X、发明名称为“用于语音识别的声学模型训练方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to Chinese Patent Application No. 201610195612.X, entitled "Acoustic Model Training Method and Apparatus for Speech Recognition", filed on March 30, 2016, the entire contents of which are hereby incorporated by reference. The citations are incorporated herein by reference.
技术领域Technical field
本发明涉及语音识别技术领域,尤其涉及一种用于语音识别的声学模型训练方法、装置和电子设备。The present invention relates to the field of speech recognition technologies, and in particular, to an acoustic model training method, apparatus and electronic device for speech recognition.
背景技术Background technique
语音识别系统的一个目的,是把语音转换成文字,具体来说,是将一段语音信号,找一个文字序列(由词或字组成),使得它与语音信号的匹配程度最高。One purpose of the speech recognition system is to convert speech into text. Specifically, a speech signal is searched for a sequence of words (consisting of words or words) so that it matches the speech signal with the highest degree.
语音识别系统中最重要的部分之一即是声学模型(Acoustic Modeling),在进行语音识别时,将语音信号转换为声学特征,再利用声学模型确定出声学特征对应的各个声学状态,由各个声学状态组合即获得文字。One of the most important parts of the speech recognition system is the Acoustic Modeling. When speech recognition is performed, the speech signal is converted into acoustic features, and then the acoustic model is used to determine the acoustic states corresponding to the acoustic features. Acoustic state combination is the text.
其中,声学状态是构成文字发音的基本单位,通常是指将音素进一步划分获得的更小单位。Among them, the acoustic state is the basic unit constituting the pronunciation of a word, and generally refers to a smaller unit obtained by further dividing the phoneme.
声学特征对应的声学状态,是利用声学模型中的状态描述模型计算获得,在声学模型中,每一个声学状态对应一个状态描述模型,利用状态描述模型即可以识别与声学特征最匹配的声学状态。The acoustic state corresponding to the acoustic features is obtained by using a state description model in the acoustic model. In the acoustic model, each acoustic state corresponds to a state description model, and the state description model can be used to identify the acoustic state that best matches the acoustic features.
现有技术中,声学模型的训练过程非常复杂,不仅包括状态描述模型 的训练,还包括声学特征的提取、声学特征变换、决策树生成、状态定义模型的训练等。而在实际应用中,随着应用场景的变化或者生命周期的演进,原始的声学模型可能不在适用当前的应用场景,这就需要重新训练一个新的声学模型,但是重新训练一个新的声学模型不仅复杂度高,且发明人在研究中发现,原始的声学模型中某些结构可能并不需要进行改变,比如状态定义模型,如果重新训练,就会破坏状态定义模型定义的声学状态,反而会影响语音识别的准确度。In the prior art, the training process of the acoustic model is very complicated, including not only the state description model. The training also includes the extraction of acoustic features, acoustic feature transformation, decision tree generation, and training of state definition models. In practical applications, as the application scenario changes or the life cycle evolves, the original acoustic model may not be applicable to the current application scenario. This requires retraining a new acoustic model, but retraining a new acoustic model. The complexity is high, and the inventors found in the study that some structures in the original acoustic model may not need to be changed, such as the state definition model. If retrained, it will destroy the acoustic state defined by the state definition model, but will affect it. The accuracy of speech recognition.
发明内容Summary of the invention
本发明实施例要解决的是降低模型训练复杂度,不破坏原始声学模型的结构,保证语音识别的准确度的问题。The solution to be solved by the embodiment of the present invention is to reduce the complexity of model training, not to destroy the structure of the original acoustic model, and to ensure the accuracy of speech recognition.
本发明实施例提供一种用于语音识别的声学模型训练方法及装置,用以解决现有技术中如何在保证语音识别准确度的前提下,降低声学模型训练复杂度的技术问题。The embodiment of the invention provides an acoustic model training method and device for speech recognition, which is used to solve the technical problem of reducing the complexity of acoustic model training under the premise of ensuring the accuracy of speech recognition in the prior art.
本发明实施例提供一种用于语音识别的声学模型训练方法,包括:An embodiment of the present invention provides an acoustic model training method for speech recognition, including:
获取训练样本;所述训练样本包括声学特征以及所述声学特征对应的训练文本;Obtaining a training sample; the training sample includes an acoustic feature and a training text corresponding to the acoustic feature;
获取原始声学模型,并利用所述原始声学模型,确定每一训练文本对应的所述原始声学模型中的声学状态;Acquiring an original acoustic model, and using the original acoustic model, determining an acoustic state in the original acoustic model corresponding to each training text;
根据每一训练文本对应的声学状态以及声学特征,确定每一声学状态对应的声学特征;Determining an acoustic feature corresponding to each acoustic state according to an acoustic state and an acoustic feature corresponding to each training text;
利用每一声学状态对应的声学特征,重新训练获得所述声学状态的状态描述模型;Re-training the state description model of the acoustic state using the acoustic features corresponding to each acoustic state;
利用重新训练获得的状态描述模型,更新所述原始声学模型中的原始状态描述模型,获得更新之后的声学模型。Using the state description model obtained by retraining, the original state description model in the original acoustic model is updated to obtain an updated acoustic model.
本发明实施例提供一种用于语音识别的声学模型训练装置,包括: An embodiment of the present invention provides an acoustic model training apparatus for speech recognition, including:
样本获取模块,用于获取训练样本;所述训练样本包括声学特征以及所述声学特征对应的训练文本;a sample acquisition module, configured to acquire a training sample; the training sample includes an acoustic feature and a training text corresponding to the acoustic feature;
第一确定模块,用于获取原始声学模型,并利用所述原始声学模型,确定每一训练文本对应的所述原始声学模型中的声学状态;a first determining module, configured to acquire an original acoustic model, and use the original acoustic model to determine an acoustic state in the original acoustic model corresponding to each training text;
第二确定模块,用于根据每一训练文本对应的声学状态以及声学特征,确定每一声学状态对应的声学特征;a second determining module, configured to determine an acoustic feature corresponding to each acoustic state according to an acoustic state and an acoustic feature corresponding to each training text;
训练模块,用于利用每一声学状态对应的声学特征,重新训练获得所述声学状态的状态描述模型;a training module, configured to re-train a state description model of the acoustic state by using an acoustic feature corresponding to each acoustic state;
更新模块,用于利用重新训练获得的状态描述模型,更新所述原始声学模型中的原始状态描述模型,获得更新之后的声学模型。And an update module, configured to update the original state description model in the original acoustic model by using a state description model obtained by retraining, and obtain an updated acoustic model.
本发明实施例又公开了一种电子设备,包括至少一个处理器;以及,与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够获取训练样本;所述训练样本包括声学特征以及所述声学特征对应的训练文本;获取原始声学模型,并利用所述原始声学模型,确定每一训练文本对应的所述原始声学模型中的声学状态;根据每一训练文本对应的声学状态以及声学特征,确定每一声学状态对应的声学特征;利用每一声学状态对应的声学特征,重新训练获得所述声学状态的状态描述模型;利用重新训练获得的状态描述模型,更新所述原始声学模型中的原始状态描述模型,获得更新之后的声学模型。Embodiments of the invention further disclose an electronic device comprising at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor The instructions are executed by the at least one processor to enable the at least one processor to acquire training samples; the training samples include acoustic features and training text corresponding to the acoustic features; acquiring an original acoustic model, and utilizing The original acoustic model determines an acoustic state in the original acoustic model corresponding to each training text; determining an acoustic feature corresponding to each acoustic state according to an acoustic state and an acoustic feature corresponding to each training text; utilizing each acoustic The acoustic feature corresponding to the state is retrained to obtain a state description model of the acoustic state; the state description model obtained by retraining is used to update the original state description model in the original acoustic model to obtain an updated acoustic model.
本发明还公开了一种非易失性计算机存储介质,其中,所述存储介质存储有计算机可执行指令,所述计算机可执行指令用于使所述计算机执行权利要求1-5所述的方法。The present invention also discloses a non-volatile computer storage medium, wherein the storage medium stores computer-executable instructions for causing the computer to perform the method of claims 1-5 .
本发明实施例还提供了一种计算机程序产品,所述计算机程序产品包括 存储在非暂态计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行权利要求1-5所述的方法。The embodiment of the invention further provides a computer program product, the computer program product comprising A computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform the method of claims 1-5.
本发明实施例提供的用于语音识别的声学模型训练方法及装置,利用原始声学模型,确定训练样本中训练文本对应的声学状态,并根据每一训练文本对应的声学状态以及声学特征,可以确定每一声学状态对应的声学特征。从而直接利用声学状态对应的声学特征对声学状态的状态描述模型进行重新训练即可,重新训练获得的状态描述模型用于更新所述原始声学模型中的原始状态描述模型,从而即可以获得更新之后的声学模型。通过本发明实施例,仅对原始声学模型中的状态描述模型进行再训练即可,而无需训练一个全新的声学模型,既降低了训练复杂度,且通过对原始声学模型进行更新,未破坏原始声学模型的结构,同时保证了语音识别准确度。The acoustic model training method and apparatus for speech recognition provided by the embodiments of the present invention determine the acoustic state corresponding to the training text in the training sample by using the original acoustic model, and can determine according to the acoustic state and acoustic characteristics corresponding to each training text. Acoustic features corresponding to each acoustic state. Therefore, the state description model of the acoustic state can be retrained directly by using the acoustic features corresponding to the acoustic state, and the state description model obtained by retraining is used to update the original state description model in the original acoustic model, so that after the update can be obtained Acoustic model. With the embodiment of the present invention, only the state description model in the original acoustic model can be retrained without training a new acoustic model, which not only reduces the training complexity, but also updates the original acoustic model without destroying the original. The structure of the acoustic model ensures the accuracy of speech recognition.
附图说明DRAWINGS
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, a brief description of the drawings used in the embodiments or the prior art description will be briefly described below. Obviously, the drawings in the following description It is a certain embodiment of the present invention, and other drawings can be obtained from those skilled in the art without any creative work.
图1为本发明用于语音识别的声学模型训练方法一个实施例流程图。1 is a flow chart of an embodiment of an acoustic model training method for speech recognition according to the present invention.
图2为本发明用于语音识别的声学模型训练装置一个实施例的结构示意图;2 is a schematic structural view of an embodiment of an acoustic model training device for speech recognition according to the present invention;
图3为本发明实施例中电子设备的硬件结构示意图。FIG. 3 is a schematic structural diagram of hardware of an electronic device according to an embodiment of the present invention.
具体实施方式detailed description
下面将结合附图对本发明的技术方案进行清楚、完整地描述,显然, 所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solution of the present invention will be clearly and completely described below with reference to the accompanying drawings, obviously, The described embodiments are a part of the embodiments of the invention, and not all of the embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.
在本发明的描述中,需要说明的是,术语“中心”、“上”、“下”、“左”、“右”、“竖直”、“水平”、“内”、“外”等指示的方位或位置关系为基于附图所示的方位或位置关系,仅是为了便于描述本发明和简化描述,而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作,因此不能理解为对本发明的限制。此外,术语“第一”、“第二”、“第三”仅用于描述目的,而不能理解为指示或暗示相对重要性。In the description of the present invention, it is to be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inside", "outside", etc. The orientation or positional relationship of the indications is based on the orientation or positional relationship shown in the drawings, and is merely for the convenience of the description of the invention and the simplified description, rather than indicating or implying that the device or component referred to has a specific orientation, in a specific orientation. The construction and operation are therefore not to be construed as limiting the invention. Moreover, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
在本发明的描述中,需要说明的是,除非另有明确的规定和限定,术语“安装”、“相连”、“连接”应做广义理解,例如,可以是固定连接,也可以是可拆卸连接,或一体地连接;可以是机械连接,也可以是电连接;可以是直接相连,也可以通过中间媒介间接相连,还可以是两个元件内部的连通,可以是无线连接,也可以是有线连接。对于本领域的普通技术人员而言,可以具体情况理解上述术语在本发明中的具体含义。In the description of the present invention, it should be noted that the terms "installation", "connected", and "connected" are to be understood broadly, and may be fixed or detachable, for example, unless otherwise explicitly defined and defined. Connection, or integral connection; may be mechanical connection or electrical connection; may be directly connected, may also be indirectly connected through an intermediate medium, or may be internal communication of two components, may be wireless connection, or may be wired connection. The specific meaning of the above terms in the present invention can be understood in a specific case by those skilled in the art.
本发明技术方案主要应用于语音识别领域中声学模型的训练。声学模型是语音识别系统中最重要的部分之一,用于在语音识别,利用声学模型确定出声学特征对应的各个声学状态,由各个声学状态组合即获得文字。其中,声学特征是通过对语音信号提取获得,声学特征例如可以是MFCC(Mel Frequency Cepstrum Coefficient,梅尔频率倒谱系数)特征。The technical solution of the present invention is mainly applied to the training of acoustic models in the field of speech recognition. The acoustic model is one of the most important parts of the speech recognition system. For the speech recognition, the acoustic model is used to determine the respective acoustic states corresponding to the acoustic features, and the text is obtained by combining the various acoustic states. The acoustic feature is obtained by extracting the speech signal, and the acoustic feature may be, for example, a MFCC (Mel Frequency Cepstrum Coefficient) feature.
声学模型是利用诸如隐马尔科夫模型等模型建模获得,需要使用大量的训练样本对建模模型进行训练,从而获得声学模型。Acoustic models are obtained by modeling models such as Hidden Markov Models, and a large number of training samples are needed to train the modeling models to obtain acoustic models.
声学模型中包括多个声学状态对应的状态描述模型,这些状态描述模型用于计算声学特征在该声学状态上的概率,而从确定与声学特征最匹配的声学状态,从而将声学状态进行组合即可以获得文字。 The acoustic model includes a plurality of state description models corresponding to the acoustic states, the state description models are used to calculate the probability of the acoustic features on the acoustic state, and the acoustic states that are most matched with the acoustic features are determined, thereby combining the acoustic states You can get the text.
由于声学模型训练,不仅包括状态描述模型的训练,还包括状态定义模型的训练,通过状态定义模型,可以得到在特定上下文环境下的一个文字或一个音素,其应该对应怎样的状态序列;通过状态描述模型,可以得到声学特征在某一声学状态上的概率,从而可以确定与声学特征最匹配的声学状态。Due to the acoustic model training, not only the training of the state description model, but also the training of the state definition model, through the state definition model, a text or a phoneme in a specific context can be obtained, which state sequence should correspond to; Describe the model, and the probability that the acoustic feature is in a certain acoustic state can be obtained, so that the acoustic state that best matches the acoustic feature can be determined.
如果原始的声学模型不在适用当前的应用场景,这就需要重新训练一个新的声学模型,但是重新训练一个新的声学模型不仅复杂度高,且发明人在研究中发现。在一些应用场景中并不希望改变状态定义模型,仅希望改变状态描述模型,比如,如果改变状态定义模型,识别时使用的解码图就需要重新构建,但是并不希望重新构建这个解码图,并用于训练原始模型的原始训练数据由于种种原因无法获得(比如原始训练数据是保密的)。如果重新训练一个新的声学模型,重新训练获得的新的声学模型的识别准确度可能还要低于原始声学模型的识别准确度。If the original acoustic model is not applicable to the current application scenario, this requires retraining a new acoustic model, but retraining a new acoustic model is not only highly complex, but the inventors found it in the study. In some application scenarios, it is not desirable to change the state definition model. It is only necessary to change the state description model. For example, if the state definition model is changed, the decoding map used in the recognition needs to be reconstructed, but it is not desirable to reconstruct the decoding map and use The original training data for training the original model is not available for a variety of reasons (eg, the original training data is confidential). If a new acoustic model is retrained, the recognition accuracy of the new acoustic model obtained by retraining may be lower than the recognition accuracy of the original acoustic model.
因此,本发明提出了一种通过对原始声学模型进行更新,在保证语音识别准确度的前提下,不破坏原始声学模型的结构的技术方案。Therefore, the present invention proposes a technical solution that does not destroy the structure of the original acoustic model under the premise of ensuring the accuracy of speech recognition by updating the original acoustic model.
在本发明实施例中,获取训练样本以及原始声学模型,利用原始声学模型,确定训练样本中训练文本对应的声学状态,并根据每一训练文本对应的声学状态以及声学特征,可以确定每一声学状态对应的声学特征。从而直接利用声学状态对应的声学特征对声学状态的状态描述模型进行重新训练即可,重新训练获得的状态描述模型用于更新所述原始声学模型中的原始状态描述模型,从而即可以获得更新之后的声学模型。更新之后的声学模型即可以继续用于语音识别,通过本发明实施例,仅对原始声学模型中的状态描述模型进行再训练即可,而无需训练一个全新的声学模型,既降低了训练复杂度,且通过对原始声学模型进行更新,未破坏原始声学模型的结构,同时保证了语音识别准确度。In the embodiment of the present invention, the training sample and the original acoustic model are acquired, and the acoustic state corresponding to the training text in the training sample is determined by using the original acoustic model, and each acoustic can be determined according to the acoustic state and the acoustic characteristics corresponding to each training text. The acoustic characteristics corresponding to the state. Therefore, the state description model of the acoustic state can be retrained directly by using the acoustic features corresponding to the acoustic state, and the state description model obtained by retraining is used to update the original state description model in the original acoustic model, so that after the update can be obtained Acoustic model. The updated acoustic model can continue to be used for speech recognition. With the embodiment of the present invention, only the state description model in the original acoustic model can be retrained without training a new acoustic model, which reduces the training complexity. And by updating the original acoustic model, the structure of the original acoustic model is not destroyed, and the accuracy of speech recognition is ensured.
下面将结合附图对本发明技术方案进行详细描述。The technical solution of the present invention will be described in detail below with reference to the accompanying drawings.
实施例1 Example 1
图1是本发明实施例提供的一种用于语音识别的声学模型训练方法一个实施例的流程图,该方法可以包括以下几个步骤:FIG. 1 is a flowchart of an embodiment of an acoustic model training method for speech recognition according to an embodiment of the present invention, which may include the following steps:
101:获取训练样本。101: Obtain a training sample.
其中,所述训练样本包括声学特征以及所述声学特征对应的训练文本。Wherein the training sample includes an acoustic feature and a training text corresponding to the acoustic feature.
其中,所述训练样本中的声学特征可以是从训练语音中提取获得的。Wherein, the acoustic features in the training samples may be obtained by extracting from training speech.
也即具体是通过获取训练语音以及所述训练语音对应的训练文本,并提取所述训练语音的声学特征,将所述声学特征以及其对应的训练文本作为训练样本。That is, specifically, by acquiring the training speech and the training text corresponding to the training speech, and extracting the acoustic features of the training speech, the acoustic features and their corresponding training texts are used as training samples.
本发明实施例中,由于无需重新训练新的声学模型,因此训练样本的数量可以大大减少。In the embodiment of the present invention, since it is not necessary to retrain a new acoustic model, the number of training samples can be greatly reduced.
该训练样本可以根据原始声学模型不再适用的应用场景确定,由于应用场景不再适用,导致声学模型识别不出对应应用场景的准确文字,因此需要再训练。The training sample can be determined according to the application scenario in which the original acoustic model is no longer applicable. Since the application scenario is no longer applicable, the acoustic model cannot recognize the exact text of the corresponding application scenario, and therefore needs to be retrained.
102:获取原始声学模型,并利用所述原始声学模型,确定每一训练文本对应的所述原始声学模型中的声学状态。102: Acquire an original acoustic model, and use the original acoustic model to determine an acoustic state in the original acoustic model corresponding to each training text.
具体的,可以是利用原始声学模型的状态定义模型,确定每一训练文本对应的声学状态。每一个训练文本对应的声学状态包括多个,也即其对应的为声学状态序列。Specifically, the state of the original acoustic model may be used to define a model to determine an acoustic state corresponding to each training text. The acoustic state corresponding to each training text includes a plurality, that is, it corresponds to a sequence of acoustic states.
其中,状态定义模型用于确定特定上下文环境下的一个单词或一个音素,其对应的声学状态训练。The state definition model is used to determine a word or a phoneme in a specific context, and its corresponding acoustic state training.
而状态描述模型即是用于在给出声学特征时,确定该声学特征在某声学状态上的概率,从而确定与声学特征最匹配的声学状态序列。The state description model is used to determine the probability of the acoustic feature in an acoustic state when the acoustic feature is given, thereby determining the sequence of acoustic states that best match the acoustic feature.
其中,所述声学状态是构成文字发音的基本单位,可以是指将音素进一步划分获得的比音素更小的单元。将声学状态组合可以获得音素,将音素组合即可以获得文字。Wherein, the acoustic state is a basic unit constituting a utterance of a text, and may refer to a unit smaller than a phoneme obtained by further dividing a phoneme. Combine acoustic states to obtain phonemes, and combine phonemes to get text.
训练文本由文字构成,因此可以获得每一个训练文本对应的原始声学模 型中的声学状态序列。The training text is composed of words, so the original acoustic model corresponding to each training text can be obtained. The sequence of acoustic states in the model.
103:根据每一训练文本对应的声学状态以及声学特征,确定每一声学状态对应的声学特征。103: Determine an acoustic feature corresponding to each acoustic state according to an acoustic state and an acoustic feature corresponding to each training text.
根据训练文本与声学特征的对应关系,可以获取训练文本对应的声学状态,即可以获得声学特征对应的声学状态。According to the correspondence between the training text and the acoustic features, the acoustic state corresponding to the training text can be obtained, that is, the acoustic state corresponding to the acoustic features can be obtained.
作为又一个实施例,具体可以是根据每一训练文本对应的声学状态以及声学特征,将所述声学特征进行切分,获得分段数据;As another embodiment, specifically, the acoustic features may be segmented according to an acoustic state and an acoustic feature corresponding to each training text to obtain segmentation data;
确定每一声学状态对应的分段数据。The segmentation data corresponding to each acoustic state is determined.
也即每一声学状态具体对应声学特征中的分段数据。That is, each acoustic state specifically corresponds to segmentation data in the acoustic features.
具体的,可以是通过原始声学模型对声学特征进行强制对齐(Forced Alignment),具体的是利用训练文本构建线性解码网络,并利用维特比(Viterbi)算法,将所述训练文件对应的声学特征进行切分,获得分段数据,从而即可以确定每一个声学状态对应的分段数据。Specifically, the acoustic feature may be forcibly aligned by the original acoustic model, specifically, the training text is used to construct a linear decoding network, and the Viterbi algorithm is used to perform the acoustic features corresponding to the training file. Segmentation is performed to obtain segmentation data so that segmentation data corresponding to each acoustic state can be determined.
104:利用每一声学状态对应的声学特征,重新训练获得所述声学状态的状态描述模型。104: Re-training the state description model of the acoustic state using the acoustic features corresponding to each acoustic state.
具体的,即是利用每一声学状态对应的分段数据,重新训练获得所述声学状态的状态描述模型。Specifically, the state description model of the acoustic state is retrained by using segmentation data corresponding to each acoustic state.
105:利用重新训练获得的状态描述模型,更新所述原始声学模型中的原始状态描述模型,获得更新之后的声学模型。105: Using the state description model obtained by retraining, updating the original state description model in the original acoustic model to obtain an updated acoustic model.
重新训练获得的状态描述模型,即可以替换原始声学模型中的原始状态描述模型,原始声学模型的其他结构不变,从而获得更新之后的声学模型。更新之后的声学模型即是根据训练样本再训练获得声学模型,训练样本为适用应用场景的样本,从而使得再训练获得的声学模型即适用于该应用场景的语音识别。The state description model obtained by retraining can replace the original state description model in the original acoustic model, and the other structures of the original acoustic model are unchanged, thereby obtaining the updated acoustic model. The updated acoustic model is to re-train the acoustic model according to the training sample, and the training sample is a sample of the applicable application scene, so that the acoustic model obtained by the retraining is applicable to the speech recognition of the application scene.
本实施例中,仅对原始声学模型中的状态描述模型进行再训练即可,而无需训练一个全新的声学模型,既降低了训练复杂度,且通过对原始声学模 型进行更新,未破坏原始声学模型的结构,可以继续使用原始声学模型的状态定义模型等结构,保证了语音识别准确度。In this embodiment, only the state description model in the original acoustic model can be retrained without training a new acoustic model, which reduces the training complexity and passes the original acoustic model. The type is updated without destroying the structure of the original acoustic model, and the structure of the original acoustic model can be used to define the model and other structures to ensure the accuracy of speech recognition.
本发明实施例,在原始训练数据丢失以及不希望改变原始声学模型中的状态定义模型等结构的情况下,可以通过对原始声学模型中的状态描述模型进行再训练,重新训练获得声学模型,即可以降低训练复杂度,同时使得声学模型能够适用当前应用场景。In the embodiment of the present invention, in the case that the original training data is lost and the state definition model in the original acoustic model is not desired to be changed, the state model can be retrained by retraining the acoustic model in the original acoustic model. Training complexity can be reduced while the acoustic model can be adapted to the current application scenario.
其中,状态描述模型可以通过对深度神经网络(DNN,Deep Neural Networks)训练获得,具体的可以采用反向传播(Back Propgation)算法实现,当然还可以采用其他的数学模型、比如混合高斯模型(GMM,Gaussian Mixture Model),则使用最大期望(EM,Expectation Maximization Algorithm)算法实现等,可以根据实际情况进行选择,本发明并不对此进行限定。The state description model can be obtained by training deep neural networks (DNN), and can be implemented by a Back Propgation algorithm. Of course, other mathematical models, such as a mixed Gaussian model (GMM), can also be used. , Gaussian Mixture Model), using the EM (Expectation Maximization Algorithm) algorithm implementation, etc., can be selected according to the actual situation, the present invention does not limit this.
实施例2Example 2
图2为本发明实施例提供的一种用于语音识别的声学模型训练装置一个实施例的结构示意图,该装置可以包括:FIG. 2 is a schematic structural diagram of an embodiment of an acoustic model training apparatus for voice recognition according to an embodiment of the present disclosure, where the apparatus may include:
样本获取模块201,用于获取训练样本。The sample obtaining module 201 is configured to acquire a training sample.
其中,所述训练样本包括声学特征以及所述声学特征对应的训练文本。Wherein the training sample includes an acoustic feature and a training text corresponding to the acoustic feature.
所述样本获取模块可以具体是获取训练语音以及训练文本,并提取所述训练语音的声学特征。The sample acquisition module may specifically acquire training speech and training text, and extract acoustic features of the training speech.
也即具体是通过获取训练语音以及所述训练语音对应的训练文本,并提取所述训练语音的声学特征,将所述声学特征以及其对应的训练文本作为训练样本。That is, specifically, by acquiring the training speech and the training text corresponding to the training speech, and extracting the acoustic features of the training speech, the acoustic features and their corresponding training texts are used as training samples.
本发明实施例中,由于无需重新训练新的声学模型,因此训练样本的数量可以大大减少。In the embodiment of the present invention, since it is not necessary to retrain a new acoustic model, the number of training samples can be greatly reduced.
该训练样本可以根据原始声学模型不再适用的应用场景确定,由于应用场景不再适用,导致声学模型识别不出对应应用场景的准确文字,因此需要 再训练。The training sample can be determined according to an application scenario in which the original acoustic model is no longer applicable. Since the application scenario is no longer applicable, the acoustic model cannot recognize the exact text of the corresponding application scenario, so Retraining.
第一确定模块202,用于获取原始声学模型,并利用所述原始声学模型,确定每一训练文本对应的所述原始声学模型中的声学状态。The first determining module 202 is configured to acquire an original acoustic model, and use the original acoustic model to determine an acoustic state in the original acoustic model corresponding to each training text.
具体的,所述第一确定模块可以是利用原始声学模型的状态定义模型,确定每一训练文本对应的声学状态。每一个训练文本对应的声学状态包括多个,也即其对应的为声学状态序列。Specifically, the first determining module may be configured to determine an acoustic state corresponding to each training text by using a state definition model of the original acoustic model. The acoustic state corresponding to each training text includes a plurality, that is, it corresponds to a sequence of acoustic states.
其中,状态定义模型用于确定特定上下文环境下的一个单词或一个音素,其对应的声学状态训练。The state definition model is used to determine a word or a phoneme in a specific context, and its corresponding acoustic state training.
而状态描述模型即是用于在给出声学特征时,确定该声学特征在某声学状态上的概率,从而确定与声学特征最匹配的声学状态序列。The state description model is used to determine the probability of the acoustic feature in an acoustic state when the acoustic feature is given, thereby determining the sequence of acoustic states that best match the acoustic feature.
其中,所述声学状态是构成文字发音的基本单位,可以是指将音素进一步划分获得的比音素更小的单元。将声学状态组合可以获得音素,将音素组合即可以获得文字。Wherein, the acoustic state is a basic unit constituting a utterance of a text, and may refer to a unit smaller than a phoneme obtained by further dividing a phoneme. Combine acoustic states to obtain phonemes, and combine phonemes to get text.
训练文本由文字构成,因此可以获得每一个训练文本对应的原始声学模型中的声学状态序列。The training text consists of text, so a sequence of acoustic states in the original acoustic model corresponding to each training text can be obtained.
第二确定模块203,用于根据每一训练文本对应的声学状态以及声学特征,确定每一声学状态对应的声学特征。The second determining module 203 is configured to determine an acoustic feature corresponding to each acoustic state according to an acoustic state and an acoustic feature corresponding to each training text.
根据训练文本与声学特征的对应关系,可以获取训练文本对应的声学状态,即可以获得声学特征对应的声学状态。According to the correspondence between the training text and the acoustic features, the acoustic state corresponding to the training text can be obtained, that is, the acoustic state corresponding to the acoustic features can be obtained.
作为又一个实施例,所述第二确定模块可以具体用于根据每一训练文本对应的声学状态以及声学特征,将所述声学特征进行切分,获得分段数据,并确定每一声学状态对应的分段数据。In another embodiment, the second determining module may be specifically configured to segment the acoustic features according to an acoustic state and an acoustic feature corresponding to each training text, obtain segmentation data, and determine each acoustic state correspondingly. Segmentation data.
也即每一声学状态具体对应声学特征中的分段数据。That is, each acoustic state specifically corresponds to segmentation data in the acoustic features.
具体的,可以是通过原始声学模型对声学特征进行强制对齐(Forced Alignment),利用训练文本构建线性解码网络,并利用维特比(Viterbi)算法,,将所述训练文件对应的声学特征进行切分,获得分段数据,从而即可以确定 每一个声学状态对应的分段数据。Specifically, the acoustic feature may be forcibly aligned by the original acoustic model, the linear decoding network may be constructed by using the training text, and the acoustic features corresponding to the training file may be segmented by using the Viterbi algorithm. Obtain segmentation data to determine Segmentation data corresponding to each acoustic state.
因此,作为又一个实施例,所述第二确定模块是利用每一训练文本建立线性解码网络,利用维特比算法,将所述训练文件对应的声学特征进行切分,获得分段数据,并确定每一个声学状态对应的分段数据。Therefore, as a further embodiment, the second determining module establishes a linear decoding network by using each training text, and uses a Viterbi algorithm to segment the acoustic features corresponding to the training file to obtain segment data, and determine Segmentation data corresponding to each acoustic state.
训练模块204,用于利用每一声学状态对应的声学特征,重新训练获得所述声学状态的状态描述模型。The training module 204 is configured to retrain the state description model of the acoustic state by using the acoustic features corresponding to each acoustic state.
具体是,即是利用每一声学状态对应的分段数据,重新训练获得所述声学状态的状态描述模型。Specifically, the state description model of the acoustic state is retrained by using segmentation data corresponding to each acoustic state.
更新模块205,用于利用重新训练获得的状态描述模型,更新所述原始声学模型中的原始状态描述模型,获得更新之后的声学模型。The updating module 205 is configured to update the original state description model in the original acoustic model by using a state description model obtained by retraining, and obtain an updated acoustic model.
重新训练获得的状态描述模型,即可以替换原始声学模型中的原始状态描述模型,原始声学模型的其他结构不变,从而获得更新之后的声学模型。更新之后的声学模型即是根据训练样本再训练获得声学模型,训练样本为适用应用场景的样本,从而使得再训练获得的声学模型即适用于该应用场景的语音识别。The state description model obtained by retraining can replace the original state description model in the original acoustic model, and the other structures of the original acoustic model are unchanged, thereby obtaining the updated acoustic model. The updated acoustic model is to re-train the acoustic model according to the training sample, and the training sample is a sample of the applicable application scene, so that the acoustic model obtained by the retraining is applicable to the speech recognition of the application scene.
本实施例中,仅对原始声学模型中的状态描述模型进行再训练即可,而无需训练一个全新的声学模型,既降低了训练复杂度,且通过对原始声学模型进行更新,未破坏原始声学模型的结构,可以继续使用原始声学模型的状态定义模型等结构,保证了语音识别准确度。In this embodiment, only the state description model in the original acoustic model can be retrained without training a new acoustic model, which reduces the training complexity and does not destroy the original acoustics by updating the original acoustic model. The structure of the model can continue to use the structure of the original acoustic model to define the model and other structures to ensure the accuracy of speech recognition.
实施例3Example 3
如图3所示,本发明实施例又公开了一种电子设备,包括至少一个处理器810;以及,与所述至少一个处理器810通信连接的存储器800;其中,所述存储器800存储有可被所述至少一个处理器810执行的指令,所述指令被所述至少一个处理器810执行,以使所述至少一个处理器810能够获取训练样本;所述训练样本包括声学特征以及所述声学特征对应的训练文本;获取 原始声学模型,并利用所述原始声学模型,确定每一训练文本对应的所述原始声学模型中的声学状态;根据每一训练文本对应的声学状态以及声学特征,确定每一声学状态对应的声学特征;利用每一声学状态对应的声学特征,重新训练获得所述声学状态的状态描述模型;利用重新训练获得的状态描述模型,更新所述原始声学模型中的原始状态描述模型,获得更新之后的声学模型。所述电子设备还包括与所述存储器800和所述处理器电连接的输入装置830和输出装置840,所述电连接优选为通过总线连接。As shown in FIG. 3, an embodiment of the present invention further discloses an electronic device including at least one processor 810; and a memory 800 communicably connected to the at least one processor 810; wherein the memory 800 is stored An instruction executed by the at least one processor 810, the instructions being executed by the at least one processor 810 to enable the at least one processor 810 to acquire training samples; the training samples including acoustic features and the acoustics Training text corresponding to the feature; An original acoustic model, and using the original acoustic model, determining an acoustic state in the original acoustic model corresponding to each training text; determining an acoustic corresponding to each acoustic state according to an acoustic state and an acoustic feature corresponding to each training text Feature; re-training the state description model of the acoustic state by using the acoustic features corresponding to each acoustic state; updating the original state description model in the original acoustic model by using the state description model obtained by retraining, and obtaining the updated state Acoustic model. The electronic device also includes an input device 830 and an output device 840 that are electrically coupled to the memory 800 and the processor, the electrical connections preferably being connected by a bus.
本实施例的电子设备,优选地,所述根据每一训练文本对应的声学状态以及声学特征,确定每一声学状态对应的声学特征包括:根据每一训练文本对应的声学状态以及声学特征,将所述声学特征进行切分,获得分段数据;确定每一声学状态对应的分段数据;所述利用每一声学状态对应的声学特征,重新训练获得每一声学状态的状态描述模型包括:利用每一声学状态对应的分段数据,重新训练获得所述声学状态的状态描述模型。In the electronic device of this embodiment, preferably, determining the acoustic characteristics corresponding to each acoustic state according to the acoustic state and the acoustic characteristics corresponding to each training text comprises: according to the acoustic state and acoustic characteristics corresponding to each training text, Separating the acoustic features to obtain segmentation data; determining segmentation data corresponding to each acoustic state; and using the acoustic features corresponding to each acoustic state, retraining to obtain a state description model for each acoustic state includes: utilizing The segmentation data corresponding to each acoustic state is retrained to obtain a state description model of the acoustic state.
本实施例的电子设备,优选地,所述获取训练样本包括:获取训练语音以及训练文本,并提取所述训练语音的声学特征。In the electronic device of this embodiment, preferably, the acquiring the training sample comprises: acquiring training speech and training text, and extracting an acoustic feature of the training speech.
本实施例的电子设备,优选地,所述利用所述原始声学模型,确定每一训练文本对应的所述原始声学模型中的声学状态包括:利用所述原始声学模型中的状态定义模型,确定每一训练文本对应的声学状态。In the electronic device of this embodiment, preferably, determining, by using the original acoustic model, determining an acoustic state in the original acoustic model corresponding to each training text comprises: determining, by using a state definition model in the original acoustic model, The acoustic state corresponding to each training text.
本实施例的电子设备,优选地,所述根据每一训练文本对应的声学状态以及声学特征,将所述声学特征进行切分,获得分段数据;确定每一声学状态对应的分段数据包括:利用每一训练文本建立线性解码网络,利用维特比算法,将所述训练文件对应的声学特征进行切分,获得分段数据,并确定每一个声学状态对应的分段数据。In the electronic device of the embodiment, preferably, the acoustic feature is segmented according to an acoustic state and an acoustic feature corresponding to each training text to obtain segmentation data; and determining segment data corresponding to each acoustic state includes : Using each training text to establish a linear decoding network, using the Viterbi algorithm, segmenting the acoustic features corresponding to the training file, obtaining segmentation data, and determining segmentation data corresponding to each acoustic state.
实施例4Example 4
本发明实施例还公开了一种非易失性计算机存储介质,其中,所述存储 介质存储有计算机可执行指令,所述计算机可执行指令当由电子设备执行时使得电子设备能够:获取训练样本;所述训练样本包括声学特征以及所述声学特征对应的训练文本;获取原始声学模型,并利用所述原始声学模型,确定每一训练文本对应的所述原始声学模型中的声学状态;根据每一训练文本对应的声学状态以及声学特征,确定每一声学状态对应的声学特征;利用每一声学状态对应的声学特征,重新训练获得所述声学状态的状态描述模型;利用重新训练获得的状态描述模型,更新所述原始声学模型中的原始状态描述模型,获得更新之后的声学模型。Embodiments of the present invention also disclose a non-volatile computer storage medium, wherein the storage The medium stores computer executable instructions that, when executed by the electronic device, enable the electronic device to: acquire training samples; the training samples include acoustic features and training text corresponding to the acoustic features; and acquire an original acoustic model And using the original acoustic model to determine an acoustic state in the original acoustic model corresponding to each training text; determining an acoustic feature corresponding to each acoustic state according to an acoustic state and an acoustic feature corresponding to each training text; Acoustic features corresponding to each acoustic state are retrained to obtain a state description model of the acoustic state; using a state description model obtained by retraining, the original state description model in the original acoustic model is updated to obtain an updated acoustic model.
本实施例的存储介质,优选地,所述根据每一训练文本对应的声学状态以及声学特征,确定每一声学状态对应的声学特征包括:根据每一训练文本对应的声学状态以及声学特征,将所述声学特征进行切分,获得分段数据;确定每一声学状态对应的分段数据;所述利用每一声学状态对应的声学特征,重新训练获得每一声学状态的状态描述模型包括:利用每一声学状态对应的分段数据,重新训练获得所述声学状态的状态描述模型。In the storage medium of this embodiment, preferably, determining the acoustic characteristics corresponding to each acoustic state according to the acoustic state and the acoustic characteristics corresponding to each training text comprises: according to the acoustic state and acoustic characteristics corresponding to each training text, Separating the acoustic features to obtain segmentation data; determining segmentation data corresponding to each acoustic state; and using the acoustic features corresponding to each acoustic state, retraining to obtain a state description model for each acoustic state includes: utilizing The segmentation data corresponding to each acoustic state is retrained to obtain a state description model of the acoustic state.
本实施例的存储介质,优选地,所述获取训练样本包括:获取训练语音以及训练文本,并提取所述训练语音的声学特征。In the storage medium of this embodiment, preferably, the acquiring the training samples comprises: acquiring training speech and training text, and extracting acoustic features of the training speech.
本实施例的存储介质,优选地,所述利用所述原始声学模型,确定每一训练文本对应的所述原始声学模型中的声学状态包括:利用所述原始声学模型中的状态定义模型,确定每一训练文本对应的声学状态。In the storage medium of the embodiment, preferably, determining, by using the original acoustic model, determining an acoustic state in the original acoustic model corresponding to each training text comprises: determining, by using a state definition model in the original acoustic model, The acoustic state corresponding to each training text.
本实施例的存储介质,优选地,所述根据每一训练文本对应的声学状态以及声学特征,将所述声学特征进行切分,获得分段数据;确定每一声学状态对应的分段数据包括:利用每一训练文本建立线性解码网络,利用维特比算法,将所述训练文件对应的声学特征进行切分,获得分段数据,并确定每一个声学状态对应的分段数据。The storage medium of the embodiment, preferably, the acoustic feature is segmented according to an acoustic state and an acoustic feature corresponding to each training text to obtain segmentation data; and determining segment data corresponding to each acoustic state includes : Using each training text to establish a linear decoding network, using the Viterbi algorithm, segmenting the acoustic features corresponding to the training file, obtaining segmentation data, and determining segmentation data corresponding to each acoustic state.
实施例5Example 5
本发明实施例还提供了一种计算机程序产品,所述计算机程序产品包括 存储在非暂态计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行上述实施例所述的方法。The embodiment of the invention further provides a computer program product, the computer program product comprising A computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform the method described in the above embodiments.
本发明实施例,在原始训练数据丢失以及不希望改变原始声学模型中的状态定义模型等结构的情况下,可以通过对原始声学模型中的状态描述模型进行再训练,重新训练获得声学模型,即可以降低训练复杂度,同时使得声学模型能够适用当前应用场景。In the embodiment of the present invention, in the case that the original training data is lost and the state definition model in the original acoustic model is not desired to be changed, the state model can be retrained by retraining the acoustic model in the original acoustic model. Training complexity can be reduced while the acoustic model can be adapted to the current application scenario.
其中,状态描述模型可以通过对深度神经网络(DNN,Deep Neural Networks)训练获得,具体的可以采用反向传播(Back Propgation)算法实现,当然还可以采用其他的数学模型、比如混合高斯模型(GMM,Gaussian Mixture Model),则使用最大期望(EM,Expectation Maximization Algorithm)算法实现等,可以根据实际情况进行选择。The state description model can be obtained by training deep neural networks (DNN), and can be implemented by a Back Propgation algorithm. Of course, other mathematical models, such as a mixed Gaussian model (GMM), can also be used. , Gaussian Mixture Model), using the EM (Expectation Maximization Algorithm) algorithm to achieve, etc., can be selected according to the actual situation.
本领域内的技术人员应明白,本发明的实施例可提供为方法、系统、或计算机程序产品。因此,本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will appreciate that embodiments of the present invention can be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware. Moreover, the invention can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功 能的装置。The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (system), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or FIG. These computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine for the execution of instructions for execution by a processor of a computer or other programmable data processing device. The work specified in one or more blocks of a flow or a flow and/or a block diagram of a flowchart Able device.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。The computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device. The apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device. The instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.
显然,上述实施例仅仅是为清楚地说明所作的举例,而并非对实施方式的限定。对于所属领域的普通技术人员来说,在上述说明的基础上还可以做出其它不同形式的变化或变动。这里无需也无法对所有的实施方式予以穷举。而由此所引伸出的显而易见的变化或变动仍处于本发明创造的保护范围之中。 It is apparent that the above-described embodiments are merely illustrative of the examples, and are not intended to limit the embodiments. Other variations or modifications of the various forms may be made by those skilled in the art in light of the above description. There is no need and no way to exhaust all of the implementations. Obvious changes or variations resulting therefrom are still within the scope of the invention.

Claims (13)

  1. 一种用于语音识别的声学模型训练方法,其特征在于,包括:An acoustic model training method for speech recognition, comprising:
    获取训练样本;所述训练样本包括声学特征以及所述声学特征对应的训练文本;Obtaining a training sample; the training sample includes an acoustic feature and a training text corresponding to the acoustic feature;
    获取原始声学模型,并利用所述原始声学模型,确定每一训练文本对应的所述原始声学模型中的声学状态;Acquiring an original acoustic model, and using the original acoustic model, determining an acoustic state in the original acoustic model corresponding to each training text;
    根据每一训练文本对应的声学状态以及声学特征,确定每一声学状态对应的声学特征;Determining an acoustic feature corresponding to each acoustic state according to an acoustic state and an acoustic feature corresponding to each training text;
    利用每一声学状态对应的声学特征,重新训练获得所述声学状态的状态描述模型;Re-training the state description model of the acoustic state using the acoustic features corresponding to each acoustic state;
    利用重新训练获得的状态描述模型,更新所述原始声学模型中的原始状态描述模型,获得更新之后的声学模型。Using the state description model obtained by retraining, the original state description model in the original acoustic model is updated to obtain an updated acoustic model.
  2. 根据权利要求1所述的方法,其特征在于,所述根据每一训练文本对应的声学状态以及声学特征,确定每一声学状态对应的声学特征包括:The method according to claim 1, wherein the determining the acoustic characteristics corresponding to each acoustic state according to the acoustic state and the acoustic characteristics corresponding to each training text comprises:
    根据每一训练文本对应的声学状态以及声学特征,将所述声学特征进行切分,获得分段数据;Separating the acoustic features according to an acoustic state and an acoustic feature corresponding to each training text to obtain segmentation data;
    确定每一声学状态对应的分段数据;Determining segmentation data corresponding to each acoustic state;
    所述利用每一声学状态对应的声学特征,重新训练获得每一声学状态的状态描述模型包括:The state description model for re-training each acoustic state using the acoustic features corresponding to each acoustic state includes:
    利用每一声学状态对应的分段数据,重新训练获得所述声学状态的状态描述模型。The state description model of the acoustic state is retrained using segmentation data corresponding to each acoustic state.
  3. 根据权利要求1所述的方法,其特征在于,所述获取训练样本包括:The method of claim 1 wherein said obtaining training samples comprises:
    获取训练语音以及训练文本,并提取所述训练语音的声学特征。The training speech and the training text are obtained, and the acoustic features of the training speech are extracted.
  4. 根据权利要求1所述的方法,其特征在于,所述利用所述原始声学模型,确定每一训练文本对应的所述原始声学模型中的声学状态包括:The method according to claim 1, wherein the determining, by the original acoustic model, the acoustic state in the original acoustic model corresponding to each training text comprises:
    利用所述原始声学模型中的状态定义模型,确定每一训练文本对应的声 学状态。Determining the sound corresponding to each training text by using the state definition model in the original acoustic model Learning state.
  5. 根据权利要求2所述的方法,其特征在于,所述根据每一训练文本对应的声学状态以及声学特征,将所述声学特征进行切分,获得分段数据;确定每一声学状态对应的分段数据包括:The method according to claim 2, wherein the acoustic features are segmented according to an acoustic state and an acoustic feature corresponding to each training text to obtain segmentation data; and a score corresponding to each acoustic state is determined. Segment data includes:
    利用每一训练文本建立线性解码网络,利用维特比算法,将所述训练文件对应的声学特征进行切分,获得分段数据,并确定每一个声学状态对应的分段数据。A linear decoding network is established by using each training text, and the acoustic characteristics corresponding to the training file are segmented by using a Viterbi algorithm to obtain segment data, and segment data corresponding to each acoustic state is determined.
  6. 一种用于语音识别的声学模型训练装置,其特征在于,包括:An acoustic model training device for speech recognition, comprising:
    样本获取模块,用于获取训练样本;所述训练样本包括声学特征以及所述声学特征对应的训练文本;a sample acquisition module, configured to acquire a training sample; the training sample includes an acoustic feature and a training text corresponding to the acoustic feature;
    第一确定模块,用于获取原始声学模型,并利用所述原始声学模型,确定每一训练文本对应的所述原始声学模型中的声学状态;a first determining module, configured to acquire an original acoustic model, and use the original acoustic model to determine an acoustic state in the original acoustic model corresponding to each training text;
    第二确定模块,用于根据每一训练文本对应的声学状态以及声学特征,确定每一声学状态对应的声学特征;a second determining module, configured to determine an acoustic feature corresponding to each acoustic state according to an acoustic state and an acoustic feature corresponding to each training text;
    训练模块,用于利用每一声学状态对应的声学特征,重新训练获得所述声学状态的状态描述模型;a training module, configured to re-train a state description model of the acoustic state by using an acoustic feature corresponding to each acoustic state;
    更新模块,用于利用重新训练获得的状态描述模型,更新所述原始声学模型中的原始状态描述模型,获得更新之后的声学模型。And an update module, configured to update the original state description model in the original acoustic model by using a state description model obtained by retraining, and obtain an updated acoustic model.
  7. 根据权利要求6所述的装置,其特征在于,所述第二确定模块具体用于:The device according to claim 6, wherein the second determining module is specifically configured to:
    根据每一训练文本对应的声学状态以及声学特征,将所述声学特征进行切分,获得分段数据,并确定每一声学状态对应的分段数据;Separating the acoustic features according to an acoustic state and an acoustic feature corresponding to each training text, obtaining segmentation data, and determining segmentation data corresponding to each acoustic state;
    所述训练模块具体用于:The training module is specifically configured to:
    利用每一声学状态对应的分段数据,重新训练获得所述声学状态的状态描述模型。The state description model of the acoustic state is retrained using segmentation data corresponding to each acoustic state.
  8. 根据权利要求6所述的装置,其特征在于,所述样本获取模块具体用 于:The apparatus according to claim 6, wherein said sample acquisition module is specifically used to:
    获取训练语音以及训练文本,并提取所述训练语音的声学特征。The training speech and the training text are obtained, and the acoustic features of the training speech are extracted.
  9. 根据权利要求6所述的装置,其特征在于,所述第一确定模块具体用于:The device according to claim 6, wherein the first determining module is specifically configured to:
    利用所述原始声学模型中的状态定义模型,确定每一训练文本对应的声学状态。The acoustic state corresponding to each training text is determined using a state definition model in the original acoustic model.
  10. 根据权利要求7所述的装置,其特征在于,所述第二确定模块具体用于:The device according to claim 7, wherein the second determining module is specifically configured to:
    利用每一训练文本建立线性解码网络,利用维特比算法,将所述训练文件对应的声学特征进行切分,获得分段数据,并确定每一个声学状态对应的分段数据。A linear decoding network is established by using each training text, and the acoustic characteristics corresponding to the training file are segmented by using a Viterbi algorithm to obtain segment data, and segment data corresponding to each acoustic state is determined.
  11. 一种电子设备,其特征在于包括至少一个处理器;以及,与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够:An electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions Executed by the at least one processor to enable the at least one processor to:
    获取训练样本;所述训练样本包括声学特征以及所述声学特征对应的训练文本;Obtaining a training sample; the training sample includes an acoustic feature and a training text corresponding to the acoustic feature;
    获取原始声学模型,并利用所述原始声学模型,确定每一训练文本对应的所述原始声学模型中的声学状态;Acquiring an original acoustic model, and using the original acoustic model, determining an acoustic state in the original acoustic model corresponding to each training text;
    根据每一训练文本对应的声学状态以及声学特征,确定每一声学状态对应的声学特征;Determining an acoustic feature corresponding to each acoustic state according to an acoustic state and an acoustic feature corresponding to each training text;
    利用每一声学状态对应的声学特征,重新训练获得所述声学状态的状态描述模型;Re-training the state description model of the acoustic state using the acoustic features corresponding to each acoustic state;
    利用重新训练获得的状态描述模型,更新所述原始声学模型中的原始状态描述模型,获得更新之后的声学模型。 Using the state description model obtained by retraining, the original state description model in the original acoustic model is updated to obtain an updated acoustic model.
  12. 一种非易失性计算机存储介质,其特征在于:所述存储介质存储有计算机可执行指令,所述计算机可执行指令用于使所述计算机执行权利要求1-5所述的方法。A non-volatile computer storage medium characterized in that the storage medium stores computer executable instructions for causing the computer to perform the method of claims 1-5.
  13. 一种计算机程序产品,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,其特征在于,当所述程序指令被计算机执行时,使所述计算机执行权利要求1-5所述的方法。 A computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions, wherein when the program instructions are executed by a computer, The computer performs the method of claims 1-5.
PCT/CN2016/096672 2016-03-30 2016-08-25 Acoustic model training method and apparatus for speech recognition, and electronic device WO2017166625A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610195612.X 2016-03-30
CN201610195612.XA CN105845130A (en) 2016-03-30 2016-03-30 Acoustic model training method and device for speech recognition

Publications (1)

Publication Number Publication Date
WO2017166625A1 true WO2017166625A1 (en) 2017-10-05

Family

ID=56596355

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/096672 WO2017166625A1 (en) 2016-03-30 2016-08-25 Acoustic model training method and apparatus for speech recognition, and electronic device

Country Status (2)

Country Link
CN (1) CN105845130A (en)
WO (1) WO2017166625A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112466293A (en) * 2020-11-13 2021-03-09 广州视源电子科技股份有限公司 Decoding graph optimization method, decoding graph optimization device and storage medium
CN114420087A (en) * 2021-12-27 2022-04-29 北京百度网讯科技有限公司 Methods, devices, equipment, media and products for determining acoustic characteristics

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105845130A (en) * 2016-03-30 2016-08-10 乐视控股(北京)有限公司 Acoustic model training method and device for speech recognition
CN109308895B (en) * 2018-12-04 2019-12-27 百度在线网络技术(北京)有限公司 Acoustic model training method, device, equipment and computer readable medium
CN110827799B (en) * 2019-11-21 2022-06-10 百度在线网络技术(北京)有限公司 Method, apparatus, device and medium for processing voice signal
CN111179916B (en) * 2019-12-31 2023-10-13 广州市百果园信息技术有限公司 Training method for re-scoring model, voice recognition method and related device
CN112489637B (en) * 2020-11-03 2024-03-26 北京百度网讯科技有限公司 Speech recognition method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060173673A1 (en) * 2005-02-02 2006-08-03 Samsung Electronics Co., Ltd. Speech recognition method and apparatus using lexicon group tree
CN103065626A (en) * 2012-12-20 2013-04-24 中国科学院声学研究所 Automatic grading method and automatic grading equipment for read questions in test of spoken English
CN103632667A (en) * 2013-11-25 2014-03-12 华为技术有限公司 Acoustic model optimization method and device, voice awakening method and device, as well as terminal
CN103971678A (en) * 2013-01-29 2014-08-06 腾讯科技(深圳)有限公司 Method and device for detecting keywords
CN105845130A (en) * 2016-03-30 2016-08-10 乐视控股(北京)有限公司 Acoustic model training method and device for speech recognition

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105244029B (en) * 2015-08-28 2019-02-26 安徽科大讯飞医疗信息技术有限公司 Voice recognition post-processing method and system
CN105185372B (en) * 2015-10-20 2017-03-22 百度在线网络技术(北京)有限公司 Training method for multiple personalized acoustic models, and voice synthesis method and voice synthesis device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060173673A1 (en) * 2005-02-02 2006-08-03 Samsung Electronics Co., Ltd. Speech recognition method and apparatus using lexicon group tree
CN103065626A (en) * 2012-12-20 2013-04-24 中国科学院声学研究所 Automatic grading method and automatic grading equipment for read questions in test of spoken English
CN103971678A (en) * 2013-01-29 2014-08-06 腾讯科技(深圳)有限公司 Method and device for detecting keywords
CN103632667A (en) * 2013-11-25 2014-03-12 华为技术有限公司 Acoustic model optimization method and device, voice awakening method and device, as well as terminal
CN105845130A (en) * 2016-03-30 2016-08-10 乐视控股(北京)有限公司 Acoustic model training method and device for speech recognition

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112466293A (en) * 2020-11-13 2021-03-09 广州视源电子科技股份有限公司 Decoding graph optimization method, decoding graph optimization device and storage medium
CN114420087A (en) * 2021-12-27 2022-04-29 北京百度网讯科技有限公司 Methods, devices, equipment, media and products for determining acoustic characteristics

Also Published As

Publication number Publication date
CN105845130A (en) 2016-08-10

Similar Documents

Publication Publication Date Title
WO2017166625A1 (en) Acoustic model training method and apparatus for speech recognition, and electronic device
US11837216B2 (en) Speech recognition using unspoken text and speech synthesis
JP7427723B2 (en) Text-to-speech synthesis in target speaker's voice using neural networks
US10210861B1 (en) Conversational agent pipeline trained on synthetic data
US10559299B1 (en) Reconciliation between simulator and speech recognition output using sequence-to-sequence mapping
CN108320733B (en) Voice data processing method and device, storage medium and electronic equipment
CN110491393B (en) Training method and related device for voiceprint representation model
WO2017076222A1 (en) Speech recognition method and apparatus
WO2017076211A1 (en) Voice-based role separation method and device
CN108281137A (en) A kind of universal phonetic under whole tone element frame wakes up recognition methods and system
EP3469582A1 (en) Neural network-based voiceprint information extraction method and apparatus
US11676572B2 (en) Instantaneous learning in text-to-speech during dialog
CN113823265B (en) A speech recognition method, device and computer equipment
CN112397056B (en) Voice evaluation method and computer storage medium
WO2017166631A1 (en) Voice signal processing method, apparatus and electronic device
CN113763939B (en) Mixed voice recognition system and method based on end-to-end model
JP2020042257A (en) Voice recognition method and device
CN115249479B (en) Complex speech recognition method, system and terminal for power grid dispatching based on BRNN
CN114999463B (en) Voice recognition method, device, equipment and medium
CN106710585A (en) Method and system for broadcasting polyphonic characters in voice interaction process
CN114512121A (en) Speech synthesis method, model training method and device
CN106782513A (en) Speech recognition realization method and system based on confidence level
US20230317060A1 (en) Method and apparatus for training voice wake-up model, method and apparatus for voice wake-up, device, and storage medium
KR20160098910A (en) Expansion method of speech recognition database and apparatus thereof
CN114842855B (en) Voice wake-up model training, wake-up method, device, equipment and storage medium

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16896395

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 16896395

Country of ref document: EP

Kind code of ref document: A1