[go: up one dir, main page]

CN107644638B - Audio recognition method, device, terminal and computer readable storage medium - Google Patents

Audio recognition method, device, terminal and computer readable storage medium Download PDF

Info

Publication number
CN107644638B
CN107644638B CN201710964474.1A CN201710964474A CN107644638B CN 107644638 B CN107644638 B CN 107644638B CN 201710964474 A CN201710964474 A CN 201710964474A CN 107644638 B CN107644638 B CN 107644638B
Authority
CN
China
Prior art keywords
phoneme sequence
speech
acoustic
decoding network
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201710964474.1A
Other languages
Chinese (zh)
Other versions
CN107644638A (en
Inventor
何金来
雷宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Rubu Technology Co ltd
Original Assignee
Beijing Intelligent Housekeeper Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Intelligent Housekeeper Technology Co Ltd filed Critical Beijing Intelligent Housekeeper Technology Co Ltd
Priority to CN201710964474.1A priority Critical patent/CN107644638B/en
Publication of CN107644638A publication Critical patent/CN107644638A/en
Application granted granted Critical
Publication of CN107644638B publication Critical patent/CN107644638B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

本发明公开了一种语音识别方法,包括根据采集到的语音的声学特征,计算所述语音与解码网络中的音素序列的声学相似概率;其中所述解码网络包括多组音素序列;每一组音素序列对应一个预设的命令词内容或对应噪音内容;根据所述声学相似概率,获得所述语音与所述音素序列的匹配概率;将所述语音识别为匹配概率最高的音素序列所对应的内容。相应地,本发明还公开一种语音识别装置、终端和计算机可读存储介质。本发明实现避免将噪音识别为命令词,且无需在语音识别后计算置信度,达到降低误识别率的效果。

The invention discloses a speech recognition method, comprising calculating the acoustic similarity probability between the speech and a phoneme sequence in a decoding network according to the acoustic features of the collected speech; wherein the decoding network includes multiple groups of phoneme sequences; each group The phoneme sequence corresponds to a preset command word content or corresponding noise content; according to the acoustic similarity probability, the matching probability between the voice and the phoneme sequence is obtained; the voice is recognized as the phoneme sequence with the highest matching probability. content. Correspondingly, the present invention also discloses a speech recognition device, a terminal and a computer-readable storage medium. The invention realizes that the recognition of noise as a command word is avoided, and the confidence level is not required to be calculated after speech recognition, so as to achieve the effect of reducing the misrecognition rate.

Description

Audio recognition method, device, terminal and computer readable storage medium
Technical field
The present embodiments relate to speech recognition technology more particularly to a kind of audio recognition method, device, terminal and calculating Machine readable storage medium storing program for executing.
Background technique
In voice command words identification technology, misrecognition is always a more insoluble problem.The identification of order word Why false recognition rate is relatively high, is because the order word recognition method of the prior art is generally by construction decoding network come real It is existing, it include multiple groups aligned phoneme sequence corresponding with preset order word in the decoding network.Inputting any voice all can be according to the language Sound searches out an aligned phoneme sequence the most matched from decoding network, therefore causes to misidentify.
It solves to be the confidence level for calculating recognition result by the method that noise is identified as order word at present, when confidence level is greater than in advance If threshold value when indicate that identification is correct, indicate not recognizing order word when confidence level is less than the threshold value.Due to confidence level It calculates and relies on several factors, the value variation range especially affected by environment that will lead to confidence level is very big.Under noisy environment, often It will appear the very low but wrong very high situation of recognition result confidence level of correct recognition result confidence level, so that false recognition rate It is still very high.
Summary of the invention
The present invention provides recognition methods, device, terminal and the computer readable storage medium of a kind of voice command, to realize It avoids for noise being identified as order word, and without calculating confidence level after speech recognition, achievees the effect that reduce false recognition rate.
In a first aspect, the embodiment of the invention provides a kind of audio recognition methods, comprising:
According to the acoustic feature of collected voice, acoustics phase of the voice with the aligned phoneme sequence in decoding network is calculated Like probability;Wherein, the decoding network includes multiple groups aligned phoneme sequence;In the corresponding preset order word of each group of aligned phoneme sequence Perhaps correspond to noise content;
According to the acoustics likelihood probability, the matching probability of the voice Yu the aligned phoneme sequence is obtained;
It is content corresponding to the highest aligned phoneme sequence of matching probability by the speech recognition.
Second aspect, the present invention also provides a kind of speech recognition equipments, comprising:
Computing module calculates the sound in the voice and decoding network for the acoustic feature according to collected voice The acoustics likelihood probability of prime sequences;Wherein, the decoding network includes multiple groups aligned phoneme sequence;Each group of aligned phoneme sequence is one corresponding Noise content is perhaps corresponded in preset order word;
Matching module, for obtaining the voice and the matching of the aligned phoneme sequence being general according to the acoustics likelihood probability Rate;
Identification module, for being content corresponding to the highest aligned phoneme sequence of matching probability by the speech recognition.
The third aspect, the present invention also provides a kind of terminal, the terminal includes:
One or more processors;
Memory, for storing one or more programs,
When one or more of programs are executed by one or more of processors, so that one or more of processing Device realizes the audio recognition method that any embodiment of that present invention provides.
Fourth aspect, the present invention also provides a kind of computer readable storage mediums, are stored thereon with computer program, should The audio recognition method that any embodiment of that present invention provides is realized when program is executed by processor.
The present invention can solved by increasing the corresponding aligned phoneme sequence of noise content, collected voice in decoding network Just it is identified as noise or order word when searching for most matching aligned phoneme sequence in code network, without searching for aligned phoneme sequence in decoding network Confidence calculations are carried out to search result afterwards, so that solving the prior art uses the confidence calculations method influenced by environment phoneme The problem that false recognition rate is high is caused, realization avoids for noise being identified as order word, and reduces the effect of false recognition rate.
Detailed description of the invention
Fig. 1 is the flow chart for the audio recognition method that the embodiment of the present invention one provides;
Fig. 2 is the flow chart of audio recognition method provided by Embodiment 2 of the present invention;
Fig. 3 is the structural schematic diagram for the speech recognition equipment that the embodiment of the present invention three provides;
Fig. 4 is the structural schematic diagram for the terminal that the embodiment of the present invention four provides.
Specific embodiment
The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining the present invention rather than limiting the invention.It also should be noted that in order to just Only the parts related to the present invention are shown in description, attached drawing rather than entire infrastructure.
Embodiment one
Fig. 1 is the flow chart for the audio recognition method that the embodiment of the present invention one provides, and the present embodiment is applicable to order word The case where identification, this method can be executed by speech recognition equipment, be specifically comprised the following steps:
Step 110, according to the acoustic feature of collected voice, calculate the aligned phoneme sequence in the voice and decoding network Acoustics likelihood probability;
Wherein, the decoding network includes multiple groups aligned phoneme sequence;The corresponding preset order word of each group of aligned phoneme sequence Inside perhaps correspond to noise content;Since the embodiment of the present invention is applied to the identification to voice command, any non-command word voice All it is interference for the identification of order word, therefore is all noise, then noise described in the embodiment of the present invention refers to any non-command word Voice.Specifically, decoding network can be made up of interference networks, concatenated phoneme node in interference networks multiple phoneme nodes Form aligned phoneme sequence.The acoustics likelihood probability of phoneme in field of speech recognition a, phoneme and decoding network, generally passes through The acoustic model of the phoneme in decoding network is constructed to realize, acoustics likelihood probability, which refers to, to be corresponded to the acoustic feature of voice for input Acoustic model output probability.
Step 120, according to the acoustics likelihood probability, obtain the matching probability of the voice Yu the aligned phoneme sequence;
It wherein, can be directly using acoustics likelihood probability as matching probability in order to simplify the data processing of identification process;But it answers Demanding scene for identification, as the audio recognition method of high discrimination, matching probability is removed to be believed comprising acoustics likelihood probability Breath is outer, can also be comprising other information, for example, the decoding network for using weighted finite state converted configuration, matching Probability also includes the weight information of aligned phoneme sequence, which can be related to the probability that aligned phoneme sequence occurs in practical applications, That is probabilistic language model.For example, partial order word is higher in the probability that practical application occurs, such as in order word identification scene " volume tunes up ", " shutdown " etc., and the probability that partial order word occurs in practical application is lower, similar in the two acoustic feature In the case of, the aligned phoneme sequence weight corresponding to the former can be set and be higher than aligned phoneme sequence weight corresponding to the latter.In addition, weight Information can also be adjusted according to the discrimination in the implementation process of audio recognition method.The speech recognition is by step 130 Content corresponding to the highest aligned phoneme sequence of matching probability.
The working principle of above-mentioned steps is to increase the corresponding aligned phoneme sequence of noise content in decoding network, can be according to typing The acoustic feature of noise make noise corresponding with the noise content in decoding network aligned phoneme sequence matching so that being based on acoustics Feature identifies the noise of typing, is avoided that non-command word being identified as order word, and compared with the prior art after use identification The method for calculating confidence level, the present embodiment avoid the scheme that noise is identified as to order word from not influenced by environment phoneme, drop significantly Low false recognition rate.
In order to reduce false recognition rate, improve by noise aligned phoneme sequence corresponding with noise content in decoding network is matched can Energy property, the present embodiment provides a kind of preferred embodiments.Specifically, step 110, according to the acoustic feature of collected voice, The acoustics likelihood probability for calculating the aligned phoneme sequence in the voice and decoding network, specifically includes:
Obtain the acoustic model of aligned phoneme sequence in decoding network trained in advance;Wherein, the corresponding sound of training noise content Learning noisy samples used by model includes the speech samples that multiple differences of acoustic feature between any two are greater than preset threshold value;
According to the acoustic feature of collected voice, calculated in the voice and decoding network using the acoustic model The acoustics likelihood probability of aligned phoneme sequence.
In above-mentioned preferred embodiment, the noisy samples of training noise acoustic model include that multiple acoustics between any two are special Levy the speech samples that difference is greater than preset threshold value, that is, noise acoustic model is using multiple speech samples instructions to differ greatly It gets, such as noisy ambient sound and a large amount of mutually different non-command word phrases etc..Use the big language of a large amount of differences Sound sample training to acoustic model corresponding to aligned phoneme sequence can be intended to that between various sound difference minimizes from Right sound, it is easier to various non-command word voice match.And the order word sample of training order word acoustic model is usually to use The order word sound that different accents are read aloud, the acoustic feature difference between order word sample is little, therefore only for order word phase Close sound acoustics likelihood probability is high.Therefore, above-mentioned preferred embodiment can be improved noise content in noise and decoding network A possibility that corresponding aligned phoneme sequence matches reduces false recognition rate.
Further, the decoding network is using weighted finite state converted configuration;Then step 120, described According to the acoustics likelihood probability, the matching probability of the voice Yu the aligned phoneme sequence is obtained, is specifically included: calculating the acoustics The weight of likelihood probability and the aligned phoneme sequence and value, the matching probability as the voice and the aligned phoneme sequence.Certainly, The product of acoustics likelihood probability and weight can also be calculated as matching probability.
Further, the decoding network further includes aligned phoneme sequence corresponding with mute content.It is corresponding to increase mute content Aligned phoneme sequence user experience can be improved.Because can to noise and it is mute make differentiation, to the different signal of user feedback. For example, noise may be since the wrong voice of user causes, the information that exportable prompt user retells can for mute It can be that user accidentally touches identification device and leads to typing voice, identification output can be set as sky, i.e., do not execute any Operation, leaves user alone, to improve user experience.
It should be noted that calculating acoustics likelihood probability, obtaining matching probability and then search for the highest phoneme of matching probability Sequence can be the matching probability for first calculating each aligned phoneme sequence Yu collected voice, and then comparison match probability obtains With the highest aligned phoneme sequence of probability.It is also possible to first search close with the acoustics likelihood probability of collected voice initial phoneme Decoding network in phoneme judge the phase then according to acoustics likelihood probability, weight (including probabilistic language model information) etc. In multiple groups aligned phoneme sequence where close phoneme, next phoneme of which group is matched with next phoneme of collected voice Probability highest, and then determine that next phoneme node of this group of aligned phoneme sequence is matched with next phoneme of collected voice. Further, judgement search is continued to execute, the aligned phoneme sequence finally obtained is exactly the highest aligned phoneme sequence of matching probability.
In conclusion the technical solution of the present embodiment, increases the corresponding aligned phoneme sequence of noise content in decoding network, adopts The voice collected is just identified as noise or order word when can search for most matching aligned phoneme sequence in decoding network, without solving Confidence calculations are carried out to search result after code web search aligned phoneme sequence, are used to solve the prior art by environment phoneme shadow The problem that loud confidence calculations method causes false recognition rate high, realization avoid for noise being identified as order word, and reduce and accidentally know The not effect of rate.
Embodiment two
Fig. 2 is the flow chart of audio recognition method provided by Embodiment 2 of the present invention, and the present embodiment is applicable to order word The case where identification, this method can be executed by speech recognition equipment.Base of the present embodiment in one audio recognition method of embodiment On plinth, the step of increasing adjust automatically decoding network parameter, audio recognition method is allowed dynamically to modify parameter, it is lasting to drop Low false recognition rate.Audio recognition method provided in this embodiment includes:
Step 210, according to the acoustic feature of collected voice, calculate the aligned phoneme sequence in the voice and decoding network Acoustics likelihood probability;Wherein, the decoding network includes multiple groups aligned phoneme sequence;Each group of aligned phoneme sequence corresponding one preset Noise content is perhaps corresponded in order word;
Step 220, according to the acoustics likelihood probability, obtain the matching probability of the voice Yu the aligned phoneme sequence;
Step 230, by the speech recognition be the highest aligned phoneme sequence of matching probability corresponding to content;
If step 240 confirms that collected voice is noise, and is preset order word by the speech recognition, Then improve the weight of the corresponding aligned phoneme sequence of noise content in the decoding network.
The present embodiment can also acquire confirmation message (can provide confirmation message by user) after identifying voice, confirmation identification Whether as a result correct, if confirming, collected voice is noise, and is order word by speech recognition, then illustrates that false recognition rate is still omited Height, therefore the weight of the corresponding aligned phoneme sequence of noise content in the decoding network is improved, to increase noise aligned phoneme sequence and adopt The matching probability of the voice collected, so that non-command word voice is more likely to be identified as noise.Further, settable confirmation is adopted The voice integrated reaches preset threshold value as noise and by the speech recognition as the number of order word, just improves noise phoneme sequence The weight of column causes to adjust unbalance to avoid individual identification mistakes.
Preferably, further includes: if confirming, collected voice is order word, and is noise by the speech recognition, then drops The weight of the corresponding aligned phoneme sequence of noise content in the low decoding network.
Further, the settable collected voice of confirmation is order word and reaches the number that the speech recognition is noise To preset threshold value, the weight of noise aligned phoneme sequence is just reduced.In order to reduce false recognition rate, inevitably on a small quantity will The discrimination to order word can be improved in the case where order word is identified as noise, above-mentioned preferred embodiment.
Further, the also settable instruction triggered according to user, it is corresponding to adjust noise content in the decoding network The weight of aligned phoneme sequence, to reduce false recognition rate or improve discrimination.
The technical solution of the present embodiment increases the corresponding aligned phoneme sequence of noise content, collected language in decoding network Sound is just identified as noise or order word when can search for most matching aligned phoneme sequence in decoding network, and realization avoids knowing noise Not Wei order word, and reduce false recognition rate effect.And according to recognition result, the power of noise aligned phoneme sequence in decoding network is adjusted Weight persistently reduces false recognition rate to realize dynamic modification parameter.
Embodiment three
Fig. 3 is the structural schematic diagram for the speech recognition equipment that the embodiment of the present invention three provides.The speech recognition equipment includes:
Computing module 310 calculates in the voice and decoding network for the acoustic feature according to collected voice The acoustics likelihood probability of aligned phoneme sequence;Wherein, the decoding network includes multiple groups aligned phoneme sequence;Each group of aligned phoneme sequence corresponding one Noise content is perhaps corresponded in a preset order word;
Matching module 320, for obtaining the matching of the voice Yu the aligned phoneme sequence according to the acoustics likelihood probability Probability;
Identification module 330, for being content corresponding to the highest aligned phoneme sequence of matching probability by the speech recognition.
Preferably, the decoding network is using weighted finite state converted configuration.The speech recognition equipment is also Include:
Weight adjusts module 340, if for confirming that collected voice is noise, and be to set in advance by the speech recognition Fixed order word then improves the weight of the corresponding aligned phoneme sequence of noise content in the decoding network.
Preferably, matching module 320 includes:
With value computing unit, for calculate the acoustics likelihood probability and the aligned phoneme sequence weight and value, as The matching probability of the voice and the aligned phoneme sequence.
Preferably, the decoding network further includes aligned phoneme sequence corresponding with mute content.
Preferably, the computing module includes:
Model acquiring unit, for obtaining the acoustic model of aligned phoneme sequence in decoding network trained in advance;Wherein, training Noisy samples used by the corresponding acoustic model of noise content include multiple differences of acoustic feature between any two greater than preset The speech samples of threshold value;
Model arithmetic unit, for the acoustic feature according to collected voice, using described in acoustic model calculating The acoustics likelihood probability of aligned phoneme sequence in voice and decoding network.
Voice provided by any embodiment of the invention, which can be performed, in speech recognition equipment provided by the embodiment of the present invention knows Other method has the corresponding functional module of execution method and beneficial effect.
Example IV
Fig. 4 is a kind of structural schematic diagram for terminal that the embodiment of the present invention four provides, as shown in figure 4, the terminal includes place Manage device 410, memory 420, input unit 430 and output device 440;In terminal the quantity of processor 410 can be one or It is multiple, in Fig. 4 by taking a processor 410 as an example;Processor 410, memory 420, input unit 430 and output dress in terminal Setting 440 can be connected by bus or other modes, in Fig. 4 for being connected by bus.
Memory 420 is used as a kind of computer readable storage medium, can be used for storing software program, journey can be performed in computer Sequence and module, if the corresponding program instruction/module of the audio recognition method in the embodiment of the present invention is (for example, speech recognition fills Computing module 310, matching module 320, identification module 330 and weight in setting adjust module 340).Processor 410 passes through operation Software program, instruction and the module being stored in memory 420, at the various function application and data of terminal Reason, that is, realize above-mentioned audio recognition method.
Memory 420 can mainly include storing program area and storage data area, wherein storing program area can store operation system Application program needed for system, at least one function;Storage data area, which can be stored, uses created data etc. according to terminal.This Outside, memory 420 may include high-speed random access memory, can also include nonvolatile memory, for example, at least one Disk memory, flush memory device or other non-volatile solid state memory parts.In some instances, memory 420 can be into one Step includes the memory remotely located relative to processor 410, these remote memories can pass through network connection to terminal.On The example for stating network includes but is not limited to internet, intranet, local area network, mobile radio communication and combinations thereof.
Input unit 430 can be used for receiving the number or character information of input, and generate with the user setting of terminal with And the related key signals input of function control.Output device 740 may include that display screen etc. shows equipment.
Embodiment five
The embodiment of the present invention five also provides a kind of computer readable storage medium for being stored with computer program, the calculating Machine program realizes a kind of audio recognition method when being subsequently can by computer device and executing, this method comprises:
According to the acoustic feature of collected voice, acoustics phase of the voice with the aligned phoneme sequence in decoding network is calculated Like probability;Wherein, the decoding network includes multiple groups aligned phoneme sequence;In the corresponding preset order word of each group of aligned phoneme sequence Perhaps correspond to noise content;
According to the acoustics likelihood probability, the matching probability of the voice Yu the aligned phoneme sequence is obtained;
It is content corresponding to the highest aligned phoneme sequence of matching probability by the speech recognition.
Certainly, a kind of computer readable storage medium storing computer program, journey provided by the embodiment of the present invention The method operation that sequence is not limited to the described above, can also be performed in audio recognition method provided by any embodiment of the invention Relevant operation.
By the description above with respect to embodiment, it is apparent to those skilled in the art that, the present invention It can be realized by software and required common hardware, naturally it is also possible to which by hardware realization, but in many cases, the former is more Good embodiment.Based on this understanding, technical solution of the present invention substantially in other words contributes to the prior art Part can be embodied in the form of software products, which can store in computer readable storage medium In, floppy disk, read-only memory (Read-Only Memory, ROM), random access memory (Random such as computer Access Memory, RAM), flash memory (FLASH), hard disk or CD etc., including some instructions are with so that a computer is set Standby (can be personal computer, server or the network equipment etc.) executes method described in each embodiment of the present invention.
It is worth noting that, included each unit and module are only pressed in the embodiment of above-mentioned speech recognition equipment It is divided, but is not limited to the above division according to function logic, as long as corresponding functions can be realized;In addition, The specific name of each functional unit is also only for convenience of distinguishing each other, the protection scope being not intended to restrict the invention.
Note that the above is only a better embodiment of the present invention and the applied technical principle.It will be appreciated by those skilled in the art that The invention is not limited to the specific embodiments described herein, be able to carry out for a person skilled in the art it is various it is apparent variation, It readjusts and substitutes without departing from protection scope of the present invention.Therefore, although being carried out by above embodiments to the present invention It is described in further detail, but the present invention is not limited to the above embodiments only, without departing from the inventive concept, also It may include more other equivalent embodiments, and the scope of the invention is determined by the scope of the appended claims.

Claims (8)

1.一种语音识别方法,其特征在于,包括:1. a speech recognition method, is characterized in that, comprises: 根据采集到的语音的声学特征,计算所述语音与解码网络中的音素序列的声学相似概率;其中,所述解码网络包括多组音素序列,每一组音素序列对应一个预设的命令词内容或对应噪音内容;Calculate the acoustic similarity probability between the speech and the phoneme sequence in the decoding network according to the acoustic features of the collected speech; wherein, the decoding network includes multiple groups of phoneme sequences, and each group of phoneme sequences corresponds to a preset command word content or corresponding noise content; 根据所述声学相似概率,获得所述语音与所述音素序列的匹配概率;obtaining a matching probability between the speech and the phoneme sequence according to the acoustic similarity probability; 将所述语音识别为匹配概率最高的音素序列所对应的内容;Recognizing the speech as the content corresponding to the phoneme sequence with the highest matching probability; 其中,所述根据采集到的语音的声学特征,计算所述语音与解码网络中的音素序列的声学相似概率,包括:Wherein, calculating the acoustic similarity probability of the voice and the phoneme sequence in the decoding network according to the acquired acoustic features of the voice, including: 获取预先训练的解码网络中音素序列的声学模型;其中,训练噪音内容对应的声学模型所采用的噪音样本包括多个两两之间声学特征差值大于预设的阈值的语音样本;训练命令词对应的声学模型所采用的命令词样本包括多个采用不同口音朗读的命令词语音样本;Acquiring the acoustic model of the phoneme sequence in the pre-trained decoding network; wherein, the noise samples used for training the acoustic model corresponding to the noise content include a plurality of speech samples whose acoustic feature difference between pairs is greater than a preset threshold; the training command word The command word samples used by the corresponding acoustic model include a plurality of command word voice samples read aloud in different accents; 根据采集到的语音的声学特征,采用所述声学模型计算所述语音与解码网络中的音素序列的声学相似概率。According to the acoustic features of the collected speech, the acoustic model is used to calculate the acoustic similarity probability between the speech and the phoneme sequence in the decoding network. 2.如权利要求1所述的语音识别方法,其特征在于,所述解码网络是使用加权有限状态转换器构造的;2. The speech recognition method of claim 1, wherein the decoding network is constructed using a weighted finite state converter; 所述根据所述声学相似概率,获得所述语音与所述音素序列的匹配概率,具体包括:The obtaining the matching probability between the speech and the phoneme sequence according to the acoustic similarity probability specifically includes: 计算所述声学相似概率与所述音素序列的权重的和值,作为所述语音与所述音素序列的匹配概率。The sum of the acoustic similarity probability and the weight of the phoneme sequence is calculated as the matching probability between the speech and the phoneme sequence. 3.如权利要求2所述的语音识别方法,其特征在于,还包括:3. speech recognition method as claimed in claim 2, is characterized in that, also comprises: 若确认采集到的语音为噪音,且将所述语音识别为预先设定的命令词,则提高所述解码网络中噪音内容对应的音素序列的权重。If it is confirmed that the collected speech is noise, and the speech is recognized as a preset command word, the weight of the phoneme sequence corresponding to the noise content in the decoding network is increased. 4.如权利要求1-3任一所述的语音识别方法,其特征在于,所述解码网络还包括与静音内容对应的音素序列。4. The speech recognition method according to any one of claims 1-3, wherein the decoding network further comprises a phoneme sequence corresponding to the silence content. 5.一种语音识别装置,其特征在于,包括:5. A voice recognition device, comprising: 计算模块,用于根据采集到的语音的声学特征,计算所述语音与解码网络中的音素序列的声学相似概率;其中,所述解码网络包括多组音素序列,每一组音素序列对应一个预设的命令词内容或对应噪音内容;The calculation module is used to calculate the acoustic similarity probability of the phoneme sequence in the voice and the phoneme sequence in the decoding network according to the acoustic characteristics of the collected voice; wherein, the decoding network includes multiple groups of phoneme sequences, and each group of phoneme sequences corresponds to a preset. The content of the command word set or the content of the corresponding noise; 匹配模块,用于根据所述声学相似概率,获得所述语音与所述音素序列的匹配概率;a matching module, configured to obtain a matching probability between the speech and the phoneme sequence according to the acoustic similarity probability; 识别模块,用于将所述语音识别为匹配概率最高的音素序列所对应的内容;A recognition module for recognizing the speech as the content corresponding to the phoneme sequence with the highest matching probability; 其中,所述计算模块包括:Wherein, the computing module includes: 模型获取单元,用于获取预先训练的解码网络中音素序列的声学模型;其中,训练噪音内容对应的声学模型所采用的噪音样本包括多个两两之间声学特征差值大于预设的阈值的语音样本;训练命令词对应的声学模型所采用的命令词样本包括多个采用不同口音朗读的命令词语音样本;The model obtaining unit is used to obtain the acoustic model of the phoneme sequence in the pre-trained decoding network; wherein, the noise samples used for training the acoustic model corresponding to the noise content include a plurality of pairs of acoustic feature differences greater than a preset threshold. Voice samples; the command word samples used for training the acoustic model corresponding to the command words include a plurality of command word voice samples read aloud in different accents; 模型运算单元,用于根据采集到的语音的声学特征,采用所述声学模型计算所述语音与解码网络中的音素序列的声学相似概率。The model computing unit is configured to use the acoustic model to calculate the acoustic similarity probability between the speech and the phoneme sequence in the decoding network according to the acquired acoustic features of the speech. 6.如权利要求5所述的语音识别装置,其特征在于,所述解码网络是使用加权有限状态转换器构造的;6. The speech recognition device of claim 5, wherein the decoding network is constructed using a weighted finite state converter; 所述语音识别装置还包括:The voice recognition device also includes: 权重调整模块,用于若确认采集到的语音为噪音,且将所述语音识别为预先设定的命令词,则提高所述解码网络中噪音内容对应的音素序列的权重。The weight adjustment module is configured to increase the weight of the phoneme sequence corresponding to the noise content in the decoding network if it is confirmed that the collected speech is noise and the speech is recognized as a preset command word. 7.一种终端,其特征在于,所述终端包括:7. A terminal, wherein the terminal comprises: 一个或多个处理器;one or more processors; 存储器,用于存储一个或多个程序,memory for storing one or more programs, 当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-4中任一所述的语音识别方法。When the one or more programs are executed by the one or more processors, the one or more processors implement the speech recognition method according to any one of claims 1-4. 8.一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现如权利要求1-4中任一所述的语音识别方法。8. A computer-readable storage medium on which a computer program is stored, characterized in that, when the program is executed by a processor, the speech recognition method according to any one of claims 1-4 is implemented.
CN201710964474.1A 2017-10-17 2017-10-17 Audio recognition method, device, terminal and computer readable storage medium Expired - Fee Related CN107644638B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710964474.1A CN107644638B (en) 2017-10-17 2017-10-17 Audio recognition method, device, terminal and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710964474.1A CN107644638B (en) 2017-10-17 2017-10-17 Audio recognition method, device, terminal and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN107644638A CN107644638A (en) 2018-01-30
CN107644638B true CN107644638B (en) 2019-01-04

Family

ID=61123547

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710964474.1A Expired - Fee Related CN107644638B (en) 2017-10-17 2017-10-17 Audio recognition method, device, terminal and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN107644638B (en)

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108831446B (en) * 2018-05-24 2019-10-18 百度在线网络技术(北京)有限公司 Method and apparatus for generating sample
CN108932943A (en) * 2018-07-12 2018-12-04 广州视源电子科技股份有限公司 Command word sound detection method, device, equipment and storage medium
CN110716767B (en) * 2018-07-13 2023-05-05 阿里巴巴集团控股有限公司 Model component calling and generating method, device and storage medium
CN109274845A (en) * 2018-08-31 2019-01-25 平安科技(深圳)有限公司 Intelligent sound pays a return visit method, apparatus, computer equipment and storage medium automatically
CN109273007B (en) * 2018-10-11 2022-05-17 西安讯飞超脑信息科技有限公司 Voice wake-up method and device
CN109243429B (en) * 2018-11-21 2021-12-10 苏州奇梦者网络科技有限公司 Voice modeling method and device
CN110415710B (en) * 2019-08-06 2022-05-31 大众问问(北京)信息科技有限公司 Parameter adjusting method, device, equipment and medium for vehicle-mounted voice interaction system
CN110570842B (en) * 2019-10-25 2020-07-10 南京云白信息科技有限公司 Speech recognition method and system based on phoneme approximation degree and pronunciation standard degree
CN110992932B (en) * 2019-12-18 2022-07-26 广东睿住智能科技有限公司 Self-learning voice control method, system and storage medium
CN111145748B (en) * 2019-12-30 2022-09-30 广州视源电子科技股份有限公司 Audio recognition confidence determining method, device, equipment and storage medium
CN111179974B (en) * 2019-12-30 2022-08-09 思必驰科技股份有限公司 Command word recognition method and device
KR20220128397A (en) 2020-01-17 2022-09-20 구글 엘엘씨 Alphanumeric Sequence Biasing for Automatic Speech Recognition
CN111798846A (en) * 2020-06-02 2020-10-20 厦门亿联网络技术股份有限公司 Voice command word recognition method and device, conference terminal and conference terminal system
CN111710337B (en) * 2020-06-16 2023-07-07 睿云联(厦门)网络通讯技术有限公司 Voice data processing method and device, computer readable medium and electronic equipment
CN114694641B (en) * 2020-12-31 2025-07-25 华为技术有限公司 Speech recognition method and electronic equipment
CN114974249B (en) * 2021-02-20 2025-09-12 上海大唐移动通信设备有限公司 Speech recognition method, device and storage medium
CN113823269A (en) * 2021-09-07 2021-12-21 广西电网有限责任公司贺州供电局 A method for automatic storage of power grid dispatch commands based on speech recognition
CN113889083B (en) * 2021-11-03 2022-12-02 广州博冠信息科技有限公司 Voice recognition method and device, storage medium and electronic equipment
CN114242062B (en) * 2021-12-29 2025-07-04 浙江大华技术股份有限公司 Method, device, storage medium and electronic device for outputting command words
CN114360544B (en) * 2022-02-16 2025-07-25 北京字跳网络技术有限公司 Speech recognition method, device, electronic equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101840699A (en) * 2010-04-30 2010-09-22 中国科学院声学研究所 Voice quality evaluation method based on pronunciation model
JP2012113087A (en) * 2010-11-24 2012-06-14 Nippon Telegr & Teleph Corp <Ntt> Voice recognition wfst creation apparatus, voice recognition device employing the same, methods thereof, program and storage medium
US20130138441A1 (en) * 2011-11-28 2013-05-30 Electronics And Telecommunications Research Institute Method and system for generating search network for voice recognition
CN103971685A (en) * 2013-01-30 2014-08-06 腾讯科技(深圳)有限公司 Method and system for recognizing voice commands
CN103985391A (en) * 2014-04-16 2014-08-13 柳超 Phonetic-level low power consumption spoken language evaluation and defect diagnosis method without standard pronunciation
CN104036774A (en) * 2014-06-20 2014-09-10 国家计算机网络与信息安全管理中心 Method and system for recognizing Tibetan dialects
CN107195296A (en) * 2016-03-15 2017-09-22 阿里巴巴集团控股有限公司 A kind of audio recognition method, device, terminal and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113140215A (en) * 2015-01-16 2021-07-20 三星电子株式会社 Method and apparatus for performing speech recognition

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101840699A (en) * 2010-04-30 2010-09-22 中国科学院声学研究所 Voice quality evaluation method based on pronunciation model
JP2012113087A (en) * 2010-11-24 2012-06-14 Nippon Telegr & Teleph Corp <Ntt> Voice recognition wfst creation apparatus, voice recognition device employing the same, methods thereof, program and storage medium
US20130138441A1 (en) * 2011-11-28 2013-05-30 Electronics And Telecommunications Research Institute Method and system for generating search network for voice recognition
CN103971685A (en) * 2013-01-30 2014-08-06 腾讯科技(深圳)有限公司 Method and system for recognizing voice commands
CN103985391A (en) * 2014-04-16 2014-08-13 柳超 Phonetic-level low power consumption spoken language evaluation and defect diagnosis method without standard pronunciation
CN104036774A (en) * 2014-06-20 2014-09-10 国家计算机网络与信息安全管理中心 Method and system for recognizing Tibetan dialects
CN107195296A (en) * 2016-03-15 2017-09-22 阿里巴巴集团控股有限公司 A kind of audio recognition method, device, terminal and system

Also Published As

Publication number Publication date
CN107644638A (en) 2018-01-30

Similar Documents

Publication Publication Date Title
CN107644638B (en) Audio recognition method, device, terminal and computer readable storage medium
US12094461B2 (en) Processing spoken commands to control distributed audio outputs
US11763808B2 (en) Temporary account association with voice-enabled devices
US20210256981A1 (en) Speaker verification
KR102222317B1 (en) Speech recognition method, electronic device, and computer storage medium
KR102369416B1 (en) Speech signal recognition system recognizing speech signal of a plurality of users by using personalization layer corresponding to each of the plurality of users
US9898250B1 (en) Controlling distributed audio outputs to enable voice output
JP2021018797A (en) Conversation interaction method, apparatus, computer readable storage medium, and program
US10506088B1 (en) Phone number verification
US20110301953A1 (en) System and method of multi model adaptation and voice recognition
US10152298B1 (en) Confidence estimation based on frequency
US10170122B2 (en) Speech recognition method, electronic device and speech recognition system
WO2014183373A1 (en) Systems and methods for voice identification
JP6468258B2 (en) Voice dialogue apparatus and voice dialogue method
WO2015105829A1 (en) A methodology for enhanced voice search experience
CN113658596A (en) Semantic recognition method and semantic recognition device
KR20180012639A (en) Voice recognition method, voice recognition device, apparatus comprising Voice recognition device, storage medium storing a program for performing the Voice recognition method, and method for making transformation model
US11693622B1 (en) Context configurable keywords
US10861453B1 (en) Resource scheduling with voice controlled devices
CN111145748B (en) Audio recognition confidence determining method, device, equipment and storage medium
US20220161131A1 (en) Systems and devices for controlling network applications
US11893996B1 (en) Supplemental content output
CN111128127A (en) Voice recognition processing method and device
CN111161718A (en) Voice recognition method, device, equipment, storage medium and air conditioner
EP3776300B1 (en) Temporary account association with voice-enabled devices

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder

Address after: Room 508-598, Xitian Gezhuang Town Government Office Building, No. 8 Xitong Road, Miyun District Economic Development Zone, Beijing 101500

Patentee after: BEIJING ROOBO TECHNOLOGY Co.,Ltd.

Address before: Room 508-598, Xitian Gezhuang Town Government Office Building, No. 8 Xitong Road, Miyun District Economic Development Zone, Beijing 101500

Patentee before: BEIJING INTELLIGENT STEWARD Co.,Ltd.

CP01 Change in the name or title of a patent holder
TR01 Transfer of patent right

Effective date of registration: 20210824

Address after: 301-112, floor 3, building 2, No. 18, YANGFANGDIAN Road, Haidian District, Beijing 100038

Patentee after: Beijing Rubu Technology Co.,Ltd.

Address before: Room 508-598, Xitian Gezhuang Town Government Office Building, No. 8 Xitong Road, Miyun District Economic Development Zone, Beijing 101500

Patentee before: BEIJING ROOBO TECHNOLOGY Co.,Ltd.

TR01 Transfer of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20190104