Audio recognition method, device, terminal and computer readable storage medium
Technical field
The present embodiments relate to speech recognition technology more particularly to a kind of audio recognition method, device, terminal and calculating
Machine readable storage medium storing program for executing.
Background technique
In voice command words identification technology, misrecognition is always a more insoluble problem.The identification of order word
Why false recognition rate is relatively high, is because the order word recognition method of the prior art is generally by construction decoding network come real
It is existing, it include multiple groups aligned phoneme sequence corresponding with preset order word in the decoding network.Inputting any voice all can be according to the language
Sound searches out an aligned phoneme sequence the most matched from decoding network, therefore causes to misidentify.
It solves to be the confidence level for calculating recognition result by the method that noise is identified as order word at present, when confidence level is greater than in advance
If threshold value when indicate that identification is correct, indicate not recognizing order word when confidence level is less than the threshold value.Due to confidence level
It calculates and relies on several factors, the value variation range especially affected by environment that will lead to confidence level is very big.Under noisy environment, often
It will appear the very low but wrong very high situation of recognition result confidence level of correct recognition result confidence level, so that false recognition rate
It is still very high.
Summary of the invention
The present invention provides recognition methods, device, terminal and the computer readable storage medium of a kind of voice command, to realize
It avoids for noise being identified as order word, and without calculating confidence level after speech recognition, achievees the effect that reduce false recognition rate.
In a first aspect, the embodiment of the invention provides a kind of audio recognition methods, comprising:
According to the acoustic feature of collected voice, acoustics phase of the voice with the aligned phoneme sequence in decoding network is calculated
Like probability;Wherein, the decoding network includes multiple groups aligned phoneme sequence;In the corresponding preset order word of each group of aligned phoneme sequence
Perhaps correspond to noise content;
According to the acoustics likelihood probability, the matching probability of the voice Yu the aligned phoneme sequence is obtained;
It is content corresponding to the highest aligned phoneme sequence of matching probability by the speech recognition.
Second aspect, the present invention also provides a kind of speech recognition equipments, comprising:
Computing module calculates the sound in the voice and decoding network for the acoustic feature according to collected voice
The acoustics likelihood probability of prime sequences;Wherein, the decoding network includes multiple groups aligned phoneme sequence;Each group of aligned phoneme sequence is one corresponding
Noise content is perhaps corresponded in preset order word;
Matching module, for obtaining the voice and the matching of the aligned phoneme sequence being general according to the acoustics likelihood probability
Rate;
Identification module, for being content corresponding to the highest aligned phoneme sequence of matching probability by the speech recognition.
The third aspect, the present invention also provides a kind of terminal, the terminal includes:
One or more processors;
Memory, for storing one or more programs,
When one or more of programs are executed by one or more of processors, so that one or more of processing
Device realizes the audio recognition method that any embodiment of that present invention provides.
Fourth aspect, the present invention also provides a kind of computer readable storage mediums, are stored thereon with computer program, should
The audio recognition method that any embodiment of that present invention provides is realized when program is executed by processor.
The present invention can solved by increasing the corresponding aligned phoneme sequence of noise content, collected voice in decoding network
Just it is identified as noise or order word when searching for most matching aligned phoneme sequence in code network, without searching for aligned phoneme sequence in decoding network
Confidence calculations are carried out to search result afterwards, so that solving the prior art uses the confidence calculations method influenced by environment phoneme
The problem that false recognition rate is high is caused, realization avoids for noise being identified as order word, and reduces the effect of false recognition rate.
Detailed description of the invention
Fig. 1 is the flow chart for the audio recognition method that the embodiment of the present invention one provides;
Fig. 2 is the flow chart of audio recognition method provided by Embodiment 2 of the present invention;
Fig. 3 is the structural schematic diagram for the speech recognition equipment that the embodiment of the present invention three provides;
Fig. 4 is the structural schematic diagram for the terminal that the embodiment of the present invention four provides.
Specific embodiment
The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched
The specific embodiment stated is used only for explaining the present invention rather than limiting the invention.It also should be noted that in order to just
Only the parts related to the present invention are shown in description, attached drawing rather than entire infrastructure.
Embodiment one
Fig. 1 is the flow chart for the audio recognition method that the embodiment of the present invention one provides, and the present embodiment is applicable to order word
The case where identification, this method can be executed by speech recognition equipment, be specifically comprised the following steps:
Step 110, according to the acoustic feature of collected voice, calculate the aligned phoneme sequence in the voice and decoding network
Acoustics likelihood probability;
Wherein, the decoding network includes multiple groups aligned phoneme sequence;The corresponding preset order word of each group of aligned phoneme sequence
Inside perhaps correspond to noise content;Since the embodiment of the present invention is applied to the identification to voice command, any non-command word voice
All it is interference for the identification of order word, therefore is all noise, then noise described in the embodiment of the present invention refers to any non-command word
Voice.Specifically, decoding network can be made up of interference networks, concatenated phoneme node in interference networks multiple phoneme nodes
Form aligned phoneme sequence.The acoustics likelihood probability of phoneme in field of speech recognition a, phoneme and decoding network, generally passes through
The acoustic model of the phoneme in decoding network is constructed to realize, acoustics likelihood probability, which refers to, to be corresponded to the acoustic feature of voice for input
Acoustic model output probability.
Step 120, according to the acoustics likelihood probability, obtain the matching probability of the voice Yu the aligned phoneme sequence;
It wherein, can be directly using acoustics likelihood probability as matching probability in order to simplify the data processing of identification process;But it answers
Demanding scene for identification, as the audio recognition method of high discrimination, matching probability is removed to be believed comprising acoustics likelihood probability
Breath is outer, can also be comprising other information, for example, the decoding network for using weighted finite state converted configuration, matching
Probability also includes the weight information of aligned phoneme sequence, which can be related to the probability that aligned phoneme sequence occurs in practical applications,
That is probabilistic language model.For example, partial order word is higher in the probability that practical application occurs, such as in order word identification scene
" volume tunes up ", " shutdown " etc., and the probability that partial order word occurs in practical application is lower, similar in the two acoustic feature
In the case of, the aligned phoneme sequence weight corresponding to the former can be set and be higher than aligned phoneme sequence weight corresponding to the latter.In addition, weight
Information can also be adjusted according to the discrimination in the implementation process of audio recognition method.The speech recognition is by step 130
Content corresponding to the highest aligned phoneme sequence of matching probability.
The working principle of above-mentioned steps is to increase the corresponding aligned phoneme sequence of noise content in decoding network, can be according to typing
The acoustic feature of noise make noise corresponding with the noise content in decoding network aligned phoneme sequence matching so that being based on acoustics
Feature identifies the noise of typing, is avoided that non-command word being identified as order word, and compared with the prior art after use identification
The method for calculating confidence level, the present embodiment avoid the scheme that noise is identified as to order word from not influenced by environment phoneme, drop significantly
Low false recognition rate.
In order to reduce false recognition rate, improve by noise aligned phoneme sequence corresponding with noise content in decoding network is matched can
Energy property, the present embodiment provides a kind of preferred embodiments.Specifically, step 110, according to the acoustic feature of collected voice,
The acoustics likelihood probability for calculating the aligned phoneme sequence in the voice and decoding network, specifically includes:
Obtain the acoustic model of aligned phoneme sequence in decoding network trained in advance;Wherein, the corresponding sound of training noise content
Learning noisy samples used by model includes the speech samples that multiple differences of acoustic feature between any two are greater than preset threshold value;
According to the acoustic feature of collected voice, calculated in the voice and decoding network using the acoustic model
The acoustics likelihood probability of aligned phoneme sequence.
In above-mentioned preferred embodiment, the noisy samples of training noise acoustic model include that multiple acoustics between any two are special
Levy the speech samples that difference is greater than preset threshold value, that is, noise acoustic model is using multiple speech samples instructions to differ greatly
It gets, such as noisy ambient sound and a large amount of mutually different non-command word phrases etc..Use the big language of a large amount of differences
Sound sample training to acoustic model corresponding to aligned phoneme sequence can be intended to that between various sound difference minimizes from
Right sound, it is easier to various non-command word voice match.And the order word sample of training order word acoustic model is usually to use
The order word sound that different accents are read aloud, the acoustic feature difference between order word sample is little, therefore only for order word phase
Close sound acoustics likelihood probability is high.Therefore, above-mentioned preferred embodiment can be improved noise content in noise and decoding network
A possibility that corresponding aligned phoneme sequence matches reduces false recognition rate.
Further, the decoding network is using weighted finite state converted configuration;Then step 120, described
According to the acoustics likelihood probability, the matching probability of the voice Yu the aligned phoneme sequence is obtained, is specifically included: calculating the acoustics
The weight of likelihood probability and the aligned phoneme sequence and value, the matching probability as the voice and the aligned phoneme sequence.Certainly,
The product of acoustics likelihood probability and weight can also be calculated as matching probability.
Further, the decoding network further includes aligned phoneme sequence corresponding with mute content.It is corresponding to increase mute content
Aligned phoneme sequence user experience can be improved.Because can to noise and it is mute make differentiation, to the different signal of user feedback.
For example, noise may be since the wrong voice of user causes, the information that exportable prompt user retells can for mute
It can be that user accidentally touches identification device and leads to typing voice, identification output can be set as sky, i.e., do not execute any
Operation, leaves user alone, to improve user experience.
It should be noted that calculating acoustics likelihood probability, obtaining matching probability and then search for the highest phoneme of matching probability
Sequence can be the matching probability for first calculating each aligned phoneme sequence Yu collected voice, and then comparison match probability obtains
With the highest aligned phoneme sequence of probability.It is also possible to first search close with the acoustics likelihood probability of collected voice initial phoneme
Decoding network in phoneme judge the phase then according to acoustics likelihood probability, weight (including probabilistic language model information) etc.
In multiple groups aligned phoneme sequence where close phoneme, next phoneme of which group is matched with next phoneme of collected voice
Probability highest, and then determine that next phoneme node of this group of aligned phoneme sequence is matched with next phoneme of collected voice.
Further, judgement search is continued to execute, the aligned phoneme sequence finally obtained is exactly the highest aligned phoneme sequence of matching probability.
In conclusion the technical solution of the present embodiment, increases the corresponding aligned phoneme sequence of noise content in decoding network, adopts
The voice collected is just identified as noise or order word when can search for most matching aligned phoneme sequence in decoding network, without solving
Confidence calculations are carried out to search result after code web search aligned phoneme sequence, are used to solve the prior art by environment phoneme shadow
The problem that loud confidence calculations method causes false recognition rate high, realization avoid for noise being identified as order word, and reduce and accidentally know
The not effect of rate.
Embodiment two
Fig. 2 is the flow chart of audio recognition method provided by Embodiment 2 of the present invention, and the present embodiment is applicable to order word
The case where identification, this method can be executed by speech recognition equipment.Base of the present embodiment in one audio recognition method of embodiment
On plinth, the step of increasing adjust automatically decoding network parameter, audio recognition method is allowed dynamically to modify parameter, it is lasting to drop
Low false recognition rate.Audio recognition method provided in this embodiment includes:
Step 210, according to the acoustic feature of collected voice, calculate the aligned phoneme sequence in the voice and decoding network
Acoustics likelihood probability;Wherein, the decoding network includes multiple groups aligned phoneme sequence;Each group of aligned phoneme sequence corresponding one preset
Noise content is perhaps corresponded in order word;
Step 220, according to the acoustics likelihood probability, obtain the matching probability of the voice Yu the aligned phoneme sequence;
Step 230, by the speech recognition be the highest aligned phoneme sequence of matching probability corresponding to content;
If step 240 confirms that collected voice is noise, and is preset order word by the speech recognition,
Then improve the weight of the corresponding aligned phoneme sequence of noise content in the decoding network.
The present embodiment can also acquire confirmation message (can provide confirmation message by user) after identifying voice, confirmation identification
Whether as a result correct, if confirming, collected voice is noise, and is order word by speech recognition, then illustrates that false recognition rate is still omited
Height, therefore the weight of the corresponding aligned phoneme sequence of noise content in the decoding network is improved, to increase noise aligned phoneme sequence and adopt
The matching probability of the voice collected, so that non-command word voice is more likely to be identified as noise.Further, settable confirmation is adopted
The voice integrated reaches preset threshold value as noise and by the speech recognition as the number of order word, just improves noise phoneme sequence
The weight of column causes to adjust unbalance to avoid individual identification mistakes.
Preferably, further includes: if confirming, collected voice is order word, and is noise by the speech recognition, then drops
The weight of the corresponding aligned phoneme sequence of noise content in the low decoding network.
Further, the settable collected voice of confirmation is order word and reaches the number that the speech recognition is noise
To preset threshold value, the weight of noise aligned phoneme sequence is just reduced.In order to reduce false recognition rate, inevitably on a small quantity will
The discrimination to order word can be improved in the case where order word is identified as noise, above-mentioned preferred embodiment.
Further, the also settable instruction triggered according to user, it is corresponding to adjust noise content in the decoding network
The weight of aligned phoneme sequence, to reduce false recognition rate or improve discrimination.
The technical solution of the present embodiment increases the corresponding aligned phoneme sequence of noise content, collected language in decoding network
Sound is just identified as noise or order word when can search for most matching aligned phoneme sequence in decoding network, and realization avoids knowing noise
Not Wei order word, and reduce false recognition rate effect.And according to recognition result, the power of noise aligned phoneme sequence in decoding network is adjusted
Weight persistently reduces false recognition rate to realize dynamic modification parameter.
Embodiment three
Fig. 3 is the structural schematic diagram for the speech recognition equipment that the embodiment of the present invention three provides.The speech recognition equipment includes:
Computing module 310 calculates in the voice and decoding network for the acoustic feature according to collected voice
The acoustics likelihood probability of aligned phoneme sequence;Wherein, the decoding network includes multiple groups aligned phoneme sequence;Each group of aligned phoneme sequence corresponding one
Noise content is perhaps corresponded in a preset order word;
Matching module 320, for obtaining the matching of the voice Yu the aligned phoneme sequence according to the acoustics likelihood probability
Probability;
Identification module 330, for being content corresponding to the highest aligned phoneme sequence of matching probability by the speech recognition.
Preferably, the decoding network is using weighted finite state converted configuration.The speech recognition equipment is also
Include:
Weight adjusts module 340, if for confirming that collected voice is noise, and be to set in advance by the speech recognition
Fixed order word then improves the weight of the corresponding aligned phoneme sequence of noise content in the decoding network.
Preferably, matching module 320 includes:
With value computing unit, for calculate the acoustics likelihood probability and the aligned phoneme sequence weight and value, as
The matching probability of the voice and the aligned phoneme sequence.
Preferably, the decoding network further includes aligned phoneme sequence corresponding with mute content.
Preferably, the computing module includes:
Model acquiring unit, for obtaining the acoustic model of aligned phoneme sequence in decoding network trained in advance;Wherein, training
Noisy samples used by the corresponding acoustic model of noise content include multiple differences of acoustic feature between any two greater than preset
The speech samples of threshold value;
Model arithmetic unit, for the acoustic feature according to collected voice, using described in acoustic model calculating
The acoustics likelihood probability of aligned phoneme sequence in voice and decoding network.
Voice provided by any embodiment of the invention, which can be performed, in speech recognition equipment provided by the embodiment of the present invention knows
Other method has the corresponding functional module of execution method and beneficial effect.
Example IV
Fig. 4 is a kind of structural schematic diagram for terminal that the embodiment of the present invention four provides, as shown in figure 4, the terminal includes place
Manage device 410, memory 420, input unit 430 and output device 440;In terminal the quantity of processor 410 can be one or
It is multiple, in Fig. 4 by taking a processor 410 as an example;Processor 410, memory 420, input unit 430 and output dress in terminal
Setting 440 can be connected by bus or other modes, in Fig. 4 for being connected by bus.
Memory 420 is used as a kind of computer readable storage medium, can be used for storing software program, journey can be performed in computer
Sequence and module, if the corresponding program instruction/module of the audio recognition method in the embodiment of the present invention is (for example, speech recognition fills
Computing module 310, matching module 320, identification module 330 and weight in setting adjust module 340).Processor 410 passes through operation
Software program, instruction and the module being stored in memory 420, at the various function application and data of terminal
Reason, that is, realize above-mentioned audio recognition method.
Memory 420 can mainly include storing program area and storage data area, wherein storing program area can store operation system
Application program needed for system, at least one function;Storage data area, which can be stored, uses created data etc. according to terminal.This
Outside, memory 420 may include high-speed random access memory, can also include nonvolatile memory, for example, at least one
Disk memory, flush memory device or other non-volatile solid state memory parts.In some instances, memory 420 can be into one
Step includes the memory remotely located relative to processor 410, these remote memories can pass through network connection to terminal.On
The example for stating network includes but is not limited to internet, intranet, local area network, mobile radio communication and combinations thereof.
Input unit 430 can be used for receiving the number or character information of input, and generate with the user setting of terminal with
And the related key signals input of function control.Output device 740 may include that display screen etc. shows equipment.
Embodiment five
The embodiment of the present invention five also provides a kind of computer readable storage medium for being stored with computer program, the calculating
Machine program realizes a kind of audio recognition method when being subsequently can by computer device and executing, this method comprises:
According to the acoustic feature of collected voice, acoustics phase of the voice with the aligned phoneme sequence in decoding network is calculated
Like probability;Wherein, the decoding network includes multiple groups aligned phoneme sequence;In the corresponding preset order word of each group of aligned phoneme sequence
Perhaps correspond to noise content;
According to the acoustics likelihood probability, the matching probability of the voice Yu the aligned phoneme sequence is obtained;
It is content corresponding to the highest aligned phoneme sequence of matching probability by the speech recognition.
Certainly, a kind of computer readable storage medium storing computer program, journey provided by the embodiment of the present invention
The method operation that sequence is not limited to the described above, can also be performed in audio recognition method provided by any embodiment of the invention
Relevant operation.
By the description above with respect to embodiment, it is apparent to those skilled in the art that, the present invention
It can be realized by software and required common hardware, naturally it is also possible to which by hardware realization, but in many cases, the former is more
Good embodiment.Based on this understanding, technical solution of the present invention substantially in other words contributes to the prior art
Part can be embodied in the form of software products, which can store in computer readable storage medium
In, floppy disk, read-only memory (Read-Only Memory, ROM), random access memory (Random such as computer
Access Memory, RAM), flash memory (FLASH), hard disk or CD etc., including some instructions are with so that a computer is set
Standby (can be personal computer, server or the network equipment etc.) executes method described in each embodiment of the present invention.
It is worth noting that, included each unit and module are only pressed in the embodiment of above-mentioned speech recognition equipment
It is divided, but is not limited to the above division according to function logic, as long as corresponding functions can be realized;In addition,
The specific name of each functional unit is also only for convenience of distinguishing each other, the protection scope being not intended to restrict the invention.
Note that the above is only a better embodiment of the present invention and the applied technical principle.It will be appreciated by those skilled in the art that
The invention is not limited to the specific embodiments described herein, be able to carry out for a person skilled in the art it is various it is apparent variation,
It readjusts and substitutes without departing from protection scope of the present invention.Therefore, although being carried out by above embodiments to the present invention
It is described in further detail, but the present invention is not limited to the above embodiments only, without departing from the inventive concept, also
It may include more other equivalent embodiments, and the scope of the invention is determined by the scope of the appended claims.