CN115294974B

CN115294974B - A speech recognition method, device, equipment and storage medium

Info

Publication number: CN115294974B
Application number: CN202210753629.8A
Authority: CN
Inventors: 雪巍; 彭毅; 范璐
Original assignee: Jingdong Technology Information Technology Co Ltd
Current assignee: Jingdong Technology Information Technology Co Ltd
Priority date: 2022-06-28
Filing date: 2022-06-28
Publication date: 2025-02-28
Anticipated expiration: 2042-06-28
Also published as: WO2024001662A1; CN115294974A

Abstract

The embodiment of the invention discloses a voice recognition method, a device, equipment and a storage medium, wherein the method comprises the steps of determining a decoding output score of voice to be recognized and a second candidate word corresponding to the voice to be recognized according to a first candidate word obtained by decoding the voice to be recognized; determining decoding characteristics of each second candidate word, determining decoding confidence coefficient of the voice to be recognized according to the decoding characteristics of each second candidate word, determining noise confidence coefficient of each voice frame contained in the voice to be recognized, determining noise confidence coefficient of the voice to be recognized according to the noise confidence coefficient of each voice frame, determining comprehensive confidence coefficient of the voice to be recognized according to the decoding confidence coefficient, the noise confidence coefficient and decoding output scores of the voice to be recognized, and determining voice recognition results according to the comprehensive confidence coefficient. According to the technical scheme, on the premise of not increasing the cost, the accuracy of voice recognition is improved.

Description

Voice recognition method, device, equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of voice processing, in particular to a voice recognition method, a voice recognition device, voice recognition equipment and a storage medium.

Background

Speech recognition has been widely used in the fields of intelligent customer service, intelligent home, vehicle-mounted assistants, etc. Speech recognition systems are often affected by noise interference from the environment or telephone channels, which is prone to speech recognition errors. For example, speech recognition insertion errors may occur when noise and speech segment time do not coincide, and deletion or modification errors may occur when speech segments are corrupted by noise. Speech recognition errors present a significant challenge for subsequent speech interactions.

In the prior art, the front-end noise reduction module is used for processing the voice to be recognized so as to reduce the influence of noise on the voice characteristics to be recognized, and then the voice recognition module is used for recognizing the processed voice to be recognized so as to determine the voice recognition result.

In the process of implementing the present invention, the inventor finds that at least the following technical problems exist in the prior art:

The front-end noise reduction module and the voice recognition module are required to be adapted, so that the voice recognition cost is increased.

Disclosure of Invention

The invention provides a voice recognition method, a device, equipment and a storage medium, so as to reduce the cost of voice recognition.

In a first aspect, an embodiment of the present invention provides a method for voice recognition, including:

Determining a decoding output score of the voice to be recognized and a second candidate word corresponding to the voice to be recognized according to a first candidate word obtained by decoding the voice to be recognized;

determining decoding characteristics of the second candidate words, and determining decoding confidence of the voice to be recognized according to the decoding characteristics of the second candidate words;

determining the noise confidence coefficient of each voice frame contained in the voice to be recognized, and determining the noise confidence coefficient of the voice to be recognized according to the noise confidence coefficient of each voice frame;

and determining the comprehensive confidence of the voice to be recognized according to the decoding confidence, the noise confidence and the decoding output score of the voice to be recognized, and determining a voice recognition result according to the comprehensive confidence.

In a second aspect, an embodiment of the present invention further provides a voice recognition apparatus, including:

the decoding module is used for determining a decoding output score of the voice to be recognized and a second candidate word corresponding to the voice to be recognized according to a first candidate word obtained by decoding the voice to be recognized;

the decoding confidence determining module is used for determining the decoding characteristics of each second candidate word and determining the decoding confidence of the voice to be recognized according to the decoding characteristics of each second candidate word;

The noise confidence coefficient determining module is used for determining the noise confidence coefficient of each voice frame contained in the voice to be recognized and determining the noise confidence coefficient of the voice to be recognized according to the noise confidence coefficient of each voice frame;

And the execution module is used for determining the comprehensive confidence coefficient of the voice to be recognized according to the decoding confidence coefficient, the noise confidence coefficient and the decoding output score of the voice to be recognized, and determining a voice recognition result according to the comprehensive confidence coefficient.

In a third aspect, an embodiment of the present invention further provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the speech recognition method according to any one of the first aspects when executing the program.

In a fourth aspect, embodiments of the present invention also provide a storage medium containing computer-executable instructions for performing the speech recognition method according to any one of the first aspects when executed by a computer processor.

The embodiments of the above invention have the following advantages or benefits:

The embodiment of the invention provides a voice recognition method, which comprises the steps of determining a decoding output score of a voice to be recognized and a second candidate word corresponding to the voice to be recognized according to a first candidate word obtained by decoding the voice to be recognized, determining decoding characteristics of the second candidate words, determining decoding confidence of the voice to be recognized according to the decoding characteristics of the second candidate words, determining noise confidence of voice frames contained in the voice to be recognized, determining noise confidence of the voice to be recognized according to the noise confidence of the voice frames, and determining comprehensive confidence of the voice to be recognized according to the decoding confidence, the noise confidence and the decoding output score of the voice to be recognized, and determining a voice recognition result according to the comprehensive confidence. According to the technical scheme, firstly, the decoding output score of the voice to be recognized can be determined according to the first candidate word obtained by decoding the voice to be recognized, a data basis is provided for determining the comprehensive confidence, the second candidate word corresponding to the voice to be recognized is determined according to the first candidate word, after the decoding characteristics of each second candidate word are determined, the more accurate decoding confidence of the voice to be recognized is determined according to the decoding characteristics, secondly, the more accurate frame-level noise confidence of the voice to be recognized can be determined according to the noise confidence of each voice frame contained in the voice to be recognized, further, the comprehensive confidence of the voice to be recognized can be determined according to the decoding confidence of the voice to be recognized, the noise confidence of the segment level of the voice to be recognized and the decoding output score determined, and the more accurate comprehensive confidence of the voice to be recognized is determined by combining the decoding confidence of the segment level of the voice to be recognized and the decoding output score determined. On the premise of not increasing the cost, the accuracy of voice recognition is improved.

Drawings

FIG. 1 is a schematic diagram of a speech recognition module according to an embodiment of the present invention;

FIG. 2 is a flowchart of a voice recognition method according to an embodiment of the present invention;

FIG. 3 is a first word diagram including a first candidate word obtained by decoding a speech to be recognized in a speech recognition method according to an embodiment of the present invention;

FIG. 4 is a flowchart of another speech recognition method according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a second preset network model in another speech recognition method according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a speech recognition system according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a voice recognition device according to an embodiment of the present invention;

Fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.

Before discussing exemplary embodiments in more detail, it should be mentioned that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart depicts operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently, or at the same time. Furthermore, the order of the operations may be rearranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figures. The processes may correspond to methods, functions, procedures, subroutines, and the like. Furthermore, embodiments of the invention and features of the embodiments may be combined with each other without conflict.

Fig. 1 is a schematic diagram of a speech recognition module according to an embodiment of the present invention, as shown in fig. 1, a speech recognition module 100 may include a language model 110 and an acoustic model 120, and the speech recognition module 100 may use a decoding algorithm to obtain an optimal sequence through viterbi search in a recognition process, and generate a decoded output corresponding to speech, that is, a word graph corresponding to speech. Speech recognition errors are easily caused by noise-to-speech infection. Therefore, the embodiment of the invention provides a voice recognition method, which improves the accuracy of voice recognition and reduces the error rate of voice recognition on the premise of not increasing the cost.

The voice recognition method according to the embodiment of the present invention will be described in detail with reference to the voice recognition module and the embodiment shown in fig. 1.

Fig. 2 is a flowchart of a voice recognition method according to an embodiment of the present invention, where the embodiment of the present invention is applicable to a situation where the accuracy of voice recognition needs to be improved without increasing the cost. The method may be performed by a speech recognition device, which may be implemented in software and/or hardware. As shown in fig. 2, the method specifically includes the following steps:

Step 210, determining a decoding output score of the voice to be recognized and a second candidate word corresponding to the voice to be recognized according to a first candidate word obtained by decoding the voice to be recognized.

The speech recognition module including the language model and the acoustic model as shown in fig. 1 may be used to decode the to-be-processed speech to generate the first candidate word corresponding to the speech. Therefore, the voice to be recognized can be input into the voice recognition module shown in fig. 1, so that the voice recognition module decodes the voice to be recognized to obtain the first candidate word corresponding to the voice to be recognized. Due to interference of noise, the error rate of the first candidate word obtained by decoding the voice to be recognized based on the voice recognition module is high, and whether the voice to be recognized contains voice needs to be further determined.

The first candidate words corresponding to the voice to be recognized can be represented based on a first word graph, wherein the first word graph is a compressed representation of information such as each first candidate word and time in the process of decoding the voice to be recognized, a node represents a state, and a numerical value in a bracket of the node represents a moment corresponding to the state. There are different candidate paths from the initial state to the termination state in the first word graph, and the numerical values on the candidate paths represent the scoring of the first candidate word, i.e. the posterior probability of the first candidate word. The score of each candidate path and the time information of each first candidate word in each candidate path can be determined from the first word graph. Fig. 3 is a first word diagram including a first candidate word obtained by decoding a voice to be recognized in the voice recognition method according to the embodiment of the present invention, as shown in fig. 3, for example, from an initial state 0 to a final state 4, where "beijing", "background", "mobilization", "motion", "olympic games" and "meeting" are the first candidate word, where the posterior probability of "beijing" is 0.5, the posterior probability of "background" is 0.5, the posterior probability of "mobilization" is 0.5, the posterior probability of "motion" is 0.4, the posterior probability of "olympic games" is 0.2, the posterior probability of "meeting" is 0.4, and two end point values of the edge where "winter olympic games" is located are 6 and 20, respectively, which indicates that the voice content corresponding to the time period of 6s-20s is "winter olympic games".

Specifically, after the first candidate words are obtained by decoding the voice to be recognized based on the voice recognition module, firstly, the decoding output scores of the first candidate words can be determined, the decoding output scores of the first candidate words are ordered, secondly, normalization processing can be carried out on the largest three decoding output scores, further, the processing result can be determined to be the decoding output score of the voice to be recognized, and on the other hand, the second candidate words corresponding to the voice to be recognized can be determined, specifically, the second candidate words corresponding to the voice to be recognized can be obtained by carrying out secondary decoding based on the minimum Bayesian risk on the first word graph by taking the editing distance as a criterion.

In the embodiment of the invention, after the voice to be recognized is decoded based on the voice recognition module to obtain the first candidate words, the decoding output score of the voice to be recognized can be determined according to the first candidate words, and the first word graph containing each first candidate word can be subjected to secondary decoding to obtain the second candidate words, wherein the second candidate words can be used for determining the decoding confidence degree of the voice to be recognized and providing a data basis for determining the decoding confidence degree of the voice to be recognized.

Step 220, determining the decoding characteristics of the second candidate words, and determining the decoding confidence of the voice to be recognized according to the decoding characteristics of the second candidate words.

Wherein the decoding features include a confidence score, a word category, a probability distribution, a word length, and a word graph depth for the second candidate word.

Specifically, after the posterior probability of each second candidate word is determined, normalizing the posterior probability of each second candidate word to obtain a confidence score of each second candidate word, wherein the confidence score of each second candidate word is a one-dimensional feature of the second candidate word. After the domain of the voice to be recognized is determined, the classification of the words in the domain is determined, the classification quantity is N+1, each second candidate word is mapped to any N+1 class, the word class of the second candidate word is represented based on the N+1-dimensional feature, and the word class of the second candidate word is the N+1-dimensional feature of the second candidate word. And determining probability distribution of the second candidate words according to the occurrence times of the second candidate words in all the second candidate words corresponding to the voice to be recognized and the total amount of the second candidate words corresponding to the voice to be recognized, wherein the probability distribution of the second candidate words is one-dimensional characteristics of the second candidate words. The word length of the second candidate word, which is a one-dimensional feature of the second candidate word, may be determined according to the number of phonemes contained in the second candidate word. And determining the word graph depth of the second candidate word according to the number of edges of all nodes in the corresponding time period of the second candidate word and the length of the time period in the second word graph containing the second candidate word, wherein the word graph depth of the second candidate word is a one-dimensional feature of the second candidate word.

Therefore, the decoding characteristics of the second candidate words can be determined to be n+5-dimensional characteristics, the n+5-dimensional decoding characteristics corresponding to each second candidate word are respectively input into a pre-trained decoding confidence coefficient model to obtain the decoding confidence coefficient of each second candidate word, the arithmetic average value of the decoding confidence coefficient of each second candidate word can be further determined, and the arithmetic average value is determined to be the decoding confidence coefficient of the voice to be recognized.

The decoding confidence of the voice to be recognized reflects the reliability of the voice recognition result, and in general, the value range of the decoding confidence of the voice to be recognized is 0-1, and the closer to 1, the more reliable the voice recognition result is.

In the embodiment of the invention, after the second word graph containing the second candidate words is obtained by performing secondary decoding based on the first word graph containing the first candidate words, the decoding characteristics of each second candidate word can be determined according to the second word graph, the decoding confidence degree of the second candidate words can be determined according to the decoding characteristics, the decoding confidence degree of the voice to be recognized can be determined according to the decoding confidence degree of each second candidate word, the decoding confidence degree of the voice to be recognized can indicate the reliability degree of the voice recognition result, and a data basis is provided for determining the voice recognition result.

Step 230, determining the noise confidence of each voice frame included in the voice to be recognized, and determining the noise confidence of the voice to be recognized according to the noise confidence of each voice frame.

Before the noise confidence of the voice to be recognized, the voice to be recognized can be framed, and a voice frame is obtained. Wherein, the frame length of each frame of voice frame is 25 milliseconds, and the frame shift is 10 milliseconds.

Specifically, mel-cepstrum coefficients (Mel-scaleFrequency Cepstral Coefficients, MFCC) of each speech frame may be extracted first, and the MFCC of the speech frame is input into a pre-trained noise confidence coefficient model to obtain a probability p (t) that the speech frame contains speech, and a probability 1-p (t) that the speech frame does not contain speech, so as to determine that the noise confidence coefficient of the speech frame is 1-p (t). After the noise confidence coefficient of each voice frame contained in the voice to be recognized is determined, the maximum noise confidence coefficient, the minimum noise confidence coefficient, the noise confidence coefficient mean value and the noise confidence coefficient variance are determined, and the maximum noise confidence coefficient and the minimum noise confidence coefficient, the noise confidence coefficient mean value and the noise confidence coefficient variance in the noise confidence coefficient of each voice frame contained in the voice to be recognized are determined as the noise confidence coefficient of the voice to be recognized.

In the embodiment of the invention, the noise confidence of the segment level of the voice to be recognized can be determined based on the noise confidence of the frame level of each voice frame contained in the voice to be recognized.

Step 240, determining the comprehensive confidence of the voice to be recognized according to the decoding confidence, the noise confidence and the decoding output score of the voice to be recognized, and determining a voice recognition result according to the comprehensive confidence.

Specifically, the decoding confidence level, the noise confidence level and the decoding output score of the voice to be recognized are used as recognition features of the voice to be recognized and input into a pre-trained recognition model, and the obtained output result is the comprehensive confidence level of the voice to be recognized. The integrated confidence level fuses the decoding confidence level of the segment level of the voice to be recognized, the noise confidence level of the segment level of the voice to be recognized, which is determined according to the noise confidence level of the frame level, and the decoding output score of the voice to be recognized. It is determined whether it is a valid recognition result and whether it contains a voice recognition result of voice based on the integrated confidence.

In the embodiment of the invention, the probability that the voice to be recognized contains the voice and the probability that the voice does not contain the voice can be determined according to the comprehensive confidence, and then the voice recognition result is determined according to the probability that the voice to be recognized contains the voice. The problem of voice recognition insertion errors caused by noise is effectively solved, and specific optimization and retraining of the voice recognition module are not needed, so that different voice recognition modules can be adapted.

The voice recognition method comprises the steps of determining decoding output scores of voices to be recognized and second candidate words corresponding to the voices to be recognized according to first candidate words obtained by decoding the voices to be recognized, determining decoding characteristics of the second candidate words, determining decoding confidence levels of the voices to be recognized according to the decoding characteristics of the second candidate words, determining noise confidence levels of voice frames contained in the voices to be recognized, determining noise confidence levels of the voices to be recognized according to the noise confidence levels of the voice frames, determining comprehensive confidence levels of the voices to be recognized according to the decoding confidence levels, the noise confidence levels and the decoding output scores of the voices to be recognized, and determining voice recognition results according to the comprehensive confidence levels. According to the technical scheme, firstly, the decoding output score of the voice to be recognized can be determined according to the first candidate word obtained by decoding the voice to be recognized, a data basis is provided for determining the comprehensive confidence, the second candidate word corresponding to the voice to be recognized is determined according to the first candidate word, after the decoding characteristics of each second candidate word are determined, the more accurate decoding confidence of the voice to be recognized is determined according to the decoding characteristics, secondly, the more accurate frame-level noise confidence of the voice to be recognized can be determined according to the noise confidence of each voice frame contained in the voice to be recognized, further, the comprehensive confidence of the voice to be recognized can be determined according to the decoding confidence of the voice to be recognized, the noise confidence of the segment level of the voice to be recognized and the decoding output score determined, and the more accurate comprehensive confidence of the voice to be recognized is determined by combining the decoding confidence of the segment level of the voice to be recognized and the decoding output score determined. On the premise of not increasing the cost, the accuracy of voice recognition is improved.

Fig. 4 is a flowchart of another voice recognition method according to an embodiment of the present invention, where the embodiment of the present invention is applicable to a situation where the accuracy of voice recognition needs to be improved without increasing the cost. The explanation of the terms of the embodiments of the present invention that are the same as or corresponding to the embodiments described above will not be repeated here. Referring to fig. 4, the voice recognition method provided by the embodiment of the invention includes:

Step 410, determining a decoding output score of the voice to be recognized according to a first candidate word obtained by decoding the voice to be recognized.

In one embodiment, step 410 may specifically include:

The method comprises the steps of obtaining a first candidate word by decoding the voice to be recognized once based on a voice recognition module formed by a language model and an acoustic model, determining the language score and the acoustic score of each first candidate word, and determining the decoding output score of the voice to be recognized according to the language score and the acoustic score of each first candidate word.

Specifically, the voice recognition module including the language model and the acoustic model may be used for decoding the voice to be recognized once, and the first candidate words corresponding to the voice to be recognized may be obtained by decoding once, so that the language score and the acoustic score of each first candidate word may be determined, and after the language score and the acoustic score are fused, the decoding output score of each first candidate word may be obtained. After determining the decoding output score of each first candidate word, the decoding output scores of each first candidate word may be ordered, normalization processing may be performed on the largest three decoding output scores, and the processing result may be determined as the decoding output score of the speech to be recognized.

In the embodiment of the invention, after the voice to be recognized is decoded based on the voice recognition module to obtain the first candidate word, the decoding output score of the voice to be recognized can be determined according to the first candidate word, and a data basis is provided for determining the comprehensive confidence of the voice to be recognized.

Step 420, determining a second candidate word corresponding to the voice to be recognized according to the first candidate word obtained by decoding the voice to be recognized.

In one embodiment, step 420 may specifically include:

And performing secondary decoding on the first word graph based on the minimum Bayesian risk by taking the editing distance as a criterion to obtain the second candidate words corresponding to the voice to be recognized and the posterior probability of each second candidate word.

Specifically, after the voice recognition module including the language model and the acoustic model decodes the voice to be recognized once to obtain a first candidate word corresponding to the voice to be recognized, the first word graph may be determined according to the first candidate word, and a second candidate word corresponding to the voice to be recognized may be obtained based on the minimum bayesian risk by performing a second decoding on the first word graph with the edit distance as a criterion.

The process of performing secondary decoding on the first word graph is that 1) selecting a candidate path from an initial state to a termination state from the first word graph, 2) calculating the editing distance (the editing distance can be the minimum number of times of inserting, deleting and replacing one text into another text) of the candidate path and the whole first word graph based on the candidate path, obtaining the posterior probability of all first candidate words in the corresponding time period of each first candidate word in the candidate path through the editing distance, 3) selecting the word with the highest probability at each moment to obtain a new word sequence, namely a second candidate word, 4) returning to execute 2) if the second candidate word is different from the first candidate word corresponding to the candidate path in 2), otherwise, determining that secondary decoding is finished, and determining the word sequence containing each second candidate word as a secondary decoding result. In addition, the time period in which each second candidate word in the word sequence containing each second candidate word is located has the posterior probability of all the second candidate words in the time period.

In the embodiment of the invention, after the voice to be recognized is decoded based on the voice recognition module to obtain the first candidate word, the second candidate word of the voice to be recognized can be determined according to the first candidate word, and a data basis is provided for determining the decoding confidence of the voice to be recognized.

Step 430, determining decoding characteristics of each second candidate word.

Wherein the decoding features include confidence scores, word categories, probability distributions, word lengths, and word graph depths for the second candidate word.

In one embodiment, step 430 may specifically include:

The method comprises the steps of carrying out normalization processing on posterior probabilities of second candidate words to obtain confidence scores of the second candidate words, determining word categories of the second candidate words according to category information of the second candidate words, determining probability distribution of the second candidate words according to occurrence times of the second candidate words in all the second candidate words corresponding to the voice to be recognized, determining word lengths of the second candidate words according to the number of phonemes contained in the second candidate words, and determining word depth of the second candidate words according to the number of edges of all nodes in a time period corresponding to the second candidate words and the time period length in a second word graph obtained by carrying out secondary decoding on the first candidate words.

Specifically, when the second candidate word is obtained by performing secondary decoding on the first word graph, a second word graph containing a plurality of second candidate words can be generated, and the second word graph also contains posterior probabilities of the second candidate words, so that the posterior probabilities of the second candidate words are normalized to obtain confidence scores of the second candidate words, and the confidence scores of the second candidate words can be determined as one-dimensional features of the second candidate words. The word class can indicate class information of the second candidate words, firstly, the domain of the voice to be recognized and the classification of the words in the domain can be determined, the classes of the words in the domain are ordered according to the word frequencies of the words, the words in the first N classes are respectively and independently used as one class, N classes are shared, the words in other classes are unified as a special class, therefore, the words in the domain can be divided into N+1 classes, each second candidate word can be mapped to any N+1 class, and therefore, the word class of the second candidate word can be represented based on N+1-dimensional characteristics. For example, when words in the speech domain to be recognized are classified into n+1=3+1=4 classes, it may be determined that the word class of the second candidate word is: (1, 0), (0, 1, 0) (0, 1, 0) or (0, 1). The probability distribution may indicate the occurrence times of the second candidate word in all the second candidate words corresponding to the speech to be recognized, so that the probability distribution of the second candidate word, that is, unigram probabilities of the second candidate word, may be determined according to the occurrence times of the second candidate word in all the second candidate words corresponding to the speech to be recognized and the total amount of the second candidate words corresponding to the speech to be recognized, and the probability distribution of the second candidate word may be determined as a one-dimensional feature of the second candidate word. The word length may indicate the number of phonemes included in the second candidate word, and thus the word length of the second candidate word may be determined according to the number of phonemes included in the second candidate word, and the word length of the second candidate word may be determined as a one-dimensional feature of the second candidate word. The word graph depth may indicate an average word graph depth of the second candidate word corresponding to the time period, so that the word graph depth of the second candidate word, that is, the average lattice depth, may be determined according to the number of edges of all nodes in the time period corresponding to the second candidate word and the length of the time period in the second word graph containing the second candidate word, and the word graph depth of the second candidate word may be determined as a one-dimensional feature of the second candidate word.

Thus, the decoded feature of the second candidate word may be determined to be an n+5-dimensional feature, and as previously described, when n+1=3+1=4, the decoded feature of the second candidate word may be determined to be an eight-dimensional feature.

In the embodiment of the invention, the N+5-dimensional decoding characteristics of each second candidate word obtained by secondary decoding can be determined, the decoding characteristics of the second candidate words are used for determining the decoding confidence coefficient of the voice to be recognized, and a data basis is provided for determining the decoding confidence coefficient of the voice to be recognized.

Step 440, determining the decoding confidence of the speech to be recognized according to the decoding characteristics of each second candidate word.

In one embodiment, step 440 may specifically include:

and respectively inputting the decoding characteristics of the second candidate words into a pre-trained decoding confidence coefficient model to obtain the decoding confidence coefficient of the second candidate words, and determining the decoding confidence coefficient of the voice to be recognized according to the decoding confidence coefficient of the second candidate words.

Before the decoding features of the second candidate words are respectively input into the pre-trained decoding confidence coefficient model, the method further comprises:

The method comprises the steps of constructing a first preset network model based on a deep neural network and a cross entropy function, taking N+5-dimensional decoding characteristics of second candidate words corresponding to voice data containing noise and effective voice after labeling and labeling information of the voice data as first training data, carrying out network training on the first preset network model, calculating a first loss function, and carrying out network optimization based on a back propagation algorithm until the first loss function converges to obtain a decoding confidence coefficient model.

Wherein the first loss function may be a cross entropy function.

Specifically, after a first preset network model is constructed based on a deep neural network and a cross entropy function, voice data containing noise and effective voice are marked, the noise is marked as 0, the effective voice is marked as 1, and the voice data containing the noise and the effective voice after marking is determined as a first training voice set. After the first training candidate words corresponding to the training voices are obtained by performing primary decoding on the training voices contained in the first training voice set based on the voice recognition module, second training candidate words corresponding to the training voices are further obtained by performing secondary decoding, and the N+5-dimensional decoding characteristics of the second training candidate words are determined based on the mode of the step 430. Taking N+5-dimensional decoding characteristics of each second candidate word corresponding to each training voice contained in the training voice set and labeling information of the training voice as training data, performing network training on a first preset network model, and calculating a cross entropy function according to a sigmoid activation function value which is output by the first preset network model and represents confidence score of the second candidate word and the labeling information of the training voice; and (3) performing network optimization based on a back propagation algorithm until the cross entropy function converges to obtain a decoding confidence coefficient model.

And then the decoding characteristics of each second candidate word can be respectively input into the decoding confidence coefficient model, the obtained output result is the decoding confidence coefficient of each second candidate word, the arithmetic average value of the decoding confidence coefficient of each second candidate word corresponding to the voice to be recognized is determined, and the arithmetic average value is determined as the decoding confidence coefficient of the voice to be recognized.

In the embodiment of the invention, the decoding confidence coefficient of each second candidate word corresponding to the voice to be recognized can be determined based on the decoding confidence coefficient model, and then the decoding confidence coefficient of the voice to be recognized can be determined according to the decoding confidence coefficient of each second candidate word, and the decoding confidence coefficient of the voice to be recognized can indicate the reliability degree of the voice recognition result, so that a data basis is provided for determining the voice recognition result.

Step 450, determining the noise confidence of each voice frame included in the voice to be recognized.

In one embodiment, step 450 may specifically include:

The method comprises the steps of obtaining a voice frame contained in the voice to be recognized, carrying out frame division on the voice to be recognized to obtain voice frames contained in the voice to be recognized, determining the mel cepstrum coefficient of each voice frame, and respectively inputting the mel cepstrum coefficient of each voice frame into a pre-trained noise confidence coefficient model to obtain the noise confidence coefficient of each voice frame.

Before the mel-frequency coefficient of each voice frame is respectively input into the pre-trained noise confidence coefficient model, the method further comprises:

Constructing a second preset network model based on a gating circulating unit (Gate Recurrent Unit, GRU), taking the Mel cepstrum coefficient of the frame training voice corresponding to each training voice in a second training voice set formed by training voices containing pure noise and pure voice and the labeling information of each frame training voice as second training data, carrying out network training on the second preset network model, calculating a second loss function, and iterating the weight of the second preset network model based on random gradient descent until the second loss function converges to obtain a noise confidence coefficient model.

Wherein the second loss function may also be a cross entropy function.

Fig. 5 is a schematic diagram of a second preset network model in another voice recognition method according to an embodiment of the present invention, where, as shown in fig. 5, the second preset network model includes a first full connection layer (fully connected layers, FC), a first GRU, a second GRU, a third GRU, and a second FC.

Specifically, after a second preset network model is built based on the GRU, pure noise and pure voice are collected, the pure noise is randomly added to the pure voice according to a preset signal-to-noise ratio to obtain training voices, a preset number of training voices are determined to be a second training voice set, each training voice contained in the second training voice set is divided into frames according to frame length of 25 milliseconds and frame movement of 10 milliseconds, and then frame training voices corresponding to the training voices are obtained, and for each frame training voice, when a phoneme is not mute, the phoneme is marked as 1, otherwise, the phoneme is marked as 0. And further, taking the mel cepstrum coefficient of the frame training voice corresponding to each training voice in the second training voice set and the labeling information of each frame training voice as second training data, and performing network training on a second preset network model. Specifically, the mel cepstrum coefficient of the training speech of the L frames can be used as a training sequence, a second preset network model is input, the output result corresponding to each frame of training speech is a vector with the dimension of 2, wherein one dimension represents the probability that the current frame contains speech, and the other dimension represents the probability that the current frame does not contain speech. Taking the labeling information of the L-frame training voice as a target sequence, performing network training on a second preset network model, and calculating a cross entropy function; and iterating the weight of the second preset network model based on random gradient descent until the second loss function converges, so as to obtain a noise confidence coefficient model.

And then the frame of the voice to be recognized can be divided by frame length of 25 milliseconds and frame shift of 10 milliseconds, so as to obtain voice frames contained in the voice to be recognized, the mel-frequency cepstrum coefficient of each voice frame is determined, the mel-frequency cepstrum coefficient of the voice frame is input into a noise confidence coefficient model, the obtained output result is that the probability p (t) that the voice frame contains the voice, the probability p (t) that the voice frame does not contain the voice is 1-p (t), and then the noise confidence coefficient of the voice frame is 1-p (t).

In the embodiment of the invention, the noise confidence coefficient of each voice frame contained in the voice to be recognized can be determined based on the noise confidence coefficient model, the noise confidence coefficient of each voice frame is used for determining the noise confidence coefficient of the voice to be recognized, the noise confidence coefficient of the voice to be recognized can be further used for determining the comprehensive confidence coefficient of the voice to be recognized, and a data basis is provided for determining the voice recognition result.

Step 460, determining the noise confidence coefficient of the voice to be recognized according to the noise confidence coefficient of each voice frame.

In one embodiment, step 460 may specifically include:

and determining the noise confidence coefficient of the voice to be recognized according to the maximum noise confidence coefficient, the minimum noise confidence coefficient, the noise confidence coefficient mean value and the noise confidence coefficient variance in the noise confidence coefficient of each voice frame contained in the voice to be recognized.

Specifically, after determining the noise confidence of each speech frame included in the speech to be recognized, the noise confidence of each frame of speech may be ranked, the mean and variance may be calculated, and the maximum noise confidence, the minimum noise confidence, the noise confidence mean and the noise confidence variance in the noise confidence of each speech frame included in the speech to be recognized may be determined as the noise confidence of the speech to be recognized.

And 470, determining the comprehensive confidence of the voice to be recognized according to the decoding confidence, the noise confidence and the decoding output score of the voice to be recognized.

In one embodiment, step 470 may specifically include:

and inputting the decoding confidence coefficient, the noise confidence coefficient and the decoding output score of the voice to be recognized into a pre-trained voice recognition model to obtain the comprehensive confidence coefficient of the voice to be recognized.

Before inputting the decoding confidence level, the noise confidence level and the decoding output score of the voice to be recognized into the pre-trained voice recognition model, the method further comprises the following steps:

Constructing a third preset network model based on a logistic regression, taking decoding confidence coefficient, noise confidence coefficient, decoding output score and labeling information of each training voice in a third training voice set constructed by voices containing noise as third training data, carrying out network training on the third preset network model, calculating a third loss function, and carrying out network optimization based on a back propagation algorithm until the third loss function converges to obtain a voice recognition model.

And the decoding confidence coefficient, the noise confidence coefficient and the decoding output score of the voice to be recognized can be input into a voice recognition model, and the obtained output result is the comprehensive confidence coefficient of the voice to be recognized.

In the embodiment of the invention, the segment level decoding confidence, the frame level noise confidence and the decoding output score input of the voice to be recognized are fused to determine the comprehensive confidence of the voice to be recognized.

And step 480, determining a voice recognition result according to the comprehensive confidence.

The comprehensive confidence comprises the probability that the voice to be recognized contains the voice and the probability that the voice to be recognized does not contain the voice.

In one embodiment, step 480 may specifically include:

If the probability of the voice to be recognized containing the voice is larger than or equal to a first preset threshold value, determining that the voice recognition result is that the voice to be recognized contains the voice, if the probability of the voice to be recognized containing the voice is larger than or equal to a second preset threshold value and smaller than the first preset threshold value, determining that the voice recognition result is that the voice to be recognized does not contain the voice, and if the probability of the voice to be recognized containing the voice is smaller than the second preset threshold value, determining that the voice recognition is wrong, or optimizing the voice to be recognized to obtain optimized voice, and carrying out voice recognition again based on the optimized voice.

The first preset threshold value is larger than the second preset threshold value, and the first preset threshold value and the second preset threshold value are smaller than 1.

Specifically, after determining the comprehensive confidence of the voice to be recognized based on the voice recognition model, the probability that the voice to be recognized contains the voice and the second preset threshold may be compared first.

On the one hand, if the probability that the voice to be recognized contains voice is greater than or equal to the second preset threshold, the voice recognition result can be determined according to the probability that the voice to be recognized contains voice, i.e. the decoding result of the voice recognition module can be adopted. And further continuously comparing the probability of the voice to be recognized containing the voice with a first preset threshold, if the probability of the voice to be recognized containing the voice is larger than or equal to the first preset threshold, determining that the voice recognition result is the voice to be recognized containing the voice, and if the probability of the voice to be recognized containing the voice is smaller than the first preset threshold, determining that the voice recognition result is that the voice to be recognized does not contain the voice.

On the other hand, if the probability that the voice to be recognized contains the voice is smaller than the second preset threshold, the voice recognition result cannot be determined according to the probability that the voice to be recognized contains the voice, the decoding result of the voice to be recognized by the voice recognition module is not adopted, and then the voice recognition error can be determined, or the voice to be recognized can be optimized to obtain the optimized voice, and the voice recognition is performed again based on the optimized voice.

In one embodiment, optimizing the speech to be recognized to obtain an optimized speech includes:

and setting the voice frame with the noise confidence coefficient larger than the preset confidence coefficient in the voice to be recognized as mute to obtain the optimized voice.

Specifically, comparing the noise confidence coefficient of each voice frame included in the voice to be recognized with a preset confidence coefficient, if the noise confidence coefficient of any voice frame is larger than the preset confidence coefficient, setting the noise confidence coefficient of the voice frame to be 0, and optimizing the voice to be recognized to obtain optimized voice. And setting the voice frames with the voice frame noise confidence degrees larger than the preset confidence degrees contained in the voice to be recognized as silence, so that the noise reduction of the voice to be recognized is realized, and further, the voice recognition is continuously carried out on the optimized voice obtained through the noise reduction, and the accuracy of the voice recognition can be improved.

In the embodiment of the invention, the voice to be recognized can be determined to contain voice or not according to the comprehensive confidence, or the decoding result of the voice recognition module on the voice to be recognized can be omitted, so that voice recognition errors are determined, or noise reduction is carried out on the voice to be recognized to optimize the voice to be recognized so as to obtain optimized voice, and then the optimized voice is continuously decoded based on the voice recognition module so as to obtain the voice recognition result.

The voice recognition method comprises the steps of determining a decoding output score of voice to be recognized according to first candidate words obtained by decoding the voice to be recognized, determining second candidate words corresponding to the voice to be recognized according to the first candidate words obtained by decoding the voice to be recognized, determining decoding characteristics of the second candidate words, determining decoding confidence degrees of the voice to be recognized according to the decoding characteristics of the second candidate words, determining noise confidence degrees of voice frames contained in the voice to be recognized, determining noise confidence degrees of the voice to be recognized according to the noise confidence degrees of the voice frames, determining comprehensive confidence degrees of the voice to be recognized according to the decoding confidence degrees, the noise confidence degrees and the decoding output score of the voice to be recognized, and determining voice recognition results according to the comprehensive confidence degrees. According to the technical scheme, firstly, the decoding output score of the voice to be recognized can be determined according to the first candidate word obtained by decoding the voice to be recognized, a data basis is provided for determining the comprehensive confidence, the second candidate word corresponding to the voice to be recognized is determined according to the first candidate word, after the decoding characteristics of each second candidate word are determined, the more accurate decoding confidence of the voice to be recognized is determined according to the decoding characteristics, secondly, the more accurate frame-level noise confidence of the voice to be recognized can be determined according to the noise confidence of each voice frame contained in the voice to be recognized, further, the comprehensive confidence of the voice to be recognized can be determined according to the decoding confidence of the voice to be recognized, the noise confidence of the segment level of the voice to be recognized and the decoding output score determined, and the more accurate comprehensive confidence of the voice to be recognized is determined by combining the decoding confidence of the segment level of the voice to be recognized and the decoding output score determined. On the premise of not increasing the cost, the accuracy of voice recognition is improved.

In addition, after determining that the decoding result of the voice to be recognized by the voice recognition module is not adopted according to the comprehensive confidence, voice recognition errors can be determined, or noise can be reduced on the voice to be recognized to optimize the voice to be recognized so as to obtain optimized voice, and then the optimized voice is continuously decoded based on the voice recognition module so as to obtain the voice recognition result.

Fig. 6 is a schematic diagram of a speech recognition system provided in the embodiment of the present invention, as shown in fig. 6, the speech recognition system may include a speech recognition module 100, a decoding confidence module 200, a noise confidence module 300, a result determination module 400, and a processing module 500, where the speech recognition module 100 is configured to decode a speech to be recognized once to determine a first candidate word of the speech to be recognized and a first word graph including the first candidate word, the decoding confidence module 200 is configured to determine a decoding output score of the speech to be recognized and a second candidate word corresponding to the speech to be recognized according to the first candidate word, and determine a decoding confidence of the speech to be recognized according to a decoding feature of each second candidate word after determining a decoding feature of each second candidate word, the noise confidence module 300 is configured to determine a noise confidence of each speech frame included in the speech to be recognized, and determine a noise confidence of the speech to be recognized according to a noise confidence of each speech frame, the result determination module 400 is configured to determine a comprehensive confidence of the speech to be recognized according to the decoding confidence of the speech to be recognized, and the noise confidence score is output, the processing module is configured to determine that the speech to be recognized when the first speech to be recognized includes a threshold value, the speech to be recognized is greater than or less than the second threshold, the speech to be recognized is determined to be recognized, and the speech to be recognized includes a threshold is greater than the threshold when the first speech to be recognized is determined to be less than the threshold value is determined to be greater than the speech to be recognized, and re-performing speech recognition based on the optimized speech.

The voice recognition system provided by the embodiment of the invention can execute the voice recognition method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of executing the voice recognition method.

Fig. 7 is a schematic structural diagram of a voice recognition device according to an embodiment of the present invention. The device and the voice recognition method of each embodiment belong to the same invention conception, and the details of the embodiment of the voice recognition device, which are not described in detail, can be referred to the embodiment of the voice recognition method.

The specific structure of the speech recognition device is shown in fig. 7, and includes:

the decoding module 710 is configured to determine, according to a first candidate word obtained by decoding a speech to be recognized, a decoding output score of the speech to be recognized and a second candidate word corresponding to the speech to be recognized;

The decoding confidence determining module 720 is configured to determine decoding features of the second candidate words, and determine decoding confidence of the speech to be recognized according to the decoding features of the second candidate words;

the noise confidence determining module 730 is configured to determine a noise confidence of each speech frame included in the speech to be recognized, and determine a noise confidence of the speech to be recognized according to the noise confidence of each speech frame;

The execution module 740 is configured to determine a comprehensive confidence level of the speech to be recognized according to the decoding confidence level, the noise confidence level, and the decoding output score of the speech to be recognized, and determine a speech recognition result according to the comprehensive confidence level.

Based on the above embodiment, the decoding module 710 is specifically configured to:

decoding the voice to be recognized once based on a voice recognition module formed by a language model and an acoustic model to obtain a first word graph containing the first candidate word;

Determining a language score and an acoustic score of each first candidate word, and determining the decoding output score of the voice to be recognized according to the language score and the acoustic score of each first candidate word;

On the basis of the foregoing embodiment, the decoding features include a confidence score, a word category, a probability distribution, a word length, and a word graph depth of the second candidate word, and accordingly, the decoding confidence determining module 720 is specifically configured to:

Determining the word class of each second candidate word according to the class information of each second candidate word, determining the probability distribution of each second candidate word according to the occurrence times of each second candidate word in all second candidate words corresponding to the voice to be recognized, determining the word length of each second candidate word according to the number of phonemes contained in each second candidate word, and determining the word depth of each second candidate word according to the number of edges of all nodes and the time period length in the time period corresponding to each second candidate word in a second word graph obtained by secondarily decoding the first candidate word;

respectively inputting the decoding characteristics of each second candidate word into a pre-trained decoding confidence coefficient model to obtain the decoding confidence coefficient of each second candidate word;

and determining the decoding confidence degree of the voice to be recognized according to the decoding confidence degree of each second candidate word.

Based on the above embodiment, the noise confidence determining module 730 is specifically configured to:

framing the voice to be recognized to obtain a voice frame contained in the voice to be recognized;

Determining the mel cepstrum coefficient of each voice frame, and respectively inputting the mel cepstrum coefficient of each voice frame into a pre-trained noise confidence coefficient model to obtain the noise confidence coefficient of each voice frame;

Based on the above embodiment, the execution module 740 is specifically configured to:

Inputting the decoding confidence coefficient, the noise confidence coefficient and the decoding output score of the voice to be recognized into a pre-trained voice recognition model to obtain the comprehensive confidence coefficient of the voice to be recognized;

and determining a voice recognition result according to the comprehensive confidence coefficient.

In one embodiment, the integrated confidence level includes a probability that the speech to be recognized includes speech, and accordingly, determining a speech recognition result according to the integrated confidence level includes:

if the probability that the voice to be recognized contains the voice is larger than or equal to a first preset threshold value, determining that the voice recognition result is that the voice to be recognized contains the voice;

If the probability that the voice to be recognized contains the voice is larger than or equal to a second preset threshold value and smaller than the first preset threshold value, determining that the voice recognition result is that the voice to be recognized does not contain the voice;

if the probability that the voice to be recognized contains the voice is smaller than the second preset threshold value, determining voice recognition errors or optimizing the voice to be recognized to obtain optimized voice, and carrying out voice recognition again based on the optimized voice.

Further, optimizing the speech to be recognized to obtain an optimized speech includes:

The voice recognition device provided by the embodiment of the invention can execute the voice recognition method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of executing the voice recognition method.

It should be noted that, in the embodiment of the voice recognition device, the units and modules included are only divided according to the functional logic, but not limited to the above-mentioned division, so long as the corresponding functions can be implemented, and the specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

Fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present invention. Fig. 8 shows a block diagram of an exemplary computer device 8 suitable for use in implementing embodiments of the invention. The computer device 8 shown in fig. 8 is only an example and should not be construed as limiting the functionality and scope of use of embodiments of the invention.

As shown in fig. 8, the computer device 8 is in the form of a general purpose computer device. The components of computer device 8 may include, but are not limited to, one or more processors or processing units 16, a system memory 28, and a bus 18 that connects the various system components, including the system memory 28 and the processing units 16.

Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, micro channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Computer device 8 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by computer device 8 and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 30 and/or cache memory 32. The computer device 8 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 8, commonly referred to as a "hard disk drive"). Although not shown in fig. 8, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In such cases, each drive may be coupled to bus 18 through one or more data medium interfaces. The system memory 28 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of the embodiments of the invention.

A program/utility 40 having a set (at least one) of program modules 42 may be stored in, for example, system memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 42 generally perform the functions and/or methods of the embodiments described herein.

The computer device 8 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), one or more devices that enable a user to interact with the computer device 8, and/or any devices (e.g., network card, modem, etc.) that enable the computer device 8 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 22. Moreover, the computer device 8 may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through a network adapter 20. As shown in fig. 8, the network adapter 20 communicates with other modules of the computer device 8 via the bus 18. It should be appreciated that although not shown in FIG. 8, other hardware and/or software modules may be used in connection with computer device 8, including, but not limited to, microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

The processing unit 16 executes various functional applications and page displays by running programs stored in the system memory 28, for example, implementing the voice recognition method provided by the present embodiment, the method including:

Of course, those skilled in the art will appreciate that the processor may also implement the technical solution of the speech recognition method provided in any embodiment of the present invention.

An embodiment of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a speech recognition method such as provided by the embodiment of the present invention, the method comprising:

The computer storage media of embodiments of the invention may take the form of any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium include an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present invention may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

It will be appreciated by those of ordinary skill in the art that the modules or steps of the invention described above may be implemented in a general purpose computing device, they may be centralized on a single computing device, or distributed over a network of computing devices, or they may alternatively be implemented in program code executable by a computer device, such that they are stored in a memory device and executed by the computing device, or they may be separately fabricated as individual integrated circuit modules, or multiple modules or steps within them may be fabricated as a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

In addition, the technical scheme of the invention can acquire, store, use, process and the like the data, which accords with the relevant regulations of national laws and regulations.

Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims

1. A speech recognition method, comprising:

Determining, based on a first candidate word obtained by decoding the speech to be recognized, a decoding output score of the speech to be recognized and a second candidate word corresponding to the speech to be recognized;

Determine a decoding feature of each of the second candidate words, and determine a decoding confidence of the speech to be recognized according to the decoding feature of each of the second candidate words, wherein the decoding feature includes a confidence score, a word category, a probability distribution, a word length, and a word graph depth of the second candidate words;

Determining the noise confidence of each speech frame contained in the speech to be recognized, and determining the noise confidence of the speech to be recognized according to the noise confidence of each speech frame;

The comprehensive confidence of the speech to be recognized is determined according to the decoding confidence, noise confidence and decoding output score of the speech to be recognized, and the speech recognition result is determined according to the comprehensive confidence.

2. The speech recognition method according to claim 1, characterized in that determining the decoding output score of the speech to be recognized based on the first candidate word obtained by decoding the speech to be recognized comprises:

Decoding the speech to be recognized once based on a speech recognition module composed of a language model and an acoustic model to obtain the first candidate word;

Determine a language score and an acoustic score of each of the first candidate words, and determine the decoding output score of the speech to be recognized according to the language score and the acoustic score of each of the first candidate words.

3. The speech recognition method according to claim 1, characterized in that determining the second candidate word corresponding to the speech to be recognized according to the first candidate word obtained by decoding the speech to be recognized comprises:

Decoding the speech to be recognized once based on a speech recognition module composed of a language model and an acoustic model to obtain a first word graph including the first candidate word;

On the first word graph, secondary decoding is performed based on the minimum Bayesian risk and the edit distance is used as a criterion to obtain the second candidate word corresponding to the speech to be recognized and the posterior probability of each second candidate word.

4. The speech recognition method according to claim 3, wherein determining the decoding feature of each of the second candidate words comprises:

Normalizing the posterior probability of each of the second candidate words to obtain a confidence score of each of the second candidate words;

Determining the word category of each of the second candidate words according to the category information of each of the second candidate words;

Determining the probability distribution of each second candidate word according to the number of occurrences of each second candidate word in all second candidate words corresponding to the speech to be recognized;

Determining the word length of each of the second candidate words according to the number of phonemes contained in each of the second candidate words;

In the second word graph obtained by performing secondary decoding on the first candidate words, the word graph depth of each second candidate word is determined according to the number of outgoing edges of all nodes in the time period corresponding to each second candidate word and the length of the time period.

5. The speech recognition method according to claim 1, characterized in that determining the decoding confidence of the speech to be recognized according to the decoding features of each of the second candidate words comprises:

Inputting the decoding features of each of the second candidate words into a pre-trained decoding confidence model to obtain the decoding confidence of each of the second candidate words;

The decoding confidence of the speech to be recognized is determined according to the decoding confidence of each of the second candidate words.

6. The speech recognition method according to claim 1, characterized in that determining the noise confidence of each speech frame contained in the speech to be recognized comprises:

Dividing the speech to be recognized into frames to obtain speech frames contained in the speech to be recognized;

The Mel-cepstral coefficients of each speech frame are determined, and the Mel-cepstral coefficients of each speech frame are respectively input into a pre-trained noise confidence model to obtain the noise confidence of each speech frame.

7. The speech recognition method according to claim 1, characterized in that determining the noise confidence of the speech to be recognized according to the noise confidence of each speech frame comprises:

The noise confidence of the speech to be recognized is determined according to the maximum noise confidence, the minimum noise confidence, the noise confidence mean and the noise confidence variance among the noise confidences of the speech frames contained in the speech to be recognized.

8. The speech recognition method according to claim 1, characterized in that the comprehensive confidence of the speech to be recognized is determined according to the decoding confidence, noise confidence and decoding output score of the speech to be recognized, comprising:

The decoding confidence, noise confidence and decoding output score of the speech to be recognized are input into a pre-trained speech recognition model to obtain a comprehensive confidence of the speech to be recognized.

9. The speech recognition method according to claim 1, characterized in that the comprehensive confidence includes the probability that the speech to be recognized contains speech, and accordingly, determining the speech recognition result according to the comprehensive confidence includes:

If the probability that the speech to be recognized contains speech is greater than or equal to a first preset threshold, determining that the speech recognition result is that the speech to be recognized contains speech;

If the probability that the speech to be recognized contains speech is greater than or equal to the second preset threshold and less than the first preset threshold, determining that the speech recognition result is that the speech to be recognized does not contain speech;

If the probability that the speech to be recognized contains speech is less than the second preset threshold, it is determined that the speech recognition is wrong, or the speech to be recognized is optimized to obtain optimized speech, and speech recognition is performed again based on the optimized speech.

10. The speech recognition method according to claim 9, characterized in that optimizing the speech to be recognized to obtain the optimized speech comprises:

The speech frames whose noise confidence in the speech to be recognized is greater than a preset confidence are set to be silent to obtain the optimized speech.

11. A speech recognition device, comprising:

A decoding module, configured to determine a decoding output score of the speech to be recognized and a second candidate word corresponding to the speech to be recognized based on a first candidate word obtained by decoding the speech to be recognized;

A decoding confidence determination module, used to determine the decoding features of each of the second candidate words, and determine the decoding confidence of the speech to be recognized according to the decoding features of each of the second candidate words, wherein the decoding features include the confidence score, word category, probability distribution, word length and word graph depth of the second candidate words;

A noise confidence determination module, used to determine the noise confidence of each speech frame contained in the speech to be recognized, and determine the noise confidence of the speech to be recognized according to the noise confidence of each speech frame;

The execution module is used to determine the comprehensive confidence of the speech to be recognized according to the decoding confidence, noise confidence and decoding output score of the speech to be recognized, and determine the speech recognition result according to the comprehensive confidence.

12. A computer device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the speech recognition method as claimed in any one of claims 1 to 10 when executing the program.

13. A storage medium comprising computer executable instructions, wherein the computer executable instructions are used to perform the speech recognition method according to any one of claims 1 to 10 when executed by a computer processor.