[go: up one dir, main page]

CN115294974B - A speech recognition method, device, equipment and storage medium - Google Patents

A speech recognition method, device, equipment and storage medium Download PDF

Info

Publication number
CN115294974B
CN115294974B CN202210753629.8A CN202210753629A CN115294974B CN 115294974 B CN115294974 B CN 115294974B CN 202210753629 A CN202210753629 A CN 202210753629A CN 115294974 B CN115294974 B CN 115294974B
Authority
CN
China
Prior art keywords
speech
recognized
decoding
voice
confidence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210753629.8A
Other languages
Chinese (zh)
Other versions
CN115294974A (en
Inventor
雪巍
彭毅
范璐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jingdong Technology Information Technology Co Ltd
Original Assignee
Jingdong Technology Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jingdong Technology Information Technology Co Ltd filed Critical Jingdong Technology Information Technology Co Ltd
Priority to CN202210753629.8A priority Critical patent/CN115294974B/en
Publication of CN115294974A publication Critical patent/CN115294974A/en
Priority to PCT/CN2023/097748 priority patent/WO2024001662A1/en
Application granted granted Critical
Publication of CN115294974B publication Critical patent/CN115294974B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L2025/783Detection of presence or absence of voice signals based on threshold decision

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the invention discloses a voice recognition method, a device, equipment and a storage medium, wherein the method comprises the steps of determining a decoding output score of voice to be recognized and a second candidate word corresponding to the voice to be recognized according to a first candidate word obtained by decoding the voice to be recognized; determining decoding characteristics of each second candidate word, determining decoding confidence coefficient of the voice to be recognized according to the decoding characteristics of each second candidate word, determining noise confidence coefficient of each voice frame contained in the voice to be recognized, determining noise confidence coefficient of the voice to be recognized according to the noise confidence coefficient of each voice frame, determining comprehensive confidence coefficient of the voice to be recognized according to the decoding confidence coefficient, the noise confidence coefficient and decoding output scores of the voice to be recognized, and determining voice recognition results according to the comprehensive confidence coefficient. According to the technical scheme, on the premise of not increasing the cost, the accuracy of voice recognition is improved.

Description

Voice recognition method, device, equipment and storage medium
Technical Field
The embodiment of the invention relates to the technical field of voice processing, in particular to a voice recognition method, a voice recognition device, voice recognition equipment and a storage medium.
Background
Speech recognition has been widely used in the fields of intelligent customer service, intelligent home, vehicle-mounted assistants, etc. Speech recognition systems are often affected by noise interference from the environment or telephone channels, which is prone to speech recognition errors. For example, speech recognition insertion errors may occur when noise and speech segment time do not coincide, and deletion or modification errors may occur when speech segments are corrupted by noise. Speech recognition errors present a significant challenge for subsequent speech interactions.
In the prior art, the front-end noise reduction module is used for processing the voice to be recognized so as to reduce the influence of noise on the voice characteristics to be recognized, and then the voice recognition module is used for recognizing the processed voice to be recognized so as to determine the voice recognition result.
In the process of implementing the present invention, the inventor finds that at least the following technical problems exist in the prior art:
The front-end noise reduction module and the voice recognition module are required to be adapted, so that the voice recognition cost is increased.
Disclosure of Invention
The invention provides a voice recognition method, a device, equipment and a storage medium, so as to reduce the cost of voice recognition.
In a first aspect, an embodiment of the present invention provides a method for voice recognition, including:
Determining a decoding output score of the voice to be recognized and a second candidate word corresponding to the voice to be recognized according to a first candidate word obtained by decoding the voice to be recognized;
determining decoding characteristics of the second candidate words, and determining decoding confidence of the voice to be recognized according to the decoding characteristics of the second candidate words;
determining the noise confidence coefficient of each voice frame contained in the voice to be recognized, and determining the noise confidence coefficient of the voice to be recognized according to the noise confidence coefficient of each voice frame;
and determining the comprehensive confidence of the voice to be recognized according to the decoding confidence, the noise confidence and the decoding output score of the voice to be recognized, and determining a voice recognition result according to the comprehensive confidence.
In a second aspect, an embodiment of the present invention further provides a voice recognition apparatus, including:
the decoding module is used for determining a decoding output score of the voice to be recognized and a second candidate word corresponding to the voice to be recognized according to a first candidate word obtained by decoding the voice to be recognized;
the decoding confidence determining module is used for determining the decoding characteristics of each second candidate word and determining the decoding confidence of the voice to be recognized according to the decoding characteristics of each second candidate word;
The noise confidence coefficient determining module is used for determining the noise confidence coefficient of each voice frame contained in the voice to be recognized and determining the noise confidence coefficient of the voice to be recognized according to the noise confidence coefficient of each voice frame;
And the execution module is used for determining the comprehensive confidence coefficient of the voice to be recognized according to the decoding confidence coefficient, the noise confidence coefficient and the decoding output score of the voice to be recognized, and determining a voice recognition result according to the comprehensive confidence coefficient.
In a third aspect, an embodiment of the present invention further provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the speech recognition method according to any one of the first aspects when executing the program.
In a fourth aspect, embodiments of the present invention also provide a storage medium containing computer-executable instructions for performing the speech recognition method according to any one of the first aspects when executed by a computer processor.
The embodiments of the above invention have the following advantages or benefits:
The embodiment of the invention provides a voice recognition method, which comprises the steps of determining a decoding output score of a voice to be recognized and a second candidate word corresponding to the voice to be recognized according to a first candidate word obtained by decoding the voice to be recognized, determining decoding characteristics of the second candidate words, determining decoding confidence of the voice to be recognized according to the decoding characteristics of the second candidate words, determining noise confidence of voice frames contained in the voice to be recognized, determining noise confidence of the voice to be recognized according to the noise confidence of the voice frames, and determining comprehensive confidence of the voice to be recognized according to the decoding confidence, the noise confidence and the decoding output score of the voice to be recognized, and determining a voice recognition result according to the comprehensive confidence. According to the technical scheme, firstly, the decoding output score of the voice to be recognized can be determined according to the first candidate word obtained by decoding the voice to be recognized, a data basis is provided for determining the comprehensive confidence, the second candidate word corresponding to the voice to be recognized is determined according to the first candidate word, after the decoding characteristics of each second candidate word are determined, the more accurate decoding confidence of the voice to be recognized is determined according to the decoding characteristics, secondly, the more accurate frame-level noise confidence of the voice to be recognized can be determined according to the noise confidence of each voice frame contained in the voice to be recognized, further, the comprehensive confidence of the voice to be recognized can be determined according to the decoding confidence of the voice to be recognized, the noise confidence of the segment level of the voice to be recognized and the decoding output score determined, and the more accurate comprehensive confidence of the voice to be recognized is determined by combining the decoding confidence of the segment level of the voice to be recognized and the decoding output score determined. On the premise of not increasing the cost, the accuracy of voice recognition is improved.
Drawings
FIG. 1 is a schematic diagram of a speech recognition module according to an embodiment of the present invention;
FIG. 2 is a flowchart of a voice recognition method according to an embodiment of the present invention;
FIG. 3 is a first word diagram including a first candidate word obtained by decoding a speech to be recognized in a speech recognition method according to an embodiment of the present invention;
FIG. 4 is a flowchart of another speech recognition method according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a second preset network model in another speech recognition method according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a speech recognition system according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of a voice recognition device according to an embodiment of the present invention;
Fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present invention.
Detailed Description
The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.
Before discussing exemplary embodiments in more detail, it should be mentioned that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart depicts operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently, or at the same time. Furthermore, the order of the operations may be rearranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figures. The processes may correspond to methods, functions, procedures, subroutines, and the like. Furthermore, embodiments of the invention and features of the embodiments may be combined with each other without conflict.
Fig. 1 is a schematic diagram of a speech recognition module according to an embodiment of the present invention, as shown in fig. 1, a speech recognition module 100 may include a language model 110 and an acoustic model 120, and the speech recognition module 100 may use a decoding algorithm to obtain an optimal sequence through viterbi search in a recognition process, and generate a decoded output corresponding to speech, that is, a word graph corresponding to speech. Speech recognition errors are easily caused by noise-to-speech infection. Therefore, the embodiment of the invention provides a voice recognition method, which improves the accuracy of voice recognition and reduces the error rate of voice recognition on the premise of not increasing the cost.
The voice recognition method according to the embodiment of the present invention will be described in detail with reference to the voice recognition module and the embodiment shown in fig. 1.
Fig. 2 is a flowchart of a voice recognition method according to an embodiment of the present invention, where the embodiment of the present invention is applicable to a situation where the accuracy of voice recognition needs to be improved without increasing the cost. The method may be performed by a speech recognition device, which may be implemented in software and/or hardware. As shown in fig. 2, the method specifically includes the following steps:
Step 210, determining a decoding output score of the voice to be recognized and a second candidate word corresponding to the voice to be recognized according to a first candidate word obtained by decoding the voice to be recognized.
The speech recognition module including the language model and the acoustic model as shown in fig. 1 may be used to decode the to-be-processed speech to generate the first candidate word corresponding to the speech. Therefore, the voice to be recognized can be input into the voice recognition module shown in fig. 1, so that the voice recognition module decodes the voice to be recognized to obtain the first candidate word corresponding to the voice to be recognized. Due to interference of noise, the error rate of the first candidate word obtained by decoding the voice to be recognized based on the voice recognition module is high, and whether the voice to be recognized contains voice needs to be further determined.
The first candidate words corresponding to the voice to be recognized can be represented based on a first word graph, wherein the first word graph is a compressed representation of information such as each first candidate word and time in the process of decoding the voice to be recognized, a node represents a state, and a numerical value in a bracket of the node represents a moment corresponding to the state. There are different candidate paths from the initial state to the termination state in the first word graph, and the numerical values on the candidate paths represent the scoring of the first candidate word, i.e. the posterior probability of the first candidate word. The score of each candidate path and the time information of each first candidate word in each candidate path can be determined from the first word graph. Fig. 3 is a first word diagram including a first candidate word obtained by decoding a voice to be recognized in the voice recognition method according to the embodiment of the present invention, as shown in fig. 3, for example, from an initial state 0 to a final state 4, where "beijing", "background", "mobilization", "motion", "olympic games" and "meeting" are the first candidate word, where the posterior probability of "beijing" is 0.5, the posterior probability of "background" is 0.5, the posterior probability of "mobilization" is 0.5, the posterior probability of "motion" is 0.4, the posterior probability of "olympic games" is 0.2, the posterior probability of "meeting" is 0.4, and two end point values of the edge where "winter olympic games" is located are 6 and 20, respectively, which indicates that the voice content corresponding to the time period of 6s-20s is "winter olympic games".
Specifically, after the first candidate words are obtained by decoding the voice to be recognized based on the voice recognition module, firstly, the decoding output scores of the first candidate words can be determined, the decoding output scores of the first candidate words are ordered, secondly, normalization processing can be carried out on the largest three decoding output scores, further, the processing result can be determined to be the decoding output score of the voice to be recognized, and on the other hand, the second candidate words corresponding to the voice to be recognized can be determined, specifically, the second candidate words corresponding to the voice to be recognized can be obtained by carrying out secondary decoding based on the minimum Bayesian risk on the first word graph by taking the editing distance as a criterion.
In the embodiment of the invention, after the voice to be recognized is decoded based on the voice recognition module to obtain the first candidate words, the decoding output score of the voice to be recognized can be determined according to the first candidate words, and the first word graph containing each first candidate word can be subjected to secondary decoding to obtain the second candidate words, wherein the second candidate words can be used for determining the decoding confidence degree of the voice to be recognized and providing a data basis for determining the decoding confidence degree of the voice to be recognized.
Step 220, determining the decoding characteristics of the second candidate words, and determining the decoding confidence of the voice to be recognized according to the decoding characteristics of the second candidate words.
Wherein the decoding features include a confidence score, a word category, a probability distribution, a word length, and a word graph depth for the second candidate word.
Specifically, after the posterior probability of each second candidate word is determined, normalizing the posterior probability of each second candidate word to obtain a confidence score of each second candidate word, wherein the confidence score of each second candidate word is a one-dimensional feature of the second candidate word. After the domain of the voice to be recognized is determined, the classification of the words in the domain is determined, the classification quantity is N+1, each second candidate word is mapped to any N+1 class, the word class of the second candidate word is represented based on the N+1-dimensional feature, and the word class of the second candidate word is the N+1-dimensional feature of the second candidate word. And determining probability distribution of the second candidate words according to the occurrence times of the second candidate words in all the second candidate words corresponding to the voice to be recognized and the total amount of the second candidate words corresponding to the voice to be recognized, wherein the probability distribution of the second candidate words is one-dimensional characteristics of the second candidate words. The word length of the second candidate word, which is a one-dimensional feature of the second candidate word, may be determined according to the number of phonemes contained in the second candidate word. And determining the word graph depth of the second candidate word according to the number of edges of all nodes in the corresponding time period of the second candidate word and the length of the time period in the second word graph containing the second candidate word, wherein the word graph depth of the second candidate word is a one-dimensional feature of the second candidate word.
Therefore, the decoding characteristics of the second candidate words can be determined to be n+5-dimensional characteristics, the n+5-dimensional decoding characteristics corresponding to each second candidate word are respectively input into a pre-trained decoding confidence coefficient model to obtain the decoding confidence coefficient of each second candidate word, the arithmetic average value of the decoding confidence coefficient of each second candidate word can be further determined, and the arithmetic average value is determined to be the decoding confidence coefficient of the voice to be recognized.
The decoding confidence of the voice to be recognized reflects the reliability of the voice recognition result, and in general, the value range of the decoding confidence of the voice to be recognized is 0-1, and the closer to 1, the more reliable the voice recognition result is.
In the embodiment of the invention, after the second word graph containing the second candidate words is obtained by performing secondary decoding based on the first word graph containing the first candidate words, the decoding characteristics of each second candidate word can be determined according to the second word graph, the decoding confidence degree of the second candidate words can be determined according to the decoding characteristics, the decoding confidence degree of the voice to be recognized can be determined according to the decoding confidence degree of each second candidate word, the decoding confidence degree of the voice to be recognized can indicate the reliability degree of the voice recognition result, and a data basis is provided for determining the voice recognition result.
Step 230, determining the noise confidence of each voice frame included in the voice to be recognized, and determining the noise confidence of the voice to be recognized according to the noise confidence of each voice frame.
Before the noise confidence of the voice to be recognized, the voice to be recognized can be framed, and a voice frame is obtained. Wherein, the frame length of each frame of voice frame is 25 milliseconds, and the frame shift is 10 milliseconds.
Specifically, mel-cepstrum coefficients (Mel-scaleFrequency Cepstral Coefficients, MFCC) of each speech frame may be extracted first, and the MFCC of the speech frame is input into a pre-trained noise confidence coefficient model to obtain a probability p (t) that the speech frame contains speech, and a probability 1-p (t) that the speech frame does not contain speech, so as to determine that the noise confidence coefficient of the speech frame is 1-p (t). After the noise confidence coefficient of each voice frame contained in the voice to be recognized is determined, the maximum noise confidence coefficient, the minimum noise confidence coefficient, the noise confidence coefficient mean value and the noise confidence coefficient variance are determined, and the maximum noise confidence coefficient and the minimum noise confidence coefficient, the noise confidence coefficient mean value and the noise confidence coefficient variance in the noise confidence coefficient of each voice frame contained in the voice to be recognized are determined as the noise confidence coefficient of the voice to be recognized.
In the embodiment of the invention, the noise confidence of the segment level of the voice to be recognized can be determined based on the noise confidence of the frame level of each voice frame contained in the voice to be recognized.
Step 240, determining the comprehensive confidence of the voice to be recognized according to the decoding confidence, the noise confidence and the decoding output score of the voice to be recognized, and determining a voice recognition result according to the comprehensive confidence.
Specifically, the decoding confidence level, the noise confidence level and the decoding output score of the voice to be recognized are used as recognition features of the voice to be recognized and input into a pre-trained recognition model, and the obtained output result is the comprehensive confidence level of the voice to be recognized. The integrated confidence level fuses the decoding confidence level of the segment level of the voice to be recognized, the noise confidence level of the segment level of the voice to be recognized, which is determined according to the noise confidence level of the frame level, and the decoding output score of the voice to be recognized. It is determined whether it is a valid recognition result and whether it contains a voice recognition result of voice based on the integrated confidence.
In the embodiment of the invention, the probability that the voice to be recognized contains the voice and the probability that the voice does not contain the voice can be determined according to the comprehensive confidence, and then the voice recognition result is determined according to the probability that the voice to be recognized contains the voice. The problem of voice recognition insertion errors caused by noise is effectively solved, and specific optimization and retraining of the voice recognition module are not needed, so that different voice recognition modules can be adapted.
The voice recognition method comprises the steps of determining decoding output scores of voices to be recognized and second candidate words corresponding to the voices to be recognized according to first candidate words obtained by decoding the voices to be recognized, determining decoding characteristics of the second candidate words, determining decoding confidence levels of the voices to be recognized according to the decoding characteristics of the second candidate words, determining noise confidence levels of voice frames contained in the voices to be recognized, determining noise confidence levels of the voices to be recognized according to the noise confidence levels of the voice frames, determining comprehensive confidence levels of the voices to be recognized according to the decoding confidence levels, the noise confidence levels and the decoding output scores of the voices to be recognized, and determining voice recognition results according to the comprehensive confidence levels. According to the technical scheme, firstly, the decoding output score of the voice to be recognized can be determined according to the first candidate word obtained by decoding the voice to be recognized, a data basis is provided for determining the comprehensive confidence, the second candidate word corresponding to the voice to be recognized is determined according to the first candidate word, after the decoding characteristics of each second candidate word are determined, the more accurate decoding confidence of the voice to be recognized is determined according to the decoding characteristics, secondly, the more accurate frame-level noise confidence of the voice to be recognized can be determined according to the noise confidence of each voice frame contained in the voice to be recognized, further, the comprehensive confidence of the voice to be recognized can be determined according to the decoding confidence of the voice to be recognized, the noise confidence of the segment level of the voice to be recognized and the decoding output score determined, and the more accurate comprehensive confidence of the voice to be recognized is determined by combining the decoding confidence of the segment level of the voice to be recognized and the decoding output score determined. On the premise of not increasing the cost, the accuracy of voice recognition is improved.
Fig. 4 is a flowchart of another voice recognition method according to an embodiment of the present invention, where the embodiment of the present invention is applicable to a situation where the accuracy of voice recognition needs to be improved without increasing the cost. The explanation of the terms of the embodiments of the present invention that are the same as or corresponding to the embodiments described above will not be repeated here. Referring to fig. 4, the voice recognition method provided by the embodiment of the invention includes:
Step 410, determining a decoding output score of the voice to be recognized according to a first candidate word obtained by decoding the voice to be recognized.
In one embodiment, step 410 may specifically include:
The method comprises the steps of obtaining a first candidate word by decoding the voice to be recognized once based on a voice recognition module formed by a language model and an acoustic model, determining the language score and the acoustic score of each first candidate word, and determining the decoding output score of the voice to be recognized according to the language score and the acoustic score of each first candidate word.
Specifically, the voice recognition module including the language model and the acoustic model may be used for decoding the voice to be recognized once, and the first candidate words corresponding to the voice to be recognized may be obtained by decoding once, so that the language score and the acoustic score of each first candidate word may be determined, and after the language score and the acoustic score are fused, the decoding output score of each first candidate word may be obtained. After determining the decoding output score of each first candidate word, the decoding output scores of each first candidate word may be ordered, normalization processing may be performed on the largest three decoding output scores, and the processing result may be determined as the decoding output score of the speech to be recognized.
In the embodiment of the invention, after the voice to be recognized is decoded based on the voice recognition module to obtain the first candidate word, the decoding output score of the voice to be recognized can be determined according to the first candidate word, and a data basis is provided for determining the comprehensive confidence of the voice to be recognized.
Step 420, determining a second candidate word corresponding to the voice to be recognized according to the first candidate word obtained by decoding the voice to be recognized.
In one embodiment, step 420 may specifically include:
And performing secondary decoding on the first word graph based on the minimum Bayesian risk by taking the editing distance as a criterion to obtain the second candidate words corresponding to the voice to be recognized and the posterior probability of each second candidate word.
Specifically, after the voice recognition module including the language model and the acoustic model decodes the voice to be recognized once to obtain a first candidate word corresponding to the voice to be recognized, the first word graph may be determined according to the first candidate word, and a second candidate word corresponding to the voice to be recognized may be obtained based on the minimum bayesian risk by performing a second decoding on the first word graph with the edit distance as a criterion.
The process of performing secondary decoding on the first word graph is that 1) selecting a candidate path from an initial state to a termination state from the first word graph, 2) calculating the editing distance (the editing distance can be the minimum number of times of inserting, deleting and replacing one text into another text) of the candidate path and the whole first word graph based on the candidate path, obtaining the posterior probability of all first candidate words in the corresponding time period of each first candidate word in the candidate path through the editing distance, 3) selecting the word with the highest probability at each moment to obtain a new word sequence, namely a second candidate word, 4) returning to execute 2) if the second candidate word is different from the first candidate word corresponding to the candidate path in 2), otherwise, determining that secondary decoding is finished, and determining the word sequence containing each second candidate word as a secondary decoding result. In addition, the time period in which each second candidate word in the word sequence containing each second candidate word is located has the posterior probability of all the second candidate words in the time period.
In the embodiment of the invention, after the voice to be recognized is decoded based on the voice recognition module to obtain the first candidate word, the second candidate word of the voice to be recognized can be determined according to the first candidate word, and a data basis is provided for determining the decoding confidence of the voice to be recognized.
Step 430, determining decoding characteristics of each second candidate word.
Wherein the decoding features include confidence scores, word categories, probability distributions, word lengths, and word graph depths for the second candidate word.
In one embodiment, step 430 may specifically include:
The method comprises the steps of carrying out normalization processing on posterior probabilities of second candidate words to obtain confidence scores of the second candidate words, determining word categories of the second candidate words according to category information of the second candidate words, determining probability distribution of the second candidate words according to occurrence times of the second candidate words in all the second candidate words corresponding to the voice to be recognized, determining word lengths of the second candidate words according to the number of phonemes contained in the second candidate words, and determining word depth of the second candidate words according to the number of edges of all nodes in a time period corresponding to the second candidate words and the time period length in a second word graph obtained by carrying out secondary decoding on the first candidate words.
Specifically, when the second candidate word is obtained by performing secondary decoding on the first word graph, a second word graph containing a plurality of second candidate words can be generated, and the second word graph also contains posterior probabilities of the second candidate words, so that the posterior probabilities of the second candidate words are normalized to obtain confidence scores of the second candidate words, and the confidence scores of the second candidate words can be determined as one-dimensional features of the second candidate words. The word class can indicate class information of the second candidate words, firstly, the domain of the voice to be recognized and the classification of the words in the domain can be determined, the classes of the words in the domain are ordered according to the word frequencies of the words, the words in the first N classes are respectively and independently used as one class, N classes are shared, the words in other classes are unified as a special class, therefore, the words in the domain can be divided into N+1 classes, each second candidate word can be mapped to any N+1 class, and therefore, the word class of the second candidate word can be represented based on N+1-dimensional characteristics. For example, when words in the speech domain to be recognized are classified into n+1=3+1=4 classes, it may be determined that the word class of the second candidate word is: (1, 0), (0, 1, 0) (0, 1, 0) or (0, 1). The probability distribution may indicate the occurrence times of the second candidate word in all the second candidate words corresponding to the speech to be recognized, so that the probability distribution of the second candidate word, that is, unigram probabilities of the second candidate word, may be determined according to the occurrence times of the second candidate word in all the second candidate words corresponding to the speech to be recognized and the total amount of the second candidate words corresponding to the speech to be recognized, and the probability distribution of the second candidate word may be determined as a one-dimensional feature of the second candidate word. The word length may indicate the number of phonemes included in the second candidate word, and thus the word length of the second candidate word may be determined according to the number of phonemes included in the second candidate word, and the word length of the second candidate word may be determined as a one-dimensional feature of the second candidate word. The word graph depth may indicate an average word graph depth of the second candidate word corresponding to the time period, so that the word graph depth of the second candidate word, that is, the average lattice depth, may be determined according to the number of edges of all nodes in the time period corresponding to the second candidate word and the length of the time period in the second word graph containing the second candidate word, and the word graph depth of the second candidate word may be determined as a one-dimensional feature of the second candidate word.
Thus, the decoded feature of the second candidate word may be determined to be an n+5-dimensional feature, and as previously described, when n+1=3+1=4, the decoded feature of the second candidate word may be determined to be an eight-dimensional feature.
In the embodiment of the invention, the N+5-dimensional decoding characteristics of each second candidate word obtained by secondary decoding can be determined, the decoding characteristics of the second candidate words are used for determining the decoding confidence coefficient of the voice to be recognized, and a data basis is provided for determining the decoding confidence coefficient of the voice to be recognized.
Step 440, determining the decoding confidence of the speech to be recognized according to the decoding characteristics of each second candidate word.
In one embodiment, step 440 may specifically include:
and respectively inputting the decoding characteristics of the second candidate words into a pre-trained decoding confidence coefficient model to obtain the decoding confidence coefficient of the second candidate words, and determining the decoding confidence coefficient of the voice to be recognized according to the decoding confidence coefficient of the second candidate words.
Before the decoding features of the second candidate words are respectively input into the pre-trained decoding confidence coefficient model, the method further comprises:
The method comprises the steps of constructing a first preset network model based on a deep neural network and a cross entropy function, taking N+5-dimensional decoding characteristics of second candidate words corresponding to voice data containing noise and effective voice after labeling and labeling information of the voice data as first training data, carrying out network training on the first preset network model, calculating a first loss function, and carrying out network optimization based on a back propagation algorithm until the first loss function converges to obtain a decoding confidence coefficient model.
Wherein the first loss function may be a cross entropy function.
Specifically, after a first preset network model is constructed based on a deep neural network and a cross entropy function, voice data containing noise and effective voice are marked, the noise is marked as 0, the effective voice is marked as 1, and the voice data containing the noise and the effective voice after marking is determined as a first training voice set. After the first training candidate words corresponding to the training voices are obtained by performing primary decoding on the training voices contained in the first training voice set based on the voice recognition module, second training candidate words corresponding to the training voices are further obtained by performing secondary decoding, and the N+5-dimensional decoding characteristics of the second training candidate words are determined based on the mode of the step 430. Taking N+5-dimensional decoding characteristics of each second candidate word corresponding to each training voice contained in the training voice set and labeling information of the training voice as training data, performing network training on a first preset network model, and calculating a cross entropy function according to a sigmoid activation function value which is output by the first preset network model and represents confidence score of the second candidate word and the labeling information of the training voice; and (3) performing network optimization based on a back propagation algorithm until the cross entropy function converges to obtain a decoding confidence coefficient model.
And then the decoding characteristics of each second candidate word can be respectively input into the decoding confidence coefficient model, the obtained output result is the decoding confidence coefficient of each second candidate word, the arithmetic average value of the decoding confidence coefficient of each second candidate word corresponding to the voice to be recognized is determined, and the arithmetic average value is determined as the decoding confidence coefficient of the voice to be recognized.
In the embodiment of the invention, the decoding confidence coefficient of each second candidate word corresponding to the voice to be recognized can be determined based on the decoding confidence coefficient model, and then the decoding confidence coefficient of the voice to be recognized can be determined according to the decoding confidence coefficient of each second candidate word, and the decoding confidence coefficient of the voice to be recognized can indicate the reliability degree of the voice recognition result, so that a data basis is provided for determining the voice recognition result.
Step 450, determining the noise confidence of each voice frame included in the voice to be recognized.
In one embodiment, step 450 may specifically include:
The method comprises the steps of obtaining a voice frame contained in the voice to be recognized, carrying out frame division on the voice to be recognized to obtain voice frames contained in the voice to be recognized, determining the mel cepstrum coefficient of each voice frame, and respectively inputting the mel cepstrum coefficient of each voice frame into a pre-trained noise confidence coefficient model to obtain the noise confidence coefficient of each voice frame.
Before the mel-frequency coefficient of each voice frame is respectively input into the pre-trained noise confidence coefficient model, the method further comprises:
Constructing a second preset network model based on a gating circulating unit (Gate Recurrent Unit, GRU), taking the Mel cepstrum coefficient of the frame training voice corresponding to each training voice in a second training voice set formed by training voices containing pure noise and pure voice and the labeling information of each frame training voice as second training data, carrying out network training on the second preset network model, calculating a second loss function, and iterating the weight of the second preset network model based on random gradient descent until the second loss function converges to obtain a noise confidence coefficient model.
Wherein the second loss function may also be a cross entropy function.
Fig. 5 is a schematic diagram of a second preset network model in another voice recognition method according to an embodiment of the present invention, where, as shown in fig. 5, the second preset network model includes a first full connection layer (fully connected layers, FC), a first GRU, a second GRU, a third GRU, and a second FC.
Specifically, after a second preset network model is built based on the GRU, pure noise and pure voice are collected, the pure noise is randomly added to the pure voice according to a preset signal-to-noise ratio to obtain training voices, a preset number of training voices are determined to be a second training voice set, each training voice contained in the second training voice set is divided into frames according to frame length of 25 milliseconds and frame movement of 10 milliseconds, and then frame training voices corresponding to the training voices are obtained, and for each frame training voice, when a phoneme is not mute, the phoneme is marked as 1, otherwise, the phoneme is marked as 0. And further, taking the mel cepstrum coefficient of the frame training voice corresponding to each training voice in the second training voice set and the labeling information of each frame training voice as second training data, and performing network training on a second preset network model. Specifically, the mel cepstrum coefficient of the training speech of the L frames can be used as a training sequence, a second preset network model is input, the output result corresponding to each frame of training speech is a vector with the dimension of 2, wherein one dimension represents the probability that the current frame contains speech, and the other dimension represents the probability that the current frame does not contain speech. Taking the labeling information of the L-frame training voice as a target sequence, performing network training on a second preset network model, and calculating a cross entropy function; and iterating the weight of the second preset network model based on random gradient descent until the second loss function converges, so as to obtain a noise confidence coefficient model.
And then the frame of the voice to be recognized can be divided by frame length of 25 milliseconds and frame shift of 10 milliseconds, so as to obtain voice frames contained in the voice to be recognized, the mel-frequency cepstrum coefficient of each voice frame is determined, the mel-frequency cepstrum coefficient of the voice frame is input into a noise confidence coefficient model, the obtained output result is that the probability p (t) that the voice frame contains the voice, the probability p (t) that the voice frame does not contain the voice is 1-p (t), and then the noise confidence coefficient of the voice frame is 1-p (t).
In the embodiment of the invention, the noise confidence coefficient of each voice frame contained in the voice to be recognized can be determined based on the noise confidence coefficient model, the noise confidence coefficient of each voice frame is used for determining the noise confidence coefficient of the voice to be recognized, the noise confidence coefficient of the voice to be recognized can be further used for determining the comprehensive confidence coefficient of the voice to be recognized, and a data basis is provided for determining the voice recognition result.
Step 460, determining the noise confidence coefficient of the voice to be recognized according to the noise confidence coefficient of each voice frame.
In one embodiment, step 460 may specifically include:
and determining the noise confidence coefficient of the voice to be recognized according to the maximum noise confidence coefficient, the minimum noise confidence coefficient, the noise confidence coefficient mean value and the noise confidence coefficient variance in the noise confidence coefficient of each voice frame contained in the voice to be recognized.
Specifically, after determining the noise confidence of each speech frame included in the speech to be recognized, the noise confidence of each frame of speech may be ranked, the mean and variance may be calculated, and the maximum noise confidence, the minimum noise confidence, the noise confidence mean and the noise confidence variance in the noise confidence of each speech frame included in the speech to be recognized may be determined as the noise confidence of the speech to be recognized.
In the embodiment of the invention, the noise confidence of the segment level of the voice to be recognized can be determined based on the noise confidence of the frame level of each voice frame contained in the voice to be recognized.
And 470, determining the comprehensive confidence of the voice to be recognized according to the decoding confidence, the noise confidence and the decoding output score of the voice to be recognized.
In one embodiment, step 470 may specifically include:
and inputting the decoding confidence coefficient, the noise confidence coefficient and the decoding output score of the voice to be recognized into a pre-trained voice recognition model to obtain the comprehensive confidence coefficient of the voice to be recognized.
Before inputting the decoding confidence level, the noise confidence level and the decoding output score of the voice to be recognized into the pre-trained voice recognition model, the method further comprises the following steps:
Constructing a third preset network model based on a logistic regression, taking decoding confidence coefficient, noise confidence coefficient, decoding output score and labeling information of each training voice in a third training voice set constructed by voices containing noise as third training data, carrying out network training on the third preset network model, calculating a third loss function, and carrying out network optimization based on a back propagation algorithm until the third loss function converges to obtain a voice recognition model.
And the decoding confidence coefficient, the noise confidence coefficient and the decoding output score of the voice to be recognized can be input into a voice recognition model, and the obtained output result is the comprehensive confidence coefficient of the voice to be recognized.
In the embodiment of the invention, the segment level decoding confidence, the frame level noise confidence and the decoding output score input of the voice to be recognized are fused to determine the comprehensive confidence of the voice to be recognized.
And step 480, determining a voice recognition result according to the comprehensive confidence.
The comprehensive confidence comprises the probability that the voice to be recognized contains the voice and the probability that the voice to be recognized does not contain the voice.
In one embodiment, step 480 may specifically include:
If the probability of the voice to be recognized containing the voice is larger than or equal to a first preset threshold value, determining that the voice recognition result is that the voice to be recognized contains the voice, if the probability of the voice to be recognized containing the voice is larger than or equal to a second preset threshold value and smaller than the first preset threshold value, determining that the voice recognition result is that the voice to be recognized does not contain the voice, and if the probability of the voice to be recognized containing the voice is smaller than the second preset threshold value, determining that the voice recognition is wrong, or optimizing the voice to be recognized to obtain optimized voice, and carrying out voice recognition again based on the optimized voice.
The first preset threshold value is larger than the second preset threshold value, and the first preset threshold value and the second preset threshold value are smaller than 1.
Specifically, after determining the comprehensive confidence of the voice to be recognized based on the voice recognition model, the probability that the voice to be recognized contains the voice and the second preset threshold may be compared first.
On the one hand, if the probability that the voice to be recognized contains voice is greater than or equal to the second preset threshold, the voice recognition result can be determined according to the probability that the voice to be recognized contains voice, i.e. the decoding result of the voice recognition module can be adopted. And further continuously comparing the probability of the voice to be recognized containing the voice with a first preset threshold, if the probability of the voice to be recognized containing the voice is larger than or equal to the first preset threshold, determining that the voice recognition result is the voice to be recognized containing the voice, and if the probability of the voice to be recognized containing the voice is smaller than the first preset threshold, determining that the voice recognition result is that the voice to be recognized does not contain the voice.
On the other hand, if the probability that the voice to be recognized contains the voice is smaller than the second preset threshold, the voice recognition result cannot be determined according to the probability that the voice to be recognized contains the voice, the decoding result of the voice to be recognized by the voice recognition module is not adopted, and then the voice recognition error can be determined, or the voice to be recognized can be optimized to obtain the optimized voice, and the voice recognition is performed again based on the optimized voice.
In one embodiment, optimizing the speech to be recognized to obtain an optimized speech includes:
and setting the voice frame with the noise confidence coefficient larger than the preset confidence coefficient in the voice to be recognized as mute to obtain the optimized voice.
Specifically, comparing the noise confidence coefficient of each voice frame included in the voice to be recognized with a preset confidence coefficient, if the noise confidence coefficient of any voice frame is larger than the preset confidence coefficient, setting the noise confidence coefficient of the voice frame to be 0, and optimizing the voice to be recognized to obtain optimized voice. And setting the voice frames with the voice frame noise confidence degrees larger than the preset confidence degrees contained in the voice to be recognized as silence, so that the noise reduction of the voice to be recognized is realized, and further, the voice recognition is continuously carried out on the optimized voice obtained through the noise reduction, and the accuracy of the voice recognition can be improved.
In the embodiment of the invention, the voice to be recognized can be determined to contain voice or not according to the comprehensive confidence, or the decoding result of the voice recognition module on the voice to be recognized can be omitted, so that voice recognition errors are determined, or noise reduction is carried out on the voice to be recognized to optimize the voice to be recognized so as to obtain optimized voice, and then the optimized voice is continuously decoded based on the voice recognition module so as to obtain the voice recognition result.
The voice recognition method comprises the steps of determining a decoding output score of voice to be recognized according to first candidate words obtained by decoding the voice to be recognized, determining second candidate words corresponding to the voice to be recognized according to the first candidate words obtained by decoding the voice to be recognized, determining decoding characteristics of the second candidate words, determining decoding confidence degrees of the voice to be recognized according to the decoding characteristics of the second candidate words, determining noise confidence degrees of voice frames contained in the voice to be recognized, determining noise confidence degrees of the voice to be recognized according to the noise confidence degrees of the voice frames, determining comprehensive confidence degrees of the voice to be recognized according to the decoding confidence degrees, the noise confidence degrees and the decoding output score of the voice to be recognized, and determining voice recognition results according to the comprehensive confidence degrees. According to the technical scheme, firstly, the decoding output score of the voice to be recognized can be determined according to the first candidate word obtained by decoding the voice to be recognized, a data basis is provided for determining the comprehensive confidence, the second candidate word corresponding to the voice to be recognized is determined according to the first candidate word, after the decoding characteristics of each second candidate word are determined, the more accurate decoding confidence of the voice to be recognized is determined according to the decoding characteristics, secondly, the more accurate frame-level noise confidence of the voice to be recognized can be determined according to the noise confidence of each voice frame contained in the voice to be recognized, further, the comprehensive confidence of the voice to be recognized can be determined according to the decoding confidence of the voice to be recognized, the noise confidence of the segment level of the voice to be recognized and the decoding output score determined, and the more accurate comprehensive confidence of the voice to be recognized is determined by combining the decoding confidence of the segment level of the voice to be recognized and the decoding output score determined. On the premise of not increasing the cost, the accuracy of voice recognition is improved.
In addition, after determining that the decoding result of the voice to be recognized by the voice recognition module is not adopted according to the comprehensive confidence, voice recognition errors can be determined, or noise can be reduced on the voice to be recognized to optimize the voice to be recognized so as to obtain optimized voice, and then the optimized voice is continuously decoded based on the voice recognition module so as to obtain the voice recognition result.
Fig. 6 is a schematic diagram of a speech recognition system provided in the embodiment of the present invention, as shown in fig. 6, the speech recognition system may include a speech recognition module 100, a decoding confidence module 200, a noise confidence module 300, a result determination module 400, and a processing module 500, where the speech recognition module 100 is configured to decode a speech to be recognized once to determine a first candidate word of the speech to be recognized and a first word graph including the first candidate word, the decoding confidence module 200 is configured to determine a decoding output score of the speech to be recognized and a second candidate word corresponding to the speech to be recognized according to the first candidate word, and determine a decoding confidence of the speech to be recognized according to a decoding feature of each second candidate word after determining a decoding feature of each second candidate word, the noise confidence module 300 is configured to determine a noise confidence of each speech frame included in the speech to be recognized, and determine a noise confidence of the speech to be recognized according to a noise confidence of each speech frame, the result determination module 400 is configured to determine a comprehensive confidence of the speech to be recognized according to the decoding confidence of the speech to be recognized, and the noise confidence score is output, the processing module is configured to determine that the speech to be recognized when the first speech to be recognized includes a threshold value, the speech to be recognized is greater than or less than the second threshold, the speech to be recognized is determined to be recognized, and the speech to be recognized includes a threshold is greater than the threshold when the first speech to be recognized is determined to be less than the threshold value is determined to be greater than the speech to be recognized, and re-performing speech recognition based on the optimized speech.
The voice recognition system provided by the embodiment of the invention can execute the voice recognition method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of executing the voice recognition method.
Fig. 7 is a schematic structural diagram of a voice recognition device according to an embodiment of the present invention. The device and the voice recognition method of each embodiment belong to the same invention conception, and the details of the embodiment of the voice recognition device, which are not described in detail, can be referred to the embodiment of the voice recognition method.
The specific structure of the speech recognition device is shown in fig. 7, and includes:
the decoding module 710 is configured to determine, according to a first candidate word obtained by decoding a speech to be recognized, a decoding output score of the speech to be recognized and a second candidate word corresponding to the speech to be recognized;
The decoding confidence determining module 720 is configured to determine decoding features of the second candidate words, and determine decoding confidence of the speech to be recognized according to the decoding features of the second candidate words;
the noise confidence determining module 730 is configured to determine a noise confidence of each speech frame included in the speech to be recognized, and determine a noise confidence of the speech to be recognized according to the noise confidence of each speech frame;
The execution module 740 is configured to determine a comprehensive confidence level of the speech to be recognized according to the decoding confidence level, the noise confidence level, and the decoding output score of the speech to be recognized, and determine a speech recognition result according to the comprehensive confidence level.
Based on the above embodiment, the decoding module 710 is specifically configured to:
decoding the voice to be recognized once based on a voice recognition module formed by a language model and an acoustic model to obtain a first word graph containing the first candidate word;
Determining a language score and an acoustic score of each first candidate word, and determining the decoding output score of the voice to be recognized according to the language score and the acoustic score of each first candidate word;
and performing secondary decoding on the first word graph based on the minimum Bayesian risk by taking the editing distance as a criterion to obtain the second candidate words corresponding to the voice to be recognized and the posterior probability of each second candidate word.
On the basis of the foregoing embodiment, the decoding features include a confidence score, a word category, a probability distribution, a word length, and a word graph depth of the second candidate word, and accordingly, the decoding confidence determining module 720 is specifically configured to:
Determining the word class of each second candidate word according to the class information of each second candidate word, determining the probability distribution of each second candidate word according to the occurrence times of each second candidate word in all second candidate words corresponding to the voice to be recognized, determining the word length of each second candidate word according to the number of phonemes contained in each second candidate word, and determining the word depth of each second candidate word according to the number of edges of all nodes and the time period length in the time period corresponding to each second candidate word in a second word graph obtained by secondarily decoding the first candidate word;
respectively inputting the decoding characteristics of each second candidate word into a pre-trained decoding confidence coefficient model to obtain the decoding confidence coefficient of each second candidate word;
and determining the decoding confidence degree of the voice to be recognized according to the decoding confidence degree of each second candidate word.
Based on the above embodiment, the noise confidence determining module 730 is specifically configured to:
framing the voice to be recognized to obtain a voice frame contained in the voice to be recognized;
Determining the mel cepstrum coefficient of each voice frame, and respectively inputting the mel cepstrum coefficient of each voice frame into a pre-trained noise confidence coefficient model to obtain the noise confidence coefficient of each voice frame;
and determining the noise confidence coefficient of the voice to be recognized according to the maximum noise confidence coefficient, the minimum noise confidence coefficient, the noise confidence coefficient mean value and the noise confidence coefficient variance in the noise confidence coefficient of each voice frame contained in the voice to be recognized.
Based on the above embodiment, the execution module 740 is specifically configured to:
Inputting the decoding confidence coefficient, the noise confidence coefficient and the decoding output score of the voice to be recognized into a pre-trained voice recognition model to obtain the comprehensive confidence coefficient of the voice to be recognized;
and determining a voice recognition result according to the comprehensive confidence coefficient.
In one embodiment, the integrated confidence level includes a probability that the speech to be recognized includes speech, and accordingly, determining a speech recognition result according to the integrated confidence level includes:
if the probability that the voice to be recognized contains the voice is larger than or equal to a first preset threshold value, determining that the voice recognition result is that the voice to be recognized contains the voice;
If the probability that the voice to be recognized contains the voice is larger than or equal to a second preset threshold value and smaller than the first preset threshold value, determining that the voice recognition result is that the voice to be recognized does not contain the voice;
if the probability that the voice to be recognized contains the voice is smaller than the second preset threshold value, determining voice recognition errors or optimizing the voice to be recognized to obtain optimized voice, and carrying out voice recognition again based on the optimized voice.
Further, optimizing the speech to be recognized to obtain an optimized speech includes:
and setting the voice frame with the noise confidence coefficient larger than the preset confidence coefficient in the voice to be recognized as mute to obtain the optimized voice.
The voice recognition device provided by the embodiment of the invention can execute the voice recognition method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of executing the voice recognition method.
It should be noted that, in the embodiment of the voice recognition device, the units and modules included are only divided according to the functional logic, but not limited to the above-mentioned division, so long as the corresponding functions can be implemented, and the specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.
Fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present invention. Fig. 8 shows a block diagram of an exemplary computer device 8 suitable for use in implementing embodiments of the invention. The computer device 8 shown in fig. 8 is only an example and should not be construed as limiting the functionality and scope of use of embodiments of the invention.
As shown in fig. 8, the computer device 8 is in the form of a general purpose computer device. The components of computer device 8 may include, but are not limited to, one or more processors or processing units 16, a system memory 28, and a bus 18 that connects the various system components, including the system memory 28 and the processing units 16.
Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, micro channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Computer device 8 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by computer device 8 and includes both volatile and nonvolatile media, removable and non-removable media.
The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 30 and/or cache memory 32. The computer device 8 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 8, commonly referred to as a "hard disk drive"). Although not shown in fig. 8, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In such cases, each drive may be coupled to bus 18 through one or more data medium interfaces. The system memory 28 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of the embodiments of the invention.
A program/utility 40 having a set (at least one) of program modules 42 may be stored in, for example, system memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 42 generally perform the functions and/or methods of the embodiments described herein.
The computer device 8 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), one or more devices that enable a user to interact with the computer device 8, and/or any devices (e.g., network card, modem, etc.) that enable the computer device 8 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 22. Moreover, the computer device 8 may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through a network adapter 20. As shown in fig. 8, the network adapter 20 communicates with other modules of the computer device 8 via the bus 18. It should be appreciated that although not shown in FIG. 8, other hardware and/or software modules may be used in connection with computer device 8, including, but not limited to, microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
The processing unit 16 executes various functional applications and page displays by running programs stored in the system memory 28, for example, implementing the voice recognition method provided by the present embodiment, the method including:
Determining a decoding output score of the voice to be recognized and a second candidate word corresponding to the voice to be recognized according to a first candidate word obtained by decoding the voice to be recognized;
determining decoding characteristics of the second candidate words, and determining decoding confidence of the voice to be recognized according to the decoding characteristics of the second candidate words;
determining the noise confidence coefficient of each voice frame contained in the voice to be recognized, and determining the noise confidence coefficient of the voice to be recognized according to the noise confidence coefficient of each voice frame;
and determining the comprehensive confidence of the voice to be recognized according to the decoding confidence, the noise confidence and the decoding output score of the voice to be recognized, and determining a voice recognition result according to the comprehensive confidence.
Of course, those skilled in the art will appreciate that the processor may also implement the technical solution of the speech recognition method provided in any embodiment of the present invention.
An embodiment of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a speech recognition method such as provided by the embodiment of the present invention, the method comprising:
Determining a decoding output score of the voice to be recognized and a second candidate word corresponding to the voice to be recognized according to a first candidate word obtained by decoding the voice to be recognized;
determining decoding characteristics of the second candidate words, and determining decoding confidence of the voice to be recognized according to the decoding characteristics of the second candidate words;
determining the noise confidence coefficient of each voice frame contained in the voice to be recognized, and determining the noise confidence coefficient of the voice to be recognized according to the noise confidence coefficient of each voice frame;
and determining the comprehensive confidence of the voice to be recognized according to the decoding confidence, the noise confidence and the decoding output score of the voice to be recognized, and determining a voice recognition result according to the comprehensive confidence.
The computer storage media of embodiments of the invention may take the form of any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium include an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations of the present invention may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
It will be appreciated by those of ordinary skill in the art that the modules or steps of the invention described above may be implemented in a general purpose computing device, they may be centralized on a single computing device, or distributed over a network of computing devices, or they may alternatively be implemented in program code executable by a computer device, such that they are stored in a memory device and executed by the computing device, or they may be separately fabricated as individual integrated circuit modules, or multiple modules or steps within them may be fabricated as a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
In addition, the technical scheme of the invention can acquire, store, use, process and the like the data, which accords with the relevant regulations of national laws and regulations.
Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims (13)

1.一种语音识别方法,其特征在于,包括:1. A speech recognition method, comprising: 根据解码待识别语音得到的第一候选词,确定所述待识别语音的解码输出得分和所述待识别语音对应的第二候选词;Determining, based on a first candidate word obtained by decoding the speech to be recognized, a decoding output score of the speech to be recognized and a second candidate word corresponding to the speech to be recognized; 确定各所述第二候选词的解码特征,并根据各所述第二候选词的解码特征确定所述待识别语音的解码置信度,其中,所述解码特征包括所述第二候选词的置信度得分、词类别、概率分布、词长度和词图深度;Determine a decoding feature of each of the second candidate words, and determine a decoding confidence of the speech to be recognized according to the decoding feature of each of the second candidate words, wherein the decoding feature includes a confidence score, a word category, a probability distribution, a word length, and a word graph depth of the second candidate words; 确定所述待识别语音所包含各语音帧的噪声置信度,并根据各所述语音帧的噪声置信度确定所述待识别语音的噪声置信度;Determining the noise confidence of each speech frame contained in the speech to be recognized, and determining the noise confidence of the speech to be recognized according to the noise confidence of each speech frame; 根据所述待识别语音的解码置信度、噪声置信度和解码输出得分,确定所述待识别语音的综合置信度,并根据所述综合置信度确定语音识别结果。The comprehensive confidence of the speech to be recognized is determined according to the decoding confidence, noise confidence and decoding output score of the speech to be recognized, and the speech recognition result is determined according to the comprehensive confidence. 2.根据权利要求1所述的语音识别方法,其特征在于,根据解码待识别语音得到的第一候选词,确定所述待识别语音的解码输出得分,包括:2. The speech recognition method according to claim 1, characterized in that determining the decoding output score of the speech to be recognized based on the first candidate word obtained by decoding the speech to be recognized comprises: 基于由语言模型和声学模型构成的语音识别模块对所述待识别语音进行一次解码,得到所述第一候选词;Decoding the speech to be recognized once based on a speech recognition module composed of a language model and an acoustic model to obtain the first candidate word; 确定各所述第一候选词的语言得分和声学得分,根据各所述第一候选词的语言得分和声学得分确定所述待识别语音的所述解码输出得分。Determine a language score and an acoustic score of each of the first candidate words, and determine the decoding output score of the speech to be recognized according to the language score and the acoustic score of each of the first candidate words. 3.根据权利要求1所述的语音识别方法,其特征在于,根据解码待识别语音得到的第一候选词,确定所述待识别语音对应的第二候选词,包括:3. The speech recognition method according to claim 1, characterized in that determining the second candidate word corresponding to the speech to be recognized according to the first candidate word obtained by decoding the speech to be recognized comprises: 基于由语言模型和声学模型构成的语音识别模块对所述待识别语音进行一次解码,得到包含有所述第一候选词的第一词图;Decoding the speech to be recognized once based on a speech recognition module composed of a language model and an acoustic model to obtain a first word graph including the first candidate word; 在所述第一词图上,以编辑距离为准则,基于最小贝叶斯风险进行二次解码,得到所述待识别语音对应的所述第二候选词以及各所述第二候选词的后验概率。On the first word graph, secondary decoding is performed based on the minimum Bayesian risk and the edit distance is used as a criterion to obtain the second candidate word corresponding to the speech to be recognized and the posterior probability of each second candidate word. 4.根据权利要求3所述的语音识别方法,其特征在于,确定各所述第二候选词的解码特征,包括:4. The speech recognition method according to claim 3, wherein determining the decoding feature of each of the second candidate words comprises: 对各所述第二候选词的后验概率进行归一化处理,得到各所述第二候选词的置信度得分;Normalizing the posterior probability of each of the second candidate words to obtain a confidence score of each of the second candidate words; 根据各所述第二候选词的类别信息,确定各所述第二候选词的词类别;Determining the word category of each of the second candidate words according to the category information of each of the second candidate words; 根据各所述第二候选词在所述待识别语音对应的所有第二候选词中的出现次数,确定各所述第二候选词的概率分布;Determining the probability distribution of each second candidate word according to the number of occurrences of each second candidate word in all second candidate words corresponding to the speech to be recognized; 根据各所述第二候选词包含音素的数量,确定各所述第二候选词的词长度;Determining the word length of each of the second candidate words according to the number of phonemes contained in each of the second candidate words; 在对所述第一候选词进行二次解码得到的第二词图中,根据各所述第二候选词对应时间段内所有节点的出边个数和时间段长度,确定各所述第二候选词的所述词图深度。In the second word graph obtained by performing secondary decoding on the first candidate words, the word graph depth of each second candidate word is determined according to the number of outgoing edges of all nodes in the time period corresponding to each second candidate word and the length of the time period. 5.根据权利要求1所述的语音识别方法,其特征在于,根据各所述第二候选词的解码特征确定所述待识别语音的解码置信度,包括:5. The speech recognition method according to claim 1, characterized in that determining the decoding confidence of the speech to be recognized according to the decoding features of each of the second candidate words comprises: 将各所述第二候选词的解码特征分别输入预先训练好的解码置信度模型中,得到各所述第二候选词的解码置信度;Inputting the decoding features of each of the second candidate words into a pre-trained decoding confidence model to obtain the decoding confidence of each of the second candidate words; 根据各所述第二候选词的解码置信度确定所述待识别语音的解码置信度。The decoding confidence of the speech to be recognized is determined according to the decoding confidence of each of the second candidate words. 6.根据权利要求1所述的语音识别方法,其特征在于,确定所述待识别语音所包含各语音帧的噪声置信度,包括:6. The speech recognition method according to claim 1, characterized in that determining the noise confidence of each speech frame contained in the speech to be recognized comprises: 对所述待识别语音进行分帧,得到所述待识别语音所包含的语音帧;Dividing the speech to be recognized into frames to obtain speech frames contained in the speech to be recognized; 确定各所述语音帧的梅尔倒谱系数,并将各所述语音帧的梅尔倒谱系数分别输入预先训练好的噪声置信度模型中,得到各所述语音帧的噪声置信度。The Mel-cepstral coefficients of each speech frame are determined, and the Mel-cepstral coefficients of each speech frame are respectively input into a pre-trained noise confidence model to obtain the noise confidence of each speech frame. 7.根据权利要求1所述的语音识别方法,其特征在于,根据各所述语音帧的噪声置信度确定所述待识别语音的噪声置信度,包括:7. The speech recognition method according to claim 1, characterized in that determining the noise confidence of the speech to be recognized according to the noise confidence of each speech frame comprises: 根据所述待识别语音所包含的各所述语音帧的噪声置信度中的最大噪声置信度、最小噪声置信度、噪声置信度均值和噪声置信度方差,确定所述待识别语音的噪声置信度。The noise confidence of the speech to be recognized is determined according to the maximum noise confidence, the minimum noise confidence, the noise confidence mean and the noise confidence variance among the noise confidences of the speech frames contained in the speech to be recognized. 8.根据权利要求1所述的语音识别方法,其特征在于,根据所述待识别语音的解码置信度、噪声置信度和解码输出得分,确定所述待识别语音的综合置信度,包括:8. The speech recognition method according to claim 1, characterized in that the comprehensive confidence of the speech to be recognized is determined according to the decoding confidence, noise confidence and decoding output score of the speech to be recognized, comprising: 将所述待识别语音的解码置信度、噪声置信度和解码输出得分输入预先训练好的语音识别模型中,得到所述待识别语音的综合置信度。The decoding confidence, noise confidence and decoding output score of the speech to be recognized are input into a pre-trained speech recognition model to obtain a comprehensive confidence of the speech to be recognized. 9.根据权利要求1所述的语音识别方法,其特征在于,所述综合置信度包括所述待识别语音包含语音的概率,相应地,根据所述综合置信度确定语音识别结果,包括:9. The speech recognition method according to claim 1, characterized in that the comprehensive confidence includes the probability that the speech to be recognized contains speech, and accordingly, determining the speech recognition result according to the comprehensive confidence includes: 如果所述待识别语音包含语音的概率大于或者等于第一预设阈值,则确定所述语音识别结果为所述待识别语音包含语音;If the probability that the speech to be recognized contains speech is greater than or equal to a first preset threshold, determining that the speech recognition result is that the speech to be recognized contains speech; 如果所述待识别语音包含语音的概率大于或者等于第二预设阈值,且小于所述第一预设阈值,则确定所述语音识别结果为所述待识别语音不包含语音;If the probability that the speech to be recognized contains speech is greater than or equal to the second preset threshold and less than the first preset threshold, determining that the speech recognition result is that the speech to be recognized does not contain speech; 如果所述待识别语音包含语音的概率小于所述第二预设阈值,则确定语音识别错误,或者,优化所述待识别语音得到优化语音,并基于所述优化语音重新进行语音识别。If the probability that the speech to be recognized contains speech is less than the second preset threshold, it is determined that the speech recognition is wrong, or the speech to be recognized is optimized to obtain optimized speech, and speech recognition is performed again based on the optimized speech. 10.根据权利要求9所述的语音识别方法,其特征在于,优化所述待识别语音得到优化语音,包括:10. The speech recognition method according to claim 9, characterized in that optimizing the speech to be recognized to obtain the optimized speech comprises: 将所述待识别语音中噪声置信度大于预设置信度的所述语音帧置为静音,得到所述优化语音。The speech frames whose noise confidence in the speech to be recognized is greater than a preset confidence are set to be silent to obtain the optimized speech. 11.一种语音识别装置,其特征在于,包括:11. A speech recognition device, comprising: 解码模块,用于根据解码待识别语音得到的第一候选词,确定所述待识别语音的解码输出得分和所述待识别语音对应的第二候选词;A decoding module, configured to determine a decoding output score of the speech to be recognized and a second candidate word corresponding to the speech to be recognized based on a first candidate word obtained by decoding the speech to be recognized; 解码置信度确定模块,用于确定各所述第二候选词的解码特征,并根据各所述第二候选词的解码特征确定所述待识别语音的解码置信度,其中,所述解码特征包括所述第二候选词的置信度得分、词类别、概率分布、词长度和词图深度;A decoding confidence determination module, used to determine the decoding features of each of the second candidate words, and determine the decoding confidence of the speech to be recognized according to the decoding features of each of the second candidate words, wherein the decoding features include the confidence score, word category, probability distribution, word length and word graph depth of the second candidate words; 噪声置信度确定模块,用于确定所述待识别语音所包含各语音帧的噪声置信度,并根据各所述语音帧的噪声置信度确定所述待识别语音的噪声置信度;A noise confidence determination module, used to determine the noise confidence of each speech frame contained in the speech to be recognized, and determine the noise confidence of the speech to be recognized according to the noise confidence of each speech frame; 执行模块,用于根据所述待识别语音的解码置信度、噪声置信度和解码输出得分,确定所述待识别语音的综合置信度,并根据所述综合置信度确定语音识别结果。The execution module is used to determine the comprehensive confidence of the speech to be recognized according to the decoding confidence, noise confidence and decoding output score of the speech to be recognized, and determine the speech recognition result according to the comprehensive confidence. 12.一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时实现如权利要求1-10中任一所述的语音识别方法。12. A computer device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the speech recognition method as claimed in any one of claims 1 to 10 when executing the program. 13.一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行如权利要求1-10中任一所述的语音识别方法。13. A storage medium comprising computer executable instructions, wherein the computer executable instructions are used to perform the speech recognition method according to any one of claims 1 to 10 when executed by a computer processor.
CN202210753629.8A 2022-06-28 2022-06-28 A speech recognition method, device, equipment and storage medium Active CN115294974B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210753629.8A CN115294974B (en) 2022-06-28 2022-06-28 A speech recognition method, device, equipment and storage medium
PCT/CN2023/097748 WO2024001662A1 (en) 2022-06-28 2023-06-01 Speech recognition method and apparatus, device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210753629.8A CN115294974B (en) 2022-06-28 2022-06-28 A speech recognition method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN115294974A CN115294974A (en) 2022-11-04
CN115294974B true CN115294974B (en) 2025-02-28

Family

ID=83820283

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210753629.8A Active CN115294974B (en) 2022-06-28 2022-06-28 A speech recognition method, device, equipment and storage medium

Country Status (2)

Country Link
CN (1) CN115294974B (en)
WO (1) WO2024001662A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115294974B (en) * 2022-06-28 2025-02-28 京东科技信息技术有限公司 A speech recognition method, device, equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101118745A (en) * 2006-08-04 2008-02-06 中国科学院声学研究所 A Fast Calculation Method of Confidence Degree in Speech Recognition System
CN101447183A (en) * 2007-11-28 2009-06-03 中国科学院声学研究所 Processing method of high-performance confidence level applied to speech recognition system

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040064315A1 (en) * 2002-09-30 2004-04-01 Deisher Michael E. Acoustic confidence driven front-end preprocessing for speech recognition in adverse environments
CN103578468B (en) * 2012-08-01 2017-06-27 联想(北京)有限公司 The method of adjustment and electronic equipment of a kind of confidence coefficient threshold of voice recognition
US11138334B1 (en) * 2018-10-17 2021-10-05 Medallia, Inc. Use of ASR confidence to improve reliability of automatic audio redaction
CN111341305B (en) * 2020-03-05 2023-09-26 苏宁云计算有限公司 Audio data labeling method, device and system
CN111883109B (en) * 2020-07-01 2023-09-26 北京猎户星空科技有限公司 Voice information processing and verification model training method, device, equipment and medium
CN112599128B (en) * 2020-12-31 2024-06-11 百果园技术(新加坡)有限公司 Voice recognition method, device, equipment and storage medium
CN112951219A (en) * 2021-02-01 2021-06-11 思必驰科技股份有限公司 Noise rejection method and device
CN114093358A (en) * 2021-11-17 2022-02-25 北京地平线信息技术有限公司 Speech recognition method and device, electronic device and storage medium
CN114255754B (en) * 2021-12-27 2025-08-19 贝壳找房(北京)科技有限公司 Speech recognition method, electronic device, program product, and storage medium
CN115294974B (en) * 2022-06-28 2025-02-28 京东科技信息技术有限公司 A speech recognition method, device, equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101118745A (en) * 2006-08-04 2008-02-06 中国科学院声学研究所 A Fast Calculation Method of Confidence Degree in Speech Recognition System
CN101447183A (en) * 2007-11-28 2009-06-03 中国科学院声学研究所 Processing method of high-performance confidence level applied to speech recognition system

Also Published As

Publication number Publication date
WO2024001662A1 (en) 2024-01-04
CN115294974A (en) 2022-11-04

Similar Documents

Publication Publication Date Title
US12254865B2 (en) Multi-dialect and multilingual speech recognition
US12051407B2 (en) Contextual biasing for speech recognition
US11503155B2 (en) Interactive voice-control method and apparatus, device and medium
US11545142B2 (en) Using context information with end-to-end models for speech recognition
CN1975858B (en) session control device
WO2017076222A1 (en) Speech recognition method and apparatus
CN108899013B (en) Voice search method and device and voice recognition system
JP2022529691A (en) Combination endpoint determination and automatic speech recognition
KR20220004224A (en) Context biasing for speech recognition
CN113053367B (en) Speech recognition method, speech recognition model training method and device
CN109754809A (en) Audio recognition method, device, electronic equipment and storage medium
WO2021000403A1 (en) Voice matching method for intelligent dialogue system, electronic device and computer device
CN110070859B (en) Voice recognition method and device
US10152298B1 (en) Confidence estimation based on frequency
JP7659080B2 (en) Reducing Streaming ASR Model Delay Using Self-Alignment
EP3739583B1 (en) Dialog device, dialog method, and dialog computer program
US20190156832A1 (en) Diarization Driven by the ASR Based Segmentation
WO2020156342A1 (en) Voice recognition method and device, electronic device and storage medium
US20190156835A1 (en) Diarization Driven by Meta-Information Identified in Discussion Content
WO2018192186A1 (en) Speech recognition method and apparatus
US11626107B1 (en) Natural language processing
TWI818427B (en) Method and system for correcting speaker diarisation using speaker change detection based on text
CN115457938A (en) Method, device, storage medium and electronic device for identifying wake-up words
CN112767921A (en) Voice recognition self-adaption method and system based on cache language model
CN111583910B (en) Model updating method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant