CN114974249A - Voice recognition method, device and storage medium - Google Patents
Voice recognition method, device and storage medium Download PDFInfo
- Publication number
- CN114974249A CN114974249A CN202110193727.6A CN202110193727A CN114974249A CN 114974249 A CN114974249 A CN 114974249A CN 202110193727 A CN202110193727 A CN 202110193727A CN 114974249 A CN114974249 A CN 114974249A
- Authority
- CN
- China
- Prior art keywords
- sequence
- vocabulary
- target
- word
- target sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/30—Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Machine Translation (AREA)
Abstract
Description
技术领域technical field
本申请涉及计算机技术领域,尤其涉及一种语音识别方法及装置。The present application relates to the field of computer technology, and in particular, to a speech recognition method and device.
背景技术Background technique
随着互联网的发展,语音识别发挥了越来越重要的作用。自动语音识别技术(Automatic Speech Recognition,ASR)是使得机器通过识别和理解过程把语音信号转变为相应的文本或命令的技术。其中大词汇量连续语音识别(Large Vocabulary ContinuousSpeech Recognition,LVCSR)技术近年来发展迅速,并在许多领域得到了广泛的应用。With the development of the Internet, speech recognition has played an increasingly important role. Automatic Speech Recognition (ASR) is a technology that enables machines to convert speech signals into corresponding text or commands through the process of recognition and understanding. Among them, Large Vocabulary Continuous Speech Recognition (LVCSR) technology has developed rapidly in recent years and has been widely used in many fields.
现有技术中,通常采用云端部署的语音识别引擎,训练语言模型的语料一般采用通用领域的语料。但是,因为数据量有限,不可能覆盖到全部领域。在进行特定领域的语音识别任务时,例如进行医学、建筑学、人工智能等领域的语音识别任务时,因为语言模型在这些领域中覆盖不充分,或者出现很多没有出现在词典中的词汇(Out Of Vocabulary,OOV),这样会导致ASR识别性能下降,识别准确率降低。In the prior art, a speech recognition engine deployed in the cloud is usually used, and the corpus for training the language model generally adopts the corpus in the general field. However, due to the limited amount of data, it is impossible to cover all areas. When performing speech recognition tasks in specific fields, such as speech recognition tasks in medicine, architecture, artificial intelligence, etc., because the language model does not cover enough in these fields, or there are many words that do not appear in the dictionary (Out Of Vocabulary, OOV), which will lead to the degradation of ASR recognition performance and the reduction of recognition accuracy.
发明内容SUMMARY OF THE INVENTION
本申请实施例提供一种语音识别方法、装置及存储介质,用以解决现有技术中无法准确地对语音进行识别的缺陷,提高语音识别的准确率。Embodiments of the present application provide a speech recognition method, device, and storage medium, so as to solve the defect that speech cannot be accurately recognized in the prior art, and improve the accuracy of speech recognition.
第一方面,本申请实施例提供一种语音识别方法,包括:In a first aspect, an embodiment of the present application provides a speech recognition method, including:
边缘端获取云端识别结果,所述云端识别结果包括云端针对目标识别对象识别获得的至少一个目标序列;基于所述目标识别对象对应的本地参考文本,对所述至少一个目标序列进行修正,获得边缘端识别结果。The edge terminal obtains the cloud recognition result, and the cloud recognition result includes at least one target sequence obtained by the cloud for the target recognition object recognition; based on the local reference text corresponding to the target recognition object, the at least one target sequence is modified to obtain the edge end recognition result.
可选地,根据本申请一个实施例的一种语音识别方法,所述基于所述目标识别对象对应的本地参考文本,对所述至少一个目标序列进行修正,获得边缘端识别结果,包括:Optionally, according to a speech recognition method according to an embodiment of the present application, the at least one target sequence is modified based on the local reference text corresponding to the target recognition object to obtain an edge recognition result, including:
基于所述目标序列对应的语言模型和所述本地参考文本,对所述目标序列中的一个备选序列进行词汇替换;其中,所述目标序列对应的语言模型是基于所述本地参考文本训练获得的;Based on the language model corresponding to the target sequence and the local reference text, vocabulary replacement is performed on a candidate sequence in the target sequence; wherein, the language model corresponding to the target sequence is obtained by training based on the local reference text of;
和/或,and / or,
基于命名实体识别NER,对所述目标序列中的一个所述备选序列进行词汇替换。Based on named entity recognition NER, lexical replacement is performed on one of the candidate sequences in the target sequence.
可选地,根据本申请一个实施例的一种语音识别方法,所述备选序列的词汇匹配概率在所有所述目标序列的词汇匹配概率中最高;Optionally, according to a speech recognition method according to an embodiment of the present application, the vocabulary matching probability of the candidate sequence is the highest among the vocabulary matching probabilities of all the target sequences;
其中,对于每一个所述目标序列,所述词汇匹配概率是基于所述目标序列对应的语言模型以及所述目标序列计算获得的;Wherein, for each of the target sequences, the vocabulary matching probability is calculated based on the language model corresponding to the target sequence and the target sequence;
所述目标序列的词汇匹配概率用于描述所述目标序列中的词汇在所述本地参考文本中出现的频率。The word matching probability of the target sequence is used to describe the frequency of words in the target sequence appearing in the local reference text.
可选地,根据本申请一个实施例的一种语音识别方法,所述基于所述目标序列对应的语言模型和所述本地参考文本,对所述目标序列中的一个备选序列进行词汇替换,包括:Optionally, according to a speech recognition method according to an embodiment of the present application, the vocabulary replacement is performed on an alternative sequence in the target sequence based on the language model corresponding to the target sequence and the local reference text, include:
基于所述目标序列对应的语言模型,获得所述备选序列中的连续词汇组合对应的连续匹配概率,其中,所述连续匹配概率用于表示所述连续匹配概率对应的连续词汇组合在所述本地参考文本中出现的频率;Based on the language model corresponding to the target sequence, obtain the continuous matching probability corresponding to the continuous word combination in the candidate sequence, wherein the continuous matching probability is used to indicate that the continuous matching probability corresponding to the continuous word combination is in the the frequency of occurrences in the local reference text;
若所述备选序列中的任一个连续词汇组合对应的所述连续匹配概率低于第一预设阈值,则通过所述本地参考文本中的替换文本,替换所述连续匹配概率低于第一预设阈值的第一连续词汇组合;If the continuous matching probability corresponding to any continuous word combination in the candidate sequence is lower than the first preset threshold, the continuous matching probability is replaced by the replacement text in the local reference text the first consecutive word combination of the preset threshold;
其中,所述替换文本对应的音素序列与所述第一连续词汇组合的音素序列的音素匹配度大于第二预设阈值,且所述替换文本比所述第一连续词汇组合在所述本地参考文本中的出现频率更高。Wherein, the phoneme matching degree of the phoneme sequence corresponding to the replacement text and the phoneme sequence of the first continuous vocabulary combination is greater than a second preset threshold, and the replacement text is higher than the first continuous vocabulary combination in the local reference Occurs more frequently in text.
可选地,根据本申请一个实施例的一种语音识别方法,所述目标序列包括:音素序列,和/或,词序列;Optionally, according to a speech recognition method according to an embodiment of the present application, the target sequence includes: a phoneme sequence, and/or a word sequence;
相应地,所述音素序列对应的语言模型包括音素序列语言模型;所述词序列对应的语言模型包括词序列语言模型。Correspondingly, the language model corresponding to the phoneme sequence includes a phoneme sequence language model; the language model corresponding to the word sequence includes a word sequence language model.
可选地,根据本申请一个实施例的一种语音识别方法,若所述目标序列包括音素序列,所述基于所述目标序列对应的语言模型和所述本地参考文本,对所述目标序列中的一个备选序列进行词汇替换,包括:Optionally, according to a speech recognition method according to an embodiment of the present application, if the target sequence includes a phoneme sequence, the target sequence is based on the language model corresponding to the target sequence and the local reference text. An alternative sequence of lexical replacements, including:
基于所述音素序列语言模型,对所述音素序列中的一个备选序列进行词汇替换,获得第一音素序列;Based on the phoneme sequence language model, vocabulary replacement is performed on a candidate sequence in the phoneme sequence to obtain a first phoneme sequence;
基于所述词序列语言模型,对所述第一音素序列对应的至少一个词序列中的一个备选序列,进行词汇替换。Based on the word sequence language model, vocabulary replacement is performed on a candidate sequence in the at least one word sequence corresponding to the first phoneme sequence.
可选地,根据本申请一个实施例的一种语音识别方法,若所述目标序列还包括词序列,所述基于所述目标序列对应的语言模型和所述本地参考文本,对所述目标序列中的一个备选序列进行词汇替换,还包括:Optionally, according to a speech recognition method according to an embodiment of the present application, if the target sequence further includes a word sequence, the target sequence is determined based on the language model corresponding to the target sequence and the local reference text. lexical replacement with an alternative sequence in , which also includes:
若第一词序列对应的第四匹配概率大于第二词序列对应的第四匹配概率,则确定所述第一词序列为所述边缘端识别结果;If the fourth matching probability corresponding to the first word sequence is greater than the fourth matching probability corresponding to the second word sequence, then determining that the first word sequence is the edge recognition result;
若第二词序列对应的第四匹配概率大于第一词序列对应的第四匹配概率,则确定所述第二词序列为所述边缘端识别结果;If the fourth matching probability corresponding to the second word sequence is greater than the fourth matching probability corresponding to the first word sequence, then determining that the second word sequence is the edge recognition result;
其中,所述第一词序列是,对所述第一音素序列对应的至少一个词序列中的一个备选序列进行词汇替换之后获得的;Wherein, the first word sequence is obtained after performing vocabulary replacement on a candidate sequence in at least one word sequence corresponding to the first phoneme sequence;
所述第二词序列是,基于所述词序列语言模型,对所述目标序列中的词序列中的一个备选序列进行词汇替换之后获得的;The second word sequence is obtained by performing lexical replacement on a candidate sequence in the word sequence in the target sequence based on the word sequence language model;
所述第一词序列对应的第四匹配概率用于描述所述第一词序列与所述本地参考文本的匹配程度;The fourth matching probability corresponding to the first word sequence is used to describe the degree of matching between the first word sequence and the local reference text;
所述第二词序列对应的第四匹配概率用于描述所述第二词序列与所述本地参考文本的匹配程度。The fourth matching probability corresponding to the second word sequence is used to describe the degree of matching between the second word sequence and the local reference text.
可选地,根据本申请一个实施例的一种语音识别方法,所述基于命名实体识别NER,对所述目标序列中的一个备选序列进行词汇替换,获得所述边缘端识别结果,包括:Optionally, according to a speech recognition method according to an embodiment of the present application, the NER-based named entity recognition, performing vocabulary replacement on an alternative sequence in the target sequence, and obtaining the edge recognition result, including:
基于所述NER词表,通过所述NER词表中的第一替换词汇替换所述目标序列中的一个备选序列中的词汇,获得所述边缘端识别结果;Based on the NER vocabulary, replace the vocabulary in a candidate sequence in the target sequence by the first replacement vocabulary in the NER vocabulary to obtain the edge recognition result;
其中,所述替换词汇对应的音素序列与所述备选序列中的词汇的音素序列的音素匹配概率大于第三预设阈值。Wherein, the phoneme matching probability of the phoneme sequence corresponding to the replacement vocabulary and the phoneme sequence of the vocabulary in the candidate sequence is greater than a third preset threshold.
可选地,根据本申请一个实施例的一种语音识别方法,所述方法还包括:Optionally, according to a speech recognition method according to an embodiment of the present application, the method further includes:
基于所述NER对所述本地参考文本进行识别,生成NER词表;Identifying the local reference text based on the NER, and generating an NER vocabulary;
基于词典和/或字符转音素G2P技术,获得所述NER词表对应的音素。Based on the dictionary and/or the character-to-phoneme G2P technology, the phonemes corresponding to the NER vocabulary are obtained.
可选地,根据本申请一个实施例的一种语音识别方法,所述本地参考文本包括:本地参考文本所属活动的专业信息、参与者信息、活动背景信息、和活动内容。Optionally, according to a speech recognition method according to an embodiment of the present application, the local reference text includes: professional information of the activity to which the local reference text belongs, participant information, activity background information, and activity content.
可选地,根据本申请一个实施例的一种语音识别方法,所述获得边缘端识别结果之后,所述方法还包括:Optionally, according to a speech recognition method according to an embodiment of the present application, after the edge end recognition result is obtained, the method further includes:
基于本地参考文本所属活动的专业信息,将所述本地参考文本的相关信息保存至服务器中。Based on the professional information of the activity to which the local reference text belongs, the relevant information of the local reference text is saved to the server.
第二方面,本申请实施例提供一种语音识别方法,包括:In a second aspect, an embodiment of the present application provides a speech recognition method, including:
设备端获取云端识别结果或边缘端识别结果,其中,所述云端识别结果包括云端对目标识别对象识别获得的一个目标序列;The device end obtains the cloud identification result or the edge identification result, wherein the cloud identification result includes a target sequence obtained by the cloud identifying the target identification object;
基于设备端的本地词表,对所述一个目标序列进行修正,获得设备端识别结果。Based on the local vocabulary on the device side, the one target sequence is modified to obtain the device-side recognition result.
可选地,根据本申请一个实施例的一种语音识别方法,所述基于设备端的本地词表,对所述一个目标序列进行修正,获得设备端识别结果,包括:Optionally, according to a speech recognition method according to an embodiment of the present application, modifying the target sequence based on a local vocabulary on the device side to obtain a device-side recognition result, including:
基于所述本地词表中的第二替换词汇,替换所述目标序列中与所述第二替换词汇相对应的词汇,获得所述设备端识别结果;Based on the second replacement vocabulary in the local vocabulary, replace the vocabulary corresponding to the second replacement vocabulary in the target sequence, and obtain the device-side recognition result;
其中,与所述第二替换词汇相对应的词汇的音素序列,与所述第二替换词汇的音素序列的音素匹配概率大于第四预设阈值。Wherein, the phoneme matching probability of the phoneme sequence of the vocabulary corresponding to the second replacement vocabulary and the phoneme sequence of the second replacement vocabulary is greater than a fourth preset threshold.
可选地,根据本申请一个实施例的一种语音识别方法,所述词表包括:预先设置的易错词表和/或实时NER词汇。Optionally, according to a speech recognition method according to an embodiment of the present application, the vocabulary includes: a preset error-prone vocabulary and/or a real-time NER vocabulary.
可选地,根据本申请一个实施例的一种语音识别方法,所述方法还包括:基于NER技术实时获取所述目标识别对象对应的显示信息中的实时NER词汇。Optionally, according to a speech recognition method according to an embodiment of the present application, the method further includes: acquiring real-time NER vocabulary in the display information corresponding to the target recognition object in real time based on the NER technology.
第三方面,本申请实施例提供一种语音识别装置,包括存储器、包括存储器,收发机,处理器:In a third aspect, an embodiment of the present application provides a speech recognition device, including a memory, including a memory, a transceiver, and a processor:
存储器,用于存储计算机程序;收发机,用于在所述处理器的控制下收发数据;处理器,用于读取所述存储器中的计算机程序并执行以下操作:a memory for storing a computer program; a transceiver for sending and receiving data under the control of the processor; a processor for reading the computer program in the memory and performing the following operations:
边缘端获取云端识别结果,所述云端识别结果包括云端针对目标识别对象识别获得的至少一个目标序列;The edge terminal obtains a cloud recognition result, where the cloud recognition result includes at least one target sequence obtained by the cloud for target recognition object recognition;
基于所述目标识别对象对应的本地参考文本,对所述至少一个目标序列进行修正,获得边缘端识别结果。Based on the local reference text corresponding to the target recognition object, the at least one target sequence is modified to obtain an edge end recognition result.
可选地,根据本申请一个实施例的一种语音识别装置,所述基于所述目标识别对象对应的本地参考文本,对所述至少一个目标序列进行修正,获得边缘端识别结果,包括:Optionally, according to a speech recognition apparatus according to an embodiment of the present application, the at least one target sequence is modified based on the local reference text corresponding to the target recognition object to obtain an edge recognition result, including:
基于所述目标序列对应的语言模型和所述本地参考文本,对所述目标序列中的一个备选序列进行词汇替换;其中,所述目标序列对应的语言模型是基于所述本地参考文本训练获得的;Based on the language model corresponding to the target sequence and the local reference text, vocabulary replacement is performed on a candidate sequence in the target sequence; wherein, the language model corresponding to the target sequence is obtained by training based on the local reference text of;
和/或,and / or,
基于命名实体识别NER,对所述目标序列中的一个所述备选序列进行词汇替换。Based on named entity recognition NER, lexical replacement is performed on one of the candidate sequences in the target sequence.
可选地,根据本申请一个实施例的一种语音识别装置,所述备选序列的词汇匹配概率在所有所述目标序列的词汇匹配概率中最高;Optionally, according to a speech recognition apparatus according to an embodiment of the present application, the vocabulary matching probability of the candidate sequence is the highest among the vocabulary matching probabilities of all the target sequences;
其中,对于每一个所述目标序列,所述词汇匹配概率是基于所述目标序列对应的语言模型以及所述目标序列计算获得的;Wherein, for each of the target sequences, the vocabulary matching probability is calculated based on the language model corresponding to the target sequence and the target sequence;
所述目标序列的词汇匹配概率用于描述所述目标序列中的词汇在所述本地参考文本中出现的频率。The word matching probability of the target sequence is used to describe the frequency of words in the target sequence appearing in the local reference text.
可选地,根据本申请一个实施例的一种语音识别装置,所述基于所述目标序列对应的语言模型和所述本地参考文本,对所述目标序列中的一个备选序列进行词汇替换,包括:Optionally, according to a speech recognition apparatus according to an embodiment of the present application, the vocabulary replacement is performed on an alternative sequence in the target sequence based on the language model corresponding to the target sequence and the local reference text, include:
基于所述目标序列对应的语言模型,获得所述备选序列中的连续词汇组合对应的连续匹配概率,其中,所述连续匹配概率用于表示所述连续匹配概率对应的连续词汇组合在所述本地参考文本中出现的频率;Based on the language model corresponding to the target sequence, obtain the continuous matching probability corresponding to the continuous word combination in the candidate sequence, wherein the continuous matching probability is used to indicate that the continuous matching probability corresponding to the continuous word combination is in the the frequency of occurrences in the local reference text;
若所述备选序列中的任一个连续词汇组合对应的所述连续匹配概率低于第一预设阈值,则通过所述本地参考文本中的替换文本,替换所述连续匹配概率低于第一预设阈值的第一连续词汇组合;If the continuous matching probability corresponding to any continuous word combination in the candidate sequence is lower than the first preset threshold, the continuous matching probability is replaced by the replacement text in the local reference text the first consecutive word combination of the preset threshold;
其中,所述替换文本对应的音素序列与所述第一连续词汇组合的音素序列的音素匹配度大于第二预设阈值,且所述替换文本比所述第一连续词汇组合在所述本地参考文本中的出现频率更高。Wherein, the phoneme matching degree of the phoneme sequence corresponding to the replacement text and the phoneme sequence of the first continuous vocabulary combination is greater than a second preset threshold, and the replacement text is higher than the first continuous vocabulary combination in the local reference Occurs more frequently in text.
可选地,根据本申请一个实施例的一种语音识别装置,所述目标序列包括:Optionally, according to a speech recognition apparatus according to an embodiment of the present application, the target sequence includes:
音素序列,和/或,词序列;phoneme sequences, and/or word sequences;
相应地,所述音素序列对应的语言模型包括音素序列语言模型;Correspondingly, the language model corresponding to the phoneme sequence includes a phoneme sequence language model;
所述词序列对应的语言模型包括词序列语言模型。The language model corresponding to the word sequence includes a word sequence language model.
可选地,根据本申请一个实施例的一种语音识别装置,若所述目标序列包括音素序列,所述基于所述目标序列对应的语言模型和所述本地参考文本,对所述目标序列中的一个备选序列进行词汇替换,包括:Optionally, according to a speech recognition apparatus according to an embodiment of the present application, if the target sequence includes a phoneme sequence, the target sequence is based on the language model corresponding to the target sequence and the local reference text. An alternative sequence of lexical replacements, including:
基于所述音素序列语言模型,对所述音素序列中的一个备选序列进行词汇替换,获得第一音素序列;Based on the phoneme sequence language model, vocabulary replacement is performed on a candidate sequence in the phoneme sequence to obtain a first phoneme sequence;
基于所述词序列语言模型,对所述第一音素序列对应的至少一个词序列中的一个备选序列,进行词汇替换。Based on the word sequence language model, vocabulary replacement is performed on a candidate sequence in the at least one word sequence corresponding to the first phoneme sequence.
可选地,根据本申请一个实施例的一种语音识别装置,若所述目标序列还包括词序列,所述基于所述目标序列对应的语言模型和所述本地参考文本,对所述目标序列中的一个备选序列进行词汇替换,还包括:Optionally, according to a speech recognition apparatus according to an embodiment of the present application, if the target sequence further includes a word sequence, the target sequence is determined based on the language model corresponding to the target sequence and the local reference text. lexical replacement with an alternative sequence in , which also includes:
若第一词序列对应的第四匹配概率大于第二词序列对应的第四匹配概率,则确定所述第一词序列为所述边缘端识别结果;If the fourth matching probability corresponding to the first word sequence is greater than the fourth matching probability corresponding to the second word sequence, then determining that the first word sequence is the edge recognition result;
若第二词序列对应的第四匹配概率大于第一词序列对应的第四匹配概率,则确定所述第二词序列为所述边缘端识别结果;If the fourth matching probability corresponding to the second word sequence is greater than the fourth matching probability corresponding to the first word sequence, then determining that the second word sequence is the edge recognition result;
其中,所述第一词序列是,对所述第一音素序列对应的至少一个词序列中的一个备选序列进行词汇替换之后获得的;Wherein, the first word sequence is obtained after performing vocabulary replacement on a candidate sequence in at least one word sequence corresponding to the first phoneme sequence;
所述第二词序列是,基于所述词序列语言模型,对所述目标序列中的词序列中的一个备选序列进行词汇替换之后获得的;The second word sequence is obtained by performing lexical replacement on a candidate sequence in the word sequence in the target sequence based on the word sequence language model;
所述第一词序列对应的第四匹配概率用于描述所述第一词序列与所述本地参考文本的匹配程度;The fourth matching probability corresponding to the first word sequence is used to describe the degree of matching between the first word sequence and the local reference text;
所述第二词序列对应的第四匹配概率用于描述所述第二词序列与所述本地参考文本的匹配程度。The fourth matching probability corresponding to the second word sequence is used to describe the degree of matching between the second word sequence and the local reference text.
可选地,根据本申请一个实施例的一种语音识别装置,所述基于命名实体识别NER,对所述目标序列中的一个备选序列进行词汇替换,获得所述边缘端识别结果,包括:Optionally, according to a speech recognition apparatus according to an embodiment of the present application, the NER-based named entity recognition, performing vocabulary replacement on an alternative sequence in the target sequence, and obtaining the edge recognition result, including:
基于所述NER词表,通过所述NER词表中的第一替换词汇替换所述目标序列中的一个备选序列中的词汇,获得所述边缘端识别结果;Based on the NER vocabulary, replace the vocabulary in a candidate sequence in the target sequence by the first replacement vocabulary in the NER vocabulary to obtain the edge recognition result;
其中,所述替换词汇对应的音素序列与所述备选序列中的词汇的音素序列的音素匹配概率大于第三预设阈值。Wherein, the phoneme matching probability of the phoneme sequence corresponding to the replacement vocabulary and the phoneme sequence of the vocabulary in the candidate sequence is greater than a third preset threshold.
可选地,根据本申请一个实施例的一种语音识别装置,所述操作还包括:Optionally, according to a speech recognition apparatus according to an embodiment of the present application, the operation further includes:
基于所述NER对所述本地参考文本进行识别,生成NER词表;Identifying the local reference text based on the NER, and generating an NER vocabulary;
基于词典和/或字符转音素G2P技术,获得所述NER词表对应的音素。Based on the dictionary and/or the character-to-phoneme G2P technology, the phonemes corresponding to the NER vocabulary are obtained.
可选地,根据本申请一个实施例的一种语音识别装置,所述本地参考文本包括:Optionally, according to a speech recognition apparatus according to an embodiment of the present application, the local reference text includes:
本地参考文本所属活动的专业信息、参与者信息、活动背景信息、和活动内容。Professional information about the event to which the local reference text belongs, participant information, event background information, and event content.
可选地,根据本申请一个实施例的一种语音识别装置,所述获得边缘端识别结果之后,所述操作还包括:Optionally, according to a speech recognition apparatus according to an embodiment of the present application, after obtaining the edge recognition result, the operation further includes:
基于本地参考文本所属活动的专业信息,将所述本地参考文本的相关信息保存至服务器中。Based on the professional information of the activity to which the local reference text belongs, the relevant information of the local reference text is saved to the server.
第四方面,本申请实施例还提供一种装置,包括存储器、包括存储器,收发机,处理器:In a fourth aspect, an embodiment of the present application further provides an apparatus, including a memory, including a memory, a transceiver, and a processor:
存储器,用于存储计算机程序;收发机,用于在所述处理器的控制下收发数据;处理器,用于读取所述存储器中的计算机程序并执行以下操作:a memory for storing a computer program; a transceiver for sending and receiving data under the control of the processor; a processor for reading the computer program in the memory and performing the following operations:
设备端获取云端识别结果或边缘端识别结果,其中,所述云端识别结果包括云端对目标识别对象识别获得的一个目标序列;The device end obtains the cloud identification result or the edge identification result, wherein the cloud identification result includes a target sequence obtained by the cloud identifying the target identification object;
基于设备端的本地词表,对所述一个目标序列进行修正,获得设备端识别结果。Based on the local vocabulary on the device side, the one target sequence is modified to obtain the device-side recognition result.
可选地,根据本申请一个实施例的一种语音识别装置,所述基于设备端的本地词表,对所述一个目标序列进行修正,获得设备端识别结果,包括:Optionally, according to a speech recognition apparatus according to an embodiment of the present application, the target sequence is modified based on a local vocabulary on the device side to obtain a device-side recognition result, including:
基于所述本地词表中的第二替换词汇,替换所述目标序列中与所述第二替换词汇相对应的词汇,获得所述设备端识别结果;Based on the second replacement vocabulary in the local vocabulary, replace the vocabulary corresponding to the second replacement vocabulary in the target sequence, and obtain the device-side recognition result;
其中,与所述第二替换词汇相对应的词汇的音素序列,与所述第二替换词汇的音素序列的音素匹配概率大于第四预设阈值。Wherein, the phoneme matching probability of the phoneme sequence of the vocabulary corresponding to the second replacement vocabulary and the phoneme sequence of the second replacement vocabulary is greater than a fourth preset threshold.
可选地,根据本申请一个实施例的一种语音识别装置,所述词表包括:预先设置的易错词表和/或实时NER词汇。Optionally, according to a speech recognition apparatus according to an embodiment of the present application, the vocabulary includes: a preset error-prone vocabulary and/or a real-time NER vocabulary.
可选地,根据本申请一个实施例的一种语音识别装置,所述操作还包括:基于NER技术实时获取所述目标识别对象对应的显示信息中的实时NER词汇。Optionally, according to a speech recognition apparatus according to an embodiment of the present application, the operation further includes: acquiring real-time NER words in the display information corresponding to the target recognition object in real time based on the NER technology.
第五方面,本申请实施例还提供一种语音识别装置,包括:In a fifth aspect, an embodiment of the present application further provides a speech recognition device, including:
第一获取单元,用于边缘端获取云端识别结果,所述云端识别结果包括云端针对目标识别对象识别获得的至少一个目标序列;a first obtaining unit, used for the edge terminal to obtain a cloud recognition result, where the cloud recognition result includes at least one target sequence obtained by the cloud for target recognition object recognition;
第一修正单元,用于基于所述目标识别对象对应的本地参考文本,对所述至少一个目标序列进行修正,获得边缘端识别结果。The first correcting unit is configured to correct the at least one target sequence based on the local reference text corresponding to the target recognition object to obtain an edge end recognition result.
第六方面,本申请实施例还提供一种语音识别装置,包括:In a sixth aspect, an embodiment of the present application further provides a speech recognition device, including:
第二获取单元,用于设备端获取云端识别结果或边缘端识别结果,其中,所述云端识别结果包括云端对目标识别对象识别获得的一个目标序列;The second obtaining unit is used for the device end to obtain the cloud identification result or the edge identification result, wherein the cloud identification result includes a target sequence obtained by the cloud identifying the target identification object;
第二修正单元,用于基于设备端的本地词表,对所述一个目标序列进行修正,获得设备端识别结果。The second correcting unit is configured to correct the one target sequence based on the local vocabulary of the device to obtain the identification result of the device.
第七方面,本申请实施例还提供一种处理器可读存储介质,所述处理器可读存储介质存储有计算机程序,所述计算机程序用于使所述处理器执行如上所述初选方面所述的第一方面方法的步骤。In a seventh aspect, an embodiment of the present application further provides a processor-readable storage medium, where the processor-readable storage medium stores a computer program, and the computer program is configured to cause the processor to perform the above-mentioned primary selection aspect The steps of the method of the first aspect.
本申请实施例提供的一种语音识别方法、装置及存储介质,通过边缘端获取云端对目标识别对象的识别结果,并基于目标识别对象对应的本地参考文本,对云端识别结果进行修正,实现对云端的识别结果的优化,提高了语音识别的准确性。In the speech recognition method, device and storage medium provided by the embodiments of the present application, the recognition result of the target recognition object in the cloud is obtained through the edge terminal, and the cloud recognition result is corrected based on the local reference text corresponding to the target recognition object, so as to realize the recognition of the target recognition object. The optimization of the recognition results in the cloud improves the accuracy of speech recognition.
附图说明Description of drawings
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following will briefly introduce the accompanying drawings used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description These are some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without any creative effort.
图1是现有技术提供的语音识别的流程示意图;1 is a schematic flowchart of speech recognition provided by the prior art;
图2是本申请实施例提供的语音识别的流程示意图之一;2 is one of the schematic flowcharts of speech recognition provided by an embodiment of the present application;
图3是本申请实施例提供的语音识别的流程示意图之二;FIG. 3 is the second schematic flowchart of speech recognition provided by an embodiment of the present application;
图4是本申请实施例提供的语音识别的流程示意图之三;FIG. 4 is the third schematic flowchart of speech recognition provided by an embodiment of the present application;
图5是本申请实施例提供的语音识别装置的结构示意图之一;5 is one of the schematic structural diagrams of the speech recognition device provided by the embodiment of the present application;
图6是本申请实施例提供的语音识别装置的结构示意图之二;FIG. 6 is a second schematic structural diagram of a speech recognition device provided by an embodiment of the present application;
图7是本申请实施例提供的语音识别装置的结构示意图之三;7 is a third schematic structural diagram of a speech recognition device provided by an embodiment of the present application;
图8是本申请实施例提供的语音识别装置的结构示意图之四。FIG. 8 is a fourth schematic structural diagram of a speech recognition apparatus provided by an embodiment of the present application.
具体实施方式Detailed ways
本申请实施例中术语“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。The term "and/or" in the embodiments of the present application describes the association relationship between associated objects, indicating that three relationships can exist. For example, A and/or B can indicate that A exists alone, A and B exist simultaneously, and B exists alone these three situations. The character "/" generally indicates that the associated objects are an "or" relationship.
本申请实施例中术语“多个”是指两个或两个以上,其它量词与之类似。In the embodiments of the present application, the term "plurality" refers to two or more than two, and other quantifiers are similar.
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,并不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
本申请实施例提供了一种语音识别方法及装置,用以提高语音识别的准确率。Embodiments of the present application provide a speech recognition method and apparatus, so as to improve the accuracy of speech recognition.
其中,方法和装置是基于同一申请构思的,由于方法和装置解决问题的原理相似,因此装置和方法的实施可以相互参见,重复之处不再赘述。The method and the device are conceived based on the same application. Since the principles of the method and the device for solving the problem are similar, the implementation of the device and the method can be referred to each other, and repeated descriptions will not be repeated here.
本申请实施例提供的技术方案可以适用于多种系统,尤其是5G系统。例如适用的系统可以是全球移动通讯(global system of mobile communication,GSM)系统、码分多址(code division multiple access,CDMA)系统、宽带码分多址(Wideband CodeDivision Multiple Access,WCDMA)通用分组无线业务(general packet radio service,GPRS)系统、长期演进(long term evolution,LTE)系统、LTE频分双工(frequencydivision duplex,FDD)系统、LTE时分双工(time division duplex,TDD)系统、高级长期演进(long term evolution advanced,LTE-A)系统、通用移动系统(universal mobiletelecommunication system,UMTS)、全球互联微波接入(worldwide interoperabilityfor microwave access,WiMAX)系统、5G新空口(New Radio,NR)系统等。这多种系统中均包括终端设备和网络设备。系统中还可以包括核心网部分,例如演进的分组系统(EvlovedPacket System,EPS)、5G系统(5GS)等。The technical solutions provided in the embodiments of the present application can be applied to various systems, especially 5G systems. For example, applicable systems may be global system of mobile communication (GSM) system, code division multiple access (CDMA) system, wideband code division multiple access (WCDMA) general packet radio Service (general packet radio service, GPRS) system, long term evolution (long term evolution, LTE) system, LTE frequency division duplex (frequency division duplex, FDD) system, LTE time division duplex (time division duplex, TDD) system, advanced long-term Evolution (long term evolution advanced, LTE-A) system, universal mobile telecommunication system (UMTS), worldwide interoperability for microwave access (WiMAX) system, 5G New Radio (New Radio, NR) system, etc. . These various systems include terminal equipment and network equipment. The system may also include a core network part, such as an evolved packet system (Evloved Packet System, EPS), a 5G system (5GS), and the like.
下面结合附图对本发明实施例进行详细说明。The embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
现有技术中,图1是现有技术提供的语音识别的流程示意图,如图1所示,语音识别是将语音转换为文字的过程。In the prior art, FIG. 1 is a schematic flowchart of speech recognition provided by the prior art. As shown in FIG. 1 , speech recognition is a process of converting speech into text.
考虑到人耳的听觉特性,Mel倒谱系数或感知线性预测系数已经成为目前主流的语音特征向量提取方法之一,加上它们的一阶、二阶差分以及对特征向量进行归一化处理以后,在大词汇量连续语音识别问题上取得了不错的结果。Considering the auditory characteristics of the human ear, Mel cepstral coefficients or perceptual linear prediction coefficients have become one of the current mainstream speech feature vector extraction methods. After adding their first-order and second-order differences and normalizing the feature vectors, , which achieves good results on large-vocabulary continuous speech recognition problems.
声学模型是语音识别系统的底层模型,是语音识别系统中最为关键的一部分。连续的语音信号是由一些基本的语音单元组成,这些基本的语音单元可以是句子、词组、词、音节、子音节(Sub-syllable)或者音素,选择什么样的语音单元作为声学模型的建模单元由具体的应用(如词汇量的大小、语音库的多少、要求的性能等客观因素)来定。一般来讲,应该保证所选择的建模单元满足如下条件:1)鲁棒性:每一个模型都有足够的样本来估算模型参数;2)一致性:建模单元应该稳定,在不同的条件下声学特征变化比较小。在连续语音中,由于协同发音的影响,建模单元在不同上下文中的实现有时会有很大不同。为提高模型的准确性,就需要考虑上下文对建模单元的影响。在声学模型的研究中,上下文相关的建模单元(如双音子、三音子)逐渐受到重视,成为目前声学模型建模单元的主流。The acoustic model is the underlying model of the speech recognition system and is the most critical part of the speech recognition system. The continuous speech signal is composed of some basic speech units. These basic speech units can be sentences, phrases, words, syllables, sub-syllables or phonemes. Which speech unit is selected as the modeling of the acoustic model The unit is determined by the specific application (such as the size of the vocabulary, the size of the voice library, the required performance and other objective factors). Generally speaking, it should be ensured that the selected modeling units meet the following conditions: 1) Robustness: each model has enough samples to estimate the model parameters; 2) Consistency: the modeling units should be stable under different conditions The lower acoustic characteristic changes are relatively small. In continuous speech, the implementation of modeling units in different contexts is sometimes very different due to the effect of co-articulation. In order to improve the accuracy of the model, it is necessary to consider the influence of the context on the modeling unit. In the study of acoustic models, context-dependent modeling units (such as diphones and triphones) have gradually received attention and have become the mainstream of the current acoustic model modeling units.
随着语音识别技术的不断发展,语言模型在语音识别中的作用也显得越来越重要。由于声学信号的动态时变、瞬时和随机性,单靠声学模式的匹配与判断不可能完成语音的无误的识别和理解。一些较高层次的语言知识的利用可以在声学识别的层次上减少模式匹配的模糊性,从而提高识别的准确性。并且一个大词汇量连续语音识别系统必须在每一时刻检测是否遇到语音发音边界,这样许多不同的字或词将会从不同的语音流中识别出来。为了消除这些字或词之间的模糊性,语言模型是必不可少的。语言模型可以提供字或词之间的上下文信息和语义信息。语言模型不仅用在语音识别系统中,而且可以用在机器翻译、信息检索等研究领域。随着统计语言处理方法的发展,统计语言模型成为语音识别中语言处理的主流技术。With the continuous development of speech recognition technology, the role of language models in speech recognition is becoming more and more important. Due to the dynamic time-varying, instantaneous and random nature of acoustic signals, it is impossible to recognize and understand speech without errors only by matching and judging acoustic patterns. The utilization of some higher-level linguistic knowledge can reduce the ambiguity of pattern matching at the level of acoustic recognition, thereby improving the accuracy of recognition. And a large-vocabulary continuous speech recognition system must detect at every moment whether a speech sound boundary is encountered, so that many different words or words will be recognized from different speech streams. To remove the ambiguity between these words or words, a language model is essential. Language models can provide contextual information and semantic information between words or words. Language models are not only used in speech recognition systems, but also in research fields such as machine translation and information retrieval. With the development of statistical language processing methods, statistical language models have become the mainstream technology of language processing in speech recognition.
搜索就是在由语句构成的空间当中,按照一定的优化准则,寻找最优句子的过程,也就是利用已掌握的知识(声学知识、语音学知识、词典知识、语言模型知识以及语法语义知识等),在状态(指的是词组、词、建模单元或HMM的状态)空间中找到最优的状态序列。将声学模型、发音词典及语言模型等通过有穷状态转换器(Finite State Transducer,FST)紧密结合,并在FST上进行搜索。Search is the process of finding the optimal sentence in the space composed of sentences, according to certain optimization criteria, that is, using the acquired knowledge (acoustic knowledge, phonetics knowledge, dictionary knowledge, language model knowledge, grammar and semantic knowledge, etc.) , find the optimal state sequence in the state (referring to the state of the phrase, word, modeling unit or HMM) space. The acoustic model, pronunciation dictionary and language model are closely combined through the Finite State Transducer (FST), and the search is performed on the FST.
云端语音识别引擎的语言模型在训练时,需要大量的文本语料。这些文本语料以通用领域的语料为主,同时也会添加一些其它领域的语料,但不可能覆盖所有领域的说法。在进行特定领域的语音识别任务时,例如进行医学、建筑学、人工智能等领域的语音识别任务时,因为语言模型在这些领域覆盖的不充分,导致自动语音识别技术(Automatic SpeechRecognition,ASR)性能下降明显。The language model of the cloud speech recognition engine requires a large amount of text corpus during training. These text corpora are mainly corpora in general fields, and some corpora in other fields will also be added, but it is impossible to cover all fields. When performing speech recognition tasks in specific fields, such as speech recognition tasks in medicine, architecture, artificial intelligence and other fields, because the language model does not cover enough in these fields, the performance of automatic speech recognition technology (Automatic Speech Recognition, ASR) is caused. decreased significantly.
因此,针对特定的识别任务,如果能够增加该领域的一些文本语料对识别结果进行优化,则可以显著提升当前识别任务的识别准确率。此外,将互联网上的云、边缘的计算、设备端的计算三体联动起来,构建一个物联网(Internet of Things,IoT)核心计算能力成为一种趋势。对物联网而言,边缘计算技术取得突破,意味着许多控制(计算)将通过本地设备实现而无需交由云端,处理过程将在本地边缘计算层完成。这无疑将大大提升处理效率,减轻云端的负荷。由于更加靠近用户,还可为用户提供更快的响应,将需求在边缘端解决。边缘计算作为5G时代的一项关键技术,未来将成为不可或缺的基础设施之一。因此,我们可以在得到云端的基础ASR识别结果后,进一步在边缘端和设备端提高ASR的识别准确率。本申请采用云边端相结合的方法来改善识别准确率。Therefore, for a specific recognition task, if some text corpora in this field can be added to optimize the recognition results, the recognition accuracy of the current recognition task can be significantly improved. In addition, it has become a trend to link the cloud, edge computing, and device-side computing on the Internet to build an Internet of Things (IoT) core computing capability. For the Internet of Things, breakthroughs in edge computing technology mean that many controls (calculations) will be implemented through local devices without having to be handed over to the cloud, and the processing will be completed at the local edge computing layer. This will undoubtedly greatly improve processing efficiency and reduce the load on the cloud. Due to being closer to the user, it can also provide users with a faster response and solve their needs at the edge. As a key technology in the 5G era, edge computing will become one of the indispensable infrastructures in the future. Therefore, after obtaining the basic ASR recognition results in the cloud, we can further improve the ASR recognition accuracy at the edge and device sides. The present application adopts the method of combining the cloud, the edge and the terminal to improve the recognition accuracy.
图2是本申请实施例提供的语音识别方法的流程示意图之一,如图2所述,该方法包括如下步骤:FIG. 2 is one of the schematic flowcharts of the speech recognition method provided by the embodiment of the present application. As shown in FIG. 2 , the method includes the following steps:
步骤201,边缘端获取云端识别结果,云端识别结果包括云端针对目标识别对象识别获得的至少一个目标序列;
可选地,目标识别对象包括但不限于可以被云端获取或者发送给云端的音频或视频文件,云端针对目标音频或视频文件进行识别,得到包含至少一个目标序列的云端识别结果,边缘端获得所述云端识别结果。Optionally, the target recognition objects include but are not limited to audio or video files that can be acquired by the cloud or sent to the cloud. The cloud recognizes the target audio or video files to obtain a cloud recognition result that includes at least one target sequence, and the edge terminal obtains the Describe the cloud recognition result.
比如,目标识别对象可以来自一场专业会议中一位或多位发言人的实时演讲,可以将实时演讲的语音以句为单位或以段为单位作为目标识别对象进行识别。For example, the target recognition object can come from the real-time speech of one or more speakers in a professional conference, and the speech of the real-time speech can be recognized as the target recognition object in sentence-unit or segment-unit.
步骤202,基于目标识别对象对应的本地参考文本,对至少一个目标序列进行修正,获得边缘端识别结果。
通过本实施例的云边端语音识别方法,对云端识别结果进行修正,提高语音识别的准确率。With the cloud-side-end speech recognition method in this embodiment, the cloud recognition result is corrected to improve the accuracy of speech recognition.
可选地,所述本地参考文本包括:本地参考文本所属活动的专业信息、参与者信息、活动背景信息、和活动内容。Optionally, the local reference text includes: professional information of the activity to which the local reference text belongs, participant information, activity background information, and activity content.
可选地,所述识别对象的应用场景,即本地参考文本所属活动,可以为会议、演讲、授课等活动。Optionally, the application scenario of the recognition object, that is, the activity to which the local reference text belongs, may be activities such as conferences, lectures, and lectures.
可选地,边缘端可以获取云端识别结果,包括云端针对目标识别对象识别获得的至少一个目标序列,然后可以基于本地参考文本,对至少一个目标序列进行修正,获得边缘端识别结果。Optionally, the edge terminal may obtain the cloud recognition result, including at least one target sequence obtained by the cloud for target recognition object recognition, and then may modify the at least one target sequence based on the local reference text to obtain the edge terminal recognition result.
可选地,本申请中,利用云端和边缘端相结合的方式,提高语音识别的准确性,且本地参考文本可以不用上传至云端,在边缘端通过本地参考文本比如会议讲稿、投屏ppt等资料对从云端返回的ASR识别结果进行优化,能够保护会议资料隐私。不需要把会议相关资料传到云端,而只是在边缘端和设备端进行处理。通过在组织内部进行更高级别的网络安全控制,可以提高活动的安全性。Optionally, in this application, the combination of the cloud and the edge terminal is used to improve the accuracy of speech recognition, and the local reference text may not be uploaded to the cloud. The data optimizes the ASR recognition results returned from the cloud to protect the privacy of conference data. There is no need to transmit conference-related data to the cloud, but only process it on the edge and device. The security of activities can be improved by having a higher level of cybersecurity controls within the organization.
可选地,云端使用通用的ASR,不需要根据具体的应用场景进行更改。Optionally, the cloud uses a general ASR, which does not need to be changed according to specific application scenarios.
可选地,在边缘端,利用与本次识别任务相关的资料即本地参考文本,优化识别结果。Optionally, at the edge end, the data related to this recognition task, that is, the local reference text, is used to optimize the recognition result.
可选地,本申请实施例可以提高能够提高语音识别在特定场景下的识别准确率,特定场景可以是有参考资料,且有待识别语音的场景,本申请各实施例对此不作限定,即可以基于参考资料对语音进行识别。Optionally, the embodiments of the present application can improve the recognition accuracy rate of speech recognition in a specific scenario. The specific scenario can be a scene with reference materials and speech to be recognized, which is not limited in the embodiments of the present application, that is, it can be Speech recognition based on reference material.
本申请实施例提供的一种语音识别方法,通过边缘端获取云端对目标识别对象的识别结果,并基于目标识别对象对应的本地参考文本,对云端识别结果进行修正,实现对云端的识别结果的优化,提高了语音识别的准确性。In a speech recognition method provided by an embodiment of the present application, the recognition result of a target recognition object in the cloud is obtained through an edge terminal, and based on the local reference text corresponding to the target recognition object, the recognition result in the cloud is corrected, so as to realize the recognition of the recognition result in the cloud. Optimization to improve the accuracy of speech recognition.
可选地,所述基于所述目标识别对象对应的本地参考文本,对所述至少一个目标序列进行修正,获得边缘端识别结果,包括:Optionally, modifying the at least one target sequence based on the local reference text corresponding to the target recognition object to obtain an edge end recognition result, including:
基于所述目标序列对应的语言模型和所述本地参考文本,对所述目标序列中的一个备选序列进行词汇替换;其中,所述目标序列对应的语言模型是基于所述本地参考文本训练获得的;Based on the language model corresponding to the target sequence and the local reference text, vocabulary replacement is performed on a candidate sequence in the target sequence; wherein, the language model corresponding to the target sequence is obtained by training based on the local reference text of;
和/或,and / or,
基于命名实体识别NER,对所述目标序列中的一个所述备选序列进行词汇替换。Based on named entity recognition NER, lexical replacement is performed on one of the candidate sequences in the target sequence.
可选地,可以基于目标序列对应的语言模型和所述本地参考文本,对所述目标序列中的一个备选序列进行词汇替换;Optionally, based on the language model corresponding to the target sequence and the local reference text, vocabulary replacement can be performed on a candidate sequence in the target sequence;
可选地,所述目标序列对应的语言模型是基于所述本地参考文本训练获得的。Optionally, the language model corresponding to the target sequence is obtained by training based on the local reference text.
可选地,语言模型可以获得一个单词序列上的概率分布,对于一个给定长度为m的序列,它可以为整个序列产生一个概率P(w_1,w_2,…,w_m)。其实就是想办法找到一个概率分布,它可以表示任意一个句子或序列出现的概率。Optionally, the language model can obtain a probability distribution over a sequence of words, and for a given sequence of length m, it can generate a probability P(w_1,w_2,...,w_m) for the entire sequence. In fact, it is to find a way to find a probability distribution, which can represent the probability of occurrence of any sentence or sequence.
可选地,可以首先将本地参考文本作为样本,训练获得相应的语言模型。Optionally, a local reference text can be used as a sample to train to obtain a corresponding language model.
可选地,对于文字内容,可以直接进行训练,获得针对文字序列(即词序列)的语言模型,即词序列语言模型。Optionally, for text content, training can be performed directly to obtain a language model for text sequences (ie word sequences), that is, a word sequence language model.
可选地,对于语音内容,可以将参考文本的文字内容转化成音素后进行训练,获得针对音素序列的语言模型,即音素序列语言模型。Optionally, for the speech content, the text content of the reference text can be converted into phonemes for training to obtain a language model for phoneme sequences, that is, a phoneme sequence language model.
可选地,可以通过词序列语言模型,对目标序列中的一个词序列进行概率计算,确定与参考文本的匹配程度,并将其中匹配度低的词汇或词汇组合进行替换,其中,可以将其替换为本地参考文本中发音相似但出现概率更高的字词或词组。Optionally, probability calculation can be performed on a word sequence in the target sequence through a word sequence language model to determine the degree of matching with the reference text, and the words or word combinations with low matching degree can be replaced. Replace with words or phrases that sound similar but more likely in the local reference text.
可选地,可以通过音素序列语言模型,对目标序列中的一个音素序列进行概率计算,确定与参考文本的匹配程度(在本地参考文本已知的情况下,本地参考文本的音素序列是已知且唯一的),并将其中匹配度低的部分音素进行替换,其中,可以将其替换为本地参考文本中发音相似但出现概率更高的音素。Optionally, a phoneme sequence in the target sequence can be probabilistically calculated through a phoneme sequence language model to determine the degree of matching with the reference text (in the case that the local reference text is known, the phoneme sequence of the local reference text is known. and unique), and replace some of the phonemes with low matching degree, which can be replaced with phonemes with similar pronunciation but higher occurrence probability in the local reference text.
可选地,可以基于命名实体识别NER,对所述目标序列中的一个所述备选序列进行词汇替换。Optionally, based on named entity recognition NER, vocabulary replacement can be performed on one of the candidate sequences in the target sequence.
可选地,命名实体识别(Named Entity Recognition,NER),又称作“专名识别”,是指识别文本中具有特定意义的实体,主要包括人名、地名、机构名、专有名词等。通常包括两部分:(1)实体边界识别;(2)确定实体类别(人名、地名、机构名或其他)。Optionally, Named Entity Recognition (NER), also known as "proper name recognition", refers to identifying entities with specific meanings in texts, mainly including names of persons, places, institutions, proper nouns, and the like. It usually consists of two parts: (1) entity boundary identification; (2) determination of entity category (person name, place name, institution name or other).
可选地,为了解决语音识别中对专有名词,机构或组织或人的名称等类似词汇识别不准确的问题,可以基于命名实体识别NER,对所述目标序列中的一个所述备选序列进行词汇替换,具体可以替换其中识别不准确的人名、地名、机构名或其他专有名词。Optionally, in order to solve the problem of inaccurate recognition of proper nouns, names of institutions or organizations or people and other similar words in speech recognition, NER can be recognized based on named entities, and one of the alternative sequences in the target sequence can be identified. Perform lexical replacement, specifically to replace inaccurately identified names of persons, places, institutions or other proper nouns.
可选地,音素可以理解为汉字的拼音,也可以理解为单词中发音的字母,比如单词“name”对应的音素为“nam”。Optionally, a phoneme can be understood as the pinyin of a Chinese character, or it can be understood as a letter pronounced in a word, for example, the phoneme corresponding to the word "name" is "nam".
可选地,所述备选序列的词汇匹配概率在所有所述目标序列的词汇匹配概率中最高;Optionally, the lexical matching probability of the candidate sequence is the highest among all lexical matching probabilities of the target sequence;
其中,对于每一个所述目标序列,所述词汇匹配概率是基于所述目标序列对应的语言模型以及所述目标序列计算获得的;Wherein, for each of the target sequences, the vocabulary matching probability is calculated based on the language model corresponding to the target sequence and the target sequence;
所述目标序列的词汇匹配概率用于描述所述目标序列中的词汇在所述本地参考文本中出现的频率。The word matching probability of the target sequence is used to describe the frequency of words in the target sequence appearing in the local reference text.
可选地,备选序列可以是其词汇匹配概率在所有所述目标序列的词汇匹配概率中最高;即备选序列所包括的词汇在本地参考文本中出现的频率在所有目标序列中是最高的。Optionally, the candidate sequence may have the highest lexical matching probability among all the lexical matching probabilities of the target sequences; that is, the frequency of the vocabulary included in the candidate sequence in the local reference text is the highest among all target sequences. .
可选地,在确定目标序列中的备选序列时,可以相应的语言模型(LanguageModel,LM)对从云端获得的识别结果进行概率计算,得到概率最高的识别结果作为备选序列,其中,概率为词汇匹配概率,对于词序列来说,可以是其中的词汇分别在本地参考文本中出现的频率,对于音素序列来说,可以是其中的音素分别在本地参考文本对应的音素序列中出现的频率,也可以是其中的音素对应的词汇分别在本地参考文本中出现的频率。Optionally, when determining the candidate sequence in the target sequence, a corresponding language model (LanguageModel, LM) can be used to perform probability calculation on the recognition result obtained from the cloud, and the recognition result with the highest probability is obtained as the candidate sequence, wherein the probability is the vocabulary matching probability. For word sequences, it can be the frequency of the vocabulary in the local reference text, and for the phoneme sequence, it can be the frequency of the phonemes in the phoneme sequence corresponding to the local reference text. , or it can be the frequency that the words corresponding to the phonemes appear in the local reference text respectively.
可选地,一组音素对应的词汇可以是一个最有可能的词汇,也可以是多个词汇,比如一组音素“shan feng”对应的词汇可以是“山峰”,也可以是“山风”。Optionally, the vocabulary corresponding to a group of phonemes may be one of the most likely words, or may be multiple words. For example, the vocabulary corresponding to a group of phonemes "shan feng" may be "mountain peak" or "mountain wind". .
可选地,所述基于所述目标序列对应的语言模型和所述本地参考文本,对所述目标序列中的一个备选序列进行词汇替换,包括:Optionally, performing vocabulary replacement on an alternative sequence in the target sequence based on the language model corresponding to the target sequence and the local reference text, including:
基于所述目标序列对应的语言模型,获得所述备选序列中的连续词汇组合对应的连续匹配概率,其中,所述连续匹配概率用于表示所述连续匹配概率对应的连续词汇组合在所述本地参考文本中出现的频率;Based on the language model corresponding to the target sequence, obtain the continuous matching probability corresponding to the continuous word combination in the candidate sequence, wherein the continuous matching probability is used to indicate that the continuous matching probability corresponding to the continuous word combination is in the the frequency of occurrences in the local reference text;
若所述备选序列中的任一个连续词汇组合对应的所述连续匹配概率低于第一预设阈值,则通过所述本地参考文本中的替换文本,替换所述连续匹配概率低于第一预设阈值的第一连续词汇组合;If the continuous matching probability corresponding to any continuous word combination in the candidate sequence is lower than the first preset threshold, the continuous matching probability is replaced by the replacement text in the local reference text the first consecutive word combination of the preset threshold;
其中,所述替换文本对应的音素序列与所述第一连续词汇组合的音素序列的音素匹配度大于第二预设阈值,且所述替换文本比所述第一连续词汇组合在所述本地参考文本中的出现频率更高。Wherein, the phoneme matching degree of the phoneme sequence corresponding to the replacement text and the phoneme sequence of the first continuous vocabulary combination is greater than a second preset threshold, and the replacement text is higher than the first continuous vocabulary combination in the local reference Occurs more frequently in text.
可选地,可以基于目标序列对应的语言模型,获得所述备选序列中的连续词汇组合对应的连续匹配概率;Optionally, based on the language model corresponding to the target sequence, obtain the continuous matching probability corresponding to the continuous vocabulary combination in the candidate sequence;
可选地,连续匹配概率可以用于表示所述连续匹配概率对应的连续词汇组合在所述本地参考文本中出现的频率;即连续匹配概率越高,即表示其对应的连续词汇组合在本地参考文本中出现的频率越高。Optionally, the continuous matching probability may be used to indicate the frequency of occurrence of the continuous word combination corresponding to the continuous matching probability in the local reference text; that is, the higher the continuous matching probability, that is, the corresponding continuous word combination in the local reference text. The higher the frequency in the text.
可选地,目标序列可以包括音素序列和/或词序列;Optionally, the target sequence may include a phoneme sequence and/or a word sequence;
以目标序列包括音素序列为例,其对应的语言模型可以是音素序列语言模型,因此边缘端可以利用音素序列语言模型计算获得音素序列中的一个备选序列的连续匹配概率。Taking the target sequence including the phoneme sequence as an example, the corresponding language model may be the phoneme sequence language model, so the edge end can use the phoneme sequence language model to calculate and obtain the continuous matching probability of a candidate sequence in the phoneme sequence.
以词序列包括词序列为例,其对应的语言模型可以是词序列语言模型,因此边缘端可以利用词序列语言模型计算获得词序列中的一个备选序列的连续匹配概率。Taking the word sequence including the word sequence as an example, the corresponding language model may be the word sequence language model, so the edge terminal can use the word sequence language model to calculate and obtain the continuous matching probability of a candidate sequence in the word sequence.
可选地,备选序列所包括的词汇在本地参考文本中出现的频率在所有目标序列中是最高的。Optionally, the vocabulary included in the candidate sequence has the highest frequency in the local reference text among all target sequences.
可选地,在计算获得备选序列中的所有或部分连续词汇组合的连续匹配概率后,可以确定其中包括部分连续词汇组合对应的连续匹配概率低于第一预设阈值,则可以认为该部分连续词汇组合在本地参考文本中出现的频率很低,甚至未出现,且本地参考文本中有与其发音相似的词汇或连续词汇,并且在本地参考文本中出现频率更高,则可以认为该发音相似且出现频率更高的词或词组为替换文本,将连续匹配概率低于第一预设阈值的部分连续词汇组合进行替换。Optionally, after calculating and obtaining the continuous matching probability of all or part of the continuous word combinations in the candidate sequence, it can be determined that the continuous matching probability corresponding to the partial continuous word combination is lower than the first preset threshold, then it can be considered that the part Consecutive word combinations appear very rarely or not even in the local reference text, and there are words or continuous words with similar pronunciation in the local reference text, and they appear more frequently in the local reference text, then the pronunciation can be considered similar And words or phrases that appear more frequently are replacement texts, and some continuous word combinations whose continuous matching probability is lower than the first preset threshold are replaced.
可选地,替换文本对应的音素序列与所述第一连续词汇组合的音素序列的音素匹配度大于第二预设阈值(即发音相似),且所述替换文本比所述第一连续词汇组合在所述本地参考文本中的出现频率更高。如替换文本为“张三”,所述第一连续词汇组合为“章三”,两者发音相似,但是“张三”在本地参考文本中的出现频率更高,则可以将备选序列中的第一连续词汇组合“章三”替换为“张三”。Optionally, the phoneme matching degree of the phoneme sequence corresponding to the replacement text and the phoneme sequence of the first continuous vocabulary combination is greater than a second preset threshold (that is, the pronunciation is similar), and the replacement text is higher than the first continuous vocabulary combination. Occurs more frequently in the local reference text. For example, if the replacement text is "Zhang San", and the first continuous word combination is "Zhang San", the pronunciations of the two are similar, but "Zhang San" appears more frequently in the local reference text, the alternative sequence can be The first consecutive word combination "Zhang San" is replaced by "Zhang San".
可选地,第一预设阈值可以是预先设置的小于1的正数,第二预设阈值可以是一个小于1的正数,替换文本对应的音素序列与第一连续词汇组合的音素序列的音素匹配度大于第二预设阈值,表示在参考文本中,所述替换文本的发音与所述第一连续词汇组合的发音相似。通过发音相似并且在参考文本中出现的概率更高的替换文本替换备选序列中的连续词汇组合,可以提高语音识别的准确率。Optionally, the first preset threshold may be a preset positive number less than 1, the second preset threshold may be a positive number less than 1, the phoneme sequence corresponding to the replacement text and the phoneme sequence of the first continuous vocabulary combination. The phoneme matching degree is greater than the second preset threshold, indicating that in the reference text, the pronunciation of the replacement text is similar to the pronunciation of the first continuous word combination. The accuracy of speech recognition can be improved by replacing consecutive word combinations in the candidate sequence with replacement text that is similar in pronunciation and has a higher probability of appearing in the reference text.
可选地,所述目标序列包括:Optionally, the target sequence includes:
音素序列,和/或,词序列;phoneme sequences, and/or word sequences;
可选地,音素可以理解为汉字的拼音,也可以理解为单词中发音的字母,比如单词“name”对应的音素为“nam”。Optionally, a phoneme can be understood as the pinyin of a Chinese character, or it can be understood as a letter pronounced in a word, for example, the phoneme corresponding to the word "name" is "nam".
可选地,一组音素对应的词汇可以是一个最有可能的词汇,也可以是多个词汇,比如一组音素“shan feng”对应的词汇可以是“山峰”,也可以是“山风”。Optionally, the vocabulary corresponding to a group of phonemes may be one of the most likely words, or may be multiple words. For example, the vocabulary corresponding to a group of phonemes "shan feng" may be "mountain peak" or "mountain wind". .
相应地,所述音素序列对应的语言模型包括音素序列语言模型;Correspondingly, the language model corresponding to the phoneme sequence includes a phoneme sequence language model;
所述词序列对应的语言模型包括词序列语言模型。The language model corresponding to the word sequence includes a word sequence language model.
可选地,所述目标序列可以为音素序列,可以为词序列,也可以同时存在音素序列和词序列。Optionally, the target sequence may be a phoneme sequence, a word sequence, or both a phoneme sequence and a word sequence.
相应地,所述音素序列对应的语言模型包括音素序列语言模型;所述词序列对应的语言模型包括词序列语言模型。Correspondingly, the language model corresponding to the phoneme sequence includes a phoneme sequence language model; the language model corresponding to the word sequence includes a word sequence language model.
可选地,所述音素序列语言模型和词序列语言模型是基于所述本地参考文本训练获得的。Optionally, the phoneme sequence language model and the word sequence language model are obtained by training based on the local reference text.
可选地,音素序列语言模型或词序列语言模型可以获得一个音素序列或者单词序列上的概率分布,对于一个给定长度为m的序列,它可以为整个序列产生一个概率P(w_1,w_2,…,w_m)。其实就是想办法找到一个概率分布,它可以表示任意一个句子或序列出现的概率。Alternatively, a phoneme sequence language model or word sequence language model can obtain a probability distribution over a phoneme sequence or word sequence, and for a given sequence of length m, it can generate a probability P(w_1,w_2, ..., w_m). In fact, it is to find a way to find a probability distribution, which can represent the probability of occurrence of any sentence or sequence.
具体地,从云端获取的识别结果可以是n-best(前N个最佳)音素序列识别结果,可以是n-best(前N个最佳)词序列识别结果,也可以是n-best(前N个最佳)音素序列识别结果和n-best(前N个最佳)词序列识别结果。其中,所述音素序列为最小的语音单位组成的序列,词序列可以为字或者词组成的序列,N可以是大于0的自然数。Specifically, the recognition result obtained from the cloud may be an n-best (top N best) phoneme sequence recognition result, an n-best (top N best) word sequence recognition result, or an n-best (top N best) word sequence recognition result. Top N best) phoneme sequence recognition results and n-best (top N best) word sequence recognition results. Wherein, the phoneme sequence is a sequence composed of the smallest phonetic units, the word sequence may be a sequence composed of words or words, and N may be a natural number greater than 0.
可选地,若所述目标序列包括词序列,所述基于所述目标序列对应的语言模型和所述本地参考文本,对所述目标序列中的一个备选序列进行词汇替换,包括:Optionally, if the target sequence includes a word sequence, performing word replacement on an alternative sequence in the target sequence based on the language model corresponding to the target sequence and the local reference text, including:
基于所述词序列语言模型,对所述词序列中的一个备选序列进行词汇替换,获得第一词序列。Based on the word sequence language model, vocabulary replacement is performed on a candidate sequence in the word sequence to obtain a first word sequence.
可选地,所述词序列包括N个词序列,如语音发音为“shanfeng”包含“山峰、山风、扇风”三个词序列,将上述三个词序列通过词序列语言模型进行概率计算,若“山峰”在本地参考文本中的出现概率最高,则得到“山峰”为备选序列。Optionally, the word sequence includes N word sequences, such as the phonetic pronunciation of "shanfeng" contains three word sequences of "mountain peak, mountain wind, fan wind", and the above-mentioned three word sequences are subjected to probability calculation by the word sequence language model. , if the occurrence probability of "mountain peak" in the local reference text is the highest, then "mountain peak" is obtained as the candidate sequence.
可选地,若所述目标序列包括音素序列,所述基于所述目标序列对应的语言模型和所述本地参考文本,对所述目标序列中的一个备选序列进行词汇替换,包括:Optionally, if the target sequence includes a phoneme sequence, performing lexical replacement on an alternative sequence in the target sequence based on the language model corresponding to the target sequence and the local reference text, including:
基于所述音素序列语言模型,对所述音素序列中的一个备选序列进行词汇替换,获得第一音素序列;Based on the phoneme sequence language model, vocabulary replacement is performed on a candidate sequence in the phoneme sequence to obtain a first phoneme sequence;
基于所述词序列语言模型,对所述第一音素序列对应的至少一个词序列中的一个备选序列,进行词汇替换。Based on the word sequence language model, vocabulary replacement is performed on a candidate sequence in the at least one word sequence corresponding to the first phoneme sequence.
可选地,基于所述音素序列语言模型,对所述音素序列中的一个备选序列进行词汇替换,获得第一音素序列。Optionally, based on the phoneme sequence language model, vocabulary replacement is performed on a candidate sequence in the phoneme sequence to obtain a first phoneme sequence.
例如,目标识别结果中发音为“zhangsan”,从云端获取的音素序列可以为“zhangsan、zhangshan、zhangsang”,将上述音素序列通过音素语言模型进行概率计算,得到音素zhang、音素san分别在所述本地参考文本的概率最高,则选择音素序列“zhangsan”为备选序列,若备选序列“zhangsan”中在本地参考文本中发音相似的“zhuangsan”比备选序列“zhangsan”在本地参考文本的出现概率更高,则将“zhuangsan”作为第一音素序列。For example, the target recognition result is pronounced "zhangsan", and the phoneme sequence obtained from the cloud can be "zhangsan, zhangshan, zhangsang", and the above phoneme sequence is subjected to probability calculation through the phoneme language model, and the phoneme zhang and phoneme san are obtained respectively in the description The probability of the local reference text is the highest, then the phoneme sequence "zhangsan" is selected as the candidate sequence. If the probability of occurrence is higher, "zhuangsan" is used as the first phoneme sequence.
又如,目标识别结果中发音为“wo de ming zi shi zhangsan”,从云端获取的音素序列可以为“wo de ming zi shi zhangsan”、“wo de ming zi shi zhangshan”、“wo deming zi shi zhangsang”,将上述音素序列通过音素语言模型进行概率计算,得到音素序列“wo de ming zi shi zhangsan”在所述本地参考文本的概率最高,则选择音素序列“wode ming zi shi zhangsan”为备选序列,若备选序列“wo de ming zi shi zhangsan”中“zhangsan”在本地参考文本中发音相似的“zhuangsan”比备选序列“zhangsan”在本地参考文本的出现概率更高,则将“zhuangsan”替换音素“zhangsan”,获得第一音素序列“wo deming zi shi zhuangsan”。For another example, the target recognition result is pronounced "wo de ming zi shi zhangsan", and the phoneme sequence obtained from the cloud can be "wo de ming zi shi zhangsan", "wo de ming zi shi zhangshan", "wo deming zi shi zhangsang" ”, perform probability calculation on the above phoneme sequence through the phoneme language model, and obtain that the phoneme sequence “wo de ming zi shi zhangsan” has the highest probability in the local reference text, then select the phoneme sequence “wode ming zi shi zhangsan” as the candidate sequence , if the "zhangsan" in the alternative sequence "wo de ming zi shi zhangsan" has a higher probability of occurrence in the local reference text than the alternative sequence "zhangsan" in the local reference text, then "zhuangsan" Substitute the phoneme "zhangsan" to obtain the first phoneme sequence "wo deming zi shi zhuangsan".
可选地,基于所述词序列语言模型,对所述第一音素序列对应的至少一个词序列中的一个备选序列,进行词汇替换。第一音素序列可以对应多个词序列,如音素序列“wo demingzi shi zhangsan”可以对应“我的名字是张三”、“我的名字是章三”、“我的名字是张叁”等多个词序列。其中,“我的名字是张三”在本地参考文本中的出现频率最高,则选取“我的名字是张三”为所述第一音素序列对应的至少一个词序列中的一个备选序列。Optionally, based on the word sequence language model, vocabulary replacement is performed on a candidate sequence in the at least one word sequence corresponding to the first phoneme sequence. The first phoneme sequence can correspond to multiple word sequences, for example, the phoneme sequence "wo demingzi shi zhangsan" can correspond to "my name is Zhang San", "my name is Zhang San", "my name is Zhang San", etc. word sequence. Among them, "my name is Zhang San" has the highest frequency of occurrence in the local reference text, and "my name is Zhang San" is selected as an alternative sequence in the at least one word sequence corresponding to the first phoneme sequence.
可选地,若所述目标序列包括音素序列和词序列,所述基于所述目标序列对应的语言模型和所述本地参考文本,对所述目标序列中的一个备选序列进行词汇替换,还包括:Optionally, if the target sequence includes a phoneme sequence and a word sequence, based on the language model corresponding to the target sequence and the local reference text, perform vocabulary replacement on an alternative sequence in the target sequence, and also include:
若第一词序列对应的第四匹配概率大于第二词序列对应的第四匹配概率,则确定所述第一词序列为所述边缘端识别结果;If the fourth matching probability corresponding to the first word sequence is greater than the fourth matching probability corresponding to the second word sequence, then determining that the first word sequence is the edge recognition result;
若第二词序列对应的第四匹配概率大于第一词序列对应的第四匹配概率,则确定所述第二词序列为所述边缘端识别结果;If the fourth matching probability corresponding to the second word sequence is greater than the fourth matching probability corresponding to the first word sequence, then determining that the second word sequence is the edge recognition result;
其中,所述第一词序列是,对所述第一音素序列对应的至少一个词序列中的一个备选序列进行词汇替换之后获得的;Wherein, the first word sequence is obtained after performing vocabulary replacement on a candidate sequence in at least one word sequence corresponding to the first phoneme sequence;
可选地,所述第一音素序列可以对应多个词序列,选择其中的一个词序列作为备选序列,所述备选序列的词汇匹配概率在所有所述目标序列的词汇匹配概率中最高。如一组音素“shan feng”对应的词汇可以是“山峰”,也可以是“山风”,若“山峰”在本地参考文本中的出现概率最高,则选择“山峰”为音素“shan feng”的备选序列。Optionally, the first phoneme sequence may correspond to multiple word sequences, and one of the word sequences is selected as a candidate sequence, and the word matching probability of the candidate sequence is the highest among the word matching probabilities of all the target sequences. For example, the vocabulary corresponding to a group of phonemes "shan feng" can be "mountain peak" or "mountain wind". If "mountain peak" has the highest occurrence probability in the local reference text, "mountain peak" is selected as the phoneme "shan feng". Alternative sequence.
可选地,若目标识别对象的语音发音为“wo de ming zi shi zhang san”,所述词序列包含“我的名字是张三、我得名字是张叁、我的名字是璋三”,在上述三个词序列中,“我的名字是张三”在本地参考文本中的出现概率最高,选取“我的名字是张三”为备选序列。所述备选序列中“是”的下一个词“张三”,通过语言模型中的计算的出现概率低,则可以选择参考文本中发音相似并且出现概率更高的字或者词进行替换,如“张山”在参考文本中多次出现,则将“张山”替换“张三”,得到“我的名字是张山”的第一词序列。Optionally, if the speech pronunciation of the target recognition object is "wo de ming zi shi zhang san", the word sequence includes "my name is Zhang San, my name is Zhang San, and my name is Zhang San", Among the above three word sequences, "my name is Zhang San" has the highest occurrence probability in the local reference text, and "my name is Zhang San" is selected as the candidate sequence. The next word "Zhang San" of "yes" in the alternative sequence has a low probability of occurrence through the calculation in the language model, then a word or word with a similar pronunciation and a higher probability of occurrence in the reference text can be selected for replacement, such as If "Zhang Shan" appears many times in the reference text, replace "Zhang San" with "Zhang Shan" to get the first word sequence of "My name is Zhang Shan".
所述第二词序列是,基于所述词序列语言模型,对所述目标序列中的词序列中的一个备选序列进行词汇替换之后获得的;The second word sequence is obtained by performing lexical replacement on a candidate sequence in the word sequence in the target sequence based on the word sequence language model;
所述第一词序列对应的第四匹配概率用于描述所述第一词序列与所述本地参考文本的匹配程度;The fourth matching probability corresponding to the first word sequence is used to describe the degree of matching between the first word sequence and the local reference text;
所述第二词序列对应的第四匹配概率用于描述所述第二词序列与所述本地参考文本的匹配程度。The fourth matching probability corresponding to the second word sequence is used to describe the degree of matching between the second word sequence and the local reference text.
可选地,所述基于命名实体识别NER,对所述目标序列中的一个备选序列进行词汇替换,获得所述边缘端识别结果,包括:Optionally, based on the named entity recognition NER, vocabulary replacement is performed on a candidate sequence in the target sequence to obtain the edge recognition result, including:
基于所述NER词表,通过所述NER词表中的第一替换词汇替换所述目标序列中的一个备选序列中的词汇,获得所述边缘端识别结果;Based on the NER vocabulary, replace the vocabulary in a candidate sequence in the target sequence by the first replacement vocabulary in the NER vocabulary to obtain the edge recognition result;
其中,所述替换词汇对应的音素序列与所述备选序列中的词汇的音素序列的音素匹配概率大于第三预设阈值。Wherein, the phoneme matching probability of the phoneme sequence corresponding to the replacement vocabulary and the phoneme sequence of the vocabulary in the candidate sequence is greater than a third preset threshold.
可选地,通过本地参考文本生成NER词表,边缘端通过所述NER词表中的第一替换词汇替换所述目标序列中的一个备选序列中的词汇,获得所述边缘端识别结果,所述第一替换词汇为其对应的音素序列与所述备选序列中的词汇的音素序列的音素匹配概率大于第三预设阈值的词汇,也即第一替换词汇与所述备选序列中的对应词汇发音相似。第三预设阈值为一个小于1的正数,所述替换词汇对应的音素序列与所述备选序列中的词汇的音素序列的音素匹配概率大于第三预设阈值,表示在NER词表中所述替换词汇与所述备选序列中的词汇发音相似。Optionally, a NER vocabulary is generated by local reference text, and the edge end replaces the vocabulary in an alternative sequence in the target sequence by the first replacement vocabulary in the NER vocabulary, and obtains the edge end recognition result, The first replacement word is a word whose phoneme matching probability of the phoneme sequence of the corresponding phoneme sequence and the word in the candidate sequence is greater than the third preset threshold, that is, the first replacement word and the word in the candidate sequence. The corresponding words have similar pronunciations. The third preset threshold is a positive number less than 1, and the phoneme matching probability between the phoneme sequence corresponding to the replacement vocabulary and the phoneme sequence of the vocabulary in the candidate sequence is greater than the third preset threshold, indicating that in the NER vocabulary The replacement words sound similar to words in the alternative sequence.
可选地,所述方法还包括:Optionally, the method further includes:
基于所述NER对所述本地参考文本进行识别,生成NER词表;Identifying the local reference text based on the NER, and generating an NER vocabulary;
基于词典和/或字符转音素(Grapheme-to-Phoneme,G2P)技术,获得所述NER词表对应的音素。Based on a dictionary and/or a character-to-phoneme (Grapheme-to-Phoneme, G2P) technology, the phoneme corresponding to the NER vocabulary is obtained.
可选地,基于所述NER词表,通过所述NER词表中的第一替换词汇替换所述目标序列中的一个备选序列中的词汇,获得所述边缘端识别结果,如所述目标序列中的一个备选序列中的词汇为“open ai”,在所述NER词表中“open ai”发音所对应的词汇“OpenAI”的匹配概率更大,则将备选序列中的词汇“open ai”替换为“OpenAI”,其中,“OpenAI”为由诸多硅谷大亨联合建立的人工智能非营利组织。Optionally, based on the NER vocabulary, the first replacement vocabulary in the NER vocabulary is used to replace the vocabulary in a candidate sequence in the target sequence, and the edge recognition result is obtained, as described in the target sequence. The word in one of the candidate sequences in the sequence is "open ai", and the word "OpenAI" corresponding to the pronunciation of "open ai" in the NER vocabulary has a higher matching probability, then the word in the candidate sequence " "open ai" is replaced by "OpenAI", where "OpenAI" is an artificial intelligence non-profit organization jointly established by many Silicon Valley tycoons.
具体地,从云端请求词典,将词典下载到边缘端。Specifically, the dictionary is requested from the cloud and downloaded to the edge.
利用命名实体识别(NER,Named Entity Recognition)技术,对文字内容中的人名、地名、组织机构名、专有名词等进行识别,生成NER词表。Named Entity Recognition (NER, Named Entity Recognition) technology is used to identify person names, place names, organization names, proper nouns, etc. in the text content to generate NER vocabulary.
可选地,所述本地参考文本包括:Optionally, the local reference text includes:
本地参考文本所属活动的专业信息、参与者信息、活动背景信息、和活动内容。Professional information about the event to which the local reference text belongs, participant information, event background information, and event content.
可选地,本地参考文本可以包括会议讲稿,会议的投屏ppt,参会者的姓名、职务、背景身份,本次会议的背景信息等。Optionally, the local reference text may include conference speech, screen projection ppt of the conference, names, positions, background identities of participants, background information of the conference, and the like.
可选地,本申请中,利用云端和边缘端相结合的方式,提高语音识别的准确性,且本地参考文本可以不用上传至云端,在边缘端通过本地参考文本比如会议讲稿、投屏ppt等资料对从云端返回的ASR识别结果进行优化,能够保护会议资料隐私。不需要把会议相关资料传到云端,而只是在边缘端和设备端进行处理。通过在组织内部进行更高级别的网络安全控制,可以提高活动的安全性。Optionally, in this application, the combination of the cloud and the edge terminal is used to improve the accuracy of speech recognition, and the local reference text may not be uploaded to the cloud. The data optimizes the ASR recognition results returned from the cloud to protect the privacy of conference data. There is no need to transmit conference-related data to the cloud, but only process it on the edge and device. The security of activities can be improved by having a higher level of cybersecurity controls within the organization.
可选地,参会者的名字、职务、名称等加入至NER词表中,以此提高NER的准确率。Optionally, the participant's name, position, title, etc. are added to the NER vocabulary, so as to improve the accuracy of the NER.
具体地,若参会者的姓名、职务等信息中可能存在生僻字时,针对生僻字云端识别的结果可能会不准确。本申请将参会者的名字、职务、名称等加入至NER词表中,可以通过NER词表中的生僻字将云端对生僻字的识别结果进行替换,得到更准确的识别结果。Specifically, if there may be uncommon words in the information such as the names and positions of the participants, the cloud recognition result for the uncommon words may be inaccurate. In this application, the participant's name, position, title, etc. are added to the NER vocabulary, and the cloud's recognition result of the rare word can be replaced by the rare word in the NER vocabulary to obtain a more accurate recognition result.
可选地,若所述本地参考文本包括会议讲稿、投屏ppt等资料时,利用文字识别技术,自动进行文字识别,识别图像中的文字内容。如果所述本地参考文本是电子版资料,则不需要进行文字识别。Optionally, if the local reference text includes materials such as conference speech, screen projection ppt, etc., text recognition technology is used to automatically perform text recognition to recognize the text content in the image. If the local reference text is an electronic version of the material, word recognition is not required.
可选地,所述本地参考文本还包括参会者的名字、职务介绍等会议相关背景信息。Optionally, the local reference text further includes conference-related background information such as the participant's name and job introduction.
在边缘端,通过会议讲稿、投屏ppt等资料对云端返回的语音识别结果进行优化。At the edge, the speech recognition results returned by the cloud are optimized through materials such as conference speeches and screencast ppts.
可选地,可以利用下载的词典得到NER词表中的词汇对应的发音,也可以利用字符转音素G2P技术得到NER词表中的词汇对应的发音,也可以基于下载的词典,利用字符转音素G2P技术得到NER词表中的词汇对应的发音。Optionally, the pronunciation corresponding to the vocabulary in the NER vocabulary can be obtained by using the downloaded dictionary, the pronunciation corresponding to the vocabulary in the NER vocabulary can also be obtained by using the character-to-phoneme G2P technology, or based on the downloaded dictionary, using the character-to-phoneme conversion. G2P technology obtains the pronunciation corresponding to the vocabulary in the NER vocabulary.
可选地,当NER词表中的词汇在下载的词典中没有对应的发音时,可以利用G2P技术,以下载的词典为参考模型,对NER词表中的词汇进行转换得到NER词表中的词汇对应的发音。Optionally, when the vocabulary in the NER vocabulary does not have a corresponding pronunciation in the downloaded dictionary, G2P technology can be used to convert the vocabulary in the NER vocabulary to obtain the NER vocabulary using the downloaded dictionary as a reference model. Pronunciation of vocabulary.
通过本实施例中的使用云边端协同的方式,充分利用云边端的计算资源,使得云端处理的任务减少,降低了云端网络时延,提高了语音识别的相应速度。By using the cloud-side-terminal collaboration method in this embodiment, the computing resources of the cloud-side-terminal are fully utilized, so that the tasks of cloud processing are reduced, the cloud network delay is reduced, and the corresponding speed of speech recognition is improved.
可选地,所述获得边缘端识别结果之后,所述方法还包括:Optionally, after obtaining the edge end identification result, the method further includes:
基于本地参考文本所属活动的专业信息,将所述本地参考文本的相关信息保存至服务器中。Based on the professional information of the activity to which the local reference text belongs, the relevant information of the local reference text is saved to the server.
具体地,以参考文本所属活动的主题内容为分类,分别将本地参考文本的相关信息保存至服务器中,如一次会议的主题是人工智能,可以以人工智能为一类,将以后的会议主题是人工智能的本地参考文本的相关信息都保存至服务器的人工智能这一类中。这样,组织内部可以不断累积相关文本资料,不断提升本领域的识别准确率。Specifically, take the subject content of the activity to which the reference text belongs as a classification, and save the relevant information of the local reference text to the server. Information about the AI's local reference text is saved to the server's AI category. In this way, relevant text data can be continuously accumulated within the organization, and the recognition accuracy in this field can be continuously improved.
通过本申请实施例提供的一种语音识别方位,边缘端对识别结果进行优化,可以不需要把会议相关资料传到云端,而只是在边缘端进行处理,能够保护会议资料隐私,通过在组织内部进行更高级别的网络安全控制,可以提高会议安全性。在本申请的应用场景不限于会议,还可以是演讲,授课等活动。云端使用通用的语音识别技术,不需要根据具体的应用场景进行更改。With the voice recognition orientation provided by the embodiment of the present application, the edge terminal can optimize the recognition result, and it is not necessary to transmit the conference-related data to the cloud, but only process it at the edge terminal, which can protect the privacy of the conference data. Implementing higher-level network security controls can improve meeting security. The application scenarios of this application are not limited to conferences, but can also be lectures, lectures and other activities. The cloud uses general speech recognition technology and does not need to be changed according to specific application scenarios.
本申请实施例提供的一种语音识别方法,通过边缘端获取云端对目标识别对象的识别结果,并基于目标识别对象对应的本地参考文本,对云端识别结果进行修正,实现对云端的识别结果的优化,提高了语音识别的准确性。In a speech recognition method provided by an embodiment of the present application, the recognition result of a target recognition object in the cloud is obtained through an edge terminal, and based on the local reference text corresponding to the target recognition object, the recognition result in the cloud is corrected, so as to realize the recognition of the recognition result in the cloud. Optimization to improve the accuracy of speech recognition.
图3是本申请实施例提出的语音识别方法的流程示意图之二,如图3所述,该方法包括如下步骤:FIG. 3 is the second schematic flowchart of the speech recognition method proposed by the embodiment of the present application. As shown in FIG. 3 , the method includes the following steps:
步骤301,设备端获取云端识别结果或边缘端识别结果,其中,所述云端识别结果包括云端对目标识别对象识别获得的一个目标序列;
可选地,边缘端识别结果是边缘端对云端识别结果进行词汇替换即优化之后获得的识别结果。Optionally, the recognition result of the edge terminal is the recognition result obtained after the edge terminal performs vocabulary replacement on the cloud recognition result, that is, optimization.
步骤302,基于设备端的本地词表,对所述一个目标序列进行修正,获得设备端识别结果。
可选地,若设备端获取的是边缘端识别结果,则,在设备端可以对边缘端返回的语音识别结果继续进行优化。Optionally, if the device end obtains the recognition result of the edge end, the device end may continue to optimize the speech recognition result returned by the edge end.
本申请实施例提供的一种语音识别方法,通过边缘端获取云端对目标识别对象的识别结果,并基于目标识别对象对应的本地参考文本,对云端识别结果进行修正,实现对云端的识别结果的优化,提高了语音识别的准确性。In a speech recognition method provided by an embodiment of the present application, the recognition result of a target recognition object in the cloud is obtained through an edge terminal, and based on the local reference text corresponding to the target recognition object, the recognition result in the cloud is corrected, so as to realize the recognition of the recognition result in the cloud. Optimization to improve the accuracy of speech recognition.
可选地,所述基于设备端的本地词表,对所述一个目标序列进行修正,获得设备端识别结果,包括:Optionally, the local vocabulary based on the device side, modifying the one target sequence to obtain the device side identification result, including:
基于所述本地词表中的第二替换词汇,替换所述目标序列中与所述第二替换词汇相对应的词汇,获得所述设备端识别结果;Based on the second replacement vocabulary in the local vocabulary, replace the vocabulary corresponding to the second replacement vocabulary in the target sequence, and obtain the device-side recognition result;
可选地,所述第二替换词汇为与所述目标序列中的词汇发音相似且出现概率更高的所述本地词表中的词汇。Optionally, the second replacement vocabulary is a vocabulary in the local vocabulary that is similar in pronunciation to the vocabulary in the target sequence and has a higher occurrence probability.
例如,所述目标序列为“我的名字是张叁”,其中,词汇“张叁”在所述本地词表中发音相似的词汇有“张三”、“章三”,其中词汇“张三”比所述目标序列中的词汇“张叁”的出现概率更高,则“张三”为所述本地词表中的第二替换词汇。For example, the target sequence is "My name is Zhang San", wherein the words "Zhang San" with similar pronunciation in the local vocabulary are "Zhang San" and "Zhang San", wherein the vocabulary "Zhang San" " has a higher occurrence probability than the word "Zhang San" in the target sequence, then "Zhang San" is the second replacement word in the local vocabulary.
其中,与所述第二替换词汇相对应的词汇的音素序列,与所述第二替换词汇的音素序列的音素匹配概率大于第四预设阈值。Wherein, the phoneme matching probability of the phoneme sequence of the vocabulary corresponding to the second replacement vocabulary and the phoneme sequence of the second replacement vocabulary is greater than a fourth preset threshold.
具体地,第四预设阈值为一个小于1的正数,所述第二替换词汇相对应的词汇的音素序列,与所述第二替换词汇的音素序列的音素匹配概率大于第四预设阈值表示本地词表中的词汇的发音与本地词表中的词汇对应的目标序列中的词汇发音相似。Specifically, the fourth preset threshold is a positive number less than 1, and the phoneme matching probability of the phoneme sequence of the vocabulary corresponding to the second replacement vocabulary and the phoneme sequence of the second replacement vocabulary is greater than the fourth preset threshold. It means that the pronunciation of the word in the local vocabulary is similar to the pronunciation of the word in the target sequence corresponding to the word in the local vocabulary.
可选地,所述词表包括:预先设置的易错词表和/或实时NER词汇。Optionally, the vocabulary includes: a preset error-prone vocabulary and/or a real-time NER vocabulary.
其中,预先设置的易错词表包括易识别错的词以及在本次会议中会频繁出现但又不常见的词语。Among them, the preset error-prone word list includes words that are easy to identify and wrong and words that appear frequently in this meeting but are not common.
可选地,所述方法还包括:基于NER技术实时获取所述目标识别对象对应的显示信息中的实时NER词汇。Optionally, the method further includes: acquiring real-time NER vocabulary in the display information corresponding to the target recognition object in real time based on the NER technology.
可选地,实时获取所述目标识别对象对应的显示信息中的实时NER词汇。Optionally, the real-time NER vocabulary in the display information corresponding to the target recognition object is acquired in real time.
具体地,基于NER技术实时抓取屏幕上的信息,得到的当前屏幕上的NER词汇。但是,在本申请中,并不限于屏幕上的信息,也可以是实时抓取的实体文本的信息。Specifically, based on the NER technology, the information on the screen is captured in real time, and the NER vocabulary on the current screen is obtained. However, in the present application, it is not limited to the information on the screen, and may also be the information of the real-time captured entity text.
可选地,所述词表包括:预先设置的易错词表和/或实时NER词汇。Optionally, the vocabulary includes: a preset error-prone vocabulary and/or a real-time NER vocabulary.
可选地,使用语法分析技术对设备端获得的云端或者边缘端的识别结果进行分析,对其中的主语、宾语等词汇,所述词表中发音相似的词语进行替代,从而进一步提高识别准确率。Optionally, use a syntax analysis technology to analyze the recognition results obtained by the device on the cloud or on the edge, and replace words such as subject and object in the vocabulary with words with similar pronunciation, thereby further improving the recognition accuracy.
本申请实施例提供的一种语音识别方位,通过设备端对识别结果进行优化,可以不需要把会议相关资料传到云端,而只是在设备端进行处理,能够保护会议资料隐私,通过在组织内部进行更高级别的网络安全控制,可以提高会议安全性。在本申请中,不限于会议,还可以是演讲,授课等活动。In the speech recognition orientation provided by the embodiment of the present application, the recognition result is optimized by the device, so that it is not necessary to transmit the conference-related data to the cloud, but only processed on the device, which can protect the privacy of the conference data. Implementing higher-level network security controls can improve meeting security. In this application, it is not limited to meetings, but also activities such as lectures and lectures.
可选地,通过边缘端获取云端对目标识别对象的识别结果,并基于目标识别对象对应的本地参考文本,对云端识别结果进行修正,实现对云端的识别结果的优化,提高了语音识别的准确性。Optionally, the cloud recognition result of the target recognition object is obtained through the edge terminal, and based on the local reference text corresponding to the target recognition object, the cloud recognition result is corrected, so as to realize the optimization of the cloud recognition result and improve the accuracy of speech recognition. sex.
图4是本申请实施例提供的语音识别方法的流程示意图之三,如图4所述,该流程具体为:Fig. 4 is the third schematic flow chart of the speech recognition method provided by the embodiment of the present application. As shown in Fig. 4, the flow is specifically:
步骤401,从云端获取音素序列的n-best识别结果和词序列的n-best识别结果;
步骤402,从云端请求词典,将词典下载到边缘端;
步骤403,生成边缘端的LM、生成NER词表、得到NER词表中的词汇对应的发音,用LM选择从云端返回的识别结果,用LM和NER词表对识别结果进行优化;
步骤404,构造设备端词表、得到当前屏幕上的NER词汇,进一步优化识别结果。
图5是本申请实施例提供的语音识别装置的结构示意图之一,如图5所述,该语音识别装置,包括存储器,收发机,处理器:FIG. 5 is one of the schematic structural diagrams of the speech recognition device provided by the embodiment of the present application. As shown in FIG. 5 , the speech recognition device includes a memory, a transceiver, and a processor:
存储器,用于存储计算机程序;收发机,用于在所述处理器的控制下收发数据;处理器,用于读取所述存储器中的计算机程序并执行以下操作:a memory for storing a computer program; a transceiver for sending and receiving data under the control of the processor; a processor for reading the computer program in the memory and performing the following operations:
边缘端获取云端识别结果,所述云端识别结果包括云端针对目标识别对象识别获得的至少一个目标序列;The edge terminal obtains a cloud recognition result, where the cloud recognition result includes at least one target sequence obtained by the cloud for target recognition object recognition;
基于所述目标识别对象对应的本地参考文本,对所述至少一个目标序列进行修正,获得边缘端识别结果。Based on the local reference text corresponding to the target recognition object, the at least one target sequence is modified to obtain an edge end recognition result.
收发机502,用于在处理器503的控制下接收和发送数据。The
其中,在图5中,总线架构可以包括任意数量的互联的总线和桥,具体由处理器503代表的一个或多个处理器和存储器501代表的存储器的各种电路链接在一起。总线架构还可以将诸如外围设备、稳压器和功率管理电路等之类的各种其他电路链接在一起,这些都是本领域所公知的,因此,本文不再对其进行进一步描述。总线接口提供接口。收发机502可以是多个元件,即包括发送机和接收机,提供用于在传输介质上与各种其他装置通信的单元,这些传输介质包括,这些传输介质包括无线信道、有线信道、光缆等传输介质。针对不同的用户设备,用户接口504还可以是能够外接内接需要设备的接口,连接的设备包括但不限于小键盘、显示器、扬声器、麦克风、操纵杆等。5, the bus architecture may include any number of interconnected buses and bridges, specifically one or more processors represented by
处理器503负责管理总线架构和通常的处理,存储器501可以存储处理器503在执行操作时所使用的数据。The
可选的,处理器503可以是CPU(中央处埋器)、ASIC(Application SpecificIntegrated Circuit,专用集成电路)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)或CPLD(Complex Programmable Logic Device,复杂可编程逻辑器件),处理器也可以采用多核架构。Optionally, the
可选地,在一个实施例中,所述基于所述目标识别对象对应的本地参考文本,对所述至少一个目标序列进行修正,获得边缘端识别结果,包括:Optionally, in one embodiment, modifying the at least one target sequence based on the local reference text corresponding to the target recognition object to obtain an edge recognition result, including:
基于所述目标序列对应的语言模型和所述本地参考文本,对所述目标序列中的一个备选序列进行词汇替换;其中,所述目标序列对应的语言模型是基于所述本地参考文本训练获得的;和/或,基于命名实体识别NER,对所述目标序列中的一个所述备选序列进行词汇替换。Based on the language model corresponding to the target sequence and the local reference text, vocabulary replacement is performed on a candidate sequence in the target sequence; wherein, the language model corresponding to the target sequence is obtained by training based on the local reference text and/or, performing lexical replacement on one of the candidate sequences in the target sequence based on named entity recognition NER.
可选地,在一个实施例中,所述备选序列的词汇匹配概率在所有所述目标序列的词汇匹配概率中最高;Optionally, in one embodiment, the lexical matching probability of the candidate sequence is the highest among all lexical matching probabilities of the target sequence;
其中,对于每一个所述目标序列,所述词汇匹配概率是基于所述目标序列对应的语言模型以及所述目标序列计算获得的;Wherein, for each of the target sequences, the vocabulary matching probability is calculated based on the language model corresponding to the target sequence and the target sequence;
所述目标序列的词汇匹配概率用于描述所述目标序列中的词汇在所述本地参考文本中出现的频率。The word matching probability of the target sequence is used to describe the frequency of words in the target sequence appearing in the local reference text.
可选地,在一个实施例中,所述基于所述目标序列对应的语言模型和所述本地参考文本,对所述目标序列中的一个备选序列进行词汇替换,包括:Optionally, in one embodiment, performing lexical replacement on a candidate sequence in the target sequence based on the language model corresponding to the target sequence and the local reference text, including:
基于所述目标序列对应的语言模型,获得所述备选序列中的连续词汇组合对应的连续匹配概率,其中,所述连续匹配概率用于表示所述连续匹配概率对应的连续词汇组合在所述本地参考文本中出现的频率;Based on the language model corresponding to the target sequence, obtain the continuous matching probability corresponding to the continuous word combination in the candidate sequence, wherein the continuous matching probability is used to indicate that the continuous matching probability corresponding to the continuous word combination is in the the frequency of occurrences in the local reference text;
若所述备选序列中的任一个连续词汇组合对应的所述连续匹配概率低于第一预设阈值,则通过所述本地参考文本中的替换文本,替换所述连续匹配概率低于第一预设阈值的第一连续词汇组合;If the continuous matching probability corresponding to any continuous word combination in the candidate sequence is lower than the first preset threshold, the continuous matching probability is replaced by the replacement text in the local reference text the first consecutive word combination of the preset threshold;
其中,所述替换文本对应的音素序列与所述第一连续词汇组合的音素序列的音素匹配度大于第二预设阈值,且所述替换文本比所述第一连续词汇组合在所述本地参考文本中的出现频率更高。Wherein, the phoneme matching degree of the phoneme sequence corresponding to the replacement text and the phoneme sequence of the first continuous vocabulary combination is greater than a second preset threshold, and the replacement text is higher than the first continuous vocabulary combination in the local reference Occurs more frequently in text.
可选地,在一个实施例中,所述目标序列包括:Optionally, in one embodiment, the target sequence includes:
音素序列,和/或,词序列;相应地,所述音素序列对应的语言模型包括音素序列语言模型;所述词序列对应的语言模型包括词序列语言模型。phoneme sequence, and/or word sequence; correspondingly, the language model corresponding to the phoneme sequence includes a phoneme sequence language model; the language model corresponding to the word sequence includes a word sequence language model.
可选地,在一个实施例中,若所述目标序列包括音素序列,所述基于所述目标序列对应的语言模型和所述本地参考文本,对所述目标序列中的一个备选序列进行词汇替换,包括:Optionally, in one embodiment, if the target sequence includes a phoneme sequence, the vocabulary is performed on an alternative sequence in the target sequence based on the language model corresponding to the target sequence and the local reference text. Replacement, including:
基于所述音素序列语言模型,对所述音素序列中的一个备选序列进行词汇替换,获得第一音素序列;Based on the phoneme sequence language model, vocabulary replacement is performed on a candidate sequence in the phoneme sequence to obtain a first phoneme sequence;
基于所述词序列语言模型,对所述第一音素序列对应的至少一个词序列中的一个备选序列,进行词汇替换。Based on the word sequence language model, vocabulary replacement is performed on a candidate sequence in the at least one word sequence corresponding to the first phoneme sequence.
可选地,在一个实施例中,若所述目标序列还包括词序列,所述基于所述目标序列对应的语言模型和所述本地参考文本,对所述目标序列中的一个备选序列进行词汇替换,还包括:Optionally, in one embodiment, if the target sequence further includes a word sequence, the target sequence is based on the language model corresponding to the target sequence and the local reference text, and an alternative sequence in the target sequence is performed. Vocabulary replacement, also including:
若第一词序列对应的第四匹配概率大于第二词序列对应的第四匹配概率,则确定所述第一词序列为所述边缘端识别结果;If the fourth matching probability corresponding to the first word sequence is greater than the fourth matching probability corresponding to the second word sequence, then determining that the first word sequence is the edge recognition result;
若第二词序列对应的第四匹配概率大于第一词序列对应的第四匹配概率,则确定所述第二词序列为所述边缘端识别结果;If the fourth matching probability corresponding to the second word sequence is greater than the fourth matching probability corresponding to the first word sequence, then determining that the second word sequence is the edge recognition result;
其中,所述第一词序列是,对所述第一音素序列对应的至少一个词序列中的一个备选序列进行词汇替换之后获得的;Wherein, the first word sequence is obtained after performing vocabulary replacement on a candidate sequence in at least one word sequence corresponding to the first phoneme sequence;
所述第二词序列是,基于所述词序列语言模型,对所述目标序列中的词序列中的一个备选序列进行词汇替换之后获得的;The second word sequence is obtained by performing lexical replacement on a candidate sequence in the word sequence in the target sequence based on the word sequence language model;
所述第一词序列对应的第四匹配概率用于描述所述第一词序列与所述本地参考文本的匹配程度;The fourth matching probability corresponding to the first word sequence is used to describe the degree of matching between the first word sequence and the local reference text;
所述第二词序列对应的第四匹配概率用于描述所述第二词序列与所述本地参考文本的匹配程度。The fourth matching probability corresponding to the second word sequence is used to describe the degree of matching between the second word sequence and the local reference text.
可选地,在一个实施例中,所述基于命名实体识别NER,对所述目标序列中的一个备选序列进行词汇替换,获得所述边缘端识别结果,包括:Optionally, in one embodiment, the NER-based named entity recognition, performing vocabulary replacement on a candidate sequence in the target sequence, and obtaining the edge recognition result, including:
基于所述NER词表,通过所述NER词表中的第一替换词汇替换所述目标序列中的一个备选序列中的词汇,获得所述边缘端识别结果;Based on the NER vocabulary, replace the vocabulary in a candidate sequence in the target sequence by the first replacement vocabulary in the NER vocabulary to obtain the edge recognition result;
其中,所述替换词汇对应的音素序列与所述备选序列中的词汇的音素序列的音素匹配概率大于第三预设阈值。Wherein, the phoneme matching probability of the phoneme sequence corresponding to the replacement vocabulary and the phoneme sequence of the vocabulary in the candidate sequence is greater than a third preset threshold.
可选地,在一个实施例中,所述操作还包括:Optionally, in one embodiment, the operation further includes:
基于所述NER对所述本地参考文本进行识别,生成NER词表;Identifying the local reference text based on the NER, and generating an NER vocabulary;
基于词典和/或字符转音素G2P技术,获得所述NER词表对应的音素。可选地,在一个实施例中,所述本地参考文本包括:本地参考文本所属活动的专业信息、参与者信息、活动背景信息、和活动内容。Based on the dictionary and/or the character-to-phoneme G2P technology, the phonemes corresponding to the NER vocabulary are obtained. Optionally, in one embodiment, the local reference text includes: professional information of the activity to which the local reference text belongs, participant information, activity background information, and activity content.
可选地,在一个实施例中,所述获得边缘端识别结果之后,所述操作还包括:基于本地参考文本所属活动的专业信息,将所述本地参考文本的相关信息保存至服务器中。Optionally, in one embodiment, after obtaining the edge identification result, the operation further includes: saving the relevant information of the local reference text to the server based on the professional information of the activity to which the local reference text belongs.
本申请实施例提供的一种语音识别装置,通过边缘端获取云端对目标识别对象的识别结果,并基于目标识别对象对应的本地参考文本,对云端识别结果进行修正,实现对云端的识别结果的优化,提高了语音识别的准确性。A speech recognition device provided by an embodiment of the present application obtains the recognition result of the target recognition object in the cloud through the edge terminal, and corrects the recognition result in the cloud based on the local reference text corresponding to the target recognition object, so as to realize the recognition of the recognition result in the cloud. Optimization to improve the accuracy of speech recognition.
处理器通过调用存储器存储的计算机程序,用于按照获得的可执行指令执行本申请实施例提供的任一所述方法。处理器与存储器也可以物理上分开布置。The processor is configured to execute any one of the methods provided in the embodiments of the present application according to the obtained executable instructions by invoking the computer program stored in the memory. The processor and memory may also be physically separated.
在此需要说明的是,本发明实施例提供的上述装置,能够实现上述方法实施例所实现的所有方法步骤,且能够达到相同的技术效果,在此不再对本实施例中与方法实施例相同的部分及有益效果进行具体赘述。It should be noted here that the above-mentioned device provided by the embodiment of the present invention can realize all the method steps realized by the above-mentioned method embodiment, and can achieve the same technical effect, and the same as the method embodiment in this embodiment is not repeated here. The parts and beneficial effects will be described in detail.
图6是本申请实施例提供的语音识别装置的结构示意图之二,如图6所述,该语音识别装置,包括存储器,收发机,处理器:FIG. 6 is the second structural schematic diagram of the speech recognition device provided by the embodiment of the present application. As shown in FIG. 6 , the speech recognition device includes a memory, a transceiver, and a processor:
存储器,用于存储计算机程序;收发机,用于在所述处理器的控制下收发数据;处理器,用于读取所述存储器中的计算机程序并执行以下操作:a memory for storing a computer program; a transceiver for sending and receiving data under the control of the processor; a processor for reading the computer program in the memory and performing the following operations:
设备端获取云端识别结果或边缘端识别结果,其中,所述云端识别结果包括云端对目标识别对象识别获得的一个目标序列;The device end obtains the cloud identification result or the edge identification result, wherein the cloud identification result includes a target sequence obtained by the cloud identifying the target identification object;
基于设备端的本地词表,对所述一个目标序列进行修正,获得设备端识别结果。Based on the local vocabulary on the device side, the one target sequence is modified to obtain the device-side recognition result.
收发机602,用于在处理器603的控制下接收和发送数据。The
其中,在图6中,总线架构可以包括任意数量的互联的总线和桥,具体由处理器603代表的一个或多个处理器和存储器601代表的存储器的各种电路链接在一起。总线架构还可以将诸如外围设备、稳压器和功率管理电路等之类的各种其他电路链接在一起,这些都是本领域所公知的,因此,本文不再对其进行进一步描述。总线接口提供接口。收发机602可以是多个元件,即包括发送机和接收机,提供用于在传输介质上与各种其他装置通信的单元,这些传输介质包括,这些传输介质包括无线信道、有线信道、光缆等传输介质。针对不同的用户设备,用户接口604还可以是能够外接内接需要设备的接口,连接的设备包括但不限于小键盘、显示器、扬声器、麦克风、操纵杆等。Wherein, in FIG. 6 , the bus architecture may include any number of interconnected buses and bridges, specifically one or more processors represented by the
处理器603负责管理总线架构和通常的处理,存储器601可以存储处理器603在执行操作时所使用的数据。The
可选的,处理器603可以是CPU(中央处埋器)、ASIC(Application SpecificIntegrated Circuit,专用集成电路)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)或CPLD(Complex Programmable Logic Device,复杂可编程逻辑器件),处理器也可以采用多核架构。Optionally, the
可选地,所述基于设备端的本地词表,对所述一个目标序列进行修正,获得设备端识别结果,包括:Optionally, the local vocabulary based on the device side, modifying the one target sequence to obtain the device side identification result, including:
基于所述本地词表中的第二替换词汇,替换所述目标序列中与所述第二替换词汇相对应的词汇,获得所述设备端识别结果;Based on the second replacement vocabulary in the local vocabulary, replace the vocabulary corresponding to the second replacement vocabulary in the target sequence, and obtain the device-side recognition result;
其中,与所述第二替换词汇相对应的词汇的音素序列,与所述第二替换词汇的音素序列的音素匹配概率大于第四预设阈值。Wherein, the phoneme matching probability of the phoneme sequence of the vocabulary corresponding to the second replacement vocabulary and the phoneme sequence of the second replacement vocabulary is greater than a fourth preset threshold.
可选地,所述词表包括:预先设置的易错词表和/或实时NER词汇。Optionally, the vocabulary includes: a preset error-prone vocabulary and/or a real-time NER vocabulary.
可选地,所述操作还包括:基于NER技术实时获取所述目标识别对象对应的显示信息中的实时NER词汇。Optionally, the operation further includes: acquiring real-time NER vocabulary in the display information corresponding to the target recognition object in real time based on the NER technology.
本申请实施例提供的一种语音识别装置,通过边缘端获取云端对目标识别对象的识别结果,并基于目标识别对象对应的本地参考文本,对云端识别结果进行修正,实现对云端的识别结果的优化,提高了语音识别的准确性。A speech recognition device provided by an embodiment of the present application obtains the recognition result of the target recognition object in the cloud through the edge terminal, and corrects the recognition result in the cloud based on the local reference text corresponding to the target recognition object, so as to realize the recognition of the recognition result in the cloud. Optimization to improve the accuracy of speech recognition.
处理器通过调用存储器存储的计算机程序,用于按照获得的可执行指令执行本申请实施例提供的任一所述方法。处理器与存储器也可以物理上分开布置。The processor is configured to execute any one of the methods provided in the embodiments of the present application according to the obtained executable instructions by invoking the computer program stored in the memory. The processor and memory may also be physically separated.
在此需要说明的是,本发明实施例提供的上述装置,能够实现上述方法实施例所实现的所有方法步骤,且能够达到相同的技术效果,在此不再对本实施例中与方法实施例相同的部分及有益效果进行具体赘述。It should be noted here that the above-mentioned device provided by the embodiment of the present invention can realize all the method steps realized by the above-mentioned method embodiment, and can achieve the same technical effect, and the same as the method embodiment in this embodiment is not repeated here. The parts and beneficial effects will be described in detail.
图7是本申请实施例提供的语音识别装置的结构示意图之三,如图7所述,该语音识别装置,包括第一获取单元710,和第一修正单元720,其中:FIG. 7 is a third schematic structural diagram of a speech recognition device provided by an embodiment of the present application. As shown in FIG. 7 , the speech recognition device includes a
第一获取单元710用于边缘端获取云端识别结果,所述云端识别结果包括云端针对目标识别对象识别获得的至少一个目标序列;The first obtaining
第一修正单元720用于基于所述目标识别对象对应的本地参考文本,对所述至少一个目标序列进行修正,获得边缘端识别结果。The first modifying
可选地,语音识别装置通过第一获取单元710获取云端识别结果,然后可以基于所述目标识别对象对应的本地参考文本,通过第一修正单元720对所述至少一个目标序列进行修正,获得边缘端识别结果。Optionally, the speech recognition apparatus obtains the cloud recognition result through the first obtaining
本申请实施例提供的一种语音识别装置,通过边缘端获取云端对目标识别对象的识别结果,并基于目标识别对象对应的本地参考文本,对云端识别结果进行修正,实现对云端的识别结果的优化,提高了语音识别的准确性。A speech recognition device provided by an embodiment of the present application obtains the recognition result of the target recognition object in the cloud through the edge terminal, and corrects the recognition result in the cloud based on the local reference text corresponding to the target recognition object, so as to realize the recognition of the recognition result in the cloud. Optimization to improve the accuracy of speech recognition.
需要说明的是,本申请实施例中对单元的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。It should be noted that the division of units in the embodiments of the present application is illustrative, and is only a logical function division, and other division methods may be used in actual implementation. In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个处理器可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a processor-readable storage medium. Based on this understanding, the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program codes .
可选地,第一修正单元720用于:Optionally, the
基于所述目标序列对应的语言模型和所述本地参考文本,对所述目标序列中的一个备选序列进行词汇替换;其中,所述目标序列对应的语言模型是基于所述本地参考文本训练获得的;Based on the language model corresponding to the target sequence and the local reference text, vocabulary replacement is performed on a candidate sequence in the target sequence; wherein, the language model corresponding to the target sequence is obtained by training based on the local reference text of;
和/或,and / or,
基于命名实体识别NER,对所述目标序列中的一个所述备选序列进行词汇替换。Based on named entity recognition NER, lexical replacement is performed on one of the candidate sequences in the target sequence.
可选地,所述备选序列的词汇匹配概率在所有所述目标序列的词汇匹配概率中最高;Optionally, the lexical matching probability of the candidate sequence is the highest among all lexical matching probabilities of the target sequence;
其中,对于每一个所述目标序列,所述词汇匹配概率是基于所述目标序列对应的语言模型以及所述目标序列计算获得的;Wherein, for each of the target sequences, the vocabulary matching probability is calculated based on the language model corresponding to the target sequence and the target sequence;
所述目标序列的词汇匹配概率用于描述所述目标序列中的词汇在所述本地参考文本中出现的频率。The word matching probability of the target sequence is used to describe the frequency of words in the target sequence appearing in the local reference text.
可选地,第一修正单元720用于:基于所述目标序列对应的语言模型,获得所述备选序列中的连续词汇组合对应的连续匹配概率,其中,所述连续匹配概率用于表示所述连续匹配概率对应的连续词汇组合在所述本地参考文本中出现的频率;Optionally, the
若所述备选序列中的任一个连续词汇组合对应的所述连续匹配概率低于第一预设阈值,则通过所述本地参考文本中的替换文本,替换所述连续匹配概率低于第一预设阈值的第一连续词汇组合;If the continuous matching probability corresponding to any continuous word combination in the candidate sequence is lower than the first preset threshold, the continuous matching probability is replaced by the replacement text in the local reference text the first consecutive word combination of the preset threshold;
其中,所述替换文本对应的音素序列与所述第一连续词汇组合的音素序列的音素匹配度大于第二预设阈值,且所述替换文本比所述第一连续词汇组合在所述本地参考文本中的出现频率更高。Wherein, the phoneme matching degree of the phoneme sequence corresponding to the replacement text and the phoneme sequence of the first continuous vocabulary combination is greater than a second preset threshold, and the replacement text is higher than the first continuous vocabulary combination in the local reference Occurs more frequently in text.
可选地,所述目标序列包括:Optionally, the target sequence includes:
音素序列,和/或,词序列;phoneme sequences, and/or word sequences;
相应地,所述音素序列对应的语言模型包括音素序列语言模型;Correspondingly, the language model corresponding to the phoneme sequence includes a phoneme sequence language model;
所述词序列对应的语言模型包括词序列语言模型。The language model corresponding to the word sequence includes a word sequence language model.
可选地,若所述目标序列包括音素序列,第一修正单元720用于:基于所述音素序列语言模型,对所述音素序列中的一个备选序列进行词汇替换,获得第一音素序列;Optionally, if the target sequence includes a phoneme sequence, the
基于所述词序列语言模型,对所述第一音素序列对应的至少一个词序列中的一个备选序列,进行词汇替换。Based on the word sequence language model, vocabulary replacement is performed on a candidate sequence in the at least one word sequence corresponding to the first phoneme sequence.
可选地,若所述目标序列还包括词序列,第一修正单元720用于:若第一词序列对应的第四匹配概率大于第二词序列对应的第四匹配概率,则确定所述第一词序列为所述边缘端识别结果;Optionally, if the target sequence further includes a word sequence, the
若第二词序列对应的第四匹配概率大于第一词序列对应的第四匹配概率,则确定所述第二词序列为所述边缘端识别结果;If the fourth matching probability corresponding to the second word sequence is greater than the fourth matching probability corresponding to the first word sequence, then determining that the second word sequence is the edge recognition result;
其中,所述第一词序列是,对所述第一音素序列对应的至少一个词序列中的一个备选序列进行词汇替换之后获得的;Wherein, the first word sequence is obtained after performing vocabulary replacement on a candidate sequence in at least one word sequence corresponding to the first phoneme sequence;
所述第二词序列是,基于所述词序列语言模型,对所述目标序列中的词序列中的一个备选序列进行词汇替换之后获得的;The second word sequence is obtained by performing lexical replacement on a candidate sequence in the word sequence in the target sequence based on the word sequence language model;
所述第一词序列对应的第四匹配概率用于描述所述第一词序列与所述本地参考文本的匹配程度;The fourth matching probability corresponding to the first word sequence is used to describe the degree of matching between the first word sequence and the local reference text;
所述第二词序列对应的第四匹配概率用于描述所述第二词序列与所述本地参考文本的匹配程度。The fourth matching probability corresponding to the second word sequence is used to describe the degree of matching between the second word sequence and the local reference text.
可选地,第一修正单元720用于:基于所述NER词表,通过所述NER词表中的第一替换词汇替换所述目标序列中的一个备选序列中的词汇,获得所述边缘端识别结果;Optionally, the
其中,所述替换词汇对应的音素序列与所述备选序列中的词汇的音素序列的音素匹配概率大于第三预设阈值。Wherein, the phoneme matching probability of the phoneme sequence corresponding to the replacement vocabulary and the phoneme sequence of the vocabulary in the candidate sequence is greater than a third preset threshold.
可选地,还包括:Optionally, also include:
第一生成单元,用于基于所述NER对所述本地参考文本进行识别,生成NER词表;a first generating unit, configured to identify the local reference text based on the NER, and generate a NER vocabulary;
第三获取单元,用于基于词典和/或字符转音素G2P技术,获得所述NER词表对应的音素。The third obtaining unit is configured to obtain the phonemes corresponding to the NER vocabulary based on the dictionary and/or the character-to-phoneme G2P technology.
可选地,所述本地参考文本包括:Optionally, the local reference text includes:
本地参考文本所属活动的专业信息、参与者信息、活动背景信息、和活动内容。Professional information about the event to which the local reference text belongs, participant information, event background information, and event content.
可选地,还包括:Optionally, also include:
所述获得边缘端识别结果之后,基于本地参考文本所属活动的专业信息,将所述本地参考文本的相关信息保存至服务器中。After the edge identification result is obtained, the relevant information of the local reference text is stored in the server based on the professional information of the activity to which the local reference text belongs.
图8是本申请实施例提供的语音识别装置的结构示意图之四,如图8所述,该语音识别装置,包括第二获取单元810,和第二修正单元820,其中:FIG. 8 is the fourth schematic structural diagram of the speech recognition device provided by the embodiment of the present application. As shown in FIG. 8 , the speech recognition device includes a
第二获取单元810,用于设备端获取云端识别结果或边缘端识别结果,其中,所述云端识别结果包括云端对目标识别对象识别获得的一个目标序列;The second obtaining
第二修正单元820用于基于设备端的本地词表,对所述一个目标序列进行修正,获得设备端识别结果。The
可选地,语音识别装置通过第二获取单元810获取云端识别结果或边缘端识别结果,然后可以基于设备端的本地词表,通过第二修正单元820对所述一个目标序列进行修正,获得设备端识别结果。Optionally, the speech recognition apparatus obtains the cloud recognition result or the edge recognition result through the second obtaining
本申请实施例提供的一种语音识别装置,通过边缘端获取云端对目标识别对象的识别结果,并基于目标识别对象对应的本地参考文本,对云端识别结果进行修正,实现对云端的识别结果的优化,提高了语音识别的准确性。A speech recognition device provided by an embodiment of the present application obtains the recognition result of the target recognition object in the cloud through the edge terminal, and corrects the recognition result in the cloud based on the local reference text corresponding to the target recognition object, so as to realize the recognition of the recognition result in the cloud. Optimization to improve the accuracy of speech recognition.
可选地,第二修正单元820用于:Optionally, the
基于所述本地词表中的第二替换词汇,替换所述目标序列中与所述第二替换词汇相对应的词汇,获得所述设备端识别结果;Based on the second replacement vocabulary in the local vocabulary, replace the vocabulary corresponding to the second replacement vocabulary in the target sequence, and obtain the device-side recognition result;
其中,与所述第二替换词汇相对应的词汇的音素序列,与所述第二替换词汇的音素序列的音素匹配概率大于第四预设阈值。Wherein, the phoneme matching probability of the phoneme sequence of the vocabulary corresponding to the second replacement vocabulary and the phoneme sequence of the second replacement vocabulary is greater than a fourth preset threshold.
可选地,所述词表包括:预先设置的易错词表和/或实时NER词汇。Optionally, the vocabulary includes: a preset error-prone vocabulary and/or a real-time NER vocabulary.
可选地,还包括:Optionally, also include:
第四获取模块,用于基于NER技术实时获取所述目标识别对象对应的显示信息中的实时NER词汇。The fourth acquisition module is configured to acquire real-time NER vocabulary in the display information corresponding to the target recognition object in real time based on the NER technology.
需要说明的是,本申请实施例中对单元的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。It should be noted that the division of units in the embodiments of the present application is illustrative, and is only a logical function division, and other division methods may be used in actual implementation. In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个处理器可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a processor-readable storage medium. Based on this understanding, the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program codes .
在此需要说明的是,本发明实施例提供的上述装置,能够实现上述方法实施例所实现的所有方法步骤,且能够达到相同的技术效果,在此不再对本实施例中与方法实施例相同的部分及有益效果进行具体赘述。It should be noted here that the above-mentioned device provided by the embodiment of the present invention can realize all the method steps realized by the above-mentioned method embodiment, and can achieve the same technical effect, and the same as the method embodiment in this embodiment is not repeated here. The parts and beneficial effects will be described in detail.
另一方面,本申请实施例还提供一种处理器可读存储介质,所述处理器可读存储介质存储有计算机程序,所述计算机程序用于使所述处理器执行上述各实施例提供的方法,包括:On the other hand, an embodiment of the present application further provides a processor-readable storage medium, where a computer program is stored in the processor-readable storage medium, and the computer program is used to enable the processor to execute the methods, including:
边缘端获取云端识别结果,所述云端识别结果包括云端针对目标识别对象识别获得的至少一个目标序列;The edge terminal obtains a cloud recognition result, where the cloud recognition result includes at least one target sequence obtained by the cloud for target recognition object recognition;
基于所述目标识别对象对应的本地参考文本,对所述至少一个目标序列进行修正,获得边缘端识别结果。Based on the local reference text corresponding to the target recognition object, the at least one target sequence is modified to obtain an edge end recognition result.
或or
设备端获取云端识别结果或边缘端识别结果,其中,所述云端识别结果包括云端对目标识别对象识别获得的一个目标序列;The device end obtains the cloud identification result or the edge identification result, wherein the cloud identification result includes a target sequence obtained by the cloud identifying the target identification object;
基于设备端的本地词表,对所述一个目标序列进行修正,获得设备端识别结果。Based on the local vocabulary on the device side, the one target sequence is modified to obtain the device-side recognition result.
所述处理器可读存储介质可以是处理器能够存取的任何可用介质或数据存储设备,包括但不限于磁性存储器(例如软盘、硬盘、磁带、磁光盘(MO)等)、光学存储器(例如CD、DVD、BD、HVD等)、以及半导体存储器(例如ROM、EPROM、EEPROM、非易失性存储器(NANDFLASH)、固态硬盘(SSD))等。The processor-readable storage medium can be any available medium or data storage device that can be accessed by a processor, including, but not limited to, magnetic storage (eg, floppy disk, hard disk, magnetic tape, magneto-optical disk (MO), etc.), optical storage (eg, CD, DVD, BD, HVD, etc.), and semiconductor memory (eg, ROM, EPROM, EEPROM, non-volatile memory (NANDFLASH), solid-state disk (SSD)), and the like.
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器和光学存储器等)上实施的计算机程序产品的形式。As will be appreciated by those skilled in the art, the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media having computer-usable program code embodied therein, including but not limited to disk storage, optical storage, and the like.
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机可执行指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机可执行指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block in the flowcharts and/or block diagrams, and combinations of flows and/or blocks in the flowcharts and/or block diagrams, can be implemented by computer-executable instructions. These computer-executable instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.
这些处理器可执行指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的处理器可读存储器中,使得存储在该处理器可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These processor-executable instructions may also be stored in a processor-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the processor-readable memory result in the manufacture of means including the instructions product, the instruction means implements the functions specified in the flow or flow of the flowchart and/or the block or blocks of the block diagram.
这些处理器可执行指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These processor-executable instructions can also be loaded onto a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process that Execution of the instructions provides steps for implementing the functions specified in the flowchart or blocks and/or the block or blocks of the block diagrams.
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the present application without departing from the spirit and scope of the present application. Thus, if these modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to include these modifications and variations.
Claims (20)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202110193727.6A CN114974249B (en) | 2021-02-20 | 2021-02-20 | Speech recognition method, device and storage medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202110193727.6A CN114974249B (en) | 2021-02-20 | 2021-02-20 | Speech recognition method, device and storage medium |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN114974249A true CN114974249A (en) | 2022-08-30 |
| CN114974249B CN114974249B (en) | 2025-09-12 |
Family
ID=82954730
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202110193727.6A Active CN114974249B (en) | 2021-02-20 | 2021-02-20 | Speech recognition method, device and storage medium |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN114974249B (en) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115862631A (en) * | 2022-12-12 | 2023-03-28 | 厦门黑镜科技有限公司 | Subtitle generating method and device, electronic equipment and storage medium |
| CN120199247A (en) * | 2025-05-26 | 2025-06-24 | 华泽中熙(北京)科技发展有限公司 | A method and system for intelligent customer service voice interaction based on voice recognition |
Citations (15)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN1708783A (en) * | 2002-11-02 | 2005-12-14 | 皇家飞利浦电子股份有限公司 | Method and system for speech recognition |
| CN105408953A (en) * | 2013-06-28 | 2016-03-16 | 株式会社ATR-Trek | Voice recognition client device for local voice recognition |
| CN105868650A (en) * | 2016-06-03 | 2016-08-17 | 腾讯科技(深圳)有限公司 | Information processing method and system, server and terminals |
| CN105895103A (en) * | 2015-12-03 | 2016-08-24 | 乐视致新电子科技(天津)有限公司 | Speech recognition method and device |
| CN106683677A (en) * | 2015-11-06 | 2017-05-17 | 阿里巴巴集团控股有限公司 | Method and device for recognizing voice |
| CN107644638A (en) * | 2017-10-17 | 2018-01-30 | 北京智能管家科技有限公司 | Audio recognition method, device, terminal and computer-readable recording medium |
| CN107731229A (en) * | 2017-09-29 | 2018-02-23 | 百度在线网络技术(北京)有限公司 | Method and apparatus for identifying voice |
| CN108597495A (en) * | 2018-03-15 | 2018-09-28 | 维沃移动通信有限公司 | A kind of method and device of processing voice data |
| CN109065054A (en) * | 2018-08-31 | 2018-12-21 | 出门问问信息科技有限公司 | Speech recognition error correction method, device, electronic equipment and readable storage medium storing program for executing |
| CN109473093A (en) * | 2018-12-13 | 2019-03-15 | 平安科技(深圳)有限公司 | Audio recognition method, device, computer equipment and storage medium |
| CN109599115A (en) * | 2018-12-24 | 2019-04-09 | 苏州思必驰信息科技有限公司 | Minutes method and apparatus for audio collecting device and user terminal |
| CN110166522A (en) * | 2019-04-01 | 2019-08-23 | 腾讯科技(深圳)有限公司 | Server identification method, device, readable storage medium storing program for executing and computer equipment |
| CN110473523A (en) * | 2019-08-30 | 2019-11-19 | 北京大米科技有限公司 | A kind of audio recognition method, device, storage medium and terminal |
| CN110738997A (en) * | 2019-10-25 | 2020-01-31 | 百度在线网络技术(北京)有限公司 | information correction method, device, electronic equipment and storage medium |
| CN111508484A (en) * | 2019-01-31 | 2020-08-07 | 阿里巴巴集团控股有限公司 | Voice data processing method and device |
-
2021
- 2021-02-20 CN CN202110193727.6A patent/CN114974249B/en active Active
Patent Citations (15)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN1708783A (en) * | 2002-11-02 | 2005-12-14 | 皇家飞利浦电子股份有限公司 | Method and system for speech recognition |
| CN105408953A (en) * | 2013-06-28 | 2016-03-16 | 株式会社ATR-Trek | Voice recognition client device for local voice recognition |
| CN106683677A (en) * | 2015-11-06 | 2017-05-17 | 阿里巴巴集团控股有限公司 | Method and device for recognizing voice |
| CN105895103A (en) * | 2015-12-03 | 2016-08-24 | 乐视致新电子科技(天津)有限公司 | Speech recognition method and device |
| CN105868650A (en) * | 2016-06-03 | 2016-08-17 | 腾讯科技(深圳)有限公司 | Information processing method and system, server and terminals |
| CN107731229A (en) * | 2017-09-29 | 2018-02-23 | 百度在线网络技术(北京)有限公司 | Method and apparatus for identifying voice |
| CN107644638A (en) * | 2017-10-17 | 2018-01-30 | 北京智能管家科技有限公司 | Audio recognition method, device, terminal and computer-readable recording medium |
| CN108597495A (en) * | 2018-03-15 | 2018-09-28 | 维沃移动通信有限公司 | A kind of method and device of processing voice data |
| CN109065054A (en) * | 2018-08-31 | 2018-12-21 | 出门问问信息科技有限公司 | Speech recognition error correction method, device, electronic equipment and readable storage medium storing program for executing |
| CN109473093A (en) * | 2018-12-13 | 2019-03-15 | 平安科技(深圳)有限公司 | Audio recognition method, device, computer equipment and storage medium |
| CN109599115A (en) * | 2018-12-24 | 2019-04-09 | 苏州思必驰信息科技有限公司 | Minutes method and apparatus for audio collecting device and user terminal |
| CN111508484A (en) * | 2019-01-31 | 2020-08-07 | 阿里巴巴集团控股有限公司 | Voice data processing method and device |
| CN110166522A (en) * | 2019-04-01 | 2019-08-23 | 腾讯科技(深圳)有限公司 | Server identification method, device, readable storage medium storing program for executing and computer equipment |
| CN110473523A (en) * | 2019-08-30 | 2019-11-19 | 北京大米科技有限公司 | A kind of audio recognition method, device, storage medium and terminal |
| CN110738997A (en) * | 2019-10-25 | 2020-01-31 | 百度在线网络技术(北京)有限公司 | information correction method, device, electronic equipment and storage medium |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115862631A (en) * | 2022-12-12 | 2023-03-28 | 厦门黑镜科技有限公司 | Subtitle generating method and device, electronic equipment and storage medium |
| CN120199247A (en) * | 2025-05-26 | 2025-06-24 | 华泽中熙(北京)科技发展有限公司 | A method and system for intelligent customer service voice interaction based on voice recognition |
| CN120199247B (en) * | 2025-05-26 | 2025-08-12 | 华泽中熙(北京)科技发展有限公司 | Intelligent customer service voice interaction method and system based on voice recognition |
Also Published As
| Publication number | Publication date |
|---|---|
| CN114974249B (en) | 2025-09-12 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN110675855B (en) | Voice recognition method, electronic equipment and computer readable storage medium | |
| US11514891B2 (en) | Named entity recognition method, named entity recognition equipment and medium | |
| US10741170B2 (en) | Speech recognition method and apparatus | |
| KR101183344B1 (en) | Automatic speech recognition learning using user corrections | |
| EP2700071B1 (en) | Speech recognition using multiple language models | |
| CN108899013B (en) | Voice search method and device and voice recognition system | |
| US9697201B2 (en) | Adapting machine translation data using damaging channel model | |
| US9558741B2 (en) | Systems and methods for speech recognition | |
| US20080154600A1 (en) | System, Method, Apparatus and Computer Program Product for Providing Dynamic Vocabulary Prediction for Speech Recognition | |
| CN110910903B (en) | Speech emotion recognition method, device, equipment and computer readable storage medium | |
| GB2557714A (en) | Determining phonetic relationships | |
| CN112634866B (en) | Speech synthesis model training and speech synthesis method, device, equipment and medium | |
| CN114360499A (en) | Voice recognition method, device, equipment and storage medium | |
| CN114974249B (en) | Speech recognition method, device and storage medium | |
| CN118098290A (en) | Reading evaluation method, device, equipment, storage medium and computer program product | |
| CN110809796B (en) | Speech recognition system and method with decoupled wake phrases | |
| Mirishkar et al. | CSTD-Telugu corpus: Crowd-sourced approach for large-scale speech data collection | |
| CN114171023B (en) | Speech recognition method, device, computer equipment and storage medium | |
| KR20200072005A (en) | Method for correcting speech recognized sentence | |
| US10885914B2 (en) | Speech correction system and speech correction method | |
| CN113506561B (en) | Text pinyin conversion method and device, storage medium and electronic equipment | |
| CN118675529B (en) | Data processing method and system for call voice translation | |
| CN115171646B (en) | Audio generation method and device | |
| US20240071368A1 (en) | System and Method for Adapting Natural Language Understanding (NLU) Engines Optimized on Text to Audio Input | |
| US20250225338A1 (en) | Real-time language translation systems embodied in a physical device |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |