CN115132184A - Voice interaction method, server and storage medium - Google Patents
Voice interaction method, server and storage medium Download PDFInfo
- Publication number
- CN115132184A CN115132184A CN202210736173.4A CN202210736173A CN115132184A CN 115132184 A CN115132184 A CN 115132184A CN 202210736173 A CN202210736173 A CN 202210736173A CN 115132184 A CN115132184 A CN 115132184A
- Authority
- CN
- China
- Prior art keywords
- syllable
- syllables
- combined
- pronunciation
- phonemes
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/027—Syllables being the recognition units
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
Description
技术领域technical field
本申请涉及语音技术领域,尤其涉及一种语音交互方法、服务器及存储介质。The present application relates to the field of voice technology, and in particular, to a voice interaction method, server and storage medium.
背景技术Background technique
随着汽车工业和人机交互技术的不断发展,智能汽车也为用户提供了语音交互功能。With the continuous development of the automotive industry and human-computer interaction technology, smart cars also provide users with voice interaction functions.
语音交互功能依赖于语音识别技术。目前,待识别的语音数据可能并不只是单一语种的语音,还可能为双语种的混合语音或多语种的混合语音等,因此语音识别方法也有所差异。以欧洲为例,语种众多且存在多个语系,语种数目超过十个。相关技术中一般采用上下文相关的音节建模方式来构建声学模型,但场景依赖性强,无法做到多语种统一建模,不利于不同语种的语音识别和语音交互。如果为每个语种都部署一套语音识别系统,将极大增加成本,也造成机器资源浪费。Voice interaction functions rely on voice recognition technology. At present, the speech data to be recognized may not only be a single-language speech, but may also be a bilingual mixed speech or a multilingual mixed speech, etc. Therefore, the speech recognition methods are also different. Taking Europe as an example, there are many languages and there are multiple language families, and the number of languages exceeds ten. In related technologies, the context-dependent syllable modeling method is generally used to construct an acoustic model, but it is highly dependent on the scene and cannot achieve unified modeling in multiple languages, which is not conducive to speech recognition and speech interaction in different languages. If a speech recognition system is deployed for each language, it will greatly increase the cost and waste machine resources.
发明内容SUMMARY OF THE INVENTION
为解决或部分解决相关技术中存在的问题,本申请提供一种语音交互方法、服务器及存储介质,能够实现多语种统一建模,更方便不同语种的语音识别和语音交互,降低部署成本,避免机器资源浪费。In order to solve or partially solve the problems existing in the related art, the present application provides a voice interaction method, a server and a storage medium, which can realize multilingual unified modeling, facilitate voice recognition and voice interaction in different languages, reduce deployment costs, and avoid Waste of machine resources.
本申请第一方面提供一种语音交互方法,包括:获取不同语种的音素,根据发音学规则将所述不同语种的音素合并为第一音节;利用不同语种的训练材料识别出发音的组合音节,根据发音黏着度从所述组合音节中筛选出第二音节;将所述第一音节和所述第二音节进行合并,得到建模音节;根据所述建模音节生成声学模型;接收车辆转发的车辆座舱内用户发出的语音请求;根据所述声学模型对所述语音请求进行识别,生成识别结果下发至车辆完成语音交互。本申请根据发音学规则将不同语种的音素合并为第一音节,根据发音黏着度从不同语种的训练材料的组合音节中筛选出第二音节,然后合并得到建模音节,这些音节与上下文无关,这样就可以使得各种不同语种可以使用同一套建模系统,从而能够实现多语种统一建模,更方便不同语种的语音识别和语音交互,降低部署成本,避免机器资源浪费。A first aspect of the present application provides a voice interaction method, comprising: acquiring phonemes of different languages, combining the phonemes of different languages into a first syllable according to phonetic rules; identifying the combined syllables of pronunciation by using training materials of different languages, Screen out the second syllable from the combined syllable according to the degree of pronunciation adhesion; combine the first syllable and the second syllable to obtain a modeled syllable; generate an acoustic model according to the modeled syllable; The voice request sent by the user in the vehicle cabin; the voice request is recognized according to the acoustic model, and the recognition result is generated and sent to the vehicle to complete the voice interaction. The present application merges the phonemes of different languages into the first syllable according to the phonetic rules, selects the second syllable from the combined syllables of the training materials of different languages according to the pronunciation adhesion, and then merges to obtain the modeling syllables, these syllables have nothing to do with the context, In this way, various languages can use the same modeling system, so that multi-language unified modeling can be achieved, which is more convenient for speech recognition and speech interaction in different languages, reduces deployment costs, and avoids wasting machine resources.
所述根据发音学规则将所述不同语种的音素合并为第一音节,包括:根据万国音标规则将所述不同语种的音素进行预合并;根据发音学规则将所述进行预合并后的音素合并为第一音节。本申请基于ipa进行音素合并,可以减少音素数量,可以使得数十个语种混合可以使用一个输出层,可以降低计算量和延时。Merging the phonemes of the different languages into the first syllable according to the phonetic rules, including: pre-merging the phonemes of the different languages according to the rules of the IWC; merging the phonemes after the pre-merging according to the phonetic rules for the first syllable. The present application performs phoneme merging based on ipa, which can reduce the number of phonemes, so that dozens of languages can be mixed using one output layer, which can reduce the amount of calculation and delay.
所述根据发音学规则将所述进行预合并后的音素合并为第一音节,包括:从进行预合并后的音素中,将声母和韵母的音素合并得到第一音节,将剩下的单个声母的音素和单个韵母的音素单独作为第一音节。本申请将音素合并为音节可以参考发音学规则进行合并,使得合并的音节更符合发音习惯。The described phoneme after pre-merging is merged into the first syllable according to the phonetic rule, including: from the phoneme after the pre-merging, the first syllable is obtained by merging the phonemes of the initial consonant and the final consonant, and the remaining single initial consonant is merged into the first syllable. The phoneme of , and the phoneme of a single final stand alone as the first syllable. In this application, the phonemes are merged into syllables, which can be merged with reference to the phonetic rules, so that the merged syllables are more in line with pronunciation habits.
所述利用不同语种的训练材料识别出发音的组合音节,包括:利用不同语种的音频和/或视频的训练材料识别出发音的组合音节。本申请可以充分利用音频和/或视频作为不同语种的训练材料。The recognizing the combined syllables of pronunciation by using training materials of different languages includes: identifying the combined syllables of pronunciation by using the training materials of audio and/or video of different languages. This application can make full use of audio and/or video as training material in different languages.
所述根据发音黏着度从所述组合音节中筛选出第二音节,包括:将所述组合音节进行强制帧对齐,获得对齐后的所述组合音节的平均发音持续时长和所有组合音节的平均发音持续时长;将所述组合音节的平均发音持续时长与所有组合音节的平均发音持续时长的比值作为发音黏着度,将所述发音黏着度小于设定阈值的组合音节作为第二音节。本申请参考发音黏着度筛选音节,可以使得筛选的音节更准确。Screening out the second syllable from the combined syllable according to the degree of pronunciation adhesion, comprising: performing forced frame alignment on the combined syllable, obtaining the average pronunciation duration of the aligned combined syllable and the average pronunciation of all combined syllables Duration: The ratio of the average pronunciation duration of the combined syllables and the average pronunciation duration of all combined syllables is used as the articulation stickiness, and the combined syllables with the described pronunciation stickiness less than the set threshold are used as the second syllable. The present application selects syllables with reference to pronunciation cohesion, which can make the selected syllables more accurate.
所述将所述组合音节进行强制帧对齐之前还包括:从所述组合音节中筛选出符合音节合并规则的组合音节;所述将所述组合音节进行强制帧对齐包括:将所述符合音节合并规则的组合音节进行强制帧对齐。本申请进行强制帧对齐之前先利用音节合并规则进行过滤,可以减少后续进行筛选的工作量,提高处理效率。Before performing the forced frame alignment on the combined syllables, the method further includes: filtering out combined syllables that conform to the syllable merging rule from the combined syllables; and performing the forced frame alignment on the combined syllables includes: combining the matched syllables. Regularly combining syllables to enforce frame alignment. Before the mandatory frame alignment is performed in the present application, the syllable merging rule is used for filtering, which can reduce the workload of subsequent screening and improve the processing efficiency.
所述音节合并规则,包括以下至少一项规则:声母+声母+韵母;声母+声母+韵母+特殊声母;声母+韵母+特殊声母。本申请的音节合并规则可以应用于多种情形。The syllable merging rules include at least one of the following rules: initial + initial + final; initial + initial + final + special initial; initial + final + special initial. The syllable merging rules of the present application can be applied to various situations.
所述组合音节的平均发音持续时长,根据进行强制帧对齐后的总发音持续时长与所述组合音节在所述训练音频中出现的次数的比值确定。通过均值处理得到平均发音持续时长,可以使得参数更为精准。The average pronunciation duration of the combined syllable is determined according to the ratio of the total pronunciation duration after forced frame alignment to the number of times the combined syllable appears in the training audio. The average pronunciation duration is obtained by averaging processing, which can make the parameters more accurate.
所述将所述组合音节进行强制帧对齐,包括:将所述组合音节以韵母为核心进行强制帧对齐。通过以韵母为核心进行强制帧对齐,更符合发音的实际情况。The performing forced frame alignment on the combined syllables includes: performing forced frame alignment on the combined syllables with a final vowel as the core. Forced frame alignment with finals as the core is more in line with the actual situation of pronunciation.
本申请第二方面提供一种服务器,包括:音素处理模块,用于获取不同语种的音素,根据发音学规则将所述不同语种的音素合并为第一音节;训练处理模块,用于利用不同语种的训练材料识别出发音的组合音节,根据发音黏着度从所述组合音节中筛选出第二音节;音节合并模块,用于将所述音素处理模块得到的第一音节和所述训练处理模块得到的第二音节进行合并,得到建模音节;模型生成模块,用于根据建模音节生成声学模型;请求接收模块,用于接收车辆转发的车辆座舱内用户发出的语音请求;语音识别模块,用于根据所述模型生成模块生成的声学模型对所述请求接收模块接收的语音请求进行识别,生成识别结果下发至车辆完成语音交互。本申请根据发音学规则将不同语种的音素合并为第一音节,根据发音黏着度从不同语种的训练材料的组合音节中筛选出第二音节,然后合并得到建模音节,这些音节与上下文无关,这样就可以使得各种不同语种可以使用同一套建模系统,从而能够实现多语种统一建模,更方便不同语种的语音识别和语音交互,降低部署成本,避免机器资源浪费。A second aspect of the present application provides a server, comprising: a phoneme processing module for acquiring phonemes of different languages, and combining the phonemes of different languages into a first syllable according to phonetic rules; a training processing module for using different languages The training material identifies the combined syllable of pronunciation, and the second syllable is screened from the combined syllable according to the degree of pronunciation adhesion; The syllable merge module is used to obtain the first syllable obtained by the phoneme processing module and the training processing module. The second syllables are merged to obtain the modeled syllables; the model generation module is used to generate an acoustic model according to the modeled syllables; the request receiving module is used to receive the voice request sent by the user in the vehicle cockpit forwarded by the vehicle; the speech recognition module is used for The voice request received by the request receiving module is recognized according to the acoustic model generated by the model generating module, and the recognition result is generated and sent to the vehicle to complete the voice interaction. The present application merges the phonemes of different languages into the first syllable according to the phonetic rules, selects the second syllable from the combined syllables of the training materials of different languages according to the pronunciation adhesion, and then merges to obtain the modeling syllables, these syllables have nothing to do with the context, In this way, various languages can use the same modeling system, so that multi-language unified modeling can be achieved, which is more convenient for speech recognition and speech interaction in different languages, reduces deployment costs, and avoids wasting machine resources.
所述音素处理模块包括:预合并子模块,用于根据万国音标规则将所述不同语种的音素进行预合并;音素合并子模块,用于根据发音学规则将所述进行预合并后的音素合并为第一音节。本申请基于ipa进行音素合并,可以减少音素数量,可以使得数十个语种混合可以使用一个输出层,可以降低计算量和延时。The phoneme processing module includes: a pre-merging sub-module for pre-merging the phonemes of the different languages according to the rules of the IWC; a phone-merging sub-module for merging the pre-merged phonemes according to the phonetic rules for the first syllable. The present application performs phoneme merging based on ipa, which can reduce the number of phonemes, so that dozens of languages can be mixed using one output layer, which can reduce the amount of calculation and delay.
所述训练处理模块包括:对齐及统计模块,用于将所述组合音节进行强制帧对齐,获得对齐后的所述组合音节的平均发音持续时长和所有组合音节的平均发音持续时长;音节筛选模块,用于将所述组合音节的平均发音持续时长与所有组合音节的平均发音持续时长的比值作为发音黏着度,将所述发音黏着度小于设定阈值的组合音节作为第二音节。本申请参考发音黏着度筛选音节,可以使得筛选的音节更准确。Described training processing module comprises: Alignment and statistics module, is used to carry out forced frame alignment of described combined syllable, obtains the average pronunciation duration of described combined syllable after alignment and the average pronunciation duration of all combined syllables; Syllable screening module , for taking the ratio of the average pronunciation duration of the combined syllable and the average pronunciation duration of all combined syllables as the articulation stickiness, and using the combined syllable whose pronunciation stickiness is less than the set threshold as the second syllable. The present application selects syllables with reference to pronunciation cohesion, which can make the selected syllables more accurate.
本申请第三方面提供一种服务器,包括:处理器;以及存储器,其上存储有可执行代码,当所述可执行代码被所述处理器执行时,使所述处理器执行如上所述的方法。A third aspect of the present application provides a server, comprising: a processor; and a memory on which executable codes are stored, and when the executable codes are executed by the processor, the processor is caused to execute the above-mentioned method.
本申请第四方面提供一种计算机可读存储介质,其上存储有可执行代码,当所述可执行代码被电子设备的处理器执行时,使所述处理器执行如上所述的方法。A fourth aspect of the present application provides a computer-readable storage medium on which executable codes are stored, and when the executable codes are executed by a processor of an electronic device, the processor is caused to execute the above method.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请。It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not limiting of the present application.
附图说明Description of drawings
通过结合附图对本申请示例性实施方式进行更详细地描述,本申请的上述以及其他目的、特征和优势将变得更加明显,其中,在本申请示例性实施方式中,相同的参考标号通常代表相同部件。The above and other objects, features and advantages of the present application will become more apparent from the more detailed description of the exemplary embodiments of the present application in conjunction with the accompanying drawings, wherein the same reference numerals generally represent the exemplary embodiments of the present application. same parts.
图1是本申请示出的语音交互方法的流程示意图;1 is a schematic flowchart of a voice interaction method shown in the present application;
图2是本申请另一示出的语音交互方法的流程示意图;2 is a schematic flowchart of another voice interaction method shown in the present application;
图3是本申请一示出的语音交互方法中进行语音建模的流程示意图;3 is a schematic flowchart of voice modeling in the voice interaction method shown in the present application;
图4是本申请示出的应用语音建模进行语音识别的应用框架示意图;4 is a schematic diagram of an application framework for speech recognition using speech modeling shown in the present application;
图5是本申请示出的上下文相关建模与上下文无关建模的对比示意图;5 is a schematic diagram of the comparison between context-dependent modeling and context-independent modeling shown in the present application;
图6是本申请示出的建模单元示意图;6 is a schematic diagram of a modeling unit shown in the present application;
图7是本申请示出的服务器的结构示意图;7 is a schematic structural diagram of a server shown in the present application;
图8是本申请另一示出的服务器的结构示意图;8 is a schematic structural diagram of another server shown in the present application;
图9是本申请示出的服务器的另一结构示意图。FIG. 9 is another schematic structural diagram of the server shown in this application.
具体实施方式Detailed ways
下面将参照附图更详细地描述本申请的实施方式。虽然附图中显示了本申请的实施方式,然而应该理解,可以以各种形式实现本申请而不应被这里阐述的实施方式所限制。相反,提供这些实施方式是为了使本申请更加透彻和完整,并且能够将本申请的范围完整地传达给本领域的技术人员。Embodiments of the present application will be described in more detail below with reference to the accompanying drawings. Although embodiments of the present application are shown in the drawings, it should be understood that the present application may be implemented in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided so that this application will be thorough and complete, and will fully convey the scope of this application to those skilled in the art.
在本申请使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请。在本申请和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。The terminology used in this application is for the purpose of describing particular embodiments only and is not intended to limit the application. As used in this application and the appended claims, the singular forms "a," "the," and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It will also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.
应当理解,尽管在本申请可能采用术语“第一”、“第二”、“第三”等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本申请范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本申请的描述中,“多个”的含义是两个或两个以上,除非另有明确具体的限定。It should be understood that although the terms "first", "second", "third", etc. may be used in this application to describe various information, such information should not be limited by these terms. These terms are only used to distinguish the same type of information from each other. For example, the first information may also be referred to as the second information, and similarly, the second information may also be referred to as the first information without departing from the scope of the present application. Thus, a feature defined as "first" or "second" may expressly or implicitly include one or more of that feature. In the description of the present application, "plurality" means two or more, unless otherwise expressly and specifically defined.
相关技术中一般采用上下文相关的音节建模方式来构建声学模型,无法做到多语种统一建模,不利于不同语种的语音识别和语音交互。本申请提供一种语音交互方法,能够实现多语种统一建模,更方便不同语种的语音识别和语音交互,降低部署成本,避免机器资源浪费。In the related art, a context-dependent syllable modeling method is generally used to construct an acoustic model, which cannot achieve unified modeling in multiple languages, which is not conducive to speech recognition and speech interaction of different languages. The present application provides a voice interaction method, which can realize multilingual unified modeling, facilitates voice recognition and voice interaction in different languages, reduces deployment costs, and avoids waste of machine resources.
以下结合附图详细描述本申请的技术方案。The technical solutions of the present application are described in detail below with reference to the accompanying drawings.
图1是本申请示出的语音交互方法的流程示意图。该方法可以应用于服务器。FIG. 1 is a schematic flowchart of a voice interaction method shown in the present application. This method can be applied to the server.
参见图1,该方法包括:Referring to Figure 1, the method includes:
S101、获取不同语种的音素,根据发音学规则将不同语种的音素合并为第一音节。S101. Acquire phonemes of different languages, and combine the phonemes of different languages into a first syllable according to a phonetic rule.
其中,可以根据万国音标规则将不同语种的音素进行预合并。例如,英语音素和法语音素利用ipa合并之后,只有56个音素,从而通过合并处理减少了音素数量。然后,可以根据发音学规则将进行预合并后的音素合并为第一音节。例如,可以从进行预合并后的音素中,将声母和韵母的音素合并得到第一音节,将剩下的单个声母的音素和单个韵母的音素单独作为第一音节。Among them, the phonemes of different languages can be pre-merged according to the rules of the Universal Phonetic Alphabet. For example, after merging English phonemes and French phonemes using ipa, there are only 56 phonemes, thus reducing the number of phonemes through the merging process. Then, the pre-merged phonemes can be merged into the first syllable according to phonetic rules. For example, from the pre-merged phonemes, the first syllable can be obtained by merging the phonemes of the initial and the final, and the remaining phonemes of the single initial and the phoneme of the single final can be used alone as the first syllable.
S102、利用不同语种的训练材料识别出发音的组合音节,根据发音黏着度从组合音节中筛选出第二音节。S102: Identify the combined syllables of pronunciation by using training materials of different languages, and select the second syllable from the combined syllables according to the degree of pronunciation adhesion.
其中,可以利用不同语种的音频和/或视频的训练材料识别出发音的组合音节。本申请可以充分利用音频和/或视频作为不同语种的训练材料。Wherein, the combined syllables of the pronunciation can be identified by using the audio and/or video training materials in different languages. This application can make full use of audio and/or video as training material in different languages.
其中,可以将组合音节进行强制帧对齐;确定对齐后的组合音节的平均发音持续时长和所有组合音节的平均发音持续时长;将组合音节的平均发音持续时长与所有组合音节的平均发音持续时长的比值作为发音黏着度,将发音黏着度小于设定阈值的组合音节作为第二音节。发音黏着度越小,说明组合音节中的音素越应该分到一起。Among them, the combined syllables can be forced frame alignment; the average pronunciation duration of the aligned combined syllables and the average pronunciation duration of all combined syllables are determined; the average pronunciation duration of the combined syllables and all combined syllables The average pronunciation duration of The ratio is used as the articulation cohesion, and the combined syllable whose articulation cohesion is less than the set threshold is regarded as the second syllable. The smaller the pronunciation cohesion, the more the phonemes in the combined syllable should be grouped together.
其中,音节合并规则,包括以下至少一项规则:声母+声母+韵母;声母+声母+韵母+特殊声母;声母+韵母+特殊声母。The syllable merging rules include at least one of the following rules: initial + initial + final; initial + initial + final + special initial; initial + final + special initial.
组合音节的平均发音持续时长,根据进行强制帧对齐后的总发音持续时长与组合音节在训练音频中出现的次数的比值确定。The average pronunciation duration of the combined syllable is determined according to the ratio of the total pronunciation duration after forced frame alignment to the number of times the combined syllable appears in the training audio.
将组合音节进行强制帧对齐,包括:将组合音节以韵母为核心进行强制帧对齐。The forced frame alignment of combined syllables includes: forced frame alignment of combined syllables with finals as the core.
需说明的是,S101与S102之间没有顺序关系。It should be noted that there is no sequence relationship between S101 and S102.
S103、将第一音节和第二音节进行合并,得到建模音节。S103. Combine the first syllable and the second syllable to obtain a modeled syllable.
将上述不同步骤分别得到的第一音节和第二音节进行合并,可以生成最终的音节即建模音节。The first syllable and the second syllable obtained respectively in the above different steps are combined to generate the final syllable, that is, the modeled syllable.
S104、根据建模音节生成声学模型。S104. Generate an acoustic model according to the modeled syllable.
在得到建模音节后,利用已有的相关技术可以根据建模音节生成声学模型。After the modeled syllables are obtained, an acoustic model can be generated according to the modeled syllables using existing related technologies.
S105、接收车辆转发的车辆座舱内用户发出的语音请求。S105: Receive a voice request sent by a user in the vehicle cabin forwarded by the vehicle.
服务器可以接收车辆转发的车辆座舱内用户发出的语音请求。该语音请求,可能是法语的语音请求,也可能是德语的语音请求等。The server may receive the voice request sent by the user in the vehicle cabin forwarded by the vehicle. The voice request may be a French voice request or a German voice request, etc.
S106、根据声学模型对语音请求进行识别,生成识别结果下发至车辆完成语音交互。S106: Identify the voice request according to the acoustic model, generate a recognition result and send it to the vehicle to complete the voice interaction.
服务器接收到用户发出的语音请求后,根据声学模型对语音请求进行识别,生成识别结果下发至车辆完成语音交互。根据声学模型识别语音请求的方法可以采用相关技术已有的识别方法,本申请不加以限定。After receiving the voice request from the user, the server recognizes the voice request according to the acoustic model, generates a recognition result and sends it to the vehicle to complete the voice interaction. The method for recognizing the voice request according to the acoustic model may adopt the existing recognition method in the related art, which is not limited in this application.
本申请方案,根据发音学规则将不同语种的音素合并为第一音节,根据发音黏着度从不同语种的训练材料的组合音节中筛选出第二音节,然后合并得到建模音节,这些音节与上下文无关,这样就可以使得各种不同语种可以使用同一套建模系统,从而能够实现多语种统一建模,更方便不同语种的语音识别和语音交互,降低部署成本,避免机器资源浪费。In the present application scheme, the phonemes of different languages are merged into the first syllable according to the phonetic rules, the second syllable is selected from the combined syllables of the training materials of different languages according to the pronunciation adhesion degree, and then the modeling syllables are obtained by merging them. In this way, different languages can use the same modeling system, so that multi-language unified modeling can be achieved, which is more convenient for speech recognition and speech interaction in different languages, reduces deployment costs, and avoids wasting machine resources.
图2是本申请另一示出的语音交互方法的流程示意图。该方法可以应用于服务器。图2方法中以不同语种的训练材料为音频训练材料(简称为训练音频)为例但不局限于此,也可以是视频训练材料(简称为训练视频)。FIG. 2 is a schematic flowchart of another voice interaction method shown in the present application. This method can be applied to the server. In the method of FIG. 2 , the training materials in different languages are used as audio training materials (referred to as training audio) as an example, but are not limited thereto, and may also be video training materials (referred to as training videos).
参见图2,该方法包括:Referring to Figure 2, the method includes:
S201、获取不同语种的音素,根据万国音标规则将不同语种的音素进行预合并;根据发音学规则将进行预合并后的音素合并为第一音节。S201. Acquire phonemes of different languages, and pre-merge the phonemes of different languages according to the rules of the IWC; and merge the pre-merged phonemes into the first syllable according to the phonetic rules.
其中,可以从进行预合并后的音素中,将声母和韵母的音素合并得到第一音节,将剩下的单个声母的音素和单个韵母的音素单独作为第一音节。The first syllable can be obtained by combining the phonemes of the initial and the final from the pre-merged phonemes, and the remaining phonemes of the single initial and the single final are used as the first syllable.
S202、获取不同语种的训练音频,从训练音频中识别出发音的组合音节,从组合音节中筛选出符合音节合并规则的组合音节。S202 , acquiring training audios in different languages, identifying the combined syllables of the pronunciation from the training audios, and selecting from the combined syllables the combined syllables that conform to the syllable merging rule.
其中,音节合并规则,包括以下至少一项规则:声母+声母+韵母;声母+声母+韵母+特殊声母;声母+韵母+特殊声母。The syllable merging rules include at least one of the following rules: initial + initial + final; initial + initial + final + special initial; initial + final + special initial.
需说明的是,S201与S202之间没有顺序关系。It should be noted that there is no sequence relationship between S201 and S202.
S203、将符合音节合并规则的组合音节进行强制帧对齐,确定对齐后的组合音节的平均发音持续时长和所有组合音节的平均发音持续时长。S203, performing forced frame alignment on the combined syllables that meet the syllable merging rules, and determining the average pronunciation duration of the aligned combined syllables and the average pronunciation duration of all combined syllables.
其中,可以将组合音节以韵母为核心进行强制帧对齐。Among them, the combined syllables can be forced to frame alignment with the finals as the core.
其中,组合音节的平均发音持续时长,根据进行强制帧对齐后的总发音持续时长与组合音节在训练音频中出现的次数的比值确定。The average pronunciation duration of the combined syllable is determined according to the ratio of the total pronunciation duration after forced frame alignment to the number of times the combined syllable appears in the training audio.
S204、将组合音节的平均发音持续时长与所有组合音节的平均发音持续时长的比值作为发音黏着度,将发音黏着度小于设定阈值的组合音节作为第二音节。S204, taking the ratio of the average pronunciation duration of the combined syllable to the average pronunciation duration of all combined syllables as the articulation stickiness, and taking the combined syllable whose pronunciation stickiness is less than the set threshold as the second syllable.
S205、将第一音节和第二音节进行合并,得到建模音节。S205. Combine the first syllable and the second syllable to obtain a modeled syllable.
其中,S205可以参见S103的描述,此处不再赘述。Wherein, for S205, reference may be made to the description of S103, which will not be repeated here.
S206、根据建模音节生成声学模型。S206. Generate an acoustic model according to the modeled syllable.
其中,S206可以参见S104的描述,此处不再赘述。For S206, reference may be made to the description of S104, which will not be repeated here.
S207、接收车辆转发的车辆座舱内用户发出的语音请求。S207: Receive a voice request sent by a user in the vehicle cabin forwarded by the vehicle.
服务器可以接收车辆转发的车辆座舱内用户发出的语音请求。该语音请求,可能是法语的语音请求,也可能是德语的语音请求等。The server may receive the voice request sent by the user in the vehicle cabin forwarded by the vehicle. The voice request may be a French voice request or a German voice request, etc.
S208、根据声学模型对语音请求进行识别,生成识别结果下发至车辆完成语音交互。S208: Recognize the voice request according to the acoustic model, generate a recognition result and send it to the vehicle to complete the voice interaction.
服务器接收到用户发出的语音请求后,根据声学模型对语音请求进行识别,生成识别结果下发至车辆完成语音交互。根据声学模型识别语音请求的方法可以采用相关技术已有的识别方法,本申请不加以限定。After receiving the voice request from the user, the server recognizes the voice request according to the acoustic model, generates a recognition result and sends it to the vehicle to complete the voice interaction. The method for recognizing the voice request according to the acoustic model may adopt the existing recognition method in the related art, which is not limited in this application.
本申请方案,基于ipa进行音素合并,可以使得数十个语种混合可以使用一个softmax(逻辑回归)输出层,可以降低计算量和延时。本申请通过融合发音学规则和训练音频数据统计结果来生成音节,可以使得单一语种的建模单元从40量级增加到500量级,从而大大降低了学习难度,提升语音识别率,更方便不同语种的语音识别和语音交互。In the solution of the present application, phoneme merging based on ipa can make a softmax (logistic regression) output layer available for mixing dozens of languages, which can reduce the amount of calculation and delay. This application generates syllables by combining phonetic rules and training audio data statistics, which can increase the modeling units of a single language from 40 to 500, thereby greatly reducing the difficulty of learning, improving the speech recognition rate, and making it more convenient and different. Language speech recognition and speech interaction.
图3是本申请一示出的语音交互方法中进行语音建模的流程示意图。该方法可以应用于服务器。图3方法中以不同语种的训练材料为音频训练材料(简称为训练音频)为例。本申请中不同语种的训练材料可以是音频训练材料或视频训练材料,或者是音频训练材料和视频训练材料一起使用。FIG. 3 is a schematic flowchart of voice modeling in the voice interaction method shown in the first application of the present application. This method can be applied to the server. In the method of FIG. 3 , training materials in different languages are used as audio training materials (referred to as training audio for short) as an example. The training materials in different languages in this application may be audio training materials or video training materials, or audio training materials and video training materials used together.
参见图3,该方法包括:Referring to Figure 3, the method includes:
S301、获取不同语种的音素。S301. Acquire phonemes of different languages.
以欧洲为例,欧洲语种众多且存在好几个语系,语种数目超过十个。Taking Europe as an example, there are many European languages and there are several language families, with more than ten languages.
该步骤可以获取不同语种的音素,例如获取法语音素、英语音素、德语音素等。例如,获取的法语音素包括:brem…;获取的德语音素包括:kam…。In this step, phonemes of different languages can be obtained, for example, French phonemes, English phonemes, and German phonemes can be obtained. For example, the acquired French phonemes include: brem...; the acquired German phonemes include: kam....
音素(phone),是根据语音的自然属性划分出来的最小语音单位,依据音节里的发音动作来分析,一个动作构成一个音素。音素可以分为元音与辅音两大类。如汉语音节“啊”(ā)只有一个音素,“爱”(ài)有两个音素,“代”(dài)有三个音素等。音素也是构成音节的最小单位或最小的语音片段。国际音标(也称为“万国语音学字母”)的音标符号与全人类语言的音素一一对应。A phone is the smallest unit of speech divided according to the natural properties of speech. It is analyzed according to the pronunciation action in the syllable, and an action constitutes a phoneme. Phonemes can be divided into two categories: vowels and consonants. For example, the Chinese syllable "ah" (ā) has only one phoneme, "love" (ài) has two phonemes, and "dai" (dài) has three phonemes. A phoneme is also the smallest unit or the smallest segment of speech that makes up a syllable. The phonetic symbols of the International Phonetic Alphabet (also known as the "Universal Phonetic Alphabet") correspond one-to-one with the phonemes of all human languages.
音节(Syllable)是语言中单个元音音素和辅音音素组合发音的最小语音单位,音节在语音学上指由一个或数个音素组成的语音结构基本单位;而音素是最小的语音单位。例如,汉字“好”的音节为h_ao3,其中的3表示声调。需说明的是,音节也可以不加声调。A syllable is the smallest phonetic unit in a language in which a single vowel phoneme and a consonant phoneme are combined and pronounced. A syllable refers to the basic unit of phonetic structure composed of one or several phonemes in phonetics; and a phoneme is the smallest phonetic unit. For example, the syllable of the Chinese character "好" is h_ao3, where the 3 represents the tone. It should be noted that syllables can also be without tones.
S302、根据ipa将不同语种的音素进行预合并。S302. Pre-merge phonemes of different languages according to ipa.
欧洲不同语种之间,很多语种其实属于同一个语系,发音上有很多共同点。本申请充分考虑了欧洲不同语种之间的相关性,利用ipa(International Phonetic Alphabet,万国音标,也称为国际音标)这个国际统一标准对不同语种的音素进行合并。ipa是一套用来标音的系统,以拉丁字母为基础,由国际语音学学会设计来作为口语声音的标准化标示方法。ipa的特性是越相似的语种,重合的音素越多。例如ipa中,英语有39个音素,法语有36个音素,但其中有19个音素是重合的,因此英语音素和法语音素利用ipa合并之后,只有56个音素,从而通过合并处理减少了音素数量。Among the different languages in Europe, many languages actually belong to the same language family and have a lot in common in pronunciation. This application fully considers the correlation between different languages in Europe, and uses the international unified standard ipa (International Phonetic Alphabet, also known as the International Phonetic Alphabet) to merge the phonemes of different languages. ipa is a system for notation, based on the Latin alphabet, devised by the International Phonetic Society as a standardized method of notation for spoken sounds. The characteristic of ipa is that the more similar languages, the more phonemes that overlap. For example, in ipa, English has 39 phonemes and French has 36 phonemes, but 19 of them are coincident, so after combining English phonemes and French phonemes with ipa, there are only 56 phonemes, thus reducing the number of phonemes through the merging process .
由于是将多个不同语种的音素进行了合并,多个语种就可以直接混合建模在一个模型里面,只有1个输出层,但包含了所有语种,因此就不需要再进行语种判别的处理。通过基于ipa的音素合并,使得多语种之间可以共享建模单元,进而共享部分数据,使得不同语种可以相互提升效果。Since the phonemes of multiple different languages are merged, multiple languages can be directly mixed and modeled in one model. There is only one output layer, but all languages are included, so there is no need to perform language discrimination processing. Through ipa-based phoneme merging, multiple languages can share modeling units, and then share some data, so that different languages can improve each other's effects.
S303、根据发音学规则将进行预合并后的音素合并为第一音节。S303. Combine the pre-merged phonemes into a first syllable according to phonetic rules.
在S303中,可以从通过ipa进行音素预合并后的音素中,将声母和韵母的音素合并得到第一音节,将剩下的单个声母的音素和单个韵母的音素单独作为第一音节。In S303, the first syllable can be obtained by combining the phonemes of the initial consonant and the final consonant from the phonemes after the phoneme pre-merging by ipa, and the remaining single initial consonant phoneme and the phoneme of the single final consonant are used alone as the first syllable.
本申请的发音学规则可以是将声母+韵母合并作为一个音节,将合并之后剩下单个声母和单个韵母,也单独作为一个音节。基于发音学规则进行合并后得到音节中,一般可以包含1-2个音素。声母,即是韵母前的辅音,与韵母一起构成一个完整的音节。辅音则是发声时,气流在口腔中受到各种阻碍所产生的声音。由一个元音构成的韵母叫单韵母,又叫单元音韵母。有些音节开头部分没有声母,只有一个韵母也可以独立成为音节。The phonetic rule of the present application may be to combine the initial consonant and the final consonant as one syllable, and after the combination, a single initial consonant and a single final consonant are left, which are also used as a single syllable. The syllables obtained after merging based on phonetic rules may generally contain 1-2 phonemes. The initial consonant, that is, the consonant before the final, forms a complete syllable together with the final. Consonants are sounds produced by various obstructions to the airflow in the mouth. A final composed of a vowel is called a single final, also known as a single final. Some syllables have no initials at the beginning, and only one final can become a syllable independently.
举例说明:salad(包含的5个音素为:s ae l ax d)合并->(3个音节:s_ae l_axd),因此将5个音素合并成3个音节。For example: Salad (5 phonemes included: s ae l ax d) merge -> (3 syllables: s_ae l_axd), so 5 phonemes are merged into 3 syllables.
例如,将获取的法语音素brem…和获取的德语音素kam…等根据发音学规则进行音素合并得到音节b_e、r_e、k_a等作为第一音节。第一音节中,一般包含1-2个音素。For example, the acquired French phonemes brem... and the acquired German phonemes kam... etc. are combined with phonemes according to phonetic rules to obtain syllables b_e, r_e, k_a, etc. as the first syllable. The first syllable generally contains 1-2 phonemes.
本申请基于ipa进行音素合并,且与上下文无关,使得可以从上下文相关的三音素转成了上下文无关的音节,可以使得发音单元数目足够多,相互之间区分性更大,更容易学习,同时兼顾了上下文无关的音节的数目,且更大的发音单元具有更强的抗噪声能力。This application performs phoneme merging based on ipa, which is context-independent, so that context-dependent triphones can be converted into context-independent syllables, and the number of pronunciation units can be sufficiently large, which is more distinguishable from each other and easier to learn. The number of context-independent syllables is taken into account, and larger pronunciation units have stronger anti-noise ability.
上下文相关,是指建模单元和它所处的上下文有关,不同上下文情况下的同一个发音符号,也是不同的建模单元。上下文无关,是指只要发音符号相同,不管上下文,都是一个相同的建模单元。Context-dependent means that the modeling unit is related to the context in which it is located. The same phonetic symbol in different contexts is also a different modeling unit. Context-free means that as long as the diacritics are the same, regardless of the context, they are the same modeling unit.
使用上下文相关的建模方式,区分性强,同一个建模单元具有相同的上下文语境,建模单元容易学习,但其中一个缺点是建模单元数量多,例如假设有10个语种,每个语种有30个音素,那么双音素的建模单元是300*300=90000个,三音素的建模单元是300*300*300,系统一般很难接受数量过大的建模单元数量。因此,多语种一般无法使用上下文相关的建模单元。上下文相关的建模方式的另一个缺点是迁移性差,如果与训练语料高度绑定,例如用音乐的语料训练的模型则很难迁移到导航上使用。所以,相关技术使用上下文相关的建模方式,场景依赖性强,无法做到多语种统一建模。The context-dependent modeling method is used, which is highly discriminative. The same modeling unit has the same context, and the modeling unit is easy to learn. However, one of the disadvantages is the large number of modeling units. For example, suppose there are 10 languages. If there are 30 phonemes in a language, then the modeling units for diphones are 300*300=90,000, and the modeling units for triphones are 300*300*300. Generally, it is difficult for the system to accept a large number of modeling units. Therefore, context-sensitive modeling units are generally not available for multilingualism. Another disadvantage of context-sensitive modeling is that it has poor transferability. If it is highly bound to the training corpus, for example, a model trained with a music corpus is difficult to transfer to navigation. Therefore, related technologies use context-dependent modeling methods, which are highly dependent on scenes and cannot achieve unified modeling in multiple languages.
图5是本申请示出的上下文相关建模与上下文无关建模的对比示意图。参见图5所示,以ae发音符号为例,对于英语单词happy,其音素包括haepiy,对于英语单词salad,其音素包括saelaxd。如果是使用上下文相关建模方式,则包括h_ae和p_iy两个建模单元,如果是使用本申请的与上下文无关建模方式,则只包括ae一个建模单元。FIG. 5 is a schematic diagram showing the comparison between context-dependent modeling and context-independent modeling shown in the present application. Referring to FIG. 5 , taking the ae pronunciation symbol as an example, for the English word happy, its phoneme includes haepiy, and for the English word salad, its phoneme includes saelaxd. If the context-dependent modeling method is used, two modeling units h_ae and p_iy are included. If the context-independent modeling method of the present application is used, only one modeling unit ae is included.
进一步参见图6,图6是本申请示出的建模单元示意图。如图6所示,左边方框内是上下文相关的音素,x代表上下文;中间的方框内是上下文无关的音素,两个ae代表同一个建模单元;右边的方框内是将上下文无关的音素进行合并后的上下文无关的音节。Referring further to FIG. 6, FIG. 6 is a schematic diagram of the modeling unit shown in the present application. As shown in Figure 6, the left box is the context-dependent phoneme, x represents the context; the middle box is the context-independent phoneme, and the two ae represent the same modeling unit; the right box is the context-independent phoneme The phonemes are merged into context-free syllables.
S304、获取不同语种的训练音频,从训练音频中识别出发音的组合音节。S304 , acquiring training audios in different languages, and identifying the combined syllables of the pronunciation from the training audios.
该步骤可以获取不同语种的训练音频,例如获取法语训练音频、英语训练音频、德语训练音频等。In this step, training audios in different languages can be obtained, for example, French training audios, English training audios, German training audios, and the like can be obtained.
利用已有的语音识别相关技术,可以从训练音频中识别出发音的组合音节,例如b_r_e、b_e_m、r_e_m、r_e_b等。Using the existing speech recognition technology, the combined syllables of pronunciation, such as b_r_e, b_e_m, r_e_m, r_e_b, etc., can be identified from the training audio.
需说明的是,S304与S301之间没有顺序关系。It should be noted that there is no sequence relationship between S304 and S301.
S305、从组合音节中筛选出符合音节合并规则的组合音节。S305. Screen out the combined syllables that conform to the syllable combination rule from the combined syllables.
从训练音频中识别出发音的组合音节后,判断这些组合音节是否符合音节合并规则,根据判断结果筛选出符合音节合并规则的组合音节。After identifying the combined syllables of pronunciation from the training audio, it is judged whether these combined syllables conform to the syllable merging rule, and the combined syllables that conform to the syllable merging rule are screened out according to the judgment result.
音节合并规则,包括以下至少一项规则:声母+声母+韵母;声母+声母+韵母+特殊声母;声母+韵母+特殊声母。Syllable merging rules, including at least one of the following rules: initial + initial + final; initial + initial + final + special initial; initial + final + special initial.
1)声母+声母+韵母1) Initial + initial + final
2)声母+声母+韵母+特殊声母【n/m】2) Initial + initial + final + special initial [n/m]
例如b_r_i_n,b_r_e_m,其中特殊声母n/m是单音素统计中,普遍发音较短的具有黏着性的单音素。For example, b_r_i_n, b_r_e_m, where the special initial n/m is a monophone with short cohesiveness that is commonly pronounced in monophone statistics.
3)声母+韵母+特殊声母【n/m】3) Initial + final + special initial [n/m]
从训练音频中识别出发音的组合音节例如b_r_e、b_e_m、r_e_m、r_e_b后,根据音节合并规则进行筛选,可以筛选出符合音节合并规则的组合音节b_r_e、b_e_m、r_e_m,其中r_e_b因为是声母结尾,不是韵母或特殊声母结尾,因此不符合音节合并规则被排除。After identifying the combined syllables of pronunciation, such as b_r_e, b_e_m, r_e_m, and r_e_b from the training audio, filter according to the syllable merging rules, and filter out the combined syllables b_r_e, b_e_m, r_e_m that meet the syllable merging rules, where r_e_b is the end of the initial consonant, Not finals or special initials endings, and therefore do not meet the syllable merger rules are excluded.
S306、将符合音节合并规则的组合音节进行强制帧对齐。S306, performing forced frame alignment on the combined syllables that conform to the syllable merging rule.
从训练音频中,将符合音节合并规则的组合音节进行强制帧对齐,获得强制帧对齐的结果。From the training audio, the combined syllables that conform to the syllable merging rules are subjected to forced frame alignment to obtain the result of forced frame alignment.
强制帧对齐,是指已知音频以及对应的文本标注,获得每一帧对应的标注的过程。其中,训练音频中的组合音节可以是以韵母为核心进行强制帧对齐。Forced frame alignment refers to the process of obtaining the corresponding annotation of each frame from the known audio and corresponding text annotations. Among them, the combined syllables in the training audio can be forced to frame alignment with the finals as the core.
例如:对于200帧的音频,文本标注为n i3 h ao3(其中3表示声调),进行强制帧对齐,对齐结果为:n(1-30)i3(31-100)h(101-120)ao3(121-200)。也就是第1-30帧是n,第31-100帧是i3,第101-120帧是h,第121-200帧是ao3。For example: for 200 frames of audio, the text is marked as n i3 h ao3 (where 3 represents the tone), and the forced frame alignment is performed. The alignment result is: n(1-30)i3(31-100)h(101-120)ao3 (121-200). That is, frames 1-30 are n, frames 31-100 are i3, frames 101-120 are h, and frames 121-200 are ao3.
需说明的是,在语音建模时,可以是带声调建模,也可以不带声调建模。It should be noted that, in the speech modeling, it may be modeling with or without tones.
S307、确定对齐后的组合音节的平均发音持续时长和所有组合音节的平均发音持续时长。S307. Determine the average pronunciation duration of the aligned combined syllables and the average pronunciation duration of all combined syllables.
获得强制帧对齐的结果后,对于对齐后的组合音节,统计在不同上下文语境中所有长度为3和4的组合音节(包含音素组合)的发音持续时长,取平均值作为该组合音节(属于声韵母组合)的平均发音持续时长。组合音节的平均发音持续时长,可以根据进行强制帧对齐后的总发音持续时长与组合音节在训练音频中出现的次数的比值确定。另外也统计所有组合音节的平均发音持续时长。After obtaining the result of forced frame alignment, for the aligned combined syllables, count the pronunciation durations of all combined syllables of length 3 and 4 (including phoneme combinations) in different contexts, and take the average as the combined syllables (belonging to The average pronunciation duration of the consonants). The average pronunciation duration of the combined syllable can be determined according to the ratio of the total pronunciation duration after forced frame alignment to the number of times the combined syllable appears in the training audio. In addition, the average pronunciation duration of all combined syllables is also counted.
例如,组合音节b_r_e的b、r、e这三个音素在训练音频中连着共出现了1000次,对齐后(b+r+e)总发音持续时长500s,那么b、r、e这3个音素组合的平均发音持续时长为N(b_r_e)=500/1000=0.5s。For example, the three phonemes b, r, and e of the combined syllable b_r_e appear 1000 times in the training audio. After alignment (b+r+e), the total pronunciation duration is 500s. The average pronunciation duration of the phoneme combinations is N(b_r_e)=500/1000=0.5s.
又例如,a_b音素组合,在10000个句子里面出现了1000次,将1000次的时长加起来除以1000,就可以得到a_b音素组合的平均发音持续时长。For another example, the a_b phoneme combination appears 1000 times in 10000 sentences, and the duration of the 1000 times is added up and divided by 1000, and the average pronunciation duration of the a_b phoneme combination can be obtained.
S308、确定发音黏着度,将发音黏着度小于设定阈值的组合音节作为第二音节。S308: Determine the pronunciation cohesion, and use the combined syllable whose pronunciation cohesion is less than the set threshold as the second syllable.
其中,可以将组合音节的平均发音持续时长与所有组合音节的平均发音持续时长的比值作为发音黏着度,将发音黏着度小于设定阈值的组合音节作为第二音节。发音黏着度越小,表示组合音节中的音素越应该分到一起。Wherein, the ratio of the average pronunciation duration of the combined syllable to the average pronunciation duration of all combined syllables can be used as the pronunciation stickiness, and the combined syllable with the pronunciation stickiness less than the set threshold can be used as the second syllable. The smaller the pronunciation cohesion, the more the phonemes in the combined syllable should be grouped together.
假设设定阈值为0.5,当发音黏着度小于0.5时,输出黏着性的组合音节作为第二音节。Assuming that the set threshold is 0.5, when the pronunciation cohesion is less than 0.5, the coherent combined syllable is output as the second syllable.
以下以音节合并规则中不同情形下的发音黏着度的确定过程举例说明,其中相关公式中的xym表示声母,z表示韵母,s表示任意声母/韵母,N表示发音持续时长的平均值,p表示发音黏着度,T表示发音持续时长。The following is an example of the determination process of pronunciation adhesion in different situations in the syllable merging rule, where xym in the relevant formula represents initials, z represents finals, s represents any initials/finals, N represents the average duration of pronunciation, and p represents Pronunciation stickiness, T represents the duration of pronunciation.
1)声母+声母+韵母:x,y,z的发音黏着度1) Initial + initial + final: pronunciation adhesion of x, y, z
2)声母+声母+韵母+特殊声母:x,y,z,m的发音黏着度2) Initial + initial + final + special initial: pronunciation stickiness of x, y, z, m
3)声母+韵母+特殊声母:x,z,m的发音黏着度3) Initial + final + special initial: pronunciation adhesion of x, z, m
其中,以x,y,z的发音黏着度为例,公式中的分母表示所有出现的音素组合x,s,z(所有组合音节)的平均值(平均发音持续时长);分子表示音素组合x,y,z(对齐后的组合音节)的平均值(平均发音持续时长)。发音黏着度p越小,表示x,y,z这三个音素出现时,持续时间越短,越黏着,越应该分到一起。Among them, taking the pronunciation adhesion of x, y, and z as an example, the denominator in the formula represents the average value (average pronunciation duration) of all the phoneme combinations x, s, and z (all combined syllables); the numerator represents the phoneme combination x ,y,z (aligned combined syllables) average (average pronunciation duration). The smaller the pronunciation stickiness p, the shorter the duration of the three phonemes x, y, and z when they appear, the more sticky they are, and the more they should be grouped together.
举例说明:for example:
组合音节b_r_e的发音黏着度为:The pronunciation stickiness of the combined syllable b_r_e is:
p(b_r_e)=N(b_r_e)/N(b_*_e)p(b_r_e)=N(b_r_e)/N(b_*_e)
其中,*代表所有声母,N表示发音持续时长的平均值,N(b_r_e)表示组合音节b_r_e的平均发音持续时长,N(b_*_e)表示所有声母与b、e组合后的所有组合音节的平均发音持续时长。Among them, * represents all initials, N represents the average pronunciation duration, N(b_r_e) represents the average pronunciation duration of the combined syllable b_r_e, N(b_*_e) represents the combined syllables of all initials combined with b and e. Average pronunciation duration.
b_r_e如果黏着性很高,则b_r_e的平均发音持续时长一般就短于b_*_e的平均发音持续时长。If b_r_e has high stickiness, the average pronunciation duration of b_r_e is generally shorter than that of b_*_e.
组合音节r_e_m的发音黏着度为:The pronunciation stickiness of the combined syllable r_e_m is:
p(r_e_m)=N(r_e_m)/N(r_e_*)p(r_e_m)=N(r_e_m)/N(r_e_*)
其中,*代表所有声母,N表示发音持续时长的平均值,N(r_e_m)表示组合音节r_e_m的平均发音持续时长,N(r_e_*)表示所有声母与r、e组合后的所有组合音节的平均发音持续时长。Among them, * represents all initials, N represents the average pronunciation duration, N(r_e_m) represents the average pronunciation duration of the combined syllable r_e_m, and N(r_e_*) represents the average of all combined syllables after all initials are combined with r and e. Pronunciation duration.
例如,上述符合音节合并规则的组合音节b_r_e、b_e_m、r_e_m,经过发音黏着度判断后,得到符合条件的组合音节b_r_e、b_e_m作为第二音节。经过发音黏着度判断后的第二音节中,一般包含3-4个音素。For example, the above-mentioned combined syllables b_r_e, b_e_m, and r_e_m that meet the syllable merging rules are judged by the degree of pronunciation adhesion, and the qualified combined syllables b_r_e and b_e_m are obtained as the second syllables. The second syllable after the judgment of pronunciation adhesion generally contains 3-4 phonemes.
S309、将第一音节和第二音节进行合并,得到建模音节。S309. Combine the first syllable and the second syllable to obtain a modeled syllable.
其中,将上述不同步骤分别得到的第一音节和第二音节进行合并,可以生成最终的音节即建模音节。The first syllable and the second syllable obtained respectively in the above different steps are combined to generate the final syllable, that is, the modeled syllable.
例如,将上述得到的第一音节b_e、r_e、k_a和第二音节b_r_e、b_e_m进行合并,得到最终的建模音节b_e、r_e、k_a、b_r_e、b_e_m…。For example, the first syllables b_e, r_e, k_a and the second syllables b_r_e, b_e_m obtained above are combined to obtain the final modeled syllables b_e, r_e, k_a, b_r_e, b_e_m . . .
本申请的建模方法,可以与上下文无关,这样可以压缩建模单元的数量,不同的上下文可以共享同一个建模单元,使得数十个语种可以使用同一套建模系统,从而能够实现多语种统一建模,更方便不同语种的语音识别和语音交互。本申请可以实现数据共享,同一个建模单元,可能在不同的语种都存在,可以大大减少单一语种的数据量的依赖。本申请基于ipa将音素合并成音节,通过融合发音学规则和训练音频数据统计结果来生成建模音节,可以使得单一语种的建模单元从40量级增加到500量级,从而大大降低了学习难度,提升识别率。本申请通过基于ipa的音节建模技术,使得数十个语种混合可以使用一个softmax输出层,这样不仅降低了计算量和延时,同时也使得不同语种之间的比较可以在模型内部进行,变成了一个可以学习的目标,而不需人工给予不同语种权重去比较。本申请基于ipa的与上下文无关的音节建模技术,使得建模单元从上下文相关的三音素转成了上下文无关的音节,可以使得发音单元数目足够多,相互之间区分性更大,更容易学习,同时兼顾了上下文无关的音节的数目,且更大的发音单元具有更强的抗噪声能力。The modeling method of the present application can be context-independent, so that the number of modeling units can be reduced, and different contexts can share the same modeling unit, so that dozens of languages can use the same modeling system, so that multiple languages can be realized. Unified modeling is more convenient for speech recognition and speech interaction in different languages. The application can realize data sharing, and the same modeling unit may exist in different languages, which can greatly reduce the dependence on the data volume of a single language. This application combines phonemes into syllables based on ipa, and generates modeling syllables by combining phonetics rules and training audio data statistics, which can increase the modeling units of a single language from 40 to 500, thereby greatly reducing learning. Difficulty, improve the recognition rate. This application uses ipa-based syllable modeling technology, so that dozens of languages can be mixed using a softmax output layer, which not only reduces the amount of calculation and delay, but also enables the comparison between different languages. It has become a target that can be learned without artificially giving different language weights to compare. The context-independent syllable modeling technology of the present application based on ipa enables the modeling unit to be converted from context-dependent triphones to context-independent syllables, so that the number of pronunciation units is sufficiently large, and the distinction between them is greater and easier. learning, while taking into account the number of context-independent syllables, and larger pronunciation units have stronger anti-noise ability.
图4是本申请示出的应用语音建模进行语音识别的应用框架示意图。FIG. 4 is a schematic diagram of an application framework for speech recognition using speech modeling shown in the present application.
参见图4,车辆的车载系统在接收座舱内用户的待识别的语音请求(query)后发送给服务器,服务器接收车辆转发的车辆座舱内用户发出的语音请求后,利用本申请根据建模音节所建立的声学模型和相关解码器对语音请求进行语音识别,得到语音识别结果,将语音识别结果下发至车辆完成语音交互。其中,声学模型的构建过程可以参见图4右侧所示,对于用户发出的语音音频,可以采用神经网络层例如LSTM(Long short-term memory,长短期记忆网络)隐含层进行处理输出音素特征向量,然后基于ipa将不同语种的音素例如德语音素、英文音素、法语音素等进行音素合并,数十个不同语种混合可以使用一个softmax输出层,后续不需要再进行语种判别的处理。本申请的声学模型的更详细构建过程,可以参见图3流程所描述。本申请通过基于ipa的音素合并,使得多语种之间可以共享建模单元,进而共享部分数据,使得不同语种可以相互提升效果。Referring to FIG. 4 , the on-board system of the vehicle sends the voice request (query) to the server after receiving the user in the cockpit, and the server receives the voice request sent by the user in the cockpit of the vehicle forwarded by the vehicle, and uses this application according to modeling syllables. The established acoustic model and related decoder perform speech recognition on the voice request, obtain the speech recognition result, and send the speech recognition result to the vehicle to complete the speech interaction. Among them, the construction process of the acoustic model can be seen on the right side of Figure 4. For the voice and audio sent by the user, a neural network layer such as the LSTM (Long short-term memory, long short-term memory network) hidden layer can be used to process the output phoneme features. Then, based on the ipa, the phonemes of different languages, such as German phonemes, English phonemes, French phonemes, etc., are combined phoneme. Dozens of different languages can be mixed using a softmax output layer, and subsequent language discrimination processing is not required. A more detailed construction process of the acoustic model of the present application can be described with reference to the flow chart of FIG. 3 . The present application uses ipa-based phoneme merging, so that multiple languages can share modeling units, and then share some data, so that different languages can improve the effect of each other.
与前述应用功能实现方法相对应,本申请还提供了一种服务器。Corresponding to the foregoing application function implementation method, the present application further provides a server.
图7是本申请示出的服务器的结构示意图。FIG. 7 is a schematic structural diagram of a server shown in this application.
参见图7,本申请提供的服务器70,包括:音素处理模块71、训练处理模块72、音节合并模块73、模型生成模块74、请求接收模块75、语音识别模块76。7 , the server 70 provided by this application includes: a phoneme processing module 71 , a training processing module 72 , a syllable merging module 73 , a model generating module 74 , a request receiving module 75 , and a speech recognition module 76 .
音素处理模块71,用于获取不同语种的音素,根据发音学规则将不同语种的音素合并为第一音节。音素处理模块71可以根据万国音标规则将不同语种的音素进行预合并;根据发音学规则将进行预合并后的音素合并为第一音节。The phoneme processing module 71 is configured to acquire phonemes of different languages, and combine the phonemes of different languages into the first syllable according to the phonetic rules. The phoneme processing module 71 can pre-merge phonemes of different languages according to the rules of the Universal Phonetic Symbols; and merge the pre-merged phonemes into the first syllable according to the phonetic rules.
训练处理模块72,用于获取不同语种的训练音频,利用不同语种的训练材料识别出发音的组合音节,根据发音黏着度从组合音节中筛选出第二音节。训练处理模块72可以将组合音节进行强制帧对齐;确定对齐后的组合音节的平均发音持续时长和所有组合音节的平均发音持续时长;将组合音节的平均发音持续时长与所有组合音节的平均发音持续时长的比值作为发音黏着度,将发音黏着度小于设定阈值的组合音节作为第二音节。本申请可以利用不同语种的音频和/或视频的训练材料识别出发音的组合音节。The training processing module 72 is used for acquiring training audios of different languages, identifying the combined syllables of pronunciation by using the training materials of different languages, and selecting the second syllable from the combined syllables according to the degree of pronunciation adhesion. The training processing module 72 can perform forced frame alignment of the combined syllables; determine the average pronunciation duration of the aligned combined syllables and the average pronunciation duration of all combined syllables; The ratio of the duration is used as the pronunciation cohesion, and the combined syllable whose pronunciation cohesion is less than the set threshold is regarded as the second syllable. The present application can use the audio and/or video training materials in different languages to recognize the combined syllables of pronunciation.
音节合并模块73,用于将音素处理模块71得到的第一音节和训练处理模块72得到的第二音节进行合并,得到建模音节。The syllable combining module 73 is configured to combine the first syllable obtained by the phoneme processing module 71 and the second syllable obtained by the training processing module 72 to obtain a modeled syllable.
模型生成模块74,用于根据建模音节生成声学模型。The model generation module 74 is used for generating an acoustic model according to the modeled syllables.
请求接收模块75,用于接收车辆转发的车辆座舱内用户发出的语音请求。The request receiving module 75 is configured to receive the voice request sent by the user in the vehicle cabin forwarded by the vehicle.
语音识别模块76,用于根据模型生成模块74生成的声学模型对请求接收模块75接收的语音请求进行识别,生成识别结果下发至车辆完成语音交互。The speech recognition module 76 is used for recognizing the speech request received by the request receiving module 75 according to the acoustic model generated by the model generation module 74, and generates a recognition result and sends it to the vehicle to complete the speech interaction.
图8是本申请另一示出的服务器的结构示意图。FIG. 8 is a schematic structural diagram of another server shown in the present application.
参见图8,本申请提供的服务器70,包括:音素处理模块71、训练处理模块72、音节合并模块73、模型生成模块74、请求接收模块75、语音识别模块76。8 , the server 70 provided by this application includes: a phoneme processing module 71 , a training processing module 72 , a syllable merging module 73 , a model generating module 74 , a request receiving module 75 , and a speech recognition module 76 .
其中,音素处理模块71包括:预合并子模块711、音素合并子模块712。Wherein, the phoneme processing module 71 includes: a pre-merging sub-module 711 and a phoneme-merging sub-module 712 .
预合并子模块711,用于根据万国音标规则将不同语种的音素进行预合并。The pre-merging sub-module 711 is used to pre-merge phonemes of different languages according to the rules of the Universal Phonetic Alphabet.
音素合并子模块712,用于根据发音学规则将进行预合并后的音素合并为第一音节。例如,音素合并子模块712可以从进行预合并后的音素中,将声母和韵母的音素合并得到第一音节,将剩下的单个声母的音素和单个韵母的音素单独作为第一音节。The phoneme merging sub-module 712 is used for merging the pre-merged phonemes into the first syllable according to the phonetic rule. For example, the phoneme merging submodule 712 may combine the phonemes of the initial consonant and the final vowel from the pre-merged phonemes to obtain the first syllable, and use the remaining phoneme of the single initial consonant and the phoneme of the single final as the first syllable alone.
其中,训练处理模块72包括:对齐及统计模块721、音节筛选模块722。The training processing module 72 includes: an alignment and statistics module 721 and a syllable screening module 722 .
对齐及统计模块721,用于将组合音节进行强制帧对齐,获得对齐后的组合音节的平均发音持续时长和所有组合音节的平均发音持续时长。The alignment and statistics module 721 is configured to perform forced frame alignment of the combined syllables to obtain the average pronunciation duration of the aligned combined syllables and the average pronunciation duration of all combined syllables.
音节筛选模块722,用于将组合音节的平均发音持续时长与所有组合音节的平均发音持续时长的比值作为发音黏着度,将发音黏着度小于设定阈值的组合音节作为第二音节。组合音节的平均发音持续时长,可以根据进行强制帧对齐后的总发音持续时长与组合音节在训练音频中出现的次数的比值确定。The syllable screening module 722 is configured to use the ratio of the average pronunciation duration of the combined syllable to the average pronunciation duration of all combined syllables as the articulation stickiness, and use the combined syllable whose pronunciation stickiness is less than the set threshold as the second syllable. The average pronunciation duration of the combined syllable can be determined according to the ratio of the total pronunciation duration after forced frame alignment to the number of times the combined syllable appears in the training audio.
训练处理模块72还可以包括:规则筛选模块723。The training processing module 72 may further include: a rule screening module 723 .
规则筛选模块723,从组合音节中筛选出符合音节合并规则的组合音节;对齐及统计模块721将符合音节合并规则的组合音节进行强制帧对齐。The rule screening module 723 selects the combined syllables conforming to the syllable merging rules from the combined syllables; the alignment and statistics module 721 enforces frame alignment of the combined syllables that meet the syllable combining rules.
其中,音节合并规则,包括以下至少一项规则:声母+声母+韵母;声母+声母+韵母+特殊声母;声母+韵母+特殊声母。The syllable merging rules include at least one of the following rules: initial + initial + final; initial + initial + final + special initial; initial + final + special initial.
本申请的服务器,根据发音学规则将不同语种的音素合并为第一音节,根据发音黏着度从训练音频的组合音节中筛选出第二音节,然后合并得到建模音节,这些音节与上下文无关,这样就可以使得各种不同语种可以使用同一套建模系统,从而能够实现多语种统一建模,更方便不同语种的语音识别和语音交互,降低部署成本,避免机器资源浪费。The server of the present application combines the phonemes of different languages into the first syllable according to the phonetic rules, selects the second syllable from the combined syllables of the training audio according to the pronunciation stickiness, and then merges to obtain the modeling syllables, these syllables have nothing to do with the context, In this way, various languages can use the same modeling system, so that multi-language unified modeling can be achieved, which is more convenient for speech recognition and speech interaction in different languages, reduces deployment costs, and avoids wasting machine resources.
关于上述实施例中的服务器,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不再做详细阐述说明。Regarding the server in the above embodiment, the specific manner in which each module performs operations has been described in detail in the embodiment of the method, and will not be described in detail here.
图9是本申请示出的服务器的另一结构示意图。FIG. 9 is another schematic structural diagram of the server shown in this application.
参见图9,服务器1000包括存储器1010和处理器1020。Referring to FIG. 9 , the server 1000 includes a memory 1010 and a processor 1020 .
处理器1020可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The processor 1020 may be a central processing unit (Central Processing Unit, CPU), and may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field-available processors Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
存储器1010可以包括各种类型的存储单元,例如系统内存、只读存储器(ROM)和永久存储装置。其中,ROM可以存储处理器1020或者计算机的其他模块需要的静态数据或者指令。永久存储装置可以是可读写的存储装置。永久存储装置可以是即使计算机断电后也不会失去存储的指令和数据的非易失性存储设备。在一些实施方式中,永久性存储装置采用大容量存储装置(例如磁或光盘、闪存)作为永久存储装置。另外一些实施方式中,永久性存储装置可以是可移除的存储设备(例如软盘、光驱)。系统内存可以是可读写存储设备或者易失性可读写存储设备,例如动态随机访问内存。系统内存可以存储一些或者所有处理器在运行时需要的指令和数据。此外,存储器1010可以包括任意计算机可读存储媒介的组合,包括各种类型的半导体存储芯片(例如DRAM,SRAM,SDRAM,闪存,可编程只读存储器),磁盘和/或光盘也可以采用。在一些实施方式中,存储器1010可以包括可读和/或写的可移除的存储设备,例如激光唱片(CD)、只读数字多功能光盘(例如DVD-ROM,双层DVD-ROM)、只读蓝光光盘、超密度光盘、闪存卡(例如SD卡、min SD卡、Micro-SD卡等)、磁性软盘等。计算机可读存储媒介不包含载波和通过无线或有线传输的瞬间电子信号。Memory 1010 may include various types of storage units, such as system memory, read only memory (ROM), and persistent storage. The ROM may store static data or instructions required by the processor 1020 or other modules of the computer. Persistent storage devices may be readable and writable storage devices. Permanent storage may be a non-volatile storage device that does not lose stored instructions and data even if the computer is powered off. In some embodiments, persistent storage devices employ mass storage devices (eg, magnetic or optical disks, flash memory) as persistent storage devices. In other embodiments, persistent storage may be a removable storage device (eg, a floppy disk, an optical drive). System memory can be a readable and writable storage device or a volatile readable and writable storage device, such as dynamic random access memory. System memory can store some or all of the instructions and data that the processor needs at runtime. Additionally, memory 1010 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (eg, DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), and magnetic and/or optical disks may also be employed. In some embodiments, memory 1010 may include a removable storage device that is readable and/or writable, such as a compact disc (CD), a read-only digital versatile disc (eg, DVD-ROM, dual-layer DVD-ROM), Read-only Blu-ray Disc, Ultra-Density Disc, Flash Card (eg SD Card, Min SD Card, Micro-SD Card, etc.), Magnetic Floppy Disk, etc. Computer readable storage media do not contain carrier waves and transient electronic signals transmitted over wireless or wire.
存储器1010上存储有可执行代码,当可执行代码被处理器1020处理时,可以使处理器1020执行上文述及的方法中的部分或全部。Executable codes are stored on the memory 1010, and when the executable codes are processed by the processor 1020, the processor 1020 can be caused to execute some or all of the above-mentioned methods.
此外,根据本申请的方法还可以实现为一种计算机程序或计算机程序产品,该计算机程序或计算机程序产品包括用于执行本申请的上述方法中部分或全部步骤的计算机程序代码指令。Furthermore, the method according to the present application can also be implemented as a computer program or computer program product comprising computer program code instructions for performing some or all of the steps in the above method of the present application.
或者,本申请还可以实施为一种计算机可读存储介质(或非暂时性机器可读存储介质或机器可读存储介质),其上存储有可执行代码(或计算机程序或计算机指令代码),当可执行代码(或计算机程序或计算机指令代码)被电子设备(或服务器等)的处理器执行时,使处理器执行根据本申请的上述方法的各个步骤的部分或全部。Alternatively, the present application can also be implemented as a computer-readable storage medium (or a non-transitory machine-readable storage medium or a machine-readable storage medium) on which executable codes (or computer programs or computer instruction codes) are stored, When the executable code (or computer program or computer instruction code) is executed by the processor of the electronic device (or server, etc.), the processor is caused to perform some or all of the steps of the above method according to the present application.
以上已经描述了本申请的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术的改进,或者使本技术领域的其他普通技术人员能理解本文披露的各实施例。Various embodiments of the present application have been described above, and the foregoing descriptions are exemplary, not exhaustive, and not limiting of the disclosed embodiments. Numerous modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or improvement over the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
Claims (12)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210736173.4A CN115132184A (en) | 2022-06-27 | 2022-06-27 | Voice interaction method, server and storage medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210736173.4A CN115132184A (en) | 2022-06-27 | 2022-06-27 | Voice interaction method, server and storage medium |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN115132184A true CN115132184A (en) | 2022-09-30 |
Family
ID=83379664
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202210736173.4A Pending CN115132184A (en) | 2022-06-27 | 2022-06-27 | Voice interaction method, server and storage medium |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN115132184A (en) |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6055498A (en) * | 1996-10-02 | 2000-04-25 | Sri International | Method and apparatus for automatic text-independent grading of pronunciation for language instruction |
| CN1645478A (en) * | 2004-01-21 | 2005-07-27 | 微软公司 | Segmental tonal modeling for tonal languages |
| US20150127347A1 (en) * | 2013-11-06 | 2015-05-07 | Microsoft Corporation | Detecting speech input phrase confusion risk |
| CN107301860A (en) * | 2017-05-04 | 2017-10-27 | 百度在线网络技术(北京)有限公司 | Audio recognition method and device based on Chinese and English mixing dictionary |
| CN112652306A (en) * | 2020-12-29 | 2021-04-13 | 珠海市杰理科技股份有限公司 | Voice wake-up method and device, computer equipment and storage medium |
-
2022
- 2022-06-27 CN CN202210736173.4A patent/CN115132184A/en active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6055498A (en) * | 1996-10-02 | 2000-04-25 | Sri International | Method and apparatus for automatic text-independent grading of pronunciation for language instruction |
| CN1645478A (en) * | 2004-01-21 | 2005-07-27 | 微软公司 | Segmental tonal modeling for tonal languages |
| US20150127347A1 (en) * | 2013-11-06 | 2015-05-07 | Microsoft Corporation | Detecting speech input phrase confusion risk |
| CN107301860A (en) * | 2017-05-04 | 2017-10-27 | 百度在线网络技术(北京)有限公司 | Audio recognition method and device based on Chinese and English mixing dictionary |
| CN112652306A (en) * | 2020-12-29 | 2021-04-13 | 珠海市杰理科技股份有限公司 | Voice wake-up method and device, computer equipment and storage medium |
Non-Patent Citations (2)
| Title |
|---|
| 寇芳玲等: "CV 音节特征提取在自动语种识别中的应用", 信息与控制, vol. 42, no. 4, 31 August 2013 (2013-08-31), pages 464 - 468 * |
| 晁浩等: "汉语语音识别中基于音节的声学模型改进算法", 计算机应用, vol. 33, no. 6, 1 June 2013 (2013-06-01), pages 1742 * |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20200005761A1 (en) | Voice synthesis method, apparatus, device and storage medium | |
| JP2019215513A (en) | Voice section detection method and device | |
| JP7158217B2 (en) | Speech recognition method, device and server | |
| CN107195295A (en) | Audio recognition method and device based on Chinese and English mixing dictionary | |
| CN109616096A (en) | Construction method, device, server and the medium of multilingual tone decoding figure | |
| CN107301860A (en) | Audio recognition method and device based on Chinese and English mixing dictionary | |
| CN106875936B (en) | Voice recognition method and device | |
| CN106710585B (en) | Method and system for broadcasting polyphonic characters during voice interaction | |
| CN108899033A (en) | A kind of method and device of determining speaker characteristic | |
| CN112562640B (en) | Multilingual speech recognition method, device, system, and computer-readable storage medium | |
| CN112818089A (en) | Text phonetic notation method, electronic equipment and storage medium | |
| CN108305618A (en) | Voice acquisition and search method, smart pen, search terminal and storage medium | |
| CN116312471A (en) | Voice migration, voice interaction method, device, electronic device and storage medium | |
| CN112530399A (en) | Method and system for expanding voice data, electronic equipment and storage medium | |
| CN112151018B (en) | Speech evaluation and speech recognition method, device, equipment and storage medium | |
| JP2004094257A (en) | Method and apparatus for generating question of decision tree for speech processing | |
| CN111489742B (en) | Acoustic model training method, voice recognition device and electronic equipment | |
| CN109065016B (en) | Speech synthesis method, speech synthesis device, electronic equipment and non-transient computer storage medium | |
| WO2014176489A2 (en) | A system and method for supervised creation of personalized speech samples libraries in real-time for text-to-speech synthesis | |
| US11250837B2 (en) | Speech synthesis system, method and non-transitory computer readable medium with language option selection and acoustic models | |
| CN112687296A (en) | Audio disfluency identification method, device, equipment and readable storage medium | |
| CN115132184A (en) | Voice interaction method, server and storage medium | |
| CN114783405A (en) | A kind of speech synthesis method, apparatus, electronic equipment and storage medium | |
| CN115455912A (en) | Text analysis method, device, electronic device, and computer-readable storage medium | |
| CN114927135B (en) | Voice interaction method, server and storage medium |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |