CN115148186A

CN115148186A - Speech synthesis method, apparatus, readable medium and electronic device

Info

Publication number: CN115148186A
Application number: CN202210764241.8A
Authority: CN
Inventors: 王奕桦; 梅晓; 马泽君
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2022-06-29
Filing date: 2022-06-29
Publication date: 2022-10-04

Abstract

The present disclosure relates to a speech synthesis method, apparatus, readable medium and electronic device. The method includes: determining the tone annotation information of the target text to be processed, wherein the tone annotation information includes the tone change type of each text unit in the target text, wherein the text unit is composed of at least one unit text , the continuous tone-shift type of the text unit is used to indicate the pitch change trend of the text unit; determine the prosodic labeling information of the target text and the phoneme sequence corresponding to the target text; according to the tone labeling information, the prosody labeling information and the phoneme sequence to generate synthesized audio corresponding to the target text. As a result, when performing speech synthesis for the target text, the feature of continuous tone change is introduced, so that the continuous tone change mode can be directly controlled during speech synthesis, which improves the controllability of the continuous tone change phenomenon in speech synthesis, and further The naturalness of the synthesized speech.

Description

Speech synthesis method, apparatus, readable medium and electronic device

技术领域technical field

本公开涉及计算机技术领域，具体地，涉及一种语音合成方法、装置、可读介质及电子设备。The present disclosure relates to the field of computer technology, and in particular, to a speech synthesis method, apparatus, readable medium, and electronic device.

背景技术Background technique

语音合成技术能够将任意文本转换成对应的音频，通常包括两个部分，一部分是对文本进行分析，得到语言学相关的信息，另一部分则是基于分析得出的结果生成声音波形。在相关技术中，通常缺乏对连读变调这一特征的学习，使得合成语音的声调无法得到有效控制，导致合成后的音频不够自然。Speech synthesis technology can convert any text into corresponding audio, which usually includes two parts, one part is to analyze the text to obtain linguistically related information, and the other part is to generate sound waveforms based on the analysis results. In the related art, there is usually a lack of learning of the feature of link-to-speech pitch change, so that the pitch of the synthesized speech cannot be effectively controlled, resulting in an unnaturally synthesized audio.

发明内容SUMMARY OF THE INVENTION

提供该部分内容以便以简要的形式介绍构思，这些构思将在后面的具体实施方式部分被详细描述。该部分内容并不旨在标识要求保护的技术方案的关键特征或必要特征，也不旨在用于限制所要求的保护的技术方案的范围。This section is provided to introduce in a simplified form concepts that are described in detail in the detailed description that follows. This section is not intended to identify key features or essential features of the claimed technical solution, nor is it intended to limit the scope of the claimed technical solution.

第一方面，本公开提供一种语音合成方法，所述方法包括：In a first aspect, the present disclosure provides a speech synthesis method, the method comprising:

确定待处理的目标文本的声调标注信息，其中，所述声调标注信息包括所述目标文本中各文本单元的连读变调类型，其中，所述文本单元由至少一个单位文本构成，文本单元的连读变调类型用于指示该文本单元的音高变化趋势；Determine the tone annotation information of the target text to be processed, wherein the tone annotation information includes the tone change type of each text unit in the target text, wherein the text unit is composed of at least one unit text, and the continuous tone of the text unit is The reading pitch type is used to indicate the pitch change trend of the text unit;

确定所述目标文本的韵律标注信息和所述目标文本对应的音素序列；Determine the prosodic annotation information of the target text and the phoneme sequence corresponding to the target text;

根据所述声调标注信息、所述韵律标注信息和所述音素序列，生成与所述目标文本对应的合成音频。According to the tone annotation information, the prosodic annotation information and the phoneme sequence, a synthesized audio corresponding to the target text is generated.

第二方面，本公开提供一种语音合成装置，所述装置包括：In a second aspect, the present disclosure provides a speech synthesis device, the device comprising:

第一确定模块，用于确定待处理的目标文本的声调标注信息，其中，所述声调标注信息包括所述目标文本中各文本单元的连读变调类型，其中，所述文本单元由至少一个单位文本构成，文本单元的连读变调类型用于指示该文本单元的音高变化趋势；The first determination module is used to determine the tone annotation information of the target text to be processed, wherein the tone annotation information includes the tone change type of each text unit in the target text, wherein the text unit is composed of at least one unit. Text composition, the type of tone change of the text unit is used to indicate the pitch change trend of the text unit;

第二确定模块，用于确定所述目标文本的韵律标注信息和所述目标文本对应的音素序列；a second determination module, configured to determine the prosodic annotation information of the target text and the phoneme sequence corresponding to the target text;

生成模块，用于根据所述声调标注信息、所述韵律标注信息和所述音素序列，生成与所述目标文本对应的合成音频。A generating module, configured to generate synthetic audio corresponding to the target text according to the tone annotation information, the prosody annotation information and the phoneme sequence.

第三方面，本公开提供一种计算机可读介质，其上存储有计算机程序，该程序被处理装置执行时实现本公开第一方面所述方法的步骤。In a third aspect, the present disclosure provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing apparatus, implements the steps of the method described in the first aspect of the present disclosure.

第四方面，本公开提供一种电子设备，包括：In a fourth aspect, the present disclosure provides an electronic device, comprising:

存储装置，其上存储有至少一个计算机程序；a storage device having at least one computer program stored thereon;

至少一个处理装置，用于执行所述存储装置中的所述至少一个计算机程序，以实现本公开第一方面所述方法的步骤。At least one processing device is configured to execute the at least one computer program in the storage device to implement the steps of the method in the first aspect of the present disclosure.

通过上述技术方案，确定待处理的目标文本的声调标注信息，其中，声调标注信息包括目标文本中各文本单元的连读变调类型，其中，文本单元由至少一个单位文本构成，文本单元的连读变调类型用于指示该文本单元的音高变化趋势，确定目标文本的韵律标注信息和目标文本对应的音素序列，根据声调标注信息、韵律标注信息和音素序列，生成与目标文本对应的合成音频。由此，在针对目标文本进行语音合成时，除了使用韵律特征之外，还进一步引入了连读变调特征，从而，能够在语音合成时对连读变调方式进行直接控制，提升了语音合成中对于连读变调现象可控性，进而合成语音的自然度。Through the above technical solution, the tone labeling information of the target text to be processed is determined, wherein the tone labeling information includes the tone change type of each text unit in the target text, wherein the text unit is composed of at least one unit text, and the continuous reading of the text unit The pitch type is used to indicate the pitch change trend of the text unit, determine the prosody annotation information of the target text and the phoneme sequence corresponding to the target text, and generate synthetic audio corresponding to the target text according to the tone annotation information, prosodic annotation information and phoneme sequence. As a result, in addition to using prosody features in speech synthesis for the target text, the link-to-tone shift feature is further introduced, so that the link-to-tone shift mode can be directly controlled during speech synthesis, which improves the accuracy of speech synthesis in speech synthesis. The controllability of the tone-shifting phenomenon of continuous reading, and then the naturalness of the synthesized speech.

本公开的其他特征和优点将在随后的具体实施方式部分予以详细说明。Other features and advantages of the present disclosure will be described in detail in the detailed description that follows.

附图说明Description of drawings

结合附图并参考以下具体实施方式，本公开各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中，相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的，原件和元素不一定按照比例绘制。在附图中：The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent when taken in conjunction with the accompanying drawings and with reference to the following detailed description. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that the originals and elements are not necessarily drawn to scale. In the attached image:

图1是根据本公开的一种实施方式提供的语音合成方法的流程图；1 is a flowchart of a speech synthesis method provided according to an embodiment of the present disclosure;

图2是根据本公开的一种实施方式提供的语音合成装置的框图；2 is a block diagram of a speech synthesis apparatus provided according to an embodiment of the present disclosure;

图3示出了适于用来实现本公开实施例的电子设备的结构示意图。FIG. 3 shows a schematic structural diagram of an electronic device suitable for implementing an embodiment of the present disclosure.

具体实施方式Detailed ways

下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例，然而应当理解的是，本公开可以通过各种形式来实现，而且不应该被解释为限于这里阐述的实施例，相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是，本公开的附图及实施例仅用于示例性作用，并非用于限制本公开的保护范围。Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for the purpose of A more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are only for exemplary purposes, and are not intended to limit the protection scope of the present disclosure.

应当理解，本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行，和/或并行执行。此外，方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。It should be understood that the various steps described in the method embodiments of the present disclosure may be performed in different orders and/or in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this regard.

本文使用的术语“包括”及其变形是开放性包括，即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”；术语“另一实施例”表示“至少一个另外的实施例”；术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。As used herein, the term "including" and variations thereof are open-ended inclusions, ie, "including but not limited to". The term "based on" is "based at least in part on." The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions of other terms will be given in the description below.

需要注意，本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分，并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。It should be noted that concepts such as "first" and "second" mentioned in the present disclosure are only used to distinguish different devices, modules or units, and are not used to limit the order of functions performed by these devices, modules or units or interdependence.

需要注意，本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的，本领域技术人员应当理解，除非在上下文另有明确指出，否则应该理解为“一个或多个”。It should be noted that the modifications of "a" and "a plurality" mentioned in the present disclosure are illustrative rather than restrictive, and those skilled in the art should understand that unless the context clearly indicates otherwise, they should be understood as "one or a plurality of". multiple".

本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的，而并不是用于对这些消息或信息的范围进行限制。The names of messages or information exchanged between multiple devices in the embodiments of the present disclosure are only for illustrative purposes, and are not intended to limit the scope of these messages or information.

本公开中所有获取信号、信息或数据的动作都是在遵照所在地国家相应的数据保护法规政策的前提下，并获得由相应装置所有者给予授权的情况下进行的。All actions of obtaining signals, information or data in this disclosure are carried out under the premise of complying with the corresponding data protection laws and policies of the country where they are located and authorized by the corresponding device owner.

可以理解的是，在使用本公开各实施例公开的技术方案之前，均应当依据相关法律法规通过恰当的方式对本公开所涉及个人信息的类型、使用范围、使用场景等告知用户并获得用户的授权。It can be understood that, before using the technical solutions disclosed in the embodiments of the present disclosure, the user should be informed of the type, scope of use, and use scenario of the personal information involved in the present disclosure in an appropriate manner in accordance with relevant laws and regulations, and the user's authorization should be obtained. .

例如，在响应于接收到用户的主动请求时，向用户发送提示信息，以明确地提示用户，其请求执行的操作将需要获取和使用到用户的个人信息。从而，使得用户可以根据提示信息来自主地选择是否向执行本公开技术方案的操作的电子设备、应用程序、服务器或存储介质等软件或硬件提供个人信息。For example, in response to receiving an active request from the user, prompt information is sent to the user to explicitly prompt the user that the requested operation will require the acquisition and use of the user's personal information. Therefore, the user can independently choose whether to provide personal information to software or hardware such as electronic devices, applications, servers, or storage media that perform operations of the technical solutions of the present disclosure according to the prompt information.

作为一种可选的但非限定性的实现方式，响应于接收到用户的主动请求，向用户发送提示信息的方式例如可以是弹窗的方式，弹窗中可以以文字的方式呈现提示信息。此外，弹窗中还可以承载供用户选择“同意”或者“不同意”向电子设备提供个人信息的选择控件。As an optional but non-limiting implementation manner, in response to receiving the user's active request, the method of sending prompt information to the user may be, for example, a pop-up window, and the prompt information may be presented in text in the pop-up window. In addition, the pop-up window may also carry a selection control for the user to select "agree" or "disagree" to provide personal information to the electronic device.

可以理解的是，上述通知和获取用户授权过程仅是示意性的，不对本公开的实现方式构成限定，其它满足相关法律法规的方式也可应用于本公开的实现方式中。It can be understood that the above process of notifying and obtaining user authorization is only illustrative, and does not limit the implementation of the present disclosure, and other methods that satisfy relevant laws and regulations can also be applied to the implementation of the present disclosure.

同时，可以理解的是，本技术方案所涉及的数据(包括但不限于数据本身、数据的获取或使用)应当遵循相应法律法规及相关规定的要求。At the same time, it can be understood that the data involved in this technical solution (including but not limited to the data itself, the acquisition or use of the data) shall comply with the requirements of the corresponding laws and regulations and relevant regulations.

图1是根据本公开的一种实施方式提供的语音合成方法的流程图，如图1所示，本公开提供的方法可以包括步骤11～步骤13。FIG. 1 is a flowchart of a speech synthesis method provided according to an embodiment of the present disclosure. As shown in FIG. 1 , the method provided by the present disclosure may include steps 11 to 13 .

在步骤11中，确定待处理的目标文本的声调标注信息。In step 11, the tone annotation information of the target text to be processed is determined.

在本公开中，声调标注信息包括目标文本中各文本单元的连读变调类型，其中，文本单元由至少一个单位文本构成，文本单元的连读变调类型用于指示该文本单元的音高变化趋势。In the present disclosure, the tone annotation information includes the tone transition type of each text unit in the target text, wherein the text unit is composed of at least one unit text, and the tone transition type of the text unit is used to indicate the pitch change trend of the text unit .

其中，单位文本就是构成目标文本的最小单位。例如，若目标文本为中文文本，则单位文本就是单个文字，也就是单个音节，相应地，文本单元就由至少一个文字构成。Among them, the unit text is the smallest unit that constitutes the target text. For example, if the target text is Chinese text, the unit text is a single character, that is, a single syllable, and accordingly, the text unit is composed of at least one character.

文本单元的连读变调类型可以参考下述定义：The transposition type of text unit can refer to the following definitions:

若文本单元由一个单位文本构成，该文本单元的连读变调类型为以下中的一者：用于指示音高为文本单元的原始声调的第一类型、用于指示音高由文本单元的原始声调的音高起点变化至平调的第二类型；If the text unit consists of one unit text, the tone transition type of the text unit is one of the following: the first type used to indicate that the pitch is the original tone of the text unit, the first type used to indicate that the pitch is the original tone of the text unit The pitch onset of the tone changes to the second type of flat;

若文本单元由两个单位文本构成，该文本单元的连读变调类型为以下中的一者：用于指示音高由高平调降低至低平调的第三类型、用于指示音高由高平调降低至中平调的第四类型、用于指示音高由中平调升高至高平调的第五类型、用于指示音高由低平调升高至高平调的第六类型、用于指示音高由低平调升高至低升调的第七类型、用于指示音高由低平调升高至中平调的第八类型；If a text unit is composed of two unit texts, the text unit's continuation transposition type is one of the following: the third type used to indicate that the pitch is lowered from high flat to low flat, the third type used to indicate pitch from high flat to low flat Type 4 for pitch down to mid-flat, type 5 for indicating pitch from mid-flat to high-flat, sixth type for indicating pitch from low-flat to high-flat, use Type 7 for indicating pitch rising from low flat to low rising, and Type 8 for indicating pitch rising from low flat to medium flat;

若文本单元由多于两个单位文本构成，该文本单元的连读变调类型为以下中的一者：第三类型、第四类型、第七类型、第八类型、用于指示音高由中平调升高至高平调再降低至低平调的第九类型、用于指示音高由中平调升高至高平调再降低至中平调的第十类型、用于指示音高由低平调升高至高平调再降低至低平调的第十一类型、用于指示音高由低平调升高至高平调再降低至中平调的第十二类型。If the text unit consists of more than two unit texts, the text unit's continuation shift type is one of the following: the third type, the fourth type, the seventh type, the eighth type, which is used to indicate the pitch from the middle The ninth type of flat pitch raised to a high flat key and then lowered to a low flat key, the tenth type used to indicate pitch from medium flat to high flat and then lowered to medium flat, used to indicate pitch from low The eleventh type, where the pitch is raised to high flat and then lowered to low flat, and the twelfth type, which is used to indicate that the pitch is raised from low flat to high flat and then lowered to medium flat.

示例地，文本单元的连读变调类型及其音高变化趋势可以参考下表：By way of example, the type of tone shift and its pitch change trend of text units can refer to the following table:

其中，H表示高平调，M表示中平调，L表示低平调，R表示低升调，符号“-”用于分隔前后相邻的两个单位文本。以目标文本为中文文本为例，单位文本就是单个文字，也就是单个音节，相应地，H-L表示一个双音节词调，它的第一个音节为高平调，第二个音节为低平调，整体音高曲线是从高平调向低平调变化。Among them, H stands for high level, M for medium level, L for low level, R for low rise, and the symbol "-" is used to separate two adjacent unit texts. Taking the target text as Chinese text as an example, the unit text is a single character, that is, a single syllable. Correspondingly, H-L represents a two-syllable tone, its first syllable is a high-level tone, the second syllable is a low-level tone, and the overall The pitch curve changes from high flat to low flat.

在上表中，[M]{n}的意思是n个M相连，也就是n个中平调相连，如M{2}即是M-M，M{3}即是M-M-M，其余同理。In the above table, [M]{n} means that n M are connected, that is, n are connected horizontally. For example, M{2} is M-M, M{3} is M-M-M, and the rest are the same.

需要说明的是，以第三类型为例，若文本单元由4个单位文本构成，则其音高变化趋势为H-M-M-L，则音高可以是由高平调逐渐降低至低平调，也就是说，中间的两个虽然都是中平调，但是前者的音高可以略高于后者。It should be noted that, taking the third type as an example, if the text unit is composed of 4 unit texts, the pitch change trend is H-M-M-L, and the pitch can be gradually reduced from high-level to low-level, that is to say, Although the two in the middle are all mid-flat, the pitch of the former can be slightly higher than that of the latter.

在步骤12中，确定目标文本的韵律标注信息和目标文本对应的音素序列。In step 12, the prosodic annotation information of the target text and the phoneme sequence corresponding to the target text are determined.

在本公开中，可以通过字素到音素(Grapheme-to-Phoneme，G2P)模型、来获取待合成文本对应的音素序列。In the present disclosure, the phoneme sequence corresponding to the text to be synthesized may be obtained through a Grapheme-to-Phoneme (G2P) model.

示例地，G2P模型可以采用循环神经网络(Recurrent Neural Network，RNN)和长短期记忆网络(Long Short-Term Memory，LSTM)来实现从字素到音素的转化。For example, the G2P model can use a Recurrent Neural Network (RNN) and a Long Short-Term Memory (LSTM) to realize the transformation from grapheme to phoneme.

韵律标注信息用于反映与韵律相关的内容，它可以包括但不限于韵律边界信息。其中，韵律边界(break index，可简写为BRK)，也可以称为间断指数，用于描述信息在语流中组织、分句的形式。可选地，韵律边界信息可以包括但不限于句边界、语调短语边界、韵律短语边界和韵律词边界。The prosodic annotation information is used to reflect content related to prosody, and it may include but not limited to prosodic boundary information. Among them, the prosodic boundary (break index, which can be abbreviated as BRK), which can also be called a break index, is used to describe the form of information organization and clauses in the speech flow. Optionally, the prosodic boundary information may include, but is not limited to, sentence boundaries, intonation phrase boundaries, prosodic phrase boundaries, and prosodic word boundaries.

在步骤13中，根据声调标注信息、韵律标注信息和音素序列，生成与目标文本对应的合成音频。In step 13, a synthetic audio corresponding to the target text is generated according to the tone annotation information, the prosody annotation information and the phoneme sequence.

为了使得本领域技术人员更加理解本公开提供的语音合成方法，下面对上述各步骤进行详细举例说明。In order to make those skilled in the art better understand the speech synthesis method provided by the present disclosure, the above steps are described in detail below.

首先，对本公开所使用的声调标注信息的相关内容进行解释说明。First, the related content of the tone annotation information used in the present disclosure will be explained.

如上文，声调标注信息包括目标文本中各文本单元的连读变调类型，其中，文本单元由至少一个单位文本构成，文本单元的连读变调类型用于指示该文本单元的音高变化趋势。下面对目标文本的声调标注信息的确定方式进行详细说明。As above, the tone annotation information includes the tone transition type of each text unit in the target text, wherein the text unit is composed of at least one unit text, and the tone transition type of the text unit is used to indicate the pitch change trend of the text unit. The manner of determining the tone annotation information of the target text will be described in detail below.

在一种可能的实施方式中，可以通过人工标注的方式确定目标文本的声调标注信息。也就是说，可以接收针对目标文本的声调标注操作，并根据该声调标注操作生成目标文本的声调标注信息。In a possible implementation manner, the tone annotation information of the target text may be determined by manual annotation. That is to say, the tone annotation operation for the target text can be received, and the tone annotation information of the target text can be generated according to the tone annotation operation.

也就是说，标注人员可以直接对目标文本的声调进行标注操作，将期望从合成的音频中听到的连读变调特征标注到目标文本中。其中，对于声调的标注操作可以分两个步骤进行。That is to say, the annotator can directly annotate the tone of the target text, and annotate the tone-translation features expected to be heard from the synthesized audio into the target text. Among them, the labeling operation for tones can be performed in two steps.

第一个步骤，可以基于目标文本的内容划分文本单元，即，通过生成边界的方式将目标文本划分为多个文本单元。In the first step, text units may be divided based on the content of the target text, that is, the target text may be divided into multiple text units by generating boundaries.

第二个步骤，针对每个文本单元标注其连读变调类型。In the second step, each text unit is marked with its continuation transposition type.

在另一种可能的实施方式中，可以通过预先训练的声调标注模型实现目标文本的声调标注信息的确定。相应地，目标文本的声调标注信息可以通过以下方式得到：In another possible implementation manner, the determination of the tone annotation information of the target text may be realized by a pre-trained tone annotation model. Correspondingly, the tone annotation information of the target text can be obtained in the following ways:

将目标文本输入至声调标注模型，获得声调标注模型的输出结果；Input the target text into the tone annotation model, and obtain the output result of the tone annotation model;

根据输出结果，确定目标文本的声调标注信息。According to the output result, the tone annotation information of the target text is determined.

其中，声调标注模型基于带有声调标注信息的第二训练文本训练得到，输出结果包括用于将目标文本划分为多个文本单元的边界信息和每一文本单元各自对应的连读变调类型。The tone labeling model is obtained by training based on the second training text with tone labeling information, and the output result includes boundary information for dividing the target text into multiple text units and the corresponding tone transition type for each text unit.

第二训练文本可以是从真实存在的语音中提取出的文本，针对这样的语音，标注人员可以通过听语音的方式在文本中的合适位置进行标记，以得到第二训练文本的声调标注信息。这样的标注主要依赖于标注人员的听感，标注思路可以参考前文中对于连读变调类型的描述。The second training text may be text extracted from real speech. For such speech, the annotator can mark appropriate positions in the text by listening to the speech to obtain the tone annotation information of the second training text. Such annotation mainly depends on the auditor's sense of hearing. For the labeling idea, please refer to the description of the type of tone shift in the previous section.

在一种可能的实施方式中，可以直接将上述输出结果作为目标文本的声调标注信息。In a possible implementation, the above output result can be directly used as the tone annotation information of the target text.

在另一种可能的实施方式中，根据输出结果，确定目标文本的声调标注信息，可以包括以下步骤：In another possible implementation, determining the tone annotation information of the target text according to the output result, which may include the following steps:

确定是否接收到针对输出结果的修正指令；determine whether a correction instruction for the output result has been received;

根据修正指令，对输出结果进行修正，并将修正后的输出结果确定为目标文本的声调标注信息。According to the correction instruction, the output result is corrected, and the corrected output result is determined as the tone annotation information of the target text.

其中，修正指令用于指示更改输出结果中的边界信息、连读变调类型中的至少一者。Wherein, the modification instruction is used to instruct to modify at least one of the boundary information in the output result and the type of tone shift of continuous reading.

也就是说，标注人员还可以针对模型的输出结果通过修正指令进行修正，以此在提升声调标注信息确定效率的基础上，还能保证标注的高准确性。That is to say, the annotator can also correct the output result of the model through the correction instruction, so as to improve the efficiency of determining the tone annotation information, and also ensure the high accuracy of the annotation.

通过上述方式，即可获得与第二训练文本的真实语音对应的声调标注信息。从而，将第二训练文本作为神经网络模型的输入，并将第二训练文本对应的声调标注信息作为模型的目标输出，对神经网络模型进行训练，训练完毕后，即可获得能够自动为文本生成声调标注信息的声调标注模型。这样，将一段文本输入至该声调标注模型，就能够自动获得声调标注模型输出的与该段文本对应的声调标注信息，不再需要人为标注，有利于提升声调标注信息的确定效率。In the above manner, the tone annotation information corresponding to the real speech of the second training text can be obtained. Therefore, the second training text is used as the input of the neural network model, and the tone annotation information corresponding to the second training text is used as the target output of the model, and the neural network model is trained. Tone annotation model for tone annotation information. In this way, inputting a piece of text into the tone labeling model can automatically obtain the tone labeling information output by the tone labeling model corresponding to the piece of text, and no manual labeling is required, which is beneficial to improve the efficiency of determining the tone labeling information.

回到图1，在步骤13中，根据声调标注信息、韵律标注信息和音素序列，生成与目标文本对应的合成音频。Returning to FIG. 1 , in step 13, according to the tone annotation information, the prosodic annotation information and the phoneme sequence, a synthesized audio corresponding to the target text is generated.

在一种可能的实施方式中，步骤13可以包括以下步骤：In a possible implementation, step 13 may include the following steps:

根据声调标注信息，确定目标文本对应的连读变调标签序列；According to the tone label information, determine the sequence of consecutive tone-shift labels corresponding to the target text;

根据韵律标注信息，确定目标文本对应的韵律标签序列；Determine the prosodic label sequence corresponding to the target text according to the prosodic labeling information;

根据连读变调标签序列、韵律标签序列和音素序列，利用预先训练的语音合成模型，生成与目标文本对应的声学特征信息；Generate acoustic feature information corresponding to the target text by using the pre-trained speech synthesis model according to the sequence of transposition tags, prosodic tags and phoneme sequences;

利用声码器对声学特征信息进行语音合成，以生成与目标文本对应的合成音频。Speech synthesis is performed on the acoustic feature information using a vocoder to generate synthesized audio corresponding to the target text.

其中，确定连读变调标签的思路在于，同一文本单元的单位文本共享相同的连读变调标签，即，构成一个文本单元的各个单位文本的连读变调标签与该文本单元的连读变调标签一致。进而，按照文本单元在目标文本中的出现顺序，即可得到连读变调标签序列。Wherein, the idea of determining the link-shifting label is that the unit texts of the same text unit share the same linking-shifting label, that is, the linking-shifting label of each unit text constituting a text unit is consistent with the linking-shifting label of the text unit . Furthermore, according to the order of appearance of the text units in the target text, the sequence of read-through transposition tags can be obtained.

如前文，韵律标注信息可以包括韵律边界信息，相应地，韵律标签可以包括韵律边界标签。As before, the prosodic annotation information may include prosodic boundary information, and correspondingly, the prosodic label may include a prosodic boundary label.

韵律标注信息一般是标注了文本的某个位置，例如，文本的某个位置是韵律短语边界。而为了便于后续的语音合成，保证韵律标注信息能够与待合成文本的音素逐一对应上，可以基于韵律标注信息，进一步确定音素级别的韵律标签。The prosodic annotation information generally marks a certain position in the text, for example, a certain position in the text is a prosodic phrase boundary. In order to facilitate subsequent speech synthesis and ensure that the prosodic labeling information can correspond to the phonemes of the text to be synthesized one by one, the prosodic label at the phoneme level may be further determined based on the prosodic labeling information.

确定韵律标签的思路在于，对于存在韵律标注信息的音素位置处，按照标注信息生成标签内容，而对于不存在韵律标注信息的音素位置处，用指定替代内容进行替代。例如，对于音素序列{A1，A2，A3，A4，A5，A6}，假设韵律标注信息中包括韵律边界信息，且标注内容为A2处存在韵律短语边界，A5处存在语调短语边界，并且，规定了韵律短语边界用3表征、语调短语边界用4表征，无标记用N2表征，则确定出的韵律边界标签就是{N2，3，N2，N2，4，N2}。The idea of determining the prosodic label is to generate label content according to the label information for phoneme positions where prosodic label information exists, and to substitute specified alternative content for phoneme locations without prosodic label information. For example, for the phoneme sequence {A1, A2, A3, A4, A5, A6}, it is assumed that the prosodic annotation information includes prosodic boundary information, and the annotation content is that there is a prosodic phrase boundary at A2 and an intonation phrase boundary at A5, and it is specified that If the prosodic phrase boundary is represented by 3, the intonation phrase boundary is represented by 4, and the unmarked phrase is represented by N2, the determined prosodic boundary label is {N2, 3, N2, N2, 4, N2}.

在生成与目标文本对应的声学特征信息时，可以将连读变调标签序列、韵律标签序列和音素序列输入到预先训练好的语音合成模型中，得到目标文本对应的声学特征信息。示例地，声学特征信息可以为梅尔频谱(Mel谱)、线性谱等。When generating the acoustic feature information corresponding to the target text, the continuous tone-shifting label sequence, the prosodic label sequence and the phoneme sequence can be input into the pre-trained speech synthesis model to obtain the acoustic feature information corresponding to the target text. For example, the acoustic feature information may be a Mel spectrum (Mel spectrum), a linear spectrum, or the like.

上述语音合成模型可以包括编码网络、注意力网络和解码网络。其中，编码网络用于根据与连读变调标签序列、韵律标签序列和音素序列对应的拼接向量，生成文本表征序列；注意力网络用于根据文本表征序列，生成语义表征；解码网络用于根据语义表征，输出与目标文本对应的声学特征信息。The above-mentioned speech synthesis model may include an encoding network, an attention network, and a decoding network. Among them, the encoding network is used to generate the text representation sequence according to the splicing vector corresponding to the sequence of read-linking transposition tags, the prosodic tag sequence and the phoneme sequence; the attention network is used to generate the semantic representation according to the text representation sequence; the decoding network is used to generate the semantic representation according to the semantic Characterization, output the acoustic feature information corresponding to the target text.

语音合成模型的编码网络(Encoder)的输入为目标文本的向量表示，它可以包括音素序列经过向量化(embedding)后得到的第一向量、连读变调标签序列经过向量化后的第二向量和韵律标签序列经过向量化后的第三向量，上述几者经过拼接后形成拼接向量，作为编码网络的输入。之后，编码网络对应输出目标文本的文本表征序列(TE，textembedding)。编码网络输出的文本表征序列经过注意力网络，生成上下文向量C，作为目标文本的语义表征。注意力网络生成的语义表征进入解码网络，由解码网络输出与目标文本对应的声学特征信息。The input of the encoding network (Encoder) of the speech synthesis model is the vector representation of the target text, which can include the first vector obtained after the phoneme sequence is vectorized (embedding), the second vector obtained after the sequence of read-through tone-shifting labels is vectorized, and The prosodic label sequence is the third vector after vectorization, and the above-mentioned ones are concatenated to form a concatenated vector, which is used as the input of the encoding network. After that, the encoding network corresponds to the text representation sequence (TE, textembedding) of the output target text. The text representation sequence output by the encoding network passes through the attention network to generate a context vector C, which is used as the semantic representation of the target text. The semantic representation generated by the attention network enters the decoding network, which outputs the acoustic feature information corresponding to the target text.

示例地，语音合成模型通过以下方式训练得到：Illustratively, a speech synthesis model is trained by:

获取第一训练样本，其中，每一第一训练样本包括第一训练文本对应的训练音素序列、训练连读变调标签序列、训练韵律标签序列，以及第一训练文本对应的训练声学特征信息；Obtaining a first training sample, wherein each first training sample includes a training phoneme sequence corresponding to the first training text, a training sequence of tone-shifting labels, a sequence of training prosody labels, and training acoustic feature information corresponding to the first training text;

通过将与训练音素序列、训练连读变调标签序列和训练韵律标签序列对应的拼接向量作为模型的输入，并将训练声学特征信息作为模型的目标输出的方式进行模型训练，以得到训练完成的语音合成模型。The model is trained by taking the splicing vector corresponding to the training phoneme sequence, the training read-connection transposition label sequence and the training prosodic label sequence as the input of the model, and the training acoustic feature information as the target output of the model, so as to obtain the trained speech synthetic model.

示例地，上述语音合成模型可以使用Tacotron模型。Illustratively, the above-mentioned speech synthesis model may use the Tacotron model.

第一训练文本对应有音频，确定该音频的声学特征信息，作为训练声学特征信息。The first training text corresponds to audio, and the acoustic feature information of the audio is determined as the training acoustic feature information.

在本公开中，可以通过与步骤12中确定目标文本的音素序列相似的方式来确定第一训练文本的训练音素序列，并且，可以通过上文中确定目标文本的连读变调标签序列和韵律标签序列相似的方式确定第一训练文本的训练连读变调标签序列和训练韵律标签序列。此处不再对上述内容重复叙述。In the present disclosure, the training phoneme sequence of the first training text can be determined in a manner similar to that of determining the phoneme sequence of the target text in step 12, and the continuous tone-shifting label sequence and the prosodic label sequence of the target text can be determined by the above In a similar manner, the training read-through pitch-shifting label sequence and the training prosodic label sequence of the first training text are determined. The above content will not be repeated here.

语音合成模型的训练目的在于，使通过模型输出合成的音频能够无限地接近于第一训练样本的实际音频，即，使模型输出的声学特征信息无限接近于训练声学特征信息。因此，可以基于训练声学特征信息和训练时模型输出的声学特征信息，计算模型的损失值，并利用该损失值对当前模型的内部参数进行调整。之后，将调整后的模型用于下一次的训练中，如此循环往复，直至满足停止训练的条件，就可以得到训练完成的语音合成模型。The training purpose of the speech synthesis model is to make the audio synthesized by the model output infinitely close to the actual audio of the first training sample, that is, to make the acoustic feature information output by the model infinitely close to the training acoustic feature information. Therefore, the loss value of the model can be calculated based on the training acoustic feature information and the acoustic feature information output by the model during training, and the internal parameters of the current model can be adjusted by using the loss value. After that, the adjusted model is used in the next training, and so on and so forth, until the conditions for stopping training are met, and the trained speech synthesis model can be obtained.

经过上述训练步骤得到的训练完成的语音合成模型，可以用于语音合成场景中。即：The trained speech synthesis model obtained through the above training steps can be used in speech synthesis scenarios. which is:

在得到目标文本的声学特征信息后，可以将该声学特征信息输入到声码器(例如，Wavenet声码器、Griffin-Lim声码器)中，以进行语音合成，从而得到待合成文本对应的合成音频。After obtaining the acoustic feature information of the target text, the acoustic feature information can be input into a vocoder (eg, Wavenet vocoder, Griffin-Lim vocoder) for speech synthesis, so as to obtain the corresponding text to be synthesized. Synthesized audio.

图2是根据本公开的一种实施方式提供的语音合成装置的框图。如图2所示，所述装置20包括：FIG. 2 is a block diagram of a speech synthesis apparatus provided according to an embodiment of the present disclosure. As shown in Figure 2, the device 20 includes:

第一确定模块21，用于确定待处理的目标文本的声调标注信息，其中，所述声调标注信息包括所述目标文本中各文本单元的连读变调类型，其中，所述文本单元由至少一个单位文本构成，文本单元的连读变调类型用于指示该文本单元的音高变化趋势；The first determination module 21 is configured to determine the tone annotation information of the target text to be processed, wherein the tone annotation information includes the tone-translation type of each text unit in the target text, wherein the text unit is composed of at least one The unit text is composed, and the tone change type of the text unit is used to indicate the pitch change trend of the text unit;

第二确定模块22，用于确定所述目标文本的韵律标注信息和所述目标文本对应的音素序列；The second determination module 22 is used to determine the prosodic labeling information of the target text and the phoneme sequence corresponding to the target text;

生成模块23，用于根据所述声调标注信息、所述韵律标注信息和所述音素序列，生成与所述目标文本对应的合成音频。The generating module 23 is configured to generate synthetic audio corresponding to the target text according to the tone labeling information, the prosody labeling information and the phoneme sequence.

可选地，所述生成模块23，包括：Optionally, the generating module 23 includes:

第一确定子模块，用于根据所述声调标注信息，确定所述目标文本对应的连读变调标签序列；a first determination submodule, configured to determine a sequence of consecutive tone-shifting labels corresponding to the target text according to the tone labeling information;

第二确定子模块，用于根据所述韵律标注信息，确定所述目标文本对应的韵律标签序列；a second determination submodule, configured to determine the prosodic label sequence corresponding to the target text according to the prosodic labeling information;

第一生成子模块，用于根据所述连读变调标签序列、所述韵律标签序列和所述音素序列，利用预先训练的语音合成模型，生成与所述目标文本对应的声学特征信息；The first generation sub-module is used to generate acoustic feature information corresponding to the target text by using a pre-trained speech synthesis model according to the read-through tone-changing label sequence, the prosodic label sequence and the phoneme sequence;

合成子模块，用于利用声码器对所述声学特征信息进行语音合成，以生成与所述目标文本对应的合成音频。The synthesis sub-module is configured to use a vocoder to perform speech synthesis on the acoustic feature information to generate synthesized audio corresponding to the target text.

可选地，所述语音合成模型包括编码网络、注意力网络和解码网络；其中：Optionally, the speech synthesis model includes an encoding network, an attention network and a decoding network; wherein:

所述编码网络用于根据与所述连读变调标签序列、所述韵律标签序列和所述音素序列对应的拼接向量，生成文本表征序列；The encoding network is configured to generate a text representation sequence according to the splicing vector corresponding to the read-through pitch tag sequence, the prosodic tag sequence and the phoneme sequence;

所述注意力网络用于根据所述文本表征序列，生成语义表征；The attention network is used to generate semantic representations according to the text representation sequence;

所述解码网络用于根据所述语义表征，输出与所述目标文本对应的声学特征信息。The decoding network is configured to output acoustic feature information corresponding to the target text according to the semantic representation.

可选地，所述语音合成模型通过以下模块获得：Optionally, the speech synthesis model is obtained by the following modules:

获取模块，用于获取第一训练样本，其中，每一所述第一训练样本包括第一训练文本对应的训练音素序列、训练连读变调标签序列、训练韵律标签序列，以及所述第一训练文本对应的训练声学特征信息；an acquisition module, configured to acquire a first training sample, wherein each of the first training samples includes a training phoneme sequence corresponding to the first training text, a training tone-shifting label sequence, a training prosody label sequence, and the first training The training acoustic feature information corresponding to the text;

训练模块，用于通过将与所述训练音素序列、所述训练连读变调标签序列和所述训练韵律标签序列对应的拼接向量作为模型的输入，并将所述训练声学特征信息作为模型的目标输出的方式进行模型训练，以得到训练完成的所述语音合成模型。A training module is used to use the splicing vector corresponding to the training phoneme sequence, the training read-through pitch tag sequence and the training prosody tag sequence as the input of the model, and the training acoustic feature information as the target of the model Model training is performed in the manner of output, so as to obtain the speech synthesis model that has been trained.

可选地，所述第一确定模块21，包括：Optionally, the first determining module 21 includes:

处理子模块，用于将所述目标文本输入至声调标注模型，获得所述声调标注模型的输出结果，其中，所述声调标注模型基于带有声调标注信息的第二训练文本训练得到，所述输出结果包括用于将所述目标文本划分为多个文本单元的边界信息和每一文本单元各自对应的连读变调类型；A processing submodule, configured to input the target text into the tone annotation model, and obtain the output result of the tone annotation model, wherein the tone annotation model is obtained by training based on the second training text with tone annotation information, and the tone annotation model is obtained by training. The output result includes boundary information for dividing the target text into a plurality of text units and the respective corresponding link-translation type of each text unit;

第三确定子模块，用于根据所述输出结果，确定所述目标文本的声调标注信息。The third determination sub-module is configured to determine the tone annotation information of the target text according to the output result.

可选地，所述第三确定子模块，包括：Optionally, the third determination submodule includes:

第四确定子模块，用于确定是否接收到针对所述输出结果的修正指令，所述修正指令用于指示更改所述输出结果中的边界信息、连读变调类型中的至少一者；a fourth determination sub-module, configured to determine whether to receive a correction instruction for the output result, where the correction instruction is used to instruct to modify at least one of the boundary information in the output result, and the tone change type of continuous reading;

修正子模块，用于根据所述修正指令，对所述输出结果进行修正，并将修正后的输出结果确定为所述目标文本的声调标注信息。A modification sub-module, configured to modify the output result according to the modification instruction, and determine the modified output result as the tone annotation information of the target text.

可选地，若文本单元由一个单位文本构成，该文本单元的连读变调类型为以下中的一者：用于指示音高为文本单元的原始声调的第一类型、用于指示音高由文本单元的原始声调的音高起点变化至平调的第二类型；Optionally, if the text unit is constituted by a unit text, the tone transition type of the text unit is one of the following: the first type used to indicate that the pitch is the original tone of the text unit, the first type used to indicate that the pitch is The second type of pitch onset change of the original tone of the text unit to a flat tone;

若文本单元由多于两个单位文本构成，该文本单元的连读变调类型为以下中的一者：所述第三类型、所述第四类型、所述第七类型、所述第八类型、用于指示音高由中平调升高至高平调再降低至低平调的第九类型、用于指示音高由中平调升高至高平调再降低至中平调的第十类型、用于指示音高由低平调升高至高平调再降低至低平调的第十一类型、用于指示音高由低平调升高至高平调再降低至中平调的第十二类型。If the text unit is composed of more than two unit texts, the text unit's transposition type is one of the following: the third type, the fourth type, the seventh type, the eighth type , the ninth type used to indicate that the pitch is raised from mid-flat to high-flat and then lowered to low-flat, and the tenth type used to indicate that the pitch is raised from mid-flat to high-flat and then lowered to mid-flat , the eleventh type used to indicate that the pitch is raised from low flat to high flat and then lowered to low flat, the tenth type used to indicate pitch raised from low flat to high flat and then lowered to medium flat Two types.

关于上述实施例中的装置，其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述，此处将不做详细阐述说明。Regarding the apparatus in the above-mentioned embodiment, the specific manner in which each module performs operations has been described in detail in the embodiment of the method, and will not be described in detail here.

下面参考图3，其示出了适于用来实现本公开实施例的电子设备(600的结构示意图。本公开实施例中的终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图3示出的电子设备仅仅是一个示例，不应对本公开实施例的功能和使用范围带来任何限制。3, which shows a schematic structural diagram of an electronic device (600) suitable for implementing the embodiment of the present disclosure. The terminal device in the embodiment of the present disclosure may include, but is not limited to, such as mobile phones, notebook computers, digital broadcast receivers PDA (Personal Digital Assistant), PAD (Tablet Computer), PMP (Portable Multimedia Player), mobile terminals such as in-vehicle terminals (such as in-vehicle navigation terminals), etc., and stationary terminals such as digital TVs, desktop computers, etc. Fig. The electronic device shown in 3 is only an example, and should not impose any limitation on the function and scope of use of the embodiments of the present disclosure.

如图3所示，电子设备600可以包括处理装置(例如中央处理器、图形处理器等)601，其可以根据存储在只读存储器(ROM)602中的程序或者从存储装置608加载到随机访问存储器(RAM)603中的程序而执行各种适当的动作和处理。在RAM 603中，还存储有电子设备600操作所需的各种程序和数据。处理装置601、ROM 602以及RAM 603通过总线604彼此相连。输入/输出(I/O)接口605也连接至总线604。As shown in FIG. 3 , electronic device 600 may include processing means (eg, central processing unit, graphics processor, etc.) 601 that may be loaded into random access according to a program stored in read only memory (ROM) 602 or from storage means 608 Various appropriate actions and processes are executed by the programs in the memory (RAM) 603 . In the RAM 603, various programs and data necessary for the operation of the electronic device 600 are also stored. The processing device 601 , the ROM 602 , and the RAM 603 are connected to each other through a bus 604 . An input/output (I/O) interface 605 is also connected to bus 604 .

通常，以下装置可以连接至I/O接口605：包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置606；包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置607；包括例如磁带、硬盘等的存储装置608；以及通信装置609。通信装置609可以允许电子设备600与其他设备进行无线或有线通信以交换数据。虽然图3示出了具有各种装置的电子设备600，但是应理解的是，并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。Typically, the following devices can be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speakers, vibration An output device 607 of a computer, etc.; a storage device 608 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 609. Communication means 609 may allow electronic device 600 to communicate wirelessly or by wire with other devices to exchange data. Although FIG. 3 shows electronic device 600 having various means, it should be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

特别地，根据本公开的实施例，上文参考流程图描述的过程可以被实现为计算机软件程序。例如，本公开的实施例包括一种计算机程序产品，其包括承载在非暂态计算机可读介质上的计算机程序，该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中，该计算机程序可以通过通信装置609从网络上被下载和安装，或者从存储装置608被安装，或者从ROM 602被安装。在该计算机程序被处理装置601执行时，执行本公开实施例的方法中限定的上述功能。In particular, according to embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated in the flowchart. In such an embodiment, the computer program may be downloaded and installed from the network via the communication device 609 , or from the storage device 608 , or from the ROM 602 . When the computer program is executed by the processing apparatus 601, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.

需要说明的是，本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件，或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于：具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中，计算机可读存储介质可以是任何包含或存储程序的有形介质，该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中，计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号，其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式，包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质，该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输，包括但不限于：电线、光缆、RF(射频)等等，或者上述的任意合适的组合。It should be noted that the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two. The computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing. In this disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In the present disclosure, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device . Program code embodied on a computer readable medium may be transmitted using any suitable medium including, but not limited to, electrical wire, optical fiber cable, RF (radio frequency), etc., or any suitable combination of the foregoing.

在一些实施方式中，客户端、服务器可以利用诸如HTTP(HyperText TransferProtocol，超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信，并且可以与任意形式或介质的数字数据通信(例如，通信网络)互连。通信网络的示例包括局域网(“LAN”)，广域网(“WAN”)，网际网(例如，互联网)以及端对端网络(例如，ad hoc端对端网络)，以及任何当前已知或未来研发的网络。In some embodiments, the client and server can communicate using any currently known or future developed network protocol such as HTTP (HyperText Transfer Protocol), and can communicate with digital data in any form or medium (eg, a communications network) interconnected. Examples of communication networks include local area networks ("LAN"), wide area networks ("WAN"), the Internet (eg, the Internet), and peer-to-peer networks (eg, ad hoc peer-to-peer networks), as well as any currently known or future development network of.

上述计算机可读介质可以是上述电子设备中所包含的；也可以是单独存在，而未装配入该电子设备中。The above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or may exist alone without being assembled into the electronic device.

上述计算机可读介质承载有一个或者多个程序，当上述一个或者多个程序被该电子设备执行时，使得该电子设备：确定待处理的目标文本的声调标注信息，其中，所述声调标注信息包括所述目标文本中各文本单元的连读变调类型，其中，所述文本单元由至少一个单位文本构成，文本单元的连读变调类型用于指示该文本单元的音高变化趋势；确定所述目标文本的韵律标注信息和所述目标文本对应的音素序列；根据所述声调标注信息、所述韵律标注信息和所述音素序列，生成与所述目标文本对应的合成音频。The above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device: determines the tone annotation information of the target text to be processed, wherein the tone annotation information Including the tone transition type of each text unit in the target text, wherein the text unit is composed of at least one unit text, and the tone transition type of the text unit is used to indicate the pitch change trend of the text unit; determine the The prosodic labeling information of the target text and the phoneme sequence corresponding to the target text; according to the tone labeling information, the prosodic labeling information and the phoneme sequence, a synthetic audio corresponding to the target text is generated.

可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码，上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++，还包括常规的过程式程序设计语言——诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中，远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)——连接到用户计算机，或者，可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。Computer program code for performing operations of the present disclosure may be written in one or more programming languages, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and This includes conventional procedural programming languages - such as the "C" language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider to via Internet connection).

附图中的流程图和框图，图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分，该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意，在有些作为替换的实现中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个接连地表示的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合，可以用执行规定的功能或操作的专用的基于硬件的系统来实现，或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.

描述于本公开实施例中所涉及到的模块可以通过软件的方式实现，也可以通过硬件的方式来实现。其中，模块的名称在某种情况下并不构成对该模块本身的限定，例如，第一获取模块还可以被描述为“获取至少两个网际协议地址的模块”。The modules involved in the embodiments of the present disclosure may be implemented in software or hardware. Wherein, the name of the module does not constitute a limitation of the module itself under certain circumstances, for example, the first acquisition module may also be described as "a module for acquiring at least two Internet Protocol addresses".

本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如，非限制性地，可以使用的示范类型的硬件逻辑部件包括：现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), Systems on Chips (SOCs), Complex Programmable Logical Devices (CPLDs) and more.

在本公开的上下文中，机器可读介质可以是有形的介质，其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备，或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.

根据本公开的一个或多个实施例，提供了一种语音合成方法，所述方法包括：According to one or more embodiments of the present disclosure, there is provided a speech synthesis method, the method comprising:

根据本公开的一个或多个实施例，提供了一种语音合成方法，所述根据所述声调标注信息、所述韵律标注信息和所述音素序列，生成与所述目标文本对应的合成音频，包括：According to one or more embodiments of the present disclosure, a speech synthesis method is provided, wherein the synthesized audio corresponding to the target text is generated according to the tone annotation information, the prosodic annotation information and the phoneme sequence, include:

根据所述声调标注信息，确定所述目标文本对应的连读变调标签序列；According to the tone labeling information, determine the sequence of consecutive tone-shifting labels corresponding to the target text;

根据所述韵律标注信息，确定所述目标文本对应的韵律标签序列；determining the prosodic label sequence corresponding to the target text according to the prosodic labeling information;

根据所述连读变调标签序列、所述韵律标签序列和所述音素序列，利用预先训练的语音合成模型，生成与所述目标文本对应的声学特征信息；Generate acoustic feature information corresponding to the target text by using a pre-trained speech synthesis model according to the read-through tone-shifting label sequence, the prosodic label sequence and the phoneme sequence;

利用声码器对所述声学特征信息进行语音合成，以生成与所述目标文本对应的合成音频。Speech synthesis is performed on the acoustic feature information using a vocoder to generate synthesized audio corresponding to the target text.

根据本公开的一个或多个实施例，提供了一种语音合成方法，所述语音合成模型包括编码网络、注意力网络和解码网络；其中：According to one or more embodiments of the present disclosure, a speech synthesis method is provided, wherein the speech synthesis model includes an encoding network, an attention network and a decoding network; wherein:

根据本公开的一个或多个实施例，提供了一种语音合成方法，所述语音合成模型通过以下方式获得：According to one or more embodiments of the present disclosure, a speech synthesis method is provided, and the speech synthesis model is obtained by:

获取第一训练样本，其中，每一所述第一训练样本包括第一训练文本对应的训练音素序列、训练连读变调标签序列、训练韵律标签序列，以及所述第一训练文本对应的训练声学特征信息；Acquiring a first training sample, wherein each of the first training samples includes a training phoneme sequence corresponding to the first training text, a training tone-shifting label sequence, a training prosody label sequence, and a training acoustic corresponding to the first training text characteristic information;

通过将与所述训练音素序列、所述训练连读变调标签序列和所述训练韵律标签序列对应的拼接向量作为模型的输入，并将所述训练声学特征信息作为模型的目标输出的方式进行模型训练，以得到训练完成的所述语音合成模型。The model is carried out by using the splicing vector corresponding to the training phoneme sequence, the training read-through pitch tag sequence and the training prosody tag sequence as the input of the model, and the training acoustic feature information as the target output of the model training to obtain the speech synthesis model that has been trained.

根据本公开的一个或多个实施例，提供了一种语音合成方法，所述确定待处理的目标文本的声调标注信息，包括：According to one or more embodiments of the present disclosure, a speech synthesis method is provided, wherein the determining the tone annotation information of the target text to be processed includes:

将所述目标文本输入至声调标注模型，获得所述声调标注模型的输出结果，其中，所述声调标注模型基于带有声调标注信息的第二训练文本训练得到，所述输出结果包括用于将所述目标文本划分为多个文本单元的边界信息和每一文本单元各自对应的连读变调类型；The target text is input into the tone annotation model, and an output result of the tone annotation model is obtained, wherein the tone annotation model is obtained by training based on the second training text with tone annotation information, and the output result includes a method for adding tone annotation information. Described target text is divided into the boundary information of a plurality of text units and each text unit is respectively corresponding to the contiguous tone change type;

根据所述输出结果，确定所述目标文本的声调标注信息。According to the output result, the tone annotation information of the target text is determined.

根据本公开的一个或多个实施例，提供了一种语音合成方法，所述根据所述输出结果，确定所述目标文本的声调标注信息，包括：According to one or more embodiments of the present disclosure, a speech synthesis method is provided, wherein the determining tone annotation information of the target text according to the output result includes:

确定是否接收到针对所述输出结果的修正指令，所述修正指令用于指示更改所述输出结果中的边界信息、连读变调类型中的至少一者；determining whether a correction instruction for the output result is received, the correction instruction being used to instruct to change at least one of the boundary information in the output result, and the type of tone change;

根据所述修正指令，对所述输出结果进行修正，并将修正后的输出结果确定为所述目标文本的声调标注信息。According to the correction instruction, the output result is corrected, and the corrected output result is determined as the tone annotation information of the target text.

根据本公开的一个或多个实施例，提供了一种语音合成方法，若文本单元由一个单位文本构成，该文本单元的连读变调类型为以下中的一者：用于指示音高为文本单元的原始声调的第一类型、用于指示音高由文本单元的原始声调的音高起点变化至平调的第二类型；According to one or more embodiments of the present disclosure, a speech synthesis method is provided. If a text unit is composed of a unit text, the link-to-speech change type of the text unit is one of the following: used to indicate that the pitch is text a first type of the original tone of the unit, a second type for indicating a pitch change from the pitch origin of the original tone of the text unit to a flat tone;

根据本公开的一个或多个实施例，提供了一种语音合成装置，所述装置包括：According to one or more embodiments of the present disclosure, there is provided a speech synthesis apparatus, the apparatus comprising:

根据本公开的一个或多个实施例，提供了一种计算机可读介质，其上存储有计算机程序，该程序被处理装置执行时实现本公开任意实施例所提供的语音合成方法的步骤。According to one or more embodiments of the present disclosure, there is provided a computer-readable medium on which a computer program is stored, and when the program is executed by a processing apparatus, implements the steps of the speech synthesis method provided by any embodiment of the present disclosure.

根据本公开的一个或多个实施例，提供了一种电子设备，包括：According to one or more embodiments of the present disclosure, there is provided an electronic device, comprising:

至少一个处理装置，用于执行所述存储装置中的所述至少一个计算机程序，以实现本公开任意实施例所提供的语音合成方法的步骤。At least one processing device is configured to execute the at least one computer program in the storage device to implement the steps of the speech synthesis method provided by any embodiment of the present disclosure.

以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解，本公开中所涉及的公开范围，并不限于上述技术特征的特定组合而成的技术方案，同时也应涵盖在不脱离上述公开构思的情况下，由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。The above description is merely a preferred embodiment of the present disclosure and an illustration of the technical principles employed. Those skilled in the art should understand that the scope of the disclosure involved in the present disclosure is not limited to the technical solutions formed by the specific combination of the above-mentioned technical features, and should also cover, without departing from the above-mentioned disclosed concept, the technical solutions formed by the above-mentioned technical features or Other technical solutions formed by any combination of its equivalent features. For example, a technical solution is formed by replacing the above features with the technical features disclosed in the present disclosure (but not limited to) with similar functions.

此外，虽然采用特定次序描绘了各操作，但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下，多任务和并行处理可能是有利的。同样地，虽然在上面论述中包含了若干具体实现细节，但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地，在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。Additionally, although operations are depicted in a particular order, this should not be construed as requiring that the operations be performed in the particular order shown or in a sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, although the above discussion contains several implementation-specific details, these should not be construed as limitations on the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题，但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反，上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。关于上述实施例中的装置，其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述，此处将不做详细阐述说明。Although the subject matter has been described in language specific to structural features and/or logical acts of method, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are merely example forms of implementing the claims. Regarding the apparatus in the above-mentioned embodiment, the specific manner in which each module performs operations has been described in detail in the embodiment of the method, and will not be described in detail here.

Claims

1. A method of speech synthesis, the method comprising:

determining tone marking information of a target text to be processed, wherein the tone marking information comprises continuous reading tone variation types of text units in the target text, the text units are composed of at least one unit text, and the continuous reading tone variation types of the text units are used for indicating pitch variation trends of the text units;

determining prosody labeling information of the target text and a phoneme sequence corresponding to the target text;

and generating synthetic audio corresponding to the target text according to the tone marking information, the rhythm marking information and the phoneme sequence.

2. The method of claim 1, wherein generating synthetic audio corresponding to the target text according to the tone labeling information, the prosody labeling information, and the phoneme sequence comprises:

determining a continuous reading tonal modification label sequence corresponding to the target text according to the tone marking information;

determining a prosody tag sequence corresponding to the target text according to the prosody labeling information;

generating acoustic feature information corresponding to the target text by utilizing a pre-trained speech synthesis model according to the continuous reading tonal modification label sequence, the prosodic label sequence and the phoneme sequence;

and performing voice synthesis on the acoustic feature information by using a vocoder to generate synthetic audio corresponding to the target text.

3. The method of claim 2, wherein the speech synthesis model comprises an encoding network, an attention network, and a decoding network; wherein:

the coding network is used for generating a text representation sequence according to the splicing vectors corresponding to the continuous reading tonal modification label sequence, the prosodic label sequence and the phoneme sequence;

the attention network is used for generating semantic representation according to the text representation sequence;

and the decoding network is used for outputting acoustic characteristic information corresponding to the target text according to the semantic representation.

4. The method of claim 2, wherein the speech synthesis model is obtained by:

acquiring first training samples, wherein each first training sample comprises a training phoneme sequence, a training continuous reading tone-changing label sequence and a training prosody label sequence which correspond to a first training text, and training acoustic feature information which corresponds to the first training text;

and performing model training by taking the splicing vectors corresponding to the training phoneme sequence, the training continuous reading tone-changing label sequence and the training prosody label sequence as the input of a model and taking the training acoustic characteristic information as the target output of the model to obtain the trained speech synthesis model.

5. The method according to claim 1, wherein the determining tonal marking information of the target text to be processed comprises:

inputting the target text into a tone marking model, and obtaining an output result of the tone marking model, wherein the tone marking model is obtained by training based on a second training text with tone marking information, and the output result comprises boundary information for dividing the target text into a plurality of text units and a continuous reading tone-changing type corresponding to each text unit;

and determining the tone marking information of the target text according to the output result.

6. The method according to claim 5, wherein the determining the tone labeling information of the target text according to the output result comprises:

determining whether a revision instruction for the output result is received, the revision instruction being indicative of altering at least one of boundary information, a type of read-through transpose in the output result;

and correcting the output result according to the correction instruction, and determining the corrected output result as the tone marking information of the target text.

7. The method according to any one of claims 1 to 6,

if the text unit is composed of a unit text, the continuous reading tonal modification type of the text unit is one of the following types: a first type for indicating that a pitch is an original tone of the text unit, a second type for indicating that the pitch is changed from a pitch start point of the original tone of the text unit to a flat tone;

if a text unit consists of two units of text, the type of the continuous reading tone of the text unit is one of the following: a third type for indicating a decrease in pitch from high to low, a fourth type for indicating a decrease in pitch from high to medium, a fifth type for indicating an increase in pitch from medium to high, a sixth type for indicating an increase in pitch from low to high, a seventh type for indicating an increase in pitch from low to low, and an eighth type for indicating an increase in pitch from low to medium;

if a text unit is composed of more than two units of text, the type of the continuous reading key of the text unit is one of the following: the third type, the fourth type, the seventh type, the eighth type, a ninth type for indicating that the pitch is raised from a middle level to a high level and then lowered to a low level, a tenth type for indicating that the pitch is raised from a middle level to a high level and then lowered to a middle level, an eleventh type for indicating that the pitch is raised from a low level to a high level and then lowered to a low level, and a twelfth type for indicating that the pitch is raised from a low level to a high level and then lowered to a middle level.

8. A speech synthesis apparatus, characterized in that the apparatus comprises:

the system comprises a first determining module, a second determining module and a processing module, wherein the first determining module is used for determining tone marking information of a target text to be processed, the tone marking information comprises continuous reading tone variation types of text units in the target text, the text units are composed of at least one unit text, and the continuous reading tone variation types of the text units are used for indicating pitch variation trends of the text units;

the second determining module is used for determining prosody labeling information of the target text and a phoneme sequence corresponding to the target text;

and the generating module is used for generating synthetic audio corresponding to the target text according to the tone marking information, the rhythm marking information and the phoneme sequence.

9. A computer-readable medium, on which a computer program is stored, characterized in that the program, when being executed by processing means, carries out the steps of the method of any one of claims 1 to 7.

10. An electronic device, comprising:

a storage device having at least one computer program stored thereon;

at least one processing device for executing the at least one computer program in the storage device to carry out the steps of the method according to any one of claims 1 to 7.