CN114255736B - Rhythm annotation method and system - Google Patents
Rhythm annotation method and system Download PDFInfo
- Publication number
- CN114255736B CN114255736B CN202111591322.4A CN202111591322A CN114255736B CN 114255736 B CN114255736 B CN 114255736B CN 202111591322 A CN202111591322 A CN 202111591322A CN 114255736 B CN114255736 B CN 114255736B
- Authority
- CN
- China
- Prior art keywords
- word
- prosody
- speech
- prosodic
- clustering
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Document Processing Apparatus (AREA)
Abstract
Description
技术领域Technical Field
本发明涉及智能语音领域,尤其涉及一种韵律标注方法及系统。The present invention relates to the field of intelligent speech, and in particular to a prosody annotation method and system.
背景技术Background Art
为了让智能语音交互更加人性化,通常会在与用户沟通的音频中加入韵律,通过调整韵律使对话更具有抑扬顿挫,表达出语言的情感。In order to make intelligent voice interaction more humane, rhythm is usually added to the audio used to communicate with users. By adjusting the rhythm, the conversation can be made more rhythmic and the emotion of the language can be expressed.
通常可以在TTS(Text To Speech,文本到语音)的基础模块上增加韵律空间的建模。与韵律控制有关的工作或使用VAE(Variational Auto-Encoder,变分自编码器)从一个隐空间中采样获得韵律,或用额外的参数控制某些和基频、能量等韵律特征有关的量。在韵律建模上则大多以提升自然度、多样性为目标,大多使用显式的韵律特征。从而得到带有韵律的语音。Usually, we can add the modeling of rhythm space to the basic module of TTS (Text To Speech). Work related to rhythm control uses VAE (Variational Auto-Encoder) to sample rhythm from a latent space, or uses additional parameters to control certain quantities related to rhythm features such as fundamental frequency and energy. In rhythm modeling, most of them aim to improve naturalness and diversity, and most of them use explicit rhythm features. Thus, we can get speech with rhythm.
在实现本发明过程中,发明人发现相关技术中至少存在如下问题:In the process of implementing the present invention, the inventors found that there are at least the following problems in the related art:
显式的韵律特征并不能非常充分地表示韵律空间,而且其控制起来比较繁琐;隐式的韵律空间则更加难以控制,因为其为连续空间,缺乏解释性。此外,大多数相关工作都在音素层面进行,虽然比较方便,但是并不符合人自然说话时的原理,使得生成的带有韵律的语音不够人性化。Explicit prosodic features cannot fully represent the prosodic space and are cumbersome to control; implicit prosodic space is even more difficult to control because it is a continuous space and lacks interpretability. In addition, most related work is done at the phoneme level, which is convenient but does not conform to the principles of natural human speech, making the generated prosodic speech less humane.
发明内容Summary of the invention
为了至少解决现有技术中生成的带有韵律的语音不够人性化的问题。In order to at least solve the problem that the rhythmic speech generated in the prior art is not humane enough.
第一方面,本发明实施例提供一种韵律标注方法,包括:In a first aspect, an embodiment of the present invention provides a prosody tagging method, comprising:
从训练语音数据中提取词级韵律表达,利用所述词级韵律表达对所述语音进行重构,得到带有词级韵律表达的重构语音;Extracting word-level prosodic expressions from training speech data, and reconstructing the speech using the word-level prosodic expressions to obtain reconstructed speech with word-level prosodic expressions;
对所述重构语音按照词的文本信息进行第一聚类;Performing a first clustering on the reconstructed speech according to text information of words;
对所述第一聚类的结果按照韵律进行无监督的第二聚类,得到每个单词的无监督韵律标注。An unsupervised second clustering is performed on the results of the first clustering according to the rhythm to obtain an unsupervised rhythm tagging of each word.
第二方面,本发明实施例提供一种基于韵律标注的文本转语音方法,包括:In a second aspect, an embodiment of the present invention provides a text-to-speech method based on prosody annotation, comprising:
将文本输入至文本转语音模型,对所述文本转语音模型确定的音素序列进行基于词的文本信息的第一聚类和基于韵律的第二聚类,得到所述文本中每个单词的无监督韵律标注;Inputting the text into a text-to-speech model, performing a first clustering based on the text information of the words and a second clustering based on the rhythm on the phoneme sequence determined by the text-to-speech model, and obtaining an unsupervised rhythmic annotation of each word in the text;
至少利用所述每个单词的无监督韵律标注对所述素序列进行韵律合成,得到带有韵律的语音。The unsupervised prosodic annotation of each word is at least used to perform prosodic synthesis on the prime sequence to obtain speech with prosody.
第三方面,本发明实施例提供一种韵律标注系统,包括:In a third aspect, an embodiment of the present invention provides a prosody annotation system, including:
语音重构程序模块,用于从训练语音数据中提取词级韵律表达,利用所述词级韵律表达对所述语音进行重构,得到带有词级韵律表达的重构语音;A speech reconstruction program module is used to extract word-level prosodic expressions from training speech data, and reconstruct the speech using the word-level prosodic expressions to obtain reconstructed speech with word-level prosodic expressions;
聚类程序模块,用于对所述重构语音按照词的文本信息进行第一聚类;A clustering program module, used for performing a first clustering on the reconstructed speech according to the text information of the words;
韵律标注程序模块,用于对所述第一聚类的结果按照韵律进行无监督的第二聚类,得到每个单词的无监督韵律标注。The rhythm tagging program module is used to perform unsupervised second clustering on the results of the first clustering according to rhythm to obtain unsupervised rhythm tagging of each word.
第四方面,本发明实施例提供一种基于韵律标注的文本转语音系统,包括:In a fourth aspect, an embodiment of the present invention provides a text-to-speech system based on prosody annotation, comprising:
无监督韵律标注程序模块,用于将文本输入至文本转语音模型,对所述文本转语音模型确定的音素序列进行基于词的文本信息的第一聚类和基于韵律的第二聚类,得到所述文本中每个单词的无监督韵律标注;An unsupervised prosodic tagging program module is used to input text into a text-to-speech model, perform a first clustering based on word text information and a second clustering based on prosody on a phoneme sequence determined by the text-to-speech model, and obtain an unsupervised prosodic tagging of each word in the text;
语音合成程序模块,用于至少利用所述每个单词的无监督韵律标注对所述素序列进行韵律合成,得到带有韵律的语音。The speech synthesis program module is used to perform rhythm synthesis on the element sequence by at least using the unsupervised rhythm annotation of each word to obtain speech with rhythm.
第五方面,提供一种电子设备,其包括:至少一个处理器,以及与所述至少一个处理器通信连接的存储器,其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行本发明任一实施例的韵律标注方法的步骤。In a fifth aspect, an electronic device is provided, comprising: at least one processor, and a memory communicatively connected to the at least one processor, wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor so that the at least one processor can perform the steps of the rhythm tagging method of any embodiment of the present invention.
第六方面,本发明实施例提供一种存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现本发明任一实施例的韵律标注方法的步骤。In a sixth aspect, an embodiment of the present invention provides a storage medium having a computer program stored thereon, characterized in that when the program is executed by a processor, the steps of the prosody tagging method of any embodiment of the present invention are implemented.
本发明实施例的有益效果在于:选择在词级别做韵律的建模和控制,而词又含有丰富的语义信息,因此可以很好地和上下文做关联。并且,可以标注出具有多样性的韵律,又具有可解释性、良好的控制能力以及高度可扩展性,达到更人性化的韵律标注。使TTS合成的语音具有更丰富的韵律,提升用户的交互体验。The beneficial effects of the embodiments of the present invention are: the prosody modeling and control is performed at the word level, and the words contain rich semantic information, so they can be well associated with the context. In addition, a variety of prosody can be annotated, and the prosody is interpretable, well controlled, and highly scalable, so as to achieve more humanized prosody annotation. The TTS synthesized speech has richer prosody, and the user's interactive experience is improved.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following briefly introduces the drawings required for use in the embodiments or the description of the prior art. Obviously, the drawings described below are some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.
图1是本发明一实施例提供的一种韵律标注方法的流程图;FIG1 is a flow chart of a prosody marking method provided by an embodiment of the present invention;
图2是本发明一实施例提供的一种韵律标注方法的韵律提取与标注系统体系结构图;FIG2 is a diagram of the architecture of a rhythm extraction and annotation system of a rhythm annotation method provided by an embodiment of the present invention;
图3是本发明一实施例提供的一种基于韵律标注的文本转语音方法的流程图;FIG3 is a flow chart of a text-to-speech method based on prosody annotation provided by an embodiment of the present invention;
图4是本发明一实施例提供的一种基于韵律标注的文本转语音方法的训练和推理阶段的韵律控制模型结构图;4 is a structural diagram of a prosody control model in the training and reasoning stages of a text-to-speech method based on prosody annotation provided by an embodiment of the present invention;
图5是本发明一实施例提供的一种基于韵律标注的文本转语音方法的叶的对数似然曲线示意图;FIG5 is a schematic diagram of a log-likelihood curve of a leaf of a text-to-speech method based on prosody annotation provided by an embodiment of the present invention;
图6是本发明一实施例提供的一种基于韵律标注的文本转语音方法的自然性的主观评价示意图;FIG6 is a schematic diagram of a subjective evaluation of the naturalness of a text-to-speech method based on prosody annotation provided by an embodiment of the present invention;
图7是本发明一实施例提供的一种基于韵律标注的文本转语音方法的人工指定韵律的合成语音示意图;7 is a schematic diagram of synthesized speech with artificially specified prosody in a text-to-speech method based on prosody annotation provided by an embodiment of the present invention;
图8是本发明一实施例提供的一种基于韵律标注的文本转语音方法的GT标签与控制标签的梅尔倒谱失真示意图;FIG8 is a schematic diagram of Mel-cepstrum distortion of GT labels and control labels of a text-to-speech method based on prosody annotation provided by an embodiment of the present invention;
图9是本发明一实施例提供的一种基于韵律标注的文本转语音方法的可控性的主观评价示意图;FIG9 is a schematic diagram of subjective evaluation of controllability of a text-to-speech method based on prosody annotation provided by an embodiment of the present invention;
图10是本发明一实施例提供的一种韵律标注系统的结构示意图;FIG10 is a schematic diagram of the structure of a prosody annotation system provided by an embodiment of the present invention;
图11是本发明一实施例提供的一种基于韵律标注的文本转语音系统的结构示意图;11 is a schematic diagram of the structure of a text-to-speech system based on prosody annotation provided by an embodiment of the present invention;
图12为本发明一实施例提供的一种韵律标注以及基于韵律标注的文本转语音的电子设备的实施例的结构示意图。FIG. 12 is a schematic diagram of the structure of an embodiment of an electronic device for prosody annotation and text-to-speech conversion based on prosody annotation provided by an embodiment of the present invention.
具体实施方式DETAILED DESCRIPTION
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purpose, technical solution and advantages of the embodiments of the present invention clearer, the technical solution in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.
如图1所示为本发明一实施例提供的一种韵律标注方法的流程图,包括如下步骤:FIG. 1 is a flowchart of a prosody tagging method provided by an embodiment of the present invention, comprising the following steps:
S11:从训练语音数据中提取词级韵律表达,利用所述词级韵律表达对所述语音进行重构,得到带有词级韵律表达的重构语音;S11: extracting word-level prosodic expressions from training speech data, and reconstructing the speech using the word-level prosodic expressions to obtain reconstructed speech with word-level prosodic expressions;
S12:对所述重构语音按照词的文本信息进行第一聚类;S12: performing a first clustering on the reconstructed speech according to text information of words;
S13:对所述第一聚类的结果按照韵律进行无监督的第二聚类,得到每个单词的无监督韵律标注。S13: performing unsupervised second clustering on the results of the first clustering according to rhythm, to obtain unsupervised rhythm labeling of each word.
在本实施方式中,发现词才是韵律的基本单元,因为一个字,在不同词语的情况下,所发出的韵律是不相同的。以此作为韵律的基本单元进行处理。简单来说,本方法分为以下阶段,如图2所示。词级韵律嵌入提取、两阶段词级韵律标注,在得到词级韵律标注后,可以进行带有韵律标注的TTS训练。In this embodiment, it is found that words are the basic unit of prosody, because a word, in different words, has different prosody. This is used as the basic unit of prosody for processing. In short, this method is divided into the following stages, as shown in Figure 2. Word-level prosody embedding extraction, two-stage word-level prosody annotation, after obtaining word-level prosody annotation, TTS training with prosody annotation can be performed.
对于步骤S11,如图2的(a)所示,从训练数据的梅尔频谱中提取词级的韵律表达,这个韵律表达会嵌入TTS系统中训练。For step S11, as shown in FIG2(a), a word-level prosodic expression is extracted from the mel-spectrogram of the training data, and this prosodic expression is embedded in the TTS system for training.
具体的,包括:预先构建基于FastSpeech2的文本转语音模型以及韵律提取器;将所述训练语音数据输入至所述文本转语音模型,利用所述文本转语音模型的编码器确定所述训练语音数据的音素序列;Specifically, it includes: pre-building a text-to-speech model and a prosody extractor based on FastSpeech2; inputting the training speech data into the text-to-speech model, and determining the phoneme sequence of the training speech data using the encoder of the text-to-speech model;
通过所述韵律提取器从所述训练语音数据的梅尔频谱中提取每个单词的词级韵律表达,利用所述词级韵律表达对所述音素序列进行重构,得到带有词级韵律表达的重构语音。The prosody extractor extracts the word-level prosodic expression of each word from the Mel-spectrogram of the training speech data, and reconstructs the phoneme sequence using the word-level prosodic expression to obtain a reconstructed speech with the word-level prosodic expression.
在本实施方式中,为了获得单词级别的韵律嵌入,首先构建了一个基于FastSpeech2的TTS(Text To Speech,文本到语音)模型,并在之后构建了一个韵律提取器。如图2的(a)所示,韵律提取器从相应的梅尔谱段为每个单词生成一个隐藏向量(称为韵律嵌入e)。生成的韵律嵌入然后与音素序列对齐并连接到编码器输出。因此,提取器被优化以提取有用韵律信息表达以更好地重构输出语音,包括词的韵律信息和语音内容。In this embodiment, in order to obtain word-level prosodic embedding, a TTS (Text To Speech) model based on FastSpeech2 is first constructed, and then a prosodic extractor is constructed. As shown in (a) of Figure 2, the prosodic extractor generates a hidden vector (called prosodic embedding e) for each word from the corresponding Mel spectrum segment. The generated prosodic embedding is then aligned with the phoneme sequence and connected to the encoder output. Therefore, the extractor is optimized to extract useful prosodic information expressions to better reconstruct the output speech, including the prosodic information and speech content of the word.
对于步骤S12,韵律标注有两个阶段,直观的想法是,语音内容大不相同的单词,比如长单词“congratulations”和短单词“cat”,发音方式完全不同,因此不应该使用相同的韵律标签。因此,在本方法中,设计了两阶段的韵律标注策略,首先根据其语音内容使用决策树将单词分类成不同的类型,然后在每一类型的单词中使用GMM(Gaussian MixedModel,高斯混合模型)分别对韵律进行聚类。For step S12, there are two stages of prosody tagging. The intuitive idea is that words with very different phonetic contents, such as the long word "congratulations" and the short word "cat", are pronounced completely differently, so the same prosody tag should not be used. Therefore, in this method, a two-stage prosody tagging strategy is designed. First, a decision tree is used to classify words into different types according to their phonetic content, and then GMM (Gaussian Mixed Model) is used to cluster the prosody in each type of words.
所述对所述重构语音按照词的文本信息进行第一聚类包括:对所述重构语音按照词的文本信息进行基于决策树的第一聚类。在本实施方式中,第一聚类为第一阶段的决策树聚类。The first clustering of the reconstructed speech according to the text information of the words includes: performing a first clustering of the reconstructed speech according to the text information of the words based on a decision tree. In this embodiment, the first clustering is a first stage decision tree clustering.
受ASR(Automatic Speech Recognition,自动语音识别技术)中的HMM(HiddenMarkov Model,隐马尔科夫模型)状态绑定技术的启发,构造了一个二叉决策树用于单词聚类,其中包含一组关于其语音内容的问题Q,其中,根中的所有单词都被聚类到l个叶节点中。为此,根据预设专业知识设计了39个问题,例如“单词的音素是否大于4”“这个单词的第一个音素是‘m’还是‘n’”以及“这个单词是否以闭音节结尾”。Inspired by the HMM (HiddenMarkov Model) state binding technology in ASR (Automatic Speech Recognition), a binary decision tree was constructed for word clustering, which contains a set of questions Q about its speech content, where all words in the root are clustered into l leaf nodes. For this purpose, 39 questions were designed based on preset professional knowledge, such as "Is the phoneme of the word greater than 4?", "Is the first phoneme of this word 'm' or 'n'?", and "Does this word end with a closed syllable?".
决策树中的每个节点都包含一组词,这些词的韵律嵌入可以用高斯分布建模,对数似然可以表示为:Each node in the decision tree contains a set of words whose prosodic embeddings can be modeled with a Gaussian distribution and the log-likelihood can be expressed as:
i是节点索引、ε(i)是与节点i中的单词对应的所有韵律嵌入的集合,每个非叶节点i与问题q相关,该问题q将节点中的单词划分为其左或右子级,从而导致韵律嵌入的对数可能性增加:i is the node index, ε (i) is the set of all prosodic embeddings corresponding to the words in node i, and each non-leaf node i is associated with a question q that partitions the words in the node into its left or right children, resulting in an increase in the log-likelihood of the prosodic embeddings:
ΔgLL(i)=LL(i’s left child under q)+LL(i’s right child under q)-LL(i) Δ g LL (i) =LL (i's left child under q) +LL (i's right child under q) -LL (i)
初始树只包含根节点,也是叶节点。然后,递归地执行以下步骤:找到使所有叶节点的对数似然增加最大化的问题,并选择一个叶节点j,其增加在所有叶节点上最大,即:The initial tree contains only the root node, which is also a leaf node. Then, recursively perform the following steps: find the problem that maximizes the log-likelihood increase of all leaf nodes, and select a leaf node j whose increase is the largest among all leaf nodes, that is:
并将选取的节点用相应的问题进行分割。这个过程持续下去,直到对数可能性的增加小于阈值。从而得到决策树的拓扑结构。在本方法中,如图2的(b)所示,叶片l的数量为10,其索引用a到j的字母表示。The selected nodes are segmented with the corresponding questions. This process continues until the increase in logarithmic likelihood is less than the threshold. Thus, the topological structure of the decision tree is obtained. In this method, as shown in (b) of Figure 2, the number of leaves l is 10, and its index is represented by letters from a to j.
上述步骤通俗的说,就是先设定一个问题集,此集合中的问题全是与词的文本信息有关,如“开头是否为辅音”“是否包含IY1这个音素”“单词的音素是否大于4”等等。用一棵决策树,基于将数据的总似然获得最大的提升这个准则,在第一聚类迭代地将数据聚成10类(叶节点)(随着问题集的修改对应聚类的数量也会发生变化,聚类的数量在此不做限定)。In layman's terms, the above steps are to first set a set of questions, all of which are related to the text information of the word, such as "whether it starts with a consonant", "whether it contains the phoneme IY1", "whether the phonemes of the word are greater than 4", etc. A decision tree is used to iteratively cluster the data into 10 categories (leaf nodes) in the first cluster based on the criterion of maximizing the total likelihood of the data (the number of corresponding clusters will change as the question set is modified, and the number of clusters is not limited here).
对于步骤S13,第二阶段可以使用进行高斯混合聚类。由神经网络提取的词级韵律嵌入包含词的韵律信息和语音内容。然而,决策树仅根据语音内容的问题将单词聚类成l个叶节点,因此假设单词在叶节点中的韵律嵌入仅在韵律上不同,在语音内容上相似。因此,叶节点内的聚类由韵律控制,而不是语音内容。For step S13, the second stage can use Gaussian mixture clustering. The word-level prosodic embedding extracted by the neural network contains the prosodic information and speech content of the word. However, the decision tree clusters words into l leaf nodes based only on the question of speech content, so it is assumed that the prosodic embeddings of words in leaf nodes are different only in prosody and similar in speech content. Therefore, the clustering within the leaf node is controlled by prosody, not speech content.
进而分别对每个叶节点i中的韵律嵌入执行基于GMM的聚类:Then, GMM-based clustering is performed on the rhythm embedding in each leaf node i:
其中k是高斯成分指数,m是成分数。每个词的韵律都用高斯成分的索引进行标记,该索引使其韵律嵌入e的后验概率最大化:where k is the Gaussian component index and m is the number of components. The prosody of each word is labeled with the index of the Gaussian component that maximizes the posterior probability of its prosodic embedding e:
在上述步骤中代表叶节点i中第k个高斯分量的均值向量,代表此节点中第k个高斯分量的协方差矩阵,代表此节点中第k个高斯分量的权重。具体的,m可以设置为5,因此高斯分量ID的范围为0到4。因此,训练集中的所有单词都被标记为m*l=5*10=50韵律标记,这也就是每个词的韵律标注,这是10个叶子id和5个高斯分量id的组合。如图2的(b)所示,韵律标记从a0到j4。上述的韵律提取和标记是完全无监督的,其中仅使用音频信息。此外,标签系统是由数据和知识共同驱动的。In the above steps represents the mean vector of the kth Gaussian component in leaf node i, represents the covariance matrix of the kth Gaussian component in this node, Represents the weight of the kth Gaussian component in this node. Specifically, m can be set to 5, so the Gaussian component ID ranges from 0 to 4. Therefore, all words in the training set are labeled with m*l=5*10=50 prosody tags, which is the prosody annotation of each word, which is a combination of 10 leaf IDs and 5 Gaussian component IDs. As shown in Figure 2(b), the prosody tags are from a0 to j4. The above prosody extraction and labeling are completely unsupervised, where only audio information is used. In addition, the labeling system is driven by both data and knowledge.
通过该实施方式可以看出,选择在词级别做韵律的建模和控制,而词又含有丰富的语义信息,因此可以很好地和上下文做关联。并且,可以标注出具有多样性的韵律,又具有可解释性、良好的控制能力以及高度可扩展性,达到更人性化的韵律标注。From this implementation, it can be seen that the prosody modeling and control is done at the word level, and words contain rich semantic information, so they can be well associated with the context. In addition, a variety of prosody can be annotated, and it has interpretability, good controllability and high scalability, achieving more humanized prosody annotation.
如图3所示为本发明一实施例提供的一种基于韵律标注的文本转语音方法的流程图,包括如下步骤:FIG3 is a flowchart of a text-to-speech method based on prosody annotation provided by an embodiment of the present invention, comprising the following steps:
S21:将文本输入至文本转语音模型,对所述文本转语音模型确定的音素序列进行基于词的文本信息的第一聚类和基于韵律的第二聚类,得到所述文本中每个单词的无监督韵律标注;S21: inputting the text into a text-to-speech model, performing a first clustering based on the text information of the words and a second clustering based on the rhythm on the phoneme sequence determined by the text-to-speech model, and obtaining an unsupervised rhythmic annotation of each word in the text;
S22:至少利用所述每个单词的无监督韵律标注对所述素序列进行韵律合成,得到带有韵律的语音。S22: Performing rhythm synthesis on the prime sequence using at least the unsupervised rhythm annotation of each word to obtain rhythmic speech.
在本实施方式中,由于得到每个词的无监督韵律标注,可以将其应用于文本转语音任务中,进行带韵律标注的语音韵律控制。In this embodiment, since unsupervised prosodic annotations are obtained for each word, they can be applied to text-to-speech tasks to perform prosodic control of speech with prosodic annotations.
对于步骤S21,训练了一个TTS模型,该模型带有派生的单词级韵律标记,如图4所示。在训练阶段,TTS模型将确定的音素序列引导至韵律标记预测器中,在推理阶段,韵律标记可以由韵律预测器从输入文本中预测。也就是上文所述的韵律标注部分,得到每个单词的无监督韵律标注。For step S21, a TTS model is trained with derived word-level prosodic tags, as shown in Figure 4. In the training phase, the TTS model guides the determined phoneme sequence into the prosodic tag predictor, and in the inference phase, the prosodic tag can be predicted from the input text by the prosodic predictor. That is, the prosodic tagging part described above, and the unsupervised prosodic tagging of each word is obtained.
对于步骤S22,从而利用所述每个单词的无监督韵律标注对所述素序列进行韵律合成,得到带有韵律的语音。作为一种实施方式,考虑到更人性化的设计,为用户提供可输入的手动指定韵律标注对所述素序列进行韵律合成,得到更贴合用户期望的带由韵律的语音。For step S22, the unsupervised prosodic annotation of each word is used to perform prosodic synthesis on the element sequence to obtain a speech with prosody. As an implementation method, considering a more humanistic design, a user is provided with an inputtable manually specified prosodic annotation to perform prosodic synthesis on the element sequence to obtain a speech with prosody that better meets the user's expectations.
具体的,图4中所示,(a)表示这个步骤中的训练流程,(b)表示训练结束之后在合成阶段的流程图。在合成阶段,即可以依据上下文去让模型自动预测韵律标注,也可以对任意词手动指定标注,输入到TTS模型里去合成想要的韵律。韵律预测器根据每个单词对应的音素隐藏状态(即编码器输出序列h)预测每个单词的韵律标记。韵律预测器包含一个bi-GRU(Bi-directional Gated Recurrent Unit,双向选通循环单元),将音素隐藏状态转换为每个单词的向量、两个卷积块和一个softmax层。这里的卷积块由一个1D卷积层、一个ReLU激活层、一个归一化层和一个随机失活层组成。该预测器通过交叉熵损失LPP和韵律标签进行优化。因此,模型培训的总体损失定义为:Specifically, as shown in Figure 4, (a) represents the training process in this step, and (b) represents the flowchart in the synthesis stage after the training is completed. In the synthesis stage, the model can be allowed to automatically predict the prosody annotation based on the context, or the annotation can be manually specified for any word and input into the TTS model to synthesize the desired prosody. The prosody predictor predicts the prosody marker of each word based on the phoneme hidden state corresponding to each word (i.e., the encoder output sequence h). The prosody predictor contains a bi-GRU (Bi-directional Gated Recurrent Unit), which converts the phoneme hidden state into a vector for each word, two convolution blocks, and a softmax layer. The convolution block here consists of a 1D convolution layer, a ReLU activation layer, a normalization layer, and a random dropout layer. The predictor is optimized by the cross entropy loss L PP and the prosody label. Therefore, the overall loss of model training is defined as:
其中LFastSpeech2是FastSpeech2的文本转语音模型损失,α是这两个项之间的相对权重。where LFastSpeech2 is the text-to-speech model loss of FastSpeech2 and α is the relative weight between these two terms.
通过该实施方式可以看出,基于可以确定的多样性、可解释性、良好控制能力的韵律标注应用到TTS中,使得TTS合成的语音具有更丰富的韵律,提升用户的交互体验。It can be seen from this implementation that the prosody annotation based on the definite diversity, interpretability and good controllability is applied to TTS, so that the speech synthesized by TTS has richer prosody and improves the user's interactive experience.
对本方法进行试验说明,本方法使用LJSpeech,这是一个包含实验24小时录音的单发言者数据集。242句话被留出作为测试集。所有的声音都被降采样到16kHz。使用800点窗口长度、200点帧移、1024个FFT点和320个mel-bins进行特征提取。音素对齐是通过在librispeech(一种数据集)上训练的HMM-GMM ASR模型得到的,其中,GMM(Gaussianmixture model,高斯混合模型)。在这个工作中使用的声码器是MelGAN。系数α设为1.0。韵律嵌入是128维的。The method is experimentally demonstrated using LJSpeech, a single-speaker dataset containing 24 hours of experimental recordings. 242 utterances are set aside as a test set. All sounds are downsampled to 16kHz. Feature extraction is performed using 800-point window length, 200-point frame shift, 1024 FFT points, and 320 mel-bins. Phoneme alignment is obtained using a HMM-GMM ASR model trained on librispeech (a dataset), where GMM (Gaussian mixture model). The vocoder used in this work is MelGAN. The coefficient α is set to 1.0. The prosodic embedding is 128-dimensional.
确定决策树在韵律标签中的表现,本方法在图5中展示了决策树生长时,每个叶节点的平均韵律嵌入数曲线和所有叶节点∑i∈leaf nodesLL(i)的整体韵律嵌入对数似然曲线。随着叶节点数的增加,每个叶节点的平均韵律嵌入数减少,而韵律嵌入的总体对数似然数增加。考虑到性能和复杂性,当叶子数量达到10时,停止树的生长。To determine the performance of the decision tree in prosodic labeling, the method shows the average prosodic embedding number curve of each leaf node and the overall prosodic embedding log-likelihood curve of all leaf nodes ∑ i∈leaf nodes LL (i) as the decision tree grows in Figure 5. As the number of leaf nodes increases, the average prosodic embedding number of each leaf node decreases, while the overall log-likelihood of prosodic embedding increases. Considering performance and complexity, the tree growth is stopped when the number of leaves reaches 10.
确定预测韵律的自然性,使用派生的词级韵律标记训练带有韵律预测器的TTS模型。在推理阶段,词级韵律可以由韵律预测器从输入文本中预测,也可以手动指定。在这一部分中,合成了韵律被预测和采样的测试集。然后,用穆什拉(MUSHRA)测试来评估自然度,其中30名听者被要求对每一个话语在0到100之间进行评分。将本方法的模型与两个基线进行比较:典型的FastSpeech2模型原始Raw_FSP和TTS模型,其中音素级韵律采用混合密度网络PLP_MDN进行建模。此外,录音的真实mel谱图由声码器重建,然后在听力测试中作为GT提供。结果如图6所示。可以观察到,由于词级韵律建模,本方法提出的系统在自然度方面优于其他两种模型。然而,WLP样本仍然比GT稍差。To determine the naturalness of the predicted prosody, a TTS model with a prosody predictor is trained using the derived word-level prosody markers. In the inference phase, word-level prosody can be predicted by the prosody predictor from the input text or can be manually specified. In this section, a test set is synthesized where the prosody is predicted and sampled. Then, the naturalness is evaluated using the MUSHRA test, where 30 listeners are asked to rate each utterance between 0 and 100. The proposed model is compared with two baselines: the typical FastSpeech2 model original Raw_FSP and a TTS model where the phoneme-level prosody is modeled using a mixture density network PLP_MDN. In addition, the real mel-spectrograms of the recordings are reconstructed by a vocoder and then provided as GT in the listening test. The results are shown in Figure 6. It can be observed that the proposed system outperforms the other two models in terms of naturalness due to the word-level prosody modeling. However, the WLP samples are still slightly worse than the GT.
其中,GT指真实录音(ground truth)。实验中,为确保听音者不被音质影响,GT样本实际上是从真实录音提取mel谱后,用MelGAN反合成的音频。Raw_FSP指原始不带有韵律建模的FastSpeech2模型合成的音频,PLP_MDN(phone-level prosody mixture densitynetwork)指用音素级别的混合密度网络作为韵律建模的模型合成的音频。WLP_predict(word-level prosody,predicted)即在本方法中,完全通过prosody predictor预测韵律并合成的音频。Among them, GT refers to the real recording (ground truth). In the experiment, to ensure that the listener is not affected by the sound quality, the GT sample is actually the audio synthesized by MelGAN after extracting the mel spectrum from the real recording. Raw_FSP refers to the audio synthesized by the original FastSpeech2 model without prosody modeling, and PLP_MDN (phone-level prosody mixture density network) refers to the audio synthesized by using the phoneme-level mixture density network as the model for prosody modeling. WLP_predict (word-level prosody, predicted) is the audio synthesized by completely predicting the prosody through the prosody predictor in this method.
为了评估本方法的TTS模型的词级韵律可控性,首先使用所提出的韵律标记系统为测试集标记基础真实词韵律。然后,对测试集进行5次合成,其中叶d中的单词的韵律标记分别被手动指定为d0到d4,而其他单词的韵律标记则被预测和采样。图7显示了一个示例,其中黄色虚线之间的“责任”一词分别由d0至d4手动控制。可以观察到,单词的5个韵律都是不同的,显示了韵律标记的可控性。In order to evaluate the word-level prosody controllability of the TTS model of the proposed method, the ground truth word prosody is first labeled for the test set using the proposed prosody labeling system. Then, the test set is synthesized 5 times, where the prosody labels of the words in leaf d are manually assigned as d0 to d4, while the prosody labels of other words are predicted and sampled. Figure 7 shows an example, where the word "responsibility" between the yellow dotted lines is manually controlled by d0 to d4, respectively. It can be observed that the 5 prosody of the word are all different, showing the controllability of the prosody labeling.
另外,需要确认相同的韵律标签导致相似的韵律。因此,本方法对测试集中的叶子d中的所有单词,评估录音与使用不同指定韵律标签的合成语音之间的韵律相似度。理论上,当指定的韵律标签等于基准韵律标签时,合成语音中的词韵律应该与录音最相似。In addition, it is necessary to confirm that the same prosodic label leads to similar prosody. Therefore, this method evaluates the prosodic similarity between the recording and the synthesized speech with different assigned prosodic labels for all words in the leaf d in the test set. In theory, when the assigned prosodic label is equal to the reference prosodic label, the prosody of the word in the synthesized speech should be most similar to the recording.
本方法分别从客观和主观两方面对韵律相似度进行评价。首先计算所有带有基准韵律标签dt的单词的平均Mel倒失真(MCD),其中t的范围为0到4之间的录音和具有特定韵律标签的合成语音。结果见如图8所示。可以发现所有对角线值在其列值中都是最低的,这说明在合成语音中,相同的韵律标签会产生相似的韵律,用于控制的标签也能较好地展现出此标签在训练集上具有的韵律特征。This method evaluates the prosodic similarity from both objective and subjective aspects. First, the average Mel inverted distortion (MCD) of all words with a baseline prosodic label dt is calculated, where t ranges from 0 to 4 for recordings and synthesized speech with a specific prosodic label. The results are shown in Figure 8. It can be found that all diagonal values are the lowest among their column values, which indicates that in the synthesized speech, the same prosodic label will produce similar prosody, and the label used for control can also better show the prosodic characteristics of this label in the training set.
同时,评估相似韵律与主观听力测试30听众提供的记录和合成语音具有不同韵律标签为每个组和要求选择相应的合成语音的韵律词是最类似于录音。选择的比例在图9中以混淆矩阵的形式描述。与客观评价结果相似,具有相同韵律标签的合成语音与基准的比例,即对角线值在其列中最高,进一步验证了韵律标签的可控性。At the same time, the prosody similarity was evaluated with a subjective listening test in which 30 listeners provided recorded and synthesized speech with different prosody labels for each group and were asked to select the prosodic word of the corresponding synthesized speech that was most similar to the recording. The proportion of selection was described in the form of a confusion matrix in Figure 9. Similar to the objective evaluation results, the proportion of synthesized speech with the same prosody label as the benchmark, i.e., the diagonal value was the highest in its column, further verifying the controllability of the prosody label.
总的来说,本方法提出了一种新的无监督词级韵律标注方法,分两个阶段,首先根据单词的语音内容用决策树将单词分为不同的类型,然后在每种类型的单词中分别使用GMM对韵律进行聚类。此外,一个带有派生词级韵律标签的TTS系统被训练用于可控语音合成,其中韵律可以从输入文本预测,也可以手动指定。在LJSpeech上的实验表明,本方法模型比具有预测韵律的典型FastSpeech2模型获得了更好的自然度。此外,对韵律可控性的主客观评价表明,通过指定词级韵律标记可以有效地控制韵律。In summary, this paper proposes a new unsupervised word-level prosodic tagging method in two stages, firstly classifying words into different types according to their phonetic content using decision trees, and then clustering the prosody using GMMs in each type of words. In addition, a TTS system with derived word-level prosodic tags is trained for controllable speech synthesis, where the prosody can be predicted from the input text or manually specified. Experiments on LJSpeech show that the proposed model achieves better naturalness than the typical FastSpeech2 model with predicted prosody. In addition, subjective and objective evaluations of the controllability of prosody show that prosody can be effectively controlled by specifying word-level prosodic tags.
如图10所示为本发明一实施例提供的一种韵律标注系统的结构示意图,该系统可执行上述任意实施例所述的韵律标注方法,并配置在终端中。FIG. 10 is a schematic diagram showing the structure of a prosody annotation system provided by an embodiment of the present invention. The system can execute the prosody annotation method described in any of the above embodiments and is configured in a terminal.
本实施例提供的一种韵律标注系统10包括:语音重构程序模块11,聚类程序模块12和韵律标注程序模块13。The prosody annotation system 10 provided in this embodiment includes: a speech reconstruction program module 11 , a clustering program module 12 and a prosody annotation program module 13 .
其中,语音重构程序模块11用于从训练语音数据中提取词级韵律表达,利用所述词级韵律表达对所述语音进行重构,得到带有词级韵律表达的重构语音;聚类程序模块12用于对所述重构语音按照词的文本信息进行第一聚类;韵律标注程序模块13用于对所述第一聚类的结果按照韵律进行无监督的第二聚类,得到每个单词的无监督韵律标注。Among them, the speech reconstruction program module 11 is used to extract word-level prosodic expressions from the training speech data, and reconstruct the speech using the word-level prosodic expressions to obtain reconstructed speech with word-level prosodic expressions; the clustering program module 12 is used to perform a first clustering of the reconstructed speech according to the text information of the words; the prosodic annotation program module 13 is used to perform an unsupervised second clustering of the results of the first clustering according to the prosody to obtain unsupervised prosodic annotations for each word.
进一步地,所述语音重构程序模块用于:Furthermore, the speech reconstruction program module is used to:
预先构建基于FastSpeech2的文本转语音模型以及韵律提取器;Pre-built FastSpeech2-based text-to-speech model and prosody extractor;
将所述训练语音数据输入至所述文本转语音模型,利用所述文本转语音模型的编码器确定所述训练语音数据的音素序列;Inputting the training speech data into the text-to-speech model, and determining a phoneme sequence of the training speech data using an encoder of the text-to-speech model;
通过所述韵律提取器从所述训练语音数据的梅尔频谱中提取每个单词的词级韵律表达,利用所述词级韵律表达对所述音素序列进行重构,得到带有词级韵律表达的重构语音。The prosody extractor extracts the word-level prosodic expression of each word from the Mel-spectrogram of the training speech data, and reconstructs the phoneme sequence using the word-level prosodic expression to obtain a reconstructed speech with the word-level prosodic expression.
本发明实施例还提供了一种非易失性计算机存储介质,计算机存储介质存储有计算机可执行指令,该计算机可执行指令可执行上述任意方法实施例中的韵律标注方法;The embodiment of the present invention further provides a non-volatile computer storage medium, the computer storage medium stores computer executable instructions, and the computer executable instructions can execute the prosody tagging method in any of the above method embodiments;
作为一种实施方式,本发明的非易失性计算机存储介质存储有计算机可执行指令,计算机可执行指令设置为:As an implementation mode, the non-volatile computer storage medium of the present invention stores computer executable instructions, and the computer executable instructions are configured as follows:
从训练语音数据中提取词级韵律表达,利用所述词级韵律表达对所述语音进行重构,得到带有词级韵律表达的重构语音;Extracting word-level prosodic expressions from training speech data, and reconstructing the speech using the word-level prosodic expressions to obtain reconstructed speech with word-level prosodic expressions;
对所述重构语音按照词的文本信息进行第一聚类;Performing a first clustering on the reconstructed speech according to text information of words;
对所述第一聚类的结果按照韵律进行无监督的第二聚类,得到每个单词的无监督韵律标注。An unsupervised second clustering is performed on the results of the first clustering according to the rhythm to obtain an unsupervised rhythm tagging of each word.
如图11所示为本发明一实施例提供的一种基于韵律标注的文本转语音系统的结构示意图,该系统可执行上述任意实施例所述的基于韵律标注的文本转语音方法,并配置在终端中。FIG11 is a schematic diagram of the structure of a text-to-speech system based on prosody annotation provided in one embodiment of the present invention. The system can execute the text-to-speech method based on prosody annotation described in any of the above embodiments and be configured in a terminal.
本实施例提供的一种韵律标注系统20包括:无监督韵律标注程序模块21,语音合成程序模块22。The prosody tagging system 20 provided in this embodiment includes: an unsupervised prosody tagging program module 21 and a speech synthesis program module 22 .
其中,无监督韵律标注程序模块21用于将文本输入至文本转语音模型,对所述文本转语音模型确定的音素序列进行基于词的文本信息的第一聚类和基于韵律的第二聚类,得到所述文本中每个单词的无监督韵律标注;语音合成程序模块22用于至少利用所述每个单词的无监督韵律标注对所述素序列进行韵律合成,得到带有韵律的语音。Among them, the unsupervised prosodic annotation program module 21 is used to input the text into the text-to-speech model, perform a first clustering based on the text information of the word and a second clustering based on the prosody on the phoneme sequence determined by the text-to-speech model, and obtain the unsupervised prosodic annotation of each word in the text; the speech synthesis program module 22 is used to perform prosodic synthesis on the phoneme sequence using at least the unsupervised prosodic annotation of each word to obtain speech with rhythm.
进一步地,所述语音合成程序模块还用于:Furthermore, the speech synthesis program module is also used for:
利用所述每个单词的无监督韵律标注以及输入的手动指定韵律标注对所述素序列进行韵律合成,得到带有韵律的语音。The unsupervised prosodic annotation of each word and the input manually specified prosodic annotation are used to perform prosodic synthesis on the element sequence to obtain speech with prosody.
本发明实施例还提供了一种非易失性计算机存储介质,计算机存储介质存储有计算机可执行指令,该计算机可执行指令可执行上述任意方法实施例中的基于韵律标注的文本转语音方法;The embodiment of the present invention further provides a non-volatile computer storage medium, the computer storage medium stores computer executable instructions, and the computer executable instructions can execute the text-to-speech method based on prosody annotation in any of the above method embodiments;
作为一种实施方式,本发明的非易失性计算机存储介质存储有计算机可执行指令,计算机可执行指令设置为:As an implementation mode, the non-volatile computer storage medium of the present invention stores computer executable instructions, and the computer executable instructions are configured as follows:
将文本输入至文本转语音模型,对所述文本转语音模型确定的音素序列进行基于词的文本信息的第一聚类和基于韵律的第二聚类,得到所述文本中每个单词的无监督韵律标注;Inputting the text into a text-to-speech model, performing a first clustering based on the text information of the words and a second clustering based on the rhythm on the phoneme sequence determined by the text-to-speech model, and obtaining an unsupervised rhythmic annotation of each word in the text;
至少利用所述每个单词的无监督韵律标注对所述素序列进行韵律合成,得到带有韵律的语音。The unsupervised prosodic annotation of each word is at least used to perform prosodic synthesis on the prime sequence to obtain speech with prosody.
作为一种非易失性计算机可读存储介质,可用于存储非易失性软件程序、非易失性计算机可执行程序以及模块,如本发明实施例中的方法对应的程序指令/模块。一个或者多个程序指令存储在非易失性计算机可读存储介质中,当被处理器执行时,执行上述任意方法实施例中的韵律标注方法和基于韵律标注的文本转语音方法。As a non-volatile computer-readable storage medium, it can be used to store non-volatile software programs, non-volatile computer executable programs and modules, such as program instructions/modules corresponding to the method in the embodiment of the present invention. One or more program instructions are stored in the non-volatile computer-readable storage medium, and when executed by the processor, the prosody annotation method and the text-to-speech method based on prosody annotation in any of the above method embodiments are executed.
图12是本申请另一实施例提供的韵律标注方法和基于韵律标注的文本转语音方法的电子设备的硬件结构示意图,如图12所示,该设备包括:FIG12 is a schematic diagram of the hardware structure of an electronic device for a prosody annotation method and a text-to-speech method based on prosody annotation provided in another embodiment of the present application. As shown in FIG12 , the device includes:
一个或多个处理器1210以及存储器1220,图12中以一个处理器1210为例。韵律标注方法和基于韵律标注的文本转语音方法的设备还可以包括:输入装置1230和输出装置1240。One or more processors 1210 and a memory 1220 , and FIG12 takes one processor 1210 as an example. The apparatus of the prosody annotation method and the text-to-speech method based on prosody annotation may further include: an input device 1230 and an output device 1240 .
处理器1210、存储器1220、输入装置1230和输出装置1240可以通过总线或者其他方式连接,图12中以通过总线连接为例。The processor 1210, the memory 1220, the input device 1230 and the output device 1240 may be connected via a bus or other means, and FIG12 takes the connection via a bus as an example.
存储器1220作为一种非易失性计算机可读存储介质,可用于存储非易失性软件程序、非易失性计算机可执行程序以及模块,如本申请实施例中的韵律标注方法和基于韵律标注的文本转语音方法对应的程序指令/模块。处理器1210通过运行存储在存储器1220中的非易失性软件程序、指令以及模块,从而执行服务器的各种功能应用以及数据处理,即实现上述方法实施例韵律标注方法和基于韵律标注的文本转语音方法。The memory 1220, as a non-volatile computer-readable storage medium, can be used to store non-volatile software programs, non-volatile computer executable programs and modules, such as program instructions/modules corresponding to the prosody annotation method and the text-to-speech method based on prosody annotation in the embodiments of the present application. The processor 1210 executes various functional applications and data processing of the server by running the non-volatile software programs, instructions and modules stored in the memory 1220, that is, the prosody annotation method and the text-to-speech method based on prosody annotation in the above method embodiments are implemented.
存储器1220可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储数据等。此外,存储器1220可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实施例中,存储器1220可选包括相对于处理器1210远程设置的存储器,这些远程存储器可以通过网络连接至移动装置。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 1220 may include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application required for at least one function; the data storage area may store data, etc. In addition, the memory 1220 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one disk storage device, a flash memory device, or other non-volatile solid-state storage device. In some embodiments, the memory 1220 may optionally include a memory remotely arranged relative to the processor 1210, and these remote memories may be connected to the mobile device via a network. Examples of the above-mentioned network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
输入装置1230可接收输入的数字或字符信息。输出装置1240可包括显示屏等显示设备。The input device 1230 can receive input digital or character information. The output device 1240 can include a display device such as a display screen.
所述一个或者多个模块存储在所述存储器1220中,当被所述一个或者多个处理器1210执行时,执行上述任意方法实施例中的韵律标注方法和基于韵律标注的文本转语音方法。The one or more modules are stored in the memory 1220 , and when executed by the one or more processors 1210 , perform the prosody annotation method and the text-to-speech method based on prosody annotation in any of the above method embodiments.
上述产品可执行本申请实施例所提供的方法,具备执行方法相应的功能模块和有益效果。未在本实施例中详尽描述的技术细节,可参见本申请实施例所提供的方法。The above-mentioned product can execute the method provided in the embodiment of the present application, and has the functional modules and beneficial effects corresponding to the execution method. For technical details not fully described in this embodiment, please refer to the method provided in the embodiment of the present application.
非易失性计算机可读存储介质可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储根据装置的使用所创建的数据等。此外,非易失性计算机可读存储介质可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实施例中,非易失性计算机可读存储介质可选包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至装置。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The non-volatile computer-readable storage medium may include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application required for at least one function; the data storage area may store data created according to the use of the device, etc. In addition, the non-volatile computer-readable storage medium may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one disk storage device, a flash memory device, or other non-volatile solid-state storage device. In some embodiments, the non-volatile computer-readable storage medium may optionally include a memory remotely arranged relative to the processor, and these remote memories may be connected to the device via a network. Examples of the above-mentioned network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
本发明实施例还提供一种电子设备,其包括:至少一个处理器,以及与所述至少一个处理器通信连接的存储器,其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行本发明任一实施例的韵律标注方法和基于韵律标注的文本转语音方法的步骤。An embodiment of the present invention also provides an electronic device, comprising: at least one processor, and a memory communicatively connected to the at least one processor, wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor so that the at least one processor can perform the steps of the prosody annotation method and the text-to-speech method based on prosody annotation of any embodiment of the present invention.
本申请实施例的电子设备以多种形式存在,包括但不限于:The electronic device of the embodiment of the present application exists in various forms, including but not limited to:
(1)移动通信设备:这类设备的特点是具备移动通信功能,并且以提供话音、数据通信为主要目标。这类终端包括:智能手机、多媒体手机、功能性手机,以及低端手机等。(1) Mobile communication equipment: This type of equipment is characterized by having mobile communication functions and its main purpose is to provide voice and data communications. This type of terminal includes: smart phones, multimedia phones, functional phones, and low-end phones.
(2)超移动个人计算机设备:这类设备属于个人计算机的范畴,有计算和处理功能,一般也具备移动上网特性。这类终端包括:PDA、MID和UMPC设备等,例如平板电脑。(2) Ultra-mobile personal computer devices: These devices fall into the category of personal computers, have computing and processing capabilities, and generally also have mobile Internet access features. These terminals include: PDAs, MIDs, and UMPC devices, such as tablet computers.
(3)便携式娱乐设备:这类设备可以显示和播放多媒体内容。该类设备包括:音频、视频播放器,掌上游戏机,电子书,以及智能玩具和便携式车载导航设备。(3) Portable entertainment devices: These devices can display and play multimedia content. They include audio and video players, handheld game consoles, e-books, smart toys, and portable car navigation devices.
(4)其他具有数据处理功能的电子装置。(4) Other electronic devices with data processing functions.
在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”,不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。In this article, relational terms such as first and second, etc. are used only to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Moreover, the terms "include" and "comprise" include not only those elements, but also other elements not explicitly listed, or also include elements inherent to such processes, methods, articles or equipment. In the absence of further restrictions, the elements defined by the statement "include..." do not exclude the existence of other identical elements in the process, method, article or equipment that includes the elements.
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。The device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of this embodiment. Those of ordinary skill in the art may understand and implement it without creative work.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。Through the description of the above implementation methods, those skilled in the art can clearly understand that each implementation method can be implemented by means of software plus a necessary general hardware platform, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solution is essentially or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product can be stored in a computer-readable storage medium, such as ROM/RAM, a disk, an optical disk, etc., including a number of instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in each embodiment or some parts of the embodiments.
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit it. Although the present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the aforementioned embodiments, or make equivalent replacements for some of the technical features therein. However, these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims (6)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111591322.4A CN114255736B (en) | 2021-12-23 | 2021-12-23 | Rhythm annotation method and system |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111591322.4A CN114255736B (en) | 2021-12-23 | 2021-12-23 | Rhythm annotation method and system |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN114255736A CN114255736A (en) | 2022-03-29 |
| CN114255736B true CN114255736B (en) | 2024-08-23 |
Family
ID=80797183
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202111591322.4A Active CN114255736B (en) | 2021-12-23 | 2021-12-23 | Rhythm annotation method and system |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN114255736B (en) |
Families Citing this family (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116129859A (en) * | 2022-11-16 | 2023-05-16 | 马上消费金融股份有限公司 | Prosody labeling method, acoustic model training method, voice synthesis method and voice synthesis device |
| CN119694288B (en) * | 2024-10-18 | 2025-09-30 | 马上消费金融股份有限公司 | Speech synthesis method, apparatus, electronic device, computer-readable storage medium, and computer program product |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105529023A (en) * | 2016-01-25 | 2016-04-27 | 百度在线网络技术(北京)有限公司 | Voice synthesis method and device |
| CN110534087A (en) * | 2019-09-04 | 2019-12-03 | 清华大学深圳研究生院 | A kind of text prosody hierarchy Structure Prediction Methods, device, equipment and storage medium |
Family Cites Families (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5384893A (en) * | 1992-09-23 | 1995-01-24 | Emerson & Stern Associates, Inc. | Method and apparatus for speech synthesis based on prosodic analysis |
| US20130311185A1 (en) * | 2011-02-15 | 2013-11-21 | Nokia Corporation | Method apparatus and computer program product for prosodic tagging |
| TWI573129B (en) * | 2013-02-05 | 2017-03-01 | 國立交通大學 | Streaming encoder, prosody information encoding device, prosody-analyzing device, and device and method for speech-synthesizing |
| US11322135B2 (en) * | 2019-09-12 | 2022-05-03 | International Business Machines Corporation | Generating acoustic sequences via neural networks using combined prosody info |
| CN112820266B (en) * | 2020-12-29 | 2023-11-14 | 中山大学 | Parallel end-to-end speech synthesis method based on skip encoder |
| CN112863482B (en) * | 2020-12-31 | 2022-09-27 | 思必驰科技股份有限公司 | Method and system for speech synthesis with prosody |
| CN113506562B (en) * | 2021-07-19 | 2022-07-19 | 武汉理工大学 | End-to-end voice synthesis method and system based on fusion of acoustic features and text emotional features |
-
2021
- 2021-12-23 CN CN202111591322.4A patent/CN114255736B/en active Active
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105529023A (en) * | 2016-01-25 | 2016-04-27 | 百度在线网络技术(北京)有限公司 | Voice synthesis method and device |
| CN110534087A (en) * | 2019-09-04 | 2019-12-03 | 清华大学深圳研究生院 | A kind of text prosody hierarchy Structure Prediction Methods, device, equipment and storage medium |
Also Published As
| Publication number | Publication date |
|---|---|
| CN114255736A (en) | 2022-03-29 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Kaur et al. | Conventional and contemporary approaches used in text to speech synthesis: A review | |
| Hsu et al. | Hierarchical generative modeling for controllable speech synthesis | |
| CN110211565B (en) | Dialect identification method and device and computer readable storage medium | |
| JP5768093B2 (en) | Speech processing system | |
| CN113327575B (en) | Speech synthesis method, device, computer equipment and storage medium | |
| CN106688034B (en) | Text-to-speech conversion with emotional content | |
| WO2022121181A1 (en) | Intelligent news broadcasting method, apparatus and device, and storage medium | |
| CN113920977A (en) | Speech synthesis model, model training method and speech synthesis method | |
| WO2017076222A1 (en) | Speech recognition method and apparatus | |
| CN114360504B (en) | Audio processing method, device, equipment, program product and storage medium | |
| CN116364055A (en) | Speech generation method, device, equipment and medium based on pre-training language model | |
| CN110782918B (en) | Speech prosody assessment method and device based on artificial intelligence | |
| CN114242033A (en) | Speech synthesis method, apparatus, equipment, storage medium and program product | |
| CN109102796A (en) | A kind of phoneme synthesizing method and device | |
| CN114255736B (en) | Rhythm annotation method and system | |
| CN108899046A (en) | A kind of speech-emotion recognition method and system based on Multistage Support Vector Machine classification | |
| CN114283786A (en) | Speech recognition method, device and computer readable storage medium | |
| CN118571229B (en) | Voice labeling method and device for voice feature description | |
| CN116246639A (en) | Self-supervision speaker verification model training method, electronic device and storage medium | |
| CN111933121B (en) | Acoustic model training method and device | |
| CN114528812A (en) | Voice recognition method, system, computing device and storage medium | |
| US8438029B1 (en) | Confidence tying for unsupervised synthetic speech adaptation | |
| CN116978381A (en) | Audio data processing method, device, computer equipment and storage medium | |
| CN115910021A (en) | Speech synthesis method, device, electronic equipment and readable storage medium | |
| Paulose et al. | Marathi Speech Recognition. |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |