CN102063897B - A sound bank compression and usage method for embedded speech synthesis system - Google Patents
A sound bank compression and usage method for embedded speech synthesis system Download PDFInfo
- Publication number
- CN102063897B CN102063897B CN2010105807907A CN201010580790A CN102063897B CN 102063897 B CN102063897 B CN 102063897B CN 2010105807907 A CN2010105807907 A CN 2010105807907A CN 201010580790 A CN201010580790 A CN 201010580790A CN 102063897 B CN102063897 B CN 102063897B
- Authority
- CN
- China
- Prior art keywords
- model
- spectral
- codebook
- variance
- original
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Landscapes
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
Description
技术领域 technical field
本发明总的来说涉及一种面向嵌入式语音合成系统的音库压缩和使用方法,尤其是存储和运算资源有限的终端设备。The present invention generally relates to a method for compressing and using a sound bank for an embedded speech synthesis system, especially for terminal equipment with limited storage and computing resources.
背景技术 Background technique
语音合成技术的目的是让机器还原自然的人类语音,嵌入式设备应用广泛,终端类嵌入式设备与用户交互频繁,语音是最自然的交互手段。一般的语音合成系统可分为三个主要的功能模块:文本分析模块、韵律生成模块和声学合成模块。基于大规模语料库的拼接合成方法由于技术简单,合成音质高被广泛采用。但是,这种方法的音库规模大,虽然通过聚类、编码和压缩等技术手段处理后,空间可以降低,但音质受到损伤,且灵活度下降。因此,近年来基于大规模语料库的统计建模参数合成方法被广泛研究,基本思想是,对大量的原始语音库进行参数化表示和统计建模,合成时依照特定规则挑选模型构成模型序列,进一步计算得到合成语句的参数序列,通过参数化合成的方法合成符合要求的语音。通过参数化统计建模方法合成的语音具有较高的自然度和智能度。在这种方法中,为保证合成效果,原始语音库需要尽可能覆盖韵律特征,得到的模型库可达到数百兆字节。经过模型聚类,可将模型库压缩至十兆左右。这种规模对一些掌上电脑等中高端设备的存储和计算能力来说可以满足,但对于运算和存储资源有限的终端设备来说仍无法满足实用的要求。The purpose of speech synthesis technology is to allow machines to restore natural human speech. Embedded devices are widely used, and terminal embedded devices frequently interact with users. Voice is the most natural means of interaction. A general speech synthesis system can be divided into three main functional modules: text analysis module, prosody generation module and acoustic synthesis module. The splicing and synthesis method based on large-scale corpus is widely used because of its simple technology and high quality of synthesized sound. However, this method has a large sound bank, and although the space can be reduced by clustering, encoding, and compression, the sound quality is damaged and the flexibility is reduced. Therefore, in recent years, the statistical modeling parameter synthesis method based on large-scale corpus has been widely studied. The basic idea is to perform parametric representation and statistical modeling on a large number of original speech databases, and select models according to specific rules to form a model sequence during synthesis. The parameter sequence of the synthesized sentence is calculated, and the speech meeting the requirements is synthesized through the method of parametric synthesis. The speech synthesized by parametric statistical modeling method has high degree of naturalness and intelligence. In this method, in order to ensure the synthesis effect, the original speech library needs to cover prosodic features as much as possible, and the obtained model library can reach hundreds of megabytes. After model clustering, the model library can be compressed to about ten megabytes. This scale can meet the storage and computing capabilities of some mid-to-high-end devices such as handheld computers, but it still cannot meet the practical requirements for terminal devices with limited computing and storage resources.
在参数化统计建模语音库的训练过程中,常采用的语音特征参数为基音频率、频谱系数和时长特征,参数化模型为隐半马尔科夫模型(HSMM)。根据隐半马尔科夫模型(HSMM)的状态跳转特性,每种特征的模型包括各个状态的决策树和表示决策树叶节点的概率分布函数。目前常采用的概率密度函数表示方法为单高斯模型。最终得到的模型中,频谱系数的模型占最终模型大小的80%~90%的空间,是最需要压缩的部分。目前已有的减小频谱参数模型规模的方法采用降低数值精度、控制聚类因子和捆绑方差等方式。在采用音节作为合成系统基本单元的前提下,控制训练数据量至合成语音听感可接受的最小值时,基于上述方法得到的模型库至少也需要1兆字节的存储空间。并且,若对聚类进行更严格的控制,则合成语音的自然度和音质都会显著下降。上述系统对资源有限的设备来说仍然开销较大,难以满足用户的需求。因此,需要一种改进的方法,用于在嵌入式平台下实现占用资源较小的参数化语音合成系统。In the training process of parametric statistical modeling speech library, the speech feature parameters often used are pitch frequency, spectral coefficient and duration feature, and the parametric model is Hidden Semi-Markov Model (HSMM). According to the state transition characteristics of the Hidden Semi-Markov Model (HSMM), the model of each feature includes a decision tree of each state and a probability distribution function representing the leaf nodes of the decision tree. At present, the commonly used probability density function representation method is a single Gaussian model. In the finally obtained model, the model of spectral coefficient occupies 80% to 90% of the space of the final model size, which is the part that needs to be compressed most. The existing methods to reduce the size of the spectral parameter model adopt methods such as reducing the numerical precision, controlling the clustering factor and bundling the variance. On the premise of using syllables as the basic unit of the synthesis system, when controlling the amount of training data to the minimum value acceptable for the hearing of synthesized speech, the model library obtained based on the above method also requires at least 1 megabyte of storage space. Also, with tighter controls on clustering, the naturalness and quality of the synthesized speech can be significantly degraded. The above-mentioned system is still expensive for devices with limited resources, and it is difficult to meet the needs of users. Therefore, there is a need for an improved method for realizing a parametric speech synthesis system that occupies less resources under an embedded platform.
发明内容 Contents of the invention
本发明所要解决的技术问题是提供一种应用于嵌入式中文语音合成系统的音库压缩和使用方法。它使语音模型库占用极小的空间资源,提高运算速度,同时保持了较好的合成自然度和音质。The technical problem to be solved by the present invention is to provide a method for compressing and using a sound bank applied to an embedded Chinese speech synthesis system. It enables the speech model library to occupy a very small space resource, improves the computing speed, and maintains good synthesis naturalness and sound quality.
为实现上述目的,本文提供了一种参数化统计模型的压缩和使用方法,用于减小模型库占用的空间并维持合成的音质。原始模型库的训练和合成过程采用汉语中的音节作为基本单元;模型库的压缩过程分为下述三个步骤:To achieve the above objectives, this paper provides a method for compressing and using parametric statistical models, which is used to reduce the space occupied by the model library and maintain the sound quality of the synthesis. The training and synthesis process of the original model base uses Chinese syllables as the basic unit; the compression process of the model base is divided into the following three steps:
A.创建基于汉语音节的原始模型库。A. Create a library of primitive models based on Chinese syllables.
B.将表示原始频谱模型的单高斯分布分解为能量、谱均值和谱方差三部分。利用矢量量化技术对谱均值和谱方差分别进行压缩。B. Decompose the single Gaussian distribution representing the original spectral model into three parts: energy, spectral mean and spectral variance. The spectral mean and spectral variance are compressed separately by vector quantization technology.
C.将能量、压缩得到的谱均值码本和索引以及全局方差组合,得到最终的压缩模型库。C. Combining the energy, the compressed spectral mean codebook, the index, and the global variance to obtain the final compressed model library.
上述的参数化统计模型的压缩和使用方法,其特征是:所述基于汉语音节为单元的原始模型库创建过程分为下述五个步骤:The method for compressing and using the above-mentioned parameterized statistical model is characterized in that: the creation process of the original model library based on Chinese syllables as a unit is divided into the following five steps:
A.创建基于汉语音节的原始语音库。A. Create an original phonetic library based on Chinese syllables.
B.提取语音库中所有音节的基音频率、频谱参数和时长参数。训练不考虑上下文语境的音节模型。B. Extract pitch frequency, spectrum parameters and duration parameters of all syllables in the speech library. Train a syllable model that does not take context into account.
C.根据所有音节的语境信息训练考虑上下文语境的音节模型,并使用基于决策树的方法对模型进行状态聚类。C. Train a syllable model that considers contextual context based on the contextual information of all syllables, and use a decision tree-based method to perform state clustering on the model.
D.将聚类后的模型参数进一步训练。D. Further train the clustered model parameters.
E.返回步骤C重复步骤C、D,输出参数化统计模型。E. Return to step C and repeat steps C and D to output a parameterized statistical model.
上述的参数化统计模型的压缩和使用方法,其特征是:所述频谱模型压缩过程分为下述六个步骤:The method for compressing and using the above-mentioned parameterized statistical model is characterized in that: the process of compressing the spectrum model is divided into the following six steps:
A.将频谱模型的状态高斯分布分为能量、谱均值和谱方差三部分。本方法考虑了一阶动态特征和二阶动态特征。A. Divide the state Gaussian distribution of the spectral model into three parts: energy, spectral mean and spectral variance. This method considers first-order dynamic features and second-order dynamic features.
B.将所有状态分布的均值矢量(包含静态特征、一阶动态特征和二阶动态特征)作为训练样本,进行矢量量化的码本训练。B. Use the mean vectors of all state distributions (including static features, first-order dynamic features and second-order dynamic features) as training samples for codebook training of vector quantization.
C.搜索矢量量化分类后每一类中离码本距离最小的训练样本,替代该类码本保存下来。C. Search for the training sample with the smallest distance from the codebook in each class after vector quantization classification, and save it instead of the codebook of this class.
D.用新的码本对训练样本重新分类。D. Reclassify the training samples with the new codebook.
E.判断新的分类结果与原分类结果是否相同。如果是,则谱均值矢量量化码本训练结束;如果否,则返回步骤C重复步骤C、D。E. Judging whether the new classification result is the same as the original classification result. If yes, the spectral mean vector quantization codebook training ends; if no, return to step C and repeat steps C and D.
F.将所有状态分布的方差矢量(包含静态特征、一阶动态特征和二阶动态特征)进行平均,得到全局方差矢量。F. Average the variance vectors (including static features, first-order dynamic features and second-order dynamic features) of all state distributions to obtain a global variance vector.
上述的参数化统计模型的压缩和使用方法,其特征是:所述模型重新组合过程如下:将原模型中的状态分布以能量值和对应的均值矢量码本索引代替,最后存入全局方差值。The above method for compressing and using the parameterized statistical model is characterized in that: the model recombination process is as follows: the state distribution in the original model is replaced by the energy value and the corresponding mean value vector codebook index, and finally stored in the global variance value.
上述方法可大幅度压缩采用音节作为基元的频谱模型,同时保持了原模型合成的音质和自然度。The above method can greatly compress the spectral model using syllables as primitives, while maintaining the sound quality and naturalness of the original model synthesis.
为更好的满足嵌入式设备运算速度的要求,本发明还提供了一种嵌入式语音合成系统。包括下述四个步骤:In order to better meet the requirements of the computing speed of embedded devices, the present invention also provides an embedded speech synthesis system. Include the following four steps:
A.文本分析及韵律生成模块,用于对合成文本进行内容分析,得到对应的音节序列,同时每个音节附着相关的以上下文语境标识的韵律信息,其格式与模型训练时使用的标识相同;A. Text analysis and prosody generation module, which is used to analyze the content of the synthesized text to obtain the corresponding syllable sequence. At the same time, each syllable is attached with relevant prosody information marked by context and context, and its format is the same as that used for model training. ;
B.模型决策模块,用于接收上述附着韵律信息的音节序列,利用训练得到的模型决策树生成相应的模型状态序列,并得到时长决策结果;B. The model decision module is used to receive the above-mentioned syllable sequence attached to the prosodic information, use the model decision tree obtained through training to generate a corresponding model state sequence, and obtain the duration decision result;
C.参数序列生成模块,用于接收上述模型状态序列,利用所述的压缩频谱模型计算全局方差加窗矩阵,最后计算得到谱参数序列和基频参数序列;C. The parameter sequence generation module is used to receive the above-mentioned model state sequence, utilizes the described compressed spectrum model to calculate the global variance windowing matrix, and finally calculates the spectral parameter sequence and the fundamental frequency parameter sequence;
D.语音波形合成输出模块,用于接收所述的参数序列,生成所要合成的语音波形数据,并输出播放或存储。D. The voice waveform synthesis output module is used to receive the parameter sequence, generate the voice waveform data to be synthesized, and output it for playback or storage.
上述的嵌入式语音合成系统,其特征是:所述参数序列生成模块分为以下5个步骤:Above-mentioned embedded speech synthesis system is characterized in that: described parameter sequence generation module is divided into following 5 steps:
A.根据频谱系数的状态序列计算得到能量序列和频谱系数序列,根据基音频率的状态序列计算得到基音频率序列;A. Calculate the energy sequence and the spectral coefficient sequence according to the state sequence of the spectral coefficient, and obtain the pitch frequency sequence according to the state sequence of the pitch frequency;
B.根据全局方差计算全局方差矩阵。在参数生成过程中,采用逐维生成的方式计算所需要合成的特征参数,每次计算取一维的均值或全局方差;B. Calculate the global variance matrix from the global variance. In the process of parameter generation, the feature parameters to be synthesized are calculated by dimension-by-dimensional generation, and the mean value or global variance of one dimension is taken for each calculation;
C.根据接收的模型状态序列对应的状态谱均值码本序列,获取一维谱均值码本序列;C. Obtain a one-dimensional spectrum mean codebook sequence according to the state spectrum mean codebook sequence corresponding to the received model state sequence;
D.根据接收的全局方差矩阵和状态谱均值码本序列求解特征参数序列;D. Solve the characteristic parameter sequence according to the received global variance matrix and the state spectrum mean value codebook sequence;
E.判断是否处理完全部频谱系数。如果是,则频谱系数求解结束;如果否,则返回步骤C重复步骤C、D。E. Judging whether all spectral coefficients have been processed. If yes, the spectral coefficient calculation ends; if not, return to step C and repeat steps C and D.
依照上述方法建立的嵌入式语音合成系统,完全可以在所述嵌入式系统下应用,且其所占用的空间资源和需要的计算复杂度均不超过嵌入式设备所具备的能力。The embedded speech synthesis system established according to the above method can be completely applied in the embedded system, and the occupied space resource and required calculation complexity are not more than the capability of the embedded device.
下面结合附图和实施例对本发明进一步说明,通过结合附图对系统各组成部件的详细说明将会更好地描述实现本发明的步骤和过程。The present invention will be further described below in conjunction with the accompanying drawings and embodiments, and the steps and processes for realizing the present invention will be better described by referring to the detailed description of each component of the system in conjunction with the accompanying drawings.
附图说明 Description of drawings
附图1基于汉语音节的嵌入式语音合成系统的结构框图Accompanying drawing 1 is based on the structural block diagram of the embedded speech synthesis system of Chinese syllable
附图2频谱模型压缩过程示意图Figure 2 Schematic diagram of the compression process of the spectrum model
附图3频谱参数生成过程示意图Figure 3 Schematic diagram of the generation process of spectrum parameters
图中1.文本输入,2.文本分析及韵律生成,3.模型决策,4.参数生成,5.波形合成,6.语音输出,7.训练语音库,8.HMM模型训练,9.数据分解,10.模型压缩,11.数据重组,12.压缩模型库,13.码本是否稳定,14.数据重新分类,15.码本搜索替代,16.矢量量化,17.状态谱均值,18.状态谱方差,19.方差平均,20.结束,21.是,22.否,23.开始,24.计算全局方差矩阵,25.获取状态相应维度的码本序列,26.求解一维频谱系数序列,27.是否处理完18维参数,28.结束,101.模型训练部分,102.语音合成系统。In the figure 1. Text input, 2. Text analysis and prosody generation, 3. Model decision-making, 4. Parameter generation, 5. Waveform synthesis, 6. Voice output, 7. Training voice library, 8. HMM model training, 9. Data Decomposition, 10. Model compression, 11. Data reorganization, 12. Compression model library, 13. Whether the codebook is stable, 14. Data reclassification, 15. Codebook search and replacement, 16. Vector quantization, 17. State spectrum mean, 18 . State spectrum variance, 19. Variance average, 20. End, 21. Yes, 22. No, 23. Start, 24. Calculate the global variance matrix, 25. Obtain the codebook sequence of the corresponding dimension of the state, 26. Solve the one-dimensional spectrum Coefficient sequence, 27. Whether the 18-dimensional parameters have been processed, 28. End, 101. Model training part, 102. Speech synthesis system.
具体实施方式 Detailed ways
在附图1中,在本发明的实施方案中,本发明的语音合成系统部署在一种嵌入式操作系统中,该嵌入式语音合成系统包括:模型训练部分(101),语音合成系统(102)。In accompanying drawing 1, in the embodiment of the present invention, speech synthesis system of the present invention is deployed in a kind of embedded operating system, and this embedded speech synthesis system comprises: model training part (101), speech synthesis system (102 ).
其中,模型训练部分(101)只在系统线下使用,仅用于生成语音合成系统工作时所需要的压缩模型库(12)。其中训练语音库(7)包括录制好的原始语音,由训练语音库(7)线下生成压缩模型库(12)的过程包括:HMM模型训练步骤(8)、数据分解(9)、模型压缩(10)和模型重组(11)。Among them, the model training part (101) is only used offline in the system, and is only used to generate the compressed model library (12) required for the speech synthesis system to work. Wherein the training voice bank (7) includes the original voice recorded, and the process of generating the compressed model bank (12) offline by the training voice bank (7) includes: HMM model training step (8), data decomposition (9), model compression (10) and model reorganization (11).
在HMM模型训练步骤(8)中,首先利用语音识别工具包(HTK)对录制好的原始训练语音库以音节为单位进行自动切分,得到粗切边界信息,并手工进行校对。然后根据音节的上下文语境和韵律信息进行标注,包括:当前音节、当前音节音调、前一音节、前一音节音调、后一音节、后一音节音调以及对文本进行文法分析得到的高层次韵律信息,即低层次韵律单元在高层次韵律单元中的位置和数目,本实例考虑的韵律层次包括韵律词、韵律短语和句子。利用语音合成模型训练工具包(HTS)对原始训练语音进行HSMM模型训练,模型内容包括基音频率、频谱系数和时长参数。时长以帧数表示,帧长5毫秒。模型分为10个状态,每个状态采用单高斯模型表示状态概率分布。训练过程中根据需要通过适度控制模型聚类因子控制基音频率和时长模型的大小,得到原始语音模型库。In the HMM model training step (8), firstly, the recorded original training speech library is automatically segmented in units of syllables by using the Speech Recognition Toolkit (HTK), to obtain rough cutting boundary information, and manually proofread. Then mark according to the context and prosodic information of the syllable, including: the current syllable, the current syllable tone, the previous syllable, the previous syllable tone, the next syllable, the next syllable tone, and the high-level prosody obtained by grammatical analysis of the text Information, that is, the position and number of low-level prosodic units in high-level prosodic units. The prosodic levels considered in this example include prosodic words, prosodic phrases and sentences. Use the speech synthesis model training toolkit (HTS) to train the HSMM model on the original training speech, and the model content includes pitch frequency, spectral coefficient and duration parameters. The duration is represented by the number of frames, and the frame length is 5 milliseconds. The model is divided into 10 states, and each state uses a single Gaussian model to represent the state probability distribution. During the training process, the size of the pitch frequency and duration model can be controlled by appropriately controlling the model clustering factor to obtain the original speech model library.
在数据分解(9)中,将频谱模型的状态高斯分布分为能量、谱均值和谱方差三部分。本实例考虑的所有特征均考虑由其静态特征、一阶动态特征和二阶动态特征组成的组合特征。本实例采用的频谱系数为18维线谱对(LSP)系数,HSMM模型的状态采用单高斯分布表示。因此每个原始频谱模型状态包含1个57维均值矢量和57维方差矢量。分解后的每个原始频谱模型数据表示为1个3维能量矢量,1个54为均值矢量,一个57维方差矢量。将所有原始状态谱均值(17)构成均值量化码本训练数据,将所有原始状态谱方差(18)构成全局方差的训练数据。In data decomposition (9), the state Gaussian distribution of the spectral model is divided into three parts: energy, spectral mean and spectral variance. All features considered in this example consider a combined feature consisting of its static, first-order dynamic, and second-order dynamic features. The spectral coefficients used in this example are 18-dimensional line spectral pair (LSP) coefficients, and the state of the HSMM model is represented by a single Gaussian distribution. Therefore, each original spectrum model state contains a 57-dimensional mean vector and a 57-dimensional variance vector. Each decomposed original spectrum model data is expressed as a 3-dimensional energy vector, a 54-dimensional mean vector, and a 57-dimensional variance vector. All the original state spectrum means (17) constitute mean value quantization codebook training data, and all original state spectrum variances (18) constitute global variance training data.
在模型压缩(10)中,如附图2所示。模型压缩过程分为下述五个步骤:In model compression (10), as shown in accompanying drawing 2. The model compression process is divided into the following five steps:
矢量量化(16),将均值矢量作为矢量量化码本训练数据,采用LBG算法训练一个初始码本。在训练过程中,根据选用的18维线谱对(LSP)系数的特征,对系数之间的距离进行加权,定义两个线谱对系数x和y之间的加权线谱对距离为:Vector quantization (16), using the mean value vector as the vector quantization codebook training data, and using the LBG algorithm to train an initial codebook. During the training process, according to the characteristics of the selected 18-dimensional line spectrum pair (LSP) coefficients, the distance between the coefficients is weighted, and the weighted line spectrum pair distance between the two line spectrum pair coefficients x and y is defined as:
其中加权系数为:where the weighting coefficients are:
xn、xn+1、xn-1、yn、yn+1、yn-1分别为特征x和y的第n维,n+1维和n-1维静态系数。Δxn,Δyn,Δ2xn,Δ2yn为相应的一阶动态和二阶动态系数。xn, xn+1, xn-1, yn, yn+1, yn-1 are the n-th dimension, n+1-dimensional and n-1-dimensional static coefficients of features x and y, respectively. Δx n , Δy n , Δ 2 x n , Δ 2 y n are the corresponding first-order dynamic coefficients and second-order dynamic coefficients.
码本搜索替代(15),搜索矢量量化分类后每一类中离码本距离最小的训练样本,替代该类码本保存下来,距离判断准则采用所述的加权线谱对距离。The codebook search replaces (15), searches for the training sample with the smallest distance from the codebook in each class after the vector quantization classification, and replaces the codebook of this class and saves it. The distance judgment criterion adopts the weighted line spectrum pair distance.
数据重新分类(14),用新的码本对训练样本重新分类,分类距离判断准则采用所述的加权线谱对距离。Data reclassification (14), using a new codebook to reclassify the training samples, and the classification distance judgment criterion adopts the weighted line spectrum pair distance.
码本是否稳定(13),判断新的分类结果与原分类结果是否相同。如果是,则谱均值矢量量化码本训练结束;如果否,则返回数据重新分类(14)重复数据重新分类(14)。Whether the codebook is stable (13), judge whether the new classification result is the same as the original classification result. If yes, the spectral mean value vector quantization codebook training ends; if not, return to data reclassification (14) and repeat data reclassification (14).
方差平均(19),将所有状态分布的方差矢量(包含静态特征、一阶动态特征和二阶动态特征)进行平均,得到全局方差矢量。Variance averaging (19), average the variance vectors (including static features, first-order dynamic features and second-order dynamic features) of all state distributions to obtain the global variance vector.
在数据重组(11)中,将原模型中的状态分布以能量值和对应的均值矢量码本索引代替,最后存入全局方差值。其余模型依次按照需要组合在一起,形成一个压缩模型(12)。至此语音合成系统的模型训练部分(101)结束工作。In data reorganization (11), the state distribution in the original model is replaced by the energy value and the corresponding mean value vector codebook index, and finally stored in the global variance value. The rest of the models are combined sequentially as required to form a compressed model (12). So far, the model training part (101) of the speech synthesis system has finished its work.
如附图1所示,文本文本输入(1)接收输入的文本,在本发明的实施例中,系统提供了可供手写输入或文本选取粘贴的输入界面。As shown in accompanying drawing 1, text input (1) receives the input text, and in the embodiment of the present invention, the system provides the input interface that can be used for handwriting input or text selection and pasting.
语音合成系统(102)又包括文本分析及韵律生成(2),模型决策(3),参数生成(4)和波形合成(5)。其中文本分析及韵律生成模块将接收到的输入汉字文本串转换为附着相关的上下文语境信息的音节字符串。模型决策模块根据训练得到的模型决策树将接收到的音节字符串确定各个状态的时长并得到基音频率和频谱系数的模型状态序列。参数生成模块根据频谱系数的状态序列计算得到能量序列和频谱系数序列,根据基音频率的状态序列计算得到基音频率序列,如图3所示,频谱系数序列的计算过程分为下述四个步骤:The speech synthesis system (102) further includes text analysis and prosody generation (2), model decision-making (3), parameter generation (4) and waveform synthesis (5). The text analysis and prosody generation module converts the received input Chinese character text string into a syllable string with relevant contextual information attached. The model decision module determines the duration of each state from the received syllable string according to the model decision tree obtained through training, and obtains the model state sequence of pitch frequency and spectral coefficient. The parameter generation module calculates the energy sequence and the spectral coefficient sequence according to the state sequence of the spectral coefficient, and obtains the pitch frequency sequence according to the state sequence of the pitch frequency. As shown in Figure 3, the calculation process of the spectral coefficient sequence is divided into the following four steps:
计算全局方差矩阵(24),根据全局方差计算全局方差矩阵。在参数生成过程中,采用逐维生成的方式计算所需要合成的特征参数,每次计算取一维的均值或全局方差。所用到的全局方差矩阵共有两类,分别为:Calculate the global variance matrix (24), and calculate the global variance matrix according to the global variance. In the process of parameter generation, the characteristic parameters to be synthesized are calculated in a dimension-by-dimensional way, and the mean value or global variance of one dimension is taken for each calculation. There are two types of global variance matrices used, namely:
WUWn=W×Un×W WUWn =W× Un ×W
和and
WUn=W×Un WU n =W×U n
其中U是第n维全局方差及其一阶和二阶动态系数,W是动态窗系数,乘法为矩阵乘法;Where U is the nth-dimensional global variance and its first-order and second-order dynamic coefficients, W is the dynamic window coefficient, and the multiplication is matrix multiplication;
获取状态相应维度的码本序列(25),根据接收的模型状态序列对应的状态谱均值码本序列,获取一维谱均值码本序列;Obtain the codebook sequence (25) of the corresponding dimension of the state, and obtain the one-dimensional spectrum average codebook sequence according to the state spectrum mean codebook sequence corresponding to the received model state sequence;
求解一维频谱系数序列(26),根据接收的全局方差矩阵和状态谱均值码本序列求解特征参数序列,方法为求解下面的矩阵方程:Solve the one-dimensional spectral coefficient sequence (26), and solve the characteristic parameter sequence according to the received global variance matrix and state spectrum mean codebook sequence, the method is to solve the following matrix equation:
其中L为所求的线谱对系数序列,方程解算方法可以采用任何一种满足嵌入式系统计算需求的线性矩阵方程求解算法。本实例中采用对系数矩阵进行LU分解和前向后向高斯消去的求解方法;Wherein L is the desired line spectrum pair coefficient sequence, and the equation solving method can adopt any linear matrix equation solving algorithm that meets the computing requirements of the embedded system. In this example, the solution method of LU decomposition and forward and backward Gaussian elimination is adopted for the coefficient matrix;
判断是否处理完全部18维频谱系数(27)。如果是,则频谱系数求解结束;如果否,则返回获取状态相应维度的码本序列(25),重复获取状态相应维度的码本序列(25)、求解一维频谱系数序列(26)。Judging whether all 18-dimensional spectral coefficients have been processed (27). If yes, then the spectral coefficient solution ends; if not, return to obtain the codebook sequence (25) of the corresponding dimension of the state, repeat the acquisition of the codebook sequence (25) of the corresponding dimension of the state, and solve the one-dimensional spectral coefficient sequence (26).
至此,参数生成模块工作结束。采用本发明所提供的全局方差矩阵方案可以提高运算速度,节省中间步骤的空间消耗。So far, the work of the parameter generation module is over. The adoption of the global variance matrix scheme provided by the present invention can improve the operation speed and save the space consumption of intermediate steps.
在波形合成步骤中,所采用的可以是任意一种能够满足设备资源需求的算法,例如:G.723中所采用的合成滤波方法,或其他基于线性预测的语音解码算法中的合成滤波方法。In the waveform synthesis step, any algorithm that can meet the resource requirements of the device can be used, for example, the synthesis filtering method adopted in G.723, or the synthesis filtering method in other speech decoding algorithms based on linear prediction.
语音输出(6),用于播放或存储合成的数字语音信号。Voice output (6), used for playing or storing synthesized digital voice signals.
本发明涉及一种应用于嵌入式中文语音合成系统的音库压缩和使用方法。基于该方法能够使语音模型库占用极小的空间资源,提高运算速度,同时保持了较好的合成自然度和音质。The invention relates to a method for compressing and using a sound bank applied to an embedded Chinese speech synthesis system. Based on this method, the speech model library can occupy a very small space resource, improve the calculation speed, and maintain a good synthetic naturalness and sound quality.
本发明在嵌入式设备上使用时,所有的音频输入输出均可使用设备本身提供的输入输出接口。语音功能可以随时在设备上开启或关闭。在未启用语音功能时,原设备的各种功能不受任何影响。When the present invention is used on an embedded device, all audio input and output can use the input and output interfaces provided by the device itself. Voice can be turned on or off on the device at any time. When the voice function is not enabled, various functions of the original device will not be affected in any way.
上述实例为本发明的较佳实施例,本发明的应用可用于各种嵌入式终端设备。根据本发明的主要构思,本领域的普通技术人员均可以产生多种类低的或等价的应用。因此,本发明的保护应以权利要求的保护范围为准。The above examples are preferred embodiments of the present invention, and the application of the present invention can be used in various embedded terminal devices. According to the main concept of the present invention, those skilled in the art can generate various low or equivalent applications. Therefore, the protection of the present invention should be based on the protection scope of the claims.
Claims (3)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN2010105807907A CN102063897B (en) | 2010-12-09 | 2010-12-09 | A sound bank compression and usage method for embedded speech synthesis system |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN2010105807907A CN102063897B (en) | 2010-12-09 | 2010-12-09 | A sound bank compression and usage method for embedded speech synthesis system |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN2013100412045A Division CN103077704A (en) | 2010-12-09 | 2010-12-09 | Voice library compression and use method for embedded voice synthesis system |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN102063897A CN102063897A (en) | 2011-05-18 |
| CN102063897B true CN102063897B (en) | 2013-07-03 |
Family
ID=43999144
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN2010105807907A Active CN102063897B (en) | 2010-12-09 | 2010-12-09 | A sound bank compression and usage method for embedded speech synthesis system |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN102063897B (en) |
Families Citing this family (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN102201232A (en) * | 2011-06-01 | 2011-09-28 | 北京宇音天下科技有限公司 | Voice database structure compression used for embedded voice synthesis system and use method thereof |
| CN103632663B (en) * | 2013-11-25 | 2016-08-17 | 内蒙古大学 | A kind of method of Mongol phonetic synthesis front-end processing based on HMM |
| CN104538024B (en) * | 2014-12-01 | 2019-03-08 | 百度在线网络技术(北京)有限公司 | Phoneme synthesizing method, device and equipment |
| CN104916281B (en) * | 2015-06-12 | 2018-09-21 | 科大讯飞股份有限公司 | Big language material sound library method of cutting out and system |
| CN112634863B (en) * | 2020-12-09 | 2024-02-09 | 深圳市优必选科技股份有限公司 | Training method and device of speech synthesis model, electronic equipment and medium |
| CN119785804A (en) * | 2025-01-21 | 2025-04-08 | 维沃移动通信有限公司 | Audio encoding method, device, electronic equipment and readable storage medium |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20040210439A1 (en) * | 2003-04-18 | 2004-10-21 | Schrocter Horst Juergen | System and method for text-to-speech processing in a portable device |
| CN1755796A (en) * | 2004-09-30 | 2006-04-05 | 国际商业机器公司 | Distance defining method and system based on statistic technology in text-to speech conversion |
| CN1924994A (en) * | 2005-08-31 | 2007-03-07 | 中国科学院自动化研究所 | Embedded language synthetic method and system |
| CN101593516A (en) * | 2008-05-28 | 2009-12-02 | 国际商业机器公司 | The method and system of phonetic synthesis |
| CN101894547A (en) * | 2010-06-30 | 2010-11-24 | 北京捷通华声语音技术有限公司 | Speech synthesis method and system |
-
2010
- 2010-12-09 CN CN2010105807907A patent/CN102063897B/en active Active
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20040210439A1 (en) * | 2003-04-18 | 2004-10-21 | Schrocter Horst Juergen | System and method for text-to-speech processing in a portable device |
| CN1755796A (en) * | 2004-09-30 | 2006-04-05 | 国际商业机器公司 | Distance defining method and system based on statistic technology in text-to speech conversion |
| CN1924994A (en) * | 2005-08-31 | 2007-03-07 | 中国科学院自动化研究所 | Embedded language synthetic method and system |
| CN101593516A (en) * | 2008-05-28 | 2009-12-02 | 国际商业机器公司 | The method and system of phonetic synthesis |
| CN101894547A (en) * | 2010-06-30 | 2010-11-24 | 北京捷通华声语音技术有限公司 | Speech synthesis method and system |
Also Published As
| Publication number | Publication date |
|---|---|
| CN102063897A (en) | 2011-05-18 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11361751B2 (en) | Speech synthesis method and device | |
| CN109036371B (en) | Audio data generation method and system for speech synthesis | |
| Lu et al. | Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis | |
| CN102063897B (en) | A sound bank compression and usage method for embedded speech synthesis system | |
| CN114005428A (en) | Speech synthesis method, apparatus, electronic device, storage medium, and program product | |
| CN102201234B (en) | Speech synthesizing method based on tone automatic tagging and prediction | |
| CN101471071A (en) | Speech synthesis system based on mixed hidden Markov model | |
| CN114627851B (en) | Speech synthesis method and system | |
| CN111508466A (en) | Text processing method, device and equipment and computer readable storage medium | |
| An et al. | Speech Emotion Recognition algorithm based on deep learning algorithm fusion of temporal and spatial features | |
| CN102496363A (en) | Correction method for Chinese speech synthesis tone | |
| CN114299911A (en) | Speech synthesis method and related device, electronic equipment and storage medium | |
| CN116798403A (en) | Speech synthesis model method capable of synthesizing multi-emotion audio | |
| CN115966197A (en) | Speech synthesis method, speech synthesis device, electronic equipment and storage medium | |
| CN113362803B (en) | ARM side offline speech synthesis method, ARM side offline speech synthesis device and storage medium | |
| CN1924994B (en) | Embedded language synthetic method and system | |
| CN118298803B (en) | Speech cloning method | |
| US20240274120A1 (en) | Speech synthesis method and apparatus, electronic device, and readable storage medium | |
| CN103077704A (en) | Voice library compression and use method for embedded voice synthesis system | |
| Chen et al. | A statistical model based fundamental frequency synthesizer for Mandarin speech | |
| Qin et al. | HMM-based emotional speech synthesis using average emotion model | |
| CN102201232A (en) | Voice database structure compression used for embedded voice synthesis system and use method thereof | |
| Wang et al. | Speaker adaptation of speaking rate-dependent hierarchical prosodic model for Mandarin TTS | |
| CN112750423A (en) | Method, device and system for constructing personalized speech synthesis model and electronic equipment | |
| CN120164454B (en) | A low-delay speech synthesis method, device, equipment and medium |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| C14 | Grant of patent or utility model | ||
| GR01 | Patent grant |