JPS61259300A

JPS61259300A - Voice synthesization system

Info

Publication number: JPS61259300A
Application number: JP60102194A
Authority: JP
Inventors: 博雄北川
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1985-05-14
Filing date: 1985-05-14
Publication date: 1986-11-17

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】肢血立！本発明は、任意の文字配列から合成音声を出力する音声
合成方式に関し、より詳細には、品質の良い合成音を得
るためのスペクトルパラメータ生成方式に関する。[Detailed Description of the Invention] Like-Station! The present invention relates to a speech synthesis method for outputting synthesized speech from an arbitrary character arrangement, and more particularly to a spectral parameter generation method for obtaining high-quality synthesized speech.

従来抜上任意の音声を作り出す音声合成方式においては、音素、
音節、ＶＣＶ　（母音−子音−母音）などを合成の基本
単位とし、各素片のスペクトルパラメータと駆動音源信
号を一定の規則に基づいて結合して音声合成器に与え、
合成音を得るようにしている。スペクトルパラメータと
しては、線形予測分析に基づいたＬＰＧ　（線形予測符
号化法）、ＰＡＲＣＯＲ（偏自己相関法）、ＬＳＰ（線
スペクトル対法）などを用い、駆動音源としては、イン
パルス列と白色雑音、あるいは合成音質向上のために残
差波形を用いる方法が主流となっている。Conventionally, in speech synthesis methods that can create any speech, phonemes,
The basic unit of synthesis is a syllable, VCV (vowel-consonant-vowel), etc., and the spectral parameters of each element and driving sound source signal are combined based on certain rules and fed to a speech synthesizer.
I'm trying to get a synthesized sound. Spectral parameters include LPG (Linear Predictive Coding) based on linear predictive analysis, PARCOR (Partial Autocorrelation), and LSP (Line Spectrum Pair), and the driving sound sources include impulse trains, white noise, Alternatively, the mainstream method is to use residual waveforms to improve synthesized sound quality.

従来、合成単位の音声素片は、単音節、単語。Traditionally, the synthesis unit speech segment is a single syllable or word.

文章などとして発声された人間の声から必要な部分を切
り出し、線形予測分析することによってスペクトルパラ
メータ化されてきたが、線形予測分析は、音声生成過程
における音源特性、音道特性。Spectral parameters have been created by cutting out the necessary parts of the human voice uttered as sentences and performing linear predictive analysis.Linear predictive analysis analyzes the sound source characteristics and sound path characteristics during the speech generation process.

放射特性をすべて１つにまとめて全極モデルで近（以す
るものであるから、これにより得られるパラメータには
、素片収録時の音源特性が含まれており、異なった音源
特性を持つ素片を連結して合成音を得る際に、音質が劣
下してしまうという欠点がある。Since all radiation characteristics are combined into one all-pole model, the parameters obtained by this include the sound source characteristics at the time of recording the elemental fragments, and are The drawback is that the sound quality deteriorates when the pieces are connected to obtain a synthesized sound.

旦ｍ本発明は、上述のごとき実情に鑑みてなされたもので、
特に、任意の入力文字系列に対して、自然性の高い合成
音声を得ることを目的としてなされたものである。The present invention was made in view of the above-mentioned circumstances.
In particular, this was done with the aim of obtaining highly natural synthesized speech for an arbitrary input character sequence.

１虞本発明は、上記目的を達成するため、音声素片ファイル
に登録されたスペクトルパラメータと残差信号を合成す
べき音声の音韻系列に応じて読出し、この読出されたパ
ラメータと残差信号を一定の規則に基づいて順次結合し
て音声合成器に与えて音声を出力する音声合成方式にお
いて、前記スペクトルパラメータに平坦化処理を施し、
その平坦化されたパラメータと逆フィルタから得られる
前記残差信号を前記音声素片ファイルに登録し、合成す
ることを特徴としたものである。以下、本発明の実施例
に基づいて説明する。1. In order to achieve the above object, the present invention reads the spectral parameters and residual signals registered in the speech segment file according to the phoneme sequence of the speech to be synthesized, and uses the read parameters and residual signals. In a speech synthesis method in which the spectral parameters are sequentially combined based on a certain rule and fed to a speech synthesizer to output speech, the spectral parameters are subjected to flattening processing,
The flattened parameters and the residual signal obtained from the inverse filter are registered in the speech segment file and synthesized. Hereinafter, the present invention will be explained based on examples.

第１図は、本発明の一実施例を説明するための電気的ブ
ロック線図で、図中、１は音声入力部。FIG. 1 is an electrical block diagram for explaining an embodiment of the present invention, and in the figure, 1 is an audio input section.

２は適応逆フィルタ、３はスペクトル分析部、４は逆フ
ィルタ、５は音声素片のファイルで、この実施例は、音
声素片ファイルの作成時に、スペクトル包絡を平坦化す
る適応逆フィルタを通した音声波形を分析してスペクト
ルパラメータを登録するようにし、駆動音源信号生成の
ための残差波形は、合成器の逆フィルタに音声波形を通
すことによって抽出し、前記スペクトルパラメータとと
もに登録するようにしたものである。まず、音声素片フ
ァイル５に登録すべき音声を音声入力部１より入力し、
その音声波形を適応逆フィルタ２に通した後、スペクト
ル分析部３にてスペクトル分析を行う。適応逆フィルタ
２は、声帯波特性の逆特性を持つフィルタであり、これ
を通過した音声波形のスペクトル概形は、適応的に平坦
化される。2 is an adaptive inverse filter, 3 is a spectrum analysis section, 4 is an inverse filter, and 5 is a speech segment file. In this embodiment, when creating a speech segment file, an adaptive inverse filter is passed through which the spectrum envelope is flattened. A residual waveform for generating a driving sound source signal is extracted by passing the audio waveform through an inverse filter of a synthesizer, and is registered together with the spectral parameters. This is what I did. First, the voice to be registered in the voice segment file 5 is input from the voice input unit 1,
After passing the audio waveform through an adaptive inverse filter 2, a spectrum analysis section 3 performs spectrum analysis. The adaptive inverse filter 2 is a filter having characteristics inverse to the vocal cord wave characteristics, and the spectral outline of the speech waveform that has passed through it is adaptively flattened.

このフィルタは、２次と３次の逆フィルタの組み合わせ
によって実現できることは、既に知られている。スペク
トル分析部３では、合成時に必要なパラメータ、例えば
、線形予測分析に基づ＜ＬＳＰ（線スペクトル対）等が
フレーム周期毎に抽出され、素片のパラメータ辞書とし
て登録される。It is already known that this filter can be realized by a combination of second-order and third-order inverse filters. In the spectrum analysis unit 3, parameters necessary for synthesis, such as <LSP (line spectrum pair), etc., are extracted for each frame period based on linear prediction analysis, and are registered as a parameter dictionary of the elemental piece.

また、残差信号は、上記方法によって求められたパラメ
ータを合成器の逆フィルタ４に与え、このフィルタに元
の入力音声波形を通すことによって生成される。残差波
形の登録に際しては、合成時に必要な情報が容易に取り
出せるように任意の加工を施して良い。Further, the residual signal is generated by applying the parameters determined by the above method to the inverse filter 4 of the synthesizer and passing the original input speech waveform through this filter. When registering the residual waveform, any processing may be performed so that necessary information at the time of synthesis can be easily retrieved.

第２図は、前記音声素片フ、アイルを用いた音声合成装
置の一例を示す図で、図中、１１は文字配列入力部、１
２は構文解析部、１３は辞書、１４は音韻系列生成部、
１５はパラメータ時系列生成部、１６はピッチパターン
生成部、１７は音韻時系列生成部、１８は音声素片ファ
イル、１９は合成フィルタ、２０はＤ／Ａ変換部、２１
はスピーカで、文字列入力部１１よりの入力文字系列は
、構文解析部１２により、読み方９アクセント位置。FIG. 2 is a diagram illustrating an example of a speech synthesis device using the above-mentioned speech units F and A. In the figure, 11 is a character array input section;
2 is a syntactic analysis unit, 13 is a dictionary, 14 is a phoneme sequence generation unit,
15 is a parameter time series generation unit, 16 is a pitch pattern generation unit, 17 is a phoneme time series generation unit, 18 is a speech segment file, 19 is a synthesis filter, 20 is a D/A conversion unit, 21
is a speaker, and the input character sequence from the character string input section 11 is processed by the syntax analysis section 12 into reading 9 accent positions.

イントネーション等の分析が行われ、音韻系列生成部１
４とピッチパターン生成部１６に解析結果が渡される。Analysis of intonation, etc. is performed, and the phoneme sequence generation unit 1
4 and the analysis result is passed to the pitch pattern generation section 16.

音韻系列生成部１４では、合成時に必要とされる音素の
コード列を生成し、パラメータ時系列生成部１５では、
音声素片ファイル１８から該当する素片のパラメータを
読出し、それらをなめらかに連絡する。ピッチパターン
生成部１６では、イントネーション情報などに基づき、
ピッチ周期を決定し、音源時系列生成部１７では、音声
素片ファイル１８から該当する素片の残差波形を読出し
、ピッチ周期等を加工して、必要とされる駆動音源信号
を生成する。パラメータと音源の時系列データは、合成
フィルタ１９に入力され、Ｄ／Ａ変換部２０を通して、
合成音声としてスピーカ２１より出力される。これによ
り、素片連結部のスペクトル歪みが減少するため、なめ
らかな音声出力が達成できる。The phoneme sequence generation unit 14 generates a phoneme code sequence required for synthesis, and the parameter time sequence generation unit 15 generates
The parameters of the corresponding segment are read from the speech segment file 18 and are communicated smoothly. Based on intonation information etc., the pitch pattern generation unit 16 generates
The pitch period is determined, and the sound source time series generation unit 17 reads out the residual waveform of the corresponding segment from the speech segment file 18, processes the pitch period, etc., and generates the required driving sound source signal. The parameters and time series data of the sound source are input to the synthesis filter 19 and passed through the D/A converter 20.
The voice is output from the speaker 21 as a synthesized voice. As a result, spectral distortion at the segment connection portion is reduced, so that smooth audio output can be achieved.

第３図は、本発明の他の実施例を説明するための電気的
ブロック線図で、この実施例は、音声素片ファイルの作
成時に、音声の有声、無声を判定する音韻判定部６と、
この音韻判定部６の判定結果によって動作するスイッチ
７と、高域強調及び低域強調用の２種のフィルタ８，９
を用いるようにしたもので、以下、有声、無声によって
スイッチングされる２種のフィルタ構成の場合について
説明するが、これに限るものではない、まず、音声入力
部１で入力された音声信号は、音韻判定部６によって有
声、無声の判定が行われる。有声。FIG. 3 is an electrical block diagram for explaining another embodiment of the present invention. This embodiment includes a phoneme determining unit 6 that determines whether speech is voiced or unvoiced when creating a speech segment file. ,
A switch 7 that operates according to the judgment result of the phoneme judgment unit 6, and two types of filters 8 and 9 for high-frequency emphasis and low-frequency emphasis.
The following describes the case of two types of filter configurations that switch between voiced and unvoiced, but is not limited to this. First, the audio signal input at the audio input section 1 is The phoneme determining unit 6 determines whether the sound is voiced or unvoiced. voiced.

無声の判定方法としては、スペクトル包絡の傾きを用い
る方法などが知られている。スイッチ７は、音韻判定部
６に連動しており、２種のフィルタ８・９の選択を行う
。有声という判定が行われた場合には、フィルタ？　（
＋　６ｄＢ１０ｃｔ　）が選択され、音声信号は、高域
強調された後、スペクトル分析が行われる。無声という
判定が行われた場合には、フィルタ８　　（６ｄＢ１０
ｃｔ　）が選択され、低域強調された後、スペクトル分
析される。スペクトル分析部３の動作およびそれ以降の
処理の流れは、第１図に示した実施例と同じである。As a method for determining unvoicedness, a method using the slope of the spectrum envelope is known. The switch 7 is linked to the phoneme determining section 6 and selects between two types of filters 8 and 9. If it is determined that there is a voice, the filter? (
+6dB10ct) is selected, and the audio signal is subjected to high-frequency emphasis and then subjected to spectrum analysis. If it is determined that there is no voice, filter 8 (6dB10
ct ) is selected, bass-emphasized, and then spectrally analyzed. The operation of the spectrum analyzer 3 and the flow of subsequent processing are the same as in the embodiment shown in FIG.

第４図は、本発明の他の実施例を説明するための電気的
ブロック線図で、この実施例は、音声素片ファイルの作
成時に、パラメータ補正部１０を用いるようにしたもの
で、まず、音声素片ファイル５に登録すべき音声を音声
人力部１より入力し、スペクトル分析部３でスペクトル
分析を行う。スペクトル分析部３は、線形予測係数およ
びその係数から得られるスペクトル包絡の傾き情報を出
力する。パラメータ補正部１０では、スペクトル分析部
３の結果を受は取り、スペクトル包絡の傾きを除去する
ように線形予測係数の補正を行い、合成時ニ必要とされ
るパラメータに変換し７音声素片ファイル５に登録する
。ここでは、線形予測分析に基づ（スペクトル分析およ
びパラメータ補正を説明したが、これに限るものではな
く、例えばスペクトル分析部でＦＦＴ（高速フーリエ変
換）を使用することも可能である。なお、残差波形の抽
出およびそれ以降の処理は、第１図に示した実施例と同
様にして実施できる。FIG. 4 is an electrical block diagram for explaining another embodiment of the present invention. In this embodiment, a parameter correction section 10 is used when creating a speech segment file. , the speech to be registered in the speech segment file 5 is input from the speech human input section 1, and the spectrum analysis section 3 performs spectrum analysis. The spectrum analysis unit 3 outputs linear prediction coefficients and slope information of a spectrum envelope obtained from the coefficients. The parameter correction unit 10 receives the results of the spectrum analysis unit 3, corrects the linear prediction coefficients so as to remove the slope of the spectrum envelope, converts them into parameters required at the time of synthesis, and creates 7 speech segment files. Register for 5. Here, we have explained spectral analysis and parameter correction based on linear predictive analysis, but the invention is not limited to this. For example, it is also possible to use FFT (fast Fourier transform) in the spectral analysis section. Extraction of the difference waveform and subsequent processing can be performed in the same manner as in the embodiment shown in FIG.

着果以上の説明から明らかなように、本発明によると、登録
された音声素片のパラメータに音源特性が含まれなくな
るため、任意の文の合成に際して自然性の高い合成音が
得られる。Results As is clear from the above description, according to the present invention, since sound source characteristics are not included in the parameters of registered speech segments, highly natural synthesized speech can be obtained when synthesizing any sentence.

[Brief explanation of the drawing]

第１図は、本発明による音声合成方式の一実施例を説明
するための電気的ブロック線図、第２図は、音声合成装
置の一実施例を説明するための電気的ブロック線図、第
３図及び第４図は、それぞれ本発明の他の実施例を説明
するための電気的ブロック線図である。１・・・音声入力部、２・・・適応逆フィルタ、３・・
・スペクトル分析部、４・・・逆フィルタ、５・・・音
声素片ファイル、６・・・音韻判定部、７・・・スイッ
チ、８，９・・・フィルタ、１０・・・パラメータ補正
部、１１・・・文字配列入力部、１２・・・構文解析部
、１３・・・辞書。１４・・・音韻系列生成部、１５・・・パラメータ時系
列生成部、１６・・・ピッチパターン生成部、１７・・
・音韻時系列生成部、１８・・・音声系列ファイル、１
９・・・合成フィルタ、２０・・・Ｄ／Ａ変換部、２１
・・・スピーカ。FIG. 1 is an electrical block diagram for explaining an embodiment of a speech synthesis method according to the present invention, and FIG. 2 is an electrical block diagram for explaining an embodiment of a speech synthesis device. 3 and 4 are electrical block diagrams for explaining other embodiments of the present invention, respectively. 1... Audio input section, 2... Adaptive inverse filter, 3...
- Spectrum analysis section, 4... Inverse filter, 5... Speech segment file, 6... Phoneme determination section, 7... Switch, 8, 9... Filter, 10... Parameter correction section , 11...Character array input section, 12...Syntax analysis section, 13...Dictionary. 14... Phoneme sequence generation unit, 15... Parameter time series generation unit, 16... Pitch pattern generation unit, 17...
・Phonological time series generation unit, 18...Speech series file, 1
9... Synthesis filter, 20... D/A converter, 21
...Speaker.

Claims

[Claims]

(1) The spectral parameters and residual signals registered in the speech segment file are read out according to the phoneme sequence of the speech to be synthesized, and the read parameters and residual signals are sequentially combined based on a certain rule. In the speech synthesis method, the spectrum parameters are flattened and the flattened parameters and the residual signal obtained from the inverse filter are registered in the speech segment file. , a speech synthesis method characterized by synthesis.

(2) The speech synthesis method according to claim (1), wherein the spectral parameters are flattened using an adaptive inverse filter for removing radiation and sound source characteristics in the speech generation process.

(3) The speech synthesis method according to claim (1), wherein the spectral parameters are flattened using two or more types of filters and a phoneme determining section for selecting them.

(4) The speech synthesis method according to claim (1), characterized in that the spectral parameters are flattened by performing spectral conversion of the speech waveform and performing parameter correction to remove its slope.