[go: up one dir, main page]

CN115019767A - Singing voice synthesis method and device - Google Patents

Singing voice synthesis method and device Download PDF

Info

Publication number
CN115019767A
CN115019767A CN202210846439.0A CN202210846439A CN115019767A CN 115019767 A CN115019767 A CN 115019767A CN 202210846439 A CN202210846439 A CN 202210846439A CN 115019767 A CN115019767 A CN 115019767A
Authority
CN
China
Prior art keywords
syllable
vibrato
implicit
target song
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210846439.0A
Other languages
Chinese (zh)
Inventor
宋伟
张炜
张政臣
吴友政
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jingdong Technology Information Technology Co Ltd
Original Assignee
Jingdong Technology Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jingdong Technology Information Technology Co Ltd filed Critical Jingdong Technology Information Technology Co Ltd
Priority to CN202210846439.0A priority Critical patent/CN115019767A/en
Publication of CN115019767A publication Critical patent/CN115019767A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • G10L13/0335Pitch control
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Auxiliary Devices For Music (AREA)

Abstract

The invention provides a singing voice synthesis method and a singing voice synthesis device, wherein the method comprises the following steps: acquiring the sound spectrum data of a target song; extracting phonemes, pitches and audios from the voice signals; processing the phoneme by using an encoder to obtain an implicit spectrum characterization vector; predicting acoustic parameters based on the implicit spectral characterization vector and the pitch by using a predictor; extracting and decomposing a pitch track in the audio to obtain the trill characteristics corresponding to each syllable in the target song; predicting the probability of the trill of each syllable based on the trill characteristics; when the probability of the trill of the syllable meets a preset threshold value, synthesizing the analog trill corresponding to the syllable and marking the syllable; and inputting the voice spectrum data, the acoustic parameters and the simulated trill corresponding to the marked syllables into a preset voice model to generate a synthetic singing voice corresponding to the target song. By applying the method provided by the invention, the prosody of the singing voice is improved through the implicit spectral feature vector, and meanwhile, the expressive force of the singing voice is improved by adding the vibrato, so that the synthesized singing voice is more natural.

Description

歌声合成方法及装置Method and device for synthesizing singing voice

技术领域technical field

本发明涉及计算机技术领域,特别是涉及一种歌声合成方法及装置。The present invention relates to the field of computer technology, in particular to a method and device for synthesizing singing voices.

背景技术Background technique

歌声合成是基于歌词、节奏以及音高等信息的声谱来合成的虚拟歌声技术。现有的歌声合成技术,例如XiaoiceSing、ByteSing,DurIAN-SC等唱歌模型,通常直接使用歌谱中的音符、音素、音符时长、音高等信息进行声学特征预测。同时,现有技术还会应用唱歌模型预测歌曲的能量特征,并在合成歌声时加入能量特征提升歌曲的整体韵律。Singing voice synthesis is a virtual singing voice technology that is synthesized based on the sound spectrum of information such as lyrics, rhythm, and pitch. Existing singing voice synthesis technologies, such as XiaoiceSing, ByteSing, Durian-SC and other singing models, usually directly use information such as notes, phonemes, note durations, and pitches in the score to predict acoustic features. At the same time, the prior art also applies the singing model to predict the energy characteristics of the song, and adds the energy characteristics when synthesizing the singing voice to improve the overall rhythm of the song.

歌唱家在唱歌的过程中,通常会通过颤音的技巧来提升唱歌的表现力,但现有技术中加入能量特征无法模拟出真实的歌唱家唱出颤音的效果,且现有技术的能量特征是对高维度的幅度谱特征计算获得,将高维度的能量特征压缩到一维能量特征会丢失大量信息,合成的歌声韵律表现不够,表现力也差。In the process of singing, singers usually use the technique of vibrato to improve the expressiveness of singing, but adding energy features in the prior art cannot simulate the effect of real singers singing vibrato, and the energy features of the prior art are The high-dimensional amplitude spectrum features are calculated and obtained, and a large amount of information will be lost when the high-dimensional energy features are compressed into one-dimensional energy features, and the synthesized singing rhythm performance is not enough, and the expressiveness is also poor.

发明内容SUMMARY OF THE INVENTION

有鉴于此,本发明提供一种歌声合成方法,通过该方法,通过隐式谱表征向量提升歌声韵律的同时,加入颤音提高歌声的表现力,使得合成的歌声更加自然。In view of this, the present invention provides a singing voice synthesis method, by which, while improving the singing voice rhythm through the implicit spectral representation vector, adding vibrato to improve the expressive power of the singing voice, making the synthesized singing voice more natural.

本发明还提供了一种歌声合成装置,用以保证上述方法在实际中的实现及应用。The present invention also provides a singing voice synthesis device to ensure the realization and application of the above method in practice.

一种歌声合成方法,包括:A singing voice synthesis method, comprising:

获取目标歌曲的声谱数据;Obtain the sound spectrum data of the target song;

提取所述声谱数据中所述目标歌曲对应的音素、音高及音频;extracting the phoneme, pitch and audio corresponding to the target song in the sound spectrum data;

应用预设的编码器对所述音素进行处理,获得所述目标歌曲对应的隐式谱表征向量,所述隐式谱表征向量用于表征所述目标歌曲的能量信息;Applying a preset encoder to process the phoneme to obtain an implicit spectral representation vector corresponding to the target song, where the implicit spectral representation vector is used to characterize the energy information of the target song;

应用预设的预测器基于所述隐式谱表征向量及所述音高,预测所述目标歌曲对应的声学参数;Applying a preset predictor to predict the acoustic parameter corresponding to the target song based on the implicit spectral representation vector and the pitch;

提取所述音频中的音高轨迹,并分解所述音高轨迹,获得所述目标歌曲中每个音节对应的颤音特征;Extract the pitch track in the audio, and decompose the pitch track to obtain the vibrato feature corresponding to each syllable in the target song;

基于每个所述音节对应的颤音特征,预测每个所述音节出现颤音的概率;Based on the vibrato feature corresponding to each of the syllables, predict the probability of vibrato in each of the syllables;

当任意的音节出现颤音的概率满足预设的阈值时,合成所述音节对应的模拟颤音,并对所述音节进行标记;When the probability of vibrato in any syllable meets a preset threshold, synthesizing the analog vibrato corresponding to the syllable, and marking the syllable;

将所述声谱数据、所述声学参数及各个已标注的音节对应的模拟颤音输入至预设的声音模型,生成所述目标歌曲对应的合成歌声。The sound spectrum data, the acoustic parameters, and the simulated vibrato corresponding to each marked syllable are input into a preset sound model to generate a synthetic singing voice corresponding to the target song.

上述的方法,可选的,所述应用预设的编码器对所述音素进行处理,获得所述目标歌曲对应的隐式谱表征向量,包括:In the above method, optionally, the application of the preset encoder to process the phoneme to obtain an implicit spectral representation vector corresponding to the target song, including:

将所述音素输入所述编码器,获得所述编码器输出的所述目标歌曲对应的初始表征向量,所述初始表征向量为音素级别的表征序列;Inputting the phoneme into the encoder to obtain an initial representation vector corresponding to the target song output by the encoder, where the initial representation vector is a phoneme-level representation sequence;

对所述初始表征向量进行级别扩展,获得所述初始表征向量对应的隐式谱表征向量,所述隐式谱表征向量为帧级别的表征序列。Perform level expansion on the initial representation vector to obtain an implicit spectral representation vector corresponding to the initial representation vector, where the implicit spectral representation vector is a frame-level representation sequence.

上述的方法,可选的,所述应用预设的预测器基于所述隐式谱表征向量及所述音高,预测所述目标歌曲对应的声学参数,包括:In the above method, optionally, the application preset predictor predicts the acoustic parameters corresponding to the target song based on the implicit spectral representation vector and the pitch, including:

将所述隐式谱表征向量输入至所述预测器的第一测试层,获得所述第一测试层输出的谱表征向量,所述第一测试层为一维卷积层;inputting the implicit spectral representation vector to the first test layer of the predictor, to obtain a spectral representation vector output by the first test layer, where the first test layer is a one-dimensional convolution layer;

将所述谱表征向量、所述隐式谱表征向量及所述音高进行拼接,获得拼接数据;Splicing the spectral representation vector, the implicit spectral representation vector and the pitch to obtain splicing data;

将所述拼接数据输入所述测试器的第二测试层,获得所述第二测试层输出的声学参数,所述第二测试层为一维卷积层。The spliced data is input into the second test layer of the tester, and the acoustic parameters output by the second test layer are obtained, and the second test layer is a one-dimensional convolution layer.

上述的方法,可选的,所述分解所述音高轨迹,获得所述目标歌曲中每个音节对应的颤音特征,包括:In the above-mentioned method, optionally, the decomposing the pitch trajectory to obtain the vibrato feature corresponding to each syllable in the target song, including:

应用预设的三角滤波器对所述音高轨迹进行卷积操作,获得平滑的音高轨迹;Applying a preset triangular filter to perform a convolution operation on the pitch track to obtain a smooth pitch track;

将所述平滑的音高轨迹进行分解,获得每个所述音节对应的颤音;Decomposing the smooth pitch track to obtain the vibrato corresponding to each syllable;

应用预设的希尔伯特变换法对每个所述音节对应的颤音进行变换操作,获得每个所述音节对应的颤音特征。A preset Hilbert transformation method is applied to transform the vibrato corresponding to each syllable to obtain the vibrato feature corresponding to each syllable.

上述的方法,可选的,所述合成所述音节对应的模拟颤音,包括:The above method, optionally, the synthesizing the analog vibrato corresponding to the syllable, including:

选取所述音节对应的颤音特征中的振幅、频率和相位;Select the amplitude, frequency and phase in the vibrato feature corresponding to the syllable;

应用预设的颤音模拟算法对所述振幅、频率、相位以及所述音节出现颤音的概率进行计算,生成所述音节对应的模拟颤音。A preset vibrato simulation algorithm is applied to calculate the amplitude, frequency, phase and probability of vibrato in the syllable to generate a simulated vibrato corresponding to the syllable.

一种歌声合成装置,包括:A singing voice synthesis device, comprising:

第一获取单元,用于获取目标歌曲的声谱数据;The first acquisition unit is used to acquire the sound spectrum data of the target song;

提取单元,用于提取所述声谱数据中所述目标歌曲对应的音素、音高及音频;an extraction unit for extracting the phoneme, pitch and audio corresponding to the target song in the sound spectrum data;

处理单元,用于应用预设的编码器对所述音素进行处理,获得所述目标歌曲对应的隐式谱表征向量,所述隐式谱表征向量用于表征所述目标歌曲的能量信息;a processing unit, configured to process the phoneme by applying a preset encoder to obtain an implicit spectral representation vector corresponding to the target song, where the implicit spectral representation vector is used to characterize the energy information of the target song;

第一预测单元,用于应用预设的预测器基于所述隐式谱表征向量及所述音高,预测所述目标歌曲对应的声学参数;a first prediction unit, configured to apply a preset predictor to predict the acoustic parameter corresponding to the target song based on the implicit spectral representation vector and the pitch;

分解单元,用于提取所述音频中的音高轨迹,并分解所述音高轨迹,获得所述目标歌曲中每个音节对应的颤音特征;a decomposition unit, used to extract the pitch track in the audio, and decompose the pitch track to obtain the vibrato feature corresponding to each syllable in the target song;

第二预测单元,用于基于每个所述音节对应的颤音特征,预测每个所述音节出现颤音的概率;The second prediction unit is used to predict the probability of vibrato in each of the syllables based on the vibrato feature corresponding to each of the syllables;

第一合成单元,用于当任意的音节出现颤音的概率满足预设的阈值时,合成所述音节对应的模拟颤音,并对所述音节进行标记;The first synthesis unit is used to synthesize the analog vibrato corresponding to the syllable when the probability of vibrato in any syllable meets a preset threshold, and mark the syllable;

第二合成单元,用于将所述声谱数据、所述声学参数及各个已标注的音节对应的模拟颤音输入至预设的声音模型,生成所述目标歌曲对应的合成歌声。The second synthesis unit is configured to input the sound spectrum data, the acoustic parameters and the simulated vibrato corresponding to each marked syllable into a preset sound model to generate a synthesized singing voice corresponding to the target song.

上述的装置,可选的,所述处理单元,包括:The above device, optionally, the processing unit includes:

第一输入子单元,用于将所述音素输入所述编码器,获得所述编码器输出的所述目标歌曲对应的初始表征向量,所述初始表征向量为音素级别的表征序列;The first input subunit is used for inputting the phoneme into the encoder to obtain an initial representation vector corresponding to the target song output by the encoder, where the initial representation vector is a phoneme-level representation sequence;

扩展子单元,用于对所述初始表征向量进行级别扩展,获得所述初始表征向量对应的隐式谱表征向量,所述隐式谱表征向量为帧级别的表征序列。The expansion subunit is configured to perform level expansion on the initial characterization vector to obtain an implicit spectral characterization vector corresponding to the initial characterization vector, where the implicit spectral characterization vector is a frame-level characterization sequence.

上述的装置,可选的,所述第一预测单元,包括:In the above device, optionally, the first prediction unit includes:

第二输入子单元,用于将所述隐式谱表征向量输入至所述预测器的第一测试层,获得所述第一测试层输出的谱表征向量,所述第一测试层为一维卷积层;The second input subunit is used to input the implicit spectral representation vector to the first test layer of the predictor, and obtain the spectral representation vector output by the first test layer, and the first test layer is one-dimensional convolutional layer;

拼接子单元,用于将所述谱表征向量、所述隐式谱表征向量及所述音高进行拼接,获得拼接数据;a splicing subunit, used for splicing the spectral representation vector, the implicit spectral representation vector and the pitch to obtain splicing data;

第三输入子单元,用于将所述拼接数据输入所述测试器的第二测试层,获得所述第二测试层输出的声学参数,所述第二测试层为一维卷积层。The third input subunit is used for inputting the spliced data into the second test layer of the tester to obtain acoustic parameters output by the second test layer, where the second test layer is a one-dimensional convolution layer.

上述的装置,可选的,所述分解单元,包括:The above-mentioned device, optionally, the decomposition unit includes:

操作子单元,用于应用预设的三角滤波器对所述音高轨迹进行卷积操作,获得平滑的音高轨迹;an operation subunit, used for applying a preset triangular filter to perform a convolution operation on the pitch track to obtain a smooth pitch track;

分解子单元,用于将所述平滑的音高轨迹进行分解,获得每个所述音节对应的颤音;a decomposition subunit, used for decomposing the smooth pitch track to obtain the vibrato corresponding to each syllable;

变换子单元,用于应用预设的希尔伯特变换法对每个所述音节对应的颤音进行变换操作,获得每个所述音节对应的颤音特征。The transformation subunit is configured to perform transformation operations on the vibrato corresponding to each of the syllables by applying a preset Hilbert transformation method to obtain the vibrato feature corresponding to each of the syllables.

上述的装置,可选的,所述第一合成单元,包括:The above-mentioned device, optionally, the first synthesis unit includes:

选取子单元,用于选取所述音节对应的颤音特征中的振幅、频率和相位;Selecting a subunit for selecting amplitude, frequency and phase in the vibrato feature corresponding to the syllable;

合成子单元,用于应用预设的颤音模拟算法对所述振幅、频率、相位以及所述音节出现颤音的概率进行计算,生成所述音节对应的模拟颤音。The synthesis subunit is configured to apply a preset vibrato simulation algorithm to calculate the amplitude, frequency, phase and probability of vibrato in the syllable, and generate a simulated vibrato corresponding to the syllable.

一种存储介质,所述存储介质包括存储的指令,其中,在所述指令运行时控制所述存储介质所在的设备执行上述的歌声合成方法。A storage medium, the storage medium comprising stored instructions, wherein when the instructions are executed, a device where the storage medium is located is controlled to execute the above singing voice synthesis method.

一种电子设备,包括存储器,以及一个或者一个以上的指令,其中一个或者一个以上指令存储于存储器中,且经配置以由一个或者一个以上处理器执行上述的歌声合成方法。An electronic device includes a memory, and one or more instructions, wherein the one or more instructions are stored in the memory and configured to perform the above-described singing voice synthesis method by one or more processors.

与现有技术相比,本发明包括以下优点:Compared with the prior art, the present invention includes the following advantages:

本发明提供一种歌声合成方法,包括:获取目标歌曲的声谱数据;提取所述声谱数据中所述目标歌曲对应的音素、音高及音频;应用预设的编码器对所述音素进行处理,获得所述目标歌曲对应的隐式谱表征向量,所述隐式谱表征向量用于表征所述目标歌曲的能量信息;应用预设的预测器基于所述隐式谱表征向量及所述音高,预测所述目标歌曲对应的声学参数;提取所述音频中的音高轨迹,并分解所述音高轨迹,获得所述目标歌曲中每个音节对应的颤音特征;基于每个所述音节对应的颤音特征,预测每个所述音节出现颤音的概率;当任意的音节出现颤音的概率满足预设的阈值时,合成所述音节对应的模拟颤音,并对所述音节进行标记;将所述声谱数据、所述声学参数及各个已标注的音节对应的模拟颤音输入至预设的声音模型,生成所述目标歌曲对应的合成歌声。应用本发明提供的方法,通过隐式谱表征向量提升歌声韵律的同时,加入颤音提高歌声的表现力,使得合成的歌声更加自然。The present invention provides a method for synthesizing singing voice, which includes: acquiring sound spectrum data of a target song; extracting phonemes, pitches and audios corresponding to the target song in the sound spectrum data; processing to obtain an implicit spectral representation vector corresponding to the target song, the implicit spectral representation vector is used to characterize the energy information of the target song; a preset predictor is applied based on the implicit spectral representation vector and the pitch, predict the acoustic parameters corresponding to the target song; extract the pitch trajectory in the audio, and decompose the pitch trajectory to obtain the vibrato feature corresponding to each syllable in the target song; based on each of the The vibrato feature corresponding to the syllable is to predict the probability of vibrato in each syllable; when the probability of vibrato in any syllable meets the preset threshold, the simulated vibrato corresponding to the syllable is synthesized, and the syllable is marked; The sound spectrum data, the acoustic parameters, and the simulated vibrato corresponding to each marked syllable are input to a preset sound model to generate a synthesized singing voice corresponding to the target song. By applying the method provided by the present invention, the rhythm of the singing voice is improved through the implicit spectral representation vector, and the expressiveness of the singing voice is improved by adding vibrato, so that the synthesized singing voice is more natural.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only It is an embodiment of the present invention. For those of ordinary skill in the art, other drawings can also be obtained according to the provided drawings without creative work.

图1为本发明实施例提供的一种歌声合成方法的方法流程图;Fig. 1 is the method flow chart of a kind of singing voice synthesis method provided in the embodiment of the present invention;

图2为本发明实施例提供的一种唱歌模型的整体网络结构图;2 is an overall network structure diagram of a singing model provided by an embodiment of the present invention;

图3为本发明实施例提供的一种歌声合成方法的又一方法流程图;Fig. 3 is another method flow chart of a kind of singing voice synthesis method provided by the embodiment of the present invention;

图4为本发明实施例提供的一种歌声合成装置的装置结构图;4 is a device structure diagram of a singing voice synthesis device provided by an embodiment of the present invention;

图5为本发明实施例提供的一种电子设备结构示意图。FIG. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

在本申请中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。In this application, relational terms such as first and second, etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that any such relationship exists between these entities or operations. The terms "comprising", "comprising" or any other variation thereof are intended to cover a non-exclusive inclusion such that a process, method, article or device comprising a series of elements includes not only those elements, but also Include other elements not expressly listed, or which are inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.

本发明可用于众多通用或专用的计算装置环境或配置中。例如:个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、多处理器装置、包括以上任何装置或设备的分布式计算环境等等。The present invention may be used in numerous general purpose or special purpose computing device environments or configurations. For example: personal computers, server computers, handheld or portable devices, tablet-type devices, multi-processor devices, distributed computing environments including any of the above, and the like.

本发明实施例提供了一种歌声合成方法,该方法可以应用在多种系统平台,其执行主体可以为计算机终端或各种移动设备的处理器,所述方法的方法流程图如图1所示,具体包括:An embodiment of the present invention provides a singing voice synthesis method. The method can be applied to various system platforms, and the execution body of the method can be a computer terminal or a processor of various mobile devices. The method flowchart of the method is shown in FIG. 1 . , including:

S101:获取目标歌曲的声谱数据。S101: Acquire sound spectrum data of a target song.

在本发明实施例中,声谱数据包括音素、音高以及音频等信息。In this embodiment of the present invention, the sound spectrum data includes information such as phoneme, pitch, and audio.

S102:提取所述声谱数据中所述目标歌曲对应的音素、音高及音频。S102: Extract the phoneme, pitch and audio corresponding to the target song in the sound spectrum data.

在本发明实施例中,音素是根据语音的自然属性划分出来的最小语音单位,依据音节里的发音动作来分析,一个动作构成一个音素。例如:汉字“开(kai)”、“心(xin)”分别有三个音素。音高是指音调高低不同的声音。In the embodiment of the present invention, a phoneme is the smallest unit of speech divided according to the natural attributes of speech, and is analyzed according to the pronunciation action in the syllable, and one action constitutes a phoneme. For example, the Chinese characters "kai (kai)" and "xin (xin)" have three phonemes respectively. Pitch refers to sounds with different pitches.

S103:应用预设的编码器对所述音素进行处理,获得所述目标歌曲对应的隐式谱表征向量,所述隐式谱表征向量用于表征所述目标歌曲的能量信息。S103: Use a preset encoder to process the phoneme to obtain an implicit spectral representation vector corresponding to the target song, where the implicit spectral representation vector is used to represent energy information of the target song.

在本发明实施例中,编码器为声学模型的一部分,将音素输入声学模型的编码器,通过编码器得到音素的表征向量,即,隐式谱表征向量。In the embodiment of the present invention, the encoder is a part of the acoustic model, and the phoneme is input into the encoder of the acoustic model, and the representation vector of the phoneme, that is, the implicit spectral representation vector, is obtained by the encoder.

S104:应用预设的预测器基于所述隐式谱表征向量及所述音高,预测所述目标歌曲对应的声学参数。S104: Apply a preset predictor to predict the acoustic parameter corresponding to the target song based on the implicit spectral representation vector and the pitch.

在本发明实施例中,预测器为声学模型的一部分,对隐式谱表征向量进行处理后结合音高输入预测器,由预测器输出声学参数。其中,声学参数包括MGC(Mel-GeneralizedCepstral,梅尔广义倒谱)、BAP(BandAperiodicity,带非周期信号)和音高pitch。其中,带非周期性指的是语音信号和信号的非周期分量之间的功率比。In the embodiment of the present invention, the predictor is a part of the acoustic model. After processing the implicit spectral representation vector, it is combined with the pitch input predictor, and the predictor outputs acoustic parameters. Among them, the acoustic parameters include MGC (Mel-Generalized Cepstral, Mel generalized cepstral), BAP (BandAperiodicity, with aperiodic signal) and pitch. Wherein, with aperiodicity refers to the power ratio between the speech signal and the aperiodic component of the signal.

S105:提取所述音频中的音高轨迹,并分解所述音高轨迹,获得所述目标歌曲中每个音节对应的颤音特征。S105: Extract the pitch track in the audio, and decompose the pitch track to obtain the vibrato feature corresponding to each syllable in the target song.

在本发明实施例中,分解音高轨迹后除了颤音之外还包括音调,从高音轨迹中获取颤音,并获得颤音中音节对应的颤音特征包括振幅extent,频率rate和相位phase。In this embodiment of the present invention, after decomposing the pitch track, it includes pitch in addition to vibrato, obtains the vibrato from the treble track, and obtains the vibrato features corresponding to the syllables in the vibrato, including amplitude extent, frequency rate and phase phase.

进一步地,分解音高轨迹,获得目标歌曲中每个音节对应的颤音特征,包括:Further, decompose the pitch track to obtain the vibrato feature corresponding to each syllable in the target song, including:

应用预设的三角滤波器对所述音高轨迹进行卷积操作,获得平滑的音高轨迹;Applying a preset triangular filter to perform a convolution operation on the pitch track to obtain a smooth pitch track;

将所述平滑的音高轨迹进行分解,获得每个所述音节对应的颤音;Decomposing the smooth pitch track to obtain the vibrato corresponding to each syllable;

应用预设的希尔伯特变换法对每个所述音节对应的颤音进行变换操作,获得每个所述音节对应的颤音特征,所述颤音特征包括振幅、频率和相位。A preset Hilbert transformation method is applied to transform the vibrato corresponding to each syllable to obtain a vibrato feature corresponding to each syllable, where the vibrato feature includes amplitude, frequency and phase.

需要说明的是,从音频数据中提取真实的高音轨迹pitchtrajectory,然后通过一个三角滤波器和提取到的高音轨迹进行卷积操作,得到平滑的高音轨迹,高音轨迹分解为音调pitchintonation和颤音,pitch intonation是平滑之后的音高升降曲线pitchcontour,颤音则展示了歌唱家的表现技巧。颤音进一步通过希尔伯特Hilbert变换被分解为振幅extent,频率rate和相位phase三个特征。It should be noted that the real treble trajectory pitchtrajectory is extracted from the audio data, and then a triangular filter is used to perform a convolution operation with the extracted treble trajectory to obtain a smooth treble trajectory. The treble trajectory is decomposed into pitch pitchintonation and vibrato, pitch intonation It is the pitchcontour after smoothing, and the vibrato shows the performance skills of the singer. The vibrato is further decomposed into three features: amplitude extent, frequency rate and phase phase through the Hilbert transform.

S106:基于每个所述音节对应的颤音特征,预测每个所述音节出现颤音的概率。S106: Based on the vibrato feature corresponding to each of the syllables, predict the probability of vibrato appearing in each of the syllables.

在本发明实施例中,每个音节出现颤音的概率为平均概率,即,每个音节上平均颤音出现的可能性。颤音在发音较短的音节中基本上不存在,在发音较长的音节上才会出现,因此,音节越长,其出现颤音的概率就越高。In this embodiment of the present invention, the probability of vibrato appearing in each syllable is an average probability, that is, the average probability of vibrato appearing on each syllable. Trills are basically absent in shorter-pronounced syllables, and only appear on longer-pronounced syllables, so the longer the syllable, the higher the probability of vibrato.

可选的,可以将各个颤音特征输入至颤音相似性标注网络预测每个音节出现颤音的概率。Optionally, each vibrato feature can be input into the vibrato similarity annotation network to predict the probability of vibrato in each syllable.

S107:当任意的音节出现颤音的概率满足预设的阈值时,合成所述音节对应的模拟颤音,并对所述音节进行标记。S107: When the probability of vibrato in any syllable meets a preset threshold, synthesize the analog vibrato corresponding to the syllable, and mark the syllable.

进一步地,合成所述音节对应的模拟颤音,具体可以包括:Further, synthesizing the analog vibrato corresponding to the syllable may specifically include:

选取所述音节对应的颤音特征中的振幅、频率和相位;Select the amplitude, frequency and phase in the vibrato feature corresponding to the syllable;

应用预设的颤音模拟算法对所述振幅、频率、相位以及所述音节出现颤音的概率进行计算,生成所述音节对应的模拟颤音。A preset vibrato simulation algorithm is applied to calculate the amplitude, frequency, phase and probability of vibrato in the syllable to generate a simulated vibrato corresponding to the syllable.

其中,合成每个字节对应的模拟颤音的计算公式如下:Among them, the calculation formula of the analog vibrato corresponding to each byte is as follows:

Figure BDA0003753030290000081
Figure BDA0003753030290000081

其中,pmean是该音节上计算的平均颤音出现的可能性,

Figure BDA0003753030290000082
分别是预测的颤音的extent,rate,phase和likeliness。where p mean is the calculated mean vibrato probability over that syllable,
Figure BDA0003753030290000082
are the extent, rate, phase and likeliness of the predicted vibrato, respectively.

需要说明的是,因为歌声数据中,很难通过算法直接提取Pitchtrajectory中是否存在颤音用于作为颤音可能性的标签,所以本发明提出了一种构造模拟颤音数据和标签的方法来获取颤音标签和对应的pitchtrajectory。It should be noted that, because it is difficult to directly extract whether there is vibrato in Pitchtrajectory as a tag of vibrato possibility in singing voice data through an algorithm, the present invention proposes a method for constructing simulated vibrato data and tags to obtain vibrato tags and tags. The corresponding pitchtrajectory.

首先,从音频数据中提取真实的pitchtrajectory,然后通过一个三角滤波器和提取到的pitchtrajectory进行卷积操作,得到平滑的pitchtrajectory,然后我们通过一个颤音模拟算法(例如:随机选取颤音的extent,并乘以6Hz的正弦波噪音)构造颤音数据,然后,随机的选择平滑之后的pitchtrajectory的一个片段,给这个片段添加颤音,从而获得包含颤音的pitchtrajectory,又能够知道随机的选取的这个片段的位置,这样就知道了颤音标签。随机选取的这个片段是由颤音的,所以这个片段对应的帧的颤音标签为1,即出现颤音,该片段之外的帧都不包含颤音,颤音标签为0。First, extract the real pitchtrajectory from the audio data, and then perform a convolution operation with the extracted pitchtrajectory through a triangular filter to obtain a smooth pitchtrajectory, and then we pass a vibrato simulation algorithm (for example: randomly select the extent of the vibrato, and multiply the Construct vibrato data with 6Hz sine wave noise), and then randomly select a segment of the smoothed pitchtrajectory, add vibrato to this segment, so as to obtain a pitchtrajectory containing vibrato, and know the position of the randomly selected segment, so that Just know the vibrato tag. The randomly selected clip is vibrato, so the vibrato label of the frame corresponding to this clip is 1, that is, vibrato occurs, and the frames outside this clip do not contain vibrato, and the vibrato label is 0.

有了模拟的包含颤音的pitchtrajectory和对应的颤音是否出现的标签,可以用这些伪造的数据训练一个预测颤音出现可能性的神经网络,即预测颤音出现的概率的神经网络。该网络训练完成之后,可以用该网络对pitchmodel的训练数据进行标注,即我们可以提取pitchmodel训练数据的真实的pitch trajectory,然后预测该pitchtrajectory中每帧的位置是否存在颤音,这样就得到了训练pitchmodel的训练数据的颤音可能性的标签。With the simulated pitchtrajectory containing the vibrato and the corresponding labels of whether the vibrato is present, a neural network that predicts the probability of vibrato, that is, a neural network that predicts the probability of vibrato, can be trained with these fake data. After the network training is completed, the network can be used to mark the training data of the pitchmodel, that is, we can extract the real pitch trajectory of the pitchmodel training data, and then predict whether there is a vibrato in the position of each frame in the pitchtrajectory, thus obtaining the training pitchmodel Labels for the vibrato likelihood of the training data.

S108:将所述声谱数据、所述声学参数及各个已标注的音节对应的模拟颤音输入至预设的声音模型,生成所述目标歌曲对应的合成歌声。S108: Input the sound spectrum data, the acoustic parameters, and the simulated vibrato corresponding to each marked syllable into a preset sound model to generate a synthesized singing voice corresponding to the target song.

在本发明实施例中,在歌声合成过程中加入颤音以及声学参数,提高合成歌声的表现力,使得合成的歌声更加自然。In the embodiment of the present invention, vibrato and acoustic parameters are added in the singing voice synthesis process to improve the expressive power of the synthesized singing voice and make the synthesized singing voice more natural.

需要说明的是,声音模型为合成歌声的神经网络模型,该模型通过多次机器学习后,根据输入的各个参数(如:声谱数据、声学参数及模拟颤音)模拟人声合成目标歌曲对应的合成歌声。It should be noted that the sound model is a neural network model for synthesizing singing voices. After multiple machine learnings, the model simulates the corresponding vocal synthesis target song according to various input parameters (such as: sound spectrum data, acoustic parameters and simulated vibrato). Synthesized vocals.

本发明实施例提供的方法中,获取目标的声谱数据,提取其中的音素、音高及音频。通过编码器对音素进行处理,获得隐式谱表征向量,再进一步通过预测器对隐式谱表征向量及音高进行处理,获得一维的声学参数,通过该声学参数可以在合成歌声的过程中增加歌曲的韵律。获得声学参数后,通过音频获得每个音节对应的颤音特征,并通过颤音特征预测每个音节出现颤音的可能性,即颤音出现的概率。当任意音节会出现颤音,即颤音出现的概率大于阈值时,合成模拟颤音并对其进行标记,通过标记设置该音节的颤音出现可能性的标签。最后将声谱数据、声学参数以及模拟颤音输入至声音模型后,合成目标歌曲的歌声。In the method provided by the embodiment of the present invention, the sound spectrum data of the target is acquired, and the phoneme, pitch and audio therein are extracted. The phoneme is processed by the encoder to obtain the implicit spectral representation vector, and the implicit spectral representation vector and pitch are further processed by the predictor to obtain one-dimensional acoustic parameters, which can be used in the process of synthesizing singing voices. Increase the rhythm of the song. After the acoustic parameters are obtained, the vibrato feature corresponding to each syllable is obtained from the audio frequency, and the possibility of vibrato appearing in each syllable is predicted through the vibrato feature, that is, the probability of vibrato appearing. When vibrato occurs in any syllable, that is, when the probability of vibrato is greater than the threshold, the simulated vibrato is synthesized and marked, and the label of the possibility of vibrato of this syllable is set through the mark. Finally, after inputting the spectral data, acoustic parameters and simulated vibrato into the sound model, the singing voice of the target song is synthesized.

应用本发明实施例提供的方法,通过隐式谱表征向量提升歌声韵律的同时,加入颤音提高歌声的表现力,使得合成的歌声更加自然。By applying the method provided by the embodiment of the present invention, while improving the rhythm of the singing voice through the implicit spectral representation vector, adding vibrato to improve the expressive power of the singing voice, making the synthesized singing voice more natural.

参考图2,图2为唱歌模型的整体网络结构图,其中包括三个部分,第一个部分为(a)是声学模型,声学模型的输入包含音素、和音高,然后声学模型通过编码器的输出预测了隐式谱表征向量,并将预测得到的隐式谱表征向量和编码器得到的表征拼接起来送入到解码器中去预测声学特征(MGC,BAP,Pitch);图1中(b)是pitchmodel音高模型,pitchmodel的输入是音素和音节以及对应的歌唱者id,在pitchmodel中,pitch被分解为intonation,phase,extent,rate等信息,并且我们也建模了颤音出现的可能性,即图中的likeliness;图1中(c)是颤音可能性标注网络,该网络用于给训练数据标注颤音可能性标签。其中,图2中的“c”表示连接,“+”表示添加,“f”为生成模拟颤音的公式,“FC”为全连接层“LR”为长度调节器,“Rep”为representation的缩写,Vibrato likeliness为颤音出现的可能性,即,颤音出现的概率。Referring to Figure 2, Figure 2 is the overall network structure diagram of the singing model, which includes three parts, the first part is (a) is the acoustic model, the input of the acoustic model contains phonemes, and pitch, and then the acoustic model passes through the encoder's input. The output predicts the implicit spectral representation vector, and the predicted implicit spectral representation vector and the representation obtained by the encoder are concatenated and sent to the decoder to predict the acoustic features (MGC, BAP, Pitch); in Figure 1 (b ) is the pitch model of pitchmodel. The input of pitchmodel is phoneme and syllable and the corresponding singer id. In pitchmodel, pitch is decomposed into information such as intonation, phase, extent, rate, etc., and we also model the possibility of vibrato. , that is, the likeliness in the figure; (c) in Figure 1 is the vibrato likelihood labeling network, which is used to label the training data with vibrato likelihood labels. Among them, "c" in Figure 2 means connection, "+" means adding, "f" is the formula for generating analog vibrato, "FC" is the fully connected layer, "LR" is the length regulator, and "Rep" is the abbreviation of representation , Vibrato likeliness is the possibility of vibrato, that is, the probability of vibrato.

本发明实施例提供的方法中,应用预设的编码器对所述音素进行处理,获得所述目标歌曲对应的隐式谱表征向量,包括:In the method provided by the embodiment of the present invention, a preset encoder is used to process the phoneme, and an implicit spectral representation vector corresponding to the target song is obtained, including:

将所述音素输入所述编码器,获得所述编码器输出的所述目标歌曲对应的初始表征向量,所述初始表征向量为音素级别的表征序列;Inputting the phoneme into the encoder to obtain an initial representation vector corresponding to the target song output by the encoder, where the initial representation vector is a phoneme-level representation sequence;

对所述初始表征向量进行级别扩展,获得所述初始表征向量对应的隐式谱表征向量,所述隐式谱表征向量为帧级别的表征序列。Perform level expansion on the initial representation vector to obtain an implicit spectral representation vector corresponding to the initial representation vector, where the implicit spectral representation vector is a frame-level representation sequence.

需要说明的是,如图2所示,获得隐式谱表征向量的过程可以是:声学模型的编码器的输入为音素,通过编码器得到音素的表征向量以后,根据歌谱中的音素时长信息将编码器的输出扩展到帧级别,即从音素级别的表征序列扩展到帧级别的表征序列。It should be noted that, as shown in Figure 2, the process of obtaining the implicit spectral representation vector may be: the input of the encoder of the acoustic model is a phoneme, and after obtaining the representation vector of the phoneme through the encoder, according to the phoneme duration information in the song score, the The output of the encoder is expanded to the frame level, that is, from the sequence of representations at the phoneme level to the sequence of representations at the frame level.

进一步地,参考图3,应用预设的预测器基于所述隐式谱表征向量及所述声谱数据,预测所述目标歌曲对应的声学参数,包括:Further, with reference to FIG. 3 , a preset predictor is applied to predict the acoustic parameters corresponding to the target song based on the implicit spectral representation vector and the acoustic spectral data, including:

S301:将隐式谱表征向量输入至预测器的第一测试层,获得第一测试层输出的谱表征向量。S301: Input the implicit spectral representation vector to the first test layer of the predictor, and obtain the spectral representation vector output by the first test layer.

其中,所述第一测试层为一维卷积层。Wherein, the first test layer is a one-dimensional convolution layer.

需要说明的是,预测器包含两层的一维卷积,输出层为线性变换,预测到一个每帧维度为256的谱表征向量。It should be noted that the predictor includes two layers of one-dimensional convolution, the output layer is a linear transformation, and predicts a spectral representation vector with a dimension of 256 per frame.

S302:将谱表征向量、隐式谱表征向量及音高进行拼接,获得拼接数据。S302: Splicing the spectral representation vector, the implicit spectral representation vector and the pitch to obtain splicing data.

S303:将拼接数据输入测试器的第二测试层,获得第二测试层输出的声学参数。S303: Input the spliced data into the second test layer of the tester, and obtain the acoustic parameters output by the second test layer.

其中,第二测试层为一维卷积层。Among them, the second test layer is a one-dimensional convolutional layer.

需要说明的是,将预测得到谱表征和声学模型的编码器的输出以及pitchembedding拼接起来送入到声学模型测试器进行声学参数预测,本发明使用的声学参数为MGC(Mel-Generalized Cepstral)、BAP(BandAperiodicity,带非周期性)和Pitch。其中,编码器的输出为隐式谱表征向量。如表1所示,表1为声学模型中声学参数不同特征的维度:It should be noted that the output of the encoder that predicts the spectral representation and the acoustic model and the pitchembedding are spliced together and sent to the acoustic model tester for acoustic parameter prediction. The acoustic parameters used in the present invention are MGC (Mel-Generalized Cepstral), BAP (BandAperiodicity, with aperiodicity) and Pitch. Among them, the output of the encoder is the implicit spectral representation vector. As shown in Table 1, Table 1 shows the dimensions of different characteristics of acoustic parameters in the acoustic model:

Figure BDA0003753030290000101
Figure BDA0003753030290000101

表1Table 1

还需要说明的是,为了让谱表征向量能够非常好的表征能量信息,本发明在声学模型中添加了一个解码器,使预测得到的普表征向量通过解码器回复该输入文本对应的音频的能量谱,因为唱歌数据都是成对出现了,即一个文本输入对应一个音频输出,所以可以用谱表征预测器得到的谱表征通过一个解码器回复对应的能量谱,这个解码器即图2的(a)部分中虚线框中的部分,该部分模型只需要在训练的过程中使用,在推理的时候不需要,推理的时候直接使用谱表征预测器预测得到的谱表征进行声学参数预测即可。It should also be noted that, in order to enable the spectral representation vector to represent the energy information very well, the present invention adds a decoder to the acoustic model, so that the predicted general representation vector can return the energy of the audio corresponding to the input text through the decoder. spectrum, because the singing data all appear in pairs, that is, a text input corresponds to an audio output, so the spectrum representation obtained by the spectrum representation predictor can be used to reply the corresponding energy spectrum through a decoder, which is the ( The part in the dashed box in part a), this part of the model only needs to be used in the training process, not in the inference. The spectral representation predicted by the spectral representation predictor can be directly used for acoustic parameter prediction during inference.

基于上述实施例提供的方法,本发明通过颤音模型,显示的预测颤音的Phase,Extent,Rate,Likeliness等特征,然后通过这些特征提升唱歌模型的表现力的方法;通过给唱歌模型引入压缩的、具有高表现力的latent spectrogram representation(即能量的bottle neck feature)来提升唱歌韵律表现力的方法。Based on the method provided by the above-mentioned embodiment, the present invention displays features such as Phase, Extent, Rate, Likeliness and the like for predicting the vibrato through the vibrato model, and then improves the expressiveness of the singing model through these features; A method to improve the expressiveness of singing rhythm with a latent spectrogram representation with high expressiveness (ie, the bottle neck feature of energy).

上述各个实施例的具体实施过程及其衍生方式,均在本发明的保护范围之内。The specific implementation process of each of the above-mentioned embodiments and the derivatives thereof are all within the protection scope of the present invention.

与图1所述的方法相对应,本发明实施例还提供了一种歌声合成装置,用于对图1中方法的具体实现,本发明实施例提供的歌声合成装置可以应用计算机终端或各种移动设备中,其结构示意图如图4所示,具体包括:Corresponding to the method described in FIG. 1 , an embodiment of the present invention also provides a singing voice synthesis device for implementing the method in FIG. 1 . The singing voice synthesis device provided by the embodiment of the present invention can be applied to a computer terminal or various In the mobile device, a schematic diagram of its structure is shown in Figure 4, which specifically includes:

第一获取单元401,用于获取目标歌曲的声谱数据;The first obtaining unit 401 is used to obtain the sound spectrum data of the target song;

提取单元402,用于提取所述声谱数据中所述目标歌曲对应的音素、音高及音频;Extraction unit 402, for extracting the phoneme, pitch and audio corresponding to the target song in the sound spectrum data;

处理单元403,用于应用预设的编码器对所述音素进行处理,获得所述目标歌曲对应的隐式谱表征向量,所述隐式谱表征向量用于表征所述目标歌曲的能量信息;A processing unit 403, configured to process the phoneme by applying a preset encoder to obtain an implicit spectral representation vector corresponding to the target song, where the implicit spectral representation vector is used to characterize the energy information of the target song;

第一预测单元404,用于应用预设的预测器基于所述隐式谱表征向量及所述音高,预测所述目标歌曲对应的声学参数;a first predicting unit 404, configured to apply a preset predictor to predict the acoustic parameter corresponding to the target song based on the implicit spectral representation vector and the pitch;

分解单元405,用于提取所述音频中的音高轨迹,并分解所述音高轨迹,获得所述目标歌曲中每个音节对应的颤音特征;Decomposition unit 405 is used to extract the pitch track in the audio, and decompose the pitch track to obtain the vibrato feature corresponding to each syllable in the target song;

第二预测单元406,用于基于每个所述音节对应的颤音特征,预测每个所述音节出现颤音的概率;The second prediction unit 406 is configured to predict the probability of vibrato appearing in each of the syllables based on the vibrato feature corresponding to each of the syllables;

第一合成单元407,用于当任意的音节出现颤音的概率满足预设的阈值时,合成所述音节对应的模拟颤音,并对所述音节进行标记;The first synthesizing unit 407 is used to synthesize the analog vibrato corresponding to the syllable when the probability of vibrato appearing in any syllable meets a preset threshold, and mark the syllable;

第二合成单元408,用于将所述声谱数据、所述声学参数及各个已标注的音节对应的模拟颤音输入至预设的声音模型,生成所述目标歌曲对应的合成歌声。The second synthesis unit 408 is configured to input the sound spectrum data, the acoustic parameters and the simulated vibrato corresponding to each marked syllable into a preset sound model to generate a synthesized singing voice corresponding to the target song.

本发明实施例提供的装置中,获取目标的声谱数据,提取其中的音素、音高及音频。通过编码器对音素进行处理,获得隐式谱表征向量,再进一步通过预测器对隐式谱表征向量及音高进行处理,获得一维的声学参数,通过该声学参数可以在合成歌声的过程中增加歌曲的韵律。获得声学参数后,通过音频获得每个音节对应的颤音特征,并通过颤音特征预测每个音节出现颤音的可能性,即颤音出现的概率。当任意音节会出现颤音,即颤音出现的概率大于阈值时,合成模拟颤音并对其进行标记,通过标记设置该音节的颤音出现可能性的标签。最后将声谱数据、声学参数以及模拟颤音输入至声音模型后,合成目标歌曲的歌声。In the device provided by the embodiment of the present invention, the sound spectrum data of the target is acquired, and the phoneme, pitch and audio therein are extracted. The phoneme is processed by the encoder to obtain the implicit spectral representation vector, and the implicit spectral representation vector and pitch are further processed by the predictor to obtain one-dimensional acoustic parameters, which can be used in the process of synthesizing singing voices. Increase the rhythm of the song. After the acoustic parameters are obtained, the vibrato feature corresponding to each syllable is obtained from the audio frequency, and the possibility of vibrato appearing in each syllable is predicted through the vibrato feature, that is, the probability of vibrato appearing. When vibrato occurs in any syllable, that is, when the probability of vibrato is greater than the threshold, the simulated vibrato is synthesized and marked, and the label of the possibility of vibrato of this syllable is set through the mark. Finally, after inputting the spectral data, acoustic parameters and simulated vibrato into the sound model, the singing voice of the target song is synthesized.

应用本发明实施例提供的装置,通过隐式谱表征向量提升歌声韵律的同时,加入颤音提高歌声的表现力,使得合成的歌声更加自然。By applying the device provided by the embodiment of the present invention, the rhythm of the singing voice is improved by the implicit spectral representation vector, and the expressive power of the singing voice is improved by adding vibrato, so that the synthesized singing voice is more natural.

本发明实施例提供的装置中,所述处理单元403,包括:In the apparatus provided by the embodiment of the present invention, the processing unit 403 includes:

第一输入子单元,用于将所述音素输入所述编码器,获得所述编码器输出的所述目标歌曲对应的初始表征向量,所述初始表征向量为音素级别的表征序列;The first input subunit is used for inputting the phoneme into the encoder to obtain an initial representation vector corresponding to the target song output by the encoder, where the initial representation vector is a phoneme-level representation sequence;

扩展子单元,用于对所述初始表征向量进行级别扩展,获得所述初始表征向量对应的隐式谱表征向量,所述隐式谱表征向量为帧级别的表征序列。The expansion subunit is configured to perform level expansion on the initial characterization vector to obtain an implicit spectral characterization vector corresponding to the initial characterization vector, where the implicit spectral characterization vector is a frame-level characterization sequence.

本发明实施例提供的装置中,所述第一预测单元404,包括:In the apparatus provided by the embodiment of the present invention, the first prediction unit 404 includes:

第二输入子单元,用于将所述隐式谱表征向量输入至所述预测器的第一测试层,获得所述第一测试层输出的谱表征向量,所述第一测试层为一维卷积层;The second input subunit is used to input the implicit spectral representation vector to the first test layer of the predictor, and obtain the spectral representation vector output by the first test layer, and the first test layer is one-dimensional convolutional layer;

拼接子单元,用于将所述谱表征向量、所述隐式谱表征向量及所述音高进行拼接,获得拼接数据;a splicing subunit, used for splicing the spectral representation vector, the implicit spectral representation vector and the pitch to obtain splicing data;

第三输入子单元,用于将所述拼接数据输入所述测试器的第二测试层,获得所述第二测试层输出的声学参数,所述第二测试层为一维卷积层。The third input subunit is used for inputting the spliced data into the second test layer of the tester to obtain acoustic parameters output by the second test layer, where the second test layer is a one-dimensional convolution layer.

本发明实施例提供的装置中,所述分解单元405,包括:In the device provided by the embodiment of the present invention, the decomposition unit 405 includes:

操作子单元,用于应用预设的三角滤波器对所述音高轨迹进行卷积操作,获得平滑的音高轨迹;an operation subunit, used for applying a preset triangular filter to perform a convolution operation on the pitch track to obtain a smooth pitch track;

分解子单元,用于将所述平滑的音高轨迹进行分解,获得每个所述音节对应的颤音;a decomposition subunit, used for decomposing the smooth pitch track to obtain the vibrato corresponding to each syllable;

变换子单元,用于应用预设的希尔伯特变换法对每个所述音节对应的颤音进行变换操作,获得每个所述音节对应的颤音特征。The transformation subunit is configured to perform transformation operations on the vibrato corresponding to each of the syllables by applying a preset Hilbert transformation method to obtain the vibrato feature corresponding to each of the syllables.

本发明实施例提供的装置中,所述第一合成单元407,包括:In the device provided by the embodiment of the present invention, the first synthesis unit 407 includes:

选取子单元,用于选取所述音节对应的颤音特征中的振幅、频率和相位;Selecting a subunit for selecting amplitude, frequency and phase in the vibrato feature corresponding to the syllable;

合成子单元,用于应用预设的颤音模拟算法对所述振幅、频率、相位以及所述音节出现颤音的概率进行计算,生成所述音节对应的模拟颤音。The synthesis subunit is configured to apply a preset vibrato simulation algorithm to calculate the amplitude, frequency, phase and probability of vibrato in the syllable, and generate a simulated vibrato corresponding to the syllable.

以上本发明实施例公开的歌声合成装置中各个单元及子单元的具体工作过程,可参见本发明上述实施例公开的歌声合成方法中的对应内容,这里不再进行赘述。For the specific working process of each unit and sub-unit in the singing voice synthesis device disclosed in the above embodiments of the present invention, reference may be made to the corresponding content in the singing voice synthesis method disclosed in the above embodiments of the present invention, which will not be repeated here.

本发明实施例还提供了一种存储介质,所述存储介质包括存储的指令,其中,在所述指令运行时控制所述存储介质所在的设备执行上述歌声合成方法。An embodiment of the present invention further provides a storage medium, where the storage medium includes stored instructions, wherein when the instructions are executed, a device where the storage medium is located is controlled to execute the above singing voice synthesis method.

本发明实施例还提供了一种电子设备,其结构示意图如图5所示,具体包括存储器501,以及一个或者一个以上的指令502,其中一个或者一个以上指令502存储于存储器501中,且经配置以由一个或者一个以上处理器503执行所述一个或者一个以上指令502进行以下操作:An embodiment of the present invention further provides an electronic device, the schematic structural diagram of which is shown in FIG. 5 , and specifically includes a memory 501 and one or more instructions 502 , wherein one or more instructions 502 are stored in the memory 501 and are processed through the memory 501 . The one or more instructions 502 are configured to be executed by the one or more processors 503 to:

获取目标歌曲的声谱数据;Obtain the sound spectrum data of the target song;

提取所述声谱数据中所述目标歌曲对应的音素、音高及音频;extracting the phoneme, pitch and audio corresponding to the target song in the sound spectrum data;

应用预设的编码器对所述音素进行处理,获得所述目标歌曲对应的隐式谱表征向量,所述隐式谱表征向量用于表征所述目标歌曲的能量信息;Applying a preset encoder to process the phoneme to obtain an implicit spectral representation vector corresponding to the target song, where the implicit spectral representation vector is used to characterize the energy information of the target song;

应用预设的预测器基于所述隐式谱表征向量及所述音高,预测所述目标歌曲对应的声学参数;Applying a preset predictor to predict the acoustic parameter corresponding to the target song based on the implicit spectral representation vector and the pitch;

提取所述音频中的音高轨迹,并分解所述音高轨迹,获得所述目标歌曲中每个音节对应的颤音特征;Extract the pitch track in the audio, and decompose the pitch track to obtain the vibrato feature corresponding to each syllable in the target song;

基于每个所述音节对应的颤音特征,预测每个所述音节出现颤音的概率;Based on the vibrato feature corresponding to each of the syllables, predict the probability of vibrato in each of the syllables;

当任意的音节出现颤音的概率满足预设的阈值时,合成所述音节对应的模拟颤音,并对所述音节进行标记;When the probability of vibrato in any syllable meets a preset threshold, synthesizing the analog vibrato corresponding to the syllable, and marking the syllable;

将所述声谱数据、所述声学参数及各个已标注的音节对应的模拟颤音输入至预设的声音模型,生成所述目标歌曲对应的合成歌声。The sound spectrum data, the acoustic parameters, and the simulated vibrato corresponding to each marked syllable are input into a preset sound model to generate a synthetic singing voice corresponding to the target song.

本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于系统或系统实施例而言,由于其基本相似于方法实施例,所以描述得比较简单,相关之处参见方法实施例的部分说明即可。以上所描述的系统及系统实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。Each embodiment in this specification is described in a progressive manner, and the same and similar parts between the various embodiments may be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the system or the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and reference may be made to the partial description of the method embodiment for related parts. The systems and system embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, It can be located in one place, or it can be distributed over multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment. Those of ordinary skill in the art can understand and implement it without creative effort.

专业人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现。Professionals may further realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of the two.

为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。In order to clearly illustrate the interchangeability of hardware and software, the components and steps of each example have been generally described in terms of functions in the foregoing description. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of the present invention.

对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下,在其它实施例中实现。因此,本发明将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A singing voice synthesizing method, comprising:
acquiring the sound spectrum data of a target song;
extracting phonemes, pitches and audios corresponding to the target songs in the sound spectrum data;
processing the phonemes by using a preset encoder to obtain an implicit spectral characterization vector corresponding to the target song, wherein the implicit spectral characterization vector is used for characterizing energy information of the target song;
predicting acoustic parameters corresponding to the target song based on the implicit spectral characterization vector and the pitch by applying a preset predictor;
extracting a pitch track in the audio, decomposing the pitch track, and obtaining a vibrato characteristic corresponding to each syllable in the target song;
predicting the probability of the trill of each syllable based on the trill characteristic corresponding to each syllable;
when the probability of the vibrato of any syllable meets a preset threshold value, synthesizing the analog vibrato corresponding to the syllable and marking the syllable;
and inputting the sound spectrum data, the acoustic parameters and the simulated trills corresponding to the marked syllables into a preset sound model to generate a synthetic singing voice corresponding to the target song.
2. The method of claim 1, wherein the applying a preset encoder to process the phoneme to obtain an implicit spectral feature vector corresponding to the target song comprises:
inputting the phonemes into the encoder, and obtaining initial characterization vectors corresponding to the target songs output by the encoder, wherein the initial characterization vectors are characterization sequences at a phoneme level;
and performing level expansion on the initial characterization vector to obtain an implicit spectrum characterization vector corresponding to the initial characterization vector, wherein the implicit spectrum characterization vector is a frame-level characterization sequence.
3. The method of claim 1 or 2, wherein the applying a preset predictor to predict the acoustic parameters corresponding to the target song based on the implicit spectral characterization vector and the pitch comprises:
inputting the implicit spectral feature vector into a first test layer of the predictor to obtain a spectral feature vector output by the first test layer, wherein the first test layer is a one-dimensional convolutional layer;
splicing the spectrum characterization vector, the implicit spectrum characterization vector and the pitch to obtain spliced data;
and inputting the splicing data into a second test layer of the tester to obtain the acoustic parameters output by the second test layer, wherein the second test layer is a one-dimensional convolutional layer.
4. The method of claim 1, wherein the decomposing the pitch track to obtain the trill feature for each syllable in the target song comprises:
performing convolution operation on the pitch track by applying a preset triangular filter to obtain a smooth pitch track;
decomposing the smooth pitch track to obtain the trills corresponding to each syllable;
and performing transformation operation on the vibrato corresponding to each syllable by applying a preset Hilbert transformation method to obtain the vibrato characteristic corresponding to each syllable.
5. The method according to claim 1 or 4, wherein the synthesizing of the analog vibrato corresponding to the syllable comprises:
selecting amplitude, frequency and phase in the trill characteristics corresponding to the syllables;
and calculating the amplitude, the frequency, the phase and the probability of the trill of the syllable by applying a preset trill simulation algorithm to generate the simulated trill corresponding to the syllable.
6. A singing voice synthesizing apparatus, comprising:
a first acquisition unit for acquiring the sound spectrum data of a target song;
the extraction unit is used for extracting phonemes, pitches and audios corresponding to the target songs from the sound spectrum data;
the processing unit is used for processing the phonemes by using a preset encoder to obtain an implicit spectral characterization vector corresponding to the target song, and the implicit spectral characterization vector is used for characterizing energy information of the target song;
the first prediction unit is used for applying a preset predictor to predict the acoustic parameters corresponding to the target song based on the implicit spectral characterization vector and the pitch;
the decomposition unit is used for extracting a pitch track in the audio, decomposing the pitch track and obtaining the trill characteristics corresponding to each syllable in the target song;
a second prediction unit, configured to predict a probability of a vibrato occurring for each of the syllables based on a vibrato characteristic corresponding to each of the syllables;
the first synthesis unit is used for synthesizing the simulated vibrato corresponding to the syllable and marking the syllable when the probability of the vibrato appearing on any syllable meets a preset threshold value;
and the second synthesis unit is used for inputting the sound spectrum data, the acoustic parameters and the simulated trill corresponding to each marked syllable into a preset sound model to generate the synthetic singing voice corresponding to the target song.
7. The apparatus of claim 6, wherein the processing unit comprises:
a first input subunit, configured to input the phoneme to the encoder, and obtain an initial characterization vector corresponding to the target song output by the encoder, where the initial characterization vector is a phoneme-level characterization sequence;
and the expansion subunit is used for performing level expansion on the initial characterization vector to obtain an implicit spectrum characterization vector corresponding to the initial characterization vector, wherein the implicit spectrum characterization vector is a frame-level characterization sequence.
8. The apparatus according to claim 6 or 7, wherein the first prediction unit comprises:
the second input subunit is configured to input the implicit spectral feature vector to a first test layer of the predictor, so as to obtain a spectral feature vector output by the first test layer, where the first test layer is a one-dimensional convolutional layer;
a splicing subunit, configured to splice the spectrum characterization vector, the implicit spectrum characterization vector, and the pitch to obtain splicing data;
and the third input subunit is used for inputting the splicing data into a second test layer of the tester to obtain the acoustic parameters output by the second test layer, and the second test layer is a one-dimensional convolutional layer.
9. The apparatus of claim 6, wherein the decomposition unit comprises:
the operation subunit is used for performing convolution operation on the pitch track by applying a preset triangular filter to obtain a smooth pitch track;
the decomposition subunit is used for decomposing the smooth pitch track to obtain the trills corresponding to each syllable;
and the transformation subunit is used for performing transformation operation on the vibrato corresponding to each syllable by applying a preset Hilbert transformation method to obtain the vibrato characteristic corresponding to each syllable.
10. The apparatus of claim 6 or 9, wherein the first synthesis unit comprises:
a selecting subunit, configured to select an amplitude, a frequency, and a phase of a vibrato feature corresponding to the syllable;
and the synthesis subunit is used for calculating the amplitude, the frequency, the phase and the probability of the trill of the syllable by applying a preset trill simulation algorithm to generate the simulated trill corresponding to the syllable.
CN202210846439.0A 2022-07-19 2022-07-19 Singing voice synthesis method and device Pending CN115019767A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210846439.0A CN115019767A (en) 2022-07-19 2022-07-19 Singing voice synthesis method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210846439.0A CN115019767A (en) 2022-07-19 2022-07-19 Singing voice synthesis method and device

Publications (1)

Publication Number Publication Date
CN115019767A true CN115019767A (en) 2022-09-06

Family

ID=83082010

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210846439.0A Pending CN115019767A (en) 2022-07-19 2022-07-19 Singing voice synthesis method and device

Country Status (1)

Country Link
CN (1) CN115019767A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116129938A (en) * 2023-02-13 2023-05-16 腾讯音乐娱乐科技(深圳)有限公司 Singing voice synthesizing method, singing voice synthesizing device, singing voice synthesizing equipment and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002073064A (en) * 2000-08-28 2002-03-12 Yamaha Corp Voice processor, voice processing method and information recording medium
EP1727123A1 (en) * 2005-05-26 2006-11-29 Yamaha Corporation Sound signal processing apparatus, sound signal processing method and sound signal processing program
JP6004358B1 (en) * 2015-11-25 2016-10-05 株式会社テクノスピーチ Speech synthesis apparatus and speech synthesis method
CN108766409A (en) * 2018-05-25 2018-11-06 中国传媒大学 A kind of opera synthetic method, device and computer readable storage medium
CN110164460A (en) * 2019-04-17 2019-08-23 平安科技(深圳)有限公司 Sing synthetic method and device
WO2020140390A1 (en) * 2019-01-04 2020-07-09 平安科技(深圳)有限公司 Vibrato modeling method, device, computer apparatus and storage medium
CN111429877A (en) * 2020-03-03 2020-07-17 云知声智能科技股份有限公司 Song processing method and device
WO2020248388A1 (en) * 2019-06-11 2020-12-17 平安科技(深圳)有限公司 Method and device for training singing voice synthesis model, computer apparatus, and storage medium
CN113555001A (en) * 2021-07-23 2021-10-26 平安科技(深圳)有限公司 Singing voice synthesis method and device, computer equipment and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002073064A (en) * 2000-08-28 2002-03-12 Yamaha Corp Voice processor, voice processing method and information recording medium
EP1727123A1 (en) * 2005-05-26 2006-11-29 Yamaha Corporation Sound signal processing apparatus, sound signal processing method and sound signal processing program
JP6004358B1 (en) * 2015-11-25 2016-10-05 株式会社テクノスピーチ Speech synthesis apparatus and speech synthesis method
CN108766409A (en) * 2018-05-25 2018-11-06 中国传媒大学 A kind of opera synthetic method, device and computer readable storage medium
WO2020140390A1 (en) * 2019-01-04 2020-07-09 平安科技(深圳)有限公司 Vibrato modeling method, device, computer apparatus and storage medium
CN110164460A (en) * 2019-04-17 2019-08-23 平安科技(深圳)有限公司 Sing synthetic method and device
WO2020248388A1 (en) * 2019-06-11 2020-12-17 平安科技(深圳)有限公司 Method and device for training singing voice synthesis model, computer apparatus, and storage medium
CN111429877A (en) * 2020-03-03 2020-07-17 云知声智能科技股份有限公司 Song processing method and device
CN113555001A (en) * 2021-07-23 2021-10-26 平安科技(深圳)有限公司 Singing voice synthesis method and device, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YI REN.ETC: "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech", 《ARXIV:2006.04558V3》, 22 June 2020 (2020-06-22), pages 1 - 11 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116129938A (en) * 2023-02-13 2023-05-16 腾讯音乐娱乐科技(深圳)有限公司 Singing voice synthesizing method, singing voice synthesizing device, singing voice synthesizing equipment and storage medium

Similar Documents

Publication Publication Date Title
Gold et al. Speech and audio signal processing: processing and perception of speech and music
Hayes et al. A review of differentiable digital signal processing for music and speech synthesis
WO2017190674A1 (en) Method and device for processing audio data, and computer storage medium
US8594993B2 (en) Frame mapping approach for cross-lingual voice transformation
US7977562B2 (en) Synthesized singing voice waveform generator
WO2020215666A1 (en) Speech synthesis method and apparatus, computer device, and storage medium
JP4054507B2 (en) Voice information processing method and apparatus, and storage medium
CN109949783A (en) Song synthesis method and system
CN108806665A (en) Phoneme synthesizing method and device
CN106971703A (en) A kind of song synthetic method and device based on HMM
CN108831437A (en) A kind of song generation method, device, terminal and storage medium
CN113393830B (en) Hybrid acoustic model training and lyric timestamp generation method, device and medium
CN112382270A (en) Speech synthesis method, apparatus, device and storage medium
CN112382274B (en) Audio synthesis method, device, equipment and storage medium
Ming et al. Fundamental frequency modeling using wavelets for emotional voice conversion
WO2023116243A1 (en) Data conversion method and computer storage medium
CN109817191A (en) Trill modeling method, device, computer equipment and storage medium
CN110310621A (en) Singing synthesis method, device, equipment and computer-readable storage medium
US20140236597A1 (en) System and method for supervised creation of personalized speech samples libraries in real-time for text-to-speech synthesis
CN113555001A (en) Singing voice synthesis method and device, computer equipment and storage medium
US20240347037A1 (en) Method and apparatus for synthesizing unified voice wave based on self-supervised learning
CN111477210A (en) Speech synthesis method and device
CN112382269B (en) Audio synthesis method, device, equipment and storage medium
WO2015025788A1 (en) Quantitative f0 pattern generation device and method, and model learning device and method for generating f0 pattern
CN119049445B (en) A highly expressive singing voice synthesis model training method, synthesis method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination