[go: up one dir, main page]

CN116206592A - A voice cloning method, device, equipment and storage medium - Google Patents

A voice cloning method, device, equipment and storage medium Download PDF

Info

Publication number
CN116206592A
CN116206592A CN202310066406.9A CN202310066406A CN116206592A CN 116206592 A CN116206592 A CN 116206592A CN 202310066406 A CN202310066406 A CN 202310066406A CN 116206592 A CN116206592 A CN 116206592A
Authority
CN
China
Prior art keywords
target
acoustic model
synthesized
adaptive
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310066406.9A
Other languages
Chinese (zh)
Inventor
孙志宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sipic Technology Co Ltd
Original Assignee
Sipic Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sipic Technology Co Ltd filed Critical Sipic Technology Co Ltd
Priority to CN202310066406.9A priority Critical patent/CN116206592A/en
Publication of CN116206592A publication Critical patent/CN116206592A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/173Transcoding, i.e. converting between two coded representations avoiding cascaded coding-decoding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/01Correction of time axis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

本发明提供一种语音克隆方法、装置、设备及存储介质,所述方法包括:接收待合成文本;将所述待合成文本转换为待合成音素序列;将所述待合成音素序列输入自适应声学模型,利用所述自适应声学模型将所述待合成音素序列转换为目标对象的待合成声学特征;其中,所述自适应声学模型是基于目标对象的目标语音样本数据对基础声学模型进行自适应训练获得的,在对所述基础声学模型进行自适应训练时,对所述基础声学模型中解码器的条件归一化层的参数进行更新;采用声码器将所述待合成声学特征合成所述目标对象的待合成音频数据。本发明降低了语音克隆时自适应模型的训练时间,提升了语音克隆的效率和质量。

Figure 202310066406

The present invention provides a voice cloning method, device, equipment and storage medium. The method includes: receiving text to be synthesized; converting the text to be synthesized into a phoneme sequence to be synthesized; inputting the phoneme sequence to be synthesized into adaptive acoustics model, using the adaptive acoustic model to convert the sequence of phonemes to be synthesized into the acoustic features to be synthesized of the target object; wherein, the adaptive acoustic model is to adapt the basic acoustic model based on the target voice sample data of the target object Obtained by training, when performing adaptive training on the basic acoustic model, update the parameters of the conditional normalization layer of the decoder in the basic acoustic model; use a vocoder to synthesize the acoustic features to be synthesized The audio data to be synthesized of the target object. The invention reduces the training time of the self-adaptive model during voice cloning, and improves the efficiency and quality of voice cloning.

Figure 202310066406

Description

一种语音克隆方法、装置、设备及存储介质A voice cloning method, device, equipment and storage medium

技术领域technical field

本发明涉及人工智能技术领域,尤其涉及一种语音克隆方法、装置、设备及存储介质。The invention relates to the technical field of artificial intelligence, in particular to a voice cloning method, device, equipment and storage medium.

背景技术Background technique

随着科技的进步,智能语音技术也在不断发展,如:语音克隆技术,语音克隆是指利用智能设备将指定的内容合成指定人物发音的音频数据,例如:家里的智能音箱能够发出主人的声音。With the advancement of technology, intelligent voice technology is also constantly developing, such as: voice cloning technology, voice cloning refers to the use of smart devices to synthesize specified content into audio data pronounced by a specified person, for example: a smart speaker at home can emit the voice of the owner .

一般的语音克隆技术需要利用目标发音人在实际场景中的录制音频提取目标说话人的声学特征和声纹特征,进行模型训练,最后,通过声码器可将目标发音人的声学特征解码得到目标发音人的克隆语音。但是,在进行模型训练时,需要更新大量的模型参数,这就需要很长的模型训练时间,也会导致合成音频质量和自然度都会下降。The general voice cloning technology needs to use the recorded audio of the target speaker in the actual scene to extract the acoustic features and voiceprint features of the target speaker for model training. Finally, the acoustic features of the target speaker can be decoded by the vocoder to obtain the target speaker. Clone voice of speaker. However, when performing model training, a large number of model parameters need to be updated, which requires a long model training time, and will also result in a decrease in the quality and naturalness of the synthesized audio.

发明内容Contents of the invention

鉴于此,本发明实施例提供了一种语音克隆方法、装置、设备及存储介质,以消除或改善现有技术中存在的一个或更多个缺陷。In view of this, an embodiment of the present invention provides a voice cloning method, device, device, and storage medium, so as to eliminate or improve one or more defects existing in the prior art.

本发明的一个方面提供了一种语音克隆方法,该方法包括以下步骤:One aspect of the present invention provides a voice cloning method, the method comprising the following steps:

接收待合成文本;Receive the text to be synthesized;

将所述待合成文本转换为待合成音素序列;Converting the text to be synthesized into a phoneme sequence to be synthesized;

将所述待合成音素序列输入自适应声学模型,利用所述自适应声学模型将所述待合成音素序列转换为目标对象的待合成声学特征;其中,所述自适应声学模型是基于所述目标对象的目标语音样本数据对基础声学模型进行自适应训练获得的,在对所述基础声学模型进行自适应训练时,对所述基础声学模型中解码器的条件归一化层的参数进行更新;The phoneme sequence to be synthesized is input into an adaptive acoustic model, and the phoneme sequence to be synthesized is converted into an acoustic feature to be synthesized of a target object by using the adaptive acoustic model; wherein, the adaptive acoustic model is based on the target The target speech sample data of the object is obtained by performing adaptive training on the basic acoustic model, and updating the parameters of the conditional normalization layer of the decoder in the basic acoustic model when performing adaptive training on the basic acoustic model;

采用声码器将所述待合成声学特征合成所述目标对象的目标合成音频数据。The acoustic features to be synthesized are synthesized into target synthesized audio data of the target object by using a vocoder.

在本发明的一些实施例中,所述自适应声学模型的训练方法包括:In some embodiments of the present invention, the training method of the adaptive acoustic model includes:

采集目标对象的目标语音样本数据;collecting the target voice sample data of the target object;

将所述目标语音样本数据转换为目标文本;Converting the target speech sample data into target text;

提取所述目标语音样本数据中的目标声纹向量和目标声学特征;Extracting target voiceprint vectors and target acoustic features in the target voice sample data;

将所述目标文本转换为目标音素序列,并将所述目标语音样本数据和所述目标文本进行对齐,获得目标时长序列;converting the target text into a target phoneme sequence, and aligning the target voice sample data with the target text to obtain a target duration sequence;

基于所述目标音素序列、所述目标时长序列、所述目标声学特征以及所述目标声纹向量对所述基础声学模型进行自适应训练,更新所述基础声学模型中解码器的条件归一化层的参数,直至满足预设要求。Perform adaptive training on the basic acoustic model based on the target phoneme sequence, the target duration sequence, the target acoustic feature and the target voiceprint vector, and update the conditional normalization of the decoder in the basic acoustic model Layer parameters until the preset requirements are met.

在本发明的一些实施例中,所述方法还包括:In some embodiments of the present invention, the method also includes:

在基于目标对象的目标语音样本数据对基础声学模型进行自适应训练时,采用神经网络算法对所述目标语音样本数据进行降噪,并采用降噪后的目标语音样本数据对基础声学模型进行自适应训练,获得所述自适应声学模型。When the basic acoustic model is adaptively trained based on the target voice sample data of the target object, the neural network algorithm is used to reduce the noise of the target voice sample data, and the basic acoustic model is automatically trained by using the target voice sample data after noise reduction. and adapting the training to obtain the adaptive acoustic model.

在本发明的一些实施例中,所述方法还包括:In some embodiments of the present invention, the method also includes:

采用降噪模型对所述目标合成音频数据进行二次降噪,获得并输出降噪的目标合成音频数据。The noise reduction model is used to perform secondary noise reduction on the target synthesized audio data, and the noise-reduced target synthesized audio data is obtained and output.

在本发明的一些实施例中,所述方法还包括:In some embodiments of the present invention, the method also includes:

所述目标语音样本数据的采样频率与所述基础声学模型的训练数据的采样频率相同。The sampling frequency of the target speech sample data is the same as the sampling frequency of the training data of the basic acoustic model.

在本发明的一些实施例中,所述基础声学模型中还包括参考编码器,所述参考编码器用于学习音频数据中的声学特征的规律。In some embodiments of the present invention, the basic acoustic model further includes a reference encoder, and the reference encoder is used to learn the rules of the acoustic features in the audio data.

本发明的另一方面提供了一种语音克隆装置,所述装置包括:Another aspect of the present invention provides a voice cloning device, the device comprising:

文本接收模块,用于接收待合成文本;The text receiving module is used for receiving the text to be synthesized;

音素转换模块,用于将所述待合成文本转换为待合成音素序列;A phoneme conversion module, configured to convert the text to be synthesized into a phoneme sequence to be synthesized;

声学特征预测模块,用于将所述待合成音素序列输入自适应声学模型,利用所述自适应声学模型将所述待合成音素序列转换为目标对象的待合成声学特征;其中,所述自适应声学模型是基于所述目标对象的目标语音样本数据对基础声学模型进行自适应训练获得的,在对所述基础声学模型进行自适应训练时,对所述基础声学模型中解码器的条件归一化层的参数进行更新;An acoustic feature prediction module, configured to input the phoneme sequence to be synthesized into an adaptive acoustic model, and use the adaptive acoustic model to convert the phoneme sequence to be synthesized into an acoustic feature to be synthesized of a target object; wherein the adaptive The acoustic model is obtained by performing adaptive training on the basic acoustic model based on the target voice sample data of the target object. When performing adaptive training on the basic acoustic model, the condition of the decoder in the basic acoustic model is normalized The parameters of the layer are updated;

语音合成模块,用于采用声码器将所述待合成声学特征合成所述目标对象的目标合成音频数据。A speech synthesis module, configured to use a vocoder to synthesize the acoustic features to be synthesized into target synthesized audio data of the target object.

在本发明的一些实施例中,所述装置还包括模型训练模块,用于采用下述方法训练所述自适应声学模型:In some embodiments of the present invention, the device further includes a model training module for training the adaptive acoustic model by using the following method:

采集目标对象的目标语音样本数据;collecting the target voice sample data of the target object;

将所述目标语音样本数据转换为目标文本;Converting the target speech sample data into target text;

提取所述目标语音样本数据中的目标声纹向量和目标声学特征;Extracting target voiceprint vectors and target acoustic features in the target voice sample data;

将所述目标文本转换为目标音素序列,并将所述目标语音样本数据和所述目标文本进行对齐,获得目标时长序列;converting the target text into a target phoneme sequence, and aligning the target voice sample data with the target text to obtain a target duration sequence;

基于所述目标音素序列、所述目标时长序列、所述目标声学特征以及所述目标声纹向量对所述基础声学模型进行自适应训练,更新所述基础声学模型中解码器的条件归一化层的参数,直至满足预设要求。Perform adaptive training on the basic acoustic model based on the target phoneme sequence, the target duration sequence, the target acoustic feature and the target voiceprint vector, and update the conditional normalization of the decoder in the basic acoustic model Layer parameters until the preset requirements are met.

本发明的另一方面提供了一种语音克隆设备,包括处理器和存储器,所述存储器中存储有计算机指令,所述处理器用于执行所述存储器中存储的计算机指令,当所述计算机指令被处理器执行时该设备实现上述语音克隆方法。Another aspect of the present invention provides a voice cloning device, including a processor and a memory, where computer instructions are stored in the memory, and the processor is used to execute the computer instructions stored in the memory, when the computer instructions are When the processor executes, the device implements the above voice cloning method.

本发明的又一方面提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述语音克隆方法。Another aspect of the present invention provides a computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the above voice cloning method is implemented.

本发明的提供的一种语音克隆方法、装置、设备及存储介质,只需要采集目标对象的少量的音频数据,就可以实现在基础声学模型的基础上训练获得目标对象的自适应声学模型。进而可以利用自适应声学模型将指定的内容转换为目标对象的声学特征,再利用声码器即可以克隆出目标对象的音频数据。在对声学模型进行自适应训练时,通过将条件归一化层引入到声学模型的解码器中,能够在不损失合成音频自然度的情况下降低自适应训练所需的时间,降低了语音克隆过程中自适应模型的训练时间,提升了语音克隆的速度,保障了语音克隆的音频质量。The voice cloning method, device, equipment and storage medium provided by the present invention only need to collect a small amount of audio data of the target object, and can realize the training and acquisition of the adaptive acoustic model of the target object on the basis of the basic acoustic model. Furthermore, the adaptive acoustic model can be used to convert the specified content into the acoustic characteristics of the target object, and then the audio data of the target object can be cloned by using the vocoder. In the adaptive training of the acoustic model, by introducing the conditional normalization layer into the decoder of the acoustic model, the time required for adaptive training can be reduced without losing the naturalness of the synthesized audio, and the voice cloning is reduced. The training time of the adaptive model in the process increases the speed of voice cloning and ensures the audio quality of voice cloning.

本发明的附加优点、目的,以及特征将在下面的描述中将部分地加以阐述,且将对于本领域普通技术人员在研究下文后部分地变得明显,或者可以根据本发明的实践而获知。本发明的目的和其它优点可以通过在说明书以及附图中具体指出的结构实现到并获得。Additional advantages, objects, and features of the present invention will be set forth in part in the following description, and will be partly apparent to those of ordinary skill in the art after studying the following text, or can be learned from the practice of the present invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the written description and appended drawings.

本领域技术人员将会理解的是,能够用本发明实现的目的和优点不限于以上具体所述,并且根据以下详细说明将更清楚地理解本发明能够实现的上述和其他目的。It will be understood by those skilled in the art that the objects and advantages that can be achieved by the present invention are not limited to the above specific ones, and the above and other objects that can be achieved by the present invention will be more clearly understood from the following detailed description.

附图说明Description of drawings

此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一部分,并不构成对本发明的限定。附图中的部件不是成比例绘制的,而只是为了示出本发明的原理。为了便于示出和描述本发明的一些部分,附图中对应部分可能被放大,即,相对于依据本发明实际制造的示例性装置中的其它部件可能变得更大。在附图中:The drawings described here are used to provide further understanding of the present invention, constitute a part of the application, and do not limit the present invention. The components in the figures are not drawn to scale, merely illustrating the principles of the invention. For ease of illustration and description of some parts of the present invention, corresponding parts in the figures may be exaggerated, ie, may be made larger relative to other components in an exemplary device actually manufactured in accordance with the present invention. In the attached picture:

图1是本说明书一个实施例中提供的语音克隆方法流程示意图;Fig. 1 is a schematic flow chart of the voice cloning method provided in one embodiment of this specification;

图2是本说明书一个实施例中基础声学模型的训练过程示意图;Fig. 2 is a schematic diagram of the training process of the basic acoustic model in an embodiment of the present specification;

图3是本说明书另一个实施例中语音克隆的流程示意图;Fig. 3 is a schematic flow chart of voice cloning in another embodiment of the specification;

图4是本说明书提供的语音克隆装置一个实施例的模块结构示意图;Fig. 4 is a schematic diagram of the module structure of an embodiment of the voice cloning device provided in this specification;

图5是本说明书一个实施例中语音克隆服务器的硬件结构框图。Fig. 5 is a block diagram of the hardware structure of the voice cloning server in one embodiment of this specification.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚明白,下面结合实施方式和附图,对本发明做进一步详细说明。在此,本发明的示意性实施方式及其说明用于解释本发明,但并不作为对本发明的限定。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be described in further detail below in conjunction with the embodiments and accompanying drawings. Here, the exemplary embodiments of the present invention and their descriptions are used to explain the present invention, but not to limit the present invention.

在此,还需要说明的是,为了避免因不必要的细节而模糊了本发明,在附图中仅仅示出了与根据本发明的方案密切相关的结构和/或处理步骤,而省略了与本发明关系不大的其他细节。Here, it should also be noted that, in order to avoid obscuring the present invention due to unnecessary details, only the structures and/or processing steps closely related to the solution according to the present invention are shown in the drawings, and the related Other details are not relevant to the invention.

应该强调,术语“包括/包含”在本文使用时指特征、要素、步骤或组件的存在,但并不排除一个或更多个其它特征、要素、步骤或组件的存在或附加。It should be emphasized that the term "comprising/comprising" when used herein refers to the presence of a feature, element, step or component, but does not exclude the presence or addition of one or more other features, elements, steps or components.

在此,还需要说明的是,如果没有特殊说明,术语“连接”在本文不仅可以指直接连接,也可以表示存在中间物的间接连接。Here, it should also be noted that, unless otherwise specified, the term "connection" herein may refer not only to a direct connection, but also to an indirect connection with an intermediate.

在下文中,将参考附图描述本发明的实施例。在附图中,相同的附图标记代表相同或类似的部件,或者相同或类似的步骤。Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. In the drawings, the same reference numerals represent the same or similar components, or the same or similar steps.

一般的语音克隆技术基于语音合成模型,分为声学模型和声码器。首先在多说话人的数据上训练一个基础的声学模型;之后,利用目标发音人在实际场景中的录制音频提取目标说话人的声学特征和声纹特征,在基础的声学模型上进行自适应训练(更新大量网络模型参数),将待克隆的文本通过自适应声学模型推理出对应的声学特征;最后,通过声码器可将目标发音人的声学特征解码得到目标发音人的克隆语音。语音克隆技术可以合成目标发音人的声音,如:语音克隆技术可以应用于配音行业,快速合成目标配音演员的声音,降低配音成本和时间。当然,根据实际需要语音克隆技术还可以应用于其他场景,本说明书实施例不做具体限定。The general speech cloning technology is based on speech synthesis model, which is divided into acoustic model and vocoder. Firstly, a basic acoustic model is trained on multi-speaker data; then, the acoustic features and voiceprint features of the target speaker are extracted using the recorded audio of the target speaker in the actual scene, and adaptive training is performed on the basic acoustic model (Update a large number of network model parameters), infer the corresponding acoustic features of the text to be cloned through the adaptive acoustic model; finally, the acoustic features of the target speaker can be decoded by the vocoder to obtain the cloned voice of the target speaker. Voice cloning technology can synthesize the voice of the target speaker. For example, voice cloning technology can be applied to the dubbing industry to quickly synthesize the voice of the target voice actor, reducing the cost and time of dubbing. Of course, the voice cloning technology can also be applied to other scenarios according to actual needs, and this embodiment of this specification does not specifically limit it.

但是,在语音克隆技术的真实使用场景中,为了取得良好的克隆效果,往往会对目标发音人的录音有较多要求,例如:录音的数目、录音的信噪比,以及在基础模型上自适应训练较长,声学模型更新的参数较多,这些限制会大大降低用户的使用体验。为了提高合成音频的自然度和表现力,基础声学模型往往是在质量较高的语音合成数据和较复杂的网络结构为前提得到的,而语音克隆真实的使用场景中,克隆音频的音质和自适应训练的时长都将是该领域长期面临的一个挑战。为了解决自适应训练长的原因,从业人员往往会采取直接降低在基础声学模型的自适应时间,但这也会直接降低克隆音频的自然度。However, in the real application scenarios of speech cloning technology, in order to achieve a good cloning effect, there are often more requirements on the recordings of the target speaker, such as: the number of recordings, the signal-to-noise ratio of the recordings, and the basic model. The adaptation training takes a long time, and the acoustic model updates more parameters. These limitations will greatly reduce the user experience. In order to improve the naturalness and expressiveness of synthesized audio, the basic acoustic model is often obtained on the premise of high-quality speech synthesis data and a more complex network structure. Adapting to the length of training will be a long-term challenge for the field. In order to solve the reason for the long adaptive training, practitioners often directly reduce the adaptation time of the basic acoustic model, but this will also directly reduce the naturalness of the cloned audio.

本说明书实施例中提供一种语音克隆方法,可以预先基于基础声学模型训练出目标对象的自适应声学模型,并在基础声学模型中增加条件归一化层,在训练自适应声学模型时,只需要更新基础声学模型中解码器的条件归一化层的参数,在进行语音克隆时,利用训练好的自适应声学模型将待合成文本合成目标对象发音的音频数据。整个过程只需要更新指定的少量模型参数,就可以快速准确的将指定文本合成目标对象的音频,大大减少了语音克隆的模型训练时间,进而提升了语音克隆的效率,同时不影响语音克隆的音频的质量,提升了用户的体验感。The embodiment of this specification provides a voice cloning method, which can pre-train an adaptive acoustic model of the target object based on the basic acoustic model, and add a conditional normalization layer to the basic acoustic model. When training the adaptive acoustic model, only It is necessary to update the parameters of the conditional normalization layer of the decoder in the basic acoustic model. When performing speech cloning, use the trained adaptive acoustic model to synthesize the text to be synthesized into the audio data of the target object's pronunciation. The entire process only needs to update a small number of specified model parameters, and the specified text can be quickly and accurately synthesized into the audio of the target object, which greatly reduces the model training time of voice cloning, thereby improving the efficiency of voice cloning without affecting the audio of voice cloning The quality improves the user experience.

图1是本说明书一个实施例中提供的语音克隆方法流程示意图,如图1所示,本说明书提供的语音克隆方法的一个实施例中,所述方法可以应用在计算机、平板电脑、服务器、智能手机、智能穿戴设备等终端设备中,所述方法可以包括如下步骤:Figure 1 is a schematic flow chart of the voice cloning method provided in an embodiment of this specification, as shown in Figure 1, in an embodiment of the voice cloning method provided in this specification, the method can be applied to computers, tablet computers, servers, smart In terminal devices such as mobile phones and smart wearable devices, the method may include the following steps:

步骤102、接收待合成文本。Step 102, receiving text to be synthesized.

在具体的实施过程中,语音克隆技术是指利用智能设备将指定的内容合成目标对象发音的音频数据,待合成文本可以理解为需要克隆为目标对象的音频的内容,可以是文字或图片,也可以是语音,通过语音识别将语音转换成对应的文本。如:若用户A想使用智能设备将“今天天气很好”这段文字合成用户B发音的音频,“今天天气很好”这段文字可以理解为待合成文本。当然,待合成文本可以是不同语种的语言,如:中文或英文等,具体可以根据实际需要而定,本说明书实施例不做具体限定。In the specific implementation process, voice cloning technology refers to the use of smart devices to synthesize specified content into the audio data pronounced by the target object. The text to be synthesized can be understood as the audio content that needs to be cloned into the target object, which can be text or pictures, or It can be voice, and the voice is converted into corresponding text through voice recognition. For example, if user A wants to use a smart device to synthesize the text "today's weather is fine" into the audio pronounced by user B, the text "today's weather is fine" can be understood as the text to be synthesized. Of course, the text to be synthesized can be in different languages, such as Chinese or English, which can be determined according to actual needs, and is not specifically limited in this embodiment of the specification.

步骤104、将所述待合成文本转换为待合成音素序列。Step 104, converting the text to be synthesized into a phoneme sequence to be synthesized.

在具体的实施过程中,接收到待合成文本后,可以进行音素转换,本说明书实施例中的音素序列可以理解为最小的语音单位的组合,可以利用前端设备TTS软件(文本转语音软件)等将待合成文本转换为对应的待合成音素序列。In the specific implementation process, after receiving the text to be synthesized, phoneme conversion can be performed. The phoneme sequence in the embodiment of this specification can be understood as the combination of the smallest phonetic unit, and front-end equipment TTS software (text-to-speech software) can be used. The text to be synthesized is converted into the corresponding phoneme sequence to be synthesized.

步骤106、将所述待合成音素序列输入自适应声学模型,利用所述自适应声学模型将所述待合成音素序列转换为目标对象的待合成声学特征;其中,所述自适应声学模型是基于所述目标对象的目标语音样本数据对基础声学模型进行自适应训练获得的,在对所述基础声学模型进行自适应训练时,对所述基础声学模型中解码器的条件归一化层的参数进行更新。Step 106: Input the phoneme sequence to be synthesized into an adaptive acoustic model, and use the adaptive acoustic model to convert the phoneme sequence to be synthesized into an acoustic feature to be synthesized of the target object; wherein, the adaptive acoustic model is based on The target speech sample data of the target object is obtained by performing adaptive training on the basic acoustic model. When performing adaptive training on the basic acoustic model, the parameters of the conditional normalization layer of the decoder in the basic acoustic model to update.

在具体的实施过程中,目标对象可以理解为希望克隆声音的对象,如:用户希望可以使用语音克隆设备合成用户B发音的音频,用户B可以称为目标对象。可以预先利用目标对象的目标语音样本数据训练构建自适应声学模型,在需要克隆目标对象的音频数据时,将对应的待合成文本转换成的待合成音素序列输入到自适应声学模型中,就可以合成目标对象的音频数据。参见上述实施例的记载,自适应声学模型可以在基础声学模型的基础上自适应训练获得,基础声学模型可以是利用大量的语音样本数据训练获得的声学模型,基础声学模型可以学习不同对象的声音特征,再将指定的内容合成指定对象的声学特征。基础声学模型可以根据实际需要选择自回归语音合成模型如:Tactotron2、TransformerTTS、Deep Voice 3,或者非自回归语音合成模型如:FastSpeech、FastSpeech2,本说明实施例不做具体限定。In the specific implementation process, the target object can be understood as the object who wants to clone the voice, for example, if the user wants to use the voice cloning device to synthesize the audio pronounced by user B, user B can be called the target object. The target voice sample data of the target object can be used in advance to train and build an adaptive acoustic model. When the audio data of the target object needs to be cloned, the corresponding phoneme sequence to be synthesized into the text to be synthesized is input into the adaptive acoustic model, and the Synthesize the audio data of the target object. Referring to the records of the above-mentioned embodiments, the adaptive acoustic model can be obtained by adaptive training on the basis of the basic acoustic model, the basic acoustic model can be an acoustic model obtained by training with a large amount of voice sample data, and the basic acoustic model can learn the sounds of different objects Features, and then synthesize the specified content into the acoustic features of the specified object. The basic acoustic model can be an autoregressive speech synthesis model such as: Tactotron2, TransformerTTS, Deep Voice 3, or a non-autoregressive speech synthesis model such as: FastSpeech, FastSpeech2 according to actual needs, which is not specifically limited in this embodiment.

本说明书一些实施例中,基础声学模型可以采用FastSpeech2,并在基础声学模型中增加了条件归一化层(Conditional Layer Normalization,CLN)。在需要克隆目标对象如:某一个用户的声音时,可以采集目标对象的音频数据即目标语音样本数据,使用目标对象的目标语音样本数据对基础声学模型进行自适应训练,并且,在对基础声学模型进行自适应训练时,只需要对基础声学模型中解码器中的条件归一化层的参数进行更新,训练获得自适应声学模型。通过设置条件归一化层可以在自适应声学模型的训练过程中控制模型参数更新的数量,降低语音克隆的模型训练时间,提升语音克隆的模型训练效率,也为了在后期能够更好的部署。In some embodiments of this specification, FastSpeech2 may be used for the basic acoustic model, and a conditional normalization layer (Conditional Layer Normalization, CLN) is added to the basic acoustic model. When it is necessary to clone a target object such as the voice of a certain user, the audio data of the target object, that is, the target voice sample data, can be collected, and the basic acoustic model can be adaptively trained using the target voice sample data of the target object. When the model is adaptively trained, it is only necessary to update the parameters of the conditional normalization layer in the decoder of the basic acoustic model, and train to obtain an adaptive acoustic model. By setting the conditional normalization layer, the number of model parameter updates can be controlled during the training process of the adaptive acoustic model, reducing the model training time of voice cloning, improving the model training efficiency of voice cloning, and for better deployment in the later stage.

图2是本说明书一个实施例中基础声学模型的训练过程示意图,如图2所示,本说明书实施例中基础声学模型中还可以包括参考编码器,参考编码器(reference encoder)可以用来对从音频数据中提取到的声学特征进行建模,学习声学特征的规律,进而使得基础声学模型能够提取到用户的发音特点,如:重读、停顿、语言风格等信,进而提升最后合成语音的韵律以及情感效果等,使得克隆的语音更加贴近真实人物说话的情景。Fig. 2 is a schematic diagram of the training process of the basic acoustic model in an embodiment of the present specification. As shown in Fig. 2, the basic acoustic model in the embodiment of the present specification may also include a reference encoder, and the reference encoder (reference encoder) may be used to Model the acoustic features extracted from the audio data, learn the rules of the acoustic features, and then enable the basic acoustic model to extract the user's pronunciation characteristics, such as: stress, pause, language style, etc., and then improve the rhythm of the final synthesized speech And emotional effects, etc., make the voice of the clone closer to the scene of real people speaking.

如图2所示,基础声学模型的训练过程可以包括:As shown in Figure 2, the training process of the basic acoustic model can include:

采集样本训练音频数据,提取所述样本训练音频数据中的训练声纹向量、训练声学特征;Collecting sample training audio data, extracting training voiceprint vectors and training acoustic features in the sample training audio data;

将所述样本训练音频数据转换为训练文本;converting said sample training audio data into training text;

将所述训练文本转换为训练音素序列,将所述样本训练音频数据与所述训练文本进行对齐,获得训练时长序列;Converting the training text into a training phoneme sequence, aligning the sample training audio data with the training text to obtain a training duration sequence;

设置所述基础声学模型的模型参数,其中,所述基础声学模型中包括编码器、时长预测器、参考编码器,所述解码器中包括条件归一化层;Setting model parameters of the basic acoustic model, wherein the basic acoustic model includes an encoder, a duration predictor, and a reference encoder, and the decoder includes a conditional normalization layer;

将所述训练音素序列输入所述编码器、所述训练时长序列输入所述时长预测器、所述训练声学特征输入所述参考编码器;inputting the training phoneme sequence into the encoder, the training duration sequence into the duration predictor, and the training acoustic features into the reference encoder;

将所述编码器、所述时长预测器以及所述参考编码器的输出和所述声纹向量输入解码器中,对所述基础声学模型进行模型训练。The output of the encoder, the duration predictor, the reference encoder and the voiceprint vector are input into a decoder to perform model training on the basic acoustic model.

如图2所示,基础声学模型可以理解为输入音频和文本数据,输出预测的声学特征的一种智能学习模型。可以预先从音频样本数据库中采集用于基础声学模型训练的样本训练音频数据以及样本音频数据对应的训练文本,对样本训练音频数据和训练文本进行特征提取,并利用提取到的特征对基础声学模型进行模型训练,具体过程参考如下:As shown in Figure 2, the basic acoustic model can be understood as an intelligent learning model that inputs audio and text data and outputs predicted acoustic features. The sample training audio data for basic acoustic model training and the training text corresponding to the sample audio data can be collected in advance from the audio sample database, and feature extraction is performed on the sample training audio data and training text, and the basic acoustic model is analyzed by using the extracted features. For model training, the specific process is as follows:

步骤1:对音频和文本进行特征提取。首先,可以利用已经训练好的声纹识别模型对音频即样本训练音频数据提取256维的声纹向量(x-vector),对音频继续提取声学特征(320维的梅尔频谱,mel-spectrum);其次,将录音文本即样本训练音频数据对应的训练文本转换为对应的音素序列(phoneme);最后,将音频和文本利用已经训练好的MFA(MontrealForced Aligner,语音对齐工具)模型进行音素发音时长的对齐得到时长序列(duration)。Step 1: Feature extraction for audio and text. First, the trained voiceprint recognition model can be used to extract the 256-dimensional voiceprint vector (x-vector) from the audio, that is, the sample training audio data, and continue to extract the acoustic features (320-dimensional mel-spectrum, mel-spectrum) from the audio. ; Secondly, convert the recording text, that is, the training text corresponding to the sample training audio data, into a corresponding phoneme sequence (phoneme); finally, use the trained MFA (MontrealForced Aligner, voice alignment tool) model for the audio and text to perform phoneme pronunciation duration The alignment of is to get the duration sequence (duration).

步骤2:将音素序列通过编码器,将时长序列通过时长预测器,将梅尔频谱通过参考编码器,最后将以上各部分的输出和声纹向量一起输送到包含条件归一化层的解码器中,预测出对应的声学特征。Step 2: Pass the phoneme sequence through the encoder, pass the duration sequence through the duration predictor, pass the mel spectrum through the reference encoder, and finally send the output of the above parts together with the voiceprint vector to the decoder containing the conditional normalization layer , the corresponding acoustic features are predicted.

步骤3:将解码器的输出和真实的梅尔频谱之间计算MSE损失,学习训练样本数据的分布以及规律。Step 3: Calculate the MSE loss between the output of the decoder and the real Mel spectrum, and learn the distribution and regularity of the training sample data.

采用大量的训练数据训练获得基础声学模型后,可以将基础声学模型保存,在需要对指定的对象进行语音克隆时,采集指定对应的音频数据对基础声学模型进行自适应训练,就可以快速获得能够克隆指定对象的声音的自适应声学模型。After a large amount of training data is used to train and obtain the basic acoustic model, the basic acoustic model can be saved. When the specified object needs to be voice cloned, the corresponding audio data can be collected and the basic acoustic model can be adaptively trained. Clone an adaptive acoustic model of the sound of the specified object.

本说明书一些实施例中,所述自适应声学模型的训练方法包括:In some embodiments of this specification, the training method of the adaptive acoustic model includes:

采集目标对象的目标语音样本数据;collecting the target voice sample data of the target object;

将所述目标语音样本数据转换为目标文本;Converting the target speech sample data into target text;

提取所述目标语音样本数据中的目标声纹向量和目标声学特征;Extracting target voiceprint vectors and target acoustic features in the target voice sample data;

将所述目标文本转换为目标音素序列,并将所述目标语音样本数据和所述目标文本进行对齐,获得目标时长序列;converting the target text into a target phoneme sequence, and aligning the target voice sample data with the target text to obtain a target duration sequence;

基于所述目标音素序列、所述目标时长序列、所述目标声学特征以及所述目标声纹向量对所述基础声学模型进行自适应训练,更新所述基础声学模型中解码器的条件归一化层的参数,直至满足预设要求。Perform adaptive training on the basic acoustic model based on the target phoneme sequence, the target duration sequence, the target acoustic feature and the target voiceprint vector, and update the conditional normalization of the decoder in the basic acoustic model Layer parameters until the preset requirements are met.

在具体的实施过程中,图3是本说明书另一个实施例中语音克隆的流程示意图,如图3所示,目标说话人即目标对象,可以采集目标对象录制的音频数据即目标语音样本数据,一般的,目标语音样本数据不需要像基础声学模型的训练一样需要大量的样本,目标语音样本数据可以为少量的音频数据如:10条录音,只需要使用少量的目标对象的音频数据,就可以训练获得能够克隆目标对象的声音特征的自适应声学模型。In the specific implementation process, FIG. 3 is a schematic flow chart of voice cloning in another embodiment of this specification. As shown in FIG. 3, the target speaker is the target object, and the audio data recorded by the target object can be collected, that is, the target voice sample data, Generally, the target speech sample data does not require a large number of samples like the training of the basic acoustic model. The target speech sample data can be a small amount of audio data, such as: 10 recordings, and only a small amount of audio data of the target object can be used. Train an adaptive acoustic model capable of cloning the acoustic signature of the target object.

此外,本说明书一些实施例中,在采集目标对象的目标语音样本数据时,目标语音样本数据的采样频率与基础声学模型的训练数据的采样频率相同。基础声学模型是基于大量的训练数据训练获得的,通过相同采样频率采集目标对象的音频,可以使得基础声学模型能够快速准确学习到目标对象的声音特征,提升语音克隆的音频的质量。In addition, in some embodiments of the present specification, when the target speech sample data of the target object is collected, the sampling frequency of the target speech sample data is the same as the sampling frequency of the training data of the basic acoustic model. The basic acoustic model is trained based on a large amount of training data. By collecting the audio of the target object at the same sampling frequency, the basic acoustic model can quickly and accurately learn the sound characteristics of the target object and improve the audio quality of the voice clone.

在采集到目标对象的目标语音样本数据后,可以对目标语音样本数据进行ASR(Automatic Speech Recognition,自动语音识别)识别,将目标语音样本数据转换为目标文本。再对目标语音样本数据以及目标文本进行特征提取,如:提取目标语音样本数据中的目标声纹向量和目标声学特征,将目标文本转换为目标音素序列,并将目标语音样本数据和目标文本进行对齐,获得目标时长序列,特征提取的过程可以参考上述图2中基础声学模型中特征提取的过程,此处不再赘述。After the target speech sample data of the target object is collected, ASR (Automatic Speech Recognition, automatic speech recognition) recognition may be performed on the target speech sample data, and the target speech sample data is converted into a target text. Then perform feature extraction on the target speech sample data and the target text, such as: extracting the target voiceprint vector and target acoustic features in the target speech sample data, converting the target text into a target phoneme sequence, and performing the target speech sample data and the target text Alignment, obtaining the target duration sequence, and the process of feature extraction can refer to the process of feature extraction in the basic acoustic model in Figure 2 above, and will not be repeated here.

如图3所示,特征提取完成后,基于提取到的目标声纹向量、目标声学特征、目标时长序列、目标音素序列对基础声学模型进行自适应训练,在基础声学模型上只对解码器中的CLN层进行参数更新,直至满足预设要求,如:更新500步,得到目标说话人的自适应声学模型。As shown in Figure 3, after the feature extraction is completed, the basic acoustic model is adaptively trained based on the extracted target voiceprint vector, target acoustic feature, target duration sequence, and target phoneme sequence. The parameters of the CLN layer are updated until the preset requirements are met, such as: update 500 steps to obtain the adaptive acoustic model of the target speaker.

在自适应目标对象的声学模型时,更新解码器中的条件归一化层,达到了只更新的70k参数,通过实验发现在自适应500步时,就能够取得更快且良好的复刻效果。表1是本说明书一个实施例中参考图3的流程,采集10人测试集(每人10条语音)进行的相关测试,表1为测试的克隆时间、自然度以及相似度的相关数据,其中,克隆语音的相似度和自然度满分为5分。如表1所示,本说明书实施例从训练获得目标对象的自适应生声学模型开始,整个语音克隆的时间只需要6.2分钟,语音的相似度和自然度也比较好。When adapting to the acoustic model of the target object, the conditional normalization layer in the decoder is updated to achieve only updated 70k parameters. Through experiments, it is found that when the adaptive step is 500, faster and better engraving effects can be achieved. . Table 1 refers to the flow process of Fig. 3 in one embodiment of this description, and collects the related tests carried out by 10 people's test sets (10 voices per person), and Table 1 is the relevant data of the cloning time, naturalness and similarity of the test, wherein , the full score for the similarity and naturalness of the cloned voice is 5 points. As shown in Table 1, the embodiment of this specification starts from the training to obtain the adaptive acoustic model of the target object, and the entire voice cloning takes only 6.2 minutes, and the similarity and naturalness of the voice are relatively good.

表1Table 1

语音克隆流程所需时间Time required for voice cloning process 克隆语音自然度Clone Speech Naturalness 克隆语音相似度Clone phonetic similarity 6.2分钟6.2 minutes 3.923.92 3.873.87

本说明书实施例,在基础声学模型的基础上,只需要少量的目标对象的音频数据,就可以快速训练获得能够合成目标对象的声学特征的自适应声学模型,并且自适应过程中只需要少量参数的更新,大大降低了语音克隆中自适应模型训练的时间,提升了语音克隆中模型训练的速度,并可以快速获得不同目标对象的自适应声学模型,为语音克隆技术的普及奠定了基础。In the embodiment of this specification, on the basis of the basic acoustic model, only a small amount of audio data of the target object is needed to quickly train and obtain an adaptive acoustic model capable of synthesizing the acoustic features of the target object, and only a small number of parameters are required in the adaptive process The update greatly reduces the time for training the adaptive model in voice cloning, improves the speed of model training in voice cloning, and can quickly obtain adaptive acoustic models for different target objects, laying the foundation for the popularization of voice cloning technology.

步骤108、采用声码器将所述待合成声学特征合成所述目标对象的目标合成音频数据。Step 108, using a vocoder to synthesize the acoustic features to be synthesized into target synthesized audio data of the target object.

在具体的实施过程中,利用目标对象的音频数据训练获得自适应声学模型后,在使用自适应声学模型时,可以指定采用目标对象的声音,进而自适应声学模型可以将待合成文本转换为目标对象的声学特征。在利用自适应声学模型将待合成文本转换为目标对象的待合成声学特征后,可以采用声码器将待合成声学特征合成目标对象发音的目标合成音频数据。In the specific implementation process, after using the audio data of the target object to train the adaptive acoustic model, when using the adaptive acoustic model, you can specify the sound of the target object, and then the adaptive acoustic model can convert the text to be synthesized into the target The acoustic characteristics of the object. After using the adaptive acoustic model to convert the text to be synthesized into the acoustic features to be synthesized of the target object, a vocoder can be used to synthesize the acoustic features to be synthesized into target synthesized audio data of the pronunciation of the target object.

例如:用户A想让家里的智能设备克隆合成用户B的声音,可以预先采集用户B的音频数据,参考上述实施例的记载,对基础声学模型进行自适应训练,获得用户B的自适应声学模型。将训练获得自适应声学模型加载到智能设备能够,用户A可以将自己想让智能设备输出的内容输入到,如:输入“生日快乐”,智能设备中的自适应声学模型可以将“生日快乐”转换为用户B的声学特征,再利用声码器将自适应声学模型输出的声学特征合成对应的音频数据,该音频数据的声音即为用户B的声音。For example, if user A wants to clone and synthesize user B's voice with his smart device at home, he can collect user B's audio data in advance, refer to the description of the above embodiment, and perform adaptive training on the basic acoustic model to obtain user B's adaptive acoustic model . It is possible to load the adaptive acoustic model obtained through training to the smart device. User A can input the content he wants the smart device to output, such as: input "Happy Birthday", and the adaptive acoustic model in the smart device can output "Happy Birthday" Convert it into the acoustic features of user B, and then use the vocoder to synthesize the corresponding audio data from the acoustic features output by the adaptive acoustic model, and the sound of the audio data is the voice of user B.

此外,在训练获得目标对象的自适应声学模型后,之后若需要克隆目标对象的音频数据,直接使用训练好的目标对象的自适应声学模型对待合成音素序列进行声学特征的预测即可,不需要重复训练目标对象的自适应声学模型。In addition, after training the adaptive acoustic model of the target object, if you need to clone the audio data of the target object, you can directly use the trained adaptive acoustic model of the target object to predict the acoustic characteristics of the phoneme sequence to be synthesized. Retrain the adaptive acoustic model of the target object.

本说明书实施例提供的语音克隆方法,只需要采集目标对象的少量的音频数据,就可以实现在基础声学模型的基础上训练获得目标对象自适应声学模型。进而可以利用自适应声学模型将指定的内容转换为目标对象的声学特征,再利用声码器即可以克隆出目标对象的音频数据。在对基础声学模型进行自适应训练时,通过将条件归一化层引入到声学模型的解码器中,能够在不损失合成音频自然度的情况下降低自适应训练所需的时间,降低了语音克隆过程中自适应模型的训练时间,提升了语音克隆的速度,保障了语音克隆的音频质量。The voice cloning method provided by the embodiment of this specification only needs to collect a small amount of audio data of the target object, and can realize training on the basis of the basic acoustic model to obtain the adaptive acoustic model of the target object. Furthermore, the adaptive acoustic model can be used to convert the specified content into the acoustic characteristics of the target object, and then the audio data of the target object can be cloned by using the vocoder. When performing adaptive training on the basic acoustic model, by introducing the conditional normalization layer into the decoder of the acoustic model, the time required for adaptive training can be reduced without losing the naturalness of the synthesized audio, and the speech The training time of the adaptive model during the cloning process improves the speed of voice cloning and ensures the audio quality of voice cloning.

如图3所示,本说明书一些实施例中,所述方法还包括:As shown in Figure 3, in some embodiments of this specification, the method further includes:

在基于目标对象的目标语音样本数据对基础声学模型进行自适应训练时,采用神经网络算法对所述目标语音样本数据进行降噪,并采用降噪后的目标语音样本数据对基础声学模型进行自适应训练,获得所述自适应声学模型。When the basic acoustic model is adaptively trained based on the target voice sample data of the target object, the neural network algorithm is used to reduce the noise of the target voice sample data, and the basic acoustic model is automatically trained by using the target voice sample data after noise reduction. and adapting the training to obtain the adaptive acoustic model.

在具体的实施过程中,在对基础声学模型进行自适应训练时,采集到目标对象的目标语音样本数据后,可以采用神经网络算法如RNN(Recurrent Neural Network,循环神经网络)对目标语音样本数据进行降噪处理,再对降噪后的目标语音样本数据进行特征提取,利用提取后的特征对基础声学模型进行自适应训练,获得自适应声学模型。一般的,在真实的使用场景中,目标发音人的音频带有一定噪声,目标对象的录制音频的信噪比低,在基础声学模型进行自适应训练时,需要更新大量的模型参数,会导致合成音频质量和自然度都会下降。In the specific implementation process, when the basic acoustic model is adaptively trained, after the target speech sample data of the target object is collected, a neural network algorithm such as RNN (Recurrent Neural Network, cyclic neural network) can be used to analyze the target speech sample data. Perform noise reduction processing, and then perform feature extraction on the target voice sample data after noise reduction, and use the extracted features to perform adaptive training on the basic acoustic model to obtain an adaptive acoustic model. Generally, in a real usage scenario, the audio of the target speaker has a certain amount of noise, and the signal-to-noise ratio of the recorded audio of the target object is low. When the basic acoustic model is adaptively trained, a large number of model parameters need to be updated, which will lead to Composite audio quality and naturalness will be reduced.

本说明书实施例通过在进行自适应声学模型训练之前使用神经网络降噪,可以提升目标对象的音频数据的质量,进而不再需要对目标对象录制音频数据时环境和质量提过高的要求,降低了样本采集的难度和要求,同时提升了语音克隆的质量。In the embodiment of this specification, by using neural network noise reduction before performing adaptive acoustic model training, the quality of the audio data of the target object can be improved, and it is no longer necessary to impose high requirements on the environment and quality of the audio data recorded by the target object, reducing the The difficulty and requirements of sample collection are reduced, and the quality of voice cloning is improved at the same time.

此外,如图3所示,本说明书一些实施例中,所述方法还包括:In addition, as shown in Figure 3, in some embodiments of this specification, the method further includes:

采用降噪模型对所述目标合成音频数据进行二次降噪,获得并输出降噪的目标合成音频数据。The noise reduction model is used to perform secondary noise reduction on the target synthesized audio data, and the noise-reduced target synthesized audio data is obtained and output.

在具体的实施过程中,在将自适应声学模型输出的声学特征合成对应的音频数据后,可以采用降噪模型如:speexdsp对合成的音频数据进行降噪,获得并输出降噪的目标合成音频数据。其中,speexdsp可以用来进行回音抑制,噪音消除等附加功能。通过实验发现,将最终的合成音频再经过降噪将会提高最终合成音频的音质。In the specific implementation process, after the acoustic features output by the adaptive acoustic model are synthesized into the corresponding audio data, a noise reduction model such as: speexdsp can be used to denoise the synthesized audio data, and the target synthesized audio for noise reduction can be obtained and output data. Among them, speexdsp can be used for additional functions such as echo suppression and noise cancellation. It is found through experiments that denoising the final synthesized audio will improve the sound quality of the final synthesized audio.

本说明书实施例通过对目标对象的音频数据以及克隆生成的音频数据进行两次降噪,并且对录制音频和克隆音频分别进行特定方法的降噪,可以提高克隆的音频的音质,进而提升了语音克隆的质量。In the embodiment of this specification, noise reduction is performed twice on the audio data of the target object and the audio data generated by cloning, and noise reduction is performed on the recorded audio and the cloned audio in a specific method, which can improve the sound quality of the cloned audio, thereby improving the voice quality. The quality of the clone.

如图3所示,本说明书实施例提供的语音克隆方法,在真实使用场景中利用目标说话人录制的10句音频,在已经训练好的基础声学模型基础之上,更新指定的网络层,从而缩短模型的自适应时间,另外经过两次不同的降噪保证克隆音频的音质和自然度。As shown in Figure 3, the voice cloning method provided by the embodiment of this specification uses 10 sentences of audio recorded by the target speaker in a real usage scenario, and updates the specified network layer on the basis of the trained basic acoustic model, so that The adaptation time of the model is shortened, and two different noise reductions are performed to ensure the sound quality and naturalness of the cloned audio.

首先,在真实的使用场景中,用户身处的环境各不相同,其中有些用户的录音难免会包含一些噪声,这也势必会一定程度上影响到最终合成音频的质量。本说明书实施例在前期预处理时,主要进行的操作有采样率的转换、响度归一化、降噪。用户录音的采样率要和基础声学模型的训练数据的采样率保持一致,强度归一化用音频工具sox进行处理即可,降噪则使用的是基于神经网络降噪。First of all, in real usage scenarios, users are in different environments, and some users' recordings will inevitably contain some noise, which will inevitably affect the quality of the final synthesized audio to a certain extent. In the pre-processing of the embodiment of this specification, the main operations include sampling rate conversion, loudness normalization, and noise reduction. The sampling rate of the user recording should be consistent with the sampling rate of the training data of the basic acoustic model. The intensity normalization can be processed with the audio tool sox, and the noise reduction is based on the neural network.

其次,特征提取。此步骤主要分为,语音转文字,文字换音素序列;对齐操作:MFA进行对齐;声纹向量的提取;声学特征的提取。首先,用户利用手机在安静的环境下录制10句流畅音频,利用ASR工具进行语音转写,转换成对应的文字。其次,利用汉语字典将识别出的文字转换成对应的音素序列;然后,利用对齐工具MFA将音素序列和对应的音频进行时长对齐,得到的结果为每个音素所对应的发音时长;再次,利用已经训练好的声纹识别模型提取音频中的说话人信息,输出为256维的向量(x-vector);最后,对音频进行声学特征的提取,这里使用的是320维的梅尔谱。Second, feature extraction. This step is mainly divided into, speech to text, text to phoneme sequence; alignment operation: MFA for alignment; voiceprint vector extraction; acoustic feature extraction. First, the user uses the mobile phone to record 10 smooth audio sentences in a quiet environment, and uses the ASR tool to transcribe the voice and convert it into corresponding text. Secondly, use the Chinese dictionary to convert the recognized text into the corresponding phoneme sequence; then, use the alignment tool MFA to align the phoneme sequence with the corresponding audio, and the result is the pronunciation time corresponding to each phoneme; again, use The trained voiceprint recognition model extracts the speaker information in the audio, and outputs it as a 256-dimensional vector (x-vector); finally, extracts the acoustic features of the audio, and here uses a 320-dimensional mel spectrum.

最后,模型的自适应训练。本说明书实施例是在Fastspeech2上进行改进的。主要改动是增添了条件归一化层和一个参考编码器,条件归一化层目的是为了在自适应过程中控制模型参数更新的数量,也为了在后期能够更好的部署,参考编码器则是为了从用户的录音中提取到用户的发音特点。Finally, the adaptive training of the model. The embodiment of this specification is improved on Fastspeech2. The main change is the addition of a conditional normalization layer and a reference encoder. The purpose of the conditional normalization layer is to control the number of model parameter updates during the adaptive process, and for better deployment in the later stage. The reference encoder is The purpose is to extract the user's pronunciation characteristics from the user's recording.

本说明书实施例提供的语音克隆方法,通过改进的声学模型,在进行语音克隆的自适应学习时,只需要对声学模型中指定的参数进行更新,能够取得更快且良好的复刻效果,提升了语音克隆的速度和效果,并在进行自适应之前使用神经网络降噪,合成音频之后再使用speedxsp进行一次降噪处理,能够提高语音克隆合成音频的音质。The voice cloning method provided in the embodiment of this specification, through the improved acoustic model, only needs to update the parameters specified in the acoustic model when performing adaptive learning of voice cloning, which can achieve faster and better engraving effect, and improve The speed and effect of voice cloning are improved, and the neural network is used to reduce noise before self-adaptation. After the audio is synthesized, speedxsp is used to perform a noise reduction process, which can improve the sound quality of the synthesized audio of voice cloning.

本说明书中上述方法的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参考即可,每个实施例重点说明的都是与其他实施例的不同之处。相关之处参考方法实施例的部分说明即可。Each embodiment of the above-mentioned method in this specification is described in a progressive manner, the same and similar parts of each embodiment can be referred to each other, and each embodiment focuses on the difference from other embodiments. For relevant parts, please refer to part of the descriptions of the method embodiments.

基于上述所述的语音克隆方法,本说明书一个或多个实施例还提供一种语音克隆的装置。所述装置可以包括使用了本说明书实施例所述方法的装置(包括分布式系统)、软件(应用)、模块、组件、服务器、客户端等并结合必要的实施硬件的装置。基于同一创新构思,本说明书实施例提供的一个或多个实施例中的装置如下面的实施例所述。由于装置解决问题的实现方案与方法相似,因此本说明书实施例具体的装置的实施可以参考前述方法的实施,重复之处不再赘述。以下所使用的,术语“单元”或者“模块”可以实现预定功能的软件和/或硬件的组合。尽管以下实施例所描述的装置较佳地以软件来实现,但是硬件,或者软件和硬件的组合的实现也是可能并被构想的。Based on the voice cloning method described above, one or more embodiments of this specification further provide a voice cloning device. The device may include devices (including distributed systems), software (applications), modules, components, servers, clients, etc. that use the methods described in the embodiments of this specification combined with necessary implementation hardware. Based on the same innovative idea, the devices in one or more embodiments provided by the embodiments of this specification are as described in the following embodiments. Since the implementation of the device to solve the problem is similar to the method, the implementation of the specific device in the embodiment of this specification can refer to the implementation of the aforementioned method, and the repetition will not be repeated. As used below, the term "unit" or "module" may be a combination of software and/or hardware that realizes a predetermined function. Although the devices described in the following embodiments are preferably implemented in software, implementations in hardware, or a combination of software and hardware are also possible and contemplated.

具体地,图4是本说明书提供的语音克隆装置一个实施例的模块结构示意图,如图4所示,本说明书中提供的装置可以包括:Specifically, FIG. 4 is a schematic diagram of the module structure of an embodiment of the voice cloning device provided in this specification. As shown in FIG. 4, the device provided in this specification may include:

文本接收模块41,用于接收待合成文本;Text receiving module 41, is used for receiving text to be synthesized;

音素转换模块42,用于将所述待合成文本转换为待合成音素序列;A phoneme conversion module 42, configured to convert the text to be synthesized into a phoneme sequence to be synthesized;

声学特征预测模块43,用于将所述待合成音素序列输入自适应声学模型,利用所述自适应声学模型将所述待合成音素序列转换为目标对象的待合成声学特征;其中,所述自适应声学模型是基于所述目标对象的目标语音样本数据对基础声学模型进行自适应训练获得的,在对所述基础声学模型进行自适应训练时,对所述基础声学模型中解码器的条件归一化层的参数进行更新;An acoustic feature prediction module 43, configured to input the phoneme sequence to be synthesized into an adaptive acoustic model, and convert the phoneme sequence to be synthesized into an acoustic feature to be synthesized of a target object by using the adaptive acoustic model; The adaptive acoustic model is obtained by performing adaptive training on the basic acoustic model based on the target speech sample data of the target object. When performing adaptive training on the basic acoustic model, the conditional normalization of the decoder in the basic acoustic model The parameters of the chemical layer are updated;

语音合成模块44,用于采用声码器将所述待合成声学特征合成所述目标对象的目标合成音频数据。The speech synthesis module 44 is configured to use a vocoder to synthesize the acoustic features to be synthesized into target synthesized audio data of the target object.

本说明书一些实施例中,所述装置还包括模型训练模块,用于采用下述方法训练所述自适应声学模型:In some embodiments of this specification, the device further includes a model training module, which is used to train the adaptive acoustic model by using the following method:

采集目标对象的目标语音样本数据;collecting the target voice sample data of the target object;

将所述目标语音样本数据转换为目标文本;Converting the target speech sample data into target text;

提取所述目标语音样本数据中的目标声纹向量和目标声学特征;Extracting target voiceprint vectors and target acoustic features in the target voice sample data;

将所述目标文本转换为目标音素序列,并将所述目标语音样本数据和所述目标文本进行对齐,获得目标时长序列;converting the target text into a target phoneme sequence, and aligning the target voice sample data with the target text to obtain a target duration sequence;

基于所述目标音素序列、所述目标时长序列、所述目标声学特征以及所述目标声纹向量对所述基础声学模型进行自适应训练,更新所述基础声学模型中解码器的条件归一化层的参数,直至满足预设要求。Perform adaptive training on the basic acoustic model based on the target phoneme sequence, the target duration sequence, the target acoustic feature and the target voiceprint vector, and update the conditional normalization of the decoder in the basic acoustic model Layer parameters until the preset requirements are met.

本说明书实施例提供的语音克隆装置,只需要采集目标对象的少量的音频数据,就可以实现在基础声学模型的基础上训练获得目标对象自适应声学模型。进而可以利用自适应声学模型将指定的内容转换为目标对象的声学特征,再利用声码器即可以克隆出目标对象的音频数据。在对基础声学模型进行自适应学习时,通过将条件归一化层引入到声学模型的解码器中,能够在不损失合成音频自然度的情况下降低自适应训练所需的时间,降低了语音克隆过程中自适应模型的训练时间,提升了语音克隆的速度,保障了语音克隆的音频质量。The speech cloning device provided in the embodiment of this specification only needs to collect a small amount of audio data of the target object, and can realize training on the basis of the basic acoustic model to obtain the adaptive acoustic model of the target object. Furthermore, the adaptive acoustic model can be used to convert the specified content into the acoustic characteristics of the target object, and then the audio data of the target object can be cloned by using the vocoder. When performing adaptive learning on the basic acoustic model, by introducing the conditional normalization layer into the decoder of the acoustic model, it is possible to reduce the time required for adaptive training without losing the naturalness of the synthesized audio, reducing the speech The training time of the adaptive model during the cloning process improves the speed of voice cloning and ensures the audio quality of voice cloning.

本说明书一些实施例中,还提供了一种语言合成设备,包括处理器和存储器,所述存储器中存储有计算机指令,所述处理器用于执行所述存储器中存储的计算机指令,当所述计算机指令被处理器执行时该设备实现上述实施例中记载的语音克隆方法,如:In some embodiments of this specification, there is also provided a language synthesis device, including a processor and a memory, the memory stores computer instructions, and the processor is used to execute the computer instructions stored in the memory, when the computer When the instructions are executed by the processor, the device implements the voice cloning method recorded in the above embodiments, such as:

接收待合成文本;Receive the text to be synthesized;

将所述待合成文本转换为待合成音素序列;Converting the text to be synthesized into a phoneme sequence to be synthesized;

将所述待合成音素序列输入自适应声学模型,利用所述自适应声学模型将所述待合成音素序列转换为目标对象的待合成声学特征;其中,所述自适应声学模型是基于所述目标对象的目标语音样本数据对基础声学模型进行自适应训练获得的,在对所述基础声学模型进行自适应训练时,对所述基础声学模型中解码器的条件归一化层的参数进行更新;The phoneme sequence to be synthesized is input into an adaptive acoustic model, and the phoneme sequence to be synthesized is converted into an acoustic feature to be synthesized of a target object by using the adaptive acoustic model; wherein, the adaptive acoustic model is based on the target The target speech sample data of the object is obtained by performing adaptive training on the basic acoustic model, and updating the parameters of the conditional normalization layer of the decoder in the basic acoustic model when performing adaptive training on the basic acoustic model;

采用声码器将所述待合成声学特征合成所述目标对象的目标合成音频数据。The acoustic features to be synthesized are synthesized into target synthesized audio data of the target object by using a vocoder.

需要说明的,上述所述的装置、设备根据方法实施例的描述还可以包括其他的实施方式。具体的实现方式可以参照相关方法实施例的描述,在此不作一一赘述。It should be noted that the above-mentioned apparatus and equipment may also include other implementation manners according to the description of the method embodiment. For specific implementation manners, reference may be made to descriptions of related method embodiments, and details are not repeated here.

本说明书实施例所提供的方法实施例可以在移动终端、计算机终端、服务器或者类似的运算装置中执行。以运行在服务器上为例,图5是本说明书一个实施例中语音克隆服务器的硬件结构框图,该计算机终端可以是上述实施例中的语音克隆服务器或语音克隆装置。如图5所示服务器10可以包括一个或多个(图中仅示出一个)处理器100(处理器100可以包括但不限于微处理器MCU或可编程逻辑器件FPGA等的处理装置)、用于存储数据的非易失性存储器200、以及用于通信功能的传输模块300。本领域普通技术人员可以理解,图5所示的结构仅为示意,其并不对上述电子装置的结构造成限定。例如,服务器10还可包括比图5中所示更多或者更少的组件,例如还可以包括其他的处理硬件,如数据库或多级缓存、GPU,或者具有与图5所示不同的配置。The method embodiments provided in the embodiments of this specification may be executed in mobile terminals, computer terminals, servers or similar computing devices. Taking running on a server as an example, FIG. 5 is a block diagram of the hardware structure of the voice cloning server in an embodiment of this specification. The computer terminal can be the voice cloning server or the voice cloning device in the above embodiments. As shown in Figure 5, the server 10 may include one or more (only one is shown in the figure) processors 100 (the processor 100 may include but not limited to processing devices such as microprocessor MCU or programmable logic device FPGA, etc.), A non-volatile memory 200 for storing data, and a transmission module 300 for communication functions. Those of ordinary skill in the art can understand that the structure shown in FIG. 5 is only a schematic diagram, which does not limit the structure of the above-mentioned electronic device. For example, the server 10 may also include more or fewer components than those shown in FIG. 5 , for example, may also include other processing hardware, such as a database or multi-level cache, GPU, or have a configuration different from that shown in FIG. 5 .

非易失性存储器200可用于存储应用软件的软件程序以及模块,如本说明书实施例中的语音克隆方法对应的程序指令/模块,处理器100通过运行存储在非易失性存储器200内的软件程序以及模块,从而执行各种功能应用以及资源数据更新。非易失性存储器200可包括高速随机存储器,还可包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,非易失性存储器200可进一步包括相对于处理器100远程设置的存储器,这些远程存储器可以通过网络连接至计算机终端。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The non-volatile memory 200 can be used to store software programs and modules of application software, such as program instructions/modules corresponding to the voice cloning method in the embodiment of this specification, and the processor 100 can run the software stored in the non-volatile memory 200 Programs and modules to perform various functional applications and resource data updates. The non-volatile memory 200 may include high-speed random access memory, and may also include non-volatile memories, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memories. In some examples, the non-volatile memory 200 may further include a memory that is remotely located relative to the processor 100, and these remote memories may be connected to a computer terminal through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

传输模块300用于经由一个网络接收或者发送数据。上述的网络具体实例可包括计算机终端的通信供应商提供的无线网络。在一个实例中,传输模块300包括一个网络适配器(Network Interface Controller,NIC),其可通过基站与其他网络设备相连从而可与互联网进行通讯。在一个实例中,传输模块300可以为射频(Radio Frequency,RF)模块,其用于通过无线方式与互联网进行通讯。The transmission module 300 is used to receive or transmit data via a network. The specific example of the above-mentioned network may include a wireless network provided by the communication provider of the computer terminal. In one example, the transmission module 300 includes a network adapter (Network Interface Controller, NIC), which can be connected to other network devices through a base station so as to communicate with the Internet. In one example, the transmission module 300 may be a radio frequency (Radio Frequency, RF) module, which is used to communicate with the Internet in a wireless manner.

与上述方法相应地,本发明还提供了一种装置,该装置包括计算机设备,所述计算机设备包括处理器和存储器,所述存储器中存储有计算机指令,所述处理器用于执行所述存储器中存储的计算机指令,当所述计算机指令被处理器执行时该装置实现如前所述方法的步骤。Corresponding to the above method, the present invention also provides an apparatus, the apparatus includes computer equipment, the computer equipment includes a processor and a memory, the memory stores computer instructions, and the processor is used to execute the instructions in the memory. Stored computer instructions, when the computer instructions are executed by the processor, the device implements the steps of the aforementioned method.

本发明实施例还提供一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时以实现前述边缘计算服务器部署方法的步骤。该计算机可读存储介质可以是有形存储介质,诸如随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、软盘、硬盘、可移动存储盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质。An embodiment of the present invention also provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps of the aforementioned method for deploying an edge computing server can be implemented. The computer readable storage medium may be a tangible storage medium such as random access memory (RAM), internal memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, floppy disk, hard disk, removable storage disk, CD-ROM, or any other form of storage medium known in the art.

本领域普通技术人员应该可以明白,结合本文中所公开的实施方式描述的各示例性的组成部分、系统和方法,能够以硬件、软件或者二者的结合来实现。具体究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。当以硬件方式实现时,其可以例如是电子电路、专用集成电路(ASIC)、适当的固件、插件、功能卡等等。当以软件方式实现时,本发明的元素是被用于执行所需任务的程序或者代码段。程序或者代码段可以存储在机器可读介质中,或者通过载波中携带的数据信号在传输介质或者通信链路上传送。Those of ordinary skill in the art should understand that each exemplary component, system and method described in conjunction with the embodiments disclosed herein can be implemented by hardware, software or a combination of the two. Whether it is implemented in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present invention. When implemented in hardware, it may be, for example, an electronic circuit, an application specific integrated circuit (ASIC), suitable firmware, a plug-in, a function card, or the like. When implemented in software, the elements of the invention are the programs or code segments employed to perform the required tasks. Programs or code segments can be stored in machine-readable media, or transmitted over transmission media or communication links by data signals carried in carrier waves.

需要明确的是,本发明并不局限于上文所描述并在图中示出的特定配置和处理。为了简明起见,这里省略了对已知方法的详细描述。在上述实施例中,描述和示出了若干具体的步骤作为示例。但是,本发明的方法过程并不限于所描述和示出的具体步骤,本领域的技术人员可以在领会本发明的精神后,作出各种改变、修改和添加,或者改变步骤之间的顺序。It is to be understood that the invention is not limited to the specific arrangements and processes described above and shown in the drawings. For conciseness, detailed descriptions of known methods are omitted here. In the above embodiments, several specific steps are described and shown as examples. However, the method process of the present invention is not limited to the specific steps described and shown, and those skilled in the art can make various changes, modifications and additions, or change the sequence of steps after understanding the spirit of the present invention.

本发明中,针对一个实施方式描述和/或例示的特征,可以在一个或更多个其它实施方式中以相同方式或以类似方式使用,和/或与其他实施方式的特征相结合或代替其他实施方式的特征。In the present invention, features described and/or exemplified for one embodiment can be used in the same or similar manner in one or more other embodiments, and/or can be combined with features of other embodiments or replace other Features of the implementation.

以上所述仅为本发明的优选实施例,并不用于限制本发明,对于本领域的技术人员来说,本发明实施例可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, various modifications and changes may be made to the embodiments of the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.

Claims (10)

1.一种语音克隆方法,其特征在于,所述方法包括:1. A voice cloning method, characterized in that the method comprises: 接收待合成文本;Receive the text to be synthesized; 将所述待合成文本转换为待合成音素序列;Converting the text to be synthesized into a phoneme sequence to be synthesized; 将所述待合成音素序列输入自适应声学模型,利用所述自适应声学模型将所述待合成音素序列转换为目标对象的待合成声学特征;其中,所述自适应声学模型是基于所述目标对象的目标语音样本数据对基础声学模型进行自适应训练获得的,在对所述基础声学模型进行自适应训练时,对所述基础声学模型中解码器的条件归一化层的参数进行更新;The phoneme sequence to be synthesized is input into an adaptive acoustic model, and the phoneme sequence to be synthesized is converted into an acoustic feature to be synthesized of a target object by using the adaptive acoustic model; wherein, the adaptive acoustic model is based on the target The target speech sample data of the object is obtained by performing adaptive training on the basic acoustic model, and updating the parameters of the conditional normalization layer of the decoder in the basic acoustic model when performing adaptive training on the basic acoustic model; 采用声码器将所述待合成声学特征合成所述目标对象的目标合成音频数据。The acoustic features to be synthesized are synthesized into target synthesized audio data of the target object by using a vocoder. 2.根据权利要求1所述的方法,其特征在于,所述自适应声学模型的训练方法包括:2. method according to claim 1, is characterized in that, the training method of described adaptive acoustic model comprises: 采集目标对象的目标语音样本数据;collecting the target voice sample data of the target object; 将所述目标语音样本数据转换为目标文本;Converting the target speech sample data into target text; 提取所述目标语音样本数据中的目标声纹向量和目标声学特征;Extracting target voiceprint vectors and target acoustic features in the target voice sample data; 将所述目标文本转换为目标音素序列,并将所述目标语音样本数据和所述目标文本进行对齐,获得目标时长序列;converting the target text into a target phoneme sequence, and aligning the target voice sample data with the target text to obtain a target duration sequence; 基于所述目标音素序列、所述目标时长序列、所述目标声学特征以及所述目标声纹向量对所述基础声学模型进行自适应训练,更新所述基础声学模型中解码器的条件归一化层的参数,直至满足预设要求。Perform adaptive training on the basic acoustic model based on the target phoneme sequence, the target duration sequence, the target acoustic feature and the target voiceprint vector, and update the conditional normalization of the decoder in the basic acoustic model Layer parameters until the preset requirements are met. 3.根据权利要求1所述的方法,其特征在于,所述方法还包括:3. The method according to claim 1, wherein the method further comprises: 在基于目标对象的目标语音样本数据对基础声学模型进行自适应训练时,采用神经网络算法对所述目标语音样本数据进行降噪,并采用降噪后的目标语音样本数据对基础声学模型进行自适应训练,获得所述自适应声学模型。When the basic acoustic model is adaptively trained based on the target voice sample data of the target object, the neural network algorithm is used to reduce the noise of the target voice sample data, and the basic acoustic model is automatically trained by using the target voice sample data after noise reduction. and adapting the training to obtain the adaptive acoustic model. 4.根据权利要求3所述的方法,其特征在于,所述方法还包括:4. method according to claim 3, is characterized in that, described method also comprises: 采用降噪模型对所述目标合成音频数据进行二次降噪,获得并输出降噪的目标合成音频数据。The noise reduction model is used to perform secondary noise reduction on the target synthesized audio data, and the noise-reduced target synthesized audio data is obtained and output. 5.根据权利要求2所述的方法,其特征在于,所述方法还包括:5. The method according to claim 2, characterized in that the method further comprises: 所述目标语音样本数据的采样频率与所述基础声学模型的训练数据的采样频率相同。The sampling frequency of the target speech sample data is the same as the sampling frequency of the training data of the basic acoustic model. 6.根据权利要求1所述的方法,其特征在于,所述基础声学模型中还包括参考编码器,所述参考编码器用于学习音频数据中的声学特征的规律。6 . The method according to claim 1 , wherein the basic acoustic model further includes a reference encoder, and the reference encoder is used to learn the rules of the acoustic features in the audio data. 7.一种语音克隆装置,其特征在于,所述装置包括:7. A voice cloning device, characterized in that the device comprises: 文本接收模块,用于接收待合成文本;The text receiving module is used for receiving the text to be synthesized; 音素转换模块,用于将所述待合成文本转换为待合成音素序列;A phoneme conversion module, configured to convert the text to be synthesized into a phoneme sequence to be synthesized; 声学特征预测模块,用于将所述待合成音素序列输入自适应声学模型,利用所述自适应声学模型将所述待合成音素序列转换为目标对象的待合成声学特征;其中,所述自适应声学模型是基于所述目标对象的目标语音样本数据对基础声学模型进行自适应训练获得的,在对所述基础声学模型进行自适应训练时,对所述基础声学模型中解码器的条件归一化层的参数进行更新;An acoustic feature prediction module, configured to input the phoneme sequence to be synthesized into an adaptive acoustic model, and use the adaptive acoustic model to convert the phoneme sequence to be synthesized into an acoustic feature to be synthesized of a target object; wherein the adaptive The acoustic model is obtained by performing adaptive training on the basic acoustic model based on the target voice sample data of the target object. When performing adaptive training on the basic acoustic model, the condition of the decoder in the basic acoustic model is normalized The parameters of the layer are updated; 语音合成模块,用于采用声码器将所述待合成声学特征合成所述目标对象的目标合成音频数据。A speech synthesis module, configured to use a vocoder to synthesize the acoustic features to be synthesized into target synthesized audio data of the target object. 8.根据权利要求7所述的装置,其特征在于,所述装置还包括模型训练模块,用于采用下述方法训练所述自适应声学模型:8. The device according to claim 7, wherein the device also includes a model training module for training the adaptive acoustic model using the following method: 采集目标对象的目标语音样本数据;collecting the target voice sample data of the target object; 将所述目标语音样本数据转换为目标文本;Converting the target speech sample data into target text; 提取所述目标语音样本数据中的目标声纹向量和目标声学特征;Extracting target voiceprint vectors and target acoustic features in the target voice sample data; 将所述目标文本转换为目标音素序列,并将所述目标语音样本数据和所述目标文本进行对齐,获得目标时长序列;converting the target text into a target phoneme sequence, and aligning the target voice sample data with the target text to obtain a target duration sequence; 基于所述目标音素序列、所述目标时长序列、所述目标声学特征以及所述目标声纹向量对所述基础声学模型进行自适应训练,更新所述基础声学模型中解码器的条件归一化层的参数,直至满足预设要求。Perform adaptive training on the basic acoustic model based on the target phoneme sequence, the target duration sequence, the target acoustic feature and the target voiceprint vector, and update the conditional normalization of the decoder in the basic acoustic model Layer parameters until the preset requirements are met. 9.一种语音克隆设备,包括处理器和存储器,其特征在于,所述存储器中存储有计算机指令,所述处理器用于执行所述存储器中存储的计算机指令,当所述计算机指令被处理器执行时该设备实现如权利要求1至6中任一项所述方法的步骤。9. A voice cloning device, comprising a processor and a memory, wherein computer instructions are stored in the memory, and the processor is used to execute the computer instructions stored in the memory, when the computer instructions are executed by the processor When executed, the device implements the steps of the method according to any one of claims 1-6. 10.一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现如权利要求1至6中任一项所述方法的步骤。10. A computer-readable storage medium, on which a computer program is stored, wherein, when the program is executed by a processor, the steps of the method according to any one of claims 1 to 6 are realized.
CN202310066406.9A 2023-01-17 2023-01-17 A voice cloning method, device, equipment and storage medium Pending CN116206592A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310066406.9A CN116206592A (en) 2023-01-17 2023-01-17 A voice cloning method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310066406.9A CN116206592A (en) 2023-01-17 2023-01-17 A voice cloning method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116206592A true CN116206592A (en) 2023-06-02

Family

ID=86516645

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310066406.9A Pending CN116206592A (en) 2023-01-17 2023-01-17 A voice cloning method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116206592A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116825081A (en) * 2023-08-25 2023-09-29 摩尔线程智能科技(北京)有限责任公司 Speech synthesis method, device and storage medium based on small sample learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116825081A (en) * 2023-08-25 2023-09-29 摩尔线程智能科技(北京)有限责任公司 Speech synthesis method, device and storage medium based on small sample learning
CN116825081B (en) * 2023-08-25 2023-11-21 摩尔线程智能科技(北京)有限责任公司 Speech synthesis method, device and storage medium based on small sample learning

Similar Documents

Publication Publication Date Title
JP6903129B2 (en) Whispering conversion methods, devices, devices and readable storage media
CN110379412B (en) Voice processing method and device, electronic equipment and computer readable storage medium
CN108597496B (en) Voice generation method and device based on generation type countermeasure network
US20230223010A1 (en) Video generation method, generation model training method and apparatus, and medium and device
CN108922543B (en) Model base establishing method, voice recognition method, device, equipment and medium
TW201935464A (en) Method and device for voiceprint recognition based on memorability bottleneck features
CN108346427A (en) Voice recognition method, device, equipment and storage medium
CN108108357B (en) Accent conversion method and device and electronic equipment
WO2023030235A1 (en) Target audio output method and system, readable storage medium, and electronic apparatus
CN113948062A (en) Data conversion method and computer storage medium
CN110176243B (en) Speech enhancement method, model training method, device and computer equipment
CN111986675A (en) Voice dialogue method, device and computer readable storage medium
CN112002307B (en) Voice recognition method and device
US20230186943A1 (en) Voice activity detection method and apparatus, and storage medium
CN108198566B (en) Information processing method and device, electronic device and storage medium
CN111833878A (en) Chinese voice interaction sensorless control system and method based on Raspberry Pi edge computing
CN112599114A (en) Voice recognition method and device
US12094484B2 (en) General speech enhancement method and apparatus using multi-source auxiliary information
CN114882861A (en) Voice generation method, device, equipment, medium and product
CN115700871A (en) Model training and speech synthesis method, device, equipment and medium
CN111968622A (en) Attention mechanism-based voice recognition method, system and device
CN116206592A (en) A voice cloning method, device, equipment and storage medium
CN118486297B (en) Response method based on voice emotion recognition and intelligent voice assistant system
WO2020073839A1 (en) Voice wake-up method, apparatus and system, and electronic device
CN112116909A (en) Speech recognition method, device and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination