[go: up one dir, main page]

CN115294955B - Model training and speech synthesis method, device, equipment and medium - Google Patents

Model training and speech synthesis method, device, equipment and medium Download PDF

Info

Publication number
CN115294955B
CN115294955B CN202110419495.1A CN202110419495A CN115294955B CN 115294955 B CN115294955 B CN 115294955B CN 202110419495 A CN202110419495 A CN 202110419495A CN 115294955 B CN115294955 B CN 115294955B
Authority
CN
China
Prior art keywords
language
feature
voice
sample
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110419495.1A
Other languages
Chinese (zh)
Other versions
CN115294955A (en
Inventor
李永强
张大成
朱晓旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Orion Star Technology Co Ltd
Original Assignee
Beijing Orion Star Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Orion Star Technology Co Ltd filed Critical Beijing Orion Star Technology Co Ltd
Priority to CN202110419495.1A priority Critical patent/CN115294955B/en
Publication of CN115294955A publication Critical patent/CN115294955A/en
Application granted granted Critical
Publication of CN115294955B publication Critical patent/CN115294955B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/086Detection of language
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention discloses a model training and voice synthesis method, device, equipment and medium. In the embodiment of the invention, after the first text feature of the first voice sample in the first sample set and the second text feature of the second voice sample in the second language are obtained, the first acoustic feature corresponding to the first text feature and the second acoustic feature of the second voice sample are determined, the first acoustic feature and the second acoustic feature are acoustic features of the same speaker, and the first language and the second language are different, so that the acoustic features of different languages of the same speaker are obtained, and the subsequent training of an original voice synthesis model based on the first text feature, the first acoustic feature, the second text feature and the second acoustic feature corresponding to the first text feature is facilitated, and a target voice synthesis model is obtained, so that the voice synthesis of the first language and the second language of each speaker in the first sample set can be realized through the target voice synthesis model.

Description

Model training and speech synthesis method, device, equipment and medium
Technical Field
The present invention relates to the field of natural language understanding technologies, and in particular, to a method, an apparatus, a device, and a medium for model training and speech synthesis.
Background
With the progress of internationalization, information contents exhibit a situation of cross use of multiple languages, and thus, a speech synthesis model supporting multiple languages simultaneously is needed. In general, a speech synthesis model supporting multiple languages is generally determined by the language type of a sample corpus for training the model, and the difficulty of simultaneously having multiple speech samples of multiple languages of the same speaker is higher, so that it is more difficult to obtain a speech synthesis model supporting multiple languages of multiple speakers.
Therefore, how to obtain a speech synthesis model capable of supporting multiple speakers and multiple languages is a technical problem to be solved.
Disclosure of Invention
The embodiment of the invention provides a model training and voice synthesis method, device, equipment and medium, which are used for solving the problem that the existing voice synthesis model supporting multiple speakers and multiple languages cannot be obtained.
The embodiment of the invention provides a training method of a speech synthesis model, which comprises the following steps:
Acquiring a first text feature of a first voice sample in a first language in a first sample set and a second text feature corresponding to a second voice sample in a second language, wherein the first language is different from the second language;
determining a first acoustic feature corresponding to the first text feature and a second acoustic feature of the second voice sample, wherein the first acoustic feature and the second acoustic feature are acoustic features of the same speaker;
Training an original speech synthesis model based on the first text feature and the corresponding first acoustic feature, the second text feature and the corresponding second acoustic feature to obtain a target speech synthesis model, wherein the target speech synthesis model supports speech synthesis of a first language and a second language of each speaker in the first sample set.
The embodiment of the invention provides a voice synthesis method, which comprises the following steps:
Acquiring text characteristics of text information to be synthesized;
Acquiring target acoustic features corresponding to the text features based on the text features, the identification information of the target speakers and the identification information of target languages of the text information through a target voice synthesis model of multiple speakers and multiple languages trained in advance; wherein the target speech synthesis model supports speech synthesis of a target language of the target speaker;
and determining audio data when the target speaker emits the text information of the target language based on the target acoustic characteristics and the vocoder.
The embodiment of the invention provides a training device of a speech synthesis model, which comprises the following components:
an obtaining unit, configured to obtain a first text feature of a first voice sample in a first language in a first sample set and a second text feature corresponding to a second voice sample in a second language, where the first language is different from the second language;
The determining unit is used for determining a first acoustic feature corresponding to the first text feature and a second acoustic feature of the second voice sample, wherein the first acoustic feature and the second acoustic feature are acoustic features of the same speaker;
The training unit is used for training the original voice synthesis model based on the first text feature and the corresponding first acoustic feature, the second text feature and the corresponding second acoustic feature to obtain a target voice synthesis model, and the target voice synthesis model supports voice synthesis of a first language and a second language of each speaker in the first sample set.
The embodiment of the invention provides a voice synthesis device, which comprises:
the acquisition module is used for acquiring text characteristics of the text information to be synthesized;
The processing module is used for acquiring target acoustic features corresponding to the text features based on the text features, the identification information of the target speakers and the identification information of the target language of the text information through a target voice synthesis model of multiple speakers and multiple languages trained in advance; wherein the target speech synthesis model supports speech synthesis of a target language of the target speaker;
And the synthesis module is used for determining audio data when the target speaker emits the text information of the target language based on the target acoustic characteristics and the vocoder.
The embodiment of the invention provides electronic equipment, which at least comprises a processor and a memory, wherein the processor is used for realizing the steps of the speech synthesis model training method or the steps of the speech synthesis method when executing a computer program stored in the memory.
Embodiments of the present invention provide a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of a speech synthesis model training method as described above, or implements the steps of a speech synthesis method as described above.
In the embodiment of the invention, after the first text feature of the first voice sample in the first sample set and the second text feature of the second voice sample in the second language are obtained, the first acoustic feature corresponding to the first text feature and the second acoustic feature of the second voice sample are determined, the first acoustic feature and the second acoustic feature are acoustic features of the same speaker, and the first language and the second language are different, so that the acoustic features of different languages of the same speaker are obtained, and the voice synthesis model is beneficial to accurately adjusting the parameter values of the parameters contained in the target voice synthesis model according to the acoustic feature of the first language and the acoustic feature of the first language output by the voice synthesis model and the acoustic feature of the second language output by the voice synthesis model, thereby improving the accuracy of the target voice synthesis model and further realizing the voice synthesis of the first language and the second language of each person in the first sample set through the target voice synthesis model.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a training process of a speech synthesis model according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a training flow of a specific speech synthesis model according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a speech synthesis process according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a specific speech synthesis flow provided in an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of a training device for speech synthesis model according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present invention;
Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention;
Fig. 8 is a schematic structural diagram of another electronic device according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail below with reference to the attached drawings, wherein it is apparent that the embodiments described are only some, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In order to reduce the difficulty of acquiring a multi-speaker multi-language speech synthesis model, the embodiment of the invention provides a speech synthesis model training and speech synthesis method, device, equipment and medium.
Example 1: fig. 1 is a schematic diagram of a training process of a speech synthesis model according to an embodiment of the present invention, where the process includes:
S101: a first text feature of a first voice sample in a first language in a first sample set and a second text feature corresponding to a second voice sample in a second language are obtained, wherein the first language is different from the second language.
The voice synthesis model training method provided by the embodiment of the invention is applied to electronic equipment, and the electronic equipment can be intelligent equipment such as a robot and the like or a server.
In order to accurately generate the target speech synthesis model supporting multiple languages, a sample set (denoted as a first sample set) for training the target speech synthesis model supporting multiple languages needs to be acquired in advance, and the first sample set should contain speech samples corresponding to at least two languages, each different, for example, english, chinese, korean, etc.
If the number of collected voice information of different languages of the same speaker is enough, for example, 3 ten thousand sentences or more than 30 hours, the collected voice information can be directly determined as a voice sample. And then, based on the voice samples corresponding to each language of the speaker, directly training the original voice synthesis model to obtain a target voice synthesis model which is trained and supports multiple languages of the speaker. However, in the practical application scenario, the difficulty of collecting a large number of speech samples of multiple languages of the same speaker is high, and the cost is also high, so that a large number of speech samples which can be used for training a target speech synthesis model cannot be collected, and further the target speech synthesis model supporting multiple languages of the speaker cannot be obtained.
Thus, in the embodiment of the present invention, the voice samples corresponding to each language in the first sample set may be from different speakers, for example, the voice samples in the first sample set may be from speaker a, speaker B and speaker C, where the voice samples in the first sample set include english, chinese and korean, the voice sample in chinese corresponds to speaker a and the voice sample in english, the voice sample in english corresponds to speaker B, the voice sample in korean corresponds to speaker C, and may be from different speakers, for example, the voice samples in the first sample set may be from speaker a, speaker B and speaker C, the voice samples in the first sample set may include english, chinese and korean, where speaker a corresponds to the voice sample in chinese, speaker B corresponds to the voice sample in english, and speaker C corresponds to the voice sample in korean. The embodiment of the invention is not limited to a specific implementation manner.
In the embodiment of the present invention, in order to train the target speech synthesis model to support speech synthesis in a plurality of languages (i.e., two or more languages) included in the first sample set, the first language and the second language may be any two of the plurality of languages included in the first sample set. For example, the languages included in the first sample set may be chinese, korean, and english, respectively, and the first language and the second language may be chinese and korean, may be korean and english, respectively, and may be chinese and english, respectively.
After the first sample set for training the target speech synthesis model is obtained based on the above embodiment, a speech sample corresponding to any language (denoted as a first language) may be selected from the first sample set as a first speech sample, text features (denoted as first text features) of the first speech sample are obtained, a speech sample corresponding to any other language (denoted as a second language) than the first language is selected from the first sample set as a second speech sample, and text features (denoted as second text features) of the second speech sample are obtained.
In some embodiments, the first language and/or the second language is a language specified in a plurality of languages included in the first sample set. For example, the languages included in the first sample set include chinese, korean, and english, respectively, a first language is designated as korean, any other language than the designated language may be a second language, or a second language may be designated as chinese, any other language than the designated language may be used as the first language, or the first language may be designated as chinese, and the second language may be designated as english. Taking the example that the first language and the second language are the appointed languages in at least two languages included in the first sample set, after the first sample set for training the target speech synthesis model is obtained based on the above embodiment, a speech sample corresponding to the appointed first language in the first sample set may be used as a first speech sample, a first text feature of the first speech sample may be obtained, and a speech sample corresponding to the appointed second language may be obtained from the first sample set as a second speech sample, and a second text feature of the second speech sample may be obtained.
As yet another possible implementation manner, the first language and the second language may be determined from a plurality of languages included in the first sample set according to different scenes. For example, for different countries or regions, a large number of speech samples in the native language may be obtained, and speech samples in the non-native language may not be easily obtained, so that in order to train a target speech synthesis model supporting the non-native language of the speaker in different countries or regions, the native language in a certain country or region in the first sample set may be determined as the second language, and any other language in the non-native language in the first sample set may be determined as the first language. After the first sample set for training the target speech synthesis model is obtained based on the above embodiment, for different countries or regions, a speech sample corresponding to a native language (denoted as a second language) of the country or region is obtained from the first sample set as a second speech sample, a second text feature of the second speech sample is obtained, and a speech sample corresponding to any other language (denoted as a first language) than the native language is obtained from the first sample set as a first speech sample, and a first text feature of the first speech sample is obtained.
The text feature of each voice sample in the first sample set may be predetermined, or may be determined after the first voice sample and the second voice sample in the first sample set are acquired. In the implementation process, the method can be flexibly set according to actual requirements, and is not particularly limited.
In order to facilitate description of text features corresponding to the first voice sample and the second voice sample, taking the first voice sample as an example, text content corresponding to the first voice sample is obtained, and the obtaining method can be obtained through a voice recognition model or a manual labeling mode. After the text content is acquired, extracting text characteristics of the text content to represent the text into a data structure which can be processed by an algorithm, and determining the text characteristics as first text characteristics of a first voice sample. The text feature of the text content can be extracted by a text analysis algorithm, such as syntactic analysis, grammar analysis and the like, or can be determined manually.
In the embodiment of the invention, any text feature comprises: the pronunciation sequence, the part-of-speech and word segmentation information of each word included in the speech sample, the intonation of the speech sample, and the prosodic features corresponding to the pronunciation sequence. The pronunciation sequence may be determined from the language of the speech sample and a pronunciation dictionary of the university of Carniken (Carnegie Mellon University, CMU). For example, if the language of the voice sample is chinese, the pronunciation sequence of the voice sample may be an initial and final sequence, and if the language of the voice sample is english, the pronunciation sequence of the voice sample may be a syllable sequence, etc.
S102: and determining a first acoustic feature corresponding to the first text feature and a second acoustic feature of the second voice sample, wherein the first acoustic feature and the second acoustic feature are acoustic features of the same speaker.
Because the input of the voice synthesis model is text feature, and the output is acoustic feature corresponding to the text feature, in order to conveniently determine the accuracy of the output result of the voice synthesis model, further determine the accuracy of the voice synthesis model, and further acquire the acoustic feature of each voice sample in the first sample set. And training the original voice synthesis model directly based on the acoustic features and the first text features of the first voice sample and the acoustic features and the second text features of the second voice sample to obtain a target voice synthesis model, so that the target voice synthesis model supports voice synthesis of the first language and the second language of each speaker in the first sample set.
The original speech synthesis model may be a deep learning model, such as tacotron model.
For convenience of explanation of the training process, the first speech sample is used for explanation, and the acoustic feature of the first speech sample, the first text feature, the object identification information of the speaker and the language identification information of the first language are input into the original speech synthesis model. And acquiring acoustic features based on the acoustic features of the first voice sample, the first text features, the object identification information of the speaker and the language identification information of the first language through the original voice synthesis model, wherein the acoustic features represent acoustic features of audio data of the first text features sent by the speaker of the first voice sample predicted by the original voice synthesis model, determining a loss value according to the acoustic features and the first acoustic features, and training the original voice synthesis model according to the loss value, namely adjusting parameter values of parameters in the original voice synthesis model.
It should be noted that the processing procedure of the second voice sample is the same as that of the first voice sample, and will not be described herein.
After the trained speech synthesis model is obtained, the speech synthesis model has learned the more common acoustic features of the audio data of different languages, such as the context pronunciation rules of the text, etc., which are sent out by different speakers. In order to further improve the accuracy of the speech synthesis model, the idea of transfer learning is adopted to finely adjust part or all of the parameters in the speech synthesis model, so as to obtain the target speech synthesis model.
Specifically, how to further train the speech synthesis model by adopting the idea of transfer learning belongs to the prior art, and the specific process is not described here in detail.
In order to facilitate training of the target speech synthesis model and improve the accuracy of the target speech synthesis model, in the embodiment of the present invention, acoustic features corresponding to the first text feature (denoted as first acoustic features) and acoustic features of the second speech sample (denoted as second acoustic features) are acquired, where the first acoustic features and the second acoustic features are acoustic features of the same speaker (i.e., the speaker of the second speech sample).
The second acoustic feature is obtained by an acoustic feature extraction algorithm or by an acoustic feature extraction model.
In one possible embodiment, if the speaker who has the second voice sample in the first sample set emits the voice sample in the first language, the at least one voice sample in the first language corresponding to the speaker in the first sample set may be determined, that is, acoustic features corresponding to acoustic features of the at least one voice sample are all determined as the first acoustic features. For example, the acoustic features of the speech sample may be obtained by an acoustic feature extraction algorithm or the acoustic features of the speech sample may be obtained by an acoustic feature extraction model.
In the embodiment of the present invention, the first acoustic feature may be any one of acoustic features of a mel frequency cepstrum coefficient (Mel Frequency Cepstrum Coefficient, MFCC), a bark frequency cepstrum coefficient (Bark Frequency Cepstrum Coefficient, BFCC), an inverse mel frequency cepstrum coefficient (INVERSE MEL Frequency Cepstrum Coefficient, IMFCC), a gamma pass frequency cepstrum coefficient (Gammatone Frequency Cepstrum Coefficient, GFCC), a linear prediction frequency cepstrum coefficient (Linear Prediction Cepstral Coefficients, LPCCs), and the like.
In one possible embodiment, the first acoustic feature is BFCC.
Wherein the type of the second acoustic feature is the same as the type of the first acoustic feature described above.
S103: training an original speech synthesis model based on the first text feature, the corresponding first acoustic feature, the second text feature and the corresponding second acoustic feature to obtain a target speech synthesis model, wherein the target speech synthesis model supports the speech synthesis of the first language and the second language of each speaker in the first sample set.
In the related art, in order to support synthesis of multiple languages of the same speaker, according to each language to be synthesized, an original speech synthesis model is trained based on acoustic features corresponding to a speech sample of the same speaker and text features corresponding to the speech sample, so as to obtain a target speech synthesis model supporting the same speaker, where the target speech synthesis model cannot simultaneously satisfy synthesis of multiple languages of the speaker. When multiple languages of a certain speaker need to be synthesized, a target voice synthesis model corresponding to each language of the speaker needs to be trained and deployed in advance, and a large amount of storage space and training resources are consumed.
In the embodiment of the invention, for any two languages of each speaker corresponding to the first sample set, after the first text feature (corresponding to the first language) and the second text feature (corresponding to the second language) corresponding to the speaker are obtained, the first text feature and the first acoustic feature corresponding to the first text feature, the second text feature and the second acoustic feature corresponding to the second text feature are respectively or simultaneously input into the original speech synthesis model. Through the original speech synthesis model, acoustic features (marked as third acoustic features) corresponding to the first text features predicted by the original speech synthesis model, namely acoustic features of the first language sent by the speaker of the second speech sample, and acoustic features (marked as fourth acoustic features) corresponding to the second text features, namely acoustic features of the second language sent by the speaker of the second speech sample, can be obtained. Further, a loss value is determined according to a third acoustic feature and a first acoustic feature corresponding to the first text feature, and a fourth acoustic feature and a second acoustic feature corresponding to the second text feature. And training the original speech synthesis model according to the loss value to obtain a target speech synthesis model. The target speech synthesis model supports target speech synthesis models in a first language and a second language for each speaker in the first sample set.
The method for determining the first language and the second language has been described in the above embodiments, and the repetition is not repeated.
Since the first sample set includes a plurality of voice samples, for each voice sample, the voice sample is determined as a first voice sample, and other voice samples different from the voice of the voice sample are acquired from the first sample set as second voice samples, the above-described operations are performed on the first voice sample and the second voice sample. When a preset convergence condition (noted as a first convergence condition) is satisfied, the training of the target speech synthesis model is completed.
The first convergence condition may be that the loss value determined based on each voice sample is smaller than a preconfigured loss threshold (denoted as a first loss threshold), or the determined loss value is always in a downward trend and tends to be gentle, or the number of iterations of training the original voice synthesis model reaches a set maximum number of iterations, or the like. The implementation may be flexibly set, and is not particularly limited herein.
As a possible implementation manner, when training the target speech synthesis model, the speech samples in the first sample set may be divided into a training sample and a test sample, the original speech synthesis model may be trained based on the training sample, and then the reliability of the trained speech synthesis model may be verified based on the test sample.
In the embodiment of the invention, after the first text feature of the first voice sample in the first sample set and the second text feature of the second voice sample in the second language are obtained, the first acoustic feature corresponding to the first text feature and the second acoustic feature of the second voice sample are determined, the first acoustic feature and the second acoustic feature are acoustic features of the same speaker, and the first language and the second language are different, so that the acoustic features of different languages of the same speaker are obtained, and the voice synthesis model is beneficial to accurately adjusting the parameter values of the parameters contained in the target voice synthesis model according to the acoustic feature of the first language and the acoustic feature of the first language output by the voice synthesis model and the acoustic feature of the second language output by the voice synthesis model, thereby improving the accuracy of the target voice synthesis model and further realizing the voice synthesis of the first language and the second language of each person in the first sample set through the target voice synthesis model.
Example 2: on the basis of the above embodiment, training the original speech synthesis model based on the first text feature and the corresponding first acoustic feature, the second text feature and the corresponding second acoustic feature, further includes:
Respectively acquiring a first characteristic vector corresponding to a first data set and a second characteristic vector corresponding to a second data set through an encoding layer in an original voice synthesis model; the first data set comprises a first text feature, language identification information of a first language and object identification information of a speaker of a second voice sample, and the second data set comprises a second text feature, language identification information of a second language and object identification information;
Respectively inputting the first feature vector and the second feature vector into a decoding layer in an original speech synthesis model to obtain a third acoustic feature corresponding to the first text feature and a fourth acoustic feature corresponding to the second text feature;
and adjusting parameter values of parameters in the original speech synthesis model based on the first acoustic feature and the third acoustic feature corresponding to the first text feature and the second acoustic feature and the fourth acoustic feature corresponding to the second text feature to obtain the target speech synthesis model.
In order to ensure that the trained model can synthesize multiple languages of multiple speakers, in the embodiment of the invention, different object identification information is preset for different speakers, for example, the object identification information of speaker a is "a", the object identification information of speaker B is "B", and the like, and different language identification information is preset for different languages, for example, the language identification information corresponding to chinese is "0", and the language identification information corresponding to english is "1", and the like, so that the speech synthesis model can perform speech synthesis of a certain language of a certain speaker conveniently. The language identification information of the first language, the object identification information of the speaker of the second voice sample, the data set of the first text feature and the first acoustic feature (marked as a first data set), and the data set of the language identification information of the second language, the object identification information of the speaker of the second voice sample, the second text feature and the second acoustic feature (marked as a second data set) are respectively or simultaneously input into the original voice synthesis model to train the original voice synthesis model.
The language identification information corresponding to any language may be numbers, character strings, etc., or may be in other forms, so long as the expression form capable of uniquely identifying the language can be used in the embodiment of the present invention. In the implementation process, the method can be flexibly set according to actual requirements.
The object identification information corresponding to any speaker may be a number, a character string, or the like, or may be other forms, so long as the representation form capable of uniquely identifying the speaker can be used in the embodiment of the present invention. In the implementation process, the method can be flexibly set according to actual requirements.
In order to realize speech synthesis, the original speech synthesis model at least comprises an encoding layer and a decoding layer, wherein the encoding layer is connected with the decoding layer, the encoding layer is used for extracting characteristics of an input data set, and the decoding layer is used for further processing the characteristic vector extracted by the encoding layer to acquire acoustic characteristics corresponding to the input data set.
Taking the example of inputting the first data set and the second data set into the original speech synthesis model at the same time, the feature vector (marked as a first feature vector) corresponding to the first data set and the feature vector (marked as a second feature vector) corresponding to the second data set are respectively obtained through the coding layer in the original speech synthesis model. The first feature vector is a feature vector which is extracted from the first text feature and has a higher abstract dimension, and the second feature vector is a feature vector which is extracted from the second text feature and has a higher abstract dimension. The first feature vector and the second feature vector are then input to a decoding layer in the original speech synthesis model, respectively. The first feature vector is processed through the decoding layer to obtain an acoustic feature (denoted as a third feature vector) corresponding to the first text feature, and the second feature vector is processed through the decoding layer to obtain an acoustic feature (denoted as a fourth feature vector) corresponding to the second text feature.
In one possible implementation manner, the element value of each element included in different acoustic features of different languages acquired in the actual application process is in a different numerical range, so that when the element value of each element in the acoustic features is predicted, the difficulty of predicting the element value of a certain element is high, and the error between the predicted element value and the true element value of the element is increased. Therefore, in order to improve the model precision, a normalization function, such as a minimum maximum minmax algorithm, a mean variance normalization algorithm, and the like, is preset, and normalization processing is performed on the first acoustic feature and the second acoustic feature respectively, so that the element value of each element in the first acoustic feature and the second acoustic feature can be between [0,1], and further the error corresponding to the element value of each element in the predicted acoustic feature can be between [0,1 ].
After the third acoustic feature is obtained based on the above embodiment, determining a sub-loss value (denoted as a first sub-loss value) according to the first acoustic feature and the third acoustic feature; when the fourth acoustic feature is acquired, a sub-loss value (denoted as a second sub-loss value) is determined based on the second acoustic feature and the fourth acoustic feature. And determining a loss value according to the first sub-loss value and the second sub-loss value, and adjusting the parameter value of the parameter in the original speech synthesis model according to the loss value.
The sum of the first sub-loss value and the second sub-loss value may be directly determined as a loss value, or the loss value may be determined according to the first sub-loss value and the corresponding weight value thereof, and the second loss value and the corresponding weight value thereof, or the sum of the first sub-loss value and the second sub-loss value may be correspondingly processed according to a preset mathematical function, and the processed value may be determined as a loss value. In the implementation process, the method can be flexibly set according to actual requirements, and is not particularly limited herein.
Note that the types of the third acoustic feature and the fourth acoustic feature are the same as the types of the first acoustic feature in the above-described embodiment.
Example 3: based on the above embodiments, in the embodiment of the present invention, the method for obtaining acoustic features of a large number of voice samples of different languages of different speakers further includes:
In the first mode, because of the great difficulty of a large number of voice samples of different languages of different speakers, when the voice samples of the first language corresponding to the speaker without the second voice sample in the first sample set, the first acoustic feature can be determined through at least one first voice sample in the first sample set and a voice conversion method (voice conversion). For example, a first acoustic feature is obtained based on linguistic content features of at least one first voice sample in the first set of samples by a language conversion model of a first language of a speaker supporting the second voice sample.
In the second aspect, when the first sample set includes the first speech samples in the first language corresponding to the speakers of the second speech samples, in order to obtain a large number of first acoustic features in the first language corresponding to the speakers of the second speech samples, not only the acoustic features of the first speech samples corresponding to the speakers are determined as the first acoustic features, but also more first acoustic features are determined by the voice samples in the first language of the speakers not in the second speech samples in the first sample set and the sound conversion method.
As a possible implementation, for each speaker, a language conversion model is trained in advance according to each language that the speaker is to support. For example, a speech conversion model supporting a single language of the speaker may be trained, or a speech conversion model supporting multiple languages of the speaker may be trained.
When the first acoustic feature of a certain language of the speaker needs to be acquired later, the first acoustic feature of the language of the speaker can be acquired by a pre-trained language conversion model supporting the language of the speaker, for example, a language conversion model supporting only the language of the speaker or a language conversion model supporting multiple languages of the speaker, and the first acoustic feature of the language of the speaker is acquired by collecting the language content feature of at least one first voice sample of the language of the speaker.
In the implementation process, through a language conversion model, based on the input language content characteristics, the first acoustic characteristics of the speaker of the second voice sample corresponding to the first language can be obtained.
As another possible embodiment, a speech conversion model supporting the language of a plurality of speakers may be trained in advance for different languages.
As another possible implementation manner, since training too many language conversion models consumes a lot of resources and costs for deploying each language conversion model, in the embodiment of the present invention, a language conversion model capable of supporting multiple languages for multiple speakers is trained in advance, where the language conversion model includes at least a character vector layer, an encoding layer, and a decoding layer, and the character vector layer is connected to the encoding layer, and the encoding layer and the decoding layer are connected to each other. Because the language conversion model supports multiple languages of multiple speakers, in order to control the language conversion model to realize tone conversion of a certain language of a certain speaker, object identification information of the speaker and language identification information of the language are also input, and the character vector layer in the language conversion model can respectively convert the input object identification information and language identification information into character vectors which can be understood by the language conversion model, and is convenient for processing of other network layers in the language conversion model. The coding layer in the language conversion model is used for combining the language character vectors input by the character vector layer, processing the characteristics of the language content and extracting the characteristic vectors. The decoding layer is used for combining the object character vector input by the character vector layer, processing the feature vector input by the encoding layer and obtaining the acoustic feature.
In one possible implementation, after the first voice sample is acquired, the language content feature of the first voice sample, the object identification information of the speaker of the second voice sample, and the language identification information of the first language are acquired. A data set (noted as a third data set) including the language content features of the first voice sample, the object identification information of the speaker of the second voice sample, and the language identification information of the first language is input to a pre-trained support language conversion model. Based on the language identification information of the first language and the object identification information of the speaker of the second voice sample included in the third data set, the language character vector corresponding to the first identification information and the object character vector corresponding to the object identification information are obtained through the character layer of the language conversion model.
Further, after the language character vector and the object character vector are obtained, the character vector layer inputs the language content feature, the language character vector and the object character vector of at least one first voice sample included in the third data set to a coding layer in the language conversion model. The feature vector (denoted as a third feature vector) is obtained by the encoding layer of the language conversion model based on the language character vector and the language content feature of the at least one first speech sample included in the third data set. Wherein the third feature vector comprises higher-dimensional, more abstract features extracted from the language content features.
Further, after the third feature vector is obtained, the encoding layer inputs the third feature vector and the object character vector to a decoding layer of the language conversion model. Through the decoding layer, a first acoustic feature is acquired and output based on the object character vector and the third feature vector.
Because the language conversion model of the first language of the speaker supporting the second voice sample is trained in advance, the first acoustic feature corresponding to the first text feature can be acquired based on the language content feature of at least one first voice sample of the first language of the speaker not supporting the second voice sample in the first sample set, the difficulty of acquiring the voice samples of multiple languages of the same speaker is effectively reduced, and the accuracy of the target voice synthesis model is improved.
In order to reduce the difficulty of collecting speech samples of multiple languages of the same speaker, in the embodiment of the present invention, the language conversion model is trained by:
Acquiring a second sample set comprising voice samples of at least two languages, each voice sample corresponding to a fourth data set; the fourth data set comprises language content characteristics of the voice sample, object identification information of a speaker of the voice sample, language identification information of the voice sample and fifth acoustic characteristics of the voice sample; a second sample set of voice samples of a speaker including a second voice sample, at least two languages including a first language and a second language;
aiming at the voice sample of each language, acquiring a sixth acoustic feature corresponding to the voice sample based on a fourth data set corresponding to the voice sample through an original voice conversion model; training the original language conversion model based on the fifth acoustic feature corresponding to the voice sample and the sixth acoustic feature corresponding to the voice sample, so that the obtained language conversion model can perform tone color conversion of any two languages of each speaker in the second sample set.
In order to train a language conversion model supporting multiple speaker multilingual, in an embodiment of the present invention, a sample set (denoted as a second sample set) for training the language conversion model is collected in advance. The second sample set contains at least two kinds of speech samples, and the second sample set can be the same as the first sample set or can be partially or completely different, but the pronouncing person in the second sample set at least contains the pronouncing person in the first sample set, and the kind of the speech contained in the second sample set at least contains the kind of the speech contained in the first sample set. For example, the first sample set includes a chinese speech sample of speaker a and an english speech sample of speaker B, and the second sample set also includes a chinese speech sample of speaker a and an english speech sample of speaker B.
For any voice sample in the second sample set, a data set (denoted as a fourth data set) is corresponding, and the fourth data set includes a language content feature of the voice sample, object identification information of a speaker of the voice sample, language identification information of the voice sample, and an acoustic feature (denoted as a fifth acoustic feature) of the voice sample.
The fifth acoustic features may be obtained by an acoustic feature extraction algorithm or by an acoustic feature extraction model.
In a possible implementation manner, in order to reduce the error of the acoustic feature predicted by the language conversion model, a normalization function, such as a minimum maximum minmax algorithm, a mean variance normalization algorithm, and the like, is preset, and normalization processing is performed on the fifth acoustic feature, so that the element value of each element in the fifth acoustic feature can be between [0,1], and further the error corresponding to the element value of each element in the acoustic feature predicted by the language conversion model can be between [0, 1].
In the embodiment of the invention, in order to improve the robustness of the acquired language conversion model, the acquired voice information of the same speaker can be spliced, so that the number of voice samples in the second sample set is multiplied, and the accuracy and the robustness of the language conversion model are further improved.
In one possible implementation manner, the collected voice sample may be determined as an original voice sample, and at least two original voice samples of the same speaker are spliced for a part or all of the languages corresponding to the original voice sample, so as to determine a spliced voice sample. Each original speech sample and each spliced speech sample is determined to be a speech sample in the second sample set.
In the embodiment of the invention, the collected voice sample can be collected in the working environment of the intelligent device, or can be collected from the professional voice sample recording environment, namely, the voice sample comprises voice information collected from the working environment of the intelligent device, and/or the voice sample comprises voice information collected from the professional voice sample recording environment.
For each language voice sample in the second sample set, a fourth data set corresponding to the voice sample is input to the original language conversion model. And (3) performing corresponding processing based on the fourth input data set through the original language conversion model, and acquiring acoustic features (marked as sixth acoustic features) corresponding to the voice sample.
The original language conversion model may be a deep learning model, such as tacotron model.
In the implementation process, through a character vector layer of the original language conversion model, based on language identification information and object identification information included in the fourth data set, a language character vector corresponding to the voice identification information and an object character vector corresponding to the object identification information are obtained. After the language character vector and the object character vector are obtained, the character vector layer inputs the language character vector to the coding layer in the original language conversion model, and inputs the object character vector to the decoding layer in the original language conversion model. Through the coding layer of the original language conversion model, feature vectors (noted as fourth feature vectors) are obtained based on the language character vectors and the language content features of the speech samples included in the fourth data set. Wherein the fourth feature vector comprises higher-dimensional, more abstract features extracted from the language content features. After the fourth feature vector is obtained, the encoding layer inputs the third feature vector to the decoding layer of the original language conversion model. Through the decoding layer, a sixth acoustic feature is acquired based on the object character vector and the fourth feature vector and output.
According to the fifth acoustic feature and the sixth acoustic feature corresponding to the voice sample, determining a loss value, training the original language conversion model to adjust parameter values of each parameter of the original language conversion model, and obtaining a voice conversion model, wherein the voice conversion model can perform tone conversion of any two languages of each speaker in the second sample set.
Since the second sample set includes a plurality of voice samples, the above-described operation is performed for each voice sample. When the preset convergence condition (noted as a second convergence condition) is satisfied, the language conversion model training is completed.
The meeting of the preset second convergence condition may be based on the loss value corresponding to each voice sample in the second sample set being smaller than a preset loss value threshold (denoted as a second loss threshold), or the determined loss value always being in a downward trend and becoming gentle, or the number of iterations of training the original language conversion model reaching a set maximum number of iterations, and so on. The implementation may be flexibly set, and is not particularly limited herein.
As a possible implementation manner, when the language conversion model is trained, the speech samples in the second sample set can be divided into training samples and test samples, the original language conversion model is trained based on the training samples, and then the reliability degree of the trained language conversion model is verified based on the test samples.
Wherein the type of the fifth acoustic feature and the sixth acoustic feature is the same as the type of the first acoustic feature in the above-described embodiment.
In one possible embodiment, the fifth acoustic feature and the sixth acoustic feature are BFCC.
In one possible implementation manner, in order to obtain the language content features of the voice samples in the second sample set conveniently, so as to facilitate subsequent training of the language conversion model, in the embodiment of the present invention, the language content features of the voice samples are obtained by the following manner: and carrying out voice recognition processing on the voice sample based on the acoustic characteristics of the preset type to acquire the characteristics of the language content.
In order to obtain the language content features of the voice sample, the acoustic features of the voice sample of the preset type can be subjected to voice processing through a pre-trained voice recognition model, and the features output by a network layer for extracting the language content features (PPG) in the voice recognition model are determined to be the language content features. In some embodiments, the network layer for extracting PPG may be the last network layer connected to the output layer in the speech recognition model, and further, the implicit feature output by the network layer may be determined as a language content feature.
The language content features are content features which are irrelevant to a speaker in the pronunciation content of the voice sample.
The acoustic feature of the preset type may be the same as or different from the type of the fifth acoustic feature in the above embodiment, for example, the fifth acoustic feature is BFCC, the acoustic feature of the preset type is MFCC, or the fifth acoustic feature is BFCC, the acoustic feature of the preset type is BFCC, or the like.
Example 4: the following describes a training method of a speech synthesis model according to an embodiment of the present invention in detail through specific embodiments.
Fig. 2 is a schematic diagram of a training flow of a specific speech synthesis model according to an embodiment of the present invention, where the flow includes:
s201: the first server trains a language conversion model.
Specifically, the training process of training the language conversion model has been described in the above embodiments, and the repetition is not repeated.
In the process of training the language conversion model, an offline mode is generally adopted, and the original language conversion model is trained in advance through a first server for training the language conversion model and voice samples in a second sample set so as to obtain a trained language conversion model. And storing the language conversion model trained based on the embodiment into the electronic equipment for subsequent training of the speech synthesis model.
The electronic device for training the language conversion model in the embodiment of the present invention may be the same as or different from the electronic device for training the speech synthesis model in the above embodiment, and is not specifically limited herein, i.e., the first server may be the same as or different from the second server.
S202: the second server obtains a first text feature of a first voice sample in a first language in the first sample set and a second text feature corresponding to a second voice sample in a second language.
Wherein the first language is different from the second language.
S203: the second server determines a second acoustic feature of the second voice sample, and obtains a first acoustic feature corresponding to the first text feature based on the language content feature of at least one first voice sample in the first sample set through a language conversion model supporting a first language of a speaker of the second voice sample.
Wherein the first acoustic feature and the second acoustic feature are acoustic features of the same speaker.
In one possible implementation, the first acoustic feature and the second acoustic feature are BFCC.
Because the first acoustic feature is directly used for S204 after the first acoustic feature corresponding to the first text feature is acquired based on the language content feature of at least one first voice sample in the first sample set through the language conversion model of the first language of the speaker supporting the second voice sample, the errors brought by the process of determining the audio data corresponding to the first acoustic feature according to the first acoustic feature and the vocoder and extracting the acoustic feature based on the audio data in the prior art are reduced.
S204: the second server trains the original voice synthesis model based on the first text feature, the corresponding first acoustic feature, the second text feature and the corresponding second acoustic feature to obtain a target voice synthesis model, and the target voice synthesis model supports voice synthesis of the first language and the second language of each speaker in the first sample set.
Example 5: the embodiment of the invention also provides a voice synthesis method, and fig. 3 is a schematic diagram of a voice synthesis process provided by the embodiment of the invention, where the process includes:
s301: and acquiring a first text characteristic of the text information to be synthesized.
S302: acquiring target acoustic features corresponding to the first text features based on the first text features, the identification information of the target speakers and the identification information of the target languages of the text information through a target voice synthesis model of multiple speakers and multiple languages trained in advance; wherein the target speech synthesis model supports speech synthesis of a target language of the target speaker.
S303: audio data of the target speaker when the text information of the target language is generated is determined based on the target acoustic feature and the vocoder.
The voice synthesis method provided by the embodiment of the invention is applied to electronic equipment, and the electronic equipment can be intelligent equipment such as a robot and the like or can be a server.
The electronic device for performing the speech synthesis in the embodiment of the present invention may be the same as or different from the electronic device for performing the speech synthesis model training.
It should be noted that, the specific process of training the multi-speaker multi-language target speech synthesis model is described in the above embodiments 1-4, and the repetition is omitted.
When obtaining the voice information of a certain language which needs to be sent out by a target speaker based on the Text information synthesis target of the certain language, namely the Text information of the certain language which needs to be processed by Text To Speech (TTS), obtaining the Text information to be synthesized of the certain language, and extracting the Text characteristics of the Text information.
Since the target speech synthesis model supporting multiple speakers and multiple languages is trained in advance by the above-described embodiment, after the text feature of the text information to be synthesized is acquired, the text feature is input into the target speech synthesis model. And through the target voice synthesis model, based on the text feature, the identification information of the target speaker and the identification information of the target language of the text information, performing corresponding processing to acquire the acoustic feature corresponding to the text feature.
In one possible implementation, since the acoustic features predicted by the target speech synthesis model are acoustic features after normalization processing when the target speech synthesis model is trained, that is, the element value of each element in the acoustic features is between [0,1], in the practical application process, the acoustic features of the normal speech information are defined by a certain bit depth. Therefore, after the acoustic feature corresponding to the text feature is obtained based on the above embodiment, the acoustic feature needs to be subjected to inverse normalization processing through a preset inverse normalization function, for example, an inverse minmax algorithm, an inverse mean normalization algorithm, and the like, so that the element value of each element in the acoustic feature can be within a preset value range, that is, within a certain bit depth range, which is beneficial to more natural voice information obtained subsequently.
The acoustic features after inverse normalization may also be regularized minmax files.
The specific process of inverse normalization belongs to the prior art and is not described in detail here.
The audio data when the target speaker emits text information in the target language is determined based on the acquired acoustic features and vocoders, such as a WORLD vocoder, a linear predictive LPC vocoder, and the like.
The audio data generated when the target speaker generates text information of the target language based on the acoustic features and the vocoder belongs to the prior art, and will not be described herein.
Example 6: fig. 4 is a schematic diagram of a specific speech synthesis flow provided in an embodiment of the present invention, where the electronic device for training a language conversion model, the electronic device for training a speech synthesis model, and the electronic device for performing speech synthesis are the same, the first language is english, the second language is chinese, and the flow mainly includes three parts of language conversion model training, speech synthesis model training, and speech synthesis, and each part is described below:
A first part: language conversion model training.
S401: and acquiring a voice sample of any English in the second sample set and a voice sample of any Chinese.
S402: MFCCs of voice samples of english and MFCCs of voice samples of chinese are acquired, respectively.
S403: and performing voice recognition processing on the English voice sample and the Chinese voice sample based on the MFCC of the English voice sample and the MFCC of the Chinese voice sample respectively to acquire a first language content feature corresponding to the MFCC of the English voice sample and a second language content feature corresponding to the MFCC of the Chinese voice sample.
S404: and respectively acquiring a sixth acoustic feature corresponding to a fourth data set corresponding to the English voice sample and a sixth acoustic feature corresponding to a fourth data set corresponding to the Chinese voice sample through an original language conversion model.
Wherein the fourth data set corresponding to the voice sample of any language in the second sample set includes: the method comprises the steps of determining a speech content feature of a speech sample, object identification information of a speaker of the speech sample, language identification information of the speech sample and a fifth acoustic feature of the speech sample.
And training the original language conversion model based on the fifth acoustic feature and the sixth acoustic feature corresponding to the English voice sample, and the fifth acoustic feature and the sixth acoustic feature corresponding to the Chinese voice sample to acquire a pre-trained language conversion model.
A second part: and training a speech synthesis model.
S405: a first voice sample of any English in the first sample set and a second voice sample of any Chinese in the first sample set are obtained.
S406: the MFCC and the first text feature of the first speech sample are obtained, and the second text feature of the second speech sample is obtained.
S407: and performing voice recognition processing on the first voice sample based on the MFCC of the first voice sample to acquire the third language content characteristics of the first voice sample.
S408: and acquiring a first acoustic feature corresponding to the first text feature based on the first content feature of the first voice sample through a pre-trained language conversion model supporting the first language of the speaker of the second voice sample.
The acquired first acoustic feature is an acoustic feature after normalization processing.
S409: a second acoustic feature of a second speech sample is acquired.
The second acoustic feature is also the acoustic feature after normalization processing.
S410: training the original speech synthesis model based on the first text feature, the corresponding first acoustic feature, the second text feature and the corresponding second acoustic feature to obtain a target speech synthesis model.
Wherein the target speech synthesis model supports speech synthesis in a first language and a second language for each speaker in the first sample set.
Third section: a speech synthesis processing flow based on a target speech synthesis model.
S411: and acquiring the text characteristics of the text information to be synthesized.
S412: and acquiring target acoustic features corresponding to the text features based on the text features, the identification information of the target speakers and the identification information of the target languages of the text information through a target voice synthesis model of multiple speakers and multiple languages trained in advance.
Wherein the target speech synthesis model supports speech synthesis of a target language of the target speaker.
S413: and performing inverse normalization processing on the acoustic features through a preset inverse normalization function.
S414: based on the acoustic features and the vocoder, audio data is determined when the target speaker is emitting text information in the target language.
Example 7: the embodiment of the invention provides a device for training a speech synthesis model, and fig. 5 is a schematic structural diagram of the device for training a speech synthesis model, which comprises:
An obtaining unit 51, configured to obtain a first text feature of a first voice sample in a first language in the first sample set and a second text feature corresponding to a second voice sample in a second language, where the first language is different from the second language.
The determining unit 52 is configured to determine a first acoustic feature corresponding to the first text feature and a second acoustic feature of the second speech sample, where the first acoustic feature and the second acoustic feature are acoustic features of the same speaker.
The training unit 52 is configured to train the original speech synthesis model based on the first text feature and the corresponding first acoustic feature, the second text feature and the corresponding second acoustic feature, to obtain a target speech synthesis model, where the target speech synthesis model supports speech synthesis of the first language and the second language of each speaker in the first sample set.
In one possible embodiment, the training unit 52 is specifically configured to:
Respectively acquiring a first characteristic vector corresponding to a first data set and a second characteristic vector corresponding to a second data set through an encoding layer in an original voice synthesis model; the first data set comprises a first text feature, language identification information of a first language and object identification information of a speaker of a second voice sample, and the second data set comprises a second text feature, language identification information of a second language and object identification information; respectively inputting the first feature vector and the second feature vector into a decoding layer in an original speech synthesis model to obtain a third acoustic feature corresponding to the first text feature and a fourth acoustic feature corresponding to the second text feature; and adjusting parameter values of parameters in the original speech synthesis model based on the first acoustic feature and the third acoustic feature corresponding to the first text feature and the second acoustic feature and the fourth acoustic feature corresponding to the second text feature to obtain the target speech synthesis model.
In a possible embodiment, the determining unit 52 is specifically configured to obtain the first acoustic feature by:
if the voice sample of the first language is sent out by the speaker without the second voice sample in the first sample set, acquiring a first acoustic feature based on the language content feature of at least one first voice sample in the first sample set through a language conversion model of the first language of the speaker supporting the second voice sample; and/or if the speaker with the second voice sample in the first sample set emits at least one voice sample in the first language, determining the first acoustic feature according to the acoustic feature of the at least one voice sample in the first sample set.
In a possible embodiment, the determining unit 52 is specifically configured to:
Acquiring a language character vector corresponding to the first identification information and an object character vector corresponding to the object identification information based on the language identification information of the first language and the object identification information of the speaker of the second voice sample, which are included in the third data set, through a character vector layer of the language conversion model; acquiring a third feature vector based on the language character vector and the language content feature of at least one first voice sample included in the third data set through an encoding layer of the language conversion model; and acquiring the first acoustic feature based on the object character vector and the third feature vector through a decoding layer of the language conversion model.
In one possible implementation, the language conversion model is trained by:
Acquiring a second sample set comprising voice samples of at least two languages, each voice sample corresponding to a fourth data set; the fourth data set comprises language content characteristics of the voice sample, object identification information of a speaker of the voice sample, language identification information of the voice sample and fifth acoustic characteristics of the voice sample; a second sample set of voice samples of a speaker including a second voice sample, at least two languages including a first language and a second language;
aiming at the voice sample of each language, acquiring a sixth acoustic feature corresponding to the voice sample based on a fourth data set corresponding to the voice sample through an original voice conversion model; training the original language conversion model based on the fifth acoustic feature corresponding to the voice sample and the sixth acoustic feature corresponding to the voice sample, so that the obtained language conversion model can perform tone color conversion of any two languages of each speaker in the second sample set.
In one possible implementation, the linguistic content features of the speech samples are obtained by:
and carrying out voice recognition processing on the voice sample based on the acoustic characteristics of the preset type to acquire the language content characteristics.
Specifically, based on the acoustic features and the vocoder, audio data when the target speaker generates text information in the target language is determined.
Example 8: an embodiment of the present invention provides a speech synthesis apparatus based on a target speech synthesis model obtained by a speech synthesis model training method in any one of the foregoing embodiments 1 to 5, and fig. 6 is a schematic structural diagram of a speech synthesis apparatus provided in the embodiment of the present invention, where the apparatus includes:
An obtaining module 61, configured to obtain a text feature of the text information to be synthesized.
The processing module 62 is configured to obtain, through a target speech synthesis model of multiple speakers and multiple languages trained in advance, a target acoustic feature corresponding to the text feature based on the text feature, the identification information of the target speaker, and the identification information of the target language of the text information; wherein the target speech synthesis model supports speech synthesis of a target language of the target speaker.
The synthesis module 63 is configured to determine audio data when the target speaker generates text information in the target language based on the target acoustic feature and the vocoder.
Example 9: on the basis of the above embodiment, the embodiment of the present invention further provides an electronic device, as shown in fig. 7, including: the processor 71, the communication interface 72, the memory 73 and the communication bus 74, wherein the processor 71, the communication interface 72 and the memory 73 complete communication with each other through the communication bus 74;
The memory 73 has stored therein a computer program which, when executed by the processor 71, causes the processor 71 to perform the steps of:
Acquiring a first text feature of a first voice sample in a first language in a first sample set and a second text feature corresponding to a second voice sample in a second language, wherein the first language is different from the second language; determining a first acoustic feature corresponding to the first text feature and a second acoustic feature of the second voice sample, wherein the first acoustic feature and the second acoustic feature are acoustic features of the same speaker; training the original speech synthesis model based on the first text feature, the corresponding first acoustic feature, the second text feature and the corresponding second acoustic feature to obtain a target speech synthesis model, wherein the target speech synthesis model supports the speech synthesis of the first language and the second language of each speaker in the first sample set.
Since the principle of the above-mentioned electronic device for solving the problem is similar to that of the speech synthesis model training method, the implementation of the above-mentioned electronic device can be referred to embodiments 1-5 of the method, and the repetition is omitted.
Example 10: on the basis of the above embodiment, the embodiment of the present invention further provides an electronic device, as shown in fig. 8, including: the processor 81, the communication interface 82, the memory 83 and the communication bus 84, wherein the processor 81, the communication interface 82 and the memory 83 complete communication with each other through the communication bus 84;
the memory 83 has stored therein a computer program which, when executed by the processor 81, causes the processor 81 to perform the steps of:
Acquiring text characteristics of text information to be synthesized; acquiring target acoustic features corresponding to the text features based on the text features, the identification information of the target speakers and the identification information of the target language of the text information through a target voice synthesis model of multiple speakers and multiple languages trained in advance; the target voice synthesis model supports voice synthesis of target language of a target speaker; and determining audio data of the target speaker when the text information of the target language is generated based on the target acoustic feature and the vocoder.
Since the principle of the above-mentioned computer readable storage medium for solving the problem is similar to that of the speech synthesis model training method, the implementation of the above-mentioned computer readable storage medium can refer to implementation 6 of the method, and the repetition is not repeated.
The communication bus mentioned above for the electronic device may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus. The communication interface 802 is used for communication between the electronic device and other devices described above. The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor. The processor may be a general-purpose processor, including a central processing unit, a network processor (Network Processor, NP), etc.; but also digital instruction processors (DIGITAL SIGNAL Processing units, DSPs), application specific integrated circuits, field programmable gate arrays or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
Example 11: on the basis of the above embodiments, the embodiments of the present invention further provide a computer readable storage medium, in which a computer program executable by a processor is stored, which when executed on the processor causes the processor to implement the steps of:
Acquiring a first text feature of a first voice sample in a first language in a first sample set and a second text feature corresponding to a second voice sample in a second language, wherein the first language is different from the second language; determining a first acoustic feature corresponding to the first text feature and a second acoustic feature of the second voice sample, wherein the first acoustic feature and the second acoustic feature are acoustic features of the same speaker; training the original speech synthesis model based on the first text feature, the corresponding first acoustic feature, the second text feature and the corresponding second acoustic feature to obtain a target speech synthesis model, wherein the target speech synthesis model supports the speech synthesis of the first language and the second language of each speaker in the first sample set.
Since the principle of the above-mentioned computer readable storage medium for solving the problem is similar to that of the speech synthesis model training method, the implementation of the above-mentioned computer readable storage medium can be referred to in the implementation 1-5 of the method, and the repetition is omitted.
Example 12: on the basis of the above embodiments, the embodiments of the present invention further provide a computer readable storage medium, in which a computer program executable by a processor is stored, which when executed on the processor causes the processor to implement the steps of:
Acquiring text characteristics of text information to be synthesized; acquiring target acoustic features corresponding to the text features based on the text features, the identification information of the target speakers and the identification information of the target language of the text information through a target voice synthesis model of multiple speakers and multiple languages trained in advance; the target voice synthesis model supports voice synthesis of target language of a target speaker; and determining audio data of the target speaker when the text information of the target language is generated based on the target acoustic feature and the vocoder.
Since the principle of the above-mentioned computer readable storage medium for solving the problem is similar to that of the speech synthesis method, the implementation of the above-mentioned computer readable storage medium can refer to implementation 6 of the method, and the repetition is omitted.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (12)

1. A method of training a speech synthesis model, the method comprising:
Acquiring a first text feature of a first voice sample in a first language in a first sample set and a second text feature corresponding to a second voice sample in a second language, wherein the first language is different from the second language;
determining a first acoustic feature corresponding to the first text feature and a second acoustic feature of the second voice sample, wherein the first acoustic feature and the second acoustic feature are acoustic features of the same speaker;
Training an original speech synthesis model based on the first text feature and the corresponding first acoustic feature, the second text feature and the corresponding second acoustic feature to obtain a target speech synthesis model, wherein the target speech synthesis model supports speech synthesis of a first language and a second language of each speaker in the first sample set;
The first acoustic feature is acquired by:
if the voice sample of the first language is sent out by the speaker of the second voice sample in the first sample set, acquiring the first acoustic feature based on the language content feature of at least one first voice sample in the first sample set through a language conversion model of the first language of the speaker supporting the second voice sample; and/or
If the speaker with the second voice sample in the first sample set sends out at least one voice sample in a first language, determining the first acoustic feature according to the acoustic feature of the at least one voice sample in the first sample set.
2. The method of claim 1, wherein the training the original speech synthesis model based on the first text feature and the corresponding first acoustic feature, the second text feature and the corresponding second acoustic feature comprises:
respectively acquiring a first characteristic vector corresponding to a first data set and a second characteristic vector corresponding to a second data set through an encoding layer in the original voice synthesis model; wherein the first data set includes the first text feature, language identification information of the first language, and object identification information of a speaker of the second voice sample, and the second data set includes the second text feature, language identification information of the second language, and the object identification information;
Respectively inputting the first feature vector and the second feature vector to a decoding layer in the original speech synthesis model to obtain a third acoustic feature corresponding to the first text feature and a fourth acoustic feature corresponding to the second text feature;
And adjusting parameter values of parameters in the original speech synthesis model based on the first acoustic feature and the third acoustic feature corresponding to the first text feature and the second acoustic feature and the fourth acoustic feature corresponding to the second text feature to obtain the target speech synthesis model.
3. The method of claim 1, wherein the obtaining the first acoustic feature based on the linguistic content features of at least one first voice sample in the first set of samples by a language translation model of a first language of a speaker supporting the second voice sample comprises:
Acquiring a language character vector corresponding to the language identification information and an object character vector corresponding to the object identification information based on the language identification information of the first language and the object identification information of the speaker of the second voice sample, which are included in a third data set, through a character vector layer of the language conversion model;
Acquiring, by an encoding layer of the language conversion model, a third feature vector based on the language character vector and the language content feature of the at least one first speech sample included in the third data set;
And acquiring the first acoustic feature based on the object character vector and the third feature vector through a decoding layer of the language conversion model.
4. The method of claim 1, wherein the language conversion model is trained by:
Acquiring a second sample set comprising voice samples of at least two languages, each voice sample corresponding to a fourth data set; the fourth data set includes language content features of the voice sample, object identification information of a speaker of the voice sample, language identification information of the voice sample, and fifth acoustic features of the voice sample; the second sample set includes voice samples of a speaker of the second voice sample, and the at least two languages include the first language and the second language;
Aiming at a voice sample of each language, acquiring a sixth acoustic feature corresponding to the voice sample based on a fourth data set corresponding to the voice sample through an original voice conversion model; training an original language conversion model based on the fifth acoustic feature and the sixth acoustic feature corresponding to the voice sample, so that the obtained language conversion model can perform tone color conversion of any two languages of each speaker in the second sample set.
5. The method of claim 4, wherein the linguistic content features of the speech samples are obtained by:
and carrying out voice recognition processing on the voice sample based on the acoustic characteristics of the preset type to acquire the language content characteristics.
6. The method of claim 4, wherein the fifth acoustic feature and the sixth acoustic feature are bark frequency cepstrum coefficients BFCC.
7. The method of claim 5, wherein the predetermined type of acoustic feature is mel-frequency cepstrum coefficient MFCC.
8. A method of speech synthesis, the method comprising:
Acquiring text characteristics of text information to be synthesized;
Acquiring target acoustic features corresponding to the text features based on the text features, the identification information of target speakers and the identification information of target languages of the text information through a target voice synthesis model of multiple speakers and multiple languages trained in advance; wherein the target speech synthesis model supports speech synthesis of a target language of the target speaker; wherein the target speech synthesis model is trained based on the training method of the speech synthesis model of any one of claims 1-7;
and determining audio data when the target speaker emits the text information of the target language based on the target acoustic characteristics and the vocoder.
9. A training device for a speech synthesis model, the device comprising:
an obtaining unit, configured to obtain a first text feature of a first voice sample in a first language in a first sample set and a second text feature corresponding to a second voice sample in a second language, where the first language is different from the second language;
The determining unit is used for determining a first acoustic feature corresponding to the first text feature and a second acoustic feature of the second voice sample, wherein the first acoustic feature and the second acoustic feature are acoustic features of the same speaker;
The training unit is used for training an original voice synthesis model based on the first text feature and the corresponding first acoustic feature, the second text feature and the corresponding second acoustic feature to obtain a target voice synthesis model, and the target voice synthesis model supports voice synthesis of a first language and a second language of each speaker in the first sample set;
The determining unit is specifically configured to acquire the first acoustic feature by:
if the voice sample of the first language is sent out by the speaker without the second voice sample in the first sample set, acquiring a first acoustic feature based on the language content feature of at least one first voice sample in the first sample set through a language conversion model of the first language of the speaker supporting the second voice sample; and/or if the speaker with the second voice sample in the first sample set emits at least one voice sample in the first language, determining the first acoustic feature according to the acoustic feature of the at least one voice sample in the first sample set.
10. A speech synthesis apparatus, the apparatus comprising:
the acquisition module is used for acquiring text characteristics of the text information to be synthesized;
The processing module is used for acquiring target acoustic features corresponding to the text features based on the text features, the identification information of the target speakers and the identification information of the target language of the text information through a target voice synthesis model of multiple speakers and multiple languages trained in advance; wherein the target speech synthesis model supports speech synthesis of a target language of the target speaker; wherein the target speech synthesis model is trained based on the training method of the speech synthesis model of any one of claims 1-7;
And the synthesis module is used for determining audio data when the target speaker emits the text information of the target language based on the target acoustic characteristics and the vocoder.
11. An electronic device comprising at least a processor and a memory, the processor being adapted to implement the steps of the speech synthesis model training method according to any of claims 1-7 or the steps of the speech synthesis method according to claim 8 when executing a computer program stored in the memory.
12. A computer-readable storage medium, characterized in that it stores a computer program which, when executed by a processor, implements the steps of the speech synthesis model training method according to any one of claims 1-7 or implements the steps of the speech synthesis method according to claim 8.
CN202110419495.1A 2021-04-19 2021-04-19 Model training and speech synthesis method, device, equipment and medium Active CN115294955B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110419495.1A CN115294955B (en) 2021-04-19 2021-04-19 Model training and speech synthesis method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110419495.1A CN115294955B (en) 2021-04-19 2021-04-19 Model training and speech synthesis method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN115294955A CN115294955A (en) 2022-11-04
CN115294955B true CN115294955B (en) 2024-08-16

Family

ID=83818833

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110419495.1A Active CN115294955B (en) 2021-04-19 2021-04-19 Model training and speech synthesis method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN115294955B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111292720A (en) * 2020-02-07 2020-06-16 北京字节跳动网络技术有限公司 Speech synthesis method, speech synthesis device, computer readable medium and electronic equipment

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9865251B2 (en) * 2015-07-21 2018-01-09 Asustek Computer Inc. Text-to-speech method and multi-lingual speech synthesizer using the method
KR102199050B1 (en) * 2018-01-11 2021-01-06 네오사피엔스 주식회사 Method and apparatus for voice translation using a multilingual text-to-speech synthesis model
US20200279553A1 (en) * 2019-02-28 2020-09-03 Microsoft Technology Licensing, Llc Linguistic style matching agent
CN112435650B (en) * 2020-11-11 2022-04-15 四川长虹电器股份有限公司 Multi-speaker and multi-language voice synthesis method and system
CN112562655A (en) * 2020-12-03 2021-03-26 北京猎户星空科技有限公司 Residual error network training and speech synthesis method, device, equipment and medium
CN118135992A (en) * 2020-12-24 2024-06-04 北京猎户星空科技有限公司 Speech synthesis model training and speech synthesis method, device, equipment and medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111292720A (en) * 2020-02-07 2020-06-16 北京字节跳动网络技术有限公司 Speech synthesis method, speech synthesis device, computer readable medium and electronic equipment

Also Published As

Publication number Publication date
CN115294955A (en) 2022-11-04

Similar Documents

Publication Publication Date Title
KR102265972B1 (en) Method and apparatus for voice translation using a multilingual text-to-speech synthesis model
US11929059B2 (en) Method, device, and computer readable storage medium for text-to-speech synthesis using machine learning on basis of sequential prosody feature
EP3966804B1 (en) Multilingual speech synthesis and cross-language voice cloning
CN114038447B (en) Speech synthesis model training method, speech synthesis method, device and medium
CN109523989B (en) Speech synthesis method, speech synthesis device, storage medium, and electronic apparatus
CN113327575B (en) Speech synthesis method, device, computer equipment and storage medium
WO2017067206A1 (en) Training method for multiple personalized acoustic models, and voice synthesis method and device
CN113707125A (en) Training method and device for multi-language voice synthesis model
JPWO2018151125A1 (en) Word vectorization model learning device, word vectorization device, speech synthesizer, method and program thereof
CN113555006B (en) Voice information identification method and device, electronic equipment and storage medium
CN113327574A (en) Speech synthesis method, device, computer equipment and storage medium
CN112365878A (en) Speech synthesis method, device, equipment and computer readable storage medium
CN118135992A (en) Speech synthesis model training and speech synthesis method, device, equipment and medium
CN114694633A (en) Speech synthesis method, device, equipment and storage medium
CN115132170B (en) Language classification method, device and computer readable storage medium
WO2023245389A1 (en) Song generation method, apparatus, electronic device, and storage medium
CN115700871A (en) Model training and speech synthesis method, device, equipment and medium
CN112259084A (en) Speech recognition method, apparatus and storage medium
JP2015161927A (en) Acoustic model generation device, production method for acoustic model, and program
CN115294955B (en) Model training and speech synthesis method, device, equipment and medium
CN116913243A (en) Speech synthesis method, device, electronic equipment and readable storage medium
CN115114933A (en) Method, device, equipment and storage medium for text processing
Zgank Cross-lingual speech recognition between languages from the same language family
CN113571041A (en) Method and device for processing voice recognition text and electronic equipment
Kardava Georgian speech recognizer in famous searching systems and management of software package by voice commands in Georgian language

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant