Detailed Description
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs, the terms used in the description herein are used for the purpose of describing particular embodiments only and are not intended to limit the application, and the terms "comprising" and "having" and any variations thereof in the description of the application and the claims and the above description of the drawings are intended to cover non-exclusive inclusions. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
In order to make the person skilled in the art better understand the solution of the present application, the technical solution of the embodiment of the present application will be clearly and completely described below with reference to the accompanying drawings.
As shown in fig. 1, the system architecture 100 may include a terminal device 101, a network 102, and a server 103, where the terminal device 101 may be a notebook 1011, a tablet 1012, or a cell phone 1013. Network 102 is the medium used to provide communication links between terminal device 101 and server 103. Network 102 may include various connection types such as wired, wireless communication links, or fiber optic cables.
A user may interact with the server 103 via the network 102 using the terminal device 101 to receive or send messages or the like. Various communication client applications, such as a web browser application, a shopping class application, a search class application, an instant messaging tool, a mailbox client, social platform software, and the like, may be installed on the terminal device 101.
The terminal device 101 may be various electronic devices having a display screen and supporting web browsing, and the terminal device 101 may be an electronic book reader, an MP3 player (Moving Picture Experts Group Audio Layer III, moving picture experts compression standard audio layer III), an MP4 (Moving Picture Experts Group Audio Layer IV, moving picture experts compression standard audio layer IV) player, a laptop portable computer, a desktop computer, or the like, in addition to the notebook 1011, the tablet 1012, or the mobile phone 1013.
The server 103 may be a server providing various services, such as a background server providing support for pages displayed on the terminal device 101.
It should be noted that, the voice conversion generating method provided by the embodiment of the present application is generally executed by a server/terminal device, and accordingly, the voice conversion generating apparatus is generally disposed in the server/terminal device.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to fig. 2, a flowchart of one embodiment of a speech conversion generating method according to the present application is shown, comprising the steps of:
step S201, a target accent sample set is acquired, and the target accent sample set is input into a pre-built initial conversion model, wherein the initial conversion model comprises a content encoder and a conversion decoder.
The target accent sample set is formed by directly collecting voice data of the user, and can also be obtained from a public corpus or a data service platform specially providing the voice data set, and the method is not limited herein. The target accent sample set contains a large number of target accent samples, wherein the target accent refers to accents meeting specific requirements or standards, including but not limited to standard mandarin, dialects, or official language accents of other countries, etc., into which the original accent needs to be converted during the accent conversion process.
The choice of the target accent depends on the specific scenario and requirements, for example, in the medical field it may be necessary to convert the patient's accent into an accent familiar to the doctor to improve communication efficiency, and in the financial field it may be necessary to convert the customer's accent into a standard accent that the system can more accurately recognize.
In this embodiment, the electronic device (e.g., the server/terminal device shown in fig. 1) on which the voice conversion generating method operates may receive the obtained target accent sample set through a wired connection manner or a wireless connection manner. It should be noted that the wireless connection may include, but is not limited to, 3G/4G/5G connection, wiFi connection, bluetooth connection, wiMAX connection, zigbee connection, UWB (ultra wideband) connection, and other now known or later developed wireless connection.
In this embodiment, denoising is performed on a target accent sample set to obtain a denoised target accent sample set, the denoised target accent sample set is input into a pre-built initial conversion model, the initial conversion model includes a content encoder and a conversion decoder, semantic labeling is performed on target accent samples in the target accent sample set in a discrete semantic labeling space through the content encoder, and accent conversion is performed on the semantic labels through the conversion decoder.
For example, spectral subtraction is used to denoise, estimate the noise power spectral density, set the signal-to-noise ratio threshold to 10dB, and subtract the estimated noise power spectrum from the noisy speech power spectrum to obtain denoised speech.
Step S202, extracting discrete semantic marks of each target accent sample in the target accent sample set through a content encoder to obtain a sample accent semantic mark sequence.
In this embodiment, a self-supervised learning mechanism is used to train an initial conversion model, in order to cope with the deficiency of parallel accent data, a target accent sample set is used to pre-train the initial conversion model, and then a small amount of weak parallel accent data is used to fine tune the pre-trained conversion model, so that the conversion model learns the conversion relation of source accent to target accent.
In some alternative implementations, the content encoder adopts a pre-trained self-supervision speech representation model HuBERT, which comprises a feature extractor and a plurality of content encoders connected together, wherein the feature extractor is used for extracting features of a target accent sample to obtain accent feature representations, and the content encoders are used for clustering and predicting the accent feature representations to obtain a sample accent semantic tag sequence. The content encoder is a discrete content encoder, the accent feature representation is divided into a series of discrete semantic tags through a clustering algorithm, namely, the accent feature representation is clustered through a K-means clustering algorithm, and a sample accent semantic tag sequence is obtained. In the clustering process, similar accent feature representations are aggregated into the same category by calculating Euclidean distance between accent feature representations, and the number of the categories of the clusters can be adjusted according to the scale of the corpus and the difference degree of accents.
And step S203, damaging the sample accent semantic mark sequence in a mark damage mode to obtain a damaged accent semantic mark sequence.
In this embodiment, the task of the transformation model aims to build a probability space of discrete semantic tags in the target accent domain, so that the semantic tags of the target accent can be generated in a closed-loop manner according to their context of previous tags in the target accent domain. In this task, the conversion model is trained in a self-supervising manner, i.e. at a given sequence of damaged accent semantic tagsThe original sample accent semantic tag sequence y= { Y 0,…,yt }, T < T, is generated as follows:
the marking damage modes include marking mask, marking deletion, marking filling and the like, and in the embodiment, the sample accent semantic marking sequence is damaged by adopting the marking filling mode, so that the damaged accent semantic marking sequence is obtained.
The mark filling refers to filling sequences with inconsistent sequence lengths to the same length, and is usually realized by adding special filling marks at the end or the beginning of the sequences, wherein the filling marks do not affect the prediction of the model. However, in the process of training the conversion model, filling marks can be added at any position of the sample accent semantic mark sequence, so that the model predicts the features before filling, and the contextual information and the high-level features of the voice accent can be learned.
And S204, training a conversion decoder to reconstruct the sample accent semantic mark sequence by adopting the damaged accent semantic mark sequence to obtain a pre-trained conversion model.
The method comprises the steps of inputting a damaged accent semantic mark sequence into a conversion decoder to conduct autoregressive generation to obtain a predicted accent semantic mark sequence, calculating a cross entropy loss function between the predicted accent semantic mark sequence and a sample accent semantic mark sequence to obtain a first loss value, adjusting model parameters of a content encoder and the conversion decoder based on the first loss value, and continuing iterative training until iteration stop conditions are met to obtain a pre-trained conversion model.
The conversion decoder adopts a transducer structure generated by autoregressive to perform pre-training, and establishes a mapping relation from a damaged accent semantic mark sequence to a sample accent semantic mark sequence. In a specific example, the transducer structure includes an input layer, an encoder portion and a decoder portion, the encoder portion is formed by stacking a plurality of encoder layers, the decoder portion is also formed by stacking a plurality of decoder layers, semantics and contextual features of the damaged accent semantic tag sequence are extracted through a multi-headed attention mechanism of each encoder layer, accent semantic coding features are obtained, the accent semantic coding features are input into the decoder portion, and long-distance dependency between the accent semantic coding features is captured through the multi-headed attention mechanism and the cross-attention mechanism of each encoder, so that a predicted accent semantic tag sequence is obtained.
The cross entropy loss function is used to measure the difference between the probability distribution of the model-generated predicted accent semantic tag sequence and the probability distribution of the real sample accent semantic tag sequence. Calculating a first loss value between the predicted accent semantic mark sequence and the sample accent semantic mark sequence through a cross entropy loss function, adjusting model parameters of a content encoder and a conversion decoder according to the first loss value by using an Adam optimizer, and continuing iterative training on the adjusted model until an iteration stop condition is met, namely, the number of iterations reaches a preset number or the first loss value does not change significantly, so as to obtain a conversion model after pre-training.
Through adopting damage accent semantic mark sequence to retrain the conversion model for the conversion model can learn the high-level characteristic representation and the context information that can learn the accent of pronunciation, promotes the model migration performance and strengthens the understanding of model to the complex structure, simultaneously, need not to use parallel corpus, improves the generalization ability of model.
Step S205, a parallel voice data set containing the corresponding relation between the source accent sample and the target accent sample is obtained, the parallel voice data set is input into a conversion model for fine adjustment, and the fine-adjusted accent conversion model is obtained.
Since some phones in the source accent need to be converted to target accent phones, the mapping between the source accent phones and the target accent phones needs to be learned, specifically, a pre-trained conversion model is fine-tuned under the semantic labeling conditions of the source accent using a small amount of weak parallel accent data.
The method comprises the steps of inputting a parallel voice data set into a conversion model, respectively extracting voice feature vector sequences of a source accent sample and a target accent sample in the parallel voice data set through a content encoder of the conversion model, clustering the voice feature vector sequences through a clustering algorithm to obtain a source accent mark sequence and a target accent mark sequence, inputting the source accent mark sequence into a conversion decoder of the conversion model to obtain a corresponding predicted accent mark sequence, calculating a cross entropy loss function between the predicted accent mark sequence and the target accent mark sequence to obtain a second loss value, finely adjusting model parameters of the conversion decoder based on the second loss value, and continuing iterative training until iteration stop conditions are met to obtain the finely-adjusted accent conversion model.
The method comprises the steps of respectively extracting voice feature vector sequences of source accent voice and target accent voice through a content encoder of a pre-trained conversion model aiming at source accent voice sample and target accent voice sample in parallel voice data sets to obtain the source accent voice feature vector sequences and target accent voice feature vector sequences, aligning the extracted source accent voice feature vector sequences and target accent voice feature vector sequences to ensure consistent lengths of the two sequences so as to facilitate subsequent feature mapping and conversion, respectively clustering the source accent voice feature vector sequences and the target accent voice feature vector sequences through a K-means clustering algorithm to obtain source accent mark sequences and target accent mark sequences, training a conversion decoder according to the source accent mark sequences and the target accent mark sequences, and learning a conversion relation for mapping the source accent mark sequences to the target accent mark sequences to obtain predicted accent mark sequences after the conversion of the source accent mark sequences by the conversion decoder.
Calculating a second loss value between the predicted accent mark sequence and the target accent mark sequence through the cross entropy loss function, performing fine adjustment on model parameters of the conversion decoder according to the second loss value by using the Adam optimizer, and continuing performing iterative training on the fine-adjusted model until the iteration stop condition is met, so as to obtain the final fine-adjusted accent conversion model.
By fine tuning the conversion model by using the parallel voice data set, the parallel data is reduced, the adaptability of the model is improved, and the characteristics of source accent and target accent can be accurately captured, so that the accent conversion quality is improved.
Step S206, inputting the target accent sample set into a pre-built initial generation model, wherein the initial generation model comprises a coder and a generation decoder.
In this embodiment, the accent conversion process is divided into two stages, the first stage realizes semantic labeling through the accent conversion model, and the second stage performs target accent speech synthesis through the speech generation model.
The target accent sample set is obtained from a large target accent voice corpus, an initial generation model based on TF-Codec is pre-constructed, the initial generation model is generated through single-stage causal voice, and acoustic marks of the TF-Codec are iteratively generated based on the converted target accent semantic marks.
In this embodiment, the initial generation model includes a codec and a generation decoder, the acoustic markers characterizing the speaker style in the target accent sample set are extracted by the codec, and the target accent speech is reconstructed by the generation decoder fusing the sample accent semantic marker sequences and the acoustic markers.
In step S207, the acoustic mark of each target accent sample in the target accent sample set is extracted by the codec, and a style prompt mark sequence is obtained.
The Codec employs a pre-trained causal voice neural Codec TF-Codec, abbreviated TF-Codec. The TF-Codec coder can decompose the voice into subspaces such as content, rhythm, timbre, audio details and the like, thereby realizing efficient attribute decoupling.
In this embodiment, the prediction loop is removed and efficient acoustic modeling is performed using a non-predictive model of 6kbps to obtain a high quality output.
In some optional implementations, the step of extracting, by the codec, the acoustic signature of each target accent sample in the target accent sample set, and obtaining the style-cue signature sequence includes:
Inputting the target accent sample set into a codec, wherein the codec comprises an input layer, a style encoder, a style decoder and an output layer;
Preprocessing a target accent sample set through an input layer to obtain a purified voice signal, and converting the purified voice signal according to a preset format to obtain an input voice signal;
extracting acoustic features of the input voice signals through a style encoder to obtain acoustic feature expression vectors;
inputting the acoustic feature representation vector into a style decoder for decoding to generate a corresponding style acoustic feature vector;
Post-processing is carried out on the style acoustic feature vector through the output layer, and the style acoustic feature vector after post-processing is mapped into a preset style acoustic marking space, so that a final style prompt marking sequence is obtained.
The target accent sample set is preprocessed through the input layer, and the preprocessing comprises noise reduction, endpoint detection, silence segment removal and the like. Specifically, the noise reduction is performed by adopting spectral subtraction, the noise power spectral density is estimated, a signal-to-noise ratio threshold is set, and the estimated noise power spectrum is subtracted from the noise power spectrum of the voice with noise to obtain the voice after noise reduction. And then setting an energy threshold by using an energy-based endpoint detection algorithm, detecting the start and stop points of the noise-reduced voice, and removing the mute segment according to the start and stop points of the voice to obtain a purified voice signal.
In this embodiment, the TF-Codec accepts as input a time-frequency spectrum compressed with a 16kHz amplitude, a window length of 20 ms, a step size of 5 ms, and a preset format of the time-frequency spectrum compressed with a 16kHz amplitude, a window length of 20 ms, and a step size of 5 ms. And converting the purified voice signal into an input voice signal which can be processed by the model according to a preset format.
Input speech signals are input into a style encoder to causally capture short-term and long-term time dependencies between input frames to obtain acoustic feature representation vectors.
In some optional implementations of this embodiment, the style encoder includes a plurality of 2D causal convolution layers, a temporal convolution network, and a gating loop unit block, and the step of extracting acoustic features from the input speech signal by the style encoder to obtain the acoustic feature representation vector includes:
carrying out convolution operation on an input voice signal through a plurality of 2D causal convolution layers to obtain acoustic convolution characteristics;
inputting the acoustic convolution characteristics into a time convolution network to carry out causal convolution and expansion convolution to obtain acoustic time sequence characteristics;
and inputting the acoustic time sequence characteristics into a gating circulation unit block to calculate the hidden state, and outputting the acoustic characteristic representation vector.
The number of the 2D causal convolution layers is selected according to actual conditions. And 2D convolution operation is carried out on the input voice signals through a plurality of 2D causal convolution layers to extract time domain and frequency domain characteristics of the voice signals, so that acoustic convolution characteristics are obtained.
The temporal convolution network is made up of multiple convolution layers, each of which uses causal and dilation convolutions to extract timing characteristics. In causal convolution, when a convolution kernel slides on an input sequence, only the current position and the previous element can be seen, namely the output of the current time step only depends on the past time step, and the expansion convolution expands the receptive field of the convolution kernel by introducing expansion rate parameters, so that the range of the context information which can be captured is increased under the condition that the size of the convolution kernel is kept unchanged, namely the long-distance time sequence dependency relationship can be captured. When the input acoustic convolution characteristics are processed, the time convolution network carries out convolution operation on the acoustic convolution characteristics layer by layer, extracts time sequence characteristics of different layers, and outputs final acoustic time sequence characteristics.
The gate control loop unit (GRU) block comprises a reset gate and an update gate, wherein the reset gate determines how much information of a previous hidden state needs to be forgotten, and the update gate determines how much information of a current hidden state needs to be reserved. And calculating the hidden state of the acoustic time sequence feature through the gating circulating unit block, and outputting the final hidden state as an acoustic feature expression vector.
In some alternative implementations, in the time convolution network, after each convolution layer, through residual connection and layer normalization operation, the convergence speed and training stability of the network are accelerated, and meanwhile, the gradient disappearance problem is avoided.
The input voice signal is extracted through the style encoder, so that the high-efficiency parallel processing capability and the long-term dependency capturing capability can be improved, the resource consumption is reduced, and the acoustic features can be captured more accurately, so that the accent conversion quality is improved.
In some optional implementations of this embodiment, the style decoder includes an embedding layer, a plurality of vector quantizers, and a decoding layer, and the step of inputting the acoustic feature representation vector to the style decoder for decoding, the step of generating the corresponding style acoustic feature vector includes:
Vector conversion is carried out on the acoustic feature representation vector through the embedding layer, so that an acoustic embedding vector is obtained;
inputting the acoustic embedded vectors into a plurality of vector quantizers for grouping quantization to obtain acoustic embedded grouping vectors with the same number as the vector quantizers;
And inputting the acoustic embedded grouping vectors into a decoding layer for attention mechanism calculation, and generating style acoustic feature vectors.
The acoustic feature expression vector is converted into an acoustic embedding vector through an embedding layer, the acoustic embedding vector is divided into K groups by adopting grouping quantization, and each group is quantized by a vector quantizer with 1024 code words, so that the acoustic embedding grouping vector is obtained. The acoustic embedded grouping vector is input into a decoding layer, the decoding layer introduces a attention mechanism, and the acoustic embedded grouping vector is decoded by using a corresponding decoding algorithm to obtain a style acoustic feature vector representing the style of a speaker. The speaker style includes pronunciation habit, intonation pattern, etc.
The style decoder is used for carrying out grouping quantization and decoding, so that the data characteristics of different numerical ranges can be better captured, the model precision is improved, and meanwhile, the efficient compression and acceleration reasoning of the model can be realized.
In this embodiment, post-processing is performed on the style acoustic feature vector through the output layer, specifically, the style acoustic feature vector is normalized by using the L2 norm, and the normalized vector is mapped to the style acoustic marker space with a preset dimension, so as to output a final style prompt marker sequence.
The style prompt mark sequence is extracted through the coder and decoder, so that style characteristics in voice can be extracted more accurately, higher-quality accent conversion is realized, accent characteristics of different users can be adapted more flexibly, and generalization capability of a model is improved.
And step S208, training and generating a decoder to reconstruct a corresponding target accent sample by adopting a sample accent semantic mark sequence and a style prompt mark sequence to obtain a trained voice generation model.
The method comprises the steps of obtaining a predicted target accent voice by decoding a sample accent semantic mark sequence and a style prompt mark sequence through a generating decoder, calculating a third loss value between the predicted target accent voice and a corresponding target accent sample according to a preset loss function, adjusting model parameters of the generating decoder based on the third loss value, and continuing iterative training until iteration stop conditions are met.
In this embodiment, the generation decoder is a single-stage causal speech generation model, employing autoregressive generated transform structures. In a specific example, the transducer structure includes an input layer, an encoder portion including a stack of a plurality of encoder layers, a decoder portion including a stack of a plurality of decoder layers, and an output layer. The method comprises the steps of splicing a sample accent semantic mark sequence and a style prompt mark sequence through an input layer to obtain a spliced semantic mark sequence, extracting the semantics and the context characteristics of the spliced semantic mark sequence through a multi-head attention mechanism of each encoder layer to obtain target accent style coding characteristics, inputting the target accent style coding characteristics into a decoder part, and capturing long-distance dependency relations between the target accent style coding characteristics through the multi-head attention mechanism and the cross attention mechanism of each encoder to obtain predicted target accent voice.
Calculating the reconstruction loss between the predicted target accent voice and the corresponding target accent sample, adopting L1 loss as a preset loss function to obtain a third loss value, optimizing and updating model parameters of the generated decoder by using the calculated third loss value through a back propagation algorithm to obtain an updated model, and continuing to perform iterative training until the iteration stopping condition is met, namely the iteration times reach the preset times or the first loss value has no obvious change, so as to obtain the generated decoder after the pre-training is completed.
Wherein, the calculation formula of the L1 loss is as follows:
Where n represents the number of target accent samples in the target accent sample set, y i represents the i-th target accent sample in the target accent sample set, and y' i represents the predicted target accent speech corresponding to the i-th target accent sample in the initial reconstructed speech data set.
By training the generation decoder, the naturalness and fluency of the converted speech are improved, and better speech quality is achieved with lower complexity.
Step S209, integrating the accent conversion model and the voice generation model to obtain a final voice accent conversion generation model.
And connecting and integrating the trained accent conversion model and the voice generation model, wherein the output of the accent conversion model is connected with a coder-decoder of the voice generation model to form a complete voice accent conversion generation model.
Step S210, source accent voice to be converted is obtained, the source accent voice to be converted is input into a voice accent conversion generation model, and target accent voice is generated.
In the embodiment, a source accent voice input voice accent conversion generation model to be converted is obtained, discrete semantic marking is conducted on the source accent voice to be converted through a content encoder to obtain source semantic features, the source semantic features are input into a conversion decoder to conduct target accent feature mapping and conversion to obtain target semantic features, the target semantic features are input into a voice generation model, voice with a preset time length before the source accent voice to be converted is extracted to obtain target style voice, the target style voice is input into a coder and decoder to conduct acoustic marking extraction to obtain style prompt features, the target semantic features and the style prompt features are spliced to obtain splicing features, and autoregressive generation is conducted on the splicing features input into a generation decoder to obtain the target accent voice.
It should be emphasized that, to further ensure the privacy and security of the source accent speech to be converted, the source accent speech to be converted may also be stored in a node of a blockchain.
The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The blockchain (Blockchain), essentially a de-centralized database, is a string of data blocks that are generated in association using cryptographic methods, each of which contains information from a batch of network transactions for verifying the validity (anti-counterfeit) of its information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.
The application uses a two-stage generation framework comprising a conversion model and a generation model, wherein the first stage carries out conversion on a semantic mark level through the conversion model, the second stage generates the model based on the converted semantic mark synthetic voice, the accent conversion is decoupled into two processes of conversion and voice generation, and the voice generation stage can fully utilize a large amount of target accent voice data through the semantic mark as a conversion bridge, thereby obviously reducing the dependence of the conversion stage on parallel voice data.
The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Wherein artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.
Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by computer readable instructions stored in a computer readable storage medium that, when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.
With further reference to fig. 3, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a speech conversion generating apparatus, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus is particularly applicable to various electronic devices.
As shown in fig. 3, the speech conversion generating apparatus 300 according to the present embodiment includes an acquisition module 301, a semantic marking module 302, a marking damage module 303, a conversion module 304, a fine adjustment module 305, an input module 306, an acoustic marking module 307, a generation module 308, an integration module 309, and a conversion generation module 310. Wherein:
The obtaining module 301 is configured to obtain a target accent sample set, and input the target accent sample set into a pre-constructed initial conversion model, where the initial conversion model includes a content encoder and a conversion decoder;
the semantic mark module 302 is configured to extract, by using the content encoder, a discrete semantic mark of each target accent sample in the target accent sample set, to obtain a sample accent semantic mark sequence;
the marking damage module 303 is configured to damage the sample accent semantic mark sequence in a marking damage manner, so as to obtain a damaged accent semantic mark sequence;
the conversion module 304 is configured to train the conversion decoder to reconstruct the sample accent semantic tag sequence by using the damaged accent semantic tag sequence, so as to obtain a pre-trained conversion model;
The fine tuning module 305 is configured to obtain a parallel voice data set containing a corresponding relationship between a source accent sample and a target accent sample, input the parallel voice data set into the conversion model, and perform fine tuning to obtain a fine-tuned accent conversion model;
the input module 306 is configured to input the target accent sample set into a pre-constructed initial generation model, where the initial generation model includes a codec and a generation decoder;
the acoustic marking module 307 is configured to extract, by using the codec, an acoustic mark of each target accent sample in the target accent sample set, and obtain a style prompt mark sequence;
the generating module 308 is configured to train the generating decoder to reconstruct a corresponding target accent sample by using the sample accent semantic tag sequence and the style prompt tag sequence, so as to obtain a trained speech generating model;
The integration module 309 is configured to integrate the accent conversion model and the speech generation model to obtain a final speech accent conversion generation model;
The conversion generating module 310 is configured to obtain a source accent voice to be converted, input the source accent voice to be converted into the voice accent conversion generating model, and generate a target accent voice.
It should be emphasized that, to further ensure the privacy and security of the source accent speech to be converted, the source accent speech to be converted may also be stored in a node of a blockchain.
Based on the above-mentioned voice conversion generating device 300, through two-stage generating framework including conversion model and generating model, the first stage carries out conversion on the semantic mark level through the conversion model, the second stage generates model then based on the semantic mark synthesized voice after conversion, decouple accent conversion into two processes of conversion and voice generation, through the semantic mark as the bridge of conversion, make full use of a large amount of target accent voice data in the voice generating stage, thereby obviously reducing the dependence of the conversion stage on parallel voice data, in addition, introduce language pre-training technology, further reduce the demand on parallel voice data, improve accent conversion quality, and can keep the voice style characteristics of the speaker, simultaneously accurately convert into target accent, and provide an effective solution for cross-language and cross-dialect voice interaction.
In some alternative implementations of the present embodiment, the conversion module 304 is further configured to:
Inputting the damaged accent semantic tag sequence into the conversion decoder to perform autoregressive generation to obtain a predicted accent semantic tag sequence;
Calculating a cross entropy loss function between the predicted accent semantic tag sequence and the sample accent semantic tag sequence to obtain a first loss value;
And adjusting model parameters of the content encoder and the conversion decoder based on the first loss value, and continuing iterative training until the iteration stopping condition is met, so as to obtain a pre-trained conversion model.
Through adopting damage accent semantic mark sequence to retrain the conversion model for the conversion model can learn the high-level characteristic representation and the context information that can learn the accent of pronunciation, promotes the model migration performance and strengthens the understanding of model to the complex structure, simultaneously, need not to use parallel corpus, improves the generalization ability of model.
In some alternative implementations of the present embodiment, the fine tuning module 305 is further configured to:
Respectively extracting voice feature vector sequences of a source accent sample and a target accent sample in the parallel voice data set by a content encoder of the conversion model;
Clustering the voice feature vector sequence through a clustering algorithm to obtain a source accent mark sequence and a target accent mark sequence;
inputting the source accent mark sequence into a conversion decoder of the conversion model to obtain a corresponding predicted accent mark sequence;
Calculating a cross entropy loss function between the predicted accent mark sequence and the target accent mark sequence to obtain a second loss value;
and fine tuning the model parameters of the conversion decoder based on the second loss value, and continuing iterative training until the iteration stop condition is met, so as to obtain a fine-tuned accent conversion model.
By fine tuning the conversion model by using the parallel voice data set, the parallel data is reduced, the adaptability of the model is improved, and the characteristics of source accent and target accent can be accurately captured, so that the accent conversion quality is improved.
In some alternative implementations of the present embodiment, the acoustic tagging module 307 comprises:
An input sub-module for inputting the target accent sample set into the codec, wherein the codec comprises an input layer, a style encoder, a style decoder, and an output layer;
the preprocessing sub-module is used for preprocessing the target accent sample set through the input layer to obtain a purified voice signal, and converting the purified voice signal according to a preset format to obtain an input voice signal;
the coding submodule is used for extracting acoustic features of the input voice signals through the style coder to obtain acoustic feature expression vectors;
The decoding submodule is used for inputting the acoustic feature representation vector into the style decoder for decoding to generate a corresponding style acoustic feature vector;
And the post-processing sub-module is used for carrying out post-processing on the style acoustic feature vector through the output layer, and mapping the style acoustic feature vector after the post-processing into a preset style acoustic mark space to obtain a final style prompt mark sequence.
The style prompt mark sequence is extracted through the coder and decoder, so that style characteristics in voice can be extracted more accurately, higher-quality accent conversion is realized, accent characteristics of different users can be adapted more flexibly, and generalization capability of a model is improved.
In some alternative implementations, the style encoder includes a plurality of 2D causal convolutional layers, a temporal convolutional network, and a gating loop block of units, the encoding submodule further to:
performing convolution operation on the input voice signal through the plurality of 2D causal convolution layers to obtain acoustic convolution characteristics;
inputting the acoustic convolution characteristics into the time convolution network to perform causal convolution and expansion convolution to obtain acoustic time sequence characteristics;
And inputting the acoustic time sequence characteristics into the gating circulating unit block to calculate a hidden state, and outputting acoustic characteristic representation vectors.
The input voice signal is extracted through the style encoder, so that the high-efficiency parallel processing capability and the long-term dependency capturing capability can be improved, the resource consumption is reduced, and the acoustic features can be captured more accurately, so that the accent conversion quality is improved.
In some alternative implementations, the style decoder includes an embedded layer and a plurality of vector quantizers and a decoding layer, the decoding submodule further configured to:
Performing vector conversion on the acoustic feature representation vector through the embedding layer to obtain an acoustic embedding vector;
Inputting the acoustic embedded vectors into the vector quantizers for grouping quantization to obtain acoustic embedded grouping vectors with the same number as the vector quantizers;
and inputting the acoustic embedded grouping vector into the decoding layer for attention mechanism calculation, and generating a style acoustic feature vector.
The style decoder is used for carrying out grouping quantization and decoding, so that the data characteristics of different numerical ranges can be better captured, the model precision is improved, and meanwhile, the efficient compression and acceleration reasoning of the model can be realized.
In some alternative implementations, the generation module 308 is further to:
decoding the sample accent semantic tag sequence and the style prompt tag sequence through the generating decoder to obtain predicted target accent voice;
calculating a third loss value between the predicted target accent voice and the corresponding target accent sample according to a preset loss function;
And adjusting model parameters of the generated decoder based on the third loss value, and continuing iterative training until an iterative stopping condition is met.
By training the generation decoder, the naturalness and fluency of the converted speech are improved, and better speech quality is achieved with lower complexity.
In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 4, fig. 4 is a basic structural block diagram of a computer device according to the present embodiment.
The computer device 4 comprises a memory 41, a processor 42, a network interface 43 which are communicatively connected to each other via a system bus. It is noted that only a computer device 4 having a memory 41, a processor 42, a network interface 43 is shown in the figures, but it is understood that not all illustrated components are required to be implemented and that more or fewer components may be implemented instead. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and its hardware includes, but is not limited to, a microprocessor, an Application SPECIFIC INTEGRATED Circuit (ASIC), a Programmable gate array (Field-Programmable GATE ARRAY, FPGA), a digital Processor (DIGITAL SIGNAL Processor, DSP), an embedded device, and the like.
The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.
The memory 41 includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the storage 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the computer device 4. Of course, the memory 41 may also comprise both an internal memory unit of the computer device 4 and an external memory device. In this embodiment, the memory 41 is typically used to store an operating system and various application software installed on the computer device 4, such as computer readable instructions of a voice conversion generating method. Further, the memory 41 may be used to temporarily store various types of data that have been output or are to be output.
The processor 42 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute computer readable instructions stored in the memory 41 or process data, such as computer readable instructions for executing the voice conversion generating method.
The network interface 43 may comprise a wireless network interface or a wired network interface, which network interface 43 is typically used for establishing a communication connection between the computer device 4 and other electronic devices.
The two-stage generation framework comprises a conversion model and a generation model, wherein the conversion model is used for carrying out conversion on a semantic mark layer in the first stage, the generation model is used for synthesizing voice based on the converted semantic mark, the accent conversion is decoupled into two processes of conversion and voice generation, the semantic mark is used as a bridge for conversion, a large amount of target accent voice data can be fully utilized in the voice generation stage, so that the dependence of the conversion stage on the parallel voice data is obviously reduced, in addition, a language pre-training technology is introduced, the requirement on the parallel voice data is further reduced, the accent conversion quality is improved, the voice style characteristics of a speaker can be reserved, and meanwhile, the target accent can be accurately converted.
The present application also provides another embodiment, namely, a computer-readable storage medium storing computer-readable instructions executable by at least one processor to cause the at least one processor to perform the steps of the speech conversion generating method as described above.
The two-stage generation framework comprises a conversion model and a generation model, wherein the conversion model is used for carrying out conversion on a semantic mark layer in the first stage, the generation model is used for synthesizing voice based on the converted semantic mark, the accent conversion is decoupled into two processes of conversion and voice generation, the semantic mark is used as a bridge for conversion, a large amount of target accent voice data can be fully utilized in the voice generation stage, so that the dependence of the conversion stage on the parallel voice data is obviously reduced, in addition, a language pre-training technology is introduced, the requirement on the parallel voice data is further reduced, the accent conversion quality is improved, the voice style characteristics of a speaker can be reserved, and meanwhile, the target accent can be accurately converted.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.
It is apparent that the above-described embodiments are only some embodiments of the present application, but not all embodiments, and the preferred embodiments of the present application are shown in the drawings, which do not limit the scope of the patent claims. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a thorough and complete understanding of the present disclosure. Although the application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing description, or equivalents may be substituted for elements thereof. All equivalent structures made by the content of the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the scope of the application.