CN108492818B - Text-to-speech conversion method and device and computer equipment - Google Patents
Text-to-speech conversion method and device and computer equipment Download PDFInfo
- Publication number
- CN108492818B CN108492818B CN201810241804.9A CN201810241804A CN108492818B CN 108492818 B CN108492818 B CN 108492818B CN 201810241804 A CN201810241804 A CN 201810241804A CN 108492818 B CN108492818 B CN 108492818B
- Authority
- CN
- China
- Prior art keywords
- text
- frame
- linear spectrum
- converted
- features
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 58
- 238000006243 chemical reaction Methods 0.000 title claims description 51
- 238000001228 spectrum Methods 0.000 claims abstract description 128
- 238000013507 mapping Methods 0.000 claims abstract description 64
- 238000003062 neural network model Methods 0.000 claims abstract description 38
- 238000012549 training Methods 0.000 claims description 40
- 230000006870 function Effects 0.000 claims description 22
- 238000013528 artificial neural network Methods 0.000 claims description 13
- 238000004590 computer program Methods 0.000 claims description 13
- 230000004913 activation Effects 0.000 claims description 12
- 230000009467 reduction Effects 0.000 claims description 4
- 238000010586 diagram Methods 0.000 description 11
- 238000012545 processing Methods 0.000 description 9
- 238000013527 convolutional neural network Methods 0.000 description 8
- 230000003287 optical effect Effects 0.000 description 8
- 230000015572 biosynthetic process Effects 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000003786 synthesis reaction Methods 0.000 description 4
- 238000009432 framing Methods 0.000 description 3
- 239000013307 optical fiber Substances 0.000 description 3
- 230000000306 recurrent effect Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Signal Processing (AREA)
- Machine Translation (AREA)
- Telephonic Communication Services (AREA)
Abstract
The application provides a method and a device for converting text to voice and computer equipment, wherein the method for converting text to voice comprises the following steps: acquiring a frame corresponding to a text to be converted; acquiring vector characteristics and text prosody characteristics of a phonon corresponding to a current frame in the frame, and acquiring mapping characteristics of a linear spectrum corresponding to a previous frame of the current frame; inputting the obtained vector features, text prosody features and mapping features into a pre-trained neural network model to obtain a linear spectrum corresponding to the current frame; and after the linear spectrum of the frame corresponding to the text to be converted is obtained, obtaining the voice corresponding to the text to be converted according to the linear spectrum of the frame corresponding to the text to be converted. The method and the device can directly obtain the voice corresponding to the text to be converted according to the linear spectrum of the frame corresponding to the text to be converted, and because approximate errors are not introduced, the tone quality of the obtained voice is better, and the whole fluency is better.
Description
Technical Field
The present application relates to the field of speech synthesis technologies, and in particular, to a method and an apparatus for converting text to speech, and a computer device.
Background
TTS is an abbreviation from Text To Speech (Text To Speech) and is part of a human-machine conversation in order To allow machines To speak from Text. A phone is the smallest unit of speech for a human to speak, and in chinese, a phone is each initial or final. In order for a machine to emit corresponding sounds from text, an acoustic model for each phoneme needs to be modeled.
In the prior art, a vocoder is used for modeling, and the modeling method firstly performs framing processing on a voice signal, and then an acoustic model of each frame is divided into three blocks for modeling: (1) whether or not this frame needs to be voiced; (2) the base frequency of this frame; (3) the impulse response of this frame with respect to the fundamental frequency. However, this method uses a priori knowledge obtained from human research to perform approximation, so that there is a certain loss of information, and the final synthesized sound has a heavy mechanical effect and a poor hearing, which cannot be compared with natural human voice.
Disclosure of Invention
The present application is directed to solving, at least to some extent, one of the technical problems in the related art.
Therefore, a first objective of the present application is to provide a text-to-speech conversion method, so as to obtain speech corresponding to a text to be converted directly according to a linear spectrum of a frame corresponding to the text to be converted, and because no approximation error is introduced, the obtained speech has better tone quality and better overall fluency.
A second object of the present application is to propose a text to speech conversion device.
A third object of the present application is to propose a computer device.
A fourth object of the present application is to propose a non-transitory computer-readable storage medium.
In order to achieve the above object, an embodiment of a first aspect of the present application provides a method for converting text into speech, including: acquiring a frame corresponding to a text to be converted; acquiring vector characteristics and text prosody characteristics of a phonon corresponding to a current frame in the frame, and acquiring mapping characteristics of a linear spectrum corresponding to a previous frame of the current frame; inputting the obtained vector features, text prosody features and mapping features into a pre-trained neural network model to obtain a linear spectrum corresponding to the current frame; and after the linear spectrum of the frame corresponding to the text to be converted is obtained, obtaining the voice corresponding to the text to be converted according to the linear spectrum of the frame corresponding to the text to be converted.
In the method for converting a text into a voice in the embodiment of the application, after a frame corresponding to a text to be converted is obtained, a vector feature and a text prosody feature of a phonon corresponding to a current frame in the frame are obtained, a mapping feature of a linear spectrum corresponding to a previous frame of the current frame is obtained, and then the obtained vector feature, the text prosody feature and the mapping feature are input into a pre-trained neural network model to obtain the linear spectrum corresponding to the current frame. After the linear spectrum of the frame corresponding to the text to be converted is obtained, the voice corresponding to the text to be converted is obtained according to the linear spectrum of the frame corresponding to the text to be converted, so that the voice corresponding to the text to be converted can be directly obtained according to the linear spectrum of the frame corresponding to the text to be converted.
In order to achieve the above object, a second aspect of the present application provides a text-to-speech conversion apparatus, including: the acquisition module is used for acquiring a frame corresponding to a text to be converted; acquiring vector characteristics and text prosody characteristics of a phonon corresponding to a current frame in the frame, and acquiring mapping characteristics of a linear spectrum corresponding to a previous frame of the current frame; the input module is used for inputting the vector characteristics, the text prosody characteristics and the mapping characteristics acquired by the acquisition module into a pre-trained neural network model to acquire a linear spectrum corresponding to the current frame; and the conversion module is used for obtaining the voice corresponding to the text to be converted according to the linear spectrum of the frame corresponding to the text to be converted after the linear spectrum of the frame corresponding to the text to be converted is obtained.
In the text-to-speech conversion apparatus of the embodiment of the application, after the obtaining module obtains a frame corresponding to a text to be converted, a vector feature and a text prosody feature of a phonon corresponding to a current frame in the frame and a mapping feature of a linear spectrum corresponding to a previous frame of the current frame are obtained, and then the obtained vector feature, text prosody feature and mapping feature are input into a pre-trained neural network model by the input module to obtain the linear spectrum corresponding to the current frame. After the linear spectrum of the frame corresponding to the text to be converted is obtained, the conversion module obtains the voice corresponding to the text to be converted according to the linear spectrum of the frame corresponding to the text to be converted, so that the voice corresponding to the text to be converted can be directly obtained according to the linear spectrum of the frame corresponding to the text to be converted.
To achieve the above object, an embodiment of a third aspect of the present application provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the method as described above when executing the computer program.
In order to achieve the above object, a fourth aspect of the present application provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program, when executed by a processor, implementing the method as described above.
Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a flowchart of an embodiment of a text-to-speech conversion method of the present application;
FIG. 2 is a schematic diagram of a neural network composed of two fully-connected layers in the text-to-speech conversion method according to the present application;
FIG. 3 is a flowchart of another embodiment of a text-to-speech conversion method of the present application;
FIG. 4 is a flowchart of another embodiment of a text-to-speech conversion method of the present application;
FIG. 5 is a schematic structural diagram of a neural network model in the text-to-speech conversion method according to the present application;
FIG. 6 is a schematic structural diagram of a convolution unit in the text-to-speech conversion method according to the present application;
FIG. 7 is a flowchart of yet another embodiment of a text-to-speech conversion method of the present application;
FIG. 8 is a schematic structural diagram of an embodiment of a text-to-speech conversion apparatus according to the present application;
FIG. 9 is a schematic structural diagram of an embodiment of a computer apparatus according to the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.
In the conventional text-to-speech conversion scheme, speech conversion is not directly performed according to the frequency spectrum of audio, but acoustic parameters such as fundamental frequency and spectral envelope are extracted from the audio, and then speech conversion is performed according to indirect acoustic features. Due to the complexity of the sound spectrum, a large number of approximations are inevitably introduced in order to simplify the speech conversion process, eventually resulting in a more or less distorted final synthesized speech. In order to avoid the problem, the method and the device directly perform voice conversion according to the linear spectrum corresponding to the text to be converted.
Fig. 1 is a flowchart of an embodiment of a text-to-speech conversion method according to the present application, and as shown in fig. 1, the text-to-speech conversion method may include:
Specifically, obtaining the frame corresponding to the text to be converted may be: and acquiring a frame corresponding to each phoneme of the text to be converted.
Taking the text to be converted into "hundredth" as an example, the phonons of "hundredth" include: b. ai, d, and u, therefore, obtaining the frame corresponding to the text to be converted may be: and respectively acquiring frames corresponding to the b, the ai, the d and the u.
The vector features of the phones corresponding to the current frame in the frame may be phones corresponding to the current frame, such as one-hot features of the basic phonetic units such as initials and finals; the text prosodic feature of the phonon corresponding to the current frame in the frame may be a supplementary feature of the phonon corresponding to the current frequency, such as a part of speech of a word where the phonon corresponding to the current frequency is located and/or a word vector of the word where the phonon is located.
Still taking the example that the text to be converted is "hundredth", suppose that "b" corresponds to 5 frames, and the phones corresponding to the 5 frames are all "b", and for each frame in the 5 frames, the vector feature and the text prosody feature of the phone corresponding to the current frame are the vector feature and the text prosody feature corresponding to "b" the phone.
Specifically, the mapping feature for obtaining the linear spectrum corresponding to the previous frame of the current frame may be: and inputting the linear spectrum corresponding to the previous frame of the current frame into a neural network consisting of two fully-connected layers for mapping to obtain the mapping characteristic of the linear spectrum corresponding to the previous frame of the current frame.
Fig. 2 is a schematic diagram of a neural network composed of two fully connected layers in the text-to-speech conversion method of the present application, and after a linear spectrum corresponding to a previous frame is obtained, the linear spectrum corresponding to the previous frame is input into the neural network shown in fig. 2 to be mapped, so that a mapping feature of the linear spectrum corresponding to the previous frame can be obtained.
And 103, inputting the obtained vector features, text prosody features and mapping features into a pre-trained neural network model to obtain a linear spectrum corresponding to the current frame.
Specifically, after obtaining the linear spectrum corresponding to the current frame, the linear spectrum corresponding to the current frame may be input to the neural network shown in fig. 2, so as to obtain the mapping feature of the linear spectrum corresponding to the current frame, and then the mapping feature of the linear spectrum corresponding to the current frame is spliced with the vector feature of the phonon and the text prosody feature corresponding to the next frame to serve as the input feature of the next frame.
And 104, after obtaining the linear spectrum of the frame corresponding to the text to be converted, obtaining the voice corresponding to the text to be converted according to the linear spectrum of the frame corresponding to the text to be converted.
Specifically, after the linear spectrums of all frames corresponding to the text to be converted are obtained according to the methods described in step 102 and step 103, the speech corresponding to the text to be converted can be obtained according to the linear spectrums of the frames corresponding to the text to be converted.
In the text-to-speech conversion method, after a frame corresponding to a text to be converted is obtained, vector features and text prosody features of a phonon corresponding to a current frame in the frame and mapping features of a linear spectrum corresponding to a previous frame of the current frame are obtained, and then the obtained vector features, text prosody features and mapping features are input into a pre-trained neural network model to obtain the linear spectrum corresponding to the current frame. After the linear spectrum of the frame corresponding to the text to be converted is obtained, the voice corresponding to the text to be converted is obtained according to the linear spectrum of the frame corresponding to the text to be converted, so that the voice corresponding to the text to be converted can be directly obtained according to the linear spectrum of the frame corresponding to the text to be converted.
Fig. 3 is a flowchart of another embodiment of the text-to-speech conversion method of the present application, as shown in fig. 3, in the embodiment shown in fig. 1 of the present application, step 104 may include:
And step 302, generating the voice corresponding to the text to be converted through an inverse short-time Fourier transform algorithm according to the phase spectrum of the voice corresponding to the text to be converted.
Fig. 4 is a flowchart of a text-to-speech conversion method according to another embodiment of the present application, and as shown in fig. 4, in the embodiment shown in fig. 1 of the present application, step 103 may include:
That is to say, in this embodiment, the structure of the neural network model may be as shown in fig. 5, and fig. 5 is a schematic structural diagram of the neural network model in the text-to-speech conversion method according to the present application. The output characteristic of the neural network model shown in fig. 5 is a linear spectrum corresponding to the current frame, that is, a magnitude spectrum of a frequency spectrum obtained after short-time fourier transform of the audio corresponding to the current frame, the neural network model firstly performs framing processing on the audio corresponding to the current frame, each 25ms is a frame, and 12.5ms overlap exists between frames. Then, Fourier transform is carried out on each frame, and the real part of the Fourier transform is taken as the linear spectrum corresponding to the current frame.
The convolution unit comprises two one-dimensional convolution kernels, wherein the output of one convolution kernel is normalized to a preset interval through an activation function; the output of the convolution unit is the product of the output result of one convolution kernel normalized to a preset interval and the corresponding element of the output result of another convolution kernel which does not pass the activation function.
The activation function may be a sigmoid activation function, the predetermined interval may be a (0,1) interval, the structure of the convolution unit may be as shown in fig. 6, and fig. 6 is a schematic structural diagram of the convolution unit in the text-to-speech conversion method according to the present application.
In the embodiment, the convolutional neural network model is used for replacing a recurrent neural network, which is beneficial to improving the parallel operation efficiency of the network and greatly improving the synthesis speed of the voice; in addition, the convolutional neural network model used in the embodiment can function like a filter, and therefore, the convolutional neural network model also has good voice conversion capability on training data with low signal-to-noise ratio.
Fig. 7 is a flowchart of a text-to-speech conversion method according to another embodiment of the present application, as shown in fig. 7, in the embodiment shown in fig. 1 of the present application, before step 103, the method may further include:
step 701, obtaining a frame corresponding to a training text.
Step 702, obtaining the vector feature and text prosody feature of the phonon corresponding to the current frame in the frame corresponding to the training text, and obtaining the mapping feature of the linear spectrum corresponding to the previous frame of voice of the current frame in the voice corresponding to the training text.
And 703, inputting the obtained vector characteristics, text prosody characteristics and mapping characteristics into a neural network model to be trained to obtain a linear spectrum corresponding to the current frame.
The above steps 701 to 705 describe a training process of the neural network model used in the embodiment of the present application.
As can be seen from step 702, when the neural network model is trained, the mapping feature of the previous frame used is the mapping feature of the linear spectrum corresponding to the previous frame of speech of the current frame in the speech corresponding to the training text, and when the neural network model is actually used, the mapping feature of the previous frame used is the mapping feature of the linear spectrum corresponding to the previous frame output by the neural network model.
In step 705, the obtained speech corresponding to the training text meeting the predetermined condition may be: compared with the real-person speech of the training text, the speech corresponding to the training text obtained through the neural network model to be trained has errors of tone quality and overall fluency within a predetermined range, and certainly, the predetermined range can be set by the user according to system performance and/or implementation requirements and the like during specific implementation, and the size of the predetermined range is not limited in the embodiment.
The method for converting the text into the voice directly uses the linear spectrum corresponding to the text to be converted for voice conversion, and compared with a scheme of performing voice conversion by using acoustic characteristics obtained by approximate estimation, the method for converting the text into the voice obtains smaller voice error, so that the obtained voice has better tone quality and better integral fluency; in addition, the convolutional neural network model is used for replacing a recurrent neural network, so that the parallel operation efficiency of the network is improved, and the audio synthesis speed is greatly improved; in addition, the convolutional neural network model used in the method can play a role similar to a filter, so that the method also has good voice conversion capability on training data with low signal-to-noise ratio.
Fig. 8 is a schematic structural diagram of an embodiment of a text-to-speech conversion apparatus according to the present application, where the text-to-speech conversion apparatus according to the present application may implement the text-to-speech conversion method according to the present application. As shown in fig. 8, the text-to-speech conversion apparatus may include: an acquisition module 81, an input module 82 and a conversion module 83;
the acquiring module 81 is configured to acquire a frame corresponding to a text to be converted; acquiring vector characteristics and text prosody characteristics of a phonon corresponding to a current frame in the frame, and acquiring mapping characteristics of a linear spectrum corresponding to a previous frame of the current frame; specifically, obtaining the frame corresponding to the text to be converted may be: and acquiring a frame corresponding to each phoneme of the text to be converted.
Taking the text to be converted into "hundredth" as an example, the phonons of "hundredth" include: b. ai, d, and u, therefore, obtaining the frame corresponding to the text to be converted may be: and respectively acquiring frames corresponding to the b, the ai, the d and the u.
The vector features of the phones corresponding to the current frame in the frame may be phones corresponding to the current frame, such as one-hot features of the basic phonetic units such as initials and finals; the text prosodic feature of the phonon corresponding to the current frame in the frame may be a supplementary feature of the phonon corresponding to the current frequency, such as a part of speech of a word where the phonon corresponding to the current frequency is located and/or a word vector of the word where the phonon is located.
Still taking the example that the text to be converted is "hundredth", suppose that "b" corresponds to 5 frames, and the phones corresponding to the 5 frames are all "b", and for each frame in the 5 frames, the vector feature and the text prosody feature of the phone corresponding to the current frame are the vector feature and the text prosody feature corresponding to "b" the phone.
Specifically, the obtaining module 81 is specifically configured to input the linear spectrum corresponding to the previous frame of the current frame into a neural network composed of two fully connected layers for mapping, so as to obtain a mapping feature of the linear spectrum corresponding to the previous frame of the current frame.
After obtaining the linear spectrum corresponding to the previous frame, the obtaining module 81 may input the linear spectrum corresponding to the previous frame into the neural network shown in fig. 2 for mapping, so as to obtain the mapping feature of the linear spectrum corresponding to the previous frame.
And the input module 82 is configured to input the vector features, the text prosody features, and the mapping features acquired by the acquisition module 81 into a pre-trained neural network model, so as to obtain a linear spectrum corresponding to the current frame.
In this embodiment, after the input module 82 obtains the linear spectrum corresponding to the current frame, the obtaining module 81 may input the linear spectrum corresponding to the current frame into the neural network shown in fig. 2, obtain the mapping feature of the linear spectrum corresponding to the current frame, and then spell the mapping feature with the vector feature of the phonon and the text prosody feature corresponding to the next frame to serve as the input feature of the next frame.
The conversion module 83 is configured to obtain the speech corresponding to the text to be converted according to the linear spectrum of the frame corresponding to the text to be converted after obtaining the linear spectrum of the frame corresponding to the text to be converted.
Specifically, after obtaining the linear spectrums of all frames corresponding to the text to be converted, the conversion module 83 may obtain the speech corresponding to the text to be converted according to the linear spectrums of the frames corresponding to the text to be converted.
In this embodiment, the conversion module 83 is specifically configured to obtain a phase spectrum of a speech corresponding to the text to be converted through iteration from a linear spectrum of a frame corresponding to the text to be converted; and generating the voice corresponding to the text to be converted through an inverse short-time Fourier transform algorithm according to the phase spectrum of the voice corresponding to the text to be converted.
In this embodiment, the input module 82 is specifically configured to perform dimensionality reduction on the acquired vector features, text prosody features, and mapping features through a full-link layer, input the reduced vector features, text prosody features, and mapping features into a network composed of at least two convolution units with residual connection, perform upsampling through a full-link layer, and obtain a linear spectrum corresponding to the current frame.
That is, in the present embodiment, the structure of the neural network model may be as shown in fig. 5. The output characteristic of the neural network model shown in fig. 5 is a linear spectrum corresponding to the current frame, that is, a magnitude spectrum of a frequency spectrum obtained after short-time fourier transform of the audio corresponding to the current frame, the neural network model firstly performs framing processing on the audio corresponding to the current frame, each 25ms is a frame, and 12.5ms overlap exists between frames. Then, Fourier transform is carried out on each frame, and the real part of the Fourier transform is taken as the linear spectrum corresponding to the current frame.
The convolution unit comprises two one-dimensional convolution kernels, wherein the output of one convolution kernel is normalized to a preset interval through an activation function; the output of the convolution unit is the product of the output result of one convolution kernel normalized to a preset interval and the corresponding element of the output result of another convolution kernel which does not pass the activation function.
The activation function may be a sigmoid activation function, the predetermined interval may be a (0,1) interval, and the convolution unit may have a structure as shown in fig. 6.
In the embodiment, the convolutional neural network model is used for replacing a recurrent neural network, which is beneficial to improving the parallel operation efficiency of the network and greatly improving the synthesis speed of the voice; in addition, the convolutional neural network model used in the embodiment can function like a filter, and therefore, the convolutional neural network model also has good voice conversion capability on training data with low signal-to-noise ratio.
Further, the obtaining module 81 is further configured to obtain a frame corresponding to the training text; acquiring vector characteristics and text prosody characteristics of a phonon corresponding to a current frame in a frame corresponding to the training text, and acquiring mapping characteristics of a linear spectrum corresponding to a previous frame of voice of the current frame in voice corresponding to the training text;
the input module 82 is further configured to input the vector features, text prosody features, and mapping features acquired by the acquisition module 81 into a neural network model to be trained, so as to obtain a linear spectrum corresponding to the current frame;
the conversion module 83 is further configured to, after obtaining the linear spectrum of the frame corresponding to the training text, obtain the speech corresponding to the training text according to the linear spectrum of the frame corresponding to the training text;
the obtaining module 81 is further configured to obtain a trained neural network model when the obtained speech corresponding to the training text meets a predetermined condition.
When the neural network model is trained, the mapping feature of the previous frame acquired by the acquiring module 81 is the mapping feature of the linear spectrum corresponding to the previous frame of speech of the current frame in the speech corresponding to the training text, and when the neural network model is actually used, the mapping feature of the previous frame acquired by the acquiring module 81 is the mapping feature of the linear spectrum corresponding to the previous frame output by the neural network model.
The obtained speech corresponding to the training text meeting the predetermined condition may be: compared with the real-person speech of the training text, the speech corresponding to the training text obtained through the neural network model to be trained has errors of tone quality and overall fluency within a predetermined range, and certainly, the predetermined range can be set by the user according to system performance and/or implementation requirements and the like during specific implementation, and the size of the predetermined range is not limited in the embodiment.
In the text-to-speech conversion device, after a frame corresponding to a text to be converted is acquired, vector features and text prosody features of a phonon corresponding to a current frame in the frame and mapping features of a linear spectrum corresponding to a previous frame of the current frame are acquired, and then the acquired vector features, text prosody features and mapping features are input into a pre-trained neural network model to acquire the linear spectrum corresponding to the current frame. After the linear spectrum of the frame corresponding to the text to be converted is obtained, the voice corresponding to the text to be converted is obtained according to the linear spectrum of the frame corresponding to the text to be converted, so that the voice corresponding to the text to be converted can be directly obtained according to the linear spectrum of the frame corresponding to the text to be converted.
Fig. 9 is a schematic structural diagram of an embodiment of a computer device according to the present application, where the computer device may include a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the method for converting text into speech provided in the embodiment of the present application may be implemented.
The Computer device may be a server or a terminal device, and the terminal device may be an intelligent terminal device such as a Personal Computer (PC) or a notebook Computer, and the embodiment does not limit the specific form of the Computer device.
FIG. 9 illustrates a block diagram of an exemplary computer device 12 suitable for use in implementing embodiments of the present application. The computer device 12 shown in fig. 9 is only an example, and should not bring any limitation to the function and the scope of use of the embodiments of the present application.
As shown in FIG. 9, computer device 12 is in the form of a general purpose computing device. The components of computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.
The system Memory 28 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) 30 and/or cache Memory 32. Computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 9, and commonly referred to as a "hard drive"). Although not shown in FIG. 9, a disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a Compact disk Read Only memory (CD-ROM), a Digital versatile disk Read Only memory (DVD-ROM), or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the application.
A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally perform the functions and/or methodologies of the embodiments described herein.
The processing unit 16 executes various functional applications and data processing, such as implementing a text-to-speech conversion method provided by an embodiment of the present application, by running a program stored in the system memory 28.
The embodiment of the present application further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, can implement the method for converting text into speech provided in the embodiment of the present application.
The non-transitory computer readable storage medium described above may take any combination of one or more computer readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM) or flash Memory, an optical fiber, a portable compact disc Read Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of Network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection portion (electronic device) having one or more wires, a portable computer cartridge (magnetic device), a Random Access Memory (RAM), a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM) or a flash Memory, an optical fiber device, and a portable Compact Disc Read Only Memory (CD-ROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic Gate circuit for realizing a logic function for a data signal, an asic having an appropriate combinational logic Gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), and the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.
Claims (12)
1. A method for text to speech conversion, comprising:
acquiring a frame corresponding to a text to be converted;
acquiring vector characteristics and text prosody characteristics of a phonon corresponding to a current frame in the frame, and acquiring mapping characteristics of a linear spectrum corresponding to a previous frame of the current frame;
inputting the obtained vector features, text prosody features and mapping features into a pre-trained neural network model to obtain a linear spectrum corresponding to the current frame;
after the linear spectrum of the frame corresponding to the text to be converted is obtained, obtaining the voice corresponding to the text to be converted according to the linear spectrum of the frame corresponding to the text to be converted;
the obtaining of the mapping feature of the linear spectrum corresponding to the previous frame of the current frame includes:
and inputting the linear spectrum corresponding to the previous frame of the current frame into a neural network consisting of two fully-connected layers for mapping to obtain the mapping characteristic of the linear spectrum corresponding to the previous frame of the current frame.
2. The method according to claim 1, wherein the obtaining the speech corresponding to the text to be converted according to the linear spectrum of the frame corresponding to the text to be converted comprises:
obtaining a phase spectrum of the voice corresponding to the text to be converted through iteration from the linear spectrum of the frame corresponding to the text to be converted;
and generating the voice corresponding to the text to be converted through an inverse short-time Fourier transform algorithm according to the phase spectrum of the voice corresponding to the text to be converted.
3. The method of claim 1, wherein the inputting the obtained vector feature, text prosody feature and mapping feature into a pre-trained neural network model, and obtaining the linear spectrum corresponding to the current frame comprises:
and after the acquired vector features, text prosody features and mapping features are subjected to dimensionality reduction through a full connection layer, the reduced dimensionality is input into a network consisting of at least two convolution units with residual connection, and after the reduced dimensionality is subjected to upsampling through a full connection layer, a linear spectrum corresponding to the current frame is obtained.
4. The method of claim 3, wherein the convolution unit comprises two one-dimensional convolution kernels, wherein the output of one convolution kernel is normalized to a predetermined interval by an activation function; the output of the convolution unit is the product of the output result of one convolution kernel normalized to a preset interval and the corresponding element of the output result of the other convolution kernel which does not pass the activation function.
5. The method according to any one of claims 1 to 4, wherein before inputting the obtained vector feature, text prosody feature and mapping feature into a pre-trained neural network model and obtaining the linear spectrum corresponding to the current frame, the method further comprises:
acquiring a frame corresponding to a training text;
acquiring vector features and text prosody features of a phonon corresponding to a current frame in a frame corresponding to the training text, and acquiring mapping features of a linear spectrum corresponding to a previous frame of voice of the current frame in voice corresponding to the training text;
inputting the obtained vector features, text prosody features and mapping features into a neural network model to be trained to obtain a linear spectrum corresponding to the current frame;
after the linear spectrum of the frame corresponding to the training text is obtained, obtaining the voice corresponding to the training text according to the linear spectrum of the frame corresponding to the training text;
and when the obtained speech corresponding to the training text meets a preset condition, obtaining a trained neural network model.
6. A text-to-speech conversion apparatus, comprising:
the acquisition module is used for acquiring a frame corresponding to a text to be converted; acquiring vector characteristics and text prosody characteristics of a phonon corresponding to a current frame in the frame, and acquiring mapping characteristics of a linear spectrum corresponding to a previous frame of the current frame;
the input module is used for inputting the vector characteristics, the text prosody characteristics and the mapping characteristics acquired by the acquisition module into a pre-trained neural network model to acquire a linear spectrum corresponding to the current frame;
the conversion module is used for obtaining the voice corresponding to the text to be converted according to the linear spectrum of the frame corresponding to the text to be converted after obtaining the linear spectrum of the frame corresponding to the text to be converted;
the obtaining module is specifically configured to input the linear spectrum corresponding to the previous frame of the current frame into a neural network composed of two fully-connected layers for mapping, so as to obtain a mapping feature of the linear spectrum corresponding to the previous frame of the current frame.
7. The apparatus of claim 6,
the conversion module is specifically configured to obtain a phase spectrum of a speech corresponding to the text to be converted through iteration from a linear spectrum of a frame corresponding to the text to be converted; and generating the voice corresponding to the text to be converted through an inverse short-time Fourier transform algorithm according to the phase spectrum of the voice corresponding to the text to be converted.
8. The apparatus of claim 6,
the input module is specifically configured to perform dimensionality reduction on the acquired vector features, text prosody features and mapping features through a full connection layer, input the reduced vector features, text prosody features and mapping features into a network composed of at least two convolution units with residual connection, perform upsampling through a full connection layer, and obtain a linear spectrum corresponding to the current frame.
9. The apparatus of claim 8, wherein the convolution unit comprises two one-dimensional convolution kernels, wherein an output of one convolution kernel is normalized to a predetermined interval by an activation function; the output of the convolution unit is the product of the output result of one convolution kernel normalized to a preset interval and the corresponding element of the output result of the other convolution kernel which does not pass the activation function.
10. The apparatus according to any one of claims 6 to 9,
the acquisition module is also used for acquiring a frame corresponding to the training text; acquiring vector features and text prosody features of a phonon corresponding to a current frame in a frame corresponding to the training text, and acquiring mapping features of a linear spectrum corresponding to a previous frame of voice of the current frame in voice corresponding to the training text;
the input module is further configured to input the vector features, the text prosody features and the mapping features acquired by the acquisition module into a neural network model to be trained, so as to obtain a linear spectrum corresponding to the current frame;
the conversion module is further configured to, after obtaining the linear spectrum of the frame corresponding to the training text, obtain the speech corresponding to the training text according to the linear spectrum of the frame corresponding to the training text;
the obtaining module is further configured to obtain a trained neural network model when the obtained speech corresponding to the training text meets a predetermined condition.
11. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1-5 when executing the computer program.
12. A non-transitory computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the method of any one of claims 1-5.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201810241804.9A CN108492818B (en) | 2018-03-22 | 2018-03-22 | Text-to-speech conversion method and device and computer equipment |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201810241804.9A CN108492818B (en) | 2018-03-22 | 2018-03-22 | Text-to-speech conversion method and device and computer equipment |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN108492818A CN108492818A (en) | 2018-09-04 |
| CN108492818B true CN108492818B (en) | 2020-10-30 |
Family
ID=63319329
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201810241804.9A Active CN108492818B (en) | 2018-03-22 | 2018-03-22 | Text-to-speech conversion method and device and computer equipment |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN108492818B (en) |
Families Citing this family (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN109754778B (en) * | 2019-01-17 | 2023-05-30 | 平安科技(深圳)有限公司 | Text speech synthesis method and device and computer equipment |
| CN110164413B (en) * | 2019-05-13 | 2021-06-04 | 北京百度网讯科技有限公司 | Speech synthesis method, apparatus, computer device and storage medium |
| CN110335587B (en) * | 2019-06-14 | 2023-11-10 | 平安科技(深圳)有限公司 | Speech synthesis method, system, terminal device and readable storage medium |
| CN112289297B (en) * | 2019-07-25 | 2024-08-02 | 阿里巴巴集团控股有限公司 | Speech synthesis method, device and system |
| US11322135B2 (en) | 2019-09-12 | 2022-05-03 | International Business Machines Corporation | Generating acoustic sequences via neural networks using combined prosody info |
| CN110473516B (en) * | 2019-09-19 | 2020-11-27 | 百度在线网络技术(北京)有限公司 | Speech synthesis method, device and electronic device |
| CN111862931B (en) * | 2020-05-08 | 2024-09-24 | 北京嘀嘀无限科技发展有限公司 | Voice generation method and device |
| CN112464649A (en) * | 2020-11-25 | 2021-03-09 | 平安科技(深圳)有限公司 | Pinyin conversion method and device for polyphone, computer equipment and storage medium |
| CN113763924B (en) * | 2021-11-08 | 2022-02-15 | 北京优幕科技有限责任公司 | Acoustic deep learning model training method, and voice generation method and device |
| CN116312464A (en) * | 2023-03-23 | 2023-06-23 | 平安科技(深圳)有限公司 | Audio synthesis method, device, electronic equipment and computer readable storage medium |
Citations (11)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| GB2326321A (en) * | 1997-06-13 | 1998-12-16 | Motorola Inc | Speech synthesis using neural networks |
| GB2326320A (en) * | 1997-06-13 | 1998-12-16 | Motorola Inc | Text to speech synthesis using neural network |
| CA2161540C (en) * | 1994-04-28 | 2000-06-13 | Orhan Karaali | A method and apparatus for converting text into audible signals using a neural network |
| CN101996628A (en) * | 2009-08-21 | 2011-03-30 | 索尼株式会社 | Method and device for extracting prosodic features of speech signal |
| US8527276B1 (en) * | 2012-10-25 | 2013-09-03 | Google Inc. | Speech synthesis using deep neural networks |
| CN104867489A (en) * | 2015-04-27 | 2015-08-26 | 苏州大学张家港工业技术研究院 | Method and system for simulating reading and pronunciation of real person |
| CN105336321A (en) * | 2015-09-25 | 2016-02-17 | 百度在线网络技术(北京)有限公司 | Phonetic segmentation method and device for speech synthesis |
| CN105679306A (en) * | 2016-02-19 | 2016-06-15 | 上海语知义信息技术有限公司 | Method and system for predicting fundamental frequency frame in voice synthesis |
| CN107039034A (en) * | 2016-02-04 | 2017-08-11 | 科大讯飞股份有限公司 | A kind of prosody prediction method and system |
| CN107452369A (en) * | 2017-09-28 | 2017-12-08 | 百度在线网络技术(北京)有限公司 | Phonetic synthesis model generating method and device |
| CN107680580A (en) * | 2017-09-28 | 2018-02-09 | 百度在线网络技术(北京)有限公司 | Text transformation model training method and device, text conversion method and device |
Family Cites Families (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8401849B2 (en) * | 2008-12-18 | 2013-03-19 | Lessac Technologies, Inc. | Methods employing phase state analysis for use in speech synthesis and recognition |
-
2018
- 2018-03-22 CN CN201810241804.9A patent/CN108492818B/en active Active
Patent Citations (11)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CA2161540C (en) * | 1994-04-28 | 2000-06-13 | Orhan Karaali | A method and apparatus for converting text into audible signals using a neural network |
| GB2326321A (en) * | 1997-06-13 | 1998-12-16 | Motorola Inc | Speech synthesis using neural networks |
| GB2326320A (en) * | 1997-06-13 | 1998-12-16 | Motorola Inc | Text to speech synthesis using neural network |
| CN101996628A (en) * | 2009-08-21 | 2011-03-30 | 索尼株式会社 | Method and device for extracting prosodic features of speech signal |
| US8527276B1 (en) * | 2012-10-25 | 2013-09-03 | Google Inc. | Speech synthesis using deep neural networks |
| CN104867489A (en) * | 2015-04-27 | 2015-08-26 | 苏州大学张家港工业技术研究院 | Method and system for simulating reading and pronunciation of real person |
| CN105336321A (en) * | 2015-09-25 | 2016-02-17 | 百度在线网络技术(北京)有限公司 | Phonetic segmentation method and device for speech synthesis |
| CN107039034A (en) * | 2016-02-04 | 2017-08-11 | 科大讯飞股份有限公司 | A kind of prosody prediction method and system |
| CN105679306A (en) * | 2016-02-19 | 2016-06-15 | 上海语知义信息技术有限公司 | Method and system for predicting fundamental frequency frame in voice synthesis |
| CN107452369A (en) * | 2017-09-28 | 2017-12-08 | 百度在线网络技术(北京)有限公司 | Phonetic synthesis model generating method and device |
| CN107680580A (en) * | 2017-09-28 | 2018-02-09 | 百度在线网络技术(北京)有限公司 | Text transformation model training method and device, text conversion method and device |
Also Published As
| Publication number | Publication date |
|---|---|
| CN108492818A (en) | 2018-09-04 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN108492818B (en) | Text-to-speech conversion method and device and computer equipment | |
| JP7701490B2 (en) | Text-to-speech synthesis in a target speaker's voice using neural networks | |
| US11842728B2 (en) | Training neural networks to predict acoustic sequences using observed prosody info | |
| CN113450765B (en) | Speech synthesis method, device, equipment and storage medium | |
| CN106887225B (en) | Acoustic feature extraction method and device based on convolutional neural network and terminal equipment | |
| CN103635960B (en) | Statistical Enhancement of Speech Output from Statistical Text to Speech Synthesis Systems | |
| Eskimez et al. | Adversarial training for speech super-resolution | |
| CN111161702A (en) | Personalized speech synthesis method and device, electronic equipment and storage medium | |
| CA3195582A1 (en) | Audio generator and methods for generating an audio signal and training an audio generator | |
| CN116994553A (en) | Training method of speech synthesis model, speech synthesis method, device and equipment | |
| US11335321B2 (en) | Building a text-to-speech system from a small amount of speech data | |
| JP7664330B2 (en) | Turn off text echo | |
| CN115497451B (en) | Voice processing method, device, electronic device and storage medium | |
| US11295762B2 (en) | Unsupervised speech decomposition | |
| Gu et al. | Waveform Modeling Using Stacked Dilated Convolutional Neural Networks for Speech Bandwidth Extension. | |
| CN113160849B (en) | Singing voice synthesizing method, singing voice synthesizing device, electronic equipment and computer readable storage medium | |
| CN110930975B (en) | Method and device for outputting information | |
| JP2022133447A (en) | Speech processing method, device, electronic device and storage medium | |
| CN114842880A (en) | Intelligent customer service voice rhythm adjustment method, device, equipment and storage medium | |
| CN114495896B (en) | A voice playing method and computer device | |
| CN114141259A (en) | Speech conversion method, apparatus, device, storage medium and program product | |
| US20110313772A1 (en) | System and method for unit selection text-to-speech using a modified viterbi approach | |
| CN113066472B (en) | Synthetic voice processing method and related device | |
| Mukherjee et al. | A Bengali speech synthesizer on Android OS | |
| CN118447820A (en) | Voice conversion method, device, equipment and medium based on style |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |