CN108492818B

CN108492818B - Text-to-speech conversion method and device and computer equipment

Info

Publication number: CN108492818B
Application number: CN201810241804.9A
Authority: CN
Inventors: 张黄斌
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2018-03-22
Filing date: 2018-03-22
Publication date: 2020-10-30
Anticipated expiration: 2038-03-22
Also published as: CN108492818A

Abstract

The application provides a method and a device for converting text to voice and computer equipment, wherein the method for converting text to voice comprises the following steps: acquiring a frame corresponding to a text to be converted; acquiring vector characteristics and text prosody characteristics of a phonon corresponding to a current frame in the frame, and acquiring mapping characteristics of a linear spectrum corresponding to a previous frame of the current frame; inputting the obtained vector features, text prosody features and mapping features into a pre-trained neural network model to obtain a linear spectrum corresponding to the current frame; and after the linear spectrum of the frame corresponding to the text to be converted is obtained, obtaining the voice corresponding to the text to be converted according to the linear spectrum of the frame corresponding to the text to be converted. The method and the device can directly obtain the voice corresponding to the text to be converted according to the linear spectrum of the frame corresponding to the text to be converted, and because approximate errors are not introduced, the tone quality of the obtained voice is better, and the whole fluency is better.

Description

Text-to-speech conversion method and device and computer equipment

Technical Field

The present application relates to the field of speech synthesis technologies, and in particular, to a method and an apparatus for converting text to speech, and a computer device.

Background

TTS is an abbreviation from Text To Speech (Text To Speech) and is part of a human-machine conversation in order To allow machines To speak from Text. A phone is the smallest unit of speech for a human to speak, and in chinese, a phone is each initial or final. In order for a machine to emit corresponding sounds from text, an acoustic model for each phoneme needs to be modeled.

In the prior art, a vocoder is used for modeling, and the modeling method firstly performs framing processing on a voice signal, and then an acoustic model of each frame is divided into three blocks for modeling: (1) whether or not this frame needs to be voiced; (2) the base frequency of this frame; (3) the impulse response of this frame with respect to the fundamental frequency. However, this method uses a priori knowledge obtained from human research to perform approximation, so that there is a certain loss of information, and the final synthesized sound has a heavy mechanical effect and a poor hearing, which cannot be compared with natural human voice.

Disclosure of Invention

The present application is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, a first objective of the present application is to provide a text-to-speech conversion method, so as to obtain speech corresponding to a text to be converted directly according to a linear spectrum of a frame corresponding to the text to be converted, and because no approximation error is introduced, the obtained speech has better tone quality and better overall fluency.

A second object of the present application is to propose a text to speech conversion device.

A third object of the present application is to propose a computer device.

A fourth object of the present application is to propose a non-transitory computer-readable storage medium.

In order to achieve the above object, an embodiment of a first aspect of the present application provides a method for converting text into speech, including: acquiring a frame corresponding to a text to be converted; acquiring vector characteristics and text prosody characteristics of a phonon corresponding to a current frame in the frame, and acquiring mapping characteristics of a linear spectrum corresponding to a previous frame of the current frame; inputting the obtained vector features, text prosody features and mapping features into a pre-trained neural network model to obtain a linear spectrum corresponding to the current frame; and after the linear spectrum of the frame corresponding to the text to be converted is obtained, obtaining the voice corresponding to the text to be converted according to the linear spectrum of the frame corresponding to the text to be converted.

In the method for converting a text into a voice in the embodiment of the application, after a frame corresponding to a text to be converted is obtained, a vector feature and a text prosody feature of a phonon corresponding to a current frame in the frame are obtained, a mapping feature of a linear spectrum corresponding to a previous frame of the current frame is obtained, and then the obtained vector feature, the text prosody feature and the mapping feature are input into a pre-trained neural network model to obtain the linear spectrum corresponding to the current frame. After the linear spectrum of the frame corresponding to the text to be converted is obtained, the voice corresponding to the text to be converted is obtained according to the linear spectrum of the frame corresponding to the text to be converted, so that the voice corresponding to the text to be converted can be directly obtained according to the linear spectrum of the frame corresponding to the text to be converted.

In order to achieve the above object, a second aspect of the present application provides a text-to-speech conversion apparatus, including: the acquisition module is used for acquiring a frame corresponding to a text to be converted; acquiring vector characteristics and text prosody characteristics of a phonon corresponding to a current frame in the frame, and acquiring mapping characteristics of a linear spectrum corresponding to a previous frame of the current frame; the input module is used for inputting the vector characteristics, the text prosody characteristics and the mapping characteristics acquired by the acquisition module into a pre-trained neural network model to acquire a linear spectrum corresponding to the current frame; and the conversion module is used for obtaining the voice corresponding to the text to be converted according to the linear spectrum of the frame corresponding to the text to be converted after the linear spectrum of the frame corresponding to the text to be converted is obtained.

In the text-to-speech conversion apparatus of the embodiment of the application, after the obtaining module obtains a frame corresponding to a text to be converted, a vector feature and a text prosody feature of a phonon corresponding to a current frame in the frame and a mapping feature of a linear spectrum corresponding to a previous frame of the current frame are obtained, and then the obtained vector feature, text prosody feature and mapping feature are input into a pre-trained neural network model by the input module to obtain the linear spectrum corresponding to the current frame. After the linear spectrum of the frame corresponding to the text to be converted is obtained, the conversion module obtains the voice corresponding to the text to be converted according to the linear spectrum of the frame corresponding to the text to be converted, so that the voice corresponding to the text to be converted can be directly obtained according to the linear spectrum of the frame corresponding to the text to be converted.

To achieve the above object, an embodiment of a third aspect of the present application provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the method as described above when executing the computer program.

In order to achieve the above object, a fourth aspect of the present application provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program, when executed by a processor, implementing the method as described above.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flowchart of an embodiment of a text-to-speech conversion method of the present application;

FIG. 2 is a schematic diagram of a neural network composed of two fully-connected layers in the text-to-speech conversion method according to the present application;

FIG. 3 is a flowchart of another embodiment of a text-to-speech conversion method of the present application;

FIG. 4 is a flowchart of another embodiment of a text-to-speech conversion method of the present application;

FIG. 5 is a schematic structural diagram of a neural network model in the text-to-speech conversion method according to the present application;

FIG. 6 is a schematic structural diagram of a convolution unit in the text-to-speech conversion method according to the present application;

FIG. 7 is a flowchart of yet another embodiment of a text-to-speech conversion method of the present application;

FIG. 8 is a schematic structural diagram of an embodiment of a text-to-speech conversion apparatus according to the present application;

FIG. 9 is a schematic structural diagram of an embodiment of a computer apparatus according to the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.

In the conventional text-to-speech conversion scheme, speech conversion is not directly performed according to the frequency spectrum of audio, but acoustic parameters such as fundamental frequency and spectral envelope are extracted from the audio, and then speech conversion is performed according to indirect acoustic features. Due to the complexity of the sound spectrum, a large number of approximations are inevitably introduced in order to simplify the speech conversion process, eventually resulting in a more or less distorted final synthesized speech. In order to avoid the problem, the method and the device directly perform voice conversion according to the linear spectrum corresponding to the text to be converted.

Fig. 1 is a flowchart of an embodiment of a text-to-speech conversion method according to the present application, and as shown in fig. 1, the text-to-speech conversion method may include:

step 101, obtaining a frame corresponding to a text to be converted.

Specifically, obtaining the frame corresponding to the text to be converted may be: and acquiring a frame corresponding to each phoneme of the text to be converted.

Taking the text to be converted into "hundredth" as an example, the phonons of "hundredth" include: b. ai, d, and u, therefore, obtaining the frame corresponding to the text to be converted may be: and respectively acquiring frames corresponding to the b, the ai, the d and the u.

Step 102, obtaining the vector characteristic and text prosody characteristic of the phonon corresponding to the current frame in the frame, and obtaining the mapping characteristic of the linear spectrum corresponding to the previous frame of the current frame.

The vector features of the phones corresponding to the current frame in the frame may be phones corresponding to the current frame, such as one-hot features of the basic phonetic units such as initials and finals; the text prosodic feature of the phonon corresponding to the current frame in the frame may be a supplementary feature of the phonon corresponding to the current frequency, such as a part of speech of a word where the phonon corresponding to the current frequency is located and/or a word vector of the word where the phonon is located.

Still taking the example that the text to be converted is "hundredth", suppose that "b" corresponds to 5 frames, and the phones corresponding to the 5 frames are all "b", and for each frame in the 5 frames, the vector feature and the text prosody feature of the phone corresponding to the current frame are the vector feature and the text prosody feature corresponding to "b" the phone.

Specifically, the mapping feature for obtaining the linear spectrum corresponding to the previous frame of the current frame may be: and inputting the linear spectrum corresponding to the previous frame of the current frame into a neural network consisting of two fully-connected layers for mapping to obtain the mapping characteristic of the linear spectrum corresponding to the previous frame of the current frame.

Fig. 2 is a schematic diagram of a neural network composed of two fully connected layers in the text-to-speech conversion method of the present application, and after a linear spectrum corresponding to a previous frame is obtained, the linear spectrum corresponding to the previous frame is input into the neural network shown in fig. 2 to be mapped, so that a mapping feature of the linear spectrum corresponding to the previous frame can be obtained.

And 103, inputting the obtained vector features, text prosody features and mapping features into a pre-trained neural network model to obtain a linear spectrum corresponding to the current frame.

Specifically, after obtaining the linear spectrum corresponding to the current frame, the linear spectrum corresponding to the current frame may be input to the neural network shown in fig. 2, so as to obtain the mapping feature of the linear spectrum corresponding to the current frame, and then the mapping feature of the linear spectrum corresponding to the current frame is spliced with the vector feature of the phonon and the text prosody feature corresponding to the next frame to serve as the input feature of the next frame.

And 104, after obtaining the linear spectrum of the frame corresponding to the text to be converted, obtaining the voice corresponding to the text to be converted according to the linear spectrum of the frame corresponding to the text to be converted.

Specifically, after the linear spectrums of all frames corresponding to the text to be converted are obtained according to the methods described in step 102 and step 103, the speech corresponding to the text to be converted can be obtained according to the linear spectrums of the frames corresponding to the text to be converted.

In the text-to-speech conversion method, after a frame corresponding to a text to be converted is obtained, vector features and text prosody features of a phonon corresponding to a current frame in the frame and mapping features of a linear spectrum corresponding to a previous frame of the current frame are obtained, and then the obtained vector features, text prosody features and mapping features are input into a pre-trained neural network model to obtain the linear spectrum corresponding to the current frame. After the linear spectrum of the frame corresponding to the text to be converted is obtained, the voice corresponding to the text to be converted is obtained according to the linear spectrum of the frame corresponding to the text to be converted, so that the voice corresponding to the text to be converted can be directly obtained according to the linear spectrum of the frame corresponding to the text to be converted.

Fig. 3 is a flowchart of another embodiment of the text-to-speech conversion method of the present application, as shown in fig. 3, in the embodiment shown in fig. 1 of the present application, step 104 may include:

step 301, obtaining a phase spectrum of a speech corresponding to the text to be converted through iteration from the linear spectrum of the frame corresponding to the text to be converted.

And step 302, generating the voice corresponding to the text to be converted through an inverse short-time Fourier transform algorithm according to the phase spectrum of the voice corresponding to the text to be converted.

Fig. 4 is a flowchart of a text-to-speech conversion method according to another embodiment of the present application, and as shown in fig. 4, in the embodiment shown in fig. 1 of the present application, step 103 may include:

step 401, subjecting the obtained vector feature, text prosody feature and mapping feature to dimensionality reduction through a full-link layer, inputting the reduced vector feature, text prosody feature and mapping feature into a network composed of at least two convolution units with residual connection, and performing upsampling through a full-link layer to obtain a linear spectrum corresponding to the current frame.

That is to say, in this embodiment, the structure of the neural network model may be as shown in fig. 5, and fig. 5 is a schematic structural diagram of the neural network model in the text-to-speech conversion method according to the present application. The output characteristic of the neural network model shown in fig. 5 is a linear spectrum corresponding to the current frame, that is, a magnitude spectrum of a frequency spectrum obtained after short-time fourier transform of the audio corresponding to the current frame, the neural network model firstly performs framing processing on the audio corresponding to the current frame, each 25ms is a frame, and 12.5ms overlap exists between frames. Then, Fourier transform is carried out on each frame, and the real part of the Fourier transform is taken as the linear spectrum corresponding to the current frame.

The convolution unit comprises two one-dimensional convolution kernels, wherein the output of one convolution kernel is normalized to a preset interval through an activation function; the output of the convolution unit is the product of the output result of one convolution kernel normalized to a preset interval and the corresponding element of the output result of another convolution kernel which does not pass the activation function.

The activation function may be a sigmoid activation function, the predetermined interval may be a (0,1) interval, the structure of the convolution unit may be as shown in fig. 6, and fig. 6 is a schematic structural diagram of the convolution unit in the text-to-speech conversion method according to the present application.

In the embodiment, the convolutional neural network model is used for replacing a recurrent neural network, which is beneficial to improving the parallel operation efficiency of the network and greatly improving the synthesis speed of the voice; in addition, the convolutional neural network model used in the embodiment can function like a filter, and therefore, the convolutional neural network model also has good voice conversion capability on training data with low signal-to-noise ratio.

Fig. 7 is a flowchart of a text-to-speech conversion method according to another embodiment of the present application, as shown in fig. 7, in the embodiment shown in fig. 1 of the present application, before step 103, the method may further include:

step 701, obtaining a frame corresponding to a training text.

Step 702, obtaining the vector feature and text prosody feature of the phonon corresponding to the current frame in the frame corresponding to the training text, and obtaining the mapping feature of the linear spectrum corresponding to the previous frame of voice of the current frame in the voice corresponding to the training text.

And 703, inputting the obtained vector characteristics, text prosody characteristics and mapping characteristics into a neural network model to be trained to obtain a linear spectrum corresponding to the current frame.

Step 704, after obtaining the linear spectrum of the frame corresponding to the training text, obtaining the speech corresponding to the training text according to the linear spectrum of the frame corresponding to the training text.

Step 705, when the obtained speech corresponding to the training text meets a predetermined condition, a trained neural network model is obtained.

The above steps 701 to 705 describe a training process of the neural network model used in the embodiment of the present application.

As can be seen from step 702, when the neural network model is trained, the mapping feature of the previous frame used is the mapping feature of the linear spectrum corresponding to the previous frame of speech of the current frame in the speech corresponding to the training text, and when the neural network model is actually used, the mapping feature of the previous frame used is the mapping feature of the linear spectrum corresponding to the previous frame output by the neural network model.

In step 705, the obtained speech corresponding to the training text meeting the predetermined condition may be: compared with the real-person speech of the training text, the speech corresponding to the training text obtained through the neural network model to be trained has errors of tone quality and overall fluency within a predetermined range, and certainly, the predetermined range can be set by the user according to system performance and/or implementation requirements and the like during specific implementation, and the size of the predetermined range is not limited in the embodiment.

The method for converting the text into the voice directly uses the linear spectrum corresponding to the text to be converted for voice conversion, and compared with a scheme of performing voice conversion by using acoustic characteristics obtained by approximate estimation, the method for converting the text into the voice obtains smaller voice error, so that the obtained voice has better tone quality and better integral fluency; in addition, the convolutional neural network model is used for replacing a recurrent neural network, so that the parallel operation efficiency of the network is improved, and the audio synthesis speed is greatly improved; in addition, the convolutional neural network model used in the method can play a role similar to a filter, so that the method also has good voice conversion capability on training data with low signal-to-noise ratio.

Fig. 8 is a schematic structural diagram of an embodiment of a text-to-speech conversion apparatus according to the present application, where the text-to-speech conversion apparatus according to the present application may implement the text-to-speech conversion method according to the present application. As shown in fig. 8, the text-to-speech conversion apparatus may include: an acquisition module 81, an input module 82 and a conversion module 83;

the acquiring module 81 is configured to acquire a frame corresponding to a text to be converted; acquiring vector characteristics and text prosody characteristics of a phonon corresponding to a current frame in the frame, and acquiring mapping characteristics of a linear spectrum corresponding to a previous frame of the current frame; specifically, obtaining the frame corresponding to the text to be converted may be: and acquiring a frame corresponding to each phoneme of the text to be converted.

Specifically, the obtaining module 81 is specifically configured to input the linear spectrum corresponding to the previous frame of the current frame into a neural network composed of two fully connected layers for mapping, so as to obtain a mapping feature of the linear spectrum corresponding to the previous frame of the current frame.

After obtaining the linear spectrum corresponding to the previous frame, the obtaining module 81 may input the linear spectrum corresponding to the previous frame into the neural network shown in fig. 2 for mapping, so as to obtain the mapping feature of the linear spectrum corresponding to the previous frame.

And the input module 82 is configured to input the vector features, the text prosody features, and the mapping features acquired by the acquisition module 81 into a pre-trained neural network model, so as to obtain a linear spectrum corresponding to the current frame.

In this embodiment, after the input module 82 obtains the linear spectrum corresponding to the current frame, the obtaining module 81 may input the linear spectrum corresponding to the current frame into the neural network shown in fig. 2, obtain the mapping feature of the linear spectrum corresponding to the current frame, and then spell the mapping feature with the vector feature of the phonon and the text prosody feature corresponding to the next frame to serve as the input feature of the next frame.

The conversion module 83 is configured to obtain the speech corresponding to the text to be converted according to the linear spectrum of the frame corresponding to the text to be converted after obtaining the linear spectrum of the frame corresponding to the text to be converted.

Specifically, after obtaining the linear spectrums of all frames corresponding to the text to be converted, the conversion module 83 may obtain the speech corresponding to the text to be converted according to the linear spectrums of the frames corresponding to the text to be converted.

In this embodiment, the conversion module 83 is specifically configured to obtain a phase spectrum of a speech corresponding to the text to be converted through iteration from a linear spectrum of a frame corresponding to the text to be converted; and generating the voice corresponding to the text to be converted through an inverse short-time Fourier transform algorithm according to the phase spectrum of the voice corresponding to the text to be converted.

In this embodiment, the input module 82 is specifically configured to perform dimensionality reduction on the acquired vector features, text prosody features, and mapping features through a full-link layer, input the reduced vector features, text prosody features, and mapping features into a network composed of at least two convolution units with residual connection, perform upsampling through a full-link layer, and obtain a linear spectrum corresponding to the current frame.

That is, in the present embodiment, the structure of the neural network model may be as shown in fig. 5. The output characteristic of the neural network model shown in fig. 5 is a linear spectrum corresponding to the current frame, that is, a magnitude spectrum of a frequency spectrum obtained after short-time fourier transform of the audio corresponding to the current frame, the neural network model firstly performs framing processing on the audio corresponding to the current frame, each 25ms is a frame, and 12.5ms overlap exists between frames. Then, Fourier transform is carried out on each frame, and the real part of the Fourier transform is taken as the linear spectrum corresponding to the current frame.

The activation function may be a sigmoid activation function, the predetermined interval may be a (0,1) interval, and the convolution unit may have a structure as shown in fig. 6.

Further, the obtaining module 81 is further configured to obtain a frame corresponding to the training text; acquiring vector characteristics and text prosody characteristics of a phonon corresponding to a current frame in a frame corresponding to the training text, and acquiring mapping characteristics of a linear spectrum corresponding to a previous frame of voice of the current frame in voice corresponding to the training text;

the input module 82 is further configured to input the vector features, text prosody features, and mapping features acquired by the acquisition module 81 into a neural network model to be trained, so as to obtain a linear spectrum corresponding to the current frame;

the conversion module 83 is further configured to, after obtaining the linear spectrum of the frame corresponding to the training text, obtain the speech corresponding to the training text according to the linear spectrum of the frame corresponding to the training text;

the obtaining module 81 is further configured to obtain a trained neural network model when the obtained speech corresponding to the training text meets a predetermined condition.

When the neural network model is trained, the mapping feature of the previous frame acquired by the acquiring module 81 is the mapping feature of the linear spectrum corresponding to the previous frame of speech of the current frame in the speech corresponding to the training text, and when the neural network model is actually used, the mapping feature of the previous frame acquired by the acquiring module 81 is the mapping feature of the linear spectrum corresponding to the previous frame output by the neural network model.

The obtained speech corresponding to the training text meeting the predetermined condition may be: compared with the real-person speech of the training text, the speech corresponding to the training text obtained through the neural network model to be trained has errors of tone quality and overall fluency within a predetermined range, and certainly, the predetermined range can be set by the user according to system performance and/or implementation requirements and the like during specific implementation, and the size of the predetermined range is not limited in the embodiment.

In the text-to-speech conversion device, after a frame corresponding to a text to be converted is acquired, vector features and text prosody features of a phonon corresponding to a current frame in the frame and mapping features of a linear spectrum corresponding to a previous frame of the current frame are acquired, and then the acquired vector features, text prosody features and mapping features are input into a pre-trained neural network model to acquire the linear spectrum corresponding to the current frame. After the linear spectrum of the frame corresponding to the text to be converted is obtained, the voice corresponding to the text to be converted is obtained according to the linear spectrum of the frame corresponding to the text to be converted, so that the voice corresponding to the text to be converted can be directly obtained according to the linear spectrum of the frame corresponding to the text to be converted.

Fig. 9 is a schematic structural diagram of an embodiment of a computer device according to the present application, where the computer device may include a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the method for converting text into speech provided in the embodiment of the present application may be implemented.

The Computer device may be a server or a terminal device, and the terminal device may be an intelligent terminal device such as a Personal Computer (PC) or a notebook Computer, and the embodiment does not limit the specific form of the Computer device.

FIG. 9 illustrates a block diagram of an exemplary computer device 12 suitable for use in implementing embodiments of the present application. The computer device 12 shown in fig. 9 is only an example, and should not bring any limitation to the function and the scope of use of the embodiments of the present application.

As shown in FIG. 9, computer device 12 is in the form of a general purpose computing device. The components of computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.

Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. These architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus, to name a few.

Computer device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.

The system Memory 28 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) 30 and/or cache Memory 32. Computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 9, and commonly referred to as a "hard drive"). Although not shown in FIG. 9, a disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a Compact disk Read Only memory (CD-ROM), a Digital versatile disk Read Only memory (DVD-ROM), or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the application.

A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally perform the functions and/or methodologies of the embodiments described herein.

Computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with computer device 12, and/or with any devices (e.g., network card, modem, etc.) that enable computer device 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Moreover, computer device 12 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public Network such as the Internet) via Network adapter 20. As shown in FIG. 9, the network adapter 20 communicates with the other modules of the computer device 12 via the bus 18. It should be appreciated that although not shown in FIG. 9, other hardware and/or software modules may be used in conjunction with computer device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

The processing unit 16 executes various functional applications and data processing, such as implementing a text-to-speech conversion method provided by an embodiment of the present application, by running a program stored in the system memory 28.

The embodiment of the present application further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, can implement the method for converting text into speech provided in the embodiment of the present application.

The non-transitory computer readable storage medium described above may take any combination of one or more computer readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM) or flash Memory, an optical fiber, a portable compact disc Read Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of Network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection portion (electronic device) having one or more wires, a portable computer cartridge (magnetic device), a Random Access Memory (RAM), a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM) or a flash Memory, an optical fiber device, and a portable Compact Disc Read Only Memory (CD-ROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic Gate circuit for realizing a logic function for a data signal, an asic having an appropriate combinational logic Gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), and the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims

1. A method for text to speech conversion, comprising:

acquiring a frame corresponding to a text to be converted;

acquiring vector characteristics and text prosody characteristics of a phonon corresponding to a current frame in the frame, and acquiring mapping characteristics of a linear spectrum corresponding to a previous frame of the current frame;

inputting the obtained vector features, text prosody features and mapping features into a pre-trained neural network model to obtain a linear spectrum corresponding to the current frame;

after the linear spectrum of the frame corresponding to the text to be converted is obtained, obtaining the voice corresponding to the text to be converted according to the linear spectrum of the frame corresponding to the text to be converted;

the obtaining of the mapping feature of the linear spectrum corresponding to the previous frame of the current frame includes:

and inputting the linear spectrum corresponding to the previous frame of the current frame into a neural network consisting of two fully-connected layers for mapping to obtain the mapping characteristic of the linear spectrum corresponding to the previous frame of the current frame.

2. The method according to claim 1, wherein the obtaining the speech corresponding to the text to be converted according to the linear spectrum of the frame corresponding to the text to be converted comprises:

obtaining a phase spectrum of the voice corresponding to the text to be converted through iteration from the linear spectrum of the frame corresponding to the text to be converted;

and generating the voice corresponding to the text to be converted through an inverse short-time Fourier transform algorithm according to the phase spectrum of the voice corresponding to the text to be converted.

3. The method of claim 1, wherein the inputting the obtained vector feature, text prosody feature and mapping feature into a pre-trained neural network model, and obtaining the linear spectrum corresponding to the current frame comprises:

and after the acquired vector features, text prosody features and mapping features are subjected to dimensionality reduction through a full connection layer, the reduced dimensionality is input into a network consisting of at least two convolution units with residual connection, and after the reduced dimensionality is subjected to upsampling through a full connection layer, a linear spectrum corresponding to the current frame is obtained.

4. The method of claim 3, wherein the convolution unit comprises two one-dimensional convolution kernels, wherein the output of one convolution kernel is normalized to a predetermined interval by an activation function; the output of the convolution unit is the product of the output result of one convolution kernel normalized to a preset interval and the corresponding element of the output result of the other convolution kernel which does not pass the activation function.

5. The method according to any one of claims 1 to 4, wherein before inputting the obtained vector feature, text prosody feature and mapping feature into a pre-trained neural network model and obtaining the linear spectrum corresponding to the current frame, the method further comprises:

acquiring a frame corresponding to a training text;

acquiring vector features and text prosody features of a phonon corresponding to a current frame in a frame corresponding to the training text, and acquiring mapping features of a linear spectrum corresponding to a previous frame of voice of the current frame in voice corresponding to the training text;

inputting the obtained vector features, text prosody features and mapping features into a neural network model to be trained to obtain a linear spectrum corresponding to the current frame;

after the linear spectrum of the frame corresponding to the training text is obtained, obtaining the voice corresponding to the training text according to the linear spectrum of the frame corresponding to the training text;

and when the obtained speech corresponding to the training text meets a preset condition, obtaining a trained neural network model.

6. A text-to-speech conversion apparatus, comprising:

the acquisition module is used for acquiring a frame corresponding to a text to be converted; acquiring vector characteristics and text prosody characteristics of a phonon corresponding to a current frame in the frame, and acquiring mapping characteristics of a linear spectrum corresponding to a previous frame of the current frame;

the input module is used for inputting the vector characteristics, the text prosody characteristics and the mapping characteristics acquired by the acquisition module into a pre-trained neural network model to acquire a linear spectrum corresponding to the current frame;

the conversion module is used for obtaining the voice corresponding to the text to be converted according to the linear spectrum of the frame corresponding to the text to be converted after obtaining the linear spectrum of the frame corresponding to the text to be converted;

the obtaining module is specifically configured to input the linear spectrum corresponding to the previous frame of the current frame into a neural network composed of two fully-connected layers for mapping, so as to obtain a mapping feature of the linear spectrum corresponding to the previous frame of the current frame.

7. The apparatus of claim 6,

the conversion module is specifically configured to obtain a phase spectrum of a speech corresponding to the text to be converted through iteration from a linear spectrum of a frame corresponding to the text to be converted; and generating the voice corresponding to the text to be converted through an inverse short-time Fourier transform algorithm according to the phase spectrum of the voice corresponding to the text to be converted.

8. The apparatus of claim 6,

the input module is specifically configured to perform dimensionality reduction on the acquired vector features, text prosody features and mapping features through a full connection layer, input the reduced vector features, text prosody features and mapping features into a network composed of at least two convolution units with residual connection, perform upsampling through a full connection layer, and obtain a linear spectrum corresponding to the current frame.

9. The apparatus of claim 8, wherein the convolution unit comprises two one-dimensional convolution kernels, wherein an output of one convolution kernel is normalized to a predetermined interval by an activation function; the output of the convolution unit is the product of the output result of one convolution kernel normalized to a preset interval and the corresponding element of the output result of the other convolution kernel which does not pass the activation function.

10. The apparatus according to any one of claims 6 to 9,

the acquisition module is also used for acquiring a frame corresponding to the training text; acquiring vector features and text prosody features of a phonon corresponding to a current frame in a frame corresponding to the training text, and acquiring mapping features of a linear spectrum corresponding to a previous frame of voice of the current frame in voice corresponding to the training text;

the input module is further configured to input the vector features, the text prosody features and the mapping features acquired by the acquisition module into a neural network model to be trained, so as to obtain a linear spectrum corresponding to the current frame;

the conversion module is further configured to, after obtaining the linear spectrum of the frame corresponding to the training text, obtain the speech corresponding to the training text according to the linear spectrum of the frame corresponding to the training text;

the obtaining module is further configured to obtain a trained neural network model when the obtained speech corresponding to the training text meets a predetermined condition.

11. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1-5 when executing the computer program.

12. A non-transitory computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the method of any one of claims 1-5.