CN114049874A

CN114049874A - Method for synthesizing speech

Info

Publication number: CN114049874A
Application number: CN202111326772.0A
Authority: CN
Inventors: 谭兴军
Original assignee: Beijing Fangjianghu Technology Co Ltd
Current assignee: Seashell Housing Beijing Technology Co Ltd
Priority date: 2021-11-10
Filing date: 2021-11-10
Publication date: 2022-02-15
Anticipated expiration: 2041-11-10
Also published as: CN114049874B

Abstract

The embodiment of the invention provides a method for synthesizing voice, belonging to the field of artificial intelligence. The method comprises the following steps: finding a matching preset template matched with the received text in a preset template library, wherein each preset template comprises template characters and a slot position in the preset template library, and a voice waveform corresponding to the template characters is obtained in advance; acquiring slot position content corresponding to slot positions included in the matched preset template in the received text; generating slot position acoustic characteristics corresponding to the slot position content; generating a slot position voice waveform corresponding to the slot position acoustic characteristic; and splicing the slot position voice waveform with the voice waveform corresponding to the template characters included by the matched preset template to obtain the voice waveform corresponding to the received text, thereby synthesizing the voice. Therefore, the real calculation amount during voice synthesis is reduced, the response delay of the intelligent device terminal is reduced, and the user experience is improved.

Description

Method for synthesizing speech

Technical Field

Embodiments of the present invention relate to a method for synthesizing speech.

Background

With the high-speed iteration of computer performance and the wide application of deep learning technology, the intelligent voice interaction technology has been developed unprecedentedly. Today, interfaces for human-computer interaction are undergoing a transition from touch to speech. The purpose of speech synthesis (TTS) technology is to make computers and various smart devices speak like a person. Driven by the endless machine learning algorithms, TTS technology has been able to generate speech that closely approximates real human speech, even in the false.

Because the mainstream deep learning algorithm depends on strong computing power, all smart device manufacturers and technical service providers deploy the part in a computing center nowadays, and provide smart voice services in a cloud computing mode. However, for TTS, this approach has two significant drawbacks: 1) due to the complexity of the algorithm, network delay and other factors, the response delay of the intelligent equipment terminal is large, and the user experience is influenced; 2) because user requests are not evenly distributed in time (e.g., request peaks in the morning and evening, request troughs in the morning), vendors need to prepare oversaturated computing devices to handle request peaks, which results in wasted computing resources and increased costs.

Disclosure of Invention

To at least partially solve the above problem, an aspect of an embodiment of the present invention provides a method for synthesizing speech, the method including: finding a matching preset template matched with the received text in a preset template library, wherein each preset template comprises template characters and a slot position in the preset template library, and a voice waveform corresponding to the template characters is obtained in advance; acquiring slot position content corresponding to the slot position included by the matching preset template in the received text; generating slot position acoustic characteristics corresponding to the slot position contents; generating a slot position voice waveform corresponding to the slot position acoustic characteristic; and splicing the slot position voice waveform with the voice waveform corresponding to the template characters included in the matched preset template to obtain the voice waveform corresponding to the received text, thereby synthesizing voice.

In addition, another aspect of the embodiments of the present invention also provides a machine-readable storage medium, which stores instructions for causing a machine to execute the above-mentioned method.

In addition, another aspect of the embodiments of the present invention also provides a computer program product, which includes a computer program/instruction, and when the computer program/instruction is executed by a processor, the steps of the method are implemented.

According to the technical scheme, the voice is synthesized by using the preset template, the voice waveform corresponding to the template characters in the preset template is obtained in advance, when the voice is synthesized, only the voice waveform corresponding to the slot position content filled in the slot position part in the preset template needs to be generated in real time, the voice waveform corresponding to the template characters and the voice waveform corresponding to the slot position content are spliced together to synthesize the voice, the received voice waveform corresponding to the whole text does not need to be generated in real time, therefore, the real calculation amount during voice synthesis is reduced, the terminal response delay of the intelligent device is reduced, and the user experience is improved. In addition, when synthesizing the voice, the voice is synthesized by means of the preset template, the processing time of a single task is reduced, and the computing intensity when the peak value is requested by a user can be borne without preparing supersaturated computing equipment, so that the waste of computing resources is reduced, and the cost is reduced.

Additional features and advantages of embodiments of the invention will be set forth in the detailed description which follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the embodiments of the invention without limiting the embodiments of the invention. In the drawings:

FIG. 1 is a flow chart of a method for synthesizing speech provided by an aspect of an embodiment of the present invention;

FIG. 2 is a diagram of a matching default template according to another embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating a compensation signal provided by another embodiment of the present invention;

FIG. 4 is a schematic diagram of the logic of a training compensation model according to another embodiment of the present invention;

FIG. 5 is a schematic diagram of compensation logic provided in accordance with another embodiment of the present invention;

FIG. 6 is a schematic illustration of compensation for adjacent acoustic features provided by another embodiment of the present invention; and

FIG. 7 is a schematic illustration of slot acoustic feature generation and compensation of adjacent acoustic features provided by another embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating embodiments of the invention, are given by way of illustration and explanation only, not limitation.

In an intelligent speech interactive system there is a Natural Language Generator (NLG) which responds to the user's text content and a TTS module which accepts the text and converts it to speech for output to the user. One mainstream design method of NLG is to design a dialog template, where the dialog template includes various slots (slots), and different values are filled in the slots to generate different response texts. For example, the conversation template sample is: the current time is the morning point minute. The dialogue template comprises 2 slots, and different contents are filled in the 2 slots to generate a type of response text answering the current time. Responding to the text sample: time now was 58 minutes 6 am; time now was 7 am 45 min in the morning; time is now 8 am 20 min. Through statistical findings, the structured response texts generated by the dialog templates in the real intelligent voice interaction system have very high occurrence frequency, especially in the early and late request peak period. The technical scheme provided by the embodiment of the invention has the design idea that preset templates are made for the high-frequency dialogue templates in advance, and only the voice corresponding to the content filled in the slot position is generated when the voice is synthesized on line in real time. Taking the foregoing example as an example, it is previously made that "the present time is the morning [ ] point [ ] minute. "as a preset template, wherein [ ] represents a slot position. During real-time speech synthesis, only digital speech segments of '6', '58' and the like are generated and spliced into the template to be used as the final output of TTS. The TTS algorithm scheme comprises 3 functional modules: the device comprises a linguistic feature generation module, an acoustic feature generation module and a voice waveform generation module. Generating linguistic features from the text, generating acoustic features from the linguistic features, and generating speech waveforms from the acoustic features, thereby obtaining speech signals.

One aspect of embodiments of the present invention provides a method for synthesizing speech. Fig. 1 is a flowchart of a method for synthesizing speech according to an embodiment of the present invention. As shown in fig. 1, the method includes the following.

In step S10, a matching preset template matching the received text is found in the preset template library. In the preset template library, each preset template comprises template characters and slot positions, and the voice waveforms corresponding to the template characters are acquired in advance. The received text is the text to be speech-synthesized. In the preset template, the slot is not filled with characters. One or more slots may be included in a default template. And the preset template matched with the received text in the preset template library is the matched preset template. For example, each preset template in the preset template library has a template ID, and for the case that template information can be obtained from an NLG (natural language generation module), the template ID of the preset template matching the received text can be directly obtained, and the matching preset template matching the received text can be found from the preset template library according to the template ID. In addition, in the embodiment of the present invention, in a preset template library, an acoustic feature and/or an acoustic feature hidden state and/or a speech waveform hidden state corresponding to a template character may also be obtained in advance, where the acoustic feature hidden state is a hidden state of an acoustic feature generation network when an acoustic feature is generated, and the speech waveform hidden state is a hidden state of a speech waveform generation network when a speech waveform corresponding to the acoustic feature is generated. Speech is a typical timing signal and when processing timing signals with neural networks, the network is usually required to "remember" historical information, such as cell state c (cell state) and hidden state h (hidden state) in LSTM. Hidden states are not equal to the model parameters of the neural network, the model parameters are typically fixed value weights, and hidden states refer to state parameters and/or intermediate values that change with time sequence during inference. Still taking LSTM as an example, the hidden state at time t can be expressed as [ c ]^t、h^t]. In the real model, there may be multiple LSTM layers, each of which may have multiple LSTM units, and the hidden state is a matrix sequence in the real model. In the process of generating the acoustic characteristics of the voice template by using the acoustic characteristic generation network, the deviceGenerating the last frame of acoustic features of the template text at the time t, generating the first frame of acoustic features of the slot position text at the time t +1, and hiding the state [ c ] before the time t is finished^t、h^t]Exported and stored as part of the speech template; similarly, the hidden state corresponding to the voice waveform generation network may also be derived and stored as a part of the voice template.

In step S11, the slot content corresponding to the slot included in the matching preset template in the received text is obtained. For example, the received text may be compared with a matching preset template to obtain the slot content corresponding to the slot.

In step S12, a slot acoustic feature corresponding to the slot content is generated. For example, the slot acoustic features may be generated according to a streaming speech synthesis algorithm.

In step S13, a slot speech waveform corresponding to the slot acoustic feature is generated. For example, the slot speech waveform may be generated according to a streaming speech synthesis algorithm.

In step S14, the slot speech waveform and the speech waveform corresponding to the template word included in the matching preset template are spliced together to obtain the speech waveform corresponding to the received text, so as to synthesize speech.

Optionally, in the embodiment of the present invention, the creating of the preset template library includes two aspects, one is screening of a preset template text, and the other is making of a preset template. The screening of the preset template text can adopt the following screening method: 1) obtaining all designed dialogue templates from the NLG module; 2) deleting the dialogue template with sparse text content, for example, the dialogue template with sparse content can be a dialogue template in which the average segment length is less than 2 words after the template text is segmented according to the slot positions; 3) deleting fixed words, namely templates with zero slot positions; 4) the remaining dialog templates are used as the default templates for the present proposal. In the preset template manufacturing, a simple method is to record the preset template in advance in a mode of recording audio. However, this method is limited to a few specific timbres, and cannot be extended to more timbres; it is suitable for only a small number of pre-made dialogue templates and cannot be extended to a dialogue template and a high-frequency text added at a later stage. In the embodiment of the present invention, the making of the preset template includes the following steps. 1) And filling variable examples in the slots of the conversation template, and completing the template into a complete and smooth text. 2) And synthesizing the complemented text into voice by adopting a Streaming Speech Synthesis (Streaming Speech Synthesis) algorithm. 3) Some intermediate values in the synthesis process, such as the acoustic features of the template text, the hidden state in the prediction acoustic process (the hidden state of the acoustic feature generation network when the acoustic features are generated), the hidden state of the vocoder (vocoder) when the waveform (waveform) is generated (the hidden state of the voice waveform generation network when the voice waveform is generated by the acoustic features), the voice waveform of the template text, and the like, are recorded, so that the inference context (i.e., the corresponding input data, hidden state and other variables and necessary computing resources when the slot speech is predicted) can be quickly recovered when the slot speech is generated in real time online. 4) And (3) deleting the part corresponding to the content of the slot from the result (namely the acoustic characteristic, the voice waveform and the hidden state) output in the steps 2) to 3). 5) And recording the template number, the template characters, the slot position description information and the result of the step 4) as a preset template. The slot position description information comprises the number and the sequence of the slots and the offset position of each slot on the text, the acoustic characteristic and the voice waveform data of the template. For example, the time is now the [ hour ] point in the morning, the offset for [ hour ] is 7, and the offset for [ hour ] is 8. And establishing a preset template library through the content.

Optionally, in the embodiment of the present invention, the following content may be adopted to find a matching preset template matching the received text in the preset template library. The following method for finding the matched preset template in the preset template library is suitable for the condition that the template information cannot be obtained from the NLG. In most voice interaction systems, information interaction between TTS and NLG is inconvenient, and at the moment, the most appropriate template needs to be matched in a preset template library through the text of the TTS.

All characters included in the received text are character-ordered in the order of the received text from the head to the tail. For example, the received text comprises m characters, the m characters are sorted by c₁、c₂、…、c_mIs shown in the specification, wherein c_iRepresenting the ith character. And aiming at any preset template in the preset template library, executing the following matching operation so as to match all the preset templates in the preset template library with the received text and determine the matched preset template. And segmenting the template characters corresponding to the preset template based on the slot positions to obtain segmented template characters. According to the position of the segmented template characters in the preset template, template character sequencing is carried out on the segmented template characters according to the sequence of the preset template from the head to the tail, namely all the segmented template characters are sequenced. For example, the template text corresponding to a preset template includes n segmented template texts, each using t₁、t₂、…、t_nDenotes, t_iRepresenting the ith segmented template word. Alternatively, the slots included in the default template may be ordered from head to tail, e.g., n-1 slots with s₁、s₂、…、s_n-1Denotes s_iIndicating the ith slot. Specifically, it can be understood with reference to table 1. Sequentially searching for segmented template characters in the received text according to the template character sequence, wherein the received text comprises the segmented template charactersThe same characters in the text of (1) as included in a segmented template word are no longer used for subsequent lookups. Wherein the searching is to find the same character string as the segmented template word in the received text. When the segmented template characters are searched in the received text, searching is carried out according to the sequence of the segmented template characters. For example, using the segmented template text t₁、t₂、…、t_nFor example, first find t₁Then find t₂And so on. When some characters in the received text are the same as a segmented template word, then the some characters are no longer used for subsequent lookup work. For example, using the above example as an example, the characters in the received text are represented by c₁、c₂、…、c_mRepresenting, for each of the segmented template characters, a corresponding one of the preset templates, a segmented template character t₁、t₂、…、t_nIndicates when the character c₁、c₂、c₃、c₄And segmented template character t₁If the characters in (1) are the same, the character c is₁、c₂、c₃、c₄Is no longer used for the segmented template text t₂、…、t_nFor segmented template text t₂From the character c₅And starting. Aiming at a segmented template character, the segmented template character is searched in the received text to find the identical character string in the received text. For example, as shown in Table 1, for a segmented template text t₁"time now is morning" which must be 7 characters when it is found in the received text, and must be simply "time now is morning". Under the condition that all the segmented template characters are found in the received text, judging whether the preset template is successfully matched with the received text or not according to the matching preset condition; and under the condition that the preset template is successfully matched with the received text, determining whether the preset template is set as an optimal template according to an optimal template setting rule, wherein the optimal template is the matched preset template under the condition that all preset templates in the preset template library are subjected to matching operation. If it is preset in the template libraryAnd if the matching operation is finished on all the preset templates, and if the optimal template does not exist, the matched preset template is not found in the preset template library. In addition, in the received text, the characters in the template text which is not included in the matching preset template are the characters corresponding to the slot positions. In addition, the slot positions and the segmentation template characters included in the matching preset template can be sequenced at the same time, and the content of the slot positions corresponding to each slot position included in the matching preset template is determined by comparing the sequence with the received text.

TABLE 1

Optionally, in the embodiment of the present invention, sequentially searching for segmented template words in the received text according to the template word sorting includes the following contents. And searching a first segmented template character in the received text according to the following contents, wherein the first segmented template character is the segmented template character which is arranged at the top in a preset template according to the template character sequence. For example, segmented template text t₁、t₂、…、t_n，t₁Is the first segment template text. And comparing the first segmented template word with a character string in the received text, wherein the character string has the same character length as the character length of the first segmented template word from the first character according to the character sequence under the condition that the head of the first segmented template word has no slot, and the first character is the character which is arranged at the top in the received text according to the character sequence. E.g. t₁The head has no slot, if at character c₁、c₂、…、c_mFinding segmented template characters t₁、t₂、…、t_n，t₁Including 7 characters, then only t can be obtained₁And character string c₁-c₇And (6) carrying out comparison. c. C₁I.e. the first character. Under the condition that the head of the first segmentation template word has a slot, the length of any character in the first segmentation template word and the received text is the same as that of the character of the first segmentation template wordAnd comparing according to the character strings with continuous character sequencing serial numbers. That is, in the case that the head of the first segmented template text has a slot, the character string of the received text to which the comparison is performed is not limited. And under the condition that the first segmented template characters are searched in the received text, setting a matching position aiming at second segmented template characters according to the character length of the first segmented template characters and the sequence number of the initial characters in the character string matched with the first segmented template characters in the received text according to the character sequence, wherein the second segmented template characters are the next segmented template characters of the first segmented template characters according to the template characters sequence, and the matching position is the sequence number of the initial characters of the segmented template characters searched in the received text according to the character sequence. For example, segmented template text t₁、t₂、…、t_n，t₂Is the second sectional template character. Take the above example as an example, with t₁The matched character string is c₁-c₇，t₁Has a length of 7, and t₁Initial character c of comparison₁Is 1, for t₂Is 8, i.e. from c₈Start for t₂And (6) carrying out comparison. Searching a second segmented template character in the received text from the matching position; under the condition of searching the second segmented template characters, updating the matching positions according to the character length of the second segmented template characters and the initial characters in the character strings matched with the second segmented template characters in the received text according to the serial numbers of the character sequence, understanding by referring to the content of the matching positions set for the second segmented template characters, sequentially searching the segmented template characters except the first segmented template characters and the second segmented template characters in the preset template in the received text according to the content of the searched second segmented template characters until all the segmented template characters in the preset template are searched, wherein in the process of searching the preset template containing the first segmented template characters, the second segmented template characters and all the segmented template characters after the first segmented template characters and the second segmented template characters, any segmented template character is not searched, the preset template is failed to be matched with the received text, parallel knotAnd (4) bundling the operation of sequentially searching the segmented template characters in the received text according to the template character sequence. For example, taking the above example as an example, when t is found₂And if the character string matched with the preset template is not found, finishing the searching operation aiming at the preset template, wherein the preset template is failed to be matched with the received text.

Optionally, in the embodiment of the present invention, the matching preset condition includes the following. If all the segmented template characters are found in the received text, the maximum serial number of the characters in the received text according to the character sequence is larger than or equal to the last matching position corresponding to the last segmented template character, and the tail part of the last segmented template character has no slot position, the preset template is not successfully matched with the received text, wherein the last segmented template character is the segmented template character arranged at the last in the preset template according to the template character sequence, and the last matching position is the matching position set according to the character length of the last segmented template character and the serial number of the initial character in the character string matched with the last segmented template character in the received text according to the character sequence. For example, in the character c₁、c₂、…、c₉Finding segmented template characters t₁、t₂、…、t₅Maximum serial number 9, and last segmented template text t₅，t₅Tail without slot, t₅Comprising 2 characters, character c₁、c₂、…、c₉Neutralization of t₅And if the sequence number of the initial character in the matched character string according to the character sequence is 6, the last matching position is 8, and the maximum sequence number 9 is greater than the last matching position 8, which indicates that redundant characters exist, the matching is unsuccessful. And if all the segmented template characters are found in the received text and the maximum serial number is smaller than the last matching position corresponding to the last segmented template character and/or the tail part of the last segmented template character has a slot position, successfully matching the preset template with the received text.

Optionally, in the embodiment of the present invention, the optimal template setting rule includes the following. If the optimal template is not set, the preset template is set as the optimal template, namely the preset template which is successfully matched currently is set as the optimal template. If the optimal template is set, but the number of characters included in the preset template is more than that of the characters included in the optimal template, setting the preset template as the optimal template. If the optimal template is set and the number of the characters included in the preset template is the same as the number of the characters included in the optimal template, but the number of the slot positions included in the preset template is less than the number of the slot positions included in the optimal template, the preset template is set as the optimal template. If the optimal template is set, but the number of the characters included in the preset template is less than or equal to the number of the characters included in the optimal template, the preset template is not set as the optimal template. If the optimal template is set and the number of the characters included in the preset template is the same as the number of the characters included in the optimal template, but the number of the slot positions included in the preset template is larger than or equal to the number of the slot positions included in the optimal template, the preset template is not set as the optimal template.

And searching the segmented template characters in the received text, namely matching the segmented template characters with the characters in the received text, wherein character strings with matching are found, and character strings without matching are not found. For example, using the segmented template text t₁、t₂、…、t_nAnd c for characters₁、c₂、…、c_mFor example, referring to fig. 2, an exemplary description is given to the search process, during the search, each segmented template word of the t sequence of the segmented template words is sequentially searched in the received text, if the t sequence can be completely found and the received text has no redundant characters, the matching is considered to be successful, otherwise, the matching is failed. As shown in fig. 2, t in the diagram A₁Has been matched to c₁、c₂Then t is₂From c₃Starting to search; graph B shows t₂Has been matched to c₄、c₅. Graph C shows t_nHas been matched to c_m-1、c_mAnd if the received text has no more characters, the preset template is successfully matched with the received text. For the text received in the same section, a plurality of preset templates are matched possibly successfully, the preset template with the most characters is selected at the moment, and if the characters are the same, the preset template with the least slot positions is selectedAnd setting a template.

Specifically, the matching algorithm may include the following operations performed on all preset templates in the preset template library.

Step 1: segmenting the template characters according to the slot positions, and recording the segmented template characters as t₁,t₂,…,t_nThe slot position is denoted as s₁,s₂,…,s_n-1(fewer slots than text passages), or s₁,s₂,…,s_n(the number of slots equals the text passage), or s₁,s₂,…,s_n+1(more slots than text passages). Sorting the segments in the order in which they appear in the template may result in: (1) t is t₁、s₁、t₂、s₂、…、s_n-1、t_n；(2)t₁、s₁、t₂、s₂、…、s_n-1、t_n、s_n；(3)s₁、t₁、s₂、t₂、…、s_n、t_n；(4)s₁、t₁、s₂、t₂、…、s_n、t_n、s_n+1。

Step 2: in the received text c₁,c₂,…,c_mThe respective segmented template words are looked up sequentially (since the slot (i.e., s-sequence) can match 0 to multiple characters, only the text segment (i.e., t-sequence) is looked up).

Step 2.1: if the type is (1) or (2) preset template, t is compared₁And c₁、c₂、…、c_L1，L₁Is t₁If the length of the matching position search _ beg is the same, the matching position search _ beg is updated to L₁+1 and go to step 2.2, otherwise the preset template matching fails. If it is a type (3) or (4) template, then at c₁,c₂,…,c_mIn search t₁If at c_P1Where found and t₁Identical character strings, c_P1For the initial character of the character string, the matching position search _ beg is updated to P₁+L₁And entering step 2.2, otherwise, the preset template matching fails. In addition, c₁And c_P1-1BetweenCorresponding is the slot position s₁The corresponding slot contents.

Step 2.2: if the last segmented template text t_nAfter the matching is successful, updating the matching position according to t_nAnd c₁,c₂,…,c_mNeutralization of t_nIf the serial number of the matched initial character is updated, the step 2.3 is carried out; otherwise, the next template text segment t is searched from the search _ beg of the input text_iIf at c_PiWhere found and t_iIf the identical character string is found, the matching position search _ beg is updated to P_i+L_iAnd re-enter step 2.2, L_iIs t_iOtherwise, the preset template matching fails.

Step 2.3: if m is larger than or equal to the search _ beg, and the preset template is of the type (1) or (3), indicating that the received text has redundant characters relative to the preset template, the preset template fails to be matched; otherwise, the template matching is successful, and the step 3 is entered.

And step 3: if the optimal template is not set, setting the current preset template as the optimal template; otherwise, comparing the current preset template with the optimal template character number, and if the current preset template character number is more, setting the current template as the optimal template; if the number of the characters is the same, comparing the number of the current preset template and the number of the optimal template slot positions, and if the number of the slot positions of the current preset template is less, setting the current preset template as the optimal template. And after all the preset templates are matched, if the optimal template is set, selecting the optimal template, otherwise, failing to match the received text to the template in the preset template library, and failing to match.

Since the speech signal is a time-series signal, it is greatly affected by the preceding and following speech. If the voice of the slot position content is directly generated and spliced into the preset template, only the voice with heavy mechanical taste is generated, namely, the voice is inconsistent in fundamental frequency, energy, speed and rhythm to cause poor naturalness, and the phase is discontinuous to cause burr noise and the like. In the embodiment of the present invention, in order to ensure the continuity of the synthesized speech, certain adjustment is required to match the acoustic characteristics of the partial frame where the preset template itself is adjacent to the slot.Specifically, in the preset template library, the acoustic features corresponding to the template characters included in each preset template are obtained in advance, and the method further includes the following steps. And compensating adjacent acoustic features in the acoustic features corresponding to the template characters included in the matched preset template to obtain compensated acoustic features, wherein the adjacent acoustic features include all the acoustic features of which the sequence numbers are less than or equal to a preset numerical value from the acoustic feature number closest to the slot position acoustic feature in the acoustic features corresponding to the template characters included in the matched preset template. For example, the acoustic features corresponding to the template text include i₁、i₂、i₃、i₄、……、i₂₀The slot being at acoustic characteristic i₁Previously, setting the preset value to 5, the acoustic feature i is measured₁Counting up, all acoustic features with a number less than or equal to 5 are neighboring acoustic features, and therefore, the neighboring acoustic features include the acoustic feature i₁、i₂、i₃、i₄、i₅. The acoustic features still corresponding to the template text include i₁、i₂、i₃、i₄、……、i₂₀For example, the slot is at acoustic feature i₂₀Then, setting the preset value to 5, the acoustic feature i is obtained₂₀Counting up all acoustic features with a number less than or equal to 5 as neighboring acoustic features, except that at this point, an inverse number is required, and therefore, the neighboring acoustic features include the acoustic feature i₂₀、i₁₉、i₁₈、i₁₇、i₁₆. The acoustic features still corresponding to the template text include i₁、i₂、i₃、i₄、……、i₂₀For example, the slot is at acoustic feature i₁₁And i₁₂Set the preset value to 5, then respectively from the acoustic feature i₁₁Inverse number of and from acoustic features i₁₂All the acoustic features with the sequence number less than or equal to 5 are the adjacent acoustic features, and therefore the adjacent acoustic features include the acoustic feature i₁₁、i₁₀、i₉、i₈、i₇And acoustic characteristics i₁₂、i₁₃、i₁₄、i₁₅、i₁₆. To say thatIt should be understood that the above method can be used to obtain adjacent acoustic features regardless of the number of slots in a predetermined template. And regenerating a voice waveform corresponding to the adjacent acoustic feature based on the compensated acoustic feature so as to generate an updated adjacent template text voice waveform. For example, the updated adjacent template text speech waveform is generated according to a streaming speech synthesis algorithm. And splicing the slot position voice waveform with the voice waveform corresponding to the template characters included in the matched preset template into a slot position voice waveform, updating the voice waveform of the adjacent template characters and splicing the voice waveforms corresponding to the residual acoustic features except the adjacent acoustic features in the acoustic features corresponding to the template characters included in the matched preset template.

Optionally, in the embodiment of the present invention, a compensation signal may be generated, and the adjacent acoustic features may be compensated based on the compensation signal.

The compensation signal is the difference between the model-generated acoustic features and the ideal acoustic features. For example, in the embodiment of the present invention, the adjacent acoustic features are compensated, the ideal acoustic feature is the compensated acoustic feature, and therefore, the compensation signal is the difference between the adjacent acoustic features before compensation and the adjacent acoustic features after compensation. The pronunciation of a speech segment is affected by its context, which contributes to the continuity and consistency of the entire speech segment. For the same preset template, the slot positions are filled with different contents, the ideal pronunciation of the template is different, and particularly, the voice frames adjacent to the slot positions are partially voiced. The root cause of this difference is that the part of the Encoder (Encoder) in the model that generates the acoustic features uses the linguistic features of the complete sentence. These linguistic features have changed when the slot content is replaced. The goal of the compensation signal is to repair the effect of this change on the pronunciation. More direct white description: the generation of the preset template is based on a certain condition (i.e. a segment of text), and when the preset template is used, the condition changes a little (a small part of text changes), we need to make a little change to the generated template to make it adapt to the new condition more perfectly, and the change amplitude is the compensation signal.

For a trained acoustic model, if the preset template in the technical solution provided in the embodiment of the present invention is not used, that is, the acoustic features generated from zero can be considered as ideal acoustic features, such as the acoustic feature a2 and the acoustic feature B2 shown in fig. 3. In fig. 3, acoustic features of two sentences (in italic underlined font as slot content) are generated separately using the trained acoustic model. Both acoustic signature a2 and acoustic signature B2 represent acoustic signatures of the pre-defined pattern adjacent to the slot, i.e., adjacent acoustic signatures as described in the previous embodiments. The difference between the two segments of the acoustic signature is a compensation signal (denoted by C), wherein the compensation signal C is a compensation signal for compensating the acoustic signature B2 to obtain the acoustic signature a 2. In order to predict the compensation signal C, the compensation model needs to be trained with the acoustic features B2 and the linguistic features a1 as input data, as shown in fig. 4. In an embodiment of the present invention, the compensation model may be trained as follows. The input data is acoustic feature B2, linguistic feature a1, linguistic feature a1 refers to "no problem, together with linguistic features corresponding to the entire sentence that bears no bar". The output data is [ compensation signal C' ]. The loss function is Mel-SD (acoustic feature a2, acoustic feature a '), which is a Mel-spectrum distortion measure between the target acoustic feature a2 and the acoustic feature a' obtained by compensating for the acoustic feature B2. Wherein, Adam is adopted in the optimization method of the acoustic characteristic A '═ acoustic characteristic B2+ compensation signal C'. In order to fully utilize the data resources, the input data may also be acoustic features a2, linguistic features B1, that is, acoustic features B2 is obtained by compensating for the acoustic features a2, and then the loss function is calculated as Mel-SD (acoustic features B2, acoustic features B'), that is, two sets of training data can be obtained from any two filled sentences of the same template.

The actually predicted compensation signal after the compensation model is trained is C', namely the compensation signal used in the process of synthesizing the voice on line in real time. Before synthesizing the voice in real time on line, taking adjacent acoustic features as acoustic features B2; when synthesizing speech on line in real time, the linguistic features of the received text are generated as linguistic features a1, and the compensated signals C 'are obtained through the compensation model and then added to the template acoustic features B2 to obtain acoustic features a', which are compensated output features, as shown in fig. 5. Linguistic feature generation converts text into phonetic labels (e.g., phonemes and intonations) and prosodic labels (e.g., accents, pauses). Different contents can be filled in the slot position when the linguistic feature is generated on line in real time, and in order to ensure the effectiveness of the slot position linguistic feature, the complete linguistic feature needs to be generated by using the complete input text.

Optionally, in the embodiment of the present invention, different methods may be adopted according to the position of the slot in the preset template to generate the slot speech waveform and/or update the text speech waveform of the adjacent template. The voice waveform generation is to convert acoustic features into voice signals to be output. Specifically, the following can be referred to. In the preset template library, a voice waveform hiding state corresponding to template characters included in each preset template is obtained in advance, and the voice waveform hiding state is a hiding state of a voice waveform generation network when a voice waveform corresponding to acoustic features is generated. And placing the acoustic features corresponding to the template characters included in the matching preset template and the acoustic features of the slot positions together according to the relation between the matching preset template and the slot positions to obtain the overall acoustic features, and sequencing the acoustic features included in the overall acoustic features from the head to the tail. For example, the acoustic features corresponding to the template text include i₁、i₂、i₃、i₄、……、i₂₀Slot acoustic features include j₁、j₂、j₃、j₄The slot is located at the head of the matching preset template, so that the overall acoustic signature is the acoustic signature j₁、j₂、j₃、j₄、i₁、i₂、i₃、i₄、……、i₂₀All the acoustic features included in the overall acoustic feature are then sorted in order from head to tail. And when the slot positions are positioned at the tail part and the middle part of the matched preset template, sequencing according to the contents.

When the trench is located at the tail part of the matched preset template, generating a trench voice waveform and/or updating a text voice waveform of the adjacent template comprises: generating voice sampling points and/or adjacent sounds corresponding to the slot position acoustic characteristics according to the following contentsLearning the voice sample points corresponding to the features: generating the acoustic feature y of the (i + 1) th frame_i+1Corresponding speech samples z_j+L、z_j+1+L、…、z_j+2L-1For streaming based speech synthesis algorithms, with y_i+1And z_j、z_j+1、…、z_j+L-1For input, with s_i' is generated under the condition where z_j、z_j+1、…、z_j+L-1For the ith frame acoustic feature y_iCorresponding speech sample point, L is frame length, s_i' is a speech waveform hiding state corresponding to the acoustic feature of the (i + 1) th frame; and synthesizing slot voice waveforms and/or updating adjacent template text voice waveforms according to the voice sampling points corresponding to the slot acoustic features and/or the voice sampling points corresponding to the adjacent acoustic features. When the slot is located at the tail of the preset template, the voice samples are generated frame by frame according to the sequence of the acoustic features, for example, according to the acoustic feature y₁、y₂、…、y_i、…、y_N-1、y_NAnd using it as input, generating speech samples z frame by frame or point by point₁、z₂、…、z_j、z_j+1、…、z_j+L、…z_MWherein z is_j、z_j+1、…、z_j+L-1Corresponding to an acoustic feature y_iL is the frame length and M is the complete speech length. A set of hidden states s is maintained during the generation process₀’、s₁’、s₂’、s₃’、…、s_i’、…、s_N-1’、s_N', wherein s₀' is an initial state, s_i' speech waveform hiding state corresponding to acoustic feature of frame i +1, i.e. s_i' generating a hidden state of the speech waveform generation network when generating a speech waveform corresponding to the acoustic feature of the i +1 th frame. When generating the speech waveform corresponding to the acoustic feature of the (i + 1) th frame_i+1And z_j、z_j+1、…、z_j+L-1For input, s_i' Generation of z as Condition_j+L、z_j+1+L、…、z_j+2L-1. Specifically, in the embodiment of the present invention, the speech samples may be generated sequentially frame by frame from the adjacent acoustic features to the end of the slot acoustic feature. To is coming toThe natural continuity of the speech is maintained and the adjacent acoustic features are adjusted. When the speech waveform is generated, speech waveforms corresponding to adjacent acoustic features need to be re-generated, and specifically, speech samples may be generated frame by frame starting from the adjacent acoustic features to the slot acoustic features.

When the slot position is located at the head of the matched preset template, generating a slot position voice waveform and/or updating an adjacent template character voice waveform comprises: generating voice samples corresponding to the slot acoustic features and/or voice samples corresponding to adjacent acoustic features according to the following contents: generating the i-1 frame acoustic feature y_i-1Corresponding speech samples z_j-1、z_j-2、…、z_j-LFor streaming based speech synthesis algorithms, with y_i-1And z_j+L-1、z_j+L-2、…、z_jFor input, with s_i"wherein z is_j+L-1、z_j+L-2、…、z_jFor the ith frame acoustic feature y_iCorresponding speech sample point, L is frame length, s_i"is the speech waveform hiding state corresponding to the acoustic feature of the i-1 th frame, i.e. s_i"is the hidden state of the voice waveform generation network when generating the voice waveform corresponding to the acoustic feature of the i-1 th frame; and synthesizing slot voice waveforms and/or updating adjacent template text voice waveforms according to the voice sampling points corresponding to the slot acoustic features and/or the voice sampling points corresponding to the adjacent acoustic features. And when the slot position is positioned at the head of the preset template, generating the voice sampling points frame by frame according to the reverse sequence of the acoustic features. For example, it is characterized by an acoustic feature y_N、y_N-1、…、y_i、…、y₂、y₁Generating speech samples z for input, frame-by-frame or point-by-point_M、z_M-1、…、z_j+L、z_j+L-1、…、z_j、…、z₁Wherein z is_j+L-1、…、z_j+1、z_jCorresponding to an acoustic feature y_iL is the frame length and M is the complete speech length. A set of hidden states s is maintained during the generation process_N+1”、s_N”、s_N-1”、…、s_i”、…、s₃”、s₂”、s₁", wherein s_N+1' isInitial state, s_i"is the i-1 frame acoustic feature corresponding to the speech waveform hidden body, i.e. s_i"is a hidden state of the speech waveform generation network when generating the speech waveform corresponding to the acoustic feature of the i-1 th frame. When generating the speech waveform corresponding to the acoustic feature of the i-1 th frame_i-1And z_j+L-1、…、z_jFor input, s_i"conditional on generating z_j-1、z_j-2、…、z_j-L. Specifically, in the embodiment of the present invention, the speech samples may be generated sequentially frame by frame from the adjacent acoustic features to the end of the slot acoustic feature. In order to maintain natural continuity of speech, adjustments are made to adjacent acoustic features. When the speech waveform is generated, speech waveforms corresponding to adjacent acoustic features need to be re-generated, and specifically, speech samples may be generated frame by frame starting from the adjacent acoustic features to the slot acoustic features.

And when the slot position is positioned in the middle of the matched preset template, generating slot position voice waveforms and updating adjacent template character voice waveforms to generate spliced voice waveforms spliced together with the updated adjacent template character voice waveforms. And when the slot position is positioned in the middle of the matched preset template, the acoustic features included in the adjacent acoustic features are distributed on two sides of the acoustic features of the slot position, the adjacent acoustic features and the acoustic features of the slot position are put together to form spliced acoustic features, and a spliced voice waveform corresponding to the spliced acoustic features is generated. Specifically, the following may be included. Generating a voice sampling point corresponding to the slot position acoustic feature and a voice sampling point corresponding to the acoustic feature with the sequence number smaller than that of the slot position acoustic feature in the adjacent acoustic feature according to the following contents to obtain a first spliced voice sampling point: generating the acoustic feature y of the (i + 1) th frame_i+1Corresponding speech samples z_j+L、z_j+1+L、…、z_j+2L-1For streaming based speech synthesis algorithms, with y_i+1And z_j、z_j+1、…、z_j+L-1For input, with s_i' is generated under the condition where z_j、z_j+1、…、z_j+L-1For the ith frame acoustic feature y_iCorresponding speech sample point, L is frame length, s_i' is generated to a speech waveform concealment state corresponding to the acoustic feature of the i +1 th frame. For example, the acoustic features corresponding to the template text include i₁、i₂、i₃、i₄、……、i₂₀Slot acoustic features include j₁、j₂、j₃、j₄The slot being located at an acoustic feature i₁₁And i₁₂Adjacent acoustic feature comprises acoustic feature i₁₁、i₁₀、i₉、i₈、i₇And acoustic characteristics i₁₂、i₁₃、i₁₄、i₁₅、i₁₆. Acoustic signature i₇、i₈、i₉、i₁₀、i₁₁、j₁、j₂、j₃、j₄The corresponding voice sampling point is a first spliced voice sampling point. Generating a voice sampling point corresponding to the slot position acoustic feature and a voice sampling point corresponding to the acoustic feature with the sequence number being greater than that of the slot position acoustic feature in the adjacent acoustic feature according to the following contents to obtain a second spliced voice sampling point: generating the i-1 frame acoustic feature y_i-1Corresponding speech samples z_j-1、z_j-2、…、z_j-LFor streaming based speech synthesis algorithms, with y_i-1And z_j+L-1、z_j+L-2、…、z_jFor input, with s_i"wherein z is_j+L-1、z_j+L-2、…、z_jFor the ith frame acoustic feature y_iCorresponding speech sample point, L is frame length, s_i"is to generate the speech waveform hiding state corresponding to the acoustic feature of the i-1 th frame. For example, the acoustic features corresponding to the template text include i₁、i₂、i₃、i₄、……、i₂₀Slot acoustic features include j₁、j₂、j₃、j₄The slot being located at an acoustic feature i₁₁And i₁₂Adjacent acoustic feature comprises acoustic feature i₁₁、i₁₀、i₉、i₈、i₇And acoustic characteristics i₁₂、i₁₃、i₁₄、i₁₅、i₁₆. Acoustic feature j₁、j₂、j₃、j₄、i₁₂、i₁₃、i₁₄、i₁₅、i₁₆The corresponding voice sample point is a second spliced voice sample point. And generating spliced voice sampling points corresponding to the slot position acoustic features and the adjacent acoustic features based on the first spliced voice sampling points, the second spliced voice sampling points, the weights corresponding to the first spliced voice sampling points and the weights corresponding to the second spliced voice sampling points. Specifically, the weight is a weight sequence, the first spliced voice sample is multiplied by elements included in the weight sequence corresponding to the first spliced voice sample, the second spliced voice sample is multiplied by elements included in the weight sequence corresponding to the second spliced voice sample, and the multiplication results of the two times are added to obtain a spliced voice sample. The number of elements included in the weight sequence corresponding to the first spliced voice sample is the same as the number of elements included in the weight sequence corresponding to the second spliced voice sample, and the number of elements included in the weight sequence is the number of voice samples included in the first spliced voice sample or the second spliced voice sample. And synthesizing a spliced voice waveform according to the spliced voice sampling points. When the slot position is positioned in the middle of the preset template, the slot position is divided into two conditions that the slot position is positioned at the tail part of the preset template and the slot position is positioned at the tail part of the preset template. According to the sequencing aiming at the acoustic features, for the acoustic features positioned at the head of the acoustic features of the slot in the adjacent acoustic features, generating first spliced voice sampling points according to the method for generating the voice sampling points in the sequence; and generating a second spliced voice sample point according to the method for generating the voice sample points in the reverse order for the acoustic features positioned at the tail part of the slot acoustic features in the adjacent acoustic features. And obtaining a final spliced voice sample point spliced together by the adjacent acoustic features and the slot position acoustic features through weighted summation. For example, the length of the first concatenated speech samples or the second concatenated speech samples is N, and the weight sequence corresponding to the first concatenated speech samples may be

The weight sequence corresponding to the second spliced voice sample point may be

Acoustic feature generation converts linguistic features into acoustic features (e.g., mel-frequency spectral coefficients). In the embodiment of the invention, the acoustic features corresponding to the template characters in the preset template are generated in advance, and the generated acoustic features and the hidden state in the process of generating the acoustic features are stored for later use. Acoustic features and hidden states of the template text are required when generating the slot acoustic features. Optionally, in the embodiment of the present invention, the slot acoustic feature may be generated by different methods according to the position of the slot in the preset template. In the preset template library, acoustic features and acoustic feature hidden states corresponding to template characters included in each preset template are acquired in advance, and the acoustic feature hidden states are corresponding hidden states when the acoustic features corresponding to the template characters are generated. The acoustic features corresponding to the template characters included in the matching preset template and the slot acoustic features are put together according to the relationship between the matching preset template and the slot to obtain the overall acoustic features, and the acoustic features included in the overall acoustic features are sequenced from the head to the tail, and for the relevant introduction of the sequencing, reference may be made to the relevant contents described in the above embodiments.

When the slot is at the end of a matching default template, for example, the sample is today lunar calendar Chougu year [ April seventeen ]]Wherein, the 'lunar calendar clown year' is the template text and the 'April seventeen' is the slot text. And when the slot position is at the tail part of the preset template, based on a streaming voice synthesis algorithm, generating the acoustic features frame by frame according to the sequencing sequence aiming at the acoustic features. The acoustic features are sequentially generated by taking the linguistic features X of a complete text as input and sequentially generating the acoustic features y frame by frame from front to back according to the text₁、y₂、…、y_i、…、y_N-1、y_N. A set of hidden states s is maintained during the generation process₀、s₁、s₂、…、s_i、…、s_N-1、s_NWherein s is₀The hidden state is a hidden state of the acoustic feature generation network when the acoustic feature is generated. Specifically, generating the slot acoustic feature includes generating a slot acoustic feature based onThe following is generated: acoustic feature y of frame i +1_i+1For the streaming based speech synthesis algorithm, Enc (X) and y_iAs input, with s_iGenerating for a condition, wherein Enc () is a linguistic feature coding network (e.g., CBHG-based Encoder of Tacotron and Transformer-based Encoder of Fastspech), s_iThe acoustic feature hiding state corresponding to the acoustic feature of the ith frame, i.e. s_iGenerating a hidden state of the network for the acoustic features when generating the acoustic features of the ith frame, y_iAnd X is the acoustic feature of the ith frame, and X is the corresponding linguistic feature of the received text. In addition, in the embodiment of the present invention, adjacent acoustic features may be compensated, and when the adjacent acoustic features are compensated, the acoustic features corresponding to the received text may be obtained as shown in fig. 6, where the acoustic feature compensation signal generated by the adjacent portion between the template and the slot shown in fig. 6 is a compensation signal for compensating for the adjacent acoustic features.

When the slot is located at the head of the matching preset template, for example sample: [ Jiangnan]The one-minute audition version is sent to the user, the user can enjoy the full version of the song when the user purchases the small housekeeper app, wherein 'south of the Yangtze river' is a slot position text, 'the one-minute audition version is sent to the user, and the user can enjoy the full version of the song when the user purchases the small housekeeper app', and the template text. The difference from the mainstream speech synthesis algorithm scheme is that the acoustic features are generated in a reverse order for the situation, namely the acoustic features of the Chinese character 'nan' are generated firstly and then the acoustic features of the Chinese character 'jiang' are generated. And when the slot is positioned at the head of the preset template, based on a streaming voice synthesis algorithm, generating the acoustic features frame by frame according to the sequencing reverse order aiming at the acoustic features. The acoustic features are generated frame by frame in the reverse order by taking the linguistic features X of a section of complete text as input, and the acoustic features y are generated frame by frame in the reverse order from back to front_N、y_N-1、…、y_j、…、y₂、y₁. A set of hidden states s is maintained during the generation process_N+1、s_N、s_N-1、…、s_j、…、s₂、s₁Wherein s is_N+1The hidden state is an initial state, and the hidden state is a corresponding hidden state when the acoustic feature is generated. When generatingEnc (X) and y at the time of acoustic feature of the j-1 frame_jFor input, s_jGenerating y for the condition_j-1Where Enc () encodes the network for linguistic features. Specifically, when the slot is located at the head of the preset template, generating the slot acoustic feature includes generating according to: acoustic feature y of frame j-1_j-1For the streaming based speech synthesis algorithm, Enc (X) and y_jAs input, with s_jGenerating for a condition, wherein Enc () is a linguistic feature coding network, s_jThe acoustic feature hiding state corresponding to the acoustic feature of the (j-1) th frame, i.e. s_jGenerating a hidden state of the network for the acoustic features when generating the acoustic features of the j-1 th frame, y_jAnd X is the acoustic feature of the jth frame, and X is the corresponding linguistic feature of the received text. In addition, in the embodiment of the present invention, the adjacent acoustic features may be compensated, and when the adjacent acoustic features are compensated (for example, the adjacent acoustic features may be compensated by using the method for compensating the adjacent acoustic features provided in the embodiment of the present invention), the slot acoustic features and the adjacent acoustic features may be understood with reference to fig. 7.

When the slot position is located in the middle of the matched preset template, the slot position is different from the situation that the slot position is located at the head part and the tail part of the preset template, and the slot position part can be simultaneously influenced by the preset template parts at the two ends of the slot position part under the situation. Specifically, in the embodiment of the present invention, the acoustic features may be generated frame by frame in an order of sorting for the acoustic features based on a streaming speech synthesis algorithm. Specifically, generating the slot acoustic feature includes generating according to: acoustic feature y of frame i +1_i+1For the streaming based speech synthesis algorithm, Enc (X), y_i、y_nAnd Enc_d(n-i-1) as input, s_iGenerating for a condition, wherein Enc () is a linguistic feature coding network, Enc_d() As a distance coding function, s_iFor the acoustic feature hiding state corresponding to the acoustic feature of the (i + 1) th frame, y_iFor the i-th frame acoustic feature, y_nIs the acoustic feature of the nth frame and is the acoustic feature which has a serial number larger than that of the slot acoustic feature and is closest to the slot acoustic feature in the adjacent acoustic features, and X is the linguistics corresponding to the received textAnd (5) characterizing. For example, the adjacent acoustic features include y_a-l+1、…、y_a、y_b、y_b+1、…、y_b+l-1And l is the adjustment window length (i.e. the preset value in the embodiment of the present invention). y is_aAnd y_bBetween is the slot acoustic feature, y_bThat is, the acoustic feature having a sequence number greater than the sequence number of the slot acoustic feature and closest to the slot acoustic feature in the ranking of the acoustic features.

In summary, in the embodiment of the present invention, the calculated amount of the TTS algorithm is reduced from the algorithm structure, so as to reduce the response delay of the intelligent device terminal; and the task processing time is reduced from the characteristics of the service data and the algorithm strategy, so that the computing resources are reduced, and the service cost is reduced.

Yet another aspect of the embodiments of the present invention provides a machine-readable storage medium on which a program is stored, the program implementing the method described in the above embodiments when executed by a processor.

In another aspect of the embodiments of the present invention, a processor is further provided, where the processor is configured to execute a program, where the program executes the method described in the foregoing embodiments.

In another aspect, an apparatus is further provided, where the apparatus includes a processor, a memory, and a program stored in the memory and executable on the processor, and the processor executes the program to implement the method described in the foregoing embodiments. The device herein may be a server, a PC, a PAD, a mobile phone, etc.

Yet another aspect of an embodiment of the present invention provides a computer program product including a computer program/instructions, which when executed by a processor, implement the method described in the above embodiment.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. a method for synthesizing speech, it is characterised in that the method comprises:

A matching preset template matching the received text is found in the preset template library, wherein, in the preset template library, each preset template includes a template text and a slot and a speech waveform corresponding to the template text pre-acquired;

Obtain the slot content corresponding to the slot included in the matching preset template in the received text;

generating a slot acoustic feature corresponding to the slot content;

generating a slot speech waveform corresponding to the slot acoustic feature; and

The slot voice waveform and the voice waveform corresponding to the template text included in the matching preset template are spliced together to obtain a voice waveform corresponding to the received text, thereby synthesizing voice.

2. The method according to claim 1, wherein the finding a matching preset template matching the received text in the preset template library comprises:

characterize all characters included in the received text in order from head to tail of the received text; and

For any of the preset templates in the preset template library, perform the following matching operations:

The template text is segmented based on the slot to obtain segmented template text;

According to the position of the segmented template text in the preset template, according to the sequence of the preset template from the head to the tail, the template text is sorted on the segmented template text;

According to the sorting of the template characters, the segmented template characters are sequentially searched in the received text, wherein the characters in the received text that are the same as the characters included in a segmented template character are not identical to each other. is then used for subsequent searches;

In the case where all the segmented template texts are found in the received text, according to matching preset conditions, determine whether the preset template is successfully matched with the received text; and

In the case that the preset template is successfully matched with the received text, according to the optimal template setting rule, it is determined whether to set the preset template as the optimal template, wherein all the preset templates in the preset template library are set as the optimal template. In the case that the preset templates have all performed the matching operation, the optimal template is the matching preset template,

Wherein, the optimal template setting rules include:

If the optimal template is not set, the preset template is set as the optimal template;

If the optimal template has been set, but the number of characters included in the preset template is more than the number of characters included in the optimal template, the preset template is set as the optimal template; and

If the optimal template has been set and the number of characters included in the preset template is the same as the number of characters included in the optimal template, but the number of slots included in the preset template is larger than the number of slots included in the optimal template If the number of the slots included in the optimal template is small, the preset template is set as the optimal template.

3. The method according to claim 2, wherein, sequentially searching for the segmented template text in the received text according to the template text ordering comprises:

The first segmented template text is searched in the received text according to the following content, where the first segmented template text is the segment that is at the front in the preset template according to the template text ordering Paragraph template literal:

In the case that the header of the first segmented template text does not have the slot, sorting the first segmented template text and the received text according to the character sequence starting from the first character in character length Compare with a character string of the same length as the first segmented template text, wherein the first character is the first character in the received text according to the character order; and

In the case that the header of the first segmented template text has the slot, compare the length of any character in the first segmented template text and the received text with the first segmented template The character strings with the same character length are compared according to the consecutive character strings;

In the case where the first segmented template text is found in the received text, according to the character length of the first segmented template text and the difference between the received text and the first segmented template The starting characters in the character string matched according to the sequence number of the character ordering set the matching position for the second segment template text, wherein the second segment template text is the first segment sorted according to the template text. The next described segmented template text of the segmented template text, the matching position is the sequence number that the starting character of the segmented template text is searched in the received text according to the sequence of the characters;

starting from the matching position, searching the received text for the second segmented template literal; and

In the case where the second segmented template text is found, start the character according to the character length of the second segmented template text and the character string matching the second segmented template text in the received text. Characters update the matching position according to the sequence number of the character sorting, and sequentially search for the preset template in the received text except for the first segment template according to the content of the search for the second segment template text. The text and the segmented template text other than the second segmented template text, until all the segmented template texts in the preset template are searched, wherein the search for the preset template contains: During the process of the first segmented template text, the second segmented template text and all the following segmented template texts, if any segmented template text is not found, the preset The template fails to match the received text, and the operation of sequentially searching for the segmented template text in the received text according to the template text ordering ends.

4. The method according to claim 2, wherein the matching preset condition comprises:

If all the segmented template words are found in the received text and the maximum sequence number of the characters in the received text sorted by the characters is greater than or equal to the last match corresponding to the last segmented template word position and the tail of the last segment template text does not have the slot, then the preset template is not successfully matched with the received text, wherein the last segment template text is in the preset template The last segmented template text is sorted according to the template text, and the last matching position is a match with the last segmented template text according to the character length of the last segmented template text and the received text The matching position set by the sequence number of the starting character in the character string according to the sequence of the character; and

If all the segmented template texts are found in the received text and the maximum sequence number is smaller than the last matching position corresponding to the last segmented template text and/or the tail of the last segmented template text has If the slot is selected, the preset template is successfully matched with the received text.

5. The method according to any one of claims 1-4, wherein, in the preset template library, the acoustic features corresponding to the template characters included in each of the preset templates are preset Obtained, the method also includes:

Compensating adjacent acoustic features in the acoustic features corresponding to the template text included in the matching preset template to obtain compensated acoustic features, wherein the adjacent acoustic features are included in the matching preset template All the acoustic features whose serial numbers are less than or equal to a preset value from the number of the acoustic features closest to the acoustic feature of the slot in the included acoustic features corresponding to the template text; and

Based on the compensated acoustic feature, regenerate the speech waveform corresponding to the adjacent acoustic feature to generate an updated adjacent template text speech waveform;

Wherein, splicing the voice waveform of the slot and the voice waveform corresponding to the template text included in the matching preset template is to combine the voice waveform of the slot and the voice of the updated adjacent template text. The waveforms and the speech waveforms corresponding to the acoustic features remaining in the acoustic features corresponding to the template characters included in the matching preset template except the adjacent acoustic features are spliced together.

6 . The method according to claim 5 , wherein the generation of the speech waveform of the slot and/or the speech waveform of the updated adjacent template text is generated according to a streaming speech synthesis algorithm. 7 .

7. The method according to claim 6, wherein, in the preset template library, the hidden state of the speech waveform corresponding to the template text included in each of the preset templates is acquired in advance, and the The voice waveform hidden state is the hidden state of the voice waveform generation network when the voice waveform corresponding to the acoustic feature is generated, and when the slot is located at the end of the matching preset template, all the matching preset templates include The acoustic feature corresponding to the template text and the slot acoustic feature are put together according to the relationship between the matching preset template and the slot to obtain an overall acoustic feature and the acoustic features included in the overall acoustic feature The features are sorted in order from the head to the tail, and the generation of the slot voice waveform and/or the update of the adjacent template text voice waveform includes:

The speech sample corresponding to the acoustic feature of the slot and/or the speech sample corresponding to the adjacent acoustic feature is generated according to the following content: generating the speech sample corresponding to the i+1th frame acoustic feature y _i+1 Points z _j+L , z _j+1+L ,..., z _j+2L-1 are based on the streaming speech synthesis algorithm, with y _i+1 and z _j , z _j+1 ,..., z _{j+ L-1} is the input, and is generated with s _i ' as the condition, where z _j , z _j+1 , ..., z _j+L-1 are the speech samples corresponding to the acoustic feature y _i of the ith frame, and L is the frame long, s _i ' is the speech waveform hidden state corresponding to the i+1th frame acoustic feature; and

According to the speech sample corresponding to the acoustic feature of the slot and/or the speech sample corresponding to the adjacent acoustic feature, the speech waveform of the slot and/or the speech waveform of the updated adjacent template text are synthesized.

8 . The method according to claim 6 , wherein, in the preset template library, the hidden state of the speech waveform corresponding to the template text included in each preset template is acquired in advance, and the The voice waveform hidden state is the hidden state of the voice waveform generation network when the voice waveform corresponding to the acoustic feature is generated, and when the slot is located at the head of the matching preset template, the matching preset template includes the hidden state. The acoustic feature corresponding to the template text and the slot acoustic feature are put together according to the relationship between the matching preset template and the slot to obtain an overall acoustic feature and the overall acoustic feature included in the overall acoustic feature. The acoustic features are sorted in order from the head to the tail, and the generation of the slot speech waveform and/or the update of the adjacent template text speech waveform includes:

The speech sample corresponding to the acoustic feature of the slot and/or the speech sample corresponding to the adjacent acoustic feature is generated according to the following content: generating the speech sample corresponding to the i-1th frame acoustic feature y _i-1 Points z _j-1 , z _j-2 ,..., z _jL are based on the streaming speech synthesis algorithm, and y _i-1 and z _j+L-1 , z _j+L-2 ,..., z _j are Input, generate on the condition of s _i ", wherein z _j+L-1 , z _j+L-2 ,..., z _j are the speech samples corresponding to the ith frame acoustic feature y _i , and L is the frame long, s _i ″ is to generate the speech waveform hidden state corresponding to the i-1th frame acoustic feature; and

9 . The method according to claim 6 , wherein, in the preset template library, the hidden state of the speech waveform corresponding to the template text included in each preset template is acquired in advance, and the The voice waveform hidden state is the hidden state of the voice waveform generation network when the voice waveform corresponding to the acoustic feature is generated, and when the slot is located in the middle of the matching preset template, all of the matching preset templates are included. The acoustic feature corresponding to the template text and the slot acoustic feature are put together according to the relationship between the matching preset template and the slot to obtain an overall acoustic feature and the acoustic features included in the overall acoustic feature The features are sorted in order from the head to the tail, and the voice waveform of the slot position and the voice waveform of the updated adjacent template text are spliced together to generate the voice waveform of the slot position and the voice waveform of the updated adjacent template text. The spliced voice waveform, generating the spliced voice waveform includes:

The speech samples corresponding to the acoustic features of the slot and the speech samples corresponding to the acoustic features whose serial numbers included in the adjacent acoustic features are smaller than the serial numbers of the acoustic features of the slot are generated according to the following contents to obtain the first splicing Speech samples: generating the speech samples z _j+L , z _j _+1+L _, . The synthesis algorithm takes _y _i+1 and _z _j , z _j+1 _, _. _j+L-1 is the speech sample corresponding to the ith frame acoustic feature y _i , L is the frame length, and s _i ' is to generate the speech waveform hidden state corresponding to the i+1th frame acoustic feature;

The speech samples corresponding to the acoustic features of the slot and the speech samples corresponding to the acoustic features whose serial numbers included in the adjacent acoustic features are greater than the serial numbers of the acoustic features of the slots are generated according to the following contents to obtain the second splicing Speech samples: generating the speech samples z _j _-1 _, z _j _-2 , . _-1 and z _j+L-1 _, _z _j ₊ _L-2 , . , z _j are the described speech samples corresponding to the ith frame acoustic feature y _i , L is the frame length, and s _i " is to generate the described speech waveform hidden state corresponding to the i-1th frame acoustic feature;

The slot acoustic feature is generated based on the first spliced speech sample, the second spliced speech sample, the weight corresponding to the first spliced speech sample, and the weight corresponding to the second spliced speech sample and the concatenated speech samples corresponding to the adjacent acoustic features; and

The spliced speech waveform is synthesized according to the spliced speech samples.

10. The method according to any one of claims 1-4, wherein generating the acoustic feature of the slot is generated according to a streaming speech synthesis algorithm.

11 . The method according to claim 10 , wherein, in the preset template library, the acoustic features and acoustic feature hidden states corresponding to the template characters included in each of the preset templates are acquired in advance. 12 . To, the acoustic feature hidden state is the hidden state of the acoustic feature generation network when the acoustic feature is generated, and when the slot is located at the end of the matching preset template, the matching preset template includes the The acoustic feature corresponding to the template text and the acoustic feature of the slot are put together according to the relationship between the matching preset template and the slot to obtain an overall acoustic feature and the acoustic features included in the overall acoustic feature Sorting in order from head to tail, the generating the acoustic feature of the slot includes generating according to the following:

The i+1th frame acoustic feature y _i+1 is generated based on the streaming speech synthesis algorithm, with Enc(X) and y _i as input and _si as the condition, where Enc() is the linguistic feature code network, s _i is the hidden state of the acoustic feature corresponding to the i+1th frame acoustic feature, y _i is the ith frame acoustic feature, and X is the linguistic feature corresponding to the received text.

12 . The method according to claim 10 , wherein, in the preset template library, the acoustic features and acoustic feature hidden states corresponding to the template characters included in each of the preset templates are acquired in advance. 13 . To, the acoustic feature hidden state is the hidden state of the acoustic feature generation network when the acoustic feature is generated, and when the slot is located at the head of the matching preset template, the matching preset template includes all the The acoustic feature corresponding to the template text and the slot acoustic feature are put together according to the relationship between the matching preset template and the slot to obtain an overall acoustic feature and the acoustic features included in the overall acoustic feature The features are ordered from head to tail, and the generating the slot acoustic feature includes generating according to:

The j-1th frame acoustic feature y _j-1 is generated based on the streaming speech synthesis algorithm, taking Enc(X) and y _j as input and s _j as the condition, where Enc() is the linguistic feature code network, s _j is the acoustic feature hidden state corresponding to the j-1th frame acoustic feature, y _j is the jth frame acoustic feature, and X is the linguistic feature corresponding to the received text.

13 . The method according to claim 10 , wherein, in the preset template library, the acoustic features and acoustic feature hidden states corresponding to the template characters included in each of the preset templates are acquired in advance. 14 . To, the acoustic feature hidden state is the hidden state of the acoustic feature generation network when the acoustic feature is generated, and when the slot is located in the middle of the matching preset template, the matching preset template includes the The acoustic feature corresponding to the template text and the acoustic feature of the slot are put together according to the relationship between the matching preset template and the slot to obtain an overall acoustic feature and the acoustic features included in the overall acoustic feature Sorting in order from head to tail, the generating the acoustic feature of the slot includes generating according to the following:

The i+1th frame acoustic feature y _i+1 is generated based on the streaming speech synthesis algorithm, taking Enc(X), y _i , _yn and Enc _d (ni-1) as input and s _i as the condition , where Enc() is the linguistic feature encoding network, Enc _d () is the distance encoding function, s _i is the hidden state of the acoustic feature corresponding to the acoustic feature in the i+1th frame, and y _i is the data in the i-th frame. The acoustic feature, y _n is the acoustic feature of the nth frame and is the acoustic feature whose serial number is greater than the serial number of the acoustic feature of the slot and is closest to the acoustic feature of the slot in the adjacent acoustic features, and X is the received acoustic feature Linguistic features of the text.