CN114049874A - Method for synthesizing speech - Google Patents
Method for synthesizing speech Download PDFInfo
- Publication number
- CN114049874A CN114049874A CN202111326772.0A CN202111326772A CN114049874A CN 114049874 A CN114049874 A CN 114049874A CN 202111326772 A CN202111326772 A CN 202111326772A CN 114049874 A CN114049874 A CN 114049874A
- Authority
- CN
- China
- Prior art keywords
- template
- text
- acoustic feature
- slot
- preset
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Machine Translation (AREA)
- Telephonic Communication Services (AREA)
Abstract
The embodiment of the invention provides a method for synthesizing voice, belonging to the field of artificial intelligence. The method comprises the following steps: finding a matching preset template matched with the received text in a preset template library, wherein each preset template comprises template characters and a slot position in the preset template library, and a voice waveform corresponding to the template characters is obtained in advance; acquiring slot position content corresponding to slot positions included in the matched preset template in the received text; generating slot position acoustic characteristics corresponding to the slot position content; generating a slot position voice waveform corresponding to the slot position acoustic characteristic; and splicing the slot position voice waveform with the voice waveform corresponding to the template characters included by the matched preset template to obtain the voice waveform corresponding to the received text, thereby synthesizing the voice. Therefore, the real calculation amount during voice synthesis is reduced, the response delay of the intelligent device terminal is reduced, and the user experience is improved.
Description
Technical Field
Embodiments of the present invention relate to a method for synthesizing speech.
Background
With the high-speed iteration of computer performance and the wide application of deep learning technology, the intelligent voice interaction technology has been developed unprecedentedly. Today, interfaces for human-computer interaction are undergoing a transition from touch to speech. The purpose of speech synthesis (TTS) technology is to make computers and various smart devices speak like a person. Driven by the endless machine learning algorithms, TTS technology has been able to generate speech that closely approximates real human speech, even in the false.
Because the mainstream deep learning algorithm depends on strong computing power, all smart device manufacturers and technical service providers deploy the part in a computing center nowadays, and provide smart voice services in a cloud computing mode. However, for TTS, this approach has two significant drawbacks: 1) due to the complexity of the algorithm, network delay and other factors, the response delay of the intelligent equipment terminal is large, and the user experience is influenced; 2) because user requests are not evenly distributed in time (e.g., request peaks in the morning and evening, request troughs in the morning), vendors need to prepare oversaturated computing devices to handle request peaks, which results in wasted computing resources and increased costs.
Disclosure of Invention
To at least partially solve the above problem, an aspect of an embodiment of the present invention provides a method for synthesizing speech, the method including: finding a matching preset template matched with the received text in a preset template library, wherein each preset template comprises template characters and a slot position in the preset template library, and a voice waveform corresponding to the template characters is obtained in advance; acquiring slot position content corresponding to the slot position included by the matching preset template in the received text; generating slot position acoustic characteristics corresponding to the slot position contents; generating a slot position voice waveform corresponding to the slot position acoustic characteristic; and splicing the slot position voice waveform with the voice waveform corresponding to the template characters included in the matched preset template to obtain the voice waveform corresponding to the received text, thereby synthesizing voice.
In addition, another aspect of the embodiments of the present invention also provides a machine-readable storage medium, which stores instructions for causing a machine to execute the above-mentioned method.
In addition, another aspect of the embodiments of the present invention also provides a computer program product, which includes a computer program/instruction, and when the computer program/instruction is executed by a processor, the steps of the method are implemented.
According to the technical scheme, the voice is synthesized by using the preset template, the voice waveform corresponding to the template characters in the preset template is obtained in advance, when the voice is synthesized, only the voice waveform corresponding to the slot position content filled in the slot position part in the preset template needs to be generated in real time, the voice waveform corresponding to the template characters and the voice waveform corresponding to the slot position content are spliced together to synthesize the voice, the received voice waveform corresponding to the whole text does not need to be generated in real time, therefore, the real calculation amount during voice synthesis is reduced, the terminal response delay of the intelligent device is reduced, and the user experience is improved. In addition, when synthesizing the voice, the voice is synthesized by means of the preset template, the processing time of a single task is reduced, and the computing intensity when the peak value is requested by a user can be borne without preparing supersaturated computing equipment, so that the waste of computing resources is reduced, and the cost is reduced.
Additional features and advantages of embodiments of the invention will be set forth in the detailed description which follows.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the embodiments of the invention without limiting the embodiments of the invention. In the drawings:
FIG. 1 is a flow chart of a method for synthesizing speech provided by an aspect of an embodiment of the present invention;
FIG. 2 is a diagram of a matching default template according to another embodiment of the present invention;
FIG. 3 is a schematic diagram illustrating a compensation signal provided by another embodiment of the present invention;
FIG. 4 is a schematic diagram of the logic of a training compensation model according to another embodiment of the present invention;
FIG. 5 is a schematic diagram of compensation logic provided in accordance with another embodiment of the present invention;
FIG. 6 is a schematic illustration of compensation for adjacent acoustic features provided by another embodiment of the present invention; and
FIG. 7 is a schematic illustration of slot acoustic feature generation and compensation of adjacent acoustic features provided by another embodiment of the present invention.
Detailed Description
The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating embodiments of the invention, are given by way of illustration and explanation only, not limitation.
In an intelligent speech interactive system there is a Natural Language Generator (NLG) which responds to the user's text content and a TTS module which accepts the text and converts it to speech for output to the user. One mainstream design method of NLG is to design a dialog template, where the dialog template includes various slots (slots), and different values are filled in the slots to generate different response texts. For example, the conversation template sample is: the current time is the morning point minute. The dialogue template comprises 2 slots, and different contents are filled in the 2 slots to generate a type of response text answering the current time. Responding to the text sample: time now was 58 minutes 6 am; time now was 7 am 45 min in the morning; time is now 8 am 20 min. Through statistical findings, the structured response texts generated by the dialog templates in the real intelligent voice interaction system have very high occurrence frequency, especially in the early and late request peak period. The technical scheme provided by the embodiment of the invention has the design idea that preset templates are made for the high-frequency dialogue templates in advance, and only the voice corresponding to the content filled in the slot position is generated when the voice is synthesized on line in real time. Taking the foregoing example as an example, it is previously made that "the present time is the morning [ ] point [ ] minute. "as a preset template, wherein [ ] represents a slot position. During real-time speech synthesis, only digital speech segments of '6', '58' and the like are generated and spliced into the template to be used as the final output of TTS. The TTS algorithm scheme comprises 3 functional modules: the device comprises a linguistic feature generation module, an acoustic feature generation module and a voice waveform generation module. Generating linguistic features from the text, generating acoustic features from the linguistic features, and generating speech waveforms from the acoustic features, thereby obtaining speech signals.
One aspect of embodiments of the present invention provides a method for synthesizing speech. Fig. 1 is a flowchart of a method for synthesizing speech according to an embodiment of the present invention. As shown in fig. 1, the method includes the following.
In step S10, a matching preset template matching the received text is found in the preset template library. In the preset template library, each preset template comprises template characters and slot positions, and the voice waveforms corresponding to the template characters are acquired in advance. The received text is the text to be speech-synthesized. In the preset template, the slot is not filled with characters. One or more slots may be included in a default template. And the preset template matched with the received text in the preset template library is the matched preset template. For example, each preset template in the preset template library has a template ID, and for the case that template information can be obtained from an NLG (natural language generation module), the template ID of the preset template matching the received text can be directly obtained, and the matching preset template matching the received text can be found from the preset template library according to the template ID. In addition, in the embodiment of the present invention, in a preset template library, an acoustic feature and/or an acoustic feature hidden state and/or a speech waveform hidden state corresponding to a template character may also be obtained in advance, where the acoustic feature hidden state is a hidden state of an acoustic feature generation network when an acoustic feature is generated, and the speech waveform hidden state is a hidden state of a speech waveform generation network when a speech waveform corresponding to the acoustic feature is generated. Speech is a typical timing signal and when processing timing signals with neural networks, the network is usually required to "remember" historical information, such as cell state c (cell state) and hidden state h (hidden state) in LSTM. Hidden states are not equal to the model parameters of the neural network, the model parameters are typically fixed value weights, and hidden states refer to state parameters and/or intermediate values that change with time sequence during inference. Still taking LSTM as an example, the hidden state at time t can be expressed as [ c ]t、ht]. In the real model, there may be multiple LSTM layers, each of which may have multiple LSTM units, and the hidden state is a matrix sequence in the real model. In the process of generating the acoustic characteristics of the voice template by using the acoustic characteristic generation network, the deviceGenerating the last frame of acoustic features of the template text at the time t, generating the first frame of acoustic features of the slot position text at the time t +1, and hiding the state [ c ] before the time t is finishedt、ht]Exported and stored as part of the speech template; similarly, the hidden state corresponding to the voice waveform generation network may also be derived and stored as a part of the voice template.
In step S11, the slot content corresponding to the slot included in the matching preset template in the received text is obtained. For example, the received text may be compared with a matching preset template to obtain the slot content corresponding to the slot.
In step S12, a slot acoustic feature corresponding to the slot content is generated. For example, the slot acoustic features may be generated according to a streaming speech synthesis algorithm.
In step S13, a slot speech waveform corresponding to the slot acoustic feature is generated. For example, the slot speech waveform may be generated according to a streaming speech synthesis algorithm.
In step S14, the slot speech waveform and the speech waveform corresponding to the template word included in the matching preset template are spliced together to obtain the speech waveform corresponding to the received text, so as to synthesize speech.
According to the technical scheme, the voice is synthesized by using the preset template, the voice waveform corresponding to the template characters in the preset template is obtained in advance, when the voice is synthesized, only the voice waveform corresponding to the slot position content filled in the slot position part in the preset template needs to be generated in real time, the voice waveform corresponding to the template characters and the voice waveform corresponding to the slot position content are spliced together to synthesize the voice, the received voice waveform corresponding to the whole text does not need to be generated in real time, therefore, the real calculation amount during voice synthesis is reduced, the terminal response delay of the intelligent device is reduced, and the user experience is improved. In addition, when synthesizing the voice, the voice is synthesized by means of the preset template, the processing time of a single task is reduced, and the computing intensity when the peak value is requested by a user can be borne without preparing supersaturated computing equipment, so that the waste of computing resources is reduced, and the cost is reduced.
Optionally, in the embodiment of the present invention, the creating of the preset template library includes two aspects, one is screening of a preset template text, and the other is making of a preset template. The screening of the preset template text can adopt the following screening method: 1) obtaining all designed dialogue templates from the NLG module; 2) deleting the dialogue template with sparse text content, for example, the dialogue template with sparse content can be a dialogue template in which the average segment length is less than 2 words after the template text is segmented according to the slot positions; 3) deleting fixed words, namely templates with zero slot positions; 4) the remaining dialog templates are used as the default templates for the present proposal. In the preset template manufacturing, a simple method is to record the preset template in advance in a mode of recording audio. However, this method is limited to a few specific timbres, and cannot be extended to more timbres; it is suitable for only a small number of pre-made dialogue templates and cannot be extended to a dialogue template and a high-frequency text added at a later stage. In the embodiment of the present invention, the making of the preset template includes the following steps. 1) And filling variable examples in the slots of the conversation template, and completing the template into a complete and smooth text. 2) And synthesizing the complemented text into voice by adopting a Streaming Speech Synthesis (Streaming Speech Synthesis) algorithm. 3) Some intermediate values in the synthesis process, such as the acoustic features of the template text, the hidden state in the prediction acoustic process (the hidden state of the acoustic feature generation network when the acoustic features are generated), the hidden state of the vocoder (vocoder) when the waveform (waveform) is generated (the hidden state of the voice waveform generation network when the voice waveform is generated by the acoustic features), the voice waveform of the template text, and the like, are recorded, so that the inference context (i.e., the corresponding input data, hidden state and other variables and necessary computing resources when the slot speech is predicted) can be quickly recovered when the slot speech is generated in real time online. 4) And (3) deleting the part corresponding to the content of the slot from the result (namely the acoustic characteristic, the voice waveform and the hidden state) output in the steps 2) to 3). 5) And recording the template number, the template characters, the slot position description information and the result of the step 4) as a preset template. The slot position description information comprises the number and the sequence of the slots and the offset position of each slot on the text, the acoustic characteristic and the voice waveform data of the template. For example, the time is now the [ hour ] point in the morning, the offset for [ hour ] is 7, and the offset for [ hour ] is 8. And establishing a preset template library through the content.
Optionally, in the embodiment of the present invention, the following content may be adopted to find a matching preset template matching the received text in the preset template library. The following method for finding the matched preset template in the preset template library is suitable for the condition that the template information cannot be obtained from the NLG. In most voice interaction systems, information interaction between TTS and NLG is inconvenient, and at the moment, the most appropriate template needs to be matched in a preset template library through the text of the TTS.
All characters included in the received text are character-ordered in the order of the received text from the head to the tail. For example, the received text comprises m characters, the m characters are sorted by c1、c2、…、cmIs shown in the specification, wherein ciRepresenting the ith character. And aiming at any preset template in the preset template library, executing the following matching operation so as to match all the preset templates in the preset template library with the received text and determine the matched preset template. And segmenting the template characters corresponding to the preset template based on the slot positions to obtain segmented template characters. According to the position of the segmented template characters in the preset template, template character sequencing is carried out on the segmented template characters according to the sequence of the preset template from the head to the tail, namely all the segmented template characters are sequenced. For example, the template text corresponding to a preset template includes n segmented template texts, each using t1、t2、…、tnDenotes, tiRepresenting the ith segmented template word. Alternatively, the slots included in the default template may be ordered from head to tail, e.g., n-1 slots with s1、s2、…、sn-1Denotes siIndicating the ith slot. Specifically, it can be understood with reference to table 1. Sequentially searching for segmented template characters in the received text according to the template character sequence, wherein the received text comprises the segmented template charactersThe same characters in the text of (1) as included in a segmented template word are no longer used for subsequent lookups. Wherein the searching is to find the same character string as the segmented template word in the received text. When the segmented template characters are searched in the received text, searching is carried out according to the sequence of the segmented template characters. For example, using the segmented template text t1、t2、…、tnFor example, first find t1Then find t2And so on. When some characters in the received text are the same as a segmented template word, then the some characters are no longer used for subsequent lookup work. For example, using the above example as an example, the characters in the received text are represented by c1、c2、…、cmRepresenting, for each of the segmented template characters, a corresponding one of the preset templates, a segmented template character t1、t2、…、tnIndicates when the character c1、c2、c3、c4And segmented template character t1If the characters in (1) are the same, the character c is1、c2、c3、c4Is no longer used for the segmented template text t2、…、tnFor segmented template text t2From the character c5And starting. Aiming at a segmented template character, the segmented template character is searched in the received text to find the identical character string in the received text. For example, as shown in Table 1, for a segmented template text t1"time now is morning" which must be 7 characters when it is found in the received text, and must be simply "time now is morning". Under the condition that all the segmented template characters are found in the received text, judging whether the preset template is successfully matched with the received text or not according to the matching preset condition; and under the condition that the preset template is successfully matched with the received text, determining whether the preset template is set as an optimal template according to an optimal template setting rule, wherein the optimal template is the matched preset template under the condition that all preset templates in the preset template library are subjected to matching operation. If it is preset in the template libraryAnd if the matching operation is finished on all the preset templates, and if the optimal template does not exist, the matched preset template is not found in the preset template library. In addition, in the received text, the characters in the template text which is not included in the matching preset template are the characters corresponding to the slot positions. In addition, the slot positions and the segmentation template characters included in the matching preset template can be sequenced at the same time, and the content of the slot positions corresponding to each slot position included in the matching preset template is determined by comparing the sequence with the received text.
TABLE 1
Optionally, in the embodiment of the present invention, sequentially searching for segmented template words in the received text according to the template word sorting includes the following contents. And searching a first segmented template character in the received text according to the following contents, wherein the first segmented template character is the segmented template character which is arranged at the top in a preset template according to the template character sequence. For example, segmented template text t1、t2、…、tn,t1Is the first segment template text. And comparing the first segmented template word with a character string in the received text, wherein the character string has the same character length as the character length of the first segmented template word from the first character according to the character sequence under the condition that the head of the first segmented template word has no slot, and the first character is the character which is arranged at the top in the received text according to the character sequence. E.g. t1The head has no slot, if at character c1、c2、…、cmFinding segmented template characters t1、t2、…、tn,t1Including 7 characters, then only t can be obtained1And character string c1-c7And (6) carrying out comparison. c. C1I.e. the first character. Under the condition that the head of the first segmentation template word has a slot, the length of any character in the first segmentation template word and the received text is the same as that of the character of the first segmentation template wordAnd comparing according to the character strings with continuous character sequencing serial numbers. That is, in the case that the head of the first segmented template text has a slot, the character string of the received text to which the comparison is performed is not limited. And under the condition that the first segmented template characters are searched in the received text, setting a matching position aiming at second segmented template characters according to the character length of the first segmented template characters and the sequence number of the initial characters in the character string matched with the first segmented template characters in the received text according to the character sequence, wherein the second segmented template characters are the next segmented template characters of the first segmented template characters according to the template characters sequence, and the matching position is the sequence number of the initial characters of the segmented template characters searched in the received text according to the character sequence. For example, segmented template text t1、t2、…、tn,t2Is the second sectional template character. Take the above example as an example, with t1The matched character string is c1-c7,t1Has a length of 7, and t1Initial character c of comparison1Is 1, for t2Is 8, i.e. from c8Start for t2And (6) carrying out comparison. Searching a second segmented template character in the received text from the matching position; under the condition of searching the second segmented template characters, updating the matching positions according to the character length of the second segmented template characters and the initial characters in the character strings matched with the second segmented template characters in the received text according to the serial numbers of the character sequence, understanding by referring to the content of the matching positions set for the second segmented template characters, sequentially searching the segmented template characters except the first segmented template characters and the second segmented template characters in the preset template in the received text according to the content of the searched second segmented template characters until all the segmented template characters in the preset template are searched, wherein in the process of searching the preset template containing the first segmented template characters, the second segmented template characters and all the segmented template characters after the first segmented template characters and the second segmented template characters, any segmented template character is not searched, the preset template is failed to be matched with the received text, parallel knotAnd (4) bundling the operation of sequentially searching the segmented template characters in the received text according to the template character sequence. For example, taking the above example as an example, when t is found2And if the character string matched with the preset template is not found, finishing the searching operation aiming at the preset template, wherein the preset template is failed to be matched with the received text.
Optionally, in the embodiment of the present invention, the matching preset condition includes the following. If all the segmented template characters are found in the received text, the maximum serial number of the characters in the received text according to the character sequence is larger than or equal to the last matching position corresponding to the last segmented template character, and the tail part of the last segmented template character has no slot position, the preset template is not successfully matched with the received text, wherein the last segmented template character is the segmented template character arranged at the last in the preset template according to the template character sequence, and the last matching position is the matching position set according to the character length of the last segmented template character and the serial number of the initial character in the character string matched with the last segmented template character in the received text according to the character sequence. For example, in the character c1、c2、…、c9Finding segmented template characters t1、t2、…、t5Maximum serial number 9, and last segmented template text t5,t5Tail without slot, t5Comprising 2 characters, character c1、c2、…、c9Neutralization of t5And if the sequence number of the initial character in the matched character string according to the character sequence is 6, the last matching position is 8, and the maximum sequence number 9 is greater than the last matching position 8, which indicates that redundant characters exist, the matching is unsuccessful. And if all the segmented template characters are found in the received text and the maximum serial number is smaller than the last matching position corresponding to the last segmented template character and/or the tail part of the last segmented template character has a slot position, successfully matching the preset template with the received text.
Optionally, in the embodiment of the present invention, the optimal template setting rule includes the following. If the optimal template is not set, the preset template is set as the optimal template, namely the preset template which is successfully matched currently is set as the optimal template. If the optimal template is set, but the number of characters included in the preset template is more than that of the characters included in the optimal template, setting the preset template as the optimal template. If the optimal template is set and the number of the characters included in the preset template is the same as the number of the characters included in the optimal template, but the number of the slot positions included in the preset template is less than the number of the slot positions included in the optimal template, the preset template is set as the optimal template. If the optimal template is set, but the number of the characters included in the preset template is less than or equal to the number of the characters included in the optimal template, the preset template is not set as the optimal template. If the optimal template is set and the number of the characters included in the preset template is the same as the number of the characters included in the optimal template, but the number of the slot positions included in the preset template is larger than or equal to the number of the slot positions included in the optimal template, the preset template is not set as the optimal template.
And searching the segmented template characters in the received text, namely matching the segmented template characters with the characters in the received text, wherein character strings with matching are found, and character strings without matching are not found. For example, using the segmented template text t1、t2、…、tnAnd c for characters1、c2、…、cmFor example, referring to fig. 2, an exemplary description is given to the search process, during the search, each segmented template word of the t sequence of the segmented template words is sequentially searched in the received text, if the t sequence can be completely found and the received text has no redundant characters, the matching is considered to be successful, otherwise, the matching is failed. As shown in fig. 2, t in the diagram A1Has been matched to c1、c2Then t is2From c3Starting to search; graph B shows t2Has been matched to c4、c5. Graph C shows tnHas been matched to cm-1、cmAnd if the received text has no more characters, the preset template is successfully matched with the received text. For the text received in the same section, a plurality of preset templates are matched possibly successfully, the preset template with the most characters is selected at the moment, and if the characters are the same, the preset template with the least slot positions is selectedAnd setting a template.
Specifically, the matching algorithm may include the following operations performed on all preset templates in the preset template library.
Step 1: segmenting the template characters according to the slot positions, and recording the segmented template characters as t1,t2,…,tnThe slot position is denoted as s1,s2,…,sn-1(fewer slots than text passages), or s1,s2,…,sn(the number of slots equals the text passage), or s1,s2,…,sn+1(more slots than text passages). Sorting the segments in the order in which they appear in the template may result in: (1) t is t1、s1、t2、s2、…、sn-1、tn;(2)t1、s1、t2、s2、…、sn-1、tn、sn;(3)s1、t1、s2、t2、…、sn、tn;(4)s1、t1、s2、t2、…、sn、tn、sn+1。
Step 2: in the received text c1,c2,…,cmThe respective segmented template words are looked up sequentially (since the slot (i.e., s-sequence) can match 0 to multiple characters, only the text segment (i.e., t-sequence) is looked up).
Step 2.1: if the type is (1) or (2) preset template, t is compared1And c1、c2、…、cL1,L1Is t1If the length of the matching position search _ beg is the same, the matching position search _ beg is updated to L1+1 and go to step 2.2, otherwise the preset template matching fails. If it is a type (3) or (4) template, then at c1,c2,…,cmIn search t1If at cP1Where found and t1Identical character strings, cP1For the initial character of the character string, the matching position search _ beg is updated to P1+L1And entering step 2.2, otherwise, the preset template matching fails. In addition, c1And cP1-1BetweenCorresponding is the slot position s1The corresponding slot contents.
Step 2.2: if the last segmented template text tnAfter the matching is successful, updating the matching position according to tnAnd c1,c2,…,cmNeutralization of tnIf the serial number of the matched initial character is updated, the step 2.3 is carried out; otherwise, the next template text segment t is searched from the search _ beg of the input textiIf at cPiWhere found and tiIf the identical character string is found, the matching position search _ beg is updated to Pi+LiAnd re-enter step 2.2, LiIs tiOtherwise, the preset template matching fails.
Step 2.3: if m is larger than or equal to the search _ beg, and the preset template is of the type (1) or (3), indicating that the received text has redundant characters relative to the preset template, the preset template fails to be matched; otherwise, the template matching is successful, and the step 3 is entered.
And step 3: if the optimal template is not set, setting the current preset template as the optimal template; otherwise, comparing the current preset template with the optimal template character number, and if the current preset template character number is more, setting the current template as the optimal template; if the number of the characters is the same, comparing the number of the current preset template and the number of the optimal template slot positions, and if the number of the slot positions of the current preset template is less, setting the current preset template as the optimal template. And after all the preset templates are matched, if the optimal template is set, selecting the optimal template, otherwise, failing to match the received text to the template in the preset template library, and failing to match.
Since the speech signal is a time-series signal, it is greatly affected by the preceding and following speech. If the voice of the slot position content is directly generated and spliced into the preset template, only the voice with heavy mechanical taste is generated, namely, the voice is inconsistent in fundamental frequency, energy, speed and rhythm to cause poor naturalness, and the phase is discontinuous to cause burr noise and the like. In the embodiment of the present invention, in order to ensure the continuity of the synthesized speech, certain adjustment is required to match the acoustic characteristics of the partial frame where the preset template itself is adjacent to the slot.Specifically, in the preset template library, the acoustic features corresponding to the template characters included in each preset template are obtained in advance, and the method further includes the following steps. And compensating adjacent acoustic features in the acoustic features corresponding to the template characters included in the matched preset template to obtain compensated acoustic features, wherein the adjacent acoustic features include all the acoustic features of which the sequence numbers are less than or equal to a preset numerical value from the acoustic feature number closest to the slot position acoustic feature in the acoustic features corresponding to the template characters included in the matched preset template. For example, the acoustic features corresponding to the template text include i1、i2、i3、i4、……、i20The slot being at acoustic characteristic i1Previously, setting the preset value to 5, the acoustic feature i is measured1Counting up, all acoustic features with a number less than or equal to 5 are neighboring acoustic features, and therefore, the neighboring acoustic features include the acoustic feature i1、i2、i3、i4、i5. The acoustic features still corresponding to the template text include i1、i2、i3、i4、……、i20For example, the slot is at acoustic feature i20Then, setting the preset value to 5, the acoustic feature i is obtained20Counting up all acoustic features with a number less than or equal to 5 as neighboring acoustic features, except that at this point, an inverse number is required, and therefore, the neighboring acoustic features include the acoustic feature i20、i19、i18、i17、i16. The acoustic features still corresponding to the template text include i1、i2、i3、i4、……、i20For example, the slot is at acoustic feature i11And i12Set the preset value to 5, then respectively from the acoustic feature i11Inverse number of and from acoustic features i12All the acoustic features with the sequence number less than or equal to 5 are the adjacent acoustic features, and therefore the adjacent acoustic features include the acoustic feature i11、i10、i9、i8、i7And acoustic characteristics i12、i13、i14、i15、i16. To say thatIt should be understood that the above method can be used to obtain adjacent acoustic features regardless of the number of slots in a predetermined template. And regenerating a voice waveform corresponding to the adjacent acoustic feature based on the compensated acoustic feature so as to generate an updated adjacent template text voice waveform. For example, the updated adjacent template text speech waveform is generated according to a streaming speech synthesis algorithm. And splicing the slot position voice waveform with the voice waveform corresponding to the template characters included in the matched preset template into a slot position voice waveform, updating the voice waveform of the adjacent template characters and splicing the voice waveforms corresponding to the residual acoustic features except the adjacent acoustic features in the acoustic features corresponding to the template characters included in the matched preset template.
Optionally, in the embodiment of the present invention, a compensation signal may be generated, and the adjacent acoustic features may be compensated based on the compensation signal.
The compensation signal is the difference between the model-generated acoustic features and the ideal acoustic features. For example, in the embodiment of the present invention, the adjacent acoustic features are compensated, the ideal acoustic feature is the compensated acoustic feature, and therefore, the compensation signal is the difference between the adjacent acoustic features before compensation and the adjacent acoustic features after compensation. The pronunciation of a speech segment is affected by its context, which contributes to the continuity and consistency of the entire speech segment. For the same preset template, the slot positions are filled with different contents, the ideal pronunciation of the template is different, and particularly, the voice frames adjacent to the slot positions are partially voiced. The root cause of this difference is that the part of the Encoder (Encoder) in the model that generates the acoustic features uses the linguistic features of the complete sentence. These linguistic features have changed when the slot content is replaced. The goal of the compensation signal is to repair the effect of this change on the pronunciation. More direct white description: the generation of the preset template is based on a certain condition (i.e. a segment of text), and when the preset template is used, the condition changes a little (a small part of text changes), we need to make a little change to the generated template to make it adapt to the new condition more perfectly, and the change amplitude is the compensation signal.
For a trained acoustic model, if the preset template in the technical solution provided in the embodiment of the present invention is not used, that is, the acoustic features generated from zero can be considered as ideal acoustic features, such as the acoustic feature a2 and the acoustic feature B2 shown in fig. 3. In fig. 3, acoustic features of two sentences (in italic underlined font as slot content) are generated separately using the trained acoustic model. Both acoustic signature a2 and acoustic signature B2 represent acoustic signatures of the pre-defined pattern adjacent to the slot, i.e., adjacent acoustic signatures as described in the previous embodiments. The difference between the two segments of the acoustic signature is a compensation signal (denoted by C), wherein the compensation signal C is a compensation signal for compensating the acoustic signature B2 to obtain the acoustic signature a 2. In order to predict the compensation signal C, the compensation model needs to be trained with the acoustic features B2 and the linguistic features a1 as input data, as shown in fig. 4. In an embodiment of the present invention, the compensation model may be trained as follows. The input data is acoustic feature B2, linguistic feature a1, linguistic feature a1 refers to "no problem, together with linguistic features corresponding to the entire sentence that bears no bar". The output data is [ compensation signal C' ]. The loss function is Mel-SD (acoustic feature a2, acoustic feature a '), which is a Mel-spectrum distortion measure between the target acoustic feature a2 and the acoustic feature a' obtained by compensating for the acoustic feature B2. Wherein, Adam is adopted in the optimization method of the acoustic characteristic A '═ acoustic characteristic B2+ compensation signal C'. In order to fully utilize the data resources, the input data may also be acoustic features a2, linguistic features B1, that is, acoustic features B2 is obtained by compensating for the acoustic features a2, and then the loss function is calculated as Mel-SD (acoustic features B2, acoustic features B'), that is, two sets of training data can be obtained from any two filled sentences of the same template.
The actually predicted compensation signal after the compensation model is trained is C', namely the compensation signal used in the process of synthesizing the voice on line in real time. Before synthesizing the voice in real time on line, taking adjacent acoustic features as acoustic features B2; when synthesizing speech on line in real time, the linguistic features of the received text are generated as linguistic features a1, and the compensated signals C 'are obtained through the compensation model and then added to the template acoustic features B2 to obtain acoustic features a', which are compensated output features, as shown in fig. 5. Linguistic feature generation converts text into phonetic labels (e.g., phonemes and intonations) and prosodic labels (e.g., accents, pauses). Different contents can be filled in the slot position when the linguistic feature is generated on line in real time, and in order to ensure the effectiveness of the slot position linguistic feature, the complete linguistic feature needs to be generated by using the complete input text.
Optionally, in the embodiment of the present invention, different methods may be adopted according to the position of the slot in the preset template to generate the slot speech waveform and/or update the text speech waveform of the adjacent template. The voice waveform generation is to convert acoustic features into voice signals to be output. Specifically, the following can be referred to. In the preset template library, a voice waveform hiding state corresponding to template characters included in each preset template is obtained in advance, and the voice waveform hiding state is a hiding state of a voice waveform generation network when a voice waveform corresponding to acoustic features is generated. And placing the acoustic features corresponding to the template characters included in the matching preset template and the acoustic features of the slot positions together according to the relation between the matching preset template and the slot positions to obtain the overall acoustic features, and sequencing the acoustic features included in the overall acoustic features from the head to the tail. For example, the acoustic features corresponding to the template text include i1、i2、i3、i4、……、i20Slot acoustic features include j1、j2、j3、j4The slot is located at the head of the matching preset template, so that the overall acoustic signature is the acoustic signature j1、j2、j3、j4、i1、i2、i3、i4、……、i20All the acoustic features included in the overall acoustic feature are then sorted in order from head to tail. And when the slot positions are positioned at the tail part and the middle part of the matched preset template, sequencing according to the contents.
When the trench is located at the tail part of the matched preset template, generating a trench voice waveform and/or updating a text voice waveform of the adjacent template comprises: generating voice sampling points and/or adjacent sounds corresponding to the slot position acoustic characteristics according to the following contentsLearning the voice sample points corresponding to the features: generating the acoustic feature y of the (i + 1) th framei+1Corresponding speech samples zj+L、zj+1+L、…、zj+2L-1For streaming based speech synthesis algorithms, with yi+1And zj、zj+1、…、zj+L-1For input, with si' is generated under the condition where zj、zj+1、…、zj+L-1For the ith frame acoustic feature yiCorresponding speech sample point, L is frame length, si' is a speech waveform hiding state corresponding to the acoustic feature of the (i + 1) th frame; and synthesizing slot voice waveforms and/or updating adjacent template text voice waveforms according to the voice sampling points corresponding to the slot acoustic features and/or the voice sampling points corresponding to the adjacent acoustic features. When the slot is located at the tail of the preset template, the voice samples are generated frame by frame according to the sequence of the acoustic features, for example, according to the acoustic feature y1、y2、…、yi、…、yN-1、yNAnd using it as input, generating speech samples z frame by frame or point by point1、z2、…、zj、zj+1、…、zj+L、…zMWherein z isj、zj+1、…、zj+L-1Corresponding to an acoustic feature yiL is the frame length and M is the complete speech length. A set of hidden states s is maintained during the generation process0’、s1’、s2’、s3’、…、si’、…、sN-1’、sN', wherein s0' is an initial state, si' speech waveform hiding state corresponding to acoustic feature of frame i +1, i.e. si' generating a hidden state of the speech waveform generation network when generating a speech waveform corresponding to the acoustic feature of the i +1 th frame. When generating the speech waveform corresponding to the acoustic feature of the (i + 1) th framei+1And zj、zj+1、…、zj+L-1For input, si' Generation of z as Conditionj+L、zj+1+L、…、zj+2L-1. Specifically, in the embodiment of the present invention, the speech samples may be generated sequentially frame by frame from the adjacent acoustic features to the end of the slot acoustic feature. To is coming toThe natural continuity of the speech is maintained and the adjacent acoustic features are adjusted. When the speech waveform is generated, speech waveforms corresponding to adjacent acoustic features need to be re-generated, and specifically, speech samples may be generated frame by frame starting from the adjacent acoustic features to the slot acoustic features.
When the slot position is located at the head of the matched preset template, generating a slot position voice waveform and/or updating an adjacent template character voice waveform comprises: generating voice samples corresponding to the slot acoustic features and/or voice samples corresponding to adjacent acoustic features according to the following contents: generating the i-1 frame acoustic feature yi-1Corresponding speech samples zj-1、zj-2、…、zj-LFor streaming based speech synthesis algorithms, with yi-1And zj+L-1、zj+L-2、…、zjFor input, with si"wherein z isj+L-1、zj+L-2、…、zjFor the ith frame acoustic feature yiCorresponding speech sample point, L is frame length, si"is the speech waveform hiding state corresponding to the acoustic feature of the i-1 th frame, i.e. si"is the hidden state of the voice waveform generation network when generating the voice waveform corresponding to the acoustic feature of the i-1 th frame; and synthesizing slot voice waveforms and/or updating adjacent template text voice waveforms according to the voice sampling points corresponding to the slot acoustic features and/or the voice sampling points corresponding to the adjacent acoustic features. And when the slot position is positioned at the head of the preset template, generating the voice sampling points frame by frame according to the reverse sequence of the acoustic features. For example, it is characterized by an acoustic feature yN、yN-1、…、yi、…、y2、y1Generating speech samples z for input, frame-by-frame or point-by-pointM、zM-1、…、zj+L、zj+L-1、…、zj、…、z1Wherein z isj+L-1、…、zj+1、zjCorresponding to an acoustic feature yiL is the frame length and M is the complete speech length. A set of hidden states s is maintained during the generation processN+1”、sN”、sN-1”、…、si”、…、s3”、s2”、s1", wherein sN+1' isInitial state, si"is the i-1 frame acoustic feature corresponding to the speech waveform hidden body, i.e. si"is a hidden state of the speech waveform generation network when generating the speech waveform corresponding to the acoustic feature of the i-1 th frame. When generating the speech waveform corresponding to the acoustic feature of the i-1 th framei-1And zj+L-1、…、zjFor input, si"conditional on generating zj-1、zj-2、…、zj-L. Specifically, in the embodiment of the present invention, the speech samples may be generated sequentially frame by frame from the adjacent acoustic features to the end of the slot acoustic feature. In order to maintain natural continuity of speech, adjustments are made to adjacent acoustic features. When the speech waveform is generated, speech waveforms corresponding to adjacent acoustic features need to be re-generated, and specifically, speech samples may be generated frame by frame starting from the adjacent acoustic features to the slot acoustic features.
And when the slot position is positioned in the middle of the matched preset template, generating slot position voice waveforms and updating adjacent template character voice waveforms to generate spliced voice waveforms spliced together with the updated adjacent template character voice waveforms. And when the slot position is positioned in the middle of the matched preset template, the acoustic features included in the adjacent acoustic features are distributed on two sides of the acoustic features of the slot position, the adjacent acoustic features and the acoustic features of the slot position are put together to form spliced acoustic features, and a spliced voice waveform corresponding to the spliced acoustic features is generated. Specifically, the following may be included. Generating a voice sampling point corresponding to the slot position acoustic feature and a voice sampling point corresponding to the acoustic feature with the sequence number smaller than that of the slot position acoustic feature in the adjacent acoustic feature according to the following contents to obtain a first spliced voice sampling point: generating the acoustic feature y of the (i + 1) th framei+1Corresponding speech samples zj+L、zj+1+L、…、zj+2L-1For streaming based speech synthesis algorithms, with yi+1And zj、zj+1、…、zj+L-1For input, with si' is generated under the condition where zj、zj+1、…、zj+L-1For the ith frame acoustic feature yiCorresponding speech sample point, L is frame length, si' is generated to a speech waveform concealment state corresponding to the acoustic feature of the i +1 th frame. For example, the acoustic features corresponding to the template text include i1、i2、i3、i4、……、i20Slot acoustic features include j1、j2、j3、j4The slot being located at an acoustic feature i11And i12Adjacent acoustic feature comprises acoustic feature i11、i10、i9、i8、i7And acoustic characteristics i12、i13、i14、i15、i16. Acoustic signature i7、i8、i9、i10、i11、j1、j2、j3、j4The corresponding voice sampling point is a first spliced voice sampling point. Generating a voice sampling point corresponding to the slot position acoustic feature and a voice sampling point corresponding to the acoustic feature with the sequence number being greater than that of the slot position acoustic feature in the adjacent acoustic feature according to the following contents to obtain a second spliced voice sampling point: generating the i-1 frame acoustic feature yi-1Corresponding speech samples zj-1、zj-2、…、zj-LFor streaming based speech synthesis algorithms, with yi-1And zj+L-1、zj+L-2、…、zjFor input, with si"wherein z isj+L-1、zj+L-2、…、zjFor the ith frame acoustic feature yiCorresponding speech sample point, L is frame length, si"is to generate the speech waveform hiding state corresponding to the acoustic feature of the i-1 th frame. For example, the acoustic features corresponding to the template text include i1、i2、i3、i4、……、i20Slot acoustic features include j1、j2、j3、j4The slot being located at an acoustic feature i11And i12Adjacent acoustic feature comprises acoustic feature i11、i10、i9、i8、i7And acoustic characteristics i12、i13、i14、i15、i16. Acoustic feature j1、j2、j3、j4、i12、i13、i14、i15、i16The corresponding voice sample point is a second spliced voice sample point. And generating spliced voice sampling points corresponding to the slot position acoustic features and the adjacent acoustic features based on the first spliced voice sampling points, the second spliced voice sampling points, the weights corresponding to the first spliced voice sampling points and the weights corresponding to the second spliced voice sampling points. Specifically, the weight is a weight sequence, the first spliced voice sample is multiplied by elements included in the weight sequence corresponding to the first spliced voice sample, the second spliced voice sample is multiplied by elements included in the weight sequence corresponding to the second spliced voice sample, and the multiplication results of the two times are added to obtain a spliced voice sample. The number of elements included in the weight sequence corresponding to the first spliced voice sample is the same as the number of elements included in the weight sequence corresponding to the second spliced voice sample, and the number of elements included in the weight sequence is the number of voice samples included in the first spliced voice sample or the second spliced voice sample. And synthesizing a spliced voice waveform according to the spliced voice sampling points. When the slot position is positioned in the middle of the preset template, the slot position is divided into two conditions that the slot position is positioned at the tail part of the preset template and the slot position is positioned at the tail part of the preset template. According to the sequencing aiming at the acoustic features, for the acoustic features positioned at the head of the acoustic features of the slot in the adjacent acoustic features, generating first spliced voice sampling points according to the method for generating the voice sampling points in the sequence; and generating a second spliced voice sample point according to the method for generating the voice sample points in the reverse order for the acoustic features positioned at the tail part of the slot acoustic features in the adjacent acoustic features. And obtaining a final spliced voice sample point spliced together by the adjacent acoustic features and the slot position acoustic features through weighted summation. For example, the length of the first concatenated speech samples or the second concatenated speech samples is N, and the weight sequence corresponding to the first concatenated speech samples may beThe weight sequence corresponding to the second spliced voice sample point may be
Acoustic feature generation converts linguistic features into acoustic features (e.g., mel-frequency spectral coefficients). In the embodiment of the invention, the acoustic features corresponding to the template characters in the preset template are generated in advance, and the generated acoustic features and the hidden state in the process of generating the acoustic features are stored for later use. Acoustic features and hidden states of the template text are required when generating the slot acoustic features. Optionally, in the embodiment of the present invention, the slot acoustic feature may be generated by different methods according to the position of the slot in the preset template. In the preset template library, acoustic features and acoustic feature hidden states corresponding to template characters included in each preset template are acquired in advance, and the acoustic feature hidden states are corresponding hidden states when the acoustic features corresponding to the template characters are generated. The acoustic features corresponding to the template characters included in the matching preset template and the slot acoustic features are put together according to the relationship between the matching preset template and the slot to obtain the overall acoustic features, and the acoustic features included in the overall acoustic features are sequenced from the head to the tail, and for the relevant introduction of the sequencing, reference may be made to the relevant contents described in the above embodiments.
When the slot is at the end of a matching default template, for example, the sample is today lunar calendar Chougu year [ April seventeen ]]Wherein, the 'lunar calendar clown year' is the template text and the 'April seventeen' is the slot text. And when the slot position is at the tail part of the preset template, based on a streaming voice synthesis algorithm, generating the acoustic features frame by frame according to the sequencing sequence aiming at the acoustic features. The acoustic features are sequentially generated by taking the linguistic features X of a complete text as input and sequentially generating the acoustic features y frame by frame from front to back according to the text1、y2、…、yi、…、yN-1、yN. A set of hidden states s is maintained during the generation process0、s1、s2、…、si、…、sN-1、sNWherein s is0The hidden state is a hidden state of the acoustic feature generation network when the acoustic feature is generated. Specifically, generating the slot acoustic feature includes generating a slot acoustic feature based onThe following is generated: acoustic feature y of frame i +1i+1For the streaming based speech synthesis algorithm, Enc (X) and yiAs input, with siGenerating for a condition, wherein Enc () is a linguistic feature coding network (e.g., CBHG-based Encoder of Tacotron and Transformer-based Encoder of Fastspech), siThe acoustic feature hiding state corresponding to the acoustic feature of the ith frame, i.e. siGenerating a hidden state of the network for the acoustic features when generating the acoustic features of the ith frame, yiAnd X is the acoustic feature of the ith frame, and X is the corresponding linguistic feature of the received text. In addition, in the embodiment of the present invention, adjacent acoustic features may be compensated, and when the adjacent acoustic features are compensated, the acoustic features corresponding to the received text may be obtained as shown in fig. 6, where the acoustic feature compensation signal generated by the adjacent portion between the template and the slot shown in fig. 6 is a compensation signal for compensating for the adjacent acoustic features.
When the slot is located at the head of the matching preset template, for example sample: [ Jiangnan]The one-minute audition version is sent to the user, the user can enjoy the full version of the song when the user purchases the small housekeeper app, wherein 'south of the Yangtze river' is a slot position text, 'the one-minute audition version is sent to the user, and the user can enjoy the full version of the song when the user purchases the small housekeeper app', and the template text. The difference from the mainstream speech synthesis algorithm scheme is that the acoustic features are generated in a reverse order for the situation, namely the acoustic features of the Chinese character 'nan' are generated firstly and then the acoustic features of the Chinese character 'jiang' are generated. And when the slot is positioned at the head of the preset template, based on a streaming voice synthesis algorithm, generating the acoustic features frame by frame according to the sequencing reverse order aiming at the acoustic features. The acoustic features are generated frame by frame in the reverse order by taking the linguistic features X of a section of complete text as input, and the acoustic features y are generated frame by frame in the reverse order from back to frontN、yN-1、…、yj、…、y2、y1. A set of hidden states s is maintained during the generation processN+1、sN、sN-1、…、sj、…、s2、s1Wherein s isN+1The hidden state is an initial state, and the hidden state is a corresponding hidden state when the acoustic feature is generated. When generatingEnc (X) and y at the time of acoustic feature of the j-1 framejFor input, sjGenerating y for the conditionj-1Where Enc () encodes the network for linguistic features. Specifically, when the slot is located at the head of the preset template, generating the slot acoustic feature includes generating according to: acoustic feature y of frame j-1j-1For the streaming based speech synthesis algorithm, Enc (X) and yjAs input, with sjGenerating for a condition, wherein Enc () is a linguistic feature coding network, sjThe acoustic feature hiding state corresponding to the acoustic feature of the (j-1) th frame, i.e. sjGenerating a hidden state of the network for the acoustic features when generating the acoustic features of the j-1 th frame, yjAnd X is the acoustic feature of the jth frame, and X is the corresponding linguistic feature of the received text. In addition, in the embodiment of the present invention, the adjacent acoustic features may be compensated, and when the adjacent acoustic features are compensated (for example, the adjacent acoustic features may be compensated by using the method for compensating the adjacent acoustic features provided in the embodiment of the present invention), the slot acoustic features and the adjacent acoustic features may be understood with reference to fig. 7.
When the slot position is located in the middle of the matched preset template, the slot position is different from the situation that the slot position is located at the head part and the tail part of the preset template, and the slot position part can be simultaneously influenced by the preset template parts at the two ends of the slot position part under the situation. Specifically, in the embodiment of the present invention, the acoustic features may be generated frame by frame in an order of sorting for the acoustic features based on a streaming speech synthesis algorithm. Specifically, generating the slot acoustic feature includes generating according to: acoustic feature y of frame i +1i+1For the streaming based speech synthesis algorithm, Enc (X), yi、ynAnd Encd(n-i-1) as input, siGenerating for a condition, wherein Enc () is a linguistic feature coding network, Encd() As a distance coding function, siFor the acoustic feature hiding state corresponding to the acoustic feature of the (i + 1) th frame, yiFor the i-th frame acoustic feature, ynIs the acoustic feature of the nth frame and is the acoustic feature which has a serial number larger than that of the slot acoustic feature and is closest to the slot acoustic feature in the adjacent acoustic features, and X is the linguistics corresponding to the received textAnd (5) characterizing. For example, the adjacent acoustic features include ya-l+1、…、ya、yb、yb+1、…、yb+l-1And l is the adjustment window length (i.e. the preset value in the embodiment of the present invention). y isaAnd ybBetween is the slot acoustic feature, ybThat is, the acoustic feature having a sequence number greater than the sequence number of the slot acoustic feature and closest to the slot acoustic feature in the ranking of the acoustic features.
In summary, in the embodiment of the present invention, the calculated amount of the TTS algorithm is reduced from the algorithm structure, so as to reduce the response delay of the intelligent device terminal; and the task processing time is reduced from the characteristics of the service data and the algorithm strategy, so that the computing resources are reduced, and the service cost is reduced.
Yet another aspect of the embodiments of the present invention provides a machine-readable storage medium on which a program is stored, the program implementing the method described in the above embodiments when executed by a processor.
In another aspect of the embodiments of the present invention, a processor is further provided, where the processor is configured to execute a program, where the program executes the method described in the foregoing embodiments.
In another aspect, an apparatus is further provided, where the apparatus includes a processor, a memory, and a program stored in the memory and executable on the processor, and the processor executes the program to implement the method described in the foregoing embodiments. The device herein may be a server, a PC, a PAD, a mobile phone, etc.
Yet another aspect of an embodiment of the present invention provides a computer program product including a computer program/instructions, which when executed by a processor, implement the method described in the above embodiment.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.
Claims (13)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111326772.0A CN114049874B (en) | 2021-11-10 | 2021-11-10 | Methods for synthesizing speech |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111326772.0A CN114049874B (en) | 2021-11-10 | 2021-11-10 | Methods for synthesizing speech |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN114049874A true CN114049874A (en) | 2022-02-15 |
| CN114049874B CN114049874B (en) | 2025-07-29 |
Family
ID=80208045
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202111326772.0A Active CN114049874B (en) | 2021-11-10 | 2021-11-10 | Methods for synthesizing speech |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN114049874B (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114842827A (en) * | 2022-04-28 | 2022-08-02 | 腾讯音乐娱乐科技(深圳)有限公司 | Audio synthesis method, electronic equipment and readable storage medium |
Citations (13)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2000339314A (en) * | 1999-05-25 | 2000-12-08 | Nippon Telegr & Teleph Corp <Ntt> | Automatic response method, dialogue analysis method, response sentence generation method, device thereof, and medium recording the program |
| US6438522B1 (en) * | 1998-11-30 | 2002-08-20 | Matsushita Electric Industrial Co., Ltd. | Method and apparatus for speech synthesis whereby waveform segments expressing respective syllables of a speech item are modified in accordance with rhythm, pitch and speech power patterns expressed by a prosodic template |
| US6496801B1 (en) * | 1999-11-02 | 2002-12-17 | Matsushita Electric Industrial Co., Ltd. | Speech synthesis employing concatenated prosodic and acoustic templates for phrases of multiple words |
| JP2003241795A (en) * | 2002-02-18 | 2003-08-29 | Hitachi Ltd | Information acquisition method and information acquisition system using voice input |
| CN1945691A (en) * | 2006-10-16 | 2007-04-11 | 安徽中科大讯飞信息科技有限公司 | Method for improving template sentence synthetic effect in voice synthetic system |
| CN107340991A (en) * | 2017-07-18 | 2017-11-10 | 百度在线网络技术(北京)有限公司 | Switching method, device, equipment and the storage medium of speech roles |
| CN107886948A (en) * | 2017-11-16 | 2018-04-06 | 百度在线网络技术(北京)有限公司 | Voice interactive method and device, terminal, server and readable storage medium storing program for executing |
| CN110517662A (en) * | 2019-07-12 | 2019-11-29 | 云知声智能科技股份有限公司 | A kind of method and system of Intelligent voice broadcasting |
| CN111128121A (en) * | 2019-12-20 | 2020-05-08 | 贝壳技术有限公司 | Voice information generation method and device, electronic device and storage medium |
| CN111583904A (en) * | 2020-05-13 | 2020-08-25 | 北京字节跳动网络技术有限公司 | Speech synthesis method, speech synthesis device, storage medium and electronic equipment |
| US20210027788A1 (en) * | 2019-07-23 | 2021-01-28 | Baidu Online Network Technology (Beijing) Co., Ltd. | Conversation interaction method, apparatus and computer readable storage medium |
| US20210174781A1 (en) * | 2019-01-17 | 2021-06-10 | Ping An Technology (Shenzhen) Co., Ltd. | Text-based speech synthesis method, computer device, and non-transitory computer-readable storage medium |
| CN113345415A (en) * | 2021-06-01 | 2021-09-03 | 平安科技(深圳)有限公司 | Speech synthesis method, apparatus, device and storage medium |
-
2021
- 2021-11-10 CN CN202111326772.0A patent/CN114049874B/en active Active
Patent Citations (13)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6438522B1 (en) * | 1998-11-30 | 2002-08-20 | Matsushita Electric Industrial Co., Ltd. | Method and apparatus for speech synthesis whereby waveform segments expressing respective syllables of a speech item are modified in accordance with rhythm, pitch and speech power patterns expressed by a prosodic template |
| JP2000339314A (en) * | 1999-05-25 | 2000-12-08 | Nippon Telegr & Teleph Corp <Ntt> | Automatic response method, dialogue analysis method, response sentence generation method, device thereof, and medium recording the program |
| US6496801B1 (en) * | 1999-11-02 | 2002-12-17 | Matsushita Electric Industrial Co., Ltd. | Speech synthesis employing concatenated prosodic and acoustic templates for phrases of multiple words |
| JP2003241795A (en) * | 2002-02-18 | 2003-08-29 | Hitachi Ltd | Information acquisition method and information acquisition system using voice input |
| CN1945691A (en) * | 2006-10-16 | 2007-04-11 | 安徽中科大讯飞信息科技有限公司 | Method for improving template sentence synthetic effect in voice synthetic system |
| CN107340991A (en) * | 2017-07-18 | 2017-11-10 | 百度在线网络技术(北京)有限公司 | Switching method, device, equipment and the storage medium of speech roles |
| CN107886948A (en) * | 2017-11-16 | 2018-04-06 | 百度在线网络技术(北京)有限公司 | Voice interactive method and device, terminal, server and readable storage medium storing program for executing |
| US20210174781A1 (en) * | 2019-01-17 | 2021-06-10 | Ping An Technology (Shenzhen) Co., Ltd. | Text-based speech synthesis method, computer device, and non-transitory computer-readable storage medium |
| CN110517662A (en) * | 2019-07-12 | 2019-11-29 | 云知声智能科技股份有限公司 | A kind of method and system of Intelligent voice broadcasting |
| US20210027788A1 (en) * | 2019-07-23 | 2021-01-28 | Baidu Online Network Technology (Beijing) Co., Ltd. | Conversation interaction method, apparatus and computer readable storage medium |
| CN111128121A (en) * | 2019-12-20 | 2020-05-08 | 贝壳技术有限公司 | Voice information generation method and device, electronic device and storage medium |
| CN111583904A (en) * | 2020-05-13 | 2020-08-25 | 北京字节跳动网络技术有限公司 | Speech synthesis method, speech synthesis device, storage medium and electronic equipment |
| CN113345415A (en) * | 2021-06-01 | 2021-09-03 | 平安科技(深圳)有限公司 | Speech synthesis method, apparatus, device and storage medium |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114842827A (en) * | 2022-04-28 | 2022-08-02 | 腾讯音乐娱乐科技(深圳)有限公司 | Audio synthesis method, electronic equipment and readable storage medium |
Also Published As
| Publication number | Publication date |
|---|---|
| CN114049874B (en) | 2025-07-29 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11990118B2 (en) | Text-to-speech (TTS) processing | |
| CN113345415B (en) | Speech synthesis method, device, equipment and storage medium | |
| JP7464621B2 (en) | Speech synthesis method, device, and computer-readable storage medium | |
| US7567896B2 (en) | Corpus-based speech synthesis based on segment recombination | |
| EP4158619B1 (en) | Phrase-based end-to-end text-to-speech (tts) synthesis | |
| US9905220B2 (en) | Multilingual prosody generation | |
| US11763797B2 (en) | Text-to-speech (TTS) processing | |
| US10692484B1 (en) | Text-to-speech (TTS) processing | |
| JP2002530703A (en) | Speech synthesis using concatenation of speech waveforms | |
| JP2007249212A (en) | Method, computer program and processor for text speech synthesis | |
| US8626510B2 (en) | Speech synthesizing device, computer program product, and method | |
| MXPA01006594A (en) | Method and system for preselection of suitable units for concatenative speech. | |
| CN102822889B (en) | Pre-saved data compression for tts concatenation cost | |
| Fahmy et al. | A transfer learning end-to-end arabic text-to-speech (tts) deep architecture | |
| US20220189455A1 (en) | Method and system for synthesizing cross-lingual speech | |
| CN114974218A (en) | Voice conversion model training method and device and voice conversion method and device | |
| CN114049874B (en) | Methods for synthesizing speech | |
| Mei et al. | A particular character speech synthesis system based on deep learning | |
| JP5874639B2 (en) | Speech synthesis apparatus, speech synthesis method, and speech synthesis program | |
| CN119207374B (en) | Method and system for converting text into voice efficiently | |
| CN120412544B (en) | A prosody-controllable speech synthesis method and related device based on VITS | |
| CN114974208B (en) | A Chinese speech synthesis method, device, electronic device, and storage medium | |
| Louw | Text-to-speech duration models for resource-scarce languages in neural architectures | |
| EP1501075B1 (en) | Speech synthesis using concatenation of speech waveforms | |
| JP5449022B2 (en) | Speech segment database creation device, alternative speech model creation device, speech segment database creation method, alternative speech model creation method, program |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| TA01 | Transfer of patent application right | ||
| TA01 | Transfer of patent application right |
Effective date of registration: 20220426 Address after: 100085 Floor 101 102-1, No. 35 Building, No. 2 Hospital, Xierqi West Road, Haidian District, Beijing Applicant after: Seashell Housing (Beijing) Technology Co.,Ltd. Address before: 101309 room 24, 62 Farm Road, Erjie village, Yangzhen, Shunyi District, Beijing Applicant before: Beijing fangjianghu Technology Co.,Ltd. |
|
| GR01 | Patent grant | ||
| GR01 | Patent grant |