[go: up one dir, main page]

CN114049874A - Method for synthesizing speech - Google Patents

Method for synthesizing speech Download PDF

Info

Publication number
CN114049874A
CN114049874A CN202111326772.0A CN202111326772A CN114049874A CN 114049874 A CN114049874 A CN 114049874A CN 202111326772 A CN202111326772 A CN 202111326772A CN 114049874 A CN114049874 A CN 114049874A
Authority
CN
China
Prior art keywords
template
text
acoustic feature
slot
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111326772.0A
Other languages
Chinese (zh)
Other versions
CN114049874B (en
Inventor
谭兴军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Seashell Housing Beijing Technology Co Ltd
Original Assignee
Beijing Fangjianghu Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Fangjianghu Technology Co Ltd filed Critical Beijing Fangjianghu Technology Co Ltd
Priority to CN202111326772.0A priority Critical patent/CN114049874B/en
Publication of CN114049874A publication Critical patent/CN114049874A/en
Application granted granted Critical
Publication of CN114049874B publication Critical patent/CN114049874B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The embodiment of the invention provides a method for synthesizing voice, belonging to the field of artificial intelligence. The method comprises the following steps: finding a matching preset template matched with the received text in a preset template library, wherein each preset template comprises template characters and a slot position in the preset template library, and a voice waveform corresponding to the template characters is obtained in advance; acquiring slot position content corresponding to slot positions included in the matched preset template in the received text; generating slot position acoustic characteristics corresponding to the slot position content; generating a slot position voice waveform corresponding to the slot position acoustic characteristic; and splicing the slot position voice waveform with the voice waveform corresponding to the template characters included by the matched preset template to obtain the voice waveform corresponding to the received text, thereby synthesizing the voice. Therefore, the real calculation amount during voice synthesis is reduced, the response delay of the intelligent device terminal is reduced, and the user experience is improved.

Description

Method for synthesizing speech
Technical Field
Embodiments of the present invention relate to a method for synthesizing speech.
Background
With the high-speed iteration of computer performance and the wide application of deep learning technology, the intelligent voice interaction technology has been developed unprecedentedly. Today, interfaces for human-computer interaction are undergoing a transition from touch to speech. The purpose of speech synthesis (TTS) technology is to make computers and various smart devices speak like a person. Driven by the endless machine learning algorithms, TTS technology has been able to generate speech that closely approximates real human speech, even in the false.
Because the mainstream deep learning algorithm depends on strong computing power, all smart device manufacturers and technical service providers deploy the part in a computing center nowadays, and provide smart voice services in a cloud computing mode. However, for TTS, this approach has two significant drawbacks: 1) due to the complexity of the algorithm, network delay and other factors, the response delay of the intelligent equipment terminal is large, and the user experience is influenced; 2) because user requests are not evenly distributed in time (e.g., request peaks in the morning and evening, request troughs in the morning), vendors need to prepare oversaturated computing devices to handle request peaks, which results in wasted computing resources and increased costs.
Disclosure of Invention
To at least partially solve the above problem, an aspect of an embodiment of the present invention provides a method for synthesizing speech, the method including: finding a matching preset template matched with the received text in a preset template library, wherein each preset template comprises template characters and a slot position in the preset template library, and a voice waveform corresponding to the template characters is obtained in advance; acquiring slot position content corresponding to the slot position included by the matching preset template in the received text; generating slot position acoustic characteristics corresponding to the slot position contents; generating a slot position voice waveform corresponding to the slot position acoustic characteristic; and splicing the slot position voice waveform with the voice waveform corresponding to the template characters included in the matched preset template to obtain the voice waveform corresponding to the received text, thereby synthesizing voice.
In addition, another aspect of the embodiments of the present invention also provides a machine-readable storage medium, which stores instructions for causing a machine to execute the above-mentioned method.
In addition, another aspect of the embodiments of the present invention also provides a computer program product, which includes a computer program/instruction, and when the computer program/instruction is executed by a processor, the steps of the method are implemented.
According to the technical scheme, the voice is synthesized by using the preset template, the voice waveform corresponding to the template characters in the preset template is obtained in advance, when the voice is synthesized, only the voice waveform corresponding to the slot position content filled in the slot position part in the preset template needs to be generated in real time, the voice waveform corresponding to the template characters and the voice waveform corresponding to the slot position content are spliced together to synthesize the voice, the received voice waveform corresponding to the whole text does not need to be generated in real time, therefore, the real calculation amount during voice synthesis is reduced, the terminal response delay of the intelligent device is reduced, and the user experience is improved. In addition, when synthesizing the voice, the voice is synthesized by means of the preset template, the processing time of a single task is reduced, and the computing intensity when the peak value is requested by a user can be borne without preparing supersaturated computing equipment, so that the waste of computing resources is reduced, and the cost is reduced.
Additional features and advantages of embodiments of the invention will be set forth in the detailed description which follows.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the embodiments of the invention without limiting the embodiments of the invention. In the drawings:
FIG. 1 is a flow chart of a method for synthesizing speech provided by an aspect of an embodiment of the present invention;
FIG. 2 is a diagram of a matching default template according to another embodiment of the present invention;
FIG. 3 is a schematic diagram illustrating a compensation signal provided by another embodiment of the present invention;
FIG. 4 is a schematic diagram of the logic of a training compensation model according to another embodiment of the present invention;
FIG. 5 is a schematic diagram of compensation logic provided in accordance with another embodiment of the present invention;
FIG. 6 is a schematic illustration of compensation for adjacent acoustic features provided by another embodiment of the present invention; and
FIG. 7 is a schematic illustration of slot acoustic feature generation and compensation of adjacent acoustic features provided by another embodiment of the present invention.
Detailed Description
The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating embodiments of the invention, are given by way of illustration and explanation only, not limitation.
In an intelligent speech interactive system there is a Natural Language Generator (NLG) which responds to the user's text content and a TTS module which accepts the text and converts it to speech for output to the user. One mainstream design method of NLG is to design a dialog template, where the dialog template includes various slots (slots), and different values are filled in the slots to generate different response texts. For example, the conversation template sample is: the current time is the morning point minute. The dialogue template comprises 2 slots, and different contents are filled in the 2 slots to generate a type of response text answering the current time. Responding to the text sample: time now was 58 minutes 6 am; time now was 7 am 45 min in the morning; time is now 8 am 20 min. Through statistical findings, the structured response texts generated by the dialog templates in the real intelligent voice interaction system have very high occurrence frequency, especially in the early and late request peak period. The technical scheme provided by the embodiment of the invention has the design idea that preset templates are made for the high-frequency dialogue templates in advance, and only the voice corresponding to the content filled in the slot position is generated when the voice is synthesized on line in real time. Taking the foregoing example as an example, it is previously made that "the present time is the morning [ ] point [ ] minute. "as a preset template, wherein [ ] represents a slot position. During real-time speech synthesis, only digital speech segments of '6', '58' and the like are generated and spliced into the template to be used as the final output of TTS. The TTS algorithm scheme comprises 3 functional modules: the device comprises a linguistic feature generation module, an acoustic feature generation module and a voice waveform generation module. Generating linguistic features from the text, generating acoustic features from the linguistic features, and generating speech waveforms from the acoustic features, thereby obtaining speech signals.
One aspect of embodiments of the present invention provides a method for synthesizing speech. Fig. 1 is a flowchart of a method for synthesizing speech according to an embodiment of the present invention. As shown in fig. 1, the method includes the following.
In step S10, a matching preset template matching the received text is found in the preset template library. In the preset template library, each preset template comprises template characters and slot positions, and the voice waveforms corresponding to the template characters are acquired in advance. The received text is the text to be speech-synthesized. In the preset template, the slot is not filled with characters. One or more slots may be included in a default template. And the preset template matched with the received text in the preset template library is the matched preset template. For example, each preset template in the preset template library has a template ID, and for the case that template information can be obtained from an NLG (natural language generation module), the template ID of the preset template matching the received text can be directly obtained, and the matching preset template matching the received text can be found from the preset template library according to the template ID. In addition, in the embodiment of the present invention, in a preset template library, an acoustic feature and/or an acoustic feature hidden state and/or a speech waveform hidden state corresponding to a template character may also be obtained in advance, where the acoustic feature hidden state is a hidden state of an acoustic feature generation network when an acoustic feature is generated, and the speech waveform hidden state is a hidden state of a speech waveform generation network when a speech waveform corresponding to the acoustic feature is generated. Speech is a typical timing signal and when processing timing signals with neural networks, the network is usually required to "remember" historical information, such as cell state c (cell state) and hidden state h (hidden state) in LSTM. Hidden states are not equal to the model parameters of the neural network, the model parameters are typically fixed value weights, and hidden states refer to state parameters and/or intermediate values that change with time sequence during inference. Still taking LSTM as an example, the hidden state at time t can be expressed as [ c ]t、ht]. In the real model, there may be multiple LSTM layers, each of which may have multiple LSTM units, and the hidden state is a matrix sequence in the real model. In the process of generating the acoustic characteristics of the voice template by using the acoustic characteristic generation network, the deviceGenerating the last frame of acoustic features of the template text at the time t, generating the first frame of acoustic features of the slot position text at the time t +1, and hiding the state [ c ] before the time t is finishedt、ht]Exported and stored as part of the speech template; similarly, the hidden state corresponding to the voice waveform generation network may also be derived and stored as a part of the voice template.
In step S11, the slot content corresponding to the slot included in the matching preset template in the received text is obtained. For example, the received text may be compared with a matching preset template to obtain the slot content corresponding to the slot.
In step S12, a slot acoustic feature corresponding to the slot content is generated. For example, the slot acoustic features may be generated according to a streaming speech synthesis algorithm.
In step S13, a slot speech waveform corresponding to the slot acoustic feature is generated. For example, the slot speech waveform may be generated according to a streaming speech synthesis algorithm.
In step S14, the slot speech waveform and the speech waveform corresponding to the template word included in the matching preset template are spliced together to obtain the speech waveform corresponding to the received text, so as to synthesize speech.
According to the technical scheme, the voice is synthesized by using the preset template, the voice waveform corresponding to the template characters in the preset template is obtained in advance, when the voice is synthesized, only the voice waveform corresponding to the slot position content filled in the slot position part in the preset template needs to be generated in real time, the voice waveform corresponding to the template characters and the voice waveform corresponding to the slot position content are spliced together to synthesize the voice, the received voice waveform corresponding to the whole text does not need to be generated in real time, therefore, the real calculation amount during voice synthesis is reduced, the terminal response delay of the intelligent device is reduced, and the user experience is improved. In addition, when synthesizing the voice, the voice is synthesized by means of the preset template, the processing time of a single task is reduced, and the computing intensity when the peak value is requested by a user can be borne without preparing supersaturated computing equipment, so that the waste of computing resources is reduced, and the cost is reduced.
Optionally, in the embodiment of the present invention, the creating of the preset template library includes two aspects, one is screening of a preset template text, and the other is making of a preset template. The screening of the preset template text can adopt the following screening method: 1) obtaining all designed dialogue templates from the NLG module; 2) deleting the dialogue template with sparse text content, for example, the dialogue template with sparse content can be a dialogue template in which the average segment length is less than 2 words after the template text is segmented according to the slot positions; 3) deleting fixed words, namely templates with zero slot positions; 4) the remaining dialog templates are used as the default templates for the present proposal. In the preset template manufacturing, a simple method is to record the preset template in advance in a mode of recording audio. However, this method is limited to a few specific timbres, and cannot be extended to more timbres; it is suitable for only a small number of pre-made dialogue templates and cannot be extended to a dialogue template and a high-frequency text added at a later stage. In the embodiment of the present invention, the making of the preset template includes the following steps. 1) And filling variable examples in the slots of the conversation template, and completing the template into a complete and smooth text. 2) And synthesizing the complemented text into voice by adopting a Streaming Speech Synthesis (Streaming Speech Synthesis) algorithm. 3) Some intermediate values in the synthesis process, such as the acoustic features of the template text, the hidden state in the prediction acoustic process (the hidden state of the acoustic feature generation network when the acoustic features are generated), the hidden state of the vocoder (vocoder) when the waveform (waveform) is generated (the hidden state of the voice waveform generation network when the voice waveform is generated by the acoustic features), the voice waveform of the template text, and the like, are recorded, so that the inference context (i.e., the corresponding input data, hidden state and other variables and necessary computing resources when the slot speech is predicted) can be quickly recovered when the slot speech is generated in real time online. 4) And (3) deleting the part corresponding to the content of the slot from the result (namely the acoustic characteristic, the voice waveform and the hidden state) output in the steps 2) to 3). 5) And recording the template number, the template characters, the slot position description information and the result of the step 4) as a preset template. The slot position description information comprises the number and the sequence of the slots and the offset position of each slot on the text, the acoustic characteristic and the voice waveform data of the template. For example, the time is now the [ hour ] point in the morning, the offset for [ hour ] is 7, and the offset for [ hour ] is 8. And establishing a preset template library through the content.
Optionally, in the embodiment of the present invention, the following content may be adopted to find a matching preset template matching the received text in the preset template library. The following method for finding the matched preset template in the preset template library is suitable for the condition that the template information cannot be obtained from the NLG. In most voice interaction systems, information interaction between TTS and NLG is inconvenient, and at the moment, the most appropriate template needs to be matched in a preset template library through the text of the TTS.
All characters included in the received text are character-ordered in the order of the received text from the head to the tail. For example, the received text comprises m characters, the m characters are sorted by c1、c2、…、cmIs shown in the specification, wherein ciRepresenting the ith character. And aiming at any preset template in the preset template library, executing the following matching operation so as to match all the preset templates in the preset template library with the received text and determine the matched preset template. And segmenting the template characters corresponding to the preset template based on the slot positions to obtain segmented template characters. According to the position of the segmented template characters in the preset template, template character sequencing is carried out on the segmented template characters according to the sequence of the preset template from the head to the tail, namely all the segmented template characters are sequenced. For example, the template text corresponding to a preset template includes n segmented template texts, each using t1、t2、…、tnDenotes, tiRepresenting the ith segmented template word. Alternatively, the slots included in the default template may be ordered from head to tail, e.g., n-1 slots with s1、s2、…、sn-1Denotes siIndicating the ith slot. Specifically, it can be understood with reference to table 1. Sequentially searching for segmented template characters in the received text according to the template character sequence, wherein the received text comprises the segmented template charactersThe same characters in the text of (1) as included in a segmented template word are no longer used for subsequent lookups. Wherein the searching is to find the same character string as the segmented template word in the received text. When the segmented template characters are searched in the received text, searching is carried out according to the sequence of the segmented template characters. For example, using the segmented template text t1、t2、…、tnFor example, first find t1Then find t2And so on. When some characters in the received text are the same as a segmented template word, then the some characters are no longer used for subsequent lookup work. For example, using the above example as an example, the characters in the received text are represented by c1、c2、…、cmRepresenting, for each of the segmented template characters, a corresponding one of the preset templates, a segmented template character t1、t2、…、tnIndicates when the character c1、c2、c3、c4And segmented template character t1If the characters in (1) are the same, the character c is1、c2、c3、c4Is no longer used for the segmented template text t2、…、tnFor segmented template text t2From the character c5And starting. Aiming at a segmented template character, the segmented template character is searched in the received text to find the identical character string in the received text. For example, as shown in Table 1, for a segmented template text t1"time now is morning" which must be 7 characters when it is found in the received text, and must be simply "time now is morning". Under the condition that all the segmented template characters are found in the received text, judging whether the preset template is successfully matched with the received text or not according to the matching preset condition; and under the condition that the preset template is successfully matched with the received text, determining whether the preset template is set as an optimal template according to an optimal template setting rule, wherein the optimal template is the matched preset template under the condition that all preset templates in the preset template library are subjected to matching operation. If it is preset in the template libraryAnd if the matching operation is finished on all the preset templates, and if the optimal template does not exist, the matched preset template is not found in the preset template library. In addition, in the received text, the characters in the template text which is not included in the matching preset template are the characters corresponding to the slot positions. In addition, the slot positions and the segmentation template characters included in the matching preset template can be sequenced at the same time, and the content of the slot positions corresponding to each slot position included in the matching preset template is determined by comparing the sequence with the received text.
TABLE 1
Figure BDA0003347437870000081
Optionally, in the embodiment of the present invention, sequentially searching for segmented template words in the received text according to the template word sorting includes the following contents. And searching a first segmented template character in the received text according to the following contents, wherein the first segmented template character is the segmented template character which is arranged at the top in a preset template according to the template character sequence. For example, segmented template text t1、t2、…、tn,t1Is the first segment template text. And comparing the first segmented template word with a character string in the received text, wherein the character string has the same character length as the character length of the first segmented template word from the first character according to the character sequence under the condition that the head of the first segmented template word has no slot, and the first character is the character which is arranged at the top in the received text according to the character sequence. E.g. t1The head has no slot, if at character c1、c2、…、cmFinding segmented template characters t1、t2、…、tn,t1Including 7 characters, then only t can be obtained1And character string c1-c7And (6) carrying out comparison. c. C1I.e. the first character. Under the condition that the head of the first segmentation template word has a slot, the length of any character in the first segmentation template word and the received text is the same as that of the character of the first segmentation template wordAnd comparing according to the character strings with continuous character sequencing serial numbers. That is, in the case that the head of the first segmented template text has a slot, the character string of the received text to which the comparison is performed is not limited. And under the condition that the first segmented template characters are searched in the received text, setting a matching position aiming at second segmented template characters according to the character length of the first segmented template characters and the sequence number of the initial characters in the character string matched with the first segmented template characters in the received text according to the character sequence, wherein the second segmented template characters are the next segmented template characters of the first segmented template characters according to the template characters sequence, and the matching position is the sequence number of the initial characters of the segmented template characters searched in the received text according to the character sequence. For example, segmented template text t1、t2、…、tn,t2Is the second sectional template character. Take the above example as an example, with t1The matched character string is c1-c7,t1Has a length of 7, and t1Initial character c of comparison1Is 1, for t2Is 8, i.e. from c8Start for t2And (6) carrying out comparison. Searching a second segmented template character in the received text from the matching position; under the condition of searching the second segmented template characters, updating the matching positions according to the character length of the second segmented template characters and the initial characters in the character strings matched with the second segmented template characters in the received text according to the serial numbers of the character sequence, understanding by referring to the content of the matching positions set for the second segmented template characters, sequentially searching the segmented template characters except the first segmented template characters and the second segmented template characters in the preset template in the received text according to the content of the searched second segmented template characters until all the segmented template characters in the preset template are searched, wherein in the process of searching the preset template containing the first segmented template characters, the second segmented template characters and all the segmented template characters after the first segmented template characters and the second segmented template characters, any segmented template character is not searched, the preset template is failed to be matched with the received text, parallel knotAnd (4) bundling the operation of sequentially searching the segmented template characters in the received text according to the template character sequence. For example, taking the above example as an example, when t is found2And if the character string matched with the preset template is not found, finishing the searching operation aiming at the preset template, wherein the preset template is failed to be matched with the received text.
Optionally, in the embodiment of the present invention, the matching preset condition includes the following. If all the segmented template characters are found in the received text, the maximum serial number of the characters in the received text according to the character sequence is larger than or equal to the last matching position corresponding to the last segmented template character, and the tail part of the last segmented template character has no slot position, the preset template is not successfully matched with the received text, wherein the last segmented template character is the segmented template character arranged at the last in the preset template according to the template character sequence, and the last matching position is the matching position set according to the character length of the last segmented template character and the serial number of the initial character in the character string matched with the last segmented template character in the received text according to the character sequence. For example, in the character c1、c2、…、c9Finding segmented template characters t1、t2、…、t5Maximum serial number 9, and last segmented template text t5,t5Tail without slot, t5Comprising 2 characters, character c1、c2、…、c9Neutralization of t5And if the sequence number of the initial character in the matched character string according to the character sequence is 6, the last matching position is 8, and the maximum sequence number 9 is greater than the last matching position 8, which indicates that redundant characters exist, the matching is unsuccessful. And if all the segmented template characters are found in the received text and the maximum serial number is smaller than the last matching position corresponding to the last segmented template character and/or the tail part of the last segmented template character has a slot position, successfully matching the preset template with the received text.
Optionally, in the embodiment of the present invention, the optimal template setting rule includes the following. If the optimal template is not set, the preset template is set as the optimal template, namely the preset template which is successfully matched currently is set as the optimal template. If the optimal template is set, but the number of characters included in the preset template is more than that of the characters included in the optimal template, setting the preset template as the optimal template. If the optimal template is set and the number of the characters included in the preset template is the same as the number of the characters included in the optimal template, but the number of the slot positions included in the preset template is less than the number of the slot positions included in the optimal template, the preset template is set as the optimal template. If the optimal template is set, but the number of the characters included in the preset template is less than or equal to the number of the characters included in the optimal template, the preset template is not set as the optimal template. If the optimal template is set and the number of the characters included in the preset template is the same as the number of the characters included in the optimal template, but the number of the slot positions included in the preset template is larger than or equal to the number of the slot positions included in the optimal template, the preset template is not set as the optimal template.
And searching the segmented template characters in the received text, namely matching the segmented template characters with the characters in the received text, wherein character strings with matching are found, and character strings without matching are not found. For example, using the segmented template text t1、t2、…、tnAnd c for characters1、c2、…、cmFor example, referring to fig. 2, an exemplary description is given to the search process, during the search, each segmented template word of the t sequence of the segmented template words is sequentially searched in the received text, if the t sequence can be completely found and the received text has no redundant characters, the matching is considered to be successful, otherwise, the matching is failed. As shown in fig. 2, t in the diagram A1Has been matched to c1、c2Then t is2From c3Starting to search; graph B shows t2Has been matched to c4、c5. Graph C shows tnHas been matched to cm-1、cmAnd if the received text has no more characters, the preset template is successfully matched with the received text. For the text received in the same section, a plurality of preset templates are matched possibly successfully, the preset template with the most characters is selected at the moment, and if the characters are the same, the preset template with the least slot positions is selectedAnd setting a template.
Specifically, the matching algorithm may include the following operations performed on all preset templates in the preset template library.
Step 1: segmenting the template characters according to the slot positions, and recording the segmented template characters as t1,t2,…,tnThe slot position is denoted as s1,s2,…,sn-1(fewer slots than text passages), or s1,s2,…,sn(the number of slots equals the text passage), or s1,s2,…,sn+1(more slots than text passages). Sorting the segments in the order in which they appear in the template may result in: (1) t is t1、s1、t2、s2、…、sn-1、tn;(2)t1、s1、t2、s2、…、sn-1、tn、sn;(3)s1、t1、s2、t2、…、sn、tn;(4)s1、t1、s2、t2、…、sn、tn、sn+1
Step 2: in the received text c1,c2,…,cmThe respective segmented template words are looked up sequentially (since the slot (i.e., s-sequence) can match 0 to multiple characters, only the text segment (i.e., t-sequence) is looked up).
Step 2.1: if the type is (1) or (2) preset template, t is compared1And c1、c2、…、cL1,L1Is t1If the length of the matching position search _ beg is the same, the matching position search _ beg is updated to L1+1 and go to step 2.2, otherwise the preset template matching fails. If it is a type (3) or (4) template, then at c1,c2,…,cmIn search t1If at cP1Where found and t1Identical character strings, cP1For the initial character of the character string, the matching position search _ beg is updated to P1+L1And entering step 2.2, otherwise, the preset template matching fails. In addition, c1And cP1-1BetweenCorresponding is the slot position s1The corresponding slot contents.
Step 2.2: if the last segmented template text tnAfter the matching is successful, updating the matching position according to tnAnd c1,c2,…,cmNeutralization of tnIf the serial number of the matched initial character is updated, the step 2.3 is carried out; otherwise, the next template text segment t is searched from the search _ beg of the input textiIf at cPiWhere found and tiIf the identical character string is found, the matching position search _ beg is updated to Pi+LiAnd re-enter step 2.2, LiIs tiOtherwise, the preset template matching fails.
Step 2.3: if m is larger than or equal to the search _ beg, and the preset template is of the type (1) or (3), indicating that the received text has redundant characters relative to the preset template, the preset template fails to be matched; otherwise, the template matching is successful, and the step 3 is entered.
And step 3: if the optimal template is not set, setting the current preset template as the optimal template; otherwise, comparing the current preset template with the optimal template character number, and if the current preset template character number is more, setting the current template as the optimal template; if the number of the characters is the same, comparing the number of the current preset template and the number of the optimal template slot positions, and if the number of the slot positions of the current preset template is less, setting the current preset template as the optimal template. And after all the preset templates are matched, if the optimal template is set, selecting the optimal template, otherwise, failing to match the received text to the template in the preset template library, and failing to match.
Since the speech signal is a time-series signal, it is greatly affected by the preceding and following speech. If the voice of the slot position content is directly generated and spliced into the preset template, only the voice with heavy mechanical taste is generated, namely, the voice is inconsistent in fundamental frequency, energy, speed and rhythm to cause poor naturalness, and the phase is discontinuous to cause burr noise and the like. In the embodiment of the present invention, in order to ensure the continuity of the synthesized speech, certain adjustment is required to match the acoustic characteristics of the partial frame where the preset template itself is adjacent to the slot.Specifically, in the preset template library, the acoustic features corresponding to the template characters included in each preset template are obtained in advance, and the method further includes the following steps. And compensating adjacent acoustic features in the acoustic features corresponding to the template characters included in the matched preset template to obtain compensated acoustic features, wherein the adjacent acoustic features include all the acoustic features of which the sequence numbers are less than or equal to a preset numerical value from the acoustic feature number closest to the slot position acoustic feature in the acoustic features corresponding to the template characters included in the matched preset template. For example, the acoustic features corresponding to the template text include i1、i2、i3、i4、……、i20The slot being at acoustic characteristic i1Previously, setting the preset value to 5, the acoustic feature i is measured1Counting up, all acoustic features with a number less than or equal to 5 are neighboring acoustic features, and therefore, the neighboring acoustic features include the acoustic feature i1、i2、i3、i4、i5. The acoustic features still corresponding to the template text include i1、i2、i3、i4、……、i20For example, the slot is at acoustic feature i20Then, setting the preset value to 5, the acoustic feature i is obtained20Counting up all acoustic features with a number less than or equal to 5 as neighboring acoustic features, except that at this point, an inverse number is required, and therefore, the neighboring acoustic features include the acoustic feature i20、i19、i18、i17、i16. The acoustic features still corresponding to the template text include i1、i2、i3、i4、……、i20For example, the slot is at acoustic feature i11And i12Set the preset value to 5, then respectively from the acoustic feature i11Inverse number of and from acoustic features i12All the acoustic features with the sequence number less than or equal to 5 are the adjacent acoustic features, and therefore the adjacent acoustic features include the acoustic feature i11、i10、i9、i8、i7And acoustic characteristics i12、i13、i14、i15、i16. To say thatIt should be understood that the above method can be used to obtain adjacent acoustic features regardless of the number of slots in a predetermined template. And regenerating a voice waveform corresponding to the adjacent acoustic feature based on the compensated acoustic feature so as to generate an updated adjacent template text voice waveform. For example, the updated adjacent template text speech waveform is generated according to a streaming speech synthesis algorithm. And splicing the slot position voice waveform with the voice waveform corresponding to the template characters included in the matched preset template into a slot position voice waveform, updating the voice waveform of the adjacent template characters and splicing the voice waveforms corresponding to the residual acoustic features except the adjacent acoustic features in the acoustic features corresponding to the template characters included in the matched preset template.
Optionally, in the embodiment of the present invention, a compensation signal may be generated, and the adjacent acoustic features may be compensated based on the compensation signal.
The compensation signal is the difference between the model-generated acoustic features and the ideal acoustic features. For example, in the embodiment of the present invention, the adjacent acoustic features are compensated, the ideal acoustic feature is the compensated acoustic feature, and therefore, the compensation signal is the difference between the adjacent acoustic features before compensation and the adjacent acoustic features after compensation. The pronunciation of a speech segment is affected by its context, which contributes to the continuity and consistency of the entire speech segment. For the same preset template, the slot positions are filled with different contents, the ideal pronunciation of the template is different, and particularly, the voice frames adjacent to the slot positions are partially voiced. The root cause of this difference is that the part of the Encoder (Encoder) in the model that generates the acoustic features uses the linguistic features of the complete sentence. These linguistic features have changed when the slot content is replaced. The goal of the compensation signal is to repair the effect of this change on the pronunciation. More direct white description: the generation of the preset template is based on a certain condition (i.e. a segment of text), and when the preset template is used, the condition changes a little (a small part of text changes), we need to make a little change to the generated template to make it adapt to the new condition more perfectly, and the change amplitude is the compensation signal.
For a trained acoustic model, if the preset template in the technical solution provided in the embodiment of the present invention is not used, that is, the acoustic features generated from zero can be considered as ideal acoustic features, such as the acoustic feature a2 and the acoustic feature B2 shown in fig. 3. In fig. 3, acoustic features of two sentences (in italic underlined font as slot content) are generated separately using the trained acoustic model. Both acoustic signature a2 and acoustic signature B2 represent acoustic signatures of the pre-defined pattern adjacent to the slot, i.e., adjacent acoustic signatures as described in the previous embodiments. The difference between the two segments of the acoustic signature is a compensation signal (denoted by C), wherein the compensation signal C is a compensation signal for compensating the acoustic signature B2 to obtain the acoustic signature a 2. In order to predict the compensation signal C, the compensation model needs to be trained with the acoustic features B2 and the linguistic features a1 as input data, as shown in fig. 4. In an embodiment of the present invention, the compensation model may be trained as follows. The input data is acoustic feature B2, linguistic feature a1, linguistic feature a1 refers to "no problem, together with linguistic features corresponding to the entire sentence that bears no bar". The output data is [ compensation signal C' ]. The loss function is Mel-SD (acoustic feature a2, acoustic feature a '), which is a Mel-spectrum distortion measure between the target acoustic feature a2 and the acoustic feature a' obtained by compensating for the acoustic feature B2. Wherein, Adam is adopted in the optimization method of the acoustic characteristic A '═ acoustic characteristic B2+ compensation signal C'. In order to fully utilize the data resources, the input data may also be acoustic features a2, linguistic features B1, that is, acoustic features B2 is obtained by compensating for the acoustic features a2, and then the loss function is calculated as Mel-SD (acoustic features B2, acoustic features B'), that is, two sets of training data can be obtained from any two filled sentences of the same template.
The actually predicted compensation signal after the compensation model is trained is C', namely the compensation signal used in the process of synthesizing the voice on line in real time. Before synthesizing the voice in real time on line, taking adjacent acoustic features as acoustic features B2; when synthesizing speech on line in real time, the linguistic features of the received text are generated as linguistic features a1, and the compensated signals C 'are obtained through the compensation model and then added to the template acoustic features B2 to obtain acoustic features a', which are compensated output features, as shown in fig. 5. Linguistic feature generation converts text into phonetic labels (e.g., phonemes and intonations) and prosodic labels (e.g., accents, pauses). Different contents can be filled in the slot position when the linguistic feature is generated on line in real time, and in order to ensure the effectiveness of the slot position linguistic feature, the complete linguistic feature needs to be generated by using the complete input text.
Optionally, in the embodiment of the present invention, different methods may be adopted according to the position of the slot in the preset template to generate the slot speech waveform and/or update the text speech waveform of the adjacent template. The voice waveform generation is to convert acoustic features into voice signals to be output. Specifically, the following can be referred to. In the preset template library, a voice waveform hiding state corresponding to template characters included in each preset template is obtained in advance, and the voice waveform hiding state is a hiding state of a voice waveform generation network when a voice waveform corresponding to acoustic features is generated. And placing the acoustic features corresponding to the template characters included in the matching preset template and the acoustic features of the slot positions together according to the relation between the matching preset template and the slot positions to obtain the overall acoustic features, and sequencing the acoustic features included in the overall acoustic features from the head to the tail. For example, the acoustic features corresponding to the template text include i1、i2、i3、i4、……、i20Slot acoustic features include j1、j2、j3、j4The slot is located at the head of the matching preset template, so that the overall acoustic signature is the acoustic signature j1、j2、j3、j4、i1、i2、i3、i4、……、i20All the acoustic features included in the overall acoustic feature are then sorted in order from head to tail. And when the slot positions are positioned at the tail part and the middle part of the matched preset template, sequencing according to the contents.
When the trench is located at the tail part of the matched preset template, generating a trench voice waveform and/or updating a text voice waveform of the adjacent template comprises: generating voice sampling points and/or adjacent sounds corresponding to the slot position acoustic characteristics according to the following contentsLearning the voice sample points corresponding to the features: generating the acoustic feature y of the (i + 1) th framei+1Corresponding speech samples zj+L、zj+1+L、…、zj+2L-1For streaming based speech synthesis algorithms, with yi+1And zj、zj+1、…、zj+L-1For input, with si' is generated under the condition where zj、zj+1、…、zj+L-1For the ith frame acoustic feature yiCorresponding speech sample point, L is frame length, si' is a speech waveform hiding state corresponding to the acoustic feature of the (i + 1) th frame; and synthesizing slot voice waveforms and/or updating adjacent template text voice waveforms according to the voice sampling points corresponding to the slot acoustic features and/or the voice sampling points corresponding to the adjacent acoustic features. When the slot is located at the tail of the preset template, the voice samples are generated frame by frame according to the sequence of the acoustic features, for example, according to the acoustic feature y1、y2、…、yi、…、yN-1、yNAnd using it as input, generating speech samples z frame by frame or point by point1、z2、…、zj、zj+1、…、zj+L、…zMWherein z isj、zj+1、…、zj+L-1Corresponding to an acoustic feature yiL is the frame length and M is the complete speech length. A set of hidden states s is maintained during the generation process0’、s1’、s2’、s3’、…、si’、…、sN-1’、sN', wherein s0' is an initial state, si' speech waveform hiding state corresponding to acoustic feature of frame i +1, i.e. si' generating a hidden state of the speech waveform generation network when generating a speech waveform corresponding to the acoustic feature of the i +1 th frame. When generating the speech waveform corresponding to the acoustic feature of the (i + 1) th framei+1And zj、zj+1、…、zj+L-1For input, si' Generation of z as Conditionj+L、zj+1+L、…、zj+2L-1. Specifically, in the embodiment of the present invention, the speech samples may be generated sequentially frame by frame from the adjacent acoustic features to the end of the slot acoustic feature. To is coming toThe natural continuity of the speech is maintained and the adjacent acoustic features are adjusted. When the speech waveform is generated, speech waveforms corresponding to adjacent acoustic features need to be re-generated, and specifically, speech samples may be generated frame by frame starting from the adjacent acoustic features to the slot acoustic features.
When the slot position is located at the head of the matched preset template, generating a slot position voice waveform and/or updating an adjacent template character voice waveform comprises: generating voice samples corresponding to the slot acoustic features and/or voice samples corresponding to adjacent acoustic features according to the following contents: generating the i-1 frame acoustic feature yi-1Corresponding speech samples zj-1、zj-2、…、zj-LFor streaming based speech synthesis algorithms, with yi-1And zj+L-1、zj+L-2、…、zjFor input, with si"wherein z isj+L-1、zj+L-2、…、zjFor the ith frame acoustic feature yiCorresponding speech sample point, L is frame length, si"is the speech waveform hiding state corresponding to the acoustic feature of the i-1 th frame, i.e. si"is the hidden state of the voice waveform generation network when generating the voice waveform corresponding to the acoustic feature of the i-1 th frame; and synthesizing slot voice waveforms and/or updating adjacent template text voice waveforms according to the voice sampling points corresponding to the slot acoustic features and/or the voice sampling points corresponding to the adjacent acoustic features. And when the slot position is positioned at the head of the preset template, generating the voice sampling points frame by frame according to the reverse sequence of the acoustic features. For example, it is characterized by an acoustic feature yN、yN-1、…、yi、…、y2、y1Generating speech samples z for input, frame-by-frame or point-by-pointM、zM-1、…、zj+L、zj+L-1、…、zj、…、z1Wherein z isj+L-1、…、zj+1、zjCorresponding to an acoustic feature yiL is the frame length and M is the complete speech length. A set of hidden states s is maintained during the generation processN+1”、sN”、sN-1”、…、si”、…、s3”、s2”、s1", wherein sN+1' isInitial state, si"is the i-1 frame acoustic feature corresponding to the speech waveform hidden body, i.e. si"is a hidden state of the speech waveform generation network when generating the speech waveform corresponding to the acoustic feature of the i-1 th frame. When generating the speech waveform corresponding to the acoustic feature of the i-1 th framei-1And zj+L-1、…、zjFor input, si"conditional on generating zj-1、zj-2、…、zj-L. Specifically, in the embodiment of the present invention, the speech samples may be generated sequentially frame by frame from the adjacent acoustic features to the end of the slot acoustic feature. In order to maintain natural continuity of speech, adjustments are made to adjacent acoustic features. When the speech waveform is generated, speech waveforms corresponding to adjacent acoustic features need to be re-generated, and specifically, speech samples may be generated frame by frame starting from the adjacent acoustic features to the slot acoustic features.
And when the slot position is positioned in the middle of the matched preset template, generating slot position voice waveforms and updating adjacent template character voice waveforms to generate spliced voice waveforms spliced together with the updated adjacent template character voice waveforms. And when the slot position is positioned in the middle of the matched preset template, the acoustic features included in the adjacent acoustic features are distributed on two sides of the acoustic features of the slot position, the adjacent acoustic features and the acoustic features of the slot position are put together to form spliced acoustic features, and a spliced voice waveform corresponding to the spliced acoustic features is generated. Specifically, the following may be included. Generating a voice sampling point corresponding to the slot position acoustic feature and a voice sampling point corresponding to the acoustic feature with the sequence number smaller than that of the slot position acoustic feature in the adjacent acoustic feature according to the following contents to obtain a first spliced voice sampling point: generating the acoustic feature y of the (i + 1) th framei+1Corresponding speech samples zj+L、zj+1+L、…、zj+2L-1For streaming based speech synthesis algorithms, with yi+1And zj、zj+1、…、zj+L-1For input, with si' is generated under the condition where zj、zj+1、…、zj+L-1For the ith frame acoustic feature yiCorresponding speech sample point, L is frame length, si' is generated to a speech waveform concealment state corresponding to the acoustic feature of the i +1 th frame. For example, the acoustic features corresponding to the template text include i1、i2、i3、i4、……、i20Slot acoustic features include j1、j2、j3、j4The slot being located at an acoustic feature i11And i12Adjacent acoustic feature comprises acoustic feature i11、i10、i9、i8、i7And acoustic characteristics i12、i13、i14、i15、i16. Acoustic signature i7、i8、i9、i10、i11、j1、j2、j3、j4The corresponding voice sampling point is a first spliced voice sampling point. Generating a voice sampling point corresponding to the slot position acoustic feature and a voice sampling point corresponding to the acoustic feature with the sequence number being greater than that of the slot position acoustic feature in the adjacent acoustic feature according to the following contents to obtain a second spliced voice sampling point: generating the i-1 frame acoustic feature yi-1Corresponding speech samples zj-1、zj-2、…、zj-LFor streaming based speech synthesis algorithms, with yi-1And zj+L-1、zj+L-2、…、zjFor input, with si"wherein z isj+L-1、zj+L-2、…、zjFor the ith frame acoustic feature yiCorresponding speech sample point, L is frame length, si"is to generate the speech waveform hiding state corresponding to the acoustic feature of the i-1 th frame. For example, the acoustic features corresponding to the template text include i1、i2、i3、i4、……、i20Slot acoustic features include j1、j2、j3、j4The slot being located at an acoustic feature i11And i12Adjacent acoustic feature comprises acoustic feature i11、i10、i9、i8、i7And acoustic characteristics i12、i13、i14、i15、i16. Acoustic feature j1、j2、j3、j4、i12、i13、i14、i15、i16The corresponding voice sample point is a second spliced voice sample point. And generating spliced voice sampling points corresponding to the slot position acoustic features and the adjacent acoustic features based on the first spliced voice sampling points, the second spliced voice sampling points, the weights corresponding to the first spliced voice sampling points and the weights corresponding to the second spliced voice sampling points. Specifically, the weight is a weight sequence, the first spliced voice sample is multiplied by elements included in the weight sequence corresponding to the first spliced voice sample, the second spliced voice sample is multiplied by elements included in the weight sequence corresponding to the second spliced voice sample, and the multiplication results of the two times are added to obtain a spliced voice sample. The number of elements included in the weight sequence corresponding to the first spliced voice sample is the same as the number of elements included in the weight sequence corresponding to the second spliced voice sample, and the number of elements included in the weight sequence is the number of voice samples included in the first spliced voice sample or the second spliced voice sample. And synthesizing a spliced voice waveform according to the spliced voice sampling points. When the slot position is positioned in the middle of the preset template, the slot position is divided into two conditions that the slot position is positioned at the tail part of the preset template and the slot position is positioned at the tail part of the preset template. According to the sequencing aiming at the acoustic features, for the acoustic features positioned at the head of the acoustic features of the slot in the adjacent acoustic features, generating first spliced voice sampling points according to the method for generating the voice sampling points in the sequence; and generating a second spliced voice sample point according to the method for generating the voice sample points in the reverse order for the acoustic features positioned at the tail part of the slot acoustic features in the adjacent acoustic features. And obtaining a final spliced voice sample point spliced together by the adjacent acoustic features and the slot position acoustic features through weighted summation. For example, the length of the first concatenated speech samples or the second concatenated speech samples is N, and the weight sequence corresponding to the first concatenated speech samples may be
Figure BDA0003347437870000191
The weight sequence corresponding to the second spliced voice sample point may be
Figure BDA0003347437870000192
Acoustic feature generation converts linguistic features into acoustic features (e.g., mel-frequency spectral coefficients). In the embodiment of the invention, the acoustic features corresponding to the template characters in the preset template are generated in advance, and the generated acoustic features and the hidden state in the process of generating the acoustic features are stored for later use. Acoustic features and hidden states of the template text are required when generating the slot acoustic features. Optionally, in the embodiment of the present invention, the slot acoustic feature may be generated by different methods according to the position of the slot in the preset template. In the preset template library, acoustic features and acoustic feature hidden states corresponding to template characters included in each preset template are acquired in advance, and the acoustic feature hidden states are corresponding hidden states when the acoustic features corresponding to the template characters are generated. The acoustic features corresponding to the template characters included in the matching preset template and the slot acoustic features are put together according to the relationship between the matching preset template and the slot to obtain the overall acoustic features, and the acoustic features included in the overall acoustic features are sequenced from the head to the tail, and for the relevant introduction of the sequencing, reference may be made to the relevant contents described in the above embodiments.
When the slot is at the end of a matching default template, for example, the sample is today lunar calendar Chougu year [ April seventeen ]]Wherein, the 'lunar calendar clown year' is the template text and the 'April seventeen' is the slot text. And when the slot position is at the tail part of the preset template, based on a streaming voice synthesis algorithm, generating the acoustic features frame by frame according to the sequencing sequence aiming at the acoustic features. The acoustic features are sequentially generated by taking the linguistic features X of a complete text as input and sequentially generating the acoustic features y frame by frame from front to back according to the text1、y2、…、yi、…、yN-1、yN. A set of hidden states s is maintained during the generation process0、s1、s2、…、si、…、sN-1、sNWherein s is0The hidden state is a hidden state of the acoustic feature generation network when the acoustic feature is generated. Specifically, generating the slot acoustic feature includes generating a slot acoustic feature based onThe following is generated: acoustic feature y of frame i +1i+1For the streaming based speech synthesis algorithm, Enc (X) and yiAs input, with siGenerating for a condition, wherein Enc () is a linguistic feature coding network (e.g., CBHG-based Encoder of Tacotron and Transformer-based Encoder of Fastspech), siThe acoustic feature hiding state corresponding to the acoustic feature of the ith frame, i.e. siGenerating a hidden state of the network for the acoustic features when generating the acoustic features of the ith frame, yiAnd X is the acoustic feature of the ith frame, and X is the corresponding linguistic feature of the received text. In addition, in the embodiment of the present invention, adjacent acoustic features may be compensated, and when the adjacent acoustic features are compensated, the acoustic features corresponding to the received text may be obtained as shown in fig. 6, where the acoustic feature compensation signal generated by the adjacent portion between the template and the slot shown in fig. 6 is a compensation signal for compensating for the adjacent acoustic features.
When the slot is located at the head of the matching preset template, for example sample: [ Jiangnan]The one-minute audition version is sent to the user, the user can enjoy the full version of the song when the user purchases the small housekeeper app, wherein 'south of the Yangtze river' is a slot position text, 'the one-minute audition version is sent to the user, and the user can enjoy the full version of the song when the user purchases the small housekeeper app', and the template text. The difference from the mainstream speech synthesis algorithm scheme is that the acoustic features are generated in a reverse order for the situation, namely the acoustic features of the Chinese character 'nan' are generated firstly and then the acoustic features of the Chinese character 'jiang' are generated. And when the slot is positioned at the head of the preset template, based on a streaming voice synthesis algorithm, generating the acoustic features frame by frame according to the sequencing reverse order aiming at the acoustic features. The acoustic features are generated frame by frame in the reverse order by taking the linguistic features X of a section of complete text as input, and the acoustic features y are generated frame by frame in the reverse order from back to frontN、yN-1、…、yj、…、y2、y1. A set of hidden states s is maintained during the generation processN+1、sN、sN-1、…、sj、…、s2、s1Wherein s isN+1The hidden state is an initial state, and the hidden state is a corresponding hidden state when the acoustic feature is generated. When generatingEnc (X) and y at the time of acoustic feature of the j-1 framejFor input, sjGenerating y for the conditionj-1Where Enc () encodes the network for linguistic features. Specifically, when the slot is located at the head of the preset template, generating the slot acoustic feature includes generating according to: acoustic feature y of frame j-1j-1For the streaming based speech synthesis algorithm, Enc (X) and yjAs input, with sjGenerating for a condition, wherein Enc () is a linguistic feature coding network, sjThe acoustic feature hiding state corresponding to the acoustic feature of the (j-1) th frame, i.e. sjGenerating a hidden state of the network for the acoustic features when generating the acoustic features of the j-1 th frame, yjAnd X is the acoustic feature of the jth frame, and X is the corresponding linguistic feature of the received text. In addition, in the embodiment of the present invention, the adjacent acoustic features may be compensated, and when the adjacent acoustic features are compensated (for example, the adjacent acoustic features may be compensated by using the method for compensating the adjacent acoustic features provided in the embodiment of the present invention), the slot acoustic features and the adjacent acoustic features may be understood with reference to fig. 7.
When the slot position is located in the middle of the matched preset template, the slot position is different from the situation that the slot position is located at the head part and the tail part of the preset template, and the slot position part can be simultaneously influenced by the preset template parts at the two ends of the slot position part under the situation. Specifically, in the embodiment of the present invention, the acoustic features may be generated frame by frame in an order of sorting for the acoustic features based on a streaming speech synthesis algorithm. Specifically, generating the slot acoustic feature includes generating according to: acoustic feature y of frame i +1i+1For the streaming based speech synthesis algorithm, Enc (X), yi、ynAnd Encd(n-i-1) as input, siGenerating for a condition, wherein Enc () is a linguistic feature coding network, Encd() As a distance coding function, siFor the acoustic feature hiding state corresponding to the acoustic feature of the (i + 1) th frame, yiFor the i-th frame acoustic feature, ynIs the acoustic feature of the nth frame and is the acoustic feature which has a serial number larger than that of the slot acoustic feature and is closest to the slot acoustic feature in the adjacent acoustic features, and X is the linguistics corresponding to the received textAnd (5) characterizing. For example, the adjacent acoustic features include ya-l+1、…、ya、yb、yb+1、…、yb+l-1And l is the adjustment window length (i.e. the preset value in the embodiment of the present invention). y isaAnd ybBetween is the slot acoustic feature, ybThat is, the acoustic feature having a sequence number greater than the sequence number of the slot acoustic feature and closest to the slot acoustic feature in the ranking of the acoustic features.
In summary, in the embodiment of the present invention, the calculated amount of the TTS algorithm is reduced from the algorithm structure, so as to reduce the response delay of the intelligent device terminal; and the task processing time is reduced from the characteristics of the service data and the algorithm strategy, so that the computing resources are reduced, and the service cost is reduced.
Yet another aspect of the embodiments of the present invention provides a machine-readable storage medium on which a program is stored, the program implementing the method described in the above embodiments when executed by a processor.
In another aspect of the embodiments of the present invention, a processor is further provided, where the processor is configured to execute a program, where the program executes the method described in the foregoing embodiments.
In another aspect, an apparatus is further provided, where the apparatus includes a processor, a memory, and a program stored in the memory and executable on the processor, and the processor executes the program to implement the method described in the foregoing embodiments. The device herein may be a server, a PC, a PAD, a mobile phone, etc.
Yet another aspect of an embodiment of the present invention provides a computer program product including a computer program/instructions, which when executed by a processor, implement the method described in the above embodiment.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (13)

1.一种用于合成语音的方法,其特征在于,该方法包括:1. a method for synthesizing speech, it is characterised in that the method comprises: 在预设模板库中找到与接收到的文本匹配的匹配预设模板,其中,在所述预设模板库中,每一预设模板包括模板文字和槽位且所述模板文字对应的语音波形被预先获取到;A matching preset template matching the received text is found in the preset template library, wherein, in the preset template library, each preset template includes a template text and a slot and a speech waveform corresponding to the template text pre-acquired; 获取所接收到的文本中与所述匹配预设模板包括的所述槽位对应的槽位内容;Obtain the slot content corresponding to the slot included in the matching preset template in the received text; 生成所述槽位内容对应的槽位声学特征;generating a slot acoustic feature corresponding to the slot content; 生成所述槽位声学特征对应的槽位语音波形;以及generating a slot speech waveform corresponding to the slot acoustic feature; and 将所述槽位语音波形与所述匹配预设模板包括的所述模板文字对应的所述语音波形拼接在一起,以得到所接收到的文本对应的语音波形,从而合成语音。The slot voice waveform and the voice waveform corresponding to the template text included in the matching preset template are spliced together to obtain a voice waveform corresponding to the received text, thereby synthesizing voice. 2.根据权利要求1所述的方法,其特征在于,所述在预设模板库中找到与接收到的文本匹配的匹配预设模板包括:2. The method according to claim 1, wherein the finding a matching preset template matching the received text in the preset template library comprises: 按照所接收到的文本从头部至尾部的顺序,对所接收到的文本中包括的所有字符进行字符排序;以及characterize all characters included in the received text in order from head to tail of the received text; and 针对所述预设模板库中的任一所述预设模板,执行以下匹配操作:For any of the preset templates in the preset template library, perform the following matching operations: 基于所述槽位对所述模板文字进行分段,得到分段模板文字;The template text is segmented based on the slot to obtain segmented template text; 根据所述分段模板文字在该预设模板中的位置,按照该预设模板从头部至尾部的顺序,对所述分段模板文字进行模板文字排序;According to the position of the segmented template text in the preset template, according to the sequence of the preset template from the head to the tail, the template text is sorted on the segmented template text; 按照所述模板文字排序,在所接收到的文本中依次查找所述分段模板文字,其中,所接收到的文本中与一所述分段模板文字包括的所述字符相同的所述字符不再被用于后续查找;According to the sorting of the template characters, the segmented template characters are sequentially searched in the received text, wherein the characters in the received text that are the same as the characters included in a segmented template character are not identical to each other. is then used for subsequent searches; 在所有所述分段模板文字均在所接收到的文本中被查找到的情况下,根据匹配预设条件,判断该预设模板是否与所接收到的文本匹配成功;以及In the case where all the segmented template texts are found in the received text, according to matching preset conditions, determine whether the preset template is successfully matched with the received text; and 在该预设模板与所接收到的文本匹配成功的情况下,根据最优模板设置规则,确定是否将该预设模板设置为最优模板,其中,在所述预设模板库中的所有所述预设模板均执行完所述匹配操作的情况下,所述最优模板即为所述匹配预设模板,In the case that the preset template is successfully matched with the received text, according to the optimal template setting rule, it is determined whether to set the preset template as the optimal template, wherein all the preset templates in the preset template library are set as the optimal template. In the case that the preset templates have all performed the matching operation, the optimal template is the matching preset template, 其中,所述最优模板设置规则包括:Wherein, the optimal template setting rules include: 若未设置所述最优模板,则将所述预设模板设置为所述最优模板;If the optimal template is not set, the preset template is set as the optimal template; 若已设置所述最优模板,但所述预设模板包括的字符的数量多于所述最优模板包括的字符的数量,则将所述预设模板设置为所述最优模板;以及If the optimal template has been set, but the number of characters included in the preset template is more than the number of characters included in the optimal template, the preset template is set as the optimal template; and 若已设置所述最优模板且所述预设模板包括的字符的数量与所述最优模板包括的字符的数量相同,但所述预设模板包括的所述槽位的数量比所述最优模板包括的所述槽位的数量少,则将所述预设模板设置为所述最优模板。If the optimal template has been set and the number of characters included in the preset template is the same as the number of characters included in the optimal template, but the number of slots included in the preset template is larger than the number of slots included in the optimal template If the number of the slots included in the optimal template is small, the preset template is set as the optimal template. 3.根据权利要求2所述的方法,其特征在于,按照所述模板文字排序在所接收到的文本中依次查找所述分段模板文字包括:3. The method according to claim 2, wherein, sequentially searching for the segmented template text in the received text according to the template text ordering comprises: 根据以下内容在所接收到的文本中查找第一分段模板文字,其中,所述第一分段模板文字为在所述预设模板中按照所述模板文字排序排在最前面的所述分段模板文字:The first segmented template text is searched in the received text according to the following content, where the first segmented template text is the segment that is at the front in the preset template according to the template text ordering Paragraph template literal: 在所述第一分段模板文字的头部没有所述槽位的情况下,将所述第一分段模板文字与所接收到的文本中的按照所述字符排序从第一字符起字符长度与所述第一分段模板文字的字符长度相同的字符串进行比对,其中,所述第一字符为在所接收到的文本中按照所述字符排序排在最前面的字符;以及In the case that the header of the first segmented template text does not have the slot, sorting the first segmented template text and the received text according to the character sequence starting from the first character in character length Compare with a character string of the same length as the first segmented template text, wherein the first character is the first character in the received text according to the character order; and 在所述第一分段模板文字的头部有所述槽位的情况下,将所述第一分段模板文字与所接收到的文本中的任一字符长度与所述第一分段模板文字的字符长度相同的按照所述字符排序序号连续的字符串进行比对;In the case that the header of the first segmented template text has the slot, compare the length of any character in the first segmented template text and the received text with the first segmented template The character strings with the same character length are compared according to the consecutive character strings; 在所接收到的文本中查找到所述第一分段模板文字的情况下,根据所述第一分段模板文字的所述字符长度和所接收到的文本中与所述第一分段模板文字匹配的字符串中起始字符按照所述字符排序的序号设置针对第二分段模板文字的匹配位置,其中,所述第二分段模板文字为按照所述模板文字排序所述第一分段模板文字的下一所述分段模板文字,所述匹配位置为在所接收到的文本中进行查找所述分段模板文字的起始字符按照所述字符排序的序号;In the case where the first segmented template text is found in the received text, according to the character length of the first segmented template text and the difference between the received text and the first segmented template The starting characters in the character string matched according to the sequence number of the character ordering set the matching position for the second segment template text, wherein the second segment template text is the first segment sorted according to the template text. The next described segmented template text of the segmented template text, the matching position is the sequence number that the starting character of the segmented template text is searched in the received text according to the sequence of the characters; 从所述匹配位置开始,在所接收到的文本中查找所述第二分段模板文字;以及starting from the matching position, searching the received text for the second segmented template literal; and 在查找到所述第二分段模板文字的情况下,根据所述第二分段模板文字的字符长度和所接收到的文本中与所述第二分段模板文字匹配的字符串中起始字符按照所述字符排序的序号更新所述匹配位置,并根据查找所述第二分段模板文字的内容在所接收到的文本中依次查找所述预设模板中除所述第一分段模板文字和所述第二分段模板文字外的所述分段模板文字,直到所述预设模板中的所有所述分段模板文字均被查找完毕,其中,在查找所述预设模板中含所述第一分段模板文字和所述第二分段模板文字及其之后的所有所述分段模板文字的过程中,任一所述分段模板文字未被查找到,则所述预设模板与所接收到的文本匹配失败,并结束按照所述模板文字排序在所接收到的文本中依次查找所述分段模板文字的操作。In the case where the second segmented template text is found, start the character according to the character length of the second segmented template text and the character string matching the second segmented template text in the received text. Characters update the matching position according to the sequence number of the character sorting, and sequentially search for the preset template in the received text except for the first segment template according to the content of the search for the second segment template text. The text and the segmented template text other than the second segmented template text, until all the segmented template texts in the preset template are searched, wherein the search for the preset template contains: During the process of the first segmented template text, the second segmented template text and all the following segmented template texts, if any segmented template text is not found, the preset The template fails to match the received text, and the operation of sequentially searching for the segmented template text in the received text according to the template text ordering ends. 4.根据权利要求2所述的方法,其特征在于,所述匹配预设条件包括:4. The method according to claim 2, wherein the matching preset condition comprises: 若全部所述分段模板文字皆在所接收到的文本中查找到且所接收到的文本中的按照所述字符排序的所述字符的最大序号大于或等于最后分段模板文字对应的最后匹配位置且所述最后分段模板文字的尾部没有所述槽位,则所述预设模板未与所接收到的文本匹配成功,其中,所述最后分段模板文字为在所述预设模板中按照所述模板文字排序排在最后的所述分段模板文字,所述最后匹配位置为根据所述最后分段模板文字的字符长度和所接收到的文本中与所述最后分段模板文字匹配的字符串中起始字符按照所述字符排序的序号设置的所述匹配位置;以及If all the segmented template words are found in the received text and the maximum sequence number of the characters in the received text sorted by the characters is greater than or equal to the last match corresponding to the last segmented template word position and the tail of the last segment template text does not have the slot, then the preset template is not successfully matched with the received text, wherein the last segment template text is in the preset template The last segmented template text is sorted according to the template text, and the last matching position is a match with the last segmented template text according to the character length of the last segmented template text and the received text The matching position set by the sequence number of the starting character in the character string according to the sequence of the character; and 若全部所述分段模板文字皆在所接收到的文本中查找到且所述最大序号小于所述最后分段模板文字对应的最后匹配位置和/或所述最后分段模板文字的尾部有所述槽位,则所述预设模板与所接收到的文本匹配成功。If all the segmented template texts are found in the received text and the maximum sequence number is smaller than the last matching position corresponding to the last segmented template text and/or the tail of the last segmented template text has If the slot is selected, the preset template is successfully matched with the received text. 5.根据权利要求1-4中任一项所述的方法,其特征在于,在所述预设模板库中,每一所述预设模板中包括的所述模板文字对应的声学特征被预先获取到,该方法还包括:5. The method according to any one of claims 1-4, wherein, in the preset template library, the acoustic features corresponding to the template characters included in each of the preset templates are preset Obtained, the method also includes: 将所述匹配预设模板包括的所述模板文字对应的所述声学特征中的相邻声学特征进行补偿,以得到补偿声学特征,其中,所述相邻声学特征包括在所述匹配预设模板包括的所述模板文字对应的所述声学特征中从距离所述槽位声学特征最近的所述声学特征数起序号小于或等于预设数值的所有所述声学特征;以及Compensating adjacent acoustic features in the acoustic features corresponding to the template text included in the matching preset template to obtain compensated acoustic features, wherein the adjacent acoustic features are included in the matching preset template All the acoustic features whose serial numbers are less than or equal to a preset value from the number of the acoustic features closest to the acoustic feature of the slot in the included acoustic features corresponding to the template text; and 基于所述补偿声学特征,重新生成所述相邻声学特征对应的所述语音波形,以生成更新相邻模板文字语音波形;Based on the compensated acoustic feature, regenerate the speech waveform corresponding to the adjacent acoustic feature to generate an updated adjacent template text speech waveform; 其中,所述将所述槽位语音波形与所述匹配预设模板包括的所述模板文字对应的所述语音波形拼接在一起为将所述槽位语音波形、所述更新相邻模板文字语音波形以及所述匹配预设模板包括的所述模板文字对应的所述声学特征中除去所述相邻声学特征外剩余的所述声学特征对应的所述语音波形拼接在一起。Wherein, splicing the voice waveform of the slot and the voice waveform corresponding to the template text included in the matching preset template is to combine the voice waveform of the slot and the voice of the updated adjacent template text. The waveforms and the speech waveforms corresponding to the acoustic features remaining in the acoustic features corresponding to the template characters included in the matching preset template except the adjacent acoustic features are spliced together. 6.根据权利要求5所述的方法,其特征在于,生成所述槽位语音波形和/或所述更新相邻模板文字语音波形为根据流式语音合成算法进行生成。6 . The method according to claim 5 , wherein the generation of the speech waveform of the slot and/or the speech waveform of the updated adjacent template text is generated according to a streaming speech synthesis algorithm. 7 . 7.根据权利要求6所述的方法,其特征在于,在所述预设模板库中,每一所述预设模板中包括的所述模板文字对应的语音波形隐藏状态被预先获取到,所述语音波形隐藏状态为生成所述声学特征对应的语音波形时语音波形生成网络的隐藏状态,当所述槽位位于所述匹配预设模板的尾部时,将所述匹配预设模板包括的所述模板文字对应的所述声学特征和所述槽位声学特征根据所述匹配预设模板和所述槽位的关系放在一起得到整体声学特征并对所述整体声学特征中包括的所述声学特征按照从头部至尾部的顺序进行排序,生成所述槽位语音波形和/或所述更新相邻模板文字语音波形包括:7. The method according to claim 6, wherein, in the preset template library, the hidden state of the speech waveform corresponding to the template text included in each of the preset templates is acquired in advance, and the The voice waveform hidden state is the hidden state of the voice waveform generation network when the voice waveform corresponding to the acoustic feature is generated, and when the slot is located at the end of the matching preset template, all the matching preset templates include The acoustic feature corresponding to the template text and the slot acoustic feature are put together according to the relationship between the matching preset template and the slot to obtain an overall acoustic feature and the acoustic features included in the overall acoustic feature The features are sorted in order from the head to the tail, and the generation of the slot voice waveform and/or the update of the adjacent template text voice waveform includes: 根据以下内容生成所述槽位声学特征对应的语音样点和/或所述相邻声学特征对应的所述语音样点:生成第i+1帧声学特征yi+1对应的所述语音样点zj+L、zj+1+L、…、zj+2L-1为基于所述流式语音合成算法,以yi+1和zj、zj+1、…、zj+L-1为输入,以si’为条件进行生成,其中,zj、zj+1、…、zj+L-1为第i帧声学特征yi对应的语音样点,L为帧长,si’为第i+1帧声学特征对应的所述语音波形隐藏状态;以及The speech sample corresponding to the acoustic feature of the slot and/or the speech sample corresponding to the adjacent acoustic feature is generated according to the following content: generating the speech sample corresponding to the i+1th frame acoustic feature y i+1 Points z j+L , z j+1+L ,..., z j+2L-1 are based on the streaming speech synthesis algorithm, with y i+1 and z j , z j+1 ,..., z j+ L-1 is the input, and is generated with s i ' as the condition, where z j , z j+1 , ..., z j+L-1 are the speech samples corresponding to the acoustic feature y i of the ith frame, and L is the frame long, s i ' is the speech waveform hidden state corresponding to the i+1th frame acoustic feature; and 根据所述槽位声学特征对应的语音样点和/或所述相邻声学特征对应的所述语音样点,合成所述槽位语音波形和/或所述更新相邻模板文字语音波形。According to the speech sample corresponding to the acoustic feature of the slot and/or the speech sample corresponding to the adjacent acoustic feature, the speech waveform of the slot and/or the speech waveform of the updated adjacent template text are synthesized. 8.根据权利要求6所述的方法,其特征在于,在所述预设模板库中,每一所述预设模板中包括的所述模板文字对应的语音波形隐藏状态被预先获取到,所述语音波形隐藏状态为生成所述声学特征对应的语音波形时语音波形生成网络的隐藏状态,当所述槽位位于所述匹配预设模板的头部时,将所述匹配预设模板包括的所述模板文字对应的所述声学特征和所述槽位声学特征根据所述匹配预设模板和所述槽位的关系放在一起得到整体声学特征并对所述整体声学特征中包括的所述声学特征按照从头部至尾部的顺序进行排序,生成所述槽位语音波形和/或所述更新相邻模板文字语音波形包括:8 . The method according to claim 6 , wherein, in the preset template library, the hidden state of the speech waveform corresponding to the template text included in each preset template is acquired in advance, and the The voice waveform hidden state is the hidden state of the voice waveform generation network when the voice waveform corresponding to the acoustic feature is generated, and when the slot is located at the head of the matching preset template, the matching preset template includes the hidden state. The acoustic feature corresponding to the template text and the slot acoustic feature are put together according to the relationship between the matching preset template and the slot to obtain an overall acoustic feature and the overall acoustic feature included in the overall acoustic feature. The acoustic features are sorted in order from the head to the tail, and the generation of the slot speech waveform and/or the update of the adjacent template text speech waveform includes: 根据以下内容生成所述槽位声学特征对应的语音样点和/或所述相邻声学特征对应的所述语音样点:生成第i-1帧声学特征yi-1对应的所述语音样点zj-1、zj-2、…、zj-L为基于所述流式语音合成算法,以yi-1和zj+L-1、zj+L-2、…、zj为输入,以si”为条件进行生成,其中,zj+L-1、zj+L-2、…、zj为第i帧声学特征yi对应的所述语音样点,L为帧长,si”为生成第i-1帧声学特征对应的所述语音波形隐藏状态;以及The speech sample corresponding to the acoustic feature of the slot and/or the speech sample corresponding to the adjacent acoustic feature is generated according to the following content: generating the speech sample corresponding to the i-1th frame acoustic feature y i-1 Points z j-1 , z j-2 ,..., z jL are based on the streaming speech synthesis algorithm, and y i-1 and z j+L-1 , z j+L-2 ,..., z j are Input, generate on the condition of s i ", wherein z j+L-1 , z j+L-2 ,..., z j are the speech samples corresponding to the ith frame acoustic feature y i , and L is the frame long, s i ″ is to generate the speech waveform hidden state corresponding to the i-1th frame acoustic feature; and 根据所述槽位声学特征对应的语音样点和/或所述相邻声学特征对应的所述语音样点,合成所述槽位语音波形和/或所述更新相邻模板文字语音波形。According to the speech sample corresponding to the acoustic feature of the slot and/or the speech sample corresponding to the adjacent acoustic feature, the speech waveform of the slot and/or the speech waveform of the updated adjacent template text are synthesized. 9.根据权利要求6所述的方法,其特征在于,在所述预设模板库中,每一所述预设模板中包括的所述模板文字对应的语音波形隐藏状态被预先获取到,所述语音波形隐藏状态为生成所述声学特征对应的语音波形时语音波形生成网络的隐藏状态,当所述槽位位于所述匹配预设模板的中间时,将所述匹配预设模板包括的所述模板文字对应的所述声学特征和所述槽位声学特征根据所述匹配预设模板和所述槽位的关系放在一起得到整体声学特征并对所述整体声学特征中包括的所述声学特征按照从头部至尾部的顺序进行排序,生成所述槽位语音波形和所述更新相邻模板文字语音波形为生成所述槽位语音波形和所述更新相邻模板文字语音波形拼接在一起的拼接语音波形,生成所述拼接语音波形包括:9 . The method according to claim 6 , wherein, in the preset template library, the hidden state of the speech waveform corresponding to the template text included in each preset template is acquired in advance, and the The voice waveform hidden state is the hidden state of the voice waveform generation network when the voice waveform corresponding to the acoustic feature is generated, and when the slot is located in the middle of the matching preset template, all of the matching preset templates are included. The acoustic feature corresponding to the template text and the slot acoustic feature are put together according to the relationship between the matching preset template and the slot to obtain an overall acoustic feature and the acoustic features included in the overall acoustic feature The features are sorted in order from the head to the tail, and the voice waveform of the slot position and the voice waveform of the updated adjacent template text are spliced together to generate the voice waveform of the slot position and the voice waveform of the updated adjacent template text. The spliced voice waveform, generating the spliced voice waveform includes: 根据以下内容生成所述槽位声学特征对应的语音样点以及所述相邻声学特征中包括的序号小于所述槽位声学特征的序号的声学特征对应的所述语音样点以得到第一拼接语音样点:生成第i+1帧声学特征yi+1对应的所述语音样点zj+L、zj+1+L、…、zj+2L-1为基于所述流式语音合成算法,以yi+1和zj、zj+1、…、zj+L-1为输入,以si’为条件进行生成,其中,zj、zj+1、…、zj+L-1为第i帧声学特征yi对应的语音样点,L为帧长,si’为生成第i+1帧声学特征对应的所述语音波形隐藏状态;The speech samples corresponding to the acoustic features of the slot and the speech samples corresponding to the acoustic features whose serial numbers included in the adjacent acoustic features are smaller than the serial numbers of the acoustic features of the slot are generated according to the following contents to obtain the first splicing Speech samples: generating the speech samples z j+L , z j +1+L , . The synthesis algorithm takes y i+1 and z j , z j+1 , . j+L-1 is the speech sample corresponding to the ith frame acoustic feature y i , L is the frame length, and s i ' is to generate the speech waveform hidden state corresponding to the i+1th frame acoustic feature; 根据以下内容生成所述槽位声学特征对应的语音样点以及所述相邻声学特征中包括的序号大于所述槽位声学特征的序号的声学特征对应的所述语音样点以得到第二拼接语音样点:生成第i-1帧声学特征yi-1对应的所述语音样点zj-1、zj-2、…、zj-L为基于所述流式语音合成算法,以yi-1和zj+L-1、zj+L-2、…、zj为输入,以si”为条件进行生成,其中,zj+L-1、zj+L-2、…、zj为第i帧声学特征yi对应的所述语音样点,L为帧长,si”为生成第i-1帧声学特征对应的所述语音波形隐藏状态;The speech samples corresponding to the acoustic features of the slot and the speech samples corresponding to the acoustic features whose serial numbers included in the adjacent acoustic features are greater than the serial numbers of the acoustic features of the slots are generated according to the following contents to obtain the second splicing Speech samples: generating the speech samples z j -1 , z j -2 , . -1 and z j+L-1 , z j + L-2 , . , z j are the described speech samples corresponding to the ith frame acoustic feature y i , L is the frame length, and s i " is to generate the described speech waveform hidden state corresponding to the i-1th frame acoustic feature; 基于所述第一拼接语音样点、所述第二拼接语音样点、所述第一拼接语音样点对应的权重以及所述第二拼接语音样点对应的权重,生成所述槽位声学特征及所述相邻声学特征对应的拼接语音样点;以及The slot acoustic feature is generated based on the first spliced speech sample, the second spliced speech sample, the weight corresponding to the first spliced speech sample, and the weight corresponding to the second spliced speech sample and the concatenated speech samples corresponding to the adjacent acoustic features; and 根据所述拼接语音样点,合成所述拼接语音波形。The spliced speech waveform is synthesized according to the spliced speech samples. 10.根据权利要求1-4中任一项所述的方法,其特征在于,生成所述槽位声学特征为根据流式语音合成算法进行生成。10. The method according to any one of claims 1-4, wherein generating the acoustic feature of the slot is generated according to a streaming speech synthesis algorithm. 11.根据权利要求10所述的方法,其特征在于,在所述预设模板库中,每一所述预设模板中包括的所述模板文字对应的声学特征和声学特征隐藏状态被预先获取到,所述声学特征隐藏状态为生成所述声学特征时声学特征生成网络的隐藏状态,当所述槽位位于所述匹配预设模板的尾部时,将所述匹配预设模板包括的所述模板文字对应的所述声学特征和所述槽位声学特征根据所述匹配预设模板和所述槽位的关系放在一起得到整体声学特征并对所述整体声学特征中包括的所述声学特征按照从头部至尾部的顺序进行排序,所述生成所述槽位声学特征包括根据以下内容进行生成:11 . The method according to claim 10 , wherein, in the preset template library, the acoustic features and acoustic feature hidden states corresponding to the template characters included in each of the preset templates are acquired in advance. 12 . To, the acoustic feature hidden state is the hidden state of the acoustic feature generation network when the acoustic feature is generated, and when the slot is located at the end of the matching preset template, the matching preset template includes the The acoustic feature corresponding to the template text and the acoustic feature of the slot are put together according to the relationship between the matching preset template and the slot to obtain an overall acoustic feature and the acoustic features included in the overall acoustic feature Sorting in order from head to tail, the generating the acoustic feature of the slot includes generating according to the following: 第i+1帧声学特征yi+1为基于所述流式语音合成算法,以Enc(X)和yi为输入、以si为条件进行生成,其中,Enc()为语言学特征编码网络,si为第i+1帧声学特征对应的所述声学特征隐藏状态,yi为第i帧声学特征,X为所接收到的文本对应的语言学特征。The i+1th frame acoustic feature y i+1 is generated based on the streaming speech synthesis algorithm, with Enc(X) and y i as input and si as the condition, where Enc() is the linguistic feature code network, s i is the hidden state of the acoustic feature corresponding to the i+1th frame acoustic feature, y i is the ith frame acoustic feature, and X is the linguistic feature corresponding to the received text. 12.根据权利要求10所述的方法,其特征在于,在所述预设模板库中,每一所述预设模板中包括的所述模板文字对应的声学特征和声学特征隐藏状态被预先获取到,所述声学特征隐藏状态为生成所述声学特征时声学特征生成网络的隐藏状态,当所述槽位位于所述匹配预设模板的头部时,将所述匹配预设模板包括的所述模板文字对应的所述声学特征和所述槽位声学特征根据所述匹配预设模板和所述槽位的关系放在一起得到整体声学特征并对所述整体声学特征中包括的所述声学特征按照从头部至尾部的顺序进行排序,所述生成所述槽位声学特征包括根据以下内容进行生成:12 . The method according to claim 10 , wherein, in the preset template library, the acoustic features and acoustic feature hidden states corresponding to the template characters included in each of the preset templates are acquired in advance. 13 . To, the acoustic feature hidden state is the hidden state of the acoustic feature generation network when the acoustic feature is generated, and when the slot is located at the head of the matching preset template, the matching preset template includes all the The acoustic feature corresponding to the template text and the slot acoustic feature are put together according to the relationship between the matching preset template and the slot to obtain an overall acoustic feature and the acoustic features included in the overall acoustic feature The features are ordered from head to tail, and the generating the slot acoustic feature includes generating according to: 第j-1帧声学特征yj-1为基于所述流式语音合成算法,以Enc(X)和yj为输入、以sj为条件进行生成,其中,Enc()为语言学特征编码网络,sj为第j-1帧声学特征对应的所述声学特征隐藏状态,yj为第j帧声学特征,X为所接收到的文本对应的语言学特征。The j-1th frame acoustic feature y j-1 is generated based on the streaming speech synthesis algorithm, taking Enc(X) and y j as input and s j as the condition, where Enc() is the linguistic feature code network, s j is the acoustic feature hidden state corresponding to the j-1th frame acoustic feature, y j is the jth frame acoustic feature, and X is the linguistic feature corresponding to the received text. 13.根据权利要求10所述的方法,其特征在于,在所述预设模板库中,每一所述预设模板中包括的所述模板文字对应的声学特征和声学特征隐藏状态被预先获取到,所述声学特征隐藏状态为生成所述声学特征时声学特征生成网络的隐藏状态,当所述槽位位于所述匹配预设模板的中间时,将所述匹配预设模板包括的所述模板文字对应的所述声学特征和所述槽位声学特征根据所述匹配预设模板和所述槽位的关系放在一起得到整体声学特征并对所述整体声学特征中包括的所述声学特征按照从头部至尾部的顺序进行排序,所述生成所述槽位声学特征包括根据以下内容进行生成:13 . The method according to claim 10 , wherein, in the preset template library, the acoustic features and acoustic feature hidden states corresponding to the template characters included in each of the preset templates are acquired in advance. 14 . To, the acoustic feature hidden state is the hidden state of the acoustic feature generation network when the acoustic feature is generated, and when the slot is located in the middle of the matching preset template, the matching preset template includes the The acoustic feature corresponding to the template text and the acoustic feature of the slot are put together according to the relationship between the matching preset template and the slot to obtain an overall acoustic feature and the acoustic features included in the overall acoustic feature Sorting in order from head to tail, the generating the acoustic feature of the slot includes generating according to the following: 第i+1帧声学特征yi+1为基于所述流式语音合成算法,以Enc(X)、yi、yn和Encd(n-i-1)为输入、以si为条件进行生成,其中,Enc()为语言学特征编码网络,Encd()为距离编码函数,si为第i+1帧所述声学特征对应的所述声学特征隐藏状态,yi为第i帧所述声学特征,yn为第n帧声学特征且为所述相邻声学特征中序号大于所述槽位声学特征的序号且距离所述槽位声学特征最近的声学特征,X为所接收到的文本对应的语言学特征。The i+1th frame acoustic feature y i+1 is generated based on the streaming speech synthesis algorithm, taking Enc(X), y i , yn and Enc d (ni-1) as input and s i as the condition , where Enc() is the linguistic feature encoding network, Enc d () is the distance encoding function, s i is the hidden state of the acoustic feature corresponding to the acoustic feature in the i+1th frame, and y i is the data in the i-th frame. The acoustic feature, y n is the acoustic feature of the nth frame and is the acoustic feature whose serial number is greater than the serial number of the acoustic feature of the slot and is closest to the acoustic feature of the slot in the adjacent acoustic features, and X is the received acoustic feature Linguistic features of the text.
CN202111326772.0A 2021-11-10 2021-11-10 Methods for synthesizing speech Active CN114049874B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111326772.0A CN114049874B (en) 2021-11-10 2021-11-10 Methods for synthesizing speech

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111326772.0A CN114049874B (en) 2021-11-10 2021-11-10 Methods for synthesizing speech

Publications (2)

Publication Number Publication Date
CN114049874A true CN114049874A (en) 2022-02-15
CN114049874B CN114049874B (en) 2025-07-29

Family

ID=80208045

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111326772.0A Active CN114049874B (en) 2021-11-10 2021-11-10 Methods for synthesizing speech

Country Status (1)

Country Link
CN (1) CN114049874B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114842827A (en) * 2022-04-28 2022-08-02 腾讯音乐娱乐科技(深圳)有限公司 Audio synthesis method, electronic equipment and readable storage medium

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000339314A (en) * 1999-05-25 2000-12-08 Nippon Telegr & Teleph Corp <Ntt> Automatic response method, dialogue analysis method, response sentence generation method, device thereof, and medium recording the program
US6438522B1 (en) * 1998-11-30 2002-08-20 Matsushita Electric Industrial Co., Ltd. Method and apparatus for speech synthesis whereby waveform segments expressing respective syllables of a speech item are modified in accordance with rhythm, pitch and speech power patterns expressed by a prosodic template
US6496801B1 (en) * 1999-11-02 2002-12-17 Matsushita Electric Industrial Co., Ltd. Speech synthesis employing concatenated prosodic and acoustic templates for phrases of multiple words
JP2003241795A (en) * 2002-02-18 2003-08-29 Hitachi Ltd Information acquisition method and information acquisition system using voice input
CN1945691A (en) * 2006-10-16 2007-04-11 安徽中科大讯飞信息科技有限公司 Method for improving template sentence synthetic effect in voice synthetic system
CN107340991A (en) * 2017-07-18 2017-11-10 百度在线网络技术(北京)有限公司 Switching method, device, equipment and the storage medium of speech roles
CN107886948A (en) * 2017-11-16 2018-04-06 百度在线网络技术(北京)有限公司 Voice interactive method and device, terminal, server and readable storage medium storing program for executing
CN110517662A (en) * 2019-07-12 2019-11-29 云知声智能科技股份有限公司 A kind of method and system of Intelligent voice broadcasting
CN111128121A (en) * 2019-12-20 2020-05-08 贝壳技术有限公司 Voice information generation method and device, electronic device and storage medium
CN111583904A (en) * 2020-05-13 2020-08-25 北京字节跳动网络技术有限公司 Speech synthesis method, speech synthesis device, storage medium and electronic equipment
US20210027788A1 (en) * 2019-07-23 2021-01-28 Baidu Online Network Technology (Beijing) Co., Ltd. Conversation interaction method, apparatus and computer readable storage medium
US20210174781A1 (en) * 2019-01-17 2021-06-10 Ping An Technology (Shenzhen) Co., Ltd. Text-based speech synthesis method, computer device, and non-transitory computer-readable storage medium
CN113345415A (en) * 2021-06-01 2021-09-03 平安科技(深圳)有限公司 Speech synthesis method, apparatus, device and storage medium

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6438522B1 (en) * 1998-11-30 2002-08-20 Matsushita Electric Industrial Co., Ltd. Method and apparatus for speech synthesis whereby waveform segments expressing respective syllables of a speech item are modified in accordance with rhythm, pitch and speech power patterns expressed by a prosodic template
JP2000339314A (en) * 1999-05-25 2000-12-08 Nippon Telegr & Teleph Corp <Ntt> Automatic response method, dialogue analysis method, response sentence generation method, device thereof, and medium recording the program
US6496801B1 (en) * 1999-11-02 2002-12-17 Matsushita Electric Industrial Co., Ltd. Speech synthesis employing concatenated prosodic and acoustic templates for phrases of multiple words
JP2003241795A (en) * 2002-02-18 2003-08-29 Hitachi Ltd Information acquisition method and information acquisition system using voice input
CN1945691A (en) * 2006-10-16 2007-04-11 安徽中科大讯飞信息科技有限公司 Method for improving template sentence synthetic effect in voice synthetic system
CN107340991A (en) * 2017-07-18 2017-11-10 百度在线网络技术(北京)有限公司 Switching method, device, equipment and the storage medium of speech roles
CN107886948A (en) * 2017-11-16 2018-04-06 百度在线网络技术(北京)有限公司 Voice interactive method and device, terminal, server and readable storage medium storing program for executing
US20210174781A1 (en) * 2019-01-17 2021-06-10 Ping An Technology (Shenzhen) Co., Ltd. Text-based speech synthesis method, computer device, and non-transitory computer-readable storage medium
CN110517662A (en) * 2019-07-12 2019-11-29 云知声智能科技股份有限公司 A kind of method and system of Intelligent voice broadcasting
US20210027788A1 (en) * 2019-07-23 2021-01-28 Baidu Online Network Technology (Beijing) Co., Ltd. Conversation interaction method, apparatus and computer readable storage medium
CN111128121A (en) * 2019-12-20 2020-05-08 贝壳技术有限公司 Voice information generation method and device, electronic device and storage medium
CN111583904A (en) * 2020-05-13 2020-08-25 北京字节跳动网络技术有限公司 Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN113345415A (en) * 2021-06-01 2021-09-03 平安科技(深圳)有限公司 Speech synthesis method, apparatus, device and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114842827A (en) * 2022-04-28 2022-08-02 腾讯音乐娱乐科技(深圳)有限公司 Audio synthesis method, electronic equipment and readable storage medium

Also Published As

Publication number Publication date
CN114049874B (en) 2025-07-29

Similar Documents

Publication Publication Date Title
US11990118B2 (en) Text-to-speech (TTS) processing
CN113345415B (en) Speech synthesis method, device, equipment and storage medium
JP7464621B2 (en) Speech synthesis method, device, and computer-readable storage medium
US7567896B2 (en) Corpus-based speech synthesis based on segment recombination
EP4158619B1 (en) Phrase-based end-to-end text-to-speech (tts) synthesis
US9905220B2 (en) Multilingual prosody generation
US11763797B2 (en) Text-to-speech (TTS) processing
US10692484B1 (en) Text-to-speech (TTS) processing
JP2002530703A (en) Speech synthesis using concatenation of speech waveforms
JP2007249212A (en) Method, computer program and processor for text speech synthesis
US8626510B2 (en) Speech synthesizing device, computer program product, and method
MXPA01006594A (en) Method and system for preselection of suitable units for concatenative speech.
CN102822889B (en) Pre-saved data compression for tts concatenation cost
Fahmy et al. A transfer learning end-to-end arabic text-to-speech (tts) deep architecture
US20220189455A1 (en) Method and system for synthesizing cross-lingual speech
CN114974218A (en) Voice conversion model training method and device and voice conversion method and device
CN114049874B (en) Methods for synthesizing speech
Mei et al. A particular character speech synthesis system based on deep learning
JP5874639B2 (en) Speech synthesis apparatus, speech synthesis method, and speech synthesis program
CN119207374B (en) Method and system for converting text into voice efficiently
CN120412544B (en) A prosody-controllable speech synthesis method and related device based on VITS
CN114974208B (en) A Chinese speech synthesis method, device, electronic device, and storage medium
Louw Text-to-speech duration models for resource-scarce languages in neural architectures
EP1501075B1 (en) Speech synthesis using concatenation of speech waveforms
JP5449022B2 (en) Speech segment database creation device, alternative speech model creation device, speech segment database creation method, alternative speech model creation method, program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20220426

Address after: 100085 Floor 101 102-1, No. 35 Building, No. 2 Hospital, Xierqi West Road, Haidian District, Beijing

Applicant after: Seashell Housing (Beijing) Technology Co.,Ltd.

Address before: 101309 room 24, 62 Farm Road, Erjie village, Yangzhen, Shunyi District, Beijing

Applicant before: Beijing fangjianghu Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant