CN113178187B

CN113178187B - A voice processing method, device, equipment, medium, and program product

Info

Publication number: CN113178187B
Application number: CN202110455104.1A
Authority: CN
Inventors: 齐建永
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2021-04-26
Filing date: 2021-04-26
Publication date: 2025-02-07
Anticipated expiration: 2041-04-26
Also published as: CN113178187A

Abstract

The application provides a voice processing method, which comprises the steps that a sending end of a voice communication system collects voice of a user, then the voice of the user is identified to obtain an identification result, the identification result at least comprises pronunciation prompt information, and then the identification result is sent to a receiving end of the voice communication system, so that the receiving end carries out voice synthesis according to the identification result, and the voice which does not comprise background noise is obtained. Therefore, the method realizes 100% elimination of background noise, does not lose voice signals, and ensures voice quality.

Description

Voice processing method, device, equipment, medium and program product

Technical Field

The present application relates to the field of voice communication technology, and in particular, to a voice processing method, apparatus, device, computer readable storage medium, and computer program product.

Background

With the rapid development of communication technology, voice communication is becoming a mainstream communication mode. The voice communication specifically refers to a communication mode that a user sends voice through electronic equipment to realize communication with a counterpart user. At present, background noise generally exists in the voice communication process, and the background noise interferes with the voice of a user, so that the communication efficiency of the user is reduced, and the user experience is influenced.

Voice noise reduction algorithms have been proposed to reduce the interference of background noise with voice. A voice noise reduction algorithm is based on a plurality of microphones to judge the spatial information of noise, such as the amplitude and phase of the noise, and then noise suppression is carried out based on the spatial information; another speech noise reduction algorithm is based on a deep learning method, and performs a great deal of learning on various noises, so as to establish a noise sample, and then, noise is reduced from an original sound signal, so that the purposes of noise reduction and speech quality improvement are achieved.

However, the above-described voice noise reduction algorithm still cannot achieve 100% noise cancellation. In addition, the noise reduction algorithm also damages the voice signal, and reduces the voice quality.

Disclosure of Invention

The application provides a voice processing method, a sending end in a voice communication system sends a recognition result which does not comprise background noise to a receiving end, the receiving end carries out voice synthesis according to pronunciation prompt information in the recognition result, no background noise is doped in synthesized voice, 100% background noise elimination is realized, voice signals are not damaged by the method, and voice quality is ensured. The application also provides a device, equipment, a computer readable storage medium and a computer program product corresponding to the method.

In a first aspect, the present application provides a voice processing method, applied to a voice communication system, where the system includes a transmitting end and a receiving end, the method includes:

the sending end collects voice of a user;

The sending end recognizes the voice of the user to obtain a recognition result, wherein the recognition result at least comprises pronunciation prompt information;

The sending end sends the identification result to the receiving end so that the receiving end performs voice synthesis according to the identification result.

In some possible implementations, the sending end sends the identification result to the receiving end, including:

And the sending end sends the identification result and voiceprint information of the user to the receiving end.

In some possible implementations, the pronunciation cues include phonemes;

the sending end identifies the voice of the user to obtain an identification result, and the method comprises the following steps:

and the sending end recognizes the voice of the user through an acoustic model to obtain the phonemes corresponding to the voice.

In some possible implementations, the sending end further obtains a pronunciation time or a pronunciation interval time of the phoneme corresponding to the voice through the acoustic model, and the recognition result further includes the pronunciation time or the pronunciation interval time.

In some possible implementations, the pronunciation cues include text information that has the same pronunciation as the speech.

In some possible implementations, the text information is obtained by the transmitting end identifying phonemes through an acoustic model and then decoding the phonemes through a language model.

In some possible implementations, the method further includes:

Encrypting the identification result;

the sending end sends the identification result to the receiving end, and the method comprises the following steps:

The sending end sends the encrypted identification result to the receiving end.

In some possible implementations, the method further includes:

And filtering the voice of the user according to the voiceprint information of the user.

the sending end generates a character stream according to the identification result;

The sending end transmits the character stream to the receiving end in a stream type sequential transmission mode.

In a second aspect, the present application provides a voice processing method, applied to a voice communication system, where the system includes a transmitting end and a receiving end, the method includes:

the receiving end obtains a recognition result of the voice of the user, wherein the recognition result at least comprises pronunciation prompt information;

the receiving end performs voice synthesis according to the pronunciation prompting information to obtain synthesized voice;

the receiving end plays the synthesized voice so as to realize real-time voice communication.

In some possible implementations, the method further includes:

Acquiring voiceprint information of a user;

The receiving end performs voice synthesis according to the pronunciation prompt information to obtain synthesized voice, and the method comprises the following steps:

And the receiving end performs voice synthesis according to the pronunciation prompt information and the voiceprint information of the user to obtain synthesized voice.

In some possible implementations, the pronunciation cues include phonemes;

the receiving end performs voice synthesis according to the pronunciation prompt information and the voiceprint information of the user to obtain synthesized voice, and the method comprises the following steps:

the receiving end synthesizes initial voice according to the pronunciation prompting information;

And the receiving end obtains the synthesized voice corresponding to the user according to the initial voice and the voiceprint information of the user.

In some possible implementations, the recognition result further includes a pronunciation time or pronunciation interval time of the phoneme;

The receiving end synthesizes the voice according to the pronunciation prompt information and the voiceprint information of the user, and comprises the following steps;

And the receiving end synthesizes the voice according to the voice of the phoneme or one of the voice interval time and the voiceprint information of the phoneme and the user.

In some possible implementations, the receiving end obtains a recognition result of a voice of the user, including:

the receiving end receives the encrypted identification result sent by the sending end;

the receiving end decrypts the identification result from the encrypted identification result.

In some possible implementations, the recognition result is transmitted in a character stream;

The receiving end synthesizes the voice according to the pronunciation prompt information and the voiceprint information of the user, and comprises the following steps:

and the receiving end synthesizes the voice according to the pronunciation prompt information and the voiceprint information of the user according to the sequence of the character stream.

In a third aspect, the present application provides a speech processing apparatus comprising:

The acquisition unit is used for acquiring the voice of the user;

The recognition unit is used for recognizing the voice of the user to obtain a recognition result, and the recognition result at least comprises pronunciation prompt information;

and the sending unit is used for sending the identification result to a receiving end so that the receiving end performs voice synthesis according to the identification result.

In some possible implementations, the sending unit is configured to send, by the sending end, the identification result and voiceprint information of the user to the receiving end.

In some possible implementations, the pronunciation cues include phonemes;

the recognition unit is used for recognizing the voice of the user through an acoustic model to obtain the phonemes corresponding to the voice.

In some possible implementations, the recognition unit further obtains, through the acoustic model, a pronunciation time or a pronunciation interval time of the phoneme corresponding to the speech, and the recognition result further includes the pronunciation time or the pronunciation interval time.

In some possible implementations, the recognition unit is configured to recognize phonemes by an acoustic model and then decode the phonemes by a language model to obtain the text information.

In some possible implementations, the apparatus further includes:

an encryption unit for encrypting the identification result;

the sending unit is used for sending the encrypted identification result to the receiving end.

In some possible implementations, the apparatus further includes:

And the filtering unit is used for filtering the voice of the user according to the voiceprint information of the user.

In some possible implementations, the sending unit is configured to generate a character stream according to the identification result, and transmit the character stream to the receiving end through a streaming sequential transmission manner.

In a fourth aspect, the present application provides a speech processing apparatus comprising:

an acquisition unit for acquiring a recognition result of a voice of a user, the identification result at least comprises pronunciation prompting information;

The synthesis unit is used for performing voice synthesis according to the pronunciation prompt information to obtain synthesized voice;

And the playing unit is used for playing the synthesized voice so as to realize real-time voice communication.

In some possible implementations, the acquiring unit is configured to acquire voiceprint information of a user;

and the synthesis unit is used for performing voice synthesis according to the pronunciation prompt information and the voiceprint information of the user to obtain synthesized voice.

In some possible implementations, the pronunciation cues include phonemes;

The synthesizing unit is used for synthesizing initial voice according to the pronunciation prompt information and obtaining synthesized voice corresponding to the user according to the initial voice and voiceprint information of the user.

the synthesizing unit is used for carrying out voice synthesis according to one of the pronunciation time of the phonemes or the pronunciation interval time and the voiceprint information of the phonemes and the user.

In some possible implementations, the apparatus further includes a decryption unit;

the acquisition unit is used for receiving the encrypted identification result sent by the sending end;

the decryption unit is used for decrypting the identification result from the encrypted identification result.

and the synthesis unit is used for carrying out voice synthesis according to the pronunciation prompt information and the voiceprint information of the user according to the sequence of the character stream.

In a fifth aspect, the present application provides an apparatus comprising a processor and a memory. The processor and the memory communicate with each other. The processor is configured to execute instructions stored in the memory to cause the apparatus to perform a speech processing method as in any implementation of the first or second aspect.

In a sixth aspect, the present application provides a computer readable storage medium having stored therein instructions for instructing a device to execute the speech processing method according to any implementation manner of the first aspect or the second aspect.

In a seventh aspect, the present application provides a computer program product comprising instructions which, when run on a device, cause the device to perform the speech processing method of any of the implementations of the first or second aspects described above.

Further combinations of the present application may be made to provide further implementations based on the implementations provided in the above aspects.

Drawings

In order to more clearly illustrate the technical method of the embodiments of the present application, the drawings used in the embodiments will be briefly described below.

Fig. 1 is a schematic diagram of a voice communication system according to an embodiment of the present application;

FIG. 2 is a command interaction diagram of a speech processing method according to an embodiment of the present application;

FIG. 3 is a command interaction diagram of a speech processing method according to an embodiment of the present application;

FIG. 4 is a command interaction diagram of a speech processing method according to an embodiment of the present application;

Fig. 5 is a schematic structural diagram of a voice processing device according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a voice processing device according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a terminal according to an embodiment of the present application;

Fig. 8 is a schematic structural diagram of a terminal according to an embodiment of the present application.

Detailed Description

The terms "first", "second" in embodiments of the application are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature.

Some technical terms related to the embodiments of the present application will be described first.

Voice communication refers to a communication mode in which two parties communicate through voice, for example, a user sends voice through an electronic device to communicate with a counterpart user. One typical application scenario for voice communication is real-time voice communication. Real-time voice communication is a scenario in which voice is transmitted over a network to achieve a face-to-face conversation. Real-time voice communication has a high demand for transmission delay of voice, and in particular, real-time voice communication requires both parties to transmit voice with low delay.

The real-time voice communication may be that after the electronic device of the user 1 and the electronic device of the user 2 are connected, the user 1 and the user 2 talk in voice. For example, the user 1 and the user 2 perform real-time voice communication based on a call service provided by a telecom operator, or perform real-time voice communication based on a call service provided by an internet application service provider.

During the process of voice communication by the user, the voice of the user collected by the electronic device is often doped with some background noise. For example, when a user is in voice communication with an electronic device on a busy street, the electronic device may collect the driving sound of a vehicle on the street. For another example, when a user communicates by voice with an electronic device in a crowd, the electronic device may collect sounds of other users. The background noise can interfere the voice of the user, so that the communication efficiency of the user is reduced, and the voice communication experience of the user is affected.

Currently, the industry adopts a voice noise reduction algorithm to process voice signals collected by electronic equipment so as to reduce interference of background noise on voice of a user. For example, in some speech noise reduction algorithms, the speech noise reduction algorithm determines spatial information of noise, such as the amplitude and phase of the noise, based on a plurality of microphones, and then performs noise suppression based on the spatial information. For example, in other voice noise reduction algorithms, the voice noise reduction algorithm performs a great deal of learning on various noises based on a deep learning method, so as to establish a noise sample, and then reduces the noises from the original sound signal, thereby achieving the purposes of reducing noises and improving voice quality.

However, the above voice noise reduction algorithm cannot eliminate the noise component by 100%, and after the voice signal collected by the electronic device is processed by the above voice noise reduction algorithm, the voice signal is damaged, so as to reduce the voice quality and affect the voice communication experience of the user.

In view of this, the embodiment of the application provides a voice processing method. The voice processing method can be applied to a voice communication system, and the voice communication system comprises a transmitting end and a receiving end. Specifically, the sending end collects the voice of the user, firstly identifies the voice of the user to obtain an identification result comprising voice prompt information, then sends the identification result and voiceprint information of the user to the receiving end, and after the receiving end receives the identification result and the voiceprint information of the user, the receiving end can perform voice synthesis according to the voice prompt information in the identification result and the voiceprint information of the user, so that the voice which does not comprise background noise is obtained.

On the one hand, the sending end recognizes the voice of the user in advance to obtain a recognition result comprising voice prompt information, then sends the recognition result and voice print information of the user to the receiving end, and the receiving end synthesizes the voice according to the voice prompt information in the recognition result and the voice print information of the user because the recognition result sent to the receiving end by the sending end does not comprise any noise component such as background noise, so that the synthesized voice is not doped with any noise component, and the noise component can be eliminated by 100% theoretically. Therefore, the method reduces the interference of the background noise to the voice of the user, improves the communication efficiency of the user, and further improves the voice communication experience of the user.

Further, the receiving end restores the voice of the user through a voice synthesis mode, for example, the receiving end synthesizes the voice according to the pronunciation prompt information in the recognition result and the voiceprint information of the user. The method can not damage the voice signal, and the receiving end can restore the voice signal with high quality in a voice synthesis mode, so that a user can hear clearer voice, and the voice communication experience of the user is improved. .

On the other hand, the transmitting end transmits the recognition result with smaller data volume and the voiceprint information of the user to the receiving end, instead of transmitting the voice data with larger data volume to the receiving end. Therefore, the data volume sent from the sending end to the receiving end can be reduced, the requirements on the network environments of the sending end and the receiving end are reduced, and when the sending end and the receiving end are in a poor network environment, a user can perform high-quality voice communication.

The voice processing method provided by the embodiment of the application can be applied to a real-time voice communication scene (such as voice call) and can also be applied to a non-real-time voice communication scene (such as voice message). In a real-time voice communication scenario, low-delay voice transmission between a transmitting end and a receiving end is required, for example, a user of the transmitting end performs a voice call (e.g., makes a call) with a user of the receiving end. In a non-real-time voice communication scenario, the requirements of both communication parties on delay of voice transmission are low, for example, a user of a transmitting end sends a voice message to a user of a receiving end, in some implementations, the user of the transmitting end may send a message in a voice form to the user of the receiving end by using a communication application (such as an instant messaging application) installed on the transmitting end, and the user of the receiving end may select a time for playing the message in the voice form as required after receiving the message in the voice form.

It should be noted that, the embodiment of the present application is not limited to the application scenario of the voice processing method, and the following description will only take the application of the voice processing method to the real-time voice communication scenario as an example.

In order to make the technical solution of the present application clearer and easier to understand, the architecture of the voice communication system provided by the embodiment of the present application is described below with reference to the accompanying drawings.

As shown in fig. 1, the voice communication system 100 includes a terminal 10 and a terminal 20. It is only schematically illustrated in fig. 1 that the terminal 10 or the terminal 20 may be a mobile phone, of course, the terminal 10 or the terminal 20 includes, but is not limited to, a tablet computer, a notebook computer, a Personal Digital Assistant (PDA), a smart wearable device, or the like. Wherein, intelligent wearing equipment can be intelligent wrist-watch, intelligent bracelet, intelligent glasses etc..

In some embodiments, one of the terminals is a transmitting end and the other terminal is a receiving end. For example, the terminal 10 is a transmitting end, and the terminal 20 is a receiving end. As shown in fig. 1, when the user 1 utters voice, the terminal 10 may collect the voice of the user 1, then recognize the voice of the user 1, obtain a recognition result including the voice prompt information of the user 1, and then transmit the recognition result and the voiceprint information of the user 1 to the terminal 20. The terminal 20 performs speech synthesis according to the pronunciation prompt information of the user 1 and the voiceprint information of the user 1 in the recognition result to obtain a synthesized speech, and then the terminal 20 plays the synthesized speech. In this way, the user 2 can hear the voice of the user 1 excluding the background noise.

Accordingly, the terminal 10 may be a receiving end, and the terminal 20 may be a transmitting end. After hearing the voice of the user 1, the user 2 can perform a voice response to the user 1. When the user 2 answers the voice, the terminal 20 may collect the voice of the user 2, then recognize the voice of the user 2, obtain a recognition result including the voice prompt information of the user 2, and then send the recognition result and the voiceprint information of the user 2 to the terminal 10. The terminal 10 performs speech synthesis based on the voice prompt information of the user 2 and the voiceprint information of the user 2 in the recognition result, and obtains a synthesized speech. In this way, the user 1 can hear the voice of the user 2 excluding the background noise.

In the voice communication system 100, taking the terminal 10 as a transmitting end and the terminal 20 as a receiving end as an example, the terminal 10 recognizes the voice of the user, the terminal 20 synthesizes the voice according to the recognition result and the voiceprint information of the user, and the terminal 10 can adopt a noise reduction scheme with low requirements. For example, the voice input device of the terminal 10 may be a single microphone, and then based on the spectrum subtraction principle, steady-state noise in the surrounding environment is filtered out, so that the terminal 10 can accurately identify the collected voice of the user 1 and can collect voiceprint information of the user 1.

It should be noted that, the voice processing method provided by the embodiment of the application can be provided for users in the form of computer programs. The computer program may be a separate Application (APP), a functional module of an existing application, a plug-in or applet, etc. The terminals 10, 20 implement the voice processing method of the present application by running the above-described computer program.

In order to make the technical solution of the present application clearer and easier to understand, the following describes in detail the voice processing method provided in the embodiment of the present application, taking the case that the user 1 on the terminal 10 side and the user 2 on the terminal 20 side perform voice communication.

As shown in fig. 2, which shows an instruction interaction diagram of a speech processing method, the method comprises the steps of:

s202, the terminal 10 collects the voice of the user 1.

The terminal 10 may collect the voice of the user 1 through a microphone, which may be a microphone built in the terminal 10 or may be a microphone externally connected to the terminal 10, such as an earphone with a microphone. In some embodiments, when user 1 utters speech, terminal 10 may be able to gather user 1's speech through a microphone, i.e., terminal 10 may be able to gather content of user 1's speech.

S204, the terminal 10 recognizes the voice of the user 1 to obtain a recognition result.

The recognition result comprises pronunciation prompt information, wherein the pronunciation prompt information is prompt information of a pointer to the pronunciation of a user. For example, the pronunciation alert information may include phonemes, or may include text information having the same pronunciation as the user's voice. The following describes the process of recognizing the voice of the user 1 by the terminal 10 for the above two kinds of pronunciation prompting information, respectively.

The first type of pronunciation prompting information comprises phonemes.

Phonemes are the basic units of pronunciation. The phonemes may be classified into vowel phonemes and consonant phonemes according to whether airflow is impeded at the time of pronunciation. The vowel phonemes are the sounds formed by the air flow in the mouth and the pharyngeal end without being blocked during pronunciation, and the consonant phonemes are the sounds formed by the air flow in the mouth and the pharyngeal end with a certain degree of blocking during pronunciation, and are called consonants for short. It should be noted that vowels and consonants in different languages may be different.

The pronunciation alert information may include a plurality of phonemes. Wherein, the phonemes may be independent or may be related to each other to form at least one syllable. Syllables are a relatively natural structural unit in speech. In chinese, a syllable refers to the tone of a word in chinese. Syllables such as "mandarin" include "p ǔ", "t ō ng" and "hua", which have 8 phonemes, "p", "u", "t", "o", "ng", "h", "u" and "a", respectively. Based on this, when the user utters the voice "mandarin", the pronunciation prompting information may be 8 phonemes or 3 syllables formed by 8 phonemes. In English, a syllable may be a word, or a part of a word. Syllables in a monosyllabic word are the word, and syllables in a polysyllabic word are part of the word. For example, the word "water" includes "wa"And "ter"Two syllables. When the user utters a voice "water", the pronunciation prompt may include the phonemes of the word, or syllables formed by a plurality of phonemes, such as "wa"“ter”The above is merely an example illustration of phonemes and syllables and is not meant to limit the present application.

In some embodiments, the terminal 10 may recognize the voice of the user 1 through the acoustic model, so as to obtain a phoneme corresponding to the voice of the user 1. The acoustic model is a model for converting a voice into a phoneme corresponding to the voice. For example, the acoustic model may be a deep neural network-hidden Markov model (deep neural networks-hidden Markov model, DNN-HMM).

Specifically, the trained acoustic model may be stored in advance on the terminal 10, after the terminal 10 collects the voice of the user 1, the acoustic feature is extracted from the acoustic waveform of the voice of the user 1, and then the acoustic feature is input into the acoustic model to obtain the phoneme corresponding to the voice of the user 1.

In some embodiments, the terminal 10 frames the sound waveform of the voice of the user 1 first, and thus the terminal 10 divides the voice into a plurality of segments, and then performs acoustic feature extraction for each segment. After inputting a plurality of acoustic features into the acoustic model, the terminal 10 can obtain a phoneme sequence. The acoustic feature extraction method includes, but is not limited to, linear predictive coding (LINEAR PREDICTIVE coding, LPC), mel frequency cepstral coefficient (Mel-frequency cepstral coefficients, MFCC) calculation, and the like.

In some embodiments, the terminal 10 may also obtain the pronunciation time of the phonemes corresponding to the speech of the user 1 through an acoustic model. The pronunciation time refers to a time period of the actual pronunciation of the user 1 in a time period corresponding to the voice of the user 1. For example, the time period corresponding to the voice of the user 1 is 5 seconds, the user 1 does not pronounce in the time interval of 0 to 3 seconds, and pronounces in the time interval of 3 to 5 seconds, and the pronunciation time of the phoneme corresponding to the voice of the user 1 is 2 seconds. In practice, the pronunciation time of the phonemes corresponding to the speech of the user 1 may be much smaller than 2 seconds, for example, 10 ms, 15 ms or 20 ms. In this way, the terminal 10 can obtain the actual pronunciation time of the user 1 in the period corresponding to the voice of the user 1, and then put the pronunciation time into the recognition result, so that the terminal 20 synthesizes the voice consistent with the voice speed of the user 1 according to the recognition result including the pronunciation time.

In other embodiments, the terminal 10 may also obtain the pronunciation interval time of the phonemes corresponding to the speech of the user 1 through an acoustic model. The pronunciation interval time refers to an interval time between phonemes of the user 1 in a time period corresponding to the voice of the user 1. For example, the time period corresponding to the voice of the user 1 is 3 seconds, the user 1 emits one phoneme (or syllables composed of a plurality of phonemes) in the time interval of 0 to 1 second, the user 1 does not emit sound in the time interval of 1 to 2 seconds, and emits one phoneme (or syllables composed of a plurality of phonemes) in the time interval of 2 to 3 seconds, and the sound emission interval time of the factor corresponding to the voice of the user 1 is 1 second. In practice, the pronunciation interval time of the factor corresponding to the voice of the user 1 may be less than 1 second, for example, 50 ms, 60 ms, or 100 ms. In this way, the terminal 10 can obtain the pronunciation interval time of the user 1 in the period corresponding to the voice of the user 1, and then put the pronunciation interval time into the recognition result, so that the terminal 20 synthesizes the voice consistent with the sentence break when the user 1 speaks according to the recognition result including the pronunciation interval time.

In other implementations, the terminal 10 may simultaneously place the pronunciation time and the pronunciation interval time into the recognition result, so that the terminal 20 can synthesize a voice consistent with the speech speed and the sentence break when the user 1 speaks according to the recognition result.

The first pronunciation prompting message is introduced above, and the second pronunciation prompting message is introduced below.

The second type of pronunciation cues includes text information that has the same pronunciation as the user's voice.

The text information refers to text information, and the text may be text composed of Chinese, text composed of English, or text composed of other languages, and in some implementations, text may also be text composed of multiple languages together. For example, the text information may be "me drinks cola", and the text information having the same pronunciation is the same text information as the aforementioned "me drinks cola", for example "me medicine drinks thirst". For another example, the text information may be "alter" and the text information having the same pronunciation as the previous text information of "alter", for example, "altar". The above-mentioned "i want to drink cola" and "i drink thirst" and "alter" and "altar" are only illustrative of text messages having the same pronunciation and do not constitute a limitation of the present application.

In some embodiments, the terminal 10 may recognize the voice of the user 1 through the acoustic model to obtain phonemes corresponding to the voice of the user 1, and then decode the phonemes through the language model to obtain text information of the voice of the user 1. The process of recognizing the voice of the user 1 by the terminal 10 through the acoustic model may be referred to the description of the first embodiment, and will not be repeated here.

The language model refers to a model that converts phonemes into text information. After obtaining the phoneme sequence corresponding to the voice of the user 1 in the above manner, the terminal 10 converts the phoneme sequence into text information through a voice model. In some embodiments, the text information may be text information having the same pronunciation, and the terminal 10 need not accurately perform semantic recognition. For example, the terminal 10 does not need to convert various phoneme sequences output by the acoustic model into text information through the speech model, and then score different text information to determine the finally obtained text information. In this way, the terminal 10 can recognize the voice of the user 1 more quickly, obtain text information corresponding to the voice of the user 1, reduce the recognition time of the terminal 10, improve the recognition efficiency of the terminal 10, and further reduce the communication delay between the terminal 10 and the terminal 20.

S206, the terminal 10 transmits the identification result to the terminal 20.

In some implementations, the terminal 10 may send the identification result and the voiceprint information of the user 1 to the terminal 20, and in other implementations, the terminal 10 may also send the identification result directly to the terminal 20, where the two implementations are described below.

In a first implementation, the terminal 10 sends the recognition result and voiceprint information of the user 1 to the terminal 20.

Voiceprint information refers to information that can reflect physiological and behavioral characteristics of a speaker. Since the vocal organs (e.g., tongue, teeth, throat, lungs, nasal cavity) of different speakers are different in size and morphology, voiceprint information of different people is also different, and thus, the identity of the speaker can be determined by comparing the voiceprint information.

In some embodiments, user 1 may pre-enter voiceprint information of user 1 on terminal 10. For example, voice print information of the user 1 is entered in a text-dependent manner. Specifically, the terminal 10 collects voice of the user 1 when reading a preset text, then processes the voice based on a Linear Predictive Coding (LPC) technique to obtain voiceprint information of the user 1, and then stores the voiceprint information locally in the terminal 10. After the terminal 10 inputs the voiceprint information of the user 1 in advance, the terminal 10 does not need to acquire the voiceprint information of the user 1 in real time in the process of voice communication with the terminal 20, so that the calculation pressure of the terminal 10 is reduced. Further, the terminal 10 acquires the voice of the user 1 when the user 1 reads the preset text, so that the obtained voiceprint information of the user 1 is more accurate.

In other embodiments, the terminal 10 may also acquire the voiceprint information of the user 1 in real time. For example, after the terminal 10 collects the voice of the user 1, the terminal 10 processes the voice by using a linear predictive coding technique to obtain voiceprint information of the user 1, and then stores the voiceprint information locally in the terminal 10. In this embodiment, the terminal 10 acquires the voiceprint information of the user 1 in real time through the acquired voice of the user 1 without the user 1 reading the preset text.

In a second implementation, the terminal 10 sends the recognition result to the terminal 20.

Specifically, the terminal 10 directly sends the recognition result to the terminal 20, so that the receiving end performs speech synthesis according to the pronunciation prompt information in the recognition result to obtain the synthesized speech. The synthesized speech is capable of expressing what user 1 is to express. The synthesized speech does not include voiceprint information of the user 1, for example, the synthesized speech may be speech of a preset tone color. In some examples, terminal 20 may present the identity of user 1 to user 2 so that user 2 may determine the identity information of user 1 based on the identity of user 1 to learn that the synthesized speech is the speech of user 1. The identification of the user 1 may be a phone number, a user name, an account number, etc. of the user 1.

In some embodiments, the terminal 10 may send the voiceprint information of the user 1 to the terminal 20 when the terminal 20 is in voice communication for the first time, and the terminal 20 stores the voiceprint information of the user 1 after receiving the voiceprint information sent by the terminal 10. In this way, the terminal 10 can directly transmit the recognition result to the terminal 20 without simultaneously transmitting the voiceprint information of the user 1. When the user of the terminal 10 subsequently performs voice communication with the user of the terminal 20, the terminal 10 and the terminal 20 can directly perform voice processing based on the stored voiceprint information of the opposite terminal user, so as to realize real-time voice communication.

In some examples, the mapping relationship between the voiceprint information of the user 1 and the identity of the user 1 is pre-stored in the terminal 20. After receiving the recognition result sent by the terminal 10, the terminal 20 determines the voiceprint information of the user 1 from the mapping relation according to the identification of the user 1, and then performs speech synthesis according to the voiceprint information of the user 1 and the recognition result.

In some scenarios, the terminal 20 may store the voiceprint information of the user 1 until the end of the present voice communication. For example, the user 1 makes two voices continuously, after the user 1 makes a voice for the first time, the terminal 10 transmits the recognition result of the voice and the voiceprint information of the user 1 to the terminal 20, and after the user 1 makes a voice for the second time, the terminal 10 only transmits the recognition result of the voice to the terminal 20, without transmitting the voiceprint information of the user 1 to the terminal 20 again.

In other scenarios, terminal 20 may store voiceprint information for user 1 until a preset time node. For example, when the user 1 and the user 2 perform voice communication twice, the terminal 10 transmits the recognition result of the voice of the user 1 and the voiceprint information of the user 1 to the terminal 20 when the user 1 and the user 2 perform voice communication for the first time, and the terminal 10 only transmits the recognition result of the voice of the user 1 to the terminal 20 when the user 1 and the user 2 perform voice communication for the second time, without transmitting the voiceprint information of the user 1 to the terminal 20 again.

When the terminal 10 and the terminal 20 perform voice communication, the terminal 10 does not need to send the voiceprint information of the user 1 to the terminal 20 multiple times, and the data amount transmitted to the terminal 20 by the terminal 10 is further reduced.

In this embodiment, the terminal 10 transmits the recognition result with a smaller data amount and the voiceprint information of the user 1 to the terminal 20, and the terminal 10 can transmit the recognition result with a smaller data amount and the voiceprint information of the user 1 to the terminal 20 in a short time, compared with directly transmitting the voice data with a larger data amount to the terminal 20, so that the time required for the terminal 10 to transmit the data to the terminal 20 is reduced. Taking the terminal as an example of narrowband voice communication with lower voice quality, the original code rate is 128kbps, the compressed code rate is 4.75kbps to 12.2kbps, and the pronunciation prompting information in the recognition result sent by the terminal 10 to the terminal 20 in this embodiment is represented by characters, for example, text information is represented by characters or phonemes (syllables composed of a plurality of phonemes) are represented by characters. Taking text information represented by characters, the terminal occupies 2 bytes, namely 16bits, each character occupies according to Chinese character standard GB-2313, the speech speed of a normal person is 120 characters per minute, and the code rate is converted into 32bps.

It can be seen that the recognition result transmitted from the terminal 10 to the terminal 20 can reduce the amount of data transmitted by nearly thousand times as compared with the case where voice data is directly transmitted to the terminal 20. The data size of the voiceprint information is within about 10kb, and the terminal 10 transmits the voiceprint information only once to the terminal 20, and after the terminal 20 saves the voiceprint information, the subsequent terminal 10 only transmits the identification result to the terminal 20. Further, even if the network environment in which the terminal 10 is located is poor or the network environment in which the terminal 10 is located is poor (for example, the user1 carries the terminal 10 into an area with poor network environment such as an elevator), the terminal 10 can transmit the recognition result with a small data size to the terminal 20, thereby realizing the voice communication between the user1 and the user 2. Thereby improving the usability of real-time voice communications.

In some implementations, the terminal 10 may generate a character stream according to the recognition result and then transmit the character stream to the terminal 20 through a streaming sequential transmission manner. For example, in the process of transmitting the recognition result to the terminal 20, each time the terminal 10 recognizes the voice of the user 1 and obtains a syllable, the syllable is directly encoded and transmitted to the terminal 20 by a streaming type sequential transmission method. The terminal 10 does not need to wait for all syllables corresponding to the voice of the user 1 to be obtained, and then uniformly encodes the syllables and transmits the syllables to the terminal 20. In this way, the terminal 10 can recognize the voice of the user 1, generate a character stream according to the recognition result, and transmit the character stream to the terminal 20.

In the process of receiving the character stream transmitted by the terminal 10, the terminal 20 performs voice synthesis according to the received character stream and the voiceprint information of the user1 whenever the terminal 20 receives the character stream transmitted by the terminal 10. In some embodiments, the terminal 20 may perform speech synthesis according to the pronunciation prompt information and the voiceprint information of the user1 in the order of character stream. In this way, the terminal 20 can perform speech synthesis while receiving the character stream transmitted by the terminal 10.

In the above manner, the terminal 10 can complete the transmission of the character stream corresponding to the recognition result to the terminal 20 in a short time, thereby reducing the communication delay between the terminal 10 and the terminal 20.

And S208, the terminal 20 performs voice synthesis according to the recognition result to obtain synthesized voice.

In some embodiments, when the voiceprint information of the user 1 is not stored in the terminal 20 and the voiceprint information of the user 1 is not transmitted to the terminal 20 by the terminal 10, the terminal 20 directly performs speech synthesis according to the recognition result, so as to obtain a synthesized speech, where the synthesized speech does not include background noise. The terminal 20 can play the synthesized voice, and the user 2 can learn what the user 1 wants to express.

In other embodiments, when the voiceprint information of the user 1 is stored in the terminal 20 or the terminal 10 transmits the voiceprint information of the user 1 to the terminal 20, the terminal 20 performs speech synthesis according to the recognition result transmitted by the terminal 10 and the voiceprint information of the user 1, so as to obtain the speech of the user 1 that does not include background noise. Specifically, the terminal 20 synthesizes an initial voice according to the pronunciation prompt, which may be a standard voice such as a robot voice. The voiceprint information of user 1 is not included in the initial voice. Then, the terminal 20 obtains the synthesized voice corresponding to the user 1 according to the initial voice and the voiceprint information of the user 1. The synthesized voice includes voiceprint information of the user 1, and after the terminal 20 plays the synthesized voice, the user 2 can determine that the synthesized voice is the voice of the user 1.

In some implementations, the terminal 20 may process the voice prompt information in the recognition result and the voiceprint information of the user 1 through the formant model, and synthesize the voice of the user 1 that does not include the background noise.

In some embodiments, the recognition result further includes pronunciation time of a phoneme corresponding to the voice of the user 1, which is acquired by the terminal 10 through the acoustic model. After receiving the recognition result transmitted from the terminal 10, the terminal 20 acquires the speaking time, and the terminal 20 can determine the speech rate when the user 1 speaks according to the speaking time. After the terminal 20 synthesizes the speech according to the recognition result including the pronunciation time of the phonemes corresponding to the speech of the user 1, the speech speed of the synthesized speech can be consistent with the speech speed of the speech when the user 1 speaks, so that the situation that the synthesized speech is too stiff is reduced, and the user experience of the speech communication is improved.

In other embodiments, the recognition result further includes a pronunciation interval time of a phoneme corresponding to the voice of the user 1, which is obtained by the terminal 20 through the acoustic model. The terminal 20 can determine the sentence break condition when the user 1 speaks based on the pronunciation interval time. After speech synthesis is performed based on the recognition result including the pronunciation interval time of the phoneme corresponding to the speech of the user 1, the terminal 20 can match the speech after synthesis with the speech of the user 1.

In other embodiments, the terminal 20 may perform speech synthesis according to the recognition result including the pronunciation time and the pronunciation interval time, so that the terminal 20 can synthesize the speech consistent with the speech speed and the sentence break when the user 1 speaks.

It should be noted that, the speed of speech recognition by the terminal 10 is much higher than the speed of speech encoding by the terminal 10 in a conventional manner, and correspondingly, the speed of speech synthesis by the terminal 20 is also much higher than the speed of speech decoding by the terminal 20 in a conventional manner, so that the method further reduces the communication delay between the terminal 10 and the terminal 20.

S210, the terminal 20 plays the synthesized voice.

The terminal 20 plays the synthesized voice so that the user 2 on the terminal 20 side can hear the synthesized voice without including any background noise component, thereby realizing real-time voice communication without including any noise.

Based on the above description, the embodiment of the application provides a voice processing method. On the other hand, the recognition result transmitted from the terminal 10 to the terminal 20 does not include any noise component such as background noise, and the terminal 20 performs speech synthesis according to the recognition result which does not include background noise and the voiceprint information of the user 1, so that the synthesized speech is not doped with any noise component, and it is theoretically possible to achieve 100% noise cancellation. Therefore, the method reduces the interference of the background noise to the voice of the user 1, improves the communication efficiency of the user 1, and further improves the voice communication experience of the user 1.

Further, the terminal 20 restores the voice of the user 1 in a voice synthesis manner, the voice signal is not damaged in the voice synthesis manner, and the terminal 20 can restore the voice signal with high quality, so that the user 2 can hear clearer voice, and the voice communication experience of the user 1 and the user 2 is improved.

On the other hand, the terminal 10 transmits the recognition result of the smaller data amount and the voiceprint information of the user 1 to the terminal 20, instead of transmitting the voice data of the user 1 of the larger data amount to the terminal 20. The method can reduce the data amount transmitted by the terminal 10 to the terminal 20, thereby reducing the requirements on the network environment where the terminal 10 and the terminal 20 are located, and enabling the terminal 10 and the terminal 20 to perform high-quality voice communication in a poor network environment.

In some embodiments, after the terminal 20 performs the speech synthesis, the user 2 may also answer to the synthesized speech after hearing the synthesized speech played by the terminal 20, for example, the user 2 performs the speech answer to the user 1 on the terminal 10 side through the terminal 20. As shown in fig. 3, the embodiment of the present application further provides a voice processing method, which further includes the following steps on the basis of the steps shown in fig. 2:

S302, the terminal 20 collects the voice of the user 2.

S304, the terminal 20 recognizes the voice of the user 2 to obtain a recognition result.

S306 the terminal 20 transmits the recognition result to the terminal 10.

The terminal 10 performs voice synthesis according to the recognition result to obtain synthesized voice S308.

S310 the terminal 10 plays the synthesized voice.

In this embodiment, the terminal 10 may be used as a transmitting end or a receiving end, that is, the terminal 10 has both the function of the transmitting end and the function of the receiving end. Similarly, the terminal 20 may be used as a transmitting end or a receiving end, that is, the terminal 20 has both the function of the transmitting end and the function of the receiving end. The terminal 10 and the terminal 20 may implement different functions in different scenarios to enable voice communication between the user 1 on the terminal 10 side and the user 2 on the terminal 20 side.

It should be noted that, S302 to S310 are similar to S202 to S210 in the steps shown in fig. 2, and are not repeated here.

In some embodiments, the terminal may further encrypt the transmitted data, such as the recognition result and the voiceprint information, so as to improve the security of the voice communication between the user 1 and the user 2. For ease of understanding, the following description will be given by taking the example of encrypting data transmitted to the terminal 20 by the terminal 10. As shown in fig. 4, the embodiment of the application further provides a voice processing method, which includes the following steps:

S402 the terminal 10 collects the voice of the user 1.

S404 the terminal 10 filters the voice of the user 1 using the voiceprint information.

In some embodiments, the terminal 10 may pre-store voiceprint information of the user 1. After the terminal 10 collects the voice of the user 1, the terminal 10 may reduce noise of the voice of the user 1 according to the voiceprint information of the user 1, so as to remove the voices of other users speaking. Specifically, the terminal 10 compares the voiceprint information of the collected voice with the voiceprint information of the user 1, if the comparison result shows that the voice is the voice of the user 1, the terminal 10 performs subsequent processing such as recognition and the like on the voice, and if the comparison result shows that the voice is not the voice of the user 1 but the voice of other users, the terminal 10 determines that the voice is noise, and the terminal 10 can filter the noise. In this way, the terminal 10 can filter out voices of other users speaking included in the voices of the user 1, and further reduce noise in the voices of the user 1 collected by the terminal 10.

And S406, the terminal 10 recognizes the filtered voice to obtain a recognition result.

The manner in which the terminal 10 recognizes the filtered voice is similar to the manner in which the terminal 10 directly recognizes the voice of the user 1, and specifically, reference may be made to S204 in the above embodiment, which is not described herein.

After the terminal 10 performs noise reduction processing on the collected voice of the user 1 in advance, the accuracy of voice recognition by the terminal 10 can be improved, and further, the terminal 10 can obtain a more accurate recognition result.

The terminal 10 encrypts the recognition result S408.

In some embodiments, the terminal 10 may also encrypt the recognition result to obtain ciphertext. When the data transmitted to the terminal 20 by the terminal 10 includes the voiceprint information of the user 1, the terminal 10 may encrypt the voiceprint information of the user 1 together with the identification result to obtain the ciphertext. For example, the terminal 10 stores a preset mapping relationship between plaintext and ciphertext, and then converts the recognition result of the plaintext form and voiceprint information into ciphertext according to the mapping relationship between the plaintext and ciphertext. The preset mapping relationship between plaintext and ciphertext may be a preset mapping relationship between plaintext and ciphertext between the terminal 10 and the terminal 20.

In some embodiments, the terminal 10 may set an encryption information base for the recognition result and the voiceprint information, respectively. For example, a first encryption information bank storing a first mapping relationship of a first plaintext and a first ciphertext and a second encryption information bank storing a second mapping relationship of a second plaintext and a second ciphertext. The terminal 10 encrypts the recognition result according to the first mapping relation stored in the first encryption information base, and converts the recognition result in the form of plaintext into ciphertext. The terminal 10 encrypts the voiceprint information according to the second mapping relation stored in the second encryption information base, and converts the voiceprint information in the plaintext form into ciphertext. In some implementations, the terminal 10 may also add an identifier to the ciphertext of the recognition result, so that the terminal 20 obtains the recognition result in plaintext form from a corresponding encrypted information base according to the identifier. Similarly, the terminal 10 may also add an identifier to the ciphertext of the voiceprint information, so that the terminal 20 obtains the voiceprint information in plaintext form from the corresponding encrypted information base according to the identifier.

In other embodiments, the terminal 10 may simply transmit the identification result to the terminal 20 without repeatedly transmitting the voiceprint information of the user 1 to the terminal 20. The terminal 10 only needs to encrypt the identification result, thus reducing the time required for the terminal 10 to encrypt the transmitted data, and further reducing the delay of the terminal 10 transmitting the data to the terminal 20.

And S410, the terminal 10 sends ciphertext to the terminal 20.

The terminal 10 sends the ciphertext to the terminal 20, and even if the ciphertext is intercepted by an lawbreaker such as a hacker, the lawbreaker cannot learn the data actually sent to the terminal 20 by the terminal 10 through the ciphertext, so that the security of the data transmitted between the terminal 10 and the terminal 20 is improved.

S412, the terminal 20 decrypts the ciphertext to obtain the identification result.

Correspondingly, the mapping relationship between the plaintext and the ciphertext is also stored in the terminal 20, after the terminal 20 receives the ciphertext sent by the terminal 10, the plaintext corresponding to the ciphertext is determined according to the mapping relationship between the plaintext and the ciphertext, so as to obtain the recognition result of the plaintext form, and when the data transmitted to the terminal 20 by the terminal 10 includes the voiceprint information of the user 1, the terminal 20 can also obtain the voiceprint information of the plaintext form, so as to realize decryption.

In some embodiments, after receiving the ciphertext transmitted by the terminal 10, the terminal 20 may also determine a corresponding encryption information base through the identifier carried by the ciphertext, and then query the corresponding encryption information base to obtain a plaintext corresponding to the ciphertext, so as to achieve decryption. For example, if the terminal 20 determines that the ciphertext is the ciphertext of the recognition result according to the identifier carried by the ciphertext, the terminal 20 queries the first mapping relationship in the first encryption information base to decrypt the ciphertext, thereby obtaining the recognition result in the plaintext form.

And S414, the terminal 20 performs voice synthesis according to the recognition result to obtain synthesized voice.

The manner in which the terminal 20 performs the speech synthesis according to the recognition result may be specifically referred to S208 in the above embodiment, which is not described herein again.

S416, the terminal 20 plays the synthesized voice.

The terminal 20 plays the synthesized voice so that the user 2 on the terminal 20 side hears the voice of the user 1 excluding any background noise, thereby realizing real-time voice communication excluding any noise.

Based on the above description, the embodiment of the application provides a voice processing method. In this method, the terminal 10 encrypts the data transmitted to the terminal 20, such as the recognition result and the voiceprint information, to obtain the ciphertext, and the terminal 10 transmits the ciphertext to the terminal 20, thereby improving the security of the transmitted data. Even if the lawless person intercepts the ciphertext, it cannot decrypt the ciphertext to obtain the data actually transmitted to the terminal 20 by the terminal 10. Thus, the method can improve the safety of real-time voice communication.

The voice processing method provided by the embodiment of the present application is described in detail above with reference to fig. 1 to 4, and the apparatus and device provided by the embodiment of the present application are described below with reference to the accompanying drawings.

Referring to the schematic structure of the speech processing apparatus shown in fig. 5, the apparatus 500 includes:

The acquisition unit 502 is used for acquiring the voice of the user;

the recognition unit 504 is configured to recognize the voice of the user to obtain a recognition result, where the recognition result at least includes pronunciation prompt information;

And a sending unit 506, configured to send the recognition result to a receiving end, so that the receiving end performs speech synthesis according to the recognition result.

In some possible implementations, the sending unit 506 is configured to send the identification result and voiceprint information of the user to the receiving end by using the sending end.

In some possible implementations, the pronunciation cues include phonemes;

The recognition unit 504 is configured to recognize a voice of the user through an acoustic model, and obtain the phoneme corresponding to the voice.

In some possible implementations, the recognition unit 504 further obtains, through the acoustic model, a pronunciation time or a pronunciation interval time of the phoneme corresponding to the speech, and the recognition result further includes the pronunciation time or the pronunciation interval time.

In some possible implementations, the recognition unit 504 is configured to recognize phonemes through an acoustic model and then decode the phonemes through a language model to obtain the text information.

In some possible implementations, the apparatus further includes:

An encryption unit 508, configured to encrypt the identification result;

the sending unit 506 is configured to send the encrypted identification result to the receiving end.

In some possible implementations, the apparatus further includes:

And the filtering unit 510 is configured to filter the voice of the user according to the voiceprint information of the user.

In some possible implementations, the sending unit 506 is configured to generate a character stream according to the identification result, and transmit the character stream to the receiving end through a streaming sequential transmission manner.

The speech processing device 500 according to the embodiment of the present application may correspond to performing the method described in the embodiment of the present application, and the above and other operations and/or functions of the respective modules/units of the speech processing device 500 are not repeated herein for brevity in order to implement the flow of the method in the embodiment of the method described above.

Referring to the schematic structural diagram of the speech processing apparatus shown in fig. 6, the apparatus 600 includes:

An obtaining unit 602, configured to obtain a recognition result of a voice of a user, where the recognition result includes at least pronunciation prompt information;

a synthesis unit 604, configured to perform speech synthesis according to the pronunciation prompt information, so as to obtain a synthesized speech;

and a playing unit 606, configured to play the synthesized voice, so as to implement real-time voice communication.

In some possible implementations, the acquiring unit 602 is configured to acquire voiceprint information of a user;

The synthesizing unit 604 is configured to perform speech synthesis according to the pronunciation prompt information and the voiceprint information of the user, so as to obtain synthesized speech.

In some possible implementations, the pronunciation cues include phonemes;

The synthesizing unit 604 is configured to synthesize an initial voice according to the pronunciation prompt information, and obtain a synthesized voice corresponding to the user according to the initial voice and voiceprint information of the user.

the synthesizing unit 604 is configured to perform speech synthesis according to the phoneme and voiceprint information of the user and one of the pronunciation time and the pronunciation interval time of the phoneme.

In some possible implementations, the apparatus further includes a decryption unit 608;

The acquiring unit 602 is configured to receive the encrypted identification result sent by the sending end;

The decryption unit 608 is configured to decrypt the identification result from the encrypted identification result.

the synthesizing unit 604 is configured to perform speech synthesis according to the pronunciation prompting information and the voiceprint information of the user according to the sequence of the character stream.

The speech processing device 600 according to the embodiment of the present application may correspond to performing the method described in the embodiment of the present application, and the above and other operations and/or functions of the respective modules/units of the speech processing device 600 are not repeated herein for brevity in order to implement the flow of the method in the embodiment of the method described above.

The embodiment of the application also provides voice processing equipment. The speech processing device may be a terminal 10 for implementing the functions of the speech processing means 500 in the embodiment shown in fig. 5. The hardware architecture of the speech processing device will be described below using the terminal 10 as an example.

Fig. 7 provides a schematic structural diagram of the terminal 10, and as shown in fig. 7, the terminal 10 includes a bus 701, a processor 702, a communication interface 703, and a memory 704. Communication between processor 702, memory 704 and communication interface 703 is via bus 701.

Bus 701 may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus, or the like. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one thick line is shown in fig. 7, but not only one bus or one type of bus.

The processor 702 may be any one or more of a central processing unit (central processing unit, CPU), a graphics processor (graphics processing unit, GPU), a Microprocessor (MP), or a digital signal processor (DIGITAL SIGNAL processor, DSP).

The communication interface 703 is used for communication with the outside. For example, the communication interface 703 may be used for communication with the terminal 20. The communication interface is used to send the recognition result to the terminal 20 so that the terminal 20 performs speech synthesis according to the recognition result.

The memory 704 may include volatile memory (RAM), such as random access memory (random access memory). The memory 704 may also include a non-volatile memory (non-volatile memory), such as read-only memory (ROM), flash memory, a hard disk drive (HARD DISK DRIVE, HDD) or a solid state drive (SSD STATE DRIVE).

The memory 704 has stored therein executable code that the processor 702 executes to perform the speech processing methods of the previous embodiments.

In particular, in the case where the embodiment shown in fig. 5 is implemented, and where each module or unit of the speech processing apparatus 500 described in the embodiment of fig. 5 is implemented by software, software or program code required to perform the functions of each module/unit in fig. 5 may be partially or entirely stored in the memory 704. The processor 702 executes the program codes corresponding to the respective units stored in the memory 704 to perform the voice processing method in the foregoing embodiment.

The embodiment of the application also provides voice processing equipment. The speech processing device may be a terminal 20 for implementing the functions of the speech processing means 600 in the embodiment shown in fig. 6. The hardware architecture of the speech processing device will be described below using the terminal 20 as an example.

Fig. 8 provides a schematic structural diagram of a first speech processing device, and as shown in fig. 8, the terminal 20 includes a bus 801, a processor 802, a communication interface 803, and a memory 804. Communication between the processor 802, the memory 804 and the communication interface 803 is via the bus 801.

The communication interface 803 is used for communication with the outside. For example, communication interface 803 may be used to communicate with terminal 10. The communication interface is used for receiving the recognition result transmitted by the terminal 10 so that the terminal 20 performs voice synthesis according to the recognition result.

The memory 804 has stored therein executable code that the processor 802 executes to perform the speech processing methods of the previous embodiments.

In particular, in the case where the embodiment shown in fig. 6 is implemented, and each module or unit of the speech processing apparatus 600 described in the embodiment of fig. 6 is implemented by software, software or program codes necessary for performing the functions of each module/unit in fig. 6 may be partially or entirely stored in the memory 804. The processor 802 executes the program codes corresponding to the respective units stored in the memory 804 to perform the voice processing method in the foregoing embodiment.

The embodiment of the application also provides a computer readable storage medium. The computer readable storage medium may be any available medium that can be stored by a computing device or a data storage device such as a data center containing one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), etc. The computer-readable storage medium includes instructions that instruct a computing device to perform the above-described voice processing method applied to the voice processing apparatus 500 or the voice processing apparatus 600.

Embodiments of the present application also provide a computer program product comprising one or more computer instructions. When the computer instructions are loaded and executed on a computing device, the processes or functions in accordance with embodiments of the present application are fully or partially developed.

The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, or data center to another website, computer, or data center by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.).

The computer program product, when executed by a computer, performs any of the methods of speech processing in embodiments of the present application. The computer program product may be a software installation package which may be downloaded and executed on a computer in case any of the methods of speech processing in the embodiments of the present application is required before use.

The descriptions of the processes or structures corresponding to the drawings have emphasis, and the descriptions of other processes or structures may be referred to for the parts of a certain process or structure that are not described in detail.

Claims

1. A voice processing method, applied to a voice communication system, the system comprising a transmitting end and a receiving end, the method comprising:

the sending end collects voice of a user;

the sending end identifies the voice of the user to obtain an identification result, wherein the identification result at least comprises pronunciation prompt information, and the pronunciation prompt information comprises phonemes, pronunciation time of the phonemes and pronunciation interval time;

The sending end sends the identification result to the receiving end so that the receiving end can perform voice synthesis according to the identification result and voiceprint information of the user to obtain synthesized voice, the synthesized voice is voice of the user, the voice of the user is consistent with the speech speed and the sentence breaking speed of the user when speaking, and the voiceprint information of the user is pre-recorded or acquired in real time by the sending end and sent to the receiving end or pre-stored by the receiving end.

2. The method according to claim 1, wherein the transmitting end transmits the identification result to the receiving end, the method comprising:

and the sending end sends the voiceprint information of the user to the receiving end so that the receiving end performs voice synthesis according to the identification result and the voiceprint information of the user.

3. The method of claim 1, wherein the step of determining the position of the substrate comprises,

4. A method according to claim 3, wherein the transmitting end further obtains the pronunciation time and pronunciation interval time of the phonemes corresponding to the speech through the acoustic model.

5. The method of claim 1, wherein the pronunciation alert information comprises text information having the same pronunciation as the speech.

6. The method of claim 5, wherein the text information is obtained by the transmitting end recognizing phonemes through an acoustic model and then decoding the phonemes through a language model.

7. The method according to any one of claims 1 to 6, further comprising:

Encrypting the identification result;

The sending end sends the encrypted identification result to the receiving end.

8. The method according to any one of claims 1 to 6, further comprising:

9. The method according to any one of claims 1 to 6, wherein the sending end sends the identification result to the receiving end, including:

10. A voice processing method, applied to a voice communication system, the system comprising a transmitting end and a receiving end, the method comprising:

The receiving end obtains a recognition result of the voice of the user, wherein the recognition result at least comprises pronunciation prompt information, and the pronunciation prompt information comprises phonemes, pronunciation time of the phonemes and pronunciation interval time;

The receiving end synthesizes the voice according to the pronunciation prompt information and the voiceprint information of the user to obtain synthetic voice, wherein the synthetic voice is the voice of the user, and the voice of the user is consistent with the voice speed and the sentence breaking of the user when speaking;

11. The method according to claim 10, wherein the method further comprises:

And acquiring voiceprint information of the user.

12. The method of claim 10, wherein the step of determining the position of the first electrode is performed,

13. The method of claim 10, wherein the step of determining the position of the first electrode is performed,

And the receiving end synthesizes the voice according to the pronunciation time of the phonemes, the pronunciation interval time and the voiceprint information of the phonemes and the user.

14. The method of claim 10, wherein the pronunciation alert information comprises text information having the same pronunciation as the speech.

15. The method of claim 14, wherein the text information is obtained by the transmitting end recognizing phonemes through an acoustic model and then decoding the phonemes through a language model.

16. The method according to any one of claims 10 to 15, wherein the receiving end obtains a recognition result of a voice of the user, including:

17. The method according to any one of claims 11 to 15, wherein the recognition result is transmitted in a character stream;

18. A speech processing apparatus, comprising:

The acquisition unit is used for acquiring the voice of the user;

The recognition unit is used for recognizing the voice of the user to obtain a recognition result, wherein the recognition result at least comprises pronunciation prompt information, and the pronunciation prompt information comprises phonemes, pronunciation time of the phonemes and pronunciation interval time;

The voice recognition device comprises a receiving end, a sending unit, a receiving end and a receiving end, wherein the receiving end is used for receiving voice information of a user, the sending unit is used for sending the recognition result to the receiving end so that the receiving end can conduct voice synthesis according to the recognition result and the voice print information of the user to obtain synthesized voice, the synthesized voice is voice of the user, the voice of the user is consistent with the voice speed and sentence breaking of the user when the user speaks, and the voice print information of the user is pre-recorded or acquired in real time by the sending end and is sent to the receiving end or is pre-stored by the receiving end.

19. A speech processing apparatus, comprising:

The system comprises an acquisition unit, a sound generation unit and a sound generation unit, wherein the acquisition unit acquires a recognition result of the voice of a user, and the recognition result at least comprises sound generation prompt information, and the sound generation prompt information comprises phonemes, sound generation time of the phonemes and sound generation interval time;

The voice synthesis unit is used for carrying out voice synthesis according to the pronunciation prompt information and the voiceprint information of the user to obtain synthetic voice, wherein the synthetic voice is the voice of the user, and the voice of the user is consistent with the speech speed and sentence breaking of the user when speaking;

20. An apparatus comprising a processor and a memory;

the processor is configured to execute instructions stored in the memory to cause the apparatus to perform the method of any one of claims 1 to 17.

21. A computer readable storage medium comprising instructions that instruct a device to perform the method of any one of claims 1 to 17.

22. A computer program product, characterized in that the computer program product, when run on a computer, causes the computer to perform the method of any one of claims 1 to 17.