CN114333836B - Audio transformer based on AI video sound channel - Google Patents
Audio transformer based on AI video sound channel Download PDFInfo
- Publication number
- CN114333836B CN114333836B CN202111591729.7A CN202111591729A CN114333836B CN 114333836 B CN114333836 B CN 114333836B CN 202111591729 A CN202111591729 A CN 202111591729A CN 114333836 B CN114333836 B CN 114333836B
- Authority
- CN
- China
- Prior art keywords
- voice
- signal
- module
- word
- voice signal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 20
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 20
- 238000000354 decomposition reaction Methods 0.000 claims abstract description 16
- 230000000875 corresponding effect Effects 0.000 claims description 41
- 230000002194 synthesizing effect Effects 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 claims description 3
- 230000002596 correlated effect Effects 0.000 claims description 3
- 238000012986 modification Methods 0.000 claims description 3
- 230000004048 modification Effects 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 claims description 3
- 238000001914 filtration Methods 0.000 description 2
- 241000238558 Eucarida Species 0.000 description 1
- 244000062793 Sorghum vulgare Species 0.000 description 1
- 230000001276 controlling effect Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 235000019713 millet Nutrition 0.000 description 1
Landscapes
- Machine Translation (AREA)
- Telephonic Communication Services (AREA)
Abstract
The invention relates to an AI-based video channel transformer, which is characterized by comprising the following components: a voice recording module; a voice packet generation module; a voice decomposition module; a voice signal recognition module; a speech synthesis module; a voice correction module; and a voice synthesis module. Compared with the prior art, the sound transformer provided by the invention can convert the sound emitted by the AI voice assistant into the sound close to the target implementation person, so that the sound emitted by the AI voice assistant is more close to the real voice. In addition, in the technical scheme disclosed by the invention, the voice packet is only used for storing voice signals corresponding to 63 Chinese phonetic alphabets, and the required storage amount is less.
Description
Technical Field
The present invention relates to a video channel transformer.
Background
At present, almost all intelligent terminals are built with AI voice assistants, such as the little college of millet company, the Siri of apple company, etc. Common features of all these AI voice assistants are: firstly, responding to a voice instruction given by a user, and controlling the intelligent terminal to respond; second, simple interactions with the user can be performed.
When the AI voice assistant interacts with the user or feeds back the operation of the voice instruction to the user through voice, the AI voice assistant loads the built-in voice packet and makes corresponding sound according to the setting of the voice packet. However, basically, the user can only select whether the AI voice assistant utters male or female, and the sound of the AI voice assistant still has a certain gap from the sound of a real person, and is closer to the robot sound in terms of hearing.
Disclosure of Invention
The invention aims to solve the technical problems that: at present, the type of sound emitted by an AI voice assistant is limited and is closer to the robot sound.
In order to solve the technical problem, the technical scheme of the invention provides an AI-based video channel transformer, which is characterized by comprising the following steps:
The voice recording module is used for acquiring voice signals corresponding to each Chinese pinyin after the target implementation person recites 63 letters of the Chinese pinyin;
The voice packet generation module is used for storing each Chinese phonetic alphabet obtained through the voice recording module and the corresponding voice signal in a correlated way to form a voice packet, and storing the voice packet in the intelligent terminal;
The voice decomposition module is used for acquiring a section of robot voice sent by the current intelligent terminal, converting the section of robot voice into characters, performing word segmentation processing on the characters to obtain Chinese phonetic alphabets forming each character, and acquiring a robot voice signal corresponding to each character;
The voice signal recognition module is used for recognizing signal parameters of the voice signal;
After the voice decomposition module obtains the robot voice signal corresponding to each word, calling the voice signal recognition module to obtain the signal parameter of the robot voice signal corresponding to each word;
the voice synthesis module is used for synthesizing the target implementation character voice signal corresponding to each word obtained by the voice decomposition module after calling the voice packet generated by the voice packet generation module, and comprises the following steps:
step 1, a voice synthesis module obtains all Chinese phonetic alphabets corresponding to each word given by a voice decomposition module;
step 2, calling the voice signals of the target implementation characters corresponding to each Chinese phonetic alphabet from the voice packet generated by the voice packet generation module according to the Chinese phonetic alphabets obtained in the step 1;
Step 3, for any word in the step 1, obtaining the voice signals of the target implementation characters corresponding to all the Chinese phonetic alphabets forming the current word, calling a voice signal recognition module to obtain the voice parameters of each voice signal, and synthesizing the voice signals of all the target implementation characters corresponding to the same word into a target implementation character voice signal by utilizing a voice synthesis algorithm based on the voice parameters;
After the voice synthesis module obtains the target implementation character voice signal corresponding to each word, calling the voice signal recognition module to obtain the signal parameter of the target implementation character voice signal corresponding to each word;
The voice correction module obtains signal parameters of the robot voice signal and signal parameters of the target implementation character voice signal of the same word, calculates to obtain signal correction data based on the difference of the two signal parameters by the voice correction module, and corrects the robot voice signal by using the signal correction data;
And the voice synthesis module is used for correcting the robot voice signals of all words forming the word section obtained by the voice decomposition module, and the voice synthesis module is used for synthesizing all corrected robot voice signals into a voice signal section according to the word sequence and playing the voice signal section.
Preferably, the signal parameters include pitch period, fundamental frequency and formant frequency.
Preferably, the voice signal recognition module performs low-pass filtering on the input voice signal, then performs autocorrelation calculation, and estimates the pitch period of the input voice signal based on the autocorrelation signal; the voice signal recognition module is used for calculating the fundamental frequency and the formant frequency corresponding to the current fundamental frequency according to the estimated fundamental frequency and the functional relation among the fundamental frequency and the formant frequency.
Preferably, the voice correction module forms the foregoing signal correction data using the robot voice signal of the same word and the pitch period, fundamental frequency, and formant frequency of the target implementation person voice signal.
Compared with the prior art, the sound transformer provided by the invention can convert the sound emitted by the AI voice assistant into the sound close to the target implementation person, so that the sound emitted by the AI voice assistant is more close to the real voice. In addition, in the technical scheme disclosed by the invention, the voice packet is only used for storing voice signals corresponding to 63 Chinese phonetic alphabets, and the required storage amount is less.
Drawings
Fig. 1 illustrates the flow of the present invention.
Detailed Description
The application will be further illustrated with reference to specific examples. It is to be understood that these examples are illustrative of the present application and are not intended to limit the scope of the present application. Furthermore, it should be understood that various changes and modifications can be made by one skilled in the art after reading the teachings of the present application, and such equivalents are intended to fall within the scope of the application as defined in the appended claims.
The invention provides an AI-based video channel transformer, which comprises the following components:
and the voice recording module is used for acquiring voice signals corresponding to each Chinese pinyin after the target implementation person recites 63 letters of the Chinese pinyin.
And the voice packet generation module is used for storing each Chinese phonetic alphabet obtained through the voice recording module and the corresponding voice signal in a correlated way to form a voice packet, and storing the voice packet in the intelligent terminal.
The voice decomposition module is used for acquiring a section of robot voice sent by the current intelligent terminal, converting the section of robot voice into characters, performing word segmentation processing on the characters to obtain Chinese phonetic alphabets forming each character, and acquiring a robot voice signal corresponding to each character.
And the voice signal recognition module is used for recognizing signal parameters of the voice signal. In this embodiment, the signal parameters include a pitch period, a fundamental frequency, and a formant frequency, and the speech signal recognition module performs low-pass filtering on the input speech signal, and then performs autocorrelation calculation, and estimates the pitch period of the input speech signal based on the autocorrelation signal. The voice signal recognition module is used for calculating the fundamental frequency and the formant frequency corresponding to the current fundamental frequency according to the estimated fundamental frequency and the functional relation among the fundamental frequency and the formant frequency.
In the invention, after the voice decomposition module obtains the robot voice signal corresponding to each word, the voice signal recognition module is invoked to obtain the signal parameters of the robot voice signal corresponding to each word, wherein in the embodiment, the signal parameters are the pitch period, the fundamental frequency and the formant frequency.
The voice synthesis module is used for synthesizing the target implementation character voice signal corresponding to each word obtained by the voice decomposition module after calling the voice packet generated by the voice packet generation module, and comprises the following steps:
step 1, a voice synthesis module obtains all Chinese phonetic alphabets corresponding to each word given by a voice decomposition module;
step 2, calling the voice signals of the target implementation characters corresponding to each Chinese phonetic alphabet from the voice packet generated by the voice packet generation module according to the Chinese phonetic alphabets obtained in the step 1;
and 3, for any word in the step 1, obtaining voice signals of target implementation characters corresponding to all Chinese phonetic letters forming the current word, calling a voice signal recognition module to obtain voice parameters of each voice signal (in the embodiment, the voice parameters are pitch period, fundamental frequency and formant frequency), and synthesizing the voice signals of all target implementation characters corresponding to the same word into one target implementation character voice signal by utilizing a voice synthesis algorithm based on the voice parameters.
In the invention, after the voice synthesis module obtains the target implementation character voice signal corresponding to each word, the voice signal recognition module is called to obtain the signal parameters of the target implementation character voice signal corresponding to each word, namely the pitch period, the fundamental frequency and the formant frequency.
And the voice correction module obtains signal parameters of the robot voice signal and signal parameters of the target implementation character voice signal of the same word, calculates to obtain signal correction data based on the difference of the two signal parameters by the voice correction module, and corrects the robot voice signal by using the signal correction data.
In this embodiment, the voice correction module uses the robot voice signal of the same word and the pitch period, fundamental frequency and formant frequency of the target implementation person voice signal to form the signal correction data.
And the voice synthesis module is used for correcting the robot voice signals of all words forming the word section obtained by the voice decomposition module, and the voice synthesis module is used for synthesizing all corrected robot voice signals into a voice signal section according to the word sequence and playing the voice signal section.
Claims (4)
1. An AI-based video channel transformer, comprising:
The voice recording module is used for acquiring voice signals corresponding to each Chinese pinyin after the target implementation person recites 63 letters of the Chinese pinyin;
The voice packet generation module is used for storing each Chinese phonetic alphabet obtained through the voice recording module and the corresponding voice signal in a correlated way to form a voice packet, and storing the voice packet in the intelligent terminal;
The voice decomposition module is used for acquiring a section of robot voice sent by the current intelligent terminal, converting the section of robot voice into characters, performing word segmentation processing on the characters to obtain Chinese phonetic alphabets forming each character, and acquiring a robot voice signal corresponding to each character;
The voice signal recognition module is used for recognizing signal parameters of the voice signal;
After the voice decomposition module obtains the robot voice signal corresponding to each word, calling the voice signal recognition module to obtain the signal parameter of the robot voice signal corresponding to each word;
the voice synthesis module is used for synthesizing the target implementation character voice signal corresponding to each word obtained by the voice decomposition module after calling the voice packet generated by the voice packet generation module, and comprises the following steps:
step 1, a voice synthesis module obtains all Chinese phonetic alphabets corresponding to each word given by a voice decomposition module;
step 2, calling the voice signals of the target implementation characters corresponding to each Chinese phonetic alphabet from the voice packet generated by the voice packet generation module according to the Chinese phonetic alphabets obtained in the step 1;
Step 3, for any word in the step 1, obtaining the voice signals of the target implementation characters corresponding to all the Chinese phonetic alphabets forming the current word, calling a voice signal recognition module to obtain the voice parameters of each voice signal, and synthesizing the voice signals of all the target implementation characters corresponding to the same word into a target implementation character voice signal by utilizing a voice synthesis algorithm based on the voice parameters;
After the voice synthesis module obtains the target implementation character voice signal corresponding to each word, calling the voice signal recognition module to obtain the signal parameter of the target implementation character voice signal corresponding to each word;
The voice correction module obtains signal parameters of the robot voice signal and signal parameters of the target implementation character voice signal of the same word, calculates to obtain signal correction data based on the difference of the two signal parameters by the voice correction module, and corrects the robot voice signal by using the signal correction data;
And the voice synthesis module is used for correcting the robot voice signals of all words forming the word section obtained by the voice decomposition module, and the voice synthesis module is used for synthesizing all corrected robot voice signals into a voice signal section according to the word sequence and playing the voice signal section.
2. The AI-based video sounder of claim 1 wherein the signal parameters include pitch period, fundamental frequency and formant frequency.
3. The AI-based video channel transformer of claim 2 wherein the speech signal recognition module first low-pass filters the input speech signal, then performs an autocorrelation calculation, and estimates a pitch period of the input speech signal based on the autocorrelation signal; the voice signal recognition module is used for calculating the fundamental frequency and the formant frequency corresponding to the current fundamental frequency according to the estimated fundamental frequency and the functional relation among the fundamental frequency and the formant frequency.
4. The AI-based video sounder of claim 3 wherein the voice modification module uses the robot voice signal of the same word and the pitch period, fundamental frequency and formant frequency of the target implementation person voice signal to form the aforementioned signal modification data.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111591729.7A CN114333836B (en) | 2021-12-23 | 2021-12-23 | Audio transformer based on AI video sound channel |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111591729.7A CN114333836B (en) | 2021-12-23 | 2021-12-23 | Audio transformer based on AI video sound channel |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN114333836A CN114333836A (en) | 2022-04-12 |
| CN114333836B true CN114333836B (en) | 2024-10-25 |
Family
ID=81053961
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202111591729.7A Active CN114333836B (en) | 2021-12-23 | 2021-12-23 | Audio transformer based on AI video sound channel |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN114333836B (en) |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN1567428A (en) * | 2003-06-19 | 2005-01-19 | 北京中科信利技术有限公司 | Phoneme changing method based on digital signal processing |
| CN111681637A (en) * | 2020-04-28 | 2020-09-18 | 平安科技(深圳)有限公司 | Song synthesis method, device, equipment and storage medium |
Family Cites Families (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2002189490A (en) * | 2000-12-01 | 2002-07-05 | Leadtek Research Inc | Method of pinyin speech input |
| JP4469883B2 (en) * | 2007-08-17 | 2010-06-02 | 株式会社東芝 | Speech synthesis method and apparatus |
-
2021
- 2021-12-23 CN CN202111591729.7A patent/CN114333836B/en active Active
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN1567428A (en) * | 2003-06-19 | 2005-01-19 | 北京中科信利技术有限公司 | Phoneme changing method based on digital signal processing |
| CN111681637A (en) * | 2020-04-28 | 2020-09-18 | 平安科技(深圳)有限公司 | Song synthesis method, device, equipment and storage medium |
Also Published As
| Publication number | Publication date |
|---|---|
| CN114333836A (en) | 2022-04-12 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Ochiai et al. | Multimodal SpeakerBeam: Single Channel Target Speech Extraction with Audio-Visual Speaker Clues. | |
| Luo et al. | Speaker-independent speech separation with deep attractor network | |
| JP4976432B2 (en) | Distributed speech recognition system using acoustic feature vector deformation | |
| US9280969B2 (en) | Model training for automatic speech recognition from imperfect transcription data | |
| KR101031660B1 (en) | Voice recognition system using implicit speaker adaptation | |
| CN106098078B (en) | Voice recognition method and system capable of filtering loudspeaker noise | |
| CN118248147B (en) | Audio-visual voice recognition method, equipment and storage medium based on self-supervision learning | |
| CN112652318A (en) | Tone conversion method and device and electronic equipment | |
| CN112509568A (en) | Voice awakening method and device | |
| JP6468258B2 (en) | Voice dialogue apparatus and voice dialogue method | |
| CN118985025A (en) | General automatic speech recognition for joint acoustic echo cancellation, speech enhancement and speech separation | |
| WO2024198370A1 (en) | Speech synthesis method and apparatus, and electronic device and storage medium | |
| CN107369441A (en) | Noise-eliminating method, device and the terminal of voice signal | |
| CN114333836B (en) | Audio transformer based on AI video sound channel | |
| JP4752516B2 (en) | Voice dialogue apparatus and voice dialogue method | |
| CN112885341A (en) | Voice wake-up method and device, electronic equipment and storage medium | |
| CN116110373B (en) | Voice data acquisition method and related device of intelligent conference system | |
| US20220293119A1 (en) | Multistage low power, low latency, and real-time deep learning single microphone noise suppression | |
| JPH0486899A (en) | Standard pattern adaption system | |
| KR20150107520A (en) | Method and apparatus for voice recognition | |
| CN113505612B (en) | Multi-user dialogue voice real-time translation method, device, equipment and storage medium | |
| CN112802476B (en) | Speech recognition method and device, server and computer readable storage medium | |
| JP2005520194A (en) | Generating text messages | |
| CN115910095A (en) | Voice enhancement method and device, computer equipment and storage medium | |
| Sahoo et al. | Word extraction from speech recognition using correlation coefficients |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |