[go: up one dir, main page]

CN114333836B - Audio transformer based on AI video sound channel - Google Patents

Audio transformer based on AI video sound channel Download PDF

Info

Publication number
CN114333836B
CN114333836B CN202111591729.7A CN202111591729A CN114333836B CN 114333836 B CN114333836 B CN 114333836B CN 202111591729 A CN202111591729 A CN 202111591729A CN 114333836 B CN114333836 B CN 114333836B
Authority
CN
China
Prior art keywords
voice
signal
module
word
voice signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111591729.7A
Other languages
Chinese (zh)
Other versions
CN114333836A (en
Inventor
蔡彬
胡亚平
彭培超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
E Surfing Video Media Co Ltd
Original Assignee
E Surfing Video Media Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by E Surfing Video Media Co Ltd filed Critical E Surfing Video Media Co Ltd
Priority to CN202111591729.7A priority Critical patent/CN114333836B/en
Publication of CN114333836A publication Critical patent/CN114333836A/en
Application granted granted Critical
Publication of CN114333836B publication Critical patent/CN114333836B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention relates to an AI-based video channel transformer, which is characterized by comprising the following components: a voice recording module; a voice packet generation module; a voice decomposition module; a voice signal recognition module; a speech synthesis module; a voice correction module; and a voice synthesis module. Compared with the prior art, the sound transformer provided by the invention can convert the sound emitted by the AI voice assistant into the sound close to the target implementation person, so that the sound emitted by the AI voice assistant is more close to the real voice. In addition, in the technical scheme disclosed by the invention, the voice packet is only used for storing voice signals corresponding to 63 Chinese phonetic alphabets, and the required storage amount is less.

Description

Audio transformer based on AI video sound channel
Technical Field
The present invention relates to a video channel transformer.
Background
At present, almost all intelligent terminals are built with AI voice assistants, such as the little college of millet company, the Siri of apple company, etc. Common features of all these AI voice assistants are: firstly, responding to a voice instruction given by a user, and controlling the intelligent terminal to respond; second, simple interactions with the user can be performed.
When the AI voice assistant interacts with the user or feeds back the operation of the voice instruction to the user through voice, the AI voice assistant loads the built-in voice packet and makes corresponding sound according to the setting of the voice packet. However, basically, the user can only select whether the AI voice assistant utters male or female, and the sound of the AI voice assistant still has a certain gap from the sound of a real person, and is closer to the robot sound in terms of hearing.
Disclosure of Invention
The invention aims to solve the technical problems that: at present, the type of sound emitted by an AI voice assistant is limited and is closer to the robot sound.
In order to solve the technical problem, the technical scheme of the invention provides an AI-based video channel transformer, which is characterized by comprising the following steps:
The voice recording module is used for acquiring voice signals corresponding to each Chinese pinyin after the target implementation person recites 63 letters of the Chinese pinyin;
The voice packet generation module is used for storing each Chinese phonetic alphabet obtained through the voice recording module and the corresponding voice signal in a correlated way to form a voice packet, and storing the voice packet in the intelligent terminal;
The voice decomposition module is used for acquiring a section of robot voice sent by the current intelligent terminal, converting the section of robot voice into characters, performing word segmentation processing on the characters to obtain Chinese phonetic alphabets forming each character, and acquiring a robot voice signal corresponding to each character;
The voice signal recognition module is used for recognizing signal parameters of the voice signal;
After the voice decomposition module obtains the robot voice signal corresponding to each word, calling the voice signal recognition module to obtain the signal parameter of the robot voice signal corresponding to each word;
the voice synthesis module is used for synthesizing the target implementation character voice signal corresponding to each word obtained by the voice decomposition module after calling the voice packet generated by the voice packet generation module, and comprises the following steps:
step 1, a voice synthesis module obtains all Chinese phonetic alphabets corresponding to each word given by a voice decomposition module;
step 2, calling the voice signals of the target implementation characters corresponding to each Chinese phonetic alphabet from the voice packet generated by the voice packet generation module according to the Chinese phonetic alphabets obtained in the step 1;
Step 3, for any word in the step 1, obtaining the voice signals of the target implementation characters corresponding to all the Chinese phonetic alphabets forming the current word, calling a voice signal recognition module to obtain the voice parameters of each voice signal, and synthesizing the voice signals of all the target implementation characters corresponding to the same word into a target implementation character voice signal by utilizing a voice synthesis algorithm based on the voice parameters;
After the voice synthesis module obtains the target implementation character voice signal corresponding to each word, calling the voice signal recognition module to obtain the signal parameter of the target implementation character voice signal corresponding to each word;
The voice correction module obtains signal parameters of the robot voice signal and signal parameters of the target implementation character voice signal of the same word, calculates to obtain signal correction data based on the difference of the two signal parameters by the voice correction module, and corrects the robot voice signal by using the signal correction data;
And the voice synthesis module is used for correcting the robot voice signals of all words forming the word section obtained by the voice decomposition module, and the voice synthesis module is used for synthesizing all corrected robot voice signals into a voice signal section according to the word sequence and playing the voice signal section.
Preferably, the signal parameters include pitch period, fundamental frequency and formant frequency.
Preferably, the voice signal recognition module performs low-pass filtering on the input voice signal, then performs autocorrelation calculation, and estimates the pitch period of the input voice signal based on the autocorrelation signal; the voice signal recognition module is used for calculating the fundamental frequency and the formant frequency corresponding to the current fundamental frequency according to the estimated fundamental frequency and the functional relation among the fundamental frequency and the formant frequency.
Preferably, the voice correction module forms the foregoing signal correction data using the robot voice signal of the same word and the pitch period, fundamental frequency, and formant frequency of the target implementation person voice signal.
Compared with the prior art, the sound transformer provided by the invention can convert the sound emitted by the AI voice assistant into the sound close to the target implementation person, so that the sound emitted by the AI voice assistant is more close to the real voice. In addition, in the technical scheme disclosed by the invention, the voice packet is only used for storing voice signals corresponding to 63 Chinese phonetic alphabets, and the required storage amount is less.
Drawings
Fig. 1 illustrates the flow of the present invention.
Detailed Description
The application will be further illustrated with reference to specific examples. It is to be understood that these examples are illustrative of the present application and are not intended to limit the scope of the present application. Furthermore, it should be understood that various changes and modifications can be made by one skilled in the art after reading the teachings of the present application, and such equivalents are intended to fall within the scope of the application as defined in the appended claims.
The invention provides an AI-based video channel transformer, which comprises the following components:
and the voice recording module is used for acquiring voice signals corresponding to each Chinese pinyin after the target implementation person recites 63 letters of the Chinese pinyin.
And the voice packet generation module is used for storing each Chinese phonetic alphabet obtained through the voice recording module and the corresponding voice signal in a correlated way to form a voice packet, and storing the voice packet in the intelligent terminal.
The voice decomposition module is used for acquiring a section of robot voice sent by the current intelligent terminal, converting the section of robot voice into characters, performing word segmentation processing on the characters to obtain Chinese phonetic alphabets forming each character, and acquiring a robot voice signal corresponding to each character.
And the voice signal recognition module is used for recognizing signal parameters of the voice signal. In this embodiment, the signal parameters include a pitch period, a fundamental frequency, and a formant frequency, and the speech signal recognition module performs low-pass filtering on the input speech signal, and then performs autocorrelation calculation, and estimates the pitch period of the input speech signal based on the autocorrelation signal. The voice signal recognition module is used for calculating the fundamental frequency and the formant frequency corresponding to the current fundamental frequency according to the estimated fundamental frequency and the functional relation among the fundamental frequency and the formant frequency.
In the invention, after the voice decomposition module obtains the robot voice signal corresponding to each word, the voice signal recognition module is invoked to obtain the signal parameters of the robot voice signal corresponding to each word, wherein in the embodiment, the signal parameters are the pitch period, the fundamental frequency and the formant frequency.
The voice synthesis module is used for synthesizing the target implementation character voice signal corresponding to each word obtained by the voice decomposition module after calling the voice packet generated by the voice packet generation module, and comprises the following steps:
step 1, a voice synthesis module obtains all Chinese phonetic alphabets corresponding to each word given by a voice decomposition module;
step 2, calling the voice signals of the target implementation characters corresponding to each Chinese phonetic alphabet from the voice packet generated by the voice packet generation module according to the Chinese phonetic alphabets obtained in the step 1;
and 3, for any word in the step 1, obtaining voice signals of target implementation characters corresponding to all Chinese phonetic letters forming the current word, calling a voice signal recognition module to obtain voice parameters of each voice signal (in the embodiment, the voice parameters are pitch period, fundamental frequency and formant frequency), and synthesizing the voice signals of all target implementation characters corresponding to the same word into one target implementation character voice signal by utilizing a voice synthesis algorithm based on the voice parameters.
In the invention, after the voice synthesis module obtains the target implementation character voice signal corresponding to each word, the voice signal recognition module is called to obtain the signal parameters of the target implementation character voice signal corresponding to each word, namely the pitch period, the fundamental frequency and the formant frequency.
And the voice correction module obtains signal parameters of the robot voice signal and signal parameters of the target implementation character voice signal of the same word, calculates to obtain signal correction data based on the difference of the two signal parameters by the voice correction module, and corrects the robot voice signal by using the signal correction data.
In this embodiment, the voice correction module uses the robot voice signal of the same word and the pitch period, fundamental frequency and formant frequency of the target implementation person voice signal to form the signal correction data.
And the voice synthesis module is used for correcting the robot voice signals of all words forming the word section obtained by the voice decomposition module, and the voice synthesis module is used for synthesizing all corrected robot voice signals into a voice signal section according to the word sequence and playing the voice signal section.

Claims (4)

1. An AI-based video channel transformer, comprising:
The voice recording module is used for acquiring voice signals corresponding to each Chinese pinyin after the target implementation person recites 63 letters of the Chinese pinyin;
The voice packet generation module is used for storing each Chinese phonetic alphabet obtained through the voice recording module and the corresponding voice signal in a correlated way to form a voice packet, and storing the voice packet in the intelligent terminal;
The voice decomposition module is used for acquiring a section of robot voice sent by the current intelligent terminal, converting the section of robot voice into characters, performing word segmentation processing on the characters to obtain Chinese phonetic alphabets forming each character, and acquiring a robot voice signal corresponding to each character;
The voice signal recognition module is used for recognizing signal parameters of the voice signal;
After the voice decomposition module obtains the robot voice signal corresponding to each word, calling the voice signal recognition module to obtain the signal parameter of the robot voice signal corresponding to each word;
the voice synthesis module is used for synthesizing the target implementation character voice signal corresponding to each word obtained by the voice decomposition module after calling the voice packet generated by the voice packet generation module, and comprises the following steps:
step 1, a voice synthesis module obtains all Chinese phonetic alphabets corresponding to each word given by a voice decomposition module;
step 2, calling the voice signals of the target implementation characters corresponding to each Chinese phonetic alphabet from the voice packet generated by the voice packet generation module according to the Chinese phonetic alphabets obtained in the step 1;
Step 3, for any word in the step 1, obtaining the voice signals of the target implementation characters corresponding to all the Chinese phonetic alphabets forming the current word, calling a voice signal recognition module to obtain the voice parameters of each voice signal, and synthesizing the voice signals of all the target implementation characters corresponding to the same word into a target implementation character voice signal by utilizing a voice synthesis algorithm based on the voice parameters;
After the voice synthesis module obtains the target implementation character voice signal corresponding to each word, calling the voice signal recognition module to obtain the signal parameter of the target implementation character voice signal corresponding to each word;
The voice correction module obtains signal parameters of the robot voice signal and signal parameters of the target implementation character voice signal of the same word, calculates to obtain signal correction data based on the difference of the two signal parameters by the voice correction module, and corrects the robot voice signal by using the signal correction data;
And the voice synthesis module is used for correcting the robot voice signals of all words forming the word section obtained by the voice decomposition module, and the voice synthesis module is used for synthesizing all corrected robot voice signals into a voice signal section according to the word sequence and playing the voice signal section.
2. The AI-based video sounder of claim 1 wherein the signal parameters include pitch period, fundamental frequency and formant frequency.
3. The AI-based video channel transformer of claim 2 wherein the speech signal recognition module first low-pass filters the input speech signal, then performs an autocorrelation calculation, and estimates a pitch period of the input speech signal based on the autocorrelation signal; the voice signal recognition module is used for calculating the fundamental frequency and the formant frequency corresponding to the current fundamental frequency according to the estimated fundamental frequency and the functional relation among the fundamental frequency and the formant frequency.
4. The AI-based video sounder of claim 3 wherein the voice modification module uses the robot voice signal of the same word and the pitch period, fundamental frequency and formant frequency of the target implementation person voice signal to form the aforementioned signal modification data.
CN202111591729.7A 2021-12-23 2021-12-23 Audio transformer based on AI video sound channel Active CN114333836B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111591729.7A CN114333836B (en) 2021-12-23 2021-12-23 Audio transformer based on AI video sound channel

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111591729.7A CN114333836B (en) 2021-12-23 2021-12-23 Audio transformer based on AI video sound channel

Publications (2)

Publication Number Publication Date
CN114333836A CN114333836A (en) 2022-04-12
CN114333836B true CN114333836B (en) 2024-10-25

Family

ID=81053961

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111591729.7A Active CN114333836B (en) 2021-12-23 2021-12-23 Audio transformer based on AI video sound channel

Country Status (1)

Country Link
CN (1) CN114333836B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1567428A (en) * 2003-06-19 2005-01-19 北京中科信利技术有限公司 Phoneme changing method based on digital signal processing
CN111681637A (en) * 2020-04-28 2020-09-18 平安科技(深圳)有限公司 Song synthesis method, device, equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002189490A (en) * 2000-12-01 2002-07-05 Leadtek Research Inc Method of pinyin speech input
JP4469883B2 (en) * 2007-08-17 2010-06-02 株式会社東芝 Speech synthesis method and apparatus

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1567428A (en) * 2003-06-19 2005-01-19 北京中科信利技术有限公司 Phoneme changing method based on digital signal processing
CN111681637A (en) * 2020-04-28 2020-09-18 平安科技(深圳)有限公司 Song synthesis method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN114333836A (en) 2022-04-12

Similar Documents

Publication Publication Date Title
Ochiai et al. Multimodal SpeakerBeam: Single Channel Target Speech Extraction with Audio-Visual Speaker Clues.
Luo et al. Speaker-independent speech separation with deep attractor network
JP4976432B2 (en) Distributed speech recognition system using acoustic feature vector deformation
US9280969B2 (en) Model training for automatic speech recognition from imperfect transcription data
KR101031660B1 (en) Voice recognition system using implicit speaker adaptation
CN106098078B (en) Voice recognition method and system capable of filtering loudspeaker noise
CN118248147B (en) Audio-visual voice recognition method, equipment and storage medium based on self-supervision learning
CN112652318A (en) Tone conversion method and device and electronic equipment
CN112509568A (en) Voice awakening method and device
JP6468258B2 (en) Voice dialogue apparatus and voice dialogue method
CN118985025A (en) General automatic speech recognition for joint acoustic echo cancellation, speech enhancement and speech separation
WO2024198370A1 (en) Speech synthesis method and apparatus, and electronic device and storage medium
CN107369441A (en) Noise-eliminating method, device and the terminal of voice signal
CN114333836B (en) Audio transformer based on AI video sound channel
JP4752516B2 (en) Voice dialogue apparatus and voice dialogue method
CN112885341A (en) Voice wake-up method and device, electronic equipment and storage medium
CN116110373B (en) Voice data acquisition method and related device of intelligent conference system
US20220293119A1 (en) Multistage low power, low latency, and real-time deep learning single microphone noise suppression
JPH0486899A (en) Standard pattern adaption system
KR20150107520A (en) Method and apparatus for voice recognition
CN113505612B (en) Multi-user dialogue voice real-time translation method, device, equipment and storage medium
CN112802476B (en) Speech recognition method and device, server and computer readable storage medium
JP2005520194A (en) Generating text messages
CN115910095A (en) Voice enhancement method and device, computer equipment and storage medium
Sahoo et al. Word extraction from speech recognition using correlation coefficients

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant