CN114974278A - Voice processing method, device, equipment and storage medium - Google Patents
Voice processing method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN114974278A CN114974278A CN202210359842.0A CN202210359842A CN114974278A CN 114974278 A CN114974278 A CN 114974278A CN 202210359842 A CN202210359842 A CN 202210359842A CN 114974278 A CN114974278 A CN 114974278A
- Authority
- CN
- China
- Prior art keywords
- frame
- amplitude
- speech segment
- decomposed
- superimposed
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0316—Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0316—Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
- G10L21/0324—Details of processing therefor
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M9/00—Arrangements for interconnection not involving centralised switching
- H04M9/08—Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L2021/02082—Noise filtering the noise being echo, reverberation of the speech
Landscapes
- Engineering & Computer Science (AREA)
- Signal Processing (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)
Abstract
Description
技术领域technical field
本发明属于音频处理技术领域,尤其涉及一种语音处理方法、装置、设备及存储介质。The present invention belongs to the technical field of audio processing, and in particular relates to a voice processing method, device, equipment and storage medium.
背景技术Background technique
在双工通话中,往往会出现回声问题,说话者可以从听筒中听到自己刚才说的话;如图1所示,路径1-7表明了A说话时的声音传播路径:当远端A的人说了一句话,则远端A的麦克风采集到此声音并产生语音信号,此语音信号传给近端B的扬声器播出;而近端B的麦克风又采集到刚才由B扬声器播出的语音信号,再次传给远端A的扬声器播出,也即远端A说了一句话,过一会又从扬声器听到了自己刚才说的话;反之,如果近端B说话,也是同理。因此,为了防止说话者从听筒中听到自己刚才说的话,需要对回声进行消除。例如,在近端B加入回声消除器对回声进行消除,则远端A不会听到自己刚才说的话。In duplex calls, echo problems often occur, and the speaker can hear what he has just said from the receiver; as shown in Figure 1, paths 1-7 indicate the sound propagation path of A when speaking: when the far-end A's When a person speaks a sentence, the microphone of far-end A collects the sound and generates a voice signal, which is transmitted to the speaker of near-end B for broadcasting; The voice signal is transmitted to the loudspeaker of far-end A again for broadcast, that is, far-end A said a word, and after a while, he heard what he just said from the loudspeaker; conversely, if near-end B speaks, the same is true. Therefore, in order to prevent the speaker from hearing what he just said from the receiver, the echo cancellation needs to be performed. For example, if an echo canceller is added to the near-end B to cancel the echo, the far-end A will not hear what it just said.
目前,常用的回声消除算法有最小均方自适应滤波器(Least Mean Square,LMS)与归一化最小均方自适应滤波器(Normalized Least Mean Square,NLMS)算法,算法要求同时输入麦克风信号与参考信号(近端的扬声器信号实际上就是远端的麦克风信号,又称为参考信号),当仅有远端说话时滤波器以麦克风信号为目标,使参考信号尽可能地接近麦克风信号以模拟回声路径。At present, the commonly used echo cancellation algorithms include Least Mean Square (LMS) and Normalized Least Mean Square (NLMS) algorithms. The algorithm requires the input of the microphone signal and the The reference signal (the near-end speaker signal is actually the far-end microphone signal, also known as the reference signal), when only the far-end speaks, the filter targets the microphone signal, making the reference signal as close as possible to the microphone signal to simulate echo path.
但是,由于网络的波动以及硬件的使用情况,可能出现参考信号与麦克风信号不同步的现象,也即近端麦克风采集音频信号的间隔与近端扬声器播放的间隔可能会出现偏差,进而导致进行回声消除时参考信号与麦克风信号之间产生时延,如果长期网络不佳则时延会越来越大,就会导致无法正常使用参考信号对麦克风信号进行参考而消去回声;虽然回声消除算法中一般会有参考信号与麦克风信号的时延对齐方法,但这只能在小时延值内进行两音频对齐,时延过大和时延不稳定的情况下则很难估计准时延值。However, due to network fluctuations and hardware usage, the reference signal may be out of sync with the microphone signal, that is, the interval between the audio signal collected by the near-end microphone and the interval played by the near-end speaker may deviate, which will lead to echoes. There is a delay between the reference signal and the microphone signal during cancellation. If the long-term network is not good, the delay will become larger and larger, which will lead to the failure to use the reference signal to refer to the microphone signal to cancel the echo; although the echo cancellation algorithm generally There is a delay alignment method between the reference signal and the microphone signal, but this can only be performed within a small delay value to align the two audio frequencies. When the delay is too large and the delay is unstable, it is difficult to estimate the quasi-delay value.
发明内容SUMMARY OF THE INVENTION
本发明旨在至少在一定程度上解决相关技术中的技术问题之一。为此,本发明的一个目的在于提出一种语音处理方法、装置、设备及存储介质。The present invention aims to solve one of the technical problems in the related art at least to a certain extent. To this end, an object of the present invention is to provide a voice processing method, apparatus, device and storage medium.
为了解决上述技术问题,本发明的实施例提供如下技术方案:In order to solve the above-mentioned technical problems, the embodiments of the present invention provide the following technical solutions:
一种语音处理方法,包括:A speech processing method, comprising:
对待处理的语音段进行分解处理,获取多帧分解语音段;Perform decomposition processing on the speech segment to be processed to obtain multi-frame decomposed speech segments;
基于调整算法对多帧所述分解语音段进行调整,获取多帧待叠加语音段;Adjusting the decomposed speech segments of multiple frames based on an adjustment algorithm to obtain multiple frames of speech segments to be superimposed;
计算获得每帧所述待叠加语音段的幅度增益因子;Calculate the amplitude gain factor of each frame of the to-be-superimposed speech segment;
基于所述幅度增益因子,对每帧所述待叠加语音段的幅度进行调整,确定目标幅度;Based on the amplitude gain factor, the amplitude of the speech segment to be superimposed in each frame is adjusted to determine the target amplitude;
基于所述目标幅度,获取目标语音段。Based on the target amplitude, a target speech segment is obtained.
可选的,所述基于调整算法所述对多帧所述分解语音段进行调整,获得多帧待叠加语音段,包括:Optionally, the adjustment of the decomposed speech segments of multiple frames based on the adjustment algorithm to obtain the speech segments to be superimposed in multiple frames includes:
若所述分解语音段存在帧缺失或帧冗余,则基于所述分解语音段确定待调整分解语音段;If the decomposed speech segment has frame missing or frame redundancy, determining the decomposed speech segment to be adjusted based on the decomposed speech segment;
对所述待调整分解语音段调整,获得所述待叠加语音段。The decomposed speech segment to be adjusted is adjusted to obtain the speech segment to be superimposed.
可选的,所述若所述分解语音段存在帧缺失或帧冗余,则基于所述分解语音段确定待调整分解语音段,包括:Optionally, if the decomposed speech segment has frame missing or frame redundancy, determining the decomposed speech segment to be adjusted based on the decomposed speech segment, including:
若所述分解语音段存在帧缺失;If the decomposed speech segment has frame missing;
则基于所述分解语音段确定至少一帧所述待调整分解语音段;then determining at least one frame of the decomposed speech segment to be adjusted based on the decomposed speech segment;
对所述待调整分解语音段进行拉伸,获取所述待叠加语音段。The to-be-adjusted and decomposed speech segment is stretched to obtain the to-be-superimposed speech segment.
可选的,所述计算获得每帧所述待叠加语音段的幅度增益因子,包括:Optionally, the calculating to obtain the amplitude gain factor of the speech segment to be superimposed in each frame, including:
获取每帧所述待叠加语音段的长度L;Obtain the length L of the speech segment to be superimposed in each frame;
基于所述长度将每帧所述待叠加语音段划分为L个样点;其中,L为正整数;Based on the length, the speech segment to be superimposed in each frame is divided into L samples; wherein, L is a positive integer;
获取每帧所述样点对应的幅度值;Obtain the amplitude value corresponding to the sample point in each frame;
获取每帧所述样点的目标叠加位置以及每帧所述目标叠加位置的原始幅度值;Obtain the target superposition position of the sample points in each frame and the original amplitude value of the target superposition position in each frame;
基于所述幅度值以及原始幅度值,计算获得每帧所述待叠加语音段的所述幅度增益因子。Based on the amplitude value and the original amplitude value, the amplitude gain factor of the speech segment to be superimposed in each frame is obtained by calculation.
可选的,所述基于所述幅度值以及原始幅度值,计算获得每帧所述待叠加语音段的所述幅度增益因子,包括:Optionally, calculating and obtaining the amplitude gain factor of the speech segment to be superimposed in each frame based on the amplitude value and the original amplitude value, including:
对每帧所述待叠加语音段的L个所述幅度值求和,获得第一值M;其中,M≥0;Summing up the L amplitude values of the speech segments to be superimposed in each frame to obtain a first value M; wherein, M≥0;
对每帧所述待叠加语音段的L个所述原始幅度值求和,获得第二值N;其中,N≥0;Summing up the L original amplitude values of the speech segments to be superimposed in each frame to obtain a second value N; wherein, N≥0;
基于所述第一值M以及第二值N,获得所述幅度增益因子。Based on the first value M and the second value N, the amplitude gain factor is obtained.
可选的,所述基于所述第一值M以及第二值N,获得所述幅度增益因子,包括:Optionally, the obtaining the amplitude gain factor based on the first value M and the second value N includes:
获取L个所述幅度值中零值的数量Q;其中,Q≥0,且Q为整数;Obtain the number Q of zero values in the L said amplitude values; wherein, Q≥0, and Q is an integer;
基于所述长度L以及L个所述幅度值中零值的数量Q,获得所述幅度增益因子的参考系数λ;其中,λ≥1;Based on the length L and the number Q of zeros in the L amplitude values, a reference coefficient λ of the amplitude gain factor is obtained; wherein, λ≥1;
基于所述参考系数λ、第一值M以及第二值N,获得所述幅度增益因子。Based on the reference coefficient λ, the first value M and the second value N, the amplitude gain factor is obtained.
可选的,所述幅度增益因子基于如下公式计算获得:Optionally, the amplitude gain factor is calculated and obtained based on the following formula:
其中,βk为第k个幅度增益因子;k为正整数。Among them, β k is the kth amplitude gain factor; k is a positive integer.
可选的,所述参考系数基于如下计算公式获得:Optionally, the reference coefficient is obtained based on the following calculation formula:
可选的,所述目标语音段基于如下计算公式计算获得:Optionally, the target speech segment is calculated and obtained based on the following calculation formula:
Sk=Pk*βk S k =P k *β k
其中,Sk为第K帧所述目标语音段;Pk是第K帧所述待叠加语音段。Wherein, Sk is the target speech segment of the Kth frame; Pk is the to-be-superimposed speech segment of the Kth frame.
本发明的实施例还提供一种语音处理装置,包括:An embodiment of the present invention also provides a voice processing device, comprising:
分解模块,用于对待处理的语音段进行分解处理,获取多帧分解语音段;The decomposition module is used to decompose the speech segment to be processed, and obtain multi-frame decomposed speech segments;
调整模块,用于基于调整算法对多帧所述分解语音段进行调整,获取多帧待叠加语音段;an adjustment module, configured to adjust the decomposed speech segments of the multiple frames based on an adjustment algorithm, and obtain the speech segments to be superimposed in the multiple frames;
计算模块,用于计算获得每帧所述待叠加语音段的幅度增益因子;A calculation module, for calculating and obtaining the amplitude gain factor of the speech segment to be superimposed in each frame;
确定模块,用于基于所述幅度增益因子,对每帧所述待叠加语音段的幅度进行调整,确定目标幅度;A determination module, configured to adjust the amplitude of the speech segment to be superimposed in each frame based on the amplitude gain factor, and determine the target amplitude;
获取模块,用于基于所述目标幅度,获取目标语音段。an obtaining module, configured to obtain a target speech segment based on the target amplitude.
本发明的实施例还提供一种电子设备,包括处理器、存储器以及存储在所述存储器中且被配置为由所述处理器执行的计算机程序,所述处理器执行所述计算机程序时实现如上所述的方法。Embodiments of the present invention also provide an electronic device, including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, where the processor implements the above when executing the computer program the method described.
本发明的实施例还提供一种计算机可读存储介质,所述计算机可读存储介质包括存储的计算机程序,其中,在所述计算机程序运行时控制所述计算机可读存储介质所在设备执行如上所述的方法。Embodiments of the present invention also provide a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, wherein when the computer program runs, the device where the computer-readable storage medium is located is controlled to execute the above-mentioned method described.
本发明的实施例,具有如下技术效果:The embodiment of the present invention has the following technical effects:
本发明的上述技术方案,1)本发明的实施例,基于缓存器对网络发送的语音信号进行缓存,形成缓存队列,基于缓存队列可以实现以匀速稳定地向扬声器传输语音信号,解决了时延过大以及时延不稳定的问题,进而可以实现对语音信号传输过程中的时延进行准确估计,进而提高回声消除的准确率。The above technical solutions of the present invention, 1) In the embodiment of the present invention, the voice signal sent by the network is buffered based on the buffer to form a buffer queue. Based on the buffer queue, the voice signal can be transmitted to the speaker at a uniform speed and stability, and the time delay is solved. If the delay is too large and the delay is unstable, it is possible to accurately estimate the delay during the transmission of the voice signal, thereby improving the accuracy of echo cancellation.
2)采用WSOLA算法在两个帧之间留有重叠的部分,同时,对每一个帧进行加窗处理,减轻了波形不连续造成的影响;此外,本发明的实施例基于WSOLA算法,并不直接找一帧分解语音段进行直接拼接,而是在一个区间内寻找与理想帧最接近的一帧分解语音段进行叠加,进而减少因相位跳变而产生的失真问题。2) Using the WSOLA algorithm to leave an overlapped part between the two frames, at the same time, windowing is performed on each frame to reduce the impact of the discontinuous waveform; Directly find a frame of decomposed speech segment for direct splicing, but find a decomposed speech segment closest to the ideal frame in an interval and superimpose it, thereby reducing the distortion problem caused by phase jumps.
3)通过增加幅度增益因子严格控制待叠加语音段的幅度,解决了由于合成语音幅度过大导致的在实时通信的恢复过程中可能出现较大的失真的问题以及由于网络延迟导致的缺帧恰好发生在一段语音带的结尾,调整后的语音和原始语音之间有相当大的差异的问题,实现了避免语音波形尾部幅度过大等情况,使得恢复后的音频更接近原始状态。3) Strictly control the amplitude of the speech segment to be superimposed by increasing the amplitude gain factor, which solves the problem that large distortion may occur in the recovery process of real-time communication caused by the excessively large amplitude of the synthesized speech and the lack of frames due to network delay is just right. Occurred at the end of a voice band, there is a considerable difference between the adjusted voice and the original voice, which avoids the situation that the tail of the voice waveform is too large, and makes the restored audio closer to the original state.
本发明附加的方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本发明的实践了解到。Additional aspects and advantages of the present invention will be set forth, in part, from the following description, and in part will be apparent from the following description, or may be learned by practice of the invention.
附图说明Description of drawings
图1是现有的声音传播路径示意图;1 is a schematic diagram of an existing sound propagation path;
图2是本发明的实施例提供的一种语音处理系统的结构示意图;2 is a schematic structural diagram of a speech processing system provided by an embodiment of the present invention;
图3是本发明的实施例提供的一种语音处理方法的流程示意图;3 is a schematic flowchart of a speech processing method provided by an embodiment of the present invention;
图4是本发明的实施例提供的一种语音处理装置的结构示意图。FIG. 4 is a schematic structural diagram of a voice processing apparatus provided by an embodiment of the present invention.
具体实施方式Detailed ways
下面详细描述本发明的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,旨在用于解释本发明,而不能理解为对本发明的限制。The following describes in detail the embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary, and are intended to explain the present invention and should not be construed as limiting the present invention.
首先,为了便于本领域的技术人员对实施例的理解,对部分用语进行解释:First, in order to facilitate the understanding of the embodiments by those skilled in the art, some terms are explained:
(1)WebRTC:Web Real-Time Communication,网页实时通信。(1) WebRTC: Web Real-Time Communication, web real-time communication.
(2)AEC:Acoustic Echo Cancellation,回声消除。(2) AEC: Acoustic Echo Cancellation, echo cancellation.
(3)API:Application Programming Interface,应用程序接口。(3) API: Application Programming Interface, application program interface.
(4)OLA:Overlap-and-Add,重叠叠加算法。(4) OLA: Overlap-and-Add, overlapping stacking algorithm.
(5)WSOLA:Waveform similarity Overlap-Add,波形相似叠加算法;及基本原理为在原语音信号中截取一个帧,加窗,在一个范围内(例如,某个框内)选取第二个帧,这个帧的相位参数应该和第一帧相位对齐,在另一个范围内查找第三帧,这个帧与第二个帧应该最相似,然后再把它们叠加到一起。(5) WSOLA: Waveform similarity Overlap-Add, waveform similarity overlay algorithm; and the basic principle is to intercept a frame in the original speech signal, add a window, and select the second frame within a range (for example, within a certain frame). The frame's phase parameter should be phase-aligned with the first frame, look for the third frame in another range, which should be most similar to the second frame, and then add them together.
其次,目前,在实时语音通信中,人类的语音是由无数个比较短的语音带组合而成的,因此在实时环境下,语音的情况并不完全满足标准波形相似叠加算法的要求,也难以起到比较好的恢复效果。而且由于标准波形相似叠加算法将处理重心放在了保持语音的基音频率上,因此对语音恢复过程中的幅度控制把握得并不严格,某些情况下可能导致合成语音幅度过大的问题,所以在实时通信的恢复过程中可能出现较大的失真,进而会影响实时语音通信的效果。Secondly, at present, in real-time voice communication, human voice is composed of countless relatively short voice bands. Therefore, in real-time environment, the situation of voice does not fully meet the requirements of the standard waveform similarity superposition algorithm, and it is difficult to have a better recovery effect. Moreover, since the standard waveform similarity superposition algorithm focuses on maintaining the pitch frequency of the speech, the amplitude control during the speech recovery process is not strictly controlled, which may lead to the problem of excessively large synthesized speech amplitude in some cases. During the recovery process of real-time communication, large distortion may occur, which will affect the effect of real-time voice communication.
此外,对于典型的人类语音来说,一段语音并不是绝对的连续的,它通常会由很多很短的连续语音组成,我们将这些很短的连续语音称为语音带。在相邻的两个语音带之间,往往存在一段无声带;由于实际的人类语音中存在许多这样的空隙,而假如网络延迟导致的缺帧恰好发生在一段语音带的结尾,那么按照WSOLA算法会填补缺帧处的波形。这样,一个正常的语音带就无法及时的结束,因此,调整后的语音和原始语音之间就会有相当大的差异。In addition, for typical human speech, a piece of speech is not absolutely continuous, it usually consists of many short continuous speeches, and we call these very short continuous speeches as speech bands. Between two adjacent speech bands, there is often a silent band; since there are many such gaps in actual human speech, and if the missing frame caused by network delay happens at the end of a speech band, then according to the WSOLA algorithm The waveform at the missing frame will be filled. In this way, a normal speech band cannot end in time, so there will be a considerable difference between the adjusted speech and the original speech.
为了解决上述技术问题,本发明提供了如下技术方案:In order to solve the above-mentioned technical problems, the present invention provides the following technical solutions:
如图2所示,本发明的实施例,提供一种语音处理系统,包括:As shown in FIG. 2, an embodiment of the present invention provides a speech processing system, including:
扬声器、麦克风、回声消除器以及缓存器;Speakers, microphones, echo cancellers and buffers;
其中,缓存器设置在所述扬声器以及网络之间,并分别与扬声器以及网络连接;所述缓存器用于对网络发送的语音信号进行接收并发送至扬声器;Wherein, the buffer is arranged between the speaker and the network, and is respectively connected with the speaker and the network; the buffer is used for receiving the voice signal sent by the network and sending it to the speaker;
扬声器、麦克风以及缓存器均与回声消除器连接,用于实现基于回声消除器对语音信号传输的过程中的回声进行消除。The loudspeaker, the microphone and the buffer are all connected with the echo canceller, and are used for realizing echo cancellation during the transmission of the voice signal based on the echo canceller.
在实际应用场景中,当网速有波动的时候可能网络依次传来多帧语音信号,也有可能在一段时间内网络未发送任何帧的语音信号。In practical application scenarios, when the network speed fluctuates, the network may transmit multiple frames of voice signals in sequence, or the network may not send any frame of voice signals for a period of time.
因此,为了实现基于网络向扬声器匀速稳定发送语音信号,本发明的实施例设置了缓存器,用于对网络发送的语音信号进行缓存,并形成缓存队列;例如:可以实现在不考虑远端往近端传数据的速率,就近端来说,假设音频的采样率是8000Hz(或者其它预设数值,本发明的实施例以8000Hz为例),也就是说近端扬声器每秒只能播8000个点,麦克风也只能收到8000个点;也即,基于网络每秒钟给扬声器匀速提供8000个点。Therefore, in order to realize the uniform and stable transmission of voice signals to the speaker based on the network, the embodiment of the present invention sets a buffer for buffering the voice signals sent by the network, and forms a buffer queue; The data transmission rate of the near-end, as far as the near-end is concerned, it is assumed that the sampling rate of the audio is 8000 Hz (or other preset values, the embodiment of the present invention takes 8000 Hz as an example), that is to say, the near-end speaker can only broadcast 8000 Hz per second. Only 8,000 points can be received by the microphone; that is, 8,000 points are provided to the speaker at a constant rate per second based on the network.
进一步地,当网络长期较慢,缓存队列中的语音段可能存在帧缺失的情况,则对缓存器中的缓存队列进行插值来维持每秒8000点的转发速率;Further, when the network is slow for a long time and the voice segment in the buffer queue may have frame missing, the buffer queue in the buffer is interpolated to maintain the forwarding rate of 8000 points per second;
进一步地,当网络长期较快时,缓存队列中的语音段可能存在帧冗余的情况,也即缓存器的缓存区会出现被占满的情况,但是,网络又持续在向缓存器发送语音信号,因此,为了保证每秒8000点的转发速率,本发明的实施例对一些语音段进行缩短处理。Further, when the network is fast for a long time, there may be frame redundancy in the voice segment in the buffer queue, that is, the buffer area of the buffer may be full, but the network continues to send voice to the buffer. Therefore, in order to ensure the forwarding rate of 8000 points per second, the embodiment of the present invention performs shortening processing on some speech segments.
对于时延估计,在实际应用场景中,可以采用大范围粗略估计加小范围精确计算的方法,例如:在WebRTC的AEC模块中,其API接口中获取调用者传入的一个以毫秒为单位的回声时延估计值,由于本发明的实施例设置了缓存器,并对缓存队列中的语音段进行了处理,解决了无法正常使用参考信号对麦克风信号进行参考的问题,然后由AEC内部在这个估计值的基础上进行小范围内时延精确估计。For delay estimation, in practical application scenarios, a method of large-scale rough estimation and small-scale accurate calculation can be used. For example, in the AEC module of WebRTC, the API interface obtains a millisecond passed in by the caller. The echo delay estimation value, because the embodiment of the present invention sets up a buffer and processes the speech segments in the buffer queue, which solves the problem that the reference signal cannot be used to refer to the microphone signal normally, and then the AEC internally stores this value in this value. On the basis of the estimated value, an accurate estimation of the delay in a small range is carried out.
本发明的实施例,基于缓存器对网络发送的语音信号进行缓存,形成缓存队列,基于缓存队列可以实现以匀速稳定地向扬声器传输语音信号,解决了时延过大以及时延不稳定的问题,进而可以实现对语音信号传输过程中的时延进行准确估计,进而提高回声消除的准确率。In the embodiment of the present invention, the voice signal sent by the network is buffered based on the buffer to form a buffer queue. Based on the buffer queue, the voice signal can be transmitted to the speaker at a constant speed and stability, and the problems of excessive delay and unstable delay can be solved. , so as to accurately estimate the time delay in the process of voice signal transmission, thereby improving the accuracy of echo cancellation.
如图3所示,本发明的实施例提供一种语音处理方法,应用于上述系统,包括:As shown in FIG. 3, an embodiment of the present invention provides a speech processing method, which is applied to the above-mentioned system, including:
步骤S1:对待处理的语音段进行分解处理,获取多帧分解语音段;Step S1: decompose the speech segment to be processed, and obtain multi-frame decomposed speech segments;
具体的,在对缓存队列中的语音信号进行处理之前,先对于缓存队列中的语音信号进行分解处理,将其等分成不同的分解语音段。Specifically, before processing the voice signal in the buffer queue, the voice signal in the buffer queue is firstly decomposed and divided into different decomposed voice segments.
例如,每个分解语音段的时长可以处于50ms~100ms之间。For example, the duration of each decomposed speech segment may be between 50ms and 100ms.
步骤S2:基于调整算法对多帧所述分解语音段进行调整,获取多帧待叠加语音段;Step S2: adjusting the decomposed speech segments of multiple frames based on an adjustment algorithm, to obtain multiple frames of speech segments to be superimposed;
具体的,所述基于调整算法所述对多帧所述分解语音段进行调整,获得多帧待叠加语音段,包括:Specifically, the adjustment of the decomposed speech segments of multiple frames based on the adjustment algorithm to obtain the speech segments to be superimposed in multiple frames includes:
若所述分解语音段存在帧缺失或帧冗余,则基于所述分解语音段确定待调整分解语音段;If the decomposed speech segment has frame missing or frame redundancy, determining the decomposed speech segment to be adjusted based on the decomposed speech segment;
基于所述待调整分解语音段对所述分解语音段进行调整,获得所述待叠加语音段。The decomposed speech segment is adjusted based on the decomposed speech segment to be adjusted to obtain the speech segment to be superimposed.
进一步地,所述若所述分解语音段存在帧缺失或帧冗余,则基于所述分解语音段确定待调整分解语音段,包括:Further, if the decomposed speech segment has frame missing or frame redundancy, determining the decomposed speech segment to be adjusted based on the decomposed speech segment, including:
若所述分解语音段存在帧缺失;If the decomposed speech segment has frame missing;
则基于所述分解语音段确定至少一帧所述待调整分解语音段;then determining at least one frame of the decomposed speech segment to be adjusted based on the decomposed speech segment;
对所述待调整分解语音段进行拉伸,获取所述待叠加语音段。The to-be-adjusted and decomposed speech segment is stretched to obtain the to-be-superimposed speech segment.
具体的,当某个分解语音段发生帧缺失或者帧不足的时候,获取该分解语音段的起始时间,基于该起始时间查找在该起始时间前的几个分解语音段,将这几个分解语音段确定为待调整分解语音段,然后将这几个待调整分解语音段进行拉长,覆盖过帧缺失的位置。Specifically, when a decomposed speech segment has frame missing or insufficient frames, the start time of the decomposed speech segment is obtained, several decomposed speech segments before the start time are searched based on the start time, and the The decomposed speech segments are determined to be decomposed speech segments to be adjusted, and then the decomposed speech segments to be adjusted are elongated to cover the position where the frame is missing.
其中,基于WSOLA算法根据存在帧缺失的分解语音段确定和该分解语音段最为相似的几个分解语音段,并对这个分解语音段进行拉长处理,具体的拉长值,可以基于WSOLA算法进行确定。Among them, the decomposed speech segments that are most similar to the decomposed speech segments are determined based on the decomposed speech segments with frame missing based on the WSOLA algorithm, and the decomposed speech segments are elongated. The specific elongated value can be determined based on the WSOLA algorithm. Sure.
进一步地,所述若所述分解语音段存在帧缺失或帧冗余,则基于所述分解语音段确定待调整分解语音段,包括:Further, if the decomposed speech segment has frame missing or frame redundancy, determining the decomposed speech segment to be adjusted based on the decomposed speech segment, including:
若所述分解语音段存在帧冗余;if there is frame redundancy in the decomposed speech segment;
则基于所述分解语音段确定至少一帧所述待调整分解语音段;then determining at least one frame of the decomposed speech segment to be adjusted based on the decomposed speech segment;
对所述待调整分解语音段进行缩短,获取所述待叠加语音段。The to-be-adjusted and decomposed speech segment is shortened to obtain the to-be-superimposed speech segment.
具体的,当某个分解语音段发生帧冗余的时候,获取该分解语音段的起始时间,基于该起始时间查找在该起始时间前的几个分解语音段,将这几个分解语音段确定为待调整分解语音段,然后将这几个待调整分解语音段进行缩短。Specifically, when frame redundancy occurs in a decomposed speech segment, the start time of the decomposed speech segment is obtained, several decomposed speech segments before the start time are searched based on the start time, and these decomposed speech segments are decomposed. The speech segment is determined as the speech segment to be adjusted and decomposed, and then the speech segments to be adjusted and decomposed are shortened.
其中,基于WSOLA算法根据存在帧冗余的分解语音段确定和该分解语音段最为相似的几个分解语音段,并对这个分解语音段进行缩短处理,具体的缩短值,可以基于WSOLA算法进行确定。Among them, based on the WSOLA algorithm, the decomposed speech segments that are most similar to the decomposed speech segments are determined according to the decomposed speech segments with frame redundancy, and the decomposed speech segments are shortened. The specific shortening value can be determined based on the WSOLA algorithm. .
在实际应用场景中,为了改变语音信号的传输速率,现有技术对获取的多帧分解语音段进行拉伸或压缩处理,以便于实现通过改变每个分解语音段的时长来达到改变语音信号传输速率的效果。In practical application scenarios, in order to change the transmission rate of the speech signal, the prior art performs stretching or compression processing on the acquired multi-frame decomposed speech segments, so as to realize the change of the speech signal transmission by changing the duration of each decomposed speech segment. effect of speed.
例如,基于OLA算法,在获取到多帧分解语音段之后,直接基于OLA,将多帧分解语音段的首位拼接起来;当需要把一段语音段的语速放慢,则将基于该段语音段获得的每个分解语音段进行拉伸处理,也即延长每个分解语音段的时长。For example, based on the OLA algorithm, after obtaining the multi-frame decomposed speech segment, directly based on OLA, the first part of the multi-frame decomposed speech segment is spliced together; when the speech speed of a segment needs to be slowed down, it will be based on the segment. Each obtained decomposed speech segment is stretched, that is, the duration of each decomposed speech segment is extended.
但是,直接基于首位拼接的方法对多帧待叠加,容易产生拼接处的波形不连续的问题,也即语音信号中的基音断裂;此外,由于OLA常常无法保持信号中的周期性结构,在帧裁剪的过程中,没办法保证每一个帧都能覆盖完整周期并保证其相位对齐,进而导致了相位跳跃失真。However, the method directly based on the first splicing method to superimpose multiple frames is prone to the problem of discontinuous waveform at the splicing point, that is, the pitch in the speech signal is broken; During the cropping process, there is no way to ensure that each frame can cover the full cycle and ensure its phase alignment, which leads to phase jump distortion.
而本发明的实施例,采用WSOLA算法在两个帧之间留有重叠的部分,同时,对每一个帧进行加窗处理,减轻了波形不连续造成的影响;In the embodiment of the present invention, the WSOLA algorithm is used to leave an overlapped part between the two frames, and at the same time, each frame is subjected to windowing processing, thereby reducing the influence caused by the discontinuity of the waveform;
此外,本发明的实施例基于WSOLA算法,并不直接找一帧分解语音段进行直接拼接,而是在一个区间内寻找与理想帧最接近的一帧分解语音段进行叠加,进而减少因相位跳变而产生的失真问题。In addition, based on the WSOLA algorithm, the embodiment of the present invention does not directly find a frame of decomposed speech segments for direct splicing, but finds a decomposed speech segment that is closest to the ideal frame in an interval and superimposes it, thereby reducing phase jumps caused by phase jumps. Distortion problem caused by change.
步骤S3:计算获得每帧所述待叠加语音段的幅度增益因子;Step S3: calculating and obtaining the amplitude gain factor of the speech segment to be superimposed in each frame;
具体的,所述计算获得每帧所述待叠加语音段的幅度增益因子,包括:Specifically, the calculation to obtain the amplitude gain factor of the speech segment to be superimposed in each frame includes:
获取每帧所述待叠加语音段的长度L;Obtain the length L of the speech segment to be superimposed in each frame;
基于所述长度将每帧所述待叠加语音段划分为L个样点;其中,L为正整数;Based on the length, the speech segment to be superimposed in each frame is divided into L samples; wherein, L is a positive integer;
获取每帧所述样点对应的幅度值;Obtain the amplitude value corresponding to the sample point in each frame;
获取每帧所述样点的目标叠加位置以及每帧所述目标叠加位置的原始幅度值;Obtain the target superposition position of the sample points in each frame and the original amplitude value of the target superposition position in each frame;
基于所述幅度值以及原始幅度值,计算获得每帧所述待叠加语音段的所述幅度增益因子。Based on the amplitude value and the original amplitude value, the amplitude gain factor of the speech segment to be superimposed in each frame is obtained by calculation.
本发明的实施例,计算获得每个待叠加语音段的长度,例如,当获得待叠加语音段的长度为50ms,则将每个待叠加语音段划分为50个样点;In the embodiment of the present invention, the length of each speech segment to be superimposed is obtained by calculating, for example, when the length of the obtained speech segment to be superimposed is 50ms, then each speech segment to be superimposed is divided into 50 samples;
然后获得50个样点中每个样点对应的幅度值;Then obtain the amplitude value corresponding to each sample point in the 50 sample points;
基于WSOLA算法确定每个样点的目标叠加位置,也即,对应地,获取50个目标叠加位置,并获得这50个目标叠加位置对应的原始幅度值;Determine the target stacking position of each sample point based on the WSOLA algorithm, that is, correspondingly, obtain 50 target stacking positions, and obtain the original amplitude values corresponding to the 50 target stacking positions;
对50个幅度值以及50个原始幅度值进行计算处理,获得每个待叠加语音段的幅度增益因子。The 50 amplitude values and the 50 original amplitude values are calculated and processed to obtain the amplitude gain factor of each speech segment to be superimposed.
进一步地,所述基于所述幅度值以及原始幅度值,计算获得每帧所述待叠加语音段的所述幅度增益因子,包括:Further, calculating and obtaining the amplitude gain factor of the speech segment to be superimposed in each frame based on the amplitude value and the original amplitude value includes:
对每帧所述待叠加语音段的L个所述幅度值求和,获得第一值M;其中,M≥0;Summing up the L amplitude values of the speech segments to be superimposed in each frame to obtain a first value M; wherein, M≥0;
对每帧所述待叠加语音段的L个所述原始幅度值求和,获得第二值N;其中,N≥0;Summing up the L original amplitude values of the speech segments to be superimposed in each frame to obtain a second value N; wherein, N≥0;
基于所述第一值M以及第二值N,获得所述幅度增益因子。Based on the first value M and the second value N, the amplitude gain factor is obtained.
进一步地,所述基于所述第一值M以及第二值N,获得所述幅度增益因子,包括:Further, obtaining the amplitude gain factor based on the first value M and the second value N includes:
获取L个所述幅度值中零值的数量Q;其中,Q≥0,且Q为整数;Obtain the number Q of zero values in the L said amplitude values; wherein, Q≥0, and Q is an integer;
基于所述长度L以及L个所述幅度值中零值的数量Q,获得所述幅度增益因子的参考系数λ;其中,λ≥1;Based on the length L and the number Q of zeros in the L amplitude values, a reference coefficient λ of the amplitude gain factor is obtained; wherein, λ≥1;
基于所述参考系数λ、第一值M以及第二值N,获得所述幅度增益因子。Based on the reference coefficient λ, the first value M and the second value N, the amplitude gain factor is obtained.
本发明的实施例,在不考虑参考系数λ的前提下,若某个待叠加语音段将要叠加的位置的幅度值比它原来所在的位置要大,也即N大于M时,则会得到一个大于1的幅度增益因子,反之则会达到一个小于1的幅度增益因子;幅度增益因子使得被叠加的各帧待叠加语音段可以获得一个更加合理的目标幅度值,用于保证恢复出的语音波形在幅度上和原始波形拥有最大的相似性。In the embodiment of the present invention, without considering the reference coefficient λ, if the amplitude value of the position where a speech segment to be superimposed is to be superimposed is larger than its original position, that is, when N is greater than M, a The amplitude gain factor is greater than 1, otherwise it will reach an amplitude gain factor less than 1; the amplitude gain factor enables each frame to be superimposed to obtain a more reasonable target amplitude value, which is used to ensure the recovered speech waveform. Has the greatest similarity in amplitude to the original waveform.
但是,当待叠加语音段存在的零值样点的个数较多时,则会导致计算出的幅度增益因子的值过于小,进而会严重影响语音的恢复效果;为了解决这个问题,本发明的实施例增加了参考系数λ,用于平衡由于帧缺失导致的大面积样点为零值的形况,进而达到更好的恢复效果。However, when the number of zero-valued samples in the speech segment to be superimposed is large, the calculated value of the amplitude gain factor will be too small, which will seriously affect the recovery effect of speech; in order to solve this problem, the present invention In the embodiment, a reference coefficient λ is added, which is used to balance the situation of zero-valued large-area sample points caused by frame deletion, thereby achieving a better recovery effect.
进一步地,所述幅度增益因子基于如下公式计算获得:Further, the amplitude gain factor is calculated and obtained based on the following formula:
其中,βk为第k个幅度增益因子;k为正整数。Among them, β k is the kth amplitude gain factor; k is a positive integer.
具体的,将基于同一段待处理的语音段分解而得的n帧分解语音段进行编号,则K≤n;Specifically, the decomposed speech segments of n frames obtained by decomposing the same segment of speech segments to be processed are numbered, then K≤n;
以第K帧待叠加语音段为例获得的βk即为第k个幅度增益因子。Taking the speech segment to be superimposed in the K-th frame as an example, β k obtained is the k-th amplitude gain factor.
本发明的实施例,虽然基于一段待处理的语音段划分而成的多帧待叠加语音段的长度相同,但是,由于不同的待叠加语音段的幅度值之和或者对应的原始幅度值之和可能不行同,因此,基于本发明的实施例的上述算法获得的同一段音频信号获取的每个待叠加语音段的幅度增益因子也可能不相同,进而实现了对不同待叠加语音段的幅度进行调节的效果。In the embodiment of the present invention, although the lengths of the multi-frame speech segments to be superimposed divided based on a segment of speech segments to be processed are the same, the sum of the amplitude values of different speech segments to be superimposed or the sum of the corresponding original amplitude values It may not be the same. Therefore, the amplitude gain factor of each voice segment to be superimposed obtained from the same audio signal obtained by the above algorithm according to the embodiment of the present invention may also be different, thereby realizing the amplitude gain of different voice segments to be superimposed. adjustment effect.
进一步地,所述参考系数基于如下计算公式获得:Further, the reference coefficient is obtained based on the following calculation formula:
步骤S4:基于所述幅度增益因子,对每帧所述待叠加语音段的幅度进行调整,确定目标幅度;Step S4: Based on the amplitude gain factor, the amplitude of the to-be-superimposed speech segment in each frame is adjusted to determine the target amplitude;
具体的,所述目标语音段基于如下计算公式计算获得:Specifically, the target speech segment is calculated and obtained based on the following calculation formula:
Sk=Pk*βk S k =P k *β k
其中,Sk为第K帧目标语音段;Pk是第K帧待叠加语音段。Wherein, Sk is the target speech segment of the Kth frame; Pk is the speech segment to be superimposed in the Kth frame.
本发明的实施例,在叠加过程中,对WSLOA算法进行了改进,计算并获得了幅度增益因子,用于基于幅度增益因子控制待叠加语音段的目标幅度值。In the embodiment of the present invention, during the superposition process, the WSLOA algorithm is improved, and the amplitude gain factor is calculated and obtained, which is used to control the target amplitude value of the speech segment to be superimposed based on the amplitude gain factor.
步骤S5:基于所述目标幅度,获取目标语音段。Step S5: Obtain a target speech segment based on the target amplitude.
具体的,基于WSOLA算法对多帧目标语音段进行叠加,并输出恢复后的音频波形。Specifically, multi-frame target speech segments are superimposed based on the WSOLA algorithm, and the restored audio waveform is output.
本发明的实施例,通过增加幅度增益因子严格控制待叠加语音段的幅度,解决了由于合成语音幅度过大导致的在实时通信的恢复过程中可能出现较大的失真的问题以及由于网络延迟导致的缺帧恰好发生在一段语音带的结尾,调整后的语音和原始语音之间有相当大的差异的问题,实现了避免语音波形尾部幅度过大等情况,使得恢复后的音频更接近原始状态。In the embodiment of the present invention, the amplitude of the speech segment to be superimposed is strictly controlled by increasing the amplitude gain factor, thereby solving the problem that large distortion may occur in the recovery process of real-time communication caused by the excessively large amplitude of the synthesized speech and the problems caused by network delays The missing frame happens at the end of a voice band, and there is a considerable difference between the adjusted voice and the original voice, which avoids the situation that the tail of the voice waveform is too large, and makes the restored audio closer to the original state. .
本发明的实施例还提供一种语音处理装置400,包括:An embodiment of the present invention further provides a
分解模块401,用于对待处理的语音段进行分解处理,获取多帧分解语音段;The
调整模块402,用于基于调整算法对多帧所述分解语音段进行调整,获取多帧待叠加语音段;An
计算模块403,用于计算获得每帧所述待叠加语音段的幅度增益因子;The
确定模块404,用于基于所述幅度增益因子,对每帧所述待叠加语音段的幅度进行调整,确定目标幅度;A
获取模块405,用于基于所述目标幅度,获取目标语音段。The obtaining
可选的,所述基于调整算法所述对多帧所述分解语音段进行调整,获得多帧待叠加语音段,包括:Optionally, the adjustment of the decomposed speech segments of multiple frames based on the adjustment algorithm to obtain the speech segments to be superimposed in multiple frames includes:
若所述分解语音段存在帧缺失或帧冗余,则基于所述分解语音段确定待调整分解语音段;If the decomposed speech segment has frame missing or frame redundancy, determining the decomposed speech segment to be adjusted based on the decomposed speech segment;
对所述待调整分解语音段调整,获得所述待叠加语音段。The decomposed speech segment to be adjusted is adjusted to obtain the speech segment to be superimposed.
可选的,所述若所述分解语音段存在帧缺失或帧冗余,则基于所述分解语音段确定待调整分解语音段,包括:Optionally, if the decomposed speech segment has frame missing or frame redundancy, determining the decomposed speech segment to be adjusted based on the decomposed speech segment, including:
若所述分解语音段存在帧缺失;If the decomposed speech segment has frame missing;
则基于所述分解语音段确定至少一帧所述待调整分解语音段;then determining at least one frame of the decomposed speech segment to be adjusted based on the decomposed speech segment;
对所述待调整分解语音段进行拉伸,获取所述待叠加语音段。The to-be-adjusted and decomposed speech segment is stretched to obtain the to-be-superimposed speech segment.
可选的,所述计算获得每帧所述待叠加语音段的幅度增益因子,包括:Optionally, the calculating to obtain the amplitude gain factor of the speech segment to be superimposed in each frame, including:
获取每帧所述待叠加语音段的长度L;Obtain the length L of the speech segment to be superimposed in each frame;
基于所述长度将每帧所述待叠加语音段划分为L个样点;其中,L为正整数;Based on the length, the speech segment to be superimposed in each frame is divided into L samples; wherein, L is a positive integer;
获取每帧所述样点对应的幅度值;Obtain the amplitude value corresponding to the sample point in each frame;
获取每帧所述样点的目标叠加位置以及每帧所述目标叠加位置的原始幅度值;Obtain the target superposition position of the sample points in each frame and the original amplitude value of the target superposition position in each frame;
基于所述幅度值以及原始幅度值,计算获得每帧所述待叠加语音段的所述幅度增益因子。Based on the amplitude value and the original amplitude value, the amplitude gain factor of the speech segment to be superimposed in each frame is obtained by calculation.
可选的,所述基于所述幅度值以及原始幅度值,计算获得每帧所述待叠加语音段的所述幅度增益因子,包括:Optionally, calculating and obtaining the amplitude gain factor of the speech segment to be superimposed in each frame based on the amplitude value and the original amplitude value, including:
对每帧所述待叠加语音段的L个所述幅度值求和,获得第一值M;其中,M≥0;Summing up the L amplitude values of the speech segments to be superimposed in each frame to obtain a first value M; wherein, M≥0;
对每帧所述待叠加语音段的L个所述原始幅度值求和,获得第二值N;其中,N≥0;Summing up the L original amplitude values of the speech segments to be superimposed in each frame to obtain a second value N; wherein, N≥0;
基于所述第一值M以及第二值N,获得所述幅度增益因子。Based on the first value M and the second value N, the amplitude gain factor is obtained.
可选的,基于所述第一值M以及第二值N,获得所述幅度增益因子,包括:Optionally, obtaining the amplitude gain factor based on the first value M and the second value N, including:
获取L个所述幅度值中零值的数量Q;其中,Q≥0,且Q为整数;Obtain the number Q of zero values in the L said amplitude values; wherein, Q≥0, and Q is an integer;
基于所述长度L以及L个所述幅度值中零值的数量Q,获得所述幅度增益因子的参考系数λ;其中,λ≥1;Based on the length L and the number Q of zeros in the L amplitude values, a reference coefficient λ of the amplitude gain factor is obtained; wherein, λ≥1;
基于所述参考系数λ、第一值M以及第二值N,获得所述幅度增益因子。Based on the reference coefficient λ, the first value M and the second value N, the amplitude gain factor is obtained.
可选的,所述幅度增益因子基于如下公式计算获得:Optionally, the amplitude gain factor is calculated and obtained based on the following formula:
其中,βk为第k个幅度增益因子;k为正整数。Among them, β k is the kth amplitude gain factor; k is a positive integer.
可选的,所述参考系数基于如下计算公式获得:Optionally, the reference coefficient is obtained based on the following calculation formula:
可选的,所述目标语音段基于如下计算公式计算获得:Optionally, the target speech segment is calculated and obtained based on the following calculation formula:
Sk=Pk*βk S k =P k *β k
其中,Sk为第K帧所述目标语音段;Pk是第K帧所述待叠加语音段。包括Wherein, Sk is the target speech segment of the Kth frame; Pk is the to-be-superimposed speech segment of the Kth frame. include
本发明的实施例还提供一种电子设备,包括处理器、存储器以及存储在所述存储器中且被配置为由所述处理器执行的计算机程序,所述处理器执行所述计算机程序时实现如上所述的方法。Embodiments of the present invention also provide an electronic device, including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, where the processor implements the above when executing the computer program the method described.
本发明的实施例还提供一种计算机可读存储介质,所述计算机可读存储介质包括存储的计算机程序,其中,在所述计算机程序运行时控制所述计算机可读存储介质所在设备执行如上所述的方法。Embodiments of the present invention also provide a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, wherein when the computer program runs, the device where the computer-readable storage medium is located is controlled to execute the above-mentioned method described.
另外,本发明实施例的装置的其他构成及作用对本领域的技术人员来说是已知的,为减少冗余,此处不做赘述。In addition, other structures and functions of the apparatus in the embodiment of the present invention are known to those skilled in the art, and in order to reduce redundancy, details are not described here.
需要说明的是,在流程图中表示或在此以其他方式描述的逻辑和/或步骤,例如,可以被认为是用于实现逻辑功能的可执行指令的定序列表,可以具体实现在任何计算机可读介质中,以供指令执行系统、装置或设备(如基于计算机的系统、包括处理器的系统或其他可以从指令执行系统、装置或设备取指令并执行指令的系统)使用,或结合这些指令执行系统、装置或设备而使用。就本说明书而言,"计算机可读介质"可以是任何可以包含、存储、通信、传播或传输程序以供指令执行系统、装置或设备或结合这些指令执行系统、装置或设备而使用的装置。计算机可读介质的更具体的示例(非穷尽性列表)包括以下:具有一个或多帧布线的电连接部(电子装置),便携式计算机盘盒(磁装置),随机存取存储器(RAM),只读存储器(ROM),可擦除可编辑只读存储器(EPROM或闪速存储器),光纤装置,以及便携式光盘只读存储器(CDROM)。另外,计算机可读介质甚至可以是可在其上打印所述程序的纸或其他合适的介质,因为可以例如通过对纸或其他介质进行光学扫描,接着进行编辑、解译或必要时以其他合适方式进行处理来以电子方式获得所述程序,然后将其存储在计算机存储器中。It should be noted that the logic and/or steps represented in the flowcharts or otherwise described herein, for example, may be considered as an ordered listing of executable instructions for implementing the logical functions, and may be embodied in any computer readable medium for use by an instruction execution system, apparatus, or device (such as a computer-based system, a system including a processor, or other system that can fetch and execute instructions from an instruction execution system, apparatus, or device), or in combination with these Used in order to execute a system, apparatus or device. For the purposes of this specification, a "computer-readable medium" can be any device that can contain, store, communicate, propagate, or transport the program for use by or in connection with an instruction execution system, apparatus, or apparatus. More specific examples (non-exhaustive list) of computer readable media include the following: electrical connections (electronic devices) with one or more frames of wiring, portable computer disk cartridges (magnetic devices), random access memory (RAM), Read Only Memory (ROM), Erasable Editable Read Only Memory (EPROM or Flash Memory), Fiber Optic Devices, and Portable Compact Disc Read Only Memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program may be printed, as the paper or other medium may be optically scanned, for example, followed by editing, interpretation, or other suitable medium as necessary process to obtain the program electronically and then store it in computer memory.
应当理解,本发明的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中,多帧步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。例如,如果用硬件来实现,和在另一实施方式中一样,可用本领域公知的下列技术中的任一项或他们的组合来实现:具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路,具有合适的组合逻辑门电路的专用集成电路,可编程门阵列(PGA),现场可编程门阵列(FPGA)等。It should be understood that various parts of the present invention may be implemented in hardware, software, firmware or a combination thereof. In the above-described embodiments, the multi-frame steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented by any one or a combination of the following techniques known in the art: Discrete logic circuits, application specific integrated circuits with suitable combinational logic gates, Programmable Gate Arrays (PGA), Field Programmable Gate Arrays (FPGA), etc.
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一帧实施例或示例中。在本说明书中,对上述术语的示意性表述不一定指的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任何的一个或多帧实施例或示例中以合适的方式结合。In the description of this specification, description with reference to the terms "one embodiment," "some embodiments," "example," "specific example," or "some examples", etc., mean specific features described in connection with the embodiment or example , structure, material or feature is included in at least one frame of embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials or characteristics described may be combined in any suitable manner in any one or more of the frame embodiments or examples.
在本发明的描述中,需要理解的是,术语“中心”、“纵向”、“横向”、“长度”、“宽度”、“厚度”、“上”、“下”、“前”、“后”、“左”、“右”、“竖直”、“水平”、“顶”、“底”“内”、“外”、“顺时针”、“逆时针”、“轴向”、“径向”、“周向”等指示的方位或位置关系为基于附图所示的方位或位置关系,仅是为了便于描述本发明和简化描述,而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作,因此不能理解为对本发明的限制。In the description of the present invention, it should be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", " Rear, Left, Right, Vertical, Horizontal, Top, Bottom, Inner, Outer, Clockwise, Counterclockwise, Axial, The orientations or positional relationships indicated by "radial direction", "circumferential direction", etc. are based on the orientations or positional relationships shown in the accompanying drawings, which are only for the convenience of describing the present invention and simplifying the description, rather than indicating or implying the indicated devices or elements. It must have a specific orientation, be constructed and operate in a specific orientation, and therefore should not be construed as a limitation of the present invention.
此外,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一帧该特征。在本发明的描述中,“多帧”的含义是至少两个,例如两个,三个等,除非另有明确具体的限定。In addition, the terms "first" and "second" are only used for descriptive purposes, and should not be construed as indicating or implying relative importance or implying the number of indicated technical features. Thus, a feature defined with "first" and "second" may expressly or implicitly include at least one frame of the feature. In the description of the present invention, "multi-frame" means at least two, such as two, three, etc., unless otherwise expressly and specifically defined.
在本发明中,除非另有明确的规定和限定,术语“安装”、“相连”、“连接”、“固定”等术语应做广义理解,例如,可以是固定连接,也可以是可拆卸连接,或成一体;可以是机械连接,也可以是电连接;可以是直接相连,也可以通过中间媒介间接相连,可以是两个元件内部的连通或两个元件的相互作用关系,除非另有明确的限定。对于本领域的普通技术人员而言,可以根据具体情况理解上述术语在本发明中的具体含义。In the present invention, unless otherwise expressly specified and limited, the terms "installed", "connected", "connected", "fixed" and other terms should be understood in a broad sense, for example, it may be a fixed connection or a detachable connection , or integrated; it can be a mechanical connection or an electrical connection; it can be directly connected or indirectly connected through an intermediate medium, it can be the internal connection of two elements or the interaction relationship between the two elements, unless otherwise specified limit. For those of ordinary skill in the art, the specific meanings of the above terms in the present invention can be understood according to specific situations.
在本发明中,除非另有明确的规定和限定,第一特征在第二特征“上”或“下”可以是第一和第二特征直接接触,或第一和第二特征通过中间媒介间接接触。而且,第一特征在第二特征“之上”、“上方”和“上面”可是第一特征在第二特征正上方或斜上方,或仅仅表示第一特征水平高度高于第二特征。第一特征在第二特征“之下”、“下方”和“下面”可以是第一特征在第二特征正下方或斜下方,或仅仅表示第一特征水平高度小于第二特征。In the present invention, unless otherwise expressly specified and limited, a first feature "on" or "under" a second feature may be in direct contact between the first and second features, or the first and second features indirectly through an intermediary touch. Also, the first feature being "above", "over" and "above" the second feature may mean that the first feature is directly above or obliquely above the second feature, or simply means that the first feature is level higher than the second feature. The first feature being "below", "below" and "below" the second feature may mean that the first feature is directly below or obliquely below the second feature, or simply means that the first feature has a lower level than the second feature.
尽管上面已经示出和描述了本发明的实施例,可以理解的是,上述实施例是示例性的,不能理解为对本发明的限制,本领域的普通技术人员在本发明的范围内可以对上述实施例进行变化、修改、替换和变型。Although the embodiments of the present invention have been shown and described above, it should be understood that the above-mentioned embodiments are exemplary and should not be construed as limiting the present invention. Embodiments are subject to variations, modifications, substitutions and variations.
Claims (12)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210359842.0A CN114974278A (en) | 2022-04-06 | 2022-04-06 | Voice processing method, device, equipment and storage medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210359842.0A CN114974278A (en) | 2022-04-06 | 2022-04-06 | Voice processing method, device, equipment and storage medium |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN114974278A true CN114974278A (en) | 2022-08-30 |
Family
ID=82977215
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202210359842.0A Pending CN114974278A (en) | 2022-04-06 | 2022-04-06 | Voice processing method, device, equipment and storage medium |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN114974278A (en) |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20020019733A1 (en) * | 2000-05-30 | 2002-02-14 | Adoram Erell | System and method for enhancing the intelligibility of received speech in a noise environment |
| US20040002856A1 (en) * | 2002-03-08 | 2004-01-01 | Udaya Bhaskar | Multi-rate frequency domain interpolative speech CODEC system |
| US20050143988A1 (en) * | 2003-12-03 | 2005-06-30 | Kaori Endo | Noise reduction apparatus and noise reducing method |
| CN101894565A (en) * | 2009-05-19 | 2010-11-24 | 华为技术有限公司 | Voice signal restoration method and device |
| CN109859729A (en) * | 2019-01-21 | 2019-06-07 | 北京小唱科技有限公司 | Wave-shape amplitude control method and device are carried out to audio |
| CN111653263A (en) * | 2020-06-12 | 2020-09-11 | 百度在线网络技术(北京)有限公司 | Volume adjusting method and device, electronic equipment and storage medium |
-
2022
- 2022-04-06 CN CN202210359842.0A patent/CN114974278A/en active Pending
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20020019733A1 (en) * | 2000-05-30 | 2002-02-14 | Adoram Erell | System and method for enhancing the intelligibility of received speech in a noise environment |
| US20040002856A1 (en) * | 2002-03-08 | 2004-01-01 | Udaya Bhaskar | Multi-rate frequency domain interpolative speech CODEC system |
| US20050143988A1 (en) * | 2003-12-03 | 2005-06-30 | Kaori Endo | Noise reduction apparatus and noise reducing method |
| CN101894565A (en) * | 2009-05-19 | 2010-11-24 | 华为技术有限公司 | Voice signal restoration method and device |
| CN109859729A (en) * | 2019-01-21 | 2019-06-07 | 北京小唱科技有限公司 | Wave-shape amplitude control method and device are carried out to audio |
| CN111653263A (en) * | 2020-06-12 | 2020-09-11 | 百度在线网络技术(北京)有限公司 | Volume adjusting method and device, electronic equipment and storage medium |
Non-Patent Citations (1)
| Title |
|---|
| 叶锡恩, 张巧文: "基于WSOLA算法的语音时长调整研究", 科技通报, no. 05, 20 September 2005 (2005-09-20) * |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| JP4247002B2 (en) | Speaker distance detection apparatus and method using microphone array, and voice input / output apparatus using the apparatus | |
| CN108376548B (en) | Echo cancellation method and system based on microphone array | |
| US20130136274A1 (en) | Processing Signals | |
| US20050157866A1 (en) | System and method for enhanced stereo audio | |
| CN110992923B (en) | Echo cancellation method, electronic device, and storage device | |
| WO2015184893A1 (en) | Mobile terminal call voice noise reduction method and device | |
| US11380312B1 (en) | Residual echo suppression for keyword detection | |
| JP3693588B2 (en) | Echo suppression system | |
| EP3796629B1 (en) | Double talk detection method, double talk detection device and echo cancellation system | |
| US10090882B2 (en) | Apparatus suppressing acoustic echo signals from a near-end input signal by estimated-echo signals and a method therefor | |
| WO2020191512A1 (en) | Echo cancellation apparatus, echo cancellation method, signal processing chip and electronic device | |
| US10937418B1 (en) | Echo cancellation by acoustic playback estimation | |
| US8582754B2 (en) | Method and system for echo cancellation in presence of streamed audio | |
| JP3047300B2 (en) | A communication device having a hands-free conversation function | |
| US11386911B1 (en) | Dereverberation and noise reduction | |
| CN112437957A (en) | Imposed gap insertion for full listening | |
| US11381913B2 (en) | Dynamic device speaker tuning for echo control | |
| US12413905B2 (en) | Apparatus, methods and computer programs for reducing echo | |
| US8868417B2 (en) | Handset intelligibility enhancement system using adaptive filters and signal buffers | |
| JP4105681B2 (en) | Echo suppress method, echo suppressor, echo suppressor program, loss control method on communication path, loss control device on communication path, loss control program on communication path, recording medium | |
| CN112997249B (en) | Voice processing method, device, storage medium and electronic equipment | |
| CN115834778A (en) | Echo cancellation method, device, electronic equipment and storage medium | |
| US8406430B2 (en) | Simulated background noise enabled echo canceller | |
| CN114974278A (en) | Voice processing method, device, equipment and storage medium | |
| US12272369B1 (en) | Dereverberation and noise reduction |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |