[go: up one dir, main page]

CN120032662B - Game team voice noise reduction method, device and medium based on dynamic threshold mechanism - Google Patents

Game team voice noise reduction method, device and medium based on dynamic threshold mechanism

Info

Publication number
CN120032662B
CN120032662B CN202510480623.1A CN202510480623A CN120032662B CN 120032662 B CN120032662 B CN 120032662B CN 202510480623 A CN202510480623 A CN 202510480623A CN 120032662 B CN120032662 B CN 120032662B
Authority
CN
China
Prior art keywords
noise
frame
time
dynamic threshold
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202510480623.1A
Other languages
Chinese (zh)
Other versions
CN120032662A (en
Inventor
黄志松
冀啸天
李鹤
周义
姚茜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Luxcreo Beijing Inc
Original Assignee
Qingfeng Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingfeng Beijing Technology Co Ltd filed Critical Qingfeng Beijing Technology Co Ltd
Priority to CN202510480623.1A priority Critical patent/CN120032662B/en
Publication of CN120032662A publication Critical patent/CN120032662A/en
Application granted granted Critical
Publication of CN120032662B publication Critical patent/CN120032662B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/40Processing input control signals of video game devices, e.g. signals generated by the player or derived from the environment
    • A63F13/42Processing input control signals of video game devices, e.g. signals generated by the player or derived from the environment by mapping the input signals into game commands, e.g. mapping the displacement of a stylus on a touch screen to the steering angle of a virtual vehicle
    • A63F13/424Processing input control signals of video game devices, e.g. signals generated by the player or derived from the environment by mapping the input signals into game commands, e.g. mapping the displacement of a stylus on a touch screen to the steering angle of a virtual vehicle involving acoustic input signals, e.g. by using the results of pitch or rhythm extraction or voice recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

本申请提供一种基于动态阈值机制的游戏组队语音降噪方法、装置及介质。该方法包括:将音频信号划分为多个时间帧,对时间帧的幅值或分贝值进行转换,得到能量值;将能量值与初始阈值进行比较,若能量值低于初始阈值,则标记为噪音帧或低优先级帧,若能量值高于初始阈值,则保留为待进一步处理的语音帧;将能量值与动态阈值进行比较,若能量值低于动态阈值,则将对应时间帧作为噪音帧进行过滤,若能量值高于动态阈值,则将对应时间帧作为有效语音帧进行保留;将通过动态阈值过滤后保留的时间帧进行编码,生成最终的游戏组队语音信号。本申请能够降低噪音的漏判或误判现象,实现在不同环境噪音条件下动态、精准地过滤噪音,提升用户体验。

The present application provides a method, device and medium for reducing the noise of game team voice based on a dynamic threshold mechanism. The method includes: dividing the audio signal into multiple time frames, converting the amplitude or decibel value of the time frame to obtain an energy value; comparing the energy value with the initial threshold, if the energy value is lower than the initial threshold, marking it as a noise frame or a low-priority frame, if the energy value is higher than the initial threshold, retaining it as a voice frame to be further processed; comparing the energy value with the dynamic threshold, if the energy value is lower than the dynamic threshold, filtering the corresponding time frame as a noise frame, if the energy value is higher than the dynamic threshold, retaining the corresponding time frame as a valid voice frame; encoding the time frame retained after filtering by the dynamic threshold to generate the final game team voice signal. The present application can reduce the phenomenon of missed or misjudgment of noise, realize dynamic and accurate filtering of noise under different environmental noise conditions, and improve user experience.

Description

Game team voice noise reduction method, device and medium based on dynamic threshold mechanism
Technical Field
The application relates to the technical field of voice noise reduction, in particular to a method, a device and a medium for voice noise reduction of a game team based on a dynamic threshold mechanism.
Background
With the rapid development of online games and electronic games, more and more players need to communicate and cooperate in real time through voice. However, in the actual use process, various noise interferences, such as mechanical noise, wind noise, etc., often exist in the environment where the player is located, which results in a decrease in the clarity of the player's voice and a decrease in the communication efficiency. In order to improve the quality of team voice chat, effective voice noise reduction needs to be achieved for different noise scenarios.
Current speech noise reduction techniques typically rely on fixed thresholds to make preliminary decisions on the audio data, or simply noise suppression through basic algorithms only at the hardware level. However, under the condition of large fluctuation of noise level, the fixed threshold or the single algorithm is difficult to combine noise filtering and voice fidelity, and the phenomenon of missed judgment or misjudgment is easy to occur. Especially in noisy internet cafes or outdoor environments, problems occur when the player's voice is submerged by noise or excessively filtered, resulting in poor use experience. Therefore, how to dynamically and accurately filter noise under different environmental noise conditions becomes a technical problem to be solved in the prior art.
Disclosure of Invention
In view of the above, the embodiment of the application provides a method, a device and a medium for voice noise reduction of a game team based on a dynamic threshold mechanism, which are used for solving the problems that in the prior art, the phenomenon of missed judgment or misjudgment is easy to occur, noise cannot be filtered dynamically and accurately under different environmental noise conditions, and the user experience is poor.
The first aspect of the embodiment of the application provides a game team voice noise reduction method based on a dynamic threshold mechanism, which comprises the steps of collecting an audio signal containing player voices and environmental noises, dividing the audio signal into a plurality of time frames, converting amplitude values or decibel values of the time frames to obtain energy values corresponding to each time frame, comparing the energy values with a set initial threshold, marking the energy values as noise frames or low-priority frames if the energy values are lower than the initial threshold, retaining the energy values as voice frames to be further processed if the energy values are higher than the initial threshold, carrying out statistics and smoothing processing on the energy values in the preset time period to obtain environmental noise reference values when continuous time frames with the energy values lower than the initial threshold are detected to exceed a preset time period, calculating a dynamic threshold based on the environmental noise reference values and a preset regulating factor, updating the dynamic threshold in real time when the environmental noise reference values change with time, comparing the energy values with the dynamic threshold, filtering the corresponding time frames as noise frames if the energy values are lower than the dynamic threshold, retaining the corresponding time frames as voice frames to be further processed, retaining the voice frames after the effective time frames are detected to be effectively filtered, and finally retaining the voice frames to be processed.
The second aspect of the embodiment of the application provides a game team voice noise reduction device based on a dynamic threshold mechanism, which comprises a collection module, a conversion module, a first comparison module, a second comparison module, a processing module and a processing module, wherein the collection module is used for collecting an audio signal containing player voices and environment noise and dividing the audio signal into a plurality of time frames, the conversion module is used for converting amplitude values or decibel values of the time frames to obtain energy values corresponding to each time frame, the first comparison module is used for comparing the energy values with a set initial threshold, if the energy values are lower than the initial threshold, the energy values are marked as noise frames or low-priority frames, if the energy values are higher than the initial threshold, the energy values are reserved as voice frames to be further processed, the processing module is used for carrying out statistics and smoothing processing on the energy values in the preset time frame to obtain environment noise reference values when the continuous time frame with the energy values lower than the initial threshold is detected to obtain the environment noise reference values, the calculation module is used for calculating the dynamic threshold based on the environment noise reference values and preset regulating factors, when the environment noise reference values change with time, the dynamic threshold is updated in real time, the second comparison module is used for comparing the energy values with the dynamic threshold, if the energy values are lower than the initial threshold, the corresponding to the frames are used as voice frames to be reserved to be processed, and finally, if the energy frames are reserved as the game frames.
In a third aspect of the embodiments of the present application, there is provided an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.
In a fourth aspect of the embodiments of the present application, there is provided a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the above method.
The above at least one technical scheme adopted by the embodiment of the application can achieve the following beneficial effects:
The method comprises the steps of collecting audio signals containing player voices and environmental noises, dividing the audio signals into a plurality of time frames, converting amplitude values or decibel values of the time frames to obtain energy values corresponding to each time frame, comparing the energy values with a set initial threshold value, marking the energy values as noise frames or low-priority frames if the energy values are lower than the initial threshold value, reserving the energy values as voice frames to be further processed if the energy values are higher than the initial threshold value, carrying out statistics and smoothing processing on the energy values in the preset time period when the continuous time frames with the energy values lower than the initial threshold value are detected to exceed a preset time period to obtain an environmental noise reference value, calculating a dynamic threshold value based on the environmental noise reference value and a preset regulating factor, updating the dynamic threshold value in real time when the environmental noise reference value changes along with time, comparing the energy values with the dynamic threshold value, filtering the corresponding time frames as noise frames if the energy values are lower than the dynamic threshold value, reserving the corresponding time frames as effective voice frames if the energy values are higher than the dynamic threshold value, carrying out coding on the reserved time frames after filtering through the dynamic threshold value, and generating a final game team voice signals. The application can reduce the phenomenon of missed judgment or misjudgment of noise, realize dynamic and accurate noise filtration under different environmental noise conditions, and improve user experience.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a dynamic threshold mechanism-based voice noise reduction method for a game team according to an embodiment of the present application;
FIG. 2 is a schematic structural diagram of a dynamic threshold mechanism-based voice noise reduction device for a game team according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
The application aims to solve the technical problem of how to effectively reduce voice noise in a game team voice scene, and especially to cope with the background noise which is continuously changed under different environments, thereby ensuring the definition and comfort of voice communication of players.
Therefore, the application provides a game team voice noise reduction method based on a dynamic threshold mechanism. The main technical realization thought of the scheme comprises the following two points:
First, the collected sound decibel information is converted into an energy value by an algorithm, and a sound signal with a lower energy value (corresponding to a lower noise or invalid voice) is filtered based on a set threshold. The operation can be used for primarily filtering background noise, and redundancy of subsequent processing is reduced.
Secondly, a dynamic threshold mechanism is introduced for the problem that a fixed threshold is difficult to adapt under different environments. The change of the environmental noise is detected in real time through an algorithm, and the threshold value is dynamically adjusted, so that the proper threshold value can be automatically set in a quiet environment and a noisy environment, the accuracy and the adaptability of noise filtering are improved, and the quality of the voice of a player team is further improved.
In addition, in order to further improve the adaptability of the scheme to noise under different environmental noise conditions, a plurality of improvement technologies are introduced on the basis of the main technology realization thought, including multistage filtration, frequency domain analysis, machine learning auxiliary threshold adjustment and the like, so that a more perfect intelligent noise filtration solution is formed.
The following describes the technical scheme of the present application in detail with reference to the accompanying drawings and specific embodiments.
Fig. 1 is a flowchart of a dynamic threshold mechanism-based voice noise reduction method for a game team according to an embodiment of the present application. As shown in fig. 1, the dynamic threshold mechanism-based game team voice noise reduction method specifically may include:
S101, collecting an audio signal containing player voices and environmental noise, and dividing the audio signal into a plurality of time frames;
S102, converting the amplitude value or the decibel value of the time frame to obtain an energy value corresponding to each time frame;
s103, comparing the energy value with a set initial threshold, if the energy value is lower than the initial threshold, marking the energy value as a noise frame or a low-priority frame, and if the energy value is higher than the initial threshold, reserving the energy value as a voice frame to be further processed;
S104, when the continuous time frame with the energy value lower than the initial threshold value is detected to exceed a preset time period, carrying out statistics and smoothing on the energy value in the preset time period to obtain an environmental noise reference value;
s105, calculating a dynamic threshold value based on the environmental noise reference value and a preset adjusting factor, and updating the dynamic threshold value in real time when the environmental noise reference value changes with time;
S106, comparing the energy value with a dynamic threshold, filtering the corresponding time frame as a noise frame if the energy value is lower than the dynamic threshold, and reserving the corresponding time frame as an effective voice frame if the energy value is higher than the dynamic threshold;
And S107, encoding the time frame reserved after the dynamic threshold filtering to generate a final game team voice signal.
In some embodiments, capturing an audio signal containing player speech and ambient noise and dividing the audio signal into a plurality of time frames includes:
When a player uses a single microphone, directly acquiring single-channel audio data as an analog voice signal;
When a player uses a headset or externally connected multi-microphone equipment, multiple paths of audio data are obtained in parallel, and signal fusion or beam forming is carried out on the multiple paths of audio data to obtain an analog voice signal;
Converting the acquired analog voice signals into digital signals, and quantizing the digital signals to obtain digital audio conforming to a preset audio coding format;
in the hardware level or audio driving, utilizing an automatic gain control or acoustic echo cancellation module to perform preliminary noise reduction or gain processing on digital audio;
dividing the primarily noise-reduced or gain-processed digital audio into a plurality of time frames according to a preset time window, so as to perform initial threshold comparison and dynamic threshold updating according to the energy value of the time frames.
Specifically, in some cases, the player uses a single microphone device, such as an external USB microphone or an internal microphone. At this time, the system directly acquires an analog voice signal from a single channel and converts the analog signal into a digital signal through an analog-to-digital conversion (ADC) module. The sampling rate of the digital signal may be 16kHz or 48kHz to meet the quality requirements of real-time voice communications. During the conversion process, the system quantizes the signal to conform to PCM (Pulse Code Modulation) format or other predetermined audio encoding format for subsequent signal processing.
In another case, the player uses a headset or an external multi-microphone device, such as a high-end game headset or a stand-alone multi-microphone array. At this time, the system may obtain audio data of multiple channels in parallel, and after digital signal conversion, perform signal fusion or Beamforming (Beamforming) on multiple channels of audio data, so as to improve the quality of the voice signal. The signal fusion may employ a Delay-and-Sum (Delay-and-Sum) algorithm that improves the signal-to-noise ratio of the player's voice by weighting and summing the signals of the different microphones. Meanwhile, an optimal beam forming (MVDR) or adaptive beam forming (Adaptive Beamforming) algorithm can be adopted, the weight of the microphone is dynamically adjusted according to the noise environment, the voice signal in the target direction is enhanced, and the background noise in other directions is reduced.
After the audio signal acquisition and conversion are completed, the system performs preliminary noise reduction or gain processing on the digital audio at a hardware level or an audio driving layer so as to ensure the stability and definition of the voice signal. For example, the system may automatically adjust the gain of the input signal using automatic gain control (AGC, automatic Gain Control) to ensure that the voice volume is always at the proper level. In addition, the system may apply an acoustic echo cancellation (AEC, acoustic Echo Cancellation) algorithm to cancel echo interference due to speaker sound passing back to the microphone, improving speech intelligibility.
Further, the system may perform noise suppression (NS, noise Suppression) processing on the audio signal. For example, the effect of background noise on the speech signal is reduced using spectral subtraction (Spectral Subtraction) or adaptive filtering (ADAPTIVE FILTERING) methods. In addition, the spatial noise suppression technology can be combined, and sound signals in different directions can be distinguished based on the spatial information of the multi-microphone array, so that environmental noise such as keyboard clicking sounds, mouse clicking sounds and the like can be effectively suppressed.
The digital audio data after the preprocessing is divided into a plurality of time frames according to a preset time window (such as 20ms or 40 ms). This process may be done in conjunction with the game engine or underlying audio drivers to ensure real-time of the game voice communications. The audio data after framing is used as the input of the subsequent energy value calculation, the initial threshold comparison and the dynamic threshold updating, and provides a basis for further voice noise reduction and optimization.
In some embodiments, converting the amplitude or decibel value of the time frame to obtain the energy value corresponding to each time frame includes:
Acquiring the amplitude of a digital audio sampling point in each time frame, sequentially squaring the amplitude of each digital audio sampling point and accumulating the obtained results to obtain a numerical value for representing time domain energy of the time frame;
Or converting the audio signal corresponding to the time frame into a decibel value, and converting the decibel value into a numerical value for representing the energy or the power of the time frame according to a preset mapping relation.
In particular, the digitized audio signal is divided into time frames of a fixed length, for example 20ms or 40ms per frame. After framing, each time frame contains a number of audio samples, each corresponding to a value representing the audio amplitude at that time.
In one approach, the time domain energy value may be calculated directly based on the audio amplitude values within the time frame. Specifically, the system sequentially acquires the amplitudes of all sampling points in the time frame, and performs nonlinear transformation on each amplitude, so that the amplitude can reflect the overall energy of the signal. And then accumulating the transformed amplitude values to obtain the integral energy value of the time frame. The method can better reflect the intensity of the audio signal in the frame and is suitable for time domain energy analysis.
In another manner, the audio signal may be first converted into a decibel value, and the decibel value is converted into a numerical value for representing the energy or power of the time frame through a preset mapping relationship. Specifically, the system firstly performs decibel conversion on the audio signal in the time frame, so that the energy value is more in line with the auditory perception characteristic of the human ear. Then, based on a preset mapping relation, the decibel value is matched with the standard energy or power level, so that a more representative time frame energy value is obtained. The method can more intuitively represent the relative strength of the audio signal and is suitable for energy normalization processing under different equipment and environments.
Whichever way is adopted to calculate the energy value, the obtained time frame energy value can be used for noise reduction treatment such as subsequent threshold comparison, dynamic threshold adjustment and the like so as to improve the definition and stability of the voice signal.
In some embodiments, the energy value is compared to a set initial threshold, if the energy value is below the initial threshold, it is marked as a noise frame or a low priority frame, and if the energy value is above the initial threshold, it is retained as a speech frame to be further processed.
Specifically, the system sets a fixed initial threshold based on historical data, hardware characteristics, or factory calibration results. The initial threshold is used to measure whether the energy level of a time frame is sufficient to indicate that the frame contains a valid speech signal. Specifically, the threshold may be adjusted according to factors such as microphone sensitivity, background noise level, etc. of different devices, so as to adapt to various hardware environments.
In some examples, after the system calculates the energy value for a certain time frame, the energy value is compared to a set initial threshold:
If the energy value is below the initial fixed threshold, it is considered that the time frame may contain only ambient noise or silence signals, and the frame is therefore marked as a noise frame or a low priority frame. The marked frames may be discarded directly or used as alternative data for subsequent analysis to reduce interference of the inactive signal with voice communications.
If the energy value is above the initial fixed threshold, it is considered that the time frame may contain valid speech signals, and thus the frame is retained and further processed as a candidate speech frame, e.g., dynamic thresholding, frequency domain analysis, etc., to improve the accuracy of speech recognition.
Furthermore, to adapt a fixed threshold to different use scenarios, the system may adapt the threshold for different audio devices or user environments. For example, in a noisy internet cafe environment, the initial threshold may be set slightly higher to avoid misrecognizing the background noise as a speech signal, while in a quiet home environment, the initial threshold may be set lower to ensure that weak speech signals are also captured effectively.
By means of the embodiment, the method and the device can effectively distinguish the noise frame from the voice frame based on the fixed threshold, provide a basis for subsequent dynamic threshold adjustment and advanced noise reduction processing, and ensure the definition and stability of the voice signal.
In some embodiments, when it is detected that the continuous time frame with the energy value lower than the initial threshold exceeds the preset time period, the statistics and smoothing processing are performed on the energy value in the preset time period to obtain an environmental noise reference value, including:
when the continuous time frame with the energy value lower than the initial threshold value is detected to exceed a preset time period, judging that the player does not speak temporarily, taking the continuous time frame in the preset time period as an environmental noise frame, and collecting the energy value of the environmental noise frame;
and counting the energy values of the environmental noise frames, and continuously accumulating or calculating the average value of the noise energy values in different time periods by using a smoothing algorithm to obtain an environmental noise reference value for representing the environmental noise level.
Specifically, the system continuously monitors the energy value of the audio signal and determines whether the energy level of the time frame is below a preset initial threshold. When it is detected that the energy value for a number of consecutive time frames is below the initial threshold and the duration exceeds a preset time period (e.g., 500ms or 1 s), the system may assume that the player is not speaking during that time period, and that these low energy time frames consist primarily of ambient noise. These successive time frames are therefore marked as ambient noise frames and their energy values are collected for subsequent calculations.
Next, the system performs statistics and processing on the acquired ambient noise frame energy values to obtain a more stable ambient noise reference value. In the statistical process, the system can adopt the following method:
Sliding window calculation the ambient noise frame energy values over a recent period of time (e.g., 1 second) are stored in a buffer and the mean or median of these values is calculated to smooth out short-term fluctuating noise data.
The weighted filtering process assigns different weights to noise data for different time periods, such as higher weights to the nearest noise frame, to improve the response to noise variations.
The exponential smoothing method is used for carrying out recursive calculation on the historical environmental noise value, so that the influence of newer data on the reference value is larger, and the adaptability of the system to noise level change is enhanced.
By the method, the system can dynamically estimate the overall level of the environmental noise, and a more stable and accurate environmental noise reference value is obtained. The reference value can be used for subsequent dynamic threshold calculation, so that the system can adjust the voice detection standard according to the real-time noise level, thereby improving the stability and definition of game voice communication.
In addition, the embodiment is applicable to different use environments. For example, in a quiet home environment, the system detects a low ambient noise reference, and thus the subsequent speech detection threshold is automatically lowered to ensure that low volume speech is still correctly recognized. In a noisy internet bar or outdoor environment, the environmental noise reference value is higher, and the system correspondingly increases the threshold value of voice detection so as to reduce the interference of background noise.
In some embodiments, calculating the dynamic threshold based on the ambient noise reference value and a preset adjustment factor, updating the dynamic threshold in real time as the ambient noise reference value changes over time, includes:
According to the environmental noise reference value and in combination with a preset adjusting factor, calculating a dynamic threshold value for distinguishing a noise frame from a voice frame by utilizing a predefined algorithm or mapping rule;
When the change of the environmental noise reference value along with time is detected, the output of the algorithm or the mapping rule is adaptively adjusted according to the adjusting factor, and the dynamic threshold is controlled to be synchronously adjusted up or down so as to enable the updated dynamic threshold to be matched with the environmental noise level fluctuating in real time.
In particular, the system calculates an ambient noise reference value that is used to characterize the background noise level in the current environment using the methods described above. When the environment noise is high, the reference value is large, and when the environment is quiet, the reference value is small. The system combines the reference value and a preset adjusting factor to calculate a dynamic threshold value, wherein the threshold value is used for distinguishing noise frames from voice frames.
When the dynamic threshold is calculated, the system utilizes a preset mapping rule or algorithm model to enable the dynamic threshold to adapt to different environmental noise levels. For example, the system may employ a nonlinear mapping function that calculates a dynamic threshold that is adapted to the current noise level based on the ambient noise reference value. When the environmental noise level changes, the system can adjust the calculated dynamic threshold according to the setting of the adjusting factor, so that the dynamic threshold keeps smooth change, and erroneous judgment caused by too fast noise fluctuation is avoided.
When it is detected that the environmental noise reference value rises over time (e.g., the environment in which the player is located becomes noisier, such as in an internet bar, a station, etc.), the system synchronously increases the dynamic threshold to reduce false detection of low-energy noise, ensure that only a voice signal of significant strength can pass through the noise reduction system, thereby improving the voice quality and preventing the environmental noise from interfering with normal communication.
Conversely, when a decrease in ambient noise reference value over time is detected (e.g., a player entering a quiet room from a noisy environment, or ambient background noise is reduced), the system automatically reduces the dynamic threshold to enhance the capture of low volume speech, ensuring that the player is speaking aloud or picked up at a distance, the system still accurately recognizes valid speech signals.
Furthermore, to avoid frequent changes in the dynamic threshold due to short-term noise fluctuations, the system may set a minimum adjustment step size or a smooth adjustment strategy. For example, when the environmental noise variation amplitude is small, the system may slowly adjust the dynamic threshold to avoid excessive response of the voice detection system to brief noise fluctuations. If the environmental noise changes drastically (e.g., suddenly enters a high noise location), the system can quickly adjust the dynamic threshold to accommodate the new noise environment and improve the intelligibility of the speech signal.
According to the embodiment, the system can adaptively adjust the dynamic threshold based on the real-time change of the environmental noise, so that stable and high-quality game voice communication experience can be provided under different environments.
In some embodiments, the energy value is compared to a dynamic threshold, and if the energy value is below the dynamic threshold, the corresponding time frame is filtered as a noise frame, and if the energy value is above the dynamic threshold, the corresponding time frame is retained as a valid speech frame.
Specifically, the system calculates the energy value of each time frame based on the method and obtains the dynamic threshold under the current environment. When the energy value of a certain time frame is below the dynamic threshold, the system considers that the frame is mainly composed of ambient noise, and therefore marks it as a noise frame or an invalid frame, and filters it to reduce noise interference. Conversely, if the energy value of the time frame is above the dynamic threshold, the system considers that the frame may contain speech signals and processes it further as a candidate speech frame.
In some examples, to improve the accuracy of the determination, the present embodiment may also determine the frame processing policy in conjunction with a fixed threshold and a dynamic threshold. For example:
The first type is that the energy value is lower than a fixed threshold value, and the noise frame is directly judged to be noise frame, and filtering is carried out.
And the second category is that the energy value is higher than the dynamic threshold value, the effective voice frame is directly judged, and the subsequent coding and transmission are reserved and carried out.
The third type is that the energy value is between the fixed threshold and the dynamic threshold, the frame may contain weak voice or background noise, and the system may further perform frequency domain analysis, for example, extract the frequency spectrum characteristics of the frame through Fast Fourier Transform (FFT), and determine whether the frame contains the main frequency component (e.g. 300 hz-3400 hz) of the voice signal. If the spectral characteristics of the frame conform to the speech characteristics, the frame is retained, otherwise it is filtered as a noise frame.
In addition, to reduce the impact of short-term noise fluctuations on speech detection, the system may perform a time-continuous analysis of the time frames. For example, if the energy value of a certain time frame is below the dynamic threshold, but the energy values of a plurality of adjacent time frames before and after the certain time frame are above the dynamic threshold, the system may infer that the frame may belong to a valid voice signal based on the consistency of the voice signal, thereby preserving the frame, rather than directly filtering it.
By the method of the embodiment, the system can accurately distinguish the voice frame from the noise frame under different noise environments, improve the intelligibility of voice signals, reduce the interference of background noise and ensure the definition and stability of game voice communication.
In some embodiments, after comparing the energy value to the dynamic threshold, the method further comprises:
Performing frequency domain analysis on the effective voice frame, converting the time domain signal into frequency domain information, and if non-voice frequency or narrow-band noise concentrated in a specific frequency band is detected, applying filtering processing to the non-voice frequency or narrow-band noise;
setting corresponding multi-level energy thresholds in different frequency bands, and performing differentiation judgment on energy values of each frequency band according to the centralized distribution characteristics of noise types in a specific frequency band;
And comprehensively analyzing the same audio frame or a plurality of continuous audio frames according to the short-time energy and the long-time energy, and respectively identifying and filtering the instantaneous impulse noise and the continuous background noise.
Specifically, in this embodiment, after comparing the energy value with the dynamic threshold, frequency domain analysis, multi-level energy threshold setting, and short-time/long-time energy joint judgment are further performed on the effective speech frame, so as to improve the recognition and filtering capabilities of the system for different types of noise, thereby optimizing the definition and stability of the game speech communication.
After the preliminary energy value screening is completed, a portion of the time frame may still contain complex ambient noise, such as electromagnetic interference, mechanical noise, or other non-speech frequency components. To further improve speech quality, the system performs a frequency domain analysis on these valid speech frames to detect and remove certain noise components.
Specifically, the system first performs a Fast Fourier Transform (FFT) on each valid speech frame, converts the time domain signal to a frequency domain signal, and analyzes its spectral distribution. Game speech is usually concentrated in a speech frequency band of 300 Hz-3400 Hz, if some non-speech frequencies (such as 50Hz power supply noise, keyboard knock sound or wind noise and the like) with significantly concentrated energy are detected, the system can process by adopting the following method:
band reject Filter (Notch Filter) is used to remove certain frequency components for stable narrowband noise (e.g., power supply noise) while maintaining speech signal integrity as much as possible.
Adaptive filtering ADAPTIVE FILTERING if a change in noise frequency over time is detected, an adaptive filter may be used to adjust the noise suppression strategy in real time to accommodate different ambient noise modes.
Through the frequency domain conversion and filtering processing, non-voice components in the voice signal can be effectively removed, and the purity of the final voice signal is improved.
Further, different types of noise tend to have different spectral distribution characteristics, such as:
Wind noise is usually concentrated in a lower frequency band (20 Hz-300 Hz);
mechanical noise (e.g., fan noise, road noise) may be distributed over a plurality of narrower frequency bands;
Keyboard knocks typically include higher frequency components (1000 Hz-5000 Hz);
The pop may cover a wide spectral range.
In some examples, to more accurately reject these noise, the system sets different energy thresholds at different frequency bands and makes a differential determination for a particular noise type. For example:
In the low frequency region below 300Hz, if the energy value exceeds a set low frequency threshold, wind noise may occur, and the gain of the low frequency signal may be reduced or the low frequency filtering may be directly performed.
In the high frequency region of 1000 Hz-5000 Hz, if the energy value is too high and discontinuous short pulse signals appear, the keyboard is judged to be knocked sound, and noise reduction treatment is carried out.
In the intermediate frequency region (300 Hz-3400 Hz), if the energy value accords with the voice frequency band characteristic, the energy value can be identified as a voice signal and is reserved preferentially.
By setting different energy thresholds through the frequency division, the system can carry out finer classification processing on various noises, and the accuracy of noise filtering is improved.
Further, to further enhance the stability of the speech signal, the system combines short-term energy analysis with long-term energy analysis to identify different types of noise patterns and perform corresponding processing.
Short-time energy analysis is suitable for detecting sudden noise, such as keyboard knock sound, explosion sound and the like. Such noise is typically manifested as a dramatic rise in energy over a short period of time, but a short duration. When the system detects that the energy value of a certain time frame is significantly higher than that of the previous and next frames and does not accord with the duration characteristic of the normal voice signal, the system can judge that the frame possibly contains sudden noise and perform additional noise reduction treatment.
Long-term energy analysis, suitable for detecting persistent noise, such as air conditioning noise, fan noise or background human voice. The system calculates the smoothed average energy of a plurality of time frames to determine whether a certain noise is continuously present. When the energy value of a certain frequency band is kept at a higher level for a long time and the change is small, the system can judge that the continuous noise exists in the frequency band and adopts a corresponding noise reduction strategy, such as frequency band attenuation or adaptive noise suppression.
Furthermore, the system can combine the analysis results of short-time energy and long-time energy to form a more robust filtering mechanism. For example:
If the short-time energy fluctuation of a certain time frame is large, but the long-time energy is stable, burst noise is likely, and impulse noise suppression can be performed.
If the short-time energy and the long-time energy of a certain time frame are high, the background noise may be enhanced, and the dynamic threshold may be raised to reduce noise interference.
The frequency domain analysis, the multi-level energy threshold setting and the short-time/long-time energy joint judgment in the embodiment can be used together to realize more accurate noise suppression. For example, the system may:
firstly, carrying out dynamic threshold screening to eliminate obvious noise frames;
performing frequency domain analysis on the preliminarily screened voice frame to remove narrowband noise and specific frequency interference;
Performing refined filtration aiming at noise types of different frequency bands by using a multi-level energy threshold;
and finally determining whether a certain voice frame is reserved or not according to the combination of the short-time energy analysis and the long-time energy analysis.
Through the steps of the embodiment, the method can effectively reduce noise interference under different environments, improve the definition of game voice communication and enable players to obtain good voice interaction experience in noisy or quiet environments.
In some embodiments, the method further comprises:
Marking a noise frame and an effective voice frame based on voice and noise data acquired under different environments, and training a mapping relation between environmental noise characteristics and a threshold value by using a machine learning algorithm to obtain a corresponding machine learning model;
In actual operation, whenever a new environmental noise reference value is detected, inputting the environmental noise reference value into a machine learning model for inference, and outputting an optimized dynamic threshold value by combining the environmental noise reference value with a current dynamic threshold value calculated according to an adjustment factor;
and updating model parameters of the machine learning model in real time by utilizing the collected continuously-changed noise environment data.
Specifically, in this embodiment, in order to further improve the accuracy and environmental adaptability of the dynamic threshold, a mapping relationship between the voice signal and the noise feature is modeled by combining a machine learning model, and real-time inference and adaptive adjustment are performed by using the model in actual operation, so as to optimize the noise filtering effect and improve the stability and definition of game voice communication.
First, the system needs to build a noise analysis and dynamic threshold calculation model based on machine learning. To this end, the system collects voice and noise data in different environments, including but not limited to:
quiet home environments (low background noise);
noisy internet cafes, casino environments (high background noise);
outdoor environments such as parks or streets (complex dynamic noise);
rooms with echoes or reverberation (environments of different acoustic properties).
Further, in the data collection process, the system preprocesses the audio data and divides the audio data into time frames, and then the manual or automatic labeling system classifies the time frames, and marks the time frames as follows:
noise frame, which contains only ambient noise and no obvious speech component.
The effective voice frame comprises player voice and has higher signal-to-noise ratio.
Suspicious frames, which contain weak speech or mixed noise, require further analysis.
Further, after labeling is completed, the system trains the mapping relation between noise characteristics and threshold values by using a machine learning algorithm, so that the model can automatically predict the optimal dynamic threshold values according to the noise characteristics. The machine learning model applicable to the present embodiment includes:
The neural network (RNN, CNN, transformer) is suitable for complex voice and noise pattern analysis, and can extract deep features and perform intelligent noise reduction adjustment.
Gaussian Mixture Model (GMM) is suitable for probability modeling of noise and speech signals and calculating optimal classification threshold.
The Support Vector Machine (SVM) is suitable for small sample learning, and can realize threshold optimization under the condition of a small amount of training data.
After training, the system obtains a set of model parameters, and the model can predict an optimal dynamic threshold based on the environmental noise reference value so as to adapt to different noise environments.
In actual operation, whenever a new ambient noise reference value is detected, the system inputs the reference value into a trained machine learning model and performs optimization in combination with the current dynamic threshold calculation method, and the specific flow is as follows:
1) Noise level detection:
The system continuously monitors the ambient noise level, calculates an ambient noise reference value, and determines whether the current dynamic threshold needs to be adjusted.
2) Model inference:
the current environmental noise reference value is input into a machine learning model, and the system automatically calculates an optimal dynamic threshold or threshold interval suitable for the current environment.
The inference process combines historical training data, so that the system can adjust the threshold value based on past experience and adapt to various complex environments.
3) Threshold fusion:
the optimal threshold calculated by the machine learning model is fused with the current dynamic threshold calculated based on the adjustment factor.
A weighted average or adaptive adjustment algorithm can be used to ensure that the final threshold meets the current noise environment and is not misjudged due to abrupt noise change.
4) Real-time application:
The finally calculated optimal dynamic threshold is applied to the game voice processing system, and subsequent time frames are filtered, so that the system can always operate under the optimal noise suppression condition.
For example, when a player enters a noisy internet cafe from a quiet room, the system detects a significant rise in the ambient noise level, and the machine learning model predicts a higher dynamic threshold to avoid misrecognizing the ambient noise as a valid speech signal. When the player returns to a quiet environment, the system automatically reduces the dynamic threshold to ensure that the voice is still recognized when the player speaks aloud.
Further, in order to enable the system to adapt to different users and environments for a long time, an online learning mechanism is adopted to continuously optimize the performance of the machine learning model.
During the long-term operation of the system, new environmental noise samples are continuously collected and stored in a local or cloud database. These data can be used to extend the training set to enable the model to adapt to complex situations of different devices, different users, and different environments.
Through incremental training or federal learning, the model is gradually optimized on the premise of not influencing the real-time experience of the user. If it is detected that the dynamic threshold calculation effect is poor in some specific environments (such as special echo environments and extreme noise environments), the system can adjust the adjustment factors or model parameters to improve the adaptability.
The present embodiment may allow the user to manually adjust the voice detection sensitivity and communicate this adjustment data back to the server for optimization of the model. For example, if the user turns down the dynamic threshold multiple times in an internet cafe environment, the system may learn the habit and automatically decrease the threshold in a similar environment.
The embodiment realizes intelligent voice noise reduction optimization by combining a machine learning model and a dynamic threshold calculation method, and has the following advantages compared with the traditional fixed threshold method:
The traditional method uses a fixed formula to calculate a dynamic threshold value, so that environmental mutation is difficult to adapt, and the machine learning model can adaptively adjust the threshold value according to historical data, so that the noise reduction effect under different environments is improved.
The machine learning model can be trained based on a large amount of noise environment data, so that the calculated dynamic threshold value is more accurate, and erroneous judgment is reduced.
Through an online learning mechanism, the system can continuously optimize model parameters, so that the model parameters can keep high recognition accuracy for a long time.
The method is suitable for various complex environments, and ensures that players can obtain clear and stable voice communication experience in different scenes such as internet bars, outdoors and home.
By the method, the game voice noise reduction system can dynamically adjust the threshold value, intelligently adapt to the environment and continuously optimize the model, so that players can enjoy high-quality voice communication under different noise environments, and communication barriers caused by noise interference or misjudgment are avoided.
According to the technical scheme of the embodiment of the application, the application has at least the following advantages:
High adaptability, the dynamic threshold value can be changed along with the change of the environmental noise level, and stable and proper noise filtering effect can be obtained in a noisy or quiet environment.
The method has high accuracy, and can more purposefully inhibit and filter different types of noise through multistage filtering, frequency domain analysis and an optional machine learning auxiliary mechanism.
The method has strong instantaneity, the dynamic adjustment of the threshold value can be completed within the time scale of millisecond to second, the real-time communication requirement of game voice is met, and good interaction experience of players is ensured.
The scheme can be integrated with a hardware noise reduction module (AEC/AGC and the like), multi-microphone beam forming and a cloud machine learning platform, and has wide application range, including but not limited to VR/AR games, mobile phone end network voice chat, remote conference systems and the like.
The following are examples of the apparatus of the present application that may be used to perform the method embodiments of the present application. For details not disclosed in the embodiments of the apparatus of the present application, please refer to the embodiments of the method of the present application.
Fig. 2 is a schematic structural diagram of a dynamic threshold mechanism-based voice noise reduction device for a game team according to an embodiment of the present application. As shown in fig. 2, the dynamic threshold mechanism-based game team voice noise reduction device comprises:
an acquisition module 201 for acquiring an audio signal containing player's voice and environmental noise and dividing the audio signal into a plurality of time frames;
the conversion module 202 is configured to convert the amplitude or the db value of the time frame to obtain an energy value corresponding to each time frame;
A first comparing module 203, configured to compare the energy value with a set initial threshold, mark a noise frame or a low priority frame if the energy value is lower than the initial threshold, and reserve a speech frame to be further processed if the energy value is higher than the initial threshold;
The processing module 204 is configured to, when detecting that a continuous time frame with an energy value lower than an initial threshold exceeds a preset time period, perform statistics and smoothing on the energy value in the preset time period to obtain an ambient noise reference value;
the calculating module 205 is configured to calculate a dynamic threshold based on the environmental noise reference value and a preset adjustment factor, and update the dynamic threshold in real time when the environmental noise reference value changes with time;
A second comparing module 206, configured to compare the energy value with a dynamic threshold, filter the corresponding time frame as a noise frame if the energy value is lower than the dynamic threshold, and reserve the corresponding time frame as an effective speech frame if the energy value is higher than the dynamic threshold;
A generating module 207, configured to encode the time frame that remains after the dynamic threshold filtering, and generate a final game team voice signal.
In some embodiments, the acquisition module 201 of fig. 2 directly acquires single-channel audio data as an analog voice signal when a player uses a single microphone, acquires multiple channels of audio data in parallel when the player uses a headset or an external multi-microphone device, performs signal fusion or beam forming on the multiple channels of audio data to obtain an analog voice signal, converts the acquired analog voice signal into a digital signal, and quantizes the digital signal to obtain digital audio conforming to a predetermined audio coding format, performs preliminary noise reduction or gain processing on the digital audio by using an automatic gain control or acoustic echo cancellation module in a hardware level or an audio driver, and divides the preliminary noise reduction or gain processed digital audio into multiple time frames according to a preset time window so as to perform initial threshold comparison and dynamic threshold updating according to energy values of the time frames.
In some embodiments, the conversion module 202 of fig. 2 obtains the amplitude of the digital audio sampling points in each time frame, squares the amplitude of each digital audio sampling point in turn, and accumulates the obtained results to obtain a value for representing the time-domain energy of the time frame, or converts the audio signal corresponding to the time frame into a decibel value, and converts the decibel value into a value for representing the time-domain energy or power according to a preset mapping relationship.
In some embodiments, the processing module 204 of fig. 2 determines that the player is not speaking temporarily when detecting that the continuous time frame with the energy value lower than the initial threshold exceeds the preset time period, takes the continuous time frame in the preset time period as an ambient noise frame, collects the energy value of the ambient noise frame, counts the energy value of the ambient noise frame, and continuously accumulates or calculates the average value of the noise energy values in different time periods by using a smoothing algorithm to obtain an ambient noise reference value for representing the ambient noise level.
In some embodiments, the calculation module 205 of fig. 2 calculates a dynamic threshold for distinguishing a noise frame from a speech frame according to an ambient noise reference value and in combination with a preset adjustment factor by using a predefined algorithm or mapping rule, and when it is detected that the ambient noise reference value changes with time, adjusts the output of the algorithm or mapping rule adaptively according to the adjustment factor, and controls the dynamic threshold to be adjusted up or down synchronously so as to match the updated dynamic threshold with the ambient noise level fluctuating in real time.
In some embodiments, the analysis module 208 of fig. 2 performs frequency domain analysis on the valid voice frame, converts the time domain signal into frequency domain information, applies filtering processing to the non-voice frequency or the narrowband noise if the non-voice frequency or the narrowband noise concentrated in the specific frequency band is detected, sets corresponding multi-level energy thresholds in different frequency bands, performs differentiation judgment on energy values of each frequency band according to the concentrated distribution characteristics of the noise types in the specific frequency band, performs comprehensive analysis on the same audio frame or a plurality of continuous audio frames according to short-time energy and long-time energy, and respectively identifies and filters instantaneous impulse noise and continuous background noise.
In some embodiments, the training module 209 of fig. 2 marks noise frames and valid speech frames based on collected speech and noise data under different environments, trains the mapping relation between the environmental noise characteristics and the threshold value by using a machine learning algorithm to obtain a corresponding machine learning model, inputs the environmental noise reference value into the machine learning model to infer each time a new environmental noise reference value is detected during actual operation, and outputs an optimized dynamic threshold value in combination with a current dynamic threshold value calculated according to an adjustment factor, and updates model parameters of the machine learning model in real time by using the collected continuously-changed noise environment data.
It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present application.
Fig. 3 is a schematic structural diagram of an electronic device 3 according to an embodiment of the present application. As shown in fig. 3, the electronic device 3 of this embodiment comprises a processor 301, a memory 302 and a computer program 303 stored in the memory 302 and executable on the processor 301. The steps of the various method embodiments described above are implemented when the processor 301 executes the computer program 303. Or the processor 301 when executing the computer program 303 performs the functions of the modules/units in the above-described device embodiments.
Illustratively, the computer program 303 may be partitioned into one or more modules/units, which are stored in the memory 302 and executed by the processor 301 to complete the present application. One or more of the modules/units may be a series of computer program instruction segments capable of performing a specific function for describing the execution of the computer program 303 in the electronic device 3.
The electronic device 3 may be an electronic device such as a desktop computer, a notebook computer, a palm computer, or a cloud server. The electronic device 3 may include, but is not limited to, a processor 301 and a memory 302. It will be appreciated by those skilled in the art that fig. 3 is merely an example of the electronic device 3 and does not constitute a limitation of the electronic device 3, and may include more or fewer components than shown, or may combine certain components, or different components, e.g., the electronic device may also include an input-output device, a network access device, a bus, etc.
The Processor 301 may be a central processing unit (Central Processing Unit, CPU) or other general purpose Processor, digital signal Processor (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 302 may be an internal storage unit of the electronic device 3, for example, a hard disk or a memory of the electronic device 3. The memory 302 may also be an external storage device of the electronic device 3, for example, a plug-in hard disk provided on the electronic device 3, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), or the like. Further, the memory 302 may also include both an internal storage unit and an external storage device of the electronic device 3. The memory 302 is used to store computer programs and other programs and data required by the electronic device. The memory 302 may also be used to temporarily store data that has been output or is to be output.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiments provided by the present application, it should be understood that the disclosed apparatus/computer device and method may be implemented in other manners. For example, the apparatus/computer device embodiments described above are merely illustrative, e.g., the division of modules or elements is merely a logical functional division, and there may be additional divisions of actual implementations, multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of each of the method embodiments described above. The computer program may comprise computer program code, which may be in source code form, object code form, executable file or in some intermediate form, etc. The computer readable medium can include any entity or device capable of carrying computer program code, recording medium, USB flash disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), electrical carrier signals, telecommunications signals, and software distribution media, among others.
The foregoing embodiments are merely for illustrating the technical solution of the present application, but not for limiting the same, and although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those skilled in the art that the technical solution described in the foregoing embodiments may be modified or substituted for some of the technical features thereof, and that these modifications or substitutions should not depart from the spirit and scope of the technical solution of the embodiments of the present application and should be included in the protection scope of the present application.

Claims (10)

1.一种基于动态阈值机制的游戏组队语音降噪方法,其特征在于,包括:1. A method for reducing game team voice noise based on a dynamic threshold mechanism, comprising: 采集包含玩家语音和环境噪音的音频信号,并将所述音频信号划分为多个时间帧;Collecting an audio signal containing a player's voice and ambient noise, and dividing the audio signal into a plurality of time frames; 对所述时间帧的幅值或分贝值进行转换,得到每个时间帧对应的能量值;Converting the amplitude or decibel value of the time frame to obtain an energy value corresponding to each time frame; 将所述能量值与设定的初始阈值进行比较,若所述能量值低于所述初始阈值,则标记为噪音帧或低优先级帧,若所述能量值高于所述初始阈值,则保留为待进一步处理的语音帧;Comparing the energy value with a set initial threshold, if the energy value is lower than the initial threshold, marking it as a noise frame or a low priority frame; if the energy value is higher than the initial threshold, retaining it as a speech frame to be further processed; 在检测到所述能量值低于所述初始阈值的连续时间帧超过预设时间段时,利用指数平滑方法对所述连续时间帧的能量值进行递归计算,得到环境噪音参考值;When it is detected that the continuous time frames in which the energy value is lower than the initial threshold value exceed a preset time period, recursively calculating the energy values of the continuous time frames using an exponential smoothing method to obtain an environmental noise reference value; 基于所述环境噪音参考值以及预设的调节因子计算动态阈值,当所述环境噪音参考值随时间变化时,实时更新所述动态阈值;其中,系统设定最小调整步长或平滑调整策略,当环境噪音变化幅度较小时,系统缓慢调整动态阈值,若环境噪音发生剧烈变化时,系统快速调整动态阈值;A dynamic threshold is calculated based on the ambient noise reference value and a preset adjustment factor, and the dynamic threshold is updated in real time when the ambient noise reference value changes over time; wherein the system sets a minimum adjustment step or a smooth adjustment strategy, and when the ambient noise changes slightly, the system adjusts the dynamic threshold slowly, and if the ambient noise changes drastically, the system adjusts the dynamic threshold quickly; 将所述能量值与所述动态阈值进行比较,若所述能量值低于所述动态阈值,则将对应时间帧作为噪音帧进行过滤,若所述能量值高于所述动态阈值,则将对应时间帧作为有效语音帧进行保留;其中,若某时间帧的能量值介于初始阈值与动态阈值之间时,执行快速傅里叶变换提取该时间帧的频谱特征,并判断是否包含语音信号的主要频率成分,若该时间帧的频谱特征符合语音特征,则保留该时间帧,否则将该时间帧作为噪音帧过滤;Comparing the energy value with the dynamic threshold, if the energy value is lower than the dynamic threshold, filtering the corresponding time frame as a noise frame, and if the energy value is higher than the dynamic threshold, retaining the corresponding time frame as a valid speech frame; wherein, if the energy value of a time frame is between the initial threshold and the dynamic threshold, performing a fast Fourier transform to extract the spectrum characteristics of the time frame, and determining whether it contains the main frequency components of the speech signal; if the spectrum characteristics of the time frame meet the speech characteristics, retaining the time frame, otherwise filtering the time frame as a noise frame; 将通过所述动态阈值过滤后保留的时间帧进行编码,生成最终的游戏组队语音信号。The time frames retained after filtering by the dynamic threshold are encoded to generate a final game team voice signal. 2.根据权利要求1所述的方法,其特征在于,所述采集包含玩家语音和环境噪音的音频信号,并将所述音频信号划分为多个时间帧,包括:2. The method according to claim 1, wherein collecting an audio signal containing a player's voice and ambient noise and dividing the audio signal into a plurality of time frames comprises: 当玩家使用单麦克风时,直接获取单路音频数据作为模拟语音信号;When the player uses a single microphone, the single-channel audio data is directly obtained as an analog voice signal; 当玩家使用耳麦或外接多麦克风设备时,并行获取多路音频数据,并对多路音频数据进行信号融合或波束成形,得到模拟语音信号;When players use headsets or external multi-microphone devices, multiple channels of audio data are acquired in parallel and signal fusion or beamforming is performed on the multiple channels of audio data to obtain analog voice signals; 将采集到的模拟语音信号转换为数字信号,并对所述数字信号进行量化,得到符合预定音频编码格式的数字音频;Converting the collected analog voice signal into a digital signal and quantizing the digital signal to obtain digital audio that complies with a predetermined audio coding format; 在硬件层面或音频驱动中,利用自动增益控制或声学回声消除模块,对所述数字音频进行初步降噪或增益处理;At the hardware level or in the audio driver, using an automatic gain control or acoustic echo cancellation module to perform preliminary noise reduction or gain processing on the digital audio; 按照预设的时间窗口将所述初步降噪或增益处理后的数字音频划分为多个时间帧,以便根据所述时间帧的能量值执行初始阈值比较和动态阈值更新。The digital audio after the preliminary noise reduction or gain processing is divided into multiple time frames according to a preset time window, so as to perform initial threshold comparison and dynamic threshold update according to the energy values of the time frames. 3.根据权利要求1所述的方法,其特征在于,所述对所述时间帧的幅值或分贝值进行转换,得到每个时间帧对应的能量值,包括:3. The method according to claim 1, wherein converting the amplitude or decibel value of the time frame to obtain the energy value corresponding to each time frame comprises: 获取每一时间帧内的数字音频采样点的幅值,依次对各个数字音频采样点的幅值进行平方并将所得结果进行累加,得到用于表征时间帧时域能量的数值;Obtaining the amplitude of the digital audio sampling point in each time frame, squaring the amplitude of each digital audio sampling point in turn and accumulating the obtained results to obtain a value used to represent the time domain energy of the time frame; 或者,将所述时间帧对应的音频信号转换为分贝值,并根据预设的映射关系,将所述分贝值转换为用于表征时间帧能量或功率的数值。Alternatively, the audio signal corresponding to the time frame is converted into a decibel value, and the decibel value is converted into a numerical value for representing the energy or power of the time frame according to a preset mapping relationship. 4.根据权利要求1所述的方法,其特征在于,所述在检测到所述能量值低于所述初始阈值的连续时间帧超过预设时间段时,将预设时间段内的能量值进行统计与平滑处理,得到环境噪音参考值,包括:4. The method according to claim 1, wherein when it is detected that the continuous time frames in which the energy value is lower than the initial threshold value exceed a preset time period, the energy values within the preset time period are statistically analyzed and smoothed to obtain an environmental noise reference value, comprising: 在检测到所述能量值低于所述初始阈值的连续时间帧超过预设时间段时,判定玩家暂未说话,将预设时间段内的连续时间帧作为环境噪音帧,并采集所述环境噪音帧的能量值;When it is detected that the continuous time frames in which the energy value is lower than the initial threshold value exceed the preset time period, it is determined that the player has not spoken, the continuous time frames within the preset time period are regarded as environmental noise frames, and the energy values of the environmental noise frames are collected; 对所述环境噪音帧的能量值进行统计,利用平滑算法对不同时间段内的噪音能量值进行连续累加或均值计算,得到用于表征环境噪音水平的环境噪音参考值。The energy values of the environmental noise frames are statistically analyzed, and the noise energy values in different time periods are continuously accumulated or averaged using a smoothing algorithm to obtain an environmental noise reference value for characterizing the environmental noise level. 5.根据权利要求1所述的方法,其特征在于,所述基于所述环境噪音参考值以及预设的调节因子计算动态阈值,当所述环境噪音参考值随时间变化时,实时更新所述动态阈值,包括:5. The method according to claim 1, wherein the calculating of the dynamic threshold based on the ambient noise reference value and a preset adjustment factor, and updating the dynamic threshold in real time when the ambient noise reference value changes over time, comprises: 根据所述环境噪音参考值,并结合预设的调节因子,利用预先定义的算法或映射规则,计算用于区分噪音帧与语音帧的动态阈值;Calculating a dynamic threshold for distinguishing noise frames from speech frames based on the environmental noise reference value and in combination with a preset adjustment factor using a predefined algorithm or mapping rule; 当监测到所述环境噪音参考值随时间发生变化时,根据所述调节因子对算法或映射规则的输出进行适配调整,控制所述动态阈值同步上调或下调,以使更新后的动态阈值与实时波动的环境噪音水平相匹配。When it is monitored that the ambient noise reference value changes over time, the output of the algorithm or mapping rule is adaptively adjusted according to the adjustment factor, and the dynamic threshold is controlled to be synchronously increased or decreased so that the updated dynamic threshold matches the real-time fluctuating ambient noise level. 6.根据权利要求1所述的方法,其特征在于,在所述将所述能量值与所述动态阈值进行比较之后,所述方法还包括:6. The method according to claim 1, characterized in that after comparing the energy value with the dynamic threshold, the method further comprises: 对所述有效语音帧进行频域分析,将时域信号转换为频域信息,若检测到集中于特定频段的非语音频率或窄带噪音,则对所述非语音频率或窄带噪音应用滤波处理;Performing frequency domain analysis on the valid speech frame to convert the time domain signal into frequency domain information, and if non-speech frequencies or narrowband noise concentrated in a specific frequency band are detected, applying filtering processing to the non-speech frequencies or narrowband noise; 在不同频段设置相应的多级能量阈值,根据噪音类型在特定频段内的集中分布特点,对各频段的能量值进行差异化判断;Set corresponding multi-level energy thresholds in different frequency bands, and make differentiated judgments on the energy values of each frequency band based on the concentrated distribution characteristics of noise types in specific frequency bands; 根据短时能量和长时能量对同一音频帧或连续多个音频帧进行综合分析,并对瞬时脉冲噪音和持续性背景噪音分别识别和过滤。The same audio frame or multiple consecutive audio frames are comprehensively analyzed based on short-term energy and long-term energy, and instantaneous impulse noise and continuous background noise are identified and filtered separately. 7.根据权利要求1所述的方法,其特征在于,所述方法还包括:7. The method according to claim 1, further comprising: 基于不同环境下采集到的语音与噪音数据,标注噪音帧与有效语音帧,并利用机器学习算法对环境噪音特征与阈值之间的映射关系进行训练,得到对应的机器学习模型;Based on the speech and noise data collected in different environments, the noise frames and valid speech frames are annotated, and the mapping relationship between environmental noise characteristics and thresholds is trained using a machine learning algorithm to obtain the corresponding machine learning model; 在实际运行时,每当检测到新的环境噪音参考值时,将所述环境噪音参考值输入所述机器学习模型中进行推断,并与根据调节因子计算得到的当前动态阈值相结合,输出优化后的动态阈值;During actual operation, whenever a new ambient noise reference value is detected, the ambient noise reference value is input into the machine learning model for inference, and combined with the current dynamic threshold value calculated based on the adjustment factor to output an optimized dynamic threshold value; 利用采集到的持续变化的噪音环境数据,对所述机器学习模型的模型参数进行实时更新。The model parameters of the machine learning model are updated in real time using the collected continuously changing noise environment data. 8.一种基于动态阈值机制的游戏组队语音降噪装置,其特征在于,包括:8. A game team voice noise reduction device based on a dynamic threshold mechanism, characterized by comprising: 采集模块,用于采集包含玩家语音和环境噪音的音频信号,并将所述音频信号划分为多个时间帧;An acquisition module, configured to acquire an audio signal containing player voice and ambient noise, and divide the audio signal into multiple time frames; 转换模块,用于对所述时间帧的幅值或分贝值进行转换,得到每个时间帧对应的能量值;a conversion module, configured to convert the amplitude or decibel value of the time frame to obtain an energy value corresponding to each time frame; 第一比较模块,用于将所述能量值与设定的初始阈值进行比较,若所述能量值低于所述初始阈值,则标记为噪音帧或低优先级帧,若所述能量值高于所述初始阈值,则保留为待进一步处理的语音帧;a first comparison module, configured to compare the energy value with a set initial threshold, and if the energy value is lower than the initial threshold, mark the frame as a noise frame or a low-priority frame; and if the energy value is higher than the initial threshold, retain the frame as a speech frame to be further processed; 处理模块,用于在检测到所述能量值低于所述初始阈值的连续时间帧超过预设时间段时,利用指数平滑方法对所述连续时间帧的能量值进行递归计算,得到环境噪音参考值;a processing module, configured to, when detecting that the continuous time frames in which the energy value is lower than the initial threshold value exceed a preset time period, recursively calculate the energy values of the continuous time frames using an exponential smoothing method to obtain an environmental noise reference value; 计算模块,用于基于所述环境噪音参考值以及预设的调节因子计算动态阈值,当所述环境噪音参考值随时间变化时,实时更新所述动态阈值;其中,系统设定最小调整步长或平滑调整策略,当环境噪音变化幅度较小时,系统缓慢调整动态阈值,若环境噪音发生剧烈变化时,系统快速调整动态阈值;a calculation module for calculating a dynamic threshold based on the ambient noise reference value and a preset adjustment factor, and updating the dynamic threshold in real time as the ambient noise reference value changes over time; wherein the system sets a minimum adjustment step size or a smooth adjustment strategy, and when the ambient noise changes slightly, the system adjusts the dynamic threshold slowly; if the ambient noise changes drastically, the system adjusts the dynamic threshold quickly; 第二比较模块,用于将所述能量值与所述动态阈值进行比较,若所述能量值低于所述动态阈值,则将对应时间帧作为噪音帧进行过滤,若所述能量值高于所述动态阈值,则将对应时间帧作为有效语音帧进行保留;其中,若某时间帧的能量值介于初始阈值与动态阈值之间时,执行快速傅里叶变换提取该时间帧的频谱特征,并判断是否包含语音信号的主要频率成分,若该时间帧的频谱特征符合语音特征,则保留该时间帧,否则将该时间帧作为噪音帧过滤;a second comparison module, configured to compare the energy value with the dynamic threshold; if the energy value is lower than the dynamic threshold, the corresponding time frame is filtered as a noise frame; if the energy value is higher than the dynamic threshold, the corresponding time frame is retained as a valid speech frame; wherein, if the energy value of a time frame is between the initial threshold and the dynamic threshold, a fast Fourier transform is performed to extract the spectral features of the time frame, and determine whether it contains the main frequency components of the speech signal; if the spectral features of the time frame meet the speech features, the time frame is retained; otherwise, the time frame is filtered as a noise frame; 生成模块,用于将通过所述动态阈值过滤后保留的时间帧进行编码,生成最终的游戏组队语音信号。The generation module is used to encode the time frame retained after filtering through the dynamic threshold to generate a final game team voice signal. 9.一种电子设备,包括存储器,处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至7中任一项所述方法的步骤。9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method according to any one of claims 1 to 7 when executing the computer program. 10.一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至7中任一项所述方法的步骤。10. A computer-readable storage medium storing a computer program, wherein the computer program implements the steps of the method according to any one of claims 1 to 7 when executed by a processor.
CN202510480623.1A 2025-04-17 2025-04-17 Game team voice noise reduction method, device and medium based on dynamic threshold mechanism Active CN120032662B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510480623.1A CN120032662B (en) 2025-04-17 2025-04-17 Game team voice noise reduction method, device and medium based on dynamic threshold mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510480623.1A CN120032662B (en) 2025-04-17 2025-04-17 Game team voice noise reduction method, device and medium based on dynamic threshold mechanism

Publications (2)

Publication Number Publication Date
CN120032662A CN120032662A (en) 2025-05-23
CN120032662B true CN120032662B (en) 2025-08-22

Family

ID=95737602

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510480623.1A Active CN120032662B (en) 2025-04-17 2025-04-17 Game team voice noise reduction method, device and medium based on dynamic threshold mechanism

Country Status (1)

Country Link
CN (1) CN120032662B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107967918A (en) * 2016-10-19 2018-04-27 河南蓝信科技股份有限公司 A kind of method for strengthening voice signal clarity
CN117079661A (en) * 2023-08-21 2023-11-17 腾讯科技(深圳)有限公司 Sound source processing method and related device

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103971698B (en) * 2013-01-25 2019-01-11 北京千橡网景科技发展有限公司 Method and apparatus for voice real-time noise-reducing
CN104078050A (en) * 2013-03-26 2014-10-01 杜比实验室特许公司 Device and method for audio classification and audio processing
US20150081287A1 (en) * 2013-09-13 2015-03-19 Advanced Simulation Technology, inc. ("ASTi") Adaptive noise reduction for high noise environments
CN111128214B (en) * 2019-12-19 2022-12-06 网易(杭州)网络有限公司 Audio noise reduction method and device, electronic equipment and medium
US20220415299A1 (en) * 2021-06-25 2022-12-29 Nureva, Inc. System for dynamically adjusting a soundmask signal based on realtime ambient noise parameters while maintaining echo canceller calibration performance
CN118098255A (en) * 2024-04-08 2024-05-28 深圳市长丰影像器材有限公司 Voice enhancement method based on neural network detection and related device thereof

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107967918A (en) * 2016-10-19 2018-04-27 河南蓝信科技股份有限公司 A kind of method for strengthening voice signal clarity
CN117079661A (en) * 2023-08-21 2023-11-17 腾讯科技(深圳)有限公司 Sound source processing method and related device

Also Published As

Publication number Publication date
CN120032662A (en) 2025-05-23

Similar Documents

Publication Publication Date Title
CN112951259B (en) Audio noise reduction method and device, electronic equipment and computer readable storage medium
KR101532153B1 (en) Systems, methods, and apparatus for voice activity detection
CN105261359B (en) The noise-canceling system and noise-eliminating method of mobile microphone
US8831936B2 (en) Systems, methods, apparatus, and computer program products for speech signal processing using spectral contrast enhancement
CN105513605A (en) Voice enhancement system and method for cellphone microphone
CN112767963A (en) Voice enhancement method, device and system and computer readable storage medium
CN112151056B (en) Intelligent cochlea sound processing system and method with customization function
CN113949955B (en) Noise reduction processing method, device, electronic equipment, earphone and storage medium
US9378754B1 (en) Adaptive spatial classifier for multi-microphone systems
CN103915103A (en) Voice quality enhancement system
CN119946506B (en) Audio processing method, electronic equipment and storage medium
CN114333749A (en) Howling suppression method, device, computer equipment and storage medium
WO2025035975A9 (en) Training method for speech enhancement network, speech enhancement method, and electronic device
CN120032661B (en) Game team voice noise reduction method and device based on multi-channel signal processing
CN119052696A (en) Earphone control method based on voiceprint recognition and reverse wave cancellation for reducing wind noise
JP2013078118A (en) Noise reduction device, audio input device, radio communication device, and noise reduction method
CN120018016A (en) Noise reduction processing method and system for Bluetooth audio device
CN119694328A (en) A recording noise removal method, system, device and medium based on sound characteristics
CN120032662B (en) Game team voice noise reduction method, device and medium based on dynamic threshold mechanism
CN109213471B (en) Volume adjusting method and system
CN115376534A (en) Microphone array audio processing method and pickup chest card
CN115410593A (en) Audio channel selection method, device, equipment and storage medium
US20050004792A1 (en) Speech characteristic extraction method speech charateristic extraction device speech recognition method and speech recognition device
CN120431951B (en) Two-stage speech noise reduction method based on enhanced spectral subtraction and Kalman filtering, device and storage medium thereof
CN120808810A (en) Multi-mode sensing intelligent microphone array signal processing method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant