CN120032662B

CN120032662B - Game team voice noise reduction method, device and medium based on dynamic threshold mechanism

Info

Publication number: CN120032662B
Application number: CN202510480623.1A
Authority: CN
Inventors: 黄志松; 冀啸天; 李鹤; 周义; 姚茜
Original assignee: Qingfeng Beijing Technology Co Ltd
Current assignee: Luxcreo Beijing Inc
Priority date: 2025-04-17
Filing date: 2025-04-17
Publication date: 2025-08-22
Anticipated expiration: 2045-04-17
Also published as: CN120032662A

Abstract

The present application provides a method, device and medium for reducing the noise of game team voice based on a dynamic threshold mechanism. The method includes: dividing the audio signal into multiple time frames, converting the amplitude or decibel value of the time frame to obtain an energy value; comparing the energy value with the initial threshold, if the energy value is lower than the initial threshold, marking it as a noise frame or a low-priority frame, if the energy value is higher than the initial threshold, retaining it as a voice frame to be further processed; comparing the energy value with the dynamic threshold, if the energy value is lower than the dynamic threshold, filtering the corresponding time frame as a noise frame, if the energy value is higher than the dynamic threshold, retaining the corresponding time frame as a valid voice frame; encoding the time frame retained after filtering by the dynamic threshold to generate the final game team voice signal. The present application can reduce the phenomenon of missed or misjudgment of noise, realize dynamic and accurate filtering of noise under different environmental noise conditions, and improve user experience.

Description

Game team voice noise reduction method, device and medium based on dynamic threshold mechanism

Technical Field

The application relates to the technical field of voice noise reduction, in particular to a method, a device and a medium for voice noise reduction of a game team based on a dynamic threshold mechanism.

Background

With the rapid development of online games and electronic games, more and more players need to communicate and cooperate in real time through voice. However, in the actual use process, various noise interferences, such as mechanical noise, wind noise, etc., often exist in the environment where the player is located, which results in a decrease in the clarity of the player's voice and a decrease in the communication efficiency. In order to improve the quality of team voice chat, effective voice noise reduction needs to be achieved for different noise scenarios.

Current speech noise reduction techniques typically rely on fixed thresholds to make preliminary decisions on the audio data, or simply noise suppression through basic algorithms only at the hardware level. However, under the condition of large fluctuation of noise level, the fixed threshold or the single algorithm is difficult to combine noise filtering and voice fidelity, and the phenomenon of missed judgment or misjudgment is easy to occur. Especially in noisy internet cafes or outdoor environments, problems occur when the player's voice is submerged by noise or excessively filtered, resulting in poor use experience. Therefore, how to dynamically and accurately filter noise under different environmental noise conditions becomes a technical problem to be solved in the prior art.

Disclosure of Invention

In view of the above, the embodiment of the application provides a method, a device and a medium for voice noise reduction of a game team based on a dynamic threshold mechanism, which are used for solving the problems that in the prior art, the phenomenon of missed judgment or misjudgment is easy to occur, noise cannot be filtered dynamically and accurately under different environmental noise conditions, and the user experience is poor.

The first aspect of the embodiment of the application provides a game team voice noise reduction method based on a dynamic threshold mechanism, which comprises the steps of collecting an audio signal containing player voices and environmental noises, dividing the audio signal into a plurality of time frames, converting amplitude values or decibel values of the time frames to obtain energy values corresponding to each time frame, comparing the energy values with a set initial threshold, marking the energy values as noise frames or low-priority frames if the energy values are lower than the initial threshold, retaining the energy values as voice frames to be further processed if the energy values are higher than the initial threshold, carrying out statistics and smoothing processing on the energy values in the preset time period to obtain environmental noise reference values when continuous time frames with the energy values lower than the initial threshold are detected to exceed a preset time period, calculating a dynamic threshold based on the environmental noise reference values and a preset regulating factor, updating the dynamic threshold in real time when the environmental noise reference values change with time, comparing the energy values with the dynamic threshold, filtering the corresponding time frames as noise frames if the energy values are lower than the dynamic threshold, retaining the corresponding time frames as voice frames to be further processed, retaining the voice frames after the effective time frames are detected to be effectively filtered, and finally retaining the voice frames to be processed.

The second aspect of the embodiment of the application provides a game team voice noise reduction device based on a dynamic threshold mechanism, which comprises a collection module, a conversion module, a first comparison module, a second comparison module, a processing module and a processing module, wherein the collection module is used for collecting an audio signal containing player voices and environment noise and dividing the audio signal into a plurality of time frames, the conversion module is used for converting amplitude values or decibel values of the time frames to obtain energy values corresponding to each time frame, the first comparison module is used for comparing the energy values with a set initial threshold, if the energy values are lower than the initial threshold, the energy values are marked as noise frames or low-priority frames, if the energy values are higher than the initial threshold, the energy values are reserved as voice frames to be further processed, the processing module is used for carrying out statistics and smoothing processing on the energy values in the preset time frame to obtain environment noise reference values when the continuous time frame with the energy values lower than the initial threshold is detected to obtain the environment noise reference values, the calculation module is used for calculating the dynamic threshold based on the environment noise reference values and preset regulating factors, when the environment noise reference values change with time, the dynamic threshold is updated in real time, the second comparison module is used for comparing the energy values with the dynamic threshold, if the energy values are lower than the initial threshold, the corresponding to the frames are used as voice frames to be reserved to be processed, and finally, if the energy frames are reserved as the game frames.

In a third aspect of the embodiments of the present application, there is provided an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.

In a fourth aspect of the embodiments of the present application, there is provided a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the above method.

The above at least one technical scheme adopted by the embodiment of the application can achieve the following beneficial effects:

The method comprises the steps of collecting audio signals containing player voices and environmental noises, dividing the audio signals into a plurality of time frames, converting amplitude values or decibel values of the time frames to obtain energy values corresponding to each time frame, comparing the energy values with a set initial threshold value, marking the energy values as noise frames or low-priority frames if the energy values are lower than the initial threshold value, reserving the energy values as voice frames to be further processed if the energy values are higher than the initial threshold value, carrying out statistics and smoothing processing on the energy values in the preset time period when the continuous time frames with the energy values lower than the initial threshold value are detected to exceed a preset time period to obtain an environmental noise reference value, calculating a dynamic threshold value based on the environmental noise reference value and a preset regulating factor, updating the dynamic threshold value in real time when the environmental noise reference value changes along with time, comparing the energy values with the dynamic threshold value, filtering the corresponding time frames as noise frames if the energy values are lower than the dynamic threshold value, reserving the corresponding time frames as effective voice frames if the energy values are higher than the dynamic threshold value, carrying out coding on the reserved time frames after filtering through the dynamic threshold value, and generating a final game team voice signals. The application can reduce the phenomenon of missed judgment or misjudgment of noise, realize dynamic and accurate noise filtration under different environmental noise conditions, and improve user experience.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a dynamic threshold mechanism-based voice noise reduction method for a game team according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a dynamic threshold mechanism-based voice noise reduction device for a game team according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

The application aims to solve the technical problem of how to effectively reduce voice noise in a game team voice scene, and especially to cope with the background noise which is continuously changed under different environments, thereby ensuring the definition and comfort of voice communication of players.

Therefore, the application provides a game team voice noise reduction method based on a dynamic threshold mechanism. The main technical realization thought of the scheme comprises the following two points:

First, the collected sound decibel information is converted into an energy value by an algorithm, and a sound signal with a lower energy value (corresponding to a lower noise or invalid voice) is filtered based on a set threshold. The operation can be used for primarily filtering background noise, and redundancy of subsequent processing is reduced.

Secondly, a dynamic threshold mechanism is introduced for the problem that a fixed threshold is difficult to adapt under different environments. The change of the environmental noise is detected in real time through an algorithm, and the threshold value is dynamically adjusted, so that the proper threshold value can be automatically set in a quiet environment and a noisy environment, the accuracy and the adaptability of noise filtering are improved, and the quality of the voice of a player team is further improved.

In addition, in order to further improve the adaptability of the scheme to noise under different environmental noise conditions, a plurality of improvement technologies are introduced on the basis of the main technology realization thought, including multistage filtration, frequency domain analysis, machine learning auxiliary threshold adjustment and the like, so that a more perfect intelligent noise filtration solution is formed.

The following describes the technical scheme of the present application in detail with reference to the accompanying drawings and specific embodiments.

Fig. 1 is a flowchart of a dynamic threshold mechanism-based voice noise reduction method for a game team according to an embodiment of the present application. As shown in fig. 1, the dynamic threshold mechanism-based game team voice noise reduction method specifically may include:

S101, collecting an audio signal containing player voices and environmental noise, and dividing the audio signal into a plurality of time frames;

S102, converting the amplitude value or the decibel value of the time frame to obtain an energy value corresponding to each time frame;

s103, comparing the energy value with a set initial threshold, if the energy value is lower than the initial threshold, marking the energy value as a noise frame or a low-priority frame, and if the energy value is higher than the initial threshold, reserving the energy value as a voice frame to be further processed;

S104, when the continuous time frame with the energy value lower than the initial threshold value is detected to exceed a preset time period, carrying out statistics and smoothing on the energy value in the preset time period to obtain an environmental noise reference value;

s105, calculating a dynamic threshold value based on the environmental noise reference value and a preset adjusting factor, and updating the dynamic threshold value in real time when the environmental noise reference value changes with time;

S106, comparing the energy value with a dynamic threshold, filtering the corresponding time frame as a noise frame if the energy value is lower than the dynamic threshold, and reserving the corresponding time frame as an effective voice frame if the energy value is higher than the dynamic threshold;

And S107, encoding the time frame reserved after the dynamic threshold filtering to generate a final game team voice signal.

In some embodiments, capturing an audio signal containing player speech and ambient noise and dividing the audio signal into a plurality of time frames includes:

When a player uses a single microphone, directly acquiring single-channel audio data as an analog voice signal;

When a player uses a headset or externally connected multi-microphone equipment, multiple paths of audio data are obtained in parallel, and signal fusion or beam forming is carried out on the multiple paths of audio data to obtain an analog voice signal;

Converting the acquired analog voice signals into digital signals, and quantizing the digital signals to obtain digital audio conforming to a preset audio coding format;

in the hardware level or audio driving, utilizing an automatic gain control or acoustic echo cancellation module to perform preliminary noise reduction or gain processing on digital audio;

dividing the primarily noise-reduced or gain-processed digital audio into a plurality of time frames according to a preset time window, so as to perform initial threshold comparison and dynamic threshold updating according to the energy value of the time frames.

Specifically, in some cases, the player uses a single microphone device, such as an external USB microphone or an internal microphone. At this time, the system directly acquires an analog voice signal from a single channel and converts the analog signal into a digital signal through an analog-to-digital conversion (ADC) module. The sampling rate of the digital signal may be 16kHz or 48kHz to meet the quality requirements of real-time voice communications. During the conversion process, the system quantizes the signal to conform to PCM (Pulse Code Modulation) format or other predetermined audio encoding format for subsequent signal processing.

In another case, the player uses a headset or an external multi-microphone device, such as a high-end game headset or a stand-alone multi-microphone array. At this time, the system may obtain audio data of multiple channels in parallel, and after digital signal conversion, perform signal fusion or Beamforming (Beamforming) on multiple channels of audio data, so as to improve the quality of the voice signal. The signal fusion may employ a Delay-and-Sum (Delay-and-Sum) algorithm that improves the signal-to-noise ratio of the player's voice by weighting and summing the signals of the different microphones. Meanwhile, an optimal beam forming (MVDR) or adaptive beam forming (Adaptive Beamforming) algorithm can be adopted, the weight of the microphone is dynamically adjusted according to the noise environment, the voice signal in the target direction is enhanced, and the background noise in other directions is reduced.

After the audio signal acquisition and conversion are completed, the system performs preliminary noise reduction or gain processing on the digital audio at a hardware level or an audio driving layer so as to ensure the stability and definition of the voice signal. For example, the system may automatically adjust the gain of the input signal using automatic gain control (AGC, automatic Gain Control) to ensure that the voice volume is always at the proper level. In addition, the system may apply an acoustic echo cancellation (AEC, acoustic Echo Cancellation) algorithm to cancel echo interference due to speaker sound passing back to the microphone, improving speech intelligibility.

Further, the system may perform noise suppression (NS, noise Suppression) processing on the audio signal. For example, the effect of background noise on the speech signal is reduced using spectral subtraction (Spectral Subtraction) or adaptive filtering (ADAPTIVE FILTERING) methods. In addition, the spatial noise suppression technology can be combined, and sound signals in different directions can be distinguished based on the spatial information of the multi-microphone array, so that environmental noise such as keyboard clicking sounds, mouse clicking sounds and the like can be effectively suppressed.

The digital audio data after the preprocessing is divided into a plurality of time frames according to a preset time window (such as 20ms or 40 ms). This process may be done in conjunction with the game engine or underlying audio drivers to ensure real-time of the game voice communications. The audio data after framing is used as the input of the subsequent energy value calculation, the initial threshold comparison and the dynamic threshold updating, and provides a basis for further voice noise reduction and optimization.

In some embodiments, converting the amplitude or decibel value of the time frame to obtain the energy value corresponding to each time frame includes:

Acquiring the amplitude of a digital audio sampling point in each time frame, sequentially squaring the amplitude of each digital audio sampling point and accumulating the obtained results to obtain a numerical value for representing time domain energy of the time frame;

Or converting the audio signal corresponding to the time frame into a decibel value, and converting the decibel value into a numerical value for representing the energy or the power of the time frame according to a preset mapping relation.

In particular, the digitized audio signal is divided into time frames of a fixed length, for example 20ms or 40ms per frame. After framing, each time frame contains a number of audio samples, each corresponding to a value representing the audio amplitude at that time.

In one approach, the time domain energy value may be calculated directly based on the audio amplitude values within the time frame. Specifically, the system sequentially acquires the amplitudes of all sampling points in the time frame, and performs nonlinear transformation on each amplitude, so that the amplitude can reflect the overall energy of the signal. And then accumulating the transformed amplitude values to obtain the integral energy value of the time frame. The method can better reflect the intensity of the audio signal in the frame and is suitable for time domain energy analysis.

In another manner, the audio signal may be first converted into a decibel value, and the decibel value is converted into a numerical value for representing the energy or power of the time frame through a preset mapping relationship. Specifically, the system firstly performs decibel conversion on the audio signal in the time frame, so that the energy value is more in line with the auditory perception characteristic of the human ear. Then, based on a preset mapping relation, the decibel value is matched with the standard energy or power level, so that a more representative time frame energy value is obtained. The method can more intuitively represent the relative strength of the audio signal and is suitable for energy normalization processing under different equipment and environments.

Whichever way is adopted to calculate the energy value, the obtained time frame energy value can be used for noise reduction treatment such as subsequent threshold comparison, dynamic threshold adjustment and the like so as to improve the definition and stability of the voice signal.

In some embodiments, the energy value is compared to a set initial threshold, if the energy value is below the initial threshold, it is marked as a noise frame or a low priority frame, and if the energy value is above the initial threshold, it is retained as a speech frame to be further processed.

Specifically, the system sets a fixed initial threshold based on historical data, hardware characteristics, or factory calibration results. The initial threshold is used to measure whether the energy level of a time frame is sufficient to indicate that the frame contains a valid speech signal. Specifically, the threshold may be adjusted according to factors such as microphone sensitivity, background noise level, etc. of different devices, so as to adapt to various hardware environments.

In some examples, after the system calculates the energy value for a certain time frame, the energy value is compared to a set initial threshold:

If the energy value is below the initial fixed threshold, it is considered that the time frame may contain only ambient noise or silence signals, and the frame is therefore marked as a noise frame or a low priority frame. The marked frames may be discarded directly or used as alternative data for subsequent analysis to reduce interference of the inactive signal with voice communications.

If the energy value is above the initial fixed threshold, it is considered that the time frame may contain valid speech signals, and thus the frame is retained and further processed as a candidate speech frame, e.g., dynamic thresholding, frequency domain analysis, etc., to improve the accuracy of speech recognition.

Furthermore, to adapt a fixed threshold to different use scenarios, the system may adapt the threshold for different audio devices or user environments. For example, in a noisy internet cafe environment, the initial threshold may be set slightly higher to avoid misrecognizing the background noise as a speech signal, while in a quiet home environment, the initial threshold may be set lower to ensure that weak speech signals are also captured effectively.

By means of the embodiment, the method and the device can effectively distinguish the noise frame from the voice frame based on the fixed threshold, provide a basis for subsequent dynamic threshold adjustment and advanced noise reduction processing, and ensure the definition and stability of the voice signal.

In some embodiments, when it is detected that the continuous time frame with the energy value lower than the initial threshold exceeds the preset time period, the statistics and smoothing processing are performed on the energy value in the preset time period to obtain an environmental noise reference value, including:

when the continuous time frame with the energy value lower than the initial threshold value is detected to exceed a preset time period, judging that the player does not speak temporarily, taking the continuous time frame in the preset time period as an environmental noise frame, and collecting the energy value of the environmental noise frame;

and counting the energy values of the environmental noise frames, and continuously accumulating or calculating the average value of the noise energy values in different time periods by using a smoothing algorithm to obtain an environmental noise reference value for representing the environmental noise level.

Specifically, the system continuously monitors the energy value of the audio signal and determines whether the energy level of the time frame is below a preset initial threshold. When it is detected that the energy value for a number of consecutive time frames is below the initial threshold and the duration exceeds a preset time period (e.g., 500ms or 1 s), the system may assume that the player is not speaking during that time period, and that these low energy time frames consist primarily of ambient noise. These successive time frames are therefore marked as ambient noise frames and their energy values are collected for subsequent calculations.

Next, the system performs statistics and processing on the acquired ambient noise frame energy values to obtain a more stable ambient noise reference value. In the statistical process, the system can adopt the following method:

Sliding window calculation the ambient noise frame energy values over a recent period of time (e.g., 1 second) are stored in a buffer and the mean or median of these values is calculated to smooth out short-term fluctuating noise data.

The weighted filtering process assigns different weights to noise data for different time periods, such as higher weights to the nearest noise frame, to improve the response to noise variations.

The exponential smoothing method is used for carrying out recursive calculation on the historical environmental noise value, so that the influence of newer data on the reference value is larger, and the adaptability of the system to noise level change is enhanced.

By the method, the system can dynamically estimate the overall level of the environmental noise, and a more stable and accurate environmental noise reference value is obtained. The reference value can be used for subsequent dynamic threshold calculation, so that the system can adjust the voice detection standard according to the real-time noise level, thereby improving the stability and definition of game voice communication.

In addition, the embodiment is applicable to different use environments. For example, in a quiet home environment, the system detects a low ambient noise reference, and thus the subsequent speech detection threshold is automatically lowered to ensure that low volume speech is still correctly recognized. In a noisy internet bar or outdoor environment, the environmental noise reference value is higher, and the system correspondingly increases the threshold value of voice detection so as to reduce the interference of background noise.

In some embodiments, calculating the dynamic threshold based on the ambient noise reference value and a preset adjustment factor, updating the dynamic threshold in real time as the ambient noise reference value changes over time, includes:

According to the environmental noise reference value and in combination with a preset adjusting factor, calculating a dynamic threshold value for distinguishing a noise frame from a voice frame by utilizing a predefined algorithm or mapping rule;

When the change of the environmental noise reference value along with time is detected, the output of the algorithm or the mapping rule is adaptively adjusted according to the adjusting factor, and the dynamic threshold is controlled to be synchronously adjusted up or down so as to enable the updated dynamic threshold to be matched with the environmental noise level fluctuating in real time.

In particular, the system calculates an ambient noise reference value that is used to characterize the background noise level in the current environment using the methods described above. When the environment noise is high, the reference value is large, and when the environment is quiet, the reference value is small. The system combines the reference value and a preset adjusting factor to calculate a dynamic threshold value, wherein the threshold value is used for distinguishing noise frames from voice frames.

When the dynamic threshold is calculated, the system utilizes a preset mapping rule or algorithm model to enable the dynamic threshold to adapt to different environmental noise levels. For example, the system may employ a nonlinear mapping function that calculates a dynamic threshold that is adapted to the current noise level based on the ambient noise reference value. When the environmental noise level changes, the system can adjust the calculated dynamic threshold according to the setting of the adjusting factor, so that the dynamic threshold keeps smooth change, and erroneous judgment caused by too fast noise fluctuation is avoided.

When it is detected that the environmental noise reference value rises over time (e.g., the environment in which the player is located becomes noisier, such as in an internet bar, a station, etc.), the system synchronously increases the dynamic threshold to reduce false detection of low-energy noise, ensure that only a voice signal of significant strength can pass through the noise reduction system, thereby improving the voice quality and preventing the environmental noise from interfering with normal communication.

Conversely, when a decrease in ambient noise reference value over time is detected (e.g., a player entering a quiet room from a noisy environment, or ambient background noise is reduced), the system automatically reduces the dynamic threshold to enhance the capture of low volume speech, ensuring that the player is speaking aloud or picked up at a distance, the system still accurately recognizes valid speech signals.

Furthermore, to avoid frequent changes in the dynamic threshold due to short-term noise fluctuations, the system may set a minimum adjustment step size or a smooth adjustment strategy. For example, when the environmental noise variation amplitude is small, the system may slowly adjust the dynamic threshold to avoid excessive response of the voice detection system to brief noise fluctuations. If the environmental noise changes drastically (e.g., suddenly enters a high noise location), the system can quickly adjust the dynamic threshold to accommodate the new noise environment and improve the intelligibility of the speech signal.

According to the embodiment, the system can adaptively adjust the dynamic threshold based on the real-time change of the environmental noise, so that stable and high-quality game voice communication experience can be provided under different environments.

In some embodiments, the energy value is compared to a dynamic threshold, and if the energy value is below the dynamic threshold, the corresponding time frame is filtered as a noise frame, and if the energy value is above the dynamic threshold, the corresponding time frame is retained as a valid speech frame.

Specifically, the system calculates the energy value of each time frame based on the method and obtains the dynamic threshold under the current environment. When the energy value of a certain time frame is below the dynamic threshold, the system considers that the frame is mainly composed of ambient noise, and therefore marks it as a noise frame or an invalid frame, and filters it to reduce noise interference. Conversely, if the energy value of the time frame is above the dynamic threshold, the system considers that the frame may contain speech signals and processes it further as a candidate speech frame.

In some examples, to improve the accuracy of the determination, the present embodiment may also determine the frame processing policy in conjunction with a fixed threshold and a dynamic threshold. For example:

The first type is that the energy value is lower than a fixed threshold value, and the noise frame is directly judged to be noise frame, and filtering is carried out.

And the second category is that the energy value is higher than the dynamic threshold value, the effective voice frame is directly judged, and the subsequent coding and transmission are reserved and carried out.

The third type is that the energy value is between the fixed threshold and the dynamic threshold, the frame may contain weak voice or background noise, and the system may further perform frequency domain analysis, for example, extract the frequency spectrum characteristics of the frame through Fast Fourier Transform (FFT), and determine whether the frame contains the main frequency component (e.g. 300 hz-3400 hz) of the voice signal. If the spectral characteristics of the frame conform to the speech characteristics, the frame is retained, otherwise it is filtered as a noise frame.

In addition, to reduce the impact of short-term noise fluctuations on speech detection, the system may perform a time-continuous analysis of the time frames. For example, if the energy value of a certain time frame is below the dynamic threshold, but the energy values of a plurality of adjacent time frames before and after the certain time frame are above the dynamic threshold, the system may infer that the frame may belong to a valid voice signal based on the consistency of the voice signal, thereby preserving the frame, rather than directly filtering it.

By the method of the embodiment, the system can accurately distinguish the voice frame from the noise frame under different noise environments, improve the intelligibility of voice signals, reduce the interference of background noise and ensure the definition and stability of game voice communication.

In some embodiments, after comparing the energy value to the dynamic threshold, the method further comprises:

Performing frequency domain analysis on the effective voice frame, converting the time domain signal into frequency domain information, and if non-voice frequency or narrow-band noise concentrated in a specific frequency band is detected, applying filtering processing to the non-voice frequency or narrow-band noise;

setting corresponding multi-level energy thresholds in different frequency bands, and performing differentiation judgment on energy values of each frequency band according to the centralized distribution characteristics of noise types in a specific frequency band;

And comprehensively analyzing the same audio frame or a plurality of continuous audio frames according to the short-time energy and the long-time energy, and respectively identifying and filtering the instantaneous impulse noise and the continuous background noise.

Specifically, in this embodiment, after comparing the energy value with the dynamic threshold, frequency domain analysis, multi-level energy threshold setting, and short-time/long-time energy joint judgment are further performed on the effective speech frame, so as to improve the recognition and filtering capabilities of the system for different types of noise, thereby optimizing the definition and stability of the game speech communication.

After the preliminary energy value screening is completed, a portion of the time frame may still contain complex ambient noise, such as electromagnetic interference, mechanical noise, or other non-speech frequency components. To further improve speech quality, the system performs a frequency domain analysis on these valid speech frames to detect and remove certain noise components.

Specifically, the system first performs a Fast Fourier Transform (FFT) on each valid speech frame, converts the time domain signal to a frequency domain signal, and analyzes its spectral distribution. Game speech is usually concentrated in a speech frequency band of 300 Hz-3400 Hz, if some non-speech frequencies (such as 50Hz power supply noise, keyboard knock sound or wind noise and the like) with significantly concentrated energy are detected, the system can process by adopting the following method:

band reject Filter (Notch Filter) is used to remove certain frequency components for stable narrowband noise (e.g., power supply noise) while maintaining speech signal integrity as much as possible.

Adaptive filtering ADAPTIVE FILTERING if a change in noise frequency over time is detected, an adaptive filter may be used to adjust the noise suppression strategy in real time to accommodate different ambient noise modes.

Through the frequency domain conversion and filtering processing, non-voice components in the voice signal can be effectively removed, and the purity of the final voice signal is improved.

Further, different types of noise tend to have different spectral distribution characteristics, such as:

Wind noise is usually concentrated in a lower frequency band (20 Hz-300 Hz);

mechanical noise (e.g., fan noise, road noise) may be distributed over a plurality of narrower frequency bands;

Keyboard knocks typically include higher frequency components (1000 Hz-5000 Hz);

The pop may cover a wide spectral range.

In some examples, to more accurately reject these noise, the system sets different energy thresholds at different frequency bands and makes a differential determination for a particular noise type. For example:

In the low frequency region below 300Hz, if the energy value exceeds a set low frequency threshold, wind noise may occur, and the gain of the low frequency signal may be reduced or the low frequency filtering may be directly performed.

In the high frequency region of 1000 Hz-5000 Hz, if the energy value is too high and discontinuous short pulse signals appear, the keyboard is judged to be knocked sound, and noise reduction treatment is carried out.

In the intermediate frequency region (300 Hz-3400 Hz), if the energy value accords with the voice frequency band characteristic, the energy value can be identified as a voice signal and is reserved preferentially.

By setting different energy thresholds through the frequency division, the system can carry out finer classification processing on various noises, and the accuracy of noise filtering is improved.

Further, to further enhance the stability of the speech signal, the system combines short-term energy analysis with long-term energy analysis to identify different types of noise patterns and perform corresponding processing.

Short-time energy analysis is suitable for detecting sudden noise, such as keyboard knock sound, explosion sound and the like. Such noise is typically manifested as a dramatic rise in energy over a short period of time, but a short duration. When the system detects that the energy value of a certain time frame is significantly higher than that of the previous and next frames and does not accord with the duration characteristic of the normal voice signal, the system can judge that the frame possibly contains sudden noise and perform additional noise reduction treatment.

Long-term energy analysis, suitable for detecting persistent noise, such as air conditioning noise, fan noise or background human voice. The system calculates the smoothed average energy of a plurality of time frames to determine whether a certain noise is continuously present. When the energy value of a certain frequency band is kept at a higher level for a long time and the change is small, the system can judge that the continuous noise exists in the frequency band and adopts a corresponding noise reduction strategy, such as frequency band attenuation or adaptive noise suppression.

Furthermore, the system can combine the analysis results of short-time energy and long-time energy to form a more robust filtering mechanism. For example:

If the short-time energy fluctuation of a certain time frame is large, but the long-time energy is stable, burst noise is likely, and impulse noise suppression can be performed.

If the short-time energy and the long-time energy of a certain time frame are high, the background noise may be enhanced, and the dynamic threshold may be raised to reduce noise interference.

The frequency domain analysis, the multi-level energy threshold setting and the short-time/long-time energy joint judgment in the embodiment can be used together to realize more accurate noise suppression. For example, the system may:

firstly, carrying out dynamic threshold screening to eliminate obvious noise frames;

performing frequency domain analysis on the preliminarily screened voice frame to remove narrowband noise and specific frequency interference;

Performing refined filtration aiming at noise types of different frequency bands by using a multi-level energy threshold;

and finally determining whether a certain voice frame is reserved or not according to the combination of the short-time energy analysis and the long-time energy analysis.

Through the steps of the embodiment, the method can effectively reduce noise interference under different environments, improve the definition of game voice communication and enable players to obtain good voice interaction experience in noisy or quiet environments.

In some embodiments, the method further comprises:

Marking a noise frame and an effective voice frame based on voice and noise data acquired under different environments, and training a mapping relation between environmental noise characteristics and a threshold value by using a machine learning algorithm to obtain a corresponding machine learning model;

In actual operation, whenever a new environmental noise reference value is detected, inputting the environmental noise reference value into a machine learning model for inference, and outputting an optimized dynamic threshold value by combining the environmental noise reference value with a current dynamic threshold value calculated according to an adjustment factor;

and updating model parameters of the machine learning model in real time by utilizing the collected continuously-changed noise environment data.

Specifically, in this embodiment, in order to further improve the accuracy and environmental adaptability of the dynamic threshold, a mapping relationship between the voice signal and the noise feature is modeled by combining a machine learning model, and real-time inference and adaptive adjustment are performed by using the model in actual operation, so as to optimize the noise filtering effect and improve the stability and definition of game voice communication.

First, the system needs to build a noise analysis and dynamic threshold calculation model based on machine learning. To this end, the system collects voice and noise data in different environments, including but not limited to:

quiet home environments (low background noise);

noisy internet cafes, casino environments (high background noise);

outdoor environments such as parks or streets (complex dynamic noise);

rooms with echoes or reverberation (environments of different acoustic properties).

Further, in the data collection process, the system preprocesses the audio data and divides the audio data into time frames, and then the manual or automatic labeling system classifies the time frames, and marks the time frames as follows:

noise frame, which contains only ambient noise and no obvious speech component.

The effective voice frame comprises player voice and has higher signal-to-noise ratio.

Suspicious frames, which contain weak speech or mixed noise, require further analysis.

Further, after labeling is completed, the system trains the mapping relation between noise characteristics and threshold values by using a machine learning algorithm, so that the model can automatically predict the optimal dynamic threshold values according to the noise characteristics. The machine learning model applicable to the present embodiment includes:

The neural network (RNN, CNN, transformer) is suitable for complex voice and noise pattern analysis, and can extract deep features and perform intelligent noise reduction adjustment.

Gaussian Mixture Model (GMM) is suitable for probability modeling of noise and speech signals and calculating optimal classification threshold.

The Support Vector Machine (SVM) is suitable for small sample learning, and can realize threshold optimization under the condition of a small amount of training data.

After training, the system obtains a set of model parameters, and the model can predict an optimal dynamic threshold based on the environmental noise reference value so as to adapt to different noise environments.

In actual operation, whenever a new ambient noise reference value is detected, the system inputs the reference value into a trained machine learning model and performs optimization in combination with the current dynamic threshold calculation method, and the specific flow is as follows:

1) Noise level detection:

The system continuously monitors the ambient noise level, calculates an ambient noise reference value, and determines whether the current dynamic threshold needs to be adjusted.

2) Model inference:

the current environmental noise reference value is input into a machine learning model, and the system automatically calculates an optimal dynamic threshold or threshold interval suitable for the current environment.

The inference process combines historical training data, so that the system can adjust the threshold value based on past experience and adapt to various complex environments.

3) Threshold fusion:

the optimal threshold calculated by the machine learning model is fused with the current dynamic threshold calculated based on the adjustment factor.

A weighted average or adaptive adjustment algorithm can be used to ensure that the final threshold meets the current noise environment and is not misjudged due to abrupt noise change.

4) Real-time application:

The finally calculated optimal dynamic threshold is applied to the game voice processing system, and subsequent time frames are filtered, so that the system can always operate under the optimal noise suppression condition.

For example, when a player enters a noisy internet cafe from a quiet room, the system detects a significant rise in the ambient noise level, and the machine learning model predicts a higher dynamic threshold to avoid misrecognizing the ambient noise as a valid speech signal. When the player returns to a quiet environment, the system automatically reduces the dynamic threshold to ensure that the voice is still recognized when the player speaks aloud.

Further, in order to enable the system to adapt to different users and environments for a long time, an online learning mechanism is adopted to continuously optimize the performance of the machine learning model.

During the long-term operation of the system, new environmental noise samples are continuously collected and stored in a local or cloud database. These data can be used to extend the training set to enable the model to adapt to complex situations of different devices, different users, and different environments.

Through incremental training or federal learning, the model is gradually optimized on the premise of not influencing the real-time experience of the user. If it is detected that the dynamic threshold calculation effect is poor in some specific environments (such as special echo environments and extreme noise environments), the system can adjust the adjustment factors or model parameters to improve the adaptability.

The present embodiment may allow the user to manually adjust the voice detection sensitivity and communicate this adjustment data back to the server for optimization of the model. For example, if the user turns down the dynamic threshold multiple times in an internet cafe environment, the system may learn the habit and automatically decrease the threshold in a similar environment.

The embodiment realizes intelligent voice noise reduction optimization by combining a machine learning model and a dynamic threshold calculation method, and has the following advantages compared with the traditional fixed threshold method:

The traditional method uses a fixed formula to calculate a dynamic threshold value, so that environmental mutation is difficult to adapt, and the machine learning model can adaptively adjust the threshold value according to historical data, so that the noise reduction effect under different environments is improved.

The machine learning model can be trained based on a large amount of noise environment data, so that the calculated dynamic threshold value is more accurate, and erroneous judgment is reduced.

Through an online learning mechanism, the system can continuously optimize model parameters, so that the model parameters can keep high recognition accuracy for a long time.

The method is suitable for various complex environments, and ensures that players can obtain clear and stable voice communication experience in different scenes such as internet bars, outdoors and home.

By the method, the game voice noise reduction system can dynamically adjust the threshold value, intelligently adapt to the environment and continuously optimize the model, so that players can enjoy high-quality voice communication under different noise environments, and communication barriers caused by noise interference or misjudgment are avoided.

According to the technical scheme of the embodiment of the application, the application has at least the following advantages:

High adaptability, the dynamic threshold value can be changed along with the change of the environmental noise level, and stable and proper noise filtering effect can be obtained in a noisy or quiet environment.

The method has high accuracy, and can more purposefully inhibit and filter different types of noise through multistage filtering, frequency domain analysis and an optional machine learning auxiliary mechanism.

The method has strong instantaneity, the dynamic adjustment of the threshold value can be completed within the time scale of millisecond to second, the real-time communication requirement of game voice is met, and good interaction experience of players is ensured.

The scheme can be integrated with a hardware noise reduction module (AEC/AGC and the like), multi-microphone beam forming and a cloud machine learning platform, and has wide application range, including but not limited to VR/AR games, mobile phone end network voice chat, remote conference systems and the like.

The following are examples of the apparatus of the present application that may be used to perform the method embodiments of the present application. For details not disclosed in the embodiments of the apparatus of the present application, please refer to the embodiments of the method of the present application.

Fig. 2 is a schematic structural diagram of a dynamic threshold mechanism-based voice noise reduction device for a game team according to an embodiment of the present application. As shown in fig. 2, the dynamic threshold mechanism-based game team voice noise reduction device comprises:

an acquisition module 201 for acquiring an audio signal containing player's voice and environmental noise and dividing the audio signal into a plurality of time frames;

the conversion module 202 is configured to convert the amplitude or the db value of the time frame to obtain an energy value corresponding to each time frame;

A first comparing module 203, configured to compare the energy value with a set initial threshold, mark a noise frame or a low priority frame if the energy value is lower than the initial threshold, and reserve a speech frame to be further processed if the energy value is higher than the initial threshold;

The processing module 204 is configured to, when detecting that a continuous time frame with an energy value lower than an initial threshold exceeds a preset time period, perform statistics and smoothing on the energy value in the preset time period to obtain an ambient noise reference value;

the calculating module 205 is configured to calculate a dynamic threshold based on the environmental noise reference value and a preset adjustment factor, and update the dynamic threshold in real time when the environmental noise reference value changes with time;

A second comparing module 206, configured to compare the energy value with a dynamic threshold, filter the corresponding time frame as a noise frame if the energy value is lower than the dynamic threshold, and reserve the corresponding time frame as an effective speech frame if the energy value is higher than the dynamic threshold;

A generating module 207, configured to encode the time frame that remains after the dynamic threshold filtering, and generate a final game team voice signal.

In some embodiments, the acquisition module 201 of fig. 2 directly acquires single-channel audio data as an analog voice signal when a player uses a single microphone, acquires multiple channels of audio data in parallel when the player uses a headset or an external multi-microphone device, performs signal fusion or beam forming on the multiple channels of audio data to obtain an analog voice signal, converts the acquired analog voice signal into a digital signal, and quantizes the digital signal to obtain digital audio conforming to a predetermined audio coding format, performs preliminary noise reduction or gain processing on the digital audio by using an automatic gain control or acoustic echo cancellation module in a hardware level or an audio driver, and divides the preliminary noise reduction or gain processed digital audio into multiple time frames according to a preset time window so as to perform initial threshold comparison and dynamic threshold updating according to energy values of the time frames.

In some embodiments, the conversion module 202 of fig. 2 obtains the amplitude of the digital audio sampling points in each time frame, squares the amplitude of each digital audio sampling point in turn, and accumulates the obtained results to obtain a value for representing the time-domain energy of the time frame, or converts the audio signal corresponding to the time frame into a decibel value, and converts the decibel value into a value for representing the time-domain energy or power according to a preset mapping relationship.

In some embodiments, the processing module 204 of fig. 2 determines that the player is not speaking temporarily when detecting that the continuous time frame with the energy value lower than the initial threshold exceeds the preset time period, takes the continuous time frame in the preset time period as an ambient noise frame, collects the energy value of the ambient noise frame, counts the energy value of the ambient noise frame, and continuously accumulates or calculates the average value of the noise energy values in different time periods by using a smoothing algorithm to obtain an ambient noise reference value for representing the ambient noise level.

In some embodiments, the calculation module 205 of fig. 2 calculates a dynamic threshold for distinguishing a noise frame from a speech frame according to an ambient noise reference value and in combination with a preset adjustment factor by using a predefined algorithm or mapping rule, and when it is detected that the ambient noise reference value changes with time, adjusts the output of the algorithm or mapping rule adaptively according to the adjustment factor, and controls the dynamic threshold to be adjusted up or down synchronously so as to match the updated dynamic threshold with the ambient noise level fluctuating in real time.

In some embodiments, the analysis module 208 of fig. 2 performs frequency domain analysis on the valid voice frame, converts the time domain signal into frequency domain information, applies filtering processing to the non-voice frequency or the narrowband noise if the non-voice frequency or the narrowband noise concentrated in the specific frequency band is detected, sets corresponding multi-level energy thresholds in different frequency bands, performs differentiation judgment on energy values of each frequency band according to the concentrated distribution characteristics of the noise types in the specific frequency band, performs comprehensive analysis on the same audio frame or a plurality of continuous audio frames according to short-time energy and long-time energy, and respectively identifies and filters instantaneous impulse noise and continuous background noise.

In some embodiments, the training module 209 of fig. 2 marks noise frames and valid speech frames based on collected speech and noise data under different environments, trains the mapping relation between the environmental noise characteristics and the threshold value by using a machine learning algorithm to obtain a corresponding machine learning model, inputs the environmental noise reference value into the machine learning model to infer each time a new environmental noise reference value is detected during actual operation, and outputs an optimized dynamic threshold value in combination with a current dynamic threshold value calculated according to an adjustment factor, and updates model parameters of the machine learning model in real time by using the collected continuously-changed noise environment data.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present application.

Fig. 3 is a schematic structural diagram of an electronic device 3 according to an embodiment of the present application. As shown in fig. 3, the electronic device 3 of this embodiment comprises a processor 301, a memory 302 and a computer program 303 stored in the memory 302 and executable on the processor 301. The steps of the various method embodiments described above are implemented when the processor 301 executes the computer program 303. Or the processor 301 when executing the computer program 303 performs the functions of the modules/units in the above-described device embodiments.

Illustratively, the computer program 303 may be partitioned into one or more modules/units, which are stored in the memory 302 and executed by the processor 301 to complete the present application. One or more of the modules/units may be a series of computer program instruction segments capable of performing a specific function for describing the execution of the computer program 303 in the electronic device 3.

The electronic device 3 may be an electronic device such as a desktop computer, a notebook computer, a palm computer, or a cloud server. The electronic device 3 may include, but is not limited to, a processor 301 and a memory 302. It will be appreciated by those skilled in the art that fig. 3 is merely an example of the electronic device 3 and does not constitute a limitation of the electronic device 3, and may include more or fewer components than shown, or may combine certain components, or different components, e.g., the electronic device may also include an input-output device, a network access device, a bus, etc.

The Processor 301 may be a central processing unit (Central Processing Unit, CPU) or other general purpose Processor, digital signal Processor (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 302 may be an internal storage unit of the electronic device 3, for example, a hard disk or a memory of the electronic device 3. The memory 302 may also be an external storage device of the electronic device 3, for example, a plug-in hard disk provided on the electronic device 3, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), or the like. Further, the memory 302 may also include both an internal storage unit and an external storage device of the electronic device 3. The memory 302 is used to store computer programs and other programs and data required by the electronic device. The memory 302 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided by the present application, it should be understood that the disclosed apparatus/computer device and method may be implemented in other manners. For example, the apparatus/computer device embodiments described above are merely illustrative, e.g., the division of modules or elements is merely a logical functional division, and there may be additional divisions of actual implementations, multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of each of the method embodiments described above. The computer program may comprise computer program code, which may be in source code form, object code form, executable file or in some intermediate form, etc. The computer readable medium can include any entity or device capable of carrying computer program code, recording medium, USB flash disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), electrical carrier signals, telecommunications signals, and software distribution media, among others.

The foregoing embodiments are merely for illustrating the technical solution of the present application, but not for limiting the same, and although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those skilled in the art that the technical solution described in the foregoing embodiments may be modified or substituted for some of the technical features thereof, and that these modifications or substitutions should not depart from the spirit and scope of the technical solution of the embodiments of the present application and should be included in the protection scope of the present application.

Claims

1. A method for reducing game team voice noise based on a dynamic threshold mechanism, comprising:

Collecting an audio signal containing a player's voice and ambient noise, and dividing the audio signal into a plurality of time frames;

Converting the amplitude or decibel value of the time frame to obtain an energy value corresponding to each time frame;

Comparing the energy value with a set initial threshold, if the energy value is lower than the initial threshold, marking it as a noise frame or a low priority frame; if the energy value is higher than the initial threshold, retaining it as a speech frame to be further processed;

When it is detected that the continuous time frames in which the energy value is lower than the initial threshold value exceed a preset time period, recursively calculating the energy values of the continuous time frames using an exponential smoothing method to obtain an environmental noise reference value;

A dynamic threshold is calculated based on the ambient noise reference value and a preset adjustment factor, and the dynamic threshold is updated in real time when the ambient noise reference value changes over time; wherein the system sets a minimum adjustment step or a smooth adjustment strategy, and when the ambient noise changes slightly, the system adjusts the dynamic threshold slowly, and if the ambient noise changes drastically, the system adjusts the dynamic threshold quickly;

Comparing the energy value with the dynamic threshold, if the energy value is lower than the dynamic threshold, filtering the corresponding time frame as a noise frame, and if the energy value is higher than the dynamic threshold, retaining the corresponding time frame as a valid speech frame; wherein, if the energy value of a time frame is between the initial threshold and the dynamic threshold, performing a fast Fourier transform to extract the spectrum characteristics of the time frame, and determining whether it contains the main frequency components of the speech signal; if the spectrum characteristics of the time frame meet the speech characteristics, retaining the time frame, otherwise filtering the time frame as a noise frame;

The time frames retained after filtering by the dynamic threshold are encoded to generate a final game team voice signal.

2. The method according to claim 1, wherein collecting an audio signal containing a player's voice and ambient noise and dividing the audio signal into a plurality of time frames comprises:

When the player uses a single microphone, the single-channel audio data is directly obtained as an analog voice signal;

When players use headsets or external multi-microphone devices, multiple channels of audio data are acquired in parallel and signal fusion or beamforming is performed on the multiple channels of audio data to obtain analog voice signals;

Converting the collected analog voice signal into a digital signal and quantizing the digital signal to obtain digital audio that complies with a predetermined audio coding format;

At the hardware level or in the audio driver, using an automatic gain control or acoustic echo cancellation module to perform preliminary noise reduction or gain processing on the digital audio;

The digital audio after the preliminary noise reduction or gain processing is divided into multiple time frames according to a preset time window, so as to perform initial threshold comparison and dynamic threshold update according to the energy values of the time frames.

3. The method according to claim 1, wherein converting the amplitude or decibel value of the time frame to obtain the energy value corresponding to each time frame comprises:

Obtaining the amplitude of the digital audio sampling point in each time frame, squaring the amplitude of each digital audio sampling point in turn and accumulating the obtained results to obtain a value used to represent the time domain energy of the time frame;

Alternatively, the audio signal corresponding to the time frame is converted into a decibel value, and the decibel value is converted into a numerical value for representing the energy or power of the time frame according to a preset mapping relationship.

4. The method according to claim 1, wherein when it is detected that the continuous time frames in which the energy value is lower than the initial threshold value exceed a preset time period, the energy values within the preset time period are statistically analyzed and smoothed to obtain an environmental noise reference value, comprising:

When it is detected that the continuous time frames in which the energy value is lower than the initial threshold value exceed the preset time period, it is determined that the player has not spoken, the continuous time frames within the preset time period are regarded as environmental noise frames, and the energy values of the environmental noise frames are collected;

The energy values of the environmental noise frames are statistically analyzed, and the noise energy values in different time periods are continuously accumulated or averaged using a smoothing algorithm to obtain an environmental noise reference value for characterizing the environmental noise level.

5. The method according to claim 1, wherein the calculating of the dynamic threshold based on the ambient noise reference value and a preset adjustment factor, and updating the dynamic threshold in real time when the ambient noise reference value changes over time, comprises:

Calculating a dynamic threshold for distinguishing noise frames from speech frames based on the environmental noise reference value and in combination with a preset adjustment factor using a predefined algorithm or mapping rule;

When it is monitored that the ambient noise reference value changes over time, the output of the algorithm or mapping rule is adaptively adjusted according to the adjustment factor, and the dynamic threshold is controlled to be synchronously increased or decreased so that the updated dynamic threshold matches the real-time fluctuating ambient noise level.

6. The method according to claim 1, characterized in that after comparing the energy value with the dynamic threshold, the method further comprises:

Performing frequency domain analysis on the valid speech frame to convert the time domain signal into frequency domain information, and if non-speech frequencies or narrowband noise concentrated in a specific frequency band are detected, applying filtering processing to the non-speech frequencies or narrowband noise;

Set corresponding multi-level energy thresholds in different frequency bands, and make differentiated judgments on the energy values of each frequency band based on the concentrated distribution characteristics of noise types in specific frequency bands;

The same audio frame or multiple consecutive audio frames are comprehensively analyzed based on short-term energy and long-term energy, and instantaneous impulse noise and continuous background noise are identified and filtered separately.

7. The method according to claim 1, further comprising:

Based on the speech and noise data collected in different environments, the noise frames and valid speech frames are annotated, and the mapping relationship between environmental noise characteristics and thresholds is trained using a machine learning algorithm to obtain the corresponding machine learning model;

During actual operation, whenever a new ambient noise reference value is detected, the ambient noise reference value is input into the machine learning model for inference, and combined with the current dynamic threshold value calculated based on the adjustment factor to output an optimized dynamic threshold value;

The model parameters of the machine learning model are updated in real time using the collected continuously changing noise environment data.

8. A game team voice noise reduction device based on a dynamic threshold mechanism, characterized by comprising:

An acquisition module, configured to acquire an audio signal containing player voice and ambient noise, and divide the audio signal into multiple time frames;

a conversion module, configured to convert the amplitude or decibel value of the time frame to obtain an energy value corresponding to each time frame;

a first comparison module, configured to compare the energy value with a set initial threshold, and if the energy value is lower than the initial threshold, mark the frame as a noise frame or a low-priority frame; and if the energy value is higher than the initial threshold, retain the frame as a speech frame to be further processed;

a processing module, configured to, when detecting that the continuous time frames in which the energy value is lower than the initial threshold value exceed a preset time period, recursively calculate the energy values of the continuous time frames using an exponential smoothing method to obtain an environmental noise reference value;

a calculation module for calculating a dynamic threshold based on the ambient noise reference value and a preset adjustment factor, and updating the dynamic threshold in real time as the ambient noise reference value changes over time; wherein the system sets a minimum adjustment step size or a smooth adjustment strategy, and when the ambient noise changes slightly, the system adjusts the dynamic threshold slowly; if the ambient noise changes drastically, the system adjusts the dynamic threshold quickly;

a second comparison module, configured to compare the energy value with the dynamic threshold; if the energy value is lower than the dynamic threshold, the corresponding time frame is filtered as a noise frame; if the energy value is higher than the dynamic threshold, the corresponding time frame is retained as a valid speech frame; wherein, if the energy value of a time frame is between the initial threshold and the dynamic threshold, a fast Fourier transform is performed to extract the spectral features of the time frame, and determine whether it contains the main frequency components of the speech signal; if the spectral features of the time frame meet the speech features, the time frame is retained; otherwise, the time frame is filtered as a noise frame;

The generation module is used to encode the time frame retained after filtering through the dynamic threshold to generate a final game team voice signal.

9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium storing a computer program, wherein the computer program implements the steps of the method according to any one of claims 1 to 7 when executed by a processor.