CN119811410A

CN119811410A - High-quality audio data processing method and system

Info

Publication number: CN119811410A
Application number: CN202510286313.6A
Authority: CN
Inventors: 周育春
Original assignee: Hank Electronics Co Ltd
Current assignee: Hank Electronics Co Ltd
Priority date: 2025-03-12
Filing date: 2025-03-12
Publication date: 2025-04-11

Abstract

The present invention relates to the field of audio data processing, and in particular to a high-quality audio data processing method and system. The method comprises the following steps: identifying an audio signal to be processed; performing multi-time window segmentation processing and adaptive noise filtering on the audio signal to be processed to obtain a noise-filtered optimized audio signal; performing multi-scale spectrum decomposition on the noise-filtered optimized audio signal, and capturing transient audio details to identify transient audio detail information at each scale; calculating the audio frequency range according to the transient audio detail information at each scale, and performing dynamic full-band range adjustment to obtain transient information full-band optimized audio; mining the environmental acoustic effects of the transient information full-band optimized audio, and inferring the current audio scene to obtain a real-time audio propagation scene. The present invention improves the spatial sense, accuracy and fidelity of the audio.

Description

High-tone-quality audio data processing method and system

Technical Field

The invention relates to the field of audio data processing, in particular to a high-tone quality audio data processing method and system.

Background

With the continuous development of audio technology, high-quality audio plays an increasingly important role in various audio devices. From home audio to professional recording studio, from music players to virtual reality systems, the demand for high quality audio is increasing. In order to meet the diversified demands of modern society on audio experience, how to realize clear, fine, real and deep audio output becomes a technical problem. During long-term audio processing, the audio signal is often subjected to various factors, such as noise interference, frequency distortion, dynamic range compression, environmental acoustic effects, etc., which can lead to degradation of the audio signal and affect the user's hearing experience.

Traditional audio data processing methods rely mainly on static signal processing techniques, such as spectrum analysis, equalizer adjustment, etc., and although improving sound quality to some extent, tend to be frustrating in the face of complex audio signals, dynamically changing audio characteristics, and varying environmental conditions. More importantly, the traditional method cannot dynamically optimize according to the real-time audio environment change, so that the flexibility and adaptability of audio output are poor, and the pursuit of users on high-quality audio cannot be completely met.

Along with the rapid development of artificial intelligence and deep learning technologies, more and more intelligent audio processing methods are proposed, and the methods can dynamically analyze and process audio signals and improve tone quality performance by simulating the auditory characteristics of human ears and optimizing intelligent algorithms. However, the existing intelligent audio processing technology still has the problem of insufficient performance in certain application scenarios, such as loss of transient details, excessive processing of low frequency and high frequency, insufficient space feel, and the like.

Therefore, how to break through the limitation of the existing audio processing method by innovative technical means and adopt more intelligent and real-time technology to improve the quality of audio signals becomes a challenge to be solved in the current audio processing field.

Disclosure of Invention

The present invention provides a method and a system for processing high-quality audio data to solve at least one of the above technical problems.

In order to achieve the above object, the present invention provides a method for processing high-quality audio data, comprising the steps of:

Step S1, identifying an audio signal to be processed, carrying out multi-time window segmentation processing and adaptive noise filtering on the audio signal to be processed to obtain a noise filtering optimized audio signal;

s2, performing multi-scale frequency spectrum decomposition on the noise filtering optimized audio signal, capturing transient audio details, and identifying transient audio detail information of each scale;

S3, calculating an audio frequency range according to the transient audio detail information of each scale, and adjusting a dynamic full-frequency range to obtain transient information full-frequency optimized audio;

S4, carrying out environmental acoustic effect mining on the transient information full-band optimized audio, and carrying out current audio scene inference to obtain a real-time audio propagation scene;

S5, carrying out stereo field sound effect enhancement on transient information full-band optimized audio according to a real-time audio propagation scene, so as to generate a stereo field enhanced audio signal;

and S6, performing propagation delay optimization on the stereo field enhanced audio signal, and performing global distortion prediction optimization so as to generate global adaptive distortion optimized audio.

The invention ensures that we handle the correct audio signal, whether background music, speech signal or other types of audio, through the recognition of the audio signal. By classifying the audio signals, different processing strategies can be adopted for different types of signals. The multi-time window segmentation technology is adopted to segment the audio signal into a plurality of small segments so as to independently process each segment, so that the computational redundancy caused by overlong time domain signals can be effectively avoided, and noise identification and filtering can be more accurately carried out in a local range. Through self-adaptive noise filtering, parameters of the filter are automatically adjusted according to different noise types and intensities, so that more effective noise elimination is realized. Compared with the traditional static noise suppression method, the adaptive filtering can better cope with complex and dynamically-changed noise environments, and original information of signals is reserved to the maximum extent. Through multi-scale spectrum analysis, the audio signal is decomposed into different frequency bands and time scales. Thus, not only can the characteristics in different frequency ranges be extracted, but also the transient audio details can be effectively captured, and particularly the fine change in the high-frequency part can be effectively captured. The frequency spectrum decomposition of each scale can help the system to accurately distinguish the audio information of different frequencies and time periods, and the processing precision is ensured. Transient audio refers to abrupt parts in the audio signal, such as drumbeats, the playing of musical instruments, etc. By accurate capture of transient portions, it can be ensured that these key details in the audio are preserved, making the audio more vivid and expressive. Traditional audio processing methods often lack transient processing, and the addition of this step significantly enhances the dynamics and detailed performance of the audio. After transient audio capture, the frequency range of each transient audio is calculated to help determine which frequency regions need to be emphasized or attenuated. By accurate analysis of the audio frequency, the spectral distribution of the audio can be identified and optimized, ensuring that the frequency of each part in the audio is properly adjusted. Dynamic full band adjustment is performed based on the frequency range of the transient audio. this means that the low frequency, intermediate frequency and high frequency parts of the audio signal are automatically adjusted according to the key information of each frequency band, so as to optimize the tone quality performance of each frequency band. Through the refined frequency adjustment, the transparent sense and the layering sense of the audio can be improved, and each detail of the transient audio can be optimized in the whole frequency range. By mining the environmental acoustic effects (such as reverberation, reflection, etc.) in the audio, we can understand the propagation environment of the audio signal and extract its impact on the sound quality. Extraction of the environmental acoustic effect not only helps to understand the reflection path and propagation characteristics of sound, but also provides important information for subsequent spatial sound enhancement. Real-time scene inference is performed based on the environmental acoustic effect to help understand the spatial propagation scene of the current audio. According to different scenes, the sound effect of the audio frequency can be greatly different, for example, the sound in a large space is relatively hollow and reverberant, and the sound in a small space is more concentrated and clear. By inference of the audio scene, more intelligent adjustments can be made for subsequent processing. And according to the real-time audio propagation scene, adopting a spatial sound effect algorithm to carry out stereo field enhancement on the audio. This process not only provides a richer stereo effect, but also allows the audio to appear more aurally realistic and vivid by optimizing the spatial perception of the audio. Particularly, under the earphone or surround sound system, the enhancement of the stereo field sound effect can greatly improve the sense of hearing immersion. By enhancing the stereo field, the spatial sense of the audio is greatly improved. By simulating the spatial distribution of the audio, the localization sense of the sound source and the wide breadth of the sound field are enhanced, and the expression level of the audio is further enriched. During the audio propagation, the delay may cause different parts of the audio signal to be displaced in time, affecting the accuracy of the hearing. By optimizing the propagation delay, the audio can be time aligned, so that all audio signals can be ensured to keep synchronous in the propagation process, and the audio distortion caused by the delay is eliminated. During long-term propagation of an audio signal, distortion may occur due to equipment or environmental reasons. And monitoring and identifying possible distortion bands in real time through a distortion prediction model, and performing adaptive gain optimization. The distortion compensation can ensure that the final audio has no obvious artifact or unnatural sound effect, and improves the precision and fidelity of the tone quality.

In the present specification, there is provided a high-sound-quality audio data processing system for performing the high-sound-quality audio data processing method as described above, comprising:

the noise filtering module is used for identifying the audio signal to be processed, carrying out multi-time window segmentation processing and self-adaptive noise filtering on the audio signal to be processed, and obtaining a noise filtering optimized audio signal;

the transient detail capturing module is used for carrying out multi-scale frequency spectrum decomposition on the noise filtering optimization audio signal, capturing transient audio details and identifying transient audio detail information of each scale;

The full-frequency-range adjusting module is used for calculating the audio frequency range according to the transient audio detail information of each scale and adjusting the dynamic full frequency range so as to obtain transient information full-frequency-range optimized audio;

The acoustic effect mining module is used for carrying out environmental acoustic effect mining on the transient information full-band optimized audio and carrying out current audio scene inference so as to obtain a real-time audio propagation scene;

the stereo field enhancement module is used for carrying out stereo field sound effect enhancement on transient information full-frequency band optimized audio according to the real-time audio propagation scene so as to generate a stereo field enhanced audio signal;

And the distortion optimization module is used for carrying out propagation delay optimization on the stereo field enhanced audio signal and carrying out global distortion prediction optimization so as to generate global adaptive distortion optimized audio.

The invention effectively removes background noise through self-adaptive noise filtering and ensures the definition of the original audio signal. This provides a cleaner input signal for subsequent audio processing, helping to improve the overall quality of the audio. Through multi-time window processing, the change of the audio signal in different time periods can be analyzed more carefully, and customization processing is performed aiming at different signal characteristics, so that the limitation of a single-time window processing mode is avoided. Through multi-scale frequency spectrum decomposition, audio details in different frequency intervals can be comprehensively analyzed, high-fidelity capturing and reproduction of transient audio are ensured, and dynamic details in the audio are prevented from being ignored or distorted. Capturing transient audio details allows every minute change in audio to be preserved, providing a more realistic and natural sound quality, especially for audio that requires a quick response (e.g., instrument performance, speech, etc.). By dynamically adjusting the frequency bands of the audio signals, the audio details of each frequency band are ensured to be optimized, high-frequency excessive suppression or low-frequency distortion is avoided, and more balanced and natural tone quality is provided. Transient information optimization can avoid the transient signal of the audio from being smoothed and excessive, ensure the accurate reservation of the details (such as drum points, beating sounds of pianos and the like) of rapid changes in the audio, and improve the definition and sense of reality of the audio. By mining the acoustic effect, the environmental background of the audio transmission can be accurately restored. Particularly important for recording and audio playing, the method can provide data support for the spatial sense and the positioning sense of the audio. The scene inference enables the audio signal to be simulated according to the characteristics of the environment, so that the immersion sense of the audio and the sense of reality of hearing are enhanced, and the method is particularly suitable for application scenes such as virtual reality, game audio and the like. By enhancing the sound effect of the stereo field, the spatial sense of the audio signal is remarkably improved. The sound is not only emitted from two earphones or speakers, but also simulates a rich three-dimensional sound effect space, and the hearing experience is greatly improved. The enhanced audio has directivity in position and depth sense and layering sense, and especially in applications such as virtual reality and cinema sound, a listener can feel more natural and real sound effect. Through delay optimization, signal synchronism in real-time audio application is ensured, the problem of audio non-synchronization caused by propagation delay is avoided, and audio interaction experience is improved. Through global distortion prediction, each section of frequency in the audio is optimized, so that the degradation of sound quality caused by environmental interference or propagation problems is avoided, and the high-fidelity effect of audio output is ensured.

Drawings

FIG. 1 is a flow chart showing the steps of a method for processing high-quality audio data according to the present invention;

FIG. 2 is a detailed implementation step flow diagram of step S1;

FIG. 3 is a detailed implementation step flow diagram of step S2;

fig. 4 is a detailed implementation step flow diagram of step S3.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The embodiment of the application provides a high-tone-quality audio data processing method and a system. The execution main body of the high-tone audio data processing method and system comprises mechanical equipment, a data processing platform, a cloud server node, a network uploading device and the like which are used for carrying the system, wherein the data processing platform comprises at least one of an audio image management system, an information management system and a cloud data management system.

Referring to fig. 1 to 4, the present invention provides a high-quality audio data processing method, which includes the following steps:

In the embodiment of the present invention, referring to fig. 1, a flow chart of steps of a method for processing high-quality audio data according to the present invention is shown, where in this example, the steps of the method for processing high-quality audio data include:

In this embodiment, a high quality recording device is used to collect the audio signal to be processed. The selection of a suitable sampling rate (e.g., 44.1 kHz or 48 kHz) ensures clarity and detail of the audio signal. The sound recording environment is ensured to be as quiet as possible, so that the influence of background noise on subsequent processing is reduced. The collected audio signals are converted to a format suitable for processing, typically a lossless audio format such as WAV or FLAC is selected. This step ensures that the audio quality is not degraded by compression during processing. The audio signal to be processed is identified using audio analysis software or libraries (e.g., librosa or Praat). The identified content includes duration of audio, frequency range, and audio events (e.g., speech, music, noise, etc.). This step may be implemented by feature extraction of the audio signal, for example, calculating features such as Mel Frequency Cepstrum Coefficient (MFCC) and zero-crossing rate (ZCR). Recording the identification result in a database, and generating a visual chart to show the characteristics of the audio signal. This will provide important information support for subsequent processing steps, ensuring that subsequent processing can be done targeted. And overlap (e.g., 50% or 75%). The processing can divide the long audio signal into a plurality of short time signals, so that the subsequent analysis and processing are convenient. The choice of overlapping windows can be determined experimentally to balance the time-domain and frequency-domain resolutions. A short-time fourier transform is applied to each segmented audio segment to calculate its spectrum. The STFT can provide information of the audio signal in time and frequency so that the subsequent noise filtering process can be more accurate. An appropriate adaptive noise filtering algorithm, such as an adaptive filter (e.g., LMS algorithm or NLMS algorithm), is selected. The adaptive filter can dynamically adjust parameters of the input signal according to characteristics of the input signal, and noise is eliminated in real time. Setting the noise estimate initial value can be achieved by analyzing the frequency characteristics of the split signal. An adaptive noise filter is applied to each short-time audio segment to analyze and remove background noise in real-time. The filtering effect is evaluated by comparing the signal-to-noise ratio (SNR) of the signals before and after filtering. A threshold (e.g., 10 dB) may be set to determine if the filtering was successful. And re-synthesizing the short-time audio segments after filtering to generate continuous noise filtering optimized audio signals. The final audio signal is saved in a lossless format (e.g., WAV) to ensure audio quality. At the same time, parameters and effects of the filtering are recorded for subsequent analysis and optimization.

In this embodiment, a wavelet transform (Wavelet Transform) is used as a method of multi-scale spectral decomposition. The wavelet transform can have good resolution in both time and frequency domains, and is suitable for capturing transient characteristics of an audio signal. Suitable wavelet bases (e.g., daubechies wavelets) and decomposition levels (e.g., 4 or 5 levels) are selected to ensure adequate capture of audio signal details. The noise-filtered optimized audio signal is normalized to ensure that the signal amplitude is within a reasonable range (e.g., -1 to 1). This process helps to improve the stability and effectiveness of subsequent spectral decomposition. A wavelet transform is applied to the processed audio signal. The audio signal is decomposed into low frequency components (approximation coefficients) and high frequency components (detail coefficients) of different scales by layer-by-layer decomposition. By setting a suitable number of decomposition layers, for example 4 layers, it is possible to ensure that the multi-level features of the audio signal are extracted. In each layer decomposition, corresponding frequency range and time information is recorded. The low frequency component of each layer typically contains the main features of the signal, while the high frequency component contains transient and detail information. By analysing these components, the varying characteristics of the audio signal at different scales can be understood. Transient audio detail information is extracted from the wavelet transform results of each layer. For each high frequency detail coefficient, small fluctuations are filtered out by setting a threshold (e.g. 0.1) to focus on obvious transient changes. The energy of the detail coefficients may be calculated using an energy detection method to identify transient portions. Features of the captured transient are further analyzed for details, including duration, amplitude variation, and frequency characteristics of the transient. The start time, end time, and frequency distribution of each transient in the spectrum may be calculated to fully describe the transient audio details. Recording the identified transient audio detail information in a database, and generating a visual chart showing the distribution of transient detail in time and frequency. This will provide important data support for subsequent audio analysis and processing.

In this embodiment, transient characteristics of each scale are summarized from the transient audio detail information of each scale extracted in step S2. This includes the frequency range, duration, and amplitude variations of the transient, etc. By calculating the high frequency detail coefficients, the frequency range corresponding to each transient signal can be determined. For each transient signal, its main frequency component is calculated using a spectral analysis method. A suitable frequency resolution (e.g., 0.1 kHz) is set to ensure that the frequency range of each transient can be accurately identified. By calculating the energy distribution of its spectrum, the upper and lower limits of the frequency range are determined. And setting parameters for dynamic full-band range adjustment according to the analyzed transient audio detail information. These parameters include gain adjustment, band enhancement and attenuation ratio. A Dynamic Gain Control (DGC) method may be employed to adjust the gain value in real time based on the characteristics of the transient signal. Dividing the audio signal into a plurality of frequency bands (e.g., low frequency: 20-250 Hz, intermediate frequency: 250-2000 Hz, high frequency: 2000-20000 Hz) ensures that the processing of each frequency band can be done specifically. For example, for the frequency range of the transient information, a corresponding gain adjustment strategy is formulated to improve the audio signal quality of these frequency bands. In audio signal processing, a set gain adjustment strategy is applied in real time. The performance of the transient audio signal in each frequency band is optimized by dynamically adjusting the gain of each frequency band. For example, for fast changing transient signals, increasing the gain of the high frequency band may be considered to enhance sharpness. Transient signals are processed using band pass filters to ensure that only audio information within a particular frequency range is retained. Setting the appropriate cut-off frequency ensures that the main components of the transient signal are not impaired. For low and high frequency bands, attenuation may be appropriate to avoid unnecessary noise interference. And re-synthesizing the dynamically regulated frequency band signals to generate transient information full-frequency band optimized audio signals. In the synthesis process, the balance of audio signals in different frequency bands is ensured, and a certain frequency band is prevented from being excessively highlighted or covered. And carrying out quality evaluation on the generated transient information full-band optimized audio signal. The quality of the audio signal can be objectively evaluated by calculating indexes such as signal-to-noise ratio (SNR) and uniformity of a frequency spectrum. Reasonable evaluation indexes (such as SNR is improved by at least 5 dB) are set to judge the optimization effect. Generating spectrograms and waveform diagrams of the audio signals before and after optimization, and analyzing the optimization effect through comparison. The visual chart should clearly show the audio characteristic change of each frequency band so as to facilitate the visual understanding of the results of the optimization process.

in this embodiment, features are extracted from the transient information full-band optimized audio to analyze the environmental acoustic effect. Common features include spectral features (e.g., mel-frequency cepstral coefficients (MFCCs), short-time energy, frequency distribution, etc.) and time domain features (e.g., zero-crossing rate and transient amplitude). This process may be implemented using a signal processing tool (e.g., librosa). A suitable environmental acoustic model, such as a geometric acoustic model or a statistical acoustic model, is selected. Geometric acoustic models use the propagation path and reflection characteristics of sound waves to describe the propagation of sound in the environment, while statistical acoustic models simulate acoustic effects by statistical characteristics. The selection of an appropriate model will affect the accuracy of the subsequent analysis. The extracted audio signal features are analyzed, and acoustic characteristics such as reverberation time (RT 60), sharpness (C80), and Speech Transmission Index (STI) are calculated. The reverberation time can be estimated by analyzing the attenuation of the signal, and the sharpness is calculated by the distribution of the signal energy in different frequency bands. The geometry and texture of the environment is assessed by analyzing the attenuation and reflection characteristics of the audio signal at different frequencies. More accurate modeling of acoustic environments can be performed using sound field simulation software (e.g., EASE or Odeon) to simulate the reflection and diffraction characteristics of sound waves on surfaces of different materials (e.g., concrete, wood, etc.). And marking the key acoustic effect positions in the environment according to the calculated acoustic characteristics. These locations may be reflection surfaces, sound source locations, receiving points, etc. The distribution of acoustic effects is demonstrated by visualization tools to help understand the propagation process of audio signals in the environment. And constructing a real-time audio propagation scene model according to the extracted acoustic characteristics and the environmental acoustic effect. The model should include information about the layout of sound sources and receiving points, environmental characteristics, acoustic propagation paths, etc. Environmental models can be generated using three-dimensional modeling software (e.g., skchup) to help visually demonstrate the audio propagation scene. And combining the real-time optimized audio signal with the established acoustic environment model through sound field simulation software to perform dynamic acoustic simulation. In the simulation process, the positions of the sound source and the receiving point are updated in real time, and the propagation effect of the audio signal under different conditions is observed. This process can reveal specific effects of the environment on the audio signal. And evaluating the performance of the real-time audio propagation scene according to the simulation result. If a degradation in audio quality is found under certain conditions, the sound source position or environmental parameters (e.g., adding sound absorbing material) may be adjusted to optimize the audio propagation effect. The dynamics and adaptability of the audio propagation scene model are ensured to cope with different audio scenes. The analyzed environmental acoustic properties, acoustic effects and real-time audio propagation scene model are recorded in a database for subsequent query and analysis. Meanwhile, the audio signal and the sound field data generated in the simulation process are saved, and a basis is provided for subsequent research. And generating a visual chart and a 3D model, and displaying the environmental acoustic effect and the audio transmission scene. The change in acoustic characteristics is intuitively exhibited by the waveform, spectrum, and sound field distribution diagram of the audio signal. This process can help understand the propagation characteristics of audio in complex environments, providing visual support for subsequent audio processing and applications.

In this embodiment, a real-time audio propagation scene is analyzed to understand the position of its sound source and the layout of the receiving points. Parameters capable of enhancing the stereo field are determined by analyzing the distance, angle and reflection characteristics between the sound source and the listener. This step may be implemented by sound field simulation software, ensuring a comprehensive understanding of the environment. The aim of stereo field enhancement is set according to the characteristics of the audio content (such as music, dialog or ambient sound). For example, for musical content, spatial perception and clarity of a stereo field are enhanced, while for dialogue, clarity and localization of speech are emphasized. Specific target parameters are set, such as the spatial sense of sound is improved by 10%, the voice definition is improved by 5 dB and the like. The transient information full-band optimized audio is preprocessed, so that the amplitude of the signal is ensured to be in a reasonable range (such as-1 to 1), and distortion is avoided during subsequent processing. Dynamic Range Compression (DRC) techniques may be employed to balance the dynamic range of the audio signal, ensuring that the various portions of the signal are equalized. According to the information of the real-time audio propagation scene, the audio signals are distributed to different positions in the stereo field by using a virtual sound source localization technology. By setting the direction and distance (e.g., gain difference of left and right channels) of the virtual sound source, a more natural sense of space is achieved. HRTF (head related transfer function) models can be used to simulate the perception of sound direction by the human ear. Reverberation and delay effects are applied to enhance the experience of the stereo field. By setting the appropriate reverberation time (e.g., 1 second) and pre-delay (e.g., 20 ms), the sound is made more hierarchical in space. The appropriate type of reverberation is selected (e.g., room reverberation or hall reverberation) and is adjusted based on the characteristics of the audio content. During processing, the quality and effect of the audio signal is monitored in real time. The enhancement effect is evaluated by calculating the signal-to-noise ratio (SNR) and spectral characteristics. If an audio signal in a certain frequency band is found to be too prominent or distorted, the gain and effect parameters can be dynamically adjusted to ensure the naturalness and definition of the audio. And synthesizing the audio signals subjected to virtual sound source positioning and sound field effect enhancement to generate final stereo field enhanced audio signals. In the synthesis process, the audio signals of the left channel and the right channel are balanced, and uneven distribution of the audio signals is avoided. Final processing, such as normalization and compression, is performed on the synthesized audio signal to ensure that the amplitude of the output signal is suitable for the playback device. The target volume of the output signal (e.g., -3 dBFS) may be set to ensure that no distortion occurs during playback.

In this embodiment, propagation delay analysis is performed on the stereoscopic sound field enhanced audio signal. In this process, the time required for the sound wave to propagate in the environment is known by calculating the propagation delays of the signals at different frequencies. The delay between the signals can be measured using a Cross-Correlation function (Cross-Correlation) to determine the time difference between the audio signal and the left and right channels. And setting proper delay parameters according to the analysis result. For stereo fields, it is generally desirable to control the delay to within 20 milliseconds to ensure the naturalness and spatial perception of the audio. A reasonable delay range (e.g., 0-20 ms) is set for optimization in subsequent processing. For each channel, delay compensation is performed on the audio signal according to the set delay parameters. This can be accomplished using a digital delay line (DIGITAL DELAY LINE) to ensure that the signals are synchronized in the left and right channels. By adjusting the time offset of each channel, the phase relation of the audio signals is kept consistent, and phase interference is avoided. During the delay compensation process, the phase and amplitude changes of the signal are monitored in real time. The effect of delay compensation is evaluated by calculating the phase difference and the signal energy. If the delay of a certain frequency band is found to be too large, the delay parameter can be dynamically adjusted, so that the definition and the spatial sense of the signal are ensured. Common methods for selecting an appropriate global distortion prediction model include machine-learning based regression models (e.g., support vector regression, SVR) or deep learning models (e.g., LSTM networks). The models can analyze the distortion characteristics and predict the distortion level, and provide basis for subsequent optimization. Distortion features, such as spectral features, dynamic range, transient features, etc., of the audio are extracted from the stereo field enhanced audio signal. A training dataset is prepared for the model, comprising audio samples of known distortion levels, ensuring diversity and representativeness of the feature vectors. The selected distortion prediction model is trained using the labeled training dataset. In the training process, a cross-validation method is adopted to evaluate the accuracy of the model and optimize the parameters of the model. Ensuring that the model has good generalization ability in predicting distortion levels. Inputting the stereo field enhanced audio signal into a trained distortion prediction model to generate audio distortion prediction data. And setting gain and filtering parameters of distortion compensation according to the prediction result so as to realize global self-adaptive distortion optimization. Reasonable target distortion levels (e.g., reducing total harmonic distortion THD to below 1%) are set. And synthesizing the audio signals subjected to the distortion optimization processing to generate global adaptive distortion optimization audio signals. In the synthesis process, the equalization of each frequency band of the signal is ensured, and the phenomenon that a certain frequency band is too prominent or distorted is avoided. The frequency band may be fine tuned using a dynamic equalizer (Dynamic Equalizer).

In this embodiment, referring to fig. 2, a detailed implementation step flow chart of the step S1 is shown, and in this embodiment, the detailed implementation step of the step S1 includes:

Step S11, identifying an audio signal to be processed, and performing multi-time window segmentation processing on the audio signal to be processed to obtain audio signals with a plurality of time windows;

step S12, carrying out section-by-section audio noise identification on the audio signals of a plurality of time windows, and marking the audio noise point of each time window;

step S13, calculating the power spectrum density of the audio noise points of each time window, and extracting the power spectrum density of each noise point;

Step S14, noise classification is carried out according to the power spectrum density of each noise point, so that each noise point type characteristic is generated;

And step S15, performing self-adaptive noise filtering based on the type characteristics of each noise point to obtain noise filtering optimized audio signals.

In this embodiment, the appropriate audio signal processing tools and libraries (e.g., librosa, pyDub, etc.) are selected to identify the audio signal to be processed. And loading the audio file, and reading the sampling rate, the duration and the channel information of the audio file for subsequent processing. The audio signal is divided into a plurality of time windows. The duration of each window is set (e.g., 100 milliseconds) and the appropriate overlap rate (e.g., 50%) is selected to ensure continuity and accuracy of the analysis. The smoothing process may be implemented using a window function in a Short Time Fourier Transform (STFT). And extracting the audio signal of each time window according to the set window size and the overlap rate by traversing the sample data of the audio signal. Each window is stored, making it the basis for subsequent noise identification and analysis. And recording the multiple time windows obtained by segmentation into a data structure (such as a list or an array) so as to facilitate subsequent processing. Each time window will contain the corresponding audio signal data and its timestamp information for use in subsequent noise identification. The selection of a suitable audio noise recognition algorithm, the recognition of noise points can typically be performed using a machine-learning based classifier (e.g., support vector machine SVM, random forest, or deep learning model such as CNN). The audio signal for each time window is feature extracted, typically by mel-frequency cepstrum coefficient (MFCC), audio energy, zero-crossing rate, etc. these features effectively reflect the characteristics of the audio signal and facilitate subsequent noise identification. And (5) inputting the extracted characteristics into a selected noise identification model for reasoning. The model will output whether noise is present in each time window and the specific location of the noise (e.g., start and end times). The audio noise points for each time window are marked and recorded in a database based on the output of the model. Each noise point will contain its location in the time window and its corresponding noise signature for subsequent analysis. The power spectral density is calculated by selecting an appropriate power spectral density calculation method, typically using the Welch method or fourier transform (FFT). the Welch method is generally preferred because of its better smoothing effect. The audio noise point of each time window is processed, and the audio signal data in the time window is extracted. Ensuring that the data is long enough for efficient spectral analysis and zero padding if necessary. The audio signal for each noise point is windowed using the Welch method, and the power spectral density is calculated in segments. The power spectrum for each window is calculated based on the window width (e.g., 256 points) and the overlap ratio (e.g., 50%) and averaged to obtain a smoothed power spectral density. The calculated power spectral density of each noise point is recorded in a database, and a visual chart is generated to show the spectral characteristics of each noise point. Through visualization, the frequency component of each noise point can be intuitively analyzed, and a basis is provided for subsequent classification. A suitable noise classification model is selected, typically using a machine learning based classifier (e.g., KNN, decision tree, or deep learning model). The selection is classified according to the characteristics of the noise, such as the power spectral density. And (3) carrying out standardization processing on the power spectrum density characteristics of each noise point, and ensuring that characteristic values are in a similar range so as to improve the performance of the classifier. Z-score normalization or Min-Max normalization may be used. And inputting the normalized power spectrum density characteristics into a selected noise classification model for reasoning. The model will output the class of each noise point, such as "machine sound," "human sound," "ambient noise," etc. The classification result of each noise point is recorded in a database, and statistical analysis is performed. The distribution of the different noise types is analyzed for subsequent processing and optimizing the noise filtering strategy. The selection of an appropriate adaptive noise filter, typically using a minimum mean square error (LMS) filter or an adaptive filter array, filters according to real-time noise characteristics. Different filtering parameters are set according to the type characteristics of each noise point. For example, for high frequency noise (e.g., machine sound), a high pass filter may be selected to reduce low frequency components, and for low frequency noise (e.g., ambient noise), a low pass filter may be selected. A corresponding adaptive filter is applied to the audio signal for each time window. According to the noise type characteristics, parameters of the filter are adjusted in real time to optimize the filtering effect. And recording the audio signals subjected to noise filtering processing in a database, and generating a visual chart to display audio waveforms and frequency spectrums before and after filtering so as to evaluate the filtering effect. By comparison, the effectiveness of filtering can be intuitively analyzed, and a basis is provided for subsequent processing.

In this embodiment, referring to fig. 3, a detailed implementation step flow chart of the step S2 is shown, and in this embodiment, the detailed implementation step of the step S2 includes:

s21, performing multi-scale spectrum decomposition on the noise filtering optimized audio signal to obtain audio frequency spectrum data with different scales;

s22, carrying out frequency component distribution calculation on audio frequency spectrum data of different scales to generate audio frequency component distribution characteristics of each scale;

step S23, dynamic time-frequency resolution adjustment is carried out according to the distribution characteristics of the audio frequency components of each scale so as to generate the time-frequency resolution of each scale;

and step S24, capturing transient audio detail of the audio frequency spectrum data of different scales based on the time-frequency resolution of each scale, and identifying the transient audio detail information of each scale.

In this embodiment, suitable multi-scale spectral decomposition methods are selected, commonly used including wavelet transform (Wavelet Transform) and short-time fourier transform (STFT). Wavelet transforms are generally more suitable for processing non-stationary signals due to their localized nature in time and frequency. The basis functions for the wavelet transform (e.g., daubechies wavelet, morlet wavelet, etc.) are determined. The choice of basis functions affects the effect of the decomposition, typically based on signal characteristics. For example, using Daubechies wavelet is suitable for processing a signal having a high frequency component. And performing wavelet transformation on the optimized audio signal, and setting a decomposition layer number (such as 5 layers) to obtain audio frequency spectrum data with different scales. Each layer will correspond to a different frequency range, enabling efficient capture of multiple frequency components of the signal. The audio spectrum data of each scale is recorded in a database, and a visual chart is generated to show the spectrum characteristics of different scales. Through visualization, the signal changes in different frequency ranges can be visually observed for subsequent analysis. A suitable frequency component distribution calculation method is selected, and a spectral statistical analysis method, such as calculating a spectral energy distribution or spectral density for each scale, may be generally used. The audio spectrum data of each scale is analyzed, and the energy value of each frequency component is calculated. E (f) =Wherein E (f) is the energy value of frequency f,Is the value of the spectrum at time t and frequency f. And generating frequency component distribution characteristics of each scale according to the calculated frequency spectrum energy. The frequency components of each scale may be made to have a total energy of 1 by a normalization process for comparison. The frequency component distribution characteristics of each scale are recorded in a database, and a histogram or a heat map is generated by using a visualization tool to show the frequency distribution conditions of different scales. This facilitates quantitative analysis of the performance of each frequency component at different scales. The selection of an appropriate dynamic time-frequency resolution adjustment model is typically adaptively adjusted based on signal characteristics (e.g., transient characteristics, frequency components, etc.). An adaptive window function or dynamic window width policy may be used. Setting a time frequency resolution adjustment parameter according to the frequency component distribution characteristics of each scale. For example, a wider window width is set for a low frequency signal to improve time resolution, and a narrower window width is set for a high frequency signal to improve frequency resolution. And dynamically adjusting the audio frequency spectrum data of each scale according to the set resolution adjustment parameters. Parameter adjustment of the short-time fourier transform or wavelet transform can be used to achieve optimization of time-frequency resolution. Recording the adjusted time-frequency resolution in a database, and evaluating the time-frequency resolution effect under different scales. By comparing the resolutions of different scales, the signal can be analyzed to show time-frequency domain, and a basis is provided for subsequent processing. A suitable transient audio detail capture model is selected, typically using a transient detection algorithm (e.g., energy threshold detection, transient analysis, etc.) to identify transient features in the audio signal. Transient audio detail capture is performed on the audio spectral data for each scale. The signal is analyzed according to the adjusted time-frequency resolution to identify portions of significant energy variation in a short time, which typically contain transient features. Transient characteristics, such as transient start time, duration, amplitude, etc., in each scale audio signal are extracted. By analyzing these features, important details in the audio signal, such as percussion sounds, momentary changing sound effects, etc., can be identified. Recording the extracted transient audio detail information in a database, and displaying the distribution condition of transient characteristics through a visualization tool. This will facilitate subsequent analysis and processing, providing a basis for enhancement of the audio signal.

In this embodiment, referring to fig. 4, a flowchart of a detailed implementation step of the step S3 is shown, where in this embodiment, the detailed implementation step of the step S3 includes:

S31, performing transient detail audio fineness quantitative analysis on transient audio detail information of each scale to generate an audio detail fineness quantitative value;

step S32, carrying out scale-by-scale information definition evaluation according to transient audio transient information of each scale to obtain information definition of each scale;

Step S33, quantitatively analyzing the auditory effect according to the information definition and the audio detail fineness quantification value of each scale, so as to generate the transient detail auditory effect of each scale;

Step S34, calculating audio frequency ranges of the audio frequency spectrum data with different scales, and extracting the audio frequency range of each scale;

And step S35, carrying out dynamic full-frequency range adjustment on the audio frequency range of each scale based on the transient detail hearing effect of each scale so as to obtain transient information full-frequency optimized audio.

In this embodiment, a suitable quantitative analysis method for transient detail refinement is selected, and a combination of a psychoacoustic model and subjective evaluation is generally used. Some relevant parameters may be set for quantification with reference to studies of auditory perception. Relevant features such as sharpness, duration, richness of frequency components, and attack time of the transient are extracted from the transient audio detail information of each scale. These features can generally reflect the fine feel of the audio details. Based on the extracted features, the features are converted into a fine-sense quantized value using a weighted average method or a linear regression model. Weight parameters (e.g., transient sharpness weight 0.5, frequency richness weight 0.3, attack time weight 0.2) are set, and a fine sense quantization value for each scale is calculated. Recording the calculated audio detail fineness quantification value in a database, and carrying out statistical analysis. The fineness of the audio under different scales can be evaluated by comparing the fineness values of different scales, so that a basis is provided for subsequent processing. Suitable audio sharpness assessment methods are selected, and commonly used include signal-to-noise ratio (SNR), ratio of short-time energy to short-time average energy, and the like. These indicators can reflect the sharpness of the audio signal. The transient audio signal of each scale is analyzed to extract its short-time energy and noise components. The energy spectrum for each transient may be calculated using a short-time fourier transform and the background noise level determined. A sharpness value for each scale is calculated based on the extracted short-term energy and the background noise level. Snr=10log 10 (signal energy/noise energy), the sharpness values for each scale are recorded in a database, and a visual chart is generated showing the sharpness distribution for the different scales. This helps to visually evaluate the performance of the audio at different scales. A suitable auditory effect analysis model is selected, typically using a multivariate linear regression model, with sharpness and refinement quantification values as inputs to predict auditory effects. And (3) sorting the audio definition value and the fine sense quantization value of each scale, constructing a data set containing multidimensional features, and ensuring the integrity and the accuracy of the data set. And inputting the sorted data into a selected analysis model for training and prediction. And calculating the transient detail hearing effect value of each scale through the model. An output range (e.g., 0 to 100) may be set to facilitate quantization of the auditory effect. And recording the transient detail hearing effect value in a database, and carrying out statistical analysis. And by comparing auditory effect values of different scales, the overall performance of the audio signal is evaluated, and a basis is provided for subsequent optimization. The selection of a suitable frequency range calculation method may generally extract the frequency range by calculating frequency boundaries (lowest frequency and highest frequency) of the spectral data. And analyzing the audio frequency spectrum data of each scale, and extracting the frequency distribution of the audio frequency spectrum data. The maximum and minimum values of the frequency components for each scale may be calculated to determine the frequency range. From the spectral data, a frequency range for each scale is calculated. For example, a threshold (e.g., -3 dB points) may be set to determine the effective frequency range so as to exclude unimportant frequency components. The frequency ranges for each scale are recorded in a database and a visual chart is generated showing the frequency ranges for the different scales. Through visualization, the frequency characteristics of the audio signals of all scales can be observed in a visual way, and basis is provided for subsequent processing. A suitable dynamic full band adjustment model is selected, typically using a parametric equalizer or dynamic range compressor, to adjust the gain of the audio signal in different frequency ranges. And setting gain parameters of the frequency band according to the transient detail hearing effect value of each scale. For example, the gain may be increased for better frequency bands, and decreased for less effective frequency bands. And carrying out dynamic full-band range adjustment on the audio frequency spectrum data of each scale. Using Digital Signal Processing (DSP) techniques, the gain of each frequency range is adjusted according to the set parameters to optimize the transient information. Recording the adjusted audio signals in a database, and generating a visual chart to show the audio waveform and the spectral characteristics before and after adjustment. By comparison, the adjusting effect can be intuitively analyzed, and a basis is provided for subsequent optimization.

In this embodiment, step S4 includes the following steps:

s41, performing environment sound source identification on transient information full-band optimized audio, and separating environment sound source characteristics;

S42, analyzing the size of the environment space to the environment sound source characteristics to generate the size of the environment space;

s43, estimating the reflection characteristics of the environmental sound source according to the characteristics of the environmental sound source, so as to obtain the environmental reflection material;

s44, carrying out sound reflection path evolution on the environment sound source characteristics to generate an environment sound reflection path;

s45, excavating an environmental acoustic effect according to the size of the environmental space, the environmental reflection material and the environmental acoustic reflection path, so as to generate an environmental acoustic effect parameter;

and S46, carrying out current audio scene inference on the environmental acoustic effect parameters to obtain a real-time audio propagation scene.

In this embodiment, a suitable sound source recognition algorithm is selected, and a model based on deep learning (such as convolutional neural network CNN or recurrent neural network RNN) is generally used. The models can effectively process the time domain and frequency domain characteristics of the audio signals to perform sound source identification. Features such as mel-frequency cepstral coefficients (MFCCs), short-time fourier transform (STFT) features, and spectrograms are extracted from the full-band optimized audio signal. By extracting these features, the ability of the model to identify different sound sources can be enhanced. The sound source recognition model is trained using the prepared audio data set (ambient sound source containing the markers). During training, cross-validation techniques are used to optimize the model parameters. After training, inputting the optimized audio signals into the model for reasoning, identifying the environmental sound source and extracting the characteristics of the environmental sound source. And recording the identified environmental sound source characteristics in a database, and generating a visual chart to display the frequency spectrum characteristics of different sound sources. This process facilitates subsequent sound source analysis and processing, ensuring efficient separation of ambient sound sources. Selecting an appropriate spatial size analysis method, an estimation method based on acoustic measurements, such as a room acoustic model or a reflection model, may generally be used. From the environmental sound source characteristics extracted in step S41, the frequency distribution and time delay characteristics thereof are analyzed. By analyzing the reflected sound, the distance between the sound source and the reflection surface can be deduced. The size of the ambient space is calculated using a known speed of sound (e.g., 343 m/s) and the time delay between the sound source to the reflecting surface. And comprehensively calculating the whole environmental space according to the reflection characteristics of different sound sources. And recording the calculated environmental space size in a database, and generating a visual chart display space structure. This will help in subsequent analysis of the acoustic effects, ensuring the accuracy of the environmental model. The selection of a suitable ambient sound source reflection characteristic estimation method typically uses acoustic characteristic analysis methods (such as impulse response measurements) to obtain reflection characteristics. Reflected components in the ambient sound source signature, such as intensity, delay and frequency response of reverberation, are analyzed. These characteristics can reflect the nature of the environmental surface, such as hardness, smoothness, etc. And according to the reflection characteristics of the sound source, the reflection materials of the environment are deduced by combining the existing acoustic databases (such as acoustic characteristic tables of different materials). For example, if the reflection intensity is high and clear, it may be a hard material (e.g., concrete, glass) and if the reflection is blurred, it may be a soft material (e.g., carpet, wall). Recording the deduced environmental reflection materials in a database, and generating statistical data of the environmental reflection characteristics. This will help in subsequent analysis of acoustic effects, ensuring accuracy of the reflective material. The selection of a suitable acoustic reflection path model, typically using geometric-acoustic or wave-acoustic based methods, simulates the propagation and reflection paths of sound in the environment. The sound source and reflection material characteristics extracted from steps S41 and S43 provide basic data for simulation of the sound reflection path. The accuracy of the data is ensured to improve the reliability of the path estimation. The acoustic model is used to simulate the sound reflection path, and the path of sound waves emitted from the sound source to the receiving point is calculated. And deducing a specific path of sound wave reflection according to the size of the environment space and the reflection material, and recording the angle and the distance of each reflection. The generated ambient sound reflection path is recorded in a database, and a visual chart showing the path of sound wave propagation is generated. This will help in the subsequent analysis of acoustic effects, providing basis for model optimization. A suitable acoustic effect mining model is selected, typically a reverberation time (RT 60) calculation model, an acoustic energy distribution model, or the like, can be employed to evaluate the characteristics of the acoustic environment. And integrating the information of the size of the environment space, the reflective material, the reflective path and the like to construct a data set containing multidimensional features. These features will be used to calculate acoustic effect parameters. Environmental acoustic effect parameters, such as reverberation time, sharpness, sound pressure level, etc., are calculated from the selected acoustic effect model. The reverberation time can be calculated using the formula, rt60=a/(0.161×v), where V is the room volume and a is the sum of the acoustic absorption coefficients. And recording the acoustic effect parameters obtained through calculation in a database, and carrying out statistical analysis. and by comparing effect parameters under different environmental conditions, the propagation characteristics of the audio signal are evaluated, and a basis is provided for subsequent audio scene inference. The selection of a suitable audio scene inference model is generally performed by adopting a Bayesian inference or machine learning based method and combining acoustic effect parameters. The environmental acoustic effect parameters obtained in step S45 are integrated into the inference model. And ensuring the integrity and accuracy of the data so as to improve the reliability of scene inference. Input data is provided to an inference model for inference of a real-time audio propagation scene. The model will infer characteristics of the current audio scene, such as indoor, outdoor, open space or closed space, etc., from the input acoustic features and environmental information. Recording the deduced audio propagation scene in a database, and generating a visual chart to show the deduced result. This will facilitate subsequent audio processing and optimization, ensuring that the propagation effects of the audio signal meet the requirements.

In this embodiment, the specific steps of step S5 are as follows:

s51, carrying out audio space propagation evolution on environmental acoustic effect parameters to generate a space propagation acoustic effect evolution rule;

Step S52, performing space propagation simulation processing on the real-time audio propagation scene according to a space propagation acoustic effect evolution rule, and collecting audio space propagation simulation data;

step S53, performing depth stereo field feature iterative learning on the audio space propagation simulation data to generate depth stereo field features;

and S54, carrying out stereo field sound effect enhancement on the transient information full-band optimized audio according to the depth stereo field characteristics, thereby generating a stereo field enhanced audio signal.

In this embodiment, a suitable spatial propagation model is selected, typically using a geometric-acoustic-based method or a wave-acoustic simulation. These models can accurately simulate the characteristics of sound propagation in different environments. The acoustic effect parameters (such as reverberation time, sharpness, sound pressure level, etc.) extracted from step S45 are integrated into the spatial propagation model. These parameters will act as inputs affecting the propagation characteristics of the sound. An evolution simulation of spatially propagating acoustic effects is performed using the selected model. And calculating the propagation path and change rule of the sound in the space according to the environmental characteristics (such as the space size, the reflection material, the sound source position and the like). Through multiple simulation, the change of the acoustic effect along with time and space is observed, and the evolution rule of the key parameters is recorded. and recording the generated evolution rule of the spatial propagation acoustic effect in a database, and generating a visual chart to show the acoustic effect change under different environmental conditions. This will provide an important reference for subsequent audio processing. Selection of the appropriate audio space propagation simulation process method typically uses acoustic simulation software (e.g., EASE, odeon, etc.) to perform a more accurate simulation. And collecting the current real-time audio signals, and setting simulation parameters according to the acoustic effect evolution law in the step S51. These parameters include sound source location, receiving point location, ambient reflection characteristics, etc. The real-time audio signal is input into the acoustic simulation software, and the spatial propagation simulation processing is performed. The software generates simulation data of audio transmission according to the set environmental parameters and acoustic effects, and records the behaviors of sound waves in space. Recording the acquired simulation data of the audio space propagation in a database, and generating a visual chart to show the acoustic wave propagation process. This helps in subsequent analysis and optimization of the audio effect, ensuring the accuracy of the simulation results. A suitable deep learning model is selected, typically using a Convolutional Neural Network (CNN) or a Recurrent Neural Network (RNN) to handle feature learning of the audio signal. The audio space propagation simulation data extracted from step S52 is consolidated into a data set suitable for the deep learning model input format. It is ensured that the dataset contains rich samples to enhance the generalization ability of the model. Training the selected deep learning model, and extracting the stereo field features in the audio space propagation data through iterative learning. Data enhancement techniques (e.g., time spreading, frequency perturbation, etc.) are used to improve the robustness of the model. And recording the generated depth stereo field characteristics in a database, and evaluating the learning effect of the model. The learned features may be presented by visual means, analyzing their potential effects in audio enhancement. Selection of appropriate sound enhancement techniques typically involves deep learning based sound processing methods (e.g., GAN generation of an countermeasure network) or conventional spatial sound processing techniques (e.g., HRTFs). And setting parameters for enhancing the sound effect according to the depth stereo field characteristics extracted in the step S53. This includes gain adjustment, frequency modulation, phase difference adjustment, etc. to enhance the performance of the stereo field. And combining the transient information full-band optimized audio signal with the set enhancement parameters, and executing sound effect enhancement processing. By adjusting the spatial properties of the audio signal, an audio signal with a richer stereoscopic impression is generated. Recording the generated stereo field enhanced audio signals in a database, and generating a contrast visual chart to show the audio waveform and spectral characteristics before and after enhancement. Through comparison, the enhancement effect can be intuitively analyzed, and a basis is provided for subsequent optimization.

In this embodiment, the specific steps of step S6 are as follows:

Step S61, monitoring real-time propagation parameters of an audio signal to be processed to generate real-time propagation monitoring parameters of the audio;

Step S62, carrying out propagation path delay calculation on the audio real-time propagation monitoring parameters, and extracting propagation path delay parameters;

step S63, carrying out audio delay compensation calculation on the propagation path delay parameters so as to obtain audio delay compensation parameters;

Step S64, propagation delay optimization is carried out on the stereo field enhanced audio signal according to the audio delay compensation parameter, so as to generate a delay optimized audio signal;

step S65, global distortion prediction optimization is performed on the delay optimized audio signal, so that global adaptive distortion optimized audio is generated.

In this embodiment, the audio propagation parameters to be monitored are determined, including Sound Pressure Level (SPL), spectral characteristics, time delay, phase difference, frequency response, and the like. These parameters will help to fully evaluate the propagation characteristics of the audio signal. An audio signal processing module is configured to analyze the real-time audio signal using a short-time fourier transform (STFT) or a Fast Fourier Transform (FFT). The length and the overlapping rate of each data window are selected appropriately (for example, the window length is 1024 points, the overlapping rate is 50%) to ensure a balance of frequency resolution and time resolution. The monitored real-time propagation parameters are recorded into a database and a real-time visualization chart is generated to facilitate monitoring and analysis of changes in the audio signal. This process will provide important data support for subsequent steps, ensuring that the propagation characteristics of the audio signal are effectively evaluated. Suitable delay calculation models are selected, and a phase difference method and a delay estimation method are commonly used. These methods can effectively analyze the propagation delays of the audio signal on different paths. And extracting corresponding time stamp and sound pressure level data from the audio real-time propagation monitoring parameters acquired in the step S61. The integrity of the data is ensured for accurate delay calculations. And analyzing the time delay in the real-time monitoring parameters by using the selected delay calculation method. For example, if the phase difference method is used, the phase differences of the same sound source signals received by different microphone arrays can be compared, and the delay of the propagation path can be calculated. Delay=delta phi/2 pi f, wherein delta phi is a phase difference, f is a signal frequency, the calculated propagation path Delay parameters are recorded in a database, and a visual chart is generated to show the Delay characteristics of different sound sources. This will provide the basis data for subsequent delay compensation calculations, ensuring the accuracy of the delay compensation. The selection of an appropriate delay compensation model can typically be based on linear interpolation or adaptive filters for compensation. These models are able to dynamically adjust the playing time of the audio signal according to the delay parameters. Parameters of audio delay compensation are set according to the propagation path delay parameters extracted in step S62. For example, a reference delay (e.g., 10 ms) may be set and adjusted based on the actual measured delay. And applying the selected compensation model to delay-compensate the real-time audio signal. By shifting the play time of the audio signal forward or backward, the synchronicity of the signal is ensured. For example, for a signal with a measured delay of 15ms, its play time is delayed by 15ms. Recording the calculated audio delay compensation parameters in a database, and generating a visual chart to show the audio waveform comparison before and after delay compensation. This process will ensure that the playing quality of the audio signal is optimized, providing support for subsequent steps. A suitable propagation delay optimization model is selected, typically using a delay filter based on Digital Signal Processing (DSP) techniques. This method can effectively cope with the propagation delay of an audio signal. According to the audio delay compensation parameters obtained in step S63, parameters for propagation delay optimization are set. This includes delay time, gain adjustment, etc. to ensure the synchronicity of the audio signal. And inputting the stereo field enhanced audio signal into an optimization model, and executing propagation delay optimization processing. By adjusting the time parameter of the audio signal, the optimal propagation effect of the signal is achieved. And recording the audio signals after delay optimization in a database, and generating a visual chart to show the audio waveform and the frequency spectrum characteristics before and after optimization. This will help analyze the effect of delay optimization, providing basis for subsequent processing. The selection of an appropriate global distortion prediction optimization model typically uses a machine-learning based regression model or a deep learning model (e.g., LSTM or CNN) for distortion prediction. Feature data such as spectral features, time domain features, and distortion metric values are extracted from the delay-optimized audio signal obtained in step S64. The diversity and the richness of the data are ensured, so that the prediction accuracy of the model is improved. Training a selected distortion prediction model, and inputting the audio signal after delay optimization into the model for prediction. The model will output the best distortion compensation parameters based on the input characteristics to optimize the audio quality. And recording the audio signals subjected to the global adaptive distortion optimization in a database, and generating a visual chart to show the audio waveform and the frequency spectrum characteristics before and after the distortion optimization. By contrast, the effect of distortion optimization can be intuitively analyzed, and a basis is provided for subsequent application and analysis.

In this embodiment, the specific steps of step S65 are as follows:

Detecting full-band audio frequency mutation of the delay optimized audio signal, and marking frequency mutation nodes;

carrying out mutation point distribution analysis on the frequency mutation nodes to generate audio mutation point distribution characteristics;

Performing audio distortion prediction according to the audio mutation point distribution characteristics to generate audio distortion prediction data;

And performing global adaptive distortion optimization on the delay-optimized audio signal according to the audio distortion prediction data, thereby generating global adaptive distortion-optimized audio.

In this embodiment, a suitable frequency mutation detection algorithm is selected, and common methods include a short-time fourier transform (STFT) based spectrum analysis and an energy variation based mutation detection algorithm. These methods can effectively identify points where the frequency changes significantly in a short time. And preprocessing the audio signal after delay optimization, including normalization and denoising processing, so as to improve the accuracy of mutation detection. The high-frequency noise is removed by using a low-pass filter, and the smoothness of the signal is ensured. The STFT is used to perform spectral calculations on the processed audio signal, and the appropriate window size (e.g., 1024 points) and overlap ratio (e.g., 50%) are set to ensure a balance between frequency resolution and time resolution. The spectrum of each time frame will be saved and used for subsequent analysis. By analyzing the rate of change of the spectrum (e.g., sudden changes in spectral energy or frequency components), frequency sudden change nodes are identified. A threshold (e.g., a rate of change exceeding a certain percentage) is set to mark the abrupt point and its specific position (time stamp) in the audio signal is recorded. Recording the marked frequency mutation nodes in a database, generating a visual chart, and displaying the frequency spectrum and mutation point positions of the audio signal. This process will provide the basic data for subsequent analysis, ensuring the reliability of mutation detection. From the frequency mutation node data extracted in step S66, the distribution characteristics of the mutation points are calculated. Common features include the number of mutation points, the time interval between mutation points, the amplitude variation of the mutation, etc. And carrying out statistical analysis on the mutation points to generate a histogram or a distribution diagram, and displaying the distribution situation of the mutation points in time. The concentration degree and the frequency distribution of the mutation points can be analyzed, and the regularity of the mutation points can be identified. Parameters of the analysis, such as minimum amplitude variation of the mutation point (e.g., 5 dB) and time window (e.g., 100 ms) are set to filter out small mutations, ensuring that important frequency variations are of interest. Recording the generated audio mutation point distribution characteristics in a database, generating a visual chart, and displaying the statistical distribution of the mutation points. This will help understand the varying nature of the audio signal and provide data support for subsequent distortion prediction. A suitable audio distortion prediction model is selected, typically using a machine learning based regression model (e.g., support vector regression, SVR) or a deep learning model (e.g., LSTM network) to make predictions of audio distortion values. From the mutation point distribution characteristics extracted in step S67, feature vectors including the number of mutation points, mutation time intervals, amplitude variation, and the like are constructed. The diversity and representativeness of the feature vectors are ensured. The selected distortion prediction model is trained using a labeled training data set (including audio samples of known distortion level). After training, the audio mutation point distribution characteristics are input into a model to generate audio distortion prediction data. And recording the generated audio distortion prediction data in a database, and performing statistical analysis. And by comparing distortion predicted values of different samples, the prediction accuracy of the model is evaluated, and a basis is provided for subsequent audio optimization. A suitable global adaptive distortion optimization model is selected and nonlinear distortion compensation techniques or adaptive filters may be used to dynamically adjust the audio signal based on the distortion prediction data. According to the audio distortion prediction data obtained in step S68, an optimization parameter is set. This includes the gain of the distortion compensation, the parameters of the filter and the strategy of dynamic adjustment in order to effectively reduce the distortion. And inputting the audio signal subjected to delay optimization into an adaptive distortion optimization model, and executing global adaptive distortion optimization processing. The frequency response and the gain of the audio signal are adjusted in real time, so that the distortion degree is reduced, and the overall quality of the audio is improved. And recording the audio signals subjected to global adaptive distortion optimization in a database, generating a visual chart, and displaying the audio waveforms and the frequency spectrum characteristics before and after the optimization. And the effect of distortion optimization is evaluated through comparative analysis, so that a basis is provided for subsequent application.

In the present embodiment, there is provided a high-sound-quality audio data processing system for executing the high-sound-quality audio data processing method as described above, including:

The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

The foregoing is merely a specific embodiment of the invention to enable a person skilled in the art to understand or practice the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for processing high-quality audio data, comprising the following steps:

Step S1: Identify the audio signal to be processed; perform multi-time window segmentation processing and adaptive noise filtering on the audio signal to be processed to obtain a noise-filtered optimized audio signal;

Step S2: performing multi-scale spectrum decomposition on the noise filtering optimized audio signal, and capturing transient audio details, and identifying transient audio detail information of each scale;

Step S3: Calculate the audio frequency range according to the transient audio detail information of each scale, and perform dynamic full-band range adjustment to obtain transient information full-band optimized audio;

Step S4: mining the environmental acoustic effect of the full-band optimized audio of transient information, and inferring the current audio scene to obtain a real-time audio propagation scene;

Step S5: performing stereo field sound effect enhancement on the full-band optimized audio of transient information according to the real-time audio propagation scenario, thereby generating a stereo field enhanced audio signal;

Step S6: performing propagation delay optimization on the stereo field enhanced audio signal and performing global distortion prediction optimization to generate global adaptive distortion optimized audio.

2. The high-quality audio data processing method according to claim 1, wherein the specific steps of step S1 are:

Step S11: Identify the audio signal to be processed; perform multi-time window segmentation processing on the audio signal to be processed to obtain audio signals of multiple time windows;

Step S12: performing audio noise recognition on the audio signals of multiple time windows segment by segment, marking the audio noise points of each time window;

Step S13: Calculate the power spectrum density of the audio noise point in each time window, and extract the power spectrum density of each noise point;

Step S14: classifying noise according to the power spectrum density of each noise point, thereby generating a type feature of each noise point;

Step S15: Adaptively perform noise filtering based on the type characteristics of each noise point to obtain a noise-filtered optimized audio signal.

3. The high-quality audio data processing method according to claim 1, wherein the specific steps of step S2 are:

Step S21: performing multi-scale spectrum decomposition on the noise filtering optimized audio signal to obtain audio spectrum data of different scales;

Step S22: Calculating the frequency component distribution of the audio spectrum data at different scales to generate the audio frequency component distribution characteristics of each scale;

Step S23: dynamically adjusting the time-frequency resolution according to the distribution characteristics of the audio frequency components at each scale to generate the time-frequency resolution at each scale;

Step S24: capturing transient audio details of audio spectrum data of different scales based on the time-frequency resolution of each scale, and identifying transient audio detail information of each scale.

4. The high-quality audio data processing method according to claim 1, wherein the specific steps of step S3 are:

Step S31: performing transient detail audio fineness quantification analysis on the transient audio detail information of each scale to generate an audio detail fineness quantification value;

Step S32: evaluating the clarity of each scale of transient audio information to obtain the clarity of each scale;

Step S33: performing a quantitative analysis of the auditory effect according to the information clarity and audio detail fineness value of each scale, thereby generating a transient detail auditory effect of each scale;

Step S34: Calculate the audio frequency range of the audio spectrum data of different scales, and extract the audio frequency range of each scale;

Step S35: dynamically adjusting the audio frequency range of each scale in the full frequency range based on the transient detail auditory effect of each scale to obtain transient information full frequency band optimized audio.

5. The high-quality audio data processing method according to claim 1, characterized in that the specific steps of step S4 are:

Step S41: identifying the ambient sound source of the full-band optimized audio of transient information and separating the features of the ambient sound source;

Step S42: Analyze the size of the ambient space based on the ambient sound source characteristics to generate the size of the ambient space;

Step S43: estimating the reflection characteristics of the ambient sound source based on the ambient sound source characteristics, thereby estimating the ambient reflection material;

Step S44: performing sound reflection path evolution on the ambient sound source characteristics to generate an ambient sound reflection path;

Step S45: mining the environmental acoustic effect according to the size of the environmental space, the environmental reflection material and the environmental sound reflection path, thereby generating environmental acoustic effect parameters;

Step S46: Infer the current audio scene based on the environmental acoustic effect parameters to obtain a real-time audio propagation scene.

6. The high-quality audio data processing method according to claim 1, wherein the specific steps of step S5 are:

Step S51: performing audio space propagation evolution on environmental acoustic effect parameters to generate a spatial propagation acoustic effect evolution law;

Step S52: performing spatial propagation simulation processing on the real-time audio propagation scene according to the evolution law of spatial propagation acoustic effect, and collecting audio spatial propagation simulation data;

Step S53: performing iterative learning of deep stereo field features on the audio space propagation simulation data to generate deep stereo field features;

Step S54: performing stereo field sound effect enhancement on the transient information full-band optimized audio according to the deep stereo field characteristics, thereby generating a stereo field enhanced audio signal.

7. The high-quality audio data processing method according to claim 1, characterized in that the specific steps of step S6 are:

Step S61: performing real-time propagation parameter monitoring on the audio signal to be processed to generate audio real-time propagation monitoring parameters;

Step S62: Calculate the propagation path delay of the audio real-time propagation monitoring parameters and extract the propagation path delay parameters;

Step S63: performing audio delay compensation calculation on the propagation path delay parameter, thereby obtaining an audio delay compensation parameter;

Step S64: optimizing the propagation delay of the stereo field enhancement audio signal according to the audio delay compensation parameter, thereby generating a delay optimized audio signal;

Step S65: performing global distortion prediction optimization on the delay-optimized audio signal to generate a global adaptive distortion-optimized audio.

8. The high-quality audio data processing method according to claim 1, wherein the specific steps of step S65 are:

Perform full-band audio frequency mutation detection on the delay-optimized audio signal and mark the frequency mutation nodes;

Perform mutation point distribution analysis on frequency mutation nodes to generate audio mutation point distribution features;

Predict audio distortion based on the distribution characteristics of audio mutation points and generate audio distortion prediction data;

The delay-optimized audio signal is globally adaptively distorted according to the audio distortion prediction data, thereby generating globally adaptive distortion-optimized audio.

9. A high-quality audio data processing system, characterized in that it is used to execute the high-quality audio data processing method according to claim 1, comprising:

The noise filtering module is used to identify the audio signal to be processed; perform multi-time window segmentation processing and adaptive noise filtering on the audio signal to be processed to obtain a noise filtering optimized audio signal;

The transient detail capture module is used to perform multi-scale spectrum decomposition on the noise filtering optimized audio signal, and to capture transient audio details, and identify transient audio detail information at each scale;

A full-band range adjustment module is used to calculate the audio frequency range according to the transient audio detail information of each scale, and to perform dynamic full-band range adjustment to obtain transient information full-band optimized audio;

The acoustic effect mining module is used to mine the environmental acoustic effects of the full-band optimized audio of transient information and infer the current audio scene to obtain the real-time audio propagation scene;

A stereo field enhancement module is used to enhance the stereo field sound effect of the full-band optimized audio of transient information according to the real-time audio propagation scenario, thereby generating a stereo field enhanced audio signal;

The distortion optimization module is used to optimize the propagation delay of the stereo field enhanced audio signal and perform global distortion prediction optimization to generate global adaptive distortion optimized audio.