Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by a person skilled in the art without any inventive effort, are intended to be within the scope of the present application based on the embodiments of the present application.
Based on the foregoing problems in the background art, in order to effectively improve accuracy, reliability and comprehensiveness of a cognitive ability analysis result for an object to be analyzed, embodiments of the present application provide a multi-modal fusion cognitive ability analysis method, system and multi-element terminal, and the following description and illustration will be given by way of example with reference to the accompanying drawings and specific embodiments.
Referring to fig. 1 and fig. 2, fig. 1 is a flow chart of a digital audio signal transmission verification method based on dynamic feedback enhancement according to an embodiment of the present application, and fig. 2 is a schematic diagram of an architecture of an audio transmission system. In this embodiment, the method may be implemented by a digital audio transmission system, for example, as shown in fig. 2, where the audio transmission system may include at least two digital audio terminals for performing audio transmission, and a digital audio signal transmission verification system for performing transmission verification on digital audio signals transmitted by the at least two digital audio terminals. In this embodiment, the method may be implemented by the digital audio signal transmission verification system, for example, as shown in fig. 2, where the digital audio terminal may be a device with data transmission, analysis and processing capabilities, such as a computer terminal, a mobile phone, a tablet computer, etc., and the digital audio signal transmission verification system may be a server, a server cluster, a computer device, etc., which is not limited in this embodiment. The method includes steps S10-S50, which are described in detail below.
S10, acquiring real-time transmission waveform data of a target audio signal in a transmission channel, wherein the real-time transmission waveform data comprises a time domain waveform sequence and a frequency domain spectrum sequence.
In this case, a digital audio transmission system is transmitting music audio for a period of 60 seconds. The music audio contains sounds of various instruments such as a piano, a violin, etc. During the transmission process, the waveform of the target audio signal in the transmission channel is acquired by an acquisition device in the system at a sampling frequency of 1000 times per second. The acquired time domain waveform sequence is a two-dimensional array, wherein the first dimension represents sampling time points, and the second dimension represents the corresponding audio signal amplitude value of each time point. Let the acquired amplitude data in the 1 st second sample be [0.1, 0.2, -0.3, ], 0.5], which can then be part of the time domain waveform sequence within one second. The frequency domain spectrum sequence can be obtained by performing fourier transform on the time domain waveform. For example, it is calculated that the frequency range is divided into 1000 frequency bands within 20Hz-20kHz, and the energy value corresponding to each frequency band constitutes a frequency domain spectrum sequence, for example, at a certain moment, the energy value of 100Hz-101Hz is 0.05, the energy value of 101Hz-102Hz is 0.03, and the like, which is not limited in this embodiment.
And S20, carrying out dynamic time-frequency characteristic extraction processing on the real-time transmission waveform data to obtain a multi-dimensional time-frequency characteristic set. The multi-dimensional time-frequency characteristic set comprises a time domain waveform distortion characteristic, a frequency domain energy attenuation characteristic and a phase shift correlation characteristic.
In this embodiment, regarding the time domain waveform distortion feature in the multi-dimensional time-frequency feature set, the above-mentioned music audio is taken as an example, and the time domain waveform sequence is observed. It is assumed that the normal waveform amplitude should fluctuate between-0.8, 0.8 over a period of 10-15 seconds, but that some of the actual acquired waveforms have amplitudes outside this range, reaching an outlier of 1.2. By analyzing the information of waveform fluctuation, peak value, valley value and the like, a feature vector describing the distortion degree of the waveform is calculated. For example, a number of points of the waveform deviated from the normal range is calculated to be a proportion of the total points, and a plurality of indexes such as differences between the abnormal amplitude and the normal amplitude range form a 5-dimensional time domain waveform distortion characteristic vector, such as [0.1, 0.2, 0.3, 0.4, 0.5].
Whereas for the frequency domain energy attenuation characteristics, the frequency domain spectral sequence may be observed. For example, over a period of 20-25 seconds, the energy value of certain frequency bands drops significantly compared to normal. For example, the energy value originally in the frequency band of 1000Hz-1200Hz should be stabilized at about 0.1, but at this time, the energy value is reduced to 0.05. By comparing energy value changes of different time periods and different frequency bands, a plurality of indexes such as energy attenuation proportion, attenuation frequency band range and the like are calculated to form a 4-dimensional frequency domain energy attenuation characteristic vector, such as [0.05, 0.1, 0.08 and 0.12].
Furthermore, for the phase shift correlation feature, the phase relationship between different frequency components in the audio signal is also important. For example, by comparing the phase differences of different frequency components in the normal audio signal and the current acquisition signal over a period of 30-35 seconds. It is assumed that the phase difference of the 100Hz frequency component and the 200Hz frequency component is 0.2 radians under normal conditions, but the phase difference in the current acquisition signal becomes 0.5 radians. By analyzing the phase difference change between the plurality of frequency components, an index describing the degree of phase shift is calculated, and a 3-dimensional phase shift correlation feature vector, such as [0.3, 0.4, 0.5], is formed. The three feature vectors are spliced together to obtain a multi-dimensional time-frequency feature set, such as [0.1, 0.2, 0.3, 0.4, 0.5, 0.05, 0.1, 0.08, 0.12, 0.3, 0.4, 0.5].
S30, carrying out dynamic distortion feedback analysis on the multi-dimensional time-frequency characteristic set based on a preset feedback analysis model to generate a dynamic distortion feedback coefficient set, wherein the dynamic distortion feedback coefficient set is used for quantifying the distortion fluctuation state of a transmission channel;
In one possible example, the preset feedback analysis model is assumed to be a neural network model trained on a large amount of data. And inputting the obtained multi-dimensional time-frequency characteristic set into the model. And analyzing the input characteristics through neuron calculation and weight adjustment in the model. For example, the first hidden layer in the model performs a weighted calculation on the input feature vector according to a preset weight matrix. Assuming that the weight matrix is a 12×10 matrix, correspondingly multiplying and summing 12 elements in the multi-dimensional time-frequency characteristic set and each row of the weight matrix to obtain 10 new values, and transmitting the 10 values to the next layer after being processed by the activation function. After multi-layer calculation, the model outputs a dynamic distortion feedback coefficient set. For example, the output set of dynamic distortion feedback coefficients is a 3-dimensional vector, [0.6, 0.7, 0.8]. Wherein the first coefficient 0.6 represents a distortion level quantization value in terms of time domain waveform distortion, the second coefficient 0.7 represents a distortion level quantization value in terms of frequency domain energy attenuation, and the third coefficient 0.8 represents a distortion level quantization value in terms of phase offset correlation. And the distortion fluctuation state of the transmission channel at the current moment is quantized by combining the plurality of different coefficients.
And S40, training the feedback analysis model according to the dynamic distortion feedback coefficient set, and generating a dynamic feedback enhancement model, wherein the dynamic feedback enhancement model is used for identifying potential distortion information in a transmission waveform.
Taking the current obtained dynamic distortion feedback coefficient set [0.6, 0.7, 0.8] as an example, the current dynamic distortion feedback coefficient set is compared with a corresponding correct label (assuming that the correct label represents that the distortion coefficient set in the normal transmission state is [0.2, 0.3, 0.4 ]). The error between the two is calculated, for example, by calculating the Euclidean distance. The error value calculated for the dummy design is 0.5. And according to the error value, the weight of the feedback analysis model is adjusted by using a back propagation algorithm. During the adjustment, the error is propagated back to the previous hidden layer and input layer step by step, starting from the output layer. For example, at the output layer, the weights of the neurons of the output layer are adjusted according to the errors, so that the dynamic distortion feedback coefficient set output next time is closer to the correct label. After multiple times of training, a new multidimensional time-frequency characteristic set and a corresponding dynamic distortion feedback coefficient set are continuously input for training, when the error reaches a preset smaller value (such as 0.1), training is further carried out according to a historical transmission waveform data set so as to expand the distortion type and distortion probability capacity predicted by the feedback analysis model, a dynamic feedback enhancement model is obtained, and potential distortion information in the transmission waveform can be more accurately identified.
S50, invoking the dynamic feedback enhancement model to carry out transmission verification processing on the real-time transmission waveform data, and generating a signal integrity verification result. The signal integrity verification result comprises a repairable distortion marker and an irreversible distortion alarm identifier.
In this embodiment, the acquired real-time transmission waveform data is input into the dynamic feedback enhancement model again for analysis, and the distortion condition of each part is determined. For example, in the period of 40-45 seconds, the model analyzes that some indexes in the time domain waveform distortion feature vector exceed acceptable ranges, but can be repaired by an algorithm (such as filtering the waveform). At this time, the model generates a repairable distortion marker, marking that the waveform within 5 seconds has repairable distortion. In the time period of 50-55 seconds, the model analyzes that the energy value in the frequency domain energy attenuation characteristic vector is too much reduced, and the phase difference in the phase shift associated characteristic vector is too large, which exceeds the repairable range. At this time, the model generates an irreversible distortion alarm identifier, which prompts that the audio signal within 5 seconds has irreversible distortion, and may affect the audio quality. And finally, synthesizing analysis results of all the time periods to generate a complete signal integrity verification result for evaluating the transmission condition of the digital audio signal in the transmission channel.
In a possible embodiment, in step S20, this may be achieved by the following steps S201 to S04, which are described in detail below.
S201, segment windowing is conducted on the time domain waveform sequence, time domain waveform segment data are obtained, and time domain waveform distortion characteristics are extracted based on the time domain waveform segment data. The time domain waveform distortion characteristics comprise waveform amplitude fluctuation rate, zero crossing offset and segmentation energy difference.
In the present application, description will be continued based on a scenario in which a previous digital audio transmission system transmits music audio for a period of 60 seconds as an example. As an example, the acquired time domain waveform sequence is 1000 samples of audio signal amplitude data per second, the 60 second sequence is divided into 60 segments at 1 second/segment, each segment is weighted with a hanning window of window length 100 points (0.1 seconds), such as segment 10, 1000 points processed to obtain time domain waveform segment data.
In the 10 th section, the maximum amplitude is 0.8, the minimum amplitude is-0.6, and the amplitude variation range is 1.4. The sum of the amplitude differences 50 (999 differences in total) of adjacent sampling points has an average value of 50/999 approximately equal to 0.05, so that the intensity of amplitude change can be measured.
Zero-crossing offset, namely the 10 th wave form, normally should be subjected to 50 zero crossings, actually 55 times, offset distance sum 20, and average offset 20/55 approximately equal to 0.36 each time, so that zero-crossing abnormality can be reflected.
Segment energy difference degree, namely, the energy of the 10 th segment waveform is accumulated to 100 through the square of the amplitude value, and is compared with the adjacent 9 th segment (energy 80) and 11 th segment (energy 120). The ratio 20/80=0.25 with the 9 th energy difference 20, and the ratio 20/120 approximately equal to 0.17 with the 11 th energy difference 20, so that the segment energy difference degree can be determined.
In step S201, the segment windowing process is performed on the time domain waveform sequence to obtain time domain waveform segment data, which may include steps S2011-S2013 described below.
S2011, dividing the time domain waveform sequence into continuous time segments according to a preset window length and an overlapping rate, wherein the length of an overlapping area between the starting position of each time segment and the previous segment is the window length multiplied by the overlapping rate.
In the present application, as an example, in the foregoing music audio transmission scenario, the acquired time domain waveform sequence is audio signal amplitude data at a sampling frequency of 1000 times per second, that is, 60×1000=60000 sampling points in total in 60 seconds. The preset window length is 100 sampling points (corresponding to a length of 100 +.1000=0.1 seconds), and the overlap ratio is set to 0.5.
The time domain waveform sequence is divided from the initial position of the time domain waveform sequence, and the first time segment is from the 1 st sampling point to the 100 th sampling point. The starting position of the second time slice is not the 101 th sampling point, but the (100-100×0.5+1) =51 th sampling point, and the ending position is the (51+100-1) =150 th sampling point. Thus, there is an overlap of 100×0.5=50 samples between the second time segment and the first time segment. Dividing in this way, the third time segment starts at (150-100×0.5+1) =101 samples and ends at (101+100-1) =200 samples, and there is also an overlap area of 50 samples with the second time segment. Similarly, the entire 60000 sample time domain waveform sequence may be divided into a plurality of consecutive time slices.
S2012, applying a Hanning window function to each time segment to carry out weighted smoothing treatment, and inhibiting the frequency spectrum leakage effect at the segment boundary;
Taking the second time segment (from 51 th sampling point to 150 th sampling point) as an example, the expression of the hanning window function is w (N) =0.5-0.5×cos (2n/(N-1)), where N represents the sampling point number (from 0 to 99 because the time segment has 100 sampling points) in the time segment, and N is the window length 100.
For the 0 th sampling point (corresponding to the 51 th sampling point in the actual sequence) in the time segment, a weighting value w (0) =0.5-0.5×cos (2pi×0/(100-1))=0.5-0.5×1=0 is calculated according to the hanning window function. For the 1 st sample point (corresponding to the 52 th sample point in the actual sequence), w (1) =0.5-0.5×cos (2pi×1/(100-1)), cos (2pi×1/99) is about 0.993 as calculated by trigonometric function, so w (1) =0.5-0.5×0.993=0.0035. And calculating a weighted value according to a hanning window function for each sampling point in the time segment in sequence, and multiplying the amplitude of each sampling point by the corresponding weighted value. For example, the 52 th sampling point has an original amplitude of 0.3, and the weighted amplitude is 0.3x0.0035=0.00105 after multiplying the original amplitude by the weighted value of 0.0035. In this way, the 100 sampling points of the entire time slice are weighted, thereby suppressing the spectrum leakage effect at the segment boundary.
As an example, in step S2012, it can be implemented by the following steps S2011-S2013, which are specifically described below.
S20121, detecting edge amplitude abrupt change points of the weighted time slices, wherein the edge amplitude abrupt change points are defined as sampling points with amplitude change rates exceeding a dynamic amplitude threshold value in a head-tail preset percentage interval of the time slices.
As an example, assume that one of the weighted time slices is selected, which contains 100 sample points (corresponding to a duration of 0.1 seconds). The preset head-tail percentage interval is 10%, namely the first 10 sampling points and the last 10 sampling points belong to the edge area.
For the beginning edge region, the amplitude change rate is calculated from the 1 st sampling point. The weighted amplitude of the 1 st sampling point is 0.05, and the weighted amplitude of the 2 nd sampling point is 0.08. The amplitude change rate was calculated as (0.08-0.05)/(0.05=0.6). The dynamic amplitude threshold is dynamically determined based on a large amount of historical data and the characteristics of the current audio signal, assuming a dynamic amplitude threshold of 0.5 in this scenario. Since 0.6 is greater than 0.5, the 2 nd sample point is considered an edge amplitude discontinuity. Then, the amplitude change rate from the 2 nd sampling point to the 3 rd sampling point is calculated, the weighted amplitude of the 3 rd sampling point is 0.1, the change rate is (0.1-0.08)/(0.08=0.25, and is smaller than the threshold value of 0.5), and the 3 rd sampling point is not the abrupt change point. In this way, the amplitude change rate can be calculated one by one for the first 10 sample points compared to the dynamic amplitude threshold.
For the ending edge region, the calculation starts from the 91 st sample point. The 91 st sampling point weighted amplitude is 0.12, the 92 th sampling point weighted amplitude is 0.09, and the amplitude change rate is (0.09-0.12)/(0.12= -0.25). Assuming that the dynamic amplitude threshold is also 0.5 for the downward trend, the absolute value of-0.25 is less than 0.5, the 92 th sample point is not the abrupt point. And by analogy, continuously calculating the amplitude change rate of the subsequent sampling points until the detection of the end 10 sampling points is completed, and finally determining the beginning 2 sampling points in the time segment as edge amplitude abrupt change points.
S20122, performing local linear interpolation processing on the detected edge amplitude abrupt change points to generate a smooth transition modified amplitude sequence.
As an example, the local linear interpolation processing is performed on the detected edge amplitude abrupt change point of the first 2 nd sampling point. Two reference points are selected near the abrupt point, and it is assumed that the 1 st sampling point (weighted amplitude of 0.05) and the 3 rd sampling point (weighted amplitude of 0.1) are selected. The principle of linear interpolation is to calculate the reasonable amplitude value to which the abrupt point should be corrected based on the positions and amplitudes of these two reference points.
Sample point serial number is from 1 to 3, and the amplitude is from 0.05 to 0.1 between the 1 st sample point and the 3 rd sample point. The 2 nd sampling point is at the middle position. The linear interpolation is to scale the corrected amplitude. 2 serial numbers are changed from 1 to 3, and the amplitude is changed by 0.1-0.05=0.05. Then the amplitude variation for each sequence number is 0.05 +.2=0.025. The 2 nd sample point is incremented by 1 relative to the 1 st sample point number, so the corrected amplitude should be 0.05+0.025=0.075. The amplitude of the 2 nd sampling point is corrected to 0.075, and all detected edge amplitude discontinuities are processed in this way to produce a smooth transition corrected amplitude sequence.
S20123, splicing the corrected amplitude sequence and the non-edge area amplitude of the original weighted time segment to generate time domain waveform segmentation data after edge suppression.
In this embodiment, in the above-mentioned time slice, the non-edge region is the middle 80 sampling points except the first 10 sampling points and the last 10 sampling points. The modified amplitude sequence of the first 10 samples, with the 2 nd sample modified to 0.075, is spliced in order with the original weighted amplitude of the middle 80 samples and the original weighted amplitude of the last 10 samples, which need not be modified if the end has no abrupt points.
For example, the first 10 sample points are arranged for corrected amplitude, followed by the middle 80 sample points, and finally the last 10 sample points, thereby generating edge-suppressed time-domain waveform segment data. After the processing, the abrupt change condition of the time domain waveform segmentation data at the edge is suppressed, and the subsequent analysis and processing of the time domain waveform are more facilitated.
S2013, arranging the weighted time slices in time sequence to generate a time domain waveform segmentation data set, wherein each time domain waveform segmentation data set comprises an amplitude sequence weighted by a window function and a corresponding time stamp sequence.
In this embodiment, after all the divided time slices are weighted, the first weighted time slice is placed in front of the second weighted time slice in time sequence, and so on.
For the first time-domain waveform segment data, the amplitude sequence weighted by the window function is 100 sampling point amplitudes subjected to hanning window weighting, and the corresponding time stamp sequence starts from 0 second, and is sequentially 0 second, 0.001 second and 0.002 second at intervals of 1/1000 second. The window function weighted amplitude sequence of the second time domain waveform segment data is 100 sample point amplitudes of the second time segment weighted, and the corresponding time stamp sequence starts from 0.05 seconds (because of the 0.05 second overlap with the first segment, the first segment is 0.1 seconds long and overlaps 0.05 seconds long), and also is 0.05 seconds, 0.051 seconds. Thus, all the time domain waveform segment data are arranged in sequence, and the time domain waveform segment data set can be obtained.
In step S201, extracting the time domain waveform distortion feature based on the time domain waveform segmentation data may include steps S2014-S2017 described below, which are described in detail below.
S2014, calculating the amplitude difference ratio of each time domain waveform segment data to the adjacent segment data to obtain a segment amplitude fluctuation sequence, and carrying out standard deviation calculation on the segment amplitude fluctuation sequence to obtain the waveform amplitude fluctuation rate.
As an example, also taking the above-described music audio transmission scenario as an example, the time-domain waveform sequence is divided into a plurality of time-domain waveform segment data having a duration of 0.1 seconds (100 sampling points).
Taking the 5th time domain waveform segment data as an example, the amplitude sequence of the segment data is [ a 1, a2, …, a100 ]. The amplitude sequence of the former adjacent segment data (4 th) is [ b 1, b2, …, b100 ], and the amplitude sequence of the latter adjacent segment data (6 th) is [ c 1, c2, …, c100 ].
The amplitude difference ratio to the previous adjacent segment data is calculated, and for each sampling point position i (i is from 1 to 100), the difference ratio di= |ai-bi|/(|ai|+|bi|) (if |ai|+|bi|=0, di=0). For example, at sample point 1, a 1 = 0.2,b1 =0.15, then d 1 = |0.2-0.15|/(|0.2|+|0.15|) =0.05/0.35+.0.143. After the difference ratio of 100 sampling points is calculated, the average value of the difference ratio is taken to obtain the amplitude difference ratio D 1 between the adjacent segment data and the previous segment data.
Similarly, the amplitude difference ratio to the next adjacent segment data is calculated, and for each sampling point position i, the difference ratio ei= |ai-ci|/(|ai|+|ci|) (if |ai|+|ci|=0, ei=0). After the difference ratio of 100 sampling points is calculated, the average value is taken to obtain the amplitude difference ratio D 2 between the data of the next adjacent segment.
D 1 and D 2 are combined into a new value as the amplitude difference ratio of the time domain waveform segment data to the adjacent segment data. And carrying out the calculation on all the time domain waveform segmentation data to obtain a segmentation amplitude fluctuation sequence.
Next, standard deviation calculation is performed on the segmented amplitude fluctuation sequence. Let the sequence of piecewise amplitude fluctuations be [ x 1, x2, …, xn ] (n is the number of pieces of piecewise data). The average value μ= (x 1 + x2 + … + xn)/n of the sequences is calculated first. For example, the sequence is [0.1, 0.2, 0.15, 0.25], then μ= (0.1+0.2+0.15+0.25)/4=0.175.
The square of the difference between each value and the average value, (x 1 - μ)², (x2 - μ)², …, (xn - μ) 2 is then calculated. For example ,(0.1 - 0.175)² = 0.005625,(0.2 - 0.175)² = 0.000625,(0.15 - 0.175)² = 0.000625,(0.25 - 0.175)² = 0.005625.
Then, the average of the square values, i.e., variance σ2= [ (x 1 - μ)² + (x2 - μ)² + … + (xn - μ) 2 ]/n, is calculated again. In the above example, σ2= (0.005625 +0.000625+0.000625+ 0.005625)/4= 0.003125.
Finally, the standard deviation σ= v σ2, which is the waveform amplitude fluctuation rate.
S2015, detecting the zero-crossing point position of each time-domain waveform segment data, and calculating the zero-crossing offset based on the zero-crossing point position difference of the adjacent segments.
For example, in the 7 th time domain waveform segment data, zero crossing points occur at the sampling points 20, 45, 70 upon detection. The zero-crossing points of the former adjacent segment data (6 th) occur at the sampling points 18, 42, 68 and the zero-crossing points of the latter adjacent segment data (8 th) occur at the sampling points 22, 48, 72.
The zero-crossing point position difference from the previous adjacent segment data is calculated. For the first zero-crossing point (sampling point 20) of the 7 th segment data and the first zero-crossing point (sampling point 18) of the 6 th segment data, the position difference is |20-18|=2. For the second zero-crossing point, the position difference is |45-42|=3, and the third zero-crossing point position difference is |70-68|=2. And adding the difference values to obtain a total difference of 2+3+2=7 between the zero crossing point positions of the adjacent segment data.
Similarly, a zero-crossing point position difference from the next adjacent segment data is calculated. For the first zero-crossing point, the position difference is |22-20|=2, the second zero-crossing point position difference is |48-45|=3, and the third zero-crossing point position difference is |72-70|=2. The total difference is 2+3+2=7.
The total difference between the positions of the zero crossing points of the segment data adjacent to the front segment data and the rear segment data is added and divided by the total number of the zero crossing points (6), so that the zero crossing offset is obtained. I.e. (7+7)/6.apprxeq.2.33.
S2016, extracting an energy distribution histogram of each time domain waveform segment data, and calculating the Pasteur distance between the current segment histogram and the historical segment histogram to obtain the segment energy difference degree.
In this embodiment, the energy calculation is to square the amplitude values of the sampling points and then sum the squared values, for example, the 9 th time domain waveform segment data amplitude sequence is [ a 1, a2, …, a100 ], and the energy e=a 1² + a2² + … + a100 2. Dividing the energy range into 10 sections, and counting the occurrence times of the energy values of each section to form an energy distribution histogram. Let the 9 th segment data energy distribution histogram be H 9 = [h91, h92, …, h910] ,h9 i as the i-th bin energy value occurrence number.
The historical segment histogram is selected from the energy distribution histograms of the previous segment data, such as 3 rd and 6 th, respectively, for H3 = [h31, h32, …, h310] 、H6 = [h61, h62, …, h610] . calculating the Pasteur distance of the current and historical segment histograms by first calculating the median value mi of each segment. For H 9 and H 3, the coefficients k 1ᵢ = √(h9ᵢ × h3 i) are calculated to be k 1 = [k11, k12, …, k110 ], the Pasteur coefficients B 1 = ∑(k1 i) (i from 1 to 10). Similarly, the Babbitt coefficient B 2 between H 9 and H 6 was calculated. Thus, the segment energy difference may be averaged over B 1 、B2.
S2017, carrying out normalized splicing on the waveform amplitude fluctuation rate, the zero-crossing offset and the segmentation energy difference degree to generate the time domain waveform distortion characteristic.
As an example, assume that the waveform amplitude fluctuation ratio has a value range of [0, 0.5], the current value of 0.3, the zero-crossing offset has a value range of [0, 5], the current value of 2, the segment energy difference has a value range of [0, 1], and the current value of 0.6.
Normalized value=0.3/0.5=0.6 for waveform amplitude fluctuation rate, normalized value=2/5=0.4 for zero-crossing offset, normalized value=0.6/1=0.6 for segment energy difference.
The three normalized values are sequentially spliced together, e.g., [0.6, 0.4, 0.6], to obtain the time domain waveform distortion feature.
S202, performing multi-scale filtering processing on the frequency domain spectrum sequence to obtain frequency domain energy distribution data, and extracting frequency domain energy attenuation characteristics based on the frequency domain energy distribution data, wherein the frequency domain energy attenuation characteristics comprise fundamental frequency energy attenuation slope, harmonic energy diffusivity and inter-frequency band energy transfer coefficients.
In one example, the frequency domain spectral sequence may be energy value data divided into 1000 frequency bands from 20Hz-20 kHz. The multi-scale filtering uses filters with different cut-off frequencies, such as three low-pass filters, the cut-off frequencies are 100Hz, 1000Hz and 5000Hz, and the frequency domain spectrum sequence sequentially passes through the filtering. Taking a 100Hz filter as an example, attenuating the energy values of the frequency bands larger than 100Hz to obtain partial frequency domain energy distribution data, and then passing through the 1000Hz and 5000Hz filters to obtain different scale frequency domain energy distribution data.
An example of extracting frequency domain attenuation characteristics based on the above data is as follows:
the frequency attenuation slope is that in the frequency band of 100Hz-1000Hz, the energy value of the initial 100Hz is 0.1, the energy value of the end 1000Hz is 0.05, the frequency is changed by 900Hz, the energy is changed by 0.05, and the slope is 0.05/900 approximately 0.000056.
The harmonic energy diffuseness is 0.2, the harmonic frequency 1000Hz energy value is 0.05, the 1500Hz energy value is 0.03, the total harmonic energy is 0.08, the diffuseness is 0.08/0.2=0.4, and the 500Hz is the center frequency.
The energy transfer coefficient between the frequency bands is that the frequency bands of 1000Hz-2000Hz and 2000Hz-3000Hz are observed, the initial energy of 1000Hz-2000Hz is changed to 0.3, the initial energy of 2000Hz-3000Hz is changed to 0.23, the energy transfer amount is 0.05, and the transfer coefficient is 0.05/0.3 approximately equal to 0.17.
Step S202 may include steps S2021 to S2025 described below, which are described in detail below.
S2021, dividing the frequency domain spectrum sequence into a plurality of sub-bands, wherein each sub-band corresponds to a preset bandwidth range.
As an example, in the above-mentioned music audio transmission scenario, the frequency range covered by the frequency domain spectrum sequence is 20Hz-20kHz, and is divided into energy value data corresponding to 1000 frequency bands in total. If the frequency domain spectrum sequence is divided into a plurality of sub-bands, the bandwidth range of each sub-band is preset to be 100Hz. Starting from 20Hz, the first sub-band has a frequency range of 20Hz-120Hz, the second sub-band has a frequency range of 120Hz-220Hz, the third sub-band has a frequency range of 220Hz-320Hz, and so on, until the last sub-band has a frequency range of 19920Hz-20020Hz (this division may cover the entire frequency range since the total range is 20Hz-20 kHz). Thus, the frequency domain spectrum sequence is successfully divided into a plurality of sub-bands, and each sub-band has a corresponding preset bandwidth range.
S2022, performing adaptive threshold filtering processing on each sub-band, reserving spectral components exceeding a dynamic energy threshold, and generating filtered sub-band energy data;
Taking the second sub-band (120 Hz-220 Hz) as an example, the dynamic energy threshold is dynamically determined according to the distribution of sub-band energy values, historical data and the like. The dynamic energy threshold of the sub-band is determined to be 0.03 through calculation (such as a mean value and a standard deviation of statistical energy values and the energy fluctuation range of a historical frequency band). The sub-band has 100 frequency bands (the bandwidth is 100Hz, the frequency domain spectrum is divided into 1000 frequency bands), and the energy value of each frequency band is traversed. If the 130Hz energy value is 0.05, which is larger than the threshold value 0.03, the energy is reserved, and if the 160Hz energy value is 0.02, which is smaller than the threshold value, the filtering is performed. And judging and processing all frequency band energy values of the sub-band, and forming filtered sub-band energy data by using the reserved frequency spectrum component energy values. The filtered energy data for each sub-band is generated by doing so for all sub-bands.
The step S2022 may include the following steps S20221 to S202224, which are described in detail below.
S20221, counting the energy average value of the current sub-band in a historical time window, and multiplying the energy average value by a preset attenuation factor to serve as an initial dynamic energy threshold.
In the present application, based on the scene that the previous digital audio transmission system transmits music audio with a duration of 60 seconds and has divided the frequency domain spectrum sequence into a plurality of sub-bands, as an example, it is assumed that the 6 th sub-band is currently analyzed, and the frequency range thereof is 520Hz-620Hz. The history time window is set to the last 10 seconds, during which 10 seconds the energy values of the sub-band at different moments are (assuming one recording per second) 0.04, 0.05, 0.03, 0.06, 0.04, 0.05, 0.03, 0.04, 0.05, 0.04, respectively. The average of the energy values is calculated, and all the energy values are added to be 0.04+0.05+0.03+0.06+0.04+0.05+0.03+0.04=0.43. Dividing by the number of recordings 10 gives an energy mean value of 0.43/10=0.043. The preset attenuation factor is 0.8, and the energy mean value is multiplied by the attenuation factor to obtain an initial dynamic energy threshold value of 0.043×0.8=0.0344.
S20222, monitoring the instantaneous energy value of the current sub-band in real time, and if the instantaneous energy value is lower than the initial dynamic energy threshold, decrementing the dynamic energy threshold according to a preset step length until the lowest energy threshold is reached.
For example, in real-time monitoring of the 6 th sub-band, it is assumed that the instantaneous energy value monitored at a certain moment is 0.03, which is lower than the initial dynamic energy threshold value of 0.0344. The preset step size is 0.001, and the lowest energy threshold is 0.02. Since the instantaneous energy value is below the initial dynamic energy threshold, the dynamic energy threshold begins to be decremented by the step length. After the first decrement, the dynamic energy threshold becomes 0.0344-0.001= 0.0334, and continuing to monitor if the next instantaneous energy value is still lower than the dynamic energy threshold at this time, decrementing again, the dynamic energy threshold becomes 0.0334-0.001=0.0324, and so on until the dynamic energy threshold reaches the lowest energy threshold of 0.02.
S20223, if the instantaneous energy value continuously exceeds the adjusted dynamic energy threshold, the dynamic energy threshold is increased by a preset step until the initial dynamic energy threshold is restored.
In this embodiment, in the above-mentioned monitoring process for the 6 th sub-band, when the dynamic energy threshold is decreased to 0.02, it is assumed that the next 3 consecutive instantaneous energy values are respectively 0.025, 0.028 and 0.03, which exceed the current dynamic energy threshold by 0.02. The preset step size is still 0.001.
Because the instantaneous energy value continuously exceeds the dynamic energy threshold, the dynamic energy threshold begins to be incremented by steps. The dynamic energy threshold became 0.02+0.001=0.021 after the first increment, becomes 0.021+0.001=0.022 after the second increment, becomes 0.022+0.001=0.023 after the third increment, and so on until the dynamic energy threshold is restored to the initial dynamic energy threshold of 0.0344.
And S20224, marking the frequency spectrum components with all instantaneous energy values exceeding the latest dynamic energy threshold value in the current sub-band as effective components, and generating the filtered sub-band energy data.
As a possible implementation manner, after the adjustment of the dynamic energy threshold, it is assumed that the latest dynamic energy threshold is 0.03. Within the 6 th sub-band, there are a plurality of instantaneous energy values corresponding to the frequency band, for example, the instantaneous energy value corresponding to the frequency band 540Hz is 0.035, and the instantaneous energy value corresponding to the frequency band 580Hz is 0.025. Since 0.035 exceeds the latest dynamic energy threshold of 0.03, the spectral component corresponding to the frequency band of 540Hz is marked as an effective component, whereas 0.025 does not exceed the latest dynamic energy threshold of 0.03 and is not marked as an effective component. And judging and marking all frequency bands in the 6 th sub-band, and arranging energy values corresponding to all frequency spectrum components marked as effective components together to generate filtered sub-band energy data. All sub-bands are processed as per the steps S20221-S20224 described above to obtain filtered sub-band energy data for each sub-band.
S2023, calculating an energy attenuation gradient for each filtered sub-band energy data, the energy attenuation gradient representing a rate of decline of the current sub-band energy over time.
In this embodiment, taking the filtered sub-band energy data of the third sub-band (220 Hz-320 Hz) as an example, it is assumed that the energy values of the sub-band at different time points (at 1 second intervals) are respectively 0.1 at 10 seconds, 0.08 at 11 seconds, 0.06 at 12 seconds and 0.04 at 13 seconds. When calculating the energy attenuation gradient, the energy difference between the adjacent time points can be calculated firstly, wherein the energy difference between the 10 th second and the 11 th second is 0.1-0.08=0.02, the energy difference between the 11 th second and the 12 th second is 0.08-0.06=0.02, and the energy difference between the 12 th second and the 13 th second is 0.06-0.04=0.02. The time intervals are then calculated, which may all be 1 second. The energy decay gradient is the ratio of the energy difference to the time interval, in this example, the average energy decay gradient is (0.02+0.02+0.02)/(3=0.02) (e.g., a more stable energy decay gradient value is obtained by averaging the energy variation over time intervals). The energy attenuation gradient is calculated in a similar manner for the filtered sub-band energy data for all sub-bands.
S2024, performing cross-band correlation analysis on the energy attenuation gradients of the sub-bands, and determining harmonic energy diffuseness, wherein the harmonic energy diffuseness is used for quantifying energy transfer amplitude between adjacent sub-bands.
As a possible implementation, part S2024 performs a cross-band correlation analysis on the energy attenuation gradients of the respective sub-bands to determine harmonic energy diffuseness for quantifying the energy transfer magnitudes between adjacent sub-bands. For example, the correlation between the energy attenuation gradients of the 4 th sub-band (320 Hz-420 Hz) and the 5 th sub-band (420 Hz-520 Hz) is analyzed. Let the energy attenuation gradient of the 4 th sub-band be 0.03 and the energy attenuation gradient of the 5 th sub-band be 0.02. At the same time, the change of the energy values of the two sub-bands in a certain time period is observed. It was found that during the 15 th to 16 th seconds, the energy value of the 4 th sub-band was decreased from 0.09 to 0.06, decreasing by 0.03, while the energy value of the 5 th sub-band was increased from 0.07 to 0.08, increasing by 0.01. The energy transfer amplitude is calculated by analyzing the variation relationship of the energy values of adjacent sub-bands in a plurality of time periods. For example, the proportion of energy reduced by the 4 th sub-band to shift to the 5 th sub-band is calculated. Wherein, the transfer ratio=0.01 0.03≡0.33. Harmonic energy diffuseness is determined by performing the same analysis between all adjacent subbands. For example, the final harmonic energy spread may be determined by integrating the energy transfer ratio between adjacent subbands, averaging, and the like.
S2025, generating the frequency domain energy distribution data based on the energy attenuation gradients and harmonic energy diffusances of all sub-bands.
For example, the energy attenuation gradient values of each subband are arranged in a sequence in the order of the subbands, assuming [ g 1, g2, …, gn ], n being the number of subbands. At the same time, the relevant data of the harmonic energy diffuseness (such as the energy transfer proportion between adjacent sub-bands, etc.) is also subjected to similar sorting.
Finally, the energy attenuation gradient data and the harmonic energy diffusivity data are combined to form a data structure capable of comprehensively reflecting the frequency domain energy distribution condition, for example, the energy attenuation gradient data and the harmonic energy diffusivity data can be spliced into a multi-dimensional vector or matrix and other forms, so that the frequency domain energy distribution data are generated and used for subsequent analysis and processing of frequency domain characteristics.
S203, carrying out joint phase alignment processing on the time domain waveform sequence and the frequency domain spectrum sequence to generate phase offset correlation characteristics, wherein the phase offset correlation characteristics comprise time-frequency phase difference degree, phase mutation point density and cross-frequency band phase synchronization rate.
In this embodiment, continuing to take a music audio transmission scene as an example, the time domain waveform sequence and the frequency domain spectrum sequence are aligned in a joint manner, and the matching is achieved by finding time domain waveform feature points (such as peak points) and frequency domain corresponding frequency component features.
Time-frequency phase difference degree, namely, at a certain moment (such as 30 seconds), comparing the amplitude of the time domain waveform with the energy change of the frequency component corresponding to the frequency domain. Assuming that the time domain amplitude change rate is 0.1 (the ratio of the amplitude change quantity of the adjacent sampling points to the time interval is calculated), the frequency domain energy change rate is 0.05 (the ratio of the energy change quantity of the adjacent frequency points to the frequency interval is calculated), and the difference value between the two is 0.05 as a time-frequency phase difference degree measurement value.
Phase mutation point density, namely counting the number of mutation points (points with large difference between amplitude or energy and adjacent points) of a time domain waveform sequence and a frequency domain waveform sequence in audio transmission. For example, 10 time domain mutations and 8 frequency domain mutations are within 10-20 seconds, the period is 10 seconds, the total sampling points are 10×1000=10000, and the phase mutation point density is (10+8)/10000=0.0018.
And (3) the cross-frequency band phase synchronization rate is that the synchronization conditions of different frequency bands, such as 100Hz-1000Hz and 2000Hz-3000Hz frequency bands, are observed, and the synchronization degree of energy variation is counted at a certain moment (such as 40 seconds). The synchronization rate is measured by calculating the energy change ratio (0.03/0.05=0.6) and the like, assuming that the energy of the frequency band of 100Hz-1000Hz rises by 0.05 and the energy of the frequency band of 2000Hz-3000Hz rises by 0.03.
S204, carrying out feature fusion on the time domain waveform distortion feature, the frequency domain energy attenuation feature and the phase shift correlation feature to generate the multi-dimensional time-frequency feature set.
As a possible implementation manner, the time domain waveform distortion characteristics (such as the vector formed by the waveform amplitude fluctuation rate, the zero-crossing offset and the segment energy difference), the frequency domain attenuation characteristics (such as the vector formed by the frequency domain attenuation slope, the harmonic energy diffusion degree and the inter-band energy transfer coefficient) and the phase offset correlation characteristics (such as the vector formed by the time-frequency phase difference degree, the phase mutation point density and the cross-band phase synchronization rate) which are extracted before can be spliced. For example, the three 3-dimensional vectors are spliced into a 9-dimensional vector, [ waveform amplitude fluctuation value, zero-crossing offset value, segment energy difference value, frequency attenuation slope value, harmonic energy spread value, inter-band energy transfer coefficient value, time-frequency difference value, mutation point density value, cross-band synchronization value ], thereby generating a multi-dimensional time-frequency feature set.
In a possible embodiment, step S30 may be implemented by the following steps S301 to 304, which are described in detail below.
S301, inputting the time domain waveform distortion characteristics to a time domain distortion analysis layer of a feedback analysis model, and outputting a time domain distortion score, wherein the time domain distortion score is used for quantifying the severity of time domain waveform distortion.
As an example, assume that in a music audio transmission scene, the acquired time domain waveform distortion characteristics include waveform amplitude fluctuation rate, zero-crossing offset, and segment energy difference. If a certain section of feature vector at a certain moment is [0.3, 0.2, 0.4], the feature vectors respectively correspond to the feature values.
The vector is input to a time domain distortion analysis layer of a feedback analysis model, which has preset rules and computational logic. The waveform amplitude fluctuation rate is as low as 0-0.2 distortion degree, as 0.2-0.5 medium, and is larger than 0.5 high, and as the example 0.3 medium, according to the historical data. Zero crossing offset is 0-0.1 low, 0.1-0.3 medium, greater than 0.3 high, in this case 0.2 medium. The segment energy difference is 0-0.3 low, 0.3-0.6 medium, greater than 0.6 high, and this example is 0.4 medium.
The time domain distortion analysis layer calculates the time domain distortion score to be 0.3×0.4+0.2×0.3+0.4×0.3=0.3 through a comprehensive evaluation algorithm (such as weighted average of three indexes, waveform amplitude fluctuation rate weight of 0.4, zero-crossing offset weight of 0.3, and segmentation energy difference weight of 0.3), and quantifies the time domain waveform distortion severity.
S302, inputting the frequency domain energy attenuation characteristics to a frequency domain attenuation analysis layer of a feedback analysis model, and outputting frequency domain attenuation scores, wherein the frequency domain attenuation scores are used for quantifying the cumulative effect of the frequency domain energy loss.
The obtained frequency domain energy attenuation characteristics comprise fundamental frequency energy attenuation slope, harmonic energy diffusivity and inter-band energy transfer coefficients. For example, if a certain segment of feature vector is [0.05, 0.6, 0.2] at a certain time, the feature values are respectively corresponding to the feature values. The vector is input to a frequency domain attenuation analysis layer of a feedback analysis model for evaluation, for example, the fundamental energy attenuation slope is low, the cumulative effect is low, 0.03-0.08 is medium, and is greater than 0.08 high, in this example 0.05 is medium, the harmonic energy diffusivity is low, in this example 0.4-0.8 medium, and is greater than 0.8 high, in this example 0.6 medium, and the inter-band energy transfer coefficient is low, in this example 0.1-0.3 medium, and is greater than 0.3 high, in this example 0.2 medium.
The frequency domain attenuation analysis layer calculates the frequency domain attenuation score to be 0.05×0.3+0.6× 0.4+0.2× 0.3=0.33 through a comprehensive evaluation algorithm (such as weighted average of three indexes, frequency fundamental frequency attenuation slope weight of 0.3, harmonic energy diffusivity weight of 0.4, and inter-band energy transfer weight of 0.3), and quantifies the frequency domain energy accumulation effect.
S303, inputting the phase shift correlation characteristic to a phase synchronization analysis layer of a feedback analysis model, and outputting a phase out-of-step score, wherein the phase out-of-step score is used for quantifying the damage degree of a phase synchronization state.
In this embodiment, the obtained phase offset correlation characteristic includes a time-frequency phase difference, a phase mutation point density, and a cross-band phase synchronization rate. If a certain segment of feature vector is [0.4, 0.1, 0.7] at a certain moment, the feature vectors respectively correspond to the feature values.
The vector is input to a phase synchronization analysis layer of a feedback analysis model to evaluate each characteristic, for example, the time-frequency phase difference degree is low, the damage degree is low, for example, 0.3-0.6, the damage degree is higher than 0.4, the damage degree is middle, the phase mutation point density is low, for example, 0-0.05, the damage degree is lower, for example, 0.05-0.15, the damage degree is higher than 0.15, the damage degree is higher, for example, 0.1, the damage degree is middle, the cross-band phase synchronization rate is low, for example, 0.8-1, the damage degree is lower, for example, 0.6-0.8, the damage degree is higher, and the damage degree is lower, for example, the damage degree is lower than 0.7.
The phase synchronization analysis layer calculates the phase out-of-step score to be 0.4×0.4+0.1×0.3+0.7×0.3=0.4 through a comprehensive evaluation algorithm (such as weighted average of three indexes, namely, time-frequency phase difference degree weight of 0.4, phase mutation point density weight of 0.3 and cross-frequency band phase synchronization rate weight of 0.3), and quantifies the phase synchronization destruction degree.
S304, constructing a dynamic weight distribution strategy based on the time domain distortion scores, the frequency domain attenuation scores and the phase out-of-step scores, and generating a dynamic distortion feedback coefficient set, wherein the dynamic weight distribution strategy adjusts the contribution weight of each score according to a real-time transmission environment.
As a possible implementation mode, assuming that the current real-time transmission environment is stable and has small noise interference, the analysis and judgment of the system on the environment factors determine that the time domain distortion scoring weight is 0.3, the frequency domain attenuation scoring weight is 0.3 and the phase out-of-step scoring weight is 0.4. And calculating the time domain distortion score of 0.3, the frequency domain attenuation score of 0.33 and the phase desynchronization score of 0.4 obtained by the previous calculation according to the weight distribution strategy. The first coefficient in the dynamic distortion feedback coefficient set (corresponding to time domain distortion) is 0.3×0.3=0.09, the second coefficient (corresponding to frequency domain attenuation) is 0.33×0.3=0.099, and the third coefficient (corresponding to phase step out) is 0.4×0.4=0.16. The three coefficients are combined into a vector [0.09, 0.099, 0.16], generating a set of dynamic distortion feedback coefficients. The coefficients in the set are obtained after the contribution weights of the scores are adjusted according to the real-time transmission environment and are used for quantifying the distortion fluctuation states of the transmission channel in different aspects at the current moment.
In a possible embodiment, in step S304, this may be achieved by the following steps S3041-3044, which are described in detail below.
S3041, detecting the environmental noise level of a current transmission channel, and determining a first weight of a time domain distortion score, a second weight of a frequency domain attenuation score and a third weight of a phase out-of-step score according to the environmental noise level.
As an example, the ambient noise level of the current transmission channel may be detected by a dedicated noise detection device. It is assumed that the detected ambient noise level is represented by a value, and the detected ambient noise level is 30 (the value is obtained by comprehensively calculating various factors such as noise intensity in a certain time). For example, a rule for determining weights according to the ambient noise level may be set in advance. When the environmental noise level is between 0 and 20, the first weight of the time domain distortion score is set to be 0.3, the second weight of the frequency domain attenuation score is set to be 0.3, the third weight of the phase step-out score is set to be 0.4, when the environmental noise level is between 21 and 40, the first weight is set to be 0.35, the second weight is set to be 0.25, the third weight is set to be 0.4, and when the environmental noise level is greater than 40, the first weight is set to be 0.4, the second weight is set to be 0.2, and the third weight is set to be 0.4. Since the current ambient noise level is 30 and is in the 21-40 interval, the first weight of the time domain distortion score is determined to be 0.35, the second weight of the frequency domain attenuation score is determined to be 0.25, and the third weight of the phase out-of-step score is determined to be 0.4.
S3042 increasing the weight of the time domain distortion score and decreasing the weight of the phase out-of-sync score when the ambient noise level exceeds a first threshold.
The first threshold is set to 40, and the current environmental noise level is 30, and the first threshold is not exceeded, so the weight is not adjusted temporarily. If at some point the detected ambient noise level rises to 45, the first threshold is exceeded. At this time, according to the adjustment rule preset by the system, the weight of the time domain distortion score is increased and the weight of the phase out-of-step score is reduced. The adjustment rule is assumed to be such that when the ambient noise level exceeds a first threshold, the time domain distortion scoring weight increases by 0.05 and the phase out-of-step scoring weight decreases by 0.05. Then the first weight of the adjusted time domain distortion score becomes 0.4 (0.35 + 0.05), the third weight of the phase out-of-sync score becomes 0.35 (0.4-0.05), and the second weight of the frequency domain attenuation score remains 0.25.
And S3043, when the periodic interference signal exists in the transmission channel, increasing the weight of the frequency domain attenuation score and dynamically adjusting the weight of the time domain distortion score.
In this embodiment, the signal analysis algorithm detects that the transmission channel has a periodic interference signal, for example, the frequency is about 500Hz and occurs periodically. The weight adjustment strategy may be preset, for example, when a periodic interference signal is detected, the frequency domain attenuation scoring weight is increased by 0.1, and the time domain distortion scoring weight is dynamically adjusted according to the interference signal intensity. The intensity of the plurality of interference signals is moderate, and the time domain distortion scoring weight is reduced by 0.05 according to a preset rule.
The original frequency domain attenuation score second weight is 0.25, the original frequency domain attenuation score second weight is 0.35 after being increased, the original time domain distortion score first weight is 0.4 (after being adjusted by the step S3042), the original time domain distortion score first weight is 0.35 after being reduced, and the phase out-of-step score third weight is 0.35.
And S3044, carrying out weighted summation on the time domain distortion score, the frequency domain attenuation score and the phase out-of-step score based on the adjusted first weight, the adjusted second weight and the adjusted third weight to generate a dynamic distortion feedback coefficient set.
As a possible implementation manner, it is assumed that after the adjustment in the previous step, the first weight of the time domain distortion score is 0.35, the second weight of the frequency domain attenuation score is 0.35, and the third weight of the phase step-out score is 0.3. The time domain distortion score is known to be 0.3, the frequency domain decay score is known to be 0.33, and the phase out-of-step score is known to be 0.4. The weighted sum calculation is performed with a first coefficient (corresponding to time domain distortion) of 0.3×0.35=0.105, a second coefficient (corresponding to frequency domain attenuation) of 0.33×0.35= 0.1155, and a third coefficient (corresponding to phase step out) of 0.4×0.3=0.12 in the dynamic distortion feedback coefficient set. The three coefficients are combined into a vector [0.105, 0.1155, 0.12], generating a set of dynamic distortion feedback coefficients. The set is obtained by weighting and summing the scores after the weight is adjusted according to different conditions of the current transmission channel, and the distortion fluctuation state of the transmission channel at the current moment can be reflected more accurately.
In one possible embodiment, step S40 may be implemented by steps S401 to 405 described below, which are described in detail below.
S401, verifying the distortion identification accuracy of the feedback analysis model on the real-time transmission waveform data according to the dynamic distortion feedback coefficient set and the labeling distortion feedback coefficient set, and further acquiring a historical transmission waveform data set if the distortion identification accuracy reaches a verification threshold, wherein the historical transmission waveform data set comprises a historical time-frequency characteristic set carrying distortion labeling data, and the distortion labeling data comprises a distortion type and distortion probability.
As one example, in an exemplary music audio transmission system, a large amount of historical transmission waveform data is accumulated over a long period of time, covering different audio content and transmission environments. Such as selecting 1000 pieces of waveform data from previously transmitted music audio as a historical transmission waveform data set.
And analyzing and labeling each section of waveform data in detail, namely firstly extracting time-frequency characteristics to obtain a historical time-frequency characteristic set. Taking a section of data as an example, using a method similar to S201-S204, extracting time domain waveform distortion characteristics (such as waveform amplitude fluctuation rate 0.25, zero crossing offset 0.15 and segmentation energy difference 0.3), frequency domain energy attenuation characteristics (such as fundamental frequency energy attenuation slope 0.04, harmonic energy diffusion 0.5 and inter-band energy transfer coefficient 0.18), phase offset correlation characteristics (such as time-frequency phase difference 0.3, phase mutation point density 0.08 and cross-band phase synchronization rate 0.6) and combining into a historical time-frequency characteristic set.
Meanwhile, the distortion type and the distortion probability corresponding to each section of feature set are marked. If the distortion type is 'slight phase step out and partial frequency domain energy attenuation', the distortion probability is set to '60%', which is judged by expert analysis or according to the standard. 1000 segments of data are processed to form a complete historical transmission waveform data set which contains rich annotation information and lays a foundation for subsequent training.
Step S402, inputting the historical time-frequency characteristic set into the feedback analysis model, and outputting predicted distortion data, wherein the predicted distortion data comprises a predicted distortion type and a predicted distortion probability.
After the historical transmission waveform data set is obtained, the historical time-frequency characteristic set contains characteristic information with multiple dimensions, such as time domain waveform distortion characteristics, frequency domain energy attenuation characteristics, phase shift correlation characteristics and the like. These feature information have been extracted and combined in detail in the previous steps into feature sets. For example, a historical time-frequency feature set may be a multi-dimensional vector consisting of waveform amplitude fluctuation rate, zero-crossing offset, piecewise energy variance in time-domain waveform distortion features, fundamental frequency energy attenuation slope, harmonic energy diffuseness, inter-band energy transfer coefficient in frequency-domain energy attenuation features, and time-frequency phase variance, phase discontinuity density, cross-band phase synchronization rate, etc. feature values in phase-offset correlation features.
And inputting the multi-dimensional historical time-frequency characteristic set into a feedback analysis model. The feedback analysis model is a model which is preliminarily set and trained, and the internal structure of the feedback analysis model comprises a plurality of analysis layers, such as a time domain distortion analysis layer, a frequency domain attenuation analysis layer, a phase synchronization analysis layer and the like. And in the feedback analysis model, different analysis layers can conduct targeted analysis processing on the input historical time-frequency characteristic set.
The time domain distortion analysis layer analyzes time domain waveform distortion characteristics in the historical time-frequency characteristic set to obtain a comprehensive evaluation result of time domain waveform distortion.
And the frequency domain attenuation analysis layer performs similar analysis on the frequency domain energy attenuation characteristics to obtain a comprehensive evaluation result of the frequency domain energy loss.
The phase synchronization analysis layer analyzes the phase shift correlation characteristics to further obtain a comprehensive evaluation result of phase synchronization damage.
After processing of each analysis layer, the comprehensive evaluation results can be integrated to output prediction distortion data. The predicted distortion data includes a predicted distortion type and a predicted distortion probability. The predicted distortion type may be a combination of distortion conditions such as slight phase loss and partial frequency domain energy attenuation. The predicted distortion probability is a specific value representing the probability of occurrence of a corresponding distortion type of the audio signal, e.g. 70%.
Step S403, calculating a mean square error loss between the predicted distortion data and the distortion labeling data.
The distortion labeling data is information labeled after each section of waveform data is analyzed in detail when the historical transmission waveform data set is constructed, and comprises distortion types and distortion probabilities. In order to accurately calculate the error between the predicted distortion data and the distortion annotation data, the distortion type and the distortion probability need to be processed respectively.
For the distortion type, since it is a sort of classification information, it is necessary to perform quantization processing. Different distortion types may be converted to vector representations in a one-hot encoding manner. For example, assume that there are three distortion types, light phase out-of-sync, partial frequency domain energy decay, and severe time domain waveform distortion, represented by vectors [1, 0], [0,1,0], and [0, 1], respectively. After the predicted distortion type and the distortion annotation type are converted into the vector form, comparison calculation can be performed.
For the distortion probability, it is a numerical value itself. And directly comparing the predicted distortion probability with the distortion labeling probability.
Next, the mean square error loss is calculated. First, the square of the difference between the elements corresponding to the predicted distortion type vector and the distortion annotation type vector is calculated. For example, if the predicted distortion type vector is [0.8,0.1,0.1] and the distortion label type vector is [1, 0], the squares of the corresponding element differences are the squares of (0.8-1), (0.1-0), and (0.1-0), respectively.
Then, the square of the difference between the predicted distortion probability and the distortion labeling probability is calculated.
And adding all the square values obtained by calculation, and dividing the sum by the total number of elements to obtain the mean square error loss. The mean square error loss reflects the overall error degree between the predicted distortion data and the distortion labeling data, and the smaller the error is, the closer the predicted result of the model is to the real situation.
And step S404, carrying out iterative optimization on parameters of the initial feedback analysis model based on the mean square error loss to generate a dynamic feedback enhancement model.
After the mean square error loss is obtained, the parameters of the initial feedback analysis model need to be adjusted according to the mean square error loss value so as to improve the prediction accuracy of the model, and the method can be realized by adopting a back propagation algorithm.
The back propagation algorithm is a method of back propagating errors from the output layer to the input layer, reducing the mean square error loss by adjusting the weights and offsets between the individual neurons in the model. The method comprises the following specific steps:
First, starting from the output layer, the gradient of the output layer neurons is calculated from the mean square error loss. The gradient represents the rate of change of the loss function with respect to the output layer neuron weights and biases. The gradient of the loss function can be propagated back layer by layer to the preceding hidden layer and input layer by the chain law.
In each layer, the weights and biases of the neurons of that layer are updated according to the magnitude and direction of the gradient. The update formula of the weight is typically new weight = old weight-learning rate x gradient. The learning rate is a preset parameter that controls the step size of each update. If the learning rate is too high, the model may skip the optimal solution, and if the learning rate is too low, the convergence rate of the model may be slow.
The bias is updated in a similar manner to the weight, and is also adjusted according to the gradient.
After one back propagation and parameter updating are completed, the historical time-frequency characteristic set is input into the updated model again, and new prediction distortion data and mean square error loss are calculated. And repeating the above process, and continuously performing iterative optimization until the mean square error loss reaches a preset smaller value or the maximum iterative times.
After multiple iterative optimization, when the mean square error loss meets the preset condition, the feedback analysis model becomes a dynamic feedback enhancement model. The dynamic feedback enhancement model is trained and optimized by a large amount of historical data, and compared with the step S130, the dynamic feedback enhancement model is additionally provided with functional branches, so that potential distortion information in a transmission waveform can be more accurately identified. That is, the dynamic feedback enhancement model has stronger adaptability and accuracy, which can more accurately predict the distortion type and the distortion probability from the input real-time transmission waveform data.
In one possible embodiment, step S50 may be implemented by steps S501-505 described below, as follows.
S501, dividing the real-time transmission waveform data into a plurality of verification time windows, wherein each verification time window comprises a time domain waveform sequence and a frequency domain spectrum sequence with preset duration.
As an example, this real-time transmission waveform data is partitioned for ease of analysis and verification. The duration of each verification time window is preset to be 1 second. For example, from the real-time waveform data of the music audio being transmitted, dividing from the 10 th second, the first verification time window contains a time domain waveform sequence and a frequency domain spectrum sequence within the 1 second duration of the 10 th second. The time domain waveform sequence is the amplitude data of the audio signal acquired at the sampling frequency of 1000 times per second in 1 second, namely the amplitude information containing 1000 sampling points. The frequency domain spectrum sequence is obtained by carrying out Fourier transform on the 1 second time domain waveform, and is divided into energy value data corresponding to 1000 frequency bands within the frequency range of 20Hz-20 kHz. Thus, the subsequent verification time windows are divided in sequence, e.g. the second verification time window contains 11 th second related data, the third verification time window contains 12 th second related data, etc.
S502, executing dynamic time-frequency characteristic extraction processing for each verification time window to generate a window-level time-frequency characteristic set.
Taking the first verification time window (10 th second) as an example, dynamic time-frequency feature extraction is performed. For a time domain waveform sequence, calculating amplitude variation of 1000 sampling points, calculating the amplitude fluctuation rate of the waveform to be 0.3, calculating zero crossing condition of the waveform for 50 times, carrying out actual 55 times and average deviation of 0.2 to obtain zero crossing deviation amount of 0.2, dividing the 1 second waveform into 10 sections every 100 points, calculating energy difference of each section, and obtaining energy difference degree of each section by 0.4. For the frequency domain spectrum sequence, the fundamental frequency energy attenuation slope is obtained by observing the fundamental frequency falling from 0.1 to 0.08 within 1 second, the harmonic energy distribution is analyzed to obtain the harmonic energy diffusivity of 0.5, and the energy transfer of adjacent frequency bands is observed to obtain the energy transfer coefficient between the frequency bands of 0.15. And comparing time-frequency phase relation with phase offset correlation characteristics to obtain time-frequency phase difference degree of 0.35, counting 10 phase mutation points within 1 second to obtain phase mutation point density of 0.01, and analyzing phase synchronization among frequency bands to obtain cross-frequency band phase synchronization rate of 0.6. The above features are combined into a window level time-frequency feature set, such as [0.3, 0.2, 0.4, 0.02, 0.5, 0.15, 0.35, 0.01, 0.6]. And generating a corresponding window-level time-frequency characteristic set for each verification time window according to the processing.
S503, invoking a dynamic feedback enhancement model to perform distortion matching analysis on the window level time-frequency characteristic set, and outputting window level distortion probability and distortion type identification.
In this embodiment, the window-level time-frequency feature set [0.3, 0.2, 0.4, 0.02, 0.5, 0.15, 0.35, 0.01, 0.6] generated in the first verification time window may be input into the dynamic feedback enhancement model. And analyzing and matching the input feature set in the dynamic feedback enhancement model according to the knowledge and algorithm obtained by training the dynamic feedback enhancement model. For example, the dynamic feedback enhancement model may compare the above features to a large amount of historical distortion data and patterns. Through a series of complex calculation and judgment, the distortion probability of the output window level is 0.6, and the distortion type is marked as 'slight phase step out and partial frequency domain energy attenuation, and can be repaired'. Taken together, it is shown that according to the analysis, the audio signal has a distortion probability of 60% in this verification time window, and the distortion type belongs to the repairable cases of slight phase step out and partial frequency domain energy attenuation. And processing each window level time-frequency characteristic set to obtain corresponding window level distortion probability and distortion type identification.
S504, if the window level distortion probability exceeds the second threshold and the distortion type is identified as a repairable type, a repairable distortion mark is generated and a real-time compensation signal injection operation is triggered.
As a possible implementation, it is assumed that the second threshold is set to 0.5. The window level distortion probability of the output of the first verification time window is 0.6, exceeds the second threshold of 0.5, and the distortion type is identified as a repairable type.
At this point, a repairable distortion marker is generated, for example, the repairable distortion exists in the verification time window of 10 seconds, which is explicitly marked in the log file or the related data record. At the same time, the real-time compensation signal injection operation is triggered. The compensation signal to be injected is calculated according to the type and characteristics of the distortion. For example, for the phase step-out problem, the phase value to be adjusted is calculated, and for the frequency domain energy decay problem, the energy value to be supplemented is calculated. And then generating corresponding compensation signals, and injecting the compensation signals into a transmission channel to try to repair distortion and ensure the quality of the audio signals.
And S505, if the window-level distortion probability continuously exceeds a third threshold value and the distortion type identifier is an irreversible type, generating an irreversible distortion alarm identifier and triggering a transmission channel switching instruction.
Wherein it is assumed that the third threshold is set to 0.7. If the successive 3 verification time windows (20 th, 21 st, 22 th) from 20 th second, window level distortion probabilities are 0.75, 0.8, 0.78, respectively, each exceeding the third threshold of 0.7, and the distortion type is identified as "severe time domain waveform distortion and unrecoverable frequency domain energy loss, irreversible". Thus, the irreversible distortion alarm mark can be generated, and the irreversible distortion condition from the 20 th second can be highlighted through the human-computer interaction interface or the important prompt information. Meanwhile, a transmission channel switching instruction is triggered, then a standby transmission channel is immediately started, and an audio signal is switched to the standby channel for transmission, so that the problems of serious audio quality reduction or transmission interruption and the like caused by irreversible distortion of the current transmission channel are avoided, and continuous and high-quality transmission of the audio signal is ensured.
Fig. 3 is a schematic diagram of a digital audio signal transmission verification system based on dynamic feedback enhancement according to an embodiment of the present application. The digital audio signal transmission verification system comprises a processor, a machine-readable storage medium, an input/output device and the like, wherein the machine-readable storage medium is connected with the processor, the machine-readable storage medium is used for storing programs, instructions or codes, and the processor is used for executing the programs, the instructions or the codes in the machine-readable storage medium so as to realize the digital audio signal transmission verification method.
In summary, through the above technical solution, according to the digital audio signal transmission verification method and system based on dynamic feedback enhancement provided by the embodiments of the present application, by collecting real-time transmission waveform data including a time domain waveform sequence and a frequency domain spectrum sequence, and performing multi-dimensional time-frequency feature extraction, features such as time domain waveform distortion, frequency domain energy attenuation, and phase offset correlation are obtained, and compared with a method focusing on only a single-dimensional feature, complex changes of audio signals in transmission can be comprehensively and accurately captured. And then, generating a dynamic distortion feedback coefficient set based on a preset feedback analysis model, quantifying the distortion fluctuation state of the transmission channel, breaking the limitation of a static analysis mode, and dynamically adapting to the characteristic change of the transmission channel. Finally, a dynamic feedback enhancement model is obtained through training, and potential distortion information in a transmission waveform is recognized more sharply and accurately by virtue of a dynamic feedback mechanism compared with a fixed model. In addition, the finally generated signal integrity verification result comprises a repairable distortion mark and an irreversible distortion alarm mark, clear guidance is provided for subsequent processing, the accuracy and the effectiveness of audio signal transmission verification are greatly improved, and the transmission quality of digital audio signals is comprehensively improved.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments. The embodiments, the implementation modes and the related technical features of the application can be mutually combined and replaced under the condition of no conflict. The foregoing is only a preferred embodiment of the present application, and is not intended to limit the present application in any way, but any simple modification, equivalent variation and modification made to the above embodiment according to the technical matter of the present application still fall within the scope of the technical solution of the present application.