CN119049482A - Scene audio decoding method and electronic equipment - Google Patents
Scene audio decoding method and electronic equipment Download PDFInfo
- Publication number
- CN119049482A CN119049482A CN202310610533.0A CN202310610533A CN119049482A CN 119049482 A CN119049482 A CN 119049482A CN 202310610533 A CN202310610533 A CN 202310610533A CN 119049482 A CN119049482 A CN 119049482A
- Authority
- CN
- China
- Prior art keywords
- audio signal
- signal
- order
- energy
- reconstructed
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/008—Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S7/00—Indicating arrangements; Control arrangements, e.g. balance control
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Signal Processing (AREA)
- Acoustics & Sound (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Stereophonic System (AREA)
Abstract
The application provides a scene audio decoding method and electronic equipment. The decoding method comprises the steps of receiving a first code stream, decoding the first code stream to obtain a first reconstruction signal, attribute information of a target virtual speaker and a high-order energy gain coding result, generating a virtual speaker signal corresponding to the target virtual speaker based on the attribute information of the target virtual speaker and the first audio signal, reconstructing based on the attribute information of the target virtual speaker and the virtual speaker signal to obtain a first reconstruction scene audio signal, determining a dispersion factor of the first reconstruction scene audio signal according to the first code stream, determining an attenuation factor according to a frequency band sequence number of the reconstruction signal in the first reconstruction scene audio signal and/or an order of the first reconstruction scene audio signal, and adjusting the first reconstruction scene audio signal according to the high-order energy gain coding result, the attenuation factor and the dispersion factor to obtain a reconstructed scene audio signal. Compared with the prior art, the application has lower decoding code rate on the premise of reaching the same quality. The application can improve the decoding quality of the audio signal.
Description
Technical Field
The embodiment of the application relates to the field of audio encoding and decoding, in particular to a scene audio decoding method and electronic equipment.
Background
The three-dimensional audio technology is an audio technology for acquiring, processing, transmitting and rendering and playing back sound events and three-dimensional sound field information in the real world in a computer, signal processing and other modes. Three-dimensional audio gives sound a strong sense of space, surrounding and immersion, giving the person a remarkable auditory experience of "sounding his/her environment". Among them, higher order ambisonic (higher order ambisonics, HOA) technology has properties of independence from speaker layout during recording, encoding and playback phases, and rotatable playback characteristics of HOA format data, and has higher flexibility in performing three-dimensional audio playback, and thus has been more widely focused and studied.
For the N-order HOA signal, the corresponding channel number is (N+1) 2. As the HOA order increases, the information used to record more detailed sound scenes in the HOA signal increases, but the amount of data in the HOA signal increases, and the large amount of data causes difficulty in transmission and storage, so that it is necessary to codec the HOA signal. However, the prior art has a problem of low accuracy in reconstructing the HOA signal.
Disclosure of Invention
The application provides a scene audio coding and decoding method and electronic equipment.
In a first aspect, an embodiment of the present application provides a method for encoding a scene audio, where the method includes obtaining a scene audio signal to be encoded, where the scene audio signal includes audio signals of C1 channels, C1 is a positive integer, obtaining attribute information of a target virtual speaker corresponding to the scene audio signal, obtaining a higher-order energy gain of the scene audio signal, encoding the higher-order energy gain to obtain a higher-order energy gain encoding result, and encoding a first audio signal in the scene audio signal, attribute information of the target virtual speaker, and the higher-order energy gain encoding result to obtain a first code stream, where the first audio signal is an audio signal of K channels in the scene audio signal, and K is a positive integer less than or equal to C1.
In a possible manner, the scene audio signal is an N1-order higher-order ambisonic HOA signal, the N1-order HOA signal comprises a second audio signal, the second audio signal is an audio signal except the first audio signal in the N1-order HOA signal, C1 is equal to (N1+1) square, and the acquiring the higher-order energy gain of the scene audio signal comprises acquiring the higher-order energy gain according to the characteristic information of the second audio signal and the characteristic information of the first audio signal.
Illustratively, the N1 order HOA signal comprises the second audio signal, it being understood that the N1 order HOA signal comprises only the second audio signal.
Illustratively, the N1 order HOA signal comprises a second audio signal, which is understood to include the second audio signal and other audio signals.
For example, the first audio signal may be referred to as a low-order portion of the scene audio signal and the second audio signal may be referred to as a high-order portion of the scene audio signal. That is, a low-order portion of the scene audio signal and a part of a high-order portion of the scene audio signal may be encoded.
It should be appreciated that when the N1-order HOA signal comprises only the first audio signal, the number of channels of the encoded N1-order HOA signal is smaller and the corresponding code rate is lower relative to when the N1-order HOA signal comprises the second audio signal.
In a possible manner, the obtaining the higher-order energy gain according to the characteristic information of the second audio signal and the characteristic information of the first audio signal includes obtaining the energy gain of the first audio signal and the energy gain of the second audio signal, and obtaining the higher-order energy gain according to the energy gain of the first audio signal and the energy gain of the second audio signal.
In a possible way, the obtaining the higher order energy Gain from the energy Gain of the first audio signal and the energy Gain of the second audio signal comprises obtaining the higher order energy Gain' (i, b) by:
Gain’(i,b)=10*log10(E(i,b)/E(1,b));
Where log10 represents log, x represents multiplication, E (1, b) is channel energy of a b-th frequency band of the first audio signal, E (i, b) is i-th channel energy of a b-th frequency band of the second audio signal, i is a number of an i-th channel of the second audio signal, and b is a frequency band number of the second audio signal.
In a possible manner, the encoding the higher-order energy gain to obtain a higher-order energy gain encoding result includes quantizing the higher-order energy gain to obtain a quantized higher-order energy gain, and entropy encoding the quantized higher-order energy gain to obtain the higher-order energy gain encoding result.
It should be noted that the position of the target virtual speaker is matched with the position of the sound source in the scene audio signal, the virtual speaker signal corresponding to the target virtual speaker can be generated according to the attribute information of the target virtual speaker and the first audio signal in the scene audio signal, and the scene audio signal can be reconstructed according to the virtual speaker signal and the high-order energy gain coding result. Therefore, the encoding end encodes the first audio signal in the scene audio signal, the attribute information of the target virtual speaker and the higher-order energy gain encoding result together and then sends the encoded first audio signal, the attribute information of the target virtual speaker and the higher-order energy gain encoding result to the decoding end, and the decoding end can reconstruct the scene audio signal based on the first reconstruction signal obtained by decoding (namely, the reconstruction signal of the first audio signal in the scene audio signal), the attribute information of the target virtual speaker and the higher-order energy gain encoding result.
Compared with other methods for reconstructing the scene audio signals in the prior art, the audio quality of the scene audio signals reconstructed based on the virtual speaker signals is higher, so that when K is equal to C1, the audio quality of the reconstructed scene audio signals is higher under the same code rate.
When K is smaller than C1, compared with the prior art, the method has the advantages that the number of channels of the audio signal coded by the method is smaller, the data size of the attribute information of the target virtual loudspeaker is far smaller than that of the audio signal of one channel, and therefore, the coding code rate is lower on the premise of achieving the same quality.
In addition, in the prior art, the scene audio signals are converted into the virtual speaker signals and the residual signals and then encoded, but the encoding end directly encodes the first audio signals in the scene audio signals, the virtual speaker signals and the residual signals do not need to be calculated, and the encoding complexity of the encoding end is lower.
The scene audio signals related to the embodiment of the application can be exemplified as signals for describing a sound field, wherein the scene audio signals can comprise HOA signals (wherein the HOA signals can comprise three-dimensional HOA signals and two-dimensional HOA signals (also can be called plane HOA signals)) and three-dimensional audio signals, and the three-dimensional audio signals can be other audio signals except the HOA signals in the scene audio signals.
In one possible approach, K may be equal to C1 when N1 is equal to 1, and K may be less than C1 when N1 is greater than 1. It should be appreciated that K may also be less than C1 when N1 is equal to 1.
By way of example, encoding the attribute information of the first audio signal and the target virtual speaker in the scene audio signal may include operations such as down-mixing, transforming, quantizing, and entropy encoding, to which the present application is not limited.
For example, the first bitstream may include encoded data of a first audio signal among the scene audio signals, and encoded data of attribute information of the target virtual speaker.
In one possible manner, the target virtual speaker may be selected from a plurality of candidate virtual speakers based on the scene audio signal, and the attribute information of the target virtual speaker may be determined. Illustratively, the virtual speakers (including candidate virtual speakers and target virtual speakers) are virtual speakers, not actually existing speakers.
For example, the plurality of candidate virtual speakers may be uniformly distributed on the sphere, and the number of target virtual speakers may be one or more.
In one possible manner, a preset target virtual speaker may be acquired, and then attribute information of the target virtual speaker may be determined.
It should be understood that the application is not limited in the manner in which the target virtual speaker is determined.
In a second aspect, an embodiment of the present application provides a method for decoding a scene audio, the method comprising receiving a first code stream, decoding the first code stream to obtain a first reconstructed signal, attribute information of a target virtual speaker, and a higher-order energy gain encoding result, wherein the first reconstructed signal is a reconstructed signal of a first audio signal in the scene audio signal, the scene audio signal comprises audio signals of C1 channels, the first audio signal is an audio signal of K channels in the scene audio signal, C1 is a positive integer, K is a positive integer smaller than or equal to C1, a virtual speaker signal corresponding to the target virtual speaker is generated based on the attribute information of the target virtual speaker and the first audio signal, reconstructing the first reconstructed scene audio signal based on the attribute information of the target virtual speaker and the virtual speaker signal to obtain a first reconstructed scene audio signal, the first reconstructed scene audio signal comprises audio signals of C2 channels, C2 is a positive integer, a dispersion factor of the first reconstructed scene audio signal is determined according to the first code stream, the first reconstructed scene audio signal is an audio frequency band, or a higher-order gain is adjusted according to the audio frequency band attenuation factor, and the reconstructed scene attenuation factor is obtained.
In a possible manner, the scene audio signal is an N1-order higher order ambisonic HOA signal, the N1-order HOA signal includes a second audio signal, the second audio signal is an audio signal of the N1-order HOA signal other than the first audio signal, C1 is equal to the square of (n1+1), and/or the first reconstructed scene audio signal is an N2-order HOA signal, the N2-order HOA signal includes a third audio signal, the third audio signal is a reconstructed signal of the N2-order HOA signal corresponding to each channel of the second audio signal, and C2 is equal to the square of (n2+1).
In a possible manner, the adjusting the first reconstructed scene audio signal according to the higher-order energy gain encoding result and the attenuation factor includes entropy decoding the higher-order energy gain encoding result to obtain an entropy decoded higher-order energy gain, dequantizing the entropy decoded higher-order energy gain to obtain a higher-order energy gain, obtaining higher-order energy of the second audio signal according to channel energy of the first audio signal and the higher-order energy gain, obtaining a decoded energy scale factor according to channel energy of the third audio signal and higher-order energy of the second audio signal, determining a dispersion factor of the first reconstructed scene audio signal according to the first code stream, linearly weighting the attenuation factor according to the dispersion factor to obtain a weighted attenuation factor, and adjusting a third audio signal in the N2 HOA signal according to the weighted attenuation factor and the decoded energy scale factor to obtain a third adjusted audio signal. In the above scheme, the dispersion factor may also be used to linearly weight the attenuation factor, and the weighted attenuation factor is used to adjust the third audio signal. The weight method is used for balancing the duty ratio between the dispersion component and the directional component in the attenuation factor, and the dispersion factor can be used for measuring the energy proportion of the nondirectional component in the HOA signal to be encoded and adjusting the energy of each channel of the reconstructed HOA signal, so that the energy-adjusted reconstructed HOA signal is more similar to the energy of the HOA signal to be encoded.
In one possible way, the linear weighting of the attenuation factors according to the dispersion factors, resulting in weighted attenuation factors, includes obtaining the weighted attenuation factors gd (i, b) by at least one of:
gd (i, b) =w×dispersion (b) + (1-w) ×g '(i, b), wherein w is a preset adjustment ratio threshold value, x represents multiplication operation, dispersion (b) represents a dispersion factor of a b-th frequency band of the third audio signal, and g' (i, b) represents an attenuation factor of the b-th frequency band of the third audio signal;
Or alternatively
When the attenuation factor is a full-band signal, gd (i, b) =w× mean (diffusion) + (1-w) ×g '(i, b), wherein mean (diffusion) is an average value of dispersion factors of a plurality of frequency bands of the third audio signal, g' (i, b) represents an attenuation factor of a b-th frequency band of the third audio signal, and w is a preset adjustment ratio threshold;
Or alternatively
Gd (i, b) =w×dispersion (b) + (1-w) ×g '(i, b) +offset (i, b), wherein offset (i, b) represents a bias constant of a b-th frequency band on an i-th channel of the third audio signal, g' (i, b) represents an attenuation factor of the b-th frequency band of the third audio signal, w is a preset adjustment ratio threshold, and dispersion (b) represents a dispersion factor of the b-th frequency band of the third audio signal;
Or alternatively
Gd (i, b) =w×dispersion (b) + (1-w) ×g '(i, b) +direction (i, b), wherein direction (i, b) represents a direction parameter of a b-th frequency band on an i-th channel of the third audio signal, g' (i, b) represents an attenuation factor of the b-th frequency band of the third audio signal, w is a preset adjustment ratio threshold, and dispersion (b) represents a dispersion factor of the b-th frequency band of the third audio signal.
In a possible manner, the adjusting the third audio signal in the N2-order HOA signal according to the weighted attenuation factor and the decoding energy scaling factor, so as to obtain an adjusted third audio signal:
The adjusted third audio signal X' (i, b) is obtained by:
X’(i,b)=X(i)×g(i,b)×gd(i,b);
Wherein gd (i, b) represents the weighted attenuation factor, g (i, b) represents a decoding energy scale factor, and X (i) represents the third audio signal.
In the above scheme, since the dispersion factor can be used to measure the energy proportion of the non-directional component in the HOA signal to be encoded, the decoding energy proportion factor can be used to measure the energy proportion of the reconstructed HOA signal, and the dispersion factor and the decoding energy proportion factor are used to adjust the energy of each channel of the reconstructed HOA signal, so that the energy-adjusted reconstructed HOA signal is closer to the energy of the HOA signal to be encoded.
In a possible manner, the determining the attenuation factor according to the frequency band sequence number of the reconstructed signal in the first reconstructed scene audio signal and/or the order of the first reconstructed scene audio signal includes obtaining the attenuation factor according to the frequency band sequence number of the third audio signal and/or the order of the N2-order HOA signal. In the above scheme, the decoding end may obtain the attenuation factor according to the frequency band sequence number where the third audio signal is located, or the decoding end may obtain the attenuation factor according to the order of the N2-order HOA signal, where the order of the N2-order HOA signal may specifically be the Ambisonic order, or the decoding end may obtain the attenuation factor according to the frequency band sequence number and the order of the N2-order HOA signal, where the attenuation factor may be referred to as a dual attenuation factor.
In a possible manner, the obtaining the attenuation factor according to the frequency band sequence number where the third audio signal is located and/or the order of the N2-order HOA signal includes:
the attenuation factor g' (i, b) is obtained as follows:
Wherein i is the number of the ith channel of the third audio signal, b is the frequency band number of the third audio signal, α represents the number of the target virtual speakers, β represents the number of mapping channels of the N2-order HOA signal, β is equal to (m+2) 2, M is the order of the N2-order HOA signal, γ represents the order corresponding to the channel number i of the third audio signal, and x represents the multiplication operation.
In the above scheme, by the above calculation method of the attenuation factor g' (i, b), b is the frequency band number of the third audio signal, α represents the number of target virtual speakers, β represents the number of mapping channels of the N2-order HOA signal, β is equal to (m+2) 2, M is the number of channels of the N2-order HOA signal, γ represents the number of channels i of the third audio signal, the attenuation factor can be accurately calculated by the above parameters, and the attenuation factor is changed along with three factors of the number of speakers, the number of mapping channels of the HOA signal, and the HOA number by adjusting parameters, so that the quality of the reconstructed audio signal is improved when the attenuation factor and the adjusted decoded higher-order energy gain are used for adjusting the third audio signal.
In a possible way, the method further comprises updating the alpha to alpha 2, alpha 2 = 0.375 x alpha when b < d >, d being a preset first threshold value, and updating the alpha to alpha 3, alpha 3 = 0.5 x alpha when b > d.
In a possible way the method further comprises updating the alpha to alpha 4, alpha 4 = alpha + alpha x (1-1.25 x b)/(bands + 2), wherein bands represent the number of frequency bands of the third audio signal.
In the above-described scheme, when the frequency band b is smaller than the first threshold, that is, the b-th frequency band may represent a low frequency band, the attenuation coefficient is set to a small value, for example, the attenuation coefficient is equal to 0.375, and when the frequency band b is larger than the threshold, that is, the b-th frequency band may represent a high frequency band, the attenuation coefficient is set to a large value, for example, the attenuation coefficient is equal to 0.5. The attenuation coefficient of 0.375 or 0.5 is just one possible example implementation, for example, the attenuation coefficient of 0.375 may be replaced by 0.38 or 0.37, the attenuation coefficient of 0.5 may be replaced by 0.55 or 0.6, and the attenuation coefficient is determined according to the application scenario specifically, which is not limited herein. According to the scheme, the value of alpha can be flexibly adjusted according to the value of the frequency band b, so that the effect that the attenuation effect is more remarkable along with the higher attenuation factor of the frequency band is achieved, and the reconstructed audio signal is more in line with the auditory characteristics of human ears.
In a possible way, the method further comprises updating the beta to be beta 2 when b is less than or equal to d, wherein beta 2 is = (1+w) multiplied by beta, d is a preset first threshold value, and w is a preset adjustment proportion threshold value.
In the above scheme, when b is less than or equal to the first threshold value, the value of α can be flexibly adjusted according to the value of the frequency band b in the above scheme, so as to achieve the effect that the attenuation effect is more remarkable along with the higher attenuation factor of the frequency band, thereby enabling the reconstructed audio signal to more accord with the auditory characteristics of human ears.
In a possible way, the method further comprises updating the w to w2, w2=w+α×0.05. In the above scheme, the value of w can be updated according to alpha, so that the relation between the weight w in the attenuation factor and the parameter alpha is established, and the effect that the attenuation effect of the attenuation factor is more remarkable along with the higher frequency band is achieved along with the increase of alpha, so that the reconstructed audio signal is more in line with the auditory characteristics of human ears.
In one possible approach, the value of γ is 1 when i is 0, 1, 2, or 3, 2 when i is 4, 5, 6, 7, or 8, and 3 when i is 9, 10, 11,12, 13, 14, or 15. According to the value of the gamma, the attenuation factor is determined according to the value of the gamma, the value of the gamma is a piecewise function related to the HOA order of the channel, and the value of the gamma increases with the increase of the HOA order of the i but does not exceed the maximum HOA order, and the attenuation factor is used for improving the quality of the reconstructed audio signal when the third audio signal is adjusted.
In a possible manner, the determining the dispersion factor of the first reconstructed scene audio signal according to the first code stream comprises decoding the dispersion factor from the first code stream, wherein the first code stream comprises the dispersion factor. In the above scheme, the coding end calculates the dispersion according to the HOA signal to be coded, the dispersion is quantized and coded into the first code stream, the decoding end decodes the first code stream, then the dispersion is obtained through inverse quantization, and then the dispersion is used for adjusting the gain of the uncoded channel in the reconstructed HOA signal.
In a possible manner, the determining the dispersion factor of the first reconstructed scene audio signal according to the first code stream includes obtaining the dispersion factor according to the first reconstructed signal decoded from the first code stream. In the above scheme, the dispersion is calculated at the decoding end from the lower-order HOA signal in the decoded transmission channel, and then used for gain adjustment of the uncoded channel in the reconstructed HOA signal.
In a possible manner, the acquiring the dispersion factor according to the first reconstructed signal decoded from the first code stream includes acquiring a sound field intensity of each frequency band in the first reconstructed signal, acquiring energy of each frequency band, and determining the dispersion factor according to the sound field intensity of each frequency band and the energy of each frequency band.
In a possible manner, the acquiring the sound field intensity of each frequency band in the first reconstructed signal includes:
the sound field intensity I (b) of the b-th frequency band is obtained by:
Wherein Re represents the real part taking operation, and W (b), X (b), Y (b) and Z (b) represent four channel signals of the b-th frequency band in the first reconstruction signal;
said determining said dispersion factor from said sound field intensity of each frequency band and said energy of each frequency band comprises:
The dispersion factor dispersion (b) is obtained as follows:
Wherein E represents a desired operation, E {1 (b) } represents a two-norm of E { I (b) }, M (b) | represents a two-norm of M (b) I, and M (b) represents energy of a b-th frequency band.
In a possible manner, after the third audio signal in the N2-order HOA signal is adjusted according to the higher-order energy gain coding result and the attenuation factor, the method further includes obtaining channel energy of a fourth audio signal corresponding to the adjusted third audio signal, where the third audio signal includes an audio signal of a current frame, and the fourth audio signal includes an audio signal of a previous frame of the current frame, and adjusting the adjusted third audio signal again according to the channel energy of the fourth audio signal. In the above scheme, the decoding end may further adjust the adjusted third audio signal of the current frame again by using the previous frame of the third audio signal, so as to improve the quality of the reconstructed audio signal.
In a possible manner, the readjusting the adjusted third audio signal according to the channel energy of the fourth audio signal includes obtaining a channel energy average value of the fourth audio signal and the channel energy of the adjusted third audio signal, obtaining an energy average threshold according to the channel energy average value of the fourth audio signal and the channel energy of the adjusted third audio signal, performing weighted average calculation on the channel energy average value of the fourth audio signal and the channel energy of the adjusted third audio signal according to the energy average threshold to obtain a target energy, obtaining an energy smoothing factor according to the target energy and the channel energy of the adjusted third audio signal, and adjusting the third audio signal according to the energy smoothing factor. In the above scheme, the decoding quality of the third audio signal is further improved by readjusting the adjusted third audio signal using the energy smoothing factor q (i, b).
In a possible manner, the obtaining an energy average threshold according to the channel energy average value of the fourth audio signal and the adjusted channel energy of the third audio signal includes:
The energy average threshold k is obtained by:
Wherein e_mean (i, b) represents the channel energy average value of the fourth audio signal and E' _dec (i, b) represents the energy of the adjusted third audio signal.
In a possible manner, the obtaining an energy smoothing factor according to the target energy and the channel energy of the adjusted third audio signal includes:
The energy smoothing factor q (i, b) is obtained as follows:
q(i,b)=sqrt(E_target(i,b))/sqrt(E’_dec(i,b));
Wherein e_target (i, b) represents the target energy and E' _dec (i, b) represents the energy of the adjusted third audio signal.
In a possible manner, the adjusting the decoded higher-order energy gain of the third audio signal according to the decoded energy scale factor to obtain the adjusted decoded higher-order energy gain includes:
the adjusted decoded higher order energy Gain gain_dec' (i, b) is obtained by:
Gain_dec’(i,b)=w*min(g(i,b),Gain_dec(i,b))+(1-w)*g(i,b);
Wherein g (i, b) represents the decoded energy scale factor, gain_dec (i, b) represents the decoded higher-order energy Gain of the third audio signal, w is a preset adjustment scale threshold, min represents a minimum value operation, and x represents a multiplication operation.
Compared with other methods for reconstructing the scene audio signals in the prior art, the audio quality of the scene audio signals reconstructed based on the virtual speaker signals is higher, so that when K is equal to C1, the audio quality of the reconstructed scene audio signals is higher under the same code rate.
When K is smaller than C1, the channel number of the audio signal coded by the method is smaller than the channel number of the audio signal coded by the prior art in the process of coding the scene audio signal, and the data size of the attribute information of the target virtual speaker is far smaller than the data size of the audio signal of one channel, so that the audio quality of the reconstructed scene audio signal is higher on the premise of the same code rate.
In addition, the virtual speaker signal and residual information transmitted by the prior art coding are converted from an original audio signal (namely, a scene audio signal to be coded) and are not the original audio signal, errors are introduced, and the method and the device code part of the original audio signal (namely, the audio signals of K channels in the scene audio signal to be coded), so that the introduction of errors is avoided, the audio quality of a reconstructed scene audio signal obtained by decoding can be improved, fluctuation of the reconstruction quality of the reconstructed scene audio signal obtained by decoding can be avoided, and the stability is high.
Furthermore, since the prior art encodes and transmits virtual speaker signals, and the amount of data of the virtual speaker signals is large, the number of target virtual speakers selected by the prior art is limited by the bandwidth. The application encodes and transmits the attribute information of the virtual speaker, the data volume of the attribute information is far smaller than the data volume of the virtual speaker signal, so the number of the target virtual speakers selected by the application is limited by a small bandwidth. The more the number of the target virtual speakers is selected, the higher the quality of the reconstructed scene audio signal is based on the virtual speaker signals of the target virtual speakers. Therefore, compared with the prior art, under the condition of the same code rate, the method can select more target virtual speakers, so that the quality of the reconstructed scene audio signals obtained by decoding is higher.
In addition, compared with the encoding end and the decoding end in the prior art, the encoding end and the decoding end do not need residual error and superposition operation, so that the comprehensive complexity of the encoding end and the decoding end is lower than that of the encoding end and the decoding end in the prior art.
It should be understood that, when the encoding end performs lossy compression on the first audio signal in the scene audio signal, the first reconstructed signal obtained by decoding by the decoding end and the first audio signal encoded by the encoding end have differences. When the encoding end performs lossless compression on the first audio signal, the first reconstruction signal obtained by decoding by the decoding end is identical to the first audio signal encoded by the encoding end.
It should be understood that, when the encoding end performs lossy compression on the attribute information of the target virtual speaker, the attribute information obtained by decoding by the decoding end and the attribute information encoded by the encoding end have a difference. When the encoding end performs lossless compression on the attribute information of the virtual speaker, the attribute information obtained by decoding by the decoding end is the same as the attribute information encoded by the encoding end. The application does not distinguish between the attribute information encoded by the encoding end and the attribute information decoded by the decoding end from the name.
Any implementation manner of the second aspect and the second aspect corresponds to any implementation manner of the first aspect and the first aspect, respectively. The technical effects corresponding to the second aspect and any implementation manner of the second aspect may be referred to the technical effects corresponding to the first aspect and any implementation manner of the first aspect, which are not described herein.
In a third aspect, an embodiment of the present application provides a method for generating a code stream, where the method may generate the code stream according to any implementation manner of the first aspect and the first aspect.
Any implementation manner of the third aspect and any implementation manner of the third aspect corresponds to any implementation manner of the first aspect and any implementation manner of the first aspect, respectively. The technical effects corresponding to the third aspect and any implementation manner of the third aspect may be referred to the technical effects corresponding to the first aspect and any implementation manner of the first aspect, which are not described herein.
In a fourth aspect, an embodiment of the present application provides a scene audio coding apparatus, including:
the acquisition module is used for acquiring a scene audio signal to be encoded, wherein the scene audio signal comprises audio signals of C1 channels, and C1 is a positive integer;
the acquisition module is further used for acquiring attribute information of a target virtual speaker corresponding to the scene audio signal;
the acquisition module is further used for acquiring the higher-order energy gain of the scene audio signal;
the coding module is used for coding the high-order energy gain to obtain a high-order energy gain coding result;
The encoding module is further configured to encode a first audio signal in the audio signal of the scene, attribute information of the target virtual speaker, and the high-order energy gain encoding result to obtain a first code stream, where the first audio signal is an audio signal of K channels in the audio signal of the scene, and K is a positive integer less than or equal to C1.
The scene audio coding device of the fourth aspect may perform the steps in any implementation manner of the first aspect and the first aspect, which are not described herein.
Any implementation manner of the fourth aspect and any implementation manner of the fourth aspect corresponds to any implementation manner of the first aspect and any implementation manner of the first aspect, respectively. Technical effects corresponding to any implementation manner of the fourth aspect may be referred to the technical effects corresponding to any implementation manner of the first aspect, and are not described herein.
In a fifth aspect, an embodiment of the present application provides a scene audio decoding apparatus, including:
The decoding module is used for decoding the first code stream to obtain a first reconstruction signal, attribute information of a target virtual speaker and a high-order energy gain coding result, wherein the first reconstruction signal is a reconstruction signal of a first audio signal in a scene audio signal, the scene audio signal comprises audio signals of C1 channels, the first audio signal is an audio signal of K channels in the scene audio signal, C1 is a positive integer, and K is a positive integer smaller than or equal to C1;
The virtual speaker signal generation module is used for generating a virtual speaker signal corresponding to the target virtual speaker based on the attribute information of the target virtual speaker and the first audio signal;
The scene audio signal reconstruction module is used for reconstructing based on the attribute information of the target virtual speaker and the virtual speaker signal to obtain a first reconstructed scene audio signal, wherein the first reconstructed scene audio signal comprises audio signals of C2 channels, and C2 is a positive integer;
The attenuation factor determining module is used for determining an attenuation factor according to the frequency band sequence number of the reconstruction signal in the first reconstruction scene audio signal and/or the order of the first reconstruction scene audio signal;
and the scene audio signal adjusting module is used for adjusting the first reconstructed scene audio signal according to the high-order energy gain coding result and the attenuation factor so as to obtain a reconstructed scene audio signal.
The scene audio decoding apparatus of the fifth aspect may perform the steps in any implementation manner of the second aspect and the second aspect, which are not described herein.
Any implementation manner of the fifth aspect and any implementation manner of the fifth aspect corresponds to any implementation manner of the second aspect and any implementation manner of the second aspect, respectively. Technical effects corresponding to any implementation manner of the fifth aspect may be referred to technical effects corresponding to any implementation manner of the second aspect, and will not be described herein.
In a sixth aspect, an embodiment of the application provides an electronic device comprising a memory and a processor, the memory being coupled to the processor, the memory storing program instructions that, when executed by the processor, cause the electronic device to perform the scene audio coding method of the first aspect or any possible implementation of the first aspect.
Any implementation manner of the sixth aspect corresponds to any implementation manner of the first aspect and the first aspect, respectively. Technical effects corresponding to any implementation manner of the sixth aspect may be referred to the technical effects corresponding to any implementation manner of the first aspect, and are not described herein.
In a seventh aspect, an embodiment of the application provides an electronic device comprising a memory and a processor, the memory being coupled to the processor, the memory storing program instructions that, when executed by the processor, cause the electronic device to perform the method of scene audio decoding of the second aspect or any possible implementation of the second aspect.
Any implementation manner of the seventh aspect and any implementation manner of the seventh aspect corresponds to any implementation manner of the second aspect and the second aspect, respectively. Technical effects corresponding to any implementation manner of the seventh aspect may be referred to technical effects corresponding to any implementation manner of the second aspect and the second aspect, and are not described herein.
In an eighth aspect, embodiments of the present application provide a chip comprising one or more interface circuits and one or more processors, the interface circuits being configured to receive signals from a memory of an electronic device and to send signals to the processors, the signals comprising computer instructions stored in the memory, which when executed by the processors, cause the electronic device to perform the method of scene audio coding in the first aspect or any possible implementation of the first aspect.
Any implementation manner of the eighth aspect and any implementation manner of the eighth aspect corresponds to any implementation manner of the first aspect and any implementation manner of the first aspect, respectively. Technical effects corresponding to any implementation manner of the eighth aspect may be referred to the technical effects corresponding to any implementation manner of the first aspect, and are not described herein.
In a ninth aspect, embodiments of the present application provide a chip comprising one or more interface circuits and one or more processors, the interface circuits being configured to receive signals from a memory of an electronic device and to send signals to the processors, the signals comprising computer instructions stored in the memory, which when executed by the processors, cause the electronic device to perform the method of scene audio decoding in the second aspect or any possible implementation of the second aspect.
Any implementation manner of the ninth aspect and any implementation manner of the ninth aspect correspond to any implementation manner of the second aspect and the second aspect, respectively. Technical effects corresponding to any implementation manner of the ninth aspect may be referred to technical effects corresponding to any implementation manner of the second aspect and the second aspect, and are not described herein.
In a tenth aspect, embodiments of the present application provide a computer readable storage medium storing a computer program which, when run on a computer or processor, causes the computer or processor to perform the method of scene audio coding in the first aspect or any possible implementation manner of the first aspect.
Any implementation manner of the tenth aspect and the tenth aspect corresponds to any implementation manner of the first aspect and the first aspect, respectively. Technical effects corresponding to the tenth aspect and any implementation manner of the tenth aspect may be referred to the technical effects corresponding to the first aspect and any implementation manner of the first aspect, which are not described herein.
In an eleventh aspect, embodiments of the present application provide a computer readable storage medium storing a computer program, which when run on a computer or processor causes the computer or processor to perform the method of decoding scene audio in the second aspect or any possible implementation manner of the second aspect.
Any implementation manner of the eleventh aspect and the eleventh aspect corresponds to any implementation manner of the second aspect and the second aspect, respectively. Technical effects corresponding to any implementation manner of the eleventh aspect may be referred to technical effects corresponding to any implementation manner of the second aspect and the second aspect, and are not described herein.
In a twelfth aspect, embodiments of the present application provide a computer program product comprising a software program which, when executed by a computer or processor, causes the computer or processor to perform the method of scene audio coding in the first aspect or any of the possible implementations of the first aspect.
Any implementation manner of the twelfth aspect and the twelfth aspect corresponds to any implementation manner of the first aspect and the first aspect, respectively. Technical effects corresponding to any implementation manner of the twelfth aspect may be referred to the technical effects corresponding to any implementation manner of the first aspect, and are not described herein.
In a thirteenth aspect, embodiments of the present application provide a computer program product comprising a software program which, when executed by a computer or processor, causes the computer or processor to perform the method of scene audio decoding of the second aspect or any possible implementation of the second aspect.
Any implementation manner of the thirteenth aspect and the thirteenth aspect corresponds to any implementation manner of the second aspect and the second aspect, respectively. Technical effects corresponding to any implementation manner of the thirteenth aspect may be referred to technical effects corresponding to any implementation manner of the second aspect and the second aspect, and are not described herein.
In a fourteenth aspect, an embodiment of the present application provides an apparatus for storing a code stream, where the apparatus includes a receiver and at least one storage medium, where the receiver is configured to receive the code stream, and where the at least one storage medium is configured to store the code stream, where the code stream is generated according to any one of the first aspect and the implementation manner of the first aspect.
Any implementation manner of the fourteenth aspect and the fourteenth aspect corresponds to any implementation manner of the first aspect and the first aspect, respectively. Technical effects corresponding to any implementation manner of the fourteenth aspect may be referred to the technical effects corresponding to any implementation manner of the first aspect, and are not described herein.
In a fifteenth aspect, an embodiment of the present application provides an apparatus for transmitting a code stream, where the apparatus includes a transmitter and at least one storage medium, where the at least one storage medium is configured to store the code stream, where the code stream is generated according to any one of the first aspect and the implementation manner of the first aspect, and the transmitter is configured to obtain the code stream from the storage medium and send the code stream to an end-side device through the transmission medium.
Any implementation manner of the fifteenth aspect and the fifteenth aspect corresponds to any implementation manner of the first aspect and the first aspect, respectively. Technical effects corresponding to any implementation manner of the fifteenth aspect and the fifteenth aspect may be referred to the technical effects corresponding to any implementation manner of the first aspect and the first aspect, and are not described herein.
In a sixteenth aspect, an embodiment of the present application provides a system for distributing a code stream, where the system includes at least one storage medium configured to store at least one code stream, where the at least one code stream is generated according to any one of the first aspect and the implementation manner of the first aspect, and a streaming media device configured to obtain a target code stream from the at least one storage medium and send the target code stream to an end-side device, where the streaming media device includes a content server or a content distribution server.
Any implementation manner of the sixteenth aspect and the sixteenth aspect corresponds to any implementation manner of the first aspect and the first aspect, respectively. Technical effects corresponding to any implementation manner of the sixteenth aspect may be referred to the technical effects corresponding to any implementation manner of the first aspect, and are not described herein.
Drawings
FIG. 1a is a schematic diagram of an exemplary application scenario;
FIG. 1b is a schematic diagram of an exemplary application scenario;
FIG. 2a is a schematic diagram of an exemplary encoding process;
Fig. 2b is a schematic diagram of an exemplary candidate virtual speaker distribution;
FIG. 3 is a schematic diagram of an exemplary decoding process;
FIG. 4 is a schematic diagram of an exemplary encoding process;
FIG. 5a is a schematic diagram of a decoding process shown by way of example;
FIG. 5b is a schematic diagram of another decoding process shown by way of example;
FIG. 6a is a schematic diagram of an exemplary encoding end;
FIG. 6b is a schematic diagram illustrating the structure of a decoding end;
fig. 7 is a schematic diagram of a structure of an exemplary scene audio encoding apparatus;
fig. 8 is a schematic structural view of an exemplary scene audio decoding apparatus;
fig. 9 is a schematic diagram of the structure of the device shown in an exemplary manner.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean that a exists alone, while a and B exist together, and B exists alone.
The terms first and second and the like in the description and in the claims of embodiments of the application, are used for distinguishing between different objects and not necessarily for describing a particular sequential order of objects. For example, the first target object and the second target object, etc., are used to distinguish between different target objects, and are not used to describe a particular order of target objects.
In embodiments of the application, words such as "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g." in an embodiment should not be taken as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.
In the description of the embodiments of the present application, unless otherwise indicated, the meaning of "a plurality" means two or more. For example, a plurality of processing units refers to two or more processing units, and a plurality of systems refers to two or more systems.
For clarity and conciseness in the description of the embodiments below, a brief description of the related art will be given first.
Sound (sound) is a continuous wave generated by the vibration of an object. An object that produces vibrations and emits sound waves is called a sound source. During the propagation of sound waves through a medium (e.g., air, solid, or liquid), the auditory function of a human or animal senses sound.
The characteristics of sound waves include pitch, intensity and timbre. The pitch represents the level of sound. The sound intensity indicates the size of the sound. The intensity of sound may also be referred to as loudness or volume. The units of sound intensity are decibels (dB). The tone color is also called a sound product.
The frequency of the sound wave determines the pitch. The higher the frequency the higher the tone. The number of times an object vibrates within one second is called the frequency, which is in hertz (Hz). The frequency of the sound which can be identified by the human ear is 20 Hz-20000 Hz.
The amplitude of the sound wave determines the intensity of the sound intensity. The larger the amplitude the greater the intensity. The closer to the sound source, the greater the intensity.
The waveform of the sound wave determines the tone. The waveform of the acoustic wave includes square wave, sawtooth wave, sine wave, pulse wave and the like.
Sounds can be classified into regular sounds and irregular sounds according to characteristics of sound waves. The irregular sound refers to a sound emitted by an irregularly vibrating sound source. The random sound is, for example, noise affecting people's work, learning, rest, etc. The regular sound refers to a sound emitted by the sound source vibrating regularly. Regular sounds include voices and musical tones. When the sound is electrically represented, the regular sound is an analog signal that continuously varies in the time-frequency domain. The analog signal may be referred to as an audio signal. An audio signal is an information carrier carrying speech, music and sound effects.
Since human hearing has the ability to discern the position distribution of sound sources in space, the listener can perceive the azimuth of the sound in addition to the pitch, intensity and timbre of the sound when hearing the sound in space.
As the attention and quality requirements of the auditory system experience are increasing, three-dimensional audio technology has grown in order to enhance the sense of depth, presence, and spatial perception of sound. Thus, the listener not only perceives the sound from the front, rear, left and right sound sources, but also perceives the sense that the space where the listener is located is surrounded by the space sound field (sound field) generated by the sound sources, and the sense that the sound spreads around, thereby creating an "immersive" sound effect that the listener is located in a theatre, concert hall, or other location.
The scene audio signals related to the embodiment of the application can refer to signals for describing a sound field, wherein the scene audio signals can comprise HOA signals (wherein the HOA signals can comprise three-dimensional HOA signals and two-dimensional HOA signals (also can be called plane HOA signals)) and three-dimensional audio signals, and the three-dimensional audio signals can refer to other audio signals except the HOA signals in the scene audio signals. The HOA signal will be described below as an example.
It is known that sound waves propagate in an ideal medium with wave numbers k=w/c and angular frequencies w=2pi f, where f is the sound wave frequency and c is the speed of sound. The sound pressure p satisfies the formula (1, b), and 2 is the laplace operator.
▽2p+k2p=0 (1)
The space system outside the human ear is assumed to be a sphere, the listener is positioned in the center of the sphere, sound transmitted from outside the sphere has a projection on the sphere, sound outside the sphere is filtered, sound sources are assumed to be distributed on the sphere, and the sound field generated by the original sound sources is fitted by using the sound field generated by the sound sources on the sphere, namely, the three-dimensional audio technology is a method for fitting the sound field. Specifically, equation (1, b) equation is solved under the spherical coordinate system, and in the passive spherical region, the equation (1, b) equation is solved as the following equation (2).
Where r denotes a sphere radius, θ denotes horizontal angle information (or azimuth angle information),Represents pitch angle information (or called elevation angle information), k represents wave number, s represents amplitude of an ideal plane wave, and m represents an order number of the HOA signal (or called an order number of the HOA signal).Representing a globebiser function, also called radial basis function, wherein the first j represents an imaginary unit,Not changing with angle.The term "theta" is used to indicate that,The spherical harmonic of the direction is used,Spherical harmonics representing the direction of the sound source. The HOA signal satisfies equation (3).
Substituting equation (3) into equation (2), equation (2) may be modified into equation (4).
Wherein m is truncated to the nth term, i.e., m=n, toAs an approximate description of the sound field, at this point,May be referred to as HOA coefficients (which may be used to represent the HOA signal of order N). The sound field refers to the area of the medium where sound waves are present. N is an integer greater than or equal to 1.
A scene audio signal is an information carrier carrying spatial location information of sound sources in a sound field describing the sound field of listeners in space. Equation (4) shows that the sound field can be spread on the sphere according to spherical harmonics, i.e. the sound field can be decomposed into a superposition of a plurality of plane waves. Thus, the sound field described by the HOA signal can be expressed using a superposition of a plurality of plane waves and reconstructed by the HOA coefficients.
The HOA signal to be encoded according to an embodiment of the present application may refer to an N1-order HOA signal, which may be represented by HOA coefficients or Ambisonic (stereo reverberation) coefficients, N1 being an integer greater than or equal to 1 (where, when N1 is equal, the 1-order HOA signal may be referred to as a FOA (First Order Ambisonic, first order Ambisonic) signal). Wherein the N1-order HOA signal comprises (n1+1) 2 channels of audio signals.
Fig. 1a is a schematic diagram of an exemplary application scenario. Shown in fig. 1a is a codec scene of a scene audio signal.
Referring to fig. 1a, an exemplary first electronic device may include a first audio acquisition module, a first scene audio encoding module, a first channel decoding module, a first scene audio decoding module, and a first audio playback module. It should be understood that the first electronic device may include more or fewer modules than shown in fig. 1a, as the application is not limited in this regard.
Referring to fig. 1a, the second electronic device may include a second audio acquisition module, a second scene audio encoding module, a second channel decoding module, a second scene audio decoding module, and a second audio playback module, as an example. It should be understood that the second electronic device may include more or fewer modules than shown in fig. 1a, as the application is not limited in this regard.
The first electronic device may encode and transmit the scene audio signal to the second electronic device, and the process of decoding and audio playback by the second electronic device may be such that the first audio acquisition module may perform audio acquisition and output the scene audio signal to the first scene audio encoding module. Then, the first scene audio coding module can code the scene audio signal and output a code stream to the first channel coding module. The first channel coding module may then perform channel coding on the code stream, and transmit the code stream after channel coding to the second electronic device through the wireless or wired network communication device. Then, the second channel decoding module of the second electronic device may perform channel decoding on the received data to obtain a code stream and output the code stream to the second scene audio decoding module. The second scene audio decoding module can decode the code stream to obtain a reconstructed scene audio signal, and then the reconstructed scene audio signal is output to the second audio playback module for audio playback by the second audio playback module.
It should be noted that the second audio playback module may perform post-processing (such as audio rendering (e.g., may convert a reconstructed scene audio signal containing (n1+1) 2 channels of audio signals into an audio signal with the same number of channels as the number of speakers in the second electronic device), loudness normalization, user interaction, audio format conversion or denoising, etc.) on the reconstructed scene audio signal to convert the reconstructed scene audio signal into an audio signal suitable for playing by the speakers in the second electronic device.
It should be understood that the process of encoding and transmitting the scene audio signal to the first electronic device, decoding and playing back the scene audio signal by the first electronic device is similar to the process of transmitting the scene audio signal to the second electronic device by the first electronic device and playing back the scene audio signal by the second electronic device, which is not described herein.
By way of example, the first electronic device and the second electronic device may each include, but are not limited to, a personal computer, a computer workstation, a smart phone, a tablet, a server, a smart camera, a smart car or other type of cellular phone, a media consumption device, a wearable device, a set-top box, a game console, and the like.
The present application is particularly applicable to VR (Virtual Reality)/AR (Augmented Reality) scenes, for example. In one possible approach, the first electronic device is a server and the second electronic device is a VR/AR device. In one possible approach, the second electronic device is a server and the first electronic device is a VR/AR device.
The first scene audio coding module and the second scene audio coding module may be, for example, scene audio encoders. The first and second scene audio decoding modules may be scene audio decoders.
For example, when a scene audio signal is encoded by a first electronic device, the second electronic device reconstructs the scene audio signal, the first electronic device may be referred to as an encoding side, and the second electronic device may be referred to as a decoding side. When the scene audio signal is encoded by the second electronic device, the first electronic device reconstructs the scene audio signal, the second electronic device may be referred to as an encoding side, and the first electronic device may be referred to as a decoding side.
Fig. 1b is a schematic view of an exemplary application scenario. Shown in fig. 1b is a transcoded scene of a scene audio signal.
Referring to fig. 1b (1), an exemplary wireless or core network device may include a channel decoding module, other audio decoding modules, a scene audio encoding module, and a channel encoding module. Wherein a wireless or core network device may be used for audio transcoding.
The specific application scenario of fig. 1b (1) may be that, in the case that the first electronic device is not provided with a scene audio encoding module and is only provided with other audio encoding modules, and the second electronic device is only provided with a scene audio decoding module and is not provided with other audio decoding modules, in order to enable the second electronic device to decode and play back the scene audio signal encoded by the first electronic device using the other audio encoding modules, a wireless or core network device may be used for transcoding.
Specifically, the first electronic device encodes the scene audio signal by adopting other audio encoding modules to obtain a first code stream, and sends the first code stream to the wireless or core network device after channel encoding. Then, the channel decoding module of the wireless or core network device may perform channel decoding, and output the first code stream decoded by the channel to other audio decoding modules. And then, the other audio decoding modules decode the first code stream to obtain a scene audio signal and output the scene audio signal to the scene audio encoding module. Then, the scene audio coding module may code the scene audio signal to obtain a second code stream, and output the second code stream to the channel coding module, where the channel coding module performs channel coding on the second code stream and sends the second code stream to the second electronic device. In this way, the second electronic device can call the scene audio decoding module to decode the channel to obtain the second code stream, so as to obtain the reconstructed scene audio signal, and then the reconstructed scene audio signal can be played back in an audio mode.
Referring to fig. 1b (2), an exemplary wireless or core network device may include a channel decoding module, a scene audio decoding module, other audio encoding modules, and a channel encoding module. Wherein a wireless or core network device may be used for audio transcoding.
The specific application scenario of fig. 1b (2) may be that, in the case that the first electronic device is only provided with a scene audio encoding module and is not provided with other audio encoding modules, and the second electronic device is not provided with a scene audio decoding module and is only provided with other audio decoding modules, in order to enable the second electronic device to decode and play back the scene audio signal encoded by the first electronic device using the scene audio encoding module, a wireless or core network device may be used for transcoding.
Specifically, the first electronic device adopts a scene audio coding module to code a scene audio signal to obtain a first code stream, and sends the first code stream to wireless or core network equipment after channel coding. Then, the channel decoding module of the wireless or core network device may perform channel decoding, and output the first code stream decoded by the channel to the scene audio decoding module. And then, the scene audio decoding module decodes the first code stream to obtain a scene audio signal and outputs the scene audio signal to other audio encoding modules. Then, the other audio encoding modules can encode the scene audio signals to obtain a second code stream, output the second code stream to the channel encoding module, and send the second code stream to the second electronic device after the channel encoding module performs channel encoding on the second code stream. In this way, the second electronic device can call other audio decoding modules to decode the channel to obtain the second code stream, so as to obtain the reconstructed scene audio signal, and then the reconstructed scene audio signal can be subjected to audio playback.
The following describes a codec process of a scene audio signal.
Fig. 2a is a schematic diagram of an exemplary encoding process.
S201, obtaining a scene audio signal to be encoded, wherein the scene audio signal comprises audio signals of C1 channels, and C1 is a positive integer.
Illustratively, when the field Jing Yinpin signal is a HOA signal, the HOA signal may be an N1-order HOA signal, i.e., in equation (3) above for the term N1
Illustratively, the N1-order HOA signal may include an audio signal of C1 channels, c1= (n1+1) 2. For example, when n1=3, the N1-order HOA signal includes 16 channels of audio signals, and when n1=4, the N1-order HOA signal includes 25 channels of audio signals.
S202, acquiring attribute information of a target virtual speaker corresponding to a scene audio signal.
And selecting a target virtual speaker from the plurality of candidate virtual speakers based on the scene audio signals, and acquiring attribute information of the target virtual speaker.
S203, obtaining the higher-order energy gain of the scene audio signal.
Illustratively, the feature information of the HOA signal is obtained from the HOA signal to be encoded, and the high-order energy gain is obtained through the feature information of the HOA signal, where the high-order energy gain may be used to indicate the energy gain of the high-order channel signal of the scene audio signal.
The scene audio signal includes audio signals of C1 channels, the first audio signal is an audio signal of K channels in the scene audio signal, K is a positive integer less than or equal to C1, and the value of K is not limited.
The scene audio signal is an N1-order HOA signal, the N1-order HOA signal includes a first audio signal and a second audio signal, the second audio signal is an audio signal other than the first audio signal in the N1-order HOA signal, and C1 is equal to the square of (n1+1).
In one possible approach, n1=3 and c1=10 are assumed. The N1-order HOA signal includes 1 st to 16 th channel audio signals, the first audio signal is 1 st to 10 th channel audio signals in the N1-order HOA signal, and the second audio signal is 11 th to 16 th channel audio signals in the N1-order HOA signal.
Illustratively, n1=3, c1=9. The N1-order HOA signal includes 1 st to 16 th channel audio signals, the first audio signal is a1 st to 9 th channel audio signal in the N1-order HOA signal, and the second audio signal is a 10 th to 16 th channel audio signal in the N1-order HOA signal.
Illustratively, n1=3, c1=8. The N1-order HOA signal includes 1 st to 16 th channel audio signals, the first audio signal is 1 st to 6 th, 8 th and 9 th channel audio signals in the N1-order HOA signal, and the second audio signal is 7 th and 10 th to 16 th channel audio signals in the N1-order HOA signal.
In one possible implementation, obtaining a higher-order energy gain of a scene audio signal includes:
and acquiring the high-order energy gain according to the characteristic information of the second audio signal and the characteristic information of the first audio signal.
The scene audio signals comprise a first audio signal and a second audio signal, feature information of the second audio signal and feature information of the first audio signal are respectively acquired, and the feature information corresponding to the scene audio signals comprises but not limited to gain information and diffusion information. The higher energy gain of the scene audio signal may be obtained from the characteristic information of the second audio signal and the characteristic information of the first audio signal.
For example, gain information Gain (i, b) of a second audio signal in the scene audio signal may be calculated with reference to the following formula:
Gain(i,b)=E(i,b)/E(1,b)
Where i is the number of the ith channel included in the second audio signal in the scene audio signal, the number may also be referred to as the channel number, b is the frequency band number of the second audio signal, E (i, b) is the ith channel energy of the b-th frequency band of the second audio signal, E (1, b) is the channel energy of the b-th frequency band of the first audio signal, for example, the channel of the first audio signal may be specifically the 1 st channel of the N1 st-order HOA signal.
The following steps may be performed within a frame signal or on a subframe. The following steps may be performed in the full band or in the sub-band.
Illustratively, after Gain (i, b) is calculated, gain' (i, b) is calculated as follows:
Gain’(i,b)=10*log10(Gain(i,b))。
s204, coding the high-order energy gain to obtain a high-order energy gain coding result.
After the coding end obtains the high-order energy gain of the scene audio signal, the high-order energy gain can be coded to generate a high-order energy gain coding result. The function of the high-order energy gain is to adjust the high-order channel energy at the decoding end, so that the encoding and decoding quality of the HOA signal is higher.
S205, encoding attribute information of a first audio signal and a target virtual speaker in the scene audio signal and a high-order energy gain encoding result to obtain a first code stream, wherein the first audio signal is an audio signal of K channels in the scene audio signal, and K is a positive integer less than or equal to C1.
The virtual speakers are, for example, virtual speakers, not real speakers.
For example, based on the above, the scene audio signal may be expressed using a superposition of a plurality of plane waves, and thus a target virtual speaker for simulating a sound source in the scene audio signal may be determined, so that the scene audio signal is reconstructed using a virtual speaker signal corresponding to the target virtual speaker in a subsequent decoding process.
In one possible approach, a plurality of candidate virtual speakers may be provided on the sphere at different locations, and then a target virtual speaker may be selected from the plurality of candidate virtual speakers that is located to match the location of the sound source in the scene audio signal.
Fig. 2b is a schematic diagram of an exemplary candidate virtual speaker distribution. In fig. 2b, a plurality of candidate virtual speakers may be uniformly distributed on a sphere, with a point on the sphere representing a candidate virtual speaker.
It should be noted that the number and distribution of the candidate virtual speakers are not limited in the present application, and may be set as required, and specifically described later.
The target virtual speaker whose position corresponds to the sound source position in the scene audio signal may be selected from the plurality of candidate virtual speakers based on the scene audio signal, wherein the number of the target virtual speakers may be one or more, and the application is not limited thereto.
In one possible approach, the target virtual speaker may be preset.
It should be understood that the application is not limited in the manner in which the target virtual speaker is determined.
By way of example, in one possible manner, in the decoding process, the scene audio signal may be reconstructed from the virtual speaker signal, but the code rate may be increased by directly transmitting the virtual speaker signal of the target virtual speaker, and the virtual speaker signal of the target virtual speaker may be generated based on the attribute information of the target virtual speaker and the scene audio signal of a part or all of the channels, so that the attribute information of the target virtual speaker may be acquired, and the audio signals of K channels in the scene audio signal may be acquired as the first audio signal, and then the first audio signal, the attribute information of the target virtual speaker, and the higher-order energy gain encoding result may be encoded to obtain the first code stream.
For example, operations such as down-mixing, transformation, quantization, entropy coding and the like may be performed on the attribute information of the first audio signal and the target virtual speaker to obtain a first code stream, and in addition, a higher-order energy gain coding result may be written into the first code stream. That is, the first code stream may include encoded data of the first audio signal in the scene audio signal, encoded data of attribute information of the target virtual speaker, and a higher-order energy gain encoding result.
Compared with other methods for reconstructing the scene audio signals in the prior art, the method has the advantages that the audio quality of the scene audio signals reconstructed based on the virtual speaker signals is higher, so that when K is equal to C1, the audio quality of the scene audio signals reconstructed by the method is higher under the same code rate.
When K is smaller than C1, the channel number of the audio signal coded by the application is smaller than the channel number of the audio signal coded by the prior art in the process of coding the scene audio signal, and the data size of the attribute information of the target virtual speaker is also far smaller than the data size of the audio signal of one channel, so that the coding code rate is lower on the premise of reaching the same quality.
In addition, in the prior art, the scene audio signal is converted into the virtual speaker signal and the residual signal and then encoded, and the encoding end directly encodes the audio signal of part of channels in the scene audio signal without calculating the virtual speaker signal and the residual signal, so that the encoding complexity of the encoding end is lower.
Fig. 3 is a schematic diagram of an exemplary decoding process. Fig. 3 is a decoding process corresponding to the encoding process of fig. 2.
S301, a first code stream is received.
S302, decoding the first code stream to obtain a first reconstruction signal and attribute information of a target virtual speaker.
Illustratively, encoded data of a first audio signal in a scene audio signal contained in a first bitstream may be decoded to obtain a first reconstructed signal, that is, the first reconstructed signal is a reconstructed signal of the first audio signal. And decoding the encoded data of the attribute information of the target virtual speaker contained in the first code stream, thereby obtaining the attribute information of the target virtual speaker.
It should be understood that, when the encoding end performs lossy compression on the first audio signal in the scene audio signal, the first reconstructed signal obtained by decoding by the decoding end and the first audio signal encoded by the encoding end have differences. When the encoding end performs lossless compression on the first audio signal, the first reconstruction signal obtained by decoding by the decoding end is identical to the first audio signal encoded by the encoding end.
It should be understood that, when the encoding end performs lossy compression on the attribute information of the target virtual speaker, the attribute information obtained by decoding by the decoding end and the attribute information encoded by the encoding end have a difference. When the encoding end performs lossless compression on the attribute information of the virtual speaker, the attribute information obtained by decoding by the decoding end is the same as the attribute information encoded by the encoding end. The application does not distinguish between the attribute information encoded by the encoding end and the attribute information decoded by the decoding end from the name.
S303, generating a virtual speaker signal corresponding to the target virtual speaker based on the attribute information of the target virtual speaker and the first reconstruction signal.
S304, reconstructing based on the attribute information of the target virtual speaker and the virtual speaker signal to obtain a first reconstructed scene audio signal. The first reconstructed scene audio signal comprises audio signals of C2 channels, C2 being a positive integer.
The scene audio signal may be reconstructed based on the virtual speaker signal, and further the virtual speaker signal corresponding to the target virtual speaker may be generated based on the attribute information of the target virtual speaker and the first reconstructed signal. One target virtual speaker corresponds to one virtual speaker signal, and the virtual speaker signal is a plane wave. Then, reconstructing is performed based on the attribute information of the target virtual speaker and the virtual speaker signal, and generating a first reconstructed scene audio signal.
Illustratively, when the field Jing Yinpin signal is an HOA signal, the reconstructed first reconstructed scene audio signal may also be an HOA signal, which may be an N2-order HOA signal, where N2 is a positive integer. Illustratively, the N2-order HOA signal may comprise a C2-channel audio signal, c2= (n2+1) 2.
Illustratively, the first reconstructed scene audio signal may have an order N2 greater than or equal to the order N1 of the field Jing Yinpin signal in the embodiment of FIG. 2a, and the corresponding first reconstructed scene audio signal may include an audio signal having a channel number C2 greater than or equal to the channel number C1 of the audio signal included in the field Jing Yinpin signal in the embodiment of FIG. 2 a.
Illustratively, the scene audio signal is an N1-order HOA signal, the N1-order HOA signal comprises a second audio signal, the second audio signal being an audio signal of the N1-order HOA signal other than the first audio signal, C1 being equal to the square of (n1+1), and/or,
The first reconstructed scene audio signal is an N2-order HOA signal, the N2-order HOA signal comprising a third audio signal, the third audio signal being a reconstructed signal of the N2-order HOA signal corresponding to each channel of the second audio signal, C2 being equal to the square of (n2+1).
In one possible way, the first reconstructed scene audio signal may be directly used as the final decoding result.
S305, determining a dispersion factor (dispersion) of the first reconstruction scene audio signal according to the first code stream.
Since the reconstructed HOA signal is calculated from the virtual loudspeaker signal, only sound source components with definite azimuth, also called directional components, in the HOA signal to be encoded are included, and environment components, also called non-directional components, in the HOA signal to be encoded are absent. Therefore, the energy adjustment of each channel of the reconstructed HOA signal is performed through the audio gain, so that the energy of the directional component is only closer to the energy of the HOA signal to be encoded, and the adjustment of the nondirectional component is unstable.
The dispersion can be used as a parameter describing the degree of HOA signal dispersion, and is applied to HOA signal sound field reconstruction and sound field playback. In the embodiment of the application, the energy proportion of the nondirectional component in the HOA signal to be encoded is measured by using the dispersity of the HOA signal, and the energy of each channel of the reconstructed HOA signal is adjusted, so that the energy of the reconstructed HOA signal after energy adjustment is closer to the energy of the HOA signal to be encoded. The dispersion is used as a parameter for reconstructing the HOA signal energy adjustment, the dispersion can also utilize the description capability of the dispersion on the sound source, the self-adaptive non-directional component is compensated when the reconstructed HOA signal energy is adjusted, and the directional component is not compensated, so that the dispersion can make up the defect of a gain adjustment method based on the signal energy ratio, and no extra error is introduced.
S306, determining an attenuation factor according to the frequency band sequence number of the reconstruction signal in the first reconstruction scene audio signal and/or the order of the first reconstruction scene audio signal.
For example, the decoding side may obtain the attenuation factor according to the frequency band sequence number of the reconstructed signal in the first reconstructed scene audio signal, or the decoding side may obtain the attenuation factor according to the order of the first reconstructed scene audio signal, or the order of the first reconstructed scene audio signal may specifically be the order of the N2-order HOA signal, for example, the Ambisonic order, or the decoding side may obtain the attenuation factor according to the frequency band sequence number and the order of the first reconstructed scene audio signal, which may be referred to as the dual attenuation factor. The attenuation factor may be attenuated according to at least one of a frequency band sequence number of the reconstructed signal and/or an order of the first reconstructed scene audio signal, and the attenuation factor may be used to adjust the first reconstructed scene audio signal such that the quality of the reconstructed scene audio signal is higher.
S307, the first reconstructed scene audio signal is adjusted according to the high-order energy gain coding result, the attenuation factor and the dispersion factor, so as to obtain a reconstructed scene audio signal.
The decoding end obtains a high-order energy gain coding result from the first code stream, and energy adjustment is carried out on the first reconstructed scene audio signal by using the high-order energy gain coding result, the attenuation factor and the dispersion factor. The decoding end adjusts the high-order channel energy of the first reconstructed scene audio signal by using the high-order energy gain coding result, so that the decoding quality of the scene audio signal is higher.
Compared with other methods for reconstructing the scene audio signals in the prior art, the audio quality of the scene audio signals reconstructed based on the virtual speaker signals is higher, so that when K is equal to C1, the audio quality of the reconstructed scene audio signals is higher under the same code rate.
When K is smaller than C1, the channel number of the audio signal coded by the method is smaller than the channel number of the audio signal coded by the prior art in the process of coding the scene audio signal, and the data size of the attribute information of the target virtual speaker is far smaller than the data size of the audio signal of one channel, so that the audio quality of the reconstructed scene audio signal is higher on the premise of the same code rate.
In addition, the virtual speaker signal and residual information transmitted by the prior art coding are converted from an original audio signal (namely, a scene audio signal to be coded) and are not the original audio signal, errors are introduced, and the method and the device code part of the original audio signal (namely, the audio signals of K channels in the scene audio signal to be coded), so that the introduction of errors is avoided, the audio quality of a reconstructed scene audio signal obtained by decoding can be improved, fluctuation of the reconstruction quality of the reconstructed scene audio signal obtained by decoding can be avoided, and the stability is high.
Furthermore, since the prior art encodes and transmits virtual speaker signals, and the amount of data of the virtual speaker signals is large, the number of target virtual speakers selected by the prior art is limited by the bandwidth. The application encodes and transmits the attribute information of the virtual speaker, the data volume of the attribute information is far smaller than the data volume of the virtual speaker signal, so the number of the target virtual speakers selected by the application is limited by a small bandwidth. The more the number of the target virtual speakers is selected, the higher the quality of the reconstructed scene audio signal is based on the virtual speaker signals of the target virtual speakers. Therefore, compared with the prior art, under the condition of the same code rate, the method can select more target virtual speakers, so that the quality of the reconstructed scene audio signals obtained by decoding is higher.
In addition, compared with the encoding end and the decoding end in the prior art, the encoding end and the decoding end do not need residual error and superposition operation, so that the comprehensive complexity of the encoding end and the decoding end is lower than that of the encoding end and the decoding end in the prior art. Because the first code stream sent by the encoding end comprises the encoding result of the high-order energy gain, the high-order energy gain can be used for adjusting the energy of the high-order channel at the decoding end, so that the encoding and decoding quality of the scene audio signal is higher.
The following describes the encoding process of the higher-order energy gain in the encoding process and the adjusting process of the audio signal by the higher-order energy gain in the decoding process.
Fig. 4 is a schematic diagram of an exemplary encoding process.
S401, acquiring a scene audio signal to be encoded, wherein the scene audio signal comprises audio signals of C1 channels, and C1 is a positive integer.
For example, S401 may refer to the description of S201 above, and will not be described herein.
S402, acquiring attribute information of a target virtual speaker corresponding to the scene audio signal.
In one possible approach, attribute information of the target virtual speaker is generated based on the position information of the target virtual speaker. In one possible manner, the position information (including pitch angle information and horizontal angle information) of the target virtual speaker may be used as the attribute information of the target virtual speaker. In one possible manner, a position index (including a pitch angle index (which may be used to uniquely identify pitch angle information) and a horizontal angle index (which may be used to uniquely identify horizontal angle information)) corresponding to the position information of the target virtual speaker is set as the attribute information of the target virtual speaker.
In one possible approach, a virtual speaker index (e.g., virtual speaker identification) of the target virtual speaker may be used as the attribute information of the target virtual speaker. Wherein the virtual speaker indexes are in one-to-one correspondence with the position information.
In one possible manner, the virtual speaker coefficient of the target virtual speaker may be set as the attribute information of the target virtual speaker. For example, C2 virtual speaker coefficients of the target virtual speaker may be determined, and the C2 virtual speaker coefficients of the target virtual speaker may be used as attribute information of the target virtual speaker, where the C2 virtual speaker coefficients of the target virtual speaker are in one-to-one correspondence with the audio signals of the C2 channel numbers included in the audio signal of the first reconstructed scene.
The data amount of the virtual speaker coefficient is much larger than the position information, the index of the position information and the data amount of the virtual speaker index, and it is possible to determine which information of the position information, the index of the position information, the virtual speaker index and the virtual speaker coefficient is used as the attribute information of the target virtual speaker according to the bandwidth. For example, when the bandwidth is large, the virtual speaker coefficient can be used as the attribute information of the target virtual speaker, so that the decoding end does not need to calculate the virtual speaker coefficient of the target virtual speaker, and the calculation force of the decoding end can be saved. When the bandwidth is small, any one of the position information, the index of the position information and the index of the virtual speaker can be used as the attribute information of the target virtual speaker, so that the code rate can be saved. It should be understood that which information of the position information, the index of the position information, the virtual speaker index, and the virtual speaker coefficient is used may be set in advance as the attribute information of the target virtual speaker, and the present application is not limited thereto.
S403, acquiring the energy gain of the first audio signal and the energy gain of the second audio signal.
The feature information corresponding to the scene audio signal includes gain information, the scene audio signal includes a first audio signal and a second audio signal, and an energy gain E (1, b) of the first audio signal and an energy gain E (i, b) of the second audio signal are calculated, respectively.
S404, obtaining a higher-order energy gain according to the energy gain of the first audio signal and the energy gain of the second audio signal.
After the coding end obtains the high-order energy gain of the scene audio signal, the high-order energy gain can be coded to generate a high-order energy gain coding result. The function of the high-order energy gain is to adjust the high-order channel energy at the decoding end, so that the encoding and decoding quality of the HOA signal is higher.
Illustratively, obtaining the higher order energy gain from the energy gain of the first audio signal and the energy gain of the second audio signal comprises:
The higher-order energy Gain' (i, b) is obtained by:
Gain’(i,b)=10*log10(E(i,b)/E(1,b));
Wherein log10 represents log, represents multiplication, E (1, b) is the channel energy of the first audio signal, E (i, b) is the channel energy of the second audio signal, i is the number of the i-th channel of the second audio signal, and b is the frequency band number of the second audio signal.
The characteristic information of the second audio signal may be, for example, a higher energy gain of the N1-order HOA signal, in particular, an energy ratio of each channel of the second audio signal to a W channel (1 st channel of the N1-order HOA signal), which may in particular be a channel of the first audio signal.
For example, the feature information of the second audio signal may be acquired with reference to the following steps:
And performing time-frequency transformation on the N1-order HOA signal, and transforming the time domain N1-order HOA signal to obtain a frequency domain N1-order HOA signal.
The W channel energy E (1, b) and the channel energy E (i, b) of the second audio signal are calculated, wherein i is the channel number of the second audio signal.
The calculation of the higher order energy Gain' (i, b) may use the following formula:
Gain(i,b)=E(i,b)/E(1,b);
Gain’(i,b)=10*log10(Gain(i,b))。
S405, quantizing the high-order energy gain to obtain the quantized high-order energy gain.
S406, entropy coding is carried out on the quantized high-order energy gain so as to obtain a coding result of the high-order energy gain.
The method comprises the steps of obtaining characteristic information of a second audio signal in a scene audio signal, obtaining high-order energy gain of the scene audio signal through the characteristic information of the second audio signal, and sequentially quantizing and entropy coding the high-order energy gain.
Illustratively, scalar quantization may be employed to quantize the higher order energy gain.
Entropy encoding is performed on the quantized higher-order energy gain. The entropy encoding method is not limited.
Illustratively, the higher-order energy gain is differentially encoded, then the number of entropy encoded bits is estimated, the higher-order energy gain is variable-length encoded, such as Huffman encoded, if the estimated number of bits is less than the fixed-length encoded, and otherwise the higher-order energy gain is fixed-length encoded.
After the high-order energy gain coding result is obtained, the coding result is written into the code stream.
S407, encoding the attribute information of the first audio signal and the target virtual speaker in the scene audio signal and the high-order energy gain encoding result to obtain a first code stream.
It should be appreciated that the number of channels of the audio signal included in the first audio signal may be determined according to requirements and bandwidth, and the present application is not limited in this respect.
For example, S401 may refer to the description of S201 above, and will not be described herein.
In the embodiment of the application, the coding end can calculate the energy ratio of the higher-order channel and the W channel so as to obtain a higher-order energy gain coding result, and then selects Huffman coding or direct coding according to the bit number estimation of the difference result between subframes. Therefore, the first code stream sent by the encoding end comprises a high-order energy gain encoding result, so that the high-order energy gain can be used for adjusting the energy of the high-order channel at the decoding end, and the encoding and decoding quality of the scene audio signal is higher.
Fig. 5a is a schematic diagram of an exemplary decoding process. Fig. 5a is a decoding process corresponding to the encoding process of fig. 4.
S501, a first code stream is received.
S502, decoding the first code stream to obtain attribute information of the first reconstruction signal and the target virtual speaker and a high-order energy gain coding result.
S503, generating a virtual speaker signal corresponding to a target virtual speaker based on attribute information of the target virtual speaker and a first audio signal;
S504, reconstructing based on attribute information of a target virtual speaker and the virtual speaker signal to obtain a first reconstructed scene audio signal, wherein the first reconstructed scene audio signal comprises audio signals of C2 channels, and C2 is a positive integer
For example, S501 to S504 may refer to descriptions of S301 to S304, which are not described herein.
For example, S306 may refer to the descriptions of S505 to S508.
S505, performing entropy decoding on the high-order energy gain coding result to obtain the entropy decoded high-order energy gain.
S506, performing inverse quantization on the entropy decoded high-order energy gain to obtain the high-order energy gain.
Illustratively, the high-order energy gain encoding result is read from the first code stream. Entropy decoding is performed on the higher-order energy gain encoding result. The entropy decoding method is the inverse process of the entropy coding at the coding end.
Illustratively, if the encoding end employs fixed-length encoding, the decoding end uses fixed-length decoding corresponding thereto, and if the encoding end employs encoding, the decoding end uses side-length decoding corresponding thereto, such as huffman decoding.
And (3) performing inverse quantization on the entropy decoding result, wherein the inverse quantization method is an inverse process of the quantization method of the coding end.
And S507, adjusting the high-order energy gain according to the characteristic information of the second audio signal and the characteristic information of the first audio signal to obtain the adjusted decoded high-order energy gain.
After the decoding end performs signal reconstruction to obtain a first reconstructed scene audio signal, determining a first audio signal and a third audio signal from the first reconstructed scene audio signal, wherein the third audio signal is a reconstructed signal corresponding to each channel of the second audio signal in the N2-order HOA signal, determining characteristic information of the second audio signal according to the characteristic information of the third audio signal, and finally adjusting the higher-order energy gain according to the characteristic information of the second audio signal and the characteristic information of the first audio signal to obtain an adjusted decoded higher-order energy gain, and adjusting the higher-order energy gain to ensure that the higher-order channel energy is more uniform and smooth, and the quality of the reconstructed audio signal is better.
Illustratively, S507 adjusts the higher-order energy gain according to the characteristic information of the second audio signal and the characteristic information of the first audio signal, including:
s5071, obtaining the higher energy of the second audio signal according to the channel energy and the higher energy gain of the first audio signal;
The decoding end obtains a high-order energy gain coding result from the first code stream, entropy decodes the high-order energy gain coding result, and dequantizes the high-order energy gain coding result to obtain the high-order energy gain. And estimating the energy of the second audio signal according to the channel energy and the higher-order energy gain of the first audio signal so as to determine the higher-order energy of the second audio signal.
Illustratively, the first reconstructed scene audio signal is an N2-order HOA signal, the N2-order HOA signal is subjected to time-frequency transformation, and the time-domain N2-order HOA signal is transformed to obtain a frequency-domain N2-order HOA signal.
The higher order energy e_ref (i, b) of the second audio signal is calculated, the following formula may be used:
E_Ref(i,b)=E_dec(1,b)*10^(Gain’(i,b)/10)
wherein e_dec (1, b) is the channel energy of the b-th frequency band of the first audio signal in the N2-order HOA signal, i is the channel number corresponding to the second audio signal, gain' (i, b) is the higher-order energy Gain, and b is the frequency band number of the first audio signal.
S5072, obtaining a decoding energy scale factor according to the channel energy of the third audio signal and the higher-order energy of the second audio signal.
Specifically, the third audio signal is a reconstructed signal corresponding to each channel of the second audio signal in the N2-order HOA signal, and the decoded energy scale factor is obtained by performing energy scale calculation on the third audio signal and the second audio signal.
Illustratively, the decoding energy scaling factor g (i, b) is calculated, using the following formula:
g(i,b)=sqrt(E_Ref(i,b))/sqrt(E_dec(i,b))
Where sqrt () is an evolution operation, e_dec (i, b) is a channel energy of a b-th frequency band of the third audio signal, i is a channel number corresponding to the third audio signal, and e_ref (i, b) is a higher-order energy of the b-th frequency band of the second audio signal.
S5073, obtaining a decoded higher order energy gain of the third audio signal according to the channel energy of the third audio signal and the channel energy of the first audio signal.
And taking the channel energy of the first audio signal as a reference, and performing gain calculation on the channel energy of the third audio signal to obtain a decoded higher-order energy gain of the third audio signal.
Illustratively, the decoded higher-order energy Gain gain_dec (i, b) is calculated, and the following formula can be employed:
Gain_dec(i,b)=E_dec(i,b)/E_dec(1,b)
Where e_dec (1, b) is the channel energy of the b-th frequency band of the first audio signal in the N2-order HOA signal, and e_dec (i, b) is the channel energy of the b-th channel of the third audio signal.
And S5074, adjusting the decoded higher-order energy gain of the third audio signal according to the decoded energy scale factor to obtain an adjusted decoded higher-order energy gain.
Specifically, in order to make the energy of the higher-order channel more uniform and smooth, the decoded higher-order energy gain of the third audio signal is adjusted by using the decoded energy scaling factor, and the adjusted decoded higher-order energy gain is determined. After the decoding energy scale factor is used for adjustment, the energy of the higher-order channel is more uniform and smooth, and the quality of the reconstructed audio signal is better.
Illustratively, adjusting the decoded higher order energy gain of the third audio signal according to the decoded energy scale factor to obtain an adjusted decoded higher order energy gain comprises:
The adjusted decoded higher order energy Gain gain_dec' (i, b) is obtained by:
Gain_dec’(i,b)=w*min(g(i,b),Gain_dec(i,b))+(1-w)*g(i,b);
Where g (i, b) represents a decoded energy scale factor, gain_dec (i, b) represents a decoded higher order energy Gain of a b-th frequency band of the third audio signal, w is a preset adjustment scale threshold, min represents a minimum value operation, and x represents a multiplication operation.
For example, min (a, b) is the minimum value of a and b, w is the adjustment ratio threshold, and w is a number of values, for example, w is 0.25.
And S508, adjusting the third audio signal in the N2-order HOA signal according to the adjusted decoded high-order energy gain to obtain an adjusted third audio signal.
The decoding end obtains a high-order energy gain coding result from the first code stream, and energy adjustment is carried out on a third audio signal in the N2-order HOA signal by using the high-order energy gain coding result. The decoding end adjusts the high-order channel energy of the third audio signal by using the high-order energy gain coding result, so that the decoding quality of the third audio signal is higher.
The third audio signal is a channel audio signal corresponding to each channel of the second audio signal in the N2-order HOA signal.
For example, the third audio signal may be adjusted based on the feature information corresponding to the second audio signal in the N1-order HOA signal, so as to improve the quality of the N2-order HOA signal.
Illustratively, S508 adjusts a third audio signal in the N2-order HOA signal according to the adjusted decoded higher order energy gain, comprising:
s5081, obtaining an attenuation factor according to the frequency band sequence number of the third audio signal and/or the order of the N2-order HOA signal.
For example, the decoding side may obtain the attenuation factor according to the frequency band sequence number where the third audio signal is located, or the decoding side may obtain the attenuation factor according to the order of the N2-order HOA signal, where the order of the N2-order HOA signal may specifically be the Ambisonic order, or the decoding side may obtain the attenuation factor according to the frequency band sequence number and the order of the N2-order HOA signal, where the attenuation factor may be referred to as a dual attenuation factor.
S5082, adjusting the third audio signal according to the adjusted decoded higher-order energy gain and attenuation factor to obtain an adjusted third audio signal, wherein the adjusted third audio signal belongs to the reconstructed scene audio signal.
After obtaining the adjusted decoded high-order energy gain, the gain of the third audio signal of the current frame may be weighted, where the gain attenuates with the frequency band sequence number where the third audio signal is located and/or the order of the N2-order HOA signal, and the attenuation factor may be obtained first according to the frequency band sequence number where the third audio signal is located and/or the order of the N2-order HOA signal. For example, the attenuation factor can attenuate along with two factors of a frequency band and an Ambisonic order, and then the adjusted decoded higher-order energy gain and the obtained attenuation factor act on a higher-order channel of a third audio signal reconstructed by the current frame, so that the energy of the higher-order channel is more uniform and smooth, and the quality of the reconstructed audio signal is improved.
Illustratively, the third audio signal is adjusted using the adjusted decoded higher order energy Gain gain_dec '(i, b) and the attenuation factor g' (i, b).
For example, adjustments may be made with reference to the following formulas:
X’(i,b)=X(i,b)*Gain_dec’(i,b)*g’(i,b);
wherein X (i, b) is the third audio signal before adjustment and X' (i, b) is the third audio signal after adjustment.
In a possible manner, S5081 obtains an attenuation factor according to a frequency band number where the third audio signal is located and an order of the HOA signal of the N2 order, including:
the attenuation factor g' (i, b) is obtained as follows:
Wherein i is the number of the ith channel of the third audio signal, b is the frequency band number of the third audio signal, α represents the number of the target virtual speakers, β represents the number of mapping channels of the N2-order HOA signal, β is equal to (m+2) 2, M is the order of the N2-order HOA signal, γ represents the order corresponding to the channel number i of the third audio signal, and x represents the multiplication operation.
Illustratively, b is a frequency band number of the third audio signal, which may also be referred to as a subband number, b=0, 1, 2.
Through the calculation mode of the attenuation factors g' (i, b), b is the frequency band serial number of the third audio signal, alpha represents the number of target virtual speakers, beta represents the number of mapping channels of the N2-order HOA signal, beta is equal to (M+2) 2, M is the order of the N2-order HOA signal, gamma represents the order corresponding to the channel number i of the third audio signal, the attenuation factors can be accurately calculated through the parameters, the attenuation factors are changed along with the three factors of the number of speakers, the number of mapping channels of the HOA signal and the HOA order through adjustment of the parameters, and the quality of the reconstructed audio signal is improved when the attenuation factors and the adjusted decoding high-order energy gain are used for adjusting the third audio signal.
In a possible manner, the method for decoding a scene audio signal provided by the embodiment of the application further includes:
when b is less than or equal to d, alpha is updated to alpha 2, and alpha 2 = 0.375 x alpha, wherein d is a preset first threshold value;
when b > d, α is updated to α3, α3=0.5×α.
For example, d is a preset first threshold, and the value of the first threshold is not limited, and when b is less than or equal to d, the value of α may be reduced to α2, α2 and α satisfy α2=0.375×α described above, and when b > d, the value of α may be reduced to α3, and α3 and α satisfy α3=0.5×α described above. When the frequency band b is smaller than the first threshold, that is, the b-th frequency band may represent a low frequency band, the attenuation coefficient is set to a small value, for example, the attenuation coefficient is equal to 0.375, and when the frequency band b is larger than the threshold, that is, the b-th frequency band may represent a high frequency band, the attenuation coefficient is set to a large value, for example, the attenuation coefficient is equal to 0.5. The attenuation coefficient of 0.375 or 0.5 is just one possible example implementation, for example, the attenuation coefficient of 0.375 may be replaced by 0.38 or 0.37, the attenuation coefficient of 0.5 may be replaced by 0.55 or 0.6, and the attenuation coefficient is determined according to the application scenario specifically, which is not limited herein. According to the scheme, the value of alpha can be flexibly adjusted according to the value of the frequency band b, so that the effect that the attenuation effect is more remarkable along with the higher attenuation factor of the frequency band is achieved, and the reconstructed audio signal is more in line with the auditory characteristics of human ears.
Illustratively, α represents the number of target virtual speakers, and is a value of 0,1,2,3. It is possible to further adjust α such that α2=0.375×α when subband b is less than or equal to the first threshold, and α3=0.5×α otherwise.
In a possible manner, the method for decoding a scene audio signal provided by the embodiment of the application further includes:
updating the α to α4, α4=α+α× (1-1.25×b)/(bands+2);
where the bands represent the number of frequency bands of the third audio signal, i.e. the bands represent the number of sub-bands, e.g. the value of the bands may be configured to be 12. This is by way of illustration only and not as a definition of the limits of the embodiments of the application.
In the embodiment of the application, the value of alpha is updated to alpha 4, and alpha 4 = alpha + alpha x (1-1.25 x b)/(bands + 2), namely the value of the attenuation factor is related to the i-th frequency band through expanding the value of alpha, so that the result of different attenuation efficiencies of the low frequency band and the high frequency band is achieved, and the quality of the reconstructed audio signal is improved when the attenuation factor is used for adjusting the third audio signal.
In a possible manner, the method for decoding a scene audio signal provided by the embodiment of the application further includes:
when b is less than or equal to d, updating beta into beta 2, wherein beta 2= (1+w) multiplied by beta, d is a preset first threshold value, and w is a preset regulation proportion threshold value.
Illustratively, β represents the number of mapping channels of the HOA signal, takes on the value (m+2) 2, M is the HOA signal order, and β takes on the value 0,1,2, 3..n, β2= (1+w) ×β when subband b is less than or equal to the first threshold. When b is less than or equal to the first threshold value, the value of alpha can be flexibly adjusted according to the value of the frequency band b in the scheme, so that the attenuation effect is more remarkable along with the higher attenuation factor of the frequency band, and the reconstructed audio signal is more in line with the auditory characteristics of human ears.
In a possible manner, the method for decoding a scene audio signal provided by the embodiment of the application further comprises the step of updating w to w2, wherein w2=w+α×0.05.
The method for evaluating w is as follows, the initial value of w is 0, the value of alpha is traversed in sequence, and the final adjustment value of w is obtained through calculation. For example, w is updated to w2, w2=w+α×0.05. The value of w can be updated according to alpha, so that the relation between the weight w in the attenuation factor and the parameter alpha is established, and the w is increased along with the increase of alpha, so that the attenuation effect of the attenuation factor is more remarkable along with the higher frequency band, and the reconstructed audio signal is more in line with the auditory characteristics of human ears.
In one possible way, when i has a value of 0, 1,2, or 3, γ has a value of 1;
when the value of i is 4, 5, 6, 7 or 8, the value of gamma is 2;
when the value of i is 9, 10, 11, 12, 13, 14 or 15, the value of gamma is 3;
where i is the number of the i-th channel of the third audio signal.
Illustratively, γ represents the order of the HOA signal in which the ith channel is located, and γ and i satisfy Table 1 below:
table 1 is a correspondence table of γ and i:
| i | γ |
| 0 | 1 |
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 2 |
| 5 | 2 |
| 6 | 2 |
| 7 | 2 |
| 8 | 2 |
| 9 | 3 |
| 10 | 3 |
| 11 | 3 |
| 12 | 3 |
| 13 | 3 |
| 14 | 3 |
| 15 | 3 |
According to the value of the gamma, the attenuation factor is determined according to the value of the gamma, the value of the gamma is a piecewise function related to the HOA order of the channel, and the value of the gamma increases with the increase of the HOA order of the i but does not exceed the maximum HOA order, and the attenuation factor is used for improving the quality of the reconstructed audio signal when the third audio signal is adjusted.
Illustratively, after the third audio signal in the N2-order HOA signal is adjusted according to the adjusted decoded higher order energy gain and attenuation factor S5082, the method further comprises:
s5083, obtaining channel energy of a fourth audio signal corresponding to the adjusted third audio signal, wherein the third audio signal comprises an audio signal of a current frame, and the fourth audio signal comprises an audio signal of a frame before the current frame;
s5084, the adjusted third audio signal is adjusted again according to the channel energy of the fourth audio signal.
The decoding end may further adjust the adjusted third audio signal of the current frame again by using the previous frame of the third audio signal, so that the quality of the reconstructed audio signal is improved. The third audio signal comprises an audio signal of the current frame and the fourth audio signal comprises an audio signal of a preceding frame of the current frame, e.g. the preceding frame is an audio signal of a preceding frame that may be adjacent to the current frame, or the preceding frame may also be an audio signal of a preceding frame that is not adjacent to the current frame, the channel energy of the fourth audio signal being usable for adjusting the third audio signal. For example, the decoding end performs linear weighting on the sub-bands corresponding to the higher-order channel of the current frame and the higher-order channel of the previous 2 frames of the third audio signal, so as to obtain the higher-order channel of the current frame after energy smoothing.
Illustratively, S5084 adjusts the third audio signal according to the channel energy of the fourth audio signal, including:
S50841, obtaining a channel energy average value of the fourth audio signal and the adjusted channel energy of the third audio signal;
Wherein the channel energy average of the fourth audio signal may be an average of all channel energies of the fourth audio signal.
And S50842, acquiring an energy average threshold according to the channel energy average value of the fourth audio signal and the channel energy of the third audio signal.
The energy average threshold is a threshold calculated from the channel energies of the third audio signal and the fourth audio signal, respectively.
Illustratively, obtaining the energy average threshold from the channel energy average value of the fourth audio signal and the channel energy of the third audio signal comprises:
The energy average threshold k is obtained by:
the following formula may be employed:
wherein e_mean (i) represents the channel energy average of the fourth audio signal and E' _dec (i) represents the energy of the adjusted third audio signal.
S50843, carrying out weighted average calculation on the channel energy average value of the fourth audio signal and the adjusted channel energy of the third audio signal according to the energy average threshold value so as to obtain target energy;
The target energy E_target (i, b) is calculated, and the following formula can be used:
E_target(i,b)=k*E_mean(i,b)+(1-k)*E’_dec(i,b);
where e_mean (i, b) is the average of the energy of the previous frame and E' _dec (i, b) is the energy of the adjusted third audio signal.
S50844, obtaining an energy smoothing factor according to the target energy and the adjusted channel energy of the third audio signal;
the energy smoothing factor may be used for adjustment of the third audio signal such that the decoding quality of the third audio signal is higher.
Illustratively, the obtaining an energy smoothing factor according to the target energy and the channel energy of the adjusted third audio signal comprises:
The energy smoothing factor q (i, b) is obtained as follows:
q(i,b)=sqrt(E_target(i,b))/sqrt(E’_dec(i,b));
Where e_target (i, b) represents the target energy and E' _dec (i, b) represents the energy of the third audio signal.
And S50845, adjusting the third audio signal according to the energy smoothing factor.
The decoding quality of the third audio signal is further improved by readjusting the adjusted third audio signal using the energy smoothing factor q (i, b).
By way of example, the third audio signal may be adjusted with reference to the following formula:
X”(i,b)=X’(i,b)*q(i,b);
The average value of the energy of the previous frame may also be updated, for example, with the energy of the adjusted third audio signal after the adjusted third audio signal is obtained.
Fig. 5b is a schematic diagram of an exemplary decoding process. Fig. 5b is a decoding process corresponding to the encoding process of fig. 4.
S501, a first code stream is received.
S502, decoding the first code stream to obtain attribute information of the first reconstruction signal and the target virtual speaker and a high-order energy gain coding result.
S503, generating a virtual speaker signal corresponding to a target virtual speaker based on attribute information of the target virtual speaker and a first audio signal;
S504, reconstructing based on attribute information of a target virtual speaker and the virtual speaker signal to obtain a first reconstructed scene audio signal, wherein the first reconstructed scene audio signal comprises audio signals of C2 channels, and C2 is a positive integer
For example, S501 to S504 may refer to descriptions of S301 to S304, which are not described herein.
For example, S306 may refer to the descriptions of S505 to S508.
S505, performing entropy decoding on the high-order energy gain coding result to obtain the entropy decoded high-order energy gain.
S506, performing inverse quantization on the entropy decoded high-order energy gain to obtain the high-order energy gain.
Illustratively, the high-order energy gain encoding result is read from the first code stream. Entropy decoding is performed on the higher-order energy gain encoding result. The entropy decoding method is the inverse process of the entropy coding at the coding end.
Illustratively, if the encoding end employs fixed-length encoding, the decoding end uses fixed-length decoding corresponding thereto, and if the encoding end employs encoding, the decoding end uses side-length decoding corresponding thereto, such as huffman decoding.
And (3) performing inverse quantization on the entropy decoding result, wherein the inverse quantization method is an inverse process of the quantization method of the coding end.
S509, determining a dispersion factor (dispersion) of the first reconstructed scene audio signal according to the first code stream.
Since the reconstructed HOA signal is calculated from the virtual loudspeaker signal, only sound source components with definite azimuth, also called directional components, in the HOA signal to be encoded are included, and environment components, also called non-directional components, in the HOA signal to be encoded are absent. Therefore, the energy adjustment of each channel of the reconstructed HOA signal is performed through the audio gain, so that the energy of the directional component is only closer to the energy of the HOA signal to be encoded, and the adjustment of the nondirectional component is unstable.
The dispersion can be used as a parameter describing the degree of HOA signal dispersion, and is applied to HOA signal sound field reconstruction and sound field playback. In the embodiment of the application, the energy proportion of the nondirectional component in the HOA signal to be encoded is measured by using the dispersity of the HOA signal, and the energy of each channel of the reconstructed HOA signal is adjusted, so that the energy of the reconstructed HOA signal after energy adjustment is closer to the energy of the HOA signal to be encoded. The dispersion is used as a parameter for reconstructing the HOA signal energy adjustment, the dispersion can also utilize the description capability of the dispersion on the sound source, the self-adaptive non-directional component is compensated when the reconstructed HOA signal energy is adjusted, and the directional component is not compensated, so that the dispersion can make up the defect of a gain adjustment method based on the signal energy ratio, and no extra error is introduced.
The embodiment of the application uses the dispersion degree to adjust the energy gain, so that the energy of the reconstructed HOA signal is more accurate.
In a possible implementation manner, S509 determines a dispersion factor of the first reconstructed scene audio signal according to the first bitstream, including:
and decoding the first code stream to obtain the dispersion factor, wherein the first code stream comprises the dispersion factor.
Specifically, the coding end calculates the dispersion according to the HOA signal to be coded, the dispersion is quantized and coded to a first code stream, the decoding end decodes the first code stream, then the dispersion is obtained through inverse quantization, and then the dispersion is used for adjusting the gain of an uncoded channel in the reconstructed HOA signal.
In a possible implementation manner, S509 determines a dispersion factor of the first reconstructed scene audio signal according to the first bitstream, including:
And acquiring the dispersion factor according to the first reconstruction signal obtained by decoding the first code stream.
There are various methods for calculating the dispersion, which are used to calculate the dispersion at the decoding end from the lower-order HOA signal in the decoded transmission channel, and then to adjust the gain of the uncoded channel in the reconstructed HOA signal.
In a possible implementation manner, S509 obtains the dispersion factor according to the first reconstructed signal decoded from the first code stream, and includes:
S5091, acquiring a sound field intensity of each frequency band in the first reconstructed signal;
s5092, obtaining energy of each frequency band;
s5093, determining the dispersion factor according to the sound field intensity of each frequency band and the energy of each frequency band.
Exemplary, S5091 acquires a sound field intensity of each frequency band in the first reconstructed signal, including:
the sound field intensity I (b) of the b-th frequency band is obtained by:
wherein Re represents the real part taking operation, and W (b), X (b), Y (b) and Z (b) represent four channel signals of the b-th frequency band in the first reconstruction signal.
Illustratively, the energy of each frequency band in S5092 may be obtained by calculating the sum of the squares of the real part plus the imaginary part of each frequency band.
S5093 determines the dispersion factor from the sound field intensity of each frequency band and the energy of each frequency band, including:
The dispersion factor dispersion (b) is obtained as follows:
wherein E represents a desired operation, E { I (b) } represents a two-norm of E { I (b) }, M (b) | represents a two-norm of M (b) I, and M (b) represents energy of a b-th frequency band.
It should be understood that the above calculation of the dispersion in S5091 to S5093 is only an example implementation, and is not limited to the embodiment of the present application.
S510, determining an attenuation factor according to the frequency band sequence number of the reconstruction signal in the first reconstruction scene audio signal and/or the order of the first reconstruction scene audio signal. The attenuation factors are linearly weighted according to the dispersion factors to obtain weighted attenuation factors.
In particular, the dispersion factor may also be used to linearly weight the attenuation factor, which is used to adjust the third audio signal. The weight method is used for balancing the duty ratio between the dispersion component and the directional component in the attenuation factor, and the dispersion factor can be used for measuring the energy proportion of the nondirectional component in the HOA signal to be encoded and adjusting the energy of each channel of the reconstructed HOA signal, so that the energy-adjusted reconstructed HOA signal is more similar to the energy of the HOA signal to be encoded.
The specific implementation of linear weighting of the attenuation factors by the dispersion factors is not limited, and is illustrated as follows:
in one possible implementation, the linearly weighting the attenuation factor according to the dispersion factor to obtain a weighted attenuation factor includes:
the weighted attenuation factor gd (i, b) is obtained by at least one of:
gd (i, b) =w×dispersion (b) + (1-w) ×g '(i, b), wherein w is a preset adjustment ratio threshold value, x represents multiplication operation, dispersion (b) represents a dispersion factor of a b-th frequency band of the third audio signal, and g' (i, b) represents an attenuation factor of the b-th frequency band of the third audio signal;
Or alternatively
When the attenuation factor is a full-band signal, gd (i, b) =w× mean (diffusion) + (1-w) ×g '(i, b), wherein mean (diffusion) is an average value of dispersion factors of a plurality of frequency bands of the third audio signal, g' (i, b) represents an attenuation factor of a b-th frequency band of the third audio signal, and w is a preset adjustment ratio threshold;
Or alternatively
Gd (i, b) =w×dispersion (b) + (1-w) ×g '(i, b) +offset (i, b), wherein offset (i, b) represents a bias constant of the b-th frequency band on the i-th channel of the third audio signal, g' (i, b) represents an attenuation factor of the b-th frequency band of the third audio signal, w is a preset adjustment ratio threshold, and dispersion (b) represents a dispersion factor of the b-th frequency band of the third audio signal;
Or alternatively
Gd (i, b) =w×dispersion (b) + (1-w) ×g '(i, b) +direction (i, b), wherein direction (i, b) represents a direction parameter of the b-th frequency band on the i-th channel of the third audio signal, g' (i, b) represents an attenuation factor of the b-th frequency band of the third audio signal, w is a preset adjustment ratio threshold, and dispersion (b) represents a dispersion factor of the b-th frequency band of the third audio signal.
It should be understood that the above example of the weighted calculation of the attenuation factor is only an example implementation, and is not limited to the embodiment of the present application.
S511, obtaining the higher energy of the second audio signal according to the channel energy and the higher energy gain of the first audio signal, and obtaining the decoding energy scale factor according to the channel energy of the third audio signal and the higher energy of the second audio signal.
Specifically, the third audio signal is a reconstructed signal corresponding to each channel of the second audio signal in the N2-order HOA signal, and the decoded energy scale factor is obtained by performing energy scale calculation on the third audio signal and the second audio signal.
Illustratively, the decoding energy scaling factor g (i, b) is calculated, using the following formula:
g(i,b)=sqrt(E_Ref(i,b))/sqrt(E_dec(i,b));
Where sqrt () is an evolution operation, e_dec (i, b) is a channel energy of a b-th frequency band of the third audio signal, i is a channel number corresponding to the third audio signal, and e_ref (i, b) is a higher-order energy of the b-th frequency band of the second audio signal.
And S512, adjusting the third audio signal in the N2-order HOA signal according to the weighted attenuation factor and the decoding energy scale factor to obtain an adjusted third audio signal.
The decoding end adjusts a third audio signal in the N2-order HOA signal according to the weighted attenuation factor and the decoding energy scale factor. The decoding end adjusts the energy of the higher-order channel of the third audio signal by using the weighted attenuation factor and the decoding energy scaling factor, so that the decoding quality of the third audio signal is higher.
In a possible implementation manner, the third audio signal in the N2-order HOA signal is adjusted according to the weighted attenuation factor and the decoding energy scaling factor, so as to obtain an adjusted third audio signal:
The adjusted third audio signal X' (i, b) is obtained by:
X’(i,b)=X(i)×g(i,b)×gd(i,b);
Where gd (i, b) denotes a weighted attenuation factor, g (i, b) denotes a decoding energy scale factor, and X (i) denotes a third audio signal.
In the above scheme, since the dispersion factor can be used to measure the energy proportion of the non-directional component in the HOA signal to be encoded, the decoding energy proportion factor can be used to measure the energy proportion of the reconstructed HOA signal, and the dispersion factor and the decoding energy proportion factor are used to adjust the energy of each channel of the reconstructed HOA signal, so that the energy-adjusted reconstructed HOA signal is closer to the energy of the HOA signal to be encoded.
It should be understood that the foregoing example of the adjustment calculation of the third audio signal is merely an example implementation, and is not limited to the embodiment of the present application.
For example, the input signal in the encoding end is a 3 rd order HOA signal, the 3 rd order HOA signal includes 16 channels of audio signals, the first audio signal is the audio signals of the 1 st to 5 th channels, the 7 th channels and the 9 th to 10 th channels, and the second audio signal is the audio signals of the 6 th channel, the 8 th channel and the 11 th to 16 th channels. The code stream coded by the coding end has three implementation modes, namely 1, the code stream obtained by coding does not comprise a high-order energy gain coding result. 2. The code stream obtained by encoding comprises a high-order energy gain encoding result. For the scene audio decoding method executed by the decoding end, three implementation modes are 1, when the received code stream does not comprise a high-order energy gain coding result, the decoding end rebuilds the scene audio signal in the code stream. 2. When the received code stream comprises a high-order energy gain coding result, the decoding end rebuilds the scene audio signal in the code stream, and adjusts the rebuilding scene audio signal according to the high-order energy gain coding result to obtain the rebuilt scene audio signal. 3. In the embodiment of the application, when the received code stream comprises a high-order energy gain coding result, the decoding end reconstructs a scene audio signal in the code stream, and adjusts the reconstructed scene audio signal according to the high-order energy gain coding result and an attenuation factor to obtain the reconstructed scene audio signal.
By analyzing the signal quality of the reconstructed scene audio signal, it is known that the decoded HOA signal without carrying the higher-order energy gain coding result has poor quality. There are decoded HOA signals carrying the higher order energy gain encoding results, but no adjustment of the reconstructed scene audio signal by the attenuation factor is performed, quality is moderate. The decoding HOA signal carrying the high-order energy gain coding result is used for adjusting the reconstructed scene audio signal through the attenuation factor, and the quality is optimal.
According to the analysis, the decoding end in the embodiment of the application can adjust the reconstructed scene audio signal according to the high-order energy gain coding result and the attenuation factor, so that the high-order channel energy of the reconstructed scene audio signal is more uniform and smooth, and the quality of the reconstructed scene audio signal is better. For example, the attenuation factor can attenuate along with two factors of the frequency band and the Ambisonic order of the reconstructed scene audio signal, so that the encoding and decoding quality of the HOA signal is effectively improved.
Fig. 6a is a schematic diagram of an exemplary encoding end.
Parameters fig. 6a, an exemplary encoding side may include a configuration unit, a virtual speaker generation unit, a target speaker generation unit, and a core encoder. It should be understood that fig. 6a is only an example of the present application, and the encoding end of the present application may include more or less modules than those shown in fig. 6a, and will not be described herein.
Illustratively, the configuration unit may be configured to determine configuration information of the candidate virtual speaker.
The virtual speaker generating unit may be configured to generate a plurality of candidate virtual speakers according to configuration information of the candidate virtual speakers and determine virtual speaker coefficients corresponding to the candidate virtual speakers.
The target speaker generating unit may be configured to select a target virtual speaker from a plurality of candidate virtual speakers according to the scene audio signal and the plurality of sets of virtual speaker coefficients, and determine attribute information of the target virtual speaker.
The core encoder may be used for obtaining a higher-order energy gain of the scene audio signal and obtaining a higher-order energy gain encoding result, and encoding the attribute information of the first audio signal and the target virtual speaker in the scene audio signal and the higher-order energy gain encoding result.
The above-described scene audio coding modules of fig. 1a and 1b may include, for example, the configuration unit, the virtual speaker generation unit, the target speaker generation unit, the core encoder of fig. 6a, or only the core encoder.
Fig. 6b is a schematic diagram illustrating the structure of a decoding end.
Parameters fig. 6b, an exemplary decoding side may include a core decoder, a virtual speaker coefficient generation unit, a virtual speaker signal generation unit, a reconstruction unit, and a signal adjustment unit. It should be understood that fig. 6b is only an example of the present application, and the decoding end of the present application may include more or less modules than those shown in fig. 6b, which are not described herein.
The core decoder may be used to decode the first code stream to obtain the first reconstructed signal, the attribute information of the target virtual speaker, and the higher-order energy gain encoding result.
The virtual speaker coefficient generation unit may be configured to determine the virtual speaker coefficient based on attribute information of the target virtual speaker.
The virtual speaker signal generation unit may be configured to generate the virtual speaker signal based on the first reconstructed signal and the virtual speaker coefficient.
The reconstruction unit may be adapted for reconstructing, for example, based on the virtual speaker signal and the attribute information, to obtain a first reconstructed scene audio signal.
The signal adjustment unit may be configured to determine an attenuation factor according to a frequency band sequence number of a reconstructed signal in the first reconstructed scene audio signal and/or an order of the first reconstructed scene audio signal, and adjust the first reconstructed scene audio signal according to a high-order energy gain encoding result and the attenuation factor to obtain a reconstructed scene audio signal.
The above-described scene audio decoding modules of fig. 1a and 1b may include, for example, the core decoder, the virtual speaker coefficient generation unit, the virtual speaker signal generation unit, the reconstruction unit, and the signal adjustment unit of fig. 6b, or include only the core decoder.
Fig. 7 is a schematic diagram of a structure of an exemplary scene audio encoding apparatus. The scene audio coding device in fig. 7 can be used to perform the coding method of the foregoing embodiment, so the advantages achieved by the device can be referred to the advantages of the corresponding method provided above, and will not be described herein. Wherein, the scene audio coding device may include:
the acquisition module is used for acquiring a scene audio signal to be encoded, wherein the scene audio signal comprises audio signals of C1 channels, and C1 is a positive integer;
the acquisition module is further used for acquiring attribute information of a target virtual speaker corresponding to the scene audio signal;
the acquisition module is further used for acquiring the higher-order energy gain of the scene audio signal;
the coding module is used for coding the high-order energy gain to obtain a high-order energy gain coding result;
The encoding module is further configured to encode a first audio signal in the audio signal of the scene, attribute information of the target virtual speaker, and the high-order energy gain encoding result to obtain a first code stream, where the first audio signal is an audio signal of K channels in the audio signal of the scene, and K is a positive integer less than or equal to C1.
Fig. 8 is a schematic diagram of a structure of an exemplary scene audio decoding apparatus. The scene audio decoding apparatus in fig. 8 may be used to perform the decoding method of the foregoing embodiment, so the advantages achieved by the apparatus may refer to the advantages of the corresponding method provided above, and will not be described herein. Wherein, the scene audio decoding apparatus may include:
a code stream receiving module 801, configured to receive a first code stream;
The decoding module 802 is configured to decode the first code stream to obtain a first reconstructed signal, attribute information of a target virtual speaker, and a higher-order energy gain encoding result, where the first reconstructed signal is a reconstructed signal of a first audio signal in a scene audio signal, the scene audio signal includes audio signals of C1 channels, the first audio signal is an audio signal of K channels in the scene audio signal, C1 is a positive integer, and K is a positive integer less than or equal to C1;
a virtual speaker signal generating module 803, configured to generate a virtual speaker signal corresponding to the target virtual speaker based on the attribute information and the first audio signal;
a scene audio signal reconstruction module 804, configured to reconstruct based on attribute information of the target virtual speaker and the virtual speaker signal to obtain a first reconstructed scene audio signal, where the first reconstructed scene audio signal includes audio signals of C2 channels, and C2 is a positive integer;
A dispersion factor determining module 805 configured to determine a dispersion factor of the first reconstructed scene audio signal according to the first code stream;
an attenuation factor determining module 806, configured to determine an attenuation factor according to a frequency band sequence number of a reconstructed signal in the first reconstructed scene audio signal and/or an order of the first reconstructed scene audio signal;
The scene audio signal adjustment module 807 is configured to adjust the first reconstructed scene audio signal according to the higher energy gain encoding result and the attenuation factor, so as to obtain a reconstructed scene audio signal.
In one example, a schematic block diagram apparatus 900 of an apparatus 900 illustrating an embodiment of the present application may include a processor 901 and a transceiver/transceiver pin 902, and optionally a memory 903.
The various components of apparatus 900 are coupled together by a bus 904, wherein bus 904 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are referred to in the figures as bus 904.
Optionally, the memory 903 may be used to store instructions in the foregoing method embodiments. The processor 901 is operable to execute instructions in the memory 903 and control the receive pin to receive signals and the transmit pin to transmit signals.
The apparatus 900 may be an electronic device or a chip of an electronic device in the above-described method embodiments.
All relevant contents of each step related to the above method embodiment may be cited to the functional description of the corresponding functional module, which is not described herein.
The present embodiment also provides a chip comprising one or more interface circuits and one or more processors, the interface circuits for receiving signals from a memory of an electronic device and transmitting signals to the processors, the signals comprising computer instructions stored in the memory, which when executed by the processors, cause the electronic device to perform the method of the above embodiments. The interface circuit may be referred to as transceiver 902 in fig. 9, among other things.
The present embodiment also provides a computer-readable storage medium having stored therein computer instructions that, when executed on an electronic device, cause the electronic device to perform the above-described related method steps to implement the scene audio codec method in the above-described embodiments.
The present embodiment also provides a computer program product which, when run on a computer, causes the computer to perform the above-described related steps to implement the scene audio codec method in the above-described embodiments.
The embodiment also provides a device for storing the code stream, which comprises a receiver and at least one storage medium, wherein the receiver is used for receiving the code stream, the at least one storage medium is used for storing the code stream, and the code stream is generated according to the scene audio coding method in the embodiment.
The embodiment of the application provides a device for transmitting a code stream, which comprises a transmitter and at least one storage medium, wherein the at least one storage medium is used for storing the code stream, the code stream is generated according to the scene audio coding method in the embodiment, and the transmitter is used for acquiring the code stream from the storage medium and transmitting the code stream to end-side equipment.
The embodiment of the application provides a system for distributing code streams, which comprises at least one storage medium, streaming media equipment and end-side equipment, wherein the storage medium is used for storing at least one code stream, the at least one code stream is generated according to the scene audio coding method in the embodiment, the streaming media equipment is used for acquiring a target code stream from the at least one storage medium and sending the target code stream to the end-side equipment, and the streaming media equipment comprises a content server or a content distribution server.
In addition, the embodiment of the application also provides a device which can be a chip, a component or a module, and the device can comprise a processor and a memory which are connected, wherein the memory is used for storing computer execution instructions, and when the device runs, the processor can execute the computer execution instructions stored in the memory so that the chip can execute the scene audio coding and decoding method in the method embodiments.
The electronic device, the computer readable storage medium, the computer program product or the chip provided in this embodiment are used to execute the corresponding method provided above, so that the beneficial effects thereof can be referred to the beneficial effects in the corresponding method provided above, and will not be described herein.
It will be appreciated by those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the apparatus is divided into different functional modules to perform all or part of the functions described above.
In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another apparatus, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and the parts shown as units may be one physical unit or a plurality of physical units, may be located in one place, or may be distributed in a plurality of different places. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
Any of the various embodiments of the application, as well as any of the same embodiments, may be freely combined. Any combination of the above is within the scope of the application.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a readable storage medium. Based on such understanding, the technical solution of the embodiments of the present application may be essentially or a part contributing to the prior art or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, including several instructions for causing a device (may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps of the methods of the embodiments of the present application. The storage medium includes a U disk, a removable hard disk, a Read Only Memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.
The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.
The steps of a method or algorithm described in connection with the present disclosure may be embodied in hardware, or may be embodied in software instructions executed by a processor. The software instructions may be comprised of corresponding software modules that may be stored in random access Memory (Random Access Memory, RAM), flash Memory, read Only Memory (ROM), erasable programmable Read Only Memory (Erasable Programmable ROM), electrically Erasable Programmable Read Only Memory (EEPROM), registers, hard disk, a removable disk, a compact disk Read Only Memory (CD-ROM), or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC.
Those skilled in the art will appreciate that in one or more of the examples described above, the functions described in the embodiments of the present application may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer-readable storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.
Claims (23)
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310610533.0A CN119049482A (en) | 2023-05-27 | 2023-05-27 | Scene audio decoding method and electronic equipment |
| PCT/CN2023/142210 WO2024244441A1 (en) | 2023-05-27 | 2023-12-27 | Scene audio decoding method and electronic device |
| TW113119374A TW202447610A (en) | 2023-05-27 | 2024-05-24 | Scene audio decoding method and electronic device |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310610533.0A CN119049482A (en) | 2023-05-27 | 2023-05-27 | Scene audio decoding method and electronic equipment |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN119049482A true CN119049482A (en) | 2024-11-29 |
Family
ID=93585890
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202310610533.0A Pending CN119049482A (en) | 2023-05-27 | 2023-05-27 | Scene audio decoding method and electronic equipment |
Country Status (3)
| Country | Link |
|---|---|
| CN (1) | CN119049482A (en) |
| TW (1) | TW202447610A (en) |
| WO (1) | WO2024244441A1 (en) |
Family Cites Families (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP2469741A1 (en) * | 2010-12-21 | 2012-06-27 | Thomson Licensing | Method and apparatus for encoding and decoding successive frames of an ambisonics representation of a 2- or 3-dimensional sound field |
| US9466302B2 (en) * | 2013-09-10 | 2016-10-11 | Qualcomm Incorporated | Coding of spherical harmonic coefficients |
| PL3891736T3 (en) * | 2018-12-07 | 2023-06-26 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Apparatus, method and computer program for encoding, decoding, scene processing and other procedures related to dirac based spatial audio coding using low-order, mid-order and high-order components generators |
| CN114582357B (en) * | 2020-11-30 | 2025-09-12 | 华为技术有限公司 | Audio encoding and decoding method and device |
| CN115038028B (en) * | 2021-03-05 | 2023-07-28 | 华为技术有限公司 | Method and device for determining virtual loudspeaker set |
| CN115472170A (en) * | 2021-06-11 | 2022-12-13 | 华为技术有限公司 | Three-dimensional audio signal processing method and device |
| CN115881140B (en) * | 2021-09-29 | 2025-09-26 | 华为技术有限公司 | Coding and decoding method, device, equipment, storage medium and computer program product |
-
2023
- 2023-05-27 CN CN202310610533.0A patent/CN119049482A/en active Pending
- 2023-12-27 WO PCT/CN2023/142210 patent/WO2024244441A1/en active Pending
-
2024
- 2024-05-24 TW TW113119374A patent/TW202447610A/en unknown
Also Published As
| Publication number | Publication date |
|---|---|
| TW202447610A (en) | 2024-12-01 |
| WO2024244441A1 (en) | 2024-12-05 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| TWI708241B (en) | Apparatus and method for encoding or decoding directional audio coding parameters using different time/frequency resolutions | |
| US9516446B2 (en) | Scalable downmix design for object-based surround codec with cluster analysis by synthesis | |
| CN112997248A (en) | Encoding and associated decoding to determine spatial audio parameters | |
| KR20080093994A (en) | Computer-implemented method and computer readable medium in audio encoder and audio decoder | |
| CN114945982B (en) | Coding of spatial audio direction parameters | |
| US20240087580A1 (en) | Three-dimensional audio signal coding method and apparatus, and encoder | |
| CN115346537B (en) | Audio encoding and decoding method and device | |
| CN115497485A (en) | Three-dimensional audio signal coding method, device, coder and system | |
| WO2022058645A1 (en) | Spatial audio parameter encoding and associated decoding | |
| CN112005560B (en) | Method and apparatus for processing audio signal using metadata | |
| CN119049482A (en) | Scene audio decoding method and electronic equipment | |
| CN119049483A (en) | Scene audio decoding method and electronic equipment | |
| CN118314908A (en) | Scene audio decoding method and electronic equipment | |
| JP7703692B2 (en) | Method and apparatus for encoding three-dimensional audio signals, and encoder | |
| US20240087578A1 (en) | Three-dimensional audio signal coding method and apparatus, and encoder | |
| EP4614497A1 (en) | Scene audio coding method and electronic device | |
| US20250292782A1 (en) | Scene Audio Decoding Method and Electronic Device | |
| US20250191593A1 (en) | Audio Codec Bitrate Using Directional Loudness | |
| CN119993172A (en) | Audio encoding method, device and electronic equipment | |
| WO2024212898A1 (en) | Method and apparatus for coding scenario audio signal | |
| WO2024212896A1 (en) | Scene audio signal decoding method and apparatus | |
| WO2022242483A1 (en) | Three-dimensional audio signal encoding method and apparatus, and encoder | |
| WO2024212638A1 (en) | Scene audio decoding method and electronic device |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication |