WO2019105575A1 - Determination of spatial audio parameter encoding and associated decoding - Google Patents
Determination of spatial audio parameter encoding and associated decoding Download PDFInfo
- Publication number
- WO2019105575A1 WO2019105575A1 PCT/EP2017/081265 EP2017081265W WO2019105575A1 WO 2019105575 A1 WO2019105575 A1 WO 2019105575A1 EP 2017081265 W EP2017081265 W EP 2017081265W WO 2019105575 A1 WO2019105575 A1 WO 2019105575A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- resolution
- spatial audio
- parameter
- frequency
- time
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/16—Vocoder architecture
- G10L19/18—Vocoders using multiple modes
- G10L19/24—Variable rate codecs, e.g. for generating different qualities using a scalable representation such as hierarchical encoding or layered encoding
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/008—Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S3/00—Systems employing more than two channels, e.g. quadraphonic
- H04S3/02—Systems employing more than two channels, e.g. quadraphonic of the matrix type, i.e. in which input signals are combined algebraically, e.g. after having been phase shifted with respect to each other
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S2420/00—Techniques used stereophonic systems covered by H04S but not provided for in its groups
- H04S2420/03—Application of parametric coding in stereophonic audio systems
Definitions
- the present application relates to apparatus and methods for sound-field related parameter encoding.
- Parametric spatial audio processing is a field of audio signal processing where the spatial aspect of the sound is described using a set of parameters.
- parameters such as directions of the sound in frequency bands, and the ratios between the directional and non-directional parts of the captured sound in frequency bands.
- These parameters are known to well describe the perceptual spatial properties of the captured sound at the position of the microphone array.
- These parameters can be utilized in synthesis of the spatial sound accordingly, for headphones binaurally, for loudspeakers, or to other formats, such as Ambisonics and these parameters can be used to estimate sound in positions within an environment captured by the microphone array.
- the directions and direct-to-total energy ratios in frequency bands are thus a parameterization that is particularly effective for spatial audio capture.
- a parameter set consisting of a direction parameter in frequency bands and an energy ratio parameter in frequency bands (indicating the directionality of the sound) can be also utilized as the spatial metadata for an audio codec.
- these parameters can be estimated from microphone-array captured audio signals, and for example a stereo signal can be generated from the microphone array signals to be conveyed with the spatial metadata.
- the stereo signal could be encoded, for example, with an AAC encoder or multiple instances of an EVS mono encoder.
- a corresponding decoder(s) can decode the audio signals into PCM signals, and, e.g., a synthesis processing or a renderer can process the sound in frequency bands (using the spatial metadata) to obtain the spatial output, for example a binaural output.
- the aforementioned solution is particularly suitable for encoding captured spatial sound from microphone arrays (e.g., in mobile phones, VR cameras, stand- alone microphone arrays).
- microphone arrays e.g., in mobile phones, VR cameras, stand- alone microphone arrays.
- a further input for the encoder is also multi-channel loudspeaker input, such as 5.1 or 7.1 channel surround inputs.
- an apparatus for spatial audio signal encoding comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: determine, for two or more audio signals, and for at least one time-frequency resolution from a range of frequency resolutions at least one resolution spatial audio parameter for providing spatial audio reproduction; and process the at least one resolution spatial audio parameter to be output and/or stored.
- the apparatus caused to determine, for two or more audio signals, and for at least one time-frequency resolution from a range of frequency resolutions at least one resolution spatial audio parameter for providing spatial audio reproduction may be caused to determine at least one of: at least one spatial audio parameter comprising a direction parameter with an elevation and an azimuth component; at least one energy ratio parameter associated with the direction parameter; at least one diffuseness parameter associated with the direction parameter; and at least one coherence parameter associated with the direction parameter.
- the apparatus caused to determine, for the two or more audio signals, and for at least one time-frequency resolution from a range of frequency resolutions at least one resolution spatial audio parameter for providing spatial audio reproduction may be caused to: determine at least one first time-frequency resolution from a range of frequency resolutions at least one first resolution spatial audio parameter for providing spatial audio reproduction; determine at least one second time- frequency resolution from a range of frequency resolutions at least one second resolution spatial audio parameter for providing spatial audio reproduction; and select one of the first resolution spatial audio parameter and second resolution spatial audio parameter as the at least one time-frequency resolution from a range of frequency resolutions.
- the apparatus may be further caused to: window the two or more audio signals to generate at least one frame of the two or more audio signals; filter the frame of the two or more audio signals to generate at least two sub-band representations of the at least one frame of the two or more audio signals, wherein the apparatus caused to determine, for two or more audio signals, and for a first time-frequency resolution at least one first resolution spatial audio parameter for providing spatial audio reproduction may be caused to analyse the at least two sub- band representations of the at least one frame of the two or more audio signals using the first time-frequency resolution, and wherein the apparatus caused to determine, for the two or more audio signals, and for a second time-frequency resolution at least one second resolution spatial audio parameter for providing spatial audio reproduction may be further caused to analyse the at least two sub- band representations of the at least one frame of the two or more audio signals using the second time-frequency resolution.
- the apparatus caused to select at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter to be output and/or stored may be further caused to select the at least one of the at least one first resolution spatial audio parameter and the at least one second resolution spatial audio parameter to be output and/or stored on a frame-by-frame basis.
- the apparatus may be further caused to generate a data structure for the at least one resolution spatial audio parameter, wherein the data structure may comprise at least one of: a switch field configured to indicate a selection of the at least one resolution from the range of frequency resolutions; a time-frequency description field configured to indicate the ordering of the selection of the at least one resolution from the range of frequency resolutions; and at least one parameter field configured to provide the at least one resolution spatial audio parameter.
- the data structure may comprise at least one of: a switch field configured to indicate a selection of the at least one resolution from the range of frequency resolutions; a time-frequency description field configured to indicate the ordering of the selection of the at least one resolution from the range of frequency resolutions; and at least one parameter field configured to provide the at least one resolution spatial audio parameter.
- the apparatus caused to select at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter to be output and/or stored may be further caused to generate an embedded data structure for the selected one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter, wherein the embedded data structure may comprise at least one of at least one an embedded field configured to indicate the option selection of the at least one of the at least one first resolution spatial audio parameter and the at least one second resolution spatial audio parameter to be output and/or stored.
- the apparatus may be further caused to determine at least one suitability measure.
- the apparatus caused to determine at least one suitability measure may be caused to determine the at least one suitability measure based on at least one of: analysis of the two or more audio signals; analysis of a downmix based on the two or more audio signals; analysis of the at least one resolution spatial audio parameter for providing spatial audio reproduction more audio signals for the at least one time-frequency resolution; and visual analysis of an environment generating the at least two or more audio signals.
- the apparatus caused to select at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter to be output and/or stored may be further caused to select the at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter based on the at least one suitability measure.
- the apparatus may be caused to determine the at least one time-frequency resolution based on the at least one suitability measure.
- the apparatus may be further caused to downmix the at least two audio signals based on a selected one of the first time-frequency resolution and the second time-frequency resolution.
- the apparatus may be further caused to encode the at least two audio signals based on a selected one of the first time-frequency resolution and the second time-frequency resolution.
- the apparatus may be further caused to encode the selected at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter.
- a method for spatial audio signal encoding comprising: determining, for two or more audio signals, and for at least one time-frequency resolution from a range of frequency resolutions at least one resolution spatial audio parameter for providing spatial audio reproduction; and processing the at least one resolution spatial audio parameter to be output and/or stored.
- Determining, for two or more audio signals, and for at least one time- frequency resolution from a range of frequency resolutions at least one resolution spatial audio parameter for providing spatial audio reproduction may comprise at least one of: determining at least one spatial audio parameter comprising a direction parameter with an elevation and an azimuth component; determining at least one energy ratio parameter associated with the direction parameter; determining at least one diffuseness parameter associated with the direction parameter; and determining at least one coherence parameter associated with the direction parameter.
- Determining, for the two or more audio signals, and for at least one time- frequency resolution from a range of frequency resolutions at least one resolution spatial audio parameter for providing spatial audio reproduction may comprise: determining at least one first time-frequency resolution from a range of frequency resolutions at least one first resolution spatial audio parameter for providing spatial audio reproduction; determining at least one second time-frequency resolution from a range of frequency resolutions at least one second resolution spatial audio parameter for providing spatial audio reproduction; and selecting one of the first resolution spatial audio parameter and second resolution spatial audio parameter as the at least one time-frequency resolution from a range of frequency resolutions.
- the method may further comprise: windowing the two or more audio signals to generate at least one frame of the two or more audio signals; filtering the frame of the two or more audio signals to generate at least two sub-band representations of the at least one frame of the two or more audio signals, wherein determining, for two or more audio signals, and for a first time-frequency resolution at least one first resolution spatial audio parameter for providing spatial audio reproduction may further comprise analysing the at least two sub-band representations of the at least one frame of the two or more audio signals using the first time-frequency resolution, and wherein determining, for the two or more audio signals, and for a second time- frequency resolution at least one second resolution spatial audio parameter for providing spatial audio reproduction may further comprise analysing the at least two sub-band representations of the at least one frame of the two or more audio signals using the second time-frequency resolution.
- Selecting at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter to be output and/or stored may further comprise selecting the at least one of the at least one first resolution spatial audio parameter and the at least one second resolution spatial audio parameter to be output and/or stored on a frame-by-frame basis.
- the method may further comprise generating a data structure for the at least one resolution spatial audio parameter, wherein the data structure may comprise at least one of: a switch field configured to indicate a selection of the at least one resolution from the range of frequency resolutions; a time-frequency description field configured to indicate the ordering of the selection of the at least one resolution from the range of frequency resolutions; and at least one parameter field configured to provide the at least one resolution spatial audio parameter.
- the data structure may comprise at least one of: a switch field configured to indicate a selection of the at least one resolution from the range of frequency resolutions; a time-frequency description field configured to indicate the ordering of the selection of the at least one resolution from the range of frequency resolutions; and at least one parameter field configured to provide the at least one resolution spatial audio parameter.
- Selecting at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter to be output and/or stored may further comprise generating an embedded data structure for the selected one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter, wherein the embedded data structure may comprise at least one of at least one an embedded field configured to indicate the option selection of the at least one of the at least one first resolution spatial audio parameter and the at least one second resolution spatial audio parameter to be output and/or stored.
- the method may further comprise determining at least one suitability measure.
- Determining at least one suitability measure may further comprise determining the at least one suitability measure based on at least one of: analysis of the two or more audio signals; analysing a downmix based on the two or more audio signals; analysing the at least one resolution spatial audio parameter for providing spatial audio reproduction more audio signals for the at least one time- frequency resolution; and visual analysing an environment generating the at least two or more audio signals.
- Selecting at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter to be output and/or stored may further comprise selecting the at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter based on the at least one suitability measure.
- the method may comprise determining the at least one time-frequency resolution based on the at least one suitability measure.
- the method may further comprise downmixing the at least two audio signals based on a selected one of the first time-frequency resolution and the second time- frequency resolution.
- the method may further comprise encoding the at least two audio signals based on a selected one of the first time-frequency resolution and the second time- frequency resolution.
- the method may further comprise encoding the selected at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter.
- an apparatus for spatial audio signal encoding comprising: means for determining, for two or more audio signals, and for at least one time-frequency resolution from a range of frequency resolutions at least one resolution spatial audio parameter for providing spatial audio reproduction; and means for processing the at least one resolution spatial audio parameter to be output and/or stored.
- the means for determining, for two or more audio signals, and for at least one time-frequency resolution from a range of frequency resolutions at least one resolution spatial audio parameter for providing spatial audio reproduction may comprise at least one of: means for determining at least one spatial audio parameter comprising a direction parameter with an elevation and an azimuth component; means for determining at least one energy ratio parameter associated with the direction parameter; means for determining at least one diffuseness parameter associated with the direction parameter; and means for determining at least one coherence parameter associated with the direction parameter.
- the means for determining, for the two or more audio signals, and for at least one time-frequency resolution from a range of frequency resolutions at least one resolution spatial audio parameter for providing spatial audio reproduction may comprise: means for determining at least one first time-frequency resolution from a range of frequency resolutions at least one first resolution spatial audio parameter for providing spatial audio reproduction; means for determining at least one second time-frequency resolution from a range of frequency resolutions at least one second resolution spatial audio parameter for providing spatial audio reproduction; and means for selecting one of the first resolution spatial audio parameter and second resolution spatial audio parameter as the at least one time-frequency resolution from a range of frequency resolutions.
- the apparatus may further comprise: means for windowing the two or more audio signals to generate at least one frame of the two or more audio signals; means for filtering the frame of the two or more audio signals to generate at least two sub-band representations of the at least one frame of the two or more audio signals, wherein the means for determining, for two or more audio signals, and for a first time-frequency resolution at least one first resolution spatial audio parameter for providing spatial audio reproduction may further comprise means for analysing the at least two sub-band representations of the at least one frame of the two or more audio signals using the first time-frequency resolution, and the means for determining, for the two or more audio signals, and for a second time-frequency resolution at least one second resolution spatial audio parameter for providing spatial audio reproduction may further comprises means for analysing the at least two sub-band representations of the at least one frame of the two or more audio signals using the second time-frequency resolution.
- the means for selecting at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter to be output and/or stored may further comprise means for selecting the at least one of the at least one first resolution spatial audio parameter and the at least one second resolution spatial audio parameter to be output and/or stored on a frame-by-frame basis.
- the apparatus may further comprising means for generating a data structure for the at least one resolution spatial audio parameter, wherein the data structure may comprise at least one of: a switch field configured to indicate a selection of the at least one resolution from the range of frequency resolutions; a time-frequency description field configured to indicate the ordering of the selection of the at least one resolution from the range of frequency resolutions; and at least one parameter field configured to provide the at least one resolution spatial audio parameter.
- the data structure may comprise at least one of: a switch field configured to indicate a selection of the at least one resolution from the range of frequency resolutions; a time-frequency description field configured to indicate the ordering of the selection of the at least one resolution from the range of frequency resolutions; and at least one parameter field configured to provide the at least one resolution spatial audio parameter.
- the means for selecting at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter to be output and/or stored may further comprise means for generating an embedded data structure for the selected one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter, wherein the embedded data structure may comprise at least one of at least one an embedded field configured to indicate the option selection of the at least one of the at least one first resolution spatial audio parameter and the at least one second resolution spatial audio parameter to be output and/or stored.
- the apparatus may further comprise means for determining at least one suitability measure.
- the means for determining at least one suitability measure may further comprise means for determining the at least one suitability measure based on at least one of: analysis of the two or more audio signals; analysis of a downmix based on the two or more audio signals; analysis of the at least one resolution spatial audio parameter for providing spatial audio reproduction more audio signals for the at least one time-frequency resolution; and visual analysis of an environment generating the at least two or more audio signals.
- the means for selecting at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter to be output and/or stored may further comprise means for selecting the at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter based on the at least one suitability measure.
- the apparatus may comprise means for determining the at least one time- frequency resolution based on the at least one suitability measure.
- the apparatus may further comprise means for downmixing the at least two audio signals based on a selected one of the first time-frequency resolution and the second time-frequency resolution.
- the apparatus may further comprise means for encoding the at least two audio signals based on a selected one of the first time-frequency resolution and the second time-frequency resolution.
- the apparatus may further comprise means for encoding the selected at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter.
- An apparatus comprising means for performing the actions of the method as described above.
- An apparatus configured to perform the actions of the method as described above.
- a computer program comprising program instructions for causing a computer to perform the method as described above.
- a computer program product stored on a medium may cause an apparatus to perform the method as described herein.
- An electronic device may comprise apparatus as described herein.
- a chipset may comprise apparatus as described herein.
- Embodiments of the present application aim to address problems associated with the state of the art.
- Figure 1 shows schematically a system of apparatus suitable for implementing some embodiments
- Figure 2 shows schematically the multi-resolution analysis processor as shown in Figure 1 in further detail according to some embodiments;
- Figures 3a and 3b show schematically example metadata structure according to some embodiments
- Figure 4 shows schematically example multi-resolution modes according to some embodiments
- Figure 5 shows schematically an example embedded multi-resolution mode format according to some embodiments
- Figure 6 shows schematically a further example embedded multi-resolution mode format according to some embodiments
- Figure 7 shows a flow diagram of the operation of the system as shown in Figure 1 according to some embodiments
- Figure 8 shows a flow diagram of the operation of the multi-resolution analysis processor as shown in Figure 2 according to some embodiments
- Figure 9 shows a flow diagram of the operation of analysing the audio signal for one of the multi-resolution analysis operations as shown in Figure 8 according to some embodiments.
- Figure 10 shows schematically an example device suitable for implementing the apparatus shown.
- an immersive system is one in which the encoding and decoding attempts to retain the characteristics of the audio scene (as captured by microphones or synthesized otherwise) and aim to produce an immersive effect when presented to the listener.
- the input format may be any suitable input format, such as multi-channel loudspeaker, ambisonic (FOA/HOA) etc. It is understood that in some embodiments the channel location is based on a location of the microphone or is a virtual location or direction.
- the output of the example system is a binaural headphone presentation.
- a rendered audio signal output is passed to the listener as a pair of audio signals for a suitable headphone/earphone/headset signal.
- the output may be any suitable rendering.
- a multi-channel loudspeaker arrangement For example the multi-channel loudspeaker signals may be generalised to be two or more playback audio signals.
- Current speech and audio codecs and in particular immersive audio codecs support a multitude of operating points ranging from a low bit rate operation to transparency.
- An example of such a codec is the 3GPP IVAS codec for which the standardization process has begun in 3GPP TSG-SA4 in October 2017. The completion of the standard is currently expected by end of 2019.
- the IVAS codec is an extension of the 3GPP EVS codec and intended for new immersive voice and audio services over 4G/5G.
- Such immersive services include, e.g., stereo / binaural telephony, multichannel teleconferencing and immersive voice and audio for virtual reality (VR).
- This multi-purpose audio codec is expected to handle the encoding, decoding and rendering of speech, music and generic audio.
- one of interest is a parametric immersive audio format.
- spatial metadata parameters such as direction and direct-to-total energy ratio (or diffuseness-ratio, absolute energies, or any suitable expression indicating the directionality/non-directionality of the sound at the given time-frequency interval) parameters in frequency bands are particularly suitable for expressing the perceptual properties of natural and synthetic sound fields. Natural sound fields such as captured microphone generated sound fields and synthetic sound scenes such as 5.1 loudspeaker mixes.
- the spatial metadata parameters such as direction(s), energy ratio(s), diffuseness and coherence can be used to express the features of the sound field accurately.
- One concept which is being investigated is the use of multi-resolution parametric immersive codecs. These are codecs which have more than a single time-frequency resolution. For example, in some such embodiments it is possible to have analysis and processing with a high frequency resolution, or one with a high temporal resolution, or we can have systems combining these such as via a switched system.
- These multi-resolution formats may for example be selected from or supported by internal processors and quantizers of an audio codec based on the input formats of the audio signals.
- the codec may treat a parametric immersive audio format separately or it may pass it (or at least the waveform part) through a similar processing path as it uses for other waveform-based formats.
- These other formats may include at least one of ambisonics (FOA/HOA), multi- channel (e.g., 2.0, 4.0, 5.1 , 7.1 , 7.1 +4H, 22.2, and so on), and object-based audio.
- the audio codec may support independent streams with directional metadata and/or individual streams that may have a dependency metadata in addition to a directional metadata.
- Audio format may also be combined. For example, we can have ⁇ OA + audio objects’ or‘parametric immersive audio + individual streams’ or any other combination that makes sense from the capture, content creation, or rendering point of view.
- At least some of the formats may be immersive or spatially analysed inside the audio codec. This analysis may as discussed above, determine parameters such as directions of sound sources (expressed, e.g., as a direction on a sphere or alternatively an azimuth and elevation parameter per time-frequency tiles). This processing may then be followed by an immersive downmix (in other words a downmix of the input audio signals suitable for encoding and producing a suitable immersive effect when later decoded and renderer to the listener based on the determined metadata) and metadata extraction.
- an immersive downmix in other words a downmix of the input audio signals suitable for encoding and producing a suitable immersive effect when later decoded and renderer to the listener based on the determined metadata
- a parametric representation for enabling efficient encoding and transmission of the immersive scene may be created inside the encoder prior to waveform coding.
- the processing inside the codec substantially corresponds to a processing outside the codec. The concept is thus in some embodiments to support high frequency resolution and high time resolution approaches for immersive audio coding
- the concept may thus be characterized by apparatus and methods which implement an immersive metadata format switching functionality that allows for changing time/frequency (T/F) resolution of the parameters on a frame-by-frame basis and use of different strategies in different subband/subframe ranges.
- T/F time/frequency
- the immersive metadata of a parametric immersive audio format is defined in a way that it allows for different immersive capture analysis and processing approaches related to frequency and time resolution.
- the system 100 is shown with an ‘analysis’ part 121 and a‘synthesis’ part 131 .
- The‘analysis’ part 121 is the part from receiving the multi-channel loudspeaker signals up to an encoding of the metadata and downmix signal and the ‘synthesis’ part 131 is the part from a decoding of the encoded metadata and downmix signal to the presentation of the re-generated signal (for example in multi-channel loudspeaker form).
- the input to the system 100 and the‘analysis’ part 121 is the multi-channel signals 102.
- a microphone channel signal input is described, however any suitable input (or synthetic multi-channel) format may be implemented in other embodiments.
- the multi-channel signals are passed to a downmixer 103 and to an analysis processor 105.
- the downmixer 103 is configured to receive the multi- channel signals and downmix the signals to a determined number of channels and output the downmix signals 104.
- the downmixer 103 may be configured to generate a 2 audio channel downmix of the multi-channel signals.
- the determined number of channels may be any suitable number of channels.
- the downmixer 103 is optional and the multi-channel signals are passed unprocessed to an encoder 107 in the same manner as the downmix signals are in this example.
- the downmixer 103 is configured to generate downmix signals 104 on a frame by frame basis. As such in some embodiments the downmixer 103 is configured to receive windowed and filtered audio signals provided by the multi-resolution analysis processor 105 rather than directly via the input.
- the multi-resolution analysis processor 105 is also configured to receive the multi-channel signals and analyse the signals to produce metadata 106 associated with the multi-channel signals and thus associated with the downmix signals 104.
- the analysis processor 105 may be configured to generate the metadata which may comprise, for each time-frequency analysis interval, a direction parameter 108, an energy ratio parameter 1 10, a coherence parameter 1 12, and a diffuseness parameter 1 14.
- the direction, energy ratio and diffuseness parameters may in some embodiments be considered to be spatial audio parameters.
- the spatial audio parameters comprise parameters which aim to characterize the sound-field created by the multi-channel signals (or two or more playback audio signals in general).
- the coherence parameters may be considered to be signal relationship audio parameters which aim to characterize the relationship between the multi-channel signals.
- the multi-resolution analysis processor is configured to generate multiple-resolution analysis of the audio signals. For example in some embodiments the multi-resolution analysis processor generates a first time- frequency resolution metadata parameter set and a second time-frequency resolution metadata parameter set. In some embodiments the multi-resolution analysis processor is configured to select one of the generated sets and pass this to the metadata encoder.
- the parameters generated may differ from frequency band to frequency band.
- band X all of the parameters are generated and transmitted, whereas in band Y only one of the parameters is generated and transmitted, and furthermore in band Z no parameters are generated or transmitted.
- band Z no parameters are generated or transmitted.
- a practical example of this may be that for some frequency bands such as the highest band some of the parameters are not required for perceptual reasons.
- the downmix signals 104 and the metadata 106 may be passed to an encoder 107.
- the encoder 107 may comprise a core coder 109 which is configured to receive the downmix (or otherwise) signals 104 and generate a suitable encoding of these audio signals.
- the encoder 107 can in some embodiments be a computer (running suitable software stored on memory and on at least one processor), or alternatively a specific device utilizing, for example, FPGAs or ASICs.
- the encoding may be implemented using any suitable scheme.
- the encoder 107 may furthermore comprise a metadata encoder or quantizer 109 which is configured to receive the metadata and output an encoded or compressed form of the information.
- the core encoder 109 can be achieved using various tools. For example, an audio coding tools part of an IVAS codec (such as EVS tools) can be used. If the immersive downmix signal is a mono signal, a single-channel element (SCE) encoding can be utilized. This can be, in some embodiments, an encoding corresponding to the EVS standard. If the immersive downmix signal is a stereo signal (linear or binaural), a channel-pair element (CPE) encoding can be utilized. For example, dedicated stereo modes in IVAS can be utilized. If the immersive downmix signal is beyond a stereo representation, it is possible to use, e.g., various combinations of SCE and CPE encodings, or alternatively and in addition, a multichannel encoding. In some embodiments the encoder 107 may further interleave, multiplex to a single data stream or embed the metadata within encoded down mix signals before transmission or storage shown in Figure 1 by the dashed line. The multiplexing may be implemented using any suitable scheme.
- the received or retrieved data may be received by a decoder/demultiplexer 133.
- the decoder/demultiplexer 133 may demultiplex the encoded streams and pass the audio encoded stream to a downmix extractor 135 which is configured to decode the audio signals to obtain the downmix signals.
- the decoder/demultiplexer 133 may comprise a metadata extractor 137 which is configured to receive the encoded metadata and generate metadata.
- the decoder/demultiplexer 133 can in some embodiments be a computer (running suitable software stored on memory and on at least one processor), or alternatively a specific device utilizing, for example, FPGAs or ASICs.
- the decoded metadata and downmix audio signals may be passed to a synthesis processor 139.
- the system 100‘synthesis’ part 131 further shows a synthesis processor 139 configured to receive the downmix and the metadata and re-creates in any suitable format a synthesized spatial audio in the form of multi-channel signals 1 10 (these may be multichannel loudspeaker format or in some embodiments any suitable output format such as binaural or Ambisonics signals, depending on the use case) based on the downmix signals and the metadata.
- a synthesis processor 139 configured to receive the downmix and the metadata and re-creates in any suitable format a synthesized spatial audio in the form of multi-channel signals 1 10 (these may be multichannel loudspeaker format or in some embodiments any suitable output format such as binaural or Ambisonics signals, depending on the use case) based on the downmix signals and the metadata.
- First the system (analysis part) is configured to receive multi-channel audio signals as shown in Figure 7 by step 701
- the system (analysis part) is configured to analyse signals over multiple resolutions to generate metadata such as direction parameters; energy ratio parameters; diffuseness parameters and coherence parameters. Then one of the selected multiple resolutions may be selected to be output. The generation of multiple resolution metadata and the selection of one of them is shown in Figure 7 by step 703.
- the signal analysis which generates the metadata is performed based on a control signal or information or a decision from a controller or other part or function.
- the control signal may be configured to control the analyser to perform only one resolution analysis which is different from frame to frame (in other words pre-analysis select the analysis resolution rather than post-analysis select the resolution).
- control signal for the analyser is determined based on analysis of the audio signal characteristics, for example analysis of the input audio signals and based on this only one resolution parameter set per frame are created.
- the core codec that encodes the downmix may be configured to switch between short and long windows and this can be used as an indication to which resolution is used.
- the resolution is determined from the audio characteristics of the core coded downmix signal.
- the resolution does not need to be signalled in the metadata because it can be determined from the core coded dowmix signal.
- the system (analysis part) is configured to generate a downmix of the multi-channel signals based on the selected resolution as shown in Figure 7 by step 705.
- the system is then configured to encode for storage/transmission the downmix signal and metadata as shown in Figure 7 by step 707.
- the system may store/transmit the encoded downmix and metadata as shown in Figure 7 by step 709.
- the system may retrieve/receive the encoded downmix and metadata as shown in Figure 7 by step 71 1.
- the system is configured to extract the downmix and metadata from encoded downmix and metadata parameters, for example demultiplex and decode the encoded downmix and metadata parameters, as shown in Figure 7 by step 713.
- the system (synthesis part) is configured to synthesize an output multi- channel audio signal based on extracted downmix of multi-channel audio signals and metadata as shown in Figure 7 by step 715.
- the multi-resolution analysis processor 105 comprises a windower 201 .
- the windower 201 is configured to receive the input audio signals and generate a series of analysis periods or intervals of audio signal sample lengths. These can be passed to a filter bank 203. The windower may thus generate a series of frames from which multi-resolution sub-frames may be extracted from.
- the analysis processor 105 furthermore may comprise a filter bank 203.
- the filterbank 203 in some embodiments is configured to perform an initial time- frequency domain transform of the windowed (multi-channel) audio signals 102 and apply a suitable time to frequency domain transform such as a Short Time Fourier Transform (STFT) in order to convert the input time domain signals into a suitable time-frequency signals.
- STFT Short Time Fourier Transform
- These time-frequency signals may then be filtered according to any suitable band or sub-band configuration and passed to a series of multi-resolution immersive signal analyser parts.
- time-frequency signals may be represented in the time-frequency domain representation by
- n can be considered as a time index with a lower sampling rate than that of the original time-domain signals.
- the widths of the sub-bands can approximate any suitable distribution. For example the Equivalent rectangular bandwidth (ERB) scale or the Bark scale.
- the sub-band size may be chosen directly based on the filter bank used in the audio codec.
- a filter bank may thus result in using of a sub-band bandwidth that has, e.g., a minimum size of 400 Hz.
- the STFT is used in the filter bank 203 any suitable implementation may be used.
- the filter bank 203 sub-band widths can be selected based on perceptual properties of human hearing.
- the multi-resolution analysis processor 105 comprises a series of different resolution immersive signal analysers. These are represented in Figure 2 by a 1 st immersive signal analyser 205i and a 2nd immersive signal analyser 2052.
- the immersive signal analysers may be configured to determine for a defined time and frequency (T/F) resolution a series of ‘immersive’ parameters for describing the audio signals.
- T/F time and frequency
- the immersive signal analyser comprise a direction analyser configured to receive the time-frequency signals and based on these signals estimate direction parameters on a frequency band-by-band (or groups of bands) basis.
- the direction parameters may be determined based on any audio based ‘direction’ determination.
- the direction analyser may be configured to estimate the direction with two or more signal inputs. This represents the simplest configuration to estimate a‘direction’, more complex processing may be performed with even more signals.
- the direction analyser may thus be configured to provide an azimuth and elevation for each frequency band and temporal resolution, denoted as azimuth cp(/c,n) and elevation Q(k,n).
- the direction parameter 108 may be also be used to perform further analysis of the signal.
- the analyser is configured to determine an energy ratio parameter.
- the energy ratio may be considered to be a determination of the energy of the audio signal which can be considered to arrive from a direction.
- the direct-to-total energy ratio r(k,n) can be estimated, e.g., using a stability measure of the directional estimate, or using any correlation measure, or any other suitable method to obtain a ratio parameter.
- the signal analyser 205 may be configured to produce further signal parameters such as the two parameters: coherence and diffuseness, both analysed in time-frequency domain.
- the immersive signal analysis can consist of at least two parts and differ at least in frequency and time resolution. However, they may further differ in terms of parameters being analysed.
- the 1 st immersive signal analyser 205i may be configured to generate analysis using a 20-ms update rate (in other words with a temporal resolution of 20ms) and with 24 sub-bands (for example where the bands are all the same size each band will have a frequency resolution of X/24 where X is the frequency range of the analysed audio signals, though it is understood that the bands may differ in size in some embodiments), and the 2nd immersive signal analyser 2052 may be configured to generate analysis using a 5- ms update rate with 6 sub-bands. It is to be appreciated that these are examples only and other example resolutions are possible for both high frequency and high time resolution approaches. For example, significantly higher time resolution such as a 1 .25-ms update rate may be utilized in some implementations.
- each of the multi-resolution parameters are generated in parallel it would be understood that in some embodiments the analysis operations are performed in series or in a hybrid of series and parallel.
- the multi-resolution analysis processor 105 comprises a switch 207.
- the switch 207 is configured to receive the analysis parameters from each of the immersive signal analysers 205 and output one (or more of these) to the immersive metadata extractor 209.
- the switch 207 is configured to operate based on at least one suitability measure. In some embodiments this decision is performed differently for a subset of sub-frames or sub-bands. Thus at least one switching decision may be performed for each frame (or windowed period).
- the at least one suitability measure can be any suitable measure.
- the measure or analysis may be at least one of: the input audio signals, the parameters generated by the analysis of the audio signals, and analysis of the downmix audio signals.
- the determination may be made based on detecting the input audio signals comprise a voice (for example by using a suitable voice activity determination) and that furthermore there are more than one voice source (for example based on a detection of at least two talkers in a scene). In such a situation where there are two simultaneous talkers, a faster time domain update cycle can provide better overall performance than a more accurate spectral resolution.
- switching suitability measure may be used in various implementations of the invention. For example, if an analysis of the audio scene determines that the audio scene has a relatively flat spectrum and thus noisy ambience, shorter windows may produce better results. Also, where it is determined that there are impulsive energy fluctuations between sub-frames and thus the audio signals comprise transients, a faster update rate can be selected. On the other hand, where it is determined that there are stable tonal sounds (like classical musical instruments) a longer sub-frame window size and thus higher spectral accuracy may be selected.
- control of the post-analysis selection of the resolution from multiple resolutions or as described earlier a‘pre-analysis’ selection of the resolution may relate to analysis of non-audio scene factors.
- the selection is based on visual tracking and segmentation of a scene, where the number of and types of sound sources are determined and used to, at least in part, decide the analysis resolution selection.
- the output of the switch 207 may be passed to the immersive metadata extractor 209.
- the multi-resolution analysis processor 105 comprises an immersive metadata extractor 209.
- the immersive metadata extractor 209 may be configured to receive the output of the switch 207 and passed to a metadata compressor/encoder 1 1 1. Furthermore the immersive metadata extractor 209 may be configured to output the resolution of the extracted metadata to the core coder 109 and downmixer 103 such that the time/frequency resolution of the metadata may be matched in the downmixer 103 and/or core coder 109.
- the multi-resolution analysis processor 105 comprises a metadata compressor/encoder 1 1 1 .
- the metadata compressor/encoder 1 1 1 is configured to receive the extracted metadata and generate a suitable data format to output to enable the extraction of the metadata at a suitable decoder/receiver. This for example requires the ability to generate suitable control information which enables the receiver/decoder to be able to determine the information content in the encoded format.
- the immersive audio metadata may be compressed depending on the bit rate restriction of the current coding mode. Thus in some embodiments some immersive fidelity will typically be lost, as is usually the case with lossy compression. However, the compression can utilize various perceptual techniques to minimize the impact.
- Figure 3a shows a table structure where a switching is supported between the at least two modes with different time/frequency resolutions.
- the table shows two columns, a metadata field 301 and a metadata value field(s) 303.
- switch field 302 For two modes this switch field may be supported by a single bit (‘Switch’ field).
- the two modes may be static for a given transmission (e.g., streaming or communications call), i.e., they may be known (e.g., only two resolutions are allowed by a specific implementation) or the resolutions may be communicated out of band.
- the data format shown in Figure 3a furthermore shows the various parameters and their values grouped by parameter.
- the 1 st to nth sub-band values are shown in the first part 31 1
- the 1 st to nth sub-band values are shown in the second part 313
- the 1 st to nth sub-band values are shown in the Yth part 315.
- the metadata structure and size can in some embodiments remain constant, when for example we utilize the time/frequency (T/F) modes of the example analyser shown in Figure 2 which implemented a 20ms/24sub-band and 5ms/6sub-band resolution).
- T/F time/frequency
- the first mode may provide 1x24 sub-bands and thus 1 x24 sub-band values per parameter.
- Figure 3b shows a table structure similar to that shown in Figure 3a but with an additional field, a T/F description field 305 and associated value(s) 306.
- This field may describe the T/F resolution in the current frame and/or the T/F resolutions overall supported by the encoder/decoder or used in the current transmission. This may take for example the values of the example, where the switching bit or‘Switch’ field value is used to index the T/F description’ field.
- the T/F description’ field in Figure 3b may furthermore include information on how the parameter values are ordered.
- the structure as shown in Figure 3b may be adapted.
- the orderings of ‘Parameter 1’ may be as follows: sub-band 0, 1 , 2, ..., 23 and 0, 1 , 2, 3, 4, 5 for sub-frame 0, followed similarly by sub-frames 1 , 2, 3, respectively, giving the same internal order as for the format shown in Figure 3a.
- this additional field may allow the decoder to optimize the memory allocation or other aspects for the synthesis.
- the receiver may receive information about the possible configurations for analysis resolution within a specific transmission from this information. This information may allow for embodiments with faster adaptation.
- the T/F description field and associated value may be used to convey information on a switching in a subset of the sub-frames or sub- bands.
- the ‘Switch’ field may here have more values that correspond to mixtures of at least two switched modes within the current frame. For example if a third analysis mode is a 10ms 12 sub-band mode (in other words each 20ms window is divided into two 10ms sub-frames and each sub-frame has 12 sub- bands: 10ms/12sb.
- the structure of the data format information may also be provided in different fields, as the metadata structures of are only examples.
- FIG. 4 an example arrangement 401 of different T/F modes within a frame is shown.
- the bottom line 404 summarizes the time resolution (T) in terms of ms, and the number of frequency sub-bands (F) for each T is indicated 402 over each column.
- This example as shown in Figure 4 shows six different modes of operation: a first mode utilizes 1 update cycle of 20 ms 403, the second mode utilizes 2 update cycles of 10 ms each 405i and 405 2 , the third mode utilizes 4 update cycles of 5 ms 407i to 407 4 , and the remaining three modes utilize 3 update cycles where one of the cycles is 10 ms and the other two 5 ms each 409i to 4093, 4111 to 4113, and 413i to 4133.
- the first mode 403 has a single column of 24 sub-bands (0...23) representing one parameter value per sub-band for each of the 24 sub-bands for the 20ms frame length.
- the second mode has two columns of 12 sub-bands (0..11 , 0..11) representing one parameter value per sub-band for each of the 12 sub-bands for each of the 10ms sub-frames.
- the third mode has four columns of 6 sub-bands (0..5, 0..5, 0..5, 0..5) representing one parameter value per sub-band for each of the 6 sub-bands for each of the 5ms sub-frames.
- the fourth mode has three columns, one of 12 sub-bands and two of 6 sub- bands (0..11 , 0..5, 0..5) representing one parameter value per sub-band for each of the 12 sub-bands in the 10ms sub-frame and the 6 sub-bands for each of the two 5ms sub-frames.
- the fifth mode has three columns, one of 12 sub-bands and two of 6 sub- bands (0..5, 0..5, 0..11) representing one parameter value per sub-band for each of the 6 sub-bands for each of the two 5ms sub-frames and the 12 sub-bands in the 10ms sub-frame.
- the sixth mode has three columns, one of 12 sub-bands and two of 6 sub- bands (0..5, 0..11 , 0..5) representing one parameter value per sub-band for each of the 6 sub-bands in the first 5ms sub-frame, the 12 sub-bands in the 10ms sub- frame and the 6 sub-bands for each of the last 5ms sub-frame.
- the metadata compressor/encoder may furthermore define an embedded structure of T/F resolution switching for the immersive metadata. In other words it may define a data structure wherein a different mode may be selected for each of the sub-bands.
- the embedded structure could be always active. In other words the value‘Embedded bit 0’ would not be used.
- The“fixed” mode structure may be, e.g., similar to the data structure shown earlier with respect to Figure 3a.
- the embedded mode differs from the structures shown previously in that they comprise an embedded switching bit before each sub-band or, in some embodiment, e.g., each group of sub-bands. The position of the embedded bits are not fixed due to the different size of the sub-bands (different T/F resolutions).
- the Figure 5 shows the order of the parameters in this first example.
- the data structure 501 shows an initial row 502 indicating the data structure is an embedded data structure.
- the data structure 501 on rows 2 to 5 defines the fixed mode operation wherein the mode switch identifies or selects the first mode (mode_0) or the second mode (mode_1 ) and the order of the sub-bands for each of the modes is then defined in the fixed structure 504. It is understood that the rows 4 and 5 of Figure 5 may not be the same size. Although the example shown herein has two modes and thus is indicated by a single bit, in some embodiments there may be more than two modes and which are indicated by a flag value more than one bit in length.
- the data structure 501 on rows 8 onwards then defines the embedded mode operation wherein the embedded bits switch or identify between the first mode (mode_0) or the second mode (mode_1 ) for each bifurcation or splitting.
- row 8 and 12 show the identification 506 of the first element as being either the first mode first sub-band (mode_0 Subband_1 ) where the first embedded bit is 0 and the second mode first sub-band (mode_1 Subband_1 ) where the first embedded bit is 1 .
- rows 10 and 1 1 follow the selection of the first mode for the first value with the identification 508 of the following elements as being either the first mode second sub-band (mode_0 Subband_2) onwards where the second embedded bit is 0 and the second mode second sub-band (mode_1 Subband_2) onwards where the second embedded bit is 1.
- the identification 508 of the following elements as being either the first mode second sub-band (mode_0 Subband_2) onwards where the second embedded bit is 0 and the second mode second sub-band (mode_1 Subband_2) onwards where the second embedded bit is 1.
- there may be further embedded bit values for selecting further sub-bands For example there may be a further set of embedded bit values, embedded bit 3, which select whether mode_0 or mode_1 Sub-band 3 is selected and so on.
- rows 14 and 15 follow the selection of the second mode for the first value with the identification 510 of the following elements as being either the first mode second sub-band (modeJD Subband_2) onwards where the second embedded bit is 0 and the second mode second sub-band (mode_1 Subband_2) onwards where the second embedded bit is 1. Similarly there may also be further embedded bit values for selecting further mode sub-bands.
- the data structure is defined such that the position of the embedded bits are known, and the size of the sub-bands is taken into account. This is shown for example with respect to the data structure shown in Figure 6.
- Figure 6 thus for example shows a data structure 601 shows an initial row 602 indicating the data structure is an embedded data structure.
- the data structure 601 on rows 2 to 5 defines the fixed mode operation wherein the mode switch identifies or selects the first mode (mode_0) or the second mode (mode_1 ) and the order of the sub-bands for each of the modes is then defined in the fixed structure 604. It is understood that the rows 4 and 5 of Figure 6 may not be the same size.
- the data structure 601 on rows 6 onwards then defines the embedded mode operation wherein the embedded bits switch or identify between the first mode (modeJD) or the second mode (mode_1 ) for each bifurcation or splitting.
- row 8 and 12 show the identification 606 of the first element as being either the first mode first sub-band (mode_0 Sb_1 ) where the first embedded bit is 0 and the second mode first sub-band to fourth sub-band (mode_1 Sb_1 to mode_1 Sb_4) where the first embedded bit is 1.
- rows 10 and 11 follow the selection of the first mode for the first embedded bit value with the identification 608 of the following elements as being either the first mode second sub-band (mode_0 Sb_2) where the second embedded bit is 0 and the second mode fifth sub-band onwards (mode_1 Sb_5) onwards where the second embedded bit is 1 .
- rows 14 and 15 follow the selection of the second mode for the first value with the identification 610 of the following elements as being either the first mode second sub-band (mode_0 Sb_2) onwards where the second embedded bit is 0 and the second mode fifth sub-band (mode_1 Sb_5) onwards where the second embedded bit is 1 .
- the first operation is one of windowing the (time domain) multichannel audio signals (to generate frames from which the sub-frames can be generated) as shown in Figure 8 by step 801.
- the final operation being one of outputting the determined parameters and generating a suitable metadata structure such as described herein and as shown in Figure 8 by step 809.
- the immersive signal analyser is configured to determine the analysis sub-frame window and sub-bands to be analysed as shown in Figure 9 by step 901.
- the first operation is one determining a direction analysis to determine directions and energy ratio parameters for each resolution time and frequency as shown in Figure 9 by step 903.
- the analysis is configured to determine coherence parameters, diffuseness parameters and (optionally modifying energy ratios based on determined coherence parameters) for each resolution time and frequency as shown in Figure 9 by step 905.
- the device may be any suitable electronics device or apparatus.
- the device 1400 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, etc.
- the device 1400 comprises at least one processor or central processing unit 1407.
- the processor 1407 can be configured to execute various program codes such as the methods such as described herein.
- the device 1400 comprises a memory 141 1 .
- the at least one processor 1407 is coupled to the memory 141 1.
- the memory 141 1 can be any suitable storage means.
- the memory 141 1 comprises a program code section for storing program codes implementable upon the processor 1407.
- the memory 141 1 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 1407 whenever needed via the memory-processor coupling.
- the device 1400 comprises a user interface 1405.
- the user interface 1405 can be coupled in some embodiments to the processor 1407.
- the processor 1407 can control the operation of the user interface 1405 and receive inputs from the user interface 1405.
- the user interface 1405 can enable a user to input commands to the device 1400, for example via a keypad.
- the user interface 1405 can enable the user to obtain information from the device 1400.
- the user interface 1405 may comprise a display configured to display information from the device 1400 to the user.
- the user interface 1405 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 1400 and further displaying information to the user of the device 1400.
- the user interface 1405 may be the user interface for communicating with the position determiner as described herein.
- the device 1400 comprises an input/output port 1409.
- the input/output port 1409 in some embodiments comprises a transceiver.
- the transceiver in such embodiments can be coupled to the processor 1407 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network.
- the transceiver or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.
- the transceiver can communicate with further apparatus by any suitable known communications protocol.
- the transceiver or transceiver means can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802.X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).
- UMTS universal mobile telecommunications system
- WLAN wireless local area network
- IRDA infrared data communication pathway
- the transceiver input/output port 1409 may be configured to receive the signals and in some embodiments determine the parameters as described herein by using the processor 1407 executing suitable code. Furthermore the device may generate a suitable downmix signal and parameter output to be transmitted to the synthesis device.
- the device 1400 may be employed as at least part of the synthesis device.
- the input/output port 1409 may be configured to receive the downmix signals and in some embodiments the parameters determined at the capture device or processing device as described herein, and generate a suitable audio signal format output by using the processor 1407 executing suitable code.
- the input/output port 1409 may be coupled to any suitable audio output for example to a multichannel speaker system and/or headphones or similar.
- the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof.
- some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto.
- firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto.
- While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
- the embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware.
- any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions.
- the software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.
- the memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory.
- the data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.
- Embodiments of the inventions may be practiced in various components such as integrated circuit modules.
- the design of integrated circuits is by and large a highly automated process.
- Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.
- Programs such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules.
- the resultant design in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Multimedia (AREA)
- Algebra (AREA)
- Quality & Reliability (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Stereophonic System (AREA)
Abstract
An apparatus for spatial audio signal encoding, the apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: determine, for two or more audio signals, and for at least one time-frequency resolution from a range of frequency resolutions at least one resolution spatial audio parameter for providing spatial audio reproduction; and process the at least one resolution spatial audio parameter to be output and/or stored.
Description
DETERMINATION OF SPATIAL AUDIO PARAMETER ENCODING AND
ASSOCIATED DECODING
Field
The present application relates to apparatus and methods for sound-field related parameter encoding.
Background
Parametric spatial audio processing is a field of audio signal processing where the spatial aspect of the sound is described using a set of parameters. For example, in parametric spatial audio capture from microphone arrays, it is a typical and an effective choice to estimate from the microphone array signals a set of parameters such as directions of the sound in frequency bands, and the ratios between the directional and non-directional parts of the captured sound in frequency bands. These parameters are known to well describe the perceptual spatial properties of the captured sound at the position of the microphone array. These parameters can be utilized in synthesis of the spatial sound accordingly, for headphones binaurally, for loudspeakers, or to other formats, such as Ambisonics and these parameters can be used to estimate sound in positions within an environment captured by the microphone array.
The directions and direct-to-total energy ratios in frequency bands are thus a parameterization that is particularly effective for spatial audio capture.
A parameter set consisting of a direction parameter in frequency bands and an energy ratio parameter in frequency bands (indicating the directionality of the sound) can be also utilized as the spatial metadata for an audio codec. For example, these parameters can be estimated from microphone-array captured audio signals, and for example a stereo signal can be generated from the microphone array signals to be conveyed with the spatial metadata. The stereo signal could be encoded, for example, with an AAC encoder or multiple instances of an EVS mono encoder. A corresponding decoder(s) can decode the audio signals into PCM signals, and, e.g., a synthesis processing or a renderer can process the sound in frequency bands (using the spatial metadata) to obtain the spatial output, for example a binaural output.
The aforementioned solution is particularly suitable for encoding captured spatial sound from microphone arrays (e.g., in mobile phones, VR cameras, stand- alone microphone arrays). However, it may be desirable for such an encoder to have also other input types than microphone-array captured signals, for example, loudspeaker signals, audio object signals, or Ambisonic signals.
Analysing first-order Ambisonics (FOA) inputs for spatial metadata extraction has been thoroughly documented in scientific literature related to Directional Audio Coding (DirAC) and Harmonic planewave expansion (Harpex). This is since there exist microphone arrays directly providing a FOA signal (more accurately: its variant, the B-format signal), and analysing such an input has thus been a point of study in the field.
A further input for the encoder is also multi-channel loudspeaker input, such as 5.1 or 7.1 channel surround inputs.
Summary
There is provided according to an apparatus for spatial audio signal encoding, the apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: determine, for two or more audio signals, and for at least one time-frequency resolution from a range of frequency resolutions at least one resolution spatial audio parameter for providing spatial audio reproduction; and process the at least one resolution spatial audio parameter to be output and/or stored.
The apparatus caused to determine, for two or more audio signals, and for at least one time-frequency resolution from a range of frequency resolutions at least one resolution spatial audio parameter for providing spatial audio reproduction may be caused to determine at least one of: at least one spatial audio parameter comprising a direction parameter with an elevation and an azimuth component; at least one energy ratio parameter associated with the direction parameter; at least one diffuseness parameter associated with the direction parameter; and at least one coherence parameter associated with the direction parameter.
The apparatus caused to determine, for the two or more audio signals, and for at least one time-frequency resolution from a range of frequency resolutions at least one resolution spatial audio parameter for providing spatial audio reproduction may be caused to: determine at least one first time-frequency resolution from a range of frequency resolutions at least one first resolution spatial audio parameter for providing spatial audio reproduction; determine at least one second time- frequency resolution from a range of frequency resolutions at least one second resolution spatial audio parameter for providing spatial audio reproduction; and select one of the first resolution spatial audio parameter and second resolution spatial audio parameter as the at least one time-frequency resolution from a range of frequency resolutions.
The apparatus may be further caused to: window the two or more audio signals to generate at least one frame of the two or more audio signals; filter the frame of the two or more audio signals to generate at least two sub-band representations of the at least one frame of the two or more audio signals, wherein the apparatus caused to determine, for two or more audio signals, and for a first time-frequency resolution at least one first resolution spatial audio parameter for providing spatial audio reproduction may be caused to analyse the at least two sub- band representations of the at least one frame of the two or more audio signals using the first time-frequency resolution, and wherein the apparatus caused to determine, for the two or more audio signals, and for a second time-frequency resolution at least one second resolution spatial audio parameter for providing spatial audio reproduction may be further caused to analyse the at least two sub- band representations of the at least one frame of the two or more audio signals using the second time-frequency resolution.
The apparatus caused to select at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter to be output and/or stored may be further caused to select the at least one of the at least one first resolution spatial audio parameter and the at least one second resolution spatial audio parameter to be output and/or stored on a frame-by-frame basis.
The apparatus may be further caused to generate a data structure for the at least one resolution spatial audio parameter, wherein the data structure may
comprise at least one of: a switch field configured to indicate a selection of the at least one resolution from the range of frequency resolutions; a time-frequency description field configured to indicate the ordering of the selection of the at least one resolution from the range of frequency resolutions; and at least one parameter field configured to provide the at least one resolution spatial audio parameter.
The apparatus caused to select at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter to be output and/or stored may be further caused to generate an embedded data structure for the selected one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter, wherein the embedded data structure may comprise at least one of at least one an embedded field configured to indicate the option selection of the at least one of the at least one first resolution spatial audio parameter and the at least one second resolution spatial audio parameter to be output and/or stored.
The apparatus may be further caused to determine at least one suitability measure.
The apparatus caused to determine at least one suitability measure may be caused to determine the at least one suitability measure based on at least one of: analysis of the two or more audio signals; analysis of a downmix based on the two or more audio signals; analysis of the at least one resolution spatial audio parameter for providing spatial audio reproduction more audio signals for the at least one time-frequency resolution; and visual analysis of an environment generating the at least two or more audio signals.
The apparatus caused to select at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter to be output and/or stored may be further caused to select the at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter based on the at least one suitability measure.
The apparatus may be caused to determine the at least one time-frequency resolution based on the at least one suitability measure.
The apparatus may be further caused to downmix the at least two audio signals based on a selected one of the first time-frequency resolution and the second time-frequency resolution.
The apparatus may be further caused to encode the at least two audio signals based on a selected one of the first time-frequency resolution and the second time-frequency resolution.
The apparatus may be further caused to encode the selected at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter.
According to a second aspect there is provided a method for spatial audio signal encoding, the method comprising: determining, for two or more audio signals, and for at least one time-frequency resolution from a range of frequency resolutions at least one resolution spatial audio parameter for providing spatial audio reproduction; and processing the at least one resolution spatial audio parameter to be output and/or stored.
Determining, for two or more audio signals, and for at least one time- frequency resolution from a range of frequency resolutions at least one resolution spatial audio parameter for providing spatial audio reproduction may comprise at least one of: determining at least one spatial audio parameter comprising a direction parameter with an elevation and an azimuth component; determining at least one energy ratio parameter associated with the direction parameter; determining at least one diffuseness parameter associated with the direction parameter; and determining at least one coherence parameter associated with the direction parameter.
Determining, for the two or more audio signals, and for at least one time- frequency resolution from a range of frequency resolutions at least one resolution spatial audio parameter for providing spatial audio reproduction may comprise: determining at least one first time-frequency resolution from a range of frequency resolutions at least one first resolution spatial audio parameter for providing spatial audio reproduction; determining at least one second time-frequency resolution from a range of frequency resolutions at least one second resolution spatial audio parameter for providing spatial audio reproduction; and selecting one of the first resolution spatial audio parameter and second resolution spatial audio parameter as the at least one time-frequency resolution from a range of frequency resolutions.
The method may further comprise: windowing the two or more audio signals to generate at least one frame of the two or more audio signals; filtering the frame
of the two or more audio signals to generate at least two sub-band representations of the at least one frame of the two or more audio signals, wherein determining, for two or more audio signals, and for a first time-frequency resolution at least one first resolution spatial audio parameter for providing spatial audio reproduction may further comprise analysing the at least two sub-band representations of the at least one frame of the two or more audio signals using the first time-frequency resolution, and wherein determining, for the two or more audio signals, and for a second time- frequency resolution at least one second resolution spatial audio parameter for providing spatial audio reproduction may further comprise analysing the at least two sub-band representations of the at least one frame of the two or more audio signals using the second time-frequency resolution.
Selecting at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter to be output and/or stored may further comprise selecting the at least one of the at least one first resolution spatial audio parameter and the at least one second resolution spatial audio parameter to be output and/or stored on a frame-by-frame basis.
The method may further comprise generating a data structure for the at least one resolution spatial audio parameter, wherein the data structure may comprise at least one of: a switch field configured to indicate a selection of the at least one resolution from the range of frequency resolutions; a time-frequency description field configured to indicate the ordering of the selection of the at least one resolution from the range of frequency resolutions; and at least one parameter field configured to provide the at least one resolution spatial audio parameter.
Selecting at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter to be output and/or stored may further comprise generating an embedded data structure for the selected one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter, wherein the embedded data structure may comprise at least one of at least one an embedded field configured to indicate the option selection of the at least one of the at least one first resolution spatial audio parameter and the at least one second resolution spatial audio parameter to be output and/or stored.
The method may further comprise determining at least one suitability measure.
Determining at least one suitability measure may further comprise determining the at least one suitability measure based on at least one of: analysis of the two or more audio signals; analysing a downmix based on the two or more audio signals; analysing the at least one resolution spatial audio parameter for providing spatial audio reproduction more audio signals for the at least one time- frequency resolution; and visual analysing an environment generating the at least two or more audio signals.
Selecting at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter to be output and/or stored may further comprise selecting the at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter based on the at least one suitability measure.
The method may comprise determining the at least one time-frequency resolution based on the at least one suitability measure.
The method may further comprise downmixing the at least two audio signals based on a selected one of the first time-frequency resolution and the second time- frequency resolution.
The method may further comprise encoding the at least two audio signals based on a selected one of the first time-frequency resolution and the second time- frequency resolution.
The method may further comprise encoding the selected at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter.
According to a third aspect there is provided an apparatus for spatial audio signal encoding, the apparatus comprising: means for determining, for two or more audio signals, and for at least one time-frequency resolution from a range of frequency resolutions at least one resolution spatial audio parameter for providing spatial audio reproduction; and means for processing the at least one resolution spatial audio parameter to be output and/or stored.
The means for determining, for two or more audio signals, and for at least one time-frequency resolution from a range of frequency resolutions at least one
resolution spatial audio parameter for providing spatial audio reproduction may comprise at least one of: means for determining at least one spatial audio parameter comprising a direction parameter with an elevation and an azimuth component; means for determining at least one energy ratio parameter associated with the direction parameter; means for determining at least one diffuseness parameter associated with the direction parameter; and means for determining at least one coherence parameter associated with the direction parameter.
The means for determining, for the two or more audio signals, and for at least one time-frequency resolution from a range of frequency resolutions at least one resolution spatial audio parameter for providing spatial audio reproduction may comprise: means for determining at least one first time-frequency resolution from a range of frequency resolutions at least one first resolution spatial audio parameter for providing spatial audio reproduction; means for determining at least one second time-frequency resolution from a range of frequency resolutions at least one second resolution spatial audio parameter for providing spatial audio reproduction; and means for selecting one of the first resolution spatial audio parameter and second resolution spatial audio parameter as the at least one time-frequency resolution from a range of frequency resolutions.
The apparatus may further comprise: means for windowing the two or more audio signals to generate at least one frame of the two or more audio signals; means for filtering the frame of the two or more audio signals to generate at least two sub-band representations of the at least one frame of the two or more audio signals, wherein the means for determining, for two or more audio signals, and for a first time-frequency resolution at least one first resolution spatial audio parameter for providing spatial audio reproduction may further comprise means for analysing the at least two sub-band representations of the at least one frame of the two or more audio signals using the first time-frequency resolution, and the means for determining, for the two or more audio signals, and for a second time-frequency resolution at least one second resolution spatial audio parameter for providing spatial audio reproduction may further comprises means for analysing the at least two sub-band representations of the at least one frame of the two or more audio signals using the second time-frequency resolution.
The means for selecting at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter to be output and/or stored may further comprise means for selecting the at least one of the at least one first resolution spatial audio parameter and the at least one second resolution spatial audio parameter to be output and/or stored on a frame-by-frame basis.
The apparatus may further comprising means for generating a data structure for the at least one resolution spatial audio parameter, wherein the data structure may comprise at least one of: a switch field configured to indicate a selection of the at least one resolution from the range of frequency resolutions; a time-frequency description field configured to indicate the ordering of the selection of the at least one resolution from the range of frequency resolutions; and at least one parameter field configured to provide the at least one resolution spatial audio parameter.
The means for selecting at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter to be output and/or stored may further comprise means for generating an embedded data structure for the selected one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter, wherein the embedded data structure may comprise at least one of at least one an embedded field configured to indicate the option selection of the at least one of the at least one first resolution spatial audio parameter and the at least one second resolution spatial audio parameter to be output and/or stored.
The apparatus may further comprise means for determining at least one suitability measure.
The means for determining at least one suitability measure may further comprise means for determining the at least one suitability measure based on at least one of: analysis of the two or more audio signals; analysis of a downmix based on the two or more audio signals; analysis of the at least one resolution spatial audio parameter for providing spatial audio reproduction more audio signals for the at least one time-frequency resolution; and visual analysis of an environment generating the at least two or more audio signals.
The means for selecting at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter to be
output and/or stored may further comprise means for selecting the at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter based on the at least one suitability measure.
The apparatus may comprise means for determining the at least one time- frequency resolution based on the at least one suitability measure.
The apparatus may further comprise means for downmixing the at least two audio signals based on a selected one of the first time-frequency resolution and the second time-frequency resolution.
The apparatus may further comprise means for encoding the at least two audio signals based on a selected one of the first time-frequency resolution and the second time-frequency resolution.
The apparatus may further comprise means for encoding the selected at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter.
An apparatus comprising means for performing the actions of the method as described above.
An apparatus configured to perform the actions of the method as described above.
A computer program comprising program instructions for causing a computer to perform the method as described above.
A computer program product stored on a medium may cause an apparatus to perform the method as described herein.
An electronic device may comprise apparatus as described herein.
A chipset may comprise apparatus as described herein.
Embodiments of the present application aim to address problems associated with the state of the art.
Summary of the Figures
For a better understanding of the present application, reference will now be made by way of example to the accompanying drawings in which:
Figure 1 shows schematically a system of apparatus suitable for implementing some embodiments;
Figure 2 shows schematically the multi-resolution analysis processor as shown in Figure 1 in further detail according to some embodiments;
Figures 3a and 3b show schematically example metadata structure according to some embodiments;
Figure 4 shows schematically example multi-resolution modes according to some embodiments;
Figure 5 shows schematically an example embedded multi-resolution mode format according to some embodiments;
Figure 6 shows schematically a further example embedded multi-resolution mode format according to some embodiments;
Figure 7 shows a flow diagram of the operation of the system as shown in Figure 1 according to some embodiments;
Figure 8 shows a flow diagram of the operation of the multi-resolution analysis processor as shown in Figure 2 according to some embodiments;
Figure 9 shows a flow diagram of the operation of analysing the audio signal for one of the multi-resolution analysis operations as shown in Figure 8 according to some embodiments; and
Figure 10 shows schematically an example device suitable for implementing the apparatus shown.
Embodiments of the Application
The following describes in further detail suitable apparatus and possible mechanisms for the provision of effective multi-resolution spatial analysis derived metadata parameters for immersive input format audio signals. In the following discussions an immersive system is one in which the encoding and decoding attempts to retain the characteristics of the audio scene (as captured by microphones or synthesized otherwise) and aim to produce an immersive effect when presented to the listener. As such the following examples are discussed with respect to a multi-channel microphone captured audio signal(s) implementation. However as discussed above the input format may be any suitable input format, such as multi-channel loudspeaker, ambisonic (FOA/HOA) etc. It is understood that in some embodiments the channel location is based on a location of the microphone or is a virtual location or direction. Furthermore the output of the
example system is a binaural headphone presentation. In other words a rendered audio signal output is passed to the listener as a pair of audio signals for a suitable headphone/earphone/headset signal. However the output may be any suitable rendering. For example a multi-channel loudspeaker arrangement. Furthermore the multi-channel loudspeaker signals may be generalised to be two or more playback audio signals.
Current speech and audio codecs and in particular immersive audio codecs support a multitude of operating points ranging from a low bit rate operation to transparency. An example of such a codec is the 3GPP IVAS codec for which the standardization process has begun in 3GPP TSG-SA4 in October 2017. The completion of the standard is currently expected by end of 2019. The IVAS codec is an extension of the 3GPP EVS codec and intended for new immersive voice and audio services over 4G/5G. Such immersive services include, e.g., stereo / binaural telephony, multichannel teleconferencing and immersive voice and audio for virtual reality (VR). This multi-purpose audio codec is expected to handle the encoding, decoding and rendering of speech, music and generic audio. It is expected to support channel-based audio and scene-based audio inputs including spatial information about the sound field and sound sources. It is also expected to operate with low latency to enable conversational services as well as support high error robustness under various transmission conditions. Naturally high compression ratio is expected for all scenarios with sufficient bit rate scalability for channel and quality adaptation.
As part of the IVAS encoder and supported input audio formats one of interest is a parametric immersive audio format.
As discussed previously spatial metadata parameters such as direction and direct-to-total energy ratio (or diffuseness-ratio, absolute energies, or any suitable expression indicating the directionality/non-directionality of the sound at the given time-frequency interval) parameters in frequency bands are particularly suitable for expressing the perceptual properties of natural and synthetic sound fields. Natural sound fields such as captured microphone generated sound fields and synthetic sound scenes such as 5.1 loudspeaker mixes. The spatial metadata parameters such as direction(s), energy ratio(s), diffuseness and coherence can be used to express the features of the sound field accurately.
One concept which is being investigated is the use of multi-resolution parametric immersive codecs. These are codecs which have more than a single time-frequency resolution. For example, in some such embodiments it is possible to have analysis and processing with a high frequency resolution, or one with a high temporal resolution, or we can have systems combining these such as via a switched system.
These multi-resolution formats may for example be selected from or supported by internal processors and quantizers of an audio codec based on the input formats of the audio signals. For example, the codec may treat a parametric immersive audio format separately or it may pass it (or at least the waveform part) through a similar processing path as it uses for other waveform-based formats. These other formats may include at least one of ambisonics (FOA/HOA), multi- channel (e.g., 2.0, 4.0, 5.1 , 7.1 , 7.1 +4H, 22.2, and so on), and object-based audio. In addition to or instead of traditional object-based audio (see, e.g., MPEG-H standard), the audio codec may support independent streams with directional metadata and/or individual streams that may have a dependency metadata in addition to a directional metadata. Audio format may also be combined. For example, we can have ΉOA + audio objects’ or‘parametric immersive audio + individual streams’ or any other combination that makes sense from the capture, content creation, or rendering point of view.
At least some of the formats may be immersive or spatially analysed inside the audio codec. This analysis may as discussed above, determine parameters such as directions of sound sources (expressed, e.g., as a direction on a sphere or alternatively an azimuth and elevation parameter per time-frequency tiles). This processing may then be followed by an immersive downmix (in other words a downmix of the input audio signals suitable for encoding and producing a suitable immersive effect when later decoded and renderer to the listener based on the determined metadata) and metadata extraction. In other words, a parametric representation for enabling efficient encoding and transmission of the immersive scene may be created inside the encoder prior to waveform coding. In some embodiments, the processing inside the codec substantially corresponds to a processing outside the codec.
The concept is thus in some embodiments to support high frequency resolution and high time resolution approaches for immersive audio coding. In other words embodiments which utilize such resolution switching also inside the audio codec.
Depending on coding model and audio signal content, different perceptual resolution can be justified in different use cases. High temporal resolution allows better voice quality when there are strong simultaneous sound sources from different directions (e.g. overlapping talkers), or the signal spectrum is noisy and overall the ambience is ambivalent. Higher frequency resolution on the other hand allows better quality for continuous signals (e.g. tonal musical instruments) and cases where the sound source directions remain stable for longer periods of time (e.g. single speaker talking).
The concept may thus be characterized by apparatus and methods which implement an immersive metadata format switching functionality that allows for changing time/frequency (T/F) resolution of the parameters on a frame-by-frame basis and use of different strategies in different subband/subframe ranges.
This for example as described in further detail herein can be achieved, for example by using an embedded structure format as described hereafter. In some embodiments, even concurrent information delivery with at least two different simultaneous time/frequency resolutions can be supported.
This can be useful, for example in encoders that utilize different encoding strategies and metadata compressions at different bit rates. Also when implemented, such embodiments, may have advantages in that the immersive metadata of a parametric immersive audio format is defined in a way that it allows for different immersive capture analysis and processing approaches related to frequency and time resolution.
With respect to Figure 1 an example apparatus and system for implementing embodiments of the application are shown. The system 100 is shown with an ‘analysis’ part 121 and a‘synthesis’ part 131 . The‘analysis’ part 121 is the part from receiving the multi-channel loudspeaker signals up to an encoding of the metadata and downmix signal and the ‘synthesis’ part 131 is the part from a decoding of the encoded metadata and downmix signal to the presentation of the re-generated signal (for example in multi-channel loudspeaker form).
The input to the system 100 and the‘analysis’ part 121 is the multi-channel signals 102. In the following examples a microphone channel signal input is described, however any suitable input (or synthetic multi-channel) format may be implemented in other embodiments.
The multi-channel signals are passed to a downmixer 103 and to an analysis processor 105.
In some embodiments the downmixer 103 is configured to receive the multi- channel signals and downmix the signals to a determined number of channels and output the downmix signals 104. For example the downmixer 103 may be configured to generate a 2 audio channel downmix of the multi-channel signals. The determined number of channels may be any suitable number of channels. In some embodiments the downmixer 103 is optional and the multi-channel signals are passed unprocessed to an encoder 107 in the same manner as the downmix signals are in this example. In some embodiments the downmixer 103 is configured to generate downmix signals 104 on a frame by frame basis. As such in some embodiments the downmixer 103 is configured to receive windowed and filtered audio signals provided by the multi-resolution analysis processor 105 rather than directly via the input.
In some embodiments the multi-resolution analysis processor 105 is also configured to receive the multi-channel signals and analyse the signals to produce metadata 106 associated with the multi-channel signals and thus associated with the downmix signals 104. The analysis processor 105 may be configured to generate the metadata which may comprise, for each time-frequency analysis interval, a direction parameter 108, an energy ratio parameter 1 10, a coherence parameter 1 12, and a diffuseness parameter 1 14. The direction, energy ratio and diffuseness parameters may in some embodiments be considered to be spatial audio parameters. In other words the spatial audio parameters comprise parameters which aim to characterize the sound-field created by the multi-channel signals (or two or more playback audio signals in general). The coherence parameters may be considered to be signal relationship audio parameters which aim to characterize the relationship between the multi-channel signals.
In some embodiments the multi-resolution analysis processor is configured to generate multiple-resolution analysis of the audio signals. For example in some
embodiments the multi-resolution analysis processor generates a first time- frequency resolution metadata parameter set and a second time-frequency resolution metadata parameter set. In some embodiments the multi-resolution analysis processor is configured to select one of the generated sets and pass this to the metadata encoder.
In some embodiments the parameters generated may differ from frequency band to frequency band. Thus for example in band X all of the parameters are generated and transmitted, whereas in band Y only one of the parameters is generated and transmitted, and furthermore in band Z no parameters are generated or transmitted. A practical example of this may be that for some frequency bands such as the highest band some of the parameters are not required for perceptual reasons. The downmix signals 104 and the metadata 106 may be passed to an encoder 107.
The encoder 107 may comprise a core coder 109 which is configured to receive the downmix (or otherwise) signals 104 and generate a suitable encoding of these audio signals. The encoder 107 can in some embodiments be a computer (running suitable software stored on memory and on at least one processor), or alternatively a specific device utilizing, for example, FPGAs or ASICs. The encoding may be implemented using any suitable scheme. The encoder 107 may furthermore comprise a metadata encoder or quantizer 109 which is configured to receive the metadata and output an encoded or compressed form of the information.
The core encoder 109 can be achieved using various tools. For example, an audio coding tools part of an IVAS codec (such as EVS tools) can be used. If the immersive downmix signal is a mono signal, a single-channel element (SCE) encoding can be utilized. This can be, in some embodiments, an encoding corresponding to the EVS standard. If the immersive downmix signal is a stereo signal (linear or binaural), a channel-pair element (CPE) encoding can be utilized. For example, dedicated stereo modes in IVAS can be utilized. If the immersive downmix signal is beyond a stereo representation, it is possible to use, e.g., various combinations of SCE and CPE encodings, or alternatively and in addition, a multichannel encoding.
In some embodiments the encoder 107 may further interleave, multiplex to a single data stream or embed the metadata within encoded down mix signals before transmission or storage shown in Figure 1 by the dashed line. The multiplexing may be implemented using any suitable scheme.
In the decoder side, the received or retrieved data (stream) may be received by a decoder/demultiplexer 133. The decoder/demultiplexer 133 may demultiplex the encoded streams and pass the audio encoded stream to a downmix extractor 135 which is configured to decode the audio signals to obtain the downmix signals. Similarly the decoder/demultiplexer 133 may comprise a metadata extractor 137 which is configured to receive the encoded metadata and generate metadata. The decoder/demultiplexer 133 can in some embodiments be a computer (running suitable software stored on memory and on at least one processor), or alternatively a specific device utilizing, for example, FPGAs or ASICs.
The decoded metadata and downmix audio signals may be passed to a synthesis processor 139.
The system 100‘synthesis’ part 131 further shows a synthesis processor 139 configured to receive the downmix and the metadata and re-creates in any suitable format a synthesized spatial audio in the form of multi-channel signals 1 10 (these may be multichannel loudspeaker format or in some embodiments any suitable output format such as binaural or Ambisonics signals, depending on the use case) based on the downmix signals and the metadata.
With respect to Figure 7 an example flow diagram of the overview shown in Figure 1 is shown.
First the system (analysis part) is configured to receive multi-channel audio signals as shown in Figure 7 by step 701
The system (analysis part) is configured to analyse signals over multiple resolutions to generate metadata such as direction parameters; energy ratio parameters; diffuseness parameters and coherence parameters. Then one of the selected multiple resolutions may be selected to be output. The generation of multiple resolution metadata and the selection of one of them is shown in Figure 7 by step 703.
In some embodiments the signal analysis which generates the metadata is performed based on a control signal or information or a decision from a controller
or other part or function. Thus for example in some embodiments the control signal may be configured to control the analyser to perform only one resolution analysis which is different from frame to frame (in other words pre-analysis select the analysis resolution rather than post-analysis select the resolution).
Furthermore in some embodiments the control signal for the analyser is determined based on analysis of the audio signal characteristics, for example analysis of the input audio signals and based on this only one resolution parameter set per frame are created.
In some embodiments the core codec that encodes the downmix may be configured to switch between short and long windows and this can be used as an indication to which resolution is used.
In some embodiments the resolution is determined from the audio characteristics of the core coded downmix signal.
In the latter two cases the resolution does not need to be signalled in the metadata because it can be determined from the core coded dowmix signal.
Then the system (analysis part) is configured to generate a downmix of the multi-channel signals based on the selected resolution as shown in Figure 7 by step 705.
The system is then configured to encode for storage/transmission the downmix signal and metadata as shown in Figure 7 by step 707.
After this the system may store/transmit the encoded downmix and metadata as shown in Figure 7 by step 709.
The system may retrieve/receive the encoded downmix and metadata as shown in Figure 7 by step 71 1.
Then the system is configured to extract the downmix and metadata from encoded downmix and metadata parameters, for example demultiplex and decode the encoded downmix and metadata parameters, as shown in Figure 7 by step 713.
The system (synthesis part) is configured to synthesize an output multi- channel audio signal based on extracted downmix of multi-channel audio signals and metadata as shown in Figure 7 by step 715.
With respect to Figure 2 an example multi-resolution analysis processor 105 (as shown in Figure 1 ) according to some embodiments is described in further
detail. The multi-resolution analysis processor 105 in some embodiments comprises a windower 201 .
The windower 201 is configured to receive the input audio signals and generate a series of analysis periods or intervals of audio signal sample lengths. These can be passed to a filter bank 203. The windower may thus generate a series of frames from which multi-resolution sub-frames may be extracted from.
The analysis processor 105 furthermore may comprise a filter bank 203. The filterbank 203 in some embodiments is configured to perform an initial time- frequency domain transform of the windowed (multi-channel) audio signals 102 and apply a suitable time to frequency domain transform such as a Short Time Fourier Transform (STFT) in order to convert the input time domain signals into a suitable time-frequency signals. These time-frequency signals may then be filtered according to any suitable band or sub-band configuration and passed to a series of multi-resolution immersive signal analyser parts.
Thus for example the time-frequency signals may be represented in the time-frequency domain representation by
Si(b,n),
where b is the frequency bin index and n is the frame index and / is the channel index. In another expression, n can be considered as a time index with a lower sampling rate than that of the original time-domain signals. These frequency bins can be grouped into sub-bands that group one or more of the bins into a band index k= 0,..., K-1. Each sub-band /chas a lowest bin bk iow and a highest bin bk gh, and the sub-band contains all bins from bk low to bkMgh. The widths of the sub-bands can approximate any suitable distribution. For example the Equivalent rectangular bandwidth (ERB) scale or the Bark scale. In some examples, particularly for the internal processing within an audio codec, the sub-band size may be chosen directly based on the filter bank used in the audio codec. Such a filter bank may thus result in using of a sub-band bandwidth that has, e.g., a minimum size of 400 Hz.
Although the STFT is used in the filter bank 203 any suitable implementation may be used. For example a Discrete Fourier transform (DFT) or Quadrature Mirror filter bank (QMF). Furthermore in some embodiments the filter bank 203 sub-band widths can be selected based on perceptual properties of human hearing.
In some embodiments the multi-resolution analysis processor 105 comprises a series of different resolution immersive signal analysers. These are represented in Figure 2 by a 1 st immersive signal analyser 205i and a 2nd immersive signal analyser 2052. The immersive signal analysers may be configured to determine for a defined time and frequency (T/F) resolution a series of ‘immersive’ parameters for describing the audio signals. For example in some embodiments the immersive signal analyser comprise a direction analyser configured to receive the time-frequency signals and based on these signals estimate direction parameters on a frequency band-by-band (or groups of bands) basis. The direction parameters may be determined based on any audio based ‘direction’ determination.
For example in some embodiments the direction analyser may be configured to estimate the direction with two or more signal inputs. This represents the simplest configuration to estimate a‘direction’, more complex processing may be performed with even more signals.
The direction analyser may thus be configured to provide an azimuth and elevation for each frequency band and temporal resolution, denoted as azimuth cp(/c,n) and elevation Q(k,n). The direction parameter 108 may be also be used to perform further analysis of the signal.
For example in some embodiments further to the direction parameter the analyser is configured to determine an energy ratio parameter. The energy ratio may be considered to be a determination of the energy of the audio signal which can be considered to arrive from a direction. The direct-to-total energy ratio r(k,n) can be estimated, e.g., using a stability measure of the directional estimate, or using any correlation measure, or any other suitable method to obtain a ratio parameter.
Furthermore the signal analyser 205 may be configured to produce further signal parameters such as the two parameters: coherence and diffuseness, both analysed in time-frequency domain.
As such the immersive signal analysis can consist of at least two parts and differ at least in frequency and time resolution. However, they may further differ in terms of parameters being analysed.
For example in some embodiments the 1st immersive signal analyser 205i may be configured to generate analysis using a 20-ms update rate (in other words with a temporal resolution of 20ms) and with 24 sub-bands (for example where the bands are all the same size each band will have a frequency resolution of X/24 where X is the frequency range of the analysed audio signals, though it is understood that the bands may differ in size in some embodiments), and the 2nd immersive signal analyser 2052 may be configured to generate analysis using a 5- ms update rate with 6 sub-bands. It is to be appreciated that these are examples only and other example resolutions are possible for both high frequency and high time resolution approaches. For example, significantly higher time resolution such as a 1 .25-ms update rate may be utilized in some implementations.
Although in the example described herein each of the multi-resolution parameters are generated in parallel it would be understood that in some embodiments the analysis operations are performed in series or in a hybrid of series and parallel.
These parameters from each of the analyser parts may be passed to a switch 207.
In some embodiments the multi-resolution analysis processor 105 comprises a switch 207. The switch 207 is configured to receive the analysis parameters from each of the immersive signal analysers 205 and output one (or more of these) to the immersive metadata extractor 209.
In some embodiments the switch 207 is configured to operate based on at least one suitability measure. In some embodiments this decision is performed differently for a subset of sub-frames or sub-bands. Thus at least one switching decision may be performed for each frame (or windowed period).
The at least one suitability measure can be any suitable measure. In some embodiments the measure or analysis may be at least one of: the input audio signals, the parameters generated by the analysis of the audio signals, and analysis of the downmix audio signals. For example in some embodiments the determination may be made based on detecting the input audio signals comprise a voice (for example by using a suitable voice activity determination) and that furthermore there are more than one voice source (for example based on a detection of at least two talkers in a scene). In such a situation where there are two simultaneous talkers, a
faster time domain update cycle can provide better overall performance than a more accurate spectral resolution.
However, it is noted that other and/or more than one switching suitability measure may be used in various implementations of the invention. For example, if an analysis of the audio scene determines that the audio scene has a relatively flat spectrum and thus noisy ambience, shorter windows may produce better results. Also, where it is determined that there are impulsive energy fluctuations between sub-frames and thus the audio signals comprise transients, a faster update rate can be selected. On the other hand, where it is determined that there are stable tonal sounds (like classical musical instruments) a longer sub-frame window size and thus higher spectral accuracy may be selected.
In some embodiments the control of the post-analysis selection of the resolution from multiple resolutions or as described earlier a‘pre-analysis’ selection of the resolution may relate to analysis of non-audio scene factors. For example in some embodiments the selection is based on visual tracking and segmentation of a scene, where the number of and types of sound sources are determined and used to, at least in part, decide the analysis resolution selection.
We note that a similar switching may be performed, e.g., in a capture device’s microphone processing resulting in the creation of the switched metadata.
The output of the switch 207 may be passed to the immersive metadata extractor 209.
In some embodiments the multi-resolution analysis processor 105 comprises an immersive metadata extractor 209. The immersive metadata extractor 209 may be configured to receive the output of the switch 207 and passed to a metadata compressor/encoder 1 1 1. Furthermore the immersive metadata extractor 209 may be configured to output the resolution of the extracted metadata to the core coder 109 and downmixer 103 such that the time/frequency resolution of the metadata may be matched in the downmixer 103 and/or core coder 109.
In some embodiments the multi-resolution analysis processor 105 comprises a metadata compressor/encoder 1 1 1 . The metadata compressor/encoder 1 1 1 is configured to receive the extracted metadata and generate a suitable data format to output to enable the extraction of the metadata at a suitable decoder/receiver. This for example requires the ability to generate
suitable control information which enables the receiver/decoder to be able to determine the information content in the encoded format.
The immersive audio metadata may be compressed depending on the bit rate restriction of the current coding mode. Thus in some embodiments some immersive fidelity will typically be lost, as is usually the case with lossy compression. However, the compression can utilize various perceptual techniques to minimize the impact.
A first example of a suitable data format for transmitting/encoding the metadata may be shown with respect to Figure 3a. Figure 3a for example shows a table structure where a switching is supported between the at least two modes with different time/frequency resolutions. The table shows two columns, a metadata field 301 and a metadata value field(s) 303.
Furthermore the data format shown in Figure 3a shows a switch field 302 with associated value 304. For two modes this switch field may be supported by a single bit (‘Switch’ field). The two modes may be static for a given transmission (e.g., streaming or communications call), i.e., they may be known (e.g., only two resolutions are allowed by a specific implementation) or the resolutions may be communicated out of band.
The data format shown in Figure 3a furthermore shows the various parameters and their values grouped by parameter. Thus for parameter 1 the 1st to nth sub-band values are shown in the first part 31 1 , for parameter 2 the 1st to nth sub-band values are shown in the second part 313 and for parameter Y the 1st to nth sub-band values are shown in the Yth part 315.
The metadata structure and size can in some embodiments remain constant, when for example we utilize the time/frequency (T/F) modes of the example analyser shown in Figure 2 which implemented a 20ms/24sub-band and 5ms/6sub-band resolution).
Thus in this example the first mode may provide 1x24 sub-bands and thus 1 x24 sub-band values per parameter. The second mode provides 4x6 sub-bands and thus 4x6 sub-band values per parameter. This gives a constant number of values (N = 24) and the second mode can thus name the parameter values continuously and utilize the same structure. It is to be understood that in other
embodiments, there may however be a different number of sub-bands used. Thus, e.g., the number of parameter fields may differ in some embodiments.
A further example of a suitable data format for transmitting/encoding the metadata may be shown with respect to Figure 3b. Figure 3b shows a table structure similar to that shown in Figure 3a but with an additional field, a T/F description field 305 and associated value(s) 306. This field may describe the T/F resolution in the current frame and/or the T/F resolutions overall supported by the encoder/decoder or used in the current transmission. This may take for example the values of the example, where the switching bit or‘Switch’ field value is used to index the T/F description’ field. The T/F description’ field in Figure 3b may furthermore include information on how the parameter values are ordered. Thus, the structure as shown in Figure 3b may be adapted. For example, for the 20ms/24sb mode and for the 5ms/6sb mode, the orderings of ‘Parameter 1’ may be as follows: sub-band 0, 1 , 2, ..., 23 and 0, 1 , 2, 3, 4, 5 for sub-frame 0, followed similarly by sub-frames 1 , 2, 3, respectively, giving the same internal order as for the format shown in Figure 3a.
In some embodiments this additional field may allow the decoder to optimize the memory allocation or other aspects for the synthesis. For example, the receiver may receive information about the possible configurations for analysis resolution within a specific transmission from this information. This information may allow for embodiments with faster adaptation.
In some embodiments the T/F description field and associated value may be used to convey information on a switching in a subset of the sub-frames or sub- bands. In other words, the ‘Switch’ field may here have more values that correspond to mixtures of at least two switched modes within the current frame. For example if a third analysis mode is a 10ms 12 sub-band mode (in other words each 20ms window is divided into two 10ms sub-frames and each sub-frame has 12 sub- bands: 10ms/12sb.
Furthermore it is to be understood that the structure of the data format information may also be provided in different fields, as the metadata structures of are only examples.
With respect to Figure 4 an example arrangement 401 of different T/F modes within a frame is shown. The bottom line 404 summarizes the time resolution (T) in
terms of ms, and the number of frequency sub-bands (F) for each T is indicated 402 over each column. This example as shown in Figure 4 shows six different modes of operation: a first mode utilizes 1 update cycle of 20 ms 403, the second mode utilizes 2 update cycles of 10 ms each 405i and 4052, the third mode utilizes 4 update cycles of 5 ms 407i to 4074, and the remaining three modes utilize 3 update cycles where one of the cycles is 10 ms and the other two 5 ms each 409i to 4093, 4111 to 4113, and 413i to 4133.
Thus the first mode 403 has a single column of 24 sub-bands (0...23) representing one parameter value per sub-band for each of the 24 sub-bands for the 20ms frame length.
The second mode has two columns of 12 sub-bands (0..11 , 0..11) representing one parameter value per sub-band for each of the 12 sub-bands for each of the 10ms sub-frames.
The third mode has four columns of 6 sub-bands (0..5, 0..5, 0..5, 0..5) representing one parameter value per sub-band for each of the 6 sub-bands for each of the 5ms sub-frames.
The fourth mode has three columns, one of 12 sub-bands and two of 6 sub- bands (0..11 , 0..5, 0..5) representing one parameter value per sub-band for each of the 12 sub-bands in the 10ms sub-frame and the 6 sub-bands for each of the two 5ms sub-frames.
The fifth mode has three columns, one of 12 sub-bands and two of 6 sub- bands (0..5, 0..5, 0..11) representing one parameter value per sub-band for each of the 6 sub-bands for each of the two 5ms sub-frames and the 12 sub-bands in the 10ms sub-frame.
The sixth mode has three columns, one of 12 sub-bands and two of 6 sub- bands (0..5, 0..11 , 0..5) representing one parameter value per sub-band for each of the 6 sub-bands in the first 5ms sub-frame, the 12 sub-bands in the 10ms sub- frame and the 6 sub-bands for each of the last 5ms sub-frame.
As well as defining a structure related to the format of the modes and furthermore the structure of the metadata within each of the modes in some embodiments the metadata compressor/encoder may furthermore define an embedded structure of T/F resolution switching for the immersive metadata. In
other words it may define a data structure wherein a different mode may be selected for each of the sub-bands.
An example of the logic of a first example implementation is shown with respect to Figure 5. In this example the metadata structure is embedded (for example where the‘Embedded bit 0’ is set to 1 ) or“fixed” (for example where the ‘Embedded bit 0’ is set to 0).
In some embodiments the embedded structure could be always active. In other words the value‘Embedded bit 0’ would not be used. The“fixed” mode structure may be, e.g., similar to the data structure shown earlier with respect to Figure 3a. The embedded mode, on the other hand, differs from the structures shown previously in that they comprise an embedded switching bit before each sub-band or, in some embodiment, e.g., each group of sub-bands. The position of the embedded bits are not fixed due to the different size of the sub-bands (different T/F resolutions).
The Figure 5 shows the order of the parameters in this first example.
Thus the data structure 501 shows an initial row 502 indicating the data structure is an embedded data structure.
Then the data structure 501 on rows 2 to 5 defines the fixed mode operation wherein the mode switch identifies or selects the first mode (mode_0) or the second mode (mode_1 ) and the order of the sub-bands for each of the modes is then defined in the fixed structure 504. It is understood that the rows 4 and 5 of Figure 5 may not be the same size. Although the example shown herein has two modes and thus is indicated by a single bit, in some embodiments there may be more than two modes and which are indicated by a flag value more than one bit in length.
The data structure 501 on rows 8 onwards then defines the embedded mode operation wherein the embedded bits switch or identify between the first mode (mode_0) or the second mode (mode_1 ) for each bifurcation or splitting. Thus for example row 8 and 12 show the identification 506 of the first element as being either the first mode first sub-band (mode_0 Subband_1 ) where the first embedded bit is 0 and the second mode first sub-band (mode_1 Subband_1 ) where the first embedded bit is 1 .
Then based on this selection/identification the other sub-bands are selected based on the embedded bit 2 value. Thus for example rows 10 and 1 1 follow the
selection of the first mode for the first value with the identification 508 of the following elements as being either the first mode second sub-band (mode_0 Subband_2) onwards where the second embedded bit is 0 and the second mode second sub-band (mode_1 Subband_2) onwards where the second embedded bit is 1. Although not shown in Figure 5 there may be further embedded bit values for selecting further sub-bands. For example there may be a further set of embedded bit values, embedded bit 3, which select whether mode_0 or mode_1 Sub-band 3 is selected and so on.
Also rows 14 and 15 follow the selection of the second mode for the first value with the identification 510 of the following elements as being either the first mode second sub-band (modeJD Subband_2) onwards where the second embedded bit is 0 and the second mode second sub-band (mode_1 Subband_2) onwards where the second embedded bit is 1. Similarly there may also be further embedded bit values for selecting further mode sub-bands.
In some embodiments the data structure is defined such that the position of the embedded bits are known, and the size of the sub-bands is taken into account. This is shown for example with respect to the data structure shown in Figure 6.
Figure 6, thus for example shows a data structure 601 shows an initial row 602 indicating the data structure is an embedded data structure.
Then the data structure 601 on rows 2 to 5 defines the fixed mode operation wherein the mode switch identifies or selects the first mode (mode_0) or the second mode (mode_1 ) and the order of the sub-bands for each of the modes is then defined in the fixed structure 604. It is understood that the rows 4 and 5 of Figure 6 may not be the same size.
The data structure 601 on rows 6 onwards then defines the embedded mode operation wherein the embedded bits switch or identify between the first mode (modeJD) or the second mode (mode_1 ) for each bifurcation or splitting. Thus for example row 8 and 12 show the identification 606 of the first element as being either the first mode first sub-band (mode_0 Sb_1 ) where the first embedded bit is 0 and the second mode first sub-band to fourth sub-band (mode_1 Sb_1 to mode_1 Sb_4) where the first embedded bit is 1.
Then based on this selection/identification further sub-bands are selected based on the embedded bit 2 value. Thus for example rows 10 and 11 follow the
selection of the first mode for the first embedded bit value with the identification 608 of the following elements as being either the first mode second sub-band (mode_0 Sb_2) where the second embedded bit is 0 and the second mode fifth sub-band onwards (mode_1 Sb_5) onwards where the second embedded bit is 1 . Also rows 14 and 15 follow the selection of the second mode for the first value with the identification 610 of the following elements as being either the first mode second sub-band (mode_0 Sb_2) onwards where the second embedded bit is 0 and the second mode fifth sub-band (mode_1 Sb_5) onwards where the second embedded bit is 1 .
With respect to Figure 8 a flow diagram summarising the operations of the analysis processor 105, and metadata encoder 1 1 1 are shown.
The first operation is one of windowing the (time domain) multichannel audio signals (to generate frames from which the sub-frames can be generated) as shown in Figure 8 by step 801.
Following this is applying a filtering (for example application of a time domain to frequency domain transform and separation of the frequency band components into sub-bands) to generate suitable filter sub-bands signals for analysis as shown in Figure 8 by step 803.
Then there is shown a series of parallel various resolution immersive signal analysis is shown in Figure 8 by steps 805i to 805o.
Then a selection of the resolution signal analysis is made as shown in Figure 8 by step 807.
The final operation being one of outputting the determined parameters and generating a suitable metadata structure such as described herein and as shown in Figure 8 by step 809.
An example operation performed by each of the immersive signal analysers is shown in Figure 9.
In some embodiments the immersive signal analyser is configured to determine the analysis sub-frame window and sub-bands to be analysed as shown in Figure 9 by step 901.
Having determined the sub-band and sub-frame resolution in some embodiments the first operation is one determining a direction analysis to
determine directions and energy ratio parameters for each resolution time and frequency as shown in Figure 9 by step 903.
Having determined the direction and energy ratio parameters in some embodiments the analysis is configured to determine coherence parameters, diffuseness parameters and (optionally modifying energy ratios based on determined coherence parameters) for each resolution time and frequency as shown in Figure 9 by step 905.
Having generated the parameters for each time and frequency resolution these may then be output as shown in Figure 9 by step 907.
With respect to Figure 10 an example electronic device which may be used as the analysis or synthesis device is shown. The device may be any suitable electronics device or apparatus. For example in some embodiments the device 1400 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, etc.
In some embodiments the device 1400 comprises at least one processor or central processing unit 1407. The processor 1407 can be configured to execute various program codes such as the methods such as described herein.
In some embodiments the device 1400 comprises a memory 141 1 . In some embodiments the at least one processor 1407 is coupled to the memory 141 1. The memory 141 1 can be any suitable storage means. In some embodiments the memory 141 1 comprises a program code section for storing program codes implementable upon the processor 1407. Furthermore in some embodiments the memory 141 1 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 1407 whenever needed via the memory-processor coupling.
In some embodiments the device 1400 comprises a user interface 1405. The user interface 1405 can be coupled in some embodiments to the processor 1407. In some embodiments the processor 1407 can control the operation of the user interface 1405 and receive inputs from the user interface 1405. In some embodiments the user interface 1405 can enable a user to input commands to the
device 1400, for example via a keypad. In some embodiments the user interface 1405 can enable the user to obtain information from the device 1400. For example the user interface 1405 may comprise a display configured to display information from the device 1400 to the user. The user interface 1405 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 1400 and further displaying information to the user of the device 1400. In some embodiments the user interface 1405 may be the user interface for communicating with the position determiner as described herein.
In some embodiments the device 1400 comprises an input/output port 1409. The input/output port 1409 in some embodiments comprises a transceiver. The transceiver in such embodiments can be coupled to the processor 1407 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network. The transceiver or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.
The transceiver can communicate with further apparatus by any suitable known communications protocol. For example in some embodiments the transceiver or transceiver means can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802.X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).
The transceiver input/output port 1409 may be configured to receive the signals and in some embodiments determine the parameters as described herein by using the processor 1407 executing suitable code. Furthermore the device may generate a suitable downmix signal and parameter output to be transmitted to the synthesis device.
In some embodiments the device 1400 may be employed as at least part of the synthesis device. As such the input/output port 1409 may be configured to receive the downmix signals and in some embodiments the parameters determined at the capture device or processing device as described herein, and generate a
suitable audio signal format output by using the processor 1407 executing suitable code. The input/output port 1409 may be coupled to any suitable audio output for example to a multichannel speaker system and/or headphones or similar.
In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
The embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware. Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.
The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific
integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.
Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.
Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.
The foregoing description has provided by way of exemplary and non- limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention as defined in the appended claims.
Claims
1. An apparatus for spatial audio signal encoding, the apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to:
determine, for two or more audio signals, and for at least one time-frequency resolution from a range of frequency resolutions at least one resolution spatial audio parameter for providing spatial audio reproduction; and
process the at least one resolution spatial audio parameter to be output and/or stored.
2. The apparatus as claimed in claim 1 , wherein the apparatus caused to determine, for two or more audio signals, and for at least one time-frequency resolution from a range of frequency resolutions at least one resolution spatial audio parameter for providing spatial audio reproduction is caused to determine at least one of:
at least one spatial audio parameter comprising a direction parameter with an elevation and an azimuth component;
at least one energy ratio parameter associated with the direction parameter; at least one diffuseness parameter associated with the direction parameter; and
at least one coherence parameter associated with the direction parameter.
3. The apparatus as claimed in any of claims 1 and 2, wherein the apparatus caused to determine, for the two or more audio signals, and for at least one time- frequency resolution from a range of frequency resolutions at least one resolution spatial audio parameter for providing spatial audio reproduction is caused to:
determine at least one first time-frequency resolution from a range of frequency resolutions at least one first resolution spatial audio parameter for providing spatial audio reproduction;
determine at least one second time-frequency resolution from a range of frequency resolutions at least one second resolution spatial audio parameter for providing spatial audio reproduction; and
select one of the first resolution spatial audio parameter and second resolution spatial audio parameter as the at least one time-frequency resolution from a range of frequency resolutions.
4. The apparatus as claimed in any of claims 1 to 3, wherein the apparatus is further caused to:
window the two or more audio signals to generate at least one frame of the two or more audio signals;
filter the frame of the two or more audio signals to generate at least two sub- band representations of the at least one frame of the two or more audio signals, wherein the apparatus caused to determine, for two or more audio signals, and for a first time-frequency resolution at least one first resolution spatial audio parameter for providing spatial audio reproduction is caused to analyse the at least two sub-band representations of the at least one frame of the two or more audio signals using the first time-frequency resolution, and
wherein the apparatus caused to determine, for the two or more audio signals, and for a second time-frequency resolution at least one second resolution spatial audio parameter for providing spatial audio reproduction is further caused to analyse the at least two sub-band representations of the at least one frame of the two or more audio signals using the second time-frequency resolution.
5. The apparatus as claimed in claim 4 when dependent on claim 3, wherein the apparatus caused to select at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter to be output and/or stored is further caused to select the at least one of the at least one first resolution spatial audio parameter and the at least one second resolution spatial audio parameter to be output and/or stored on a frame-by-frame basis.
6. The apparatus as claimed in any of claims 1 to 5, wherein the apparatus is further caused to generate a data structure for the at least one resolution spatial audio parameter, wherein the data structure comprises at least one of:
a switch field configured to indicate a selection of the at least one resolution from the range of frequency resolutions;
a time-frequency description field configured to indicate the ordering of the selection of the at least one resolution from the range of frequency resolutions; and at least one parameter field configured to provide the at least one resolution spatial audio parameter.
7. The apparatus as claimed in any of claims 4 to 6 when dependent on claim 3, wherein the apparatus caused to select at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter to be output and/or stored is further caused to generate an embedded data structure for the selected one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter, wherein the embedded data structure comprises at least one of at least one an embedded field configured to indicate the option selection of the at least one of the at least one first resolution spatial audio parameter and the at least one second resolution spatial audio parameter to be output and/or stored.
8. The apparatus as claimed in any of claims 1 to 7, wherein the apparatus is further caused to determine at least one suitability measure.
9. The apparatus as claimed in claim 8, wherein the apparatus caused to determine at least one suitability measure is caused to determine the at least one suitability measure based on at least one of:
analysis of the two or more audio signals;
analysis of a downmix based on the two or more audio signals;
analysis of the at least one resolution spatial audio parameter for providing spatial audio reproduction more audio signals for the at least one time-frequency resolution; and
visual analysis of an environment generating the at least two or more audio signals.
10. The apparatus as claimed in any of claims 8 and 9, when further dependent on claim 3, wherein the apparatus caused to select at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter to be output and/or stored is further caused to select the at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter based on the at least one suitability measure.
1 1 . The apparatus as claimed in any of claims 8 and 9, wherein the apparatus is caused to determine the at least one time-frequency resolution based on the at least one suitability measure.
12. The apparatus as claimed in any of claims 1 to 1 1 , wherein the apparatus is further caused to downmix the at least two audio signals based on a selected one of the first time-frequency resolution and the second time-frequency resolution.
13. The apparatus as claimed in any of claims 1 to 12, wherein the apparatus is further caused to encode the at least two audio signals based on a selected one of the first time-frequency resolution and the second time-frequency resolution.
14. The apparatus as claimed in any of claims 1 to 13, wherein the apparatus is further caused to encode the selected at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter.
15. A method for spatial audio signal encoding, the method comprising:
determining, for two or more audio signals, and for at least one time- frequency resolution from a range of frequency resolutions at least one resolution spatial audio parameter for providing spatial audio reproduction; and
processing the at least one resolution spatial audio parameter to be output and/or stored.
16. The method as claimed in claim 15, wherein determining, for two or more audio signals, and for at least one time-frequency resolution from a range of frequency resolutions at least one resolution spatial audio parameter for providing spatial audio reproduction comprises at least one of:
determining at least one spatial audio parameter comprising a direction parameter with an elevation and an azimuth component;
determining at least one energy ratio parameter associated with the direction parameter;
determining at least one diffuseness parameter associated with the direction parameter; and
determining at least one coherence parameter associated with the direction parameter.
17. The method as claimed in any of claims 15 and 16, wherein determining, for the two or more audio signals, and for at least one time-frequency resolution from a range of frequency resolutions at least one resolution spatial audio parameter for providing spatial audio reproduction comprises:
determining at least one first time-frequency resolution from a range of frequency resolutions at least one first resolution spatial audio parameter for providing spatial audio reproduction;
determining at least one second time-frequency resolution from a range of frequency resolutions at least one second resolution spatial audio parameter for providing spatial audio reproduction; and
selecting one of the first resolution spatial audio parameter and second resolution spatial audio parameter as the at least one time-frequency resolution from a range of frequency resolutions.
18. The method as claimed in any of claims 15 to 17, further comprising:
windowing the two or more audio signals to generate at least one frame of the two or more audio signals;
filtering the frame of the two or more audio signals to generate at least two sub-band representations of the at least one frame of the two or more audio signals,
wherein determining, for two or more audio signals, and for a first time-frequency resolution at least one first resolution spatial audio parameter for providing spatial audio reproduction further comprises analysing the at least two sub-band representations of the at least one frame of the two or more audio signals using the first time-frequency resolution, and
wherein determining, for the two or more audio signals, and for a second time-frequency resolution at least one second resolution spatial audio parameter for providing spatial audio reproduction further comprises analysing the at least two sub-band representations of the at least one frame of the two or more audio signals using the second time-frequency resolution.
19. The method as claimed in claim 18 when dependent on claim 17, wherein selecting at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter to be output and/or stored further comprises selecting the at least one of the at least one first resolution spatial audio parameter and the at least one second resolution spatial audio parameter to be output and/or stored on a frame-by-frame basis.
20. The method as claimed in any of claims 15 to 19, further comprising generating a data structure for the at least one resolution spatial audio parameter, wherein the data structure comprises at least one of:
a switch field configured to indicate a selection of the at least one resolution from the range of frequency resolutions;
a time-frequency description field configured to indicate the ordering of the selection of the at least one resolution from the range of frequency resolutions; and at least one parameter field configured to provide the at least one resolution spatial audio parameter.
21 . The method as claimed in any of claims 18 to 20 when dependent on claim 17, wherein selecting at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter to be output and/or stored further comprises generating an embedded data structure for the selected one of the at least one first resolution spatial audio parameter and at least
one second resolution spatial audio parameter, wherein the embedded data structure comprises at least one of at least one an embedded field configured to indicate the option selection of the at least one of the at least one first resolution spatial audio parameter and the at least one second resolution spatial audio parameter to be output and/or stored.
22. The method as claimed in any of claims 15 to 21 , further comprising determining at least one suitability measure.
23. The method as claimed in claim 22, wherein determining at least one suitability measure further comprises determining the at least one suitability measure based on at least one of:
analysing the two or more audio signals;
analysing a downmix based on the two or more audio signals;
analysing the at least one resolution spatial audio parameter for providing spatial audio reproduction more audio signals for the at least one time-frequency resolution; and
visual analysing an environment generating the at least two or more audio signals.
24. The method as claimed in any of claims 22 and 23, when further dependent on claim 17, wherein selecting at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter to be output and/or stored further comprises selecting the at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter based on the at least one suitability measure.
25. The method as claimed in any of claims 22 and 23, comprising determining the at least one time-frequency resolution based on the at least one suitability measure.
26. The method as claimed in any of claims 15 to 25, further comprising downmixing the at least two audio signals based on a selected one of the first time- frequency resolution and the second time-frequency resolution.
27. The method as claimed in any of claims 15 to 26, further comprising encoding the at least two audio signals based on a selected one of the first time- frequency resolution and the second time-frequency resolution.
28. The method as claimed in any of claims 15 to 27, further comprising encoding the selected at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter.
29. An apparatus for spatial audio signal encoding, the apparatus comprising: means for determining, for two or more audio signals, and for at least one time-frequency resolution from a range of frequency resolutions at least one resolution spatial audio parameter for providing spatial audio reproduction; and means for processing the at least one resolution spatial audio parameter to be output and/or stored.
30. The apparatus as claimed in claim 29, wherein the means for determining, for two or more audio signals, and for at least one time-frequency resolution from a range of frequency resolutions at least one resolution spatial audio parameter for providing spatial audio reproduction comprises at least one of:
means for determining at least one spatial audio parameter comprising a direction parameter with an elevation and an azimuth component;
means for determining at least one energy ratio parameter associated with the direction parameter;
means for determining at least one diffuseness parameter associated with the direction parameter; and
means for determining at least one coherence parameter associated with the direction parameter.
31. The apparatus as claimed in any of claims 29 and 30, wherein the means for determining, for the two or more audio signals, and for at least one time- frequency resolution from a range of frequency resolutions at least one resolution spatial audio parameter for providing spatial audio reproduction comprises:
means for determining at least one first time-frequency resolution from a range of frequency resolutions at least one first resolution spatial audio parameter for providing spatial audio reproduction;
means for determining at least one second time-frequency resolution from a range of frequency resolutions at least one second resolution spatial audio parameter for providing spatial audio reproduction; and
means for selecting one of the first resolution spatial audio parameter and second resolution spatial audio parameter as the at least one time-frequency resolution from a range of frequency resolutions.
32. The apparatus as claimed in any of claims 29 to 31 , further comprising: means for windowing the two or more audio signals to generate at least one frame of the two or more audio signals;
means for filtering the frame of the two or more audio signals to generate at least two sub-band representations of the at least one frame of the two or more audio signals,
wherein the means for determining, for two or more audio signals, and for a first time-frequency resolution at least one first resolution spatial audio parameter for providing spatial audio reproduction further comprises means for analysing the at least two sub-band representations of the at least one frame of the two or more audio signals using the first time-frequency resolution, and
wherein the means for determining, for the two or more audio signals, and for a second time-frequency resolution at least one second resolution spatial audio parameter for providing spatial audio reproduction further comprises means for analysing the at least two sub-band representations of the at least one frame of the two or more audio signals using the second time-frequency resolution.
33. The apparatus as claimed in claim 32 when dependent on claim 31 , wherein the means for selecting at least one of the at least one first resolution spatial audio
parameter and at least one second resolution spatial audio parameter to be output and/or stored further comprises means for selecting the at least one of the at least one first resolution spatial audio parameter and the at least one second resolution spatial audio parameter to be output and/or stored on a frame-by-frame basis.
34. The apparatus as claimed in any of claims 29 to 33, further comprising means for generating a data structure for the at least one resolution spatial audio parameter, wherein the data structure comprises at least one of:
a switch field configured to indicate a selection of the at least one resolution from the range of frequency resolutions;
a time-frequency description field configured to indicate the ordering of the selection of the at least one resolution from the range of frequency resolutions; and at least one parameter field configured to provide the at least one resolution spatial audio parameter.
35. The apparatus as claimed in any of claims 29 to 34 when dependent on claim 31 , wherein the means for selecting at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter to be output and/or stored further comprises means for generating an embedded data structure for the selected one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter, wherein the embedded data structure comprises at least one of at least one an embedded field configured to indicate the option selection of the at least one of the at least one first resolution spatial audio parameter and the at least one second resolution spatial audio parameter to be output and/or stored.
36. The apparatus as claimed in any of claims 29 to 35, further comprising means for determining at least one suitability measure.
37. The apparatus as claimed in claim 36, wherein the means for determining at least one suitability measure further comprises means for determining the at least one suitability measure based on at least one of:
analysis of the two or more audio signals;
analysis of a downmix based on the two or more audio signals;
analysis of the at least one resolution spatial audio parameter for providing spatial audio reproduction more audio signals for the at least one time-frequency resolution; and
visual analysis of an environment generating the at least two or more audio signals.
38. The apparatus as claimed in any of claims 36 and 37, when further dependent on claim 31 , wherein the means for selecting at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter to be output and/or stored further comprises means for selecting the at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter based on the at least one suitability measure.
39. The apparatus as claimed in any of claims 36 and 37, comprising means for determining the at least one time-frequency resolution based on the at least one suitability measure.
40. The apparatus as claimed in any of claims 29 to 39, further comprising means for downmixing the at least two audio signals based on a selected one of the first time-frequency resolution and the second time-frequency resolution.
41 . The apparatus as claimed in any of claims 29 to 40, further comprising means for encoding the at least two audio signals based on a selected one of the first time-frequency resolution and the second time-frequency resolution.
42. The apparatus as claimed in any of claims 29 to 41 , further comprising means for encoding the selected at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/EP2017/081265 WO2019105575A1 (en) | 2017-12-01 | 2017-12-01 | Determination of spatial audio parameter encoding and associated decoding |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/EP2017/081265 WO2019105575A1 (en) | 2017-12-01 | 2017-12-01 | Determination of spatial audio parameter encoding and associated decoding |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019105575A1 true WO2019105575A1 (en) | 2019-06-06 |
Family
ID=60543561
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/EP2017/081265 Ceased WO2019105575A1 (en) | 2017-12-01 | 2017-12-01 | Determination of spatial audio parameter encoding and associated decoding |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2019105575A1 (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021053266A3 (en) * | 2019-09-17 | 2021-04-22 | Nokia Technologies Oy | Spatial audio parameter encoding and associated decoding |
WO2021155460A1 (en) * | 2020-02-03 | 2021-08-12 | Voiceage Corporation | Switching between stereo coding modes in a multichannel sound codec |
CN114503610A (en) * | 2019-10-10 | 2022-05-13 | 诺基亚技术有限公司 | Enhanced Directed Signaling for Immersive Communication |
EP4085661A4 (en) * | 2020-02-28 | 2023-01-25 | Nokia Technologies Oy | AUDIO REPRESENTATION AND ASSOCIATED RENDERING |
WO2023066456A1 (en) * | 2021-10-18 | 2023-04-27 | Nokia Technologies Oy | Metadata generation within spatial audio |
US11765536B2 (en) | 2018-11-13 | 2023-09-19 | Dolby Laboratories Licensing Corporation | Representing spatial audio by means of an audio signal and associated metadata |
WO2024199802A1 (en) | 2023-03-24 | 2024-10-03 | Nokia Technologies Oy | Coding of frame-level out-of-sync metadata |
WO2024199874A1 (en) | 2023-03-31 | 2024-10-03 | Nokia Technologies Oy | Spatial metadata direction harmonization |
WO2024199873A1 (en) | 2023-03-24 | 2024-10-03 | Nokia Technologies Oy | Decoding of frame-level out-of-sync metadata |
WO2024245695A1 (en) * | 2023-06-01 | 2024-12-05 | Nokia Technologies Oy | Apparatus, methods and computer program for selecting a mode for an input format of an audio stream |
US12167219B2 (en) | 2018-11-13 | 2024-12-10 | Dolby Laboratories Licensing Corporation | Audio processing in immersive audio services |
US12283281B2 (en) | 2019-10-30 | 2025-04-22 | Dolby Laboratories Licensing Corporation | Bitrate distribution in immersive voice and audio services |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2384028A2 (en) * | 2008-07-31 | 2011-11-02 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Signal generation for binaural signals |
US20150213806A1 (en) * | 2012-10-05 | 2015-07-30 | Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. | Encoder, decoder and methods for backward compatible multi-resolution spatial-audio-object-coding |
US20160064006A1 (en) * | 2013-05-13 | 2016-03-03 | Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. | Audio object separation from mixture signal using object-specific time/frequency resolutions |
-
2017
- 2017-12-01 WO PCT/EP2017/081265 patent/WO2019105575A1/en not_active Ceased
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2384028A2 (en) * | 2008-07-31 | 2011-11-02 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Signal generation for binaural signals |
US20150213806A1 (en) * | 2012-10-05 | 2015-07-30 | Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. | Encoder, decoder and methods for backward compatible multi-resolution spatial-audio-object-coding |
US20160064006A1 (en) * | 2013-05-13 | 2016-03-03 | Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. | Audio object separation from mixture signal using object-specific time/frequency resolutions |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US12156012B2 (en) | 2018-11-13 | 2024-11-26 | Dolby International Ab | Representing spatial audio by means of an audio signal and associated metadata |
US11765536B2 (en) | 2018-11-13 | 2023-09-19 | Dolby Laboratories Licensing Corporation | Representing spatial audio by means of an audio signal and associated metadata |
US12167219B2 (en) | 2018-11-13 | 2024-12-10 | Dolby Laboratories Licensing Corporation | Audio processing in immersive audio services |
WO2021053266A3 (en) * | 2019-09-17 | 2021-04-22 | Nokia Technologies Oy | Spatial audio parameter encoding and associated decoding |
US12165658B2 (en) | 2019-09-17 | 2024-12-10 | Nokia Technologies Oy | Spatial audio parameter encoding and associated decoding |
CN114503610A (en) * | 2019-10-10 | 2022-05-13 | 诺基亚技术有限公司 | Enhanced Directed Signaling for Immersive Communication |
US12283281B2 (en) | 2019-10-30 | 2025-04-22 | Dolby Laboratories Licensing Corporation | Bitrate distribution in immersive voice and audio services |
WO2021155460A1 (en) * | 2020-02-03 | 2021-08-12 | Voiceage Corporation | Switching between stereo coding modes in a multichannel sound codec |
US12205598B2 (en) | 2020-02-03 | 2025-01-21 | Voiceage Corporation | Switching between stereo coding modes in a multichannel sound codec |
JP2023516303A (en) * | 2020-02-28 | 2023-04-19 | ノキア テクノロジーズ オサケユイチア | Audio representation and related rendering |
US12167220B2 (en) | 2020-02-28 | 2024-12-10 | Nokia Technologies Oy | Audio representation and associated rendering |
EP4085661A4 (en) * | 2020-02-28 | 2023-01-25 | Nokia Technologies Oy | AUDIO REPRESENTATION AND ASSOCIATED RENDERING |
WO2023066456A1 (en) * | 2021-10-18 | 2023-04-27 | Nokia Technologies Oy | Metadata generation within spatial audio |
WO2024199873A1 (en) | 2023-03-24 | 2024-10-03 | Nokia Technologies Oy | Decoding of frame-level out-of-sync metadata |
WO2024199802A1 (en) | 2023-03-24 | 2024-10-03 | Nokia Technologies Oy | Coding of frame-level out-of-sync metadata |
WO2024199874A1 (en) | 2023-03-31 | 2024-10-03 | Nokia Technologies Oy | Spatial metadata direction harmonization |
WO2024245695A1 (en) * | 2023-06-01 | 2024-12-05 | Nokia Technologies Oy | Apparatus, methods and computer program for selecting a mode for an input format of an audio stream |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2019105575A1 (en) | Determination of spatial audio parameter encoding and associated decoding | |
JP5081838B2 (en) | Audio encoding and decoding | |
US8817992B2 (en) | Multichannel audio coder and decoder | |
EP3707706B1 (en) | Determination of spatial audio parameter encoding and associated decoding | |
JP7689196B2 (en) | Combining spatial audio streams | |
JP2022548038A (en) | Determining Spatial Audio Parameter Encoding and Related Decoding | |
CN113678199B (en) | Determination of the importance of spatial audio parameters and associated coding | |
EP3732678A1 (en) | Determination of spatial audio parameter encoding and associated decoding | |
WO2020016479A1 (en) | Sparse quantization of spatial audio parameters | |
EP3776545B1 (en) | Quantization of spatial audio parameters | |
WO2023031498A1 (en) | Silence descriptor using spatial parameters | |
JPWO2006022190A1 (en) | Audio encoder | |
WO2019106221A1 (en) | Processing of spatial audio parameters | |
JP6235725B2 (en) | Multi-channel audio signal classifier | |
WO2022038307A1 (en) | Discontinuous transmission operation for spatial audio parameters | |
WO2020201619A1 (en) | Spatial audio representation and associated rendering | |
WO2024175320A1 (en) | Priority values for parametric spatial audio encoding | |
CA3208666A1 (en) | Transforming spatial audio parameters | |
JP2022517992A (en) | High resolution audio coding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 17808080 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 17808080 Country of ref document: EP Kind code of ref document: A1 |