HK1188317A

HK1188317A - Spatial audio encoding and reproduction of diffuse sound

Info

Publication number: HK1188317A
Application number: HK14101424.8A
Authority: HK
Inventors: J-M．卓特; J.D.约翰斯顿; S．R．黑斯廷斯
Original assignee: Dts（英属维尔京群岛）有限公司
Priority date: 2010-09-08
Filing date: 2011-09-08
Publication date: 2014-04-25

Description

Spatial audio encoding and reproduction of diffuse sound

Cross-referencing

This application claims priority from U.S. provisional application No.61/380,975 filed on 8/9/2010.

Technical Field

The present invention relates generally to high fidelity audio reproduction, and more particularly to the generation, transmission, recording, and reproduction of digital audio, particularly encoded or compressed multi-channel audio signals.

Background

Digital audio recording, transmission and reproduction have utilized several media, such as standard definition DVDs, high definition optical media (e.g., "blu-ray discs"), or magnetic storage (hard disks) to record or transmit audio and/or video information to a listener. More transient transmission channels, such as wireless, microwave, fiber optic, or wired networks, are also used to transmit and receive digital audio. The increased bandwidth available for audio and video transmission has led to the widespread adoption of various multi-channel, compressed audio formats. One such popular format (widely available under the trademark "DTS" surround sound) is described in U.S. patents 5974380, 5978762 and 6487535, assigned to DTS, inc.

Most of the audio content distributed to consumers for home viewing corresponds to cinema features released by theaters. The soundtracks are typically mixed with views for cinema presentation in a resizable theater environment. Such soundtracks typically assume that a listener (sitting in a theater) may be close to one or more speakers, but far from others. The dialog is usually limited to the channel just in front. Left/right and surround imaging are constrained both by the assumed seating arrangement and by the size of the theatre. In short, the soundtrack of a cinema consists of a mix that is best suited for reproduction in a large cinema.

On the other hand, home listeners typically sit in a small room with a higher quality surround sound speaker configured to better permit convincing spatial audio images. Home theaters are small with short reverberation times. While different mixes can be published for homes and for movie theatre listeners, this is rarely done (perhaps for economic reasons). For traditional content, this is generally not possible because the original multi-track "stem" (original, unmixed sound file) may not be available (or because rights are difficult to obtain). The sound engineer for large and small room and view mixing must make a compromise. Introducing reverberation or diffuse sound into the soundtrack is particularly problematic due to differences in the reverberation characteristics of the various play spaces.

This situation produces a less than optimal sound experience for home theater listeners, even for listeners who have invested in expensive surround sound systems.

Baumgarte et al, in us patent 7583805, propose a system for stereo and multi-channel synthesis of audio signals based on inter-channel relatedness cues for parametric coding. Their system generates diffuse sound derived from the transmitted combined (summed) signal. Their system is obviously intended for low bit rate applications such as teleconferencing. The aforementioned patents disclose the use of time-to-frequency conversion techniques, filters, and reverberation to generate a simulated diffuse signal represented in the frequency domain. The disclosed technique does not give the mixing engineer artistic control and is suitable for synthesizing only a limited range of simulated reverberation signals based on the inter-channel consistency measured during recording. The disclosed "diffuse" signal is based on analytical measurements of the audio signal, rather than an appropriate type of "diffusion" or "decorrelation" that the human ear would naturally resolve. The reverberation technique disclosed in the Baumgarte patent also places considerable demands on computing power and is therefore inefficient in a relatively feasible implementation.

Disclosure of Invention

According to the present invention, there are provided embodiments for conditioning multi-channel audio by encoding, transmitting or recording a "dry" audio track or "stem" in a synchronized relationship with time-variable metadata controlled by a content generator and representing a desired degree and quality of diffusion. The audio track is compressed and transmitted together with synchronized metadata representing the diffusion and preferably also the mixing and delay parameters. The separation of the audio stem from the diffuse metadata facilitates customization of the playback at the receiver, taking into account the characteristics of the local playback environment.

In a first aspect of the invention, a method for conditioning an encoded digital audio signal, the audio signal representing sound, is provided. The method comprises receiving encoded metadata parametrically representing a desired presentation of said audio signal data in a listening environment. The metadata includes at least one parameter that can be decoded to configure a perceptually diffuse audio effect in at least one audio channel. The method includes processing the digital audio signal with the perceptually diffuse audio effect configured in response to the parameter to produce a processed digital audio signal.

In another embodiment, a method for conditioning a digital audio input signal for transmission or recording is provided. The method includes compressing the digital audio input signal to produce an encoded digital audio signal. The method continues by generating a set of metadata in response to user input, the set of metadata representing user-selectable diffuse features to be applied to at least one channel of the digital audio signal to produce a desired playback signal. The method concludes with multiplexing the encoded digital audio signal and the set of metadata in a synchronized relationship to produce a combined encoded signal.

In an alternative embodiment, a method for encoding and reproducing a digitized audio signal for reproduction is provided. The method includes encoding the digitized audio signal to produce an encoded audio signal. The method continues by encoding a set of time-variable presentation parameters responsive to user input and in a synchronized relationship with the encoded audio signal. The rendering parameter represents a user selection of a variable perceptual diffusion effect.

In a second aspect of the invention, a data storage medium recorded with digitally represented audio data is provided. The recorded data storage medium includes compressed audio data representing a multi-channel audio signal formatted as data frames; and a set of user-selected, time-variable presentation parameters formatted to convey a synchronized relationship with the compressed audio data. The rendering parameter represents a user selection of a time-variable diffusion effect to be applied to modify the multi-channel audio signal when played.

In another embodiment, a configurable audio diffusion processor for conditioning a digital audio signal is provided, comprising a parameter decoding module configured to receive rendering parameters in a synchronized relationship with the digital audio signal. In a preferred embodiment of the diffusion processor, the configurable reverberator module is configured to receive the digital audio signal and to respond to controls from the parameter decoding module. The reverberator module may be dynamically reconfigurable to change a time decay constant in response to control from the parameter decoding module.

In a third aspect of the invention, a method of receiving an encoded audio signal and generating a replica decoded audio signal is provided. The encoded audio signal comprises audio data representing a multi-channel audio signal, and a set of user-selected, time-variable rendering parameters formatted to convey a synchronization relationship with said audio data. The method comprises receiving said encoded audio signal and said rendering parameters. The method continues by decoding the encoded audio signal to produce a replica audio signal. The method includes configuring an audio diffusion processor in response to the rendering parameters. The method concludes with processing the replica audio signal with the audio diffusion processor to produce a perceptually diffuse replica audio signal.

In another embodiment, a method of reproducing multi-channel audio sounds from a multi-channel digital audio signal is provided. The method comprises reproducing a first channel of the multi-channel audio signal in a perceptually diffuse manner. The method ends by reproducing at least one further channel in a perceptually straightforward manner. The first channel can be tuned by digital signal processing prior to rendering using a perceptually diffuse effect. The first channel can be tuned by introducing varying frequency dependent delays in a sufficiently complex manner to produce a psychoacoustic effect of diffusing the apparent sound source.

These and other features and advantages of the present invention will become apparent to those skilled in the art upon reading the following detailed description of the preferred embodiments, which proceeds with reference to the accompanying drawings, wherein:

drawings

FIG. 1 is a system-level schematic diagram of the encoder aspect of the present invention with functional blocks symbolically represented by blocks ("block diagrams");

FIG. 2 is a system-level schematic diagram of the decoder aspect of the present invention with functional modules symbolically represented;

FIG. 3 is a representation of a data format suitable for packaging audio, controls, and metadata for use with the present invention;

FIG. 4 is a schematic diagram of an audio diffusion processor used in the present invention with symbolically represented functional blocks;

FIG. 5 is a schematic diagram of an embodiment of the diffusion engine of FIG. 4 with functional modules symbolically represented;

FIG. 5B is a schematic diagram of an alternative embodiment of the diffusion engine of FIG. 4 with functional modules symbolically represented;

FIG. 5C is an exemplary acoustic wave plot of interaural phase difference (in radians) versus frequency (up to 400 Hz) obtained by a 5-channel utility diffuser at a listener's ear in a conventional horizontal speaker layout;

FIG. 6 is a schematic diagram of the reverberator module included in FIG. 5 with the functional modules symbolically represented;

FIG. 7 is a schematic diagram of an all-pass filter with functional blocks symbolically represented suitable for implementing the sub-modules of the reverberator module in FIG. 6;

FIG. 8 is a schematic diagram of a feedback comb filter with functional blocks symbolically represented suitable for implementing the sub-modules of the reverberator block in FIG. 6;

FIG. 9 is a delay pattern as a function of normalization frequency as a simplified example comparing the two reverberators of FIG. 5 (with different specific parameters);

FIG. 10 is a schematic diagram of a playback environment engine for a playback environment suitable for use with the decoder aspect of the present invention;

FIG. 11 is a diagram with certain components symbolically represented, depicting a "virtual microphone array" useful for computing gain and delay matrices for the diffusion engine of FIG. 5;

FIG. 12 is a schematic illustration of a hybrid engine of the environmental engine of FIG. 4 with symbolically represented functional modules;

FIG. 13 is a process flow diagram of a method in accordance with an encoder aspect of the present invention;

fig. 14 is a process flow diagram of a method in accordance with a decoder aspect of the present invention.

Detailed Description

Introduction:

the present invention relates to the processing of audio signals, that is, signals representing physical sounds. These signals are represented by digital electronic signals. In the discussion that follows, analog waveforms may be shown or discussed to illustrate concepts; it should be understood, however, that typical embodiments of the present invention will operate in the context of a time series of digital bytes or words that constitute a discrete approximation of an analog signal or (ultimately) a physical sound. Discrete, digital signals correspond to digital representations of periodically sampled audio waveforms. As is known in the art, the waveform must be sampled at a rate at least sufficient to satisfy the nyquist sampling theorem for the frequency of interest. For example, in a typical embodiment, a sampling rate of approximately 44100 samples/second may be used. A higher, oversampling rate such as 96khz may alternatively be used. The quantization scheme and bit resolution should be selected to meet the requirements of a particular application, in accordance with known principles. The techniques and apparatus of the present invention will generally be applied interdependently in several channels. For example, it may be used in the context of a "surround" audio system (having more than two channels).

As used herein, a "digital audio signal" or "audio signal" does not merely describe a mathematical abstraction, but rather represents information embodied or carried by a physical medium capable of being detected by a machine or device. This term includes recorded or transmitted signals and should be understood to include transmission by any form of encoding, including Pulse Code Modulation (PCM), but not limited to PCM. The output or input, or indeed the intermediate audio signal, may be encoded or compressed by any of a variety of known methods, including MPEG, ATRAC, AC3, or proprietary methods of DTS, inc, as described in us patent 5,974,380; 5,978,762, respectively; and 6,487,535. Some modification of the calculations may be required to accommodate this particular compression or encoding method, as will be apparent to those skilled in the art.

In this specification, the word "engine" is frequently used: for example, we refer to "production engine", "environmental engine", and "hybrid engine". This term refers to any programmable or otherwise configured set of electronic logic and/or arithmetic signal processing modules that are programmed or configured to perform the particular functions described. For example, an "environment engine" is a programmable microprocessor controlled by program modules to perform the functions attributed to the "environment engine" in one embodiment of the invention. Alternatively, a Field Programmable Gate Array (FPGA), a programmable Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), or other equivalent circuitry may be used in the implementation of any of the "engines" or sub-processes without departing from the scope of the invention.

Those skilled in the art will also recognize that suitable embodiments of the present invention may require only one microprocessor (although parallel processing with multiple processors will improve performance). Accordingly, the various modules illustrated in the figures and discussed herein may be understood to represent a number of processes or a series of acts when considered in the context of a processor-based implementation. It is known in the art of digital signal processing to perform mixing, filtering, and other operations by operating on strings of audio data in succession. Accordingly, those skilled in the art will recognize how to implement various modules by programming in a symbolic language such as C or C + +, which may then be implemented on a particular processor platform.

The system and method of the present invention permits producers and sound engineers to create a single mix to be played at movie theaters and homes. In addition, this method can also be used to produce a backwards compatible cinema mix in a standard format such as the DTS5.1 "digital surround" format (referenced above). The system of the present invention is able to distinguish between sounds that the Human Auditory System (HAS) will detect as direct, that is, arriving from a direction corresponding to the perceived sound source, and those that are diffuse, that is, "surrounding" or "enveloping" the listener. It is important to understand that sound can be created that is diffuse only on, for example, one side or direction of the listener. In this case, the difference between direct and diffuse sound is the ability to localize the direction of the source and the ability to localize a substantial region of space from which the sound arrives.

Direct sound, in The case of The human audio system, is sound arriving on both ears with some kind of Interaural Time Delay (ITD) and Interaural Level Difference (ILD), both of which are functions of frequency, both ITD and ILD representing consistent directions over a range of frequencies in multiple critical bands (as described in "The Psychology of Hearing" by Brian c.j. In contrast, diffuse signals will have "distracting" ITDs and ILDs, since there is little agreement in frequency or time in ITDs and ILDs, corresponding, for example, to the case of reverberant perception that is ambient, rather than arriving from a single direction. As used in the context of the present invention, "diffuse sound" refers to sound that has been treated or influenced by acoustic interaction so that at least one, and most preferably both, of the following conditions occur: 1) the front edge of the waveform (at low frequencies) and the envelope of the waveform at high frequencies do not reach the ear at the same time at various frequencies; and, 2) the Interaural Time Difference (ITD) between the two ears varies significantly with frequency. A "diffuse signal" or "perceptually diffuse signal" in the context of the present invention refers to a (usually multi-channel) audio signal that has been electronically or digitally processed to produce the effect of diffuse sound when reproduced to a listener.

In perceptually diffuse sound, the temporal variations in arrival time and ITD exhibit complex and irregular variations with frequency sufficient to produce the psychoacoustic effect of a diffuse sound source.

According to the invention, the diffuse signal is preferably generated by using a simple reverberation method described below, preferably in combination with the mixing process described below. There are other ways of producing diffuse sound, either by signal processing alone or by signal processing and the time of arrival at both ears from a multi-radiator speaker system (e.g., a "diffuse speaker" or a set of speakers).

The concept of "diffusion" as used herein will not be confused with chemical diffusion, decorrelation methods that do not produce the psychoacoustic effects enumerated above, or any other unrelated use of the word "diffusion" that occurs in other technologies and sciences.

As used herein, "transmission" or "transmission over a channel" means any method of transmission, storage, or recording of data for playback that may occur at different times or locations, including but not limited to electronic transmission, optical transmission, satellite relay, wired or wireless communication, transmission over a data network such as the Internet or a LAN or WAN, recording on a persistent medium such as magnetic, optical, or other forms of media (including DVD, "Blu-ray" discs, etc.). In this regard, records for transmission, archiving, or intermediate storage may be considered examples of transmission over a channel.

As used herein, "synchronized" or "in a synchronized relationship" means any method of structuring data or signals such that a temporal relationship is maintained or implied between the signals or sub-signals. More specifically, a synchronization relationship between audio data and metadata means any method that maintains or implies a defined time synchronization between metadata and audio data (both of which are time-varying or variable signals). Some exemplary methods of synchronization include Time Domain Multiplexing (TDMA), interleaving, frequency domain multiplexing, time stamped data packets, multiple indexed synchronizable data sub-streams, synchronous or asynchronous protocols, IP or PPP protocols, protocols defined by the blu-ray disc protocol or DVD standard, MP3, or other defined formats.

As used herein, "receiving" or "receiver" shall mean any method of receiving, reading, decoding, or retrieving data from a transmitted signal or from a storage medium.

As used herein, "demultiplexer" or "unpacker" means a device or method, e.g., an executable computer program module, that can be used to unpack, demultiplex, or separate an audio signal from other encoded metadata such as presentation parameters. It has to be borne in mind that the data structure may comprise other header data and metadata representing presentation parameters in addition to the audio signal data and metadata used in the present invention.

As used herein, "presentation parameters" represent a set of parameters that convey, either symbolically or by summary, the manner in which a recorded or transmitted sound plan is modified, both upon receipt and prior to playback. The term specifically includes a user-selected set of parameters representing the magnitude and quality of one or more time-variable reverberation effects to be applied at the receiver to modify the multi-channel audio signal in a playback situation. In a preferred embodiment, the term also includes other parameters, as an example, a set of mixing coefficients that control the mixing of a set of multiple audio channels. As used herein, a "receiver" or "receiver/decoder" refers broadly to any device capable of receiving, decoding, or reproducing a digital audio signal transmitted or recorded in any manner. It is not limited in any limited sense, for example, an audio-video receiver.

System overview:

fig. 1 shows an overview at the system level of a system for encoding, transmitting, and reproducing audio according to the present invention. The subject sound 102 is emitted in an acoustic environment 104 and converted to a digital audio signal by a multi-channel microphone device 106. It will be appreciated that some arrangement of microphones, analog-to-digital converters, amplifiers, and encoding devices may be used in known configurations to produce digitized audio. Alternatively, or in addition to live audio, analog or digitally recorded audio data ("traces") may provide input audio data, as represented by recording device 107.

In a preferred mode of use of the invention, the audio source to be manipulated (live or recorded) should be captured in a substantially "dry" form: in other words, in a relatively non-reverberant environment, or as direct sound, there is no significant echo. The captured audio source is generally referred to as the "stalk". Sometimes, using the described engine, it is acceptable to mix some direct stems with other signals of the "live" recording in locations that provide a good sense of space. However, this is unusual in cinemas due to problems especially when such sounds are presented in cinemas- (lobbies). The use of a substantially dry stem can enable engineers to add desired diffuse or reverberation effects in the form of metadata while preserving the dry characteristics of the audio source trajectory for use in a reverb cinema (where, without mixer control, some of the reverb would come from the cinema building itself).

The metadata generation engine 108 receives an audio signal input (derived from a live or recorded source, representing sound) and processes the audio signal under the control of a mixing engineer 110. The engineer 110 also interacts with the metadata generation engine 108 through an input device 109 connected to the metadata generation engine 108. Through user input, the engineer can instruct the creation of metadata representing artistic user selections in a synchronized relationship with the audio signal. For example, the mixing engineer 110 selects to match the direct/diffuse audio features (represented by the metadata) to the synchronized movie scene changes via the input device 109.

"metadata" in this context should be understood to mean, for example, extracted, parameterized, or summarized by a series of coded or quantized parameters. For example, the metadata includes a representation of the reverberation parameters from which the reverberator may be configured in the receiver/decoder. The metadata may also include other data such as mixing coefficients and inter-channel delay parameters. The metadata generated by the production engine 108 will vary over time in increments or "frames" of time as the frame metadata relates to a particular time interval of corresponding audio data.

The time-varying stream of audio data is encoded or compressed by the multi-channel encoding device 112 to produce encoded audio data in a synchronized relationship with corresponding metadata relating to the same time. Preferably, both the metadata and the encoded audio signal data are multiplexed into a combined data format by the multi-channel multiplexer 114. The audio data may be encoded using any known method of multi-channel audio compression; however, in certain embodiments, U.S. patent No. 5,974,380; 5,978,762, respectively; and the encoding method described in 6,487,535 (DTS 5.1 audio) is preferred. Other extensions and improvements, such as lossless or scalable coding, may also be used to encode the audio data. The multiplexer should maintain a synchronization relationship between the metadata and the corresponding audio data, whether by framing syntax or by adding some other synchronization data.

The generation engine 109 differs from previous encoders as described above in that the generation engine 108, based on user input, generates a time-varying stream of encoded metadata representing a dynamic audio environment. The method of performing this is described in more detail below with reference to fig. 14. Preferably, the metadata so generated is multiplexed or packed into a combined bit format or "frame" and inserted in a predefined "ancillary data" field of the data frame, allowing backward compatibility. Alternatively, the metadata may be transmitted separately with some means to synchronize with the primary audio data transport stream.

To permit listening during production, the production engine 108 interfaces with a listening decoder 116, which listening decoder 116 demultiplexes and decodes the combined audio stream and metadata to reproduce the listening signal in a speaker 120. The listening speakers 120 should preferably be arranged in a standardized known layout (such as ITU-R BS775 (1993) for a five channel system). The use of standardized or consistent layouts facilitates blending; the playback can be customized for the actual listening environment based on a comparison between the actual environment and a standardized or known listening environment. The listening systems (116 and 120) allow engineers to perceive the effects of the metadata and encoded audio as perceived by listeners (as described below with reference to the receiver/decoder). Based on the auditory feedback, the engineer is able to make more accurate selections to reproduce the desired psychoacoustic effect. Furthermore, the hybrid artist will also be able to switch between "cinema" and "home cinema" settings, so both can be controlled simultaneously.

The snoop decoder 116 is substantially identical to the receiver/decoder, as described in more detail below with reference to fig. 2.

After encoding, the audio data stream is transmitted over a communication channel 130 or (equivalently) recorded on some medium (e.g., an optical disc such as a DVD or "Blu-ray" disc). It should be understood that for purposes of this disclosure, recording may be considered a special case of transmission. It should also be understood that the data may be further encoded in various layers for transmission or recording, for example, by adding Cyclic Redundancy Checks (CRC) or other error correction, by adding further formatting and synchronization information, physical channel coding, and so forth. These conventional aspects of transmission do not interfere with the operation of the present invention.

Referring next to fig. 2, after transmission, the audio data and metadata (together "the bitstream") are received, and the metadata is separated in a demultiplexer 232 (e.g., by simply demultiplexing or unpacking data frames having a predetermined format). The encoded audio data is decoded by the audio decoder 236 by means complementary to those used by the audio encoder 112 and sent to the data input of the context engine 240. The metadata is unpacked by metadata decoder/unpacker 238 and sent to the control input of environment engine 240. The context engine 240 receives, conditions, and remixes the audio data in a manner controlled by the received metadata (which is received in a dynamic, time-varying manner and is not updated in time). The modified or "rendered" audio signals are then output from the environment engine and (directly or ultimately) reproduced by speakers 244 in a listening environment 246.

It will be appreciated that multiple channels may be controlled in such a system, either collectively or individually, depending on the desired artistic effect.

A more detailed description of the system of the present invention is provided below, more particularly describing the structure and function of the components or sub-modules referenced above in a more generalized, system-level terminology. The components or sub-modules of the encoder side are described first, followed by a description of the components or sub-modules of the receiver/decoder side.

A metadata generation engine:

according to the encoding aspect of the present invention, the digital audio data is manipulated by the metadata generation engine 108 prior to transmission or storage.

The metadata generation engine 108 may be implemented as a dedicated workstation or on a general purpose computer programmed to process audio and metadata in accordance with the present invention.

The metadata generation engine 108 of the present invention encodes sufficient metadata to control the subsequent synthesis (in a controlled mix) of diffuse and direct sound; further controlling the reverberation time of individual stems or mixes; further controlling the density of simulated acoustic reflections to be synthesized; further controlling the count, length and gain of the feedback comb filter and the count, length and gain of the all-pass filter in the ambient engine (described below), further controlling the perceived direction and distance of the signal. It is contemplated that a relatively small data space (e.g., a few kilobits/second) will be used for the encoded metadata.

In a preferred embodiment, the metadata also includes mixing coefficients and a set of delays sufficient to characterize and control the mapping from the N inputs to the M output channels, where N and M are not necessarily equal, either of which may be slightly larger.

TABLE 1

Table 1 shows exemplary metadata generated in accordance with the present invention. The field al represents the "direct present" flag: this is code that specifies for each channel the option to reproduce the channel without introducing synthetic diffusion (e.g., a channel recorded with inherent reverberation). This labeling is controlled by the mixing engineer to specify that the mixing engineer does not select users of the trajectory that are processed at the receiver with the diffusion effect. For example, in an actual mixing situation, an engineer may encounter a channel (track or "stalk") that is not recorded as "dry" (in the absence of reverberation or diffusion). For such stems, this fact needs to be marked so that the ambient engine can render such channels without introducing additional diffusion or reverberation. According to the invention, any input channel (stem), whether direct or diffuse, can be labeled for direct rendering. This feature greatly enhances the flexibility of the system. As such, the system of the present invention allows for separation between the direct and diffuse input channels (as well as separation that is directly independent of the diffuse output channel, as discussed below).

The field denoted "X" is reserved for the excitation code associated with the previously developed standardized reverberation group. The corresponding normalized set of reverb is stored in the decoder/playback device and can be retrieved from memory by querying, as discussed below in conjunction with the diffusion engine.

The field "T60" represents or indicates a reverberation decay parameter. In the current art, the symbol "T60" is often used to represent the time required for the amount of reverberant sound in an environment to drop to a volume 60 decibels below that of direct sound. This notation is used accordingly in this description, but it should be understood that other measures of reverberation decay time may be used instead. Preferably, the parameters should relate to the decay time constant (as in the exponent of the decay exponential function) so that the decay can easily be synthesized in a form similar to the following equation:

exp (-kt) (equation 1)

Where k is the decay time constant. More than one T60 parameter may be transmitted corresponding to multiple channels, multiple stems or multiple output channels, or the perceived geometry of the synthetic listening space.

The parameter a3-An represents (for each respective channel) a density value (e.g., a value corresponding to a delay length or number of samples of delay) that directly controls how much simulated reflections the diffusion engine will apply to the audio channel. Smaller density values will produce less complex diffusion, as discussed in more detail below with the diffusion engine. Although "lower density" is generally not appropriate in a music setting, it is quite realistic when, for example, movie characters are moving through a duct in a room with hard (metal, concrete, stone) walls, or in other cases where the reverberation should have very "anxious" characters.

The parameters B1-Bn represent "reverberation settings" values that fully represent the configuration of the reverberation module in the ambient engine (discussed below). In one embodiment, these values represent the count of codes, the length of the stages, and the gain of one or more feedback comb filters; and the count, length, and gain of the Schroeder all-pass filter in the reverb engine (discussed in detail below). In addition, or as an alternative to transmission parameters, the context engine may have a database of pre-selected reverberation values organized by profiles. In such a case, the production engine transmits metadata that symbolically represents the configuration file or selects a configuration file from among stored configuration files. The stored configuration file provides less flexibility but greater compression by saving the symbol code of the metadata.

In addition to the metadata related to reverberation, the production engine should generate and transmit further metadata to control the mixing engine at the decoder. Referring again to table 1, the further set of parameters preferably comprises: parameters representing the location of the sound source (relative to the assumed listener and the intended synthetic "room" or "space") or microphone location; a set of distance parameters D1-DN used by the decoder to control the direct/diffuse mixture in the reproduced channels; a set of delay values L1-LN for controlling the point in time at which the audio arrives from the decoder to the different output channels; and a set of gain values G1-Gn used by the decoder to control the variation of the amplitude of the audio in the different output channels. The gain values may be specified separately for the direct and diffuse channels of the audio mix, or for the simple case, collectively.

The above-specified hybrid metadata is conveniently represented as a series of matrices, as recognized in view of the input and output of the overall system of the present invention. The system of the present invention, at most generally, maps a plurality of N input channels to M output channels, where N and M are not necessarily equal, either of which may be slightly larger. It can be readily seen that the N x M dimensional matrix G is sufficient to specify a generally complete set of gain values to map from N inputs to M output channels. A similar N x M matrix may be conveniently used to fully specify the input-output retardation and diffusion parameters. Alternatively, a system of codes may be used to concisely represent a more frequently used mixing matrix. The matrices can then be easily recovered in the decoder by referring to a stored codebook, in which each code is associated with a corresponding matrix.

Fig. 3 shows a generalized data format suitable for transmitting audio data and metadata multiplexed in the time domain. Specifically, this example format is an extension of the format disclosed in us 5974380 assigned to DTS, inc. An example data frame is shown generally at 300. Preferably, frame header data 302 is carried near the beginning of a data frame, followed by audio data formatted into a plurality of audio subframes 304, 306, 308, and 310. One or more flags in the header 302 or in the optional data field 312 may be used to indicate the presence and length of metadata extensions 314 that may advantageously be included at or near the end of a data frame. Other data formats may be used; preferably backward compatibility is maintained so that legacy material can be played on the decoder in accordance with the present invention. The old decoder is programmed to ignore the metadata in the extension field.

In accordance with the present invention, compressed audio and encoded metadata are multiplexed or otherwise synchronized and then recorded on a machine-readable medium or transmitted over a communication channel to a receiver/decoder.

Using a metadata generation engine:

from the user's perspective, the approach of using a metadata generation engine appears simple, similar to known engineering practices. Preferably, the metadata generation engine displays a representation of the synthetic audio environment ("space") on a Graphical User Interface (GUI). The GUI may be programmed to display the location, size, and diffusion of various stems or sound sources together with a listener's location (e.g., at the center) and some graphical representation of room size and shape in a symbolic manner. The hybrid engineer selects the time interval in which to operate it from the recorded stem by using a mouse or keyboard input device 109 and referring to a Graphical User Interface (GUI). For example, an engineer may select a time interval from a time index. The engineer then enters input to change the synthetic acoustic environment of the stem over a selected time interval. Based on the input, the metadata generation engine computes the appropriate metadata, formats it, and occasionally passes it to the multiplexer 114 for combination with the corresponding audio data. Preferably, a set of standardized preset values is selectable from the GUI corresponding to frequently encountered acoustic environments. Then, a parameter corresponding to a preset value is retrieved from a pre-stored lookup table to generate metadata. In addition to the standardized preset values, manual controls are preferably provided which a skilled engineer can use to generate custom acoustic simulations.

The user's selection of reverberation parameters is facilitated by the use of a listening system, as described above with reference to fig. 1. As such, the reverberation parameters may be selected based on the acoustic feedback from the listening systems 116 and 120 to produce the desired effect.

The receiver/decoder:

according to a decoder aspect, the present invention includes methods and apparatus for receiving, processing, conditioning and playing digital audio signals. As discussed above, the decoder/playback device system includes a demultiplexer 232, an audio decoder 236, a metadata decoder/unpacker 238, an environment engine 240, speakers or other output channels 244, a listening environment 246, and preferably a playback environment engine.

The functional blocks of the decoder/playback device are shown in more detail in fig. 4. The environmental engine 240 includes a diffusion engine 402 in series with a mixing engine 404. Each is described in more detail below. It must be borne in mind that the context engine 240 operates in a multidimensional manner, mapping N inputs to M outputs, where N and M are integers (potentially not equal, where either may be a larger integer).

Metadata decoder/de-packetizer 238 receives encoded, transmitted, or recorded data as input in a multiplexed format and separates into metadata and audio signal data for output. The audio signal data is routed to decoder 236 (as input 236 IN); the metadata is separated into various fields and output as control data to the control input of the context engine 240. The reverberation parameters are sent to the diffusion engine 402; the mixing and delay parameters are sent to the mixing engine 416.

The decoder 236 receives the encoded audio signal data and decodes it by methods and apparatuses complementary to those used to encode the data. The decoded audio is organized into the appropriate channels and output to the context engine 240. The output of the decoder 236 is represented in any form that permits mixing and filtering operations. For example, linear PCM may be suitably used, with sufficient bit depth for certain applications.

The diffusion engine 402 receives the N-channel digital audio input from the decoder 236, decodes it into a form that permits mixing and filtering operations. It is presently preferred that the engine 402 according to the present invention operates with a time domain representation that allows the use of digital filters. According to the present invention, Infinite Impulse Response (IIR) topology is strongly preferred, since IIR has a dispersion that more accurately simulates the real physical acoustic system (low-pass positive phase dispersion characteristic).

A diffusion engine:

the diffusion engine 402 receives the (N-channel) signal input signal at signal input 408; the decoded and demultiplexed metadata is received by control input 406. The engine 402 adjusts the input signal 408 in a manner controlled by and responsive to the metadata to add reverberation and delay to produce direct and diffuse audio data (in multiple processed channels). According to the invention, the diffusion engine produces intermediate processed channels 410, including at least one "diffuse" channel 412. The plurality of processed channels 410, including both the direct channel 414 and the diffuse channel 412, are then mixed in a mixing engine 416 under control of the mixing metadata received from the metadata decoder/unpacker 238 to produce a mixed digital audio output 420. In particular, the mixed digital audio output 420 provides mixed direct and diffuse audio for a plurality of M channels and is mixed under control of the received metadata. In certain novel embodiments, the M channels of output may include one or more dedicated "diffuse" channels adapted for reproduction by a dedicated "diffuse" speaker.

Referring now to FIG. 5, more details of an embodiment of the diffusion engine 402 can be seen. For clarity, only one audio channel is shown; it will be appreciated that in a multi-channel audio system, a plurality of such channels will be used in parallel. Accordingly, for an N-channel system (capable of processing N stems in parallel), the channel path of fig. 5 will be essentially replicated N times. The diffusion engine 402 may be described as a configurable, modified Schroeder-Moorer reverberator. Unlike conventional Schroeder-Moorer reverberators, the reverberator of the present invention removes the FIR "early reflection" step and adds an IIR filter in the feedback path. The IIR filter in the feedback path produces dispersion in the feedback and creates a varying T60 as a function of frequency. This feature produces a perceptually diffuse effect.

The input audio channel data at the input node 502 is pre-filtered by a pre-filter 504 and the d.c. component is removed by a d.c. blocking stage 506. The pre-filter 504 is a 5-tap FIR low pass filter that removes high frequency energy not found in the natural reverberation. The DC blocking stage 506 is an IIR high pass filter that removes energy at and below 15 hz. The DC blocking stage 506 is necessary unless input without a DC component can be guaranteed. The output of the DC blocking stage 506 is fed through a reverberation module ("reverberation set" 508). The output of each channel is scaled by multiplying by an appropriate "diffuse gain" in scaling module 520. The diffusion gain is calculated based on the direct/diffuse parameters received as metadata accompanying the input data (see table 1 and related discussion above). Each diffuse signal channel is then summed (in summing block 522) with a corresponding direct component (fed forward from input 502 and scaled by direct gain block 524) to produce output channel 526.

In an alternative embodiment, the diffusion engine is configured to apply the diffusion gain and retardation and the direct gain and retardation before applying the diffusion effect. Referring now to FIG. 5B, further details of an alternative embodiment of the diffusion engine 402 can be seen. For clarity, only one audio channel is shown; it will be appreciated that in a multi-channel audio system, a plurality of such channels will be used in parallel. Accordingly, for an N-channel system (capable of processing N stems in parallel), the audio channel path of fig. 5B will be substantially replicated N times. The diffusion engine can be described as a configurable, utility diffuser that uses a specific diffusion effect as well as the degree of diffusion and the direct gain and delay per channel.

The audio input signal 408 is input to the diffusion engine and the appropriate direct gain and delay is applied for each channel accordingly. Then, the appropriate diffuse gain and delay is applied to the audio input signal on a per channel basis. The audio input signal 408 is then processed through a library of utility diffusers [ UD1-UD3] (described further below) for applying a diffusion density or effect to the audio output signal on a per channel basis. The density or effect of diffusion may be determinable by one or more metadata parameters.

For each audio channel 408, there is a different set of delay and gain contributions defined for each output channel. The contribution is defined as direct gain and retardation and diffuse gain and retardation.

Subsequently, the combined contributions from all audio input channels are processed by a library of utility diffusers to apply a different diffusion effect to each input channel. In particular, the contribution defines the direct and diffuse gain and the delay of each input/output channel connection.

Once processed, the diffuse and direct signals 412, 414 are output to a mixing engine 416.

A reverberation module:

each reverberation module includes a set of reverberations (508- & 514). According to the present invention, each single set of reverberation (508) is preferably implemented 514, as shown in fig. 6. Although multiple channels are processed substantially in parallel, only one channel is shown for clarity. Input audio channel data at input node 602 is processed by one or more Schroeder all-pass filters 604 in series. Two such filters 604 and 606 are shown in series, as in the preferred embodiment, two such filters are used. The filtered signal is then split into a plurality of parallel branches. Each branch is filtered by a feedback comb filter 608 to 620, the filtered outputs of the comb filters being combined at a summing node 622. The T60 metadata decoded by metadata decoder/unpacker 238 is used to calculate the gain of feedback comb filter 608-620. More details about the calculation method will be given below.

Preferably, the length (stage, Z-n) of the feedback comb filter 608 and 620 and the number of sample delays in the Schroeder all-pass filters 604 and 606 are selected from the prime number set: for the following reasons: for output diffusion, it is advantageous to ensure that the loops are never temporally coincident (which would enhance the signal at such coincident times). The use of prime sample delay values eliminates such consistency and enhancement. In the preferred embodiment, up to 49 decorrelated reverberator combinations derivable from default parameters (stored at the decoder) are provided using seven sets of all-pass delays and seven independent sets of comb delays.

In a preferred embodiment, all-pass filters 604 and 606 use carefully selected delays from prime numbers, specifically delays in each audio channel 604 and 606, such that the sum of the delays in 604 and 606 totals 120 sample periods. (there are several pairs of prime numbers available, totaling 120). Different prime pairs are preferably used in different audio signal channels to produce dissimilarity in the ITDs for the reproduced audio signal. Each of the feedback comb filters 608 and 620 use a delay in the range 900 sample intervals and above, most preferably in the range from 900 to 3000 sample periods. The use of so many different prime numbers results in a very complex characterization of the delay as a function of frequency, as described more fully below. The composite frequency and delay characteristic produces a perceptually diffuse sound by producing a sound that will introduce a frequency dependent delay when reproduced. In this manner, the leading edge of the audio waveform does not reach the ear at the various frequencies simultaneously, and the low frequencies do not reach the ear at the various frequencies simultaneously, for the corresponding reproduced sound.

Creating diffuse sound fields

In a diffuse field, it is not possible to discern the direction from where the sound came.

In general, a typical example of a diffuse sound field is sound reverberated in a room. A diffuse sensation may also be encountered in non-reverberant sound fields (e.g., applause, rain, wind noise, or surrounded by a large group of humming insects).

A mono recording can capture the sensation of reverberation (i.e., the sensation of prolonged sound decay time). However, reproducing the perception of diffusion of a reverberant sound field would require processing such a mono recording with a practical diffuser, or, more generally, using an electroacoustic reproduction designed to provide diffusion to the reproduced sound.

Diffuse sound reproduction in home cinema can be achieved in a number of ways. One way is to actually build a speaker or speaker array that produces a diffuse sensation. When this is not feasible, it is also possible to produce a device similar to an acoustic interference bar that provides a diffuse radiation pattern. Finally, when all of these are not available, and need to be rendered by a standard multi-channel speaker playing system, a practical diffuser may be used to create interference between the direct paths that would interfere with the consistency of any one arrival to such an extent that a diffuse sensation would be experienced.

A practical diffuser is an audio processing module intended to create a perception of spatial sound diffusion on a loudspeaker or headphone. This can be achieved by using various audio processing algorithms that typically decorrelate or break the coherence between the loudspeaker channel signals.

One method of implementing a practical diffuser includes configuring them to output multiple uncorrelated/incoherent channels from a single input channel or from multiple correlated channels using algorithms originally designed for multi-channel artificial reverberation (as shown in fig. 6 and the accompanying text). Such an algorithm can be modified to obtain a practical diffuser that does not produce a significant reverberation effect.

A second method of implementing a practical diffuser includes using an algorithm originally designed to simulate a spatially extended sound source (rather than a point source) from a mono audio signal. Such an algorithm can be modified to simulate the enclosed sound (without creating the sensation of reverberation).

A practical diffuser can be implemented simply by using a set of short attenuating reverberators (T60 =0.5 seconds or less), each applied to one of the speaker output channels (as shown in fig. 5B). In a preferred embodiment, such a practical diffuser is designed to ensure that the time delay in one module, and the differential time delay between multiple modules, varies with frequency in a complex manner, resulting in dispersion of the phase arriving at the listener at low frequencies and modification of the signal envelope at high frequencies. Such a diffuser is not a typical reverberator because it will have a T60 that is substantially constant over frequency, and for actual "reverberated" sounds, will not be used therein or by itself.

As an example, fig. 5C plots the interaural phase difference created by such a practical diffuser. The vertical scale is radians and the horizontal scale is a sector of the frequency domain from 0Hz to about 400 Hz. The horizontal scale is enlarged so that the details are visible. Remember that the metric is in radians, not in samples or time. This figure clearly shows how interaural time differences are severely confused. Although the time delay in frequency in one ear is not shown, it is similar in nature, but less complex.

Alternative methods for implementing a practical diffuser include frequency domain artificial reverberation, as further described by "Parametric multichannel Audio coding" by Faller, C, published in IEEE trans. on Audio, spech, and language Processing, vol.14, No.1, jan.2006; or using all-pass filters implemented in The time domain or in The frequency domain, as further described in "The decoding of Audio signals and times image on spatial image", published in Computer Music Journal, vol.19, No.4, Winter1995, and "Audio signal decoded on a critical band approach", published in 117th AES Convention, oct.2004, by Kendall, g.

In the case where diffusion is specified from one or more dry channels, it is quite appropriate to compare a typical reverb system, using the same engine as a practical diffuser, with simple modifications to create the T60 and frequency profile required by the content creator, to provide both practical diffusion and actual, perceptible reverb is entirely possible. A modified Schroeder-Moorer reverberator, such as that shown in fig. 6, may provide strictly practical diffuse or audible reverberation as desired by the content creator. When using such a system, the delay for each reverberator may advantageously be chosen to be relatively prime. (this is easily achieved by using similar sample delays as in the feedback comb filter, but with a set of relatively prime numbers, with different pairs of prime numbers accumulated as "Schroeder parts", or the same total delay in a 1-branch all-pass filter.) practical diffusion can also be achieved with a multi-channel recursive reverberation algorithm, such as further described in Jot, j. -m. and Chaigne, a. "Digital delay networks for designing the array of perfect invertors" (published in 90th AES Convention, feb. 1991).

An all-pass filter:

referring now to fig. 7, an all-pass filter suitable for implementing either or both of the Schroeder all-pass filters 604 and 606 of fig. 6 is shown. The input signal at input node 702 is summed with a feedback signal (described below) at summing node 704. The output from 704 branches at branch node 708 into a forward branch 710 and a delay branch 712. In delay branch 712, the signal is delayed by one sample delay 714. As discussed above, in the preferred embodiment, the delays are preferably chosen so that the delays of 604 and 606 total 120 sample periods. (delay times are based on the 44.1kHz sampling rate-other intervals may also be selected to scale to other sampling rates while maintaining the same psychoacoustic effect.) in the forward branch 712, the forward signal is summed with the multiplied delays in summing node 720 to produce a filtered output at 722. The delayed signals in the finger nodes 708 are also multiplied in the feedback path by a feedback gain module 724 to provide a feedback signal to the input summing node 704 (previously described). In a typical filter design, the gain forward and gain reverse would be set to the same value, except that one must have the opposite sign to the other.

A feedback comb filter:

figure 8 shows a suitable design that can be used for each feedback comb filter (608-620 in figure 6).

The input signal at 802 is summed with a feedback signal (described below) at summing node 803, and the sum is delayed by a sample delay block 804. The delayed output of 804 is output at node 806. In the feedback path, the output at 806 is filtered by a filter 808 and multiplied by a feedback gain factor in a gain module 810. In a preferred embodiment, this filter should be an IIR filter as discussed below. The output of gain block or amplifier 810 (at node 812) is used as a feedback signal and summed with the input signal at 803, as previously described.

Some variables are controlled by the feedback comb filter in fig. 8: a) the length of the sample delay 804; b) a gain parameter g, such that 0< g <1 (shown in the graph as gain 810); and, c) the coefficients of the IIR filter that can selectively attenuate different frequencies (filter 808 in FIG. 8). In the comb filter according to the invention, one or preferably more of these variables are controlled in response to the decoded metadata (decoding in #). In a typical embodiment, filter 808 should be a low pass filter because natural reverberation tends to emphasize lower frequencies. For example, air and many physical reflectors (e.g., walls, openings, etc.) generally act as low pass filters. In general, the filter 808 (at the metadata engine 108 in fig. 1) is appropriately selected with a particular gain setting to simulate T60 and histogram patterns appropriate for the scene. In many cases, default coefficients may be used. For less pleasing settings or special effects, the blending engineer may specify other filter values. Additionally, the hybrid engineer may create new filters to mimic the T60 performance of most any T60 profile through standard filter design techniques. These may be specified using first or second order partial sets of IIR coefficients.

Determination of reverberator variables:

the reverberation group (508 and 514 in fig. 5) can be defined in terms of the parameter "T60" (received as metadata and decoded by the metadata decoder/unpacker 238). The term "T60" is used in the art to denote the time in seconds of reverberation to attenuate a sound by 60 decibels (dB). For example, in a concert hall, it takes up to 4 seconds for the reverberation's reflection to decay by 60 dB; this lobby may be described as having a T60 value of "4.0". As used herein, the reverberation decay parameter or T60 is used to represent a generalized measure of the decay time of a general exponential decay model. Not necessarily limited to measurements of time attenuated by 60 db; other decay times may be used to equivalently specify the decay characteristics of sound, as long as the encoder and decoder use the parameters in a continuously complementary manner.

To control the "T60" of the reverberator, the metadata decoder computes the appropriate set of feedback comb filter gain values, which are then output to the reverberator to set the filter gain values. The closer the gain value is to 1.0, the longer the reverberation will last; at a gain equal to 1.0, the reverberation will never decrease, and above a gain of 1.0, the reverberation will continuously increase (producing a sound of the "feedback screaming" type). According to a particularly novel embodiment of the present invention, the gain value of each of the feedback comb filters is calculated using equation 2:

(formula 2)

Where the sampling rate of the audio is given by "fs" and sample delay is the time delay imposed by a particular comb filter (expressed in terms of the number of samples at a known sampling rate fs). For example, if we have a feedback comb filter with a sample _ delay length of 1777, and we have input audio with a sample rate of 44,100 samples per second, and we need a T60 of 4.0 seconds, then one can calculate:

(formula 3)

In a modification to the Schroeder-Moorer reverberator, the present invention includes seven parallel feedback comb filters, as shown in fig. 6 above, each with a gain whose value is calculated as described above, so that all seven have a consistent T60 decay time; due to the coprime sample delay length, the parallel comb filters remain orthogonal when summed, so mixed as to create a complex, diffuse sensation in the human auditory system.

To give the reverberator a consistent sound, the same filter 808 may be used in each of the feedback comb filters as appropriate. According to the invention, it is strongly preferred to use an "infinite impulse response" (IIR) filter for this purpose. The default IIR filter is designed to give a low-pass effect similar to the natural low-pass effect of air. Other default filters may provide other effects such as "wood", "hard surfaces", and "very soft" reflection characteristics to alter T60 (the maximum of which is specified above) at different frequencies in order to create a very different perception of the environment.

In a particularly novel embodiment of the present invention, IIR filter 808 is variable under control of the received metadata. By varying the characteristics of the IIR filter, the present invention achieves control of the "frequency T60 response" resulting in some frequencies of sound attenuating faster than others. Note that the mixing engineer (using the metadata engine 108) may specify other parameters for applying the filter 808 to produce unusual effects when they are deemed artistically appropriate, but these are all processed within the same IIR filter topology. The number of combs is also a parameter controlled by the transmitted metadata. As such, in acoustically challenging scenarios, the number of combs can be reduced to provide a more "tube-like" or "flutter-echo" sound quality (under the control of the mixing engineer).

In a preferred embodiment, the number of Schroeder all-pass filters is also variable under control of the transmitted metadata: a given embodiment may have zero, one, two, or more. (only two are shown in the figure for clarity.) they introduce additional analog reflections and alter the phase of the audio signal in an unpredictable way. In addition, the Schroeder section can provide unusual sound effects when needed.

In a preferred embodiment of the present invention, the use of received metadata (pre-generated by the metadata generation engine 108 under user control) controls the sound of this reverberator by changing the number of Schroeder all-pass filters, by changing the number of feedback comb filters, and by changing the parameters within these filters. Increasing the number of comb filters and all-pass filters will increase the density of reflections in the reverberation. Default values for 7 comb filters and 2 all-pass filters per channel have been determined experimentally to provide natural acoustic reverberation suitable for simulating reverberation in a concert hall. When simulating a very simple reverberant environment, such as inside a sewage pipe, it is appropriate to reduce the number of comb filters. Thus, a metadata field "density" (as discussed previously) is provided to specify how many comb filters should be used.

The complete set of settings for the reverberator defines "reverb _ set". The revertset is specifically defined by: the number of all-pass filters, the sample _ delay value for each, and the gain value for each; and the number of feedback comb filters, the sample delay value for each, and a specified set of IIR filter coefficients used as filter 808 within each feedback comb filter.

In addition to unpacking the custom reverberation groups, in the preferred embodiment, the metadata decoder/unpacker module 238 stores a plurality of predefined reverb _ sets with different values, but with similar average sample _ delay values. The metadata decoder selects from the stored set of reverb in response to a firing code received in a metadata field of the transmitted audio bitstream, as discussed above.

The combination of the all-pass filter (604, 606) and the various multiple comb filters (608-620) produces very complex delay vs. frequency characteristics in each channel; furthermore, the use of different delay groups in different channels also results in a very complex relationship, where: the delays vary a) for different frequencies within a channel, and b) between channels of the same or different frequencies. When output to a multi-channel speaker system ("surround sound system"), this may (when indicated by metadata) create a situation with a frequency dependent delay so that the leading edge (or envelope, for high frequencies) of the audio waveform does not reach the ear at various frequencies simultaneously. Furthermore, because the right and left ears preferably receive sound from different speaker channels in a surround sound arrangement, the complex variations created by the present invention result in the leading edge of the envelope (for high frequencies) or low frequency waveform reaching the ear with varying interaural time delays for different frequency bands. When reproducing such a signal, these conditions produce an "perceptually diffuse" audio signal, ultimately producing a "perceptually diffuse" sound.

FIG. 9 shows simplified delay versus frequency output characteristics from two different reverberator modules programmed with different sets of delays for both the all-pass filter and the reverberation set. Given the delay in the sampling period, the frequency is normalized to the nyquist frequency. A small portion of the audio spectrum is shown and only two channels are shown. It can be seen that the curves 902 and 904 vary in frequency in a complex manner. The inventors have found that this variation produces a confident perception of perceived diffusion in the surround system (e.g., extending to 7 channels).

As depicted in the (simplified) graph of fig. 9, the method and apparatus of the present invention produces a complex and irregular relationship between delay and frequency with multiple peaks, valleys, and bends. Such a feature is desirable for a perceptually diffuse effect. Thus, according to a preferred embodiment of the present invention, the frequency dependent delay (whether within one channel or between multiple channels) is of a complex and irregular nature — complex and irregular enough to cause a psychoacoustic effect of a diffuse sound source. This should not be confused with simple and predictable phase and frequency variations such as those produced by simple and conventional filters such as low pass, band pass, filtering, etc. The delay and frequency characteristics of the present invention result from multiple poles distributed across the audio spectrum.

Distance was simulated by mixing direct and diffuse intermediate signals:

essentially, only diffuse sound can be heard if the ear is far away from the audio source. Some direct and some diffuse can be heard as the ear gets closer to the audio source. If the ear is very close to the audio source, only direct audio can be heard. The sound reproduction system may simulate the distance to the audio source by changing the mix between direct and diffuse audio.

The environment engine simply has to "know" (receive) the metadata representing the desired direct/diffuse ratio to model the distance. More precisely, in the receiver of the invention, the received metadata represent the desired direct/diffuse ratio, as a parameter called "diffusivity". This parameter is preferably preset by the mixing engineer, as described above with reference to the production engine 108. If no diffuseness is specified, but the use of a diffuse engine is specified, then the default diffuseness value may be set appropriately to 0.5 (which represents the critical distance (the distance that the listener hears equal amounts of direct and diffuse sound)).

In one suitable parametric representation, the "diffuseness" parameter d is metadata that varies within a predefined range such that 0 ≦ d ≦ 1. By definition, a diffusivity value of 0.0 is purely direct, with absolutely no diffuse component; a diffusivity value of 1.0 is completely diffuse with no direct component; and in between, the blending may be performed using "direct _ gain" and "direct _ gain" values calculated as the following equations:

(formula 4)

Accordingly, the present invention mixes the diffuse and direct components for each stem, according to equation 3, based on the received "diffuse" metadata parameters, to produce the perceptual effect of the desired distance from the sound source.

The playing environment engine:

in a preferred and particularly novel embodiment of the present invention, the blending engine communicates with the "playback environment" engine (424 in FIG. 4) and receives from the module a set of parameters that roughly specify certain characteristics of the local playback environment. As noted above, the audio signal is pre-recorded and encoded in "dry" form (without significant ambient or reverberation). To best reproduce diffuse and direct audio in a particular local environment, the mixing engine is responsive to the transmitted metadata and a set of local parameters to improve the mixing for local playback.

The playback environment engine 424 measures specific characteristics of the local playback environment, extracts a set of parameters, and passes these parameters to the local playback presentation module. The playback environment engine 424 then calculates modifications to the gain coefficient matrix and a set of M output compensation delays that should be applied to the audio signal and the diffuse signal to produce an output signal.

As shown in fig. 10, the playback environment engine 424 extracts quantitative measurements of the local acoustic environment 1004. Among the variables estimated or extracted are: room size, room volume, local reverberation time, number of speakers, speaker layout and geometry. The local environment may be measured or estimated using a number of methods. The simplest is to provide direct user input through a keypad or terminal-like device 1010. Microphone 1012 may also be used to provide signal feedback to the playback environment engine 424, allowing room measurements and calibration to be made by known methods.

In a preferred, particularly novel embodiment of the present invention, the playback environment module and the metadata decoding engine provide control inputs to the blending engine. A mixing engine responsive to those control inputs mixes the controllably delayed audio channels, including the intermediate, composite diffuse channel, to produce an output audio channel modified to fit the local playback environment.

Based on the data from the play environment module, the environment engine 240 will use the direction and distance data for each input and the direction and distance data for each output to determine how to blend the inputs into the outputs. The distance and direction of each input stem is included in the received metadata (see table 1); the distance and direction for output is provided by the playback environment engine by measuring, assuming, or otherwise determining the speaker locations in the listening environment.

Various presentation models can be used by the context engine 240. One suitable implementation of the environmental engine uses a simulated "virtual microphone array" as the presentation model as shown in fig. 11. The simulation assumes a cluster of hypothetical microphones (shown generally at 1102) positioned around a listening center 1104 of the playing environment, one microphone per output device, each microphone aligned with a tail at the center of the environment, the head directed to the corresponding output device (speaker 1106); preferably, it is assumed that the microphone pickups are equidistantly spaced from the centre of the environment.

A virtual microphone model is used to compute a matrix (which dynamically changes) that will produce the required volume and delay in each hypothetical microphone from each real speaker (located in the real playback environment). It is apparent that the gain from any speaker to a particular microphone is sufficient to calculate, for each speaker of known location, the output volume required to achieve the required gain at the microphone. Similarly, knowing the speaker position should be sufficient to define any necessary delays that match the signal arrival time to the model (by assuming the speed of sound in air). Thus, the purpose of the rendering model is to define a set of output channel gains and delays that will reproduce the desired set of microphone signals that will be produced by the assumed microphones at the defined listening position. Preferably, the same or similar listening locations and virtual microphones are used in the production engine, as discussed above, to define the desired mix.

In the "virtual microphone" presentation model, a set of coefficients Cn is used to model the directivity of the virtual microphone 1102. The gain of each input with respect to each virtual microphone can be calculated by using the formula shown below. Some gains may be very close to zero ("negligible" gain), in which case the input to the virtual microphone may be ignored. For each input-output bin having a non-negligible gain, the rendering model instructs the blending engine to blend from the input-output bin using the calculated gain; if the gain is negligible, no mixing needs to be performed for the bipartite. (the mixing engine is given instructions in the form of "mixop", which are discussed fully in the mixing engine section below.) the microphone gain factors of the virtual microphones may be the same or different for all virtual microphones, if the calculated gain is negligible. The coefficients may be provided by any convenient means. For example, a "playback environment" system may provide them through direct or similar measures. Alternatively, the data may be input by the user or pre-stored. For standardized speaker configurations such as 5.1 and 7.1, the coefficients will be embedded based on the standardized microphone/speaker settings.

The following formula may be used to calculate the gain of the audio source (stem) relative to a hypothetical "virtual" microphone in the virtual microphone rendering model:

(formula 5)

Matrix c_ij、p_ijAnd k is_ijA matrix representing the directional gain characteristics of the assumed microphone is characterized. These can be measured from real microphones or assumed from models. The matrix can be simplified using simplifying assumptions. Subscript s identifies the audio stem; the subscript m identifies the virtual microphone. The variable θ represents the horizontal angle of the subscripted object (s for the audio stem and m for the virtual microphone).Is used to denote the vertical angle (of the corresponding subscript object).

The delay of a given stem relative to a particular virtual microphone can be found from the following equation:

(formula 6)

(formula 7)

(formula 8)

(formula 9)

(formula 10)

(formula 11)

(formula 12)

(formula 13)

Wherein the virtual microphone is assumed to fall on the assumed ring and radius_mThe variable represents the radius specified in milliseconds (assuming air at room temperature and pressure for sound in the medium). With appropriate conversion, it can be based on actual or approximate in the playback environmentAll angles and distances can be measured or calculated from different coordinate systems. For example, a simple trigonometric relationship may be used to calculate the angle based on the speaker position in cartesian coordinates (x, y, z) as known in the art.

A given particular audio environment will provide certain parameters to specify how the diffusion engine is configured for the environment. Preferably, these parameters will be measured or estimated by the playback environment engine 240, but may alternatively be input by a user or preprogrammed based on reasonable assumptions. If any of these parameters are omitted, default diffusion engine parameters may be used as appropriate. For example, if only T60 is specified, then all other parameters should be set to their default values. If there are two or more input channels that require reverberation to be applied by the diffusion engine, they will be mixed together and the result of the mixing will pass through the diffusion engine. The diffuse output of the diffusion engine can then be considered another available input to the blending engine, and a mixop can be generated from the output of the diffusion engine. Note that the diffusion engine may support multiple channels, and both inputs and outputs may be directed to or taken from specific channels within the diffusion engine.

A hybrid engine:

blending engine 416 receives as a control input a set of blending coefficients, and preferably a set of delays, from metadata decoder/unpacker 238. As a signal input, it receives an intermediate signal path 410 from the diffusion engine 402. According to the invention, the input comprises at least one intermediate diffusing channel 412. In a particularly novel embodiment, the blending engine also receives input from the playback environment engine 424, which can be used to modify the blending according to the characteristics of the local playback environment.

As discussed above (with reference to the production engine 108), the above-specified hybrid metadata is conveniently represented as a series of matrices, as recognized in view of the input and output of the overall system of the present invention. The system of the present invention, at most generally, maps a plurality of N input channels to M output channels, where N and M are not necessarily equal, either of which may be slightly larger. It can be readily seen that the NxM dimensional matrix G is sufficient to specify a generally complete set of gain values to map from N inputs to M output channels. A similar NxM matrix may be conveniently used to fully specify the input-output retardation and diffusion parameters. Alternatively, a system of codes may be used to concisely represent a more frequently used mixing matrix. The matrices can then be easily restored by referencing a stored codebook in which each code is associated with a corresponding matrix.

Accordingly, to mix N inputs into M outputs, for each sample time it is sufficient to multiply the row (corresponding to the N inputs) by the ith column of the gain matrix (i =1 to M). Similar operations may be used to specify the delay to be applied (N-to-M mapping), and the direct/diffuse mixing mapped for each N-to-M output channel. Other methods of representation may be used, including simpler scalar and vector representations (at the expense of flexibility).

Unlike conventional mixers, the mixing engine according to the invention comprises at least one (preferably more than one) input stem specifically identified for the diffuse treatment in perception; more specifically, the environmental engine is configurable under control of the metadata so that the blending engine can receive as input a perceptually diffuse channel. The perceptually diffuse input channels may be: a) generated by processing one or more audio channels with a perceptually relevant reverberator according to the invention, or b) recorded in a naturally reverberant acoustic environment and identified as such stems by corresponding metadata.

Accordingly, as shown in FIG. 12, the mixing engine 416 receives N' channels of audio input, including the intermediate audio signal 1202 (N channels) plus 1 or more diffuse channels 1204 generated by the ambient engine. The mixing engine 416 mixes the N' audio input channels 1202 and 1204 by multiplying and summing under control of a set of mixing control coefficients (decoded from the received metadata) to produce a set of M output channels (1210 and 1212) for playing in the local environment. In one embodiment, the dedicated diffuse output 1212 is differentiated for reproduction by a dedicated, diffuse radiator speaker. The multiple audio channels are then converted to analog signals and amplified by amplifier 1214. The amplified signal drives the speaker array 244.

The particular mixing coefficients differ over time in response to metadata received by the metadata decoder/unpacker 238 from time to time. In a preferred embodiment, the specific mix also changes in response to information about the local playback environment. Preferably, the local playback information is provided by the playback environment module 424, as described above.

In a preferred, novel embodiment, the mixing engine also applies a specified delay decoded from the received metadata to each input-output pair, preferably also depending on the local characteristics of the playback environment. Preferably, the received metadata includes a delay matrix to be applied by the mixing engine to each input channel/output channel pair (then modified by the receiver based on the local playback environment).

This Operation may be described in other languages by reference to a set of parameters (for the MIX Operation instruction) denoted "mixop". Based on the control data received from the decoded metadata (via data path 1216), and further parameters received from the playback environment engine, the mixing engine calculates delay and gain coefficients (collectively "mixop") based on a rendering model of the playback environment (represented as block 1220).

Preferably, the blending engine will use "mixop" to specify the blending to be performed. Suitably, for each particular input mixed into each particular output, a corresponding single mixop (preferably including both gain and delay fields) will be generated. As such, a single input may generate a mixop for each output channel. In general, NxM mixop is sufficient to map from N inputs to M output channels. For example, a 7-channel input played with 7 output channels potentially generates up to 49 gains mixop for a single direct channel; more is needed in the 7-channel embodiment of the present invention to account for the diffusion channel received from the diffusion engine 402. Each mixop specifies an input channel, an output channel, a delay, and a gain. Optionally, the mixop may also specify the output filter to be applied. In a preferred embodiment, the system allows certain channels to be identified (metadata) as "direct rendering" channels. If such a channel also has a dispersion _ flag set (in the metadata), it will not pass through the diffusion engine but will be input into the diffusion input of the mixing engine.

In a typical system, certain outputs may be processed separately as low frequency effects channels (LFEs). The output labeled LFE is specifically processed by methods that are not the subject of the present invention. The LFE signals may be processed in separate dedicated channels (by bypassing the diffusion engine and the mixing engine).

The advantage of the invention is the separation of direct and diffuse audio at encoding, followed by the synthesis of the diffuse effect at decoding and playback. This division of direct audio from room effects allows for more efficient playback in a variety of playback environments, particularly where the playback environments are not known a priori by the mixing engineer. For example, if the playback environment is a small, acoustically dry studio, a diffuse effect may be added to simulate a large theatre when the scene needs it.

This advantage of the invention is well illustrated by the specific examples: in a well known popular movie on mozart, opera scenes are set up in vienna operas houses. If such a scene is transmitted by the method of the present invention, the music will be recorded as "dry" or as more or less direct groups of sounds (in multiple channels). Metadata may then be added in the metadata engine 108 by the mixing engineer to require synthetic diffusion when playing. In response, at the decoder, if the playback theater is a small room such as a home living room, appropriate artificial reverberation will be added. On the other hand, if the playback theater is a auditorium, the metadata decoder will indicate that less artificial reverberation will be added based on the local playback environment (to avoid the effects of excessive reverberation and resulting turbidity).

Conventional audio transmission schemes do not permit equivalent adjustment to local playback because the room impulse response of a real room cannot be realistically (in practice) removed by deconvolution. While some systems attempt to compensate for the local frequency response, such systems do not truly remove the reverberation, and do not actually remove the reverberation present in the transmitted audio signal. In contrast, in various playback environments, the present invention transmits direct audio in coordinated combination with metadata that facilitates composition or appropriate diffusion effects when played back.

Direct and diffuse output and speaker:

in a preferred embodiment of the present invention, the audio output (243 in fig. 2) comprises a number of audio channels, differing in number from the number of audio input channels (stems). In a preferred, particularly novel embodiment of the inventive decoder, the dedicated diffuse output should preferably be routed to a suitable speaker dedicated to reproducing diffuse sound. A combined direct/diffuse speaker with separate direct and diffuse input channels may be advantageously used, such as the system described in US patent application 11/847096 published as US2009/0060236a 1. Alternatively, by using the reverberation method described above, a diffuse sensation can be created by interaction of 5 or 7 channels of the direct audio presentation by using the interference between the intentional channels in the listening room created with the reverberation/diffusion system specified above.

Specific embodiments of the method of the invention

In a more specific, practical embodiment of the present invention, the context engine 240, the metadata decoder/unpacker 238, and even the audio decoder 236 may be implemented on one or more general-purpose microprocessors, or in conjunction with a dedicated, programmable, integrated DSP system. Such systems are most often described from a procedural perspective. From a process perspective, it is readily appreciated that the modules and signal paths illustrated in fig. 1-12 correspond to processes performed by a microprocessor under the control of software modules (and in particular, under the control of software modules, including instructions necessary to perform all of the audio processing functions described herein). For example, the feedback comb filter is easily implemented by a programmable microprocessor in combination with sufficient random access memory to store intermediate results, as is known in the art. All of the modules, engines, and components described herein (except for the hybrid engineer) may similarly be implemented by specially programmed computers. Various data representations may be used, including any floating point of fixed point operations.

Referring now to fig. 13, a process diagram of a receiving and decoding method is shown generally. The method begins at step 1310 by receiving an audio signal having a plurality of metadata parameters. In step 1320, the audio signal is demultiplexed to unpack the encoded metadata from the audio signal, which is separated into the specified audio channels. The metadata includes a plurality of rendering parameters, blending coefficients, and a set of delays, all of which are further defined in table 1 above. Table 1 provides exemplary metadata parameters, but is not intended to limit the scope of the present invention. It will be appreciated by those skilled in the art that other metadata parameters defining the diffusion of audio signal features may be carried in the bitstream in accordance with the present invention.

The method continues in step 1330 by processing the metadata parameters to determine which audio channel(s) to filter to include the spatially diffuse effect. The appropriate audio channel is processed through the reverberation group to include the intended spatially diffuse effect. The reverberation group is discussed in the "reverberation module" section above. The method continues in step 1340 by receiving playback parameters defining a local acoustic environment. Each local acoustic environment is unique, and each environment may affect the spatially diffuse effect of the audio signal in different ways. The playback of the audio signal is facilitated taking into account the characteristics of the local acoustic environment and compensating for any spatially diffuse deviations that may occur naturally when the audio signal is played back in that environment, as planned by the encoder.

The method continues in step 1350 by mixing the filtered audio channels based on the metadata parameters and the playback parameters. It should be understood that a generalized mixture includes each of the contributions weighted from all M inputs to N outputs, where N and M are the number of outputs and inputs, respectively. The mixing operation is suitably controlled by a set of "mixops" as described above. Preferably, a set of delays (based on the received metadata) is also introduced as part of the mixing step (also as described above). In step 1360, audio channels are output for playback through one or more speakers.

Referring next to fig. 14, the encoding methodology aspect of the present invention is shown generally. In step 1410, a digital audio signal (which may originate from a captured live sound, from a transmitted digital signal, or from the playback of a recorded file) is received. The signal is compressed or encoded (step 1416). In a synchronized relationship with the audio, the mixing engineer ("user") inputs a control selection to the input device (step 1420). The input determines or selects the desired diffusion effect and multi-channel mixing. The encoding engine generates or computes metadata appropriate for the desired effect and mix (step 1430). The audio is decoded and processed by the receiver/decoder according to the decoding method of the present invention (step 1440, described above). The decoded audio includes the selected diffusion and mixing effects. The decoded audio is played to the mixing engineer through the listening system so that he/she can verify the desired diffusion and mixing effect (listening step 1450). If the source audio is from a pre-recorded source, the engineer will have the option to redo the process until the desired effect is achieved. Finally, the compressed audio is transmitted in a synchronized relationship with metadata representing diffuse and (preferably) mixed features (step 11460). This step in the preferred embodiment would include multiplexing the metadata with the compressed (multi-channel) audio stream in a combined data format for transmission or recording on a machine-readable medium.

In another aspect, the invention includes a machine-readable recordable medium recorded with a signal encoded by the method described above. In a system aspect, the invention also includes a combined system of encoding, transmitting (or recording), and receiving/decoding according to the methods and apparatus described above.

It will be apparent that variations of the processor architecture may be used. For example: multiple processors may be used in a parallel or serial configuration. A dedicated "DSP" (digital signal processor) or digital filter device may be used as the filter. Multiple channels of audio can be processed together, either by multiplexing the signals or by running parallel processors. The inputs and outputs may be formatted in various ways, including parallel, serial, interleaved, or encoded.

While various illustrative embodiments of the invention have been shown and described, many other variations and alternative embodiments will occur to those skilled in the art. Such variations and alternative embodiments are contemplated and may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A method for conditioning an encoded digital audio signal, said audio signal representing sound, the method comprising the steps of:

receiving encoded metadata parametrically representing a desired presentation of the audio signal data in a listening environment;

the metadata comprises at least one parameter that can be decoded to configure a perceptually diffuse audio effect in at least one audio channel;

processing the digital audio signal with the perceptually diffuse audio effect configured in response to the parameter to produce a processed digital audio signal.

2. The method of claim 1, wherein the step of processing the digital audio signal comprises at least one utility diffuser disassociating at least two audio channels.

3. The method of claim 2, wherein the utility diffuser includes at least one short circuit attenuating reverberator.

4. A method as recited in claim 3, wherein the short circuit decay reverberator is configured such that the measure of decay over time (T60) is equal to 0.5 seconds or less.

5. A method as recited in claim 4, wherein the short circuit attenuating reverberator is configured such that T60 is substantially constant over frequencies.

6. The method of claim 3, wherein the step of processing the digital audio signal comprises generating a processed audio signal having components in at least two output channels; and

wherein the at least two output channels comprise at least one direct sound channel and at least one diffuse sound channel;

deriving the diffuse sound channel from the audio signal by processing the audio signal with a frequency-domain artificial reverberation filter.

7. The method of claim 2, wherein the step of processing the digital audio signal further comprises: the audio signal is filtered with an all-pass filter in the time or frequency domain.

8. The method of claim 7, wherein the step of processing the digital audio signal further comprises decoding the metadata to obtain at least a second parameter indicative of a desired diffusion density; and

wherein the diffuse sound channel is configured to approximate the diffuse density in response to the second parameter.