WO2018132385A1

WO2018132385A1 - Audio zooming in natural audio video content service

Info

Publication number: WO2018132385A1
Application number: PCT/US2018/012992
Authority: WO
Inventors: Pasi Sakari OJALA
Original assignee: PCMS Holdings Inc
Current assignee: PCMS Holdings Inc
Priority date: 2017-01-12
Filing date: 2018-01-09
Publication date: 2018-07-19
Anticipated expiration: 2019-07-12

Abstract

Systems and methods related to zooming an immersive audio presentation in connection with video content. In one embodiment, there is a method comprising accessing, at a server, a primary audio and video stream. The method also includes preparing, at the server, a custom video stream to enhance a spatial region of the primary audio and video stream. The method also includes preparing, at the server, a custom audio stream corresponding to the custom video stream. The audio stream may be processed by classifying audio as either inside or outside the spatial region. Audio inside the region may be focused, and audio outside the region diffused and/or decorrelated. The processed audio stream may be paired with an enhanced video stream and delivered to a client device.

Description

AUDIO ZOOMING IN NATURAL AUDIO VIDEO CONTENT SERVICE

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] The present application is a non-provisional filing of, and claims benefit under 35 U.S.C. §119(c) from, U.S. Provisional Patent Application Serial No. 62/445,641 , filed January 12, 2017, entitled "Audio Zooming in Natural Audio Video Content Service", which is hereby incorporated by reference in its entirety.

BACKGROUND

[0002] There are number of live streaming and video on demand (VOD) services that provide variety of audio visual content. People can share live video stream as well as edited content globally to other service users and their followers on YouTube, Twitter, Facebook, etc. Each application user is connected to the service and may share their video content individually.

[0003] Camera captured streaming and VOD services offer several functionalities on managing the video stream. The content owner may, for example, produce new content options by zooming in on a given portion of or location within the video stream, or by following a particular visual object in detail. Basically, a content editor may select a region of the video stream and zoom in. The resulting new video stream may then be added as a new adaptation block within a media presentation description (MPD) content manifestation file of the streaming or download service.

[0004] Live media streams accompanied by additional sensor data provide rich contextual information regarding the recording environment. It may be beneficial to have third party interfaces for augmenting new content and focusing the existing stream and helping the content creator to find focus points and points of interest in the presentation.

SUMMARY

[0005] Described herein are systems and methods generally related to streaming of media content. More specifically, embodiments of the herein disclosed systems and methods relate to zooming an immersive audio presentation in connection with video content.

[0006] Systems and methods described herein provide mechanisms for "zooming" an immersive audio image of a video presentation. Exemplary embodiments enable a user to concentrate on a selected area and/or objects in the video presentation. For example, where the visual content has augmented material that is meant to capture a user's attention in a certain direction within the presentation, users may be provided with the ability to control the experience from an audio perspective.

[0007] In one embodiment, a method, includes accessing, at a server, a primary audio and video stream; preparing, at the server, a custom audio stream that enhances audio content associated with an identified spatial region of the primary video stream by classifying portions of audio content of the primary audio stream as either inside or outside the identified spatial region via spatial rendering of the primary audio stream, including determining a plurality coherence parameters identifying a diffuseness associated with the one or more sound sources in the primary audio stream, and determining a plurality of directional parameters associated with the primary audio stream, and filtering the primary audio stream using the plurality of coherence parameters and the plurality of directional parameters.

[0008] In another embodiment, a method includes accessing, at a server, a first media stream comprising a first audio stream and a first video stream, determining a first spatial region of the first video stream, and determining a first zoomed audio region of the first audio stream associated with the first spatial region of the first video stream. The method includes generating a focused audio stream based at least in part on processing of the first zoomed audio region of the first audio stream via spatial rendering of the first audio stream in time and frequency, including determining a plurality of coherence values as a measure of diffuseness associated with one or more sound sources in the first audio stream, and determining a plurality of directional values associated with the first audio stream, and generating a custom media stream based on the focused audio stream and a first custom video stream based at least in part on the first video stream. The method further includes streaming the custom media stream from the server to a first receiving client.

[0009] In one embodiment of the method the first media stream is received at the server from one or more of a first live content capture client and a video on demand server, the first media stream including client video data corresponding to a user-selected region of interest and audio data corresponding to the selected region of interest.

[0010] In some embodiments, there is a system comprising a processor and a non-transitory storage medium storing instructions operative, when executed on the processor, to perform functions including those set forth above, and others.

BRIEF DESCRIPTION OF THE DRAWINGS

[0011] A more detailed understanding may be had from the following description, presented by way of example in conjunction with the accompanying drawings, wherein:

[0012] FIG. 1 illustrates an overview of one embodiment of a system architecture for a streaming media server as disclosed herein. [0013] FIG. 2 illustrates an overall process flow of one embodiment of content editing at a live-streaming server, as disclosed herein.

[0014] FIG. 3 illustrates a block diagram of one embodiment of an analysis filter bank.

[0015] FIG. 4 illustrates a process flow of one embodiment of audio zooming.

[0016] FIG. 5 illustrates a block diagram of one embodiment of audio image filtering in sub band domain based on the zoom information.

[0017] FIG. 6 illustrates one embodiment of classification of time-frequency slots based on the content.

[0018] FIG. 7 illustrates a process flow of one embodiment of audio zooming.

[0019] FIG. 8A illustrates a block diagram of one embodiment of a synthesis filter bank.

[0020] FIG. 8B illustrates a block diagram for one embodiment of combined analysis, directional filtering, and synthesis.

[0021] FIG. 9 illustrates a sequence diagram for an exemplary embodiment for live operation of audio zooming with user target selection.

[0022] FIG. 10 illustrates an exemplary embodiment of selection of a zoomed region of a video stream.

[0023] FIG. 11 illustrates an exemplary embodiment of a zoomed video stream added in an media presentation description (MPD).

[0024] FIG. 12 illustrates an exemplary embodiment of how a user may hear the zoomed audio image and targets.

[0025] FIG. 13 illustrates an exemplary wireless transmit/receive unit (WTRU) that may be employed as a server or user client in some embodiments.

[0026] FIG. 14 illustrates an exemplary network entity that may be employed in some embodiments.

DETAILED DESCRIPTION

[0027] A detailed description of illustrative embodiments will now be provided with reference to the various Figures. Although this description provides detailed examples of possible implementations, it should be noted that the provided details are intended to be by way of example and in no way limit the scope of the application.

[0028] Note that various hardware elements of one or more of the described embodiments are referred to as "modules" that carry out (i.e., perform, execute, and the like) various functions that are described herein in connection with the respective modules. As used herein, a module includes hardware (e.g., one or more processors, one or more microprocessors, one or more microcontrollers, one or more microchips, one or more application-specific integrated circuits (ASICs), one or more field programmable gate arrays (FPGAs), one or more memory devices) deemed suitable by those of skill in the relevant art for a given implementation. Each described module may also include instructions executable for carrying out the one or more functions described as being carried out by the respective module, and it is noted that those instructions could take the form of or include hardware (i.e., hardwired) instructions, firmware instructions, software instructions, and/or the like, and may be stored in any suitable non-transitory computer-readable medium or media, such as commonly referred to as RAM, ROM, etc.

[0029] In some embodiments, a streaming server enables an audio/visual presentation zooming functionality for camera captured natural audio/visual live-streaming or video-on-demand (VOD) content. A content editor, director, or an automatic contextual tool may access the content through a third party interface to create a zoomed audio/video stream to focus on a particular area or object of the stream. The zooming operation of the audio image may provide a direction or area of interest in the presentation, around which the viewer is expected or desired to concentrate.

[0030] Live camera-captured natural video streams typically have a very vivid audio image with a plurality of individual sources and ambient sounds. A "zoomed" region or a direction of interest in a presentation does not necessarily have a single distinct sound source that can be traced and emphasized. It should be noted that an object based media content approach does not generally work with live natural content. That is, the number of sound sources and their location in the audio image of live natural content is varying, and the existing sound sources may move in and out of the audio image. Conventional management and manipulation of individual distinct sound sources, "audio objects", in predetermined and fixed locations does not generally work for live natural content. Camera capture for a natural content stream may have an undetermined number of sources and ambient sounds without any particular location cues. Therefore, tracing sound sources and moving their locations artificially is unreliable and may cause annoying audio effects.

[0031] As disclosed more fully below, in some embodiments of the systems and methods herein, there may be an external content editor or automatic contextual tool, such as a third party, which controls a media presentation including a zooming function, and where the third party may, in some embodiments, augment new content to the stream in a selected direction of interest.

[0032] Horizontal and vertical angles of the zoomed or selected area of the media presentation may be extracted and applied to select the corresponding area in an immersive audio image. Alternatively, in some embodiments, the zooming area selection may be conducted on the audio image only. In some cases, in addition, the content creation service or user may track target objects within the captured audio/visual image and determine the zoomed region based on contextual cues. Directional spatial filtering of the audio signal

(stereo or multi-channel audio) may be applied to focus the audio image in the given direction selected from the visual content, or determined based on target location information. Multi-channel (two or more channels) audio parameters in the time-frequency domain may be classified in the given direction of interest and outside the zoomed area. Parameters within the zoomed area in the direction of interest may be focused by increasing the coherence and sharpening the directional information to produce a clear audio image. Parameters outside the zoomed area may be "blurred" to fade away possible audio components with distinct directional cues, when reverb and decorrelation (random level and phase shift) is introduced. This may make the image outside a zoomed area more ambient and diffuse. The new tuned audio/visual stream is identified in a manifest file such as a media presentation description (MPD), in some cases together with augmented content.

[0033] In some embodiments, using the herein disclosed systems and methods, a third-party may tune a media presentation in a streaming service by selecting a certain detail in the audio/visual stream, and thereby drive a viewer's interest towards a given direction. The selection may be done manually by a content editor/director, or based on an automatic contextual tool.

[0034] In some embodiments, using the herein disclosed systems and methods, a zoomed target may be traced automatically using contextual information of the captured content. For example, when a content capturing tool is collecting location information of all targets in the audio/visual image, the presentation may concentrate and zoom in on a desired target.

[0035] In some embodiments, using the herein disclosed systems and methods, tuning of existing content may be accompanied with augmentation of third-party content in a presentation based on the selected area and contextual information existing in the content.

[0036] In some embodiments, using the herein disclosed systems and methods, a zoomed area that is selected from the video stream may sound sharp with (dry) directional cues, while the rest of the audio image (e.g., corresponding objects not seen in the zoomed video) may become more ambient without clear audible location information (e.g., direction of arrival), thus automatically driving the viewer's attention in the desired direction and creating an artificial "cocktail party effect".

[0037] In some embodiments, using the herein disclosed systems and methods, a content server may create a new version of a video stream containing only the zoomed area. In addition, the server may compose a new stereo or multi-channel audio signal that is focused to the zoomed area. Thus, the content identified in an MPD or other manifest file may include both the full and zoomed presentations.

Audio image processing at a content server.

[0038] In some embodiments, a streaming media server tunes natural camera-captured content from live streaming clients or VOD services and creates an additional media presentation for the same or other streaming clients. A content editor/director, or an automatic contextual editor tool, in some embodiment, processes the received natural media stream to meet the expected requirements. By the audio image processing, consumers receive improved and focused media streams. Alternatively, providers (or clients) may simply edit an ongoing live stream. [0039] In some embodiments, the server collects audio/visual content and creates a new user experience by editing an alternative stream, for example, by zooming the visual content, thereby creating more options for viewers.

[0040] An overview of one embodiment of a system architecture for content tuning and mixing is shown in FIG. 1. A media content server 10 (or streaming media server) may receive media streams from recording live-streaming clients 20 or from VOD servers 30 and may prepare the content for live-streaming. The server may access the content, and create new content, for example by zooming the audio/visual image and/or augmenting new components to the stream. The composed media including the original stream as well as the tuned, and in some instances augmented, content may be collected in segments and bundled into an MPD file. Receiving clients 40 may then apply, for example, MPEG DASH protocol 50 to stream the content from the server and render the presentation for the user.

[0041] One function of the media server 10 may be to bundle the new content streams as additional content identified in the MPD file. Thus, a receiving client may have an option to stream the original content or the tuned and/or augmented content. In this case, the server creates a segment representing a zoomed version of the audio/visual content and adds the segment information to the MPD. Receivers 40 will then have an option to retrieve and view either the original or the tuned content.

[0042] The media content server 10 in FIG. 1 may conduct the video stream zooming according to predetermined instructions from an external party. A content editor/director 60 may interact with the content stream through a third-party API. The third-party may have access to the content itself, as well as all possible contextual information from accompanied sensor data.

[0043] A zooming operation may be steered by a human director or it may be based on contextual analysis of the content stream itself. The server 10 may, for example, have a special task to trace a certain object in the audio/visual stream.

[0044] An overall process flow of one embodiment of the content editing is depicted in FIG. 2. The streaming client may capture the audio/visual content and may also extract contextual information about the objects appearing in the content 210. For example, a capturing device may also retrieve the location information of targets. The content may also be obtained from a VOD service. In both cases, the content may be a camera and/or microphone captured natural audio/visual stream, and collected context information about the environment. The audio/visual content and accompanying contextual metadata may be forwarded to the live-streaming server at step 220 for distribution to streaming clients. At this point, the server may have the potential to improve user experiences and tune the content. The metadata regarding the target context may be applied by the streaming service.

[0045] As shown in step 230, an external party 240, such as a content editor/director 250 or even a contextual content analysis tool, may analyze the content and select an area or direction in the presentation that is cropped from the stream. The selection of the area and direction of interest, e.g., the zoomed area 260, may be performed based on the visual and/or audio content. The context information, such as a target location relative to the camera, may also drive the zoomed area selection.

[0046] The zooming action instructions, such as vertical and horizontal angles of the selected area, may be returned to the server through the third-party API, in some embodiments. The server may then conduct the zooming operation 270 for both visual and audio content. That is, the content editor/director or contextual tool chooses the view coordinates after which the server does the zooming operation.

[0047] In one embodiment, the server then segments and identifies new zoomed content in an MPD file, together with the original content in step 280. Thus, the live-streaming (or VOD) content may be available for streaming at step 290, for example with the MPEG DASH protocol.

[0048] The zooming operation provides information about the view coordinates, as well as the horizontal and vertical angles of the selected area in the view. This information is used for audio/visual signal processing. Audio signal processing may be conducted in the time-frequency domain to extract information regarding both time evolving and frequency spectrum related phenomena. The input for the spatial filtering is the stereo or multi-channel audio signal captured with a microphone array, or the like. The processing may define the zooming operation of the presentation, and thus the filtering input may comprise the view angles (horizontal and vertical) of an identified spatial region (zoomed region).

[0049] Analysis in time-frequency domain. FIG. 3 illustrates one embodiment of time- frequency domain analysis of a multi-channel signal. In the embodiment illustrated in FIG. 3, two input channels 310 and 320 (stereo sound) are filtered with low-pass (H₀(^z)) 311 , 321 , respectively, and high-pass (Hi (z)) filters 312, 322, respectively, and then down sampled by two, 313, 314, and 323 and 324, respectively, to maintain as constant the overall number of samples. As one of skill in the art will appreciate, the filter bank can be scaled according to the number of input channels. In one embodiment, a filter bank may include several stages of band splitting filters to achieve sufficient frequency resolution. The filter configuration may also differ to achieve a non-uniform band split. When the resulting band limited signals are segmented, the result is a time- frequency domain parameterization of the original time series signal. The resulting signal components limited by frequency and time slots are considered as the time-frequency parameterization of the original signal.

[0050] In some embodiments, an aim of the analysis filtering bank of FIG. 3 is to classify the incoming "audio image" into a first area within the zoomed area and a second area outside of the zoomed area. The selection is conducted by analyzing, as shown in block 330, the spatial location of sound sources and the presence of ambient sounds without any clear direction of arrival against the zooming area coordinates. The classified parameters in the time-frequency domain 340 may then be further filtered with spatial filtering to enable the zooming effect. [0051] Alternatively, the sub band domain processing 330 may be performed in the Discrete Fourier Transform (DFT) domain using a Short Term Fourier Transform (STFT) method. In some cases, such processing may be preferable as the complex domain transform domain parameters may be easier to analyze and manipulate regarding the level and phase shift.

[0052] There are several methods available for estimating the direction of arrival of a sound source. For example, different beam forming methods search the level and time difference of different channels in each time-frequency slot. In addition, parametric methods, such as a binaural cue coding (BCC) method, may efficiently be used to extract location (direction of arrival) cues of the sound source. The BCC method uses signal coherence parameterization, which is used for classifying sound sources based on their diffuseness.

[0053] Conventional BCC analysis comprises computation of inter-channel level difference (ILD), inter- channel time difference (ITD), and inter-channel coherence (ICC) parameters estimated within each transform domain time-frequency slot, i.e., in each frequency band of each input frame. In cases where the multi-channel audio signal comprises of a stereo signal or a binaural signal on two channels, the BCC cues are determined between decomposed left and right channels.

[0054] The ICC parameter may be utilized for capturing the ambient components that are not correlated with the "dry" sound components represented by phase and magnitude parameters. Thus, the coherence cue represents the diffuseness of the signal component. High coherence values indicate that the sound source is point like with an accurate direction of arrival, whereas low coherence represents a diffuse sound that does not have any clear direction of arrival. For example, reverberation and reflected signals coming from many different directions typically have low coherence.

Audio image zooming.

[0055] An exemplary audio zooming process may include spatial filtering of the time-frequency domain parameters based on the zooming details. One embodiment of an overall process is illustrated in FIG. 4. The zooming operation is conducted after the signal is decomposed in the time-frequency domain 410. The process includes classification of the parameters based on their location relative to the zoomed area in the audio image 420. The processing of the parameters in spatial filtering depends on the classification. Parameters classified within the zoomed area are focused in process step 430 by reducing their diffuseness through reducing level and time difference variations in the time-frequency domain. Outside the zoomed area, diffuseness is increased by adding random variations and reverberation to decorrelate parameters outside the zoomed area. Such decorrelation makes the sound source appear more ambient.

[0056] FIG. 5 illustrates one embodiment of a filtering operation 510 of sub band domain time-frequency parameters and the details for the zoomed area. The content editor/director or contextual editing tool may provide the control information for the zooming. In the first phase, the zooming information drives the classification of the sub band domain audio parameterization. In the second phase, the spatial filtering conducts the audio zooming effect. The output of the process 450 is the focused audio within, and ambient audio image outside the identified spatial region (zoomed region).

[0057] Audio image classification. Audio image classification is conducted by estimating the direction of arrival cues and diffuseness in each time-frequency defined area (slot). When the parameterization in a given slot indicates a sound source with a direction of arrival in the identified spatial region (zoomed region), the slot is classified as zoomed content. All other slots may be classified as out-of-focus parameterization. Thus, input x(z) 520 and zoomed area details 530 enable sub band filtering to produce y(z) 540.

[0058] FIG. 6 illustrates an embodiment of audio classification in the time-frequency domain. The output of the filter bank is presented in the time-frequency domain. FIG. 6 indicates parameters limited by the frequency 610 and time 630 axes to identify "slots" 620 shaded as solid, hashed or null.

[0059] The direction of arrival analysis using, for example, the BCC method of the parameters classifies the slots as within the zoomed region (solid slots) or as out-of-focus parametrizations (hashed slots) that contain directional components outside the zoomed region or ambient sound (null slots) without any particular directional cues.

[0060] The classified content is then processed to enable the audio zooming functionality. The audio zooming may serve to focus a user's attention on the zoomed region of the audio image, and reduce the emphasis of content outside the zoomed area.

[0061] Audio zooming process. In various embodiments, the herein disclosed systems and methods connect the audio processing to the video zooming operation. When the visual content is zoomed into a certain area, the "audio image" is also focused in that area. The time-frequency domain classification discussed above operates to decompose the signal into the zoomed region, the outside area, and ambient sounds. The audio zooming then focuses the user experience on the identified spatial region (zoomed region) of the image.

[0062] In an exemplary process, the time and level differences and their variation in each time- frequency parameter are analyzed. Determining the variation of phase and level difference, e.g., in a discrete Fourier transform (DFT) domain is then performed. The information reveals the diffuseness of the audio image within and outside the identified spatial region (zoomed region).

[0063] In one exemplary process, the parameters classified within the identified spatial region (zoomed area) are focused by reducing the time and level differences. A variance of the parameters can also be reduced by averaging the values. This may be performed in the DFT domain when the amplitude of the parameters related to the level differences and the phase difference of complex value parameters refer to time difference. Manipulating these values reduces the variance and diffuseness. Effectively, the operation moves the sounds towards the center of the zoomed region, and makes them more coherent and focused. [0064] The parameters classified outside the identified spatial region (zoomed area) are decorrelated, such as by adding a random component to the time and level differences. For example, in one embodiment, adding a random component to the level and phase differences of DFT domain parameters decorrelates the area outside the identified spatial region (zoomed area).

[0065] The server may also add a reverberation effect by filtering the parameters with an appropriate transform domain filter. When the signal is, for example, in the DFT domain, the reverberation filtering may be conducted by multiplying with an appropriate filter, which is also transformed in the DFT domain instead of a convolution operation in a time domain. In one embodiment, a decorrelation module may increase the reverberation effect, as well as random time and level differences, the further away the audio components were classified from the identified spatial region (zoomed region).

[0066] In one embodiment, outside the identified spatial region (zoomed region), random time and level differences and reverberation may be applied to time-frequency slots that have "dry" components (with distinct direction of arrival cues). There is generally no need to manipulate ambient sounds that already lack a location cue/directional parameter.

[0067] As a result of the processing, sounds outside the identified spatial region (zoomed region) may be made more diffuse so as to lack any particular direction of arrival. The sounds outside the identified spatial region (zoomed region) thus become more ambient.

[0068] One embodiment of the zooming process for the audio image zooming is illustrated in FIG. 7. As shown, classification of the audio image parameters in the time-frequency domain 710 results in spatial filtering split into two branches, inside the zoomed region 720, and outside the zoomed region 730, such as based on the classification of slots as inside the identified spatial region (zoomed region) or outside the identified spatial region (zoomed region). Thus, inside the identified spatial region (zoomed region) decorrelation is reduced 240. A next step is to make the zoomed area more focused and coherent, and sounds outside of the identified spatial region are decorrelated to generate ambient sound. Thus, the filtering operations are different for the different branches: the audio inside the identified spatial region is focused by reducing the a level and phase shift and averaging out the variation 750. As a result, focused and "dry" sounds (sounds with directional information) become more centered in the identified spatial region (zoomed region) 760.

[0069] Outside the identified spatial region, audio content is diffused by increasing decorrelation 770. One method of increasing decorrelation adding a random component, such as to level and phase shift parameters, thereby increasing random variations, and adding reverberation in the time-frequency domain 780. Ambient sounds in the audio content are already diffuse and do not need to have phase shift processing such as adding randomization or reverberation. After the filtering operations, the branches may be transformed back to the time domain 794. As appreciated by one of skill in the art, "random" variations incudes pseudo-random variations, including predetermined pseudo-random variations.

Zoomed audio presentation.

[0070] The filtered sub-band domain signal may be transformed back to the time domain using a synthesis filter bank. FIG. 8A illustrates one embodiment of sampled filter bank structure which reconstructs the decomposition of the analysis filter bank signal from FIG. 3 back to the time domain. Thus, sub band domain filtering 810 is up sampled by 2 in blocks 812, 814 for output yi(z) 830 and 822, 824 for output y2(z) 840. As shown, inverse function G(z) 832, 834 for output yi (z) 830 and 843, 844 for outputy₂(z) 840 combine to produce the two time domain audio signals for a stereo-type signal. As will be appreciated, FIG. 8A can be scaled for multi-channel audio signals.

[0071] FIG. 8B illustrates a block diagram for one embodiment of combined analysis, directional filtering, and synthesis, using the modules discussed in relation to FIGS. 3, 5, and 8A.

[0072] The processed audio stream is segmented and the access details can be added to the MPD file to complete the media presentation with the zoomed content. Information identifying the associated video portion is also included in the MPD.

[0073] As a result of the discussed processes, the streaming media server, such as the server shown in FIG. 1 , is able to provide live or VOD content in both original and zoomed versions. A receiving user may select between the two using the details in the MPD file.

[0074] In some embodiments, more than one zoomed region may exist for given video content. In such instances, each zoomed region may be processed as discussed above, and then the plurality of zoomed regions may be included in the MPD file.

[0075] Audio classification functionality. In an exemplary embodiment, the audio classification falls into two classes: components having directional components within the identified spatial areas (zoomed area) (e.g., direction of interest), and the rest. However, the classification may be further separated. For example, components that are not in the zoomed area may be classified as either ambient sounds or components having directional cues. In such a case, the "blurring" and decorrelation operation may be conducted only on the components having directional cues, or "dry" areas.

[0076] User control. In some additional embodiments of the disclosed systems and methods, a receiving user may have control of the presentation and may steer a zooming target, and thus control the content tuning. In such cases the receiving user may, for example, use head tracking and a post filtering type of audio image control to "zoom in" on the presentation. In such a case, the focused audio image of this solution is driving the user attention towards the zoomed area and emphasizes the target in the video stream. [0077] User control may further be included in the streaming server as additional contextual information. There is valuable information about the user interests. Hence, the data is available e.g. for third party content augmentation.

[0078] Target selection. The receiving user may, in some embodiments, select a desired target on the screen. When the contextual information, such as location, is provided for the objects appearing in the captured content (and in the area covered by the recording device), the user may tap the object on the screen after which the consumer application may determine the selected object or the area on the screen and return the information to the streaming server. The server may lock on the target and create a new (or additional) zoomed stream that is focused on the selected object or area.

[0079] One embodiment of such target selection is illustrated in relation to FIG. 9, which depicts a sequence diagram for live operation of audio zooming with user target selection. More particularly,

[0080] In some embodiments, camera captured streaming and VOD services may enable new functionalities of managing a video stream by zooming in on a particular location or following a particular visual object in the video stream.

[0081] In one embodiment, a media server may create an alternative presentation for a live video stream. The content may be available at the server for streaming, such as using MPEG DASH protocol. For example, in one embodiment, the video content may be constructed in a MPD manifestation file that contains details for streaming the content. After a portion of interest or target is selected, the server may create a zoomed version of the video stream, such as by following the particular area or section of the image or the particular target in the stream.

[0082] Audio visual zooming. In an exemplary embodiment, a content director or automatic contextual editor tool may crop a part of a video presentation and compose a new video stream. FIGS. 10-12 illustrate particular stages in the process of the exemplary embodiment.

[0083] As shown in FIG. 10, a user or content editor may make a target selection for a given video stream. As illustrated in FIG. 10, the target zoomed region may be the area within the box of the larger video feed. In some instances, the selection may be performed from the video stream only. In the exemplary embodiment, the content editor may follow an object in the visual stream and maintain the zoomed area. In some cases, automatic tools may be used. Alternatively, in additional embodiments, the captured (or VOD) content may have contextual metadata associated with the absolute positions of objects appearing in the video. From this, by combining the context with the information about the camera location objects may be pinpointed in the visual stream, and video zooming to a given object may be performed automatically. For example, each individual player in an NFL game may be instrumented with location tracking, and by adding this information as metadata a content stream covering the stadium may enable the streaming server to trace any object (e.g., player) on the stream. [0084] Once a zoomed area is selected, the server may create a new visual adaptation set in the MPD file based on the zoomed region. An exemplary resulting video stream is illustrated in FIG. 11 , for the selected region based on the original video stream of FIG. 10.

[0085] The server may also create a new audio stream that matches the zoomed video presentation, such that the audio experience reflects the new presentation. For example, a viewer may be guided to the zoomed area or target with the help of the processed audio.

[0086] As disclosed herein above, an immersive audio image may be focused within the zoomed region while the remaining area is processed to sound more like ambient or background noise. Possible sound sources outside the zoomed area are still present in the audio image but the viewer (listener) is not able to trace their location as in the full presentation (e.g., point sound sources are diffused). FIG. 12 illustrates an exemplary effect in the audio image, e.g., how a user hears the audio image. In the visual representation of FIG. 12, treating optical focus as comparable to the aural focus, only the area of the zoomed area is in focus while the remainder is "blurred." Generally, distinct audio sources are not distracting to a viewer when they are not easily recognizable. As such, the viewer's focus is also drawn to the zoomed area by the audio image.

[0087] Artificial "cocktail party effect." Alternatively, in some embodiments, the server may apply only the audio zooming and possibly augment artificial visual cues to emphasize a certain area in the visual stream. In such a case, the target may by highlighted with an augmented frame, such as in FIG. 10. The corresponding audio stream may be zoomed to drive the user's attention to the selected target, e.g., the audio image may appear as in FIG. 12. The audio zooming may introduce an "artificial cocktail party effect" to the audio presentation. The user is influenced to concentrate on the target when the rest of the image is "blurred," without clear details (e.g., clear alternative point sound sources) that the user may follow.

[0088] Exemplary embodiments disclosed herein are implemented using one or more wired and/or wireless network nodes, such as a wireless transmit/receive unit (WTRU) or other network entity.

[0089] FIG. 13 is a system diagram of an exemplary WTRU 3102, which may be employed as a server or user device in embodiments described herein. As shown in FIG. 13, the WTRU 3102 may include a processor 3118, a communication interface 3119 including a transceiver 3120, a transmit/receive element 3122, a speaker/microphone 3124, a keypad 3126, a display/touchpad 3128, a non-removable memory 3130, a removable memory 3132, a power source 3134, a global positioning system (GPS) chipset 3136, and sensors 3138. It will be appreciated that the WTRU 3102 may include any sub-combination of the foregoing elements while remaining consistent with an embodiment.

[0090] The processor 3118 may be a general purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, Application Specific

Integrated Circuits (ASICs), Field Programmable Gate Array (FPGAs) circuits, any other type of integrated circuit (IC), a state machine, and the like. The processor 3118 may perform signal coding, data processing, power control, input/output processing, and/or any other functionality that enables the WTRU 3102 to operate in a wireless environment. The processor 3118 may be coupled to the transceiver 3120, which may be coupled to the transmit/receive element 3122. While FIG. 13 depicts the processor 3118 and the transceiver 3120 as separate components, it will be appreciated that the processor 3118 and the transceiver 3120 may be integrated together in an electronic package or chip.

[0091] The transmit/receive element 3122 may be configured to transmit signals to, or receive signals from, a base station over the air interface 3116. For example, in one embodiment, the transmit/receive element 3122 may be an antenna configured to transmit and/or receive RF signals. In another embodiment, the transmit/receive element 3122 may be an emitter/detector configured to transmit and/or receive IR, UV, or visible light signals, as examples. In yet another embodiment, the transmit/receive element 3122 may be configured to transmit and receive both RF and light signals. It will be appreciated that the transmit/receive element 3122 may be configured to transmit and/or receive any combination of wireless signals.

[0092] In addition, although the transmit/receive element 3122 is depicted in FIG. 13 as a single element, the WTRU 102 may include any number of transmit/receive elements 3122. More specifically, the WTRU 3102 may employ MIMO technology. Thus, in one embodiment, the WTRU 3102 may include two or more transmit/receive elements 3122 (e.g., multiple antennas) for transmitting and receiving wireless signals over the air interface 3116.

[0093] The transceiver 3120 may be configured to modulate the signals that are to be transmitted by the transmit/receive element 3122 and to demodulate the signals that are received by the transmit/receive element 3122. As noted above, the WTRU 3102 may have multi-mode capabilities. Thus, the transceiver 3120 may include multiple transceivers for enabling the WTRU 3102 to communicate via multiple RATs, such as UTRA and IEEE 802.11, as examples.

[0094] The processor 3118 of the WTRU 3102 may be coupled to, and may receive user input data from, the speaker/microphone 3124, the keypad 3126, and/or the display/touchpad 3128 (e.g., a liquid crystal display (LCD) display unit or organic light-emitting diode (OLED) display unit). The processor 3118 may also output user data to the speaker/microphone 3124, the keypad 3126, and/or the display/touchpad 3128. In addition, the processor 3118 may access information from, and store data in, any type of suitable memory, such as the non-removable memory 3130 and/or the removable memory 3132. The non-removable memory

3130 may include random-access memory (RAM), read-only memory (ROM), a hard disk, or any other type of memory storage device. The removable memory 3132 may include a subscriber identity module (SIM) card, a memory stick, a secure digital (SD) memory card, and the like. In other embodiments, the processor

3118 may access information from, and store data in, memory that is not physically located on the WTRU

3102, such as on a server or a home computer (not shown). [0095] The processor 3118 may receive power from the power source 3134, and may be configured to distribute and/or control the power to the other components in the WTRU 3102. The power source 3134 may be any suitable device for powering the WTRU 3102. As examples, the power source 3134 may include one or more dry cell batteries (e.g., nickel-cadmium (NiCd), nickel-zinc (NiZn), nickel metal hydride (NiMH), lithium-ion (Li-ion), and the like), solar cells, fuel cells, and the like.

[0096] The processor 3118 may also be coupled to the GPS chipset 3136, which may be configured to provide location information (e.g., longitude and latitude) regarding the current location of the WTRU 3102. In addition to, or in lieu of, the information from the GPS chipset 3136, the WTRU 3102 may receive location information over the air interface 3116 from a base station and/or determine its location based on the timing of the signals being received from two or more nearby base stations. It will be appreciated that the WTRU 3102 may acquire location information by way of any suitable location-determination method while remaining consistent with an embodiment.

[0097] The processor 3118 may further be coupled to other peripherals 3138, which may include one or more software and/or hardware modules that provide additional features, functionality and/or wired or wireless connectivity. For example, the peripherals 3138 may include sensors such as an accelerometer, an e-compass, a satellite transceiver, a digital camera (for photographs or video), a universal serial bus (USB) port, a vibration device, a television transceiver, a hands free headset, a Bluetooth® module, a frequency modulated (FM) radio unit, a digital music player, a media player, a video game player module, an Internet browser, and the like.

[0098] FIG. 14 depicts an exemplary network entity 4190 that may be used in embodiments of the present disclosure. As depicted in FIG. 14, network entity 4190 includes a communication interface 4192, a processor 4194, and non-transitory data storage 4196, all of which are communicatively linked by a bus, network, or other communication path 4198.

[0099] Communication interface 4192 may include one or more wired communication interfaces and/or one or more wireless-communication interfaces. With respect to wired communication, communication interface 4192 may include one or more interfaces such as Ethernet interfaces, as an example. With respect to wireless communication, communication interface 4192 may include components such as one or more antennae, one or more transceivers/chipsets designed and configured for one or more types of wireless (e.g.,

LTE) communication, and/or any other components deemed suitable by those of skill in the relevant art. And further with respect to wireless communication, communication interface 4192 may be equipped at a scale and with a configuration appropriate for acting on the network side— as opposed to the client side— of wireless communications (e.g., LTE communications, Wi-Fi communications, and the like). Thus, communication interface 4192 may include the appropriate equipment and circuitry (perhaps including multiple transceivers) for serving multiple mobile stations, UEs, or other access terminals in a coverage area. [0100] Processor 4194 may include one or more processors of any type deemed suitable by those of skill in the relevant art, some examples including a general-purpose microprocessor and a dedicated DSP.

[0101] Data storage 4196 may take the form of any non-transitory computer-readable medium or combination of such media, some examples including flash memory, read-only memory (ROM), and random- access memory (RAM) to name but a few, as any one or more types of non-transitory data storage deemed suitable by those of skill in the relevant art could be used. As depicted in FIG. 14, data storage 4196 contains program instructions 4197 executable by processor 4194 for carrying out various combinations of the various network-entity functions described herein.

[0102] Although features and elements are described above in particular combinations, one of ordinary skill in the art will appreciate that each feature or element can be used alone or in any combination with the other features and elements. In addition, the methods described herein may be implemented in a computer program, software, or firmware incorporated in a computer-readable medium for execution by a computer or processor. Examples of computer-readable storage media include, but are not limited to, a read only memory (ROM), a random access memory (RAM), a register, cache memory, semiconductor memory devices, magnetic media such as internal hard disks and removable disks, magneto-optical media, and optical media such as CD-ROM disks, and digital versatile disks (DVDs). A processor in association with software may be used to implement a radio frequency transceiver for use in a WTRU, UE, terminal, base station, RNC, or any host computer.

Claims

CLAIMS What is Claimed:

1. A method, comprising:

accessing, at a server, a primary audio and video stream;

preparing, at the server, a custom audio stream that enhances audio content associated with an identified spatial region of the primary video stream by:

classifying portions of audio content of the primary audio stream as either inside or outside the identified spatial region via spatial rendering of the primary audio stream, including:

determining a plurality coherence parameters identifying a diffuseness associated with the one or more sound sources in the primary audio stream; and determining a plurality of directional parameters associated with the primary audio stream; and

filtering the primary audio stream using the plurality of coherence parameters and the plurality of directional parameters.

2. The method of claim 1 wherein classifying portions of audio content of the primary audio stream as either inside or outside the identified spatial region via spatial rendering of the primary audio stream further comprises:

analyzing audio content of the primary audio stream to determine the plurality of directional parameters of the primary audio stream and the plurality of coherence parameters; and

extracting the identified spatial region using the plurality of coherence parameters by identifying higher value coherence parameters with lower diffuseness of the one or more sound sources, the lower diffuseness identifying sound sources within the identified spatial region.

3. The method of claim 1 , wherein the preparing, at the server, the custom audio stream that enhances audio content associated with the identified spatial region of the primary video stream further comprises:

adjusting the audio content outside the identified spatial region by one or more of:

introducing random phase shifts to the audio content outside the identified spatial region; introducing random level shifts to the audio content outside the identified spatial region; and

adding reverberation to the audio content outside the identified spatial region.

4. The method of claim 1 , wherein the preparing, at the server, the custom audio stream that enhances audio content associated with the identified spatial region of the primary video stream further comprises reducing decorrelation of one or more of the plurality of coherence and directional parameters identified as audio content inside the identified spatial region by at least one of:

reducing phase shifts to the audio content classified as inside the spatial region;

reducing level shifts to the audio content classified as inside the spatial region; and averaging out variations of the audio content classified as inside the spatial region.

5. The method of claim 1 , wherein preparing the custom video stream further comprises receiving a description of objects of interest comprising:

collecting object tracking sensor data for a plurality of objects in the primary audio and video stream;

computing a location of one or more objects of interest in a frame of the primary audio and video stream by fusing camera location and sensor data; and

determining the identified spatial region based on the computed location.

6. The method of claim 1 , wherein the audio content associated with the identified spatial region is one or more of selected at the server, received from a third party, or received at the server from a remote streaming client.

7. The method of claim 1 , further comprising streaming the custom video stream and custom audio stream to a remote streaming client responsive to a request from the remote streaming client.

8. A method comprising:

accessing, at a server, a first media stream comprising a first audio stream and a first video stream;

determining a first spatial region of the first video stream;

determining a first zoomed audio region of the first audio stream associated with the first spatial region of the first video stream;

generating a focused audio stream based at least in part on processing of the first zoomed audio region of the first audio stream via spatial rendering of the first audio stream in time and frequency, including:

determining a plurality of coherence parameters as a measure of diffuseness associated with one or more sound sources in the first audio stream; and determining a plurality of directional parameters associated with the first audio stream; and

generating a custom media stream based on the focused audio stream and a first custom video stream based at least in part on the first video stream; and

streaming the custom media stream from the server to a first receiving client.

9. The method of claim 8, wherein the first media stream is received at the server from one or more of a first live content capture client and a video on demand server, the first media stream including client video data corresponding to a user-selected region of interest and audio data corresponding to the selected region of interest.

10. The method of claim 8, wherein determining the first spatial region comprises receiving an indication of the first spatial region from an editing client, wherein the editing client is one or more of a content editor and an end user.

11. The method of claim 8, wherein generating the focused audio stream based at least in part on processing of the first zoomed audio region of the first audio stream via spatial rendering of the first audio stream further comprises:

processing the audio content outside the first zoomed audio region to reduce the directionality; and

processing the audio content inside the first zoomed audio region to increase directionality.

12. The method of claim 11 , wherein processing the audio content outside the first zoomed audio region to reduce the directionality includes increasing decorrelation by one or more of:

introducing random phase shifts to the audio content outside the first zoomed audio region; introducing random level shifts to the audio content outside the first zoomed audio region; and

adding reverberation to the audio content outside the first zoomed audio region.

13. The method of claim 11 , wherein processing the audio content inside the first zoomed audio region to increase directionality comprises reducing decorrelation by one or more of:

reducing phase shifts to parameters of said audio portions;

reducing level shifts to the audio portions; and

averaging out variations of the audio portions.

The method of claim 8, wherein generating the focused audio stream further comprises: performing the spatial rendering of the first audio stream using an analysis filter bank to parameterize the audio stream into a plurality defined areas; and

identifying each of the plurality of defined areas according to the plurality of coherence values as the measure of diffuseness of and the plurality of directional values; and

focusing the audio stream by altering one or more of the plurality of time-frequency defined areas as a function of the measure of diffuseness and the plurality of directional values.

15. A system comprising a processor and a non-transitory storage medium storing instructions operative, when executed on the processor, to perform functions including:

accessing, at a server, a primary audio and video stream;

determining a measure of diffuseness associated with one or more sources of sound of the primary audio stream; and

determining a plurality of directional parameters associated with the primary audio stream; and

processing the primary audio stream using the measure of diffuseness and the plurality of directional parameters.