KR20250069593A

KR20250069593A - Head-tracking segmentation rendering and head-related transfer function personalization

Info

Publication number: KR20250069593A
Application number: KR1020257011661A
Authority: KR
Inventors: 스테판 브룬; 리샤브 트야기
Original assignee: 돌비 레버러토리즈 라이쎈싱 코오포레이션; 돌비 인터네셔널 에이비
Priority date: 2022-09-12
Filing date: 2023-09-11
Publication date: 2025-05-19
Also published as: TW202429915A; WO2024059505A1; EP4588255A1; JP2025531871A; CN119895899A

Abstract

도착 방향(DOA) 기반 머리 추적 분할 렌더링 및 머리 관련 전달 함수(HRTF) 개인화를 위한 시스템들, 방법들 및 컴퓨터 프로그램 제품들이 설명되어 있다. 머리 추적 오디오 렌더링은 두 개의 디바이스 간에 분할된다. 제1 디바이스는 인코딩된 오디오의 메인 비트스트림 표현을 수신한다. 제2 디바이스는 머리 자세 정보를 추적한다. 제1 디바이스는 메인 디코더를 사용하여 메인 비트스트림을 디코딩하고, 디코딩된 비트스트림을 사전 렌더링된 바이노럴 신호들 및 사후 렌더링 메타데이터로 인코딩한다. 제2 디바이스는 중간 비트스트림으로부터 사전 렌더링된 바이노럴 신호들 및 사후 렌더러 메타데이터를 디코딩하고, 디코딩된 사전 렌더링된 바이노럴 신호들 및 사후 렌더러 메타데이터를 경량 렌더러에 제공한다. 경량 렌더러는 사후 렌더러 메타데이터, 머리 자세 정보, 일반 HRTF, 및 개인화된 HRTF에 기초하여 사전 렌더링된 바이노럴 신호들을 바이노럴 오디오로 렌더링한다.Systems, methods, and computer program products for direction-of-arrival (DOA) based head tracking segmented rendering and head-related transfer function (HRTF) personalization are described. Head tracking audio rendering is split between two devices. A first device receives a main bitstream representation of encoded audio. A second device tracks head pose information. The first device decodes the main bitstream using a main decoder and encodes the decoded bitstream into pre-rendered binaural signals and post-rendering metadata. The second device decodes the pre-rendered binaural signals and the post-renderer metadata from the intermediate bitstream and provides the decoded pre-rendered binaural signals and the post-renderer metadata to a lightweight renderer. The lightweight renderer renders the pre-rendered binaural signals into binaural audio based on the post-renderer metadata, the head pose information, the generic HRTF, and the personalized HRTF.

Description

Head-tracking segmentation rendering and head-related transfer function personalization

관련 출원들에 대한 상호 참조Cross-reference to related applications

본 출원은 2022년 9월 12일자로 출원된 미국 가출원 제63/405,538호, 및 2022년 11월 3일자로 출원된 미국 가출원 제63/422,331호에 대한 우선권을 주장하며, 이들 각각은 이로써 참조에 의해 그 전체가 포함된다.This application claims the benefit of U.S. Provisional Application No. 63/405,538, filed September 12, 2022, and U.S. Provisional Application No. 63/422,331, filed November 3, 2022, each of which is hereby incorporated by reference in its entirety.

발명의 분야Field of invention

이 개시는 오디오 처리에 관한 것이다. 특히, 이 개시는 오디오 렌더링에 관한 것이다.This disclosure relates to audio processing. In particular, this disclosure relates to audio rendering.

확장 현실(XR)(AR/MR/VR)은 매우 전력 제한된 최종 디바이스에 점점 더 의존하게 될 것이다. 증강 현실(AR) 안경은 대표적인 예이다. 가능한 한 경량으로 만들기 위해, 이는 무거운 배터리를 장착할 수 없다. 따라서, 합리적인 작동 시간을 가능하게 하기 위해, 그 안에 포함된 프로세서에서는 매우 복잡성 제약된 수치 연산만이 가능하다. 반면에, 몰입형 오디오는 XR 서비스의 필수적인 미디어 성분이다. 이러한 서비스는 전형적으로 3DoF 또는 6DoF 사용자 (머리) 움직임에 응답하여 제시된 몰입형 오디오/비주얼 장면을 조정하는 것을 지원할 수 있다. 대응하는 몰입형 오디오 렌디션들(renditions)을 고품질로 수행하는 것은 전형적으로 높은 수치 복잡성을 요구한다.Extended Reality (XR) (AR/MR/VR) will increasingly rely on very power-constrained end devices. Augmented Reality (AR) glasses are a prime example. In order to be as lightweight as possible, they cannot be equipped with heavy batteries. Therefore, to enable reasonable operating times, the processors contained within them are limited to very complex numerical computations. Immersive audio, on the other hand, is an essential media component of XR services. Such services can typically support adjusting the presented immersive audio/visual scene in response to 3DoF or 6DoF user (head) movements. Performing high-quality corresponding immersive audio renditions typically requires high numerical complexity.

이 문제를 해결하기 위한 한 가지 잠재적인 해결책은 렌더링을 디바이스 자체에서 수행하는 것이 아니라 최종 디바이스가 연결된 모바일/무선 네트워크의 어떤 엔티티에서 수행하거나 최종 디바이스가 테더링된 강력한 모바일 사용자 장비(UE)에서 수행하는 것이다. 그런 경우에, 최종 디바이스는 예를 들어 이미 바이노럴로(binaurally) 렌더링된 오디오만을 수신하게 된다. 3DoF/6DoF 머리 자세 정보(머리 추적 메타데이터)는 렌더링 엔티티(네트워크 엔티티/UE)로 전송되어야 한다. 이러한 방식의 문제는 100 ms 이상 정도일 수 있는, 최종 디바이스와 네트워크 엔티티/UE 사이의 전송 레이턴시이다. 렌더링을 네트워크 엔티티/UE에서 수행하는 것은 따라서 오래된 머리 추적 메타데이터에 의존해야 한다는 것을 의미하며, 최종 렌더링 디바이스에 의해 재생되는 바이노럴화된(binauralized) 오디오가 머리/최종 디바이스의 실제 머리 자세와 매칭하지 않는다는 것을 의미한다. 이 레이턴시는 모션-투-사운드 레이턴시(motion-to-sound latency)라고 지칭된다. 이 레이턴시가 너무 큰 경우, 최종 사용자는 이를 품질 저하로 인식할 것이다.One potential solution to this problem is to perform the rendering not on the device itself, but on some entity in the mobile/wireless network to which the end device is connected, or on a powerful mobile User Equipment (UE) to which the end device is tethered. In such a case, the end device would only receive audio that has already been binaurally rendered, for example. The 3DoF/6DoF head pose information (head tracking metadata) would have to be transmitted to the rendering entity (network entity/UE). The problem with this approach is the transmission latency between the end device and the network entity/UE, which can be on the order of 100 ms or more. Performing the rendering on the network entity/UE would therefore mean relying on outdated head tracking metadata, and the binauralized audio played by the end rendering device would not match the actual head pose of the head/end device. This latency is referred to as motion-to-sound latency. If this latency is too large, end users will perceive it as a degradation in quality.

몰입형 미디어 렌더링의 비디오 성분에 대해, 이 문제는 분할 렌더링(split render) 접근 방식에 의해 해결되고 있으며, 여기서 비디오 장면의 근사적인 부분은 네트워크 엔티티/UE에 의해 렌더링되고 최종 비디오 장면 조정은 최종 디바이스에서 수행된다. 오디오의 경우, 이 분야는 현재 덜 연구되어 있다.For the video component of immersive media rendering, this problem is being addressed by a split render approach, where an approximate part of the video scene is rendered by the network entity/UE and the final video scene manipulation is performed on the end device. For audio, this area is currently less studied.

다양한 오디오 서비스, 예를 들면, 몰입형 음성 및 오디오 서비스(immersive voice and audio service, IVAS)에서는, 오디오 렌더링 동안 사용자의 머리 움직임을 추적하고 이에 따라 오디오를 조정하여, 사용자에게 몰입형 오디오 경험을 제공할 수 있는 것이 바람직하다. 이는 머리 관련 전달 함수(head related transfer function, HRTF) 세트를 사용하는 몰입형 오디오 디코딩 및 바이노럴 렌더링을 필요로 하며 여기서 특정 HRTF의 선택이 몰입형 오디오 신호의 속성 및 사용자의 머리 움직임(또는 머리 자세)에 따라 달라질 수 있다. 몰입형 오디오 포맷에 따라, 디코딩 및 머리 추적(head-tracked) 바이노럴 렌더링은 계산적으로 복잡한 동작일 수 있다. 예를 들어, 장면 기반 오디오(예를 들면, 고차 앰비소닉스(higher-order Ambisonics)), 채널 기반 오디오(예를 들면, 7.1.4 채널 레이아웃을 가짐) 또는 많은 객체를 갖는 객체 기반 오디오는 각각 아주 큰 수의 구성 오디오 성분에 의존할 수 있으며, 이런 큰 수로 인해, 디코딩과 렌더링이 계산적으로 복잡하다. 이는 사용자의 머리 움직임에 응답하여 비트스트림을 디코딩하고 바이노럴 렌더링을 하는 것이 많은 양의 계산 처리를 필요로 한다는 것을 의미한다. 계산 복잡성은 전력을 필요로 하고 열을 발생시키는데, 이는 AR 안경과 같은 작은 휴대용 디바이스에 문제가 될 수 있다.In various audio services, for example, immersive voice and audio services (IVAS), it is desirable to track the user's head movements during audio rendering and adjust the audio accordingly, so as to provide an immersive audio experience to the user. This requires immersive audio decoding and binaural rendering using a set of head related transfer functions (HRTFs), where the choice of a particular HRTF may depend on the properties of the immersive audio signal and the user's head movements (or head poses). Depending on the immersive audio format, decoding and head-tracked binaural rendering can be computationally complex operations. For example, scene-based audio (e.g., higher-order Ambisonics), channel-based audio (e.g., with a 7.1.4 channel layout) or object-based audio with many objects may each rely on a very large number of constituent audio components, which makes decoding and rendering computationally complex. This means that decoding the bitstream and binaurally rendering it in response to the user’s head movements requires a lot of computational processing. The computational complexity requires power and generates heat, which can be a problem for small, portable devices such as AR glasses.

본 발명의 목적은 본 명세서에 설명된 문제들을 해결하고, 머리 자세 특정 처리가 제2 디바이스에서 수행될 수 있는 분할 렌더링을 제공하는 것이다.An object of the present invention is to solve the problems described herein and to provide split rendering in which head pose specific processing can be performed on a second device.

일부 구현에 따르면, 이 목적 및 다른 목적들은 청구항 1 또는 청구항 14에 따른 방법에 의해 달성된다. 다른 구현에 따르면, 이 목적 및 다른 목적들은 청구항 22에 따른 사용자 휴대 디바이스(user-held device)에 의해 달성된다.According to some implementations, these and other objects are achieved by a method according to claim 1 or claim 14. According to other implementations, these and other objects are achieved by a user-held device according to claim 22.

도착 방향(DOA) 기반 머리 추적 분할 렌더링 및 머리 관련 전달 함수(HRTF) 개인화를 위한 기술이 설명되어 있다. 머리 추적 오디오 디코딩 및 바이노럴 렌더링은 두 개 이상의 디바이스 간에 분할될 수 있다. 일부 예에서, 제1 디바이스는 제2 디바이스와 분할 디코딩 및 렌더링 동작을 조정할 수 있다. 제1 디바이스, 예를 들면, 스마트폰은 인코딩된 오디오의 메인 비트스트림 표현을 수신한다. 제1 디바이스는 메인 디코더와 바이노럴 렌더러를 사용하여 메인 비트스트림을 디코딩하여 사전 렌더링된 바이노럴 신호들로 렌더링하며, 사전 렌더링된 바이노럴 신호들 및, 바이노럴 렌더링과 연관된 HRTF에 대한 정보를 포함하는, 사후 렌더링 메타데이터를 인코딩한다. 제1 디바이스는 사전 렌더링된 바이노럴 신호들 및 사후 렌더러 메타데이터(post-renderer metadata)를 다중화된 중간 비트스트림으로서 제2 디바이스에 제공한다. 제2 디바이스, 예를 들면, 헤드폰, AR 안경, 또는 이어버드는 현재 머리 자세 정보를 추적한다. 제2 디바이스는 중간 비트스트림으로부터 사전 렌더링된 바이노럴 신호들 및 사후 렌더러 메타데이터를 디코딩하고, 디코딩된 사전 렌더링된 바이노럴 신호들 및 사후 렌더러 메타데이터를 경량 렌더러에 제공한다. 경량 렌더러는 사후 렌더러 메타데이터, 현재 머리 자세 정보, 일반 HRTF, 및 선택적으로 개인화된 HRTF에 기초하여 사전 렌더링된 바이노럴 신호들을 바이노럴 오디오로 렌더링한다.Techniques for direction of arrival (DOA) based head tracking segmented rendering and head related transfer function (HRTF) personalization are described. Head tracking audio decoding and binaural rendering can be split between two or more devices. In some examples, a first device can coordinate the segmented decoding and rendering operations with a second device. A first device, e.g., a smartphone, receives a main bitstream representation of encoded audio. The first device decodes and renders the main bitstream into pre-rendered binaural signals using a main decoder and a binaural renderer, and encodes post-rendering metadata including information about the pre-rendered binaural signals and HRTFs associated with the binaural rendering. The first device provides the pre-rendered binaural signals and the post-renderer metadata to the second device as a multiplexed intermediate bitstream. The second device, e.g., headphones, AR glasses, or earbuds, tracks current head pose information. The second device decodes the pre-rendered binaural signals and the post-renderer metadata from the intermediate bitstream, and provides the decoded pre-rendered binaural signals and the post-renderer metadata to the lightweight renderer. The lightweight renderer renders the pre-rendered binaural signals into binaural audio based on the post-renderer metadata, the current head pose information, the generic HRTF, and optionally the personalized HRTF.

사후 렌더링 메타데이터는 적어도 바이노럴 사전 렌더링에 사용된 사전 렌더링 HRTF의 지시를 포함한다. 사전 렌더링 HRTF는 가정된 머리 자세와 관련하여 오디오 콘텐츠의 지배적인 방향 성분의 도착 방향(DOA)(전형적으로 두 개의 각도)과 연관되어 있다. 사전 렌더링 HRTF의 지시는 DOA 또는 일종의 인덱스일 수 있으며, 사용자 휴대 디바이스가 올바른 HRTF를 식별할 수 있도록 한다.The post-render metadata includes at least an indication of the pre-render HRTF used for binaural pre-rendering. The pre-render HRTF is associated with the direction of arrival (DOA) (typically two angles) of the dominant directional component of the audio content with respect to the assumed head pose. The indication of the pre-render HRTF can be the DOA or some kind of index, which allows the user's handheld device to identify the correct HRTF.

일부 구현에서, 사전 렌더링 HRTF의 지시는 개인화될 수 있는 하나 또는 여러 개의 파라미터를 또한 포함한다.In some implementations, the pre-rendered HRTF instructions also include one or more parameters that can be personalized.

렌더링은 사전 렌더링 HRTF의 효과를 보상하도록 구성된 HRTF 보상 동작을 바이노럴 오디오 신호에 적용하여 보상된 스테레오 오디오 신호를 계산하는 것, 및 보상된 스테레오 신호에 사후 렌더링 HRTF를 적용하여 바이노럴 출력 신호를 계산하는 것을 포함할 수 있다. 이러한 단계들은 하나의 단일 동작으로 수행될 수 있다. HRTF 보상 동작은, 예를 들면, 룩업 테이블에 액세스하여 획득되는, 사전 렌더링 HRTF의 역(inverse)을 포함할 수 있다. 사전 렌더링 HRTF의 보상을 달성하는 다른 방법들도 가능하다.The rendering may include applying an HRTF compensation operation configured to compensate for the effect of the pre-rendering HRTF to the binaural audio signal to compute a compensated stereo audio signal, and applying the post-rendering HRTF to the compensated stereo signal to compute a binaural output signal. These steps may be performed in a single operation. The HRTF compensation operation may include an inverse of the pre-rendering HRTF, obtained, for example, by accessing a lookup table. Other methods of achieving compensation of the pre-rendering HRTF are also possible.

바이노럴화는 본 명세서에서 머리 관련 전달 함수(HRTF)를 사용하여 수행되는 것으로 설명되지만, 바이노럴 실내 임펄스 응답(BRIR)을 사용하여 동일하게 잘 수행될 수 있다. 또한, 모든 HRTF 처리가 종종 시간/주파수 타일로 표현되는, 각각의 시간 프레임 및 각각의 주파수 대역에 대해 수행되어야 한다는 점에 유의해야 한다.Binauralization is described herein as being performed using head-related transfer functions (HRTFs), but can equally well be performed using binaural room impulse responses (BRIRs). It should also be noted that all HRTF processing must be performed for each time frame and each frequency band, often represented as time/frequency tiles.

일부 응용 분야에서는, 가정된 머리 자세도 메타데이터에 포함된다. 다른 구현에서, 사용자 휴대 디바이스는 현재 머리 자세를 메인 디바이스로 송신하도록 구성된다. 선택적으로, 제2 디바이스는 머리 자세 정보의 적어도 일 부분을 머리 자세 비트스트림으로 인코딩하고, 이 비트스트림을 제1 디바이스에 제공한다. 제1 디바이스는 머리 자세 비트스트림을 디코딩하여 머리 자세 정보를 획득하고, 이어서 머리 자세 정보를 메인 디코더 및 사전 렌더러에 적용한다. 메인 디코더/사전 렌더러는 수신된 머리 자세 정보(가정된 머리 자세라고도 지칭됨) 및 일반 HRTF에 기초하여 메인 비트스트림을 디코딩하고 사전 렌더링한다. 이 경우에, 사용자 휴대 디바이스는 예상 전송 지연에 기초하여 가정된 머리 자세를 추정할 수 있다.In some applications, the assumed head pose is also included in the metadata. In another implementation, the user portable device is configured to transmit the current head pose to the main device. Optionally, the second device encodes at least a portion of the head pose information into a head pose bitstream and provides the bitstream to the first device. The first device decodes the head pose bitstream to obtain the head pose information, and subsequently applies the head pose information to the main decoder and the pre-renderer. The main decoder/pre-renderer decodes and pre-renders the main bitstream based on the received head pose information (also referred to as the assumed head pose) and the generic HRTF. In this case, the user portable device can estimate the assumed head pose based on the expected transmission delay.

가정된 머리 자세의 정보는, 제2 디바이스가 가정된 머리 자세를 사전 지식으로부터 도출하지 않는 한, 다른 정보와 함께 제2 디바이스로 전송되며, 이 사전 지식은 이전에 제1 디바이스로 전송된 (머리 자세) 정보 또는 두 디바이스 간에 사전 합의된 가정된 머리 자세에 기초할 수 있다.Information about the assumed head pose is transmitted to the second device together with other information, unless the second device derives the assumed head pose from prior knowledge, which may be based on (head pose) information previously transmitted to the first device or on an assumed head pose previously agreed upon between the two devices.

추가적으로, 본 개시는 적합한 프로토타입 신호 및 선택적인 확산 신호를 사용하여 DOA 기반 머리 추적 분할 렌더링을 위한 기술을 포함하는 추가 발명 개념에 관한 것이다. 제1 디바이스는 메인 디코더를 사용하여 메인 비트스트림을 디코딩하고, 디코딩된 비트스트림을 프로토타입 신호라고 지칭되는 지배적인 방향 성분, 및 0개 이상의 확산 신호, 및 사후 렌더링 메타데이터로서 렌더링한다. 제1 디바이스는 이어서 프로토타입 신호 및 0개 이상의 확산 신호(또는 이를 나타내는 파라미터) 및 사후 렌더러 메타데이터를 인코딩하고 이를 다중화된 중간 비트스트림으로서 제2 디바이스에 제공한다. 제2 디바이스는 중간 비트스트림으로부터 프로토타입 신호 및 0개 이상의 확산 신호 및 사후 렌더러 메타데이터를 디코딩하고, 디코딩된 프로토타입 신호 및 0개 이상의 확산 신호 및 사후 렌더러 메타데이터를 경량 렌더러에 제공한다. 경량 렌더러는 사후 렌더러 메타데이터, 머리 자세에 관련된 정보, 일반 HRTF, 및 선택적으로 개인화된 HRTF에 기초하여 프로토타입 신호 및 0개 이상의 확산 신호를 바이노럴 오디오로 렌더링한다.Additionally, the present disclosure relates to further inventive concepts including techniques for DOA-based head tracking segmentation rendering using suitable prototype signals and optional diffusion signals. A first device decodes a main bitstream using a main decoder, and renders the decoded bitstream as a dominant directional component, referred to as a prototype signal, and zero or more diffusion signals, and post-rendering metadata. The first device then encodes the prototype signal and the zero or more diffusion signals (or parameters indicative thereof) and the post-renderer metadata and provides them as a multiplexed intermediate bitstream to a second device. The second device decodes the prototype signal and the zero or more diffusion signals and the post-renderer metadata from the intermediate bitstream, and provides the decoded prototype signal and the zero or more diffusion signals and the post-renderer metadata to a lightweight renderer. The lightweight renderer renders the prototype signal and zero or more diffuse signals into binaural audio based on post-render metadata, information about head pose, generic HRTFs, and optionally personalized HRTFs.

본 명세서에 설명된 기술은 종래의 렌더링 기술에 비해 다양한 기술적 이점을 달성할 수 있다. 두 개의 디바이스 간에 처리를 분할하는 것은 웨어러블 디바이스에서의 처리를 감소시킴으로써, 배터리 수명을 연장시킨다. 웨어러블 디바이스는 지연된/오래된 머리 자세 정보에만 액세스할 수 있는 강력한 렌더링 디바이스에 의한 바이노럴 렌디션에만 의존할 필요 없이 사용자의 현재 머리 자세에 기초하여 경량 렌더링을 수행함으로써, 렌더링 동안 오래된 머리 자세 정보의 잠재적인 사용으로 인한 모션-투-사운드 레이턴시를 감소시킨다. 예를 들면, 제2 디바이스로부터 제1 디바이스로 전송되는 머리 자세 정보의 양을, 전혀 전송하지 않는 것부터 전체를 전송하는 것까지, 조절함으로써, 상이한 처리 능력을 갖는 다양한 웨어러블 디바이스와 매칭할 수 있게 하는 것에 의해, 제1 디바이스 간의 처리량의 할당은 유연할 수 있다. 위에서 명시적으로 설명된 것 이외의 다른 장점, 특징 및 이점은 아래에 설명된 상세한 설명 및 연관된 도면을 고려하면 명백할 것이다.The techniques described herein can achieve various technical advantages over conventional rendering techniques. Splitting the processing between the two devices prolongs battery life by reducing processing on the wearable device. The wearable device can perform lightweight rendering based on the user's current head pose without having to rely solely on binaural rendition by a powerful rendering device that only has access to delayed/outdated head pose information, thereby reducing motion-to-sound latency due to potential use of outdated head pose information during rendering. For example, the allocation of processing between the first device can be flexible by adjusting the amount of head pose information transmitted from the second device to the first device, from none to full transmission, thereby allowing matching with various wearable devices having different processing capabilities. Other advantages, features and benefits, not explicitly described above, will be apparent from consideration of the detailed description and associated drawings described below.

이 요약은 선택된 개념들을 단순화된 형태로 소개하기 위해 제공되고 청구된 주제의 핵심적이거나 필수적인 특징들을 식별해 주는 것으로 의도되지도 않고, 청구된 주제의 범위를 결정하는 데 보조수단으로서 사용되는 것으로 의도되지도 않는다. 예를 들어, "기술"이라는 용어는 위에서 그리고 본 문서 전반에 걸쳐 설명된 맥락에서 허용되는 시스템(들), 방법(들), 컴퓨터 판독 가능 명령어들, 모듈(들), 알고리즘들, 하드웨어 로직 및/또는 동작(들)을 지칭할 수 있다.This Summary is provided to introduce selected concepts in a simplified form and is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. For example, the term "technology" may refer to system(s), method(s), computer-readable instructions, module(s), algorithms, hardware logic, and/or operation(s), as permitted by the context described above and throughout this document.

도 1은 머리 추적 분할 렌더링을 구현하는 예시적인 시스템의 블록도이다.
도 2는 제1 또는 메인 디바이스에서의 처리를 예시하는 흐름도이다.
도 3은 제2 또는 사용자 휴대 디바이스에서의 처리를 예시하는 흐름도이다.
도 4는 사전 렌더링된 바이노럴 신호를 사용한 DOA 기반 분할 렌더링의 예시적인 기술을 예시한다.
도 5는 HRTF 개인화의 예시적인 기술을 예시한다.
도 6은 프로토타입 신호를 사용한 DOA 기반 분할 렌더링의 예시적인 기술을 예시한다.Figure 1 is a block diagram of an exemplary system implementing head tracking segmented rendering.
Figure 2 is a flowchart illustrating processing in a first or main device.
Figure 3 is a flowchart illustrating processing on a second or user portable device.
Figure 4 illustrates an exemplary technique for DOA-based segmentation rendering using pre-rendered binaural signals.
Figure 5 illustrates an exemplary technique for HRTF personalization.
Figure 6 illustrates an exemplary technique for DOA-based segmented rendering using prototype signals.

이하의 상세한 설명에서는, 본 명세서의 일부를 형성하고 개념들이 실시될 수 있는 특정 예시적인 구성이 예시로서 도시되어 있는 첨부 도면들이 참조된다. 이러한 구성은 본 기술 분야의 통상의 기술자가 본 명세서에 개시된 기술을 실시할 수 있도록 충분히 상세하게 설명되어 있으며, 제시된 개념의 정신 또는 범위를 벗어나지 않으면서, 다른 구성들이 활용될 수 있고, 다른 변경들이 이루어질 수 있음을 이해해야 한다. 따라서, 이하의 상세한 설명은 제한적인 의미로 받아들여져서는 안 되며, 제시된 개념의 범위는 첨부된 청구항에 의해서만 정의된다.In the following detailed description, reference is made to the accompanying drawings, which form a part of this specification and in which are shown by way of illustration specific exemplary configurations in which the concepts may be practiced. These configurations are described in sufficient detail to enable those skilled in the art to practice the technology disclosed herein, and it is to be understood that other configurations may be utilized and other changes may be made without departing from the spirit or scope of the concepts disclosed. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the concepts disclosed is defined only by the appended claims.

본 출원에 개시된 시스템들 및 방법들은 소프트웨어, 펌웨어, 하드웨어 또는 이들의 조합으로 구현될 수 있다. 하드웨어 구현에서, 태스크들의 분할이 반드시 물리적 단위로의 분할에 대응하는 것은 아니며; 이와 달리, 하나의 물리적 컴포넌트가 다수의 기능을 가질 수 있으며, 하나의 태스크가 협력하는 여러 물리적 컴포넌트에 의해 수행될 수 있다.The systems and methods disclosed in this application may be implemented in software, firmware, hardware, or a combination thereof. In a hardware implementation, the division of tasks does not necessarily correspond to division into physical units; instead, a single physical component may have multiple functions, and a single task may be performed by multiple cooperating physical components.

컴퓨터 하드웨어는 예를 들어 서버 컴퓨터, 클라이언트 컴퓨터, 개인용 컴퓨터(PC), 태블릿 PC, 셋톱 박스(STB), PDA(personal digital assistant), 셀룰러 전화, 스마트폰, 웹 기기, 네트워크 라우터, 스위치 또는 브리지, 또는 해당 컴퓨터 하드웨어에 의해 취해질 액션들을 지정하는 명령어들(순차적 또는 기타)을 실행할 수 있는 임의의 머신일 수 있다. 또한, 본 개시는 본 명세서에서 논의되는 개념들 중 어느 하나 이상을 수행하기 위한 명령어들을 개별적으로 또는 공동으로 실행하는 컴퓨터 하드웨어의 임의의 모음에 관한 것이다.The computer hardware may be, for example, a server computer, a client computer, a personal computer (PC), a tablet PC, a set-top box (STB), a personal digital assistant (PDA), a cellular telephone, a smart phone, a web device, a network router, a switch or bridge, or any machine capable of executing instructions (sequential or otherwise) specifying actions to be taken by said computer hardware. Furthermore, the present disclosure relates to any collection of computer hardware that individually or jointly executes instructions to perform any one or more of the concepts discussed herein.

특정 또는 모든 컴포넌트들은, 프로세서들 중 하나 이상의 프로세서에 의해 실행될 때, 본 명세서에 설명된 방법들 중 적어도 하나를 수행하는 명령어 세트를 포함하는 컴퓨터 판독 가능(머신 판독 가능이라고도 함) 코드를 수용하는 하나 이상의 프로세서에 의해 구현될 수 있다. 취해질 액션들을 지정하는 명령어 세트(순차적 또는 기타)를 실행할 수 있는 임의의 프로세서가 포함된다. 따라서, 일 예는 하나 이상의 프로세서를 포함하는 전형적인 처리 시스템(예를 들면, 컴퓨터 하드웨어)이다. 각각의 프로세서는 CPU, 그래픽스 처리 유닛, 및 프로그래밍 가능한 DSP 유닛 중 하나 이상을 포함할 수 있다. 처리 시스템은 하드 드라이브, SSD, RAM 및/또는 ROM을 포함하는 메모리 서브시스템을 더 포함할 수 있다. 컴포넌트들 사이의 통신을 위해 버스 서브시스템이 포함될 수 있다. 소프트웨어는 컴퓨터 시스템에 의한 그의 실행 동안 메모리 서브시스템에 및/또는 프로세서 내에 존재할 수 있다.Certain or all of the components contain computer-readable (also called machine-readable) code comprising a set of instructions that, when executed by one or more of the processors, perform at least one of the methods described herein. Can be implemented by one or more processors. Any processor capable of executing a set of instructions (sequential or otherwise) specifying actions to be taken is included. Thus, an example is a typical processing system (e.g., computer hardware) comprising one or more processors. Each processor may include one or more of a CPU, a graphics processing unit, and a programmable DSP unit. The processing system may further include a memory subsystem comprising a hard drive, an SSD, RAM, and/or ROM. A bus subsystem may be included for communication between the components. The software may reside in the memory subsystem and/or within the processor during its execution by the computer system.

하나 이상의 프로세서는 독립형 디바이스로 작동할 수 있거나 다른 프로세서(들)에 연결될, 예를 들면, 네트워크로 연결될 수 있다. 이러한 네트워크는 다양한 상이한 네트워크 프로토콜을 기반으로 구축될 수 있으며, 인터넷, WAN(Wide Area Network), LAN(Local Area Network), 또는 이들의 임의의 조합일 수 있다.One or more of the processors may operate as standalone devices or may be connected to other processor(s), for example, in a network. Such a network may be built on a variety of different network protocols and may be the Internet, a Wide Area Network (WAN), a Local Area Network (LAN), or any combination thereof.

소프트웨어는, 컴퓨터 저장 매체(또는 비일시적 매체) 및 통신 매체(또는 일시적 매체)를 포함할 수 있는, 컴퓨터 판독 가능 매체에 분산되어 있을 수 있다. 본 기술 분야의 통상의 기술자에게 잘 알려진 바와 같이, 컴퓨터 저장 매체라는 용어는 컴퓨터 판독 가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위해 임의의 방법 또는 기술로 구현되는 휘발성 및 비휘발성, 이동식 및 비이동식 매체 둘 모두를 포함한다. 컴퓨터 저장 매체는 ROM, PROM, EPROM, EEPROM, 플래시 메모리 또는 기타 메모리 기술, CD-ROM, DVD(digital versatile disk) 또는 기타 광학 디스크 저장소, 자기 카세트, 자기 테이프, 자기 디스크 저장소 또는 기타 자기 저장 디바이스와 같은 다양한 형태의 물리적(비일시적) 저장 매체, 또는 원하는 정보를 저장하는 데 사용될 수 있고 컴퓨터에 의해 액세스될 수 있는 임의의 다른 매체를 포함하지만, 이에 제한되지 않는다. 게다가, 통신 매체(일시적)가 전형적으로 반송파 또는 다른 전송 메커니즘과 같은 변조된 데이터 신호로 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터를 구체화하고 임의의 정보 전달 매체를 포함한다는 것이 통상의 기술자에게 잘 알려져 있다.The software may be distributed across a computer-readable medium, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As is well known to those skilled in the art, the term computer storage media includes both volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, various forms of physical (non-transitory) storage media, such as ROM, PROM, EPROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Additionally, it is well known to those skilled in the art that communication media (transitory) typically embodies computer readable instructions, data structures, program modules or other data in the form of modulated data signals, such as carrier waves or other transport mechanisms, and includes any information delivery media.

본 개시는 어떤 확장 현실(XR) 응용 분야에서 IVAS 코덱과 같은 몰입형 오디오 코덱이 사용되고 있다고 가정한다. 메인 디코딩 및 사전 렌더링은 가정된 5G 시스템의 제1 디바이스(사용자 장비(UE)) 또는 에지 또는 기타 네트워크 노드에 의해 수행될 수 있다. 제2 디바이스는 사후 디코더 및 (경량) 사후 렌더러를 포함한다. 따라서, 전체적인 동작들은 다수의 디바이스의 동작들로 나뉠 수 있다. 제1 디바이스(메인 디바이스)는 랩톱, 또는 태블릿, 또는 스마트폰과 같은 모바일 디바이스이거나, 워크스테이션 또는 서버와 같은 고정 디바이스일 수 있다. 제1 디바이스는 또한 여러 처리 디바이스들의 조합일 수 있다. 제2 디바이스는 한 쌍의 증강 현실(AR) 안경과 같은, 사용자 휴대(예를 들면, 착용) 디바이스일 수 있다.The present disclosure assumes that an immersive audio codec, such as an IVAS codec, is being used in some extended reality (XR) application. The main decoding and pre-rendering can be performed by a first device (a user equipment (UE)) of the assumed 5G system or by an edge or other network node. The second device includes a post decoder and a (lightweight) post renderer. Thus, the overall operations can be divided into operations of multiple devices. The first device (the main device) can be a mobile device, such as a laptop, tablet, or smartphone, or a stationary device, such as a workstation or a server. The first device can also be a combination of multiple processing devices. The second device can be a user-held (e.g., wearable) device, such as a pair of augmented reality (AR) glasses.

분할 렌더링 분야에서 적용되는 하나의 기본적인 가정과 창의적인 통찰력은, 시간-주파수 타일별로, 오디오가 하나의 지배적인 방향 성분과 확산(전방향) 성분으로 구성되어 있다는 것이다. 방향 성분은 특정 도착 방향(DOA)으로부터 도착하는 프로토타입 신호()로 가정되는 반면, 확산 성분은 해당 프로토타입 신호의 역상관된(decorrelated) 버전이다. 이 개념은 DirAC 또는 MASA(metadata assisted spatial audio) 코딩과 같은 공간 오디오 코딩 접근 방식에서 매우 강력한 것으로 입증되었다.One fundamental assumption and creative insight applied in the field of discrete rendering is that, for each time-frequency tile, audio consists of a dominant directional component and a diffuse (omnidirectional) component. The directional component is assumed to be a prototype signal arriving from a particular direction of arrival (DOA), while the diffuse component is a decorrelated version of that prototype signal. This concept has proven to be very powerful in spatial audio coding approaches such as DirAC or metadata assisted spatial audio (MASA) coding.

적어도 이러한 가정들에 기초하여, 다양한 예시적인 구현들은 다음 단계들을 포함할 수 있다:Based on at least these assumptions, various exemplary implementations may include the following steps:

1. 사전 렌더러는 머리 추적기를 장착한 경량 디바이스로부터 전송되었거나 또는 사용자의 임의의 실제 머리 자세에 반드시 대응하지는 않지만 정면을 바라보는 머리 자세와 같은 합리적인 기본값일 수 있는 사전 설정 값일 수 있는 머리 자세(P')가 주어진 경우 일반 HRTF(또는 BRIR) 세트를 사용하여 디코딩된 몰입형 오디오를 바이노럴화한다. 바이노럴 사전 렌더링 동작 동안 HRTF를 적용하는 것은 각각의 시간/주파수 타일에 대해 특별히 선택된 HRTF를 사용하여 수행될 수 있다. HRTF는 가정된 머리 자세를 기준으로, 몰입형 오디오 콘텐츠의 지배적인 성분의 도착 방향(DOA)에 기초하여 선택된다.1. The pre-renderer binauralizes the decoded immersive audio using a set of generic HRTFs (or BRIRs) given a head pose (P'), which may be transmitted from a lightweight device equipped with a head tracker or may be a preset value that does not necessarily correspond to any actual head pose of the user but may be a reasonable default, such as a forward-looking head pose. Applying the HRTFs during the binaural pre-rendering operation can be done using HRTFs specifically chosen for each time/frequency tile. The HRTFs are chosen based on the direction of arrival (DOA) of the dominant component of the immersive audio content, relative to the assumed head pose.

2. 제1 또는 메인 디바이스는 바이노럴화된 오디오 채널 및 사용된 HRTF 및/또는 DOA 각도의 지시, 그리고 가정된 머리 자세(P')를 인코딩 및 전송한다.2. The first or main device encodes and transmits the binaural audio channels and an indication of the used HRTF and/or DOA angles, and the assumed head pose (P').

3. 제2 디바이스의 사후 렌더러는 (현재 머리 자세(P)가 P'에서 벗어난 경우) 현재 머리 자세(P)를 기준으로 수신된 좌측 바이노럴 신호 및 우측 바이노럴 신호를 조정하는 것을 목표로 한다.3. The post-renderer of the second device aims to adjust the received left binaural signal and right binaural signal based on the current head pose (P) (if the current head pose (P) deviates from P').

4. 사전 렌더러에 의해 적용되는 HRTF 또는 DOA 각도 및 머리 자세가 주어진 경우, 기본적으로 좌측 오디오 채널 및 우측 오디오 채널의 역 HRTF 필터링 및 선택적으로 이들을 선형으로 결합하는 것에 의해 제2 디바이스에 의해 좌측 HRTF 보상된 신호 및 우측 HRTF 보상된 신호가 계산된다.4. Given the HRTF or DOA angle and head pose applied by the pre-renderer, the left HRTF compensated signal and the right HRTF compensated signal are computed by the second device by basically inverse HRTF filtering of the left audio channel and the right audio channel and optionally linearly combining them.

5. HRTF 보상된 신호는 올바른 머리 자세에 대응하는 올바른 HRTF를 사용하여 제2 디바이스에 의해 필터링된다.5. The HRTF compensated signal is filtered by a second device using the correct HRTF corresponding to the correct head posture.

6. 확산 성분의 잠재적인 오류는 언급된 선형 결합의 가중치를 적절히 선택하는 것에 의해 완화된다.6. Potential errors in the diffusion component are mitigated by appropriately choosing the weights of the mentioned linear combinations.

주된 아이디어는 일반 HRTF가 사전 렌더러에 의해 사용되는 반면, 사후 렌더러는 이러한 일반 HRTF를 보상하고 후속하여 개인화된 HRTF를 적용하는, HRTF 개인화에도 적용될 수 있다.The main idea is that while generic HRTFs are used by the pre-renderer, the post-renderer compensates for these generic HRTFs and subsequently applies personalized HRTFs, which can also be applied to HRTF personalization.

이하에서는, 본 명세서에 설명된 새로운 개념의 예시적인 구현이 도 1 내지 도 3을 참조하여 예시되어 있다.Below, exemplary implementations of the novel concepts described in this specification are illustrated with reference to FIGS. 1 to 3.

도 1은 본 발명의 다양한 양태에 따라 배열된 머리 추적 분할 렌더링을 구현하는 예시적인 시스템의 블록도이다. 예시적인 시스템은 제1 디바이스(10) 및 제2 디바이스(20)를 포함한다. 제1 디바이스(10)는 메인 디바이스라고도 지칭될 수 있는 반면, 제2 디바이스(20)는 모바일 또는 사용자 휴대 디바이스라고도 지칭될 수 있다.FIG. 1 is a block diagram of an exemplary system implementing head tracking segmented rendering arranged in accordance with various aspects of the present invention. The exemplary system includes a first device (10) and a second device (20). The first device (10) may also be referred to as a main device, while the second device (20) may also be referred to as a mobile or user-held device.

도 1에서, 제1 또는 메인 디바이스(10)는 디코더/렌더러(11), 선택적인 머리 자세 디코더(12), 인코더(13 및 14), 및 멀티플렉서(15)를 포함한다. 디코더/렌더러(11), 예를 들면, IVAS 디코더는 인코딩된 몰입형 오디오 콘텐츠를 포함하는 메인 비트스트림(b₁)을 수신하고(단계(S1)), 몰입형 오디오 콘텐츠를 디코딩하며(단계(S2)), 사용자의 가정된 머리 자세(P')를 기준으로 한 도착 방향(DOA)과 연관된 HRTF를 사용하여 디코딩된 오디오 콘텐츠의 바이노럴 렌더링을 수행한다(단계(S3)). 사용된 HRTF는 전형적으로 (다양한 도착 방향(DOA)에 대한) 일반 HRTF 세트() 중에서 취해진다. 이 처리는 전형적으로 모바일 또는 사용자 휴대(경량) 디바이스에서 수행되기에는 너무 계산적으로 복잡하다. 가정된 머리 자세(P')는 적절한 기본 머리 자세일 수 있거나 사용자 휴대 디바이스로부터 수신된 실제 사용자 머리 자세일 수 있으며, 이는 선택적으로 머리 자세 디코더(12)에 의해 디코딩될 수 있다(단계(S21)). 이러한 디코딩된 사용자 머리 자세는 사용자의 현재 머리 자세가 아닌 최근의 머리 자세를 나타낼 수 있다. 렌더러(11)는 바이노럴 신호(L₁, R₁)는 물론 사후 렌더링 메타데이터(M)를 출력한다. 사후 렌더링 메타데이터(M)는, 예를 들면, 가정된 머리 자세(P')를 기준으로 표현되는 몰입형 오디오 콘텐츠의 지배적인 방향 성분의 도착 방향(DOA) 또는 사용된 HRTF의 인덱스로 표현되는, 사용된 HRTF의 지시를 포함한다. 사후 렌더링 메타데이터(M)는 바이노럴 렌더링과 연관된 머리 자세(P')의 지시를 또한 포함할 수 있다. 인코더(13, 14)는 바이노럴 신호(L₁, R₁) 및 사후 렌더링 메타데이터(M)를, 제각기, 인코딩된 신호(b₁₁ 및 b₁₂)로 인코딩하도록 배열된다(단계(S4)). 멀티플렉서(15)는 인코딩된 바이노럴 신호(b₁₁)와 인코딩된 메타데이터(b₁₂)를 중간 비트스트림(b₂)으로 다중화하거나 결합하도록(단계(S5)) 배열되어 있으며, 이 중간 스트림(b₂)은 제2 디바이스(20)로 전송된다(단계(S6)).In Fig. 1, the first or main device (10) comprises a decoder/renderer (11), an optional head pose decoder (12), an encoder (13 and 14), and a multiplexer (15). The decoder/renderer (11), for example an IVAS decoder, receives a main bitstream (b ₁ ) containing encoded immersive audio content (step (S1)), decodes the immersive audio content (step (S2)), and performs binaural rendering of the decoded audio content using HRTFs associated with a direction of arrival (DOA) based on an assumed head pose (P') of the user (step (S3)). The HRTFs used are typically a set of generic HRTFs (for different directions of arrival (DOAs)) ) is taken from. This processing is typically too computationally complex to be performed on a mobile or user-held (lightweight) device. The assumed head pose (P') can be a suitable baseline head pose or can be an actual user head pose received from the user-held device, which can optionally be decoded by a head pose decoder (12) (step (S21)). This decoded user head pose may represent a recent head pose rather than the current head pose of the user. The renderer (11) outputs the binaural signals (L ₁ , R ₁ ) as well as post-rendering metadata (M). The post-rendering metadata (M) includes, for example, an indication of the direction of arrival (DOA) of the dominant directional component of the immersive audio content expressed relative to the assumed head pose (P') or an indication of the used HRTF, expressed as an index of the used HRTF. The post-rendering metadata (M) may also include an indication of the head pose (P') associated with the binaural rendering. The encoders (13, 14) are arranged to encode the binaural signals (L ₁ , R ₁ ) and the post-rendering metadata (M) into encoded signals (b ₁₁ and b ₁₂ ), respectively (step (S4)). The multiplexer (15) is arranged to multiplex or combine the encoded binaural signals (b ₁₁ ) and the encoded metadata (b ₁₂ ) into an intermediate bitstream (b ₂ ) (step (S5)), which intermediate stream (b ₂ ) is transmitted to the second device (20) (step (S6)).

사용자 휴대 디바이스(20)일 수 있는 제2 디바이스(20)는 디먹서(demuxer)(21), 디코더(22 및 23), 인코더(25), 렌더러(26) 및 머리 추적기(24)를 포함한다. 사용자 휴대 디바이스(20)는 중간 비트스트림(b₂)을 수신하고(단계(S11)), 디먹서(21)는 중간 비트스트림(b₂)을 인코딩된 신호(b₂₁과 b₂₂)로 분리하며; 이들 인코딩된 신호는 대응하는 디코더(22 및 23)에 의해 수신된다. 디코더(22 및 23)는 이에 응답하여 인코딩된 신호(b₂₁ 및 b₂₂)를 디코딩하여(단계(S12)) 디코딩된 바이노럴 신호(L₂, R₂) 및 디코딩된 메타데이터(M')를 획득한다. 언급된 바와 같이, 메타데이터(M')는, 예를 들면, 인덱스 또는 정면을 바라보는 머리 자세를 기준으로 한 도착 방향에 의해 지시되는, 사용된 HRTF의 지시를 포함한다.A second device (20), which may be a user portable device (20), includes a demuxer (21), decoders (22 and 23), an encoder (25), a renderer (26), and a head tracker (24). The user portable device (20) receives an intermediate bitstream (b ₂ ) (step (S11)), the demuxer (21) separates the intermediate bitstream (b ₂ ) into encoded signals (b ₂₁ and b ₂₂ ); these encoded signals are received by corresponding decoders (22 and 23). The decoders (22 and 23) respond by decoding the encoded signals (b ₂₁ and b ₂₂ ) (step (S12)) to obtain decoded binaural signals (L ₂ , R ₂ ) and decoded metadata (M'). As mentioned, the metadata (M') contains an indication of the HRTF used, for example indicated by the direction of arrival relative to the index or the head pose looking straight ahead.

사용자 휴대 디바이스(120)에 포함되거나 그에 연결될 수 있는 머리 추적기(124)는 사용자의 머리의 현재 머리 자세(P)를 검출한다(단계(S13)). 인코더(25)는 검출된 머리 자세(P)를 b_P로서 인코딩하고(단계(S131)) 인코딩된 검출된 머리 자세(b_P)를 메인 디바이스(10)로 전송하기 위해 선택적으로 사용될 수 있다.A head tracker (124) that may be included in or connected to a user portable device (120) detects a current head pose (P) of the user's head (step (S13)). An encoder (25) may optionally be used to encode the detected head pose (P) as b _P (step (S131)) and transmit the encoded detected head pose (b _P ) to the main device (10).

메타데이터(M')는 렌더러(11)에서 사용되는 가정된 머리 자세(P')를 또한 포함할 수 있다. 대안적으로, 검출된 머리 자세(P)가 메인 디바이스(10)로 전송되는 구현에서, 사용자 휴대 디바이스는 예상 전송 지연에 기초하여 가정된 머리 자세를 추정할 수 있다. 원칙적으로, 가정된 머리 자세는 예상 전송 지연에 대응하는 시점에서 검출된 머리 자세로 가정될 수 있다.The metadata (M') may also include an assumed head pose (P') used in the renderer (11). Alternatively, in an implementation where the detected head pose (P) is transmitted to the main device (10), the user's portable device may estimate the assumed head pose based on the expected transmission delay. In principle, the assumed head pose may be assumed to be the head pose detected at a point in time corresponding to the expected transmission delay.

마지막으로, 렌더러(26)는 디코딩된 바이노럴 오디오 신호(L₂, R₂), DOA 또는 사용된 HRTF, 가정된 머리 자세(P’) 및 현재 머리 자세(P)를 수신하고, 출력 바이노럴 신호(L_out, R_out)를 계산한다. 이 처리는 검출된 현재 머리 자세(P)에 대응하는 사후 렌더링 HRTF를 식별하는 것(단계(S14), 사전 렌더링 HRTF의 효과를 보상하도록 구성된 HRTF 보상 동작을 바이노럴 오디오 신호에 적용하는 것에 의해 보상된 스테레오 오디오 신호를 계산하는 것(단계(S15)), 및 마지막으로 식별된 사후 렌더링 HRTF를 적용하는 것(단계(S16))을 포함한다. 아래에서 논의될 이 처리를 위해, 렌더러(26)는, 전형적으로 (다양한 도착 방향(DOA)에 대한) 일반 HRTF 세트()인, HRTF 데이터를 제공받는다. 렌더러(26)는 또한 개인화된 HRTF 세트()를 제공받을 수 있다.Finally, the renderer (26) receives the decoded binaural audio signal (L ₂ , R ₂ ), the DOA or the used HRTF, the assumed head pose (P') and the current head pose (P), and computes the output binaural signal (L _out , R _out ). This processing includes identifying a post-rendering HRTF corresponding to the detected current head pose (P) (step (S14)), computing a compensated stereo audio signal by applying an HRTF compensation operation configured to compensate the effect of the pre-rendering HRTF to the binaural audio signal (step (S15)), and finally applying the identified post-rendering HRTF (step (S16)). For this processing, which will be discussed below, the renderer (26) typically uses a set of generic HRTFs (for different directions of arrival (DOAs)) ( ) is provided with HRTF data. The renderer (26) also provides a personalized HRTF set ( ) can be provided.

도 2는 위에서 설명된 단계들(S1 내지 S6) 및 선택적인 단계(S21)를 포함하는, 제1 디바이스(또는 메인 디바이스)에서의 처리를 예시하는 흐름도이다.FIG. 2 is a flowchart illustrating processing in a first device (or main device), including steps (S1 to S6) described above and an optional step (S21).

이 프로세스는, 단계(S1) "비트스트림 수신"에서, 비트스트림을 수신하고, 단계(S2) "디코딩"에서, 디코더에 의해 비트스트림을 디코딩하여 디코딩된 몰입형 오디오 콘텐츠를 획득하는 것을 포함한다. 단계(S21) “자세 디코딩”에서, 이 프로세스는 제2 사용자 휴대 디바이스로부터 현재 사용자 머리 자세의 지시를 수신 및 디코딩하고, 현재 사용자 머리 자세에 기초하여 가정된 사용자 머리 자세를 결정하는 선택적인 단계를 포함할 수 있다. 단계(S3) "사전 렌더링"에서, 이 프로세스는 사전 렌더러에 의해 몰입형 오디오 콘텐츠를 바이노럴화하여 사전 렌더링된 바이노럴 신호를 생성하는 것을 포함하며, 이 바이노럴화는 HRTF 세트 중의 사전 렌더링 HRTF 및 사용자의 가정된 머리 자세를 사용한다. 단계(S4) "인코딩"에서, 이 프로세스는 사전 렌더링된 바이노럴 신호를 인코딩하고, 사후 렌더링 메타데이터를 인코딩하는 것을 포함하며, 이 메타데이터는 사전 렌더링 HRTF를 나타낸다. 단계(S5) “결합”에서, 이 프로세스는, 멀티플렉서에서, 인코딩된 바이노럴 오디오 신호와 인코딩된 사후 렌더링 메타데이터를 결합하여, 바이노럴 오디오 표현을 포함하는 비트스트림을 형성하는 것을 포함한다. 단계(S6) “전송”에서, 이 프로세스는 비트스트림을 제2 디바이스(또는 사용자 휴대 디바이스)로 전송하는 것을 포함한다.The process includes: receiving a bitstream in step (S1) "receiving a bitstream"; decoding the bitstream by a decoder in step (S2) "decoding" to obtain decoded immersive audio content; optionally, in step (S21) "posture decoding", the process may include receiving and decoding an indication of a current user head pose from a second user portable device, and determining an assumed user head pose based on the current user head pose. In step (S3) "pre-rendering", the process includes binauralizing the immersive audio content by a pre-renderer to generate a pre-rendered binaural signal, wherein the binauralization uses a pre-rendering HRTF from a set of HRTFs and the assumed head pose of the user. In step (S4) "encoding", the process includes encoding the pre-rendered binaural signal and encoding post-rendering metadata, wherein the metadata represents the pre-rendering HRTF. In step (S5) “Combining”, the process comprises combining, in a multiplexer, the encoded binaural audio signal and the encoded post-rendering metadata to form a bitstream comprising a binaural audio representation. In step (S6) “Transmitting”, the process comprises transmitting the bitstream to a second device (or a user portable device).

도 3은 위에서 설명된 단계들(S11 내지 S16) 및 선택적인 단계(S131)를 포함하는, 제2 디바이스(또는 사용자 휴대 디바이스)에서의 처리를 예시하는 흐름도이다.FIG. 3 is a flowchart illustrating processing in a second device (or a user portable device), including steps (S11 to S16) described above and optional step (S131).

이 프로세스는, 단계(S11) "비트스트림 수신"에서, 제1 디바이스(또는 메인 디바이스)로부터, 몰입형 오디오 콘텐츠의 바이노럴 사전 렌더링의 표현을 포함하는 비트스트림을 수신하는 것을 포함한다. 바이노럴 사전 렌더링은 가정된 머리 자세(P')를 기준으로 획득되었다. 단계(S12) "디코딩"에서, 이 프로세스는 비트스트림을 디코딩하여 바이노럴 오디오 신호 및 연관된 사후 렌더링 메타데이터를 획득하는 것을 포함한다. 메타데이터는 바이노럴 사전 렌더링에서 사용되는 사전 렌더링 HRTF를 나타내며, 여기서 사전 렌더링 HRTF는 가정된 머리 자세(P')와 연관되어 있다. 단계(S13) "현재 자세 검출"에서, 이 프로세스는 현재 머리 자세(P)를 나타내는 사용자 머리 자세 정보를 획득하는 것을 포함한다. 단계(131) "자세 인코딩"에서, 이 프로세스는 인코더를 사용하여 검출된 머리 자세(P)를 인코딩하고, 현재 머리 자세(P)의 지시를 메인 디바이스로 전송하는 선택적인 단계를 포함할 수 있다. 위에서 언급된 바와 같이, 메인 디바이스는 그러면 제2 사용자 휴대 디바이스로부터 수신된 현재 자세를 가정된 자세로 사용할 수 있다. 따라서, 이 경우에, 제2 사용자 휴대 디바이스는 예상 전송 지연(및 이전에 전송된 현재 자세)에 기초하여 가정된 머리 자세(P')를 추정할 수 있다. 단계(S14) "사후 렌더링 HRTF 식별"에서, 이 프로세스는 메타데이터, 가정된 머리 자세(P') 및 현재 머리 자세(P)에 기초하여 사후 렌더링 HRTF를 식별하는 것을 포함한다. 단계(S15) "보상된 오디오 계산"에서, 이 프로세스는 사전 렌더링 HRTF의 효과를 보상하도록 구성된 HRTF 보상 동작을 바이노럴 오디오 신호에 적용하는 것에 의해 보상된 스테레오 오디오 신호를 계산하는 것을 포함한다.The process comprises, in step (S11) "receiving a bitstream", receiving a bitstream from a first device (or main device) comprising a representation of binaural pre-rendering of immersive audio content. The binaural pre-rendering is obtained based on an assumed head pose (P'). In step (S12) "decoding", the process comprises decoding the bitstream to obtain a binaural audio signal and associated post-rendering metadata. The metadata represents a pre-rendering HRTF used in the binaural pre-rendering, wherein the pre-rendering HRTF is associated with the assumed head pose (P'). In step (S13) "detecting a current pose", the process comprises obtaining user head pose information representing a current head pose (P). In step (131) "encoding a pose", the process may comprise an optional step of encoding the detected head pose (P) using an encoder and transmitting an indication of the current head pose (P) to the main device. As mentioned above, the main device can then use the current pose received from the second user portable device as the assumed pose. Therefore, in this case, the second user portable device can estimate the assumed head pose (P') based on the expected transmission delay (and the previously transmitted current pose). In step (S14) "Post-rendering HRTF identification", the process includes identifying a post-rendering HRTF based on the metadata, the assumed head pose (P') and the current head pose (P). In step (S15) "Compensated Audio Computation", the process includes computing a compensated stereo audio signal by applying an HRTF compensation operation configured to compensate for the effect of the pre-rendering HRTF to the binaural audio signal.

보상된 스테레오 오디오 신호를 계산하는 데 적합한 다양한 예시적인 HRTF 보상 동작이 본 명세서에 설명되어 있다. 이러한 동작은 사전 렌더링된 HRTF 동작으로부터 얻어지는 다양한 효과에 대해 적합하게 대응하고 조정하는 임의의 수의 방법을 포함할 수 있다. 일부 예에서, HRTF 보상은 사전 렌더링 HRTF의 역수학적 연산(inverse mathematical operation)을 포함할 수 있다. 일부 다른 예에서, HRTF 보상은 룩업 테이블 유형의 연산을 사용하여 구현될 수 있으며, 여기서 - 메모리 필요성을 줄이기 위해 - 경량 디바이스는 사전 렌더링 디바이스에서 사용된 것보다 덜 조밀한 HRTF 세트에 의존할 수 있다. 따라서, 경량 디바이스가 이용 가능한 역 HRTF 세트는 사전 렌더링 디바이스에서의 HRTF 세트에서 이용 가능한 HRTF의 역의 적합한 근사치를 포함할 수 있다. 또 다른 예에서, HRTF 보상은 선형 보간 또는 비선형 보간 또는 이들의 조합을 포함할 수 있는 수치 근사 방법을 사용하여 구현될 수 있다. 또 다른 예에서, HRTF 보상은 최적합(best fit) 유형의 근사를 사용하여 구현될 수 있다. 다양한 방법들의 조합이 동일하게 적용 가능하며 본 개시의 범위 내에 있는 것으로 간주된다.Various exemplary HRTF compensation operations suitable for computing a compensated stereo audio signal are described herein. These operations may include any number of methods that suitably counteract and adjust for the various effects resulting from the pre-rendered HRTF operations. In some examples, the HRTF compensation may include an inverse mathematical operation of the pre-rendered HRTF. In some other examples, the HRTF compensation may be implemented using a lookup table type operation, where - to reduce memory requirements - the lightweight device may rely on a less dense set of HRTFs than those used by the pre-rendered device. Thus, the set of inverse HRTFs available to the lightweight device may include a suitable approximation of the inverses of the HRTFs available in the HRTF set of the pre-rendered device. In another example, the HRTF compensation may be implemented using a numerical approximation method, which may include linear interpolation or non-linear interpolation or a combination thereof. In another example, the HRTF compensation may be implemented using a best fit type of approximation. Combinations of various methods are equally applicable and are considered to be within the scope of the present disclosure.

도 3으로 돌아가서, 단계(S16) "사후 렌더링 HRTF 적용"에서, 이 프로세스는 보상된 스테레오 신호에 사후 렌더링 HRTF를 적용하는 것에 의해 바이노럴 출력 신호를 계산하는 것을 포함한다. 단계들(S15 및 S16)이 단일 동작으로 수행될 수 있다.Returning to Fig. 3, in step (S16) “Apply Post-Rendering HRTF”, this process includes computing a binaural output signal by applying a post-rendering HRTF to the compensated stereo signal. Steps (S15 and S16) can be performed in a single operation.

도 2 및 도 3에 의해 예시된 프로세스는 하드웨어, 소프트웨어 또는 이들의 조합으로 구현될 수 있는 일련의 동작 또는 단계를 나타내는 블록 모음을 포함한다. 소프트웨어의 맥락에서, 블록들은, 하나 이상의 프로세서에 의해 실행될 때, 언급된 동작들을 수행하는 컴퓨터 실행 가능 명령어들을 나타낼 수 있다. 일반적으로, 컴퓨터 실행 가능 명령어들은 기능을 수행하거나 구현하는 루틴, 프로그램, 오브젝트, 컴포넌트, 데이터 구조 등을 포함할 수 있다. 동작들이 설명되는 순서는 제한으로서 해석되도록 의도되지 않으며, 설명된 블록들 중 임의의 개수가 프로세스를 구현하기 위해 임의의 순서로 결합되며, 추가적인 블록들로 분리되고/되거나, 병렬로 작동될 수 있다.The processes illustrated by FIGS. 2 and 3 include a collection of blocks representing a series of operations or steps that may be implemented in hardware, software, or a combination thereof. In the context of software, the blocks may represent computer-executable instructions that, when executed by one or more processors, perform the operations noted. Generally, computer-executable instructions may include routines, programs, objects, components, data structures, etc. that perform or implement functions. The order in which the operations are described is not intended to be limited, and any number of the described blocks may be combined in any order to implement the process, separated into additional blocks, and/or operated in parallel.

사전 렌더러(11)에서의 접근 방식:Approach in pre-renderer (11):

기본적인 가정은, 시간-주파수 타일별로, 오디오가 하나의 지배적인 방향 성분과 확산(전방향) 성분으로 구성되어 있다는 것이다. 방향 성분은 어떤 실내 좌표계(room coordinate system)로 표현된 방위각 및 고도각()을 가지는 특정 DOA로부터 도착하는 프로토타입 신호()로 가정된다. 확산 성분은 프로토타입 신호()의 역상관된 버전이다.The basic assumption is that, for each time-frequency tile, audio consists of a dominant directional component and a diffuse (omnidirectional) component. The directional component is expressed in azimuth and elevation angles in some room coordinate system. ) arriving from a specific DOA with a prototype signal ( ) is assumed. The diffusion component is the prototype signal ( ) is an inversely correlated version of .

사전 렌더러 합성은 여기에서 (시간-주파수 타일별) 방향 성분을 DOA에 대응하는 HRTF와 콘볼루션하고 확산 성분을 가산하는 것에 의해 수행된다. 두 성분 모두가 각자의 가중치 및 를 사용하여 가산되며:Pre-rendering synthesis is performed here by convolving the directional component (time-frequency tile-wise) with the HRTF corresponding to the DOA and adding the diffusion component. Both components have their own weights. and is added using:

, , 과 은 역상관기이다. , , class is an inverse correlation coefficient.

여기서, 은 사전 렌더러(11)에서 가정되는 머리 자세()를 기준으로 한 방향 성분의 방위각과 고도각이다. 머리 자세(P')가 동일한 실내 좌표계로 표현된 경우, 적어도 특정 추가 제한 가정 하에서, 이고, 이다.Here, is the head pose assumed in the pre-renderer (11). ) are the azimuth and elevation angles of the directional component based on the head pose (P'). If the head pose (P') is expressed in the same indoor coordinate system, at least under certain additional constraint assumptions, And, am.

사후 렌더러에서의 접근 방식:Approach in post-renderer:

사후 렌더러는, 현재 머리 자세(P)가 P'에서 벗어난 경우, 현재 머리 자세(P)를 기준으로 수신된 좌측 바이노럴 신호 및 우측 바이노럴 신호( 및 )를 조정하는 것을 목표로 한다.The post-renderer receives the left binaural signal and the right binaural signal based on the current head pose (P) if the current head pose (P) deviates from P'. and ) aims to adjust.

중요한 통찰은 올바른 출력 신호가The key insight is that the correct output signal is

, ,

일 것이라는 점이다.The point is that it will be.

신호() 및 역상관기 신호( 및 )는 이용 불가능하다. 대신, 및 이 이용 가능한 신호( 및 )를 사용하여 파라미터적 접근 방식으로 근사화될 것이다. 사전 렌더러에 의해 적용된 HRTF가 알려져 있다고 가정하면, 좌측 HRTF 보상된 신호 및 우측 HRTF 보상된 신호가 계산될 수 있다. 이러한 신호들을 획득하는 한 가지 가능성은 이들을 HRTF 보상된 좌측 채널 신호와 HRTF 보상된 우측 채널 신호의 가중 결합으로서 도출하는 것이다. 이에 의해, 및 은 적합한 가중 인자 또는 연산자이다.signal( ) and the anticorrelation signal ( and ) is not available. Instead, and This available signal( and ) will be approximated by a parametric approach. Assuming that the HRTF applied by the pre-renderer is known, the left HRTF compensated signal and the right HRTF compensated signal can be computed. One possibility to obtain these signals is to derive them as a weighted combination of the HRTF compensated left channel signal and the HRTF compensated right channel signal. By this, and is a suitable weighting factor or operator.

HRTF 보상된 좌측 채널 신호 및 HRTF 보상된 우측 채널 신호는 좌측 채널 신호 및 우측 채널 신호에 대해, 제각기, 및 이다.The HRTF compensated left channel signal and the HRTF compensated right channel signal are respectively, for the left channel signal and the right channel signal. and am.

좌측 HRTF 보상된 신호 및 우측 HRTF 보상된 신호는 그러면The left HRTF compensated signal and the right HRTF compensated signal are then

및 and

이다. am.

하나의 간단한 예에서, = = 1(가중 없음)이고, 따라서In one simple example, = = 1 (no weight), and therefore

이고 And

이다. am.

이러한 신호들을 사용하여, 사후 렌더러의 좌측 출력 신호 및 우측 출력 신호는 다음과 같이 획득된다:Using these signals, the left output signal and right output signal of the post-renderer are obtained as follows:

및 . and .

이 접근 방식은 현재 머리 자세와 관련한 출력 신호에서의 올바른 방향 성분, 즉, 및 를 이끌어낸다.This approach obtains the correct direction component in the output signal relative to the current head pose, i.e., and Leads to .

그러나, 다음과 같이 정량화될 수 있는 확산 성분에서의 오류가 발생한다:However, errors occur in the diffusion component which can be quantified as follows:

및 and

. .

HRTF를 지연 및 이득/형상 연산으로 분해할 수 있다고 가정하면, 역상관된 확산 성분의 관련된 지연 변화는 지각적으로 중요하지 않을 수 있으며, 이득/형상 변화는 음색 편차(timbral deviation) 또는 착색 효과(coloration effect)를 초래할 수 있다. 관련된 HRTF 세트가 주어진 경우 , 의 적절한 선택에 의해 이 오류를 완화시킬 수 있다. 보다 일반적인 형태에서, , 은 출력 샘플이 미리 정해진 숫자 범위를 초과하는 것을 방지하는 (주파수 선택적) 필터 연산자 또는 이득 제한기와 같은 선형 및 비선형 연산자일 수 있다.Assuming that HRTFs can be decomposed into delay and gain/shape operations, the associated delay changes in the decorrelated diffusion components may not be perceptually significant, while gain/shape changes may result in timbral deviations or coloration effects. Given a set of relevant HRTFs, , This error can be mitigated by appropriate selection of . In a more general form, , can be a linear or nonlinear operator, such as a (frequency-selective) filter operator or a gain limiter that prevents output samples from exceeding a predetermined numerical range.

가중치(, )를 적응적으로 선택하는 이점은 머리 자세 변화가 요(yaw)로 제한되는 경우, 즉, z축을 중심으로 한 회전만이 발생하고 따라서 가정된 머리 자세 및 현재 머리 자세에 대한 방향 성분의 고도각이 동일한() 경우를 고려한 예에 의해 예시되어 있다.weight( , ) is an advantage of adaptively selecting the head pose change when the head pose change is limited to yaw, i.e. only rotation about the z-axis occurs, and therefore the elevation angle of the directional component with respect to the assumed head pose and the current head pose are the same ( ) is illustrated by an example considering the case.

첫째로 요 회전이 거의 없는 경우를 고려한다. 따라서, 는 과 거의 동일하다. 이 경우에, 가중치는 바람직하게는 = = 1에 따라 선택되며, 이는 수신된 좌측 바이노럴 신호 및 우측 바이노럴 신호( 및 )가 거의 수정 없이 출력되고 이라는 것을 의미한다. 따라서, (충분히)작은 요 편차(예를 들면, 20도 미만)의 경우, 가중치를 = = 1로 설정하는 것이 좋은 선택이다.First, consider the case where there is little rotation. Therefore, Is is almost the same as . In this case, the weights are preferably = = 1, which is selected based on the received left binaural signal and right binaural signal ( and ) is printed with almost no modifications. This means that for (sufficiently) small deviations (e.g. less than 20 degrees), the weights = = Setting it to 1 is a good choice.

둘째로 180도만큼의 요 변경이 있어, 가 도와 거의 동일하게 되는 경우를 고려한다. 이제, 가중치는 바람직하게는 = = 0에 따라 선택되며, 이는 수신된 좌측 바이노럴 신호 및 우측 바이노럴 신호( 및 )가 조정 처리의 일부로서 (가상으로) 교환됨을 의미한다. 이 결과 좌측 출력 채널 및 우측 출력 채널에 대해 다음과 같은 확산 성분이 얻어진다:Secondly, there is a change of 180 degrees, go Consider the case where the weights are almost identical. Now, the weights are preferably = = 0, which is selected based on the received left binaural signal and right binaural signal ( and ) are (virtually) swapped as part of the adjustment process. This results in the following diffusion components for the left and right output channels:

, 및 , and

. .

우측 귀 HRTF는 180도 방위각 오프셋을 두고 취해진 좌측 귀 HRTF로 근사될 수 있으며, 마찬가지로 좌측 귀 HRTF는 180도 방위각 오프셋을 두고 취해진 우측 귀 HRTF로 근사될 수 있음을 고려하면, 위의 수학식에서, 항 및 는 1로 근사될 수 있음을 알 수 있다. 이는 좌측 출력 채널 및 우측 출력 채널에 대한 확산 성분의 다음 근사치를 가져온다:Considering that the right ear HRTF can be approximated by the left ear HRTF taken with a 180 degree azimuth offset, and similarly, the left ear HRTF can be approximated by the right ear HRTF taken with a 180 degree azimuth offset, in the above mathematical expression, and It can be seen that can be approximated as 1. This leads to the following approximations of the diffusion components for the left output channel and the right output channel:

, 및 , and

. .

따라서, 현재 머리 자세와 가정된 머리 자세 간의 요 차이가 180도에 가까울 경우 가중치를 = = 0으로 설정하는 것이 좋은 선택이라고 결론지을 수 있다.Therefore, if the difference between the current head posture and the assumed head posture is close to 180 degrees, the weight is = We can conclude that setting it to = 0 is a good choice.

세 번째로 고려된 경우는 90도만큼의 요 회전이 있는 경우이며, 그 결과 는 도와 동일하게 된다. 이제, HRTF 보상된 신호를 구성할 때 이용 가능한 신호 또는 중 어느 하나에 우선권을 줄 이유가 없다고 주장될 수 있다. 이러한 이유는 이것이 잠재적으로 좌측 채널과 우측 채널 간의 비대칭적인 거동을 갖는 해결책을 초래할 수 있기 때문이다. 90도에 가까운 요 회전의 경우에 = = 0.5를 선택할 때, 대칭적인 거동이 달성된다.The third case considered is the case where there is a rotation of the axis by 90 degrees, resulting in Is It becomes the same as the help. Now, when constructing the HRTF compensated signal, the available signals are or It can be argued that there is no reason to give priority to either one, since this could potentially result in a solution with asymmetrical behavior between the left and right channels. In the case of a rotation close to 90 degrees, = When = 0.5 is chosen, symmetric behavior is achieved.

이 논의는 결정된 요 회전(), 즉, 가정된 머리 자세() 및 현재 머리 자세()를 기준으로 한 방향 성분의 방위각들 사이의 차이에 응답하여 가중치(, )를 적응적으로 선택하는 선호된 해결책으로 이어진다:This discussion is about the rotation of the decided ( ), i.e., the assumed head posture ( ) and current head posture ( ) in response to the difference between the azimuths of the directional components based on the weight ( , ) leads to a preferred solution that adaptively selects:

. .

유사한 실시예들이 가정된 머리 자세와 현재 머리 자세의 롤 각도에 기초하여 수식화될 수 있음에 유의한다.Note that similar embodiments can be formulated based on the assumed head pose and the roll angle of the current head pose.

도 4는 사전 렌더링된 바이노럴 신호를 사용한 DOA 기반 분할 렌더링의 예시적인 기술을 예시한다. 음향 파면이 방위각(, 제각기, )을 가지는 DOA로부터 청취자의 머리(30)에 도착하는 것으로 어떻게 가정되는지가 도시되어 있다. 사전 렌더러는 각도()를 갖는 머리 자세()에만 액세스할 수 있다. 따라서, 바이노럴 합성은 각도()를 갖는 머리 자세()에 대응하는 HRTF를 사용하여 수행된다. 하나의 주된 효과는 좌측 귀와 우측 귀 사이의 파면의 양이간 시간 차이(interaural time difference, ITD)가 이라는 것이다. 또한 대응하는 양이간 레벨 차이(interaural level difference, ILD)와 스펙트럼 차이(spectral difference)가 있다. 적용된 HRTF는 적합한 ITD, ILD 및 스펙트럼을 부과(impose)/각인(imprint)하여 이러한 효과를 모방한다. 이 도면은 각도()를 갖는 실제 머리 자세()를 알고 있는 사후 렌더러에서의 가정된 상황을 추가로 도시한다. 가정된 파면이 실제 머리 자세(P) - 각도( 대신 ) - 를 기준으로 상이한 DOA로부터 어떻게 도착하는지가 도시되어 있으며, 이는 결국 상이한 ITD()를 초래한다. 특히, 실제 머리 자세에 대응하는 다른 ILD 및 스펙트럼도 있다. 도 4에 시각화된 본 개시의 하나의 주된 개념은 따라서 먼저 을 보상하고 이어서 을 적용하는 것에 의해 ITD를 로부터 로 변경하는 것이다. 유사하게, ILD와 스펙트럼은 머리 자세()에 대응하는 것들을 보상하고 실제 머리 자세()에 대응하는 HRTF의 ILD와 스펙트럼을 적용하는 것에 의해 수정된다.Figure 4 illustrates an exemplary technique of DOA-based segmentation rendering using pre-rendered binaural signals. The acoustic wavefront is azimuthally ( , each one, ) is assumed to arrive at the listener's head (30). The pre-renderer is assumed to arrive at the listener's head (30) at an angle ( ) with head posture ( ) is only accessible to the angles ( ) with head posture ( ) is performed using HRTFs corresponding to the left and right ears. One main effect is the interaural time difference (ITD) of the wavefront between the left and right ears. There is also a corresponding interaural level difference (ILD) and spectral difference. The applied HRTF mimics these effects by imposing/imprinting the appropriate ITD, ILD, and spectrum. This figure shows the angle ( ) with actual head posture ( ) is additionally illustrated as an assumed situation in a post-renderer that knows the actual head pose (P) - angle ( instead ) - is shown how to arrive from different DOAs, which in turn leads to different ITDs ( ) results in. In particular, there are other ILDs and spectra corresponding to actual head poses. One of the main concepts of the present disclosure, visualized in Fig. 4, is therefore to first and then compensate ITD by applying From is to change to. Similarly, ILD and spectrum are head posture ( ) and compensate for the corresponding actual head posture ( ) is modified by applying the ILD and spectrum of the HRTF corresponding to the corresponding signal.

위의 설명은 다음과 같은 단순화된 접근 방식(도 4를 또한 참조함)으로 이어진다:The above explanation leads to the following simplified approach (see also Figure 4):

가정:home:

청취자(30)의 전후축(back-front axis) A는 오른손 좌표계의 x축을 정의한다. 게다가, 많은 관련 사례에서, 사용자는 주로 요축(yaw-axis)(z축)을 중심으로 머리를 움직였으며 대부분의 몰입형 오디오 콘텐츠는 수평면에 가까운 음원을 가지고 있다. 따라서, DOA의 고도각은 0도에 상대적으로 가깝다(예를 들면, [-20, 20]도의 구간 내로 제한됨).The back-front axis A of the listener (30) defines the x-axis of the right-handed coordinate system. Furthermore, in many relevant cases, the user primarily moved his head around the yaw-axis (z-axis) and most immersive audio content has sound sources close to the horizontal plane. Therefore, the elevation angle of the DOA is relatively close to 0 degrees (e.g., limited to the interval [-20, 20] degrees).

이러한 가정 하에서, 단순화된 수식화에 따르면, DOA의 방위각 성분만이 유의미한 ITD를 발생시킨다. 이는 이득 및 스펙트럼 형상에 추가로 영향을 미친다.Under these assumptions, and according to the simplified formulation, only the azimuthal component of the DOA produces significant ITD, which further affects the gain and spectral shape.

DOA의 제한된 고도 성분(피치, 롤)은 이득 및 스펙트럼 형상에 영향을 미치지만 ITD에는 영향을 미치지 않는다.Limited elevation components (pitch, roll) of DOA affect gain and spectral shape, but not ITD.

HRTF 필터는 지연 및 이득/형상 동작으로 분해될 수 있다:HRTF filters can be decomposed into delay and gain/shape operations:

, . , .

청취자의 머리 자세()는 사전 렌더링 동안 가정된다. 사전 렌더러는 재생 시점에 실제 방위각(α)에서 벗어난 DOA의 방위각 성분(α’)에 따라 렌더링한다.Listener's head posture ( ) is assumed during pre-rendering. The pre-renderer renders according to the azimuth component (α') of the DOA that deviates from the actual azimuth (α) at playback time.

사후 렌더러는 재생 시점에 청취자 머리 자세()에 대응하는 DOA의 방위각 성분(α)에 따라 렌더링한다.The post-renderer takes into account the listener's head pose at playback time ( ) is rendered according to the azimuth component (α) of the DOA corresponding to the angle.

사전 렌더링된 신호와 사후 렌더러 조정 후의 신호의 양이간 시간 차이(ITD)는 다음과 같이 계산되고:The time difference (ITD) between the pre-rendered signal and the signal after post-rendering adjustments is calculated as follows:

ΔLR,ΔLR ,

Δ'LR,Δ'LR ,

여기서 는 양이간 거리이고, 는 음속이다.Here is the distance between the two sides, is the speed of sound.

그 결과, 사후 렌더러는 주어진 시간-주파수 타일에서의 방향 성분의 ITD를 로부터 로 조정해야 한다.As a result, the post-renderer calculates the ITD of the directional component in a given time-frequency tile. From should be adjusted.

ITD 조정 외에도, 사후 렌더러는 또한 사전 렌더러에 의한 실제 머리 자세와 가정된 머리 자세의 비교를 고려하여 양이간 레벨 차이 및 스펙트럼 형상을 조정한다.In addition to ITD adjustment, the post-renderer also adjusts the interaural level difference and spectral shape by taking into account the comparison of the actual head pose with the assumed head pose by the pre-renderer.

주목할 점은 DOA의 제한된 고도 성분을 가정하지 않고도 유사한 수식화가 가능하다는 것이다. 그 경우에도, 사후 렌더러 동작들은 ITD 조정, 양이간 레벨 차이 및 스펙트럼 형상 조정으로 분해될 수 있다. 그 경우에, 그러나 요구된 ITD 조정의 양은 사전 렌더링 동안 가정되고 사후 렌더링 동안 유효한 DOA의 방위각 및 고도각에 따라 달라진다.It is noteworthy that a similar formulation is possible without assuming a limited elevation component of the DOA. Even then, the post-rendering operations can be decomposed into ITD adjustment, inter-bench level difference, and spectral shaping adjustment. In that case, however, the amount of ITD adjustment required depends on the azimuth and elevation angles of the DOA assumed during pre-rendering and valid during post-rendering.

도 5는 HRTF 개인화의 예시적인 기술을 예시한다. DOA 각도()로부터 청취자의 머리(30)에 도착하는 가정된 파면이 어떻게 청취자 머리의 크기에 따라 상이한 ITD를 초래하는지가 도시되어 있다. 일반 HRTF를 사용한 사전 렌더링은 일반 양이간 거리()를 갖는 청취자 머리 치수를 가정할 수 있다. 이는 에 대응하는 일반 ITD와 좌측 오디오 신호 및 우측 오디오 신호의 대응하는 ILD 및 스펙트럼 형상을 초래할 것이다. 개인화된 HRTF는 (더) 정확한 청취자 머리 치수에 기초할 것이다. 따라서, 이는 더 정확하고 개인화된 ITD()와 좌측 오디오 신호 및 우측 오디오 신호의 더 정확한 대응하는 ILD 및 스펙트럼 형상을 초래할 것이다. HRTF 개인화의 일반적인 아이디어는 사후 렌더러가 일반 HRTF를 보상하고 개인화된 HRTF의 효과를 부과하는 것이다. 전체적인 개념은 위에서 설명된 사후 렌더러에서의 머리 자세 보정과 매우 유사하다. 따라서, 두 개념 모두는 서로 호환 가능하며 쉽게 결합될 수 있다.Figure 5 illustrates an exemplary technique of HRTF personalization. DOA angle ( ) is shown to show how the assumed wavefront arriving at the listener's head (30) results in different ITDs depending on the size of the listener's head. Pre-rendering using a generic HRTF is performed using a generic interaural distance ( ) can be assumed to have a listener head size of will result in a corresponding generic ITD and corresponding ILD and spectral shape of the left and right audio signals. The personalized HRTF will be based on (more) accurate listener head dimensions. Therefore, this will result in more accurate and personalized ITD( ) and will result in more accurate corresponding ILD and spectral shapes of the left and right audio signals. The general idea of HRTF personalization is that the post-renderer compensates for the generic HRTF and imposes the effect of the personalized HRTF. The overall concept is very similar to the head pose compensation in the post-renderer described above. Therefore, both concepts are compatible with each other and can be easily combined.

사전 렌더러(11)를 갖는 메인 디바이스(10)에 대한 위의 설명과 동일한 설명이 적용된다. 그러나, 사전 렌더러(11)는 일반 HRTF 세트()에 의존하여 렌더링하는 반면, 사용자 휴대 디바이스(20) 내의 사후 렌더러(26)는 개인화된 HRTF 세트()를 사용하여 조정을 수행함으로써 사후 렌더러는 사전 렌더러가 사용한 일반 HRTF에 대해 알고 있다.The same description as above applies to the main device (10) having a pre-renderer (11). However, the pre-renderer (11) has a generic HRTF set ( ) while the post-renderer (26) within the user's handheld device (20) renders using a personalized set of HRTFs ( ) to perform the adjustment, the post-renderer knows about the common HRTF used by the pre-renderer.

사후 렌더러(26)는, 현재 머리 자세(P)가 P'에서 벗어난 경우, 현재 머리 자세(P)를 기준으로, 그리고 개인화된 HRTF 세트()를 기준으로 수신된 좌측 바이노럴 신호 및 우측 바이노럴 신호( 및 )를 조정하는 것을 목표로 한다.The post-renderer (26) is based on the current head pose (P) and a personalized HRTF set ( ) received left binaural signal and right binaural signal ( and ) aims to adjust.

올바른 출력 신호는The correct output signal is

, 일 것이다. , would.

신호() 및 역상관기 신호( 및 )는 이용 불가능하다. 대신, 및 이 이용 가능한 신호( 및 )를 사용하여 파라미터적 접근 방식으로 근사화될 것이다. 사전 렌더러에 의해 적용된 HRTF가 알려져 있다고 가정하면, 좌측 HRTF 보상된 신호 및 우측 HRTF 보상된 신호는 HRTF 보상된 좌측 채널 신호 및 HRTF 보상된 우측 채널 신호의 선형 결합으로 계산된다:signal( ) and the anticorrelation signal ( and ) is not available. Instead, and This available signal( and ) will be approximated parametrically using a pre-renderer. Assuming that the HRTF applied by the pre-renderer is known, the left HRTF compensated signal and the right HRTF compensated signal are computed as linear combinations of the HRTF compensated left channel signal and the HRTF compensated right channel signal:

및 and

. .

and . and .

이 접근 방식은 실제 머리 자세 및 개인화된 HRTF와 관련한 출력 신호에서의 올바른 방향 성분, 즉, 및 를 이끌어낸다.This approach obtains the correct directional component in the output signal relative to the actual head pose and personalized HRTF, i.e., and Leads to .

그러나, 다시 말하지만, 다음과 같이 정량화될 수 있는 확산 성분에서의 오류가 발생한다:However, again, there is an error in the diffusion component which can be quantified as follows:

및 and

. .

HRTF를 지연 및 이득/형상 연산으로 분해할 수 있다고 가정하면, 역상관된 확산 성분의 관련된 지연 변화는 지각적으로 중요하지 않을 수 있으며, 이득/형상 변화는 음색 편차 또는 착색 효과를 초래할 수 있다. 관련된 HRTF 세트가 주어진 경우 , 의 적절한 선택에 의해 이 오류를 완화시킬 수 있다. 현재 머리 자세와 가정된 머리 자세 사이의 요 및/또는 롤의 편차에 응답하여 가중치를 적응적으로 선택하는 실시예는 완전히 적용 가능한 상태로 유지된다.Assuming that HRTFs can be decomposed into delay and gain/shape operations, the associated delay changes in the decorrelated diffusion components may not be perceptually significant, while gain/shape changes may result in timbre deviations or coloration effects. Given a set of relevant HRTFs, , This error can be mitigated by appropriate selection of weights. An embodiment that adaptively selects weights in response to deviations in yaw and/or roll between the current head pose and the assumed head pose remains fully applicable.

실시예의 특정 양태Specific aspects of the embodiment

사후 렌더러는 도착 방향(DOA) 정보를 수신한다. 이 DOA 정보는 가정된 머리 자세(P')를 기준으로 한 몰입형 오디오 콘텐츠의 지배적인 방향 성분의 방위각과 고도각(DOA 각도)()으로 표현될 수 있다. DOA가 시간-주파수 타일별로 결정된다는 점에 유의해야 한다. 사용된 HRTF의 인덱스는 사후 렌더러에 DOA 정보를 제공하는 다른 형태이다.The post-renderer receives direction of arrival (DOA) information. This DOA information is the azimuth and elevation angles (DOA angles) of the dominant directional component of the immersive audio content relative to the assumed head pose (P'). ) can be expressed as DOA. It should be noted that DOA is determined per time-frequency tile. The index of the HRTF used is another form of providing DOA information to the post-renderer.

게다가, 사후 렌더러는 사전 렌더러에서 가정되는 머리 자세()를 알아야 한다. 대응하는 정보는 (즉, 메타데이터에서) 사후 렌더러로 전송될 수 있다. 또한 P'이 사후 렌더러로부터 사전 렌더러로 전송된, 이전 시간 순간에서의 실제 머리 자세에 대응한다는 사실에 의존할 수 있다. 사후 렌더러로부터 사전 렌더러로의 전송 지연이 사전에 알려져 있거나 추정될 수 있다고 가정하면, 이는 사후 렌더러로의 P'의 전송을 불필요하게 만든다. 사후 렌더러로부터 사전 렌더러로의 전송 지연을 추정하는 한 가지 방식은, 예를 들면, 타임 스탬프를 사용하여, 사후 렌더러로부터 사전 렌더러로 그리고 다시 사후 렌더러로의 왕복 지연 측정에 기반하는 것이다.Additionally, the post-renderer assumes the head pose assumed in the pre-renderer ( ) needs to be known. The corresponding information can be transmitted to the post-renderer (i.e. in the metadata). We can also rely on the fact that P' corresponds to the actual head pose at a previous time instant, transmitted from the post-renderer to the pre-renderer. Assuming that the transmission delay from the post-renderer to the pre-renderer is known or can be estimated in advance, this makes the transmission of P' to the post-renderer unnecessary. One way to estimate the transmission delay from the post-renderer to the pre-renderer is to base the round-trip delay measurement from the post-renderer to the pre-renderer and back to the post-renderer, for example, using timestamps.

파라미터들 , , , 은 수학적으로 상호 연결되어 있다. 따라서 이러한 상호 의존성을 활용할 가능성이 있으며, 이는 예를 들어 사후 렌더러에서 역상관기를 사용하는 것을 피할 수 있게 해주는 , 의 적절한 선택을 찾는 데 도움이 된다. 이러한 접근 방식의 이점은 사후 렌더러 복잡성을 회피하는 것이다.Parameters , , , are mathematically interconnected. Therefore, it is possible to exploit this interdependence, which allows us to avoid using decorrelators in post-renderers, for example. , It helps to find the appropriate choice of . The advantage of this approach is to avoid post-renderer complexity.

사후 렌더러에서 역상관기가 사용되어야 하는 경우, 적합한 역상관기 입력 신호는 방향 성분을 보상/제거하는If a decoder is to be used in the post-renderer, a suitable decoder input signal should be one that compensates/removes the directional component.

이며 따라서 다음과 같이 된다:and therefore becomes:

. .

바람직하게는, 사전 렌더링된 바이노럴 채널 신호(, )는 복소 값 직교 미러 필터 뱅크(complex-valued quadrature mirror filterbank, CQMF)/주파수 도메인에서 전송되며, 이는 사후 렌더러에서 순방향 시간-CQMF/주파수 도메인 연산(forward time-to-CQMF/frequency domain operation)을 수행하는 것을 방지하여, 복잡성과 지연 측면에서 유리할 것이다.Preferably, pre-rendered binaural channel signals ( , ) is transmitted in the complex-valued quadrature mirror filterbank (CQMF)/frequency domain, which avoids performing forward time-to-CQMF/frequency domain operations in the post-renderer, which is advantageous in terms of complexity and delay.

본 접근 방식과 종래의 기술 사이의 눈에 띄는 차이점은 본 접근 방식이 사전 렌더러의 HRTF 필터 연산에 대한 보상에 의존하고 이상적으로 사용되었을 HRTF를 적용한다는 것이다. 대조적으로, 대안의 기술은 LMS 접근 방식 및 보간법에 따라 계수가 획득되는 선형 변환을 사용하여 바이노럴 출력 채널을 변환하는 것에 의존한다.A notable difference between our approach and prior art techniques is that our approach relies on compensating for the HRTF filter operation of the pre-renderer and applying the HRTF that would ideally be used. In contrast, the alternative technique relies on transforming the binaural output channels using a linear transformation whose coefficients are obtained according to the LMS approach and interpolation.

프로토타입 신호를 사용한 DOA 기반 분할 렌더링의 예시적인 구현은 다음 단계들을 포함한다:An example implementation of DOA-based segmentation rendering using prototype signals involves the following steps:

1. 사전 렌더러 또는 디코더가 프로토타입 신호(S)를 생성한다. S를 생성하는 일부 예시적인 접근 방식들은 다음과 같다:1. A pre-renderer or decoder generates a prototype signal (S). Some exemplary approaches to generating S include:

a. 알려진 기술들 중 임의의 것을 사용하여 디코더 출력으로부터 앰비소닉스 W 또는 전방향 채널 표현을 얻고 이를 S로 사용한다.a. Obtain an Ambisonics W or omni-channel representation from the decoder output using any of the known techniques and use this as S.

b. 디코더 출력으로부터 지배적인 고유 신호의 표현을 얻고 이를 S로 사용한다.b. Obtain a representation of the dominant eigensignal from the decoder output and use it as S.

c. 일반 HRTF(또는 BRIR) 세트를 사용하여 디코딩된 몰입형 오디오를 사전 렌더링하고, S = aL + bR을 생성하며; 여기서 L 및 R은 사전 렌더링된 빈(bin) 신호의 좌측 채널 및 우측 채널이고, a와 b는 시간-주파수 타일별 복소수 또는 실수 전용 이득 인자이며 동적으로 계산되거나 정적으로 미리 결정된 값일 수 있고, 예를 들면, a = 0.5이고 b = 0.5이다.c. Pre-render the decoded immersive audio using a set of generic HRTFs (or BRIRs), generating S = aL + bR, where L and R are the left and right channels of the pre-rendered bin signal, and a and b are complex or real-only gain factors per time-frequency tile, which can be dynamically computed or statically predetermined, for example, a = 0.5 and b = 0.5.

2. 메인 디바이스는 코딩된 프로토타입 신호(S), 및 가정된 머리 자세(P') 및/또는 가정된 DOA 각도(또는 동등한 정보) 및 확산성 파라미터들을 전송한다.2. The main device transmits a coded prototype signal (S), and an assumed head pose (P') and/or an assumed DOA angle (or equivalent information) and diffusion parameters.

3. 사후 렌더러는 프로토타입 신호 비트를 디코딩하고 S'을 생성한다(S'은 S를 코딩하는 데 사용된 코덱이 제로 지연을 가지며 무손실인 경우 S와 동일해야 함).3. The post-renderer decodes the prototype signal bits and produces S' (S' must be identical to S if the codec used to code S has zero delay and is lossless).

4. 사후 렌더러는 (현재 머리 자세(P)가 P'에서 벗어나는 경우) 현재 머리 자세(P)를 기준으로 좌측 바이노럴 신호 및 우측 바이노럴 신호를 생성하는 것을 목표로 한다.4. The post-renderer aims to generate left binaural signals and right binaural signals based on the current head pose (P) (if the current head pose (P) deviates from P').

4. 사후 렌더러는 P와 P' 사이의 차이에 기초하여 메인 디바이스에 의해 송신되는 DOA 각도를 조정한다. S', 사후 렌더러에서의 HRTF, 및 조정된 DOA 각도와 함께, 사후 렌더러는 사후 렌더링된 바이노럴 신호의 방향 성분을 생성한다. 확산성 파라미터들은 사후 렌더링된 바이노럴 신호에서 확산된 에너지를 채우기 위해 역상관된 S'과 함께 사용된다.4. The post-renderer adjusts the DOA angle transmitted by the main device based on the difference between P and P'. With S', the HRTF in the post-renderer, and the adjusted DOA angle, the post-renderer generates the directional component of the post-rendered binaural signal. The diffusion parameters are used together with the decorrelated S' to fill in the diffuse energy in the post-rendered binaural signal.

사후 렌더러에서 접근 방식:Approach in post-renderer:

사후 렌더러는 수신된 DOA를, 현재 머리 자세(P)가 P'에서 벗어나는 경우, 현재 머리 자세(P)를 기준으로 조정하고, 이어서 프로토타입 신호(S') 및 개인화되거나 일반적일 수 있는 HRTF 세트()와 함께 조정하는 것을 목표로 한다. 사후 렌더러는 다음과 같이 머리 추적 바이노럴 신호를 생성한다.The post-renderer adjusts the received DOA based on the current head pose (P) if it deviates from P', and then generates a prototype signal (S') and a set of HRTFs that can be personalized or generic ( ) aims to coordinate with the post-renderer. The post-renderer generates the head-tracking binaural signal as follows.

), . ), .

, . , .

는 메인 디바이스에 의해 송신되는 확산성 파라미터이며 은 DOA 및 구면 고조파를 사용하여 계산될 수 있는 방향성 이득(directional gain)이다. is the diffusion parameter transmitted by the main device. is the directional gain, which can be calculated using DOA and spherical harmonics.

이 접근 방식의 장점:Advantages of this approach:

- 저 비트레이트 모드는 S 채널만이 코딩되어 사후 렌더러로 전송되는 방식으로 달성될 수 있다.- This bitrate mode can be achieved by coding only the S channel and sending it to the post-renderer.

- HRTF 보상이 발생할 필요가 없으며 확산성 보상에서의 오류가 0으로 감소될 수 있다.- There is no need for HRTF compensation and the error in diffusion compensation can be reduced to 0.

이 접근 방식의 단점:Disadvantages of this approach:

- 사용자 휴대 디바이스가 임의의 추가 처리 또는 사후 렌더링 동작 없이 디코딩된 바이노럴 오디오 신호만을 출력하는 잠재적인 경우에 사용자 휴대 디바이스에서 사전 렌더링된 바이노럴 오디오 신호가 즉시 이용 가능하지 않다.- Pre-rendered binaural audio signals are not immediately available on the user's handheld device, in the potential case where the user's handheld device only outputs the decoded binaural audio signal without any additional processing or post-rendering operations.

도 6은 프로토타입 신호를 사용한 DOA 기반 분할 렌더링의 예시적인 기술을 예시한다.Figure 6 illustrates an exemplary technique for DOA-based segmented rendering using prototype signals.

도 6에서, 메인 디바이스(110)는 디코더/렌더러(111), 머리 자세 디코더(112), 인코더(113 및 114), 및 멀티플렉서(115)를 포함한다. 디코더/렌더러(111), 예를 들면, IVAS 디코더는 메인 비트스트림(b₁)을 수신하고 사용자의 가정된 머리 자세(P')를 기준으로 한 도착 방향(DOA)을 가지는 프로토타입 신호(S)의 렌더링 합성을 수행한다. 가정된 머리 자세(P')는 적절한 기본 머리 자세일 수 있거나 사용자 휴대 디바이스로부터 수신된 실제 사용자 머리 자세일 수 있으며, 이는 선택적으로 머리 자세 디코더(112)에 의해 디코딩될 수 있다. 이러한 디코딩된 사용자 머리 자세는 사용자의 현재 머리 자세가 아닌 최근의 머리 자세를 나타낼 것이다. 여기서 렌더러(111)는 프로토타입 신호(S) 및 적어도 프로토타입 신호의 도착 방향(DOA)을 포함하는 메타데이터(M)를 출력한다. 인코더(113, 114)는 프로토타입 신호(S) 및 메타데이터(M)를 인코딩하고, 멀티플렉서(115)는 인코딩된 프로토타입 신호(b₁₁) 및 인코딩된 메타데이터(b₁₂)를 하나의 중간 비트스트림(b₂)으로 다중화한다.In FIG. 6, the main device (110) includes a decoder/renderer (111), a head pose decoder (112), encoders (113 and 114), and a multiplexer (115). The decoder/renderer (111), for example, an IVAS decoder, receives the main bitstream (b ₁ ) and performs rendering synthesis of a prototype signal (S) having a direction of arrival (DOA) based on an assumed head pose (P') of the user. The assumed head pose (P') may be a suitable baseline head pose or may be an actual user head pose received from a user handheld device, which may optionally be decoded by the head pose decoder (112). This decoded user head pose will represent a recent head pose, not a current head pose of the user. Here, the renderer (111) outputs the prototype signal (S) and metadata (M) including at least the direction of arrival (DOA) of the prototype signal. An encoder (113, 114) encodes a prototype signal (S) and metadata (M), and a multiplexer (115) multiplexes the encoded prototype signal (b ₁₁ ) and the encoded metadata (b ₁₂ ) into one intermediate bitstream (b ₂ ).

사용자 휴대 디바이스(120)는 디먹서(121), 디코더(122, 123), 머리 추적기(124), 인코더(125), 및 사후 렌더러(126)를 포함한다. 디먹서(121)는 중간 비트스트림을 수신하고 이를 두 개의 인코딩된 신호(b₂₁과 b₂₂)로 분리하며, 두 개의 디코더(122, 123)는 이에 응답하여 이러한 신호들을 디코딩하여 렌더러(111)에서 사용되는 디코딩된 프로토타입 신호(S') 및 디코딩된 메타데이터(M'), 예를 들면, DOA 및 (선택적으로) 가정된 머리 자세(P')를 획득한다. 사용자 휴대 디바이스(120)에 포함되거나 그에 연결될 수 있는 머리 추적기(124)는 사용자의 머리의 현재 머리 자세(P)를 검출한다. 인코더(125)는 검출된 머리 자세(P)를 인코딩하고 이를 메인 디바이스(110)로 전송한다. 마지막으로, 사후 렌더러(126)는 디코딩된 프로토타입 신호(S'), DOA, 가정된 머리 자세(P’) 및 현재 머리 자세(P)를 수신하고, 출력 바이노럴 신호(L_out, R_out)를 계산한다. 아래에서 논의될 이 처리를 위해, 사후 렌더러(126)는, 전형적으로 (다양한 도착 방향(DOA)에 대한) 일반 HRTF 세트()인, HRTF 데이터를 제공받는다. 렌더러(125)는 또한 개인화된 HRTF 세트()를 제공받을 수 있다.The user handheld device (120) includes a demuxer (121), decoders (122, 123), a head tracker (124), an encoder (125), and a post-renderer (126). The demuxer (121) receives the intermediate bitstream and separates it into two encoded signals (b ₂₁ and b ₂₂ ), and the two decoders (122, 123) responsively decode these signals to obtain a decoded prototype signal (S') and decoded metadata (M'), e.g., DOA and (optionally) an assumed head pose (P'), which are used in the renderer (111). The head tracker (124), which may be included in or connected to the user handheld device (120), detects the current head pose (P) of the user's head. The encoder (125) encodes the detected head pose (P) and transmits it to the main device (110). Finally, the post-renderer (126) receives the decoded prototype signal (S'), the DOA, the assumed head pose (P') and the current head pose (P), and computes the output binaural signals (L _out , R _out ). For this processing, which will be discussed below, the post-renderer (126) typically uses a set of generic HRTFs (for different directions of arrival (DOAs)) ) is provided with HRTF data. The renderer (125) also provides a personalized HRTF set ( ) can be provided.

몰입형 오디오의 분할 렌더링으로 이득을 볼 수 있는 예시적인 사용 사례Example use cases that can benefit from split rendering of immersive audio

본 명세서에 설명된 기술은 다양한 사용 사례에 구현될 수 있다. 기본 오디오 처리/오디오 신호 증강과 이후의 사전 렌더링은 어떤 강력한 디바이스 또는 네트워크 노드에서 수행되는 반면, 사후 렌더링은 AR 안경과 같은 경량 최종 디바이스에서 수행된다고 가정된다.The techniques described herein can be implemented for a variety of use cases. It is assumed that the basic audio processing/audio signal augmentation and subsequent pre-rendering are performed on some powerful device or network node, while the post-rendering is performed on a lightweight end device such as AR glasses.

일부 예들이 아래에서 제공된다.Some examples are provided below.

1. 오디오를 포함하는 AR/MR1. AR/MR with audio

오디오 줌/확대기: 돋보기와 비슷하지만 소리를 위한 것. 사용자는 관심 있는 소리를 확대할 수 있다.Audio Zoom/Magnifier: Similar to a magnifying glass, but for sound. The user can zoom in on sounds of interest.

실제 세계 객체에 소리를 오버레이하는 것: 실제 세계 객체/항목이 소리와 연관될 것이다. 시각 장애인을 위한 보조 시스템에 유용하지만 이에 제한되지 않음.Overlaying sounds on real world objects: Real world objects/items will be associated with sounds. Useful for but not limited to assistive systems for the visually impaired.

대화 향상/스마트 주변 소음 감소: 칵테일 파티 문제가 있는 사람들을 돕고, 주변 소음보다 활성 목소리를 높이는 것.Conversation Enhancement/Smart Ambient Noise Reduction: Helps those with cocktail party issues by raising their active voice above the background noise.

무드음 분위기: 무드등과 비슷함. 소리는 실제 세계 환경, 항목 및 개인 취향과 연관될 것이다.Mood Sound Ambience: Similar to mood lighting, sounds will be associated with real-world environments, items, and personal preferences.

2. 사용 사례 특성2. Use case characteristics

이러한 사용 사례는 전형적으로 오디오/비주얼 캡처, 어떤 장면 분석 및 증강된 소리 신호의 생성에 의존할 것이다. 일부 시나리오에서는, 또한 어떤 네트워크 노드 또는 통신의 원단부로부터의 몰입형 사운드가 오버레이될 수 있다.These use cases will typically rely on audio/visual capture, some scene analysis, and the generation of augmented sound signals. In some scenarios, immersive sound from some network node or far end of the communication may also be overlaid.

사용 사례는 전형적으로 머리 추적 오디오/비주얼 렌더링에 의존할 것이다.Use cases will typically rely on head-tracked audio/visual rendering.

3. 추가의 비-AR/MR 사용 사례3. Additional non-AR/MR use cases

AR 안경을 최종 디바이스로 사용하는 몰입형 음성 통신(2인 통화, 회의) 및 몰입형 콘텐츠 스트리밍은 가능성 있는 IVAS 사용 사례이다.Immersive voice communications (two-party calls, conferencing) and immersive content streaming using AR glasses as the end device are potential IVAS use cases.

이들 중 일부는 머리 추적 오디오 렌더링에 의존할 수 있으며, 일부는 그렇지 않을 수 있다.Some of these may rely on head-tracked audio rendering, some may not.

일부 사용 사례는 머리 추적 오디오의 일대다 몰입형 배포를 포함할 수 있다.Some use cases may involve one-to-many immersive distribution of head-tracked audio.

본 명세서에 설명된 시스템의 양태는 디지털 또는 디지털화된 오디오 파일을 처리하기 위한 적절한 컴퓨터 기반 사운드 처리 네트워크 환경에서 구현될 수 있다. 적응적 오디오 시스템의 부분들은 컴퓨터들 간에 전송되는 데이터를 버퍼링 및 라우팅하는 역할을 하는 하나 이상의 라우터(도시되지 않음)를 포함한, 임의의 원하는 수의 개별 머신을 포함하는 하나 이상의 네트워크를 포함할 수 있다. 이러한 네트워크는 다양한 상이한 네트워크 프로토콜을 기반으로 구축될 수 있으며, 인터넷, WAN(Wide Area Network), LAN(Local Area Network), 또는 이들의 임의의 조합일 수 있다.Aspects of the system described herein may be implemented in any suitable computer-based sound processing network environment for processing digital or digitized audio files. Parts of the adaptive audio system may include one or more networks comprising any desired number of individual machines, including one or more routers (not shown) that serve to buffer and route data transmitted between the computers. Such networks may be built on a variety of different network protocols, and may be the Internet, a Wide Area Network (WAN), a Local Area Network (LAN), or any combination thereof.

컴포넌트들, 블록들, 프로세스들 또는 기타 기능 컴포넌트들 중 하나 이상은 시스템의 프로세서 기반 컴퓨팅 디바이스의 실행을 제어하는 컴퓨터 프로그램을 통해 구현될 수 있다. 또한 본 명세서에 개시된 다양한 기능들이, 이들의 거동, 레지스터 전송, 로직 컴포넌트 및/또는 기타 특성의 측면에서, 하드웨어, 펌웨어, 및/또는 다양한 머신 판독 가능 또는 컴퓨터 판독 가능 매체에 구체화된 데이터 및/또는 명령어들의 임의의 수의 조합을 사용하여 설명될 수 있다는 점에 유의해야 한다. 이러한 포맷화된 데이터 및/또는 명령어들이 구체화될 수 있는 컴퓨터 판독 가능 매체는 광학, 자기 또는 반도체 저장 매체와 같은 다양한 형태의 물리적(비일시적), 비휘발성 저장 매체를 포함하지만 이에 제한되지는 않는다.One or more of the components, blocks, processes, or other functional components may be implemented by a computer program that controls the execution of a processor-based computing device of the system. It should also be noted that the various functions disclosed herein, in terms of their behavior, register transfers, logic components, and/or other characteristics, may be described using any number of combinations of hardware, firmware, and/or data and/or instructions embodied in various machine-readable or computer-readable media. The computer-readable media on which such formatted data and/or instructions may be embodied include, but are not limited to, various forms of physical (non-transitory), non-volatile storage media, such as optical, magnetic, or semiconductor storage media.

하나 이상의 구현이 예로서 그리고 특정 실시예들의 측면에서 설명되었지만, 하나 이상의 구현이 개시된 실시예들로 제한되지 않음을 이해해야 한다. 반대로, 이는 본 기술 분야의 통상의 기술자에게 명백할 것인 바와 같이 다양한 수정 및 유사한 배열을 포함하도록 의도되었다. 따라서, 첨부된 청구항의 범위는 모든 이러한 수정 및 유사한 배열을 포괄하도록 가장 광범위하게 해석되어야 한다.While one or more implementations have been described by way of example and in terms of specific embodiments, it should be understood that one or more implementations are not limited to the disclosed embodiments. On the contrary, it is intended to cover various modifications and similar arrangements as will be apparent to those skilled in the art. Accordingly, the scope of the appended claims should be interpreted in the broadest manner possible to encompass all such modifications and similar arrangements.

본 발명의 추가 세부사항 및 실시예는 다음의 나열된 예시적인 실시예(enumerated exemplary embodiment, EEE)의 목록으로부터 이해될 수 있다:Additional details and embodiments of the present invention can be understood from the following list of enumerated exemplary embodiments (EEE):

EEE1. 오디오를 처리하는 방법으로서,EEE1. A method for processing audio,

제1 디바이스에 의해, 인코딩된 오디오의 메인 비트스트림 표현을 수신하는 단계;A step of receiving a main bitstream representation of encoded audio by a first device;

제2 디바이스에 의해, 사용자 머리 자세 정보를 획득하는 단계;A step of obtaining user head posture information by a second device;

제1 디바이스에 의해 메인 비트스트림으로부터, 적어도 하나의 채널 및 메타데이터를 포함하는 다운믹스된 신호들을 결정하는 단계;A step of determining downmixed signals including at least one channel and metadata from a main bitstream by a first device;

제1 디바이스에 의해 제2 디바이스로, 다운믹스된 신호들 및 메타데이터를 제공하는 단계;A step of providing downmixed signals and metadata to a second device by a first device;

제2 디바이스의 경량 렌더러에 의해, 메타데이터 및 사용자 머리 자세 정보에 기초하여 다운믹스된 신호들을 출력 바이노럴 오디오로 렌더링하는 단계A step of rendering downmixed signals based on metadata and user head pose information into output binaural audio by a lightweight renderer of a second device.

를 포함하는, 방법.A method comprising:

EEE2. EEE1에 있어서, 다운믹스된 신호들은 사전 렌더링된 바이노럴 신호들을 포함하는, 방법.EEE2. A method according to EEE1, wherein the downmixed signals include pre-rendered binaural signals.

EEE3. EEE2에 있어서, 사전 렌더링된 바이노럴 신호들 및 렌더링 메타데이터를 결정하는 단계는:EEE3. In EEE2, the steps of determining pre-rendered binaural signals and rendering metadata are:

제1 디바이스의 메인 렌더러에 의해 메인 비트스트림 표현을 디코딩하여 디코딩된 오디오를 생성하는 단계;A step of decoding a main bitstream representation by a main renderer of a first device to generate decoded audio;

제1 디바이스의 사전 렌더러에 의해 디코딩된 오디오를 바이노럴화하여 사전 렌더링된 바이노럴 신호들 및 렌더링 메타데이터를 생성하는 단계A step of binauralizing audio decoded by a pre-renderer of a first device to generate pre-rendered binaural signals and rendering metadata.

를 포함하며, 사전 렌더러는:, and the pre-renderer is:

일반 머리 관련 전달 함수(HRTF) 또는 바이노럴 실내 임펄스 응답(BRIR), 또는 General head-related transfer function (HRTF) or binaural room impulse response (BRIR), or

사용자 머리 자세 정보 User head posture information

중 적어도 하나를 사용하여 바이노럴화를 수행하며, 사용자 머리 자세 정보는Binauralization is performed using at least one of the user's head posture information.

제2 디바이스의 머리 추적기, Head tracker of the second device,

사전 설정된 값을 저장하는 저장 디바이스, 또는 A storage device that stores preset values, or

가정된 도착 방향(DOA) 각도 Assumed Direction of Arrival (DOA) Angle

중 적어도 하나로부터 획득되는, 방법.A method obtained from at least one of:

EEE4. EEE 3에 있어서, 메타데이터는:EEE4. In EEE 3, metadata:

사전 렌더러에 의해 사용되는 HRTF 또는 BRIR의 지시,Instructions for the HRTF or BRIR used by the pre-renderer,

사전 렌더러에 의해 사용되는 가정된 사용자 머리 자세, 또는The assumed user head pose used by the pre-renderer, or

사전 렌더러에 의해 사용되는 가정된 DOA 각도Assumed DOA angle used by the pre-renderer

중 적어도 하나를 포함하는, 방법.A method comprising at least one of:

EEE5. EEE 4에 있어서, 사전 렌더링된 바이노럴 신호들을 출력 바이노럴 오디오로 렌더링하는 단계는, 경량 렌더러에 의해, 사전 렌더러에 의해 사용되는 가정된 사용자 머리 자세보다 머리 추적기를 통해 획득되는 현재 사용자 머리 자세를 기준으로 사전 렌더링된 바이노럴 신호들의 좌측 채널 및 우측 채널을 조정하는 단계를 포함하는, 방법.EEE5. A method according to EEE 4, wherein the step of rendering the pre-rendered binaural signals into output binaural audio comprises the step of adjusting, by a lightweight renderer, left and right channels of the pre-rendered binaural signals based on a current user head pose acquired via a head tracker rather than an assumed user head pose used by the pre-renderer.

EEE6. EEE2 내지 EEE5 중 어느 하나에 있어서, 사전 렌더링된 바이노럴 신호들을 렌더링하는 단계는:EEE6. In any one of EEE2 to EEE5, the step of rendering pre-rendered binaural signals comprises:

사전 렌더러에 의해 사용되는 HRTF 또는 가정된 DOA 각도에 따라 사전 렌더링된 바이노럴 신호들의 좌측 채널 및 우측 채널을 역 HRTF 필터링하는 단계; 및A step of inverse HRTF filtering the left and right channels of the pre-rendered binaural signals according to the HRTF used by the pre-renderer or the assumed DOA angle; and

역 HRTF 필터링된 신호들을 선형적으로 결합하는 단계Step of linearly combining the inverse HRTF filtered signals

를 포함하는, 방법.A method comprising:

EEE7. EEE6에 있어서, 역 HRTF 필터링하는 단계는 제2 디바이스의 머리 추적기를 통해 획득되는 현재 사용자 머리 자세를 사용하여 사전 렌더러에 사용되는 HRTF를 보정하는 단계를 포함하는, 방법.EEE7. In EEE6, the method comprises the step of inverse HRTF filtering comprising the step of correcting the HRTF used in the pre-renderer using a current user head pose acquired via a head tracker of the second device.

EEE8. EEE6 또는 EEE7에 있어서, 역 HRTF 필터링된 신호들을 선형적으로 결합하는 단계는 선형 결합의 가중치를 선택하는 것에 의해 확산 성분에서의 오류를 완화시키는 단계를 포함하는, 방법.EEE8. A method according to EEE6 or EEE7, wherein the step of linearly combining the inverse HRTF filtered signals comprises the step of alleviating errors in diffusion components by selecting weights of the linear combination.

EEE9. 제2항 내지 제8항 중 어느 하나에 있어서, HRTF 개인화를 적용하는 단계를 포함하며, 사전 렌더러는 일반 HRTF를 적용하고, 경량 렌더러는 일반 HRTF를 보상하고 후속하여 개인화된 HRTF를 적용하는, 방법.EEE9. A method according to any one of claims 2 to 8, comprising the step of applying HRTF personalization, wherein the pre-renderer applies a generic HRTF, and the lightweight renderer compensates for the generic HRTF and subsequently applies the personalized HRTF.

EEE10. EEE1에 있어서, 다운믹스된 신호들은 프로토타입 신호를 포함하는, 방법.EEE10. A method according to EEE1, wherein the downmixed signals include a prototype signal.

EEE11. EEE10에 있어서, 프로토타입 신호는 단일 채널을 포함하는, 방법.EEE11. A method according to EEE10, wherein the prototype signal comprises a single channel.

EEE12. EEE10 또는 EEE11에 있어서, 프로토타입 신호를 계산하는 단계는:EEE12. In EEE10 or EEE11, the steps for computing a prototype signal are:

제1 디바이스의 메인 디코더에 의해 메인 비트스트림 표현을 디코딩하여 디코딩된 오디오를 생성하는 단계; 및A step of decoding a main bitstream representation by a main decoder of a first device to generate decoded audio; and

디코딩된 오디오에 이득들을 적용하고, 이득들이 적용된 디코딩된 오디오를 디코딩된 오디오에 가산하는 단계A step of applying gains to decoded audio and adding the decoded audio with the gains applied to the decoded audio.

를 포함하는, 방법.A method comprising:

EEE13. EEE10 내지 EEE12 중 어느 하나에 있어서,EEE13. In any one of EEE10 to EEE12,

가정된 머리 자세(P') 및 확산성 파라미터들에 기초하여 프로토타입 신호 및 DOA 각도들을 계산하는 단계;Step of calculating prototype signals and DOA angles based on assumed head pose (P') and diffusivity parameters;

가정된 머리 자세(P'), 가정된 머리 자세(P')에 대한 DOA 각도들, 확산성 파라미터들 및 프로토타입 신호를 사후 렌더러 디바이스로 전송하는 단계;A step of transmitting the assumed head pose (P'), DOA angles for the assumed head pose (P'), diffusion parameters and prototype signal to a post-renderer device;

사후 렌더러 디바이스에서, 실제 머리 자세(P)에 기초하여 DOA 각도들을 조정하는 단계;In the post-rendering device, a step of adjusting the DOA angles based on the actual head pose (P);

프로토타입 신호 및 HRTF 세트 및 조정된 DOA 각도들을 사용하여 방향 성분들을 계산하는 단계;Step of computing directional components using prototype signal and HRTF set and adjusted DOA angles;

확산성 파라미터들 및 프로토타입 신호의 역상관된 버전을 사용하여 확산 성분들을 계산하는 단계; 및Computing diffusion components using the diffusion parameters and the decorrelated version of the prototype signal; and

방향 성분들과 확산 성분들을 가산하여 사후 렌더링된 바이노럴 출력을 생성하는 단계Step to generate post-rendered binaural output by adding directional components and diffusion components.

를 포함하는, 방법.A method comprising:

EEE14. EEE1 내지 EEE13 중 어느 하나에 있어서, 제1 디바이스는 스마트폰을 포함하고, 제2 디바이스는 웨어러블 오디오, 비주얼, 또는 AR 디바이스를 포함하며, 메인 비트스트림은 몰입형 오디오 및 비디오 서비스들(IVAS) 비트스트림을 포함하는, 방법.EEE14. A method according to any one of EEE1 to EEE13, wherein the first device comprises a smartphone, the second device comprises a wearable audio, visual, or AR device, and the main bitstream comprises an immersive audio and video services (IVAS) bitstream.

EEE15. 시스템으로서, EEE1 내지 EEE14 중 어느 하나의 동작들을 수행하도록 구성된 하나 이상의 프로세서를 포함하는, 시스템.EEE15. A system, comprising one or more processors configured to perform any one of the operations of EEE1 to EEE14.

EEE16. 컴퓨터 프로그램 제품으로서, 하나 이상의 프로세서로 하여금 EEE1 내지 EEE14 중 어느 하나의 동작들을 수행하게 하도록 구성된, 컴퓨터 프로그램 제품.EEE16. A computer program product, configured to cause one or more processors to perform any one of the operations of EEE1 to EEE14.

Claims

A method for processing audio on a user's portable processing device, comprising:
A step of receiving from a main device a bitstream comprising a representation of binaural pre-rendering of immersive audio content, wherein the binaural pre-rendering is obtained based on an assumed head pose (P');
A step of decoding said bitstream to obtain a binaural audio signal and associated post-rendering metadata, wherein said metadata is indicative of a pre-rendering HRTF used in said binaural pre-rendering, wherein said pre-rendering HRTF is associated with said assumed head pose (P');
A step of obtaining user head posture information representing the current head posture (P);
A step of identifying a post-rendering HRTF based on the above metadata, the assumed head pose (P') and the current head pose (P);
Computing a compensated stereo audio signal by applying an HRTF compensation operation configured to compensate for the effect of the pre-rendered HRTF to the binaural audio signal; and
A step of calculating a binaural output signal by applying the above post-rendering HRTF to the above compensated stereo signal.
A method comprising:

A method according to claim 1, wherein the step of calculating a compensated stereo audio signal and the step of calculating a binaural output signal are performed in one single operation.

A method according to claim 1 or 2, wherein the HRTF compensation operation includes an inverse of the pre-rendering HRTF.

In the third aspect, the method wherein the inverse of the pre-rendering HRTF is obtained by accessing a look-up in a table containing an approximation to the inverse of the pre-rendering HRTF.

In claim 3 or 4, the step of calculating a compensated stereo audio signal comprises:
A method comprising the steps of forming an inverse-filtered left channel by applying an inverse-left HRTF to a left channel of the binaural audio signal, forming an inverse-filtered right channel by applying an inverse-right HRTF to a right channel of the binaural audio signal, and combining the inverse-filtered left channel and the inverse-filtered right channel to form a left channel and a right channel of the compensated stereo audio signal, respectively.

A method in accordance with claim 5, wherein the inverse filtered left channel and the inverse filtered right channel are linearly combined, and weights of the linear combination are selected to alleviate errors in diffusion components of the binaural output signal.

In the sixth paragraph, the weights are preferably adaptive, responsive to the difference between the assumed head posture (P') and the current head posture (P).

A method according to any one of claims 1 to 7, wherein the post-rendering HRTF is personalized for a user of the user's portable processing device.

A method according to any one of claims 1 to 8, wherein the post-rendering metadata further comprises an indication of the assumed head pose (P').

A method according to any one of claims 1 to 9, wherein the method further comprises the step of transmitting an indication of the current head pose (P) to the main device and estimating the assumed head pose (P') based on an expected transmission delay.

A method according to any one of claims 1 to 10, wherein the user-held processing device comprises a wearable audio, visual or AR device.

A method according to any one of claims 1 to 11, wherein the bitstream comprises an immersive audio and video services (IVAS) bitstream.

A method according to any one of claims 1 to 12, wherein the metadata comprises a direction of arrival (DOA) associated with a dominant directional component of the immersive audio content, wherein the DOA represents the pre-rendered HRTF.

As a method of processing audio,
Step of receiving a bitstream;
A step of decoding the bitstream by a decoder to obtain decoded immersive audio content;
A step of binauralizing said immersive audio content by a pre-renderer to generate a pre-rendered binaural signal, wherein said binauralization uses a pre-rendered HRTF from among a HRTF set and an assumed head pose of the user;
A step of encoding the above pre-rendered binaural signal;
A step of encoding post-rendering metadata, said metadata representing said pre-rendering HRTF;
In a multiplexer, combining the encoded binaural audio signal and the encoded post-rendering metadata to form a bitstream including a binaural audio representation; and
Step of transmitting the above bitstream to a user's portable device
A method comprising:

A method in claim 134, wherein the metadata includes a direction of arrival associated with a dominant directional component of the immersive audio content.

A method according to claim 14 or 15, wherein the metadata further includes the assumed head pose.

In any one of Articles 14 to 16,
A step of receiving an indication of the current user head posture from the user's portable device, and
A step of determining the assumed user head posture based on the current user head posture.
A method further comprising:

A method according to any one of claims 14 to 17, wherein the method is performed on a smartphone.

A system comprising one or more processors configured to perform the method of any one of claims 1 to 18.

A computer program product, configured to cause one or more processors to perform any one of the operations of claims 1 to 18.

As a user portable processing device,
A decoder configured to decode a bitstream comprising a representation of binaural pre-rendering of immersive audio content, and to obtain a binaural audio signal and associated post-rendering metadata, wherein the metadata is indicative of a pre-rendering HRTF used in the binaural pre-rendering, wherein the pre-rendering HRTF is associated with the assumed head pose;
A head tracker for obtaining user head pose information representing the current head pose;
Renderer
, and the renderer comprises:
Identifying a post-rendering HRTF based on the above metadata, the assumed head pose (P') and the current head pose (P);
Computing a compensated stereo audio signal by applying a HRTF compensation operation configured to compensate for the effect of the above pre-rendered HRTF to the binaural audio signal;
A device configured to compute a binaural output signal by applying the post-rendering HRTF to the compensated stereo signal.

In claim 21, the device comprises an inverse of the pre-rendering HRTF.

In claim 22, the device is configured to obtain the inverse of the pre-rendering HRTF by accessing a lookup in a table containing an approximation to the inverse of the pre-rendering HRTF.

In claim 22 or 23, the renderer
Applying an inverse left HRTF to the left channel of the binaural audio signal to form an inverse filtered left channel, applying an inverse right HRTF to the right channel of the binaural audio signal to form an inverse filtered right channel, and combining the inverse filtered left channel and the inverse filtered right channel to form left channels and right channels of the compensated stereo audio signal, respectively.
A device configured to compute a stereo audio signal compensated by.

A device in accordance with claim 24, wherein the inverse filtered left channel and the inverse filtered right channel are linearly combined, and weights of the linear combination are selected to alleviate errors in diffusion components of the binaural output signal.

In the 25th paragraph, the weights are preferably adaptive, responsive to the difference between the assumed head posture (P') and the current head posture (P).

A device according to any one of claims 21 to 26, wherein the post-rendering HRTF is personalized for a user of the user portable processing device.

A device according to any one of claims 21 to 27, wherein the device is integrated into a wearable audio, visual or AR device.