KR102770762B1

KR102770762B1 - Audio encoding/decoding apparatus and method using vector quantized residual error feature

Info

Publication number: KR102770762B1
Application number: KR1020210100809A
Authority: KR
Inventors: 장인선; 백승권; 성종모; 이태진; 임우택; 신종원; 천영주; 한상욱; 황수중
Original assignee: 한국전자통신연구원; 광주과학기술원
Priority date: 2021-07-30
Filing date: 2021-07-30
Publication date: 2025-02-24
Anticipated expiration: 2041-07-30
Also published as: US11804230B2; US20230039546A1; KR20230018838A

Abstract

벡터 양자화된 잔여오차 특징을 사용한 오디오 부호화/복호화 장치 및 그 방법이 개시된다. 오디오 부호화 방법은 원본 신호를 부호화하여 메인 코덱의 비트스트림을 출력하는 단계; 상기 메인 코덱의 비트스트림을 복호화하는 단계; 복호화한 신호의 특징 벡터와 상기 원본 신호의 특징 벡터로부터 잔여 오차 특징 벡터를 결정하는 단계; 및 상기 잔여 오차 특징 벡터를 부호화하여 부가 정보의 비트스트림을 출력하는 단계를 포함할 수 있다.An audio encoding/decoding device and method using vector quantized residual error features are disclosed. The audio encoding method may include a step of encoding an original signal to output a bitstream of a main codec; a step of decoding the bitstream of the main codec; a step of determining a residual error feature vector from a feature vector of a decoded signal and a feature vector of the original signal; and a step of encoding the residual error feature vector to output a bitstream of additional information.

Description

{AUDIO ENCODING/DECODING APPARATUS AND METHOD USING VECTOR QUANTIZED RESIDUAL ERROR FEATURE}

본 발명은 벡터 양자화된 잔여오차 특징을 신경망으로 압축하여 부가정보로 이용함으로써 코딩음질을 향상시킬 수 있는 장치 및 방법에 관한 것이다.The present invention relates to a device and method capable of improving coded sound quality by compressing vector quantized residual error features using a neural network and using them as additional information.

오디오 코딩 기술을 낮은 비트율에서 작동하는 경우, 프리에코(pre-echo) 및 양자화 잡음과 같은 코딩 아티펙트(artifact)가 발생하여 오디오 음질이 저하될 수 있다. 이러한 코딩 아티펙트를 제거하여 음질을 향상시키는 다양한 전/후처리 기법들이 개발되고 있다When audio coding technology operates at low bit rates, coding artifacts such as pre-echo and quantization noise may occur, which may degrade audio quality. Various pre- and post-processing techniques are being developed to improve audio quality by removing these coding artifacts.

Ghido, Florin, et al. "Coding of fine granular audio signals using High Resolution Envelope Processing (HREP)." 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017의 오디오 코딩 방법은 신호의 고주파 성분의 gain 값을 부가 정보로 사용하여 음질을 향상시키는 방법으로써 pre-echo가 발생할 수 있는 transient signal을 검출, envelop를 평탄화(flattening)시키고, 복호화 단에서 전송된 부가 정보를 이용하여 평탄화된 성분을 원래의 성분으로 되돌리는 방식이다. Ghido, Florin, et al. "Coding of fine granular audio signals using High Resolution Envelope Processing (HREP)." 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017 is a method of improving sound quality by using the gain value of the high-frequency component of the signal as additional information. It detects a transient signal that may cause pre-echo, flattens the envelope, and uses the additional information transmitted at the decoding stage to return the flattened component to the original component.

종래의 부가 정보를 이용한 코딩 음질 향상 기법들은 부가 정보를 음성 존재 유무, transient signal 여부, 또는 시간-주파수 영역에서 신호의 gain 값 등으로 한정시키고 있으므로 음질 향상 폭을 제한 시킨다는 한계가 있었다.Conventional coding sound quality enhancement techniques using additional information have the limitation of limiting the extent of sound quality enhancement because they limit the additional information to the presence or absence of voice, the presence or absence of transient signals, or the gain value of signals in the time-frequency domain.

따라서, 음질 향상 폭을 제한하지 않으면서 오디오 코딩의 음질을 향상할 수 있는 방법이 요청되고 있다.Therefore, there is a need for a method to improve the sound quality of audio coding without limiting the range of sound quality improvement.

본 발명은 잔여오차 특징들을 신경망을 이용하여 부호화하고 벡터 양자화하여 부가정보로 전송하는 오디오 부호화 장치 및 수신한 부가 정보를 신경망을 이용하여 후처리함으로써 기존 코덱과의 역호환성을 제공하고 기존 코덱으로 복호화된 오디오 신호의 음질을 향상시킬 수 있는 오디오 복호화 장치 및 방법을 제공할 수 있다.The present invention provides an audio encoding device that encodes residual error features using a neural network and vector quantizes them to transmit them as additional information, and an audio decoding device and method that provide backward compatibility with existing codecs and improve the sound quality of audio signals decoded using existing codecs by post-processing the received additional information using a neural network.

또한, 본 발명은 부가 정보 인코더에서 잔여 오차 특징 벡터를 부호화하는 딥러닝 모델과 부가 정보 디코더에서 잔여 오차 특징 벡터를 복원하는 딥러닝 모델 및 후처리 프로세서에서 원본 신호의 특징 벡터를 추정하는 딥러닝 모델을 조인트 트레이닝(joint training)하는 종단간 딥러닝을 수행함으로써, 딥러닝 모델들을 사용하는 부가 정보 인코더, 부가 정보 디코더 및 후처리 프로세스의 오차가 순차적으로 누적되지 않도록 하는 장치 및 방법을 제공할 수 있다.In addition, the present invention performs end-to-end deep learning to jointly train a deep learning model that encodes a residual error feature vector in a side information encoder, a deep learning model that restores the residual error feature vector in a side information decoder, and a deep learning model that estimates a feature vector of an original signal in a post-processing processor, thereby providing a device and method that prevent errors in a side information encoder, a side information decoder, and a post-processing process that use deep learning models from sequentially accumulating.

본 발명의 일실시예에 따른 오디오 부호화 방법은 원본 신호를 부호화하여 메인 코덱의 비트스트림을 출력하는 단계; 상기 메인 코덱의 비트스트림을 복호화하는 단계; 복호화한 신호의 특징 벡터와 상기 원본 신호의 특징 벡터로부터 잔여 오차 특징 벡터를 결정하는 단계; 및 상기 잔여 오차 특징 벡터를 부호화하여 부가 정보의 비트스트림을 출력하는 단계를 포함할 수 있다.An audio encoding method according to one embodiment of the present invention may include a step of encoding an original signal to output a bitstream of a main codec; a step of decoding the bitstream of the main codec; a step of determining a residual error feature vector from a feature vector of a decoded signal and a feature vector of the original signal; and a step of encoding the residual error feature vector to output a bitstream of additional information.

본 발명의 일실시예에 따른 오디오 부호화 방법의 상기 부가 정보 비트스트림을 출력하는 단계는, 상기 잔여 오차 특징 벡터를 잠재 공간으로 대응 시키는 단계; 잠재 공간에 대응된 잔여 오차 특징 벡터를 벡터 양자화를 위한 코드 벡터로 할당하여 부호화하는 단계; 및 부호화된 잔여 오차 특징 벡터를 양자화하여 부가 정보 비트스트림을 출력하는 단계를 포함할 수 있다.The step of outputting the additional information bitstream of the audio encoding method according to one embodiment of the present invention may include the steps of: mapping the residual error feature vector to a latent space; encoding the residual error feature vector corresponding to the latent space by assigning it as a code vector for vector quantization; and outputting the additional information bitstream by quantizing the encoded residual error feature vector.

본 발명의 일실시예에 따른 오디오 부호화 방법의 상기 잔여 오차 특징 벡터를 부호화하는 부가 정보 인코더는, 상기 부가 정보 인코더의 부호화에 따른 손실, 상기 부가 정보 비트스트림을 복호화하는 부가 정보 디코더의 벡터 양자화에 따른 손실 및 원본 신호의 특징 벡터와 상기 메인 코덱의 비트스트림 및 상기 부가 정보의 비트스트림으로부터 추정한 원본 신호의 특징 벡터 간의 차이에 따라 결정된 손실 함수에 따라 트레이닝될 수 있다.The additional information encoder for encoding the residual error feature vector of the audio encoding method according to one embodiment of the present invention may be trained according to a loss function determined according to a loss due to encoding of the additional information encoder, a loss due to vector quantization of the additional information decoder for decoding the additional information bitstream, and a difference between the feature vector of the original signal and the feature vector of the original signal estimated from the bitstream of the main codec and the bitstream of the additional information.

본 발명의 일실시예에 따른 오디오 부호화 방법은 상기 잔여 오차 특징 벡터를 부호화하는 부가 정보 인코더를 상기 부가 정보 비트스트림을 복호화하는 부가 정보 디코더 및 상기 메인 코덱의 비트스트림, 상기 부가 정보의 비트스트림을 기초로 원본 신호의 특징 벡터를 추정하는 후처리 프로세서와 함께 트레이닝하는 단계를 더 포함할 수 있다.An audio encoding method according to one embodiment of the present invention may further include a step of training a side information encoder that encodes the residual error feature vector together with a side information decoder that decodes the side information bitstream and a post-processing processor that estimates a feature vector of an original signal based on a bitstream of the main codec and a bitstream of the side information.

본 발명의 일실시예에 따른 오디오 부호화 방법의 상기 트레이닝하는 단계는, 평균 제곱 오차(MSE: mean squared error) 함수 및 VQ-VAE(Vector Quantized Variational AutoEncoder)의 손실 함수에 기초한 손실 함수를 이용하여 상기 부가 정보 인코더, 상기 부가 정보 디코더 및 후처리 프로세서를 트레이닝할 수 있다.The training step of the audio encoding method according to one embodiment of the present invention may train the additional information encoder, the additional information decoder, and the post-processing processor using a loss function based on a mean squared error (MSE) function and a loss function of a Vector Quantized Variational AutoEncoder (VQ-VAE).

본 발명의 일실시예에 따른 오디오 부호화 방법은 상기 복호화한 신호에 포함된 음향 특징들로부터 상기 복호화한 신호의 특징 벡터를 추출하는 단계; 및 상기 원본 신호에 포함된 음향 특징들로부터 상기 원본 신호의 특징 벡터를 추출하는 단계를 더 포함할 수 있다.An audio encoding method according to one embodiment of the present invention may further include a step of extracting a feature vector of a decoded signal from acoustic features included in the decoded signal; and a step of extracting a feature vector of an original signal from acoustic features included in the original signal.

본 발명의 일실시예에 따른 오디오 복호화 방법은 메인 코덱의 비트스트림 및 부가 정보의 비트스트림을 수신하는 단계; 상기 메인 코덱의 비트스트림을 복호화하는 단계; 복호화한 신호에 포함된 음향 특징들로부터 상기 복호화한 신호의 특징 벡터를 추출하는 단계; 상기 부가 정보의 비트스트림을 복호화하여 잔여 오차 특징 벡터를 복원하는 단계; 및 복호화한 신호의 특징 벡터와 상기 잔여 오차 특징 벡터로부터 원본 신호의 특징 벡터를 추정하는 단계를 포함할 수 있다.An audio decoding method according to one embodiment of the present invention may include the steps of: receiving a bitstream of a main codec and a bitstream of additional information; decoding the bitstream of the main codec; extracting a feature vector of the decoded signal from acoustic features included in the decoded signal; decoding the bitstream of the additional information to restore a residual error feature vector; and estimating a feature vector of an original signal from the feature vector of the decoded signal and the residual error feature vector.

본 발명의 일실시예에 따른 오디오 복호화 방법의 상기 원본 신호의 특징 벡터를 추정하는 단계는, 상기 복호화한 신호의 특징 벡터와 상기 잔여 오차 특징 벡터를 결합하여 상기 원본 신호의 특징 벡터를 추정할 수 있다.The step of estimating the feature vector of the original signal in the audio decoding method according to one embodiment of the present invention can estimate the feature vector of the original signal by combining the feature vector of the decoded signal and the residual error feature vector.

본 발명의 일실시예에 따른 오디오 복호화 방법은 추정한 원본 신호의 특징 벡터를 시간 영역 표현으로 변환하여 출력하는 단계를 더 포함할 수 있다.An audio decoding method according to one embodiment of the present invention may further include a step of converting a feature vector of an estimated original signal into a time domain representation and outputting the converted feature vector.

본 발명의 일실시예에 따른 오디오 복호화 방법은 상기 부가 정보 비트스트림을 복호화하는 부가 정보 디코더 및 상기 메인 코덱의 비트스트림, 상기 부가 정보의 비트스트림을 기초로 원본 신호의 특징 벡터를 추정하는 후처리 프로세서를 부호화 장치에서 상기 잔여 오차 특징 벡터를 부호화하는 부가 정보 인코더와 함께 트레이닝하는 단계를 더 포함할 수 있다.An audio decoding method according to one embodiment of the present invention may further include a step of training, in an encoding device, an additional information decoder for decoding the additional information bitstream and a post-processing processor for estimating a feature vector of an original signal based on the bitstream of the main codec and the bitstream of the additional information, together with an additional information encoder for encoding the residual error feature vector.

본 발명의 일실시예에 따른 오디오 부호화 장치는 원본 신호를 부호화하여 메인 코덱의 비트스트림을 출력하는 메인 코덱 인코더; 상기 메인 코덱의 비트스트림을 복호화하는 메인 코덱 디코더; 및 복호화한 신호의 특징 벡터와 상기 원본 신호의 특징 벡터로부터 잔여 오차 특징 벡터를 결정하고, 상기 잔여 오차 특징 벡터를 부호화하여 부가 정보의 비트스트림을 출력하는 부가 정보 인코더를 포함할 수 있다.An audio encoding device according to one embodiment of the present invention may include a main codec encoder which encodes an original signal and outputs a bitstream of a main codec; a main codec decoder which decodes the bitstream of the main codec; and an additional information encoder which determines a residual error feature vector from a feature vector of a decoded signal and a feature vector of the original signal, and encodes the residual error feature vector and outputs a bitstream of additional information.

본 발명의 일실시예에 따른 오디오 부호화 장치의 상기 부가 정보 인코더는, 상기 잔여 오차 특징 벡터를 잠재 공간으로 대응 시키고, 잠재 공간에 대응된 잔여 오차 특징 벡터를 벡터 양자화를 위한 코드 벡터로 할당하여 부호화하며, 부호화된 잔여 오차 특징 벡터를 양자화하여 부가 정보 비트스트림을 출력할 수 있다.The additional information encoder of the audio encoding device according to one embodiment of the present invention can correspond the residual error feature vector to a latent space, encode the residual error feature vector corresponding to the latent space by assigning it as a code vector for vector quantization, and output an additional information bitstream by quantizing the encoded residual error feature vector.

본 발명의 일실시예에 따른 오디오 부호화 장치의 상기 부가 정보 인코더는, 상기 부가 정보 인코더의 부호화에 따른 손실, 부가 정보 디코더의 벡터 양자화에 따른 손실 및 원본 신호의 특징 벡터와 상기 메인 코덱의 비트스트림 및 상기 부가 정보의 비트스트림으로부터 추정한 원본 신호의 특징 벡터 간의 차이에 따라 결정된 손실 함수에 따라 트레이닝될 수 있다.The additional information encoder of the audio encoding device according to one embodiment of the present invention may be trained according to a loss function determined according to a loss due to encoding of the additional information encoder, a loss due to vector quantization of the additional information decoder, and a difference between a feature vector of the original signal and a feature vector of the original signal estimated from a bitstream of the main codec and a bitstream of the additional information.

본 발명의 일실시예에 따른 오디오 부호화 장치의 상기 부가 정보 인코더는, 상기 부가 정보 비트스트림을 복호화하는 부가 정보 디코더 및 상기 메인 코덱의 비트스트림, 상기 부가 정보의 비트스트림을 기초로 원본 신호의 특징 벡터를 추정할 수 있다.The additional information encoder of the audio encoding device according to one embodiment of the present invention can estimate a feature vector of an original signal based on a bitstream of the main codec, a bitstream of the additional information, and an additional information decoder that decodes the additional information bitstream.

본 발명의 일실시예에 따른 오디오 부호화 장치의 상기 부가 정보 인코더는, 평균 제곱 오차(MSE: mean squared error) 함수 및 VQ-VAE(Vector Quantized Variational AutoEncoder)의 손실 함수에 기초한 손실 함수를 이용하여 상기 부가 정보 디코더 및 후처리 프로세서와 함께 트레이닝될 수 있다.The additional information encoder of the audio encoding device according to one embodiment of the present invention can be trained together with the additional information decoder and the post-processing processor using a loss function based on a mean squared error (MSE) function and a loss function of a Vector Quantized Variational AutoEncoder (VQ-VAE).

본 발명의 일실시예에 따른 오디오 복호화 장치는 메인 코덱의 비트스트림을 수신하고, 상기 메인 코덱의 비트스트림을 복호화하는 메인 코덱 디코더; 복호화한 신호에 포함된 음향 특징들로부터 상기 복호화한 신호의 특징 벡터를 추출하는 특징 추출기; 부가 정보의 비트스트림을 수신하고, 상기 부가 정보의 비트스트림을 복호화하여 잔여 오차 특징 벡터를 복원하는 부가 정보 디코더; 및 복호화한 신호의 특징 벡터와 상기 잔여 오차 특징 벡터로부터 원본 신호의 특징 벡터를 추정하는 후처리 프로세서를 포함할 수 있다.An audio decoding device according to one embodiment of the present invention may include: a main codec decoder which receives a bitstream of a main codec and decodes the bitstream of the main codec; a feature extractor which extracts a feature vector of the decoded signal from acoustic features included in the decoded signal; a side information decoder which receives a bitstream of side information and decodes the bitstream of the side information to restore a residual error feature vector; and a post-processing processor which estimates a feature vector of an original signal from the feature vector of the decoded signal and the residual error feature vector.

본 발명의 일실시예에 따른 오디오 복호화 장치의 상기 후처리 프로세서는, 상기 복호화한 신호의 특징 벡터와 상기 잔여 오차 특징 벡터를 결합하여 상기 원본 신호의 특징 벡터를 추정할 수 있다.The post-processing processor of the audio decoding device according to one embodiment of the present invention can estimate the feature vector of the original signal by combining the feature vector of the decoded signal and the residual error feature vector.

본 발명의 일실시예에 따른 오디오 복호화 장치의 상기 후처리 프로세서는, 추정한 원본 신호의 특징 벡터를 시간 영역 표현으로 변환하여 출력할 수 있다.The post-processing processor of the audio decoding device according to one embodiment of the present invention can convert the feature vector of the estimated original signal into a time domain representation and output it.

본 발명의 일실시예에 따른 오디오 복호화 장치의 상기 부가 정보 디코더 및 상기 후처리 프로세서는, 부호화 장치에서 상기 잔여 오차 특징 벡터를 부호화하여 상기 부가 정보의 비트스트림을 출력하는 부가 정보 인코더와 함께 트레이닝될 수 있다.The additional information decoder and the post-processing processor of the audio decoding device according to one embodiment of the present invention may be trained together with an additional information encoder that encodes the residual error feature vector in an encoding device and outputs a bitstream of the additional information.

본 발명의 일실시예에 의하면, 오디오 부호화 장치가 잔여오차 특징들을 신경망을 이용하여 부호화하고 벡터 양자화하여 부가정보로 전송하고, 오디오 복호화 장치가 수신한 부가 정보를 신경망을 이용하여 후처리 함으로써 기존 코덱과의 역호환성을 제공하고 기존 코덱으로 복호화된 오디오 신호의 음질을 향상시킬 수 있다.According to one embodiment of the present invention, an audio encoding device encodes residual error features using a neural network, vector quantizes the encoded features, and transmits them as additional information, and an audio decoding device post-processes the received additional information using a neural network, thereby providing backward compatibility with an existing codec and improving the sound quality of an audio signal decoded using an existing codec.

또한, 본 발명의 일실시예에 의하면, 잔여 오차 특징 벡터를 부호화하는 딥러닝 모델과 잔여 오차 특징 벡터를 복원하는 딥러닝 모델 및 원본 신호의 특징 벡터를 추정하는 딥러닝 모델을 조인트 트레이닝(joint training)하는 종단간 딥러닝을 수행함으로써, 딥러닝 모델들을 사용하는 부가 정보 인코더 부가 정보 디코더 및 후처리 프로세스의 오차가 순차적으로 누적되지 않도록 할 수 있다.In addition, according to one embodiment of the present invention, by performing end-to-end deep learning that jointly trains a deep learning model that encodes a residual error feature vector, a deep learning model that restores the residual error feature vector, and a deep learning model that estimates a feature vector of an original signal, errors in an additional information encoder, an additional information decoder, and a post-processing process using deep learning models can be prevented from sequentially accumulating.

그리고, 본 발명은 딥러닝 모델들을 조인트 트레이닝(joint training)하는 종단간 딥러닝을 수행함으로써, 압축된 잠재 벡터를 양자화 하는 코드 벡터를 효과적으로 트레이닝하여 오디오 부호화 과정에서 음질향상을 위한 부가정보를 추출할 수 있다.In addition, the present invention performs end-to-end deep learning that jointly trains deep learning models, thereby effectively training a code vector that quantizes a compressed latent vector, thereby extracting additional information for improving sound quality during an audio encoding process.

도 1은 본 발명의 일실시예에 따른 오디오 부호화 장치 및 오디오 복호화 장치를 나타내는 도면이다.
도 2는 본 발명의 일실시예에 따른 오디오 부호화 장치 및 오디오 복호화 장치의 동작 일례이다.
도 3은 본 발명의 일실시예에 따른 오디오 복호화 장치의 출력에 대한 성능 평가의 일례이다.
도 4는 본 발명의 일실시예에 따른 오디오 복호화 장치의 출력에 대한 음질 평가의 일례이다.
도 5는 본 발명의 일실시예에 따른 오디오 복호화 장치가 출력한 신호의 스펙트로그램의 일례이다.
도 6은 본 발명의 일실시예에 따른 오디오 부호화 방법을 도시한 플로우차트이다.
도 7은 본 발명의 일실시예에 따른 오디오 복호화 방법을 도시한 플로우차트이다.FIG. 1 is a drawing showing an audio encoding device and an audio decoding device according to one embodiment of the present invention.
FIG. 2 is an example of the operation of an audio encoding device and an audio decoding device according to an embodiment of the present invention.
FIG. 3 is an example of performance evaluation for the output of an audio decoding device according to an embodiment of the present invention.
FIG. 4 is an example of sound quality evaluation for the output of an audio decoding device according to an embodiment of the present invention.
FIG. 5 is an example of a spectrogram of a signal output by an audio decoding device according to an embodiment of the present invention.
FIG. 6 is a flowchart illustrating an audio encoding method according to an embodiment of the present invention.
FIG. 7 is a flowchart illustrating an audio decoding method according to an embodiment of the present invention.

이하, 본 발명의 실시예를 첨부된 도면을 참조하여 상세하게 설명한다. 본 발명의 일실시예에 따른 오디오 부호화 방법 및 오디오 복호화 방법은 오디오 부호화 장치(110) 및 오디오 복호화 장치(120)에 의해 수행될 수 있다. Hereinafter, an embodiment of the present invention will be described in detail with reference to the attached drawings. An audio encoding method and an audio decoding method according to an embodiment of the present invention can be performed by an audio encoding device (110) and an audio decoding device (120).

도 1은 본 발명의 일실시예에 따른 오디오 부호화 장치 및 오디오 복호화 장치를 나타내는 도면이다. FIG. 1 is a drawing showing an audio encoding device and an audio decoding device according to one embodiment of the present invention.

오디오 부호화 장치(110)는 도 1에 도시된 바와 같이 메인 코덱 인코더(111), 메인 코덱 디코더(112), 특징 추출기(113), 특징 추출기(114), 및 부가 정보 인코더(115)를 포함할 수 있다. 이때, 메인 코덱 인코더(111), 메인 코덱 디코더(112), 특징 추출기(113), 특징 추출기(114), 및 부가 정보 인코더(115)는 서로 다른 프로세스, 또는 하나의 프로세스에 포함된 각각의 모듈일 수 있다.The audio encoding device (110) may include a main codec encoder (111), a main codec decoder (112), a feature extractor (113), a feature extractor (114), and an additional information encoder (115), as illustrated in FIG. 1. At this time, the main codec encoder (111), the main codec decoder (112), the feature extractor (113), the feature extractor (114), and the additional information encoder (115) may be different processes, or may be each module included in one process.

메인 코덱 인코더(111)는 원본 신호를 부호화하여 메인 코덱의 비트스트림을 출력할 수 있다. 예를 들어, 메인 코덱은 HE-AAC(High-Efficiency Advanced Audio Coding)와 같은 레거시 코덱일 수 있다.The main codec encoder (111) is the original signal can output the bitstream of the main codec by encoding it. For example, the main codec can be a legacy codec such as HE-AAC (High-Efficiency Advanced Audio Coding).

메인 코덱 디코더(112)는 메인 코덱의 비트스트림을 복호화하여 복호화한 신호 를 출력할 수 있다.The main codec decoder (112) decodes the bitstream of the main codec and outputs the decoded signal. can output.

특징 추출기(113)는 복호화한 신호 에 포함된 음향 특징들로부터 복호화한 신호 의 특징 벡터 X_d를 추출할 수 있다.The feature extractor (113) decoded the signal Signal decoded from acoustic features included in The feature vector X _d can be extracted.

특징 추출기(114)는 원본 신호에 포함된 음향 특징들로부터 원본 신호의 특징 벡터 X_o를 추출할 수 있다. 예를 들어, 특징 추출기(113) 및 특징 추출기(114)는 LPS (log power spectra)와 같은 다양한 종류의 음향 특징들 중 적어도 하나를 이용하여 특징 벡터 X_d 및 특징 벡터 X_o를 추출할 수 있다.The feature extractor (114) extracts the original signal Original signal from acoustic features included in The feature vector X _o can be extracted. For example, the feature extractor (113) and the feature extractor (114) can extract the feature vector X _d and the feature vector X _o using at least one of various types of acoustic features, such as LPS (log power spectra).

부가 정보 인코더(115)는 복호화한 신호 의 특징 벡터 X_d와 원본 신호의 특징 벡터 X_o로부터 잔여 오차 특징 벡터 X_r을 결정할 수 있다. 그리고, 부가 정보 인코더(115)는 잔여 오차 특징 벡터 X_r을 부호화하여 부가 정보의 비트스트림을 출력할 수 있다. 이때, 잔여 오차 특징 벡터 X_r은 수학식 1을 만족할 수 있다.The additional information encoder (115) decoded the signal Feature vector X _d and original signal The residual error feature vector X _r can be determined from the feature vector X _o . Then, the additional information encoder (115) can encode the residual error feature vector X _r to output a bitstream of the additional information. At this time, the residual error feature vector X _r can satisfy mathematical expression 1.

이때, 부가 정보 인코더(115)는 잔여 오차 특징 벡터 X_r을 잔여 오차 특징 벡터의 차원수보다 더 작은 차원의 잠재 공간으로 대응 시킬 수 있다. 예를 들어, 부가 정보 인코더(115)에 입력된 잔여 오차 특징 벡터 X_r의 차원수는 257이고, 부가 정보 인코더(115)에서 출력되는 비트스트림의 차원수는 32일 수 있다. 또한, 잠재 공간은 관측된 데이터(observed data)에 내재되어 있는 잠재 정보를 표현하는 공간일 수 있다. At this time, the additional information encoder (115) can map the residual error feature vector X _r to a latent space having a smaller dimension than the dimension of the residual error feature vector. For example, the dimension of the residual error feature vector X _r input to the additional information encoder (115) may be 257, and the dimension of the bitstream output from the additional information encoder (115) may be 32. In addition, the latent space may be a space expressing latent information inherent in observed data.

다음으로, 부가 정보 인코더(115)는 잠재 공간에 대응된 잔여 오차 특징 벡터를 벡터 양자화를 위한 코드 벡터로 할당하여 부호화할 수 있다. 그 다음으로, 부가 정보 인코더(115)는 부호화된 잔여 오차 특징 벡터를 양자화하여 부가 정보 비트스트림을 출력할 수 있다.Next, the additional information encoder (115) can encode the residual error feature vector corresponding to the latent space by assigning it as a code vector for vector quantization. Next, the additional information encoder (115) can quantize the encoded residual error feature vector to output the additional information bitstream.

이때, 부가 정보 인코더(115)는 부가 정보 인코더(115)의 부호화 및 벡터 양자화에 따른 손실, 및 후처리 프로세서(1240)가 원본 신호의 특징 벡터와 메인 코덱의 비트스트림 및 부가 정보의 비트스트림으로부터 추정한 원본 신호의 특징 벡터 간의 차이에 따라 결정된 손실 함수에 따라 트레이닝될 수 있다. 이때, 부가 정보 인코더(115)는 부가 정보 디코더(123), 및 후처리 프로세서(124)와 함께 트레이닝될 수 있다. 예를 들어, 부가 정보 인코더(115), 부가 정보 디코더(123), 및 후처리 프로세서(124)는 수학식 2와 같이 나타내는 손실 함수 L에 따라 조인트 트레이닝(joint training)을 수행할 수 있다.At this time, the additional information encoder (115) may be trained according to a loss function determined according to a loss due to encoding and vector quantization of the additional information encoder (115), and a difference between the feature vector of the original signal and the feature vector of the original signal estimated from the bitstream of the main codec and the bitstream of the additional information by the post-processing processor (1240). At this time, the additional information encoder (115) may be trained together with the additional information decoder (123) and the post-processing processor (124). For example, the additional information encoder (115), the additional information decoder (123), and the post-processing processor (124) may perform joint training according to a loss function L expressed as in mathematical expression 2.

이때, 는 부가 정보 인코더(115)의 최적화를 위한 부호화 및 벡터 양자화에 따른 손실이고, 는 부가 정보 디코더(123)의 복호화 및 벡터양자화에 따른 손실일 수 있다. 또한, 는 후처리 프로세서(124)에서 원본 신호의 특징 벡터 X_o와 후처리 프로세서(124)가 추정한 원본 신호의 특징 벡터 간의 차이일 수 있다. 또한, 손실 함수는 평균 제곱 오차 (mean squared error, MSE) 등 다양한 최적화 방법을 이용하여 트레이닝 될 수 있다.At this time, is the loss due to encoding and vector quantization for optimization of the additional information encoder (115), may be a loss due to decoding and vector quantization of the additional information decoder (123). In addition, is the feature vector X _o of the original signal in the post-processing processor (124) and the feature vector of the original signal estimated by the post-processing processor (124). It can be a difference between the loss function and the mean squared error (MSE). In addition, the loss function can be trained using various optimization methods, such as the mean squared error (MSE).

정리하면, 부가 정보 인코더(115)는 평균 제곱 오차(MSE: mean squared error) 함수 및 VQ-VAE(Vector Quantized Variational AutoEncoder)의 손실 함수에 기초한 손실 함수를 이용하여 부가 정보 디코더(123) 및 후처리 프로세서(124)와 함께 트레이닝될 수 있다. 예를 들어, 부가 정보 인코더(115), 부가 정보 디코더(123) 및 후처리 프로세서(124)는 수학식 2와 같이 MSE 함수() 및 VQ-VAE의 함수(, )를 결합한 손실 함수를 사용하여 트레이닝될 수 있다.In summary, the additional information encoder (115) can be trained together with the additional information decoder (123) and the post-processing processor (124) using a loss function based on the mean squared error (MSE) function and the loss function of the VQ-VAE (Vector Quantized Variational AutoEncoder). For example, the additional information encoder (115), the additional information decoder (123), and the post-processing processor (124) can be trained using the MSE function as shown in Equation 2. ) and the function of VQ-VAE( , ) can be trained using a loss function that combines

오디오 복호화 장치(110)는 도 1에 도시된 바와 같이 메인 코덱 디코더(121), 특징 추출기(122), 부가 정보 디코더(123) 및 후처리 프로세서(124)를 포함할 수 있다. 이때, 메인 코덱 디코더(121), 특징 추출기(122), 부가 정보 디코더(123) 및 후처리 프로세서(124)는 서로 다른 프로세스, 또는 하나의 프로세스에 포함된 각각의 모듈일 수 있다.The audio decoding device (110) may include a main codec decoder (121), a feature extractor (122), an additional information decoder (123), and a post-processing processor (124) as illustrated in FIG. 1. At this time, the main codec decoder (121), the feature extractor (122), the additional information decoder (123), and the post-processing processor (124) may be different processes, or may be individual modules included in one process.

메인 코덱 디코더(121)는 오디오 부호화 장치(110)의 메인 코덱 인코더(111)로부터 메인 코덱의 비트스트림을 수신할 수 있다. 그리고, 메인 코덱 디코더(121)는 수신한 메인 코덱의 비트스트림을 복호화하여 복호화한 신호 를 출력할 수 있다. 또한, 메인 코덱 디코더(121)는 오디오 신호 부호화 장치(110)의 메인 코덱 디코더(112)와 동일하게 동작할 수 있다.The main codec decoder (121) can receive the bitstream of the main codec from the main codec encoder (111) of the audio encoding device (110). Then, the main codec decoder (121) decodes the received bitstream of the main codec and outputs the decoded signal. can output. In addition, the main codec decoder (121) can operate in the same manner as the main codec decoder (112) of the audio signal encoding device (110).

특징 추출기(122)는 메인 코덱 디코더(121)가 복호화한 신호 에 포함된 음향 특징들로부터 복호화한 신호 의 특징 벡터 X_d를 추출할 수 있다. 또한, 특징 추출기(122)는 오디오 신호 부호화 장치(110)의 특징 추출기(112)와 동일하게 동작할 수 있다.The feature extractor (122) decodes the signal decoded by the main codec decoder (121). Signal decoded from acoustic features included in The feature vector X _d of the audio signal encoding device (110) can be extracted. In addition, the feature extractor (122) can operate in the same manner as the feature extractor (112) of the audio signal encoding device (110).

부가 정보 디코더(123)는 오디오 부호화 장치(110)의 부가 정보 인코더(115)로부터 부가 정보의 비트스트림을 수신할 수 있다. 그리고, 부가 정보 디코더(123)는 수신한 부가 정보의 비트스트림을 복호화하여 잔여 오차 특징 벡터를 복원할 수 있다.The additional information decoder (123) can receive a bitstream of additional information from the additional information encoder (115) of the audio encoding device (110). Then, the additional information decoder (123) can decode the received bitstream of additional information to restore a residual error feature vector.

후처리 프로세서(124)는 복호화한 신호 의 특징 벡터 X_d와 부가 정보 디코더(123)가 복원한 잔여 오차 특징 벡터 로부터 원본 신호의 특징 벡터를 추정할 수 있다. 그리고, 후처리 프로세서(124)는 추정한 원본 신호의 특징 벡터 를 시간 영역 표현 로 변환하여 출력할 수 있다. 이때, 후처리 프로세서(124)는 특징 벡터 X_d와 잔여 오차 특징 벡터 을 결합하여 원본 신호의 특징 벡터를 추정할 수 있다.The post-processing processor (124) decrypts the signal The feature vector X _d and the residual error feature vector restored by the additional information decoder (123) The feature vector of the original signal can be estimated from the post-processing processor (124). Then, the feature vector of the estimated original signal Representing the time domain can be converted into and output. At this time, the post-processing processor (124) outputs the feature vector X _d and the residual error feature vector can be combined to estimate the feature vector of the original signal.

본 발명에 따른 오디오 부호화 장치(110)는 잔여오차 특징들을 신경망을 이용하여 부호화하고 벡터 양자화하여 부가정보로 전송하고, 오디오 복호화 장치(120)는 수신한 부가 정보를 신경망을 이용하여 후처리 함으로써 기존 코덱과의 역호환성을 제공하고 기존 코덱으로 복호화된 오디오 신호의 음질을 향상시킬 수 있다. The audio encoding device (110) according to the present invention encodes residual error features using a neural network, vector quantizes them, and transmits them as additional information, and the audio decoding device (120) post-processes the received additional information using a neural network, thereby providing backward compatibility with existing codecs and improving the sound quality of audio signals decoded with existing codecs.

도 2는 본 발명의 일실시예에 따른 오디오 부호화 장치 및 오디오 복호화 장치의 동작 일례이다. FIG. 2 is an example of the operation of an audio encoding device and an audio decoding device according to an embodiment of the present invention.

원본 신호는 도 2에 도시된 바와 같이 메인 코덱 인코더(111) 및 특징 추출기(114)에 입력될 수 있다. Original signal can be input to the main codec encoder (111) and feature extractor (114) as shown in Fig. 2.

이때, 메인 코덱 인코더(111)는 원본 신호를 부호화하여 오디오 부호화 장치(120)의 메인 코덱 디코더(112) 및 오디오 복호화 장치(120)의 메인 코덱 디코더(121)로 전송할 수 있다.At this time, the main codec encoder (111) It can be encoded and transmitted to the main codec decoder (112) of the audio encoding device (120) and the main codec decoder (121) of the audio decoding device (120).

그리고, 메인 코덱 디코더(112) 및 메인 코덱 디코더(121)는 각각 수신한 비트스트림을 복호화하여 복호화한 신호 를 출력할 수 있다.And, the main codec decoder (112) and the main codec decoder (121) each decode the received bitstream and output the decoded signal. can output.

특징 추출기(113)는 복호화한 신호 에 포함된 음향 특징들로부터 복호화한 신호 의 특징 벡터 X_d를 추출할 수 있다. 또한, 특징 추출기(114)는 원본 신호에 포함된 음향 특징들로부터 원본 신호의 특징 벡터 X_o를 추출할 수 있다. The feature extractor (113) decoded the signal Signal decoded from acoustic features included in The feature vector X _d of the original signal can be extracted. In addition, the feature extractor (114) Original signal from acoustic features included in The feature vector X _o can be extracted.

이때, 부가 정보 인코더(115)는 원본 신호의 특징 벡터 X_o와 복호화한 신호 의 특징 벡터 X_d간의 차이인 잔여 오차 특징 벡터 X_r을 결정할 수 있다. 그리고, 부가 정보 인코더(115)는 잔여 오차 특징 벡터 X_r을 부호화하여 부가 정보의 비트스트림을 출력할 수 있다. 예를 들어, 부가 정보 인코더(115)가 잔여 오차 특징 벡터 X_r의 부호화에 사용하는 신경망은 구조(210)를 가지는 딥러닝 모델로 형성될 수 있다. 또한, 부가 정보 인코더(115)의 출력 코드 벡터는 VQ 코드북(220)의 대표 코드 벡터로 할당될 수 있다. 이때, 대표 코드 벡터는 VQ 코드북(220)에 포함된 벡터들 중에서 벡터들 간의 거리가 가장 가까운 코드 벡터들일 수 있다. 예를 들어, 벡터들 간의 거리는 유클리디언 디스턴스(Euclidean distance) 등을 이용하여 계산될 수 있다.At this time, the additional information encoder (115) The feature vector X _o and the decoded signal The residual error feature vector X _r , which is the difference between the feature vectors X _{d ,} can be determined. Then, the additional information encoder (115) can encode the residual error feature vector X _r to output a bitstream of the additional information. For example, the neural network that the additional information encoder (115) uses to encode the residual error feature vector X _r can be formed as a deep learning model having the structure (210). In addition, the output code vector of the additional information encoder (115) can be assigned as a representative code vector of the VQ codebook (220). At this time, the representative code vector can be a code vector having the closest distance between vectors among the vectors included in the VQ codebook (220). For example, the distance between vectors can be calculated using the Euclidean distance, etc.

그 다음으로, 부가 정보 인코더(115)는 부호화된 잔여 오차 특징 벡터를 양자화하여 부가 정보 비트스트림을 출력할 수 있다. 이때, 부가 정보 비트스트림에는 부가 정보의 코드북 인덱스(코드 벡터 인덱스)가 포함될 수 있다. 그리고, 부가 정보 인코더(115)는 코드 북(220) 및 부가 정보 비트스트림을 부가 정보 디코더(123)에게 전송할 수 있다.Next, the additional information encoder (115) can quantize the encoded residual error feature vector to output an additional information bitstream. At this time, the additional information bitstream can include a codebook index (code vector index) of the additional information. Then, the additional information encoder (115) can transmit the codebook (220) and the additional information bitstream to the additional information decoder (123).

부가 정보 디코더(123)는 오디오 부호화 장치(110)의 부가 정보 인코더(115)로부터 수신한 부가 정보의 비트스트림을 복호화하여 잔여 오차 특징 벡터를 복원할 수 있다. 예를 들어, 부가 정보 디코더(123)가 잔여 오차 특징 벡터 X_r의 복호화에 사용하는 신경망은 구조(230)를 가지는 딥러닝 모델로 형성될 수 있다. 이때, 부가 정보 디코더(123)는 코드 북(220)의 코드 벡터를 사용하여 잔여 오차 특징 벡터를 복원할 수 있다.The additional information decoder (123) can decode the bitstream of the additional information received from the additional information encoder (115) of the audio encoding device (110) to restore the residual error feature vector. For example, the neural network used by the additional information decoder (123) to decode the residual error feature vector X _r can be formed as a deep learning model having a structure (230). At this time, the additional information decoder (123) can restore the residual error feature vector using the code vector of the code book (220).

특징 추출기(122)는 메인 코덱 디코더(121)가 복호화한 신호 에 포함된 음향 특징들로부터 복호화한 신호 의 특징 벡터 X_d를 추출할 수 있다. The feature extractor (122) decodes the signal decoded by the main codec decoder (121). Signal decoded from acoustic features included in The feature vector X _d can be extracted.

결합(concatenate) 연산기(201)는 특징 벡터 X_d와 부가 정보 디코더(123)가 복원한 잔여 오차 특징 벡터 에 결합 연산을 수행한 결과인 를 후처리 프로세서(124)에 입력할 수 있다. 그리고, 후처리 프로세서(124)는 구조(240)를 가지는 딥러닝 모듈을 이용하여 부터 원본 신호의 특징 벡터를 추정할 수 있다. 그리고, 후처리 프로세서(124)는 추정한 원본 신호의 특징 벡터 를 출력할 수 있다. 이때, 파형 복원기(202)는 추정한 원본 신호의 특징 벡터 를 시간 영역 표현 로 변환하여 출력할 수 있다. The concatenate operator (201) combines the feature vector X _d with the residual error feature vector restored by the additional information decoder (123). The result of performing a join operation on can be input to the post-processing processor (124). Then, the post-processing processor (124) uses a deep learning module having a structure (240). The feature vector of the original signal can be estimated from the post-processing processor (124). Then, the feature vector of the estimated original signal At this time, the waveform restorer (202) can output the feature vector of the estimated original signal. Representing the time domain It can be converted to and printed.

오디오 부호화 장치(110) 및 오디오 복호화 장치(120)는 부가 정보 인코더(115)에서 잔여 오차 특징 벡터를 부호화하는 딥러닝 모델과 부가 정보 디코더(123)에서 잔여 오차 특징 벡터를 복원하는 딥러닝 모델 및 후처리 프로세서(124)에서 원본 신호의 특징 벡터를 추정하는 딥러닝 모델을 조인트 트레이닝(joint training)하는 종단간 딥러닝을 수행함으로써, 딥러닝 모델들을 사용하는 부가 정보 인코더(115), 부가 정보 디코더(123) 및 후처리 프로세스(124)의 오차가 순차적으로 누적되지 않도록 할 수 있다.The audio encoding device (110) and the audio decoding device (120) perform end-to-end deep learning by jointly training a deep learning model that encodes a residual error feature vector in the additional information encoder (115), a deep learning model that restores the residual error feature vector in the additional information decoder (123), and a deep learning model that estimates a feature vector of an original signal in the post-processing processor (124), thereby preventing errors in the additional information encoder (115), the additional information decoder (123), and the post-processing process (124) that use deep learning models from sequentially accumulating.

또한, 오디오 부호화 장치(110) 및 오디오 복호화 장치(120)는 구조(210)를 가지는 딥러닝 모델, 구조(230)를 가지는 딥러닝 모델, 및 구조(240)를 가지는 딥러닝 모듈을 조인트 트레이닝(joint training)하는 종단간 딥러닝을 수행함으로써, 압축된 잠재 벡터를 양자화 하는 코드 벡터를 효과적으로 트레이닝하여 오디오 부호화 과정에서 음질향상을 위한 부가정보를 추출할 수 있다. 구체적으로, 오디오 부호화 장치(110) 및 오디오 복호화 장치(120)는 수학식 2의 손실 함수를 최소화하도록 구조(210)를 가지는 딥러닝 모델, 구조(230)를 가지는 딥러닝 모델, 및 구조(240)를 가지는 딥러닝 모듈을 트레이닝함으로써, 부가 정보 인코더(115), 부가 정보 디코더(123), 코드북(220), 및 후처리 프로세서(124)를 최적화할 수 있다.In addition, the audio encoding device (110) and the audio decoding device (120) perform end-to-end deep learning by jointly training a deep learning model having a structure (210), a deep learning model having a structure (230), and a deep learning module having a structure (240), thereby effectively training a code vector that quantizes a compressed latent vector and extracting additional information for improving sound quality in an audio encoding process. Specifically, the audio encoding device (110) and the audio decoding device (120) train the deep learning model having a structure (210), the deep learning model having a structure (230), and the deep learning module having a structure (240) to minimize the loss function of mathematical expression 2, thereby optimizing the additional information encoder (115), the additional information decoder (123), the codebook (220), and the post-processing processor (124).

도 3은 본 발명의 일실시예에 따른 오디오 복호화 장치의 출력에 대한 성능 평가의 일례이다.FIG. 3 is an example of performance evaluation for the output of an audio decoding device according to an embodiment of the present invention.

MPEG-4 high-efficiency advanced audio coding (HE-AAC) v1 중 NeroAAC 코덱을 사용한 오디오 복호화 장치의 성능 평가(NeroAAC), NeroAAC 코덱에 후처리기를 추가한 오디오 복호화 장치의 성능 평가(+PP only), 및 메인 코덱으로 NeroAAC 코덱을 사용한 오디오 복호화 장치(120)의 성능 평가(Prop. (+0.6 kbps))는 도 3의 위쪽 표에 도시된 바와 같을 수 있다. 도 3의 표는 표준화된 음성 품질 평가 도구인 ITU-T Recommendation P.862.2 wideband perceptual evaluation of speech quality (PESQ)가 사용하여 측정한 성능의 일례이다.The performance evaluation of an audio decoding device using the NeroAAC codec among MPEG-4 high-efficiency advanced audio coding (HE-AAC) v1 (NeroAAC), the performance evaluation of an audio decoding device adding a post-processor to the NeroAAC codec (+PP only), and the performance evaluation of an audio decoding device (120) using the NeroAAC codec as the main codec (Prop. (+0.6 kbps)) may be as shown in the upper table of Fig. 3. The table of Fig. 3 is an example of performance measured using the ITU-T Recommendation P.862.2 wideband perceptual evaluation of speech quality (PESQ), which is a standardized speech quality evaluation tool.

또한, QAAC 코덱을 사용한 오디오 복호화 장치의 성능 평가(QAAC), QAAC 코덱에 후처리기를 추가한 오디오 복호화 장치의 성능 평가(+PP only), 및 메인 코덱으로 QAAC 코덱을 사용한 오디오 복호화 장치(120)의 성능 평가(Prop. (+0.6 kbps))는 도 3의 아래쪽 표에 도시된 바와 같을 수 있다.In addition, the performance evaluation of an audio decoding device using the QAAC codec (QAAC), the performance evaluation of an audio decoding device adding a post-processor to the QAAC codec (+PP only), and the performance evaluation of an audio decoding device (120) using the QAAC codec as the main codec (Prop. (+0.6 kbps)) may be as shown in the lower table of FIG. 3.

도 3에 도시된 바에 따르면, 본 발명의 일실시예에 따른 오디오 부호화 장치(110) 및 오디오 복호화 장치(120)는 추가로 사용되는 비트율이 약 0.6 kbps임에도 불구하고, 더 높은 비트율에서 작동하는 메인 코덱에 후처리 모듈만 사용한 방법보다 평균 PESQ 점수가 높을 수 있다.As illustrated in FIG. 3, the audio encoding device (110) and the audio decoding device (120) according to one embodiment of the present invention can have a higher average PESQ score than a method that uses only a post-processing module in a main codec operating at a higher bit rate, even though the additionally used bit rate is about 0.6 kbps.

도 4는 본 발명의 일실시예에 따른 오디오 복호화 장치의 출력에 대한 음질 평가의 일례이다.FIG. 4 is an example of sound quality evaluation for the output of an audio decoding device according to an embodiment of the present invention.

그래프(410)는 NeroAAC 코덱에 후처리기가 추가된 오디오 복호화 장치에서 복호화한 신호가 16 kbps에서 작동시킨 NeroAAC 코덱을 사용한 오디오 복호화 장치에서 복호화된 신호 보다 품질이 향상된 정도(+PP only), 및 메인 코덱으로 NeroAAC 코덱을 사용한 오디오 복호화 장치(120)에서 복호화한 신호가 16 kbps에서 작동시킨 NeroAAC 코덱을 사용한 오디오 복호화 장치에서 복호화된 신호 보다 품질이 향상된 정도(Prop. (+0.6 kbps))를 나타낼 수 있다.The graph (410) can represent the degree to which the quality of a signal decoded by an audio decoding device with a post-processor added to the NeroAAC codec is improved compared to a signal decoded by an audio decoding device using the NeroAAC codec operating at 16 kbps (+PP only), and the degree to which the quality of a signal decoded by an audio decoding device (120) using the NeroAAC codec as the main codec is improved compared to a signal decoded by an audio decoding device using the NeroAAC codec operating at 16 kbps (Prop. (+0.6 kbps)).

또한, 그래프(420)는 QAAC 코덱에 후처리기가 추가된 오디오 복호화 장치에서 복호화한 신호가 16 kbps에서 작동시킨 QAAC 코덱을 사용한 오디오 복호화 장치에서 복호화된 신호 보다 품질이 향상된 정도(+PP only), 및 메인 코덱으로 QAAC 코덱을 사용한 오디오 복호화 장치(120)에서 복호화한 신호가 16 kbps에서 작동시킨 QAAC 코덱을 사용한 오디오 복호화 장치에서 복호화된 신호 보다 품질이 향상된 정도(Prop. (+0.6 kbps))를 나타낼 수 있다.In addition, the graph (420) can represent the degree to which the quality of a signal decoded by an audio decoding device with a post-processor added to the QAAC codec is improved compared to a signal decoded by an audio decoding device using the QAAC codec operating at 16 kbps (+PP only), and the degree to which the quality of a signal decoded by an audio decoding device (120) using the QAAC codec as the main codec is improved compared to a signal decoded by an audio decoding device using the QAAC codec operating at 16 kbps (Prop. (+0.6 kbps)).

이때, 그래프(410), 및 그래프(420)는 코덱 출력 신호의 품질을 평가하기 위한 코덱 청취 테스트를 수행하는 방법 중 하나인 MUltiple Stimuli with Hidden Reference and Anchor (MUSHRA) 테스트에 따라 측정된 결과일 수 있다.At this time, the graph (410) and the graph (420) may be results measured according to the MUltiple Stimuli with Hidden Reference and Anchor (MUSHRA) test, which is one of the methods for performing a codec listening test to evaluate the quality of a codec output signal.

그래프(410), 및 그래프(420)에 따르면, 본 발명의 일실시예에 따른 오디오 복호화 장치(120)에서 복호화된 신호가 메인 코덱에 후처리만 사용한 오디오 복호화 장치에서 복호화된 신호에 비하여 NeroAAC에서는 9.73점, QAAC에서는 7.93점이 향상됨을 확인할 수 있다.According to graph (410) and graph (420), it can be confirmed that the signal decoded by the audio decoding device (120) according to one embodiment of the present invention is improved by 9.73 points in NeroAAC and 7.93 points in QAAC compared to the signal decoded by the audio decoding device that only uses post-processing for the main codec.

도 5는 본 발명의 일실시예에 따른 오디오 복호화 장치가 출력한 신호의 스펙트로그램의 일례이다.FIG. 5 is an example of a spectrogram of a signal output by an audio decoding device according to an embodiment of the present invention.

도 5의 스펙트로그램(510)은 원본 신호(a), 메인 코덱으로 NeroAAC 코덱을 사용한 오디오 복호화 장치(120)에서 복호화한 신호(b), NeroAAC 코덱에 후처리기를 추가한 오디오 복호화 장치에서 복호화한 신호(c) 및 NeroAAC 코덱을 사용한 기존 오디오 복호화 장치에서 복호화한 신호(d)를 나타낼 수 있다.The spectrogram (510) of FIG. 5 can represent an original signal (a), a signal decoded by an audio decoding device (120) using the NeroAAC codec as the main codec (b), a signal decoded by an audio decoding device adding a post-processor to the NeroAAC codec (c), and a signal decoded by an existing audio decoding device using the NeroAAC codec (d).

또한, 도 5의 스펙트로그램(520)은 원본 신호(a), 메인 코덱으로 QAAC 코덱을 사용한 오디오 복호화 장치(120)에서 복호화한 신호(b), QAAC 코덱에 후처리기를 추가한 오디오 복호화 장치에서 복호화한 신호(c) 및 QAAC 코덱을 사용한 기존 오디오 복호화 장치에서 복호화한 신호(d)를 나타낼 수 있다.In addition, the spectrogram (520) of FIG. 5 can represent an original signal (a), a signal decoded by an audio decoding device (120) using the QAAC codec as the main codec (b), a signal decoded by an audio decoding device adding a post-processor to the QAAC codec (c), and a signal decoded by an existing audio decoding device using the QAAC codec (d).

스펙트로그램(510)과 스펙트로그램(520)에 따르면, (c)에서 잘 복원하지 못하는 고주파 대역을 (b)에서는 잘 복원하는 것을 확인할 수 있다.According to spectrogram (510) and spectrogram (520), it can be confirmed that the high-frequency band that is not well restored in (c) is well restored in (b).

도 6은 본 발명의 일실시예에 따른 오디오 부호화 방법을 도시한 플로우차트이다.FIG. 6 is a flowchart illustrating an audio encoding method according to an embodiment of the present invention.

단계(610)에서 메인 코덱 인코더(111)는 원본 신호를 부호화하여 메인 코덱의 비트스트림을 출력할 수 있다. 이때, 메인 코덱 인코더(111)는 메인 코덱의 비트스트림을 오디오 복호화 장치(120)로 전송할 수 있다.At step (610), the main codec encoder (111) The bitstream of the main codec can be output by encoding it. At this time, the main codec encoder (111) can transmit the bitstream of the main codec to the audio decoding device (120).

단계(620)에서 메인 코덱 디코더(112)는 단계(610)에서 출력된 메인 코덱의 비트스트림을 복호화하여 복호화한 신호 를 출력할 수 있다.In step (620), the main codec decoder (112) decodes the bitstream of the main codec output in step (610) and outputs the decoded signal. can output.

단계(630)에서 특징 추출기(113)는 단계(620)에서 복호화한 신호 에 포함된 음향 특징들로부터 복호화한 신호 의 특징 벡터 X_d를 추출할 수 있다.At step (630), the feature extractor (113) decrypts the signal decoded at step (620). Signal decoded from acoustic features included in The feature vector X _d can be extracted.

단계(640)에서 특징 추출기(114)는 원본 신호에 포함된 음향 특징들로부터 원본 신호의 특징 벡터 X_o를 추출할 수 있다. At step (640), the feature extractor (114) extracts the original signal Original signal from acoustic features included in The feature vector X _o can be extracted.

단계(650)에서 부가 정보 인코더(115)는 복호화한 신호 의 특징 벡터 X_d와 원본 신호의 특징 벡터 X_o로부터 잔여 오차 특징 벡터 X_r을 결정할 수 있다. In step (650), the additional information encoder (115) decoded the signal Feature vector X _d and original signal The residual error feature vector X _r can be determined from the feature vector X _o .

단계(660)에서 부가 정보 인코더(115)는 잔여 오차 특징 벡터 X_r을 부호화하여 부가 정보의 비트스트림을 출력할 수 있다. 이때, 부가 정보 인코더(115)는 잔여 오차 특징 벡터 X_r을 잠재 공간으로 대응 시킬 수 있다. 다음으로, 부가 정보 인코더(115)는 잠재 공간에 대응된 잔여 오차 특징 벡터를 벡터 양자화를 위한 코드 벡터로 할당하여 부호화할 수 있다. 그 다음으로, 부가 정보 인코더(115)는 부호화된 잔여 오차 특징 벡터를 양자화하여 부가 정보 비트스트림을 출력할 수 있다.In step (660), the additional information encoder (115) can output a bitstream of the additional information by encoding the residual error feature vector X _r . At this time, the additional information encoder (115) can correspond the residual error feature vector X _r to a latent space. Next, the additional information encoder (115) can encode the residual error feature vector corresponding to the latent space by assigning it as a code vector for vector quantization. Next, the additional information encoder (115) can quantize the encoded residual error feature vector and output a bitstream of the additional information.

도 7은 본 발명의 일실시예에 따른 오디오 복호화 방법을 도시한 플로우차트이다.FIG. 7 is a flowchart illustrating an audio decoding method according to an embodiment of the present invention.

단계(710)에서 메인 코덱 디코더(121)는 오디오 부호화 장치(110)의 메인 코덱 인코더(111)로부터 메인 코덱의 비트스트림을 수신할 수 있다. 그리고, 메인 코덱 디코더(121)는 수신한 메인 코덱의 비트스트림을 복호화하여 복호화한 신호 를 출력할 수 있다. In step (710), the main codec decoder (121) can receive a bitstream of the main codec from the main codec encoder (111) of the audio encoding device (110). Then, the main codec decoder (121) decodes the received bitstream of the main codec and outputs a decoded signal. can output.

단계(720)에서 특징 추출기(122)는 단계(710)에서 복호화한 신호 에 포함된 음향 특징들로부터 복호화한 신호 의 특징 벡터 X_d를 추출할 수 있다. In step (720), the feature extractor (122) decode the signal decoded in step (710). Signal decoded from acoustic features included in The feature vector X _d can be extracted.

단계(730)에서 부가 정보 디코더(123)는 오디오 부호화 장치(110)의 부가 정보 인코더(115)로부터 부가 정보의 비트스트림을 수신할 수 있다. 그리고, 부가 정보 디코더(123)는 수신한 부가 정보의 비트스트림을 복호화하여 잔여 오차 특징 벡터를 복원할 수 있다.In step (730), the additional information decoder (123) can receive a bitstream of additional information from the additional information encoder (115) of the audio encoding device (110). Then, the additional information decoder (123) can decode the received bitstream of additional information to restore a residual error feature vector.

단계(740)에서 후처리 프로세서(124)는 복호화한 신호 의 특징 벡터 X_d와 부가 정보 디코더(123)가 복원한 잔여 오차 특징 벡터 로부터 원본 신호의 특징 벡터를 추정할 수 있다. 이때, 후처리 프로세서(124)는 특징 벡터 X_d와 잔여 오차 특징 벡터 을 결합하여 원본 신호의 특징 벡터를 추정할 수 있다.In step (740), the post-processing processor (124) decrypts the signal The feature vector X _d and the residual error feature vector restored by the additional information decoder (123) The feature vector of the original signal can be estimated from the feature vector X d . At this time, the post-processing processor (124) calculates the feature vector X _d and the residual error feature vector can be combined to estimate the feature vector of the original signal.

단계(750)에서 후처리 프로세서(124)는 추정한 원본 신호의 특징 벡터 를 시간 영역 표현 로 변환하여 출력할 수 있다. In step (750), the post-processing processor (124) estimates the feature vector of the original signal. Representing the time domain It can be converted to and printed.

본 발명의 오디오 부호화 장치(110)는 잔여오차 특징들을 신경망을 이용하여 부호화하고 벡터 양자화하여 부가정보로 전송하고, 오디오 복호화 장치(120)는 수신한 부가 정보를 신경망을 이용하여 후처리 함으로써 기존 코덱과의 역호환성을 제공하고 기존 코덱으로 복호화된 오디오 신호의 음질을 향상시킬 수 있다.The audio encoding device (110) of the present invention encodes residual error features using a neural network, vector quantizes them, and transmits them as additional information, and the audio decoding device (120) post-processes the received additional information using a neural network, thereby providing backward compatibility with existing codecs and improving the sound quality of audio signals decoded with existing codecs.

또한, 본 발명은 부가 정보 인코더(115)에서 잔여 오차 특징 벡터를 부호화하는 딥러닝 모델과 부가 정보 디코더(123)에서 잔여 오차 특징 벡터를 복원하는 딥러닝 모델 및 후처리 프로세서(124)에서 원본 신호의 특징 벡터를 추정하는 딥러닝 모델을 조인트 트레이닝(joint training)하는 종단간 딥러닝을 수행함으로써, 딥러닝 모델들을 사용하는 부가 정보 인코더(115), 부가 정보 디코더(123) 및 후처리 프로세스(124)의 오차가 순차적으로 누적되지 않도록 할 수 있다.In addition, the present invention performs end-to-end deep learning by jointly training a deep learning model that encodes a residual error feature vector in an additional information encoder (115), a deep learning model that restores the residual error feature vector in an additional information decoder (123), and a deep learning model that estimates a feature vector of an original signal in a post-processing processor (124), thereby preventing errors in the additional information encoder (115), the additional information decoder (123), and the post-processing process (124) that use deep learning models from sequentially accumulating.

한편, 본 발명에 따른 방법은 컴퓨터에서 실행될 수 있는 프로그램으로 작성되어 마그네틱 저장매체, 광학적 판독매체, 디지털 저장매체 등 다양한 기록 매체로도 구현될 수 있다.Meanwhile, the method according to the present invention can be written as a program that can be executed on a computer and can be implemented in various recording media such as a magnetic storage medium, an optical reading medium, and a digital storage medium.

본 명세서에 설명된 각종 기술들의 구현들은 디지털 전자 회로조직으로, 또는 컴퓨터 하드웨어, 펌웨어, 소프트웨어로, 또는 그들의 조합들로 구현될 수 있다. 구현들은 데이터 처리 장치, 예를 들어 프로그램가능 프로세서, 컴퓨터, 또는 다수의 컴퓨터들의 동작에 의한 처리를 위해, 또는 이 동작을 제어하기 위해, 컴퓨터 프로그램 제품, 즉 정보 캐리어, 예를 들어 기계 판독가능 저장 장치(컴퓨터 판독가능 매체) 또는 전파 신호에서 유형적으로 구체화된 컴퓨터 프로그램으로서 구현될 수 있다. 상술한 컴퓨터 프로그램(들)과 같은 컴퓨터 프로그램은 컴파일된 또는 인터프리트된 언어들을 포함하는 임의의 형태의 프로그래밍 언어로 기록될 수 있고, 독립형 프로그램으로서 또는 모듈, 구성요소, 서브루틴, 또는 컴퓨팅 환경에서의 사용에 적절한 다른 유닛으로서 포함하는 임의의 형태로 전개될 수 있다. 컴퓨터 프로그램은 하나의 사이트에서 하나의 컴퓨터 또는 다수의 컴퓨터들 상에서 처리되도록 또는 다수의 사이트들에 걸쳐 분배되고 통신 네트워크에 의해 상호 연결되도록 전개될 수 있다.The implementations of the various technologies described herein may be implemented as digital electronic circuitry, or as computer hardware, firmware, software, or combinations thereof. The implementations may be implemented as a computer program product, i.e., a computer program tangibly embodied in an information carrier, e.g., a machine-readable storage medium (computer-readable medium) or a radio signal, for processing by the operation of a data processing device, e.g., a programmable processor, a computer, or multiple computers, or for controlling the operation thereof. A computer program, such as the computer program(s) described above, may be written in any form of programming language, including compiled or interpreted languages, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. The computer program may be deployed to be processed on one computer or multiple computers at a single site, or distributed across multiple sites and interconnected by a communications network.

컴퓨터 프로그램의 처리에 적절한 프로세서들은 예로서, 범용 및 특수 목적 마이크로프로세서들 둘 다, 및 임의의 종류의 디지털 컴퓨터의 임의의 하나 이상의 프로세서들을 포함한다. 일반적으로, 프로세서는 판독 전용 메모리 또는 랜덤 액세스 메모리 또는 둘 다로부터 명령어들 및 데이터를 수신할 것이다. 컴퓨터의 요소들은 명령어들을 실행하는 적어도 하나의 프로세서 및 명령어들 및 데이터를 저장하는 하나 이상의 메모리 장치들을 포함할 수 있다. 일반적으로, 컴퓨터는 데이터를 저장하는 하나 이상의 대량 저장 장치들, 예를 들어 자기, 자기-광 디스크들, 또는 광 디스크들을 포함할 수 있거나, 이것들로부터 데이터를 수신하거나 이것들에 데이터를 송신하거나 또는 양쪽으로 되도록 결합될 수도 있다. 컴퓨터 프로그램 명령어들 및 데이터를 구체화하는데 적절한 정보 캐리어들은 예로서 반도체 메모리 장치들, 예를 들어, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(Magnetic Media), CD-ROM(Compact Disk Read Only Memory), DVD(Digital Video Disk)와 같은 광 기록 매체(Optical Media), 플롭티컬 디스크(Floptical Disk)와 같은 자기-광 매체(Magneto-Optical Media), 롬(ROM, Read Only Memory), 램(RAM, Random Access Memory), 플래시 메모리, EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM) 등을 포함한다. 프로세서 및 메모리는 특수 목적 논리 회로조직에 의해 보충되거나, 이에 포함될 수 있다.Processors suitable for processing a computer program include, for example, both general-purpose and special-purpose microprocessors, and any one or more processors of any type of digital computer. Typically, the processor will receive instructions and data from a read-only memory or a random access memory or both. Elements of the computer may include at least one processor for executing instructions and one or more memory devices for storing instructions and data. Typically, the computer may include, or be coupled to receive data from, transmit data to, or both, one or more mass storage devices for storing data, such as magnetic, magneto-optical disks, or optical disks. Information carriers suitable for embodying computer program instructions and data include, by way of example, semiconductor memory devices, magnetic media such as hard disks, floppy disks, and magnetic tape, optical media such as CD-ROMs (Compact Disk Read Only Memory) and DVDs (Digital Video Disk), magneto-optical media such as floptical disks, ROMs (Read Only Memory), RAMs (Random Access Memory), flash memory, EPROMs (Erasable Programmable ROM), EEPROMs (Electrically Erasable Programmable ROM), etc. The processor and memory may be supplemented by, or included in, special purpose logic circuitry.

또한, 컴퓨터 판독가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용매체일 수 있고, 컴퓨터 저장매체 및 전송매체를 모두 포함할 수 있다.Additionally, the computer-readable medium can be any available medium that can be accessed by a computer, and can include both computer storage media and transmission media.

본 명세서는 다수의 특정한 구현물의 세부사항들을 포함하지만, 이들은 어떠한 발명이나 청구 가능한 것의 범위에 대해서도 제한적인 것으로서 이해되어서는 안되며, 오히려 특정한 발명의 특정한 실시형태에 특유할 수 있는 특징들에 대한 설명으로서 이해되어야 한다. 개별적인 실시형태의 문맥에서 본 명세서에 기술된 특정한 특징들은 단일 실시형태에서 조합하여 구현될 수도 있다. 반대로, 단일 실시형태의 문맥에서 기술한 다양한 특징들 역시 개별적으로 혹은 어떠한 적절한 하위 조합으로도 복수의 실시형태에서 구현 가능하다. 나아가, 특징들이 특정한 조합으로 동작하고 초기에 그와 같이 청구된 바와 같이 묘사될 수 있지만, 청구된 조합으로부터의 하나 이상의 특징들은 일부 경우에 그 조합으로부터 배제될 수 있으며, 그 청구된 조합은 하위 조합이나 하위 조합의 변형물로 변경될 수 있다.While this specification contains details of a number of specific implementations, these should not be construed as limitations on the scope of any invention or what may be claimed, but rather as descriptions of features that may be unique to particular embodiments of particular inventions. Certain features described herein in the context of separate embodiments may also be implemented in combination in a single embodiment. Conversely, various features described in the context of a single embodiment may also be implemented in multiple embodiments, either individually or in any suitable subcombination. Furthermore, although features may operate in a particular combination and may initially be described as being claimed as such, one or more features from a claimed combination may in some cases be excluded from that combination, and the claimed combination may be modified into a subcombination or variation of a subcombination.

마찬가지로, 특정한 순서로 도면에서 동작들을 묘사하고 있지만, 이는 바람직한 결과를 얻기 위하여 도시된 그 특정한 순서나 순차적인 순서대로 그러한 동작들을 수행하여야 한다거나 모든 도시된 동작들이 수행되어야 하는 것으로 이해되어서는 안 된다. 특정한 경우, 멀티태스킹과 병렬 프로세싱이 유리할 수 있다. 또한, 상술한 실시형태의 다양한 장치 컴포넌트의 분리는 그러한 분리를 모든 실시형태에서 요구하는 것으로 이해되어서는 안되며, 설명한 프로그램 컴포넌트와 장치들은 일반적으로 단일의 소프트웨어 제품으로 함께 통합되거나 다중 소프트웨어 제품에 패키징 될 수 있다는 점을 이해하여야 한다.Likewise, although operations are depicted in the drawings in a particular order, this should not be understood to imply that the operations must be performed in the particular order illustrated or in any sequential order to achieve a desirable result, or that all of the illustrated operations must be performed. In certain cases, multitasking and parallel processing may be advantageous. Furthermore, the separation of the various device components of the embodiments described above should not be understood to require such separation in all embodiments, and it should be understood that the program components and devices described may generally be integrated together in a single software product or packaged into multiple software products.

한편, 본 명세서와 도면에 개시된 본 발명의 실시 예들은 이해를 돕기 위해 특정 예를 제시한 것에 지나지 않으며, 본 발명의 범위를 한정하고자 하는 것은 아니다. 여기에 개시된 실시 예들 이외에도 본 발명의 기술적 사상에 바탕을 둔 다른 변형 예들이 실시 가능하다는 것은, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 자명한 것이다.Meanwhile, the embodiments of the present invention disclosed in this specification and drawings are merely specific examples presented to help understanding, and are not intended to limit the scope of the present invention. It will be apparent to those skilled in the art to which the present invention pertains that other modified examples based on the technical idea of the present invention can be implemented in addition to the embodiments disclosed herein.

110: 오디오 부호화 장치
111: 메인 코덱 인코더
112: 메인 코덱 디코더
113: 특징 추출기
114: 특징 추출기
115: 부가 정보 인코더
120: 오디오 복호화 장치
121: 메인 코덱 디코더
122: 특징 추출기
123: 부가 정보 디코더
124: 후처리 프로세서110: Audio encoding device
111: Main Codec Encoder
112: Main Codec Decoder
113: Feature Extractor
114: Feature Extractor
115: Additional Information Encoder
120: Audio Decoding Device
121: Main Codec Decoder
122: Feature Extractor
123: Additional Information Decoder
124: Post-processing processor

Claims

A step of encoding the original signal and outputting the bitstream of the main codec;
A step of decrypting a bitstream of the above main codec;
A step of determining a residual error feature vector from the feature vector of the decoded signal and the feature vector of the original signal;
A step of encoding the above residual error feature vector to output a bitstream of additional information; and
A step of performing end-to-end deep learning by jointly training an additional information encoder that encodes the residual error feature vector, an additional information decoder that decodes the additional information bitstream, and a post-processing processor that estimates the feature vector of the original signal based on the bitstream of the main codec and the bitstream of the additional information.
An audio signal encoding method comprising:

In the first paragraph
The step of outputting the above additional information bitstream is:
A step of mapping the above residual error feature vector to a latent space;
A step of encoding by assigning the residual error feature vector corresponding to the latent space as a code vector for vector quantization; and
A step of quantizing the encoded residual error feature vector to output a side information bitstream.
An audio signal encoding method comprising:

In the first paragraph,
The additional information encoder that encodes the above residual error feature vector,
An audio signal encoding method, wherein the training is performed according to a loss function determined according to a loss due to encoding of the above-mentioned additional information encoder, a loss due to vector quantization of the above-mentioned additional information decoder that decodes the above-mentioned additional information bitstream, and a difference between a feature vector of the original signal and a feature vector of the original signal estimated from the bitstream of the main codec and the bitstream of the above-mentioned additional information.

delete

In the first paragraph,
The above training steps are:
An audio signal encoding method for training the additional information encoder, the additional information decoder, and the post-processing processor using a loss function based on a mean squared error (MSE) function and a loss function of a Vector Quantized Variational AutoEncoder (VQ-VAE).

In the first paragraph,
A step of extracting a feature vector of the decoded signal from the acoustic features included in the decoded signal; and
A step of extracting a feature vector of the original signal from the acoustic features included in the original signal.
An audio signal encoding method further comprising:

A step of receiving a bitstream of a main codec and a bitstream of additional information;
A step of decrypting a bitstream of the above main codec;
A step of extracting a feature vector of a decoded signal from acoustic features included in the decoded signal;
A step of decoding the bitstream of the above additional information to restore the residual error feature vector;
A step of estimating the feature vector of the original signal from the feature vector of the decoded signal and the residual error feature vector.
A step of performing end-to-end deep learning by jointly training an additional information decoder that decodes the additional information bitstream, a post-processing processor that estimates a feature vector of an original signal based on the bitstream of the main codec and the bitstream of the additional information, and an additional information encoder that encodes the residual error feature vector in an encoding device.
A method for decoding an audio signal including:

In Article 7,
The step of estimating the feature vector of the above original signal is:
An audio signal decoding method for estimating a feature vector of the original signal by combining the feature vector of the decoded signal and the residual error feature vector.

In Article 7,
A step of converting the feature vector of the estimated original signal into a time domain representation and outputting it.
A method for decoding an audio signal further comprising:

delete

A computer-readable recording medium having recorded thereon a program for executing the method of any one of claims 1 to 3 and claims 5 to 9.

A main codec encoder that encodes the original signal and outputs the bitstream of the main codec;
A main codec decoder for decoding a bitstream of the above main codec; and
An additional information encoder that determines a residual error feature vector from the feature vector of the decoded signal and the feature vector of the original signal, and encodes the residual error feature vector to output a bitstream of additional information.
Including,
An audio signal encoding device that performs end-to-end deep learning by jointly training an additional information encoder that encodes the residual error feature vector, an additional information decoder that decodes the additional information bitstream, and a post-processing processor that estimates the feature vector of the original signal based on the bitstream of the main codec and the bitstream of the additional information.

In Article 12
The above additional information encoder,
An audio signal encoding device that corresponds the above residual error feature vector to a latent space, encodes the residual error feature vector corresponding to the latent space by assigning it as a code vector for vector quantization, and outputs an additional information bitstream by quantizing the encoded residual error feature vector.

In Article 12,
The above additional information encoder,
An audio signal encoding device trained according to a loss function determined according to a loss due to encoding of the above-mentioned additional information encoder, a loss due to vector quantization of the above-mentioned additional information decoder, and a difference between a feature vector of the original signal and a feature vector of the original signal estimated from the bitstream of the main codec and the bitstream of the above-mentioned additional information.

delete

In Article 12,
The above additional information encoder,
An audio signal encoding device trained together with the additional information decoder and the post-processing processor using a loss function based on a mean squared error (MSE) function and a loss function of a Vector Quantized Variational AutoEncoder (VQ-VAE).