Yet Another Generative Model For Room Impulse Response Estimation

Sungho Lee¹, Hyeong-Seok Choi⁴, and Kyogu Lee^1,2,3,4
¹Department of Intelligence and Information, ²IPAI, ³AI Institute, Seoul National University, ⁴Supertone, Inc.

Abstract Recent neural room impulse response (RIR) estimators typically comprise an encoder for reference audio analysis and a generator for RIR synthesis. Especially, it is the performance of the generator that directly influences the overall estimation quality. In this context, we explore an alternate generator architecture for improved performance. We first train an autoencoder with residual quantization to learn a discrete latent token space, where each token represents a small time-frequency patch of the RIR. Then, we cast the RIR estimation problem as a reference-conditioned autoregressive token generation task, employing transformer variants that operate across frequency, time, and quantization depth axes. This way, we address the standard blind estimation task and additional acoustic matching problem, which aims to find an RIR that matches the source signal to the target signal's reverberation characteristics. Experimental results show that our system is preferable to other baselines across various evaluation metrics.

Audio Samples

Autoencoding
Analysis-Synthesis
Blind Estimation
Acoustic Matching (Same Speaker)
Acoustic Matching (Different Speaker)

Figures

The Proposed Framework. Left: discrete representation learning with RQ-VAE. We first transform an input RIR into time-frequency features. Then, we train an autoencoder aiming for the reconstruction, while applying the residual quantization to the bottleneck. Right: token generation with axial transformers. Since we converted each RIR into discrete tokens, we can apply autoregressive modeling to them. Specifically, we use multiple transformers that operates on frequency, time, and depth axis. To estimate each RIR (tokens) from reference audio, we employ a reference encoder, condition its output to the transformers.

Discrete Representation Learning via RQ-VAE
Token Generation with Axial Transformers
Further Details on Experiments & Evaluations