Differentiable Artificial Reverberation

Sungho Lee, Hyeong-Seok Choi, and Kyogu Lee
Music and Audio Research Group, Seoul National University, Seoul, Republic of Korea

Abstract Artificial reverberation (AR) models play a central role in various audio applications. Therefore, estimating the AR model parameters (ARPs) of a reference reverberation is a crucial task. Although a few recent deep-learning-based approaches have shown promising performance, their non-end-to-end training scheme prevents them from fully exploiting the potential of deep neural networks. This motivates the introduction of differentiable artificial reverberation (DAR) models, allowing loss gradients to be back-propagated end-to-end. However, implementing the AR models with their difference equations “as is" in the deep learning framework severely bottlenecks the training speed when executed with a parallel processor like GPU due to their infinite impulse response (IIR) components. We tackle this problem by replacing the IIR filters with finite impulse response (FIR) approximations with the frequency-sampling method. Using this technique, we implement three DAR models---differentiable Filtered Velvet Noise (FVN), Advanced Filtered Velvet Noise (AFVN), and Delay Network (DN). For each AR model, we train its ARP estimation networks for analysis-synthesis (RIR-to-ARP) and blind estimation (reverberant-speech-to-ARP) task in an end-to-end manner with its DAR model counterpart. Experiment results show that the proposed method achieves consistent performance improvement over the non-end-to-end approaches in both objective metrics and subjective listening test results.

Audio Samples

Filtered Velvet Noise (FVN)
Advanced Filtered Velvet Noise (AFVN)
Delay Network (DN)
DNN Decoder Baseline (DNN)

Figure

The proposed ARP estimation framework. With the DAR models, loss gradients ∂L/∂P1, ···, ∂L/∂Pn can be back-propagated through the DAR models. Hence, the ARP estimation network can be trained in an end-to-end manner. We can train the network to perform the analysis-synthesis (RIR h input), blind estimation (reverberant speech h*x input), or even both. The network has an AR-model-agnostic encoder so that using a different DAR model only requires changing the tiny projection layers Proj1, ···, Projn. Each DAR model generate an FIR approximation of its original AR model’s IIR. After the training, estimated ARPs can be plugged in to the AR model which is highly efficient and real-time controllable in CPU.