Blind Estimation of Audio Processing Graph

Sungho Lee1, Jaehyun Park1, Seungryeol Paik1, and Kyogu Lee1,2,3
1Department of Intelligence and Information, 2Interdisciplinary Program in Artificial Intelligence, 3Artificial Intelligence Institute, Seoul National University.



Abstract Musicians and audio engineers sculpt and transform their sounds by connecting multiple processors, forming an audio processing graph. However, most deep-learning methods overlook this real-world practice and assume fixed graph settings. To bridge this gap, we develop a system that reconstructs the entire graph from a given reference audio. We first generate a realistic graph-reference pair dataset and train a simple blind estimation system composed of a convolutional reference encoder and a transformer-based graph decoder. We apply our framework to singing voice effects and drum mixing estimation tasks; evaluation results show that our method can reconstruct complex signal routings, including multi-band processing and sidechaining.

Audio Samples

Singing Voice Effect Estimation
on seen speakers: 1-10 11-20 21-30 31-40
on unseen speakers: 1-10 11-20 21-30 31-40

Drum Mixing Estimation
on seen kits: 1-10 11-20 21-30 31-40
on unseen kits: 1-10 11-20 21-30 31-40

Figures

blind-estimation

The proposed blind estimation system. We first encode the stereo reference audio with a reference encoder. Then, we pass the latent vector into the prototype decoder, which autoregressively decodes the target graph's categorical variables (node types and edge types). Then, we pass the decoded prototype graph into a parameter estimator, which fills the remaining parameter attributes of nodes (processors) and edges (connections). The evaluation results show that this two-stage decoding (prototype decoding followed by the parameter estimation) performs better than the simple single-stage approach.


tokengt-framework

The proposed prototype decoder and parameter estimator. These two modules are realized as a single transformer encoder model, which takes node tokens and edge tokens as input (the tokenized graph transformer approach). We concatenate i) the latent token from the reference encoder, ii) the task token to differentiate the two tasks, iii) an optional start-of-graph token for the autoregressive graph decoding, and iv) the prototype graph tokens. For the prototype decoding, we add a causal attention mask.