grafx.processors.core

class FIRConvolution(mode='causal', flashfftconv=True, max_input_len=131072)

Bases: Module

A FIR convolution backend, which can use either native FFT-based convolution or FlashFFTConv [FKNRe23]. Allows for causal and zero-phase convolution modes.

For an input \(\mathbf{U}\in\mathbb{R}^{B\times C_{\mathrm{in}} \times L_{\mathrm{in}}}\) and a filter \(\mathbf{U}\in\mathbb{R}^{B\times C_{\mathrm{filter}} \times L_{\mathrm{filter}}}\) the operation is defined as a usual convolution. However, the output length will be the one of the input and the number of the output channels will be determined by broadcasting.

Parameters:

mode (str, optional) – The convolution mode, either "causal" or "zerophase" (default: "causal").
flashfftconv (bool, optional) – An option to use FlashFFTConv as a backend (default: True).
max_input_len (int, optional) – When flashfftconv is set to True, the max input length must be also given (default: 2**17).

forward(input_signals, fir)

Performs the convolution operation.

Parameters:

input_signals (FloatTensor, \(B \times C_\mathrm{in} \times L_\mathrm{in}\)) – A batch of input audio signals.
fir (FloatTensor, \(B \times C_\mathrm{filter} \times L_\mathrm{filter}\)) – A batch of FIR filters.

Returns:

A batch of convolved signals of shape \(B \times C_\mathrm{out} \times L_\mathrm{in}\) where \(C_\mathrm{out} = \max (C_\mathrm{in}, C_\mathrm{filter})\).

Return type:

FloatTensor

class TriangularFilterBank(num_frequency_bins, num_filters=50, scale='bark_traunmuller', f_min=40, f_max=None, sr=44100, low_half_triangle=True)

Bases: Module

Creates a triangular filterbank for the given frequency range. Code adapted from torchaudio and Diff-MST.

We provide both analysis and synthesis mode. For the synthesis mode, we expand the input energy \(\mathbf{E}_\mathrm{fb} \in \mathbb{R}^{B \times F_{\mathrm{fb}}}\) with the number of filterbanks \(F_\mathrm{fb}\) to the linear FFT scale \(\mathbf{E} \in \mathbb{R}^{B \times F}\).

\[ \mathbf{E} = \mathbf{E}_\mathrm{fb} \mathbf{W}_\mathrm{fb} \]

\(\smash{\mathbf{W}_\mathrm{fb} \in \mathbb{R}^{F \times F_{\mathrm{fb}}}}\) is the standard trainagular filterbank matrix. The analysis mode downsamples the frequency axis by multiplying the normalized filterbank matrix (sum of each filterbank is 1; hence an adaptive weighted average pooling).

Parameters:

num_frequency_bins (int) – Number of frequency bins from linear FFT.
num_filters (int) – Number of the filterbank filters.
scale (str, optional) – Frequency scale to use: "bark_traunmuller", "bark_schroeder", "bark_wang", "mel_htk", "mel_slaney", "linear", "log" (default: "bark_traunmuller").
f_min (float, optional) – Minimum frequency (default: 40).
f_max (float, optional) – Maximum frequency (default: None).
low_half_triangle (bool, optional) – Attach the remaining low-freq parts (default: True).

forward(energy, mode='synthesis')

Apply the filterbank to the energy tensor.

Parameters:

energy (FloatTensor, \(B\times F \:\!\)) – A batch of energy tensors.
mode (str, optional) – Mode of operation: "analysis" or "synthesis" (default: "synthesis").

Returns:

The energy tensor after applying the filterbank.

Return type:

FloatTensor

static compute_matrix(num_frequency_bins, num_filters, scale, f_min, f_max, sr, low_half_triangle): Compute the triangular filterbank matrix \(\smash{\mathbf{W}_\mathrm{fb} \in \mathbb{R}^{F \times F_{\mathrm{fb}}}}\).

class IIRFilter(order=2, backend='fsm', flashfftconv=True, fsm_fir_len=4000, fsm_max_input_len=131072, fsm_regularization=False)

Bases: Module

A serial stack of second-order filters (biquads) with the given coefficients.

The transfer function of the \(K\) stacked biquads \(H(z)\) is given as [Smi07b]

\[ H(z) = \prod_{k=1}^K H_k(z) = \prod_k \frac{ b_{k, 0} + b_{k, 1} z^{-1} + b_{i, 2} z^{-2}}{a_{i, 0} + a_{i, 1} z^{-1} + a_{i, 2} z^{-2}}. \]

We provide three backends for the filtering. The first one, "lfilter", is the time-domain method that computes the difference equation exactly. It uses torchaudio.lfilter, which uses the direct form I implementation (the bar denotes the normalized coefficients by \(a_{i, 0}\)) [YMC+24].

\[\begin{split} x[n] &= \bar{b}_{i, 0} s[n] + \bar{b}_{i, 1} s[n-1] + \bar{b}_{i, 2} s[n-2], \\ y_i[n] &= x[n] - \bar{a}_{i, 1} y[n-1] - \bar{a}_{i, 2} y[n-2] \end{split}\]

The second one, "fsm", is the frequency-sampling method (FSM) that approximates the filter with a finite impulse response (FIR) by sampling the discrete-time Fourier transform (DTFT) of the filter \(H(e^{j\omega})\) at a finite number of points \(N\) uniformly [KPE20, RGM70].

\[ H_N[k] = \prod_{i=1}^K (H_i)_N[k] = \prod_{i=1}^K \frac{b_{i, 0} + b_{i, 1} z_N^{-1} + b_{i, 2} z_N^{-2}}{a_{i, 0} + a_{i, 1} z_N^{-1} + a_{i, 2} z_N^{-2}}. \]

Here, \(z_N = \exp(j\cdot 2\pi/N)\) so that \(z_N^k\) becomes the \(k\)-th \(N\)-point discrete Fourier transform (DFT) bin. Then, the FIR filter \(h_N[n]\) is obtained by taking the inverse DFT of the sampled DTFT \(H_N[k]\) and the final output signal is computed by convolving the input signal with the FIR filter as \(y[n] = h_N[n] * s[n]\). This "fsm" backend is faster than the former "lfilter" but only an approximation. This error is called time-domain aliasing; the frequency-sampled FIR is given as follows [Smi07a].

\[ h_N[n] = \sum_{m=0}^\infty h[n+mN]. \]

where \(h[n]\) is the true infinite impulse response (IIR). Clearly, increasing the number of samples \(N\) reduces the error.

The third one, "ssm", is based on the diagonalisation of the state-space model (SSM) of the biquad filter so it only works for the second-order filters. This idea is based on Ben Hayes’s derivation of associative scan for parallel IIR filter computation and implemented by Chin-Yun Yu. The direct form II implementation of the biquad filter can be written in state-space form [Smi07b] as

\[\begin{split} x_i[n+1] &= A_i x_i[n] + B_i s[n], \\ y_i[n] &= C_i x_i[n] + \bar{b}_{i, 0} s[n], \end{split}\]

where the state transition transition matrices are given as

\[\begin{split} A_i = \begin{bmatrix}-\bar{a}_{i, 1} & -\bar{a}_{i, 2} \\ 1 & 0 \end{bmatrix}, \quad B_i &= \begin{bmatrix}1 \\ 0 \end{bmatrix}, \quad C_i = \begin{bmatrix}\bar{b}_{i, 1} - \bar{b}_{i, 0} \bar{a}_{i, 1} & \bar{b}_{i, 2} - \bar{b}_{i, 0} \bar{a}_{i, 2} \end{bmatrix}. \end{split}\]

If the poles of the filter are unique, the transition matrix \(A_i\) can be decomposed as \(A_i = V_i \Lambda_i V_i^{-1}\) where \(\Lambda_i\) is either a diagonal matrix with real poles on the diagonal or a scaled rotation matrix, which can be represented by one of the complex conjugate poles. Using this decomposition, the filter can be implemented as first-order recursive filters on the projected siganl \(V_i^{-1} B_i s[n]\), where we leverage Parallel Scan [MC18] to speed up the computation on the GPU. Finally, the output is projected back to the original basis using \(V_i\).

We recommend using the "ssm" over the "lfilter" backend in general, not only because it runs several times faster on the GPU but it’s more numerically stable.

Parameters:

num_filters (int, optional) – Number of biquads to use (default: 1).
normalized (bool, optional) – If set to True, the filter coefficients are assumed to be normalized by \(a_{i, 0}\), making the number of learnable parameters \(5\) per biquad instead of \(6\) (default: False).
backend (str, optional) – The backend to use for the filtering, which can either be the frequency-sampling method "fsm" or exact time-domain filters, "lfilter" or "ssm" (default: "fsm").
fsm_fir_len (int, optional) – The length of FIR approximation when backend == "fsm" (default: 8192).

forward(input_signal, Bs, As)

Apply the IIR filter to the input signal and the given coefficients.

Parameters:

input_signal (FloatTensor, \(B \times C_\mathrm{in} \times L\)) – A batch of input audio signals.
Bs (FloatTensor, \(B \times C_\mathrm{filter} \times K \times 3\)) – A batch of biquad coefficients, \(b_{i, 0}, b_{i, 1}, b_{i, 2}\), stacked in the last dimension.
As (FloatTensor, \(B \times C_\mathrm{filter} \times K \times 3\)) – A batch of biquad coefficients, \(b_{i, 0}, b_{i, 1}, b_{i, 2}\), stacked in the last dimension.

class SurrogateDelay(N, straight_through=True, radii_loss=True, normalize_gradients=True)

Bases: Module

A surrogate FIR processor for a learnable delay line.

A single delay can be represented as a FIR filter \( h[n] = \delta[n-d] \) where \(0\leq d < N\) is a delay length we want to optimize and \(\delta[n]\) denotes a unit impulse. We exploit the fact that each delay corresponds to a complex sinusoid in the frequency domain. Such a sinusoid’s angular frequency \(z \in \mathbb{C}\) can be optimized with the gradient descent if we allow it to be inside the unit disk, i.e., \(|z| \leq 1\) [HSF23]. We first start with an unconstrained complex parameter \(\tilde{z} \in \mathbb{C}\) and restrict it to be inside the unit disk (in the same way of restricting the poles [NSW21]) with the following activation function.

\[ z = \tilde{z}_k \cdot \frac{\tanh( | \tilde{z}_k | )}{ | \tilde{z}_k | + \epsilon}. \]

Then, we compute a damped sinusoid with the normalized frequency \(z\) then use its inverse FFT as a surrogate of the delay.

\[ \tilde{h}[n] = \frac{1}{N} \sum_{k=0}^{N-1} z^k z_N^{kn}. \]

where \(z_{N} = \exp(j\cdot 2\pi/N)\). Clearly, it is not a sparse delay line unless \(z\) is an integer power of \(z_N\) (on the unit circle with an exact angle). Instead it becomes a time-aliased and low-passed sinc kernel. We can use this soft delay as is, or we can use straight-through estimation (STE) [BLC13] so that the forward pass uses the hard delays \(h[n]\) and the backward pass uses the soft delays \(\smash{\tilde{h}[n]}\).

\[ \frac{\partial L}{\partial z^*} \leftarrow \sum_{n=0}^{N-1} \frac{\partial L}{\partial h[n]} \frac{\partial \tilde{h}[n]}{\partial z^*} \]

For a stable and faster convergence, we provide two additional options. The first one is to normalize the gradients of the complex conjugate to have a unit norm.

\[ \frac{\partial L}{\partial z^*} \leftarrow \frac{\partial L}{\partial z^*}/ \left|\frac{\partial L}{\partial z^*} \right|. \]

The second one is to use the radii loss \(L_{\mathrm{radii}} = (1 - | z | )^2\) to encourage complex angluar frequency \(z\) to be near the unit circle, making the delays to be “sharp.” We empirically found this regularization to be helpful especially when we use the STE as it alleviates the discrepancy between the hard and soft delays while still having the benefits of the soft FIR.

Parameters:

N (int) – The length surrogate FIR, which is also the largest delay length minus one.
straight_through (bool, optional) – Use hard delays for the forward passes and surrogate soft delays for the backward passes with straight-through estimation (default: True).
normalize_gradients (bool, optional) – Normalize the complex conjugate gradients to unit norm (default: True).
radii_loss (bool, optional) – Use the radii loss to encourage the delays to be close to the unit circle (default: True).

forward(z)

Computes the surrogate delay FIRs from the complex angular frequencies.

Parameters:: z (ComplexTensor, any shape) – The unnormalized complex angular frequencies.
Returns:: A batch of FIRs either hard (when using the straight-through estimation) of soft surrogate delays. The returned tensor has an additional dimension (last) for the FIR taps.
Return type:: FloatTensor or Tuple[FloatTensor, FloatTensor]

class TruncatedOnePoleIIRFilter(iir_len=16384, **backend_kwargs)

Bases: Module

A one-pole IIR filter with a truncated impulse response.

The true one-pole IIR filter is defined as a recursive filter with a coefficient \(\alpha\). Here, for the speed-up, we calculate its truncated impulse response analytically and convolve it to the input signal.

\[ y[n] \approx u[n] * (1-\alpha)\alpha^n. \]

The length of the truncated FIR, \(N\), is given as an argument iir_len.

forward(input_signals, z_alpha)

Processes input audio with the processor and given coefficients.

Parameters:

input_signals (FloatTensor, \(B \times L\)) – A batch of input audio signals.
z_alpha (FloatTensor, \(B \times 1\)) – A batch of one-pole coefficients.

Returns:

A batch of smoothed signals of shape \(B \times L\).

Return type:

FloatTensor

class Ballistics

Bases: Module

A ballistics processor that smooths the input signal with a recursive filter.

An input signal \(u[n]\) is smoothed with recursively, with a different coefficient for an “attack” and “release”.

\[\begin{split} y[n] = \begin{cases} \alpha_\mathrm{R} y[n-1]+(1-\alpha_\mathrm{R}) u[n] & u[n] < y[n-1], \\ \alpha_\mathrm{A} y[n-1]+(1-\alpha_\mathrm{A}) u[n] & u[n] \geq y[n-1]. \\ \end{cases} \end{split}\]

We calculate the coefficients from the inputs with the sigmoid function, i.e., \(\alpha_\mathrm{A} = \sigma(z_{\mathrm{A}})\) and \(\alpha_\mathrm{R} = \sigma(z_{\mathrm{R}})\). We use diffcomp for the optimized forward and backward computation [YMC+24].

forward(input_signals, z_alpha)

Processes input audio with the processor and given coefficients.

Parameters:

input_signals (FloatTensor, \(B \times L\)) – A batch of input audio signals.
z_alpha (FloatTensor, \(B \times 2\)) – A batch of attack and release coefficients stacked in the last dimension.

Returns:

A batch of smoothed signals of shape \(B \times L\).

Return type:

FloatTensor