Using AFESim as an audio transform

Rockpool contains a simulation of the Audio Front End (AFE) of XyloAudio 3, which is used as a pre-processing step to convert audio signals to spike trains. The converted version of audio can be used in the following scenarios:

  • As a training sample to train an SNN model in Rockpool

  • As a test sample to test a model on the XyloAudio 3 SNN core for debugging purposes. This is done by bypassing the microphone and AFE in the HDK. See the related tutorial for more information.

In this tutorial, we will refer to the AFE simulator in XyloAudio 3 as AFESim3 and will go through an example of how to configure and use AFESim3 as an audio transform for a train or test pipeline.

There are two main modes in the AFESim3 module in Rockpool:

  • AFESimExternal

    • This mode of AFESim3 is independent of the microphone type. It bypasses the microphone path and passes an external audio (14-bit QUANTIZED signal) into the filterbank and divisive normalization module


    • In this mode, audio samples are passed through a preprocessing chain composed of PDM microphone model, filter bank and divisive normalization module

Using AFESimExternal is recommended for developing applications, while AFESimPDM is more suitable for advanced debugging tasks.


As illustrated in the diagram below, AFESimExternal receives input audio as an array, resamples and quantizes it to 14-bit format, and passes it to the filter bank (which covers 16 frequency bands between 50 Hz and 17 KHz).

Depending on the mode selected for spike_gen_mode, fixed or adaptive thresholds will be applied to filter output channels to generate a spike train. spike_gen_mode is by default set to 'divisive_norm', and changing it and related parameters (low_pass_averaging_window, rate_scale_factor, dn_EPS) is not recommended.

The Divisive Normalization (DN) module regulates the noise sensitivity of different frequency bands of the filter bank by applying adaptive thresholds. If the average power of a filter in a specific time window is less than \(\epsilon\), that filter’s threshold will be adapted to generate fewer spikes. The user can deactivate Divisive Normalization only for debugging purposes by choosing spike_gen_mode = 'threshold' and passing fixed_threshold_vec.

The spike train is rasterized with a given dt, which should the time step used in your SNN model.

import warnings
from IPython.display import Image


The following transform can convert audio: np.ndarray samples to spike trains:

import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['figure.dpi'] = 300

from typing import Union, Optional, Tuple
from rockpool.devices.xylo.syns65302 import AFESimExternal
from dataclasses import dataclass

class AFESim3_external:
    fs: float = 16000
    spike_gen_mode: str = None
    fixed_threshold: Optional[Union[int, Tuple[int]]] = None
    dn_EPS: Optional[Union[int, Tuple[int]]] = 32
    rate_scale_factor: (Optional[int]) = 63
    low_pass_averaging_window: float = 0.084
    dt: float = 0.01

        spike_gen_mode (str): The spike generation mode of the AFE. There are two ways to generate spikes, "divisive_norm" and "threshold".
        fixed_threshold_vec (Optional[Union[int, Tuple[int]]]): Used only when `spike_gen_mode = "threshold"`.

        dn_EPS (Optional[Union[int, Tuple[int]]]): Used only when `spike_gen_mode = "divisive_norm"`. Lower bound on spike generation threshold.
        Using this parameter we can control the noise level in the sense that if average power in a channel is less than EPS, the spike rate of that channel is somehow diminished during spike generation.

        rate_scale_factor (Optional[int]): Target `rate_scale_factor` for the `DivisiveNormalization` module. Defaults to 63.

        low_pass_averaging_window (Optional[float], optional): Target `low_pass_averaging_window` for the `DivisiveNormalization` module. Defaults to 84e-3.
        dt(float): simulation time step, this needs to match to the dt of snn model


    def __post_init__(self) -> None:
        if self.spike_gen_mode == "threshold":
            self.fixed_threshold_vec = [self.fixed_threshold for i in range(16)]
            self.dn_inits = {'spike_gen_mode':self.spike_gen_mode,'fixed_threshold_vec':self.fixed_threshold}
            self.fixed_threshold_vec = None
            self.dn_inits = {'spike_gen_mode':self.spike_gen_mode,'fixed_threshold_vec':self.fixed_threshold, 'dn_EPS': self.dn_EPS, 'rate_scale_factor': self.rate_scale_factor,

        self.afesim3 = AFESimExternal.from_specification(**self.dn_inits, dt = self.dt)

    def __call__(self,audio: np.ndarray) -> np.ndarray:

        out,_,_ = self.afesim3((audio, self.fs))

        return out


The diagram below illustrates the difference between AFESimPDM and AFESimExternal. AFESimPDM includes internally a simulation of a digital microphone model, composed of a sigma-delta modulator, and polyphase lowpass filter to convert the PDM signal to 14-bit quantized data.

This module can be used when debugging the PDM modules on XyloAudio 3.


The following transform can convert audio: np.ndarray samples to spike trains:


from rockpool.devices.xylo.syns65302 import AFESimPDM @dataclass class AFESimPDM_transform_output: spike_gen_mode: str = None fixed_threshold: Optional[Union[int, Tuple[int]]] = None dn_EPS: Optional[Union[int, Tuple[int]]] = 32 rate_scale_factor: (Optional[int]) = 63 low_pass_averaging_window: float = 0.084 dt: float = 0.01 fs: float = 16000 """ Args: spike_gen_mode (str): The spike generation mode of the AFE. There are two ways to generate spikes, "divisive_norm" and "threshold". fixed_threshold_vec (Optional[Union[int, Tuple[int]]]): Used only when `spike_gen_mode = "threshold"`. dn_EPS (Optional[Union[int, Tuple[int]]]): Used only when `spike_gen_mode = "divisive_norm"`. Lower bound on spike generation threshold. Using this parameter we can control the noise level in the sense that if average power in a channel is less than EPS, the spike rate of that channel is somehow diminished during spike generation. rate_scale_factor (Optional[int]): Target `rate_scale_factor` for the `DivisiveNormalization` module. Defaults to 63. low_pass_averaging_window (Optional[float], optional): Target `low_pass_averaging_window` for the `DivisiveNormalization` module. Defaults to 84e-3. dt(float): simulation time step, this needs to match to the dt of snn model fs (float): sampling frequency of audio samples """ def __post_init__(self) -> None: if self.spike_gen_mode == "threshold": self.fixed_threshold_vec = [self.fixed_threshold for i in range(16)] self.dn_inits = {'spike_gen_mode':self.spike_gen_mode,'fixed_threshold_vec':self.fixed_threshold} else: self.fixed_threshold_vec = None self.dn_inits = {'spike_gen_mode':self.spike_gen_mode,'fixed_threshold_vec':self.fixed_threshold, 'dn_EPS': self.dn_EPS, 'rate_scale_factor': self.rate_scale_factor, 'low_pass_averaging_window':self.low_pass_averaging_window} self.afesimPDM = AFESimPDM.from_specification(**self.dn_inits, dt = self.dt) def __call__(self,audio: np.ndarray) -> np.ndarray: out,_,_ = self.afesimPDM((audio, self.fs)) return out

Applying AFESim3 transform

We apply a test audio (a 1 second baby cry sample) to both introduced AFESim3 transforms to generate our pre-recorded data

!pip install --quiet librosa
import librosa
audio_path = 'audio_sample/sample_4__basic_length=1_db0=-0.wav'

test_sample, sr = librosa.load(audio_path, sr= None)
test_sample = np.expand_dims(test_sample, axis = 0)[0]
afe_ext = AFESim3_external(spike_gen_mode='divisive_norm', dt = 0.009994)
out_external = afe_ext(test_sample)

afe_pdm = AFESimPDM_transform_output(spike_gen_mode='divisive_norm', dt = 0.009994)
out_pdm = afe_pdm(test_sample)
plt.subplot(121); plt.imshow(out_external.T, aspect='auto'); plt.colorbar(); plt.title('Test audio transfomed by AFESimExternal'); plt.xlabel('Time step (dt)'); plt.ylabel('Output channel')
plt.subplot(122); plt.imshow(out_pdm.T, aspect='auto'); plt.colorbar(); plt.title('Test audio transfomed by AFESimPDM'); plt.xlabel('Time step (dt)'); plt.ylabel('Output channel');
# - You can now save the transformed data
#'AFESimExternalSample', out_external.T)

The output spike train has a dimension of \((N_{steps}, 16)\) where 16 is the number of output channels and \(N_{steps}\) is the duration of the audio in seconds divided by the provided dt of the model.