This is the html version of the file https://arxiv.org/abs/2101.08596.
Google automatically generates html versions of documents as we crawl the web.
Page 1
Published as a conference paper at ICLR 2021
LEAF: A LEARNABLE FRONTEND FOR AUDIO
CLASSIFICATION
Neil Zeghidour, Olivier Teboul, F�lix de Chaumont Quitry & Marco Tagliasacchi
Google Research
{neilz, oliviert, fcq, mtagliasacchi}@google.com
ABSTRACT
Mel-filterbanks are fixed, engineered audio features which emulate human percep-
tion and have been used through the history of audio understanding up to today.
However, their undeniable qualities are counterbalanced by the fundamental lim-
itations of handmade representations. In this work we show that we can train a
single learnable frontend that outperforms mel-filterbanks on a wide range of au-
dio signals, including speech, music, audio events and animal sounds, providing a
general-purpose learned frontend for audio classification. To do so, we introduce
a new principled, lightweight, fully learnable architecture that can be used as a
drop-in replacement of mel-filterbanks. Our system learns all operations of au-
dio features extraction, from filtering to pooling, compression and normalization,
and can be integrated into any neural network at a negligible parameter cost. We
perform multi-task training on eight diverse audio classification tasks, and show
consistent improvements of our model over mel-filterbanks and previous learn-
able alternatives. Moreover, our system outperforms the current state-of-the-art
learnable frontend on Audioset, with orders of magnitude fewer parameters.
1 INTRODUCTION
Learning representations by backpropagation in deep neural networks has become the standard in
audio understanding, ranging from automatic speech recognition (ASR) (Hinton et al., 2012; Se-
nior et al., 2015) to music information retrieval (Arcas et al., 2017), as well as animal vocaliza-
tions (Lostanlen et al., 2018) and audio events (Hershey et al., 2017; Kong et al., 2019). Still, a strik-
ing constant along the history of audio classification is the mel-filterbanks, a fixed, hand-engineered
representation of sound. Mel-filterbanks first compute a spectrogram, using the squared modulus of
the short-term Fourier transform (STFT). Then, the spectrogram is passed through a bank of triangu-
lar bandpass filters, spaced on a logarithmic scale (the mel-scale) to replicate the non-linear human
perception of pitch (Stevens & Volkmann, 1940). Eventually, the resulting coefficients are passed
through a logarithm compression, to replicate our non-linear sensitivity to loudness (Fechner et al.,
1966). This approach of drawing inspiration from the human auditory system to design features
for machine learning has been historically successful (Davis & Mermelstein, 1980; Mogran et al.,
2004). Moreover, decades after the design of mel-filterbanks, And�n & Mallat (2014) showed that
they coincidentally exhibit desirable mathematical properties for representation learning, in particu-
lar shift-invariance and stability to small deformations. Hence, both from an auditory and a machine
learning perspective, mel-filterbanks represent strong audio features.
However, the design of mel-filterbanks is also flawed by biases. First, not only has the mel-scale been
revised multiple times (O’Shaughnessy, 1987; Umesh et al., 1999), but also the auditory experiments
that led their original design could not be replicated afterwards (Greenwood, 1997). Similarly, better
alternatives to log-compression have been proposed, like cubic root for speech enhancement (Lyons
& Paliwal, 2008) or 10th root for ASR (Schluter et al., 2007). Moreover, even though matching
human perception provides good inductive biases for some application domains, e.g., ASR or music
understanding, these biases may also be detrimental, e.g. for tasks that require fine-grained resolu-
tion at high frequencies. Finally, the recent history of other fields like computer vision, in which the
rise of deep learning methods has allowed learning representations from raw pixels rather than from
engineered features (Krizhevsky et al., 2012), inspired us to take the same path.
1
arXiv:2101.08596v1 [cs.SD] 21 Jan 2021

Page 2
Published as a conference paper at ICLR 2021
GABOR
| | 2
GAUSSIAN
LOWPASS
sPCEN
LEAF
CONV1D
| | 2
LOWPASS
LOG
TD-FILTERBANKS
SINC
LRELU
MAXPOOL
LAYERNORM
SINCNET
MEL
| | 2
STFT
LOG
MEL-FILTERBANKS
Figure 1: Breakdown of the computation of mel-filterbanks, Time-Domain filterbanks, SincNet,
and the proposed LEAF frontend. Orange boxes are fixed, while computations in blue boxes are
learnable. Grey boxes represent activation functions.
These observations motivated replacing mel-filterbanks with learnable neural layers, ranging from
standard convolutional layers (Palaz et al., 2015) to dilated convolutions (Schneider et al., 2019),
as well as structured filters exploiting the characteristics of known filter families, such as Gamma-
tone (Sainath et al., 2015), Gabor (Zeghidour et al., 2018a; No� et al., 2020), Sinc (Ravanelli &
Bengio, 2018; Pariente et al., 2020) or Spline (Balestriero et al., 2018) filters. While tasks such as
speech separation have already successfully adopted learnable frontends (Luo & Mesgarani, 2019;
Luo et al., 2019), we observe that most state-of-the art approaches for audio classification (Kong
et al., 2019), ASR (Synnaeve et al., 2019) and speaker recognition (Villalba et al., 2020) still em-
ploy mel-filterbanks as input features, regardless of the backbone architecture.
In this work, we argue that a credible alternative to mel-filterbanks for classification should be
evaluated across many tasks, and propose the first extensive study of learnable frontends for audio
over a wide and diverse range of audio signals, including speech, music, audio events, and ani-
mal sounds. By breaking down mel-filterbanks into three components (filtering, pooling, compres-
sion/normalization), we propose LEAF, a novel frontend that is fully learnable in all its operations,
while being controlled by just a few hundred parameters. In a multi-task setting over 8 datasets,
we show that we can learn a single set of parameters that outperforms mel-filterbanks, as well as
previously proposed learnable alternatives. Moreover, these findings are replicated when training a
different model for each individual task. We also confirm these results on a challenging, large-scale
benchmark: classification on Audioset (Gemmeke et al., 2017). In addition, we show that the general
inductive bias of our frontend (i.e., learning bandpass filters, lowpass filtering before downsampling,
learning a per-channel compression) is general enough to benefit other systems, and propose a new,
improved version of SincNet (Ravanelli & Bengio, 2018). To foster application to new tasks, we
will release the source code of all our models and baselines, as well as pre-trained frontends.
2 RELATED WORK
In the last decade, several works addressed the problem of learning the audio frontend, as an alter-
native to mel-filterbanks. The first notable contributions in this field emerged for ASR, with Jaitly
& Hinton (2011) pretraining Restricted Boltzmann Machines from the waveform, and Palaz et al.
(2013) training a hybrid DNN-HMM model, replacing mel-filterbanks by several layers of convo-
lution. However, these alternatives, as well as others proposed more recently (Tjandra et al., 2017;
Schneider et al., 2019), are composed of many layers, which makes a fair comparison with mel-
filterbanks difficult. In the following section, we focus on frontends that provide a lightweight,
drop-in replacement to mel-filterbanks, with comparable capacity.
2.1 LEARNING FILTERS FROM WAVEFORMS
A first attempt at learning the filters of mel-filterbanks was proposed by Sainath et al. (2013), where
a filterbank is initialized using the mel-scale and then learned together with the rest of the net-
work, taking a spectrogram as input. Instead, Sainath et al. (2015) and Hoshen et al. (2015) later
proposed to learn convolutional filters directly from raw waveforms, initialized with Gammatone fil-
2

Page 3
Published as a conference paper at ICLR 2021
ters (Schluter et al., 2007). In the same spirit, Zeghidour et al. (2018a) used the scattering transform
approximation of mel-filterbanks (And�n & Mallat, 2014) to propose the time-domain filterbanks, a
learnable frontend that approximates mel-filterbanks at initialization and can then be learned without
constraints (see Figure 1). More recently, the SincNet (Ravanelli & Bengio, 2018) model was pro-
posed, which computes a convolution with sine cardinal filters, a non-linearity and a max-pooling
operator (see Figure 1), as well as a variant using Gabor filters (No� et al., 2020).
We take inspiration from these works to design a new learnable filtering layer. As detailed in Sec-
tion 3.1.2, we parametrize a complex-valued filtering layer with Gabor filters. Gabor filters are opti-
mally localized in time and frequency (Gabor, 1946), unlike Sinc filters that require using a window
function (Ravanelli & Bengio, 2018). Moreover, unlike No� et al. (2020), who use complex-valued
layers in the rest of the network, we describe in Section 3.1.2 how using a squared modulus not only
brings back the signal to the real-valued domain (leading to compatibility with standard architec-
tures), but also performs shift-invariant Hilbert envelope extraction. Zeghidour et al. (2018a) also
apply a squared modulus non-linearity, however as described in Section 3.1.1, training unconstrained
filters can lead to overfitting and stability issues, which we solve with our approach.
2.2 LEARNING THE COMPRESSION AND THE NORMALIZATION
The problem of learning a compression and/or normalization function has received less attention in
the past literature. A notable contribution is the Per-Channel Energy Normalization (PCEN) (Wang
et al., 2017; Lostanlen et al., 2019), which was originally proposed for keyword spotting, outper-
forming log-compression. Later, Battenberg et al. (2017) and Lostanlen et al. (2018) confirmed the
advantages of PCEN, respectively for large scale ASR and animal bioacoustics. However, these pre-
vious works learn a compression on top of fixed mel-filterbanks. Instead, in this work we propose a
new version of PCEN and show for the first time that combining learnable filters, learnable pooling,
and learnable compression and normalization outperforms all other approaches.
3 MODEL
Let x ∈ RT denote a one-dimensional waveform of T samples, available at the sampling frequency
Fs [Hz]. We decompose the frontend into a sequence of three components: i) filtering, which
passes x through a bank of bandpass filters followed by a non-linearity, operating at the original
sampling rate Fs; ii) pooling, which decimates the signal to reduce its temporal resolution; iii)
compression/normalization, which applies a non-linearity to reduce the dynamic range. Overall, the
frontend can be represented as a function Fψ : RT → RM�N , which maps the input waveform to a
2-dimensional feature space, where M denotes the number of temporal frames (typically M<T),
N the number of feature channels (which might correspond to frequency bins) and ψ the frontend
parameters. The features computed by the frontend are then fed to a model gθ(�) parametrized by θ.
The frontend and the model parameters are estimated by solving a supervised classification problem:
θ= arg min
θ,ψ
E(x,y)∈D L(gθ(Fψ(x)),y),
(1)
where (x, y) are samples in a labelled dataset D and L is a loss function. Our goal is to learn the
frontend parameters ψ end-to-end with the model parameters θ. To achieve this, it is necessary to
make all the frontend components learnable, so that we can solve the optimization problem in equa-
tion 1 with gradient descent. In the following we detail the design choices of each component.
3.1 FILTERING
The first block of the learnable frontend takes x as input, and computes a convolution with a bank
of complex-valued filters (ϕn)n=1..N , followed by a squared modulus operator, which brings back
its output to the real-valued domain. This convolution step has a stride of 1, therefore keeping the
input temporal resolution, and outputs the following time-frequency representation:
fn = |x ∗ ϕn|2 ∈ RT , n = 1,...,N,
(2)
where ϕn ∈ CW is a complex-valued 1-D filter of length W. It is possible to compute equation 2
without explicitly manipulating complex numbers. As proposed by Zeghidour et al. (2018a), to
3

Page 4
Published as a conference paper at ICLR 2021
NormalizedConv1D
0.4
0.2
0.0
0.2
0.4
GaborConv1D
Figure 2: Frequency response of filters at convergence for two parametrizations: normalized 1D
filters and Gabor filters, both initialized on a mel scale. We highlight two filters among the 40 of the
filterbank: one in yellow in the low frequency range and one in pink at high frequencies.
produce the squared-modulus of a complex-valued convolution with N filters, we compute instead
the convolution with 2N real-valued filters ˜ϕn,n = 1,..., 2N, and perform squared l2-pooling
with size 2 and stride 2 along the channel axis to obtain the squared modulus, using adjacent filters
as real and imaginary part of ϕn. Formally:
fn = |x ∗ ˜ϕ2n−1|2 + |x ∗ ˜ϕ2n|2 ∈ RT , n = 1,...,N.
(3)
We explore two different parametrizations for ϕn. One relies on standard fully parametrized convo-
lutional filters while the other makes use of learnable Gabor filters.
3.1.1 NORMALIZED 1D-CONVOLUTION
Inspired by Zeghidour et al. (2018a), the first version of our filtering component is a standard 1D
convolution, initialized with a bank of Gabor filters, which approximates the computation of a mel-
filterbank. Thus, at initialization, the output of the frontend is identical to that of a mel-filterbank, but
during training the filters can be learned by backpropagation. In our early experiments, we observed
several limitations of this approach. First, due to the unconstrained optimization, these filters not
only learn to select frequency bands of interest, but also learn a scaling factor. This might lead to
instability during training, because the same form of scaling can be also computed by the filter-wise
compression component, which is applied at a later stage in the frontend. We address this issue by
applying l2-normalization to the filter coefficients before computing the convolution. Second, the
unconstrained parametrization increases the number of degrees of freedom, making training prone
to overfitting. The first panel in Figure 2 shows the frequency response of a bank of N = 40 filters,
each parametrized with W = 401 coefficients. We observe that, at convergence, they are widely
spread across the frequency axis, and include negative frequencies. Moreover, the filters are very
spiky rather than smooth. To alleviate these issues, in the next section we introduce a parametrization
based on a bank of Gabor filters, which reduces the number of parameters to learn, while at the same
time enforcing during training a stable and interpretable representation.
3.1.2 GABOR 1D-CONVOLUTION
Gabor filters are produced by modulating a Gaussian kernel with a sinusoidal signal. These filters
provide several desirable properties. First, they have an optimal trade-off between localization in
time and frequency (Gabor, 1946), which makes them a suitable choice for a convolutional network
with finite-sized filters. This is in contrast to Sinc filters (Ravanelli & Bengio, 2018), which require
using a window function to smooth abrupt variations on each side of the filter. Second, the time
and frequency response of Gabor filters have the same functional form, thus leading to interpretable
bandpass filters, unlike the unconstrained filters described in the previous section. Finally, Gabor
filters are quasi-analytic (i.e., their frequency response is almost zero for negative frequencies) and
when combined with the squared modulus the resulting filterbank can be interpreted as a set of
subband Hilbert envelopes, which are invariant to small shifts. Due to this desirable property, they
have been previously used as (fixed) features for speech and speaker recognition (Falk & Chan,
2009; Thomas et al., 2008). Formally, Gabor filters are parametrized by their center frequencies
n)n=1..N and inverse bandwidths (σn)n=1..N as follows:
4

Page 5
Published as a conference paper at ICLR 2021
ϕn(t) = ei2πηnt
1
2πσn
e
t2
2σ2n , n = 1,...,N, t = −W/2,...,W/2.
(4)
The frequency response of ϕn is a Gaussian centered at frequency ηn and of bandwidth 1/σn, both
expressed in normalized frequency units in [−1/2, +1/2]. Therefore, learning these parameters al-
lows learning a bank of smooth, quasi-analytic bandpass filters, with controllable center frequency
and bandwidth. In practice, to compute the output of the filtering component, we obtain the im-
pulse response of the Gabor filters ϕn(t) over the range t = −W/2,...,W/2, and convolve these
impulse responses with the input waveform. To ensure stability during training, we clip the center
frequencies (ηn)n=1..N to be in [0, 1/2], so that they lie in the positive part of the frequency range.
We also constrain the bandwidths (σn)n=1..N in the range [4
2 log 2, 2W
2 log 2], such that the
full-width at half-maximum of the frequency response is within 1/W and 1/2.
Gabor filters have significantly fewer parameters than the normalized 1D-convolutions described in
Section 3.1.1. N filters of length W are described by 2N parameters, N for the center frequencies
and N for the bandwidths, against W � N for a standard 1D-convolution. In particular, when using
a window length of 25 ms, operating at a sampling rate of 16 kHz, then W = 401 samples, and
Gabor-based filtering accounts for 200 times fewer parameters than their unconstrained alternatives.
3.1.3 TIME-FREQUENCY ANALYSIS AND LEARNABLE FILTERS
A spectrogram, in linear or mel scale, provides an ordered time-frequency representation: adjacent
frames represent consecutive time-steps, while frequencies monotonically increase along the feature
axis. A learnable frontend that performs filtering by means of convolution with a set of bandpass
filters also preserves ordering along the temporal axis. However, the ordering along the feature
axis is unconstrained. This can be problematic when applying subsequent operations that rely on
frequency ordering. These include, for example: i) operations that leverage local frequency infor-
mation, e.g., two-dimensional convolutions, which compute a feature representation based on local
time-frequency patches; ii) operations that leverage long-range dependencies along the frequency
axis, e.g., to capture the harmonic structure of the underlying signal; and iii) augmentation methods
that mask adjacent frequency bands, like SpecAugment (Park et al., 2019). To evaluate the impact
of enforcing an explicit ordering of the center frequencies of the learned filters, we compared the
result of training a frontend using Gabor filters with or without explicitly enforcing ordering of the
center frequencies. Interestingly, we observe that even without an explicit constraint, filters that are
ordered at the initialization tend to keep the same ordering throughout training, and that enforcing
sorted filters has no effect on the performance.
3.2 LEARNABLE LOWPASS POOLING
The output of the filtering component has the same temporal resolution as the input waveform. The
second step of a learnable frontend is to downsample the output of the filterbank to a lower sam-
pling rate, similarly to what happens in the STFT when computing mel-filterbanks. Previous work
relied on max-pooling (Sainath et al., 2015; Ravanelli & Bengio, 2018; No� et al., 2020), lowpass-
filtering (Zeghidour et al., 2018a) or average pooling (Balestriero et al., 2018). Zeghidour et al.
(2018b) compare these methods on a speech recognition task and show a systematic improvement
when using lowpass filtering instead of max-pooling. More recently, Zhang (2019) showed that in
standard 2D convolutional architectures, including ResNet (He et al., 2016) and DenseNet (Huang
et al., 2017), a drop-in replacement of max-pooling and average pooling layers with a (fixed) low-
pass filter improves the performance for image classification. In the proposed frontend, we extend
the pooling layer of Zeghidour et al. (2018a) in two ways. First, while Zeghidour et al. (2018a)
adopt a single shared lowpass filter for all input channels, we implement lowpass filtering by means
of depthwise convolution, such that each input channel is associated with one lowpass filter. This
is useful because each channel in the learnable frontend is characterized by a different bandwidth,
and a specific lowpass filter can be learned for each of them. Second, we parametrize these lowpass
filters to have a Gaussian impulse response:
φn(t) =
1
2πσn
e
t2
2σ2n , t = −W/2,...,W/2.
(5)
5

Page 6
Published as a conference paper at ICLR 2021
Note that this is a particular case of Gabor filters with center frequency equal to 0 and learnable
bandwidth. With this choice, we can learn per-channel lowpass pooling functions while adding only
N parameters to the frontend model. We initialize all channels with a bandwidth of 0.4, which gives
a frequency response close to the Hann window used by mel-filterbanks.
3.3 LEARNING PER-CHANNEL COMPRESSION AND NORMALIZATION
When using a traditional frontend based on a mel-filterbank or STFT, the time-frequency features are
typically passed through a logarithmic compression, to replicate the non-linear human perception of
loudness (Fechner et al., 1966). The first limitation of this approach is that the same compression
function is applied to all frequency bins, regardless of their content. The second limitation is the
fixed choice of the non-linearity used by the compression function. While, in most cases, a loga-
rithmic function is used, other compression functions have been proposed and evaluated in the past,
including the cubic root (Lyons & Paliwal, 2008) and the 10th root (Schluter et al., 2007). This mo-
tivates learning the compression function as part of the model, rather than relying on a handcrafted
choice. In particular, Per-Channel Energy Normalization (Wang et al., 2017) was proposed as a
learnable alternative to log-compression and mean-variance normalization:
PCEN(F(t, n)) =
(
F(t, n)
(ε + M(t, n))
αn + δn
)rn
− δrn
n ,
(6)
where t = 1,...,M denotes the time-step and n = 1,...,N the channel index. In this parametriza-
tion, the time-frequency representation F is first normalized by an exponential moving average of its
past values M(t, n) = (1−s)M(t−1,n)+sF(t, n), controlled by a smoothing coefficient s and an
exponent αn, with ε being a small constant used to avoid dividing by zero. An offset δn is then added
before applying compression with the exponent rn (typically in [0, 1]). Wang et al. (2017) train αn,
δn, and rn, while treating s as a hyperparameter, or learning a convex combination of exponential
moving averages for different, fixed values of s. Instead, in this work we learn channel-dependent
smoothing coefficients sn, jointly with the rest of the parameters. We call this version sPCEN. This
approach was previously used by Schl�ter & Lehner (2018) for singing voice detection, except that
they did not learn the exponents αn, while we learn all these parameters jointly. Our final frontend
cascades a Gabor 1D-convolution, a Gaussian lowpass pooling, and sPCEN. In the rest of the paper,
we refer to this model as LEAF, for “LEarnable Audio Frontend”.
4 EXPERIMENTS
We evaluate our frontend on three supervised learning problems: i) single-task classification; ii)
multi-task classification and iii) multi-label classification on Audioset. As baselines, we compare
our system to log-compressed mel-filterbanks, learnable Time-Domain filterbanks (Zeghidour et al.,
2018a)1 and SincNet (Ravanelli & Bengio, 2018)2. We reimplemented these baselines to match
their official implementation. In all our experiments, we keep the same common backbone network
architecture, which consists of a frontend, a convolutional encoder, and one (or more) head(s). We
adopt the lightweight version of EfficientNet (Tan & Le, 2019) (EfficientNetB0, with 4M param-
eters) as convolutional encoder. On AudioSet, we also experiment with a state-of-the-art CNN14
encoder (Kong et al., 2019), with 81 M parameters. A head is a single linear layer with a number of
outputs equal to the number of classes. In the multi-task setting we use a different head for each of
the target tasks, all sharing the same encoder.
The input signal sampled at Fs = 16 kHz is passed through the frontend which feeds into the convo-
lutional encoder. As baseline, we use a log-compressed mel-filterbank with 40 channels, computed
over windows of 25 ms with a stride of 10 ms. For a fair comparison, both LEAF and the learnable
baselines also have N = 40 filters, each with W = 401 coefficients (≈25 ms at 16 kHz). The learn-
able pooling is computed over 401 samples with a stride of 160 samples (10 ms at 16 kHz), giving
the same output dimension as mel-filterbanks. On AudioSet we use 64 channels instead of 40 as
Kong et al. (2019) observed improvements from using 64 mel-filters.
To address the variable length of the input sequences, we train on randomly sampled 1 second win-
dows. We train with ADAM (Kingma & Ba, 2014) and a learning rate of 10−4 for 1M batches, with
1https://github.com/facebookresearch/tdfbanks
2https://github.com/mravanelli/SincNet
6

Page 7
Published as a conference paper at ICLR 2021
Table 1: Test accuracy (%) for single-task classification.
Task
Mel
TD-fbanks
SincNet
LEAF
Acoustic scenes
99.2 � 0.4
99.5 � 0.3
96.7 � 0.9
99.1 � 0.5
Birdsong detection
78.6 � 1.0
80.9 � 0.9
78.0 � 1.0
81.4 � 0.9
Emotion recognition
49.1 � 2.4
57.1 � 2.4
44.2 � 2.4
57.8 � 2.4
Speaker Id. (VC)
31.9 � 0.7
25.3 � 0.7
43.5 � 0.8
33.1 � 0.7
Music (instrument)
72.0 � 0.6
70.0 � 0.6
70.3 � 0.6
72.0 � 0.6
Music (pitch)
91.5 � 0.3
91.3 � 0.3
83.8 � 0.5
92.0 � 0.3
Speech commands
92.4 � 0.4
87.3 � 0.4
89.2 � 0.4
93.4 � 0.3
Language Id.
76.5 � 0.4
71.6 � 0.5
78.9 � 0.4
86.0 � 0.4
Average
73.9 � 0.8
72.9 � 0.8
73.1 � 0.9
76.9 � 0.8
Table 2: Impact of the compression layer on the performance (single task, average accuracy in %).
mel-filterbank
LEAF- filterbank
log
PCEN
sPCEN
log
PCEN
sPCEN
Average
73.9 � 0.8
76.4 � 0.8
76.0 � 0.7
74.6 � 0.7
76.4 � 0.8
76.9 � 0.8
batch size 256. For Audioset experiments, we train with mixup (Zhang et al., 2017) and SpecAug-
ment (Park et al., 2019). During evaluation, we consider the full-length sequences, splitting them
into consecutive non-overlapping 1 second windows and averaging the output logits over windows.
4.1 SINGLE-TASK AUDIO CLASSIFICATION
We train independent single-task supervised models on 8 distinct classification problems: acoustic
scene classification on TUT (Heittola et al., 2018), birdsong detection (Stowell et al., 2018), emotion
recognition on Crema-D (Cao et al., 2014), speaker identification on VoxCeleb (Nagrani et al., 2017),
musical instrument and pitch detection on NSynth (Engel et al., 2017), keyword spotting on Speech
Commands (Warden, 2018), and language identification on VoxForge (Revay & Teschke, 2019). A
summary of the datasets used in our experiments is illustrated in Table A.1.
Table 1 reports the results for each task, with 95% confidence intervals representing the uncertainty
due to the limited test sample size. No class rebalancing is applied, neither during training, nor
during testing. On average, we observe that LEAF outperforms all alternatives. When considering
results for each individual tasks, we observe that LEAF outperforms or matches the accuracy of
other frontends, with the notable exception of SincNet on Voxceleb. This is consistent with the
fact that SincNet was originally proposed for speaker identification (Ravanelli & Bengio, 2018) and
illustrates the importance of evaluating over a wide range of tasks. To evaluate the robustness of our
results with respect to the choice of the datasets, we apply a statistical bootstrap (Efron & Tibshirani,
1993) to compute the non-parametric distribution of the difference between the accuracy obtained
with LEAF and each of the other frontends, when sampling datasets with replacement among the
eight datasets. We test the null hypothesis that the mean of the difference is zero and measure
the following one-sided p-values: pMel < 10−5, pTD-fbanks < 10−5, pSincNet = 0.059. Figure A.1
illustrates the corresponding bootstrap distribution, showing the statistical significance of our results
with respect to the choice of the datasets.
We dive deeper in the comparison between the mel-filterbanks frontend and LEAF by studying the
effect of the compression component (log, PCEN, sPCEN). Table 2 shows that PCEN improves sig-
nificantly over log-compression on both mel- and learned filterbanks. The proposed LEAF frontend
adopts sPCEN, which gives the best performance. We also observe that for any choice of the com-
pression function, the learnable frontend matches or outperforms the corresponding mel-filterbank.
4.2 MULTI-TASK AUDIO CLASSIFICATION
To learn a general-purpose frontend that generalizes across tasks, we train a multi-task model that
uses the same LEAF parameters and encoder for all tasks, with a task-specific linear head. More
7

Page 8
Published as a conference paper at ICLR 2021
Table 3: Test accuracy (%) for multi-task classification.
Task
Mel
TD-fbanks
SincNet
LEAF
Acoustic scenes
99.1 � 0.5
98.3 � 0.6
91.0 � 1.4
98.9 � 0.5
Birdsong detection
81.3 � 0.9
82.3 � 0.9
78.8 � 0.9
81.9 � 0.9
Emotion recognition
24.1 � 2.1
24.4 � 2.1
26.2 � 2.1
31.9 � 2.3
Speaker Id. (LBS)
100.0 � 0.0
100.0 � 0.0
100.0 � 0.0
100.0 � 0.0
Music (instrument)
70.7 � 0.6
66.3 � 0.6
67.4 � 0.6
70.2 � 0.6
Music (pitch)
88.5 � 0.4
86.4 � 0.4
81.2 � 0.5
88.6 � 0.4
Speech commands
93.6 � 0.3
89.5 � 0.4
91.4 � 0.4
93.6 � 0.3
Language Id.
64.9 � 0.5
58.9 � 0.5
60.8 � 0.5
69.6 � 0.5
Average
77.8 � 0.7
75.8 � 0.7
74.6 � 0.8
79.3 � 0.7
Table 4: Test AUC and d-prime (� standard deviation over three runs) on Audioset, with the number
of learnable parameters per frontend.
EfficientNetB0
CNN14 (ours)
CNN14 (Kong et al., 2019)
Frontend
#Params
AUC
d-prime
AUC
d-prime
AUC
d-prime
Mel
0 0.968 � .001 2.61 � .02 0.972 � .000
2.71 � .01
0.973
2.73
Mel-PCEN
256 0.967 � .000 2.60 � .01 0.973 � .000
2.72 � .00
-
-
Wavegram
300 k 0.958 � .000 2.44 � .00 0.961 � .001
2.50 � .02
0.968
2.61
TD-fbanks
51 k 0.965 � .001 2.57 � .01 0.972 � .000
2.70 � .00
-
-
SincNet
256 0.961 � .000 2.48 � .00 0.970 � .000
2.66 � .01
-
-
SincNet+
448 0.966 � .002 2.58 � .04 0.973 � .001
2.71 � .01
-
-
LEAF
448 0.968 � .001 2.63 � .01 0.974 � .000 2.74 � .01
-
-
specifically, a single network with K heads (hθ1 ,...,hθK ) is trained on mini-batches, uniformly
sampled from the K datasets. A mini-batch of size B can be represented as (xki
i ,yki
i )i=1..B, where
ki is the associated task the example has been sampled from. The multi-task loss function is now
computed on a mini-batch as the sum of the individual loss functions:
L =
B
i=1
K
k=1
Lk
(
hθk
(
gθ(Fψ(xki
i )
)
,yki
i
)
δ(ki,k),
(7)
where θ and ψ represent the shared parameters of the encoder and frontend respectively, θk, k =
1,...,K the task-specific parameters of the heads hθk (�), and δ is the Kronecker delta function.
We use the same set of tasks described in Section 4.1 with one exception, replacing VoxCeleb with
Librispeech as a representative speaker identification task. As illustrated in Table A.1, VoxCeleb
has much more classes (≈1200) than any other task. This creates an imbalance in the number of
parameters in the heads, making training significantly slower and reducing the accuracy on all other
tasks, regardless of the frontend. Table 3 reports the accuracy obtained on each task, for all the
frontends. LEAF shows the best overall performance while matching or outperforming all other
methods on every tasks. We repeated the analysis using the bootstrap to evaluate the statistical
significance of these results with respect to the choice of the datasets and observed the following
p-values: pMel = 0.048, pTD-fbanks < 10−5, pSincNet = 10−5. Figure A.2 illustrates the corresponding
bootstrap distribution. To the best of our knowledge, it is the first time a learnable frontend is shown
to outperform mel-filterbanks over several tasks with a unique parametrization.
4.3 MULTI-LABEL CLASSIFICATION ON AUDIOSET
Table 4 shows the performance (averaged over three runs) of the different frontends on Au-
dioset (Gemmeke et al., 2017), a large-scale multi-label dataset of sounds of all categories, described
by an ontology of 527 classes. Audioset examples are weakly labelled with one or more positive
labels, so we evaluate with the standard metric of this dataset, the balanced d-prime, a non-linear
transformation of the AUC averaged uniformly across all classes. EfficientNetB0 trained with LEAF
achieves a d-prime of 2.63, outperforming the level achieved when using mel-filterbanks (d-prime:
8

Page 9
Published as a conference paper at ICLR 2021
2.61) or other frontends. We compare this result with the state-of-the-art PANN model (Kong et al.,
2019), which reports a d-prime of 2.73 when trained on mel-filterbanks. Kong et al. (2019) also train
a learnable “Wavegram” frontend made of 9 layers of 1D- and 2D- convolutions, for a total of 300 k
parameters, reporting a d-prime equal to 2.61. EfficientNetB0 on LEAF thus outperforms this sys-
tem, while having a frontend with 670x fewer parameters, and a considerably smaller encoder. To
confirm these results, we also train EfficientNetB0 on the Wavegram and get similar findings. All
these findings are replicated when replacing EfficientNetB0 with our implementation of CNN14:
even though the gap between LEAF and baselines reduces as the scores get higher, CNN14 on
LEAF matches the current state-of-the-art on AudioSet. To conclude these experiments and show
that individual components of LEAF can benefit other frontends, we replace the max-pooling of
SincNet with our Gaussian lowpass filter, and LayerNorm (Ba et al., 2016) with the proposed sP-
CEN. This version, named “SincNet+” in Table 4, significantly outperforms the original version,
regardless of the encoder.
4.4 ANALYSIS OF LEARNED FILTERS, POOLING AND COMPRESSION
Figure A.3 illustrates the center frequencies learned by the Gabor filtering layer of LEAF on Au-
dioSet, and compares them to those of mel-filterbanks. At a high level, these filters do not deviate
much from their mel-scale initialization. On the one hand, this indicates that the mel-scale is a strong
initialization, a result consistent with previous work (Sainath et al., 2015; Zeghidour et al., 2018b).
On the other hand, there are differences at both ends of the range, with LEAF covering a wider
range of frequencies. For example, the lowest frequency filter is centered around 60 Hz, as opposed
to 100 Hz for mel-filterbanks. We believe that is one of the reasons that explain the out-performance
of LEAF on Audioset, as it focuses on a more appropriate frequency range to represent the underly-
ing audio events. Figure A.4 shows the learned Gaussian lowpass filters at convergence. We see that
they all deviate towards a larger frequency bandwidth (characterized by a smaller standard deviation
in the time domain), and that each filter has a different bandwidth, which confirms the utility of using
depthwise pooling. Figure A.5 shows the learned exponents rn (initialized at 2.0) and smoothing
coefficients sn (initialized at 0.04) of sPCEN, ordered by filter index. The exponents are spread in
[1.9, 2.6], which shows the importance of learning per-channel exponents, as well as the appropriate
choice of using a fixed cubic root in previous work (Lyons & Paliwal, 2008), as an alternative to
log-compression. Learning per-channel smoothing coefficients also allows deviating from the ini-
tialization. Interestingly, most filters keep a low coefficient (slowly moving average), except for the
filter with the highest frequency, which has a significantly faster moving average (s≈0.15).
4.5 ROBUSTNESS TO NOISE
We compare the robustness of LEAF and mel-filterbanks to synthetic noise, on the Speech Com-
mands dataset. To do so, we artificially add Gaussian noise to the waveforms both during training
and evaluation, with different gains to obtain a Signal-to-Noise Ratio (SNR) from + inf (no noise)
to −5 dB. Figure A.6 shows that while performance (averaged over three runs) degrades for all
models as SNR decreases, LEAF is more resilient than mel-filterbanks. In particular, when using
a logarithmic compression, the Gabor 1-D convolution and Gaussian pooling of LEAF maintain
a significantly higher accuracy than mel-filterbanks. Using PCEN has an even higher impact on
performance when using mel-filterbanks, the best results being obtained with LEAF + PCEN.
5 CONCLUSION
In this work we introduce LEAF, a fully learnable frontend for audio classification as an alternative
to using handcrafted mel-filterbanks. We demonstrate over a large range of tasks that our model is a
good drop-in replacement to these features with no adjustment to the task at hand, and can even learn
a single set of parameters for general-purpose audio classification while outperforming previously
proposed learnable frontends. In future work, we will move yet a step forward in removing hand-
crafted biases from the model. In particular, our model still relies on an underlying convolutional
architecture, with fixed filter length and stride. Learning these important parameters directly from
data would allow for an easier generalization across tasks with various sampling rates and frequency
content. Moreover, we believe that the general principle of learning to filter, pool and compress can
benefit the analysis of non-audio signals, such as seismic data or physiological recordings.
9

Page 10
Published as a conference paper at ICLR 2021
6 ACKNOWLEDGMENTS
Authors thank Dick Lyon, Vincent Lostanlen, Matt Harvey, and Alex Park for helpful discussions.
Authors are also thankful to Julie Thomas for helping with the design of Figure 1. Finally, authors
thank the reviewers of ICLR 2021 for their feedback that helped improving the quality of this work.
REFERENCES
Joakim And�n and St�phane Mallat. Deep scattering spectrum. IEEE Transactions on Signal Pro-
cessing, 62(16):4114–4128, 2014.
Blaise Ag�era y Arcas, Beat Gfeller, Ruiqi Guo, Kevin Kilgour, Sanjiv Kumar, James Lyon, Julian
Odell, Marvin Ritter, Dominik Roblek, Matthew Sharifi, and Mihajlo Velimirovic. Now Playing:
Continuous low-power music recognition. Technical report, nov 2017. URL http://arxiv.
org/abs/1711.10958.
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint
arXiv:1607.06450, 2016.
Randall Balestriero, Romain Cosentino, Herv� Glotin, and Richard Baraniuk. Spline filters for
end-to-end deep learning. In International Conference on Machine Learning, pp. 364–373, 2018.
Eric Battenberg, Rewon Child, Adam Coates, Christopher Fougner, Yashesh Gaur, Jiaji Huang,
Heewoo Jun, Ajay Kannan, Markus Kliegl, Atul Kumar, et al. Reducing bias in production
speech models. arXiv preprint arXiv:1705.04400, 2017.
Houwei Cao, David G Cooper, Michael K Keutmann, Ruben C Gur, Ani Nenkova, and Ragini
Verma. CREMA-D: Crowd-sourced emotional multimodal actors dataset. IEEE transactions on
affective computing, 5(4):377–390, 2014.
Steven Davis and Paul Mermelstein. Comparison of parametric representations for monosyllabic
word recognition in continuously spoken sentences. IEEE transactions on acoustics, speech, and
signal processing, 28(4):357–366, 1980.
Bradley Efron and Robert J. Tibshirani. An Introduction to the Bootstrap. Number 57 in Monographs
on Statistics and Applied Probability. Chapman & Hall/CRC, Boca Raton, Florida, USA, 1993.
Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Douglas Eck, Karen Simonyan, and
Mohammad Norouzi. Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders.
Technical report, apr 2017. URL http://arxiv.org/abs/1704.01279.
Tiago H Falk and Wai-Yip Chan. Modulation spectral features for robust far-field speaker identifi-
cation. IEEE Transactions on Audio, Speech, and Language Processing, 18(1):90–100, 2009.
Gustav Theodor Fechner, Davis H Howes, and Edwin Garrigues Boring. Elements of psychophysics,
volume 1. Holt, Rinehart and Winston New York, 1966.
Dennis Gabor. Theory of communication. part 1: The analysis of information. Journal of the
Institution of Electrical Engineers-Part III: Radio and Communication Engineering, 93(26):429–
441, 1946.
Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing
Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for
audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pp. 776–780. IEEE, 2017.
Donald D. Greenwood. The mel scale’s disqualifying bias and a consistency of pitch-difference
equisections in 1956 with equal cochlear distances and equal frequency ratios. Hearing Research,
103:199–224, 1997.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog-
nition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.
770–778, 2016.
10

Page 11
Published as a conference paper at ICLR 2021
Toni Heittola, Annamaria Mesaros, and Tuomas Virtanen. TUT Urban Acoustic Scenes 2018, De-
velopment dataset, April 2018. URL https://doi.org/10.5281/zenodo.1228142.
Shawn Hershey, Sourish Chaudhuri, Daniel P.W. Ellis, Jort F Gemmeke, Aren Jansen, R Channing
Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, Malcolm Slaney, Ron J Weiss,
and Kevin Wilson. CNN architectures for large-scale audio classification. In ICASSP, IEEE
International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. 131–
135, 2017. ISBN 9781509041176. doi: 10.1109/ICASSP.2017.7952132.
Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly,
Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks
for acoustic modeling in speech recognition: The shared views of four research groups. IEEE
Signal processing magazine, 29(6):82–97, 2012.
Yedid Hoshen, Ron Weiss, and Kevin W Wilson. Speech acoustic modeling from raw multichannel
waveforms. In International Conference on Acoustics, Speech, and Signal Processing, 2015.
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected
convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pp. 4700–4708, 2017.
Navdeep Jaitly and Geoffrey Hinton. Learning a better representation of speech soundwaves using
restricted boltzmann machines. In 2011 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), pp. 5884–5887. IEEE, 2011.
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
Qiuqiang Kong, Yin Cao, T. Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D. Plumbley.
Panns: Large-scale pretrained audio neural networks for audio pattern recognition. ArXiv,
abs/1912.10211, 2019.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convo-
lutional neural networks. In Advances in neural information processing systems, pp. 1097–1105,
2012.
V. Lostanlen, J. Salamon, M. Cartwright, B. McFee, A. Farnsworth, S. Kelling, and J. P. Bello.
Per-channel energy normalization: Why and how. IEEE Signal Processing Letters, 26(1):39–43,
2019.
Vincent Lostanlen, Justin Salamon, Andrew Farnsworth, Steve Kelling, and Juan Pablo Bello.
Birdvox-full-night: A dataset and benchmark for avian flight call detection. In 2018 IEEE Inter-
national Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 266–270. IEEE,
2018.
Yi Luo and Nima Mesgarani. Conv-tasnet: Surpassing ideal time-frequency magnitude masking for
speech separation. IEEE/ACM Trans. Audio, Speech & Language Processing, 27(8):1256–1266,
2019.
Yi Luo, Zhuo Chen, and Takuya Yoshioka. Dual-path rnn: efficient long sequence modeling for
time-domain single-channel speech separation. arXiv preprint arXiv:1910.06379, 2019.
James G Lyons and Kuldip K Paliwal. Effect of compressing the dynamic range of the power
spectrum in modulation filtering based speech enhancement. In Ninth Annual Conference of the
International Speech Communication Association, 2008.
Nelson Mogran, Herv� Bourlard, and Hynek Hermansky. Automatic speech recognition: An audi-
tory perspective. In Speech processing in the auditory system, pp. 309–338. Springer, 2004.
Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. Voxceleb: A large-scale speaker iden-
tification dataset. Interspeech 2017, Aug 2017. doi: 10.21437/interspeech.2017-950. URL
http://dx.doi.org/10.21437/Interspeech.2017-950.
11

Page 12
Published as a conference paper at ICLR 2021
Paul-Gauthier No�, Titouan Parcollet, and Mohamed Morchid. Cgcnn: Complex gabor convolu-
tional neural network on raw speech. ICASSP 2020 - 2020 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), pp. 7724–7728, 2020.
D. O’Shaughnessy. Speech communication : human and machine. 1987.
Dimitri Palaz, Ronan Collobert, and Mathew Magimai Doss. Estimating phoneme class condi-
tional probabilities from raw speech signal using convolutional neural networks. arXiv preprint
arXiv:1304.1018, 2013.
Dimitri Palaz, Mathew Magimai Doss, and Ronan Collobert. Convolutional neural networks-based
continuous speech recognition using raw speech signal. In 2015 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), pp. 4295–4299. IEEE, 2015.
Manuel Pariente, Samuele Cornell, Antoine Deleforge, and Emmanuel Vincent. Filterbank de-
sign for end-to-end speech separation. In ICASSP 2020-2020 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), pp. 6364–6368. IEEE, 2020.
Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and
Quoc V Le. Specaugment: A simple data augmentation method for automatic speech recognition.
arXiv preprint arXiv:1904.08779, 2019.
Mirco Ravanelli and Yoshua Bengio. Speaker recognition from raw waveform with sincnet. In 2018
IEEE Spoken Language Technology Workshop (SLT), pp. 1021–1028. IEEE, 2018.
Shauna Revay and Matthew Teschke. Multiclass language identification using deep learning on
spectral images of audio signals. CoRR, abs/1905.04348, 2019. URL http://arxiv.org/
abs/1905.04348.
T. Sainath, Brian Kingsbury, Abdel rahman Mohamed, and B. Ramabhadran. Learning filter banks
within a deep neural network framework. 2013 IEEE Workshop on Automatic Speech Recognition
and Understanding, pp. 297–302, 2013.
Tara N Sainath, Ron J Weiss, Andrew Senior, Kevin W Wilson, and Oriol Vinyals. Learning the
speech front-end with raw waveform cldnns. In Sixteenth Annual Conference of the International
Speech Communication Association, 2015.
Jan Schl�ter and Bernhard Lehner. Zero-mean convolutions for level-invariant singing voice detec-
tion. In ISMIR, 2018.
Ralf Schluter, Ilja Bezrukov, Hermann Wagner, and Hermann Ney. Gammatone features and feature
combination for large vocabulary speech recognition. In 2007 IEEE International Conference on
Acoustics, Speech and Signal Processing-ICASSP’07, volume 4, pp. IV–649. IEEE, 2007.
Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. wav2vec: Unsupervised
pre-training for speech recognition. arXiv preprint arXiv:1904.05862, 2019.
Andrew Senior, Hasim Sak, Felix de Chaumont Quitry, Tara N. Sainath, and Kanishka Rao. Acoustic
modelling with cd-ctc-smbr lstm rnns. In ASRU, 2015.
Stanley S Stevens and John Volkmann. The relation of pitch to frequency: A revised scale. The
American Journal of Psychology, 53(3):329–353, 1940.
Dan Stowell, — Mike Wood, — Hanna Pamuła, Yannis Stylianou, and Herv� Glotin. Automatic
acoustic detection of birds through deep learning: the first Bird Audio Detection challenge. Tech-
nical report, 2018. URL https://arxiv.org/pdf/1807.05812.pdf.
Gabriel Synnaeve, Qiantong Xu, Jacob Kahn, Edouard Grave, Tatiana Likhomanenko, Vineel
Pratap, Anuroop Sriram, Vitaliy Liptchinsky, and Ronan Collobert. End-to-end asr: from super-
vised to semi-supervised learning with modern architectures. arXiv preprint arXiv:1911.08460,
2019.
12

Page 13
Published as a conference paper at ICLR 2021
Mingxing Tan and Quoc V. Le. Efficientnet: Rethinking model scaling for convolutional neural
networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th
International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, Cali-
fornia, USA, volume 97 of Proceedings of Machine Learning Research, pp. 6105–6114. PMLR,
2019. URL http://proceedings.mlr.press/v97/tan19a.html.
Samuel Thomas, Sriram Ganapathy, and Hynek Hermansky. Hilbert envelope based features for
far-field speech recognition. In International Workshop on Machine Learning for Multimodal
Interaction, pp. 119–124. Springer, 2008.
Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura. Attention-based wav2text with feature trans-
fer learning. In 2017 IEEE automatic speech recognition and understanding workshop (ASRU),
pp. 309–315. IEEE, 2017.
S. Umesh, L. Cohen, and D. Nelson. Fitting the mel scale. 1999 IEEE International Conference
on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258), 1:
217–220 vol.1, 1999.
Jes�s Villalba, Nanxin Chen, David Snyder, Daniel Garcia-Romero, Alan McCree, Gregory Sell,
Jonas Borgstrom, Leibny Paola Garcıa-Perera, Fred Richardson, R�da Dehak, et al. State-of-the-
art speaker recognition with neural network embeddings in nist sre18 and speakers in the wild
evaluations. Computer Speech & Language, 60:101026, 2020.
Yuxuan Wang, Pascal Getreuer, Thad Hughes, Richard F Lyon, and Rif A Saurous. Trainable
frontend for robust and far-field keyword spotting. In 2017 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), pp. 5670–5674. IEEE, 2017.
Pete Warden. Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. Technical
report, 2018. URL https://arxiv.org/pdf/1804.03209.pdf.
Neil Zeghidour, Nicolas Usunier, Iasonas Kokkinos, Thomas Schatz, Gabriel Synnaeve, and Em-
manuel Dupoux. Learning filterbanks from raw speech for phone recognition. In 2018 IEEE
international conference on acoustics, speech and signal Processing (ICASSP), pp. 5509–5513.
IEEE, 2018a.
Neil Zeghidour, Nicolas Usunier, Gabriel Synnaeve, Ronan Collobert, and Emmanuel Dupoux. End-
to-end speech recognition from the raw waveform. Interspeech, abs/1806.07098, 2018b.
Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical
risk minimization. arXiv preprint arXiv:1710.09412, 2017.
Richard Zhang.
Making convolutional networks shift-invariant again.
arXiv preprint
arXiv:1904.11486, 2019.
13

Page 14
Published as a conference paper at ICLR 2021
A APPENDIX
Table A.1: Datasets used in the experiments. Default train/test splits are always adopted.
Task
Name
Classes Train examples Test examples
Audio events
Audioset
527
1,832,720
17,695
Acoustic scenes
TUT Urban 2018
10
7,829
810
Birdsong detection
DCASE2018
2
32,129
3,561
Emotion recognition
Crema-D
6
5,146
820
Speaker Id.
Voxceleb
1,251
138,361
8,249
Speaker Id.
Librispeech
251
25,740
2,799
Music (instrument)
Nsynth
11
289,205
12,678
Music (pitch)
Nsynth
128
289,205
12,678
Speech commands
Speech commands
35
84,771
10,700
Language Id.
Voxforge
6
126,610
18,378
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
LEAF - mel (p= 0.000)
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
LEAF - tdfbanks (p= 0.000)
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
LEAF - SincNet (p= 0.059)
Figure A.1: Distribution of the difference between the accuracy (%) obtained with LEAF and each
of the other frontends when sampling 8 datasets with replacement, in the single task setting.
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
0.0
0.1
0.2
0.3
0.4
0.5
LEAF - mel (p= 0.048)
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
LEAF - tdfbanks (p= 0.000)
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
LEAF - SincNet (p= 0.000)
Figure A.2: Distribution of the difference between the accuracy obtained with LEAF and each of
the other frontends when sampling 8 datasets with replacement, in the multi-task setting.
14

Page 15
Published as a conference paper at ICLR 2021
0
5
10
15
20
25
30
35
40
Filter index
102
103
104
Frequency [Hz]
mel-spectrogram
LEAF (on AudioSet)
Figure A.3: Comparison between the filters learned by LEAF and the mel scale.
0
50 100 150 200 250 300 350 400
0.0
0.2
0.4
0.6
0.8
1.0
Learned depthwise Gaussian lowpass filters
init
Figure A.4: Learned Gaussian lowpass filters of LEAF on AudioSet. The dotted line represents the
initialization, identical for all filters.
0 10 20 30 40 50 60
0.04
0.06
0.08
0.10
0.12
0.14
0.16
PCEN smooth coefficients
0 10 20 30 40 50 60
2.0
2.1
2.2
2.3
2.4
2.5
2.6
PCEN root coefficients
Figure A.5: Values of learned sPCEN parameters of LEAF trained on AudioSet. The dotted lines
represent the initialization, identical for all filters.
15

Page 16
Published as a conference paper at ICLR 2021
11
5
5
SNR
40%
50%
60%
70%
80%
90%
Accuracy
LEAF-PCEN
LEAF-log
Mel-PCEN
Mel-log
Figure A.6: Test accuracy (%) on the test set of the Speech Commands dataset with varying SNR.
16