This is the html version of the file https://arxiv.org/abs/1811.06669.
Google automatically generates html versions of documents as we crawl the web.
Page 1
ACLNET: EFFICIENT END-TO-END AUDIO CLASSIFICATION CNN
Jonathan J Huang, Juan Jose Alvarado Leanos
Intel Corporation
ABSTRACT
We propose an efficient end-to-end convolutional neural
network architecture, AclNet, for audio classification. When
trained with our data augmentation and regularization, we
achieved state-of-the-art performance on the ESC-50 corpus
with 85.65% accuracy. Our network allows configurations
such that memory and compute requirements are drastically
reduced, and a tradeoff analysis of accuracy and complexity is
presented. The analysis shows high accuracy at significantly
reduced computational complexity compared to existing solu-
tions. For example, a configuration with only 155k parame-
ters and 49.3 million multiply-adds per second is 81.75%, ex-
ceeding human accuracy of 81.3%. This improved efficiency
can enable always-on inference in energy-efficient platforms.
Index Terms—
Convolutional neural networks, end-to-end CNN, envi-
ronmental sound classification, audio classification
1. INTRODUCTION
Following the successes of image classification, convolutional
neural network (CNN) architectures have become popular-
ized for audio classification. Hershey, et al. [1] showed that
large image classification CNNs trained with huge amount of
weakly labeled Youtube data leads to semantically meaning-
ful representations, the basis of a powerful classifier. In the
recent DCASE acoustic scene classification task [2], the top
submissions are mostly CNN-based [3] [4], [5]. Likewise,
many of the top results for the ESC-50 corpus [6] use various
forms of CNNs [7], [8], [9], [10].
While the prior work on CNN audio classifiers have fo-
cused on accuracy of the performance for a particular tasks,
none that we are aware of have treated computational effi-
ciency as a primary objective. The first contribution of this
paper is a scalable architecture, that at the high-end has one
of the best accuracies in the ESC-50, and at the low-end of-
fers the flexibility to scale down to extremely small model
sizes. The advantage of a scalable architecture is that it allows
for flexibility in inference platforms with various system con-
straints. The efficiency brought by our architecture allows for
low-power always-on inference on DSPs or dedicated neural
net accelerators [11], [12]. The second contribution is the ap-
plication of mixup data augmentation [13] for audio classifi-
cation, and we show it is a big contributor to the high accuracy
due to improved generalization.
AclNet is an end-to-end (e2e) CNN architecture, which
takes raw time-domain input waveform as opposed to the
more popular technique of using spectral features like mel-
filterbank or mel-frequency cepstral coefficients (MFCC).
One of the advantages of an e2e architecture is that the front-
end feature makes no assumptions of the spectral content. Its
feature representation is learned in a data-driven manner, thus
its features are optimized for the task at hand as long as there
is sufficient training data. Another advantage of e2e is that it
eliminates the implementation of the spectral features, which
simplifies software or hardware complexity. Although other
e2e techniques [14], [15], [16], [8] have been studied, our
architecture has a focus on efficiency.
Several research results from optimization of CNNs from
image domain can be borrowed to make audio classification
more efficient. Han et al. [17] used pruning, quantization,
and Huffman encoding to reduce complexity of CNNs. Sin-
gular value decomposition has been applied to DNN to re-
duce model size [18]. The AclNet gets its inspirations for ef-
ficient computations from MobileNet [19], which features the
depthwise separable convolution we employed extensively in
this work. With these tricks, human-level accuracy for ESC-
50 was achieved with only 155k parameters and 49.3 million
multiply-adds per second (MMACS).
The remainder of this paper is organized as follows. In
Section 2 we detail the AclNet architecture. Section 3 pro-
vides the details of data augmentation and model training pro-
cess. Section 4 presents the our findings from the experi-
ments, followed by conclusion in Section 5.
2. ACLNET ARCHITECTURE
The AclNet architecture consists of two building blocks of the
network: the low-level features (LLF) shown in Table 1 and
the high-level features (HLF) shown in Table 2.
2.1. Low-level features
The LLF can be viewed as a replacement of the spectral fea-
ture, and the two stages of 1-D strided convolutions are equiv-
alent to FIR decimation filterbank. With the time-domain
arXiv:1811.06669v1 [cs.SD] 16 Nov 2018

Page 2
waveform as input, the LLF produces an output of 64 chan-
nels at feature frame rate of 10ms after the maxpool layer.
In the example given on Table 1, the 1.28s input produces an
output tensor with dimension (64, 1, 128).
Although the number of parameters in the LLF are in-
variant to stride values S1, S2 and the number of intermedi-
ate channels C1, the choice of these values greatly influence
the compute complexity and accuracy. Our experiments will
show the settings which gives the most accurate results.
The example in Table 1 is for the 16kHz sampling rate.
The parameters S1, S2, and all kernel size scale linearly with
sampling rate to ensure the same time duration of kernels and
output frame rate of 10ms.
Table 1. AclNet low-level features, with 1.28s 16kHz sam-
ples. Input dimension (1, 1, 20480).
Layer
Stride Out dim
Out
Chans
Kernel
size
Conv1
S1
C1, 1, 20480/S1
C1
9
Conv2
S2
64, 1, 20480/(S1S2) 64
5
Maxpool1 1
64, 1, 128
64
160/(S1S2)
2.2. High-level features
Continuing from the LLF output, transposing the first two di-
mensions will result in an image-like tensor with dimension
(1, 64, 128). The rest of HLF thus can follow the structure
similar to image classification CNNs. We experimented with
a number of architectures, and found that a VGG-like archi-
tecture [20] provides a good classification performance and
well-understood building blocks. The architecture shown in
Table 2 is a modified VGG. Besides changing the depth and
channel width, the final layers of the network are also mod-
ified. Conv12 layer is a 1 � 1 convolution that reduces the
number of channels to the number of classes, which in the
case of ESC-50, is 50. Each of the 50 channels are then aver-
age pooled over the 2 � 4 patches and output as softmax. The
advantage of these final two layers is that our architecture can
incorporate arbitrary length inputs, without any need to mod-
ify the number of hidden units in fully-connected layers. An
additional benefit of this way of pooling is that it is shown to
be effective for training on weakly labeled datasets [21].
Before the input to Conv12 layer, we have a dropout layer.
We found the probability of 0.2 to work well on this dataset.
2.3. Convolutional layers details
All convolutional layers shown on Tables 1 and 2, except their
first layer (i.e. Conv1 and Conv3) can be configured in one of
two forms:
Table 2. AclNet high-level features, with input dimension =
(1, 64, 128) out of LLF.
Layer
Stride Out dim
Out
Chans
Kernel Size
Conv3
1
32, 64, 128
32
3 � 3
Maxpool2
1
32, 32, 64
32
2 � 2
Conv4
1
64, 32, 64
64
3 � 3
Conv5
1
64, 32, 64
64
3 � 3
Maxpool3
1
64, 16, 32
64
2 � 2
Conv6
1
128, 16, 32
128
3 � 3
Conv7
1
128, 16, 32
128
3 � 3
Maxpool4
1
128, 8, 16
128
2 � 2
Conv8
1
256, 8, 16
256
3 � 3
Conv9
1
256, 8, 16
256
3 � 3
Maxpool5
1
256, 4, 8
256
2 � 2
Conv10
1
512, 4, 8
512
3 � 3
Conv11
1
512, 4, 8
512
3 � 3
Maxpool6
1
512, 2, 4
512
2 � 2
Conv12
1
50, 2, 4
50
1 � 1
Avgpool1
1
50
50
2 � 4
• Standard convolution (SC): this is the standard build-
ing blocks of convolution layer, batch normalization,
and ReLU activation.
• Depthwise separable convolution (DWSC): the con-
volution is decomposed into depthwise separable con-
volutions with pointwise layers each followed by batch
normalization and ReLU activation as in MobileNet
[19].
The advantage of DWSC is that they use significantly less
parameters and operations compared to SC, but typically at
a cost of degradation in performance. We will explore the
tradeoffs between these two choices of convolutions in our
experiments.
2.4. Width multiplier
As in MobileNet, our architecture also has a width multiplier
(WM) to control the complexity of the network. The WM
linearly scales the number of output channels from Conv3 to
Conv11. This parameter is an easy way to manage the ca-
pacity of the network, and our experiments will explore its
accuracy impact on the ESC-50 corpus.
3. EXPERIMENTAL METHODS
3.1. Dataset
We used ESC-50 to train and evaluate the models. ESC-50
contains a total of 2000 examples of environmental sounds ar-

Page 3
ranged in 50 classes. We use the default 5-folds provided by
the dataset for cross validation in performance evaluation. All
sound files were converted to 16-bits, at 16kHz and 44.1kHz
sampling rates for two different sets of experiments. We elim-
inated the silent sections at the beginning and ending of each
recording.
3.2. Data augmentation
We have experimented with different input lengths to the
training of AclNet using ESC-50 and a proprietary dataset.
Empirically we found that between 1 to 2 second input gave
the best results, so for the rest of the experiments we chose
1.5s input length. In our data loader of the training process,
we use the following real-time data augmentation to generate
each training example.
1. Choose a random 2s of audio within a training file
2. Center the waveform to zero mean, and normalize by
standard deviation
3. Resample the waveform by a random factor uniformly
chosen in range [0.8, 1.25]
4. Crop exactly to 1.5s
5. Multiply waveform by random gain chosen uniformly
in range [−6.0, +6.0]dB
During test time, only the data normalization step is used,
and the length of the entire wave file is input into the network.
3.3. Mixup training
Mixup [13] is a recent technique to improve generalization
by increasing the support of the training distribution. In this
technique, a neighborhood is defined around each example
in the training data by constructing virtual training examples,
that is, pairs of virtual samples and virtual targets ˜x, ˜y. Given
two training examples, (xi,yi) and (xj,yj), the new virtual
pair is computed as:
˜x = λxi + (1 − λ)xj
(1)
˜y = λyi + (1 − λ)yj
(2)
where λ ∼ beta(α, α). The hyperparameter α controls
the amount that is mixed in from the second example. Higher
values of alpha make the virtual pairs less similar to the origi-
nal unmixed training examples. We experimented with values
from 0.1 to 0.5.
3.4. Learning Settings
For all experiments, we used stochastic gradient descent op-
timizer with momentum of 0.9, weight decay of 2e-4, and a
batch size of 64. We trained the model using the following
learning rate schedule with 3 different phases: 0.2 for the first
Fig. 1. Comparison of the effects of data augmentation and
mixup on validation accuracy.
500 epochs, 0.04 for the next 1000, and 0.016 for the last 500,
for a total of 2000 epochs for each fold. Also, for the first 100
epochs we disabled the mixup procedure as a form of warm
up to improve initial convergence.
4. RESULTS AND ANALYSIS
4.1. Data augmentation and mixup
Several experiments were done to assess the effectiveness of
augmentation and mixup. Figure 1 shows the validation accu-
racy over the course of the training process for various com-
binations of augmentation as explained in Section 3. All ex-
periments were done using a WM of 1.0, and SC. We see an
obvious improvement with each individual augmentation, and
that mixup by itself is more effective than the other form of
augmentation. The best result was achieved when augmen-
tation was combined with mixup, which had an absolute im-
provement of more than 5% above the baseline without any
augmentation. We note mixup is conceptually similar to be-
tween class learning, which was also shown to work well for
ESC-50 [8].
We have experimented with the choice of α in mixup,
and found that values between 0.1 to 0.2 worked well for
the larger size architectures, thus for the remainder of exper-
iments, we default to using this combined augmentation with
mixup α = 0.1.
4.2. Low-level feature parameters
In EnvNet [16], analysis showed that 2 convolutions of ker-
nel size 8 worked best for this dataset. Our experiments con-
firmed that 2 convolutions being optimal, but we also found
that slightly reducing the kernel size of second convolution
had no impact on accuracy. Our best setting is with kernel
sizes of 9 and 5 for the first two convolutions.
In order to determine the choice of other LLF parameters,
we did a grid search of the parameter space over these ranges:
C1 ∈ {8, 16, 32}, S1 ∈ {2, 4, 8}, S2 ∈ {2, 4}.

Page 4
Table 3. ESC-50 5-fold accuracies with AclNet at select configurations.
Sampling
rate
Conv
type
LLF
params
(k)
LLF
MMACS
HLF
params
(k)
HLF
MMACS
Total
params
(k)
Total
MMACS
Width
multiplier
Accuracy
(%)
16k
DWSC 1.44
4.35
13.91
2.93
15.35
7.28
0.125
75.38
16k
DWSC 1.44
4.35
153.43
31.07
154.87
35.42
0.5
80.40
16k
DWSC 1.44
4.35
567.92
113.7
569.4
118.1
1.0
80.90
44.1k
DWSC 1.81
17.98
13.91
2.96
15.72
20.94
0.125
75.50
44.1k
DWSC 1.81
17.98
153.43
31.33
155.23
49.31
0.5
81.75
44.1k
DWSC 1.81
17.98
567.92
114.6
569.73
132.59
1.0
83.10
44.1k
SC
6.99
80.9
77.21
8.88
84.21
131.17
0.125
82.30
44.1k
SC
6.99
80.9
1190.0
132.72
1197.0
255.01
0.5
83.95
44.1k
SC
6.99
80.9
4730.0
524.67
4737.0
646.97
1.0
85.0
44.1k
SC
6.99
80.9
10620
786.56
10627
867.45
1.5
85.65
We trained AclNet using both SC and DWSC settings
with width multiplier of 1.0, and found the values of (C1, S1,
S2) = (8, 2, 2) for SC and (16, 2, 4) for DWSC gave the best
accuracy. For the remainder of experiments, we will default
to using these best settings for SC and DWSC. The experi-
ments showed that there was about a 3% difference between
the best and worst parameters for each of the settings. The
best result in both cases was not the highest complexity, which
is (32, 2, 2). We suspect the heavier LLF settings might be
overfitting, and that with more training data we could reach a
different conclusion.
4.3. Complexity versus accuracy
To understand the tradeoff between complexity and accuracy,
we ran three sets of experiments using 1) 16kHz input with
DWSC, 2) 44.1kHz input with DWSC, and 3) 44.1kHz input
with SC. For each set, we did the 5-fold validation with WM
configured at 1/32, 1/16, 0.125, 0.25, 0.5, 0.75, 1.0, 1.5, and
2.0. Figure 2 shows the accuracy versus MMACS for each of
the settings, color-coded by sets. For each of these settings,
increasing complexity generally led to better accuracy. The
exception is at the highest WM, where it is possible that we
hit diminishing returns of higher capacity. In all cases, WM
below 0.25 steepens the drop in accuracy. Another observa-
tion is that for the same HLF settings, 44.1kHz sampling rate
improves accuracy by around 2%.
Table 3 shows a subset of these experiments, with details
of LLF, HLF, overall complexity and accuracy. Our best ac-
curacy of 85.65% was achieved with 44.1kHz sampling rate,
SC, and 1.5 WM. At the time of this writing, this is the best
single system accuracy reported for ESC-50 (second overall
behind an ensemble system [7]). With DWSC models, we
can see that the total parameter and MMACS are significantly
lower than SC for the same WM. The result on 44.1 kHz,
DWSC, and 0.5 WM has 81.75%, which exceeds human ac-
Fig. 2. Accuracy vs million multiply-adds per second.
curacy of 81.3% [6], was achieved with only 155k param-
eters, and 49.31 MMACS. We note that human accuracy is
also exceeded with SC, WM of 0.125, a model that has a mod-
est 84k parameters and 131.17 MMACS. As a comparison of
complexity, EnvNetV2 [8], which at the time of this writing
has the best single model accuracy of 84.9%, uses 101M pa-
rameters and 1033 MMACS. Our best model with accuracy
of 85.65% has about 1/10 the parameters and 16% less oper-
ations.
5. CONCLUSION
We have presented a novel e2e CNN architecture, AclNet,
for audio classification. AclNet is a scalable architecture that
achieved state-of-the-art 85.65% accuracy with high com-
pute, and better than human level accuracy of 81.75% with
only 155k parameters and 49.3 MMACS. To achieve the low
complexity with high accuracy, AclNet used depthwise sep-
arable convolution blocks. The combination of mixup and
data augmentation helped boost the accuracy by 5%, which
had a major contribution to achieving one of the best results
reported on ESC-50 dataset.

Page 5
6. REFERENCES
[1] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis,
Jort F Gemmeke, Aren Jansen, R Channing Moore,
Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Sey-
bold, et al., “Cnn architectures for large-scale audio
classification,” in Acoustics, Speech and Signal Process-
ing (ICASSP), 2017 IEEE International Conference on.
IEEE, 2017, pp. 131–135.
[2] Annamaria Mesaros, Toni Heittola, and Tuomas Virta-
nen, “A multi-device dataset for urban acoustic scene
classification,” Submitted to DCASE2018 Workshop,
2018.
[3] Yuma Sakashita and Masaki Aono, “Acoustic scene
classification by ensemble of spectrograms based on
adaptive temporal divisions,” Tech. Rep., DCASE2018
Challenge, September 2018.
[4] Matthias Dorfer, Bernhard Lehner, Hamid Eghbal-
zadeh, Heindl Christop, Paischer Fabian, and Widmer
Gerhard, “Acoustic scene classification with fully con-
volutional neural networks and I-vectors,” Tech. Rep.,
DCASE2018 Challenge, September 2018.
[5] Hossein Zeinali, Lukas Burget, and Honza Cernocky,
“Convolutional neural networks and x-vector embed-
ding for dcase2018 acoustic scene classification chal-
lenge,” Tech. Rep., DCASE2018 Challenge, September
2018.
[6] Karol J Piczak, “Esc: Dataset for environmental sound
classification,” in Proceedings of the 23rd ACM in-
ternational conference on Multimedia. ACM, 2015, pp.
1015–1018.
[7] Hardik B Sailor, Dharmesh M Agrawal, and Hemant A
Patil, “Unsupervised filterbank learning using convolu-
tional restricted boltzmann machine for environmental
sound classification,” Proc. Interspeech 2017, pp. 3107–
3111, 2017.
[8] Yuji Tokozume, Yoshitaka Ushiku, and Tatsuya Harada,
“Learning from between-class examples for deep sound
recognition,” in International Conference on Learning
Representations, 2018.
[9] Anurag Kumar, Maksim Khadkevich, and Christian
F�gen, “Knowledge transfer from weakly labeled au-
dio using convolutional neural network for sound events
and scenes,” in 2018 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP).
IEEE, 2018, pp. 326–330.
[10] Rishabh N Tak, Dharmesh M Agrawal, and Hemant A
Patil, “Novel phase encoded mel filterbank energies
for environmental sound classification,” in International
Conference on Pattern Recognition and Machine Intel-
ligence. Springer, 2017, pp. 317–325.
[11] Michael Deisher; Andrzej Polonski, “Implementation
of efficient, low power deep neural networks on next-
generation intel client platforms,” IEEE SigPort, 2017.
[12] Mircea Horea Ionica and David Gregg, “The movid-
ius myriad architecture’s potential for scientific comput-
ing,” IEEE Micro, vol. 35, no. 1, pp. 6–14, 2015.
[13] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and
David Lopez-Paz, “mixup: Beyond empirical risk min-
imization,” arXiv preprint arXiv:1710.09412, 2017.
[14] Yusuf Aytar, Carl Vondrick, and Antonio Torralba,
“Soundnet: Learning sound representations from unla-
beled video,” in Advances in Neural Information Pro-
cessing Systems, 2016, pp. 892–900.
[15] Wei Dai, Chia Dai, Shuhui Qu, Juncheng Li, and Samar-
jit Das, “Very deep convolutional neural networks for
raw waveforms,” 2017 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP),
pp. 421–425, 2017.
[16] Yuji Tokozume and Tatsuya Harada, “Learning envi-
ronmental sounds with end-to-end convolutional neural
network,” in Acoustics, Speech and Signal Process-
ing (ICASSP), 2017 IEEE International Conference on.
IEEE, 2017, pp. 2721–2725.
[17] Song Han, Huizi Mao, and William J Dally, “Deep com-
pression: Compressing deep neural networks with prun-
ing, trained quantization and huffman coding,” arXiv
preprint arXiv:1510.00149, 2015.
[18] Jian Xue, Jinyu Li, Dong Yu, Mike Seltzer, and Yi-
fan Gong, “Singular value decomposition based low-
footprint speaker adaptation and personalization for
deep neural network,” in Acoustics, Speech and Signal
Processing (ICASSP), 2014 IEEE International Confer-
ence on. IEEE, 2014, pp. 6359–6363.
[19] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry
Kalenichenko, Weijun Wang, Tobias Weyand, Marco
Andreetto, and Hartwig Adam, “Mobilenets: Efficient
convolutional neural networks for mobile vision appli-
cations,” arXiv preprint arXiv:1704.04861, 2017.
[20] Karen Simonyan and Andrew Zisserman, “Very deep
convolutional networks for large-scale image recogni-
tion,” arXiv preprint arXiv:1409.1556, 2014.
[21] Ankit Shah, Anurag Kumar, Alexander G. Hauptmann,
and Bhiksha Raj, “A closer look at weak label learning
for audio events,” CoRR, vol. abs/1804.09288, 2018.