arXiv:1811.06669v1 [cs.SD] 16 Nov 2018

Page 1

ACLNET: EFFICIENT END-TO-END AUDIO CLASSIFICATION CNN

Jonathan J Huang, Juan Jose Alvarado Leanos

Intel Corporation

ABSTRACT

We propose an efficient end-to-end convolutional neural

network architecture, AclNet, for audio classification. When

trained with our data augmentation and regularization, we

achieved state-of-the-art performance on the ESC-50 corpus

with 85.65% accuracy. Our network allows configurations

such that memory and compute requirements are drastically

reduced, and a tradeoff analysis of accuracy and complexity is

presented. The analysis shows high accuracy at significantly

reduced computational complexity compared to existing solu-

tions. For example, a configuration with only 155k parame-

ters and 49.3 million multiply-adds per second is 81.75%, ex-

ceeding human accuracy of 81.3%. This improved efficiency

can enable always-on inference in energy-efficient platforms.

Index Terms—

Convolutional neural networks, end-to-end CNN, envi-

ronmental sound classification, audio classification

1. INTRODUCTION

Following the successes of image classification, convolutional

neural network (CNN) architectures have become popular-

ized for audio classification. Hershey, et al. [1] showed that

large image classification CNNs trained with huge amount of

weakly labeled Youtube data leads to semantically meaning-

ful representations, the basis of a powerful classifier. In the

recent DCASE acoustic scene classification task [2], the top

submissions are mostly CNN-based [3] [4], [5]. Likewise,

many of the top results for the ESC-50 corpus [6] use various

forms of CNNs [7], [8], [9], [10].

While the prior work on CNN audio classifiers have fo-

cused on accuracy of the performance for a particular tasks,

none that we are aware of have treated computational effi-

ciency as a primary objective. The first contribution of this

paper is a scalable architecture, that at the high-end has one

of the best accuracies in the ESC-50, and at the low-end of-

fers the flexibility to scale down to extremely small model

sizes. The advantage of a scalable architecture is that it allows

for flexibility in inference platforms with various system con-

straints. The efficiency brought by our architecture allows for

low-power always-on inference on DSPs or dedicated neural

net accelerators [11], [12]. The second contribution is the ap-

plication of mixup data augmentation [13] for audio classifi-

cation, and we show it is a big contributor to the high accuracy

due to improved generalization.

AclNet is an end-to-end (e2e) CNN architecture, which

takes raw time-domain input waveform as opposed to the

more popular technique of using spectral features like mel-

filterbank or mel-frequency cepstral coefficients (MFCC).

One of the advantages of an e2e architecture is that the front-

end feature makes no assumptions of the spectral content. Its

feature representation is learned in a data-driven manner, thus

its features are optimized for the task at hand as long as there

is sufficient training data. Another advantage of e2e is that it

eliminates the implementation of the spectral features, which

simplifies software or hardware complexity. Although other

e2e techniques [14], [15], [16], [8] have been studied, our

architecture has a focus on efficiency.

Several research results from optimization of CNNs from

image domain can be borrowed to make audio classification

more efficient. Han et al. [17] used pruning, quantization,

and Huffman encoding to reduce complexity of CNNs. Sin-

gular value decomposition has been applied to DNN to re-

duce model size [18]. The AclNet gets its inspirations for ef-

ficient computations from MobileNet [19], which features the

depthwise separable convolution we employed extensively in

this work. With these tricks, human-level accuracy for ESC-

50 was achieved with only 155k parameters and 49.3 million

multiply-adds per second (MMACS).

The remainder of this paper is organized as follows. In

Section 2 we detail the AclNet architecture. Section 3 pro-

vides the details of data augmentation and model training pro-

cess. Section 4 presents the our findings from the experi-

ments, followed by conclusion in Section 5.

2. ACLNET ARCHITECTURE

The AclNet architecture consists of two building blocks of the

network: the low-level features (LLF) shown in Table 1 and

the high-level features (HLF) shown in Table 2.

2.1. Low-level features

The LLF can be viewed as a replacement of the spectral fea-

ture, and the two stages of 1-D strided convolutions are equiv-

alent to FIR decimation filterbank. With the time-domain

arXiv:1811.06669v1 [cs.SD] 16 Nov 2018

Page 2

waveform as input, the LLF produces an output of 64 chan-

nels at feature frame rate of 10ms after the maxpool layer.

In the example given on Table 1, the 1.28s input produces an

output tensor with dimension (64, 1, 128).

Although the number of parameters in the LLF are in-

variant to stride values S1, S2 and the number of intermedi-

ate channels C1, the choice of these values greatly influence

the compute complexity and accuracy. Our experiments will

show the settings which gives the most accurate results.

The example in Table 1 is for the 16kHz sampling rate.

The parameters S1, S2, and all kernel size scale linearly with

sampling rate to ensure the same time duration of kernels and

output frame rate of 10ms.

Table 1. AclNet low-level features, with 1.28s 16kHz sam-

ples. Input dimension (1, 1, 20480).

Layer

Stride Out dim

Out

Chans

Kernel

size

Conv1

C1, 1, 20480/S1

Conv2

64, 1, 20480/(S1S2) 64

Maxpool1 1

64, 1, 128

160/(S1S2)

2.2. High-level features

Continuing from the LLF output, transposing the first two di-

mensions will result in an image-like tensor with dimension

(1, 64, 128). The rest of HLF thus can follow the structure

similar to image classification CNNs. We experimented with

a number of architectures, and found that a VGG-like archi-

tecture [20] provides a good classification performance and

well-understood building blocks. The architecture shown in

Table 2 is a modified VGG. Besides changing the depth and

channel width, the final layers of the network are also mod-

ified. Conv12 layer is a 1 � 1 convolution that reduces the

number of channels to the number of classes, which in the

case of ESC-50, is 50. Each of the 50 channels are then aver-

age pooled over the 2 � 4 patches and output as softmax. The

advantage of these final two layers is that our architecture can

incorporate arbitrary length inputs, without any need to mod-

ify the number of hidden units in fully-connected layers. An

additional benefit of this way of pooling is that it is shown to

be effective for training on weakly labeled datasets [21].

Before the input to Conv12 layer, we have a dropout layer.

We found the probability of 0.2 to work well on this dataset.

2.3. Convolutional layers details

All convolutional layers shown on Tables 1 and 2, except their

first layer (i.e. Conv1 and Conv3) can be configured in one of

two forms:

Table 2. AclNet high-level features, with input dimension =

(1, 64, 128) out of LLF.

Layer

Stride Out dim

Out

Chans

Kernel Size

Conv3

32, 64, 128

3 � 3

Maxpool2

32, 32, 64

2 � 2

Conv4

64, 32, 64

3 � 3

Conv5

64, 32, 64

3 � 3

Maxpool3

64, 16, 32

2 � 2

Conv6

128, 16, 32

128

3 � 3

Conv7

128, 16, 32

128

3 � 3

Maxpool4

128, 8, 16

128

2 � 2

Conv8

256, 8, 16

256

3 � 3

Conv9

256, 8, 16

256

3 � 3

Maxpool5

256, 4, 8

256

2 � 2

Conv10

512, 4, 8

512

3 � 3

Conv11

512, 4, 8

512

3 � 3

Maxpool6

512, 2, 4

512

2 � 2

Conv12

50, 2, 4

1 � 1

Avgpool1

2 � 4

• Standard convolution (SC): this is the standard build-

ing blocks of convolution layer, batch normalization,

and ReLU activation.

• Depthwise separable convolution (DWSC): the con-

volution is decomposed into depthwise separable con-

volutions with pointwise layers each followed by batch

normalization and ReLU activation as in MobileNet

[19].

The advantage of DWSC is that they use significantly less

parameters and operations compared to SC, but typically at

a cost of degradation in performance. We will explore the

tradeoffs between these two choices of convolutions in our

experiments.

2.4. Width multiplier

As in MobileNet, our architecture also has a width multiplier

(WM) to control the complexity of the network. The WM

linearly scales the number of output channels from Conv3 to

Conv11. This parameter is an easy way to manage the ca-

pacity of the network, and our experiments will explore its

accuracy impact on the ESC-50 corpus.

3. EXPERIMENTAL METHODS

3.1. Dataset

We used ESC-50 to train and evaluate the models. ESC-50

contains a total of 2000 examples of environmental sounds ar-

Page 3

ranged in 50 classes. We use the default 5-folds provided by

the dataset for cross validation in performance evaluation. All

sound files were converted to 16-bits, at 16kHz and 44.1kHz

sampling rates for two different sets of experiments. We elim-

inated the silent sections at the beginning and ending of each

recording.

3.2. Data augmentation

We have experimented with different input lengths to the

training of AclNet using ESC-50 and a proprietary dataset.

Empirically we found that between 1 to 2 second input gave

the best results, so for the rest of the experiments we chose

1.5s input length. In our data loader of the training process,

we use the following real-time data augmentation to generate

each training example.

1. Choose a random 2s of audio within a training file

2. Center the waveform to zero mean, and normalize by

standard deviation

3. Resample the waveform by a random factor uniformly

chosen in range [0.8, 1.25]

4. Crop exactly to 1.5s

5. Multiply waveform by random gain chosen uniformly

in range [−6.0, +6.0]dB

During test time, only the data normalization step is used,

and the length of the entire wave file is input into the network.

3.3. Mixup training

Mixup [13] is a recent technique to improve generalization

by increasing the support of the training distribution. In this

technique, a neighborhood is defined around each example

in the training data by constructing virtual training examples,

that is, pairs of virtual samples and virtual targets ˜x, ˜y. Given

two training examples, (xi,yi) and (xj,yj), the new virtual

pair is computed as:

˜x = λxi + (1 − λ)xj

(1)

˜y = λyi + (1 − λ)yj

(2)

where λ ∼ beta(α, α). The hyperparameter α controls

the amount that is mixed in from the second example. Higher

values of alpha make the virtual pairs less similar to the origi-

nal unmixed training examples. We experimented with values

from 0.1 to 0.5.

3.4. Learning Settings

For all experiments, we used stochastic gradient descent op-

timizer with momentum of 0.9, weight decay of 2e-4, and a

batch size of 64. We trained the model using the following

learning rate schedule with 3 different phases: 0.2 for the first

Fig. 1. Comparison of the effects of data augmentation and

mixup on validation accuracy.

500 epochs, 0.04 for the next 1000, and 0.016 for the last 500,

for a total of 2000 epochs for each fold. Also, for the first 100

epochs we disabled the mixup procedure as a form of warm

up to improve initial convergence.

4. RESULTS AND ANALYSIS

4.1. Data augmentation and mixup

Several experiments were done to assess the effectiveness of

augmentation and mixup. Figure 1 shows the validation accu-

racy over the course of the training process for various com-

binations of augmentation as explained in Section 3. All ex-

periments were done using a WM of 1.0, and SC. We see an

obvious improvement with each individual augmentation, and

that mixup by itself is more effective than the other form of

augmentation. The best result was achieved when augmen-

tation was combined with mixup, which had an absolute im-

provement of more than 5% above the baseline without any

augmentation. We note mixup is conceptually similar to be-

tween class learning, which was also shown to work well for

ESC-50 [8].

We have experimented with the choice of α in mixup,

and found that values between 0.1 to 0.2 worked well for

the larger size architectures, thus for the remainder of exper-

iments, we default to using this combined augmentation with

mixup α = 0.1.

4.2. Low-level feature parameters

In EnvNet [16], analysis showed that 2 convolutions of ker-

nel size 8 worked best for this dataset. Our experiments con-

firmed that 2 convolutions being optimal, but we also found

that slightly reducing the kernel size of second convolution

had no impact on accuracy. Our best setting is with kernel

sizes of 9 and 5 for the first two convolutions.

In order to determine the choice of other LLF parameters,

we did a grid search of the parameter space over these ranges:

C1 ∈ {8, 16, 32}, S1 ∈ {2, 4, 8}, S2 ∈ {2, 4}.

Page 4

Table 3. ESC-50 5-fold accuracies with AclNet at select configurations.

Sampling

rate

Conv

type

LLF

params

(k)

LLF

MMACS

HLF

params

(k)

HLF

MMACS

Total

params

(k)

Total

MMACS

Width

multiplier

Accuracy

(%)

16k

DWSC 1.44

4.35

13.91

2.93

15.35

7.28

0.125

75.38

16k

DWSC 1.44

4.35

153.43

31.07

154.87

35.42

0.5

80.40

16k

DWSC 1.44

4.35

567.92

113.7

569.4

118.1

1.0

80.90

44.1k

DWSC 1.81

17.98

13.91

2.96

15.72

20.94

0.125

75.50

44.1k

DWSC 1.81

17.98

153.43

31.33

155.23

49.31

0.5

81.75

44.1k

DWSC 1.81

17.98

567.92

114.6

569.73

132.59

1.0

83.10

44.1k

6.99

80.9

77.21

8.88

84.21

131.17

0.125

82.30

44.1k

6.99

80.9

1190.0

132.72

1197.0

255.01

0.5

83.95

44.1k

6.99

80.9

4730.0

524.67

4737.0

646.97

1.0

85.0

44.1k

6.99

80.9

10620

786.56

10627

867.45

1.5

85.65

We trained AclNet using both SC and DWSC settings

with width multiplier of 1.0, and found the values of (C1, S1,

S2) = (8, 2, 2) for SC and (16, 2, 4) for DWSC gave the best

accuracy. For the remainder of experiments, we will default

to using these best settings for SC and DWSC. The experi-

ments showed that there was about a 3% difference between

the best and worst parameters for each of the settings. The

best result in both cases was not the highest complexity, which

is (32, 2, 2). We suspect the heavier LLF settings might be

overfitting, and that with more training data we could reach a

different conclusion.

4.3. Complexity versus accuracy

To understand the tradeoff between complexity and accuracy,

we ran three sets of experiments using 1) 16kHz input with

DWSC, 2) 44.1kHz input with DWSC, and 3) 44.1kHz input

with SC. For each set, we did the 5-fold validation with WM

configured at 1/32, 1/16, 0.125, 0.25, 0.5, 0.75, 1.0, 1.5, and

2.0. Figure 2 shows the accuracy versus MMACS for each of

the settings, color-coded by sets. For each of these settings,

increasing complexity generally led to better accuracy. The

exception is at the highest WM, where it is possible that we

hit diminishing returns of higher capacity. In all cases, WM

below 0.25 steepens the drop in accuracy. Another observa-

tion is that for the same HLF settings, 44.1kHz sampling rate

improves accuracy by around 2%.

Table 3 shows a subset of these experiments, with details

of LLF, HLF, overall complexity and accuracy. Our best ac-

curacy of 85.65% was achieved with 44.1kHz sampling rate,

SC, and 1.5 WM. At the time of this writing, this is the best

single system accuracy reported for ESC-50 (second overall

behind an ensemble system [7]). With DWSC models, we

can see that the total parameter and MMACS are significantly

lower than SC for the same WM. The result on 44.1 kHz,

DWSC, and 0.5 WM has 81.75%, which exceeds human ac-

Fig. 2. Accuracy vs million multiply-adds per second.

curacy of 81.3% [6], was achieved with only 155k param-

eters, and 49.31 MMACS. We note that human accuracy is

also exceeded with SC, WM of 0.125, a model that has a mod-

est 84k parameters and 131.17 MMACS. As a comparison of

complexity, EnvNetV2 [8], which at the time of this writing

has the best single model accuracy of 84.9%, uses 101M pa-

rameters and 1033 MMACS. Our best model with accuracy

of 85.65% has about 1/10 the parameters and 16% less oper-

ations.

5. CONCLUSION

We have presented a novel e2e CNN architecture, AclNet,

for audio classification. AclNet is a scalable architecture that

achieved state-of-the-art 85.65% accuracy with high com-

pute, and better than human level accuracy of 81.75% with

only 155k parameters and 49.3 MMACS. To achieve the low

complexity with high accuracy, AclNet used depthwise sep-

arable convolution blocks. The combination of mixup and

data augmentation helped boost the accuracy by 5%, which

had a major contribution to achieving one of the best results

reported on ESC-50 dataset.

Page 5

6. REFERENCES

[1] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis,

Jort F Gemmeke, Aren Jansen, R Channing Moore,

Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Sey-

bold, et al., “Cnn architectures for large-scale audio

classification,” in Acoustics, Speech and Signal Process-

ing (ICASSP), 2017 IEEE International Conference on.

IEEE, 2017, pp. 131–135.

[2] Annamaria Mesaros, Toni Heittola, and Tuomas Virta-

nen, “A multi-device dataset for urban acoustic scene

classification,” Submitted to DCASE2018 Workshop,

2018.

[3] Yuma Sakashita and Masaki Aono, “Acoustic scene

classification by ensemble of spectrograms based on

adaptive temporal divisions,” Tech. Rep., DCASE2018

Challenge, September 2018.

[4] Matthias Dorfer, Bernhard Lehner, Hamid Eghbal-

zadeh, Heindl Christop, Paischer Fabian, and Widmer

Gerhard, “Acoustic scene classification with fully con-

volutional neural networks and I-vectors,” Tech. Rep.,

DCASE2018 Challenge, September 2018.

[5] Hossein Zeinali, Lukas Burget, and Honza Cernocky,

“Convolutional neural networks and x-vector embed-

ding for dcase2018 acoustic scene classification chal-

lenge,” Tech. Rep., DCASE2018 Challenge, September

2018.

[6] Karol J Piczak, “Esc: Dataset for environmental sound

classification,” in Proceedings of the 23rd ACM in-

ternational conference on Multimedia. ACM, 2015, pp.

1015–1018.

[7] Hardik B Sailor, Dharmesh M Agrawal, and Hemant A

Patil, “Unsupervised filterbank learning using convolu-

tional restricted boltzmann machine for environmental

sound classification,” Proc. Interspeech 2017, pp. 3107–

3111, 2017.

[8] Yuji Tokozume, Yoshitaka Ushiku, and Tatsuya Harada,

“Learning from between-class examples for deep sound

recognition,” in International Conference on Learning

Representations, 2018.

[9] Anurag Kumar, Maksim Khadkevich, and Christian

F�gen, “Knowledge transfer from weakly labeled au-

dio using convolutional neural network for sound events

and scenes,” in 2018 IEEE International Conference

on Acoustics, Speech and Signal Processing (ICASSP).

IEEE, 2018, pp. 326–330.

[10] Rishabh N Tak, Dharmesh M Agrawal, and Hemant A

Patil, “Novel phase encoded mel filterbank energies

for environmental sound classification,” in International

Conference on Pattern Recognition and Machine Intel-

ligence. Springer, 2017, pp. 317–325.

[11] Michael Deisher; Andrzej Polonski, “Implementation

of efficient, low power deep neural networks on next-

generation intel client platforms,” IEEE SigPort, 2017.

[12] Mircea Horea Ionica and David Gregg, “The movid-

ius myriad architecture’s potential for scientific comput-

ing,” IEEE Micro, vol. 35, no. 1, pp. 6–14, 2015.

[13] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and

David Lopez-Paz, “mixup: Beyond empirical risk min-

imization,” arXiv preprint arXiv:1710.09412, 2017.

[14] Yusuf Aytar, Carl Vondrick, and Antonio Torralba,

“Soundnet: Learning sound representations from unla-

beled video,” in Advances in Neural Information Pro-

cessing Systems, 2016, pp. 892–900.

[15] Wei Dai, Chia Dai, Shuhui Qu, Juncheng Li, and Samar-

jit Das, “Very deep convolutional neural networks for

raw waveforms,” 2017 IEEE International Conference

on Acoustics, Speech and Signal Processing (ICASSP),

pp. 421–425, 2017.

[16] Yuji Tokozume and Tatsuya Harada, “Learning envi-

ronmental sounds with end-to-end convolutional neural

network,” in Acoustics, Speech and Signal Process-

ing (ICASSP), 2017 IEEE International Conference on.

IEEE, 2017, pp. 2721–2725.

[17] Song Han, Huizi Mao, and William J Dally, “Deep com-

pression: Compressing deep neural networks with prun-

ing, trained quantization and huffman coding,” arXiv

preprint arXiv:1510.00149, 2015.

[18] Jian Xue, Jinyu Li, Dong Yu, Mike Seltzer, and Yi-

fan Gong, “Singular value decomposition based low-

footprint speaker adaptation and personalization for

deep neural network,” in Acoustics, Speech and Signal

Processing (ICASSP), 2014 IEEE International Confer-

ence on. IEEE, 2014, pp. 6359–6363.

[19] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry

Kalenichenko, Weijun Wang, Tobias Weyand, Marco

Andreetto, and Hartwig Adam, “Mobilenets: Efficient

convolutional neural networks for mobile vision appli-

cations,” arXiv preprint arXiv:1704.04861, 2017.

[20] Karen Simonyan and Andrew Zisserman, “Very deep

convolutional networks for large-scale image recogni-

tion,” arXiv preprint arXiv:1409.1556, 2014.

[21] Ankit Shah, Anurag Kumar, Alexander G. Hauptmann,

and Bhiksha Raj, “A closer look at weak label learning

for audio events,” CoRR, vol. abs/1804.09288, 2018.