[1] F. Eyben, F. Weninger, F. Gross, and B. Schuller, “Recent de-
velopments in openSMILE, the Munich open-source multimedia
feature extractor,” in Multimedia, 2013.
[2] B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer,
F. Ringeval, M. Chetouani, F. Weninger, F. Eyben, E. Marchi,
M. Mortillaro, H. Salamin, A. Polychroniou, F. Valente, and S. K.
Kim, “The Interspeech 2013 computational paralinguistics chal-
lenge: Social signals, conflict, emotion, autism,” in Interspeech,
2013.
[3] N. Jaitly and G. Hinton, “Learning a better representation of
speech soundwaves using restricted boltzmann machines,” in
ICASSP, 2011.
[4] S. Dieleman and B. Schrauwen, “End-to-end learning for music
audio,” in ICASSP, 2014.
[5] G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nico-
laou, B. Schuller, and S. Zafeiriou, “Adieu features? end-to-end
speech emotion recognition using a deep convolutional recurrent
network,” in ICASSP, 2016.
[6] Y. LeCun and Y. Bengio, “Convolutional networks for images,
speech, and time series,” The Handbook of Brain Theory and Neu-
ral Networks, vol. 3361, no. 10, p. 1995, 1995.
[7] Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumb-
ley, “PANNs: Large-scale pretrained audio neural networks for
audio pattern recognition,” IEEE/ACM TASLP, vol. 28, pp. 2880–
2894, 2020.
[8] Y. Gong, Y.-A. Chung, and J. Glass, “PSLA: Improving audio
event classification with pretraining, sampling, labeling, and ag-
gregation,” arXiv preprint arXiv:2102.01243, 2021.
[9] O. Rybakov, N. Kononenko, N. Subrahmanya, M. Visontai, and
S. Laurenzo, “Streaming keyword spotting on mobile devices,” in
Interspeech, 2020.
[10] P. Li, Y. Song, I. V. McLoughlin, W. Guo, and L.-R. Dai, “An
attention pooling based representation learning method for speech
emotion recognition,” in Interspeech, 2018.
[11] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn,
X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold,
S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16
words: Transformers for image recognition at scale,” in ICLR,
2021.
[12] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and
H. J�gou, “Training data-efficient image transformers & distilla-
tion through attention,” arXiv preprint arXiv:2012.12877, 2020.
[13] L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, F. E. Tay, J. Feng, and
S. Yan, “Tokens-to-token ViT: Training vision transformers from
scratch on ImageNet,” arXiv preprint arXiv:2101.11986, 2021.
[14] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,
“ImageNet: A large-scale hierarchical image database,” in CVPR,
2009.
[15] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence,
R. C. Moore, M. Plakal, and M. Ritter, “Audio Set: An ontology
and human-labeled dataset for audio events,” in ICASSP, 2017.
[16] K. J. Piczak, “ESC: Dataset for environmental sound classifica-
tion,” in Multimedia, 2015.
[17] P. Warden, “Speech commands: A dataset for limited-vocabulary
speech recognition,” arXiv preprint arXiv:1804.03209, 2018.
[18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,”
in NIPS, 2017.
[19] K. Miyazaki, T. Komatsu, T. Hayashi, S. Watanabe, T. Toda,
and K. Takeda, “Convolution augmented transformer for semi-
supervised sound event detection,” in DCASE, 2020.
[20] Q. Kong, Y. Xu, W. Wang, and M. D. Plumbley, “Sound event
detection of weakly labelled data with CNN-transformer and au-
tomatic threshold optimization,” IEEE/ACM TASLP, vol. 28, pp.
2450–2460, 2020.
[21] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu,
W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer:
Convolution-augmented transformer for speech recognition,” in
Interspeech, 2020.
[22] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-
training of deep bidirectional transformers for language under-
standing,” in NAACL-HLT, 2019.
[23] G. Gwardys and D. M. Grzywczak, “Deep image features in mu-
sic information retrieval,” IJET, vol. 60, no. 4, pp. 321–326, 2014.
[24] A. Guzhov, F. Raue, J. Hees, and A. Dengel, “ESResNet: Envi-
ronmental sound classification based on visual domain models,”
in ICPR, 2020.
[25] K. Palanisamy, D. Singhania, and A. Yao, “Rethinking CNN mod-
els for audio classification,” arXiv preprint arXiv:2007.11154,
2020.
[26] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in CVPR, 2016.
[27] M. Tan and Q. V. Le, “EfficientNet: Rethinking model scaling for
convolutional neural networks,” in ICML, 2019.
[28] Y. Tokozume, Y. Ushiku, and T. Harada, “Learning from between-
class examples for deep sound recognition,” in ICLR, 2018.
[29] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D.
Cubuk, and Q. V. Le, “SpecAugment: A simple data augmen-
tation method for automatic speech recognition,” in Interspeech,
2019.
[30] P. Izmailov, D. Podoprikhin, T. Garipov, D. Vetrov, and A. G.
Wilson, “Averaging weights leads to wider optima and better gen-
eralization,” in UAI, 2018.
[31] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24,
no. 2, pp. 123–140, 1996.
[32] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-
mization,” in ICLR, 2015.
[33] H. B. Sailor, D. M. Agrawal, and H. A. Patil, “Unsupervised filter-
bank learning using convolutional restricted boltzmann machine
for environmental sound classification.” in Interspeech, 2017.
[34] S. Majumdar and B. Ginsburg, “Matchboxnet–1d time-channel
separable convolutional neural network architecture for speech
commands recognition,” arXiv preprint arXiv:2004.08531, 2020.
[35] J. Lin, K. Kilgour, D. Roblek, and M. Sharifi, “Training keyword
spotters with limited and synthesized speech data,” in ICASSP,
2020.