Published as a conference paper at ICLR 2021
Toni Heittola, Annamaria Mesaros, and Tuomas Virtanen. TUT Urban Acoustic Scenes 2018, De-
Shawn Hershey, Sourish Chaudhuri, Daniel P.W. Ellis, Jort F Gemmeke, Aren Jansen, R Channing
Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, Malcolm Slaney, Ron J Weiss,
and Kevin Wilson. CNN architectures for large-scale audio classification. In ICASSP, IEEE
International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. 131–
135, 2017. ISBN 9781509041176. doi: 10.1109/ICASSP.2017.7952132.
Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly,
Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks
for acoustic modeling in speech recognition: The shared views of four research groups. IEEE
Signal processing magazine, 29(6):82–97, 2012.
Yedid Hoshen, Ron Weiss, and Kevin W Wilson. Speech acoustic modeling from raw multichannel
waveforms. In International Conference on Acoustics, Speech, and Signal Processing, 2015.
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected
convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pp. 4700–4708, 2017.
Navdeep Jaitly and Geoffrey Hinton. Learning a better representation of speech soundwaves using
restricted boltzmann machines. In 2011 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), pp. 5884–5887. IEEE, 2011.
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
Qiuqiang Kong, Yin Cao, T. Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D. Plumbley.
Panns: Large-scale pretrained audio neural networks for audio pattern recognition. ArXiv,
abs/1912.10211, 2019.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convo-
lutional neural networks. In Advances in neural information processing systems, pp. 1097–1105,
2012.
V. Lostanlen, J. Salamon, M. Cartwright, B. McFee, A. Farnsworth, S. Kelling, and J. P. Bello.
Per-channel energy normalization: Why and how. IEEE Signal Processing Letters, 26(1):39–43,
2019.
Vincent Lostanlen, Justin Salamon, Andrew Farnsworth, Steve Kelling, and Juan Pablo Bello.
Birdvox-full-night: A dataset and benchmark for avian flight call detection. In 2018 IEEE Inter-
national Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 266–270. IEEE,
2018.
Yi Luo and Nima Mesgarani. Conv-tasnet: Surpassing ideal time-frequency magnitude masking for
speech separation. IEEE/ACM Trans. Audio, Speech & Language Processing, 27(8):1256–1266,
2019.
Yi Luo, Zhuo Chen, and Takuya Yoshioka. Dual-path rnn: efficient long sequence modeling for
time-domain single-channel speech separation. arXiv preprint arXiv:1910.06379, 2019.
James G Lyons and Kuldip K Paliwal. Effect of compressing the dynamic range of the power
spectrum in modulation filtering based speech enhancement. In Ninth Annual Conference of the
International Speech Communication Association, 2008.
Nelson Mogran, Herv� Bourlard, and Hynek Hermansky. Automatic speech recognition: An audi-
tory perspective. In Speech processing in the auditory system, pp. 309–338. Springer, 2004.
Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. Voxceleb: A large-scale speaker iden-
tification dataset. Interspeech 2017, Aug 2017. doi: 10.21437/interspeech.2017-950. URL