[1] Sharath Adavanne and Tuomas Virtanen. Sound event detection using weakly labeled dataset
with stacked convolutional and recurrent neural network. In DCASE Workshop, 2017.
[2] Sharath Adavanne, Konstantinos Drossos, Emre �akır, and Tuomas Virtanen. Stacked convolu-
tional and recurrent neural networks for bird audio detection. In EUSIPCO, 2017.
[3] Miroslav Malik, Sharath Adavanne, Konstantinos Drossos, Tuomas Virtanen, Dasa Ticha,
and Roman Jarina. Stacked convolutional and recurrent neural networks for music emotion
recognition. In Sound and Music Computing Conference (SMC), 2017.
[4] Jordi Pons, Thomas Lidy, and Xavier Serra. Experimenting with musically motivated convolu-
tional neural networks. In Content-Based Multimedia Indexing (CBMI) Workshop, pages 1–6,
2016.
[5] Sander Dieleman and Benjamin Schrauwen. End-to-end learning for music audio. In IEEE
ICASSP, pages 6964–6968, 2014.
[6] Tara N Sainath, Ron J Weiss, Andrew Senior, Kevin W Wilson, and Oriol Vinyals. Learning the
speech front-end with raw waveform cldnns. In Sixteenth Annual Conference of the International
Speech Communication Association, 2015.
[7] Wei Dai, Chia Dai, Shuhui Qu, Juncheng Li, and Samarjit Das. Very deep convolutional neural
networks for raw waveforms. In IEEE ICASSP, pages 421–425, 2017.
[8] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale
image recognition. ICLR, 2015.
[9] Jongpil Lee, Jiyoung Park, Keunhyoung Luke Kim, and Juhan Nam. Sample-level deep
convolutional neural networks for music auto-tagging using raw waveforms. In Sound and
Music Computing Conference (SMC), 2017.
[10] Jongpil Lee, Jiyoung Park, Sangeun Kum, Youngho Jeong, and Juhan Nam. Combining multi-
scale features using sample-level deep convolutional neural networks for weakly supervised
sound event detection. In DCASE Workshop, 2017.
[11] Taejun Kim, Jongpil Lee, and Juhan Nam. Sample-level cnn architectures for music auto-tagging
using raw waveforms. arXiv preprint arXiv:1710.10451, 2017.
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In IEEE CVPR, pages 770–778, 2016.
[13] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks.
arXiv preprint
arXiv:1709.01507, 2017.
[14] Jongpil Lee and Juhan Nam. Multi-level and multi-scale feature aggregation using pre-
trained convolutional neural networks for music auto-tagging. IEEE Signal Processing Letters,
24(8):1208–1212, 2017.
[15] Edith Law, Kris West, Michael I Mandel, Mert Bay, and J Stephen Downie. Evaluation of
algorithms using games: The case of music tagging. In ISMIR, pages 387–392, 2009.
[16] Pete Warden. Speech commands: A public dataset for single-word speech recogni-
tion.
[17] TensorFlow
Speech
Recognition
Challenge.
[18] Annamaria Mesaros, Toni Heittola, Aleksandr Diment, Benjamin Elizalde, Ankit Shah, Em-
manuel Vincent, Bhiksha Raj, and Tuomas Virtanen. Dcase 2017 challenge setup: tasks, datasets
and baseline system. In DCASE Workshop, 2017.
[19] Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing
Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset
for audio events. In IEEE ICASSP, 2017.
[20] Yong Xu, Qiuqiang Kong, Wenwu Wang, and Mark D Plumbley. Surrey-cvssp system for
dcase2017 challenge task4. DCASE Tech. Rep., 2017.
[21] Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent. Visualizing higher-layer
features of a deep network. University of Montreal, 1341:3, 2009.
5