6. REFERENCES
[1] P. Verma and J. Smith, “A framework for contrastive and
generative learning of audio representations,” arXiv preprint
arXiv:2010.11459, 2020.
[2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
for image recognition,” in Proceedings of the IEEE conference
on computer vision and pattern recognition, 2016, pp. 770–
778.
[3] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen,
W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio
set: An ontology and human-labeled dataset for audio events,”
in 2017 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP). IEEE, 2017, pp. 776–780.
[4] A. Madani, B. McCann, N. Naik, N. S. Keskar, N. Anand,
R. R. Eguchi, P.-S. Huang, and R. Socher, “Progen: Lan-
guage modeling for protein generation,” arXiv preprint
arXiv:2004.03497, 2020.
[5] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Ka-
plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry,
A. Askell, et al., “Language models are few-shot learners,”
arXiv preprint arXiv:2005.14165, 2020.
[6] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert:
Pre-training of deep bidirectional transformers for language
understanding,” arXiv preprint arXiv:1810.04805, 2018.
[7] C.-Z. A. Huang, A. Vaswani, J. Uszkoreit, N. Shazeer, I. Si-
mon, C. Hawthorne, A. M. Dai, M. D. Hoffman, M. Din-
culescu, and D. Eck, “Music transformer,” arXiv preprint
arXiv:1809.04281, 2018.
[8] C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid,
“Videobert: A joint model for video and language representa-
tion learning,” in Proceedings of the IEEE/CVF International
Conference on Computer Vision, 2019, pp. 7464–7473.
[9] R. Girdhar, J. Carreira, C. Doersch, and A. Zisserman, “Video
action transformer network,” in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
2019, pp. 244–253.
[10] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn,
X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer,
G. Heigold, S. Gelly, et al., “An image is worth 16x16 words:
Transformers for image recognition at scale,” arXiv preprint
arXiv:2010.11929, 2020.
[11] N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer,
A. Ku, and D. Tran, “Image transformer,” in International
Conference on Machine Learning. PMLR, 2018, pp. 4055–
4064.
[12] P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, and
I. Sutskever, “Jukebox: A generative model for music,” arXiv
preprint arXiv:2005.00341, 2020.
[13] P. Verma and J. O. Smith, “Neural style transfer for audio
spectograms,” arXiv preprint arXiv:1801.01589, 2018.
[14] P. Verma, C. Chafe, and J. Berger, “Neuralogram: A deep
neural network based representation for audio signals,” arXiv
preprint arXiv:1904.05073, 2019.
[15] A. v. d. Oord, O. Vinyals, and K. Kavukcuoglu, “Neu-
ral discrete representation learning,”
arXiv preprint
arXiv:1711.00937, 2017.
[16] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec
2.0: A framework for self-supervised learning of speech rep-
resentations,” arXiv preprint arXiv:2006.11477, 2020.
[17] A. Baevski, S. Schneider, and M. Auli, “vq-wav2vec: Self-
supervised learning of discrete speech representations,” arXiv
preprint arXiv:1910.05453, 2019.
[18] A. Haque, M. Guo, and P. Verma, “Conditional end-to-end
audio transforms,” arXiv preprint arXiv:1804.00047, 2018.
[19] A. Haque, M. Guo, P. Verma, and L. Fei-Fei, “Audio-linguistic
embeddings for spoken sentences,” in ICASSP 2019-2019
IEEE International Conference on Acoustics, Speech and Sig-
nal Processing (ICASSP). IEEE, 2019, pp. 7355–7359.
[20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all
you need,” arXiv preprint arXiv:1706.03762, 2017.
[21] E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra,
“Fsd50k: an open dataset of human-labeled sound events,”
arXiv preprint arXiv:2010.00475, 2020.
[22] P. Virtanen, R. Gommers, T. E. Oliphant, M. Haber-
land, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson,
W. Weckesser, J. Bright, et al., “Scipy 1.0: fundamental al-
gorithms for scientific computing in python,” Nature methods,
vol. 17, no. 3, pp. 261–272, 2020.
[23] G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M.
Rush, “Opennmt: Open-source toolkit for neural machine
translation,” in Proc. ACL, 2017. [Online]. Available:
[24] J. Berger, R. R. Coifman, and M. J. Goldberg, “Removing
noise from music using local trigonometric bases and wavelet
packets,” Journal of the Audio Engineering Society, vol. 42,
no. 10, pp. 808–818, 1994.
[25] P. Verma and R. W. Schafer, “Frequency estimation from
waveforms using multi-layered neural networks.” in INTER-
SPEECH, 2016, pp. 2165–2169.
[26] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-
Fei, “Imagenet: A large-scale hierarchical image database,” in
2009 IEEE conference on computer vision and pattern recog-
nition. Ieee, 2009, pp. 248–255.
[27] A. Tamkin, D. Jurafsky, and N. Goodman, “Language through
a prism: A spectral approach for multiscale language repre-
sentations,” Advances in Neural Information Processing Sys-
tems, vol. 33, 2020.
[28] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean,
M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., “Ten-
sorflow: A system for large-scale machine learning,” in 12th
{USENIX} symposium on operating systems design and im-
plementation ({OSDI} 16), 2016, pp. 265–283.
[29] D. P. Kingma and J. Ba, “Adam: A method for stochastic op-
timization,” arXiv preprint arXiv:1412.6980, 2014.
[30] W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers:
Scaling to trillion parameter models with simple and
efficient sparsity,” Jan 2021. [Online]. Available: https:
[31] R. Child, S. Gray, A. Radford, and I. Sutskever, “Generat-
ing long sequences with sparse transformers,” arXiv preprint
arXiv:1904.10509, 2019.