This is the html version of the file https://arxiv.org/abs/2104.01778.
Google automatically generates html versions of documents as we crawl the web.
Page 1
AST: Audio Spectrogram Transformer
Yuan Gong, Yu-An Chung, James Glass
MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 02139, USA
{yuangong, andyyuan, glass}@mit.edu
Abstract
In the past decade, convolutional neural networks (CNNs)
have been widely adopted as the main building block for end-
to-end audio classification models, which aim to learn a direct
mapping from audio spectrograms to corresponding labels. To
better capture long-range global context, a recent trend is to
add a self-attention mechanism on top of the CNN, forming a
CNN-attention hybrid model. However, it is unclear whether
the reliance on a CNN is necessary, and if neural networks
purely based on attention are sufficient to obtain good perfor-
mance in audio classification. In this paper, we answer the ques-
tion by introducing the Audio Spectrogram Transformer (AST),
the first convolution-free, purely attention-based model for au-
dio classification. We evaluate AST on various audio classifi-
cation benchmarks, where it achieves new state-of-the-art re-
sults of 0.485 mAP on AudioSet, 95.6% accuracy on ESC-50,
and 98.1% accuracy on Speech Commands V2.
Index Terms: audio classification, self-attention, Transformer
1. Introduction
With the advent of deep neural networks, over the last decade
audio classification research has moved from models based
on hand-crafted features [1, 2] to end-to-end models that di-
rectly map audio spectrograms to corresponding labels [3, 4, 5].
Specifically, convolutional neural networks (CNNs) [6] have
been widely used to learn representations from raw spectro-
grams for end-to-end modeling, as the inductive biases inherent
to CNNs such as spatial locality and translation equivariance
are believed to be helpful. In order to better capture long-range
global context, a recent trend is to add a self-attention mech-
anism on top of the CNN. Such CNN-attention hybrid mod-
els have achieved state-of-the-art (SOTA) results for many au-
dio classification tasks such as audio event classification [7, 8],
speech command recognition [9], and emotion recognition [10].
However, motivated by the success of purely attention-based
models in the vision domain [11, 12, 13], it is reasonable to ask
whether a CNN is still essential for audio classification.
To answer the question, we introduce the Audio Spectro-
gram Transformer (AST), a convolution-free, purely attention-
based model that is directly applied to an audio spectrogram
and can capture long-range global context even in the lowest
layers. Additionally, we propose an approach for transferring
knowledge from the Vision Transformer (ViT) [12] pretrained
on ImageNet [14] to AST, which can significantly improve the
performance. The advantages of AST are threefold. First, AST
has superior performance: we evaluate AST on a variety of au-
dio classification tasks and datasets including AudioSet [15],
ESC-50 [16] and Speech Commands [17]. AST outperforms
state-of-the-art systems on all these datasets. Second, AST nat-
urally supports variable-length inputs and can be applied to dif-
ferent tasks without any change of architecture. Specifically, the
Code at https://github.com/YuanGongND/ast.
Transformer Encoder
Linear Projection
E[CLS]
P[0]
E[1]
E[2]
E[3]
E[4]
E[5]
E[6]
E[7]
E[8]
P[1]
P[2]
P[3]
P[4]
P[5]
P[6]
P[7]
P[8]
Linear
Output
Patch Split with Overlap
Input Spectrogram
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
Figure 1: The proposed audio spectrogram transformer (AST)
architecture. The 2D audio spectrogram is split into a sequence
of 16�16 patches with overlap, and then linearly projected to
a sequence of 1-D patch embeddings. Each patch embedding
is added with a learnable positional embedding. An additional
classification token is prepended to the sequence. The output
embedding is input to a Transformer, and the output of the clas-
sification token is used for classification with a linear layer.
models we use for all aforementioned tasks have the same archi-
tecture while the input lengths vary from 1 sec. (Speech Com-
mands) to 10 sec. (AudioSet). In contrast, CNN-based models
typically require architecture tuning to obtain optimal perfor-
mance for different tasks. Third, comparing with SOTA CNN-
attention hybrid models, AST features a simpler architecture
with fewer parameters, and converges faster during training. To
the best of our knowledge, AST is the first purely attention-
based audio classification model.
Related Work The proposed Audio Spectrogram Trans-
former, as the name suggests, is based on the Transformer ar-
chitecture [18], which was originally proposed for natural lan-
guage processing tasks. Recently, the Transformer has also
been adapted for audio processing, but is typically used in
conjunction with a CNN [19, 20, 21]. In [19, 20], the au-
thors stack a Transformer on top of a CNN, while in [21],
the authors combine a Transformer and a CNN in each model
block. Other efforts combine CNNs with simpler attention
modules [8, 7, 9]. The proposed AST differs from these stud-
ies in that it is convolution-free and purely based on attention
mechanisms. The closest work to ours is the Vision Trans-
former (ViT) [11, 12, 13], which is a Transformer architecture
for vision tasks. AST and ViT have similar architectures but
ViT has only been applied to fixed-dimensional inputs (images)
while AST can process variable-length audio inputs. In addi-
tion, we propose an approach to transfer knowledge from Ima-
geNet pretrained ViT to AST. We also conduct extensive exper-
iments to show the design choice of AST on audio tasks.
arXiv:2104.01778v3 [cs.SD] 8 Jul 2021

Page 2
2. Audio Spectrogram Transformer
2.1. Model Architecture
Figure 1 illustrates the proposed Audio Spectrogram Trans-
former (AST) architecture. First, the input audio waveform of t
seconds is converted into a sequence of 128-dimensional log
Mel filterbank (fbank) features computed with a 25ms Ham-
ming window every 10ms. This results in a 128�100t spectro-
gram as input to the AST. We then split the spectrogram into a
sequence of N 16�16 patches with an overlap of 6 in both time
and frequency dimension, where N = 12⌈(100t − 16)/10⌉ is
the number of patches and the effective input sequence length
for the Transformer. We flatten each 16�16 patch to a 1D patch
embedding of size 768 using a linear projection layer. We re-
fer to this linear projection layer as the patch embedding layer.
Since the Transformer architecture does not capture the input
order information and the patch sequence is also not in tem-
poral order, we add a trainable positional embedding (also of
size 768) to each patch embedding to allow the model to cap-
ture the spatial structure of the 2D audio spectrogram.
Similar to [22], we append a [CLS] token at the beginning
of the sequence. The resulting sequence is then input to the
Transformer. A Transformer consists of several encoder and
decoder layers. Since AST is designed for classification tasks,
we only use the encoder of the Transformer. Intentionally, we
use the original Transformer encoder [18] architecture without
modification. The advantages of this simple setup are 1) the
standard Transformer architecture is easy to implement and re-
produce as it is off-the-shelf in TensorFlow and PyTorch, and
2) we intend to apply transfer learning for AST, and a stan-
dard architecture makes transfer learning easier. Specifically,
the Transformer encoder we use has an embedding dimension
of 768, 12 layers, and 12 heads, which are the same as those
in [12, 11]. The Transformer encoder’s output of the [CLS]
token serves as the audio spectrogram representation. A linear
layer with sigmoid activation maps the audio spectrogram rep-
resentation to labels for classification.
Strictly speaking, the patch embedding layer can be viewed
as a single convolution layer with a large kernel and stride size,
and the projection layer in each Transformer block is equivalent
to 1�1 convolution. However, the design is different from con-
ventional CNNs that have multiple layers and small kernel and
stride sizes. These Transformer models are usually referred to
as convolution-free to distinguish them from CNNs [11, 12].
2.2. ImageNet Pretraining
One disadvantage of the Transformer compared with CNNs is
that the Transformer needs more data to train [11]. In [11],
the authors point out that the Transformer only starts to out-
perform CNNs when the amount of data is over 14 million for
image classification tasks. However, audio datasets typically
do not have such large amounts of data, which motivates us
to apply cross-modality transfer learning to AST since images
and audio spectrograms have similar formats. Transfer learn-
ing from vision tasks to audio tasks has been previously stud-
ied in [23, 24, 25, 8], but only for CNN-based models, where
ImageNet-pretrained CNN weights are used as initial CNN
weights for audio classification training. In practice, it is com-
putationally expensive to train a state-of-the-art vision model,
but many commonly used architectures (e.g., ResNet [26], Ef-
ficientNet [27]) have off-the-shelf ImageNet-pretrained mod-
els for both TensorFlow and PyTorch, making transfer learning
much easier. We also follow this regime by adapting an off-the-
shelf pretrained Vision Transformer (ViT) to AST.
While ViT and AST have similar architectures (e.g., both
use a standard Transformer, same patch size, same embedding
size), they are not same. Therefore, a few modifications need to
make for the adaptation. First, the input of ViT is a 3-channel
image while the input to the AST is a single-channel spectro-
gram, we average the weights corresponding to each of the
three input channels of the ViT patch embedding layer and use
them as the weights of the AST patch embedding layer. This
is equivalent to expanding a single-channel spectrogram to 3-
channels with the same content, but is computationally more
efficient. We also normalize the input audio spectrogram so that
the dataset mean and standard deviation are 0 and 0.5, respec-
tively. Second, the input shape of ViT is fixed (either 224�224
or 384 � 384), which is different from a typical audio spectro-
gram. In addition, the length of an audio spectrogram can be
variable. While the Transformer naturally supports variable in-
put length and can be directly transferred from ViT to AST, the
positional embedding needs to be carefully processed because
it learns to encode the spatial information during the ImageNet
training. We propose a cut and bi-linear interpolate method for
positional embedding adaptation. For example, for a ViT that
takes 384 � 384 image input and uses a patch size of 16 � 16,
the number of patches and corresponding positional embedding
is 24 � 24 = 576 (ViT splits patches without overlap). An
AST that takes 10-second audio input has 12 � 100 patches,
each patch needs a positional embedding. We therefore cut
the first dimension and interpolate the second dimension of the
24 � 24 ViT positional embedding to 12 � 100 and use it as
the positional embedding for the AST. We directly reuse the
positional embedding for the [CLS] token. By doing this we
are able to transfer the 2D spatial knowledge from a pretrained
ViT to the AST even when the input shapes are different. Fi-
nally, since the classification task is essentially different, we
abandon the last classification layer of the ViT and reinitialize
a new one for AST. With this adaptation framework, the AST
can use various pretrained ViT weights for initialization. In this
work, we use pretrained weights of a data-efficient image Trans-
former (DeiT) [12], which is trained with CNN knowledge dis-
tillation, 384 � 384 images, has 87M parameters, and achieves
85.2% top-1 accuracy on ImageNet 2012. During ImageNet
training, DeiT has two [CLS] tokens; we average them as a
single [CLS] token for audio training.
3. Experiments
In this section, we focus on evaluating the AST on AudioSet
(Section 3.1) as weakly-labeled audio event classification is one
of the most challenging audio classification tasks. We present
our primary AudioSet results and ablation study in Section 3.1.2
and Section 3.1.3, respectively. We then present our experi-
ments on ESC-50 and Speech Commands V2 in Section 3.2.
3.1. AudioSet Experiments
3.1.1. Dataset and Training Details
AudioSet [15] is a collection of over 2 million 10-second au-
dio clips excised from YouTube videos and labeled with the
sounds that the clip contains from a set of 527 labels. The bal-
anced training, full training, and evaluation set contains 22k,
2M, and 20k samples, respectively. For AudioSet experiments,
we use the exact same training pipeline with [8]. Specifically,
we use ImageNet pretraining (as described in Section 2.2), bal-
anced sampling (for full set experiments only), data augmenta-

Page 3
Table 1: Performance comparison of AST and previous methods
on AudioSet.
Model
Architecture
Balanced
mAP
Full
mAP
Baseline [15]
CNN+MLP
-
0.314
PANNs [7]
CNN+Attention
0.278
0.439
PSLA [8] (Single)
CNN+Attention
0.319
0.444
PSLA (Ensemble-S)
CNN+Attention
0.345
0.464
PSLA (Ensemble-M)
CNN+Attention
0.362
0.474
AST (Single)
Pure Attention
0.347
� 0.001
0.459
� 0.000
AST (Ensemble-S)
Pure Attention
0.363
0.475
AST (Ensemble-M)
Pure Attention
0.378
0.485
tion (including mixup [28] with mixup ratio=0.5 and spectro-
gram masking [29] with max time mask length of 192 frames
and max frequency mask length of 48 bins), and model aggrega-
tion (including weight averaging [30] and ensemble [31]). We
train the model with a batch size of 12, the Adam optimizer [32],
and use binary cross-entropy loss. We conduct experiments on
the official balanced and full training set and evaluate on the Au-
dioSet evaluation set. For balanced set experiments, we use an
initial learning rate of 5e-5 and train the model for 25 epochs,
the learning rate is cut into half every 5 epoch after the 10th
epoch. For full set experiments, we use an initial learning rate
of 1e-5 and train the model for 5 epochs, the learning rate is
cut into half every epoch after the 2nd epoch. We use the mean
average precision (mAP) as our main evaluation metric.
3.1.2. AudioSet Results
We repeat each experiment three times with the same setup but
different random seeds and report the mean and standard devi-
ation. When AST is trained with the full AudioSet, the mAP at
the last epoch is 0.448�0.001. As in [8], we also use weight
averaging [30] and ensemble [31] strategies to further improve
the performance of AST. Specifically, for weight averaging, we
average all weights of the model checkpoints from the first to
the last epoch. The weight-averaged model achieves an mAP of
0.459�0.000, which is our best single model (weight averag-
ing does not increase the model size). For ensemble, we eval-
uate two settings: 1) Ensemble-S: we run the experiment three
times with the exact same setting, but with a different random
seed. We then average the output of the last checkpoint model of
each run. In this setting, the ensemble model achieves an mAP
of 0.475; 2) Ensemble-M: we ensemble models trained with
different settings, specifically, we ensemble the three models
in Ensemble-S together with another three models trained with
different patch split strategies (described in Section 3.1.3 and
shown in Table 5). In this setting, the ensemble model achieves
an mAP of 0.485, this is our best full model on AudioSet. As
shown in Table 1, the proposed AST outperforms the previous
best system in [8] in all settings. Note that we use the same
training pipeline with [8] and [8] also use ImageNet pretrain-
ing, so it is a fair comparison. In addition, we use fewer models
(6) for our best ensemble models than [8] (10). Finally, it is
worth mentioning that AST training converges quickly; AST
only needs 5 training epochs, while in [8], the CNN-attention
hybrid model is trained for 30 epochs.
We also conduct experiments with the balanced AudioSet
(about 1% of the full set) to evaluate the performance of AST
when the training data volume is smaller. For weight averag-
ing, we average all weights of the model checkpoints of the
Table 2: Performance impact due to ImageNet pretraining.
“Used” denotes the setting used by our optimal AST model.
Balanced Set
Full Set
No Pretrain
0.148
0.366
ImageNet Pretrain (Used)
0.347
0.459
Table 3: Performance of AST models initialized with different
ViT weights on balanced AudioSet and corresponding ViT mod-
els’ top-1 accuracy on ImageNet 2012. (* Model is trained
without patch split overlap due to memory limitation.)
# Params
ImageNet
AudioSet
ViT Base [11]
86M
0.846
0.320
ViT Large [11]*
307M
0.851
0.330
DeiT w/o Distill [12]
86M
0.829
0.330
DeiT w/ Distill (Used)
87M
0.852
0.347
last 20 epochs. For Ensemble-S, we follow the same setting
used for the full AudioSet experiment; for Ensemble-M, we in-
clude 11 models trained with different random seeds (Table 1),
different pretrained weights (Table 3), different positional em-
bedding interpolation (Table 4), and different patch split strate-
gies (Table 5). The single, Ensemble-S, and Ensemble-M model
achieve 0.347�0.001, 0.363, and 0.378, respectively, all outper-
form the previous best system. This demonstrates that AST can
work better than CNN-attention hybrid models even when the
training set is relatively small.
3.1.3. Ablation Study
We conduct a series of ablation studies to illustrate the design
choices for the AST. To save compute, we mainly conduct ab-
lation studies with the balanced AudioSet. For all experiments,
we use weight averaging but do not use ensembles.
Impact of ImageNet Pretraining. We compare ImageNet pre-
trained AST and randomly initialized AST. As shown in Ta-
ble 2, ImageNet pretrained AST noticeably outperforms ran-
domly initialized AST for both balanced and full AudioSet ex-
periments. The performance improvement of ImageNet pre-
training is more significant when the training data volume is
smaller, demonstrating that ImageNet pretraining can greatly
reduce the demand for in-domain audio data for AST. We fur-
ther study the impact of pretrained weights used. As shown in
Table 3, we compare the performance of AST models initialized
with pretrained weights of ViT-Base, ViT-Large, and DeiT mod-
els. These models have similar architectures but are trained with
different settings. We made the necessary architecture modifi-
cations for AST to reuse the weights. We find that AST using
the weights of the DeiT model with distillation that performs
best on ImageNet2012 also performs best on AudioSet.
Impact of Positional Embedding Adaptation. As mentioned
in Section 2.2, we use a cut and bi-linear interpolation approach
for positional embedding adaptation when transferring knowl-
edge from the Vision Transformer to the AST. We compare it
with a pretrained AST model with a randomly initialized posi-
tional embedding. As shown in Table 4, we find reinitializing
the positional embedding does not completely break the pre-
trained model as the model still performs better than a fully
randomly reinitialized model, but it does lead to a noticeable
performance drop compared with the proposed adaptation ap-
proach. This demonstrates the importance of transferring spatial

Page 4
Table 4: Performance impact due to various positional embed-
ding adaptation settings.
Balanced Set
Reinitialize
0.305
Nearest Neighbor Interpolation
0.346
Bilinear Interpolation (Used)
0.347
Table 5: Performance impact due to various patch overlap size.
# Patches
Balanced Set
Full Set
No Overlap
512
0.336
0.451
Overlap-2
657
0.342
0.456
Overlap-4
850
0.344
0.455
Overlap-6 (Used)
1212
0.347
0.459
Table 6: Performance impact due to various patch shape and
size. All models are trained with no patch split overlap.
# Patches
w/o Pretrain
w/ Pretrain
128�2
512
0.154
-
16�16 (Used)
512
0.143
0.336
32�32
128
0.139
-
knowledge. Bi-linear interpolation and nearest-neighbor inter-
polation do not result in a big difference.
Impact of Patch Split Overlap. We compare the performance
of models trained with different patch split overlap [13]. As
shown in Table 5, the performance improves with the overlap
size for both balanced and full set experiments. However, in-
creasing the overlap also leads to longer patch sequence inputs
to the Transformer, which will quadratically increase the com-
putational overhead. Even with no patch split overlap, AST can
still outperform the previous best system in [8].
Impact of Patch Shape and Size.
As mentioned in Sec-
tion 2.1, we split the audio spectrogram into 16 � 16 square
patches, so the input sequence to the Transformer cannot be in
temporal order. We hope the positional embedding can learn
to encode the 2D spatial information. An alternative way to
split the patch is slicing the audio spectrogram into rectangu-
lar patches in the temporal order. We compare both methods
in Table 6, when the area of the patch is the same (256), using
128 � 2 rectangle patches leads to better performance than us-
ing 16 � 16 square patches when both models are trained from
scratch. However, considering there is no 128 � 2 patch based
ImageNet pretrained models, using 16 � 16 patches is still the
current optimal solution. We also compare using patches with
different sizes, smaller size patches lead to better performance.
3.2. Results on ESC-50 and Speech Commands
The ESC-50 [16] dataset consists of 2,000 5-second environ-
mental audio recordings organized into 50 classes. The cur-
rent best results on ESC-50 are 86.5% accuracy (trained from
scratch, SOTA-S) [33] and 94.7% accuracy (with AudioSet pre-
training, SOTA-P) [7]. We compare AST with the SOTA mod-
els in these two settings, specifically, we train an AST model
with only ImageNet pretraining (AST-S) and an AST model
with ImageNet and AudioSet pretraining (AST-P). We train
both models with frequency/time masking [29] data augmen-
tation, a batch size of 48, and the Adam optimizer [32] for 20
Table 7: Comparing AST and SOTA models on ESC-50 and
Speech Commands. “-S” and “-P” denotes model trained with-
out and with additional audio data, respectively.
ESC-50
Speech Commands V2 (35 classes)
SOTA-S
86.5 [33]
97.4 [34]
SOTA-P
94.7 [7]
97.7 [35]
AST-S
88.7�0.7
98.11�0.05
AST-P
95.6�0.4
97.88�0.03
epochs. We use an initial learning rate of 1e-4 and 1e-5 for AST-
S and AST-P, respectively, and decrease the learning rate with
a factor of 0.85 every epoch after the 5th epoch. We follow the
standard 5-fold cross-validation to evaluate our model, repeat
each experiment three times, and report the mean and standard
deviation. As shown in Table 7, AST-S achieves 88.7�0.7 and
AST-P achieves 95.6�0.4, both outperform SOTA models in
the same setting. Of note, although ESC-50 has 1,600 training
samples for each fold, AST still works well with such a small
amount of data even without AudioSet pretraining.
Speech Commands V2 [17] is a dataset consists of 105,829
1-second recordings of 35 common speech commands. The
training, validation, and test set contains 84,843, 9,981, and
11,005 samples, respectively. We focus on the 35-class clas-
sification task, the SOTA model on Speech Commands V2 (35-
class classification) without additional audio data pretraining is
the time-channel separable convolutional neural network [34],
which achieves 97.4% on the test set. In [35], a CNN model
pretrained with additional 200 million YouTube audio achieves
97.7% on the test set. We also evaluate AST in these two set-
tings. Specifically, we train an AST model with only ImageNet
pretraining (AST-S) and an AST model with ImageNet and
AudioSet pretraining (AST-P). We train both models with fre-
quency and time masking [29], random noise, and mixup [28]
augmentation, a batch size of 128, and the Adam optimizer [32].
We use an initial learning rate of 2.5e-4 and decrease the learn-
ing rate with a factor of 0.85 every epoch after the 5th epoch.
We train the model for up to 20 epochs, and select the best
model using the validation set, and report the accuracy on
the test set. We repeat each experiment three times and re-
port the mean and standard deviation. AST-S model achieves
98.11�0.05, outperforms the SOTA model in [9]. In addition,
we find AudioSet pretraining unnecessary for the speech com-
mand classification task as AST-S outperforms AST-P. To sum-
marize, while the input audio length varies from 1 sec. (Speech
Commands), 5 sec. (ESC-50) to 10 sec. (AudioSet) and content
varies from speech (Speech Commands) to non-speech (Au-
dioSet and ESC-50), we use a fixed AST architecture for all
three benchmarks and achieve SOTA results on all of them. This
indicates the potential for AST use as a generic audio classifier.
4. Conclusions
Over the last decade, CNNs have become a common model
component for audio classification. In this work, we find CNNs
are not indispensable, and introduce the Audio Spectrogram
Transformer (AST), a convolution-free, purely attention-based
model for audio classification which features a simple architec-
ture and superior performance.
5. Acknowledgements
This work is partly supported by Signify.

Page 5
6. References
[1] F. Eyben, F. Weninger, F. Gross, and B. Schuller, “Recent de-
velopments in openSMILE, the Munich open-source multimedia
feature extractor,” in Multimedia, 2013.
[2] B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer,
F. Ringeval, M. Chetouani, F. Weninger, F. Eyben, E. Marchi,
M. Mortillaro, H. Salamin, A. Polychroniou, F. Valente, and S. K.
Kim, “The Interspeech 2013 computational paralinguistics chal-
lenge: Social signals, conflict, emotion, autism,” in Interspeech,
2013.
[3] N. Jaitly and G. Hinton, “Learning a better representation of
speech soundwaves using restricted boltzmann machines,” in
ICASSP, 2011.
[4] S. Dieleman and B. Schrauwen, “End-to-end learning for music
audio,” in ICASSP, 2014.
[5] G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nico-
laou, B. Schuller, and S. Zafeiriou, “Adieu features? end-to-end
speech emotion recognition using a deep convolutional recurrent
network,” in ICASSP, 2016.
[6] Y. LeCun and Y. Bengio, “Convolutional networks for images,
speech, and time series,” The Handbook of Brain Theory and Neu-
ral Networks, vol. 3361, no. 10, p. 1995, 1995.
[7] Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumb-
ley, “PANNs: Large-scale pretrained audio neural networks for
audio pattern recognition,” IEEE/ACM TASLP, vol. 28, pp. 2880–
2894, 2020.
[8] Y. Gong, Y.-A. Chung, and J. Glass, “PSLA: Improving audio
event classification with pretraining, sampling, labeling, and ag-
gregation,” arXiv preprint arXiv:2102.01243, 2021.
[9] O. Rybakov, N. Kononenko, N. Subrahmanya, M. Visontai, and
S. Laurenzo, “Streaming keyword spotting on mobile devices,” in
Interspeech, 2020.
[10] P. Li, Y. Song, I. V. McLoughlin, W. Guo, and L.-R. Dai, “An
attention pooling based representation learning method for speech
emotion recognition,” in Interspeech, 2018.
[11] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn,
X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold,
S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16
words: Transformers for image recognition at scale,” in ICLR,
2021.
[12] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and
H. J�gou, “Training data-efficient image transformers & distilla-
tion through attention,” arXiv preprint arXiv:2012.12877, 2020.
[13] L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, F. E. Tay, J. Feng, and
S. Yan, “Tokens-to-token ViT: Training vision transformers from
scratch on ImageNet,” arXiv preprint arXiv:2101.11986, 2021.
[14] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,
“ImageNet: A large-scale hierarchical image database,” in CVPR,
2009.
[15] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence,
R. C. Moore, M. Plakal, and M. Ritter, “Audio Set: An ontology
and human-labeled dataset for audio events,” in ICASSP, 2017.
[16] K. J. Piczak, “ESC: Dataset for environmental sound classifica-
tion,” in Multimedia, 2015.
[17] P. Warden, “Speech commands: A dataset for limited-vocabulary
speech recognition,” arXiv preprint arXiv:1804.03209, 2018.
[18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,”
in NIPS, 2017.
[19] K. Miyazaki, T. Komatsu, T. Hayashi, S. Watanabe, T. Toda,
and K. Takeda, “Convolution augmented transformer for semi-
supervised sound event detection,” in DCASE, 2020.
[20] Q. Kong, Y. Xu, W. Wang, and M. D. Plumbley, “Sound event
detection of weakly labelled data with CNN-transformer and au-
tomatic threshold optimization,” IEEE/ACM TASLP, vol. 28, pp.
2450–2460, 2020.
[21] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu,
W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer:
Convolution-augmented transformer for speech recognition,” in
Interspeech, 2020.
[22] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-
training of deep bidirectional transformers for language under-
standing,” in NAACL-HLT, 2019.
[23] G. Gwardys and D. M. Grzywczak, “Deep image features in mu-
sic information retrieval,” IJET, vol. 60, no. 4, pp. 321–326, 2014.
[24] A. Guzhov, F. Raue, J. Hees, and A. Dengel, “ESResNet: Envi-
ronmental sound classification based on visual domain models,”
in ICPR, 2020.
[25] K. Palanisamy, D. Singhania, and A. Yao, “Rethinking CNN mod-
els for audio classification,” arXiv preprint arXiv:2007.11154,
2020.
[26] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in CVPR, 2016.
[27] M. Tan and Q. V. Le, “EfficientNet: Rethinking model scaling for
convolutional neural networks,” in ICML, 2019.
[28] Y. Tokozume, Y. Ushiku, and T. Harada, “Learning from between-
class examples for deep sound recognition,” in ICLR, 2018.
[29] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D.
Cubuk, and Q. V. Le, “SpecAugment: A simple data augmen-
tation method for automatic speech recognition,” in Interspeech,
2019.
[30] P. Izmailov, D. Podoprikhin, T. Garipov, D. Vetrov, and A. G.
Wilson, “Averaging weights leads to wider optima and better gen-
eralization,” in UAI, 2018.
[31] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24,
no. 2, pp. 123–140, 1996.
[32] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-
mization,” in ICLR, 2015.
[33] H. B. Sailor, D. M. Agrawal, and H. A. Patil, “Unsupervised filter-
bank learning using convolutional restricted boltzmann machine
for environmental sound classification.” in Interspeech, 2017.
[34] S. Majumdar and B. Ginsburg, “Matchboxnet–1d time-channel
separable convolutional neural network architecture for speech
commands recognition,” arXiv preprint arXiv:2004.08531, 2020.
[35] J. Lin, K. Kilgour, D. Roblek, and M. Sharifi, “Training keyword
spotters with limited and synthesized speech data,” in ICASSP,
2020.