arXiv:2105.00335v1 [cs.SD] 1 May 2021

Page 1

2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics

October 17-20, 2021, New Paltz, NY

AUDIO TRANSFORMERS:

TRANSFORMER ARCHITECTURES FOR LARGE SCALE AUDIO UNDERSTANDING.

ADIEU CONVOLUTIONS∗

Prateek Verma and Jonathan Berger

Stanford University

450 Jane Stanford Way, Stanford CA, 94305,

prateekv, brg@stanford.edu

Figure 1: An overview of the proposed Audio Transformer architecture using front end fully connected encoder with Transformer layers and

pooling layers. It takes 1s of input, and divides it into patches of size 25ms, followed by learning a front end, to feed it to Transformer.

ABSTRACT

Over the past two decades, CNN architectures have produced com-

pelling models of sound perception and cognition, learning hierar-

chical organizations of features. Analogous to successes in com-

puter vision, audio feature classification can be optimized for a par-

ticular task of interest, over a wide variety of datasets and labels.

In fact similar architectures designed for image understanding have

proven effective for acoustic scene analysis. Here we propose ap-

plying Transformer based architectures without convolutional lay-

ers to raw audio signals. On a standard dataset of Free Sound 50K,

comprising of 200 categories, our model outperforms convolutional

models to produce state of the art results. This is significant as un-

like in natural language processing and computer vision, we do not

perform unsupervised pre-training for outperforming convolutional

architectures. On the same training set, with respect mean aver-

age precision benchmarks, we show a significant improvement. We

further improve the performance of Transformer architectures by

using techniques such as pooling inspired from convolutional net-

work designed in the past few years. In addition, we also show how

multi-rate signal processing ideas inspired from wavelets, can be

applied to the Transformer embeddings to improve the results. We

also show how our models learns a non-linear non constant band-

width filter-bank, which shows an adaptable time frequency front

end representation for the task of audio understanding, different

from other tasks e.g. pitch estimation. 1

1*This is not the first time an acoustic scene understanding model with-

Index Terms— Transformers, audio understanding, wavelets

1. INTRODUCTION AND RELATED WORK

Acoustic scene analysis is a classical signal processing and machine

learning problem whose goal is to predict the contents of an input

signal within a brief duration, typically one second. In addition

to modeling perception, computer simulation of hearing combined

with models of other sensory systems will help bridge the gap be-

tween humans and computers. For the past decade, CNNs have

become a de-facto architecture in learning mappings from fixed di-

mensional inputs to fixed dimensional outputs [2, 3]. CNN archi-

tectures inspired from vision [2], adapted for acoustic scene under-

standing, achieve similar performance gains for audio also.

The core backbone of this work is Transformer architecture

which recently have recently produced state of the art results in

a variety of domains, including protein sequences [4], text [5, 6],

symbolic music [7], video[8, 9] and image understanding [10, 11].

By learning transformers on the latent representations, and condi-

tioning a wavenet generator, they were able to achieve compelling

results in music generation [12] and style transfer, which was im-

possible without the guidance of meta-data and convolutional ar-

chitectures [13]. They have also been used in learning latent audio

out convolutions is presented, although perhaps it is a first end-to-end one,

to the best of our knowledge. At the same time as viT, [1] showed how in a

two-step process, one can achieve the same.

arXiv:2105.00335v1 [cs.SD] 1 May 2021

Page 2

2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics

October 17-20, 2021, New Paltz, NY

representations such as [14, 15] for solving pseudo-tasks such as in-

filling to learn time-dependent representations [1, 6]. As opposed

to learning latent representations, the reduced time-scales of Trans-

formers can advantageously model input representations. A major

drawback of convolutional architecture is the fixed filter across the

entire input. Furthermore, Transformers take advantage of attention

mechanism with the output at a location dependent upon the input

at some other location.

The core idea of this work is to replace traditional convolutional

based architectures [3], combined convolutional and Transformer

architectures [16, 17], and recurrent architectures [18, 19] with a

purely Transformer based architecture. Our work is distinct from

the method proposed in [1] which was not an end-to-end approach

and which required a two step approach (specifically, Learning a

dictionary of latent codes, and using the discrete latent codes as an

input to transformer architectures). Similar approaches were suc-

cessfully used in areas such as speech recognition [16] to mimic

BERT [6]. All these state of the art performances were possible due

to the architectures’ ability to model long term dependency inputs

and the attention mechanism present in them enabling focus only

on the part of the input that is important [20].

The organization of the paper is as follows: Section I introduces

the problem and the literature survey followed by the dataset we

used to benchmark the results in section II. The next section details

the methodology followed by results and discussion in Section IV.

We conclude the paper in Section V followed by our thoughts of

future work and references.

2. DATASET

We use FSD50K [21] an open dataset containing over 51k audio

files to train and evaluate our architectures. FSD50K comprises

over 100 hours of manually labeled audio using 200 classes drawn

from the AudioSet ontology. We chose to evaluate this over popular

AudioSet [3], as FSD50K is freely available under creative com-

mons license, contains many more high quality audio annotations,

and twice number of training examples in the balanced set-up. We

used the already provided training and the validation splits to tune

the model and tested them on the evaluation setup provided. In to-

tal there are about 51,197 clips available ranging from 0.3-30s. We

downsample all the clips to 16kHz sampling rate using [22]. We

follow the same setup for reporting the results as done in [21]. All

the training was carried on 1s audio chunks with the labels inher-

ited for all the chunks in clips greater than 1s. For samples less than

1s, the audio clip is repeated to make the duration 1s resulting in a

single training example for that clip. On an average, the duration

per clip is 7.6s, with 1.22 average labels per clip, uploaded by 7225

user ids thus encompassing a diverse range of sources, acoustic en-

vironments, microphones, locations to name a few.

3. METHODOLOGY

3.1. Baseline Transformer Architectures

This section describes the Transformer architecture as described in

[20] that we used to train the system as shown in Figure 1. A de-

tailed explanation is given in [23], but for the sake of clarity and

completeness we describe it here. As a black-box, which we would

describe in more detail in this section, it takes as an input a sequence

of a fixed length T, and produces the same length but with a chosen

dimension, which we call E, which denotes the size of the latent

space. More specifically, it maps a sequence x = (x1, x2, ....xT )

to a sequence of same length T, namely z : (z1, z2, ....zT ) ,

where each of the dimensions of (z1, z2, ....zT ) is the chosen hyper-

parameter E, which in our case is 64, the size of the embedding.

For the sake of brevity, we would explain only one Transformer

Encoder, and for a model with layers L, each of the stack is super-

imposed on the other one.

Each Transformer module consists of a attention block and

a feed-forward block. The output of each of them is passed

through a layer norm and a residual layer. So after both the at-

tention block and the feed-forward block, if the input to a sub-

block (attentionFa or feed-forwardFff block) is a sequence xb, in-

stead of passing the output directly to the next module/sub-block,

we pass along the block layer norm and the residual output xbo as

xbo = LayerNorm(xb + Fa/f f (xb)) This follows the notion that

layer-norm/skip connections help in better convergence/improved

performance. We now describe each of the two sub-blocks that are

part of the transformer block namely, i) multi-headed causal atten-

tion ii) feed-forward architecture

3.1.1. Multi-Headed Causal Attention

A multi-headed causal attention function can be described as a

weighting function that decides how to get the output of each step.

It learns a probabilistic score of how important each of the embed-

ding, is while predicting the output. A multi-headed attention con-

sists of first learning a probabilistic score. It is then multiplied with

each of the inputs to determine how important each of the input

is for prediction of the embedding for a position pos belonging to

1, 2, 3....T. We use scaled-dot product attention as the type of at-

tention mechanism. A query, key and a value vector is learned for

each of the position for each of the inputs. This is done by implicitly

learning matrices, WQ, WK and WV to produce a query vector q,

key vector k and value vector v for each of the inputs for a single at-

tention head. We take the dot product of query and key vectors, the

result is multiplied by a normalization factor (the inverse of square

root of size of the vector as done in [20]), before taking a soft-max

across all the inputs. Each of the value vector is multiplied by this

score to get the output of the attention module. Mathematically, for

a query matrix Q, key matrix K, and a value matrix V , it is de-

fined as, Attention(Q, K, V ) = softmax(QKT

√

). We can also

learn multiple such attention maps for h attention heads, defined as,

MutliHeadAttention(Q, K, V ) = Concat(h1, h2, ...hh)Wo,

where each of the attention heads hi is defined as

Attention(Qi, Ki, Vi) = softmax(

QiKT

√

)

and Wo is a matrix learned during training. In this work, we focus

on causal attention map which is made possible by multiplying with

a mask of a triangular matrix so that each of the attention head only

gives weightage to the previous sample at position pos and all the

future entries are set to zero. [5]

3.1.2. Feed Forward Architecture & Positional Information

We weigh the saliency of each of the input signal via multi-headed

attention for passing at a position pos. Each of the input at posi-

tion pos is passed through a feed-forward architecture. We have the

output of the feed-forward layers xbo for an input xb, for the di-

mension of feed-forward layers dff , in case of 2-layer network is,

Page 3

2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics

October 17-20, 2021, New Paltz, NY

FF(xb) = max(0, xbW1 + b1)W2 + b2.. We apply this function

identically at each of the inputs. As described in [20], to each of the

inputs, positional encoding are added. As the input is passed on as

a list, the model does not take into account the relative position, and

thus the positional encoding are needed. For any position pos for

the dimension i of the latent space, we use sinusoidal function, i.e.

to each position pos and embedding dimension i in E, we add,

PEpos,2i/2i+1 = sin/cos(pos/10000(2i/E))

This adds positional information for each point in time, of input

with dimension E , before passing thorough self-attention layers.

3.2. Adapting Transformer Architecture for raw waveforms

We adapt Transformer architectures using ideas from traditional sig-

nal processing. Since the Transformer has O(n2) complexity w.r.t

memory and computation requirements, we choose to follow the

traditional route of windowing the signal. For all the experiments,

as discussed before we work with 1s of audio input sampled at

16kHz yielding 16,000 samples.

Figure 2: Core idea of wavelets utilizing multi-scale learning on

(left) from [24], and using them to create a layer that operates on

intermediate Transformer embeddings at various scales. We show a

demo signal and we retains half of them, and modify the other half

using variable sized windows.

The widow length is chosen to be 25ms, choosing a non-

overlapping rectangular window. The rectangular window provides

the network an optimal windowing function which, as we will see

in a few of the learned filters, adapts itself to a shape of han-

ning/hamming window. We fix the front end to be a dense layer

of size 2048 neurons followed by another layer of size 64, primar-

ily to adapt to the size of the embedding layer of Transformer. A

single dense layer of size 2048 successfully learned a filter-back

to learn a neural-time frequency representation as shown in [25].

This design was chosen as it produced state of the art results for

an equally difficult problem of pitch estimation in polyphonic audio

[25], with feed forward layers. Since Transformer layers only con-

sists of attention + feed-forward blocks, we achieve an end-to-end

architecture that does not have any convolutional operators.1 This

yields a front end representation of 40 time steps each of size 64

dimensions, (64 being a hyper-parameter). We choose 6 layers of

Transformer module with the size of latent code being 64 for each of

the layers, and 8 attention head, with 128 dim 3-layer dense layer to

convert to the desired feature space. For comparing with a smaller

model, we choose 3-layers of Transformers with similar setup. The

last layer of Transformer is reduced to a smaller dimension using

1Although not the norm and in wide circulation, a fully connected layer

is also a convolution operation with no stride and receptive field size equal

to the size of the input data.

average pooling across time. The output of the last dense layer of

dimension 200, chosen same as number of output labels.

3.3. Transformer Architectures Inspired From CNNs: Pooling

We explored further performance enhancements to the baseline

Transformer architecture proposed in the previous section. For this

we draw inspiration of convolutional architectures used for the past

decade to understand images [26] and audio [3]. The traditional

models e.g. Resnet-50 [2], consists of using a combination of con-

volutional layers followed by pooling. The use of pooling layers

has two advantages. It reduces the number of computations by re-

ducing the size of inputs in the higher layers. More importantly

it allows the higher-layer neurons to have much broader receptive

field sizes, and allows the network to learn hierarchically ordered

features, which is important for a particular problem. Pooled Trans-

formers outperform the baseline Transformer architecture, while re-

taining the same number of parameters. In our experiments average

pooling performed significantly better than max-pooling, as it re-

tains more information on the signal. As described in Figure 1, we

use pooling across time after every two layers of Transformers, with

stride 1, to reduce the dimensionality of the input by a factor of 2,

and shows significant performance gain as compared to the original

Transformer architecture without pooling.

3.4. Learning multi-scale embeddings

In this adaptation, we draw inspiration from wavelet decomposition

and success of pooling layers. We explored if we can decompose the

intermediate embeddings out of the Transformer, at multiple scale

similar to idea of wavelet decomposition. In order to achieve it, we

fix up our kernel to be average operation across all windows chosen

at a particular level. Notice that we choose different widow sizes at

different dimensions of embedding along the time axis. The man-

ner of implementation is again a design choice and there are several

interesting ideas possible in future, including the choice of kernel.

We draw inspiration from the work carried out in [24], as seen in

Figure 2. We adapt the window size, in factors of 1,2,4,8 and so

on, following a geometric progression. The value is assigned to all

of the elements as opposed to reducing the size, as done in pooling

thus retaining the same size. This operation in fully differentiable,

and can be trained in end-to-end architectures. This is different than

work carried out on spectral filtering [27], as we choose to operate

firstly with variable window size as opposed to fixed windows, and

secondly do not take explicit hand-crafted bands of filters. Addi-

tionally, we choose to model the space of embeddings-time hierar-

chically with only a few large windows, and large number of smaller

window, most of them being 1 to retain the embeddings at their orig-

inal scale. This retains the original transformer embeddings, with

half of the embeddings unchanged, and tinkers with the other half.

This combination has been at the core of wavelet transforms.

4. RESULTS & DISCUSSION

For all of the architectures, we only tuned learning rate to be con-

sistent with the results shown in [21]. All of the Transformers have

6 layers (3 for small transformers) with 64 dim embeddings, and

3-Layer 128 neuron feed forward layers, and 8 attention head. The

front end consists of 1024/2048 dimensional layer followed by a 64

dimensional dense layer for small and large transformers. We com-

pared the same Transformer architectures with that of using i) pool-

Page 4

2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics

October 17-20, 2021, New Paltz, NY

Figure 3: Sorted filters, learned by the front end, learns a problem specific non linear, non constant bandwidth filter-bank. This is shown by

comparing it to that learned by the same front end for polyphonic pitch estimation as shown in [25].

ing layers ii) multi-scale filters. We observed that even the small-

est of the Transformer architectures outperform traditional convo-

lutional architectures. This is quite significant, unlike problems in

vision [10], where the margin was not as significant. Another ob-

servation is also that the performance keeps improving with more

depth. All the models were trained using Tensorflow framework

[28], with Huber Loss as the error criteria between the predictions

and the ground truth, using Adam optimizer [29].

Table 1: Comparison of various proposed architecture as shown

in the table below for mean average precision (mAP) metric. We

see how even baseline Transformer architectures without using any

convolutional layers can outperform widely used CNN architectures

for acoustic scene understanding by significant margins. [21]

Neural Model Architecture

mAP

# Param

CRNN [21]

0.417

0.96M

VGG-like [21]

0.434

0.27M

ResNet-18 [21]

0.373

11.3M

DenseNet-121 [21]

0.425

12.5M

Small Transformer

0.469

0.9M

Large 6- Layer Transformer

0.525

2.3M

Large Transformer with multi-scale filters

2.3M

Large 6- Layer Transformer with Pooling

0.537

2.3M

4.1. What the front end learns

We follow a strategy similar to that described in [25] to understand

what the filters learn. We deploy the same front end in our work

which is again a feed-forward layer consisting of 2048 neuron, fol-

lowed by a 64-dim dense layer. This is similar to the analogy of

getting a mel-like representation which is learnable end-to-end. Af-

ter training, we take the first layer and sort the filter according to

the peaks of their Fourier representation. We see that it manages to

learn a non-linear, non-constant bandwidth filterbank as seen in Fig-

ure 3. We also see that with using the same front end for two differ-

ent applications, namely for pitch estimation and acoustic scene un-

derstanding, the shape and the resolution of the learned filter-bank

is different. In addition, we can also see a step-wise pattern, which

shows multiple filters assigned to the same frequency bin to account

for the phase variations of the input signals. Figure 4 depicts a few

chosen filters for the sake of discussion here. We observe a variety

of ideas that can be interpreted from signal processing perspective,

and also to take into account the characteristics of the input signal

i.e. frequency, timbre, and energy. We can see, in center-top row,

that a filter learns a pure sinusoidal basis of a certain frequency.

Furthermore, it also manages to learn a windowing function that

closely resembles hanning/hamming window. The filters in the left

column present at the top-bottom, are characteristic of an onset de-

tector, which can be a slow/rapid onset respectively. Further the

filter present in the second row, third column shows a slowly mov-

ing signal, which may be latching onto the overall energy envelop

of a signal for certain characteristic sounds. It is exciting and inter-

esting to see these correlations to traditional signal processing ideas

present in these filters.

Figure 4: Filters learned from the first layer of front end show strong

correlations to signal processing, particularly learning sinusoidal

signals, onset detectors, energy envelops, and windowing functions

5. CONCLUSION & FUTURE WORK

We have shown here how a Transformer architecture without using

any convolutional filters can be adapted for large scale audio under-

standing. This work shows considerable promise, as it has outper-

formed other convolutional architectures based bench-marks by a

significant margin. We show our model can learn a time frequency

front end that is adaptable to the particular problem of interest in

this case, large scale audio understanding. There are several pos-

sible research directions ahead. With the advancements in Trans-

former architectures such as switch transformers [30], and sparse

transformers [31], these results would further improve. Addition-

ally, with the success of unsupervised representation learning archi-

tectures for audio [1], it will be interesting to do large scale pre-

training for making robust audio representations. It will be also be

useful to explore a wider search over hyper-parameters to increase

the reported precision scores.

Page 5

2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics

October 17-20, 2021, New Paltz, NY

6. REFERENCES

[1] P. Verma and J. Smith, “A framework for contrastive and

generative learning of audio representations,” arXiv preprint

arXiv:2010.11459, 2020.

[2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning

for image recognition,” in Proceedings of the IEEE conference

on computer vision and pattern recognition, 2016, pp. 770–

778.

[3] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen,

W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio

set: An ontology and human-labeled dataset for audio events,”

in 2017 IEEE International Conference on Acoustics, Speech

and Signal Processing (ICASSP). IEEE, 2017, pp. 776–780.

[4] A. Madani, B. McCann, N. Naik, N. S. Keskar, N. Anand,

R. R. Eguchi, P.-S. Huang, and R. Socher, “Progen: Lan-

guage modeling for protein generation,” arXiv preprint

arXiv:2004.03497, 2020.

[5] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Ka-

plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry,

A. Askell, et al., “Language models are few-shot learners,”

arXiv preprint arXiv:2005.14165, 2020.

[6] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert:

Pre-training of deep bidirectional transformers for language

understanding,” arXiv preprint arXiv:1810.04805, 2018.

[7] C.-Z. A. Huang, A. Vaswani, J. Uszkoreit, N. Shazeer, I. Si-

mon, C. Hawthorne, A. M. Dai, M. D. Hoffman, M. Din-

culescu, and D. Eck, “Music transformer,” arXiv preprint

arXiv:1809.04281, 2018.

[8] C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid,

“Videobert: A joint model for video and language representa-

tion learning,” in Proceedings of the IEEE/CVF International

Conference on Computer Vision, 2019, pp. 7464–7473.

[9] R. Girdhar, J. Carreira, C. Doersch, and A. Zisserman, “Video

action transformer network,” in Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recognition,

2019, pp. 244–253.

[10] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn,

X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer,

G. Heigold, S. Gelly, et al., “An image is worth 16x16 words:

Transformers for image recognition at scale,” arXiv preprint

arXiv:2010.11929, 2020.

[11] N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer,

A. Ku, and D. Tran, “Image transformer,” in International

Conference on Machine Learning. PMLR, 2018, pp. 4055–

4064.

[12] P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, and

I. Sutskever, “Jukebox: A generative model for music,” arXiv

preprint arXiv:2005.00341, 2020.

[13] P. Verma and J. O. Smith, “Neural style transfer for audio

spectograms,” arXiv preprint arXiv:1801.01589, 2018.

[14] P. Verma, C. Chafe, and J. Berger, “Neuralogram: A deep

neural network based representation for audio signals,” arXiv

preprint arXiv:1904.05073, 2019.

[15] A. v. d. Oord, O. Vinyals, and K. Kavukcuoglu, “Neu-

ral discrete representation learning,”

arXiv preprint

arXiv:1711.00937, 2017.

[16] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec

2.0: A framework for self-supervised learning of speech rep-

resentations,” arXiv preprint arXiv:2006.11477, 2020.

[17] A. Baevski, S. Schneider, and M. Auli, “vq-wav2vec: Self-

supervised learning of discrete speech representations,” arXiv

preprint arXiv:1910.05453, 2019.

[18] A. Haque, M. Guo, and P. Verma, “Conditional end-to-end

audio transforms,” arXiv preprint arXiv:1804.00047, 2018.

[19] A. Haque, M. Guo, P. Verma, and L. Fei-Fei, “Audio-linguistic

embeddings for spoken sentences,” in ICASSP 2019-2019

IEEE International Conference on Acoustics, Speech and Sig-

nal Processing (ICASSP). IEEE, 2019, pp. 7355–7359.

[20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,

A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all

you need,” arXiv preprint arXiv:1706.03762, 2017.

[21] E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra,

“Fsd50k: an open dataset of human-labeled sound events,”

arXiv preprint arXiv:2010.00475, 2020.

[22] P. Virtanen, R. Gommers, T. E. Oliphant, M. Haber-

land, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson,

W. Weckesser, J. Bright, et al., “Scipy 1.0: fundamental al-

gorithms for scientific computing in python,” Nature methods,

vol. 17, no. 3, pp. 261–272, 2020.

[23] G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M.

Rush, “Opennmt: Open-source toolkit for neural machine

translation,” in Proc. ACL, 2017. [Online]. Available:

https://doi.org/10.18653/v1/P17-4012

[24] J. Berger, R. R. Coifman, and M. J. Goldberg, “Removing

noise from music using local trigonometric bases and wavelet

packets,” Journal of the Audio Engineering Society, vol. 42,

no. 10, pp. 808–818, 1994.

[25] P. Verma and R. W. Schafer, “Frequency estimation from

waveforms using multi-layered neural networks.” in INTER-

SPEECH, 2016, pp. 2165–2169.

[26] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-

Fei, “Imagenet: A large-scale hierarchical image database,” in

2009 IEEE conference on computer vision and pattern recog-

nition. Ieee, 2009, pp. 248–255.

[27] A. Tamkin, D. Jurafsky, and N. Goodman, “Language through

a prism: A spectral approach for multiscale language repre-

sentations,” Advances in Neural Information Processing Sys-

tems, vol. 33, 2020.

[28] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean,

M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., “Ten-

sorflow: A system for large-scale machine learning,” in 12th

{USENIX} symposium on operating systems design and im-

plementation ({OSDI} 16), 2016, pp. 265–283.

[29] D. P. Kingma and J. Ba, “Adam: A method for stochastic op-

timization,” arXiv preprint arXiv:1412.6980, 2014.

[30] W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers:

Scaling to trillion parameter models with simple and

efficient sparsity,” Jan 2021. [Online]. Available: https:

//arxiv.org/abs/2101.03961

[31] R. Child, S. Gray, A. Radford, and I. Sutskever, “Generat-

ing long sequences with sparse transformers,” arXiv preprint

arXiv:1904.10509, 2019.