-
Centrality dependence of Lévy-stable two-pion Bose-Einstein correlations in $\sqrt{s_{_{NN}}}=200$ GeV Au$+$Au collisions
Authors:
PHENIX Collaboration,
N. J. Abdulameer,
U. Acharya,
A. Adare,
C. Aidala,
N. N. Ajitanand,
Y. Akiba,
R. Akimoto,
H. Al-Ta'ani,
J. Alexander,
A. Angerami,
K. Aoki,
N. Apadula,
Y. Aramaki,
H. Asano,
E. C. Aschenauer,
E. T. Atomssa,
T. C. Awes,
B. Azmoun,
V. Babintsev,
M. Bai,
B. Bannier,
K. N. Barish,
B. Bassalleck,
S. Bathe
, et al. (377 additional authors not shown)
Abstract:
The PHENIX experiment measured the centrality dependence of two-pion Bose-Einstein correlation functions in $\sqrt{s_{_{NN}}}=200$~GeV Au$+$Au collisions at the Relativistic Heavy Ion Collider at Brookhaven National Laboratory. The data are well represented by Lévy-stable source distributions. The extracted source parameters are the correlation-strength parameter $λ$, the Lévy index of stability…
▽ More
The PHENIX experiment measured the centrality dependence of two-pion Bose-Einstein correlation functions in $\sqrt{s_{_{NN}}}=200$~GeV Au$+$Au collisions at the Relativistic Heavy Ion Collider at Brookhaven National Laboratory. The data are well represented by Lévy-stable source distributions. The extracted source parameters are the correlation-strength parameter $λ$, the Lévy index of stability $α$, and the Lévy-scale parameter $R$ as a function of transverse mass $m_T$ and centrality. The $λ(m_T)$ parameter is constant at larger values of $m_T$, but decreases as $m_T$ decreases. The Lévy scale parameter $R(m_T)$ decreases with $m_T$ and exhibits proportionality to the length scale of the nuclear overlap region. The Lévy exponent $α(m_T)$ is independent of $m_T$ within uncertainties in each investigated centrality bin, but shows a clear centrality dependence. At all centralities, the Lévy exponent $α$ is significantly different from that of Gaussian ($α=2$) or Cauchy ($α=1$) source distributions. Comparisons to the predictions of Monte-Carlo simulations of resonance-decay chains show that in all but the most peripheral centrality class (50%-60%), the obtained results are inconsistent with the measurements, unless a significant reduction of the in-medium mass of the $η'$ meson is included. In each centrality class, the best value of the in-medium $η'$ mass is compared to the mass of the $η$ meson, as well as to several theoretical predictions that consider restoration of $U_A(1)$ symmetry in hot hadronic matter.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
Few-electron highly charged muonic Ar atoms verified by electronic $K$ x rays
Authors:
T. Okumura,
T. Azuma,
D. A. Bennett,
W. B. Doriese,
M. S. Durkin,
J. W. Fowler,
J. D. Gard,
T. Hashimoto,
R. Hayakawa,
Y. Ichinohe,
P. Indelicato,
T. Isobe,
S. Kanda,
D. Kato,
M. Katsuragawa,
N. Kawamura,
Y. Kino,
N. Kominato,
Y. Miyake,
K. M. Morgan,
H. Noda,
G. C. O'Neil,
S. Okada,
K. Okutsu,
N. Paul
, et al. (18 additional authors not shown)
Abstract:
Electronic $K$ x rays emitted by muonic Ar atoms in the gas phase were observed using a superconducting transition-edge-sensor microcalorimeter. The high-precision energy spectra provided a clear signature of the presence of muonic atoms accompanied by a few electrons, which have never been observed before. One-, two-, and three-electron bound, i.e., H-like, He-like, and Li-like, muonic Ar atoms w…
▽ More
Electronic $K$ x rays emitted by muonic Ar atoms in the gas phase were observed using a superconducting transition-edge-sensor microcalorimeter. The high-precision energy spectra provided a clear signature of the presence of muonic atoms accompanied by a few electrons, which have never been observed before. One-, two-, and three-electron bound, i.e., H-like, He-like, and Li-like, muonic Ar atoms were identified from electronic $K$ x rays and hyper-satellite $K$ x rays. These $K$ x rays are emitted after the charge transfer process by the collisions with surrounding Ar atoms. With the aid of theoretical calculations, we confirmed that the peak positions are consistent with the x-ray energies from highly charged Cl ions, and the intensities reflecting deexcitation dynamics were successfully understood by taking into account the interaction between the muon and bound electrons.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
Multi-Convformer: Extending Conformer with Multiple Convolution Kernels
Authors:
Darshan Prabhu,
Yifan Peng,
Preethi Jyothi,
Shinji Watanabe
Abstract:
Convolutions have become essential in state-of-the-art end-to-end Automatic Speech Recognition~(ASR) systems due to their efficient modelling of local context. Notably, its use in Conformers has led to superior performance compared to vanilla Transformer-based ASR systems. While components other than the convolution module in the Conformer have been reexamined, altering the convolution module itse…
▽ More
Convolutions have become essential in state-of-the-art end-to-end Automatic Speech Recognition~(ASR) systems due to their efficient modelling of local context. Notably, its use in Conformers has led to superior performance compared to vanilla Transformer-based ASR systems. While components other than the convolution module in the Conformer have been reexamined, altering the convolution module itself has been far less explored. Towards this, we introduce Multi-Convformer that uses multiple convolution kernels within the convolution module of the Conformer in conjunction with gating. This helps in improved modeling of local dependencies at varying granularities. Our model rivals existing Conformer variants such as CgMLP and E-Branchformer in performance, while being more parameter efficient. We empirically compare our approach with Conformer and its variants across four different datasets and three different modelling paradigms and show up to 8% relative word error rate~(WER) improvements.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
Towards Robust Speech Representation Learning for Thousands of Languages
Authors:
William Chen,
Wangyou Zhang,
Yifan Peng,
Xinjian Li,
Jinchuan Tian,
Jiatong Shi,
Xuankai Chang,
Soumi Maiti,
Karen Livescu,
Shinji Watanabe
Abstract:
Self-supervised learning (SSL) has helped extend speech technologies to more languages by reducing the need for labeled data. However, models are still far from supporting the world's 7000+ languages. We propose XEUS, a Cross-lingual Encoder for Universal Speech, trained on over 1 million hours of data across 4057 languages, extending the language coverage of SSL models 4-fold. We combine 1 millio…
▽ More
Self-supervised learning (SSL) has helped extend speech technologies to more languages by reducing the need for labeled data. However, models are still far from supporting the world's 7000+ languages. We propose XEUS, a Cross-lingual Encoder for Universal Speech, trained on over 1 million hours of data across 4057 languages, extending the language coverage of SSL models 4-fold. We combine 1 million hours of speech from existing publicly accessible corpora with a newly created corpus of 7400+ hours from 4057 languages, which will be publicly released. To handle the diverse conditions of multilingual speech data, we augment the typical SSL masked prediction approach with a novel dereverberation objective, increasing robustness. We evaluate XEUS on several benchmarks, and show that it consistently outperforms or achieves comparable results to state-of-the-art (SOTA) SSL models across a variety of tasks. XEUS sets a new SOTA on the ML-SUPERB benchmark: it outperforms MMS 1B and w2v-BERT 2.0 v2 by 0.8% and 4.4% respectively, despite having less parameters or pre-training data. Checkpoints, code, and data are found in https://www.wavlab.org/activities/2024/xeus/.
△ Less
Submitted 2 July, 2024; v1 submitted 30 June, 2024;
originally announced July 2024.
-
Beyond Silence: Bias Analysis through Loss and Asymmetric Approach in Audio Anti-Spoofing
Authors:
Hye-jin Shim,
Md Sahidullah,
Jee-weon Jung,
Shinji Watanabe,
Tomi Kinnunen
Abstract:
Current trends in audio anti-spoofing detection research strive to improve models' ability to generalize across unseen attacks by learning to identify a variety of spoofing artifacts. This emphasis has primarily focused on the spoof class. Recently, several studies have noted that the distribution of silence differs between the two classes, which can serve as a shortcut. In this paper, we extend c…
▽ More
Current trends in audio anti-spoofing detection research strive to improve models' ability to generalize across unseen attacks by learning to identify a variety of spoofing artifacts. This emphasis has primarily focused on the spoof class. Recently, several studies have noted that the distribution of silence differs between the two classes, which can serve as a shortcut. In this paper, we extend class-wise interpretations beyond silence. We employ loss analysis and asymmetric methodologies to move away from traditional attack-focused and result-oriented evaluations towards a deeper examination of model behaviors. Our investigations highlight the significant differences in training dynamics between the two classes, emphasizing the need for future research to focus on robust modeling of the bonafide class.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing Loss
Authors:
Muhammad Shakeel,
Yui Sudo,
Yifan Peng,
Shinji Watanabe
Abstract:
Contextualized end-to-end automatic speech recognition has been an active research area, with recent efforts focusing on the implicit learning of contextual phrases based on the final loss objective. However, these approaches ignore the useful contextual knowledge encoded in the intermediate layers. We hypothesize that employing explicit biasing loss as an auxiliary task in the encoder intermediat…
▽ More
Contextualized end-to-end automatic speech recognition has been an active research area, with recent efforts focusing on the implicit learning of contextual phrases based on the final loss objective. However, these approaches ignore the useful contextual knowledge encoded in the intermediate layers. We hypothesize that employing explicit biasing loss as an auxiliary task in the encoder intermediate layers may better align text tokens or audio frames with the desired objectives. Our proposed intermediate biasing loss brings more regularization and contextualization to the network. Our method outperforms a conventional contextual biasing baseline on the LibriSpeech corpus, achieving a relative improvement of 22.5% in biased word error rate (B-WER) and up to 44% compared to the non-contextual baseline with a biasing list size of 100. Moreover, employing RNN-transducer-driven joint decoding further reduces the unbiased word error rate (U-WER), resulting in a more robust network.
△ Less
Submitted 23 June, 2024;
originally announced June 2024.
-
Decoder-only Architecture for Streaming End-to-end Speech Recognition
Authors:
Emiru Tsunoo,
Hayato Futami,
Yosuke Kashiwagi,
Siddhant Arora,
Shinji Watanabe
Abstract:
Decoder-only language models (LMs) have been successfully adopted for speech-processing tasks including automatic speech recognition (ASR). The LMs have ample expressiveness and perform efficiently. This efficiency is a suitable characteristic for streaming applications of ASR. In this work, we propose to use a decoder-only architecture for blockwise streaming ASR. In our approach, speech features…
▽ More
Decoder-only language models (LMs) have been successfully adopted for speech-processing tasks including automatic speech recognition (ASR). The LMs have ample expressiveness and perform efficiently. This efficiency is a suitable characteristic for streaming applications of ASR. In this work, we propose to use a decoder-only architecture for blockwise streaming ASR. In our approach, speech features are compressed using CTC output and context embedding using blockwise speech subnetwork, and are sequentially provided as prompts to the decoder. The decoder estimates the output tokens promptly at each block. To this end, we also propose a novel training scheme using random-length prefix prompts to make the model robust to the truncated prompts caused by blockwise processing. An experimental comparison shows that our proposed decoder-only streaming ASR achieves 8% relative word error rate reduction in the LibriSpeech test-other set while being twice as fast as the baseline model.
△ Less
Submitted 23 June, 2024;
originally announced June 2024.
-
Diffusion-based Generative Modeling with Discriminative Guidance for Streamable Speech Enhancement
Authors:
Chenda Li,
Samuele Cornell,
Shinji Watanabe,
Yanmin Qian
Abstract:
Diffusion-based generative models (DGMs) have recently attracted attention in speech enhancement research (SE) as previous works showed a remarkable generalization capability. However, DGMs are also computationally intensive, as they usually require many iterations in the reverse diffusion process (RDP), making them impractical for streaming SE systems. In this paper, we propose to use discriminat…
▽ More
Diffusion-based generative models (DGMs) have recently attracted attention in speech enhancement research (SE) as previous works showed a remarkable generalization capability. However, DGMs are also computationally intensive, as they usually require many iterations in the reverse diffusion process (RDP), making them impractical for streaming SE systems. In this paper, we propose to use discriminative scores from discriminative models in the first steps of the RDP. These discriminative scores require only one forward pass with the discriminative model for multiple RDP steps, thus greatly reducing computations. This approach also allows for performance improvements. We show that we can trade off between generative and discriminative capabilities as the number of steps with the discriminative score increases. Furthermore, we propose a novel streamable time-domain generative model with an algorithmic latency of 50 ms, which has no significant performance degradation compared to offline models.
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder Prompting
Authors:
Yosuke Kashiwagi,
Hayato Futami,
Emiru Tsunoo,
Siddhant Arora,
Shinji Watanabe
Abstract:
End-to-end multilingual speech recognition models handle multiple languages through a single model, often incorporating language identification to automatically detect the language of incoming speech. Since the common scenario is where the language is already known, these models can perform as language-specific by using language information as prompts, which is particularly beneficial for attentio…
▽ More
End-to-end multilingual speech recognition models handle multiple languages through a single model, often incorporating language identification to automatically detect the language of incoming speech. Since the common scenario is where the language is already known, these models can perform as language-specific by using language information as prompts, which is particularly beneficial for attention-based encoder-decoder architectures. However, the Connectionist Temporal Classification (CTC) approach, which enhances recognition via joint decoding and multi-task training, does not normally incorporate language prompts due to its conditionally independent output tokens. To overcome this, we introduce an encoder prompting technique within the self-conditioned CTC framework, enabling language-specific adaptation of the CTC model in a zero-shot manner. Our method has shown to significantly reduce errors by 28% on average and by 41% on low-resource languages.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Finding Task-specific Subnetworks in Multi-task Spoken Language Understanding Model
Authors:
Hayato Futami,
Siddhant Arora,
Yosuke Kashiwagi,
Emiru Tsunoo,
Shinji Watanabe
Abstract:
Recently, multi-task spoken language understanding (SLU) models have emerged, designed to address various speech processing tasks. However, these models often rely on a large number of parameters. Also, they often encounter difficulties in adapting to new data for a specific task without experiencing catastrophic forgetting of previously trained tasks. In this study, we propose finding task-specif…
▽ More
Recently, multi-task spoken language understanding (SLU) models have emerged, designed to address various speech processing tasks. However, these models often rely on a large number of parameters. Also, they often encounter difficulties in adapting to new data for a specific task without experiencing catastrophic forgetting of previously trained tasks. In this study, we propose finding task-specific subnetworks within a multi-task SLU model via neural network pruning. In addition to model compression, we expect that the forgetting of previously trained tasks can be mitigated by updating only a task-specific subnetwork. We conduct experiments on top of the state-of-the-art multi-task SLU model ``UniverSLU'', trained for several tasks such as emotion recognition (ER), intent classification (IC), and automatic speech recognition (ASR). We show that pruned models were successful in adapting to additional ASR or IC data with minimal performance degradation on previously trained tasks.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Review and Prospect of Algebraic Research in Equivalent Framework between Statistical Mechanics and Machine Learning Theory
Authors:
Sumio Watanabe
Abstract:
Mathematical equivalence between statistical mechanics and machine learning theory has been known since the 20th century, and researches based on such equivalence have provided novel methodology in both theoretical physics and statistical learning theory. For example, algebraic approach in statistical mechanics such as operator algebra enables us to analyze phase transition phenomena mathematicall…
▽ More
Mathematical equivalence between statistical mechanics and machine learning theory has been known since the 20th century, and researches based on such equivalence have provided novel methodology in both theoretical physics and statistical learning theory. For example, algebraic approach in statistical mechanics such as operator algebra enables us to analyze phase transition phenomena mathematically. In this paper, for theoretical physicists who are interested in artificial intelligence, we review and prospect algebraic researches in machine learning theory. If a learning machine has hierarchical structure or latent variables, then the random Hamiltonian cannot be expressed by any quadratic perturbation because it has singularities. To study an equilibrium state defined by such a singular random Hamiltonian, algebraic approach is necessary to derive asymptotic form of the free energy and the generalization error. We also introduce the most recent advance, in fact, theoretical foundation for alignment of artificial intelligence is now being constructed based on algebraic learning theory. This paper is devoted to the memory of Professor Huzihiro Araki who is a pioneer founder of algebraic research in both statistical mechanics and quantum field theory.
△ Less
Submitted 17 June, 2024; v1 submitted 31 May, 2024;
originally announced June 2024.
-
On the Evaluation of Speech Foundation Models for Spoken Language Understanding
Authors:
Siddhant Arora,
Ankita Pasad,
Chung-Ming Chien,
Jionghao Han,
Roshan Sharma,
Jee-weon Jung,
Hira Dhamyal,
William Chen,
Suwon Shon,
Hung-yi Lee,
Karen Livescu,
Shinji Watanabe
Abstract:
The Spoken Language Understanding Evaluation (SLUE) suite of benchmark tasks was recently introduced to address the need for open resources and benchmarking of complex spoken language understanding (SLU) tasks, including both classification and sequence generation tasks, on natural speech. The benchmark has demonstrated preliminary success in using pre-trained speech foundation models (SFM) for th…
▽ More
The Spoken Language Understanding Evaluation (SLUE) suite of benchmark tasks was recently introduced to address the need for open resources and benchmarking of complex spoken language understanding (SLU) tasks, including both classification and sequence generation tasks, on natural speech. The benchmark has demonstrated preliminary success in using pre-trained speech foundation models (SFM) for these SLU tasks. However, the community still lacks a fine-grained understanding of the comparative utility of different SFMs. Inspired by this, we ask: which SFMs offer the most benefits for these complex SLU tasks, and what is the most effective approach for incorporating these SFMs? To answer this, we perform an extensive evaluation of multiple supervised and self-supervised SFMs using several evaluation protocols: (i) frozen SFMs with a lightweight prediction head, (ii) frozen SFMs with a complex prediction head, and (iii) fine-tuned SFMs with a lightweight prediction head. Although the supervised SFMs are pre-trained on much more speech recognition data (with labels), they do not always outperform self-supervised SFMs; the latter tend to perform at least as well as, and sometimes better than, supervised SFMs, especially on the sequence generation tasks in SLUE. While there is no universally optimal way of incorporating SFMs, the complex prediction head gives the best performance for most tasks, although it increases the inference time. We also introduce an open-source toolkit and performance leaderboard, SLUE-PERB, for these tasks and modeling strategies.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
MMM: Multi-Layer Multi-Residual Multi-Stream Discrete Speech Representation from Self-supervised Learning Model
Authors:
Jiatong Shi,
Xutai Ma,
Hirofumi Inaguma,
Anna Sun,
Shinji Watanabe
Abstract:
Speech discrete representation has proven effective in various downstream applications due to its superior compression rate of the waveform, fast convergence during training, and compatibility with other modalities. Discrete units extracted from self-supervised learning (SSL) models have emerged as a prominent approach for obtaining speech discrete representation. However, while discrete units hav…
▽ More
Speech discrete representation has proven effective in various downstream applications due to its superior compression rate of the waveform, fast convergence during training, and compatibility with other modalities. Discrete units extracted from self-supervised learning (SSL) models have emerged as a prominent approach for obtaining speech discrete representation. However, while discrete units have shown effectiveness compared to spectral features, they still lag behind continuous SSL representations. In this work, we propose MMM, a multi-layer multi-residual multi-stream discrete units extraction method from SSL. Specifically, we introduce iterative residual vector quantization with K-means for different layers in an SSL model to extract multi-stream speech discrete representation. Through extensive experiments in speech recognition, speech resynthesis, and text-to-speech, we demonstrate the proposed MMM can surpass or on-par with neural codec's performance under various conditions.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding
Authors:
Suwon Shon,
Kwangyoun Kim,
Yi-Te Hsu,
Prashant Sridhar,
Shinji Watanabe,
Karen Livescu
Abstract:
The integration of pre-trained text-based large language models (LLM) with speech input has enabled instruction-following capabilities for diverse speech tasks. This integration requires the use of a speech encoder, a speech adapter, and an LLM, trained on diverse tasks. We propose the use of discrete speech units (DSU), rather than continuous-valued speech encoder outputs, that are converted to t…
▽ More
The integration of pre-trained text-based large language models (LLM) with speech input has enabled instruction-following capabilities for diverse speech tasks. This integration requires the use of a speech encoder, a speech adapter, and an LLM, trained on diverse tasks. We propose the use of discrete speech units (DSU), rather than continuous-valued speech encoder outputs, that are converted to the LLM token embedding space using the speech adapter. We generate DSU using a self-supervised speech encoder followed by k-means clustering. The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering. We also explore various types of DSU extracted from different layers of the self-supervised speech encoder, as well as Mel frequency Cepstral Coefficients (MFCC). Our findings suggest that the ASR task and datasets are not crucial in instruction-tuning for spoken question answering tasks.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models
Authors:
Jinchuan Tian,
Yifan Peng,
William Chen,
Kwanghee Choi,
Karen Livescu,
Shinji Watanabe
Abstract:
The Open Whisper-style Speech Model (OWSM) series was introduced to achieve full transparency in building advanced speech-to-text (S2T) foundation models. To this end, OWSM models are trained on 25 public speech datasets, which are heterogeneous in multiple ways. In this study, we advance the OWSM series by introducing OWSM v3.2, which improves on prior models by investigating and addressing the i…
▽ More
The Open Whisper-style Speech Model (OWSM) series was introduced to achieve full transparency in building advanced speech-to-text (S2T) foundation models. To this end, OWSM models are trained on 25 public speech datasets, which are heterogeneous in multiple ways. In this study, we advance the OWSM series by introducing OWSM v3.2, which improves on prior models by investigating and addressing the impacts of this data heterogeneity. Our study begins with a detailed analysis of each dataset, from which we derive two key strategies: data filtering with proxy task to enhance data quality, and the incorporation of punctuation and true-casing using an open large language model (LLM). With all other configurations staying the same, OWSM v3.2 improves performance over the OWSM v3.1 baseline while using 15% less training data.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
VISinger2+: End-to-End Singing Voice Synthesis Augmented by Self-Supervised Learning Representation
Authors:
Yifeng Yu,
Jiatong Shi,
Yuning Wu,
Shinji Watanabe
Abstract:
Singing Voice Synthesis (SVS) has witnessed significant advancements with the advent of deep learning techniques. However, a significant challenge in SVS is the scarcity of labeled singing voice data, which limits the effectiveness of supervised learning methods. In response to this challenge, this paper introduces a novel approach to enhance the quality of SVS by leveraging unlabeled data from pr…
▽ More
Singing Voice Synthesis (SVS) has witnessed significant advancements with the advent of deep learning techniques. However, a significant challenge in SVS is the scarcity of labeled singing voice data, which limits the effectiveness of supervised learning methods. In response to this challenge, this paper introduces a novel approach to enhance the quality of SVS by leveraging unlabeled data from pre-trained self-supervised learning models. Building upon the existing VISinger2 framework, this study integrates additional spectral feature information into the system to enhance its performance. The integration aims to harness the rich acoustic features from the pre-trained models, thereby enriching the synthesis and yielding a more natural and expressive singing voice. Experimental results in various corpora demonstrate the efficacy of this approach in improving the overall quality of synthesized singing voices in both objective and subjective metrics.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets
Authors:
Jiatong Shi,
Shih-Heng Wang,
William Chen,
Martijn Bartelds,
Vanya Bannihatti Kumar,
Jinchuan Tian,
Xuankai Chang,
Dan Jurafsky,
Karen Livescu,
Hung-yi Lee,
Shinji Watanabe
Abstract:
ML-SUPERB evaluates self-supervised learning (SSL) models on the tasks of language identification and automatic speech recognition (ASR). This benchmark treats the models as feature extractors and uses a single shallow downstream model, which can be fine-tuned for a downstream task. However, real-world use cases may require different configurations. This paper presents ML-SUPERB~2.0, which is a ne…
▽ More
ML-SUPERB evaluates self-supervised learning (SSL) models on the tasks of language identification and automatic speech recognition (ASR). This benchmark treats the models as feature extractors and uses a single shallow downstream model, which can be fine-tuned for a downstream task. However, real-world use cases may require different configurations. This paper presents ML-SUPERB~2.0, which is a new benchmark for evaluating pre-trained SSL and supervised speech models across downstream models, fine-tuning setups, and efficient model adaptation approaches. We find performance improvements over the setup of ML-SUPERB. However, performance depends on the downstream model design. Also, we find large performance differences between languages and datasets, suggesting the need for more targeted approaches to improve multilingual ASR performance.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Self-Supervised Speech Representations are More Phonetic than Semantic
Authors:
Kwanghee Choi,
Ankita Pasad,
Tomohiko Nakamura,
Satoru Fukayama,
Karen Livescu,
Shinji Watanabe
Abstract:
Self-supervised speech models (S3Ms) have become an effective backbone for speech applications. Various analyses suggest that S3Ms encode linguistic properties. In this work, we seek a more fine-grained analysis of the word-level linguistic properties encoded in S3Ms. Specifically, we curate a novel dataset of near homophone (phonetically similar) and synonym (semantically similar) word pairs and…
▽ More
Self-supervised speech models (S3Ms) have become an effective backbone for speech applications. Various analyses suggest that S3Ms encode linguistic properties. In this work, we seek a more fine-grained analysis of the word-level linguistic properties encoded in S3Ms. Specifically, we curate a novel dataset of near homophone (phonetically similar) and synonym (semantically similar) word pairs and measure the similarities between S3M word representation pairs. Our study reveals that S3M representations consistently and significantly exhibit more phonetic than semantic similarity. Further, we question whether widely used intent classification datasets such as Fluent Speech Commands and Snips Smartlights are adequate for measuring semantic abilities. Our simple baseline, using only the word identity, surpasses S3M-based models. This corroborates our findings and suggests that high scores on these datasets do not necessarily guarantee the presence of semantic content.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Neural Blind Source Separation and Diarization for Distant Speech Recognition
Authors:
Yoshiaki Bando,
Tomohiko Nakamura,
Shinji Watanabe
Abstract:
This paper presents a neural method for distant speech recognition (DSR) that jointly separates and diarizes speech mixtures without supervision by isolated signals. A standard separation method for multi-talker DSR is a statistical multichannel method called guided source separation (GSS). While GSS does not require signal-level supervision, it relies on speaker diarization results to handle unkn…
▽ More
This paper presents a neural method for distant speech recognition (DSR) that jointly separates and diarizes speech mixtures without supervision by isolated signals. A standard separation method for multi-talker DSR is a statistical multichannel method called guided source separation (GSS). While GSS does not require signal-level supervision, it relies on speaker diarization results to handle unknown numbers of active speakers. To overcome this limitation, we introduce and train a neural inference model in a weakly-supervised manner, employing the objective function of a statistical separation method. This training requires only multichannel mixtures and their temporal annotations of speaker activities. In contrast to GSS, the trained model can jointly separate and diarize speech mixtures without any auxiliary information. The experiments with the AMI corpus show that our method outperforms GSS with oracle diarization results regarding word error rates. The code is available online.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Jet modification via $π^0$-hadron correlations in Au$+$Au collisions at $\sqrt{s_{_{NN}}}=200$ GeV
Authors:
PHENIX Collaboration,
N. J. Abdulameer,
U. Acharya,
A. Adare,
S. Afanasiev,
C. Aidala,
N. N. Ajitanand,
Y. Akiba,
H. Al-Bataineh,
J. Alexander,
M. Alfred,
K. Aoki,
N. Apadula,
L. Aphecetche,
J. Asai,
H. Asano,
E. T. Atomssa,
R. Averbeck,
T. C. Awes,
B. Azmoun,
V. Babintsev,
M. Bai,
G. Baksay,
L. Baksay,
A. Baldisseri
, et al. (510 additional authors not shown)
Abstract:
High-momentum two-particle correlations are a useful tool for studying jet-quenching effects in the quark-gluon plasma. Angular correlations between neutral-pion triggers and charged hadrons with transverse momenta in the range 4--12~GeV/$c$ and 0.5--7~GeV/$c$, respectively, have been measured by the PHENIX experiment in 2014 for Au$+$Au collisions at $\sqrt{s_{_{NN}}}=200$~GeV. Suppression is obs…
▽ More
High-momentum two-particle correlations are a useful tool for studying jet-quenching effects in the quark-gluon plasma. Angular correlations between neutral-pion triggers and charged hadrons with transverse momenta in the range 4--12~GeV/$c$ and 0.5--7~GeV/$c$, respectively, have been measured by the PHENIX experiment in 2014 for Au$+$Au collisions at $\sqrt{s_{_{NN}}}=200$~GeV. Suppression is observed in the yield of high-momentum jet fragments opposite the trigger particle, which indicates jet suppression stemming from in-medium partonic energy loss, while enhancement is observed for low-momentum particles. The ratio and differences between the yield in Au$+$Au collisions and $p$$+$$p$ collisions, $I_{AA}$ and $Δ_{AA}$, as a function of the trigger-hadron azimuthal separation, $Δφ$, are measured for the first time at the Relativistic Heavy Ion Collider. These results better quantify how the yield of low-$p_T$ associated hadrons is enhanced at wide angle, which is crucial for studying energy loss as well as medium-response effects.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
The Interspeech 2024 Challenge on Speech Processing Using Discrete Units
Authors:
Xuankai Chang,
Jiatong Shi,
Jinchuan Tian,
Yuning Wu,
Yuxun Tang,
Yihan Wu,
Shinji Watanabe,
Yossi Adi,
Xie Chen,
Qin Jin
Abstract:
Representing speech and audio signals in discrete units has become a compelling alternative to traditional high-dimensional feature vectors. Numerous studies have highlighted the efficacy of discrete units in various applications such as speech compression and restoration, speech recognition, and speech generation. To foster exploration in this domain, we introduce the Interspeech 2024 Challenge,…
▽ More
Representing speech and audio signals in discrete units has become a compelling alternative to traditional high-dimensional feature vectors. Numerous studies have highlighted the efficacy of discrete units in various applications such as speech compression and restoration, speech recognition, and speech generation. To foster exploration in this domain, we introduce the Interspeech 2024 Challenge, which focuses on new speech processing benchmarks using discrete units. It encompasses three pivotal tasks, namely multilingual automatic speech recognition, text-to-speech, and singing voice synthesis, and aims to assess the potential applicability of discrete units in these tasks. This paper outlines the challenge designs and baseline descriptions. We also collate baseline and selected submission systems, along with preliminary findings, offering valuable contributions to future research in this evolving field.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation
Authors:
Julius Richter,
Yi-Chiao Wu,
Steven Krenn,
Simon Welker,
Bunlong Lay,
Shinji Watanabe,
Alexander Richard,
Timo Gerkmann
Abstract:
We release the EARS (Expressive Anechoic Recordings of Speech) dataset, a high-quality speech dataset comprising 107 speakers from diverse backgrounds, totaling in 100 hours of clean, anechoic speech data. The dataset covers a large range of different speaking styles, including emotional speech, different reading styles, non-verbal sounds, and conversational freeform speech. We benchmark various m…
▽ More
We release the EARS (Expressive Anechoic Recordings of Speech) dataset, a high-quality speech dataset comprising 107 speakers from diverse backgrounds, totaling in 100 hours of clean, anechoic speech data. The dataset covers a large range of different speaking styles, including emotional speech, different reading styles, non-verbal sounds, and conversational freeform speech. We benchmark various methods for speech enhancement and dereverberation on the dataset and evaluate their performance through a set of instrumental metrics. In addition, we conduct a listening test with 20 participants for the speech enhancement task, where a generative method is preferred. We introduce a blind test set that allows for automatic online evaluation of uploaded data. Dataset download links and automatic evaluation server can be found online.
△ Less
Submitted 11 June, 2024; v1 submitted 10 June, 2024;
originally announced June 2024.
-
To what extent can ASV systems naturally defend against spoofing attacks?
Authors:
Jee-weon Jung,
Xin Wang,
Nicholas Evans,
Shinji Watanabe,
Hye-jin Shim,
Hemlata Tak,
Sidhhant Arora,
Junichi Yamagishi,
Joon Son Chung
Abstract:
The current automatic speaker verification (ASV) task involves making binary decisions on two types of trials: target and non-target. However, emerging advancements in speech generation technology pose significant threats to the reliability of ASV systems. This study investigates whether ASV effortlessly acquires robustness against spoofing attacks (i.e., zero-shot capability) by systematically ex…
▽ More
The current automatic speaker verification (ASV) task involves making binary decisions on two types of trials: target and non-target. However, emerging advancements in speech generation technology pose significant threats to the reliability of ASV systems. This study investigates whether ASV effortlessly acquires robustness against spoofing attacks (i.e., zero-shot capability) by systematically exploring diverse ASV systems and spoofing attacks, ranging from traditional to cutting-edge techniques. Through extensive analyses conducted on eight distinct ASV systems and 29 spoofing attack systems, we demonstrate that the evolution of ASV inherently incorporates defense mechanisms against spoofing attacks. Nevertheless, our findings also underscore that the advancement of spoofing attacks far outpaces that of ASV systems, hence necessitating further research on spoofing-robust ASV methodologies.
△ Less
Submitted 14 June, 2024; v1 submitted 7 June, 2024;
originally announced June 2024.
-
URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement
Authors:
Wangyou Zhang,
Robin Scheibler,
Kohei Saijo,
Samuele Cornell,
Chenda Li,
Zhaoheng Ni,
Anurag Kumar,
Jan Pirklbauer,
Marvin Sach,
Shinji Watanabe,
Tim Fingscheidt,
Yanmin Qian
Abstract:
The last decade has witnessed significant advancements in deep learning-based speech enhancement (SE). However, most existing SE research has limitations on the coverage of SE sub-tasks, data diversity and amount, and evaluation metrics. To fill this gap and promote research toward universal SE, we establish a new SE challenge, named URGENT, to focus on the universality, robustness, and generaliza…
▽ More
The last decade has witnessed significant advancements in deep learning-based speech enhancement (SE). However, most existing SE research has limitations on the coverage of SE sub-tasks, data diversity and amount, and evaluation metrics. To fill this gap and promote research toward universal SE, we establish a new SE challenge, named URGENT, to focus on the universality, robustness, and generalizability of SE. We aim to extend the SE definition to cover different sub-tasks to explore the limits of SE models, starting from denoising, dereverberation, bandwidth extension, and declipping. A novel framework is proposed to unify all these sub-tasks in a single model, allowing the use of all existing SE approaches. We collected public speech and noise data from different domains to construct diverse evaluation data. Finally, we discuss the insights gained from our preliminary baseline experiments based on both generative and discriminative SE methods with 12 curated metrics.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
Non-Spherical Pauli Forbidden States in Deformed Halo Nuclei: Impact on the ${}^7\mathrm{Be}+p$ Resonant States in the Particle Rotor Model
Authors:
Shin Watanabe,
Antonio M. Moro
Abstract:
Background: An important aspect of reducing nuclear many-body problems to few-body models is the presence of Pauli forbidden (PF) states, which are excluded in fully antisymmetrized calculations. Insufficient treatments of PF states in deformed halo nuclei underscore the need for model refinement.
Purpose: We propose a new method utilizing Nilsson states as PF states in the orthogonality conditi…
▽ More
Background: An important aspect of reducing nuclear many-body problems to few-body models is the presence of Pauli forbidden (PF) states, which are excluded in fully antisymmetrized calculations. Insufficient treatments of PF states in deformed halo nuclei underscore the need for model refinement.
Purpose: We propose a new method utilizing Nilsson states as PF states in the orthogonality condition model, and investigate the impact of PF states on the properties of resonant states.
Method: We investigate the scattering states of ${}^8\mathrm{B}$ within the Particle Rotor Model (PRM) framework based on a deformed ${}^7\mathrm{Be}$ core and $p$ two-body model. We compare several methods for eliminating PF states and test them with the experimental data.
Results: Our model successfully reproduces the experimental excitation function for elastic scattering cross section by properly eliminating PF states. The same calculation predicts the presence of a low-energy bump in the inelastic scattering excitation function, although its position is overestimated by about 1 MeV compared to experimental data.
Conclusion: This study extends the applicability of the PRM, offering a comprehensive approach for exploring structures and reactions of loosely bound nuclei like ${}^8$B. Future integration with the continuum discretized coupled channels (CDCC) method promises to further advance the research.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement
Authors:
Wangyou Zhang,
Kohei Saijo,
Jee-weon Jung,
Chenda Li,
Shinji Watanabe,
Yanmin Qian
Abstract:
Deep learning-based speech enhancement (SE) models have achieved impressive performance in the past decade. Numerous advanced architectures have been designed to deliver state-of-the-art performance; however, their scalability potential remains unrevealed. Meanwhile, the majority of research focuses on small-sized datasets with restricted diversity, leading to a plateau in performance improvement.…
▽ More
Deep learning-based speech enhancement (SE) models have achieved impressive performance in the past decade. Numerous advanced architectures have been designed to deliver state-of-the-art performance; however, their scalability potential remains unrevealed. Meanwhile, the majority of research focuses on small-sized datasets with restricted diversity, leading to a plateau in performance improvement. In this paper, we aim to provide new insights for addressing the above issues by exploring the scalability of SE models in terms of architectures, model sizes, compute budgets, and dataset sizes. Our investigation involves several popular SE architectures and speech data from different domains. Experiments reveal both similarities and distinctions between the scaling effects in SE and other tasks such as speech recognition. These findings further provide insights into the under-explored SE directions, e.g., larger-scale multi-domain corpora and efficiently scalable architectures.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders
Authors:
Yui Sudo,
Muhammad Shakeel,
Yosuke Fukumoto,
Brian Yan,
Jiatong Shi,
Yifan Peng,
Shinji Watanabe
Abstract:
End-to-end automatic speech recognition (E2E-ASR) can be classified into several network architectures, such as connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention-based encoder-decoder, and mask-predict models. Each network architecture has advantages and disadvantages, leading practitioners to switch between these different models depending on appl…
▽ More
End-to-end automatic speech recognition (E2E-ASR) can be classified into several network architectures, such as connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention-based encoder-decoder, and mask-predict models. Each network architecture has advantages and disadvantages, leading practitioners to switch between these different models depending on application requirements. Instead of building separate models, we propose a joint modeling scheme where four decoders (CTC, RNN-T, attention, and mask-predict) share the same encoder -- we refer to this as 4D modeling. The 4D model is trained using multitask learning, which will bring model regularization and maximize the model robustness thanks to their complementary properties. To efficiently train the 4D model, we introduce a two-stage training strategy that stabilizes multitask learning. In addition, we propose three novel one-pass beam search algorithms by combining three decoders (CTC, RNN-T, and attention) to further improve performance. These three beam search algorithms differ in which decoder is used as the primary decoder. We carefully evaluate the performance and computational tradeoffs associated with each algorithm. Experimental results demonstrate that the jointly trained 4D model outperforms the E2E-ASR models trained with only one individual decoder. Furthermore, we demonstrate that the proposed one-pass beam search algorithm outperforms the previously proposed CTC/attention decoding.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
Less Peaky and More Accurate CTC Forced Alignment by Label Priors
Authors:
Ruizhe Huang,
Xiaohui Zhang,
Zhaoheng Ni,
Li Sun,
Moto Hira,
Jeff Hwang,
Vimal Manohar,
Vineel Pratap,
Matthew Wiesner,
Shinji Watanabe,
Daniel Povey,
Sanjeev Khudanpur
Abstract:
Connectionist temporal classification (CTC) models are known to have peaky output distributions. Such behavior is not a problem for automatic speech recognition (ASR), but it can cause inaccurate forced alignments (FA), especially at finer granularity, e.g., phoneme level. This paper aims at alleviating the peaky behavior for CTC and improve its suitability for forced alignment generation, by leve…
▽ More
Connectionist temporal classification (CTC) models are known to have peaky output distributions. Such behavior is not a problem for automatic speech recognition (ASR), but it can cause inaccurate forced alignments (FA), especially at finer granularity, e.g., phoneme level. This paper aims at alleviating the peaky behavior for CTC and improve its suitability for forced alignment generation, by leveraging label priors, so that the scores of alignment paths containing fewer blanks are boosted and maximized during training. As a result, our CTC model produces less peaky posteriors and is able to more accurately predict the offset of the tokens besides their onset. It outperforms the standard CTC model and a heuristics-based approach for obtaining CTC's token offset timestamps by 12-40% in phoneme and word boundary errors (PBE and WBE) measured on the Buckeye and TIMIT data. Compared with the most widely used FA toolkit Montreal Forced Aligner (MFA), our method performs similarly on PBE/WBE on Buckeye, yet falls behind MFA on TIMIT. Nevertheless, our method has a much simpler training pipeline and better runtime efficiency. Our training recipe and pretrained model are released in TorchAudio.
△ Less
Submitted 15 June, 2024; v1 submitted 22 April, 2024;
originally announced June 2024.
-
YODAS: Youtube-Oriented Dataset for Audio and Speech
Authors:
Xinjian Li,
Shinnosuke Takamichi,
Takaaki Saeki,
William Chen,
Sayaka Shiota,
Shinji Watanabe
Abstract:
In this study, we introduce YODAS (YouTube-Oriented Dataset for Audio and Speech), a large-scale, multilingual dataset comprising currently over 500k hours of speech data in more than 100 languages, sourced from both labeled and unlabeled YouTube speech datasets. The labeled subsets, including manual or automatic subtitles, facilitate supervised model training. Conversely, the unlabeled subsets ar…
▽ More
In this study, we introduce YODAS (YouTube-Oriented Dataset for Audio and Speech), a large-scale, multilingual dataset comprising currently over 500k hours of speech data in more than 100 languages, sourced from both labeled and unlabeled YouTube speech datasets. The labeled subsets, including manual or automatic subtitles, facilitate supervised model training. Conversely, the unlabeled subsets are apt for self-supervised learning applications. YODAS is distinctive as the first publicly available dataset of its scale, and it is distributed under a Creative Commons license. We introduce the collection methodology utilized for YODAS, which contributes to the large-scale speech dataset construction. Subsequently, we provide a comprehensive analysis of speech, text contained within the dataset. Finally, we describe the speech recognition baselines over the top-15 languages.
△ Less
Submitted 2 June, 2024;
originally announced June 2024.
-
Cross-Talk Reduction
Authors:
Zhong-Qiu Wang,
Anurag Kumar,
Shinji Watanabe
Abstract:
While far-field multi-talker mixtures are recorded, each speaker can wear a close-talk microphone so that close-talk mixtures can be recorded at the same time. Although each close-talk mixture has a high signal-to-noise ratio (SNR) of the wearer, it has a very limited range of applications, as it also contains significant cross-talk speech by other speakers and is not clean enough. In this context…
▽ More
While far-field multi-talker mixtures are recorded, each speaker can wear a close-talk microphone so that close-talk mixtures can be recorded at the same time. Although each close-talk mixture has a high signal-to-noise ratio (SNR) of the wearer, it has a very limited range of applications, as it also contains significant cross-talk speech by other speakers and is not clean enough. In this context, we propose a novel task named cross-talk reduction (CTR) which aims at reducing cross-talk speech, and a novel solution named CTRnet which is based on unsupervised or weakly-supervised neural speech separation. In unsupervised CTRnet, close-talk and far-field mixtures are stacked as input for a DNN to estimate the close-talk speech of each speaker. It is trained in an unsupervised, discriminative way such that the DNN estimate for each speaker can be linearly filtered to cancel out the speaker's cross-talk speech captured at other microphones. In weakly-supervised CTRnet, we assume the availability of each speaker's activity timestamps during training, and leverage them to improve the training of unsupervised CTRnet. Evaluation results on a simulated two-speaker CTR task and on a real-recorded conversational speech separation and recognition task show the effectiveness and potential of CTRnet.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation
Authors:
Muhammad Shakeel,
Yui Sudo,
Yifan Peng,
Shinji Watanabe
Abstract:
End-to-end (E2E) automatic speech recognition (ASR) can operate in two modes: streaming and non-streaming, each with its pros and cons. Streaming ASR processes the speech frames in real-time as it is being received, while non-streaming ASR waits for the entire speech utterance; thus, professionals may have to operate in either mode to satisfy their application. In this work, we present joint optim…
▽ More
End-to-end (E2E) automatic speech recognition (ASR) can operate in two modes: streaming and non-streaming, each with its pros and cons. Streaming ASR processes the speech frames in real-time as it is being received, while non-streaming ASR waits for the entire speech utterance; thus, professionals may have to operate in either mode to satisfy their application. In this work, we present joint optimization of streaming and non-streaming ASR based on multi-decoder and knowledge distillation. Primarily, we study 1) the encoder integration of these ASR modules, followed by 2) separate decoders to make the switching mode flexible, and enhancing performance by 3) incorporating similarity-preserving knowledge distillation between the two modular encoders and decoders. Evaluation results show 2.6%-5.3% relative character error rate reductions (CERR) on CSJ for streaming ASR, and 8.3%-9.7% relative CERRs for non-streaming ASR within a single model compared to multiple standalone modules.
△ Less
Submitted 22 May, 2024;
originally announced May 2024.
-
Contextualized Automatic Speech Recognition with Dynamic Vocabulary
Authors:
Yui Sudo,
Yosuke Fukumoto,
Muhammad Shakeel,
Yifan Peng,
Shinji Watanabe
Abstract:
Deep biasing (DB) improves the performance of end-to-end automatic speech recognition (E2E-ASR) for rare words or contextual phrases using a bias list. However, most existing methods treat bias phrases as sequences of subwords in a predefined static vocabulary, which can result in ineffective learning of the dependencies between subwords. More advanced techniques address this problem by incorporat…
▽ More
Deep biasing (DB) improves the performance of end-to-end automatic speech recognition (E2E-ASR) for rare words or contextual phrases using a bias list. However, most existing methods treat bias phrases as sequences of subwords in a predefined static vocabulary, which can result in ineffective learning of the dependencies between subwords. More advanced techniques address this problem by incorporating additional text data, which increases the overall workload. This paper proposes a dynamic vocabulary where phrase-level bias tokens can be added during the inference phase. Each bias token represents an entire bias phrase within a single token, thereby eliminating the need to learn the dependencies between the subwords within the bias phrases. This method can be applied to various architectures because it only extends the embedding and output layers in common E2E-ASR architectures. Experimental results demonstrate that the proposed method improves the performance of bias phrases on English and Japanese datasets.
△ Less
Submitted 22 May, 2024;
originally announced May 2024.
-
Acoustic modeling for Overlapping Speech Recognition: JHU Chime-5 Challenge System
Authors:
Vimal Manohar,
Szu-Jui Chen,
Zhiqi Wang,
Yusuke Fujita,
Shinji Watanabe,
Sanjeev Khudanpur
Abstract:
This paper summarizes our acoustic modeling efforts in the Johns Hopkins University speech recognition system for the CHiME-5 challenge to recognize highly-overlapped dinner party speech recorded by multiple microphone arrays. We explore data augmentation approaches, neural network architectures, front-end speech dereverberation, beamforming and robust i-vector extraction with comparisons of our i…
▽ More
This paper summarizes our acoustic modeling efforts in the Johns Hopkins University speech recognition system for the CHiME-5 challenge to recognize highly-overlapped dinner party speech recorded by multiple microphone arrays. We explore data augmentation approaches, neural network architectures, front-end speech dereverberation, beamforming and robust i-vector extraction with comparisons of our in-house implementations and publicly available tools. We finally achieved a word error rate of 69.4% on the development set, which is a 11.7% absolute improvement over the previous baseline of 81.1%, and release this improved baseline with refined techniques/tools as an advanced CHiME-5 recipe.
△ Less
Submitted 17 May, 2024;
originally announced May 2024.
-
Fluctuations in Spin Dynamics Excited by Pulsed Light
Authors:
Tetsuya Sato,
Shinichi Watanabe,
Mamoru Matsuo,
Takeo Kato
Abstract:
We theoretically investigate nonequilibrium spin fluctuations in a ferromagnet induced by a light pulse. Using a Lindblad equation consistent with the Landau-Lifshitz-Gilbert equation, we compute the autocorrelation function of magnetization. Our analysis reveals that this function comprises both thermal and nonequilibrium components. To examine the latter in detail, we introduce a Fano factor sim…
▽ More
We theoretically investigate nonequilibrium spin fluctuations in a ferromagnet induced by a light pulse. Using a Lindblad equation consistent with the Landau-Lifshitz-Gilbert equation, we compute the autocorrelation function of magnetization. Our analysis reveals that this function comprises both thermal and nonequilibrium components. To examine the latter in detail, we introduce a Fano factor similar to nonequilibrium current noise in electronic circuits. We demonstrate that this factor encapsulates insights into the transfer of spin units to the environment. Our findings lay the groundwork for nonequilibrium spin noise spectroscopy, offering valuable insights into spin relaxation dynamics.
△ Less
Submitted 17 May, 2024;
originally announced May 2024.
-
Unsupervised Work Behavior Pattern Extraction Based on Hierarchical Probabilistic Model
Authors:
Issei Saito,
Tomoaki Nakamura,
Toshiyuki Hatta,
Wataru Fujita,
Shintaro Watanabe,
Shotaro Miwa
Abstract:
Evolving consumer demands and market trends have led to businesses increasingly embracing a production approach that prioritizes flexibility and customization. Consequently, factory workers must engage in tasks that are more complex than before. Thus, productivity depends on each worker's skills in assembling products. Therefore, analyzing the behavior of a worker is crucial for work improvement.…
▽ More
Evolving consumer demands and market trends have led to businesses increasingly embracing a production approach that prioritizes flexibility and customization. Consequently, factory workers must engage in tasks that are more complex than before. Thus, productivity depends on each worker's skills in assembling products. Therefore, analyzing the behavior of a worker is crucial for work improvement. However, manual analysis is time consuming and does not provide quick and accurate feedback. Machine learning have been attempted to automate the analyses; however, most of these methods need several labels for training. To this end, we extend the Gaussian process hidden semi-Markov model (GP-HSMM), to enable the rapid and automated analysis of worker behavior without pre-training. The model does not require labeled data and can automatically and accurately segment continuous motions into motion classes. The proposed model is a probabilistic model that hierarchically connects GP-HSMM and HSMM, enabling the extraction of behavioral patterns with different granularities. Furthermore, it mutually infers the parameters between the GP-HSMM and HSMM, resulting in accurate motion pattern extraction. We applied the proposed method to motion data in which workers assembled products at an actual production site. The accuracy of behavior pattern extraction was evaluated using normalized Levenshtein distance (NLD). The smaller the value of NLD, the more accurate is the pattern extraction. The NLD of motion patterns captured by GP-HSMM and HSMM layers in our proposed method was 0.50 and 0.33, respectively, which are the smallest compared to that of the baseline methods.
△ Less
Submitted 16 May, 2024;
originally announced May 2024.
-
Dynamical and Static Structure Factors in Hedgehog-Antihedgehog Order in Icosahedral 1/1 Approximant Crystal
Authors:
Shinji Watanabe
Abstract:
Recent discoveries of magnetic long-range orders in the icosahedral quasicrystal and topological magnetic structures on the icosahedron (IC) as the hedgehog state and the antihedgehog state have attracted great interest. Here, we report our theoretical analysis of the dynamical as well as static structure of the hedgehog-antihedgehog order in the 1/1 approximant crystal (AC). By constructing the e…
▽ More
Recent discoveries of magnetic long-range orders in the icosahedral quasicrystal and topological magnetic structures on the icosahedron (IC) as the hedgehog state and the antihedgehog state have attracted great interest. Here, we report our theoretical analysis of the dynamical as well as static structure of the hedgehog-antihedgehog order in the 1/1 approximant crystal (AC). By constructing the effective magnetic model for the rare-earth based AC, on the basis of the linear spin-wave theory, the excitation energy is shown to exhibit the reciprocal dispersion, as a consequence of preservation of the spatial inversion symmetry by the hedgehog-antihedgehog ordering. The static structure factor is shown to be expressed generally in the convolution form of the lattice structure factor and the magnetic structure factor on the IC(s) and the numerical calculation reveals the extinction rule. The dynamical structure factor shows that the high intensities appear in the low-energy branch along the $Γ$-X line and the R-$Γ$-M line in the reciprocal space.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
A Large-Scale Evaluation of Speech Foundation Models
Authors:
Shu-wen Yang,
Heng-Jui Chang,
Zili Huang,
Andy T. Liu,
Cheng-I Lai,
Haibin Wu,
Jiatong Shi,
Xuankai Chang,
Hsiang-Sheng Tsai,
Wen-Chin Huang,
Tzu-hsun Feng,
Po-Han Chi,
Yist Y. Lin,
Yung-Sung Chuang,
Tzu-Hsien Huang,
Wei-Cheng Tseng,
Kushal Lakhotia,
Shang-Wen Li,
Abdelrahman Mohamed,
Shinji Watanabe,
Hung-yi Lee
Abstract:
The foundation model paradigm leverages a shared foundation model to achieve state-of-the-art (SOTA) performance for various tasks, requiring minimal downstream-specific modeling and data annotation. This approach has proven crucial in the field of Natural Language Processing (NLP). However, the speech processing community lacks a similar setup to explore the paradigm systematically. In this work,…
▽ More
The foundation model paradigm leverages a shared foundation model to achieve state-of-the-art (SOTA) performance for various tasks, requiring minimal downstream-specific modeling and data annotation. This approach has proven crucial in the field of Natural Language Processing (NLP). However, the speech processing community lacks a similar setup to explore the paradigm systematically. In this work, we establish the Speech processing Universal PERformance Benchmark (SUPERB) to study the effectiveness of the paradigm for speech. We propose a unified multi-tasking framework to address speech processing tasks in SUPERB using a frozen foundation model followed by task-specialized, lightweight prediction heads. Combining our results with community submissions, we verify that the foundation model paradigm is promising for speech, and our multi-tasking framework is simple yet effective, as the best-performing foundation model shows competitive generalizability across most SUPERB tasks. For reproducibility and extensibility, we have developed a long-term maintained platform that enables deterministic benchmarking, allows for result sharing via an online leaderboard, and promotes collaboration through a community-driven benchmark database to support new development cycles. Finally, we conduct a series of analyses to offer an in-depth understanding of SUPERB and speech foundation models, including information flows across tasks inside the models, the correctness of the weighted-sum benchmarking protocol and the statistical significance and robustness of the benchmark.
△ Less
Submitted 29 May, 2024; v1 submitted 14 April, 2024;
originally announced April 2024.
-
Microscale Hydrogen, Carbon, and Nitrogen Isotopic Diversity of Organic Matter in Asteroid Ryugu
Authors:
Larry R Nittler,
Jens Barosch,
Katherine Burgess,
Rhonda M Stroud,
Jianhua Wang,
Hikaru Yabuta,
Yuma Enokido,
Megumi Matsumoto,
Tomoki Nakamura,
Yoko Kebukawa,
Shohei Yamashita,
Yoshio Takahashi,
Laure Bejach,
Lydie Bonal,
George D Cody,
Emmanuel Dartois,
Alexandre Dazzi,
Bradley De Gregorio,
Ariane Deniset-Besseau,
Jean Duprat,
Cécile Engrand,
Minako Hashiguchi,
A. L. David Kilcoyne,
Mutsumi Komatsu,
Zita Martins
, et al. (35 additional authors not shown)
Abstract:
We report the H, C, and N isotopic compositions of microscale (0.2 to 2$μ$m) organic matter in samples of asteroid Ryugu and the Orgueil CI carbonaceous chondrite. Three regolith particles of asteroid Ryugu, returned by the Hayabusa2 spacecraft, and several fragments of Orgueil were analyzed by NanoSIMS isotopic imaging. The isotopic distributions of the Ryugu samples from two different collection…
▽ More
We report the H, C, and N isotopic compositions of microscale (0.2 to 2$μ$m) organic matter in samples of asteroid Ryugu and the Orgueil CI carbonaceous chondrite. Three regolith particles of asteroid Ryugu, returned by the Hayabusa2 spacecraft, and several fragments of Orgueil were analyzed by NanoSIMS isotopic imaging. The isotopic distributions of the Ryugu samples from two different collection spots are closely similar to each other and to the Orgueil samples, strengthening the proposed Ryugu-CI chondrite connection. Most individual sub-$μ$m organic grains have isotopic compositions within error of bulk values, but 2-8% of them are outliers exhibiting large isotopic enrichments or depletions in D, $^{15}$N, and/or $^{13}$C. The H, C and N isotopic compositions of the outliers are not correlated with each other: while some C-rich grains are both D- and $^{15}$N-enriched, many are enriched or depleted in one or the other system. This most likely points to a diversity in isotopic fractionation pathways and thus diversity in the local formation environments for the individual outlier grains. The observation of a relatively small population of isotopic outlier grains can be explained either by escape from nebular and/or parent body homogenization of carbonaceous precursor material or addition of later isotopic outlier grains. The strong chemical similarity of isotopically typical and isotopically outlying grains, as reflected by synchrotron x-ray absorption spectra, suggests a genetic connection and thus favors the former, homogenization scenario. However, the fact that even the least altered meteorites show the same pattern of a small population of outliers on top of a larger population of homogenized grains indicates that some or most of the homogenization occurred prior to accretion of the macromolecular organic grains into asteroidal parent bodies.
△ Less
Submitted 12 April, 2024;
originally announced April 2024.
-
Development of a data overflow protection system for Super-Kamiokande to maximize data from nearby supernovae
Authors:
M. Mori,
K. Abe,
Y. Hayato,
K. Hiraide,
K. Hosokawa,
K. Ieki,
M. Ikeda,
J. Kameda,
Y. Kanemura,
R. Kaneshima,
Y. Kashiwagi,
Y. Kataoka,
S. Miki,
S. Mine,
M. Miura,
S. Moriyama,
Y. Nakano,
M. Nakahata,
S. Nakayama,
Y. Noguchi,
K. Okamoto,
K. Sato,
H. Sekiya,
H. Shiba,
K. Shimizu
, et al. (230 additional authors not shown)
Abstract:
Neutrinos from very nearby supernovae, such as Betelgeuse, are expected to generate more than ten million events over 10\,s in Super-Kamokande (SK). At such large event rates, the buffers of the SK analog-to-digital conversion board (QBEE) will overflow, causing random loss of data that is critical for understanding the dynamics of the supernova explosion mechanism. In order to solve this problem,…
▽ More
Neutrinos from very nearby supernovae, such as Betelgeuse, are expected to generate more than ten million events over 10\,s in Super-Kamokande (SK). At such large event rates, the buffers of the SK analog-to-digital conversion board (QBEE) will overflow, causing random loss of data that is critical for understanding the dynamics of the supernova explosion mechanism. In order to solve this problem, two new DAQ modules were developed to aid in the observation of very nearby supernovae. The first of these, the SN module, is designed to save only the number of hit PMTs during a supernova burst and the second, the Veto module, prescales the high rate neutrino events to prevent the QBEE from overflowing based on information from the SN module. In the event of a very nearby supernova, these modules allow SK to reconstruct the time evolution of the neutrino event rate from beginning to end using both QBEE and SN module data. This paper presents the development and testing of these modules together with an analysis of supernova-like data generated with a flashing laser diode. We demonstrate that the Veto module successfully prevents DAQ overflows for Betelgeuse-like supernovae as well as the long-term stability of the new modules. During normal running the Veto module is found to issue DAQ vetos a few times per month resulting in a total dead time less than 1\,ms, and does not influence ordinary operations. Additionally, using simulation data we find that supernovae closer than 800~pc will trigger Veto module resulting in a prescaling of the observed neutrino data.
△ Less
Submitted 12 April, 2024;
originally announced April 2024.
-
LV-CTC: Non-autoregressive ASR with CTC and latent variable models
Authors:
Yuya Fujita,
Shinji Watanabe,
Xuankai Chang,
Takashi Maekaku
Abstract:
Non-autoregressive (NAR) models for automatic speech recognition (ASR) aim to achieve high accuracy and fast inference by simplifying the autoregressive (AR) generation process of conventional models. Connectionist temporal classification (CTC) is one of the key techniques used in NAR ASR models. In this paper, we propose a new model combining CTC and a latent variable model, which is one of the s…
▽ More
Non-autoregressive (NAR) models for automatic speech recognition (ASR) aim to achieve high accuracy and fast inference by simplifying the autoregressive (AR) generation process of conventional models. Connectionist temporal classification (CTC) is one of the key techniques used in NAR ASR models. In this paper, we propose a new model combining CTC and a latent variable model, which is one of the state-of-the-art models in the neural machine translation research field. A new neural network architecture and formulation specialized for ASR application are introduced. In the proposed model, CTC alignment is assumed to be dependent on the latent variables that are expected to capture dependencies between tokens. Experimental results on a 100 hours subset of Librispeech corpus showed the best recognition accuracy among CTC-based NAR models. On the TED-LIUM2 corpus, the best recognition accuracy is achieved including AR E2E models with faster inference speed.
△ Less
Submitted 28 March, 2024;
originally announced March 2024.
-
Efficient exploration of high-Tc superconductors by a gradient-based composition design
Authors:
Akihiro Fujii,
Koji Shimizu,
Satoshi Watanabe
Abstract:
We propose a material design method via gradient-based optimization on compositions, overcoming the limitations of traditional methods: exhaustive database searches and conditional generation models. It optimizes inputs via backpropagation, aligning the model's output closely with the target property and facilitating the discovery of unlisted materials and precise property determination. Our metho…
▽ More
We propose a material design method via gradient-based optimization on compositions, overcoming the limitations of traditional methods: exhaustive database searches and conditional generation models. It optimizes inputs via backpropagation, aligning the model's output closely with the target property and facilitating the discovery of unlisted materials and precise property determination. Our method is also capable of adaptive optimization under new conditions without retraining. Applying to exploring high-Tc superconductors, we identified potential compositions beyond existing databases and discovered new hydrogen superconductors via conditional optimization. This method is versatile and significantly advances material design by enabling efficient, extensive searches and adaptability to new constraints.
△ Less
Submitted 20 March, 2024;
originally announced March 2024.
-
Wav2Gloss: Generating Interlinear Glossed Text from Speech
Authors:
Taiqi He,
Kwanghee Choi,
Lindia Tjuatja,
Nathaniel R. Robinson,
Jiatong Shi,
Shinji Watanabe,
Graham Neubig,
David R. Mortensen,
Lori Levin
Abstract:
Thousands of the world's languages are in danger of extinction--a tremendous threat to cultural identities and human language diversity. Interlinear Glossed Text (IGT) is a form of linguistic annotation that can support documentation and resource creation for these languages' communities. IGT typically consists of (1) transcriptions, (2) morphological segmentation, (3) glosses, and (4) free transl…
▽ More
Thousands of the world's languages are in danger of extinction--a tremendous threat to cultural identities and human language diversity. Interlinear Glossed Text (IGT) is a form of linguistic annotation that can support documentation and resource creation for these languages' communities. IGT typically consists of (1) transcriptions, (2) morphological segmentation, (3) glosses, and (4) free translations to a majority language. We propose Wav2Gloss: a task in which these four annotation components are extracted automatically from speech, and introduce the first dataset to this end, Fieldwork: a corpus of speech with all these annotations, derived from the work of field linguists, covering 37 languages, with standard formatting, and train/dev/test splits. We provide various baselines to lay the groundwork for future research on IGT generation from speech, such as end-to-end versus cascaded, monolingual versus multilingual, and single-task versus multi-task approaches.
△ Less
Submitted 5 June, 2024; v1 submitted 19 March, 2024;
originally announced March 2024.
-
Measurements of the charge ratio and polarization of cosmic-ray muons with the Super-Kamiokande detector
Authors:
H. Kitagawa,
T. Tada,
K. Abe,
C. Bronner,
Y. Hayato,
K. Hiraide,
K. Hosokawa,
K. Ieki,
M. Ikeda,
J. Kameda,
Y. Kanemura,
R. Kaneshima,
Y. Kashiwagi,
Y. Kataoka,
S. Miki,
S. Mine,
M. Miura,
S. Moriyama,
Y. Nakano,
M. Nakahata,
S. Nakayama,
Y. Noguchi,
K. Okamoto,
K. Sato,
H. Sekiya
, et al. (231 additional authors not shown)
Abstract:
We present the results of the charge ratio ($R$) and polarization ($P^μ_{0}$) measurements using the decay electron events collected from 2008 September to 2022 June by the Super-Kamiokande detector. Because of its underground location and long operation, we performed high precision measurements by accumulating cosmic-ray muons. We measured the muon charge ratio to be $R=1.32 \pm 0.02$…
▽ More
We present the results of the charge ratio ($R$) and polarization ($P^μ_{0}$) measurements using the decay electron events collected from 2008 September to 2022 June by the Super-Kamiokande detector. Because of its underground location and long operation, we performed high precision measurements by accumulating cosmic-ray muons. We measured the muon charge ratio to be $R=1.32 \pm 0.02$ $(\mathrm{stat.}{+}\mathrm{syst.})$ at $E_μ\cos θ_{\mathrm{Zenith}}=0.7^{+0.3}_{-0.2}$ $\mathrm{TeV}$, where $E_μ$ is the muon energy and $θ_{\mathrm{Zenith}}$ is the zenith angle of incoming cosmic-ray muons. This result is consistent with the Honda flux model while this suggests a tension with the $πK$ model of $1.9σ$. We also measured the muon polarization at the production location to be $P^μ_{0}=0.52 \pm 0.02$ $(\mathrm{stat.}{+}\mathrm{syst.})$ at the muon momentum of $0.9^{+0.6}_{-0.1}$ $\mathrm{TeV}/c$ at the surface of the mountain; this also suggests a tension with the Honda flux model of $1.5σ$. This is the most precise measurement ever to experimentally determine the cosmic-ray muon polarization near $1~\mathrm{TeV}/c$. These measurement results are useful to improve the atmospheric neutrino simulations.
△ Less
Submitted 13 March, 2024;
originally announced March 2024.
-
FOXSI-2: Upgrades of the Focusing Optics X-ray Solar Imager for its Second Flight
Authors:
Steven Christe,
Lindsay Glesener,
Camilo Buitrago-Casas,
Shin-Nosuke Ishikawa,
Brian Ramsey,
Mikhail Gubarev,
Kiranmayee Kilaru,
Jeffery J. Kolodziejczak,
Shin Watanabe,
Tadayuki Takahashi,
Hiroyasu Tajima,
Paul Turin,
Van Shourt,
Natalie Foster,
Sam Krucker
Abstract:
The Focusing Optics X-ray Solar Imager (FOXSI) sounding rocket payload flew for the second time on 2014 December 11. To enable direct Hard X-Ray (HXR) imaging spectroscopy, FOXSI makes use of grazing-incidence replicated focusing optics combined with fine-pitch solid-state detectors. FOXSI's first flight provided the first HXR focused images of the Sun. For FOXSI's second flight several updates we…
▽ More
The Focusing Optics X-ray Solar Imager (FOXSI) sounding rocket payload flew for the second time on 2014 December 11. To enable direct Hard X-Ray (HXR) imaging spectroscopy, FOXSI makes use of grazing-incidence replicated focusing optics combined with fine-pitch solid-state detectors. FOXSI's first flight provided the first HXR focused images of the Sun. For FOXSI's second flight several updates were made to the instrument including updating the optics and detectors as well as adding a new Solar Aspect and Alignment System (SAAS). This paper provides an overview of these updates as well as a discussion of their measured performance.
△ Less
Submitted 12 March, 2024;
originally announced March 2024.
-
Aligning Speech to Languages to Enhance Code-switching Speech Recognition
Authors:
Hexin Liu,
Xiangyu Zhang,
Leibny Paola Garcia,
Andy W. H. Khong,
Eng Siong Chng,
Shinji Watanabe
Abstract:
Code-switching (CS) refers to the switching of languages within a speech signal and results in language confusion for automatic speech recognition (ASR). To address language confusion, we propose the language alignment loss that performs frame-level language identification using pseudo language labels learned from the ASR decoder. This eliminates the need for frame-level language annotations. To f…
▽ More
Code-switching (CS) refers to the switching of languages within a speech signal and results in language confusion for automatic speech recognition (ASR). To address language confusion, we propose the language alignment loss that performs frame-level language identification using pseudo language labels learned from the ASR decoder. This eliminates the need for frame-level language annotations. To further tackle the complex token alternatives for language modeling in bilingual scenarios, we propose to employ large language models via a generative error correction method. A linguistic hint that incorporates language information (derived from the proposed language alignment loss and decoded hypotheses) is introduced to guide the prompting of large language models. The proposed methods are evaluated on the SEAME dataset and data from the ASRU 2019 Mandarin-English code-switching speech recognition challenge. The incorporation of the proposed language alignment loss demonstrates a higher CS-ASR performance with only a negligible increase in the number of parameters on both datasets compared to the baseline model. This work also highlights the efficacy of language alignment loss in balancing primary-language-dominant bilingual data during training, with an 8.6% relative improvement on the ASRU dataset compared to the baseline model. Performance evaluation using large language models reveals the advantage of the linguistic hint by achieving 14.1% and 5.5% relative improvement on test sets of the ASRU and SEAME datasets, respectively.
△ Less
Submitted 9 March, 2024;
originally announced March 2024.
-
Fast Benchmarking of Asynchronous Multi-Fidelity Optimization on Zero-Cost Benchmarks
Authors:
Shuhei Watanabe,
Neeratyoy Mallik,
Edward Bergman,
Frank Hutter
Abstract:
While deep learning has celebrated many successes, its results often hinge on the meticulous selection of hyperparameters (HPs). However, the time-consuming nature of deep learning training makes HP optimization (HPO) a costly endeavor, slowing down the development of efficient HPO tools. While zero-cost benchmarks, which provide performance and runtime without actual training, offer a solution fo…
▽ More
While deep learning has celebrated many successes, its results often hinge on the meticulous selection of hyperparameters (HPs). However, the time-consuming nature of deep learning training makes HP optimization (HPO) a costly endeavor, slowing down the development of efficient HPO tools. While zero-cost benchmarks, which provide performance and runtime without actual training, offer a solution for non-parallel setups, they fall short in parallel setups as each worker must communicate its queried runtime to return its evaluation in the exact order. This work addresses this challenge by introducing a user-friendly Python package that facilitates efficient parallel HPO with zero-cost benchmarks. Our approach calculates the exact return order based on the information stored in file system, eliminating the need for long waiting times and enabling much faster HPO evaluations. We first verify the correctness of our approach through extensive testing and the experiments with 6 popular HPO libraries show its applicability to diverse libraries and its ability to achieve over 1000x speedup compared to a traditional approach. Our package can be installed via pip install mfhpo-simulator.
△ Less
Submitted 17 April, 2024; v1 submitted 4 March, 2024;
originally announced March 2024.
-
Investigation of the determination of nuclear deformation using high-energy heavy-ion scattering
Authors:
Shin Watanabe,
Takenori Furumoto,
Wataru Horiuchi,
Tadahiro Suhara,
Yasutaka Taniguchi
Abstract:
Background: Nuclear deformation provides a crucial characteristic of nuclear structure. Conventionally, the quadrupole deformation length of a nucleus, $δ_{2}$, has often been determined based on a macroscopic model through a deformed nuclear potential with the deformation length $δ^{\rm (pot)}_{2}$, which is determined to reproduce the nuclear scattering data. This approach assumes…
▽ More
Background: Nuclear deformation provides a crucial characteristic of nuclear structure. Conventionally, the quadrupole deformation length of a nucleus, $δ_{2}$, has often been determined based on a macroscopic model through a deformed nuclear potential with the deformation length $δ^{\rm (pot)}_{2}$, which is determined to reproduce the nuclear scattering data. This approach assumes $δ_{2}=δ^{\rm (pot)}_{2}$ although there is no theoretical foundation. Purpose: We clarify the relationship between $δ_{2}$ and $δ^{\rm (pot)}_{2}$ for high-energy heavy-ion scattering systematically to evaluate the validity of the conventional approach to determine the nuclear deformation. Method: The deformation lengths for the $^{12}$C inelastic scattering by $^{12}$C, $^{16}$O, $^{40}$Ca, and $^{208}$Pb targets at $E/A$ = 50--400 MeV are examined. First, we perform microscopic coupled-channel (CC) calculations to relate $δ_{2}$ of the deformed density into the inelastic scattering cross section. Second, we use the deformed potential model to determine $δ^{\rm (pot)}_{2}$ so as to reproduce the microscopic CC result. We then compare $δ^{\rm (pot)}_{2}$ with $δ_{2}$. Results: We find that $δ^{\rm (pot)}_{2}$ is about 20--40% smaller than presumed $δ_{2}$, showing strong energy and target dependence. Conclusion: Our results suggest that one needs to be careful when the deformed potential model for the high-energy heavy-ion scattering is used to extract the nuclear deformation. The conventional approach may underestimate the deformation length $δ_2$ systematically.
△ Less
Submitted 29 February, 2024;
originally announced February 2024.
-
TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages
Authors:
Minsu Kim,
Jee-weon Jung,
Hyeongseop Rha,
Soumi Maiti,
Siddhant Arora,
Xuankai Chang,
Shinji Watanabe,
Yong Man Ro
Abstract:
The capability to jointly process multi-modal information is becoming an essential task. However, the limited number of paired multi-modal data and the large computational requirements in multi-modal learning hinder the development. We propose a novel Tri-Modal Translation (TMT) model that translates between arbitrary modalities spanning speech, image, and text. We introduce a novel viewpoint, whe…
▽ More
The capability to jointly process multi-modal information is becoming an essential task. However, the limited number of paired multi-modal data and the large computational requirements in multi-modal learning hinder the development. We propose a novel Tri-Modal Translation (TMT) model that translates between arbitrary modalities spanning speech, image, and text. We introduce a novel viewpoint, where we interpret different modalities as different languages, and treat multi-modal translation as a well-established machine translation problem. To this end, we tokenize speech and image data into discrete tokens, which provide a unified interface across modalities and significantly decrease the computational cost. In the proposed TMT, a multi-modal encoder-decoder conducts the core translation, whereas modality-specific processing is conducted only within the tokenization and detokenization stages. We evaluate the proposed TMT on all six modality translation tasks. TMT outperforms single model counterparts consistently, demonstrating that unifying tasks is beneficial not only for practicality but also for performance.
△ Less
Submitted 25 February, 2024;
originally announced February 2024.
-
Uncovering the Sign of Nuclear Deformations: Prolate or Oblate Shape via Low-Energy $α$ Inelastic Scattering
Authors:
Shin Watanabe,
Yoshiki Suzuki,
Masaaki Kimura,
Kazuyuki Ogata
Abstract:
Background: Understanding nuclear shape is a crucial problem in nuclear physics. In particular, determining the sign of quadrupole deformation, i.e., whether prolate or oblate, remains a challenging problem. Purpose: Our aim is to propose a method for determining the sign of quadrupole deformation using $α$ inelastic scattering data and to demonstrate its effectiveness. Method: Our approach is the…
▽ More
Background: Understanding nuclear shape is a crucial problem in nuclear physics. In particular, determining the sign of quadrupole deformation, i.e., whether prolate or oblate, remains a challenging problem. Purpose: Our aim is to propose a method for determining the sign of quadrupole deformation using $α$ inelastic scattering data and to demonstrate its effectiveness. Method: Our approach is the standard coupled-channel method based on the macroscopic model. We utilize the nuclear reorientation effect, a phenomenon associated with the self coupling of excited states, as a probe sensitive to the sign of deformation. Results: We first provide an overview of how the reorientation effect influences inelastic scattering cross sections, and numerically confirm its validity in realistic cases. We then demonstrate that the sign of deformation can be uniquely determined from inelastic scattering cross section data. Conclusion: Our technique offers a systematic approach for determining the sign of deformation in both stable and unstable nuclei. The broad applicability of $α$ inelastic scattering will make it a valuable tool to study shape of nuclei, especially unstable nuclei.
△ Less
Submitted 21 February, 2024;
originally announced February 2024.
-
OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification
Authors:
Yifan Peng,
Yui Sudo,
Muhammad Shakeel,
Shinji Watanabe
Abstract:
There has been an increasing interest in large speech models that can perform multiple tasks in a single model. Such models usually adopt an encoder-decoder or decoder-only architecture due to their popularity and good performance in many domains. However, autoregressive models can be slower during inference compared to non-autoregressive models and also have potential risks of hallucination. Thou…
▽ More
There has been an increasing interest in large speech models that can perform multiple tasks in a single model. Such models usually adopt an encoder-decoder or decoder-only architecture due to their popularity and good performance in many domains. However, autoregressive models can be slower during inference compared to non-autoregressive models and also have potential risks of hallucination. Though prior studies observed promising results of non-autoregressive models for certain tasks at small scales, it remains unclear if they can be scaled to speech-to-text generation in diverse languages and tasks. Inspired by the Open Whisper-style Speech Model (OWSM) project, we propose OWSM-CTC, a novel encoder-only speech foundation model based on Connectionist Temporal Classification (CTC). It is trained on 180k hours of public audio data for multilingual automatic speech recognition (ASR), speech translation (ST), and language identification (LID). Compared to encoder-decoder OWSM, our OWSM-CTC achieves competitive results on ASR and up to 24% relative improvement on ST, while it is more robust and 3 to 4 times faster for inference. OWSM-CTC also improves the long-form ASR result with 20x speed-up. We will publicly release our code, pre-trained model, and training logs to promote open science in speech foundation models.
△ Less
Submitted 16 June, 2024; v1 submitted 19 February, 2024;
originally announced February 2024.