Skip to main content

Showing 1–50 of 848 results for author: Watanabe, S

  1. arXiv:2407.08586  [pdf, other

    nucl-ex

    Centrality dependence of Lévy-stable two-pion Bose-Einstein correlations in $\sqrt{s_{_{NN}}}=200$ GeV Au$+$Au collisions

    Authors: PHENIX Collaboration, N. J. Abdulameer, U. Acharya, A. Adare, C. Aidala, N. N. Ajitanand, Y. Akiba, R. Akimoto, H. Al-Ta'ani, J. Alexander, A. Angerami, K. Aoki, N. Apadula, Y. Aramaki, H. Asano, E. C. Aschenauer, E. T. Atomssa, T. C. Awes, B. Azmoun, V. Babintsev, M. Bai, B. Bannier, K. N. Barish, B. Bassalleck, S. Bathe , et al. (377 additional authors not shown)

    Abstract: The PHENIX experiment measured the centrality dependence of two-pion Bose-Einstein correlation functions in $\sqrt{s_{_{NN}}}=200$~GeV Au$+$Au collisions at the Relativistic Heavy Ion Collider at Brookhaven National Laboratory. The data are well represented by Lévy-stable source distributions. The extracted source parameters are the correlation-strength parameter $λ$, the Lévy index of stability… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

    Comments: 401 authors from 75 institutions, 20 pages, 15 figures, 2 tables. v1 is version submitted to Physical Review C. HEPdata tables for the points plotted in figures for this and previous PHENIX publications are (or will be) publicly available at http://www.phenix.bnl.gov/papers.html

  2. arXiv:2407.07977  [pdf, other

    physics.atom-ph

    Few-electron highly charged muonic Ar atoms verified by electronic $K$ x rays

    Authors: T. Okumura, T. Azuma, D. A. Bennett, W. B. Doriese, M. S. Durkin, J. W. Fowler, J. D. Gard, T. Hashimoto, R. Hayakawa, Y. Ichinohe, P. Indelicato, T. Isobe, S. Kanda, D. Kato, M. Katsuragawa, N. Kawamura, Y. Kino, N. Kominato, Y. Miyake, K. M. Morgan, H. Noda, G. C. O'Neil, S. Okada, K. Okutsu, N. Paul , et al. (18 additional authors not shown)

    Abstract: Electronic $K$ x rays emitted by muonic Ar atoms in the gas phase were observed using a superconducting transition-edge-sensor microcalorimeter. The high-precision energy spectra provided a clear signature of the presence of muonic atoms accompanied by a few electrons, which have never been observed before. One-, two-, and three-electron bound, i.e., H-like, He-like, and Li-like, muonic Ar atoms w… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

  3. arXiv:2407.03718  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Multi-Convformer: Extending Conformer with Multiple Convolution Kernels

    Authors: Darshan Prabhu, Yifan Peng, Preethi Jyothi, Shinji Watanabe

    Abstract: Convolutions have become essential in state-of-the-art end-to-end Automatic Speech Recognition~(ASR) systems due to their efficient modelling of local context. Notably, its use in Conformers has led to superior performance compared to vanilla Transformer-based ASR systems. While components other than the convolution module in the Conformer have been reexamined, altering the convolution module itse… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

    Comments: Accepted to INTERSPEECH 2024

  4. arXiv:2407.00837  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Towards Robust Speech Representation Learning for Thousands of Languages

    Authors: William Chen, Wangyou Zhang, Yifan Peng, Xinjian Li, Jinchuan Tian, Jiatong Shi, Xuankai Chang, Soumi Maiti, Karen Livescu, Shinji Watanabe

    Abstract: Self-supervised learning (SSL) has helped extend speech technologies to more languages by reducing the need for labeled data. However, models are still far from supporting the world's 7000+ languages. We propose XEUS, a Cross-lingual Encoder for Universal Speech, trained on over 1 million hours of data across 4057 languages, extending the language coverage of SSL models 4-fold. We combine 1 millio… ▽ More

    Submitted 2 July, 2024; v1 submitted 30 June, 2024; originally announced July 2024.

    Comments: Updated affiliations; 20 pages

  5. arXiv:2406.17246  [pdf, other

    cs.SD cs.AI eess.AS

    Beyond Silence: Bias Analysis through Loss and Asymmetric Approach in Audio Anti-Spoofing

    Authors: Hye-jin Shim, Md Sahidullah, Jee-weon Jung, Shinji Watanabe, Tomi Kinnunen

    Abstract: Current trends in audio anti-spoofing detection research strive to improve models' ability to generalize across unseen attacks by learning to identify a variety of spoofing artifacts. This emphasis has primarily focused on the spoof class. Recently, several studies have noted that the distribution of silence differs between the two classes, which can serve as a shortcut. In this paper, we extend c… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: 5 pages, 1 figure, 5 tables

  6. arXiv:2406.16120  [pdf, other

    eess.AS cs.CL cs.SD

    Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing Loss

    Authors: Muhammad Shakeel, Yui Sudo, Yifan Peng, Shinji Watanabe

    Abstract: Contextualized end-to-end automatic speech recognition has been an active research area, with recent efforts focusing on the implicit learning of contextual phrases based on the final loss objective. However, these approaches ignore the useful contextual knowledge encoded in the intermediate layers. We hypothesize that employing explicit biasing loss as an auxiliary task in the encoder intermediat… ▽ More

    Submitted 23 June, 2024; originally announced June 2024.

    Comments: Accepted to INTERSPEECH 2024

  7. arXiv:2406.16107  [pdf, ps, other

    eess.AS cs.CL

    Decoder-only Architecture for Streaming End-to-end Speech Recognition

    Authors: Emiru Tsunoo, Hayato Futami, Yosuke Kashiwagi, Siddhant Arora, Shinji Watanabe

    Abstract: Decoder-only language models (LMs) have been successfully adopted for speech-processing tasks including automatic speech recognition (ASR). The LMs have ample expressiveness and perform efficiently. This efficiency is a suitable characteristic for streaming applications of ASR. In this work, we propose to use a decoder-only architecture for blockwise streaming ASR. In our approach, speech features… ▽ More

    Submitted 23 June, 2024; originally announced June 2024.

    Comments: Accepted for Interspeech 2024

  8. arXiv:2406.13471  [pdf, other

    eess.AS cs.SD

    Diffusion-based Generative Modeling with Discriminative Guidance for Streamable Speech Enhancement

    Authors: Chenda Li, Samuele Cornell, Shinji Watanabe, Yanmin Qian

    Abstract: Diffusion-based generative models (DGMs) have recently attracted attention in speech enhancement research (SE) as previous works showed a remarkable generalization capability. However, DGMs are also computationally intensive, as they usually require many iterations in the reverse diffusion process (RDP), making them impractical for streaming SE systems. In this paper, we propose to use discriminat… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

  9. arXiv:2406.12611  [pdf, other

    cs.SD cs.CL eess.AS

    Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder Prompting

    Authors: Yosuke Kashiwagi, Hayato Futami, Emiru Tsunoo, Siddhant Arora, Shinji Watanabe

    Abstract: End-to-end multilingual speech recognition models handle multiple languages through a single model, often incorporating language identification to automatically detect the language of incoming speech. Since the common scenario is where the language is already known, these models can perform as language-specific by using language information as prompts, which is particularly beneficial for attentio… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: Accepted by INTERSPEECH 2024

  10. arXiv:2406.12317  [pdf, other

    cs.CL eess.AS

    Finding Task-specific Subnetworks in Multi-task Spoken Language Understanding Model

    Authors: Hayato Futami, Siddhant Arora, Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe

    Abstract: Recently, multi-task spoken language understanding (SLU) models have emerged, designed to address various speech processing tasks. However, these models often rely on a large number of parameters. Also, they often encounter difficulties in adapting to new data for a specific task without experiencing catastrophic forgetting of previously trained tasks. In this study, we propose finding task-specif… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech2024

  11. arXiv:2406.10234  [pdf, ps, other

    cond-mat.stat-mech cs.LG math.ST stat.ML

    Review and Prospect of Algebraic Research in Equivalent Framework between Statistical Mechanics and Machine Learning Theory

    Authors: Sumio Watanabe

    Abstract: Mathematical equivalence between statistical mechanics and machine learning theory has been known since the 20th century, and researches based on such equivalence have provided novel methodology in both theoretical physics and statistical learning theory. For example, algebraic approach in statistical mechanics such as operator algebra enables us to analyze phase transition phenomena mathematicall… ▽ More

    Submitted 17 June, 2024; v1 submitted 31 May, 2024; originally announced June 2024.

  12. arXiv:2406.10083  [pdf, other

    cs.CL cs.SD eess.AS

    On the Evaluation of Speech Foundation Models for Spoken Language Understanding

    Authors: Siddhant Arora, Ankita Pasad, Chung-Ming Chien, Jionghao Han, Roshan Sharma, Jee-weon Jung, Hira Dhamyal, William Chen, Suwon Shon, Hung-yi Lee, Karen Livescu, Shinji Watanabe

    Abstract: The Spoken Language Understanding Evaluation (SLUE) suite of benchmark tasks was recently introduced to address the need for open resources and benchmarking of complex spoken language understanding (SLU) tasks, including both classification and sequence generation tasks, on natural speech. The benchmark has demonstrated preliminary success in using pre-trained speech foundation models (SFM) for th… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Accepted at ACL Findings 2024

  13. arXiv:2406.09869  [pdf, ps, other

    cs.SD eess.AS

    MMM: Multi-Layer Multi-Residual Multi-Stream Discrete Speech Representation from Self-supervised Learning Model

    Authors: Jiatong Shi, Xutai Ma, Hirofumi Inaguma, Anna Sun, Shinji Watanabe

    Abstract: Speech discrete representation has proven effective in various downstream applications due to its superior compression rate of the waveform, fast convergence during training, and compatibility with other modalities. Discrete units extracted from self-supervised learning (SSL) models have emerged as a prominent approach for obtaining speech discrete representation. However, while discrete units hav… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech2024

  14. arXiv:2406.09345  [pdf, other

    cs.CL cs.SD eess.AS

    DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding

    Authors: Suwon Shon, Kwangyoun Kim, Yi-Te Hsu, Prashant Sridhar, Shinji Watanabe, Karen Livescu

    Abstract: The integration of pre-trained text-based large language models (LLM) with speech input has enabled instruction-following capabilities for diverse speech tasks. This integration requires the use of a speech encoder, a speech adapter, and an LLM, trained on diverse tasks. We propose the use of discrete speech units (DSU), rather than continuous-valued speech encoder outputs, that are converted to t… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  15. arXiv:2406.09282  [pdf, other

    cs.CL cs.SD eess.AS

    On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models

    Authors: Jinchuan Tian, Yifan Peng, William Chen, Kwanghee Choi, Karen Livescu, Shinji Watanabe

    Abstract: The Open Whisper-style Speech Model (OWSM) series was introduced to achieve full transparency in building advanced speech-to-text (S2T) foundation models. To this end, OWSM models are trained on 25 public speech datasets, which are heterogeneous in multiple ways. In this study, we advance the OWSM series by introducing OWSM v3.2, which improves on prior models by investigating and addressing the i… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  16. arXiv:2406.08761  [pdf, other

    cs.SD eess.AS

    VISinger2+: End-to-End Singing Voice Synthesis Augmented by Self-Supervised Learning Representation

    Authors: Yifeng Yu, Jiatong Shi, Yuning Wu, Shinji Watanabe

    Abstract: Singing Voice Synthesis (SVS) has witnessed significant advancements with the advent of deep learning techniques. However, a significant challenge in SVS is the scarcity of labeled singing voice data, which limits the effectiveness of supervised learning methods. In response to this challenge, this paper introduces a novel approach to enhance the quality of SVS by leveraging unlabeled data from pr… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: 4 pages, 2 figures

  17. arXiv:2406.08641  [pdf, ps, other

    cs.SD cs.CL eess.AS

    ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets

    Authors: Jiatong Shi, Shih-Heng Wang, William Chen, Martijn Bartelds, Vanya Bannihatti Kumar, Jinchuan Tian, Xuankai Chang, Dan Jurafsky, Karen Livescu, Hung-yi Lee, Shinji Watanabe

    Abstract: ML-SUPERB evaluates self-supervised learning (SSL) models on the tasks of language identification and automatic speech recognition (ASR). This benchmark treats the models as feature extractors and uses a single shallow downstream model, which can be fine-tuned for a downstream task. However, real-world use cases may require different configurations. This paper presents ML-SUPERB~2.0, which is a ne… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  18. arXiv:2406.08619  [pdf, other

    cs.CL cs.LG eess.AS

    Self-Supervised Speech Representations are More Phonetic than Semantic

    Authors: Kwanghee Choi, Ankita Pasad, Tomohiko Nakamura, Satoru Fukayama, Karen Livescu, Shinji Watanabe

    Abstract: Self-supervised speech models (S3Ms) have become an effective backbone for speech applications. Various analyses suggest that S3Ms encode linguistic properties. In this work, we seek a more fine-grained analysis of the word-level linguistic properties encoded in S3Ms. Specifically, we curate a novel dataset of near homophone (phonetically similar) and synonym (semantically similar) word pairs and… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024. Source code at https://github.com/juice500ml/phonetic_semantic_probing

  19. arXiv:2406.08396  [pdf, other

    eess.AS cs.AI

    Neural Blind Source Separation and Diarization for Distant Speech Recognition

    Authors: Yoshiaki Bando, Tomohiko Nakamura, Shinji Watanabe

    Abstract: This paper presents a neural method for distant speech recognition (DSR) that jointly separates and diarizes speech mixtures without supervision by isolated signals. A standard separation method for multi-talker DSR is a statistical multichannel method called guided source separation (GSS). While GSS does not require signal-level supervision, it relies on speaker diarization results to handle unkn… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: 5 pages, 3 figures, accepted to INTERSPEECH 2024

  20. arXiv:2406.08301  [pdf, other

    nucl-ex

    Jet modification via $π^0$-hadron correlations in Au$+$Au collisions at $\sqrt{s_{_{NN}}}=200$ GeV

    Authors: PHENIX Collaboration, N. J. Abdulameer, U. Acharya, A. Adare, S. Afanasiev, C. Aidala, N. N. Ajitanand, Y. Akiba, H. Al-Bataineh, J. Alexander, M. Alfred, K. Aoki, N. Apadula, L. Aphecetche, J. Asai, H. Asano, E. T. Atomssa, R. Averbeck, T. C. Awes, B. Azmoun, V. Babintsev, M. Bai, G. Baksay, L. Baksay, A. Baldisseri , et al. (510 additional authors not shown)

    Abstract: High-momentum two-particle correlations are a useful tool for studying jet-quenching effects in the quark-gluon plasma. Angular correlations between neutral-pion triggers and charged hadrons with transverse momenta in the range 4--12~GeV/$c$ and 0.5--7~GeV/$c$, respectively, have been measured by the PHENIX experiment in 2014 for Au$+$Au collisions at $\sqrt{s_{_{NN}}}=200$~GeV. Suppression is obs… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: 534 authors from 83 institutions, 12 pages, 7 figures. v1 is version submitted to Physical Review C. HEPdata tables for the points plotted in figures for this and previous PHENIX publications are (or will be) publicly available at http://www.phenix.bnl.gov/papers.html

  21. arXiv:2406.07725  [pdf, ps, other

    cs.SD eess.AS

    The Interspeech 2024 Challenge on Speech Processing Using Discrete Units

    Authors: Xuankai Chang, Jiatong Shi, Jinchuan Tian, Yuning Wu, Yuxun Tang, Yihan Wu, Shinji Watanabe, Yossi Adi, Xie Chen, Qin Jin

    Abstract: Representing speech and audio signals in discrete units has become a compelling alternative to traditional high-dimensional feature vectors. Numerous studies have highlighted the efficacy of discrete units in various applications such as speech compression and restoration, speech recognition, and speech generation. To foster exploration in this domain, we introduce the Interspeech 2024 Challenge,… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: This manuscript has been accepted by Interspeech2024

  22. arXiv:2406.06185  [pdf, other

    eess.AS cs.LG cs.SD

    EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation

    Authors: Julius Richter, Yi-Chiao Wu, Steven Krenn, Simon Welker, Bunlong Lay, Shinji Watanabe, Alexander Richard, Timo Gerkmann

    Abstract: We release the EARS (Expressive Anechoic Recordings of Speech) dataset, a high-quality speech dataset comprising 107 speakers from diverse backgrounds, totaling in 100 hours of clean, anechoic speech data. The dataset covers a large range of different speaking styles, including emotional speech, different reading styles, non-verbal sounds, and conversational freeform speech. We benchmark various m… ▽ More

    Submitted 11 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

    Comments: Accepted at Interspeech 2024

  23. arXiv:2406.05339  [pdf, other

    eess.AS cs.AI

    To what extent can ASV systems naturally defend against spoofing attacks?

    Authors: Jee-weon Jung, Xin Wang, Nicholas Evans, Shinji Watanabe, Hye-jin Shim, Hemlata Tak, Sidhhant Arora, Junichi Yamagishi, Joon Son Chung

    Abstract: The current automatic speaker verification (ASV) task involves making binary decisions on two types of trials: target and non-target. However, emerging advancements in speech generation technology pose significant threats to the reliability of ASV systems. This study investigates whether ASV effortlessly acquires robustness against spoofing attacks (i.e., zero-shot capability) by systematically ex… ▽ More

    Submitted 14 June, 2024; v1 submitted 7 June, 2024; originally announced June 2024.

    Comments: 5 pages, 3 figures, 3 tables, Interspeech 2024

  24. arXiv:2406.04660  [pdf, other

    eess.AS cs.SD

    URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement

    Authors: Wangyou Zhang, Robin Scheibler, Kohei Saijo, Samuele Cornell, Chenda Li, Zhaoheng Ni, Anurag Kumar, Jan Pirklbauer, Marvin Sach, Shinji Watanabe, Tim Fingscheidt, Yanmin Qian

    Abstract: The last decade has witnessed significant advancements in deep learning-based speech enhancement (SE). However, most existing SE research has limitations on the coverage of SE sub-tasks, data diversity and amount, and evaluation metrics. To fill this gap and promote research toward universal SE, we establish a new SE challenge, named URGENT, to focus on the universality, robustness, and generaliza… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: 6 pages, 3 figures, 3 tables. Accepted by Interspeech 2024. An extended version of the accepted manuscript with appendix

  25. arXiv:2406.04565  [pdf, ps, other

    nucl-th

    Non-Spherical Pauli Forbidden States in Deformed Halo Nuclei: Impact on the ${}^7\mathrm{Be}+p$ Resonant States in the Particle Rotor Model

    Authors: Shin Watanabe, Antonio M. Moro

    Abstract: Background: An important aspect of reducing nuclear many-body problems to few-body models is the presence of Pauli forbidden (PF) states, which are excluded in fully antisymmetrized calculations. Insufficient treatments of PF states in deformed halo nuclei underscore the need for model refinement. Purpose: We propose a new method utilizing Nilsson states as PF states in the orthogonality conditi… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: 8 pages, 6 figures

  26. arXiv:2406.04269  [pdf, other

    eess.AS cs.SD

    Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement

    Authors: Wangyou Zhang, Kohei Saijo, Jee-weon Jung, Chenda Li, Shinji Watanabe, Yanmin Qian

    Abstract: Deep learning-based speech enhancement (SE) models have achieved impressive performance in the past decade. Numerous advanced architectures have been designed to deliver state-of-the-art performance; however, their scalability potential remains unrevealed. Meanwhile, the majority of research focuses on small-sized datasets with restricted diversity, leading to a plateau in performance improvement.… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: 5 pages, 3 figures, 4 tables, Accepted by Interspeech 2024

  27. arXiv:2406.02950  [pdf, other

    eess.AS cs.CL cs.SD

    4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders

    Authors: Yui Sudo, Muhammad Shakeel, Yosuke Fukumoto, Brian Yan, Jiatong Shi, Yifan Peng, Shinji Watanabe

    Abstract: End-to-end automatic speech recognition (E2E-ASR) can be classified into several network architectures, such as connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention-based encoder-decoder, and mask-predict models. Each network architecture has advantages and disadvantages, leading practitioners to switch between these different models depending on appl… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: submitted to IEEE/ACM Transactions on Audio Speech and Language Processing

  28. arXiv:2406.02560  [pdf, other

    eess.AS cs.AI cs.CL cs.LG

    Less Peaky and More Accurate CTC Forced Alignment by Label Priors

    Authors: Ruizhe Huang, Xiaohui Zhang, Zhaoheng Ni, Li Sun, Moto Hira, Jeff Hwang, Vimal Manohar, Vineel Pratap, Matthew Wiesner, Shinji Watanabe, Daniel Povey, Sanjeev Khudanpur

    Abstract: Connectionist temporal classification (CTC) models are known to have peaky output distributions. Such behavior is not a problem for automatic speech recognition (ASR), but it can cause inaccurate forced alignments (FA), especially at finer granularity, e.g., phoneme level. This paper aims at alleviating the peaky behavior for CTC and improve its suitability for forced alignment generation, by leve… ▽ More

    Submitted 15 June, 2024; v1 submitted 22 April, 2024; originally announced June 2024.

    Comments: Accepted by ICASSP 2024. Github repo: https://github.com/huangruizhe/audio/tree/aligner_label_priors

  29. arXiv:2406.00899  [pdf, other

    cs.CL cs.SD eess.AS

    YODAS: Youtube-Oriented Dataset for Audio and Speech

    Authors: Xinjian Li, Shinnosuke Takamichi, Takaaki Saeki, William Chen, Sayaka Shiota, Shinji Watanabe

    Abstract: In this study, we introduce YODAS (YouTube-Oriented Dataset for Audio and Speech), a large-scale, multilingual dataset comprising currently over 500k hours of speech data in more than 100 languages, sourced from both labeled and unlabeled YouTube speech datasets. The labeled subsets, including manual or automatic subtitles, facilitate supervised model training. Conversely, the unlabeled subsets ar… ▽ More

    Submitted 2 June, 2024; originally announced June 2024.

    Comments: ASRU 2023

  30. arXiv:2405.20402  [pdf, other

    eess.AS cs.SD eess.SP

    Cross-Talk Reduction

    Authors: Zhong-Qiu Wang, Anurag Kumar, Shinji Watanabe

    Abstract: While far-field multi-talker mixtures are recorded, each speaker can wear a close-talk microphone so that close-talk mixtures can be recorded at the same time. Although each close-talk mixture has a high signal-to-noise ratio (SNR) of the wearer, it has a very limited range of applications, as it also contains significant cross-talk speech by other speakers and is not clean enough. In this context… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

    Comments: in International Joint Conference on Artificial Intelligence (IJCAI), 2024

  31. arXiv:2405.13514  [pdf, other

    eess.AS cs.CL cs.SD

    Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation

    Authors: Muhammad Shakeel, Yui Sudo, Yifan Peng, Shinji Watanabe

    Abstract: End-to-end (E2E) automatic speech recognition (ASR) can operate in two modes: streaming and non-streaming, each with its pros and cons. Streaming ASR processes the speech frames in real-time as it is being received, while non-streaming ASR waits for the entire speech utterance; thus, professionals may have to operate in either mode to satisfy their application. In this work, we present joint optim… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

    Comments: Accepted to IEEE ICASSP 2024 workshop Hands-free Speech Communication and Microphone Arrays (HSCMA 2024)

  32. arXiv:2405.13344  [pdf, other

    eess.AS cs.CL cs.SD

    Contextualized Automatic Speech Recognition with Dynamic Vocabulary

    Authors: Yui Sudo, Yosuke Fukumoto, Muhammad Shakeel, Yifan Peng, Shinji Watanabe

    Abstract: Deep biasing (DB) improves the performance of end-to-end automatic speech recognition (E2E-ASR) for rare words or contextual phrases using a bias list. However, most existing methods treat bias phrases as sequences of subwords in a predefined static vocabulary, which can result in ineffective learning of the dependencies between subwords. More advanced techniques address this problem by incorporat… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

  33. Acoustic modeling for Overlapping Speech Recognition: JHU Chime-5 Challenge System

    Authors: Vimal Manohar, Szu-Jui Chen, Zhiqi Wang, Yusuke Fujita, Shinji Watanabe, Sanjeev Khudanpur

    Abstract: This paper summarizes our acoustic modeling efforts in the Johns Hopkins University speech recognition system for the CHiME-5 challenge to recognize highly-overlapped dinner party speech recorded by multiple microphone arrays. We explore data augmentation approaches, neural network architectures, front-end speech dereverberation, beamforming and robust i-vector extraction with comparisons of our i… ▽ More

    Submitted 17 May, 2024; originally announced May 2024.

    Comments: Published in: ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    Journal ref: ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 2019, pp. 6665-6669

  34. arXiv:2405.10522  [pdf, other

    cond-mat.mes-hall

    Fluctuations in Spin Dynamics Excited by Pulsed Light

    Authors: Tetsuya Sato, Shinichi Watanabe, Mamoru Matsuo, Takeo Kato

    Abstract: We theoretically investigate nonequilibrium spin fluctuations in a ferromagnet induced by a light pulse. Using a Lindblad equation consistent with the Landau-Lifshitz-Gilbert equation, we compute the autocorrelation function of magnetization. Our analysis reveals that this function comprises both thermal and nonequilibrium components. To examine the latter in detail, we introduce a Fano factor sim… ▽ More

    Submitted 17 May, 2024; originally announced May 2024.

    Comments: 11 pages, 3 figures

  35. arXiv:2405.09838  [pdf, other

    cs.LG

    Unsupervised Work Behavior Pattern Extraction Based on Hierarchical Probabilistic Model

    Authors: Issei Saito, Tomoaki Nakamura, Toshiyuki Hatta, Wataru Fujita, Shintaro Watanabe, Shotaro Miwa

    Abstract: Evolving consumer demands and market trends have led to businesses increasingly embracing a production approach that prioritizes flexibility and customization. Consequently, factory workers must engage in tasks that are more complex than before. Thus, productivity depends on each worker's skills in assembling products. Therefore, analyzing the behavior of a worker is crucial for work improvement.… ▽ More

    Submitted 16 May, 2024; originally announced May 2024.

  36. Dynamical and Static Structure Factors in Hedgehog-Antihedgehog Order in Icosahedral 1/1 Approximant Crystal

    Authors: Shinji Watanabe

    Abstract: Recent discoveries of magnetic long-range orders in the icosahedral quasicrystal and topological magnetic structures on the icosahedron (IC) as the hedgehog state and the antihedgehog state have attracted great interest. Here, we report our theoretical analysis of the dynamical as well as static structure of the hedgehog-antihedgehog order in the 1/1 approximant crystal (AC). By constructing the e… ▽ More

    Submitted 6 May, 2024; originally announced May 2024.

    Comments: 8 pages, 6 figures

    Journal ref: Physical Review B Vol. 109, 184404 (2024)

  37. arXiv:2404.09385  [pdf, other

    eess.AS cs.CL eess.SP

    A Large-Scale Evaluation of Speech Foundation Models

    Authors: Shu-wen Yang, Heng-Jui Chang, Zili Huang, Andy T. Liu, Cheng-I Lai, Haibin Wu, Jiatong Shi, Xuankai Chang, Hsiang-Sheng Tsai, Wen-Chin Huang, Tzu-hsun Feng, Po-Han Chi, Yist Y. Lin, Yung-Sung Chuang, Tzu-Hsien Huang, Wei-Cheng Tseng, Kushal Lakhotia, Shang-Wen Li, Abdelrahman Mohamed, Shinji Watanabe, Hung-yi Lee

    Abstract: The foundation model paradigm leverages a shared foundation model to achieve state-of-the-art (SOTA) performance for various tasks, requiring minimal downstream-specific modeling and data annotation. This approach has proven crucial in the field of Natural Language Processing (NLP). However, the speech processing community lacks a similar setup to explore the paradigm systematically. In this work,… ▽ More

    Submitted 29 May, 2024; v1 submitted 14 April, 2024; originally announced April 2024.

    Comments: The extended journal version for SUPERB and SUPERB-SG. Published in IEEE/ACM TASLP. The Arxiv version is preferred

  38. arXiv:2404.08795  [pdf

    astro-ph.EP

    Microscale Hydrogen, Carbon, and Nitrogen Isotopic Diversity of Organic Matter in Asteroid Ryugu

    Authors: Larry R Nittler, Jens Barosch, Katherine Burgess, Rhonda M Stroud, Jianhua Wang, Hikaru Yabuta, Yuma Enokido, Megumi Matsumoto, Tomoki Nakamura, Yoko Kebukawa, Shohei Yamashita, Yoshio Takahashi, Laure Bejach, Lydie Bonal, George D Cody, Emmanuel Dartois, Alexandre Dazzi, Bradley De Gregorio, Ariane Deniset-Besseau, Jean Duprat, Cécile Engrand, Minako Hashiguchi, A. L. David Kilcoyne, Mutsumi Komatsu, Zita Martins , et al. (35 additional authors not shown)

    Abstract: We report the H, C, and N isotopic compositions of microscale (0.2 to 2$μ$m) organic matter in samples of asteroid Ryugu and the Orgueil CI carbonaceous chondrite. Three regolith particles of asteroid Ryugu, returned by the Hayabusa2 spacecraft, and several fragments of Orgueil were analyzed by NanoSIMS isotopic imaging. The isotopic distributions of the Ryugu samples from two different collection… ▽ More

    Submitted 12 April, 2024; originally announced April 2024.

    Comments: Accepted for publication in Earth and Planetary Science Letters. 8 figures (plus 3 supplementary figs), two tables

  39. arXiv:2404.08725  [pdf, other

    astro-ph.IM astro-ph.HE astro-ph.SR hep-ex

    Development of a data overflow protection system for Super-Kamiokande to maximize data from nearby supernovae

    Authors: M. Mori, K. Abe, Y. Hayato, K. Hiraide, K. Hosokawa, K. Ieki, M. Ikeda, J. Kameda, Y. Kanemura, R. Kaneshima, Y. Kashiwagi, Y. Kataoka, S. Miki, S. Mine, M. Miura, S. Moriyama, Y. Nakano, M. Nakahata, S. Nakayama, Y. Noguchi, K. Okamoto, K. Sato, H. Sekiya, H. Shiba, K. Shimizu , et al. (230 additional authors not shown)

    Abstract: Neutrinos from very nearby supernovae, such as Betelgeuse, are expected to generate more than ten million events over 10\,s in Super-Kamokande (SK). At such large event rates, the buffers of the SK analog-to-digital conversion board (QBEE) will overflow, causing random loss of data that is critical for understanding the dynamics of the supernova explosion mechanism. In order to solve this problem,… ▽ More

    Submitted 12 April, 2024; originally announced April 2024.

    Comments: 28 pages, 18 figures. Submitted to PTEP

  40. arXiv:2403.19207  [pdf, other

    eess.AS

    LV-CTC: Non-autoregressive ASR with CTC and latent variable models

    Authors: Yuya Fujita, Shinji Watanabe, Xuankai Chang, Takashi Maekaku

    Abstract: Non-autoregressive (NAR) models for automatic speech recognition (ASR) aim to achieve high accuracy and fast inference by simplifying the autoregressive (AR) generation process of conventional models. Connectionist temporal classification (CTC) is one of the key techniques used in NAR ASR models. In this paper, we propose a new model combining CTC and a latent variable model, which is one of the s… ▽ More

    Submitted 28 March, 2024; originally announced March 2024.

  41. arXiv:2403.13627  [pdf, other

    cond-mat.supr-con cond-mat.mtrl-sci cs.LG

    Efficient exploration of high-Tc superconductors by a gradient-based composition design

    Authors: Akihiro Fujii, Koji Shimizu, Satoshi Watanabe

    Abstract: We propose a material design method via gradient-based optimization on compositions, overcoming the limitations of traditional methods: exhaustive database searches and conditional generation models. It optimizes inputs via backpropagation, aligning the model's output closely with the target property and facilitating the discovery of unlisted materials and precise property determination. Our metho… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

  42. arXiv:2403.13169  [pdf, other

    cs.CL

    Wav2Gloss: Generating Interlinear Glossed Text from Speech

    Authors: Taiqi He, Kwanghee Choi, Lindia Tjuatja, Nathaniel R. Robinson, Jiatong Shi, Shinji Watanabe, Graham Neubig, David R. Mortensen, Lori Levin

    Abstract: Thousands of the world's languages are in danger of extinction--a tremendous threat to cultural identities and human language diversity. Interlinear Glossed Text (IGT) is a form of linguistic annotation that can support documentation and resource creation for these languages' communities. IGT typically consists of (1) transcriptions, (2) morphological segmentation, (3) glosses, and (4) free transl… ▽ More

    Submitted 5 June, 2024; v1 submitted 19 March, 2024; originally announced March 2024.

    Comments: ACL 2024 camera ready version

  43. arXiv:2403.08619  [pdf, other

    hep-ex astro-ph.HE

    Measurements of the charge ratio and polarization of cosmic-ray muons with the Super-Kamiokande detector

    Authors: H. Kitagawa, T. Tada, K. Abe, C. Bronner, Y. Hayato, K. Hiraide, K. Hosokawa, K. Ieki, M. Ikeda, J. Kameda, Y. Kanemura, R. Kaneshima, Y. Kashiwagi, Y. Kataoka, S. Miki, S. Mine, M. Miura, S. Moriyama, Y. Nakano, M. Nakahata, S. Nakayama, Y. Noguchi, K. Okamoto, K. Sato, H. Sekiya , et al. (231 additional authors not shown)

    Abstract: We present the results of the charge ratio ($R$) and polarization ($P^μ_{0}$) measurements using the decay electron events collected from 2008 September to 2022 June by the Super-Kamiokande detector. Because of its underground location and long operation, we performed high precision measurements by accumulating cosmic-ray muons. We measured the muon charge ratio to be $R=1.32 \pm 0.02$… ▽ More

    Submitted 13 March, 2024; originally announced March 2024.

    Comments: 29 pages, 45 figures

  44. arXiv:2403.07610  [pdf

    astro-ph.IM astro-ph.SR

    FOXSI-2: Upgrades of the Focusing Optics X-ray Solar Imager for its Second Flight

    Authors: Steven Christe, Lindsay Glesener, Camilo Buitrago-Casas, Shin-Nosuke Ishikawa, Brian Ramsey, Mikhail Gubarev, Kiranmayee Kilaru, Jeffery J. Kolodziejczak, Shin Watanabe, Tadayuki Takahashi, Hiroyasu Tajima, Paul Turin, Van Shourt, Natalie Foster, Sam Krucker

    Abstract: The Focusing Optics X-ray Solar Imager (FOXSI) sounding rocket payload flew for the second time on 2014 December 11. To enable direct Hard X-Ray (HXR) imaging spectroscopy, FOXSI makes use of grazing-incidence replicated focusing optics combined with fine-pitch solid-state detectors. FOXSI's first flight provided the first HXR focused images of the Sun. For FOXSI's second flight several updates we… ▽ More

    Submitted 12 March, 2024; originally announced March 2024.

    Comments: 12 pages, 10 figures, 3 tables

    Report number: Vol. 5, No. 1 (2016) 1640005 (12 pages)

    Journal ref: Journal of Astronomical Instrumentation, 2016, Volume 05, Number 01, 1640005

  45. arXiv:2403.05887  [pdf, other

    eess.AS

    Aligning Speech to Languages to Enhance Code-switching Speech Recognition

    Authors: Hexin Liu, Xiangyu Zhang, Leibny Paola Garcia, Andy W. H. Khong, Eng Siong Chng, Shinji Watanabe

    Abstract: Code-switching (CS) refers to the switching of languages within a speech signal and results in language confusion for automatic speech recognition (ASR). To address language confusion, we propose the language alignment loss that performs frame-level language identification using pseudo language labels learned from the ASR decoder. This eliminates the need for frame-level language annotations. To f… ▽ More

    Submitted 9 March, 2024; originally announced March 2024.

    Comments: Manuscript submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing

  46. arXiv:2403.01888  [pdf, other

    cs.AI cs.LG

    Fast Benchmarking of Asynchronous Multi-Fidelity Optimization on Zero-Cost Benchmarks

    Authors: Shuhei Watanabe, Neeratyoy Mallik, Edward Bergman, Frank Hutter

    Abstract: While deep learning has celebrated many successes, its results often hinge on the meticulous selection of hyperparameters (HPs). However, the time-consuming nature of deep learning training makes HP optimization (HPO) a costly endeavor, slowing down the development of efficient HPO tools. While zero-cost benchmarks, which provide performance and runtime without actual training, offer a solution fo… ▽ More

    Submitted 17 April, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

    Comments: Submitted to AutoML Conference 2024 ABCD Track

  47. arXiv:2402.19008  [pdf, ps, other

    nucl-th

    Investigation of the determination of nuclear deformation using high-energy heavy-ion scattering

    Authors: Shin Watanabe, Takenori Furumoto, Wataru Horiuchi, Tadahiro Suhara, Yasutaka Taniguchi

    Abstract: Background: Nuclear deformation provides a crucial characteristic of nuclear structure. Conventionally, the quadrupole deformation length of a nucleus, $δ_{2}$, has often been determined based on a macroscopic model through a deformed nuclear potential with the deformation length $δ^{\rm (pot)}_{2}$, which is determined to reproduce the nuclear scattering data. This approach assumes… ▽ More

    Submitted 29 February, 2024; originally announced February 2024.

    Comments: 6 pages, 4 figures

    Report number: NITEP 200

  48. arXiv:2402.16021  [pdf, other

    cs.CL cs.AI cs.CV eess.AS

    TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages

    Authors: Minsu Kim, Jee-weon Jung, Hyeongseop Rha, Soumi Maiti, Siddhant Arora, Xuankai Chang, Shinji Watanabe, Yong Man Ro

    Abstract: The capability to jointly process multi-modal information is becoming an essential task. However, the limited number of paired multi-modal data and the large computational requirements in multi-modal learning hinder the development. We propose a novel Tri-Modal Translation (TMT) model that translates between arbitrary modalities spanning speech, image, and text. We introduce a novel viewpoint, whe… ▽ More

    Submitted 25 February, 2024; originally announced February 2024.

  49. arXiv:2402.13832  [pdf, ps, other

    nucl-th

    Uncovering the Sign of Nuclear Deformations: Prolate or Oblate Shape via Low-Energy $α$ Inelastic Scattering

    Authors: Shin Watanabe, Yoshiki Suzuki, Masaaki Kimura, Kazuyuki Ogata

    Abstract: Background: Understanding nuclear shape is a crucial problem in nuclear physics. In particular, determining the sign of quadrupole deformation, i.e., whether prolate or oblate, remains a challenging problem. Purpose: Our aim is to propose a method for determining the sign of quadrupole deformation using $α$ inelastic scattering data and to demonstrate its effectiveness. Method: Our approach is the… ▽ More

    Submitted 21 February, 2024; originally announced February 2024.

    Comments: 7 pages, 8 figures

  50. arXiv:2402.12654  [pdf, other

    cs.CL cs.SD eess.AS

    OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification

    Authors: Yifan Peng, Yui Sudo, Muhammad Shakeel, Shinji Watanabe

    Abstract: There has been an increasing interest in large speech models that can perform multiple tasks in a single model. Such models usually adopt an encoder-decoder or decoder-only architecture due to their popularity and good performance in many domains. However, autoregressive models can be slower during inference compared to non-autoregressive models and also have potential risks of hallucination. Thou… ▽ More

    Submitted 16 June, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

    Comments: Accepted at ACL 2024 main conference