Skip to main content

Showing 1–50 of 347 results for author: Meng, H

  1. arXiv:2407.08551  [pdf, other

    cs.CL cs.SD eess.AS

    Autoregressive Speech Synthesis without Vector Quantization

    Authors: Lingwei Meng, Long Zhou, Shujie Liu, Sanyuan Chen, Bing Han, Shujie Hu, Yanqing Liu, Jinyu Li, Sheng Zhao, Xixin Wu, Helen Meng, Furu Wei

    Abstract: We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition, bypassing the need for vector quantization, which are originally designed for audio compression and sacrifice fidelity compared to mel-spectrograms. Specifically, (i) instead of cross… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

  2. arXiv:2407.07810  [pdf, other

    cs.LG cs.AI cs.CL

    Transformer Alignment in Large Language Models

    Authors: Murdock Aubry, Haoming Meng, Anton Sugolov, Vardan Papyan

    Abstract: Large Language Models (LLMs) have made significant strides in natural language processing, and a precise understanding of the internal mechanisms driving their success is essential. We regard LLMs as transforming embeddings via a discrete, coupled, nonlinear, dynamical system in high dimensions. This perspective motivates tracing the trajectories of individual tokens as they pass through transform… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

  3. arXiv:2407.07518  [pdf, other

    cs.CV

    Multi-modal Crowd Counting via a Broker Modality

    Authors: Haoliang Meng, Xiaopeng Hong, Chenhao Wang, Miao Shang, Wangmeng Zuo

    Abstract: Multi-modal crowd counting involves estimating crowd density from both visual and thermal/depth images. This task is challenging due to the significant gap between these distinct modalities. In this paper, we propose a novel approach by introducing an auxiliary broker modality and on this basis frame the task as a triple-modal learning problem. We devise a fusion-based method to generate this brok… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

    Comments: This is the preprint version of the paper and supplemental material to appear in ECCV 2024. Please cite the final published version. Code is available at https://github.com/HenryCilence/Broker-Modality-Crowd-Counting

  4. arXiv:2407.06310  [pdf, other

    cs.SD cs.AI cs.HC cs.LG eess.AS

    Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation

    Authors: Mengzhe Geng, Xurong Xie, Jiajun Deng, Zengrui Jin, Guinan Li, Tianzi Wang, Shujie Hu, Zhaoqing Li, Helen Meng, Xunying Liu

    Abstract: The application of data-intensive automatic speech recognition (ASR) technologies to dysarthric and elderly adult speech is confronted by their mismatch against healthy and nonaged voices, data scarcity and large speaker-level variability. To this end, this paper proposes two novel data-efficient methods to learn homogeneous dysarthric and elderly speaker-level features for rapid, on-the-fly test-… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

    Comments: In submission to IEEE/ACM Transactions on Audio, Speech, and Language Processing

  5. arXiv:2407.01850  [pdf, other

    cs.CL

    Purple-teaming LLMs with Adversarial Defender Training

    Authors: Jingyan Zhou, Kun Li, Junan Li, Jiawen Kang, Minda Hu, Xixin Wu, Helen Meng

    Abstract: Existing efforts in safeguarding LLMs are limited in actively exposing the vulnerabilities of the target LLM and readily adapting to newly emerging safety risks. To address this, we present Purple-teaming LLMs with Adversarial Defender training (PAD), a pipeline designed to safeguard LLMs by novelly incorporating the red-teaming (attack) and blue-teaming (safety training) techniques. In PAD, we au… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  6. Einstein-Podolsky-Rosen steering paradox "2=1'' for $N$ qubits

    Authors: Zhi-Jie Liu, Jie Zhou, Hui-Xian Meng, Xing-Yan Fan, Mi Xie, Fu-lin Zhang, Jing-Ling Chen

    Abstract: Einstein-Podolsky-Rosen (EPR) paradox highlights the absence of a local realistic explanation for quantum mechanics, and shows the incompatibility of the local-hidden-state models with quantum theory. For $N$-qubit states, or more importantly, the $N$-qubit mixed states, we present the EPR steering paradox in the form of the contradictory equality "2=1". We show that the contradiction holds for an… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: 12 pages, 0 figure

    Journal ref: Modern Physics Letters A Vol. 39, No. 9, 2450030 (2024)

  7. arXiv:2406.14092  [pdf, other

    cs.CL eess.AS

    Seamless Language Expansion: Enhancing Multilingual Mastery in Self-Supervised Models

    Authors: Jing Xu, Minglin Wu, Xixin Wu, Helen Meng

    Abstract: Self-supervised (SSL) models have shown great performance in various downstream tasks. However, they are typically developed for limited languages, and may encounter new languages in real-world. Developing a SSL model for each new language is costly. Thus, it is vital to figure out how to efficiently adapt existed SSL models to a new language without impairing its original abilities. We propose ad… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  8. arXiv:2406.13963  [pdf, ps, other

    cs.CV

    SSAD: Self-supervised Auxiliary Detection Framework for Panoramic X-ray based Dental Disease Diagnosis

    Authors: Zijian Cai, Xinquan Yang, Xuguang Li, Xiaoling Luo, Xuechen Li, Linlin Shen, He Meng, Yongqiang Deng

    Abstract: Panoramic X-ray is a simple and effective tool for diagnosing dental diseases in clinical practice. When deep learning models are developed to assist dentist in interpreting panoramic X-rays, most of their performance suffers from the limited annotated data, which requires dentist's expertise and a lot of time cost. Although self-supervised learning (SSL) has been proposed to address this challeng… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

  9. arXiv:2406.12236  [pdf, other

    eess.AS cs.SD eess.SP

    Binaural Selective Attention Model for Target Speaker Extraction

    Authors: Hanyu Meng, Qiquan Zhang, Xiangyu Zhang, Vidhyasaharan Sethu, Eliathamby Ambikairajah

    Abstract: The remarkable ability of humans to selectively focus on a target speaker in cocktail party scenarios is facilitated by binaural audio processing. In this paper, we present a binaural time-domain Target Speaker Extraction model based on the Filter-and-Sum Network (FaSNet). Inspired by human selective hearing, our proposed model introduces target speaker embedding into separators using a multi-head… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: Accepted by INTERSPEECH2024

  10. arXiv:2406.10991  [pdf, other

    cs.CL

    Adaptive Query Rewriting: Aligning Rewriters through Marginal Probability of Conversational Answers

    Authors: Tianhua Zhang, Kun Li, Hongyin Luo, Xixin Wu, James Glass, Helen Meng

    Abstract: Query rewriting is a crucial technique for passage retrieval in open-domain conversational question answering (CQA). It decontexualizes conversational queries into self-contained questions suitable for off-the-shelf retrievers. Existing methods attempt to incorporate retriever's preference during the training of rewriting models. However, these approaches typically rely on extensive annotations su… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

  11. arXiv:2406.10152  [pdf, other

    cs.SD eess.AS

    Joint Speaker Features Learning for Audio-visual Multichannel Speech Separation and Recognition

    Authors: Guinan Li, Jiajun Deng, Youjun Chen, Mengzhe Geng, Shujie Hu, Zhe Li, Zengrui Jin, Tianzi Wang, Xurong Xie, Helen Meng, Xunying Liu

    Abstract: This paper proposes joint speaker feature learning methods for zero-shot adaptation of audio-visual multichannel speech separation and recognition systems. xVector and ECAPA-TDNN speaker encoders are connected using purpose-built fusion blocks and tightly integrated with the complete system training. Experiments conducted on LRS3-TED data simulated multichannel overlapped speech suggest that joint… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  12. arXiv:2406.10056  [pdf, other

    cs.SD eess.AS

    UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner

    Authors: Dongchao Yang, Haohan Guo, Yuanyuan Wang, Rongjie Huang, Xiang Li, Xu Tan, Xixin Wu, Helen Meng

    Abstract: The Large Language models (LLMs) have demonstrated supreme capabilities in text understanding and generation, but cannot be directly applied to cross-modal tasks without fine-tuning. This paper proposes a cross-modal in-context learning approach, empowering the frozen LLMs to achieve multiple audio tasks in a few-shot style without any parameter update. Specifically, we propose a novel and LLMs-dr… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

  13. arXiv:2406.10034  [pdf, other

    cs.SD cs.AI eess.AS

    Towards Effective and Efficient Non-autoregressive Decoding Using Block-based Attention Mask

    Authors: Tianzi Wang, Xurong Xie, Zhaoqing Li, Shoukang Hu, Zengrui Jing, Jiajun Deng, Mingyu Cui, Shujie Hu, Mengzhe Geng, Guinan Li, Helen Meng, Xunying Liu

    Abstract: This paper proposes a novel non-autoregressive (NAR) block-based Attention Mask Decoder (AMD) that flexibly balances performance-efficiency trade-offs for Conformer ASR systems. AMD performs parallel NAR inference within contiguous blocks of output labels that are concealed using attention masks, while conducting left-to-right AR prediction and history context amalgamation between blocks. A beam s… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: 5 pages, 2 figures, 2 tables, Interspeech24 conference

  14. arXiv:2406.08336  [pdf, other

    cs.SD cs.CV eess.AS

    CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction

    Authors: Xueyuan Chen, Dongchao Yang, Dingdong Wang, Xixin Wu, Zhiyong Wu, Helen Meng

    Abstract: Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech. It still suffers from low speaker similarity and poor prosody naturalness. In this paper, we propose a multi-modal DSR model by leveraging neural codec language modeling to improve the reconstruction results, especially for the speaker similarity and prosody naturalness. Our proposed model consists of: (… ▽ More

    Submitted 24 June, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  15. arXiv:2406.08275  [pdf

    cond-mat.mtrl-sci

    An accurate and transferable machine learning interatomic potential for equimolar and non-equimolar high-entropy diborides

    Authors: Hong Meng, Yiwen Liu, Hulei Yu, Lei Zhuang, Yanhui Chu

    Abstract: Machine learning interatomic potentials have become a powerful tool to achieve molecular dynamics (MD) simulations with the accuracy of ab initio methods while beyond their length and timescale limitations. Here, we develop an efficient moment tensor potential (MTP) for high-entropy diborides (HEBs) based on unary and binary diborides with Ti-V-Cr-Zr-Nb-Mo-Hf-Ta-W principal elements. Notably, the… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: 17 pages, 3 figures

  16. arXiv:2406.08243  [pdf

    cond-mat.mtrl-sci

    An efficient strategy to construct general machine learning potentials for high-entropy ceramics

    Authors: Yiwen Liu, Hong Meng, Zijie Zhu, Hulei Yu, Lei Zhuang, Yanhui Chu

    Abstract: Molecular dynamics (MD) simulations are considered an efficient and low-cost means to develop remarkable properties of high-entropy ceramics with vast composition space, yet the lack of general potentials severely limits their applications. Herein, taking high-entropy carbides (HECs) as the model, we propose a strategy to efficiently construct a general neuroevolution potential (NEP) with broad co… ▽ More

    Submitted 18 June, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

    Comments: 28 pages, 6 figures

  17. arXiv:2406.06326  [pdf, other

    cs.CL

    Self-Tuning: Instructing LLMs to Effectively Acquire New Knowledge through Self-Teaching

    Authors: Xiaoying Zhang, Baolin Peng, Ye Tian, Jingyan Zhou, Yipeng Zhang, Haitao Mi, Helen Meng

    Abstract: Large language models (LLMs) often struggle to provide up-to-date information due to their one-time training and the constantly evolving nature of the world. To keep LLMs current, existing approaches typically involve continued pre-training on new documents. However, they frequently face difficulties in extracting stored knowledge. Motivated by the remarkable success of the Feynman Technique in ef… ▽ More

    Submitted 15 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

    Comments: 30 pages

  18. arXiv:2406.05358  [pdf, other

    cs.LG math.OC

    Reinforcement Learning for Intensity Control: An Application to Choice-Based Network Revenue Management

    Authors: Huiling Meng, Ningyuan Chen, Xuefeng Gao

    Abstract: Intensity control is a type of continuous-time dynamic optimization problems with many important applications in Operations Research including queueing and revenue management. In this study, we adapt the reinforcement learning framework to intensity control using choice-based network revenue management as a case study, which is a classical problem in revenue management that features a large state… ▽ More

    Submitted 8 June, 2024; originally announced June 2024.

  19. arXiv:2406.02940  [pdf, other

    cs.SD eess.AS

    Addressing Index Collapse of Large-Codebook Speech Tokenizer with Dual-Decoding Product-Quantized Variational Auto-Encoder

    Authors: Haohan Guo, Fenglong Xie, Dongchao Yang, Hui Lu, Xixin Wu, Helen Meng

    Abstract: VQ-VAE, as a mainstream approach of speech tokenizer, has been troubled by ``index collapse'', where only a small number of codewords are activated in large codebooks. This work proposes product-quantized (PQ) VAE with more codebooks but fewer codewords to address this problem and build large-codebook speech tokenizers. It encodes speech features into multiple VQ subspaces and composes them into c… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

  20. arXiv:2406.02328  [pdf, other

    cs.SD eess.AS

    SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models

    Authors: Dongchao Yang, Dingdong Wang, Haohan Guo, Xueyuan Chen, Xixin Wu, Helen Meng

    Abstract: In this study, we propose a simple and efficient Non-Autoregressive (NAR) text-to-speech (TTS) system based on diffusion, named SimpleSpeech. Its simpleness shows in three aspects: (1) It can be trained on the speech-only dataset, without any alignment information; (2) It directly takes plain text as input and generates speech through an NAR way; (3) It tries to model speech in a finite and compac… ▽ More

    Submitted 14 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

    Comments: Accepted by InterSpeech 2024

  21. arXiv:2405.16090  [pdf, other

    cs.HC eess.SP

    EEG-DBNet: A Dual-Branch Network for Temporal-Spectral Decoding in Motor-Imagery Brain-Computer Interfaces

    Authors: Xicheng Lou, Xinwei Li, Hongying Meng, Jun Hu, Meili Xu, Yue Zhao, Jiazhang Yang, Zhangyong Li

    Abstract: Motor imagery electroencephalogram (EEG)-based brain-computer interfaces (BCIs) offer significant advantages for individuals with restricted limb mobility. However, challenges such as low signal-to-noise ratio and limited spatial resolution impede accurate feature extraction from EEG signals, thereby affecting the classification accuracy of different actions. To address these challenges, this stud… ▽ More

    Submitted 19 June, 2024; v1 submitted 25 May, 2024; originally announced May 2024.

  22. arXiv:2405.05758  [pdf, other

    cs.HC cs.CL cs.CY

    Exploring the Potential of Human-LLM Synergy in Advancing Qualitative Analysis: A Case Study on Mental-Illness Stigma

    Authors: Han Meng, Yitian Yang, Yunan Li, Jungup Lee, Yi-Chieh Lee

    Abstract: Qualitative analysis is a challenging, yet crucial aspect of advancing research in the field of Human-Computer Interaction (HCI). Recent studies show that large language models (LLMs) can perform qualitative coding within existing schemes, but their potential for collaborative human-LLM discovery and new insight generation in qualitative analysis is still underexplored. To bridge this gap and adva… ▽ More

    Submitted 9 May, 2024; originally announced May 2024.

    Comments: 55 pages

  23. arXiv:2405.00074  [pdf, other

    cs.LG cs.SE

    PAODING: A High-fidelity Data-free Pruning Toolkit for Debloating Pre-trained Neural Networks

    Authors: Mark Huasong Meng, Hao Guan, Liuhuo Wan, Sin Gee Teo, Guangdong Bai, Jin Song Dong

    Abstract: We present PAODING, a toolkit to debloat pretrained neural network models through the lens of data-free pruning. To preserve the model fidelity, PAODING adopts an iterative process, which dynamically measures the effect of deleting a neuron to identify candidates that have the least impact to the output layer. Our evaluation shows that PAODING can significantly reduce the model size, generalize on… ▽ More

    Submitted 30 April, 2024; originally announced May 2024.

    Comments: 3 pages

  24. arXiv:2404.09011  [pdf, other

    cs.CV cs.LG

    PracticalDG: Perturbation Distillation on Vision-Language Models for Hybrid Domain Generalization

    Authors: Zining Chen, Weiqiu Wang, Zhicheng Zhao, Fei Su, Aidong Men, Hongying Meng

    Abstract: Domain Generalization (DG) aims to resolve distribution shifts between source and target domains, and current DG methods are default to the setting that data from source and target domains share identical categories. Nevertheless, there exists unseen classes from target domains in practical scenarios. To address this issue, Open Set Domain Generalization (OSDG) has emerged and several methods have… ▽ More

    Submitted 13 April, 2024; originally announced April 2024.

    Comments: Accepted to CVPR2024

  25. arXiv:2404.06702  [pdf, other

    eess.AS cs.SD eess.SP

    What is Learnt by the LEArnable Front-end (LEAF)? Adapting Per-Channel Energy Normalisation (PCEN) to Noisy Conditions

    Authors: Hanyu Meng, Vidhyasaharan Sethu, Eliathamby Ambikairajah

    Abstract: There is increasing interest in the use of the LEArnable Front-end (LEAF) in a variety of speech processing systems. However, there is a dearth of analyses of what is actually learnt and the relative importance of training the different components of the front-end. In this paper, we investigate this question on keyword spotting, speech-based emotion recognition and language identification tasks an… ▽ More

    Submitted 9 April, 2024; originally announced April 2024.

    Comments: Interspeech 2023 Proceeding

    Journal ref: Interspeech 2023

  26. arXiv:2403.16078  [pdf, other

    cs.SD eess.AS

    Target Speech Extraction with Pre-trained AV-HuBERT and Mask-And-Recover Strategy

    Authors: Wenxuan Wu, Xueyuan Chen, Xixin Wu, Haizhou Li, Helen Meng

    Abstract: Audio-visual target speech extraction (AV-TSE) is one of the enabling technologies in robotics and many audio-visual applications. One of the challenges of AV-TSE is how to effectively utilize audio-visual synchronization information in the process. AV-HuBERT can be a useful pre-trained model for lip-reading, which has not been adopted by AV-TSE. In this paper, we would like to explore the way to… ▽ More

    Submitted 24 March, 2024; originally announced March 2024.

    Comments: Accepted by IJCNN 2024

  27. arXiv:2403.09326  [pdf, other

    cs.GR cs.AI

    HeadEvolver: Text to Head Avatars via Expressive and Attribute-Preserving Mesh Deformation

    Authors: Duotun Wang, Hengyu Meng, Zeyu Cai, Zhijing Shao, Qianxi Liu, Lin Wang, Mingming Fan, Xiaohang Zhan, Zeyu Wang

    Abstract: We present HeadEvolver, a novel framework to generate stylized head avatars from text guidance. HeadEvolver uses locally learnable mesh deformation from a template head mesh, producing high-quality digital assets for detail-preserving editing and animation. To tackle the challenges of lacking fine-grained and semantic-aware local shape control in global deformation through Jacobians, we introduce… ▽ More

    Submitted 10 June, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

    Comments: 12 pages, 17 figures

    ACM Class: I.2.6; I.3.8

  28. arXiv:2403.05834  [pdf, other

    cs.MM cs.SD eess.AS

    Enhancing Expressiveness in Dance Generation via Integrating Frequency and Music Style Information

    Authors: Qiaochu Huang, Xu He, Boshi Tang, Haolin Zhuang, Liyang Chen, Shuochen Gao, Zhiyong Wu, Haozhi Huang, Helen Meng

    Abstract: Dance generation, as a branch of human motion generation, has attracted increasing attention. Recently, a few works attempt to enhance dance expressiveness, which includes genre matching, beat alignment, and dance dynamics, from certain aspects. However, the enhancement is quite limited as they lack comprehensive consideration of the aforementioned three factors. In this paper, we propose Expressi… ▽ More

    Submitted 9 March, 2024; originally announced March 2024.

  29. arXiv:2403.00606  [pdf, other

    cs.CV

    Flattening Singular Values of Factorized Convolution for Medical Images

    Authors: Zexin Feng, Na Zeng, Jiansheng Fang, Xingyue Wang, Xiaoxi Lu, Heng Meng, Jiang Liu

    Abstract: Convolutional neural networks (CNNs) have long been the paradigm of choice for robust medical image processing (MIP). Therefore, it is crucial to effectively and efficiently deploy CNNs on devices with different computing capabilities to support computer-aided diagnosis. Many methods employ factorized convolutional layers to alleviate the burden of limited computational resources at the expense of… ▽ More

    Submitted 1 March, 2024; originally announced March 2024.

  30. arXiv:2402.10642  [pdf, other

    eess.AS cs.AI

    Speaking in Wavelet Domain: A Simple and Efficient Approach to Speed up Speech Diffusion Model

    Authors: Xiangyu Zhang, Daijiao Liu, Hexin Liu, Qiquan Zhang, Hanyu Meng, Leibny Paola Garcia, Eng Siong Chng, Lina Yao

    Abstract: Recently, Denoising Diffusion Probabilistic Models (DDPMs) have attained leading performances across a diverse range of generative tasks. However, in the field of speech synthesis, although DDPMs exhibit impressive performance, their long training duration and substantial inference costs hinder practical deployment. Existing approaches primarily focus on enhancing inference speed, while approaches… ▽ More

    Submitted 16 February, 2024; originally announced February 2024.

  31. arXiv:2402.09267  [pdf, other

    cs.CL cs.AI

    Self-Alignment for Factuality: Mitigating Hallucinations in LLMs via Self-Evaluation

    Authors: Xiaoying Zhang, Baolin Peng, Ye Tian, Jingyan Zhou, Lifeng Jin, Linfeng Song, Haitao Mi, Helen Meng

    Abstract: Despite showing increasingly human-like abilities, large language models (LLMs) often struggle with factual inaccuracies, i.e. "hallucinations", even when they hold relevant knowledge. To address these hallucinations, current approaches typically necessitate high-quality human factuality annotations. In this work, we explore Self-Alignment for Factuality, where we leverage the self-evaluation capa… ▽ More

    Submitted 11 June, 2024; v1 submitted 14 February, 2024; originally announced February 2024.

    Comments: 20 pages

    Journal ref: ACL2024 Main

  32. arXiv:2402.06171  [pdf, other

    cs.LG

    Pushing Boundaries: Mixup's Influence on Neural Collapse

    Authors: Quinn Fisher, Haoming Meng, Vardan Papyan

    Abstract: Mixup is a data augmentation strategy that employs convex combinations of training instances and their respective labels to augment the robustness and calibration of deep neural networks. Despite its widespread adoption, the nuanced mechanisms that underpin its success are not entirely understood. The observed phenomenon of Neural Collapse, where the last-layer activations and classifier of deep n… ▽ More

    Submitted 8 February, 2024; originally announced February 2024.

    Comments: Published as a conference paper at the International Conference on Learning Representations (ICLR 2024)

  33. arXiv:2402.03494  [pdf, other

    cs.AI cs.RO

    Beyond Text: Utilizing Vocal Cues to Improve Decision Making in LLMs for Robot Navigation Tasks

    Authors: Xingpeng Sun, Haoming Meng, Souradip Chakraborty, Amrit Singh Bedi, Aniket Bera

    Abstract: While LLMs excel in processing text in these human conversations, they struggle with the nuances of verbal instructions in scenarios like social navigation, where ambiguity and uncertainty can erode trust in robotic and other AI systems. We can address this shortcoming by moving beyond text and additionally focusing on the paralinguistic features of these audio responses. These features are the as… ▽ More

    Submitted 23 April, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

    Comments: 28 pages, 7 figures

  34. arXiv:2401.17796  [pdf, other

    cs.SD eess.AS

    Exploiting Audio-Visual Features with Pretrained AV-HuBERT for Multi-Modal Dysarthric Speech Reconstruction

    Authors: Xueyuan Chen, Yuejiao Wang, Xixin Wu, Disong Wang, Zhiyong Wu, Xunying Liu, Helen Meng

    Abstract: Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech by improving the intelligibility and naturalness. This is a challenging task especially for patients with severe dysarthria and speaking in complex, noisy acoustic environments. To address these challenges, we propose a novel multi-modal framework to utilize visual information, e.g., lip movements, in DSR… ▽ More

    Submitted 31 January, 2024; originally announced January 2024.

    Comments: Accepted by ICASSP 2024

  35. arXiv:2401.14664  [pdf, other

    cs.SD cs.CL eess.AS

    UNIT-DSR: Dysarthric Speech Reconstruction System Using Speech Unit Normalization

    Authors: Yuejiao Wang, Xixin Wu, Disong Wang, Lingwei Meng, Helen Meng

    Abstract: Dysarthric speech reconstruction (DSR) systems aim to automatically convert dysarthric speech into normal-sounding speech. The technology eases communication with speakers affected by the neuromotor disorder and enhances their social inclusion. NED-based (Neural Encoder-Decoder) systems have significantly improved the intelligibility of the reconstructed speech as compared with GAN-based (Generati… ▽ More

    Submitted 26 January, 2024; originally announced January 2024.

    Comments: Accepted to ICASSP 2024

  36. arXiv:2401.13276  [pdf, other

    eess.AS

    SCNet: Sparse Compression Network for Music Source Separation

    Authors: Weinan Tong, Jiaxu Zhu, Jun Chen, Shiyin Kang, Tao Jiang, Yang Li, Zhiyong Wu, Helen Meng

    Abstract: Deep learning-based methods have made significant achievements in music source separation. However, obtaining good results while maintaining a low model complexity remains challenging in super wide-band music source separation. Previous works either overlook the differences in subbands or inadequately address the problem of information loss when generating subband features. In this paper, we propo… ▽ More

    Submitted 24 January, 2024; originally announced January 2024.

    Comments: Accepted by ICASSP 2024

  37. arXiv:2401.13225  [pdf, ps, other

    hep-ex

    A New Look at the Scalar Meson $f_0(500)$ via $D^+\to π^+π^-\ell^+ν_\ell$ Decays

    Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, X. C. Ai, R. Aliberti, A. Amoroso, Y. Bai, O. Bakina, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann, H. Cai, X. Cai , et al. (615 additional authors not shown)

    Abstract: Using $2.93~\mathrm{fb}^{-1}$ of $e^+e^-$ collision data collected with the BESIII detector at the center-of-mass energy of 3.773 GeV, we investigate the semileptonic decays $D^+\to π^+π^- \ell^+ν_\ell$ ($\ell=e$ and $μ$). The $D^+\to f_0(500)μ^+ν_μ$ decay is observed for the first time. By analyzing simultaneously the differential decay rates of $D^+\to f_0(500) μ^+ν_μ$ and… ▽ More

    Submitted 4 February, 2024; v1 submitted 23 January, 2024; originally announced January 2024.

    Comments: Supplemental Materials added in this version

    Report number: BAM-00660

  38. arXiv:2401.07532  [pdf, other

    cs.SD cs.AI eess.AS

    Multi-view MidiVAE: Fusing Track- and Bar-view Representations for Long Multi-track Symbolic Music Generation

    Authors: Zhiwei Lin, Jun Chen, Boshi Tang, Binzhu Sha, Jing Yang, Yaolong Ju, Fan Fan, Shiyin Kang, Zhiyong Wu, Helen Meng

    Abstract: Variational Autoencoders (VAEs) constitute a crucial component of neural symbolic music generation, among which some works have yielded outstanding results and attracted considerable attention. Nevertheless, previous VAEs still encounter issues with overly long feature sequences and generated results lack contextual coherence, thus the challenge of modeling long multi-track symbolic music still re… ▽ More

    Submitted 15 January, 2024; originally announced January 2024.

    Comments: Accepted by ICASSP 2024

  39. arXiv:2401.04152  [pdf, other

    cs.SD cs.AI cs.CL eess.AS

    Cross-Speaker Encoding Network for Multi-Talker Speech Recognition

    Authors: Jiawen Kang, Lingwei Meng, Mingyu Cui, Haohan Guo, Xixin Wu, Xunying Liu, Helen Meng

    Abstract: End-to-end multi-talker speech recognition has garnered great interest as an effective approach to directly transcribe overlapped speech from multiple speakers. Current methods typically adopt either 1) single-input multiple-output (SIMO) models with a branched encoder, or 2) single-input single-output (SISO) models based on attention-based encoder-decoder architecture with serialized output train… ▽ More

    Submitted 8 January, 2024; originally announced January 2024.

    Comments: Accepted by ICASSP2024

  40. arXiv:2312.15567  [pdf, other

    cs.HC

    Conversational Co-Speech Gesture Generation via Modeling Dialog Intention, Emotion, and Context with Diffusion Models

    Authors: Haiwei Xue, Sicheng Yang, Zhensong Zhang, Zhiyong Wu, Minglei Li, Zonghong Dai, Helen Meng

    Abstract: Audio-driven co-speech human gesture generation has made remarkable advancements recently. However, most previous works only focus on single person audio-driven gesture generation. We aim at solving the problem of conversational co-speech gesture generation that considers multiple participants in a conversation, which is a novel and challenging task due to the difficulty of simultaneously incorpor… ▽ More

    Submitted 10 January, 2024; v1 submitted 24 December, 2023; originally announced December 2023.

    Comments: 5 pages,2 figures, Accepted for publication at the 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2024)

  41. arXiv:2312.15463  [pdf, other

    eess.AS cs.SD

    Consistent and Relevant: Rethink the Query Embedding in General Sound Separation

    Authors: Yuanyuan Wang, Hangting Chen, Dongchao Yang, Jianwei Yu, Chao Weng, Zhiyong Wu, Helen Meng

    Abstract: The query-based audio separation usually employs specific queries to extract target sources from a mixture of audio signals. Currently, most query-based separation models need additional networks to obtain query embedding. In this way, separation model is optimized to be adapted to the distribution of query embedding. However, query embedding may exhibit mismatches with separation models due to in… ▽ More

    Submitted 24 December, 2023; originally announced December 2023.

    Comments: Accepted by ICASSP 2024

  42. arXiv:2312.14184  [pdf

    cs.CL cs.AI cs.LG

    Large Language Models in Medical Term Classification and Unexpected Misalignment Between Response and Reasoning

    Authors: Xiaodan Zhang, Sandeep Vemulapalli, Nabasmita Talukdar, Sumyeong Ahn, Jiankun Wang, Han Meng, Sardar Mehtab Bin Murtaza, Aakash Ajay Dave, Dmitry Leshchiner, Dimitri F. Joseph, Martin Witteveen-Lane, Dave Chesla, Jiayu Zhou, Bin Chen

    Abstract: This study assesses the ability of state-of-the-art large language models (LLMs) including GPT-3.5, GPT-4, Falcon, and LLaMA 2 to identify patients with mild cognitive impairment (MCI) from discharge summaries and examines instances where the models' responses were misaligned with their reasoning. Utilizing the MIMIC-IV v2.2 database, we focused on a cohort aged 65 and older, verifying MCI diagnos… ▽ More

    Submitted 19 December, 2023; originally announced December 2023.

  43. arXiv:2312.13127  [pdf, other

    eess.IV cs.CV

    Pixel-to-Abundance Translation: Conditional Generative Adversarial Networks Based on Patch Transformer for Hyperspectral Unmixing

    Authors: Li Wang, Xiaohua Zhang, Longfei Li, Hongyun Meng, Xianghai Cao

    Abstract: Spectral unmixing is a significant challenge in hyperspectral image processing. Existing unmixing methods utilize prior knowledge about the abundance distribution to solve the regularization optimization problem, where the difficulty lies in choosing appropriate prior knowledge and solving the complex regularization optimization problem. To solve these problems, we propose a hyperspectral conditio… ▽ More

    Submitted 20 December, 2023; originally announced December 2023.

  44. arXiv:2312.12181  [pdf, other

    cs.SD cs.AI eess.AS

    StyleSpeech: Self-supervised Style Enhancing with VQ-VAE-based Pre-training for Expressive Audiobook Speech Synthesis

    Authors: Xueyuan Chen, Xi Wang, Shaofei Zhang, Lei He, Zhiyong Wu, Xixin Wu, Helen Meng

    Abstract: The expressive quality of synthesized speech for audiobooks is limited by generalized model architecture and unbalanced style distribution in the training data. To address these issues, in this paper, we propose a self-supervised style enhancing method with VQ-VAE-based pre-training for expressive audiobook speech synthesis. Firstly, a text style encoder is pre-trained with a large amount of unlab… ▽ More

    Submitted 19 December, 2023; originally announced December 2023.

    Comments: Accepted to ICASSP 2024

  45. arXiv:2312.11858  [pdf, other

    cs.LG cs.SI

    SimCalib: Graph Neural Network Calibration based on Similarity between Nodes

    Authors: Boshi Tang, Zhiyong Wu, Xixin Wu, Qiaochu Huang, Jun Chen, Shun Lei, Helen Meng

    Abstract: Graph neural networks (GNNs) have exhibited impressive performance in modeling graph data as exemplified in various applications. Recently, the GNN calibration problem has attracted increasing attention, especially in cost-sensitive scenarios. Previous work has gained empirical insights on the issue, and devised effective approaches for it, but theoretical supports still fall short. In this work,… ▽ More

    Submitted 18 December, 2023; originally announced December 2023.

  46. arXiv:2312.10899  [pdf, other

    cs.CV

    MagicScroll: Nontypical Aspect-Ratio Image Generation for Visual Storytelling via Multi-Layered Semantic-Aware Denoising

    Authors: Bingyuan Wang, Hengyu Meng, Zeyu Cai, Lanjiong Li, Yue Ma, Qifeng Chen, Zeyu Wang

    Abstract: Visual storytelling often uses nontypical aspect-ratio images like scroll paintings, comic strips, and panoramas to create an expressive and compelling narrative. While generative AI has achieved great success and shown the potential to reshape the creative industry, it remains a challenge to generate coherent and engaging content with arbitrary size and controllable style, concept, and layout, al… ▽ More

    Submitted 17 December, 2023; originally announced December 2023.

    Comments: Project page: https://magicscroll.github.io/

  47. arXiv:2312.04919  [pdf, other

    cs.SD eess.AS

    Neural Concatenative Singing Voice Conversion: Rethinking Concatenation-Based Approach for One-Shot Singing Voice Conversion

    Authors: Binzhu Sha, Xu Li, Zhiyong Wu, Ying Shan, Helen Meng

    Abstract: Any-to-any singing voice conversion (SVC) is confronted with the challenge of ``timbre leakage'' issue caused by inadequate disentanglement between the content and the speaker timbre. To address this issue, this study introduces NeuCoSVC, a novel neural concatenative SVC framework. It consists of a self-supervised learning (SSL) representation extractor, a neural harmonic signal generator, and a w… ▽ More

    Submitted 8 January, 2024; v1 submitted 8 December, 2023; originally announced December 2023.

  48. Injecting linguistic knowledge into BERT for Dialogue State Tracking

    Authors: Xiaohan Feng, Xixin Wu, Helen Meng

    Abstract: Dialogue State Tracking (DST) models often employ intricate neural network architectures, necessitating substantial training data, and their inference process lacks transparency. This paper proposes a method that extracts linguistic knowledge via an unsupervised framework and subsequently utilizes this knowledge to augment BERT's performance and interpretability in DST tasks. The knowledge extract… ▽ More

    Submitted 2 July, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

    Comments: Accepted for publication at IEEE Access

    Journal ref: IEEE Access 12 (2024) 93761-93770

  49. arXiv:2311.15577  [pdf, ps, other

    cond-mat.supr-con cond-mat.mes-hall

    Adiabatic phase pumping in S/F/S hybrids with non-coplanar magnetization

    Authors: A. A. Kopasov, Zh. Devizorova, H. Meng, S. V. Mironov, A. S. Mel'nikov, A. I. Buzdin

    Abstract: We study the distinctive features of the phase pumping effect in Josephson transport through a three-layered ferromagnet F$_1$/F/F$_2$ with non-coplanar magnetization. Using Gor'kov and Bogoliubov-de Gennes formalisms we go beyond the quasiclassical approximation and analyze the dependence of the spontaneous Josephson phase $ψ$ on the exchange field $h$ in the F layer and details of magnetization… ▽ More

    Submitted 27 November, 2023; originally announced November 2023.

    Comments: 16 pages, 9 figures

  50. arXiv:2311.03681  [pdf, ps, other

    quant-ph

    The Iteration Formula of (n,2,d) Full-correlated Multi-component Bell Function and Its Applications

    Authors: Hui-Xian Meng, Yu Zhang, Xing-Yan Fan, Jie Zhou, Wei-Min Shang, Jing-Ling Chen

    Abstract: It is very difficult and important to construct Bell inequalities for n-partite, k-settings of measurement, and d-dimensional (n,k,d) systems. Inspired by the iteration formula form of the Mermin-Ardehali-Belinski{ĭ}-Klyshko (MABK) inequality, we generalize the multi-component correlation functions for bipartite d-dimensional systems to n-partite ones, and construct the corresponding Bell inequali… ▽ More

    Submitted 6 November, 2023; originally announced November 2023.