Skip to main content

Showing 1–50 of 846 results for author: Zhao, S

  1. arXiv:2407.12592  [pdf, other

    cs.CV

    VegeDiff: Latent Diffusion Model for Geospatial Vegetation Forecasting

    Authors: Sijie Zhao, Hao Chen, Xueliang Zhang, Pengfeng Xiao, Lei Bai, Wanli Ouyang

    Abstract: In the context of global climate change and frequent extreme weather events, forecasting future geospatial vegetation states under these conditions is of significant importance. The vegetation change process is influenced by the complex interplay between dynamic meteorological variables and static environmental variables, leading to high levels of uncertainty. Existing deterministic methods are in… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

    Comments: 15 pages, 8 figures

  2. arXiv:2407.12415  [pdf, other

    cs.LG

    Not All Frequencies Are Created Equal:Towards a Dynamic Fusion of Frequencies in Time-Series Forecasting

    Authors: Xingyu Zhang, Siyu Zhao, Zeen Song, Huijie Guo, Jianqi Zhang, Changwen Zheng, Wenwen Qiang

    Abstract: Long-term time series forecasting is a long-standing challenge in various applications. A central issue in time series forecasting is that methods should expressively capture long-term dependency. Furthermore, time series forecasting methods should be flexible when applied to different scenarios. Although Fourier analysis offers an alternative to effectively capture reusable and periodic patterns… ▽ More

    Submitted 18 July, 2024; v1 submitted 17 July, 2024; originally announced July 2024.

    Comments: Accpeted by ACMMM2024

  3. Demonstrating PilotAR: A Tool to Assist Wizard-of-Oz Pilot Studies with OHMD

    Authors: Nuwan Janaka, Runze Cai, Shengdong Zhao, David Hsu

    Abstract: While pilot studies help to identify potential interesting research directions, the additional requirements in AR/MR make it challenging to conduct quick and dirty pilot studies efficiently with Optical See-Through Head-Mounted Displays (OST HMDs, OHMDs). To overcome these challenges, including the inability to observe and record in-context user interactions, increased task load, and difficulties… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

    Comments: 10 pages, 5 figures, 1 table

    Journal ref: UbiComp Companion (2024)

  4. arXiv:2407.12229  [pdf, other

    eess.AS cs.AI eess.SP

    Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech

    Authors: Haibin Wu, Xiaofei Wang, Sefik Emre Eskimez, Manthan Thakker, Daniel Tompkins, Chung-Hsien Tsai, Canrun Li, Zhen Xiao, Sheng Zhao, Jinyu Li, Naoyuki Kanda

    Abstract: People change their tones of voice, often accompanied by nonverbal vocalizations (NVs) such as laughter and cries, to convey rich emotions. However, most text-to-speech (TTS) systems lack the capability to generate speech with rich emotions, including NVs. This paper introduces EmoCtrl-TTS, an emotion-controllable zero-shot TTS that can generate highly emotional speech with NVs for any speaker. Em… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

    Comments: See https://aka.ms/emoctrl-tts for demo samples

  5. arXiv:2407.12019  [pdf, other

    cs.CL cs.AI

    DIM: Dynamic Integration of Multimodal Entity Linking with Large Language Model

    Authors: Shezheng Song, Shasha Li, Jie Yu, Shan Zhao, Xiaopeng Li, Jun Ma, Xiaodong Liu, Zhuo Li, Xiaoguang Mao

    Abstract: Our study delves into Multimodal Entity Linking, aligning the mention in multimodal information with entities in knowledge base. Existing methods are still facing challenges like ambiguous entity representations and limited image information utilization. Thus, we propose dynamic entity extraction using ChatGPT, which dynamically extracts entities and enhances datasets. We also propose a method: Dy… ▽ More

    Submitted 27 June, 2024; originally announced July 2024.

    Comments: Published on PRCV24

  6. arXiv:2407.11921  [pdf, other

    cs.CV cs.CR

    IPA-NeRF: Illusory Poisoning Attack Against Neural Radiance Fields

    Authors: Wenxiang Jiang, Hanwei Zhang, Shuo Zhao, Zhongwen Guo, Hao Wang

    Abstract: Neural Radiance Field (NeRF) represents a significant advancement in computer vision, offering implicit neural network-based scene representation and novel view synthesis capabilities. Its applications span diverse fields including robotics, urban mapping, autonomous navigation, virtual reality/augmented reality, etc., some of which are considered high-risk AI applications. However, despite its wi… ▽ More

    Submitted 18 July, 2024; v1 submitted 16 July, 2024; originally announced July 2024.

  7. arXiv:2407.11044  [pdf, other

    cs.LG cs.AI

    Generalizing soft actor-critic algorithms to discrete action spaces

    Authors: Le Zhang, Yong Gu, Xin Zhao, Yanshuo Zhang, Shu Zhao, Yifei Jin, Xinxin Wu

    Abstract: ATARI is a suite of video games used by reinforcement learning (RL) researchers to test the effectiveness of the learning algorithm. Receiving only the raw pixels and the game score, the agent learns to develop sophisticated strategies, even to the comparable level of a professional human games tester. Ideally, we also want an agent requiring very few interactions with the environment. Previous co… ▽ More

    Submitted 7 July, 2024; originally announced July 2024.

    Comments: Chinese Conference on Pattern Recognition and Computer Vision (PRCV) 2024. GitHub Repo https://github.com/lezhang-thu/bigger-better-faster-SAC

  8. arXiv:2407.10943  [pdf, other

    cs.RO cs.CV

    GRUtopia: Dream General Robots in a City at Scale

    Authors: Hanqing Wang, Jiahe Chen, Wensi Huang, Qingwei Ben, Tai Wang, Boyu Mi, Tao Huang, Siheng Zhao, Yilun Chen, Sizhe Yang, Peizhou Cao, Wenye Yu, Zichao Ye, Jialun Li, Junfeng Long, Zirui Wang, Huiling Wang, Ying Zhao, Zhongying Tu, Yu Qiao, Dahua Lin, Jiangmiao Pang

    Abstract: Recent works have been exploring the scaling laws in the field of Embodied AI. Given the prohibitive costs of collecting real-world data, we believe the Simulation-to-Real (Sim2Real) paradigm is a crucial step for scaling the learning of embodied models. This paper introduces project GRUtopia, the first simulated interactive 3D society designed for various robots. It features several advancements:… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

  9. arXiv:2407.10876  [pdf, other

    cs.CV

    RepVF: A Unified Vector Fields Representation for Multi-task 3D Perception

    Authors: Chunliang Li, Wencheng Han, Junbo Yin, Sanyuan Zhao, Jianbing Shen

    Abstract: Concurrent processing of multiple autonomous driving 3D perception tasks within the same spatiotemporal scene poses a significant challenge, in particular due to the computational inefficiencies and feature competition between tasks when using traditional multi-task learning approaches. This paper addresses these issues by proposing a novel unified representation, RepVF, which harmonizes the repre… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV 2024

  10. arXiv:2407.10833  [pdf, other

    eess.IV cs.CV

    MoE-DiffIR: Task-customized Diffusion Priors for Universal Compressed Image Restoration

    Authors: Yulin Ren, Xin Li, Bingchen Li, Xingrui Wang, Mengxi Guo, Shijie Zhao, Li Zhang, Zhibo Chen

    Abstract: We present MoE-DiffIR, an innovative universal compressed image restoration (CIR) method with task-customized diffusion priors. This intends to handle two pivotal challenges in the existing CIR methods: (i) lacking adaptability and universality for different image codecs, e.g., JPEG and WebP; (ii) poor texture generation capability, particularly at low bitrates. Specifically, our MoE-DiffIR develo… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV 2024

  11. arXiv:2407.09029  [pdf, other

    cs.MM cs.CV cs.SD eess.AS

    Enhancing Emotion Recognition in Incomplete Data: A Novel Cross-Modal Alignment, Reconstruction, and Refinement Framework

    Authors: Haoqin Sun, Shiwan Zhao, Shaokai Li, Xiangyu Kong, Xuechen Wang, Aobo Kong, Jiaming Zhou, Yong Chen, Wenjia Zeng, Yong Qin

    Abstract: Multimodal emotion recognition systems rely heavily on the full availability of modalities, suffering significant performance declines when modal data is incomplete. To tackle this issue, we present the Cross-Modal Alignment, Reconstruction, and Refinement (CM-ARR) framework, an innovative approach that sequentially engages in cross-modal alignment, reconstruction, and refinement phases to handle… ▽ More

    Submitted 12 July, 2024; originally announced July 2024.

  12. arXiv:2407.08995  [pdf, other

    cs.CL

    Self-Prompt Tuning: Enable Autonomous Role-Playing in LLMs

    Authors: Aobo Kong, Shiwan Zhao, Hao Chen, Qicheng Li, Yong Qin, Ruiqi Sun, Xin Zhou, Jiaming Zhou, Haoqin Sun

    Abstract: Recent advancements in LLMs have showcased their remarkable role-playing capabilities, able to accurately simulate the dialogue styles and cognitive processes of various roles based on different instructions and contexts. Studies indicate that assigning LLMs the roles of experts, a strategy known as role-play prompting, can enhance their performance in the corresponding domains. However, the promp… ▽ More

    Submitted 12 July, 2024; originally announced July 2024.

  13. arXiv:2407.08551  [pdf, other

    cs.CL cs.SD eess.AS

    Autoregressive Speech Synthesis without Vector Quantization

    Authors: Lingwei Meng, Long Zhou, Shujie Liu, Sanyuan Chen, Bing Han, Shujie Hu, Yanqing Liu, Jinyu Li, Sheng Zhao, Xixin Wu, Helen Meng, Furu Wei

    Abstract: We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition, bypassing the need for vector quantization, which are originally designed for audio compression and sacrifice fidelity compared to mel-spectrograms. Specifically, (i) instead of cross… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

  14. arXiv:2407.08501  [pdf, other

    cs.HC

    SelfIE: Self-Initiated Explorable Instructions Towards Enhanced User Experience

    Authors: Hyeongcheol Kim, Katherine Fennedy, Georgia Zhang, Can Liu, Shengdong Zhao

    Abstract: Given the widespread use of procedural instructions with non-linear access (situational information retrieval), there has been a proposal to accommodate both linear and non-linear usage in instructional design. However, it has received inadequate scholarly attention, leading to limited exploration. This paper introduces Self-Initiated Explorable (SelfIE) instructions, a new design concept aiming a… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

  15. arXiv:2407.07077  [pdf, other

    cs.CV cs.AI

    ConceptExpress: Harnessing Diffusion Models for Single-image Unsupervised Concept Extraction

    Authors: Shaozhe Hao, Kai Han, Zhengyao Lv, Shihao Zhao, Kwan-Yee K. Wong

    Abstract: While personalized text-to-image generation has enabled the learning of a single concept from multiple images, a more practical yet challenging scenario involves learning multiple concepts within a single image. However, existing works tackling this scenario heavily rely on extensive human annotations. In this paper, we introduce a novel task named Unsupervised Concept Extraction (UCE) that consid… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

    Comments: ECCV 2024, Project page: https://haoosz.github.io/ConceptExpress/

  16. arXiv:2407.05934  [pdf, other

    cs.LG cs.AI

    Graph Anomaly Detection with Noisy Labels by Reinforcement Learning

    Authors: Zhu Wang, Shuang Zhou, Junnan Dong, Chang Yang, Xiao Huang, Shengjie Zhao

    Abstract: Graph anomaly detection (GAD) has been widely applied in many areas, e.g., fraud detection in finance and robot accounts in social networks. Existing methods are dedicated to identifying the outlier nodes that deviate from normal ones. While they heavily rely on high-quality annotation, which is hard to obtain in real-world scenarios, this could lead to severely degraded performance based on noisy… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

  17. arXiv:2407.05758  [pdf, other

    eess.IV cs.AI cs.CV

    Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports

    Authors: Yutong Zhang, Yi Pan, Tianyang Zhong, Peixin Dong, Kangni Xie, Yuxiao Liu, Hanqi Jiang, Zhengliang Liu, Shijie Zhao, Tuo Zhang, Xi Jiang, Dinggang Shen, Tianming Liu, Xin Zhang

    Abstract: Medical images and radiology reports are crucial for diagnosing medical conditions, highlighting the importance of quantitative analysis for clinical decision-making. However, the diversity and cross-source heterogeneity of these data challenge the generalizability of current data-mining methods. Multimodal large language models (MLLMs) have recently transformed many domains, significantly affecti… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

  18. arXiv:2407.03245  [pdf, other

    cs.RO cs.AI eess.SY

    TieBot: Learning to Knot a Tie from Visual Demonstration through a Real-to-Sim-to-Real Approach

    Authors: Weikun Peng, Jun Lv, Yuwei Zeng, Haonan Chen, Siheng Zhao, Jichen Sun, Cewu Lu, Lin Shao

    Abstract: The tie-knotting task is highly challenging due to the tie's high deformation and long-horizon manipulation actions. This work presents TieBot, a Real-to-Sim-to-Real learning from visual demonstration system for the robots to learn to knot a tie. We introduce the Hierarchical Feature Matching approach to estimate a sequence of tie's meshes from the demonstration video. With these estimated meshes… ▽ More

    Submitted 3 July, 2024; v1 submitted 3 July, 2024; originally announced July 2024.

    Comments: fix few typos

  19. arXiv:2407.00596  [pdf, other

    eess.IV cs.CV

    HATs: Hierarchical Adaptive Taxonomy Segmentation for Panoramic Pathology Image Analysis

    Authors: Ruining Deng, Quan Liu, Can Cui, Tianyuan Yao, Juming Xiong, Shunxing Bao, Hao Li, Mengmeng Yin, Yu Wang, Shilin Zhao, Yucheng Tang, Haichun Yang, Yuankai Huo

    Abstract: Panoramic image segmentation in computational pathology presents a remarkable challenge due to the morphologically complex and variably scaled anatomy. For instance, the intricate organization in kidney pathology spans multiple layers, from regions like the cortex and medulla to functional units such as glomeruli, tubules, and vessels, down to various cell types. In this paper, we propose a novel… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

    Comments: arXiv admin note: text overlap with arXiv:2402.19286

  20. arXiv:2406.19247  [pdf, other

    cs.CV

    Local Manifold Learning for No-Reference Image Quality Assessment

    Authors: Timin Gao, Wensheng Pan, Yan Zhang, Sicheng Zhao, Shengchuan Zhang, Xiawu Zheng, Ke Li, Liujuan Cao, Rongrong Ji

    Abstract: Contrastive learning has considerably advanced the field of Image Quality Assessment (IQA), emerging as a widely adopted technique. The core mechanism of contrastive learning involves minimizing the distance between quality-similar (positive) examples while maximizing the distance between quality-dissimilar (negative) examples. Despite its successes, current contrastive learning methods often negl… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

  21. arXiv:2406.18485  [pdf, other

    cs.DC

    LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism

    Authors: Diandian Gu, Peng Sun, Qinghao Hu, Ting Huang, Xun Chen, Yingtong Xiong, Guoteng Wang, Qiaoling Chen, Shangchun Zhao, Jiarui Fang, Yonggang Wen, Tianwei Zhang, Xin Jin, Xuanzhe Liu

    Abstract: Efficiently training LLMs with long sequences is important yet challenged by the massive computation and memory requirements. Sequence parallelism has been proposed to tackle these problems, but existing methods suffer from scalability or efficiency issues. We propose LoongTrain, a novel system to efficiently train LLMs with long sequences at scale. The core of LoongTrain is the 2D-Attention mecha… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  22. arXiv:2406.18020  [pdf, other

    cs.LG cs.AI physics.chem-ph

    MolFusion: Multimodal Fusion Learning for Molecular Representations via Multi-granularity Views

    Authors: Muzhen Cai, Sendong Zhao, Haochun Wang, Yanrui Du, Zewen Qiang, Bing Qin, Ting Liu

    Abstract: Artificial Intelligence predicts drug properties by encoding drug molecules, aiding in the rapid screening of candidates. Different molecular representations, such as SMILES and molecule graphs, contain complementary information for molecular encoding. Thus exploiting complementary information from different molecular representations is one of the research priorities in molecular encoding. Most ex… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: 8 pages, 5 figures

  23. arXiv:2406.18009  [pdf, other

    eess.AS cs.SD

    E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS

    Authors: Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, Yanqing Liu, Sheng Zhao, Naoyuki Kanda

    Abstract: This paper introduces Embarrassingly Easy Text-to-Speech (E2 TTS), a fully non-autoregressive zero-shot text-to-speech system that offers human-level naturalness and state-of-the-art speaker similarity and intelligibility. In the E2 TTS framework, the text input is converted into a character sequence with filler tokens. The flow-matching-based mel spectrogram generator is then trained based on the… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

  24. arXiv:2406.16605  [pdf, other

    cs.CL cs.AI cs.LG stat.ME

    CLEAR: Can Language Models Really Understand Causal Graphs?

    Authors: Sirui Chen, Mengying Xu, Kun Wang, Xingyu Zeng, Rui Zhao, Shengjie Zhao, Chaochao Lu

    Abstract: Causal reasoning is a cornerstone of how humans interpret the world. To model and reason about causality, causal graphs offer a concise yet effective solution. Given the impressive advancements in language models, a crucial question arises: can they really understand causal graphs? To this end, we pioneer an investigation into language models' understanding of causal graphs. Specifically, we devel… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

  25. arXiv:2406.15570  [pdf, other

    cs.CL cs.LG

    DEM: Distribution Edited Model for Training with Mixed Data Distributions

    Authors: Dhananjay Ram, Aditya Rawal, Momchil Hardalov, Nikolaos Pappas, Sheng Zha

    Abstract: Training with mixed data distributions is a common and important part of creating multi-task and instruction-following models. The diversity of the data distributions and cost of joint training makes the optimization procedure extremely challenging. Data mixing methods partially address this problem, albeit having a sub-optimal performance across data sources and require multiple expensive trainin… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

    MSC Class: 68T50 ACM Class: F.2.2; I.2.7

  26. arXiv:2406.14565  [pdf, other

    cs.GR cs.CV

    ReflectanceFusion: Diffusion-based text to SVBRDF Generation

    Authors: Bowen Xue, Giuseppe Claudio Guarnera, Shuang Zhao, Zahra Montazeri

    Abstract: We introduce Reflectance Diffusion, a new neural text-to-texture model capable of generating high-fidelity SVBRDF maps from textual descriptions. Our method leverages a tandem neural approach, consisting of two modules, to accurately model the distribution of spatially varying reflectance as described by text prompts. Initially, we employ a pre-trained stable diffusion 2 model to generate a latent… ▽ More

    Submitted 25 April, 2024; originally announced June 2024.

  27. arXiv:2406.14449  [pdf, other

    cs.AI

    APEER: Automatic Prompt Engineering Enhances Large Language Model Reranking

    Authors: Can Jin, Hongwu Peng, Shiyu Zhao, Zhenting Wang, Wujiang Xu, Ligong Han, Jiahui Zhao, Kai Zhong, Sanguthevar Rajasekaran, Dimitris N. Metaxas

    Abstract: Large Language Models (LLMs) have significantly enhanced Information Retrieval (IR) across various modules, such as reranking. Despite impressive performance, current zero-shot relevance ranking with LLMs heavily relies on human prompt engineering. Existing automatic prompt engineering algorithms primarily focus on language modeling and classification tasks, leaving the domain of IR, particularly… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

  28. arXiv:2406.13444  [pdf, other

    cs.CL cs.CV

    VDebugger: Harnessing Execution Feedback for Debugging Visual Programs

    Authors: Xueqing Wu, Zongyu Lin, Songyan Zhao, Te-Lin Wu, Pan Lu, Nanyun Peng, Kai-Wei Chang

    Abstract: Visual programs are executable code generated by large language models to address visual reasoning problems. They decompose complex questions into multiple reasoning steps and invoke specialized models for each step to solve the problems. However, these programs are prone to logic errors, with our preliminary evaluation showing that 58% of the total errors are caused by program logic errors. Debug… ▽ More

    Submitted 27 June, 2024; v1 submitted 19 June, 2024; originally announced June 2024.

    Comments: update reference

  29. arXiv:2406.12434  [pdf, other

    cs.SD cs.LG eess.AS

    Towards Audio Codec-based Speech Separation

    Authors: Jia Qi Yip, Shengkui Zhao, Dianwen Ng, Eng Siong Chng, Bin Ma

    Abstract: Recent improvements in neural audio codec (NAC) models have generated interest in adopting pre-trained codecs for a variety of speech processing applications to take advantage of the efficiencies gained from high compression, but these have yet been applied to the speech separation (SS) task. SS can benefit from high compression because the compute required for traditional SS models makes them imp… ▽ More

    Submitted 5 July, 2024; v1 submitted 18 June, 2024; originally announced June 2024.

    Comments: This paper was accepted by Interspeech 2024

  30. arXiv:2406.12376  [pdf, other

    cs.CR

    DCS Chain: A Flexible Private Blockchain System

    Authors: Jianwu Zheng, Siyuan Zhao, Zheng Wang, Li Pan, Jianhua Li

    Abstract: Blockchain technology has seen tremendous development over the past few years. Despite the emergence of numerous blockchain systems, they all suffer from various limitations, which can all be attributed to the fundamental issue posed by the DCS trilemma. In light of this, this work introduces a novel private blockchain system named DCS Chain. The core idea is to quantify the DCS metrics and dynami… ▽ More

    Submitted 19 June, 2024; v1 submitted 18 June, 2024; originally announced June 2024.

  31. arXiv:2406.12195  [pdf, other

    quant-ph cs.LG

    Quantum Compiling with Reinforcement Learning on a Superconducting Processor

    Authors: Z. T. Wang, Qiuhao Chen, Yuxuan Du, Z. H. Yang, Xiaoxia Cai, Kaixuan Huang, Jingning Zhang, Kai Xu, Jun Du, Yinan Li, Yuling Jiao, Xingyao Wu, Wu Liu, Xiliang Lu, Huikai Xu, Yirong Jin, Ruixia Wang, Haifeng Yu, S. P. Zhao

    Abstract: To effectively implement quantum algorithms on noisy intermediate-scale quantum (NISQ) processors is a central task in modern quantum technology. NISQ processors feature tens to a few hundreds of noisy qubits with limited coherence times and gate operations with errors, so NISQ algorithms naturally require employing circuits of short lengths via quantum compilation. Here, we develop a reinforcemen… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  32. arXiv:2406.11827  [pdf, other

    cs.CL cs.AI cs.LG

    WPO: Enhancing RLHF with Weighted Preference Optimization

    Authors: Wenxuan Zhou, Ravi Agrawal, Shujian Zhang, Sathish Reddy Indurthi, Sanqiang Zhao, Kaiqiang Song, Silei Xu, Chenguang Zhu

    Abstract: Reinforcement learning from human feedback (RLHF) is a promising solution to align large language models (LLMs) more closely with human values. Off-policy preference optimization, where the preference data is obtained from other models, is widely adopted due to its cost efficiency and scalability. However, off-policy preference optimization often suffers from a distributional gap between the polic… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  33. arXiv:2406.11233  [pdf, other

    cs.LG cs.AI cs.CL

    Probing the Decision Boundaries of In-context Learning in Large Language Models

    Authors: Siyan Zhao, Tung Nguyen, Aditya Grover

    Abstract: In-context learning is a key paradigm in large language models (LLMs) that enables them to generalize to new tasks and domains by simply prompting these models with a few exemplars without explicit parameter updates. Many attempts have been made to understand in-context learning in LLMs as a function of model scale, pretraining data, and other factors. In this work, we propose a new mechanism to p… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: 18 pages, 18 figures

  34. arXiv:2406.11200  [pdf, other

    cs.LG cs.CL

    AvaTaR: Optimizing LLM Agents for Tool-Assisted Knowledge Retrieval

    Authors: Shirley Wu, Shiyu Zhao, Qian Huang, Kexin Huang, Michihiro Yasunaga, Kaidi Cao, Vassilis N. Ioannidis, Karthik Subbian, Jure Leskovec, James Zou

    Abstract: Large language model (LLM) agents have demonstrated impressive capability in utilizing external tools and knowledge to boost accuracy and reduce hallucinations. However, developing the prompting techniques that make LLM agents able to effectively use external tools and knowledge is a heuristic and laborious task. Here, we introduce AvaTaR, a novel and automatic framework that optimizes an LLM agen… ▽ More

    Submitted 17 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

    Comments: 19 pages, 8 figures, 6 tables

  35. arXiv:2406.10744  [pdf, other

    cs.CV

    Technique Report of CVPR 2024 PBDL Challenges

    Authors: Ying Fu, Yu Li, Shaodi You, Boxin Shi, Linwei Chen, Yunhao Zou, Zichun Wang, Yichen Li, Yuze Han, Yingkai Zhang, Jianan Wang, Qinglin Liu, Wei Yu, Xiaoqian Lv, Jianing Li, Shengping Zhang, Xiangyang Ji, Yuanpei Chen, Yuhan Zhang, Weihang Peng, Liwen Zhang, Zhe Xu, Dingyong Gou, Cong Li, Senyan Xu , et al. (75 additional authors not shown)

    Abstract: The intersection of physics-based vision and deep learning presents an exciting frontier for advancing computer vision technologies. By leveraging the principles of physics to inform and enhance deep learning models, we can develop more robust and accurate vision systems. Physics-based vision aims to invert the processes to recover scene properties such as shape, reflectance, light distribution, a… ▽ More

    Submitted 12 July, 2024; v1 submitted 15 June, 2024; originally announced June 2024.

    Comments: CVPR 2024 PBDL Challenges: https://pbdl-ws.github.io/pbdl2024/challenge/index.html

  36. arXiv:2406.07855  [pdf, other

    cs.CL cs.SD eess.AS

    VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment

    Authors: Bing Han, Long Zhou, Shujie Liu, Sanyuan Chen, Lingwei Meng, Yanming Qian, Yanqing Liu, Sheng Zhao, Jinyu Li, Furu Wei

    Abstract: With the help of discrete neural audio codecs, large language models (LLM) have increasingly been recognized as a promising methodology for zero-shot Text-to-Speech (TTS) synthesis. However, sampling based decoding strategies bring astonishing diversity to generation, but also pose robustness issues such as typos, omissions and repetition. In addition, the high sampling rate of audio also brings h… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: 15 pages, 5 figures

  37. arXiv:2406.06852  [pdf, other

    cs.CR cs.AI cs.CL

    A Survey of Backdoor Attacks and Defenses on Large Language Models: Implications for Security Measures

    Authors: Shuai Zhao, Meihuizi Jia, Zhongliang Guo, Leilei Gan, Jie Fu, Yichao Feng, Fengjun Pan, Luu Anh Tuan

    Abstract: The large language models (LLMs), which bridge the gap between human language understanding and complex problem-solving, achieve state-of-the-art performance on several NLP tasks, particularly in few-shot and zero-shot settings. Despite the demonstrable efficacy of LMMs, due to constraints on computational resources, users have to engage with open-source language models or outsource the entire tra… ▽ More

    Submitted 13 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

  38. arXiv:2406.05699  [pdf, ps, other

    eess.AS cs.AI eess.SP

    An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS

    Authors: Xiaofei Wang, Sefik Emre Eskimez, Manthan Thakker, Hemin Yang, Zirun Zhu, Min Tang, Yufei Xia, Jinzhu Li, Sheng Zhao, Jinyu Li, Naoyuki Kanda

    Abstract: Recently, zero-shot text-to-speech (TTS) systems, capable of synthesizing any speaker's voice from a short audio prompt, have made rapid advancements. However, the quality of the generated speech significantly deteriorates when the audio prompt contains noise, and limited research has been conducted to address this issue. In this paper, we explored various strategies to enhance the quality of audi… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

    Comments: Accepted to INTERSPEECH2024

  39. arXiv:2406.05370  [pdf, other

    cs.CL cs.SD eess.AS

    VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers

    Authors: Sanyuan Chen, Shujie Liu, Long Zhou, Yanqing Liu, Xu Tan, Jinyu Li, Sheng Zhao, Yao Qian, Furu Wei

    Abstract: This paper introduces VALL-E 2, the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity for the first time. Based on its predecessor, VALL-E, the new iteration introduces two significant enhancements: Repetition Aware Sampling refines the original nucleus sampling process by accounting for token repetition in… ▽ More

    Submitted 17 June, 2024; v1 submitted 8 June, 2024; originally announced June 2024.

    Comments: Demo posted

  40. arXiv:2406.03814  [pdf, other

    cs.CL cs.SD eess.AS

    Improving Zero-Shot Chinese-English Code-Switching ASR with kNN-CTC and Gated Monolingual Datastores

    Authors: Jiaming Zhou, Shiwan Zhao, Hui Wang, Tian-Hao Zhang, Haoqin Sun, Xuechen Wang, Yong Qin

    Abstract: The kNN-CTC model has proven to be effective for monolingual automatic speech recognition (ASR). However, its direct application to multilingual scenarios like code-switching, presents challenges. Although there is potential for performance improvement, a kNN-CTC model utilizing a single bilingual datastore can inadvertently introduce undesirable noise from the alternative language. To address thi… ▽ More

    Submitted 13 June, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

  41. arXiv:2406.03085  [pdf, other

    cs.LG cs.IR

    Exploring User Retrieval Integration towards Large Language Models for Cross-Domain Sequential Recommendation

    Authors: Tingjia Shen, Hao Wang, Jiaqing Zhang, Sirui Zhao, Liangyue Li, Zulong Chen, Defu Lian, Enhong Chen

    Abstract: Cross-Domain Sequential Recommendation (CDSR) aims to mine and transfer users' sequential preferences across different domains to alleviate the long-standing cold-start issue. Traditional CDSR models capture collaborative information through user and item modeling while overlooking valuable semantic information. Recently, Large Language Model (LLM) has demonstrated powerful semantic reasoning capa… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: 10 pages, 5 figures

    ACM Class: I.2.7

  42. arXiv:2406.02990  [pdf, other

    cs.CV

    Predicting Genetic Mutation from Whole Slide Images via Biomedical-Linguistic Knowledge Enhanced Multi-label Classification

    Authors: Gexin Huang, Chenfei Wu, Mingjie Li, Xiaojun Chang, Ling Chen, Ying Sun, Shen Zhao, Xiaodan Liang, Liang Lin

    Abstract: Predicting genetic mutations from whole slide images is indispensable for cancer diagnosis. However, existing work training multiple binary classification models faces two challenges: (a) Training multiple binary classifiers is inefficient and would inevitably lead to a class imbalance problem. (b) The biological relationships among genes are overlooked, which limits the prediction performance. To… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: 16 pages, 8 figures, and 3 tables

  43. arXiv:2406.02653  [pdf, other

    eess.IV cs.AI cs.CV cs.LG

    Pancreatic Tumor Segmentation as Anomaly Detection in CT Images Using Denoising Diffusion Models

    Authors: Reza Babaei, Samuel Cheng, Theresa Thai, Shangqing Zhao

    Abstract: Despite the advances in medicine, cancer has remained a formidable challenge. Particularly in the case of pancreatic tumors, characterized by their diversity and late diagnosis, early detection poses a significant challenge crucial for effective treatment. The advancement of deep learning techniques, particularly supervised algorithms, has significantly propelled pancreatic tumor detection in the… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

  44. arXiv:2406.02609  [pdf, other

    cs.LG cs.AI

    Less is More: Pseudo-Label Filtering for Continual Test-Time Adaptation

    Authors: Jiayao Tan, Fan Lyu, Chenggong Ni, Tingliang Feng, Fuyuan Hu, Zhang Zhang, Shaochuang Zhao, Liang Wang

    Abstract: Continual Test-Time Adaptation (CTTA) aims to adapt a pre-trained model to a sequence of target domains during the test phase without accessing the source data. To adapt to unlabeled data from unknown domains, existing methods rely on constructing pseudo-labels for all samples and updating the model through self-training. However, these pseudo-labels often involve noise, leading to insufficient ad… ▽ More

    Submitted 12 July, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

    Comments: arXiv admin note: text overlap with arXiv:2310.03335 by other authors

  45. arXiv:2406.02009  [pdf, other

    eess.AS cs.CL cs.SD

    Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis

    Authors: Kun Zhou, Shengkui Zhao, Yukun Ma, Chong Zhang, Hao Wang, Dianwen Ng, Chongjia Ni, Nguyen Trung Hieu, Jia Qi Yip, Bin Ma

    Abstract: Recent language model-based text-to-speech (TTS) frameworks demonstrate scalability and in-context learning capabilities. However, they suffer from robustness issues due to the accumulation of errors in speech unit predictions during autoregressive language modeling. In this paper, we propose a phonetic enhanced language modeling method to improve the performance of TTS models. We leverage self-su… ▽ More

    Submitted 11 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  46. arXiv:2406.00017  [pdf, other

    cs.CL cs.AI cs.MM

    PTA: Enhancing Multimodal Sentiment Analysis through Pipelined Prediction and Translation-based Alignment

    Authors: Shezheng Song, Shasha Li, Shan Zhao, Chengyu Wang, Xiaopeng Li, Jie Yu, Qian Wan, Jun Ma, Tianwei Yan, Wentao Ma, Xiaoguang Mao

    Abstract: Multimodal aspect-based sentiment analysis (MABSA) aims to understand opinions in a granular manner, advancing human-computer interaction and other fields. Traditionally, MABSA methods use a joint prediction approach to identify aspects and sentiments simultaneously. However, we argue that joint models are not always superior. Our analysis shows that joint models struggle to align relevant text to… ▽ More

    Submitted 13 June, 2024; v1 submitted 22 May, 2024; originally announced June 2024.

    Comments: Code will be released upon publication

  47. arXiv:2405.21075  [pdf, other

    cs.CV cs.CL

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Authors: Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Rongrong Ji, Xing Sun

    Abstract: In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs) have emerged as a focal point in recent advancements. However, the predominant focus remains on developing their capabilities in static image understanding. The potential of MLLMs in processing sequential visual data is still insufficiently explored, highlighting the absence of a comprehensive, high-quality… ▽ More

    Submitted 16 June, 2024; v1 submitted 31 May, 2024; originally announced May 2024.

    Comments: Project Page: https://video-mme.github.io

  48. arXiv:2405.20279  [pdf, other

    cs.CV cs.AI eess.IV

    CV-VAE: A Compatible Video VAE for Latent Generative Video Models

    Authors: Sijie Zhao, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Muyao Niu, Xiaoyu Li, Wenbo Hu, Ying Shan

    Abstract: Spatio-temporal compression of videos, utilizing networks such as Variational Autoencoders (VAE), plays a crucial role in OpenAI's SORA and numerous other video generative models. For instance, many LLM-like video models learn the distribution of discrete tokens derived from 3D VAEs within the VQVAE framework, while most diffusion-based video models capture the distribution of continuous latent ex… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

    Comments: Project Page: https://ailab-cvc.github.io/cvvae/index.html

  49. arXiv:2405.17916  [pdf, other

    cs.CV

    Boosting General Trimap-free Matting in the Real-World Image

    Authors: Leo Shan Wenzhang Zhou Grace Zhao

    Abstract: Image matting aims to obtain an alpha matte that separates foreground objects from the background accurately. Recently, trimap-free matting has been well studied because it requires only the original image without any extra input. Such methods usually extract a rough foreground by itself to take place trimap as further guidance. However, the definition of 'foreground' lacks a unified standard and… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

    Comments: 13 pages, 8 figures

  50. arXiv:2405.17809  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation

    Authors: Chenyang Le, Yao Qian, Dongmei Wang, Long Zhou, Shujie Liu, Xiaofei Wang, Midia Yousefi, Yanmin Qian, Jinyu Li, Sheng Zhao, Michael Zeng

    Abstract: There is a rising interest and trend in research towards directly translating speech from one language to another, known as end-to-end speech-to-speech translation. However, most end-to-end models struggle to outperform cascade models, i.e., a pipeline framework by concatenating speech recognition, machine translation and text-to-speech models. The primary challenges stem from the inherent complex… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

    Comments: Work in progress