Skip to main content

Showing 1–50 of 555 results for author: Wei, X

  1. arXiv:2407.10759  [pdf, other

    eess.AS cs.CL cs.LG

    Qwen2-Audio Technical Report

    Authors: Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou, Jingren Zhou

    Abstract: We introduce the latest progress of Qwen-Audio, a large-scale audio-language model called Qwen2-Audio, which is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions. In contrast to complex hierarchical tags, we have simplified the pre-training process by utilizing natural language prompts for different data an… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: https://github.com/QwenLM/Qwen2-Audio. Checkpoints, codes and scripts will be opensoursed soon

  2. arXiv:2407.10671  [pdf, other

    cs.CL cs.AI

    Qwen2 Technical Report

    Authors: An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin , et al. (37 additional authors not shown)

    Abstract: This report introduces the Qwen2 series, the latest addition to our large language models and large multimodal models. We release a comprehensive suite of foundational and instruction-tuned language models, encompassing a parameter range from 0.5 to 72 billion, featuring dense models and a Mixture-of-Experts model. Qwen2 surpasses most prior open-weight models, including its predecessor Qwen1.5, a… ▽ More

    Submitted 17 July, 2024; v1 submitted 15 July, 2024; originally announced July 2024.

    Comments: 25 pages, 1 figure

  3. arXiv:2407.09835  [pdf, other

    cs.CL

    Investigating Low-Rank Training in Transformer Language Models: Efficiency and Scaling Analysis

    Authors: Xiuying Wei, Skander Moalla, Razvan Pascanu, Caglar Gulcehre

    Abstract: State-of-the-art LLMs often rely on scale with high computational costs, which has sparked a research agenda to reduce parameter counts and costs without significantly impacting performance. Our study focuses on Transformer-based LLMs, specifically applying low-rank parametrization to the computationally intensive feedforward networks (FFNs), which are less studied than attention blocks. In contra… ▽ More

    Submitted 13 July, 2024; originally announced July 2024.

    Comments: Accepted by ICML 2024 Next Generation of Sequence Modeling Architectures Workshop. arXiv admin note: substantial text overlap with arXiv:2406.16450

  4. arXiv:2407.08739  [pdf, other

    cs.CV

    MAVIS: Mathematical Visual Instruction Tuning

    Authors: Renrui Zhang, Xinyu Wei, Dongzhi Jiang, Yichi Zhang, Ziyu Guo, Chengzhuo Tong, Jiaming Liu, Aojun Zhou, Bin Wei, Shanghang Zhang, Peng Gao, Hongsheng Li

    Abstract: Multi-modal Large Language Models (MLLMs) have recently emerged as a significant focus in academia and industry. Despite their proficiency in general multi-modal scenarios, the mathematical problem-solving capabilities in visual contexts remain insufficiently explored. We identify three key areas within MLLMs that need to be improved: visual encoding of math diagrams, diagram-language alignment, a… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

    Comments: Work in progress. Data and Models are released at https://github.com/ZrrSkywalker/MAVIS

  5. arXiv:2407.08489  [pdf, other

    cs.CV

    Projecting Points to Axes: Oriented Object Detection via Point-Axis Representation

    Authors: Zeyang Zhao, Qilong Xue, Yuhang He, Yifan Bai, Xing Wei, Yihong Gong

    Abstract: This paper introduces the point-axis representation for oriented object detection, emphasizing its flexibility and geometrically intuitive nature with two key components: points and axes. 1) Points delineate the spatial extent and contours of objects, providing detailed shape descriptions. 2) Axes define the primary directionalities of objects, providing essential orientation cues crucial for prec… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

    Comments: 19 pages,7 figures,accpeted by ECCV24!

  6. arXiv:2407.04521  [pdf, ps, other

    math.OC cs.LG q-fin.CP

    Unified continuous-time q-learning for mean-field game and mean-field control problems

    Authors: Xiaoli Wei, Xiang Yu, Fengyi Yuan

    Abstract: This paper studies the continuous-time q-learning in the mean-field jump-diffusion models from the representative agent's perspective. To overcome the challenge when the population distribution may not be directly observable, we introduce the integrated q-function in decoupled form (decoupled Iq-function) and establish its martingale characterization together with the value function, which provide… ▽ More

    Submitted 5 July, 2024; originally announced July 2024.

  7. arXiv:2407.02129  [pdf, other

    cs.HC

    ReliaAvatar: A Robust Real-Time Avatar Animator with Integrated Motion Prediction

    Authors: Bo Qian, Zhenhuan Wei, Jiashuo Li, Xing Wei

    Abstract: Efficiently estimating the full-body pose with minimal wearable devices presents a worthwhile research direction. Despite significant advancements in this field, most current research neglects to explore full-body avatar estimation under low-quality signal conditions, which is prevalent in practical usage. To bridge this gap, we summarize three scenarios that may be encountered in real-world appli… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

  8. arXiv:2407.01445  [pdf, other

    cs.LG cs.CV

    FastCLIP: A Suite of Optimization Techniques to Accelerate CLIP Training with Limited Resources

    Authors: Xiyuan Wei, Fanjiang Ye, Ori Yonay, Xingyu Chen, Baixi Sun, Dingwen Tao, Tianbao Yang

    Abstract: Existing studies of training state-of-the-art Contrastive Language-Image Pretraining (CLIP) models on large-scale data involve hundreds of or even thousands of GPUs due to the requirement of a large batch size. However, such a large amount of resources is not accessible to most people. While advanced compositional optimization techniques for optimizing global contrastive losses have been demonstra… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: 23 pages

  9. arXiv:2406.16450  [pdf, other

    cs.CL

    Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers

    Authors: Xiuying Wei, Skander Moalla, Razvan Pascanu, Caglar Gulcehre

    Abstract: State-of-the-art results in large language models (LLMs) often rely on scale, which becomes computationally expensive. This has sparked a research agenda to reduce these models' parameter count and computational costs without significantly impacting their performance. Our study focuses on transformer-based LLMs, specifically targeting the computationally intensive feedforward networks (FFN), which… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

  10. arXiv:2406.15768  [pdf, other

    cs.CV

    MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception

    Authors: Guanqun Wang, Xinyu Wei, Jiaming Liu, Ray Zhang, Yichi Zhang, Kevin Zhang, Maurice Chong, Shanghang Zhang

    Abstract: In recent years, multimodal large language models (MLLMs) have shown remarkable capabilities in tasks like visual question answering and common sense reasoning, while visual perception models have made significant strides in perception tasks, such as detection and segmentation. However, MLLMs mainly focus on high-level image-text interpretations and struggle with fine-grained visual understanding,… ▽ More

    Submitted 22 June, 2024; originally announced June 2024.

    Comments: 14 pages, 8 figures

  11. arXiv:2406.12324  [pdf, other

    cs.RO

    AutoDSL: Automated domain-specific language design for structural representation of procedures with constraints

    Authors: Yu-Zhe Shi, Haofei Hou, Zhangqian Bi, Fanxu Meng, Xiang Wei, Lecheng Ruan, Qining Wang

    Abstract: Accurate representation of procedures in restricted scenarios, such as non-standardized scientific experiments, requires precise depiction of constraints. Unfortunately, Domain-specific Language (DSL), as an effective tool to express constraints structurally, often requires case-by-case hand-crafting, necessitating customized, labor-intensive efforts. To overcome this challenge, we introduce the A… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (ACL'24)

  12. arXiv:2406.11833  [pdf, other

    cs.CV cs.AI cs.LG

    MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

    Authors: Ziyu Liu, Tao Chu, Yuhang Zang, Xilin Wei, Xiaoyi Dong, Pan Zhang, Zijian Liang, Yuanjun Xiong, Yu Qiao, Dahua Lin, Jiaqi Wang

    Abstract: Generating natural and meaningful responses to communicate with multi-modal human inputs is a fundamental capability of Large Vision-Language Models(LVLMs). While current open-source LVLMs demonstrate promising performance in simplified scenarios such as single-turn single-image input, they fall short in real-world conversation scenarios such as following instructions in a long context history wit… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: This project is available at https://github.com/Liuziyu77/MMDU

  13. arXiv:2406.10527  [pdf, other

    cs.CV

    Panoptic-FlashOcc: An Efficient Baseline to Marry Semantic Occupancy with Panoptic via Instance Center

    Authors: Zichen Yu, Changyong Shu, Qianpu Sun, Junjie Linghu, Xiaobao Wei, Jiangyong Yu, Zongdai Liu, Dawei Yang, Hui Li, Yan Chen

    Abstract: Panoptic occupancy poses a novel challenge by aiming to integrate instance occupancy and semantic occupancy within a unified framework. However, there is still a lack of efficient solutions for panoptic occupancy. In this paper, we propose Panoptic-FlashOcc, a straightforward yet robust 2D feature framework that enables realtime panoptic occupancy. Building upon the lightweight design of FlashOcc,… ▽ More

    Submitted 15 June, 2024; originally announced June 2024.

  14. arXiv:2406.08418  [pdf, other

    cs.CV cs.AI

    OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

    Authors: Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang Jin, Guanzhou Chen, Yinan He, Zhangwei Gao, Erfei Cui, Jiashuo Yu, Hao Tian, Jiasheng Zhou, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Zhenxiang Li, Pei Chu, Yi Wang , et al. (15 additional authors not shown)

    Abstract: Image-text interleaved data, consisting of multiple images and texts arranged in a natural document format, aligns with the presentation paradigm of internet data and closely resembles human reading habits. Recent studies have shown that such data aids multimodal in-context learning and maintains the capabilities of large language models during multimodal fine-tuning. However, the limited scale an… ▽ More

    Submitted 12 July, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

  15. arXiv:2406.07057  [pdf, other

    cs.CL cs.AI cs.CV cs.LG

    Benchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive Study

    Authors: Yichi Zhang, Yao Huang, Yitong Sun, Chang Liu, Zhe Zhao, Zhengwei Fang, Yifan Wang, Huanran Chen, Xiao Yang, Xingxing Wei, Hang Su, Yinpeng Dong, Jun Zhu

    Abstract: Despite the superior capabilities of Multimodal Large Language Models (MLLMs) across diverse tasks, they still face significant trustworthiness challenges. Yet, current literature on the assessment of trustworthy MLLMs remains limited, lacking a holistic evaluation to offer thorough insights into future improvements. In this work, we establish MultiTrust, the first comprehensive and unified benchm… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: 100 pages, 84 figures, 33 tables

  16. arXiv:2406.05485  [pdf, other

    cs.CV

    Training-Free Robust Interactive Video Object Segmentation

    Authors: Xiaoli Wei, Zhaoqing Wang, Yandong Guo, Chunxia Zhang, Tongliang Liu, Mingming Gong

    Abstract: Interactive video object segmentation is a crucial video task, having various applications from video editing to data annotating. However, current approaches struggle to accurately segment objects across diverse domains. Recently, Segment Anything Model (SAM) introduces interactive visual prompts and demonstrates impressive performance across different domains. In this paper, we propose a training… ▽ More

    Submitted 8 June, 2024; originally announced June 2024.

  17. arXiv:2406.04743  [pdf, other

    cs.LG cs.CR cs.DC stat.AP

    When Swarm Learning meets energy series data: A decentralized collaborative learning design based on blockchain

    Authors: Lei Xu, Yulong Chen, Yuntian Chen, Longfeng Nie, Xuetao Wei, Liang Xue, Dongxiao Zhang

    Abstract: Machine learning models offer the capability to forecast future energy production or consumption and infer essential unknown variables from existing data. However, legal and policy constraints within specific energy sectors render the data sensitive, presenting technical hurdles in utilizing data from diverse sources. Therefore, we propose adopting a Swarm Learning (SL) scheme, which replaces the… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

  18. arXiv:2406.04325  [pdf, other

    cs.CV

    ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

    Authors: Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, Jiaqi Wang

    Abstract: We present the ShareGPT4Video series, aiming to facilitate the video understanding of large video-language models (LVLMs) and the video generation of text-to-video models (T2VMs) via dense and precise captions. The series comprises: 1) ShareGPT4Video, 40K GPT4V annotated dense captions of videos with various lengths and sources, developed through carefully designed data filtering and annotating st… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: Project Page: https://sharegpt4video.github.io/

  19. arXiv:2406.03794  [pdf, other

    cs.LG

    Infusing Self-Consistency into Density Functional Theory Hamiltonian Prediction via Deep Equilibrium Models

    Authors: Zun Wang, Chang Liu, Nianlong Zou, He Zhang, Xinran Wei, Lin Huang, Lijun Wu, Bin Shao

    Abstract: In this study, we introduce a unified neural network architecture, the Deep Equilibrium Density Functional Theory Hamiltonian (DEQH) model, which incorporates Deep Equilibrium Models (DEQs) for predicting Density Functional Theory (DFT) Hamiltonians. The DEQH model inherently captures the self-consistency nature of Hamiltonian, a critical aspect often overlooked by traditional machine learning app… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

  20. arXiv:2405.20323  [pdf, other

    cs.CV cs.AI

    $\textit{S}^3$Gaussian: Self-Supervised Street Gaussians for Autonomous Driving

    Authors: Nan Huang, Xiaobao Wei, Wenzhao Zheng, Pengju An, Ming Lu, Wei Zhan, Masayoshi Tomizuka, Kurt Keutzer, Shanghang Zhang

    Abstract: Photorealistic 3D reconstruction of street scenes is a critical technique for developing real-world simulators for autonomous driving. Despite the efficacy of Neural Radiance Fields (NeRF) for driving scenes, 3D Gaussian Splatting (3DGS) emerges as a promising direction due to its faster speed and more explicit representation. However, most existing street 3DGS methods require tracked 3D vehicle b… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

    Comments: Code is available at: https://github.com/nnanhuang/S3Gaussian/

  21. arXiv:2405.18860  [pdf, other

    cs.RO

    Empowering Embodied Manipulation: A Bimanual-Mobile Robot Manipulation Dataset for Household Tasks

    Authors: Tianle Zhang, Dongjiang Li, Yihang Li, Zecui Zeng, Lin Zhao, Lei Sun, Yue Chen, Xuelong Wei, Yibing Zhan, Lusong Li, Xiaodong He

    Abstract: The advancements in embodied AI are increasingly enabling robots to tackle complex real-world tasks, such as household manipulation. However, the deployment of robots in these environments remains constrained by the lack of comprehensive bimanual-mobile robot manipulation data that can be learned. Existing datasets predominantly focus on single-arm manipulation tasks, while the few dual-arm datase… ▽ More

    Submitted 6 June, 2024; v1 submitted 29 May, 2024; originally announced May 2024.

  22. arXiv:2405.18729  [pdf, other

    cs.LG cs.AI

    Preferred-Action-Optimized Diffusion Policies for Offline Reinforcement Learning

    Authors: Tianle Zhang, Jiayi Guan, Lin Zhao, Yihang Li, Dongjiang Li, Zecui Zeng, Lei Sun, Yue Chen, Xuelong Wei, Lusong Li, Xiaodong He

    Abstract: Offline reinforcement learning (RL) aims to learn optimal policies from previously collected datasets. Recently, due to their powerful representational capabilities, diffusion models have shown significant potential as policy models for offline RL issues. However, previous offline RL algorithms based on diffusion policies generally adopt weighted regression to improve the policy. This approach opt… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

  23. arXiv:2405.18361  [pdf, other

    cs.CV

    Is a 3D-Tokenized LLM the Key to Reliable Autonomous Driving?

    Authors: Yifan Bai, Dongming Wu, Yingfei Liu, Fan Jia, Weixin Mao, Ziheng Zhang, Yucheng Zhao, Jianbing Shen, Xing Wei, Tiancai Wang, Xiangyu Zhang

    Abstract: Rapid advancements in Autonomous Driving (AD) tasks turned a significant shift toward end-to-end fashion, particularly in the utilization of vision-language models (VLMs) that integrate robust logical reasoning and cognitive abilities to enable comprehensive end-to-end planning. However, these VLM-based approaches tend to integrate 2D vision tokenizers and a large language model (LLM) for ego-car… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

  24. arXiv:2405.17935  [pdf, other

    cs.CL cs.AI

    Tool Learning with Large Language Models: A Survey

    Authors: Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, Ji-Rong Wen

    Abstract: Recently, tool learning with large language models (LLMs) has emerged as a promising paradigm for augmenting the capabilities of LLMs to tackle highly complex problems. Despite growing attention and rapid advancements in this field, the existing literature remains fragmented and lacks systematic organization, posing barriers to entry for newcomers. This gap motivates us to conduct a comprehensive… ▽ More

    Submitted 30 May, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

  25. arXiv:2405.16865  [pdf, other

    q-bio.NC cs.LG stat.ML

    An Investigation of Conformal Isometry Hypothesis for Grid Cells

    Authors: Dehong Xu, Ruiqi Gao, Wen-Hao Zhang, Xue-Xin Wei, Ying Nian Wu

    Abstract: This paper investigates the conformal isometry hypothesis as a potential explanation for the emergence of hexagonal periodic patterns in the response maps of grid cells. The hypothesis posits that the activities of the population of grid cells form a high-dimensional vector in the neural space, representing the agent's self-position in 2D physical space. As the agent moves in the 2D physical space… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

    Comments: arXiv admin note: text overlap with arXiv:2310.19192

  26. arXiv:2405.16089  [pdf, other

    cs.CL cs.IR

    COLT: Towards Completeness-Oriented Tool Retrieval for Large Language Models

    Authors: Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, Ji-Rong Wen

    Abstract: Recently, the integration of external tools with Large Language Models (LLMs) has emerged as a promising approach to overcome the inherent constraints of their pre-training data. However, realworld applications often involve a diverse range of tools, making it infeasible to incorporate all tools directly into LLMs due to constraints on input length and response time. Therefore, to fully exploit th… ▽ More

    Submitted 25 May, 2024; originally announced May 2024.

  27. arXiv:2405.14702  [pdf, other

    cs.CV cs.AI

    G3: An Effective and Adaptive Framework for Worldwide Geolocalization Using Large Multi-Modality Models

    Authors: Pengyue Jia, Yiding Liu, Xiaopeng Li, Xiangyu Zhao, Yuhao Wang, Yantong Du, Xiao Han, Xuetao Wei, Shuaiqiang Wang, Dawei Yin

    Abstract: Worldwide geolocalization aims to locate the precise location at the coordinate level of photos taken anywhere on the Earth. It is very challenging due to 1) the difficulty of capturing subtle location-aware visual semantics, and 2) the heterogeneous geographical distribution of image data. As a result, existing studies have clear limitations when scaled to a worldwide context. They may easily con… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

  28. arXiv:2405.12079  [pdf, other

    cs.DC cs.OS

    PARALLELGPUOS: A Concurrent OS-level GPU Checkpoint and Restore System using Validated Speculation

    Authors: Zhuobin Huang, Xingda Wei, Yingyi Hao, Rong Chen, Mingcong Han, Jinyu Gu, Haibo Chen

    Abstract: Checkpointing (C) and restoring (R) are key components for GPU tasks. POS is an OS-level GPU C/R system: It can transparently checkpoint or restore processes that use the GPU, without requiring any cooperation from the application, a key feature required by modern systems like the cloud. Moreover, POS is the first OS-level C/R system that can concurrently execute C/R with the application execution… ▽ More

    Submitted 20 May, 2024; originally announced May 2024.

  29. arXiv:2405.11335  [pdf, other

    cs.CR

    Detecting Complex Multi-step Attacks with Explainable Graph Neural Network

    Authors: Wei Liu, Peng Gao, Haotian Zhang, Ke Li, Weiyong Yang, Xingshen Wei, Jiwu Shu

    Abstract: Complex multi-step attacks have caused significant damage to numerous critical infrastructures. To detect such attacks, graph neural network based methods have shown promising results by modeling the system's events as a graph. However, existing methods still face several challenges when deployed in practice. First, there is a lack of sufficient real attack data especially considering the large vo… ▽ More

    Submitted 13 June, 2024; v1 submitted 18 May, 2024; originally announced May 2024.

    Comments: Corresponding author: Peng Gao (gao.itslab@gmail.com)

  30. arXiv:2405.11236  [pdf, other

    cs.CV

    TriLoRA: Integrating SVD for Advanced Style Personalization in Text-to-Image Generation

    Authors: Chengcheng Feng, Mu He, Qiuyu Tian, Haojie Yin, Xiaofang Zhao, Hongwei Tang, Xingqiang Wei

    Abstract: As deep learning technology continues to advance, image generation models, especially models like Stable Diffusion, are finding increasingly widespread application in visual arts creation. However, these models often face challenges such as overfitting, lack of stability in generated results, and difficulties in accurately capturing the features desired by creators during the fine-tuning process.… ▽ More

    Submitted 13 June, 2024; v1 submitted 18 May, 2024; originally announced May 2024.

  31. arXiv:2405.09150  [pdf, other

    cs.CV

    Curriculum Dataset Distillation

    Authors: Zhiheng Ma, Anjia Cao, Funing Yang, Xing Wei

    Abstract: Most dataset distillation methods struggle to accommodate large-scale datasets due to their substantial computational and memory requirements. In this paper, we present a curriculum-based dataset distillation framework designed to harmonize scalability with efficiency. This framework strategically distills synthetic images, adhering to a curriculum that transitions from simple to complex. By incor… ▽ More

    Submitted 15 May, 2024; originally announced May 2024.

  32. arXiv:2405.08816  [pdf, other

    cs.CV cs.RO

    The RoboDrive Challenge: Drive Anytime Anywhere in Any Condition

    Authors: Lingdong Kong, Shaoyuan Xie, Hanjiang Hu, Yaru Niu, Wei Tsang Ooi, Benoit R. Cottereau, Lai Xing Ng, Yuexin Ma, Wenwei Zhang, Liang Pan, Kai Chen, Ziwei Liu, Weichao Qiu, Wei Zhang, Xu Cao, Hao Lu, Ying-Cong Chen, Caixin Kang, Xinning Zhou, Chengyang Ying, Wentao Shang, Xingxing Wei, Yinpeng Dong, Bo Yang, Shengyin Jiang , et al. (66 additional authors not shown)

    Abstract: In the realm of autonomous driving, robust perception under out-of-distribution conditions is paramount for the safe deployment of vehicles. Challenges such as adverse weather, sensor malfunctions, and environmental unpredictability can severely impact the performance of autonomous systems. The 2024 RoboDrive Challenge was crafted to propel the development of driving perception technologies that c… ▽ More

    Submitted 29 May, 2024; v1 submitted 14 May, 2024; originally announced May 2024.

    Comments: ICRA 2024; 32 pages, 24 figures, 5 tables; Code at https://robodrive-24.github.io/

  33. arXiv:2405.08340  [pdf, other

    cs.CR cs.CV

    Achieving Resolution-Agnostic DNN-based Image Watermarking:A Novel Perspective of Implicit Neural Representation

    Authors: Yuchen Wang, Xingyu Zhu, Guanhui Ye, Shiyao Zhang, Xuetao Wei

    Abstract: DNN-based watermarking methods are rapidly developing and delivering impressive performances. Recent advances achieve resolution-agnostic image watermarking by reducing the variant resolution watermarking problem to a fixed resolution watermarking problem. However, such a reduction process can potentially introduce artifacts and low robustness. To address this issue, we propose the first, to the b… ▽ More

    Submitted 14 May, 2024; originally announced May 2024.

  34. arXiv:2405.08245  [pdf

    cs.CV cs.AI

    Progressive enhancement and restoration for mural images under low-light and defected conditions based on multi-receptive field strategy

    Authors: Xiameng Wei, Binbin Fan, Ying Wang, Yanxiang Feng, Laiyi Fu

    Abstract: Ancient murals are valuable cultural heritage with great archaeological value. They provide insights into ancient religions, ceremonies, folklore, among other things through their content. However, due to long-term oxidation and inadequate protection, ancient murals have suffered continuous damage, including peeling and mold etc. Additionally, since ancient murals were typically painted indoors, t… ▽ More

    Submitted 16 July, 2024; v1 submitted 13 May, 2024; originally announced May 2024.

  35. Selective Focus: Investigating Semantics Sensitivity in Post-training Quantization for Lane Detection

    Authors: Yunqian Fan, Xiuying Wei, Ruihao Gong, Yuqing Ma, Xiangguo Zhang, Qi Zhang, Xianglong Liu

    Abstract: Lane detection (LD) plays a crucial role in enhancing the L2+ capabilities of autonomous driving, capturing widespread attention. The Post-Processing Quantization (PTQ) could facilitate the practical application of LD models, enabling fast speeds and limited memories without labeled data. However, prior PTQ methods do not consider the complex LD outputs that contain physical semantics, such as off… ▽ More

    Submitted 10 May, 2024; originally announced May 2024.

    Comments: Accepted by AAAI-24

    Journal ref: AAAI 2024, 38, 11936-11943

  36. arXiv:2405.05808  [pdf, other

    cs.CV

    Fast and Controllable Post-training Sparsity: Learning Optimal Sparsity Allocation with Global Constraint in Minutes

    Authors: Ruihao Gong, Yang Yong, Zining Wang, Jinyang Guo, Xiuying Wei, Yuqing Ma, Xianglong Liu

    Abstract: Neural network sparsity has attracted many research interests due to its similarity to biological schemes and high energy efficiency. However, existing methods depend on long-time training or fine-tuning, which prevents large-scale applications. Recently, some works focusing on post-training sparsity (PTS) have emerged. They get rid of the high training cost but usually suffer from distinct accura… ▽ More

    Submitted 9 May, 2024; originally announced May 2024.

  37. arXiv:2405.05538  [pdf, other

    cs.CV

    A Survey on Personalized Content Synthesis with Diffusion Models

    Authors: Xulu Zhang, Xiao-Yong Wei, Wengyu Zhang, Jinlin Wu, Zhaoxiang Zhang, Zhen Lei, Qing Li

    Abstract: Recent advancements in generative models have significantly impacted content creation, leading to the emergence of Personalized Content Synthesis (PCS). With a small set of user-provided examples, PCS aims to customize the subject of interest to specific user-defined prompts. Over the past two years, more than 150 methods have been proposed. However, existing surveys mainly focus on text-to-image… ▽ More

    Submitted 9 May, 2024; originally announced May 2024.

  38. arXiv:2405.04765  [pdf, other

    cs.LG cs.AI cs.DC

    When Foresight Pruning Meets Zeroth-Order Optimization: Efficient Federated Learning for Low-Memory Devices

    Authors: Pengyu Zhang, Yingjie Liu, Yingbo Zhou, Xiao Du, Xian Wei, Ting Wang, Mingsong Chen

    Abstract: Although Federated Learning (FL) enables collaborative learning in Artificial Intelligence of Things (AIoT) design, it fails to work on low-memory AIoT devices due to its heavy memory usage. To address this problem, various federated pruning methods are proposed to reduce memory usage during inference. However, few of them can substantially mitigate the memory burdens during pruning and training.… ▽ More

    Submitted 7 May, 2024; originally announced May 2024.

  39. arXiv:2405.04434  [pdf, other

    cs.CL cs.AI

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    Authors: DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Hao Yang, Haowei Zhang, Honghui Ding , et al. (132 additional authors not shown)

    Abstract: We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference… ▽ More

    Submitted 19 June, 2024; v1 submitted 7 May, 2024; originally announced May 2024.

  40. arXiv:2405.03267  [pdf, other

    cs.DC cs.DB cs.IR

    Characterizing the Dilemma of Performance and Index Size in Billion-Scale Vector Search and Breaking It with Second-Tier Memory

    Authors: Rongxin Cheng, Yifan Peng, Xingda Wei, Hongrui Xie, Rong Chen, Sijie Shen, Haibo Chen

    Abstract: Vector searches on large-scale datasets are critical to modern online services like web search and RAG, which necessity storing the datasets and their index on the secondary storage like SSD. In this paper, we are the first to characterize the trade-off of performance and index size in existing SSD-based graph and cluster indexes: to improve throughput by 5.7$\times$ and 1.7$\times$, these indexes… ▽ More

    Submitted 7 May, 2024; v1 submitted 6 May, 2024; originally announced May 2024.

  41. arXiv:2404.17466  [pdf, other

    physics.comp-ph cs.LG physics.plasm-ph

    FTL: Transfer Learning Nonlinear Plasma Dynamic Transitions in Low Dimensional Embeddings via Deep Neural Networks

    Authors: Zhe Bai, Xishuo Wei, William Tang, Leonid Oliker, Zhihong Lin, Samuel Williams

    Abstract: Deep learning algorithms provide a new paradigm to study high-dimensional dynamical behaviors, such as those in fusion plasma systems. Development of novel model reduction methods, coupled with detection of abnormal modes with plasma physics, opens a unique opportunity for building efficient models to identify plasma instabilities for real-time control. Our Fusion Transfer Learning (FTL) model dem… ▽ More

    Submitted 26 April, 2024; originally announced April 2024.

    Comments: 18 pages, 10 figures

    MSC Class: 76W05; 68T45 ACM Class: J.2; I.2.10

  42. arXiv:2404.17360  [pdf, other

    cs.CV

    UniRGB-IR: A Unified Framework for Visible-Infrared Downstream Tasks via Adapter Tuning

    Authors: Maoxun Yuan, Bo Cui, Tianyi Zhao, Xingxing Wei

    Abstract: Semantic analysis on visible (RGB) and infrared (IR) images has gained attention for its ability to be more accurate and robust under low-illumination and complex weather conditions. Due to the lack of pre-trained foundation models on the large-scale infrared image datasets, existing methods prefer to design task-specific frameworks and directly fine-tune them with pre-trained foundation models on… ▽ More

    Submitted 26 April, 2024; originally announced April 2024.

  43. arXiv:2404.16821  [pdf, other

    cs.CV

    How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    Authors: Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai , et al. (10 additional authors not shown)

    Abstract: In this report, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple improvements: (1) Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model -- InternViT-6B, boosting its visual… ▽ More

    Submitted 29 April, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

    Comments: Technical report

  44. arXiv:2404.15380  [pdf, other

    cs.LG cs.AI

    ControlTraj: Controllable Trajectory Generation with Topology-Constrained Diffusion Model

    Authors: Yuanshao Zhu, James Jianqiao Yu, Xiangyu Zhao, Qidong Liu, Yongchao Ye, Wei Chen, Zijian Zhang, Xuetao Wei, Yuxuan Liang

    Abstract: Generating trajectory data is among promising solutions to addressing privacy concerns, collection costs, and proprietary restrictions usually associated with human mobility analyses. However, existing trajectory generation methods are still in their infancy due to the inherent diversity and unpredictability of human activities, grappling with issues such as fidelity, flexibility, and generalizabi… ▽ More

    Submitted 23 April, 2024; originally announced April 2024.

  45. arXiv:2404.15081  [pdf, other

    cs.CV cs.CR cs.LG

    Perturbing Attention Gives You More Bang for the Buck: Subtle Imaging Perturbations That Efficiently Fool Customized Diffusion Models

    Authors: Jingyao Xu, Yuetong Lu, Yandong Li, Siyang Lu, Dongdong Wang, Xiang Wei

    Abstract: Diffusion models (DMs) embark a new era of generative modeling and offer more opportunities for efficient generating high-quality and realistic data samples. However, their widespread use has also brought forth new challenges in model security, which motivates the creation of more effective adversarial attackers on DMs to understand its vulnerability. We propose CAAT, a simple but generic and effi… ▽ More

    Submitted 14 June, 2024; v1 submitted 23 April, 2024; originally announced April 2024.

    Comments: Published at CVPR 2024, code:https://github.com/CO2-cityao/CAAT

  46. arXiv:2404.12385  [pdf, other

    cs.CV cs.GR

    MeshLRM: Large Reconstruction Model for High-Quality Mesh

    Authors: Xinyue Wei, Kai Zhang, Sai Bi, Hao Tan, Fujun Luan, Valentin Deschaintre, Kalyan Sunkavalli, Hao Su, Zexiang Xu

    Abstract: We propose MeshLRM, a novel LRM-based approach that can reconstruct a high-quality mesh from merely four input images in less than one second. Different from previous large reconstruction models (LRMs) that focus on NeRF-based reconstruction, MeshLRM incorporates differentiable mesh extraction and rendering within the LRM framework. This allows for end-to-end mesh reconstruction by fine-tuning a p… ▽ More

    Submitted 18 April, 2024; originally announced April 2024.

  47. arXiv:2404.12139  [pdf, other

    cs.CV

    Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models

    Authors: Shouwei Ruan, Yinpeng Dong, Hanqing Liu, Yao Huang, Hang Su, Xingxing Wei

    Abstract: Vision-Language Pre-training (VLP) models like CLIP have achieved remarkable success in computer vision and particularly demonstrated superior robustness to distribution shifts of 2D images. However, their robustness under 3D viewpoint variations is still limited, which can hinder the development for real-world applications. This paper successfully addresses this concern while keeping VLPs' origin… ▽ More

    Submitted 18 April, 2024; originally announced April 2024.

    Comments: 20 pages

  48. arXiv:2404.09392  [pdf, ps, other

    cs.IT cs.LG cs.NI eess.SP

    An Autoencoder-Based Constellation Design for AirComp in Wireless Federated Learning

    Authors: Yujia Mu, Xizixiang Wei, Cong Shen

    Abstract: Wireless federated learning (FL) relies on efficient uplink communications to aggregate model updates across distributed edge devices. Over-the-air computation (a.k.a. AirComp) has emerged as a promising approach for addressing the scalability challenge of FL over wireless links with limited communication resources. Unlike conventional methods, AirComp allows multiple edge devices to transmit upli… ▽ More

    Submitted 14 April, 2024; originally announced April 2024.

  49. arXiv:2404.08408  [pdf, other

    cs.LG cs.AI eess.SP physics.geo-ph

    Seismic First Break Picking in a Higher Dimension Using Deep Graph Learning

    Authors: Hongtao Wang, Li Long, Jiangshe Zhang, Xiaoli Wei, Chunxia Zhang, Zhenbo Guo

    Abstract: Contemporary automatic first break (FB) picking methods typically analyze 1D signals, 2D source gathers, or 3D source-receiver gathers. Utilizing higher-dimensional data, such as 2D or 3D, incorporates global features, improving the stability of local picking. Despite the benefits, high-dimensional data requires structured input and increases computational demands. Addressing this, we propose a no… ▽ More

    Submitted 12 April, 2024; originally announced April 2024.

  50. arXiv:2404.06119  [pdf, other

    cs.CV

    DreamView: Injecting View-specific Text Guidance into Text-to-3D Generation

    Authors: Junkai Yan, Yipeng Gao, Qize Yang, Xihan Wei, Xuansong Xie, Ancong Wu, Wei-Shi Zheng

    Abstract: Text-to-3D generation, which synthesizes 3D assets according to an overall text description, has significantly progressed. However, a challenge arises when the specific appearances need customizing at designated viewpoints but referring solely to the overall description for generating 3D objects. For instance, ambiguity easily occurs when producing a T-shirt with distinct patterns on its front and… ▽ More

    Submitted 12 July, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

    Comments: Accepted to ECCV 2024, camera ready version