Skip to main content

Showing 1–50 of 267 results for author: Shao, W

  1. arXiv:2407.02797  [pdf, other

    cs.RO cs.CV

    Solving Motion Planning Tasks with a Scalable Generative Model

    Authors: Yihan Hu, Siqi Chai, Zhening Yang, Jingyu Qian, Kun Li, Wenxin Shao, Haichao Zhang, Wei Xu, Qiang Liu

    Abstract: As autonomous driving systems being deployed to millions of vehicles, there is a pressing need of improving the system's scalability, safety and reducing the engineering cost. A realistic, scalable, and practical simulator of the driving world is highly desired. In this paper, we present an efficient solution based on generative models which learns the dynamics of the driving scenes. With this mod… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: ECCV2024

  2. arXiv:2406.12229  [pdf, other

    cs.AI cs.LG

    Spatially Resolved Gene Expression Prediction from Histology via Multi-view Graph Contrastive Learning with HSIC-bottleneck Regularization

    Authors: Changxi Chi, Hang Shi, Qi Zhu, Daoqiang Zhang, Wei Shao

    Abstract: The rapid development of spatial transcriptomics(ST) enables the measurement of gene expression at spatial resolution, making it possible to simultaneously profile the gene expression, spatial locations of spots, and the matched histopathological images. However, the cost for collecting ST data is much higher than acquiring histopathological images, and thus several studies attempt to predict the… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  3. arXiv:2406.11802  [pdf, other

    cs.CV

    PhyBench: A Physical Commonsense Benchmark for Evaluating Text-to-Image Models

    Authors: Fanqing Meng, Wenqi Shao, Lixin Luo, Yahong Wang, Yiran Chen, Quanfeng Lu, Yue Yang, Tianshuo Yang, Kaipeng Zhang, Yu Qiao, Ping Luo

    Abstract: Text-to-image (T2I) models have made substantial progress in generating images from textual prompts. However, they frequently fail to produce images consistent with physical commonsense, a vital capability for applications in world simulation and everyday tasks. Current T2I evaluation benchmarks focus on metrics such as accuracy, bias, and safety, neglecting the evaluation of models' internal know… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  4. arXiv:2406.10125  [pdf, other

    cs.CV

    MapVision: CVPR 2024 Autonomous Grand Challenge Mapless Driving Tech Report

    Authors: Zhongyu Yang, Mai Liu, Jinluo Xie, Yueming Zhang, Chen Shen, Wei Shao, Jichao Jiao, Tengfei Xing, Runbo Hu, Pengfei Xu

    Abstract: Autonomous driving without high-definition (HD) maps demands a higher level of active scene understanding. In this competition, the organizers provided the multi-perspective camera images and standard-definition (SD) maps to explore the boundaries of scene reasoning capabilities. We found that most existing algorithms construct Bird's Eye View (BEV) features from these multi-perspective images and… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

  5. arXiv:2406.08845  [pdf, other

    cs.CV

    Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility, and Practicality

    Authors: Tianle Zhang, Langtian Ma, Yuchen Yan, Yuchen Zhang, Kai Wang, Yue Yang, Ziyao Guo, Wenqi Shao, Yang You, Yu Qiao, Ping Luo, Kaipeng Zhang

    Abstract: Recent text-to-video (T2V) technology advancements, as demonstrated by models such as Gen2, Pika, and Sora, have significantly broadened its applicability and popularity. Despite these strides, evaluating these models poses substantial challenges. Primarily, due to the limitations inherent in automatic metrics, manual evaluation is often considered a superior method for assessing T2V generation. H… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  6. arXiv:2406.08451  [pdf, other

    cs.CV

    GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

    Authors: Quanfeng Lu, Wenqi Shao, Zitao Liu, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, Yu Qiao, Ping Luo

    Abstract: Smartphone users often navigate across multiple applications (apps) to complete tasks such as sharing content between social media platforms. Autonomous Graphical User Interface (GUI) navigation agents can enhance user experience in communication, entertainment, and productivity by streamlining workflows and reducing manual intervention. However, prior GUI agents often trained with datasets compri… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: 16 pages, 8 figures, a cross-app GUI navigation dataset

  7. arXiv:2406.07230  [pdf, other

    cs.CV cs.AI

    Needle In A Multimodal Haystack

    Authors: Weiyun Wang, Shuibo Zhang, Yiming Ren, Yuchen Duan, Tiantong Li, Shuo Liu, Mengkang Hu, Zhe Chen, Kaipeng Zhang, Lewei Lu, Xizhou Zhu, Ping Luo, Yu Qiao, Jifeng Dai, Wenqi Shao, Wenhai Wang

    Abstract: With the rapid advancement of multimodal large language models (MLLMs), their evaluation has become increasingly comprehensive. However, understanding long multimodal content, as a foundational ability for real-world applications, remains underexplored. In this work, we present Needle In A Multimodal Haystack (MM-NIAH), the first benchmark specifically designed to systematically evaluate the capab… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

  8. arXiv:2406.04035  [pdf, other

    cs.LG cs.AI

    STEMO: Early Spatio-temporal Forecasting with Multi-Objective Reinforcement Learning

    Authors: Wei Shao, Yufan Kang, Ziyan Peng, Xiao Xiao, Lei Wang, Yuhui Yang, Flora D Salim

    Abstract: Accuracy and timeliness are indeed often conflicting goals in prediction tasks. Premature predictions may yield a higher rate of false alarms, whereas delaying predictions to gather more information can render them too late to be useful. In applications such as wildfires, crimes, and traffic jams, timely forecasting are vital for safeguarding human life and property. Consequently, finding a balanc… ▽ More

    Submitted 18 June, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

    Comments: Accepted paper in KDD 2024

  9. arXiv:2406.03404  [pdf, other

    cs.LG cs.AI cs.CR

    ST-DPGAN: A Privacy-preserving Framework for Spatiotemporal Data Generation

    Authors: Wei Shao, Rongyi Zhu, Cai Yang, Chandra Thapa, Muhammad Ejaz Ahmed, Seyit Camtepe, Rui Zhang, DuYong Kim, Hamid Menouar, Flora D. Salim

    Abstract: Spatiotemporal data is prevalent in a wide range of edge devices, such as those used in personal communication and financial transactions. Recent advancements have sparked a growing interest in integrating spatiotemporal analysis with large-scale language models. However, spatiotemporal data often contains sensitive information, making it unsuitable for open third-party access. To address this cha… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

  10. arXiv:2406.01363  [pdf, other

    cs.CL cs.IR

    Privacy in LLM-based Recommendation: Recent Advances and Future Directions

    Authors: Sichun Luo, Wei Shao, Yuxuan Yao, Jian Xu, Mingyang Liu, Qintong Li, Bowei He, Maolin Wang, Guanzhi Deng, Hanxu Hou, Xinyi Zhang, Linqi Song

    Abstract: Nowadays, large language models (LLMs) have been integrated with conventional recommendation models to improve recommendation performance. However, while most of the existing works have focused on improving the model performance, the privacy issue has only received comparatively less attention. In this paper, we review recent advancements in privacy within LLM-based recommendation, categorizing th… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

  11. Promoting Two-sided Fairness in Dynamic Vehicle Routing Problem

    Authors: Yufan Kang, Rongsheng Zhang, Wei Shao, Flora D. Salim, Jeffrey Chan

    Abstract: Dynamic Vehicle Routing Problem (DVRP), is an extension of the classic Vehicle Routing Problem (VRP), which is a fundamental problem in logistics and transportation. Typically, DVRPs involve two stakeholders: service providers that deliver services to customers and customers who raise requests from different locations. Many real-world applications can be formulated as DVRP such as ridesharing and… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

  12. arXiv:2405.17763  [pdf

    quant-ph physics.app-ph physics.atm-clus physics.optics

    Capturing dynamics and thermodynamics of a three-level quantum heat engine via programmable quantum circuits

    Authors: Gao-xiang Deng, Zhe He, Yu Liu, Wei Shao, Zheng Cui

    Abstract: This research employs the Kraus representation and Sz.-Nagy dilation theorem to model a three-level quantum heat on quantum circuits, investigating its dynamic evolution and thermodynamic performance. The feasibility of the dynamic model is validated by tracking the changes of population. On the basis of reinforcement learning algorithm, the optimal cycle of the quantum heat engine for maximal ave… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  13. arXiv:2405.14858  [pdf, other

    cs.CV

    Mamba-R: Vision Mamba ALSO Needs Registers

    Authors: Feng Wang, Jiahao Wang, Sucheng Ren, Guoyizhe Wei, Jieru Mei, Wei Shao, Yuyin Zhou, Alan Yuille, Cihang Xie

    Abstract: Similar to Vision Transformers, this paper identifies artifacts also present within the feature maps of Vision Mamba. These artifacts, corresponding to high-norm tokens emerging in low-information background areas of images, appear much more severe in Vision Mamba -- they exist prevalently even with the tiny-sized model and activate extensively across background regions. To mitigate this issue, we… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

  14. arXiv:2405.14802  [pdf, other

    eess.IV cs.CV

    Fast-DDPM: Fast Denoising Diffusion Probabilistic Models for Medical Image-to-Image Generation

    Authors: Hongxu Jiang, Muhammad Imran, Linhai Ma, Teng Zhang, Yuyin Zhou, Muxuan Liang, Kuang Gong, Wei Shao

    Abstract: Denoising diffusion probabilistic models (DDPMs) have achieved unprecedented success in computer vision. However, they remain underutilized in medical imaging, a field crucial for disease diagnosis and treatment planning. This is primarily due to the high computational cost associated with (1) the use of large number of time steps (e.g., 1,000) in diffusion processes and (2) the increased dimensio… ▽ More

    Submitted 23 May, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

  15. arXiv:2405.14554  [pdf, other

    cs.CV cs.AI

    UDKAG: Augmenting Large Vision-Language Models with Up-to-Date Knowledge

    Authors: Chuanhao Li, Zhen Li, Chenchen Jing, Shuo Liu, Wenqi Shao, Yuwei Wu, Ping Luo, Yu Qiao, Kaipeng Zhang

    Abstract: Large vision-language models (LVLMs) are ignorant of the up-to-date knowledge, such as LLaVA series, because they cannot be updated frequently due to the large amount of resources required, and therefore fail in many cases. For example, if a LVLM was released on January 2024, and it wouldn't know the detailed plot of the new movie Dune 2, which wasn't released until February 2024. To solve the pro… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

    Comments: 12 pages, 6 figures, a framework to augment large vision-language models with up-to-date knowledge

  16. arXiv:2405.10496  [pdf, other

    cs.IT eess.SP

    Electromagnetic Information Theory for Holographic MIMO Communications

    Authors: Li Wei, Tierui Gong, Chongwen Huang, Zhaoyang Zhang, Wei E. I. Sha, Zhi Ning Chen, Linglong Dai, Merouane Debbah, Chau Yuen

    Abstract: Holographic multiple-input multiple-output (HMIMO) utilizes a compact antenna array to form a nearly continuous aperture, thereby enhancing higher capacity and more flexible configurations compared with conventional MIMO systems, making it attractive in current scientific research. Key questions naturally arise regarding the potential of HMIMO to surpass Shannon's theoretical limits and how far it… ▽ More

    Submitted 25 May, 2024; v1 submitted 16 May, 2024; originally announced May 2024.

  17. arXiv:2405.05945  [pdf, other

    cs.CV

    Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers

    Authors: Peng Gao, Le Zhuo, Dongyang Liu, Ruoyi Du, Xu Luo, Longtian Qiu, Yuhang Zhang, Chen Lin, Rongjie Huang, Shijie Geng, Renrui Zhang, Junlin Xi, Wenqi Shao, Zhengkai Jiang, Tianshuo Yang, Weicai Ye, He Tong, Jingwen He, Yu Qiao, Hongsheng Li

    Abstract: Sora unveils the potential of scaling Diffusion Transformer for generating photorealistic images and videos at arbitrary resolutions, aspect ratios, and durations, yet it still lacks sufficient implementation details. In this technical report, we introduce the Lumina-T2X family - a series of Flow-based Large Diffusion Transformers (Flag-DiT) equipped with zero-initialized attention, as a unified f… ▽ More

    Submitted 13 June, 2024; v1 submitted 9 May, 2024; originally announced May 2024.

    Comments: Technical Report; Code at: https://github.com/Alpha-VLLM/Lumina-T2X

  18. arXiv:2405.00130  [pdf, other

    eess.IV cs.CV cs.LG

    A Flexible 2.5D Medical Image Segmentation Approach with In-Slice and Cross-Slice Attention

    Authors: Amarjeet Kumar, Hongxu Jiang, Muhammad Imran, Cyndi Valdes, Gabriela Leon, Dahyun Kang, Parvathi Nataraj, Yuyin Zhou, Michael D. Weiss, Wei Shao

    Abstract: Deep learning has become the de facto method for medical image segmentation, with 3D segmentation models excelling in capturing complex 3D structures and 2D models offering high computational efficiency. However, segmenting 2.5D images, which have high in-plane but low through-plane resolution, is a relatively unexplored challenge. While applying 2D models to individual slices of a 2.5D image is f… ▽ More

    Submitted 30 April, 2024; originally announced May 2024.

  19. arXiv:2404.16017  [pdf, other

    cs.CV cs.AI cs.GT cs.LG

    RetinaRegNet: A Versatile Approach for Retinal Image Registration

    Authors: Vishal Balaji Sivaraman, Muhammad Imran, Qingyue Wei, Preethika Muralidharan, Michelle R. Tamplin, Isabella M . Grumbach, Randy H. Kardon, Jui-Kai Wang, Yuyin Zhou, Wei Shao

    Abstract: We introduce the RetinaRegNet model, which can achieve state-of-the-art performance across various retinal image registration tasks. RetinaRegNet does not require training on any retinal images. It begins by establishing point correspondences between two retinal images using image features derived from diffusion models. This process involves the selection of feature points from the moving image us… ▽ More

    Submitted 20 May, 2024; v1 submitted 24 April, 2024; originally announced April 2024.

  20. arXiv:2404.16006  [pdf, other

    cs.CV

    MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

    Authors: Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, Jiayi Lei, Quanfeng Lu, Runjian Chen, Peng Xu, Renrui Zhang, Haozhe Zhang, Peng Gao, Yali Wang, Yu Qiao, Ping Luo, Kaipeng Zhang, Wenqi Shao

    Abstract: Large Vision-Language Models (LVLMs) show significant strides in general-purpose multimodal applications such as visual dialogue and embodied navigation. However, existing multimodal evaluation benchmarks cover a limited number of multimodal tasks testing rudimentary capabilities, falling short in tracking LVLM development. In this study, we present MMT-Bench, a comprehensive benchmark designed to… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

    Comments: 77 pages, 41 figures

  21. arXiv:2404.06773  [pdf, other

    cs.CV

    Adapting LLaMA Decoder to Vision Transformer

    Authors: Jiahao Wang, Wenqi Shao, Mengzhao Chen, Chengyue Wu, Yong Liu, Taiqiang Wu, Kaipeng Zhang, Songyang Zhang, Kai Chen, Ping Luo

    Abstract: This work examines whether decoder-only Transformers such as LLaMA, which were originally designed for large language models (LLMs), can be adapted to the computer vision field. We first "LLaMAfy" a standard ViT step-by-step to align with LLaMA's architecture, and find that directly applying a causal mask to the self-attention brings an attention collapse issue, resulting in the failure to the net… ▽ More

    Submitted 27 May, 2024; v1 submitted 10 April, 2024; originally announced April 2024.

    Comments: 23 pages, 11 figures

  22. arXiv:2404.01342  [pdf, other

    cs.CL cs.AI

    DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model

    Authors: Lirui Zhao, Yue Yang, Kaipeng Zhang, Wenqi Shao, Yuxin Zhang, Yu Qiao, Ping Luo, Rongrong Ji

    Abstract: Text-to-image (T2I) generative models have attracted significant attention and found extensive applications within and beyond academic research. For example, the Civitai community, a platform for T2I innovation, currently hosts an impressive array of 74,492 distinct models. However, this diversity presents a formidable challenge in selecting the most appropriate model and parameters, a process tha… ▽ More

    Submitted 31 March, 2024; originally announced April 2024.

    Comments: Published as a conference paper at CVPR 2024

  23. arXiv:2403.20194  [pdf, other

    cs.MM

    ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Capability for Large Vision-Language Models

    Authors: Shuo Liu, Kaining Ying, Hao Zhang, Yue Yang, Yuqi Lin, Tianle Zhang, Chuanhao Li, Yu Qiao, Ping Luo, Wenqi Shao, Kaipeng Zhang

    Abstract: This paper presents ConvBench, a novel multi-turn conversation evaluation benchmark tailored for Large Vision-Language Models (LVLMs). Unlike existing benchmarks that assess individual capabilities in single-turn dialogues, ConvBench adopts a three-level multimodal capability hierarchy, mimicking human cognitive processes by stacking up perception, reasoning, and creativity. Each level focuses on… ▽ More

    Submitted 25 April, 2024; v1 submitted 29 March, 2024; originally announced March 2024.

  24. arXiv:2403.18271  [pdf, other

    cs.CV

    Unleashing the Potential of SAM for Medical Adaptation via Hierarchical Decoding

    Authors: Zhiheng Cheng, Qingyue Wei, Hongru Zhu, Yan Wang, Liangqiong Qu, Wei Shao, Yuyin Zhou

    Abstract: The Segment Anything Model (SAM) has garnered significant attention for its versatile segmentation abilities and intuitive prompt-based interface. However, its application in medical imaging presents challenges, requiring either substantial training costs and extensive medical datasets for full model fine-tuning or high-quality prompts for optimal performance. This paper introduces H-SAM: a prompt… ▽ More

    Submitted 27 March, 2024; originally announced March 2024.

    Comments: CVPR 2024

  25. arXiv:2403.14354  [pdf, other

    cs.CV

    LDTR: Transformer-based Lane Detection with Anchor-chain Representation

    Authors: Zhongyu Yang, Chen Shen, Wei Shao, Tengfei Xing, Runbo Hu, Pengfei Xu, Hua Chai, Ruini Xue

    Abstract: Despite recent advances in lane detection methods, scenarios with limited- or no-visual-clue of lanes due to factors such as lighting conditions and occlusion remain challenging and crucial for automated driving. Moreover, current lane representations require complex post-processing and struggle with specific instances. Inspired by the DETR architecture, we propose LDTR, a transformer-based model… ▽ More

    Submitted 21 March, 2024; originally announced March 2024.

    Comments: Accepted by CVM 2024 and CVMJ. 16 pages, 14 figures

  26. arXiv:2403.09346  [pdf, other

    cs.CV cs.AI

    AVIBench: Towards Evaluating the Robustness of Large Vision-Language Model on Adversarial Visual-Instructions

    Authors: Hao Zhang, Wenqi Shao, Hong Liu, Yongqiang Ma, Ping Luo, Yu Qiao, Kaipeng Zhang

    Abstract: Large Vision-Language Models (LVLMs) have shown significant progress in well responding to visual-instructions from users. However, these instructions, encompassing images and text, are susceptible to both intentional and inadvertent attacks. Despite the critical importance of LVLMs' robustness against such threats, current research in this area remains limited. To bridge this gap, we introduce AV… ▽ More

    Submitted 14 March, 2024; originally announced March 2024.

  27. arXiv:2403.05970  [pdf, other

    cs.IT eess.SP

    Electromagnetic Hybrid Beamforming for Holographic Communications

    Authors: Ran Ji, Chongwen Huang, Xiaoming Chen, Wei E. I. Sha, Linglong Dai, Jiguang He, Zhaoyang Zhang, Chau Yuen, Mérouane Debbah

    Abstract: It is well known that there is inherent radiation pattern distortion for the commercial base station antenna array, which usually needs three antenna sectors to cover the whole space. To eliminate pattern distortion and further enhance beamforming performance, we propose an electromagnetic hybrid beamforming (EHB) scheme based on a three-dimensional (3D) superdirective holographic antenna array. S… ▽ More

    Submitted 9 March, 2024; originally announced March 2024.

    Comments: 13 pages

  28. arXiv:2403.02297  [pdf, other

    cs.RO

    Uncertainty-Aware Prediction and Application in Planning for Autonomous Driving: Definitions, Methods, and Comparison

    Authors: Wenbo Shao, Jiahui Xu, Zhong Cao, Hong Wang, Jun Li

    Abstract: Autonomous driving systems face the formidable challenge of navigating intricate and dynamic environments with uncertainty. This study presents a unified prediction and planning framework that concurrently models short-term aleatoric uncertainty (SAU), long-term aleatoric uncertainty (LAU), and epistemic uncertainty (EU) to predict and establish a robust foundation for planning in dynamic contexts… ▽ More

    Submitted 4 March, 2024; originally announced March 2024.

    Comments: 14 pages, 7 figures

  29. arXiv:2403.02118  [pdf, other

    cs.CY cs.AI cs.CV

    Position: Towards Implicit Prompt For Text-To-Image Models

    Authors: Yue Yang, Yuqi Lin, Hong Liu, Wenqi Shao, Runjian Chen, Hailong Shang, Yu Wang, Yu Qiao, Kaipeng Zhang, Ping Luo

    Abstract: Recent text-to-image (T2I) models have had great success, and many benchmarks have been proposed to evaluate their performance and safety. However, they only consider explicit prompts while neglecting implicit prompts (hint at a target without explicitly mentioning it). These prompts may get rid of safety constraints and pose potential threats to the applications of these models. This position pap… ▽ More

    Submitted 28 May, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

  30. arXiv:2403.01504  [pdf, ps, other

    physics.optics math-ph physics.app-ph

    Spin and Orbital Angular Momenta of Electromagnetic Waves: From Classical to Quantum Forms

    Authors: Wei E. I. Sha, Zhihao Lan, Menglin L. N. Chen, Yongpin P. Chen, Sheng Sun

    Abstract: Angular momenta of electromagnetic waves are important both in concepts and applications. In this work, we systematically discuss two types of angular momenta, i.e., spin angular momentum and orbital angular momentum in various cases, e.g., with source and without source, in classical and quantum forms. Numerical results demonstrating how to extract the topological charge of a classical vortex bea… ▽ More

    Submitted 3 March, 2024; originally announced March 2024.

    Comments: 5 pages, 3 figures

    Journal ref: IEEE Journal on Multiscale and Multiphysics Computational Techniques, 2024

  31. arXiv:2402.19385  [pdf, other

    cs.RO cs.CV

    Towards Safe and Reliable Autonomous Driving: Dynamic Occupancy Set Prediction

    Authors: Wenbo Shao, Jiahui Xu, Wenhao Yu, Jun Li, Hong Wang

    Abstract: In the rapidly evolving field of autonomous driving, reliable prediction is pivotal for vehicular safety. However, trajectory predictions often deviate from actual paths, particularly in complex and challenging environments, leading to significant errors. To address this issue, our study introduces a novel method for Dynamic Occupancy Set (DOS) prediction, it effectively combines advanced trajecto… ▽ More

    Submitted 2 June, 2024; v1 submitted 29 February, 2024; originally announced February 2024.

    Comments: Accepted by IEEE IV 2024

  32. arXiv:2402.19004  [pdf, other

    cs.CV eess.IV

    RSAM-Seg: A SAM-based Approach with Prior Knowledge Integration for Remote Sensing Image Semantic Segmentation

    Authors: Jie Zhang, Xubing Yang, Rui Jiang, Wei Shao, Li Zhang

    Abstract: The development of high-resolution remote sensing satellites has provided great convenience for research work related to remote sensing. Segmentation and extraction of specific targets are essential tasks when facing the vast and complex remote sensing images. Recently, the introduction of Segment Anything Model (SAM) provides a universal pre-training model for image segmentation tasks. While the… ▽ More

    Submitted 29 February, 2024; originally announced February 2024.

    Comments: 12 pages, 11 figures

  33. arXiv:2402.16880  [pdf, other

    cs.LG cs.AI cs.CL

    BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation

    Authors: Peng Xu, Wenqi Shao, Mengzhao Chen, Shitao Tang, Kaipeng Zhang, Peng Gao, Fengwei An, Yu Qiao, Ping Luo

    Abstract: Large language models (LLMs) have demonstrated outstanding performance in various tasks, such as text summarization, text question-answering, and etc. While their performance is impressive, the computational footprint due to their vast number of parameters can be prohibitive. Existing solutions such as SparseGPT and Wanda attempt to alleviate this issue through weight pruning. However, their layer… ▽ More

    Submitted 19 April, 2024; v1 submitted 18 February, 2024; originally announced February 2024.

  34. arXiv:2402.16117  [pdf, other

    cs.RO cs.AI cs.CV

    RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis

    Authors: Yao Mu, Junting Chen, Qinglong Zhang, Shoufa Chen, Qiaojun Yu, Chongjian Ge, Runjian Chen, Zhixuan Liang, Mengkang Hu, Chaofan Tao, Peize Sun, Haibao Yu, Chao Yang, Wenqi Shao, Wenhai Wang, Jifeng Dai, Yu Qiao, Mingyu Ding, Ping Luo

    Abstract: Robotic behavior synthesis, the problem of understanding multimodal inputs and generating precise physical control for robots, is an important part of Embodied AI. Despite successes in applying multimodal large language models for high-level understanding, it remains challenging to translate these conceptual understandings into detailed robotic actions while achieving generalization across various… ▽ More

    Submitted 25 February, 2024; originally announced February 2024.

  35. arXiv:2402.14623  [pdf, other

    cs.RO cs.AI cs.CL cs.CV

    RoboScript: Code Generation for Free-Form Manipulation Tasks across Real and Simulation

    Authors: Junting Chen, Yao Mu, Qiaojun Yu, Tianming Wei, Silang Wu, Zhecheng Yuan, Zhixuan Liang, Chao Yang, Kaipeng Zhang, Wenqi Shao, Yu Qiao, Huazhe Xu, Mingyu Ding, Ping Luo

    Abstract: Rapid progress in high-level task planning and code generation for open-world robot manipulation has been witnessed in Embodied AI. However, previous studies put much effort into general common sense reasoning and task planning capabilities of large-scale language or multi-modal models, relatively little effort on ensuring the deployability of generated code on real robots, and other fundamental c… ▽ More

    Submitted 22 February, 2024; originally announced February 2024.

    Comments: 10 pages of main paper, 4 pages of appendix; 10 figures in main paper, 3 figures in appendix

    ACM Class: I.2.7; I.2.8; I.2.9; I.2.10

  36. arXiv:2402.09181  [pdf, other

    eess.IV cs.CV

    OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM

    Authors: Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, Ping Luo

    Abstract: Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in various multimodal tasks. However, their potential in the medical domain remains largely unexplored. A significant challenge arises from the scarcity of diverse medical images spanning various modalities and anatomical regions, which is essential in real-world medical applications. To solve this problem, in this pape… ▽ More

    Submitted 21 April, 2024; v1 submitted 14 February, 2024; originally announced February 2024.

  37. arXiv:2402.05935  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

    Authors: Dongyang Liu, Renrui Zhang, Longtian Qiu, Siyuan Huang, Weifeng Lin, Shitian Zhao, Shijie Geng, Ziyi Lin, Peng Jin, Kaipeng Zhang, Wenqi Shao, Chao Xu, Conghui He, Junjun He, Hao Shao, Pan Lu, Hongsheng Li, Yu Qiao, Peng Gao

    Abstract: We propose SPHINX-X, an extensive Multimodality Large Language Model (MLLM) series developed upon SPHINX. To improve the architecture and training efficiency, we modify the SPHINX framework by removing redundant visual encoders, bypassing fully-padded sub-images with skip tokens, and simplifying multi-stage training into a one-stage all-in-one paradigm. To fully unleash the potential of MLLMs, we… ▽ More

    Submitted 26 June, 2024; v1 submitted 8 February, 2024; originally announced February 2024.

    Comments: Accepted by ICML 2024. Code and models are released at https://github.com/Alpha-VLLM/LLaMA2-Accessory

  38. arXiv:2402.04924  [pdf, other

    cs.LG

    Two Trades is not Baffled: Condensing Graph via Crafting Rational Gradient Matching

    Authors: Tianle Zhang, Yuchen Zhang, Kun Wang, Kai Wang, Beining Yang, Kaipeng Zhang, Wenqi Shao, Ping Liu, Joey Tianyi Zhou, Yang You

    Abstract: Training on large-scale graphs has achieved remarkable results in graph representation learning, but its cost and storage have raised growing concerns. As one of the most promising directions, graph condensation methods address these issues by employing gradient matching, aiming to condense the full graph into a more concise yet information-rich synthetic set. Though encouraging, these strategies… ▽ More

    Submitted 30 June, 2024; v1 submitted 7 February, 2024; originally announced February 2024.

    Comments: An effective method for graph condensation

  39. arXiv:2401.15833  [pdf

    quant-ph physics.app-ph

    Experimental demonstration of steady-state dynamics of three-level quantum heat engine using superconducting quantum circuits

    Authors: Gao-xiang Deng, Haoqiang Ai, Wei Shao, Yu Liu, Zheng Cui

    Abstract: The three-level system represents the smallest quantum system capable of autonomous cycling in quantum heat engines. This study proposes a method to simulate the steady-state dynamics of a three-level quantum heat engine by designing and implementing superconducting quantum circuits. Following error mitigation, the outcomes from the quantum circuit model designed in this study, when executed on a… ▽ More

    Submitted 14 May, 2024; v1 submitted 28 January, 2024; originally announced January 2024.

  40. arXiv:2401.13049  [pdf, other

    eess.IV cs.AI cs.CV cs.GT cs.LG

    CIS-UNet: Multi-Class Segmentation of the Aorta in Computed Tomography Angiography via Context-Aware Shifted Window Self-Attention

    Authors: Muhammad Imran, Jonathan R Krebs, Veera Rajasekhar Reddy Gopu, Brian Fazzone, Vishal Balaji Sivaraman, Amarjeet Kumar, Chelsea Viscardi, Robert Evans Heithaus, Benjamin Shickel, Yuyin Zhou, Michol A Cooper, Wei Shao

    Abstract: Advancements in medical imaging and endovascular grafting have facilitated minimally invasive treatments for aortic diseases. Accurate 3D segmentation of the aorta and its branches is crucial for interventions, as inaccurate segmentation can lead to erroneous surgical planning and endograft construction. Previous methods simplified aortic segmentation as a binary image segmentation problem, overlo… ▽ More

    Submitted 23 January, 2024; originally announced January 2024.

  41. arXiv:2401.12888  [pdf, other

    cs.RO cs.CV

    Data-Centric Evolution in Autonomous Driving: A Comprehensive Survey of Big Data System, Data Mining, and Closed-Loop Technologies

    Authors: Lincan Li, Wei Shao, Wei Dong, Yijun Tian, Qiming Zhang, Kaixiang Yang, Wenjie Zhang

    Abstract: The aspiration of the next generation's autonomous driving (AD) technology relies on the dedicated integration and interaction among intelligent perception, prediction, planning, and low-level control. There has been a huge bottleneck regarding the upper bound of autonomous driving algorithm performance, a consensus from academia and industry believes that the key to surmount the bottleneck lies i… ▽ More

    Submitted 26 January, 2024; v1 submitted 23 January, 2024; originally announced January 2024.

  42. arXiv:2401.11723  [pdf, other

    cs.CR cs.AI

    Unraveling Attacks in Machine Learning-based IoT Ecosystems: A Survey and the Open Libraries Behind Them

    Authors: Chao Liu, Boxi Chen, Wei Shao, Chris Zhang, Kelvin Wong, Yi Zhang

    Abstract: The advent of the Internet of Things (IoT) has brought forth an era of unprecedented connectivity, with an estimated 80 billion smart devices expected to be in operation by the end of 2025. These devices facilitate a multitude of smart applications, enhancing the quality of life and efficiency across various domains. Machine Learning (ML) serves as a crucial technology, not only for analyzing IoT-… ▽ More

    Submitted 26 January, 2024; v1 submitted 22 January, 2024; originally announced January 2024.

  43. arXiv:2401.04097  [pdf, other

    hep-th astro-ph.CO gr-qc hep-ph

    Stringy Spacetime Uncertainty Principle and a Modified Trans-Planckian Censorship Criterion

    Authors: Robert Brandenberger, Pei-Ming Ho, Hikaru Kawai, Wei-Hsiang Shao

    Abstract: We study the implications of the stringy space-time uncertainty relation (STUR) for inflationary cosmology. By demanding that no fluctuation modes that exit the Hubble radius are affected by the nonlocality resulting from the STUR, we find an upper bound on the number of e-foldings of inflation. The bound is a factor of 2 weaker than what results from the Trans-Planckian Censorship Criterion (TCC)… ▽ More

    Submitted 8 January, 2024; originally announced January 2024.

    Comments: 7 pages, 1 figure

  44. arXiv:2401.02384  [pdf, other

    cs.CV

    ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning

    Authors: Fanqing Meng, Wenqi Shao, Quanfeng Lu, Peng Gao, Kaipeng Zhang, Yu Qiao, Ping Luo

    Abstract: Charts play a vital role in data visualization, understanding data patterns, and informed decision-making. However, their unique combination of graphical elements (e.g., bars, lines) and textual components (e.g., labels, legends) poses challenges for general-purpose multimodal models. While vision-language models trained on chart data excel in comprehension, they struggle with generalization. To a… ▽ More

    Submitted 15 February, 2024; v1 submitted 4 January, 2024; originally announced January 2024.

    Comments: Updated and corrected experimental results, removal of inappropriate experiments, and a more comprehensive experimental setup

  45. arXiv:2312.16018  [pdf, other

    cs.IR

    RecRanker: Instruction Tuning Large Language Model as Ranker for Top-k Recommendation

    Authors: Sichun Luo, Bowei He, Haohan Zhao, Wei Shao, Yanlin Qi, Yinya Huang, Aojun Zhou, Yuxuan Yao, Zongpeng Li, Yuanzhang Xiao, Mingjie Zhan, Linqi Song

    Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities and have been extensively deployed across various domains, including recommender systems. Prior research has employed specialized \textit{prompts} to leverage the in-context learning capabilities of LLMs for recommendation purposes. More recent studies have utilized instruction tuning techniques to align LLMs with human prefere… ▽ More

    Submitted 31 March, 2024; v1 submitted 26 December, 2023; originally announced December 2023.

  46. arXiv:2312.15878  [pdf, other

    cond-mat.quant-gas

    Collisions of Majorana Zero Modes

    Authors: Liang-Liang Wang, Wenjun Shao, Jian Li

    Abstract: We investigate the collisions of Majorana zero modes, which are presented as inter-soliton collisional events in fermionic superfluids with spin-orbit coupling. Our results demonstrate that, the zero energy splitting, induced by the overlapping of inter-soliton Majorana wave-functions upon collision, generates an effective repulsive force for Majorana states, which in turn protected themselves aga… ▽ More

    Submitted 25 December, 2023; originally announced December 2023.

    Comments: 7 pages, 5 figures

  47. arXiv:2312.12742  [pdf, other

    cs.CV

    Cached Transformers: Improving Transformers with Differentiable Memory Cache

    Authors: Zhaoyang Zhang, Wenqi Shao, Yixiao Ge, Xiaogang Wang, Jinwei Gu, Ping Luo

    Abstract: This work introduces a new Transformer model called Cached Transformer, which uses Gated Recurrent Cached (GRC) attention to extend the self-attention mechanism with a differentiable memory cache of tokens. GRC attention enables attending to both past and current tokens, increasing the receptive field of attention and allowing for exploring long-range dependencies. By utilizing a recurrent gating… ▽ More

    Submitted 19 December, 2023; originally announced December 2023.

    Comments: AAAI 2024

  48. arXiv:2312.05732  [pdf, ps, other

    quant-ph

    Reply to "Comment on `Generalized James' effective Hamiltonian method'"

    Authors: Wenjun Shao, Chunfeng Wu, Xun-Li Feng

    Abstract: In the preceding Comment [1] it was claimed that the third-order Hamiltonian obtained in our original paper [2] is not Hermitian for general situations when considering time-dependence and the way of deriving the effective third-order expansion is not very rigorous. To reply the comment we should emphasize the following three points: first of all, the third-order Hamiltonian given in our paper is… ▽ More

    Submitted 20 February, 2024; v1 submitted 9 December, 2023; originally announced December 2023.

    Comments: 9 pages

  49. arXiv:2311.18765  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    MLLMs-Augmented Visual-Language Representation Learning

    Authors: Yanqing Liu, Kai Wang, Wenqi Shao, Ping Luo, Yu Qiao, Mike Zheng Shou, Kaipeng Zhang, Yang You

    Abstract: Visual-language pre-training has achieved remarkable success in many multi-modal tasks, largely attributed to the availability of large-scale image-text datasets. In this work, we demonstrate that Multi-modal Large Language Models (MLLMs) can enhance visual-language representation learning by establishing richer image-text associations for image-text datasets. Our approach is simple, utilizing MLL… ▽ More

    Submitted 13 March, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

  50. arXiv:2311.17442  [pdf

    physics.app-ph

    Quantifying Nonradiative Recombination and Resistive Losses in Perovskite Photovoltaics: A Modified Diode Model Approach

    Authors: Minshen Lin, Xuehui Xu, Hong Tian, Yang Michael Yang, Wei E. I. Sha, Wenxing Zhong

    Abstract: Pinpointing the origin of inefficiency can expedite the process of optimizing the efficiency of perovskite photovoltaics. However, it is challenging to discern and quantify the different loss pathways in a complete perovskite photovoltaic device under operational conditions. To address this challenge, we propose a modified diode model that can quantify bulk/interface defect-assisted recombination… ▽ More

    Submitted 30 November, 2023; v1 submitted 29 November, 2023; originally announced November 2023.

    Comments: 26 pages, 6 figures, published in Solar RRL