subscribe to arXiv mailings

Two Classes of Optimal Multi-Input Structures for Node Computations in Message Passing Algorithms

Abstract: In this paper, we delve into the computations performed at a node within a message-passing algorithm. We investigate low complexity/latency multi-input structures that can be adopted by the node for computing outgoing messages y = (y1, y2, . . . , yn) from incoming messages x = (x1, x2, . . . , xn), where each yj , j = 1, 2, . . . , n is computed via a multi-way tree with leaves x excluding xj . S… ▽ More In this paper, we delve into the computations performed at a node within a message-passing algorithm. We investigate low complexity/latency multi-input structures that can be adopted by the node for computing outgoing messages y = (y1, y2, . . . , yn) from incoming messages x = (x1, x2, . . . , xn), where each yj , j = 1, 2, . . . , n is computed via a multi-way tree with leaves x excluding xj . Specifically, we propose two classes of structures for different scenarios. For the scenario where complexity has a higher priority than latency, the star-tree-based structures are proposed. The complexity-optimal ones (as well as their lowest latency) of such structures are obtained, which have the near-lowest (and sometimes the lowest) complexity among all structures. For the scenario where latency has a higher priority than complexity, the isomorphic-directed-rooted-tree-based structures are proposed. The latency-optimal ones (as well as their lowest complexity) of such structures are obtained, which are proved to have the lowest latency among all structures. △ Less

Submitted 11 July, 2024; originally announced July 2024.

arXiv:2406.18796 [pdf, other]

Protecting three-dimensional entanglement from correlated amplitude damping channel

Authors: Xing Xiao, Wen-Rui Huang, Tian-Xiang Lu, Yan-Ling Li

Abstract: Quantum entanglement is a crucial resource in quantum information processing, and protecting it against noise poses a significant challenge. This paper introduces two strategies for preserving qutrit-qutrit entanglement in the presence of correlated amplitude damping (CAD) noise: weak measurement (WM) and environment-assisted measurement (EAM), both combined with quantum measurement reversal (QMR)… ▽ More Quantum entanglement is a crucial resource in quantum information processing, and protecting it against noise poses a significant challenge. This paper introduces two strategies for preserving qutrit-qutrit entanglement in the presence of correlated amplitude damping (CAD) noise: weak measurement (WM) and environment-assisted measurement (EAM), both combined with quantum measurement reversal (QMR). Two prototypical classes of three-dimensional entangled states are examined. The findings demonstrate that while the WM+QMR method can partially retain entanglement, the EAM+QMR approach is more effective at protecting entanglement as well as enhancing success probabilities, particularly for specific qutrit-qutrit entangled states. Additionally, we thoroughly discuss the impact of correlation effects on entanglement protection and the enhancement of success probability. Our results provide valuable insights into defending high-dimensional entanglement from CAD noise, thus offering practical solutions for the advancement of quantum information technologies. △ Less

Submitted 26 June, 2024; originally announced June 2024.

Comments: 14 pages,, 6 figures, comments are welcome!

arXiv:2406.18070 [pdf, other]

EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation

Authors: Baoqi Pei, Guo Chen, Jilan Xu, Yuping He, Yicheng Liu, Kanghua Pan, Yifei Huang, Yali Wang, Tong Lu, Limin Wang, Yu Qiao

Abstract: In this report, we present our solutions to the EgoVis Challenges in CVPR 2024, including five tracks in the Ego4D challenge and three tracks in the EPIC-Kitchens challenge. Building upon the video-language two-tower model and leveraging our meticulously organized egocentric video data, we introduce a novel foundation model called EgoVideo. This model is specifically designed to cater to the uniqu… ▽ More In this report, we present our solutions to the EgoVis Challenges in CVPR 2024, including five tracks in the Ego4D challenge and three tracks in the EPIC-Kitchens challenge. Building upon the video-language two-tower model and leveraging our meticulously organized egocentric video data, we introduce a novel foundation model called EgoVideo. This model is specifically designed to cater to the unique characteristics of egocentric videos and provides strong support for our competition submissions. In the Ego4D challenges, we tackle various tasks including Natural Language Queries, Step Grounding, Moment Queries, Short-term Object Interaction Anticipation, and Long-term Action Anticipation. In addition, we also participate in the EPIC-Kitchens challenge, where we engage in the Action Recognition, Multiple Instance Retrieval, and Domain Adaptation for Action Recognition tracks. By adapting EgoVideo to these diverse tasks, we showcase its versatility and effectiveness in different egocentric video analysis scenarios, demonstrating the powerful representation ability of EgoVideo as an egocentric foundation model. Our codebase and pretrained models are publicly available at https://github.com/OpenGVLab/EgoVideo. △ Less

Submitted 30 June, 2024; v1 submitted 26 June, 2024; originally announced June 2024.

Comments: Champion solutions in the EgoVis CVPR 2024 workshop

arXiv:2406.18036 [pdf, other]

Operating Single-Photon Circulator by Spinning Optical Resonators

Authors: Jing Li, Tian-Xiang Lu, Meiyu Peng, Le-Man Kuang, Hui Jing, Lan Zhou

Abstract: A circulator is one of the crucial devices in quantum networks and simulations. We propose a four-port circulator that regulate the flow of single photons at muti-frequency points by studying the coherent transmission of a single photon in a coupled system of two resonators and two waveguides. When both resonators are static or rotate at the same angular velocity, single-photon transport demonstra… ▽ More A circulator is one of the crucial devices in quantum networks and simulations. We propose a four-port circulator that regulate the flow of single photons at muti-frequency points by studying the coherent transmission of a single photon in a coupled system of two resonators and two waveguides. When both resonators are static or rotate at the same angular velocity, single-photon transport demonstrates reciprocity; however, when the angular velocities differ, four distinct frequency points emerge where photon circulation can occur. In particular, when the angular velocities of the two resonators are equal and opposite, there are two different frequency points where photon circulation can be achieved, and there is a frequency point where a single photon input from any waveguide can be completely routed to the other waveguide. Interestingly, by rotating the two resonators, the single-photon circulation suppressed by the internal defect-induced backscattering can be restored. △ Less

Submitted 25 June, 2024; originally announced June 2024.

Comments: 12 pages, 5 figures

arXiv:2406.14673 [pdf, other]

Insights into LLM Long-Context Failures: When Transformers Know but Don't Tell

Authors: Taiming Lu, Muhan Gao, Kuai Yu, Adam Byerly, Daniel Khashabi

Abstract: Large Language Models (LLMs) exhibit positional bias, struggling to utilize information from the middle or end of long contexts. Our study explores LLMs' long-context reasoning by probing their hidden representations. We find that while LLMs encode the position of target information, they often fail to leverage this in generating accurate responses. This reveals a disconnect between information re… ▽ More Large Language Models (LLMs) exhibit positional bias, struggling to utilize information from the middle or end of long contexts. Our study explores LLMs' long-context reasoning by probing their hidden representations. We find that while LLMs encode the position of target information, they often fail to leverage this in generating accurate responses. This reveals a disconnect between information retrieval and utilization, a "know but don't tell" phenomenon. We further analyze the relationship between extraction time and final accuracy, offering insights into the underlying mechanics of transformer models. △ Less

Submitted 20 June, 2024; originally announced June 2024.

arXiv:2406.13748 [pdf, other]

Every Language Counts: Learn and Unlearn in Multilingual LLMs

Authors: Taiming Lu, Philipp Koehn

Abstract: This paper investigates the propagation of harmful information in multilingual large language models (LLMs) and evaluates the efficacy of various unlearning methods. We demonstrate that fake information, regardless of the language it is in, once introduced into these models through training data, can spread across different languages, compromising the integrity and reliability of the generated con… ▽ More This paper investigates the propagation of harmful information in multilingual large language models (LLMs) and evaluates the efficacy of various unlearning methods. We demonstrate that fake information, regardless of the language it is in, once introduced into these models through training data, can spread across different languages, compromising the integrity and reliability of the generated content. Our findings reveal that standard unlearning techniques, which typically focus on English data, are insufficient in mitigating the spread of harmful content in multilingual contexts and could inadvertently reinforce harmful content across languages. We show that only by addressing harmful responses in both English and the original language of the harmful data can we effectively eliminate generations for all languages. This underscores the critical need for comprehensive unlearning strategies that consider the multilingual nature of modern LLMs to enhance their safety and reliability across diverse linguistic landscapes. △ Less

Submitted 19 June, 2024; originally announced June 2024.

arXiv:2406.08418 [pdf, other]

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Authors: Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang Jin, Guanzhou Chen, Yinan He, Zhangwei Gao, Erfei Cui, Jiashuo Yu, Hao Tian, Jiasheng Zhou, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Zhenxiang Li, Pei Chu, Yi Wang , et al. (15 additional authors not shown)

Abstract: Image-text interleaved data, consisting of multiple images and texts arranged in a natural document format, aligns with the presentation paradigm of internet data and closely resembles human reading habits. Recent studies have shown that such data aids multimodal in-context learning and maintains the capabilities of large language models during multimodal fine-tuning. However, the limited scale an… ▽ More Image-text interleaved data, consisting of multiple images and texts arranged in a natural document format, aligns with the presentation paradigm of internet data and closely resembles human reading habits. Recent studies have shown that such data aids multimodal in-context learning and maintains the capabilities of large language models during multimodal fine-tuning. However, the limited scale and diversity of current image-text interleaved data restrict the development of multimodal large language models. In this paper, we introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset. Using an efficient data engine, we filter and extract large-scale high-quality documents, which contain 8.6 billion images and 1,696 billion text tokens. Compared to counterparts (e.g., MMC4, OBELICS), our dataset 1) has 15 times larger scales while maintaining good data quality; 2) features more diverse sources, including both English and non-English websites as well as video-centric websites; 3) is more flexible, easily degradable from an image-text interleaved format to pure text corpus and image-text pairs. Through comprehensive analysis and experiments, we validate the quality, usability, and effectiveness of the proposed dataset. We hope this could provide a solid data foundation for future multimodal model research. Code and data are released at https://github.com/OpenGVLab/OmniCorpus. △ Less

Submitted 12 July, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

arXiv:2406.08394 [pdf, other]

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

Authors: Jiannan Wu, Muyan Zhong, Sen Xing, Zeqiang Lai, Zhaoyang Liu, Wenhai Wang, Zhe Chen, Xizhou Zhu, Lewei Lu, Tong Lu, Ping Luo, Yu Qiao, Jifeng Dai

Abstract: We present VisionLLM v2, an end-to-end generalist multimodal large model (MLLM) that unifies visual perception, understanding, and generation within a single framework. Unlike traditional MLLMs limited to text output, VisionLLM v2 significantly broadens its application scope. It excels not only in conventional visual question answering (VQA) but also in open-ended, cross-domain vision tasks such a… ▽ More We present VisionLLM v2, an end-to-end generalist multimodal large model (MLLM) that unifies visual perception, understanding, and generation within a single framework. Unlike traditional MLLMs limited to text output, VisionLLM v2 significantly broadens its application scope. It excels not only in conventional visual question answering (VQA) but also in open-ended, cross-domain vision tasks such as object localization, pose estimation, and image generation and editing. To this end, we propose a new information transmission mechanism termed "super link", as a medium to connect MLLM with task-specific decoders. It not only allows flexible transmission of task information and gradient feedback between the MLLM and multiple downstream decoders but also effectively resolves training conflicts in multi-tasking scenarios. In addition, to support the diverse range of tasks, we carefully collected and combed training data from hundreds of public vision and vision-language tasks. In this way, our model can be joint-trained end-to-end on hundreds of vision language tasks and generalize to these tasks using a set of shared parameters through different user prompts, achieving performance comparable to task-specific models. We believe VisionLLM v2 will offer a new perspective on the generalization of MLLMs. △ Less

Submitted 14 June, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

Comments: 43 pages

arXiv:2406.07971 [pdf, other]

It Takes Two: On the Seamlessness between Reward and Policy Model in RLHF

Authors: Taiming Lu, Lingfeng Shen, Xinyu Yang, Weiting Tan, Beidi Chen, Huaxiu Yao

Abstract: Reinforcement Learning from Human Feedback (RLHF) involves training policy models (PMs) and reward models (RMs) to align language models with human preferences. Instead of focusing solely on PMs and RMs independently, we propose to examine their interactions during fine-tuning, introducing the concept of seamlessness. Our study starts with observing the saturation phenomenon, where continual impro… ▽ More Reinforcement Learning from Human Feedback (RLHF) involves training policy models (PMs) and reward models (RMs) to align language models with human preferences. Instead of focusing solely on PMs and RMs independently, we propose to examine their interactions during fine-tuning, introducing the concept of seamlessness. Our study starts with observing the saturation phenomenon, where continual improvements in RM and PM do not translate into RLHF progress. Our analysis shows that RMs fail to assign proper scores to PM responses, resulting in a 35% mismatch rate with human preferences, highlighting a significant discrepancy between PM and RM. To measure seamlessness between PM and RM without human effort, we propose an automatic metric, SEAM. SEAM quantifies the discrepancies between PM and RM judgments induced by data samples. We validate the effectiveness of SEAM in data selection and model augmentation. Our experiments demonstrate that (1) using SEAM-filtered data for RL training improves RLHF performance by 4.5%, and (2) SEAM-guided model augmentation results in a 4% performance improvement over standard augmentation methods. △ Less

Submitted 13 June, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

arXiv:2406.03530 [pdf, other]

Fractional Chern Insulators in Twisted Bilayer MoTe$_2$: A Composite Fermion Perspective

Authors: Tianhong Lu, Luiz H. Santos

Abstract: The discovery of Fractional Chern Insulators (FCIs) in twisted bilayer MoTe$_2$ has sparked significant interest in fractional topological matter without external magnetic fields. Unlike the flat dispersion of Landau levels, moiré electronic states are influenced by lattice effects within a nanometer-scale superlattice. This study examines the impact of these lattice effects on the topological pha… ▽ More The discovery of Fractional Chern Insulators (FCIs) in twisted bilayer MoTe$_2$ has sparked significant interest in fractional topological matter without external magnetic fields. Unlike the flat dispersion of Landau levels, moiré electronic states are influenced by lattice effects within a nanometer-scale superlattice. This study examines the impact of these lattice effects on the topological phases in twisted bilayer MoTe$_2$, uncovering a family of FCIs with Abelian anyonic quasiparticles. Using a composite fermion approach, we identify a sequence of FCIs with fractional Hall conductivities $σ_{xy} = \frac{C}{2C + 1} \frac{e^2}{h}$ linked to partial filling $ν_{\,\text{h}}$ of holes of the topmost moiré valence band. These states emerge from incompressible composite fermion bands of Chern number $C$ within a complex Hofstadter spectrum. This approach explains FCIs with Hall conductivities $σ_{xy} = (2/3) e^2/h$ and $σ_{xy} = (3/5) e^2/h$ at fractional fillings $ν_{\,\text{h}} = 2/3$ and $ν_{\,\text{h}} = 3/5$ observed in experiments, and uncovers other fractal FCI states. The Hofstadter spectrum reveals new phenomena, distinct from Landau levels, including a higher-order Van Hove singularity (HOVHS) at half-filling, leading to novel quantum phase transitions. This work offers a comprehensive framework for understanding FCIs in transition metal dichalcogenide moiré systems and highlights mechanisms for topological quantum criticality. △ Less

Submitted 10 June, 2024; v1 submitted 5 June, 2024; originally announced June 2024.

Comments: Main text: 5 pages and 4 figures. Updated version with improved figures and enhanced text presentation

arXiv:2406.02039 [pdf, other]

LMB: Augmenting PCIe Devices with CXL-Linked Memory Buffer

Authors: Jiapin Wang, Xiangping Zhang, Chenlei Tang, Xiang Chen, Tao Lu

Abstract: PCIe devices, such as SSDs and GPUs, are pivotal in modern data centers, and their value is set to grow amidst the emergence of AI and large models. However, these devices face onboard DRAM shortage issue due to internal space limitation, preventing accommodation of sufficient DRAM modules alongside flash or GPU processing chips. Current solutions either curb device-internal memory usage or supple… ▽ More PCIe devices, such as SSDs and GPUs, are pivotal in modern data centers, and their value is set to grow amidst the emergence of AI and large models. However, these devices face onboard DRAM shortage issue due to internal space limitation, preventing accommodation of sufficient DRAM modules alongside flash or GPU processing chips. Current solutions either curb device-internal memory usage or supplement slower non-DRAM mediums, prove inadequate or performance-compromising. This paper introduces the Linked Memory Buffer (LMB), a scalable solution utilizing the CXL memory expander to tackle device onboard memory deficiencies. The low-latency of CXL enables LMB to utilize emerging DRAM memory expander to efficiently supplement device onboard DRAM with minimal impact on performance. △ Less

Submitted 4 June, 2024; originally announced June 2024.

arXiv:2405.19511 [pdf, other]

Planet-Planet Scattering and ZLK Migration -- The Dynamical History of HAT-P-11

Authors: Tiger Lu, Qier An, Gongjie Li, Sarah C. Millholland, G. Mirek Brandt, Timothy D. Brandt

Abstract: The two planets of the HAT-P-11 system represent fascinating dynamical puzzles due to their significant eccentricities and orbital misalignments. In particular, HAT-P-11 b is on a close-in orbit that tides should have circularized well within the age of the system. Here we propose a two-step dynamical process that can reproduce all intriguing aspects of the system. We first invoke planet-planet sc… ▽ More The two planets of the HAT-P-11 system represent fascinating dynamical puzzles due to their significant eccentricities and orbital misalignments. In particular, HAT-P-11 b is on a close-in orbit that tides should have circularized well within the age of the system. Here we propose a two-step dynamical process that can reproduce all intriguing aspects of the system. We first invoke planet-planet scattering to generate significant eccentricities and mutual inclinations between the planets. We then propose that this misalignment initiated von-Zeipel-Lidov-Kozai cycles and high-eccentricity migration that ultimately brought HAT-P-11 b to its present-day orbit. We find that this scenario is fully consistent only when significant tidally-driven radius inflation is accounted for during the tidal migration. We present a suite of N-body simulations exploring each phase of evolution and show that this scenario is consistent with all observational posteriors and the reported age of the system. △ Less

Submitted 29 May, 2024; originally announced May 2024.

Comments: 17 pages, 8 figures, submitted to ApJ

arXiv:2405.19510 [pdf, other]

Significant mutual inclinations between the stellar spin and the orbits of both planets in the HAT-P-11 system

Authors: Qier An, Tiger Lu, G. Mirek Brandt, Timothy D Brandt, Gongjie Li

Abstract: Planet-star and planet-planet obliquity encode a planetary system's dynamical history, but both obliquities are hard to measure for misaligned systems with close-in companions. HAT-P-11 is a K4 star with two known planets: a close-in, misaligned super-Neptune with a approx 5-day orbit, and an outer super-Jupiter with a approx 10-year orbit. In this work we present a joint orbit fit of HAT-P-11 sys… ▽ More Planet-star and planet-planet obliquity encode a planetary system's dynamical history, but both obliquities are hard to measure for misaligned systems with close-in companions. HAT-P-11 is a K4 star with two known planets: a close-in, misaligned super-Neptune with a approx 5-day orbit, and an outer super-Jupiter with a approx 10-year orbit. In this work we present a joint orbit fit of HAT-P-11 system with astrometry and RV data. By combining our results with previous constraints on the orientation of the star and the inner planet, we find that all three angular momenta -- those of the star, planet b, and planet c -- are significantly misaligned. We confirm the status of planet c as a super-Jupiter, with 3.06 pm 0.42 Jupiter mass, at a semimajor axis of 4.192 pm 0.07 AU, and planet b's minimum mass of 0.073 pm 0.0053 Jupiter mass. We present the posterior probability distribution of obliquity between star A and planet c, and between planet b and planet c. △ Less

Submitted 29 May, 2024; originally announced May 2024.

Comments: 12 pages, 8 figures, submitted to AJ

arXiv:2405.07527 [pdf, other]

Train Faster, Perform Better: Modular Adaptive Training in Over-Parameterized Models

Authors: Yubin Shi, Yixuan Chen, Mingzhi Dong, Xiaochen Yang, Dongsheng Li, Yujiang Wang, Robert P. Dick, Qin Lv, Yingying Zhao, Fan Yang, Tun Lu, Ning Gu, Li Shang

Abstract: Despite their prevalence in deep-learning communities, over-parameterized models convey high demands of computational costs for proper training. This work studies the fine-grained, modular-level learning dynamics of over-parameterized models to attain a more efficient and fruitful training strategy. Empirical evidence reveals that when scaling down into network modules, such as heads in self-atten… ▽ More Despite their prevalence in deep-learning communities, over-parameterized models convey high demands of computational costs for proper training. This work studies the fine-grained, modular-level learning dynamics of over-parameterized models to attain a more efficient and fruitful training strategy. Empirical evidence reveals that when scaling down into network modules, such as heads in self-attention models, we can observe varying learning patterns implicitly associated with each module's trainability. To describe such modular-level learning capabilities, we introduce a novel concept dubbed modular neural tangent kernel (mNTK), and we demonstrate that the quality of a module's learning is tightly associated with its mNTK's principal eigenvalue $λ_{\max}$. A large $λ_{\max}$ indicates that the module learns features with better convergence, while those miniature ones may impact generalization negatively. Inspired by the discovery, we propose a novel training strategy termed Modular Adaptive Training (MAT) to update those modules with their $λ_{\max}$ exceeding a dynamic threshold selectively, concentrating the model on learning common features and ignoring those inconsistent ones. Unlike most existing training schemes with a complete BP cycle across all network modules, MAT can significantly save computations by its partially-updating strategy and can further improve performance. Experiments show that MAT nearly halves the computational cost of model training and outperforms the accuracy of baselines. △ Less

Submitted 13 May, 2024; originally announced May 2024.

Comments: Accepted at NeurIPS 2023

arXiv:2405.03800 [pdf, other]

TRACE: a Time-Reversible Algorithm for Close Encounters

Authors: Tiger Lu, David M. Hernandez, Hanno Rein

Abstract: We present TRACE, a time-reversible hybrid integrator for the planetary N-body problem. Like hybrid symplectic integrators, TRACE can resolve close encounters between particles while retaining many of the accuracy and speed advantages of a fixed timestep symplectic method such the Wisdom-Holman map. TRACE switches methods time-reversibly during close encounters following the prescription of Hernan… ▽ More We present TRACE, a time-reversible hybrid integrator for the planetary N-body problem. Like hybrid symplectic integrators, TRACE can resolve close encounters between particles while retaining many of the accuracy and speed advantages of a fixed timestep symplectic method such the Wisdom-Holman map. TRACE switches methods time-reversibly during close encounters following the prescription of Hernandez and Dehnen (2023). In this paper we describe the derivation and implementation of TRACE and study its performance for a variety of astrophysical systems. In all our test cases TRACE is at least as accurate and fast as the hybrid symplectic integrator MERCURIUS. In many cases TRACE's performance is vastly superior to that of MERCURIUS. In test cases with planet-planet close encounters, TRACE is as accurate as MECURIUS with a 13x speedup. If close encounters with the central star are considered, TRACE achieves good error performance while MERCURIUS fails to give qualitatively correct results. In ensemble tests of violent scattering systems, TRACE matches the high-accuracy IAS15 while providing a 20x speed-up. In large N systems simulating lunar accretion, TRACE qualitatively gives the same results as IAS15 but at a 47x speedup. We also discuss some cases such as von Zeipel-Lidov-Kozai cycles where hybrid integrators perform poorly and provide some guidance on which integrator to use for which system. TRACE is freely available within the REBOUND package. △ Less

Submitted 6 May, 2024; originally announced May 2024.

Comments: Submitted to MNRAS

arXiv:2404.16821 [pdf, other]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Authors: Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai , et al. (10 additional authors not shown)

Abstract: In this report, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple improvements: (1) Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model -- InternViT-6B, boosting its visual… ▽ More In this report, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple improvements: (1) Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model -- InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs. (2) Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448$\times$448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input. (3) High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks. We evaluate InternVL 1.5 through a series of benchmarks and comparative studies. Compared to both open-source and proprietary models, InternVL 1.5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks. Code has been released at https://github.com/OpenGVLab/InternVL. △ Less

Submitted 29 April, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

Comments: Technical report

arXiv:2404.14316 [pdf, other]

Automated Long Answer Grading with RiceChem Dataset

Authors: Shashank Sonkar, Kangqi Ni, Lesa Tran Lu, Kristi Kincaid, John S. Hutchinson, Richard G. Baraniuk

Abstract: We introduce a new area of study in the field of educational Natural Language Processing: Automated Long Answer Grading (ALAG). Distinguishing itself from Automated Short Answer Grading (ASAG) and Automated Essay Grading (AEG), ALAG presents unique challenges due to the complexity and multifaceted nature of fact-based long answers. To study ALAG, we introduce RiceChem, a dataset derived from a col… ▽ More We introduce a new area of study in the field of educational Natural Language Processing: Automated Long Answer Grading (ALAG). Distinguishing itself from Automated Short Answer Grading (ASAG) and Automated Essay Grading (AEG), ALAG presents unique challenges due to the complexity and multifaceted nature of fact-based long answers. To study ALAG, we introduce RiceChem, a dataset derived from a college chemistry course, featuring real student responses to long-answer questions with an average word count notably higher than typical ASAG datasets. We propose a novel approach to ALAG by formulating it as a rubric entailment problem, employing natural language inference models to verify whether each criterion, represented by a rubric item, is addressed in the student's response. This formulation enables the effective use of MNLI for transfer learning, significantly improving the performance of models on the RiceChem dataset. We demonstrate the importance of rubric-based formulation in ALAG, showcasing its superiority over traditional score-based approaches in capturing the nuances of student responses. We also investigate the performance of models in cold start scenarios, providing valuable insights into the practical deployment considerations in educational settings. Lastly, we benchmark state-of-the-art open-sourced Large Language Models (LLMs) on RiceChem and compare their results to GPT models, highlighting the increased complexity of ALAG compared to ASAG. Despite leveraging the benefits of a rubric-based approach and transfer learning from MNLI, the lower performance of LLMs on RiceChem underscores the significant difficulty posed by the ALAG task. With this work, we offer a fresh perspective on grading long, fact-based answers and introduce a new dataset to stimulate further research in this important area. Code: \url{https://github.com/luffycodes/Automated-Long-Answer-Grading}. △ Less

Submitted 22 April, 2024; originally announced April 2024.

arXiv:2404.13680 [pdf, other]

Zero-shot High-fidelity and Pose-controllable Character Animation

Authors: Bingwen Zhu, Fanyi Wang, Tianyi Lu, Peng Liu, Jingwen Su, Jinxiu Liu, Yanhao Zhang, Zuxuan Wu, Guo-Jun Qi, Yu-Gang Jiang

Abstract: Image-to-video (I2V) generation aims to create a video sequence from a single image, which requires high temporal coherence and visual fidelity. However, existing approaches suffer from inconsistency of character appearances and poor preservation of fine details. Moreover, they require a large amount of video data for training, which can be computationally demanding. To address these limitations,… ▽ More Image-to-video (I2V) generation aims to create a video sequence from a single image, which requires high temporal coherence and visual fidelity. However, existing approaches suffer from inconsistency of character appearances and poor preservation of fine details. Moreover, they require a large amount of video data for training, which can be computationally demanding. To address these limitations, we propose PoseAnimate, a novel zero-shot I2V framework for character animation. PoseAnimate contains three key components: 1) a Pose-Aware Control Module (PACM) that incorporates diverse pose signals into text embeddings, to preserve character-independent content and maintain precise alignment of actions. 2) a Dual Consistency Attention Module (DCAM) that enhances temporal consistency and retains character identity and intricate background details. 3) a Mask-Guided Decoupling Module (MGDM) that refines distinct feature perception abilities, improving animation fidelity by decoupling the character and background. We also propose a Pose Alignment Transition Algorithm (PATA) to ensure smooth action transition. Extensive experiment results demonstrate that our approach outperforms the state-of-the-art training-based methods in terms of character consistency and detail fidelity. Moreover, it maintains a high level of temporal coherence throughout the generated animations. △ Less

Submitted 5 June, 2024; v1 submitted 21 April, 2024; originally announced April 2024.

Comments: 10 pages, 5 figures

arXiv:2404.11044 [pdf, other]

Asynchronous Memory Access Unit: Exploiting Massive Parallelism for Far Memory Access

Authors: Luming Wang, Xu Zhang, Songyue Wang, Zhuolun Jiang, Tianyue Lu, Mingyu Chen, Siwei Luo, Keji Huang

Abstract: The growing memory demands of modern applications have driven the adoption of far memory technologies in data centers to provide cost-effective, high-capacity memory solutions. However, far memory presents new performance challenges because its access latencies are significantly longer and more variable than local DRAM. For applications to achieve acceptable performance on far memory, a high degre… ▽ More The growing memory demands of modern applications have driven the adoption of far memory technologies in data centers to provide cost-effective, high-capacity memory solutions. However, far memory presents new performance challenges because its access latencies are significantly longer and more variable than local DRAM. For applications to achieve acceptable performance on far memory, a high degree of memory-level parallelism (MLP) is needed to tolerate the long access latency. While modern out-of-order processors are capable of exploiting a certain degree of MLP, they are constrained by resource limitations and hardware complexity. The key obstacle is the synchronous memory access semantics of traditional load/store instructions, which occupy critical hardware resources for a long time. The longer far memory latencies exacerbate this limitation. This paper proposes a set of Asynchronous Memory Access Instructions (AMI) and its supporting function unit, Asynchronous Memory Access Unit (AMU), inside a contemporary Out-of-Order Core. AMI separates memory request issuing from response handling to reduce resource occupation. Additionally, AMU architecture supports up to several hundreds of asynchronous memory requests through re-purposing a portion of L2 Cache as scratchpad memory (SPM) to provide sufficient temporal storage. Together with a coroutine-based programming framework, this scheme can achieve significantly higher MLP for hiding far memory latencies. Evaluation with a cycle-accurate simulation shows AMI achieves 2.42x speedup on average for memory-bound benchmarks with 1us additional far memory latency. Over 130 outstanding requests are supported with 26.86x speedup for GUPS (random access) with 5 us latency. These demonstrate how the techniques tackle far memory performance impacts through explicit MLP expression and latency adaptation. △ Less

Submitted 16 April, 2024; originally announced April 2024.

arXiv:2404.06514 [pdf, other]

Disentangling transitions in topological order induced by boundary decoherence

Authors: Tsung-Cheng Lu

Abstract: We study the entanglement structure of topological orders subject to decoherence on the bipartition boundary. Focusing on the toric codes in $d$ space dimensions for $d=2,3,4$, we explore whether the boundary decoherence may be able to induce a disentangling transition, characterized by the destruction of mixed-state long-range entanglement across the bipartition, measured by topological entanglem… ▽ More We study the entanglement structure of topological orders subject to decoherence on the bipartition boundary. Focusing on the toric codes in $d$ space dimensions for $d=2,3,4$, we explore whether the boundary decoherence may be able to induce a disentangling transition, characterized by the destruction of mixed-state long-range entanglement across the bipartition, measured by topological entanglement negativity. A key insight of our approach is the connection between the negativity spectrum of the decohered mixed states and emergent symmetry-protected topological orders under certain symmetry-preserving perturbation localized on the bipartition boundary. This insight allows us to analytically derive the exact results of entanglement negativity without using a replica trick. △ Less

Submitted 26 April, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

Comments: 16 pages, 5 figures; typos fixed

arXiv:2404.00723 [pdf, other]

Quantum Weak Force Sensing with Squeezed Magnomechanics

Authors: Qian Zhang, Jie Wang, Tian-Xiang Lu, Franco Nori, Hui Jing

Abstract: Cavity magnomechanics, exhibiting remarkable experimental tunability, rich magnonic nonlinearities, and compatibility with various quantum systems, has witnessed considerable advances in recent years. However, the potential benefits of using cavity magnomechanical (CMM) systems in further improving the performance of quantum-enhanced sensing for weak forces remain largely unexplored. Here we show… ▽ More Cavity magnomechanics, exhibiting remarkable experimental tunability, rich magnonic nonlinearities, and compatibility with various quantum systems, has witnessed considerable advances in recent years. However, the potential benefits of using cavity magnomechanical (CMM) systems in further improving the performance of quantum-enhanced sensing for weak forces remain largely unexplored. Here we show that the performance of a quantum CMM sensor can be significantly enhanced beyond the standard quantum limit (SQL), by squeezing the magnons. We find that, for comparable parameters, two orders of enhancement in force sensitivity can be achieved in comparison with the case without the magnon squeezing. Moreover, we show optimal parameter regimes of homodyne angle for minimizing added quantum noise. Our findings provide a promising approach for highly tunable and compatible quantum force sensing using hybrid CMM devices, with potential applications ranging from quantum precision measurements to quantum information processing. △ Less

Submitted 31 March, 2024; originally announced April 2024.

arXiv:2403.17898 [pdf, other]

Octree-GS: Towards Consistent Real-time Rendering with LOD-Structured 3D Gaussians

Authors: Kerui Ren, Lihan Jiang, Tao Lu, Mulin Yu, Linning Xu, Zhangkai Ni, Bo Dai

Abstract: The recent 3D Gaussian splatting (3D-GS) has shown remarkable rendering fidelity and efficiency compared to NeRF-based neural scene representations. While demonstrating the potential for real-time rendering, 3D-GS encounters rendering bottlenecks in large scenes with complex details due to an excessive number of Gaussian primitives located within the viewing frustum. This limitation is particularl… ▽ More The recent 3D Gaussian splatting (3D-GS) has shown remarkable rendering fidelity and efficiency compared to NeRF-based neural scene representations. While demonstrating the potential for real-time rendering, 3D-GS encounters rendering bottlenecks in large scenes with complex details due to an excessive number of Gaussian primitives located within the viewing frustum. This limitation is particularly noticeable in zoom-out views and can lead to inconsistent rendering speeds in scenes with varying details. Moreover, it often struggles to capture the corresponding level of details at different scales with its heuristic density control operation. Inspired by the Level-of-Detail (LOD) techniques, we introduce Octree-GS, featuring an LOD-structured 3D Gaussian approach supporting level-of-detail decomposition for scene representation that contributes to the final rendering results. Our model dynamically selects the appropriate level from the set of multi-resolution anchor points, ensuring consistent rendering performance with adaptive LOD adjustments while maintaining high-fidelity rendering results. △ Less

Submitted 26 March, 2024; originally announced March 2024.

Comments: Project page: https://city-super.github.io/octree-gs/

arXiv:2403.16964 [pdf, other]

GSDF: 3DGS Meets SDF for Improved Rendering and Reconstruction

Authors: Mulin Yu, Tao Lu, Linning Xu, Lihan Jiang, Yuanbo Xiangli, Bo Dai

Abstract: Presenting a 3D scene from multiview images remains a core and long-standing challenge in computer vision and computer graphics. Two main requirements lie in rendering and reconstruction. Notably, SOTA rendering quality is usually achieved with neural volumetric rendering techniques, which rely on aggregated point/primitive-wise color and neglect the underlying scene geometry. Learning of neural i… ▽ More Presenting a 3D scene from multiview images remains a core and long-standing challenge in computer vision and computer graphics. Two main requirements lie in rendering and reconstruction. Notably, SOTA rendering quality is usually achieved with neural volumetric rendering techniques, which rely on aggregated point/primitive-wise color and neglect the underlying scene geometry. Learning of neural implicit surfaces is sparked from the success of neural rendering. Current works either constrain the distribution of density fields or the shape of primitives, resulting in degraded rendering quality and flaws on the learned scene surfaces. The efficacy of such methods is limited by the inherent constraints of the chosen neural representation, which struggles to capture fine surface details, especially for larger, more intricate scenes. To address these issues, we introduce GSDF, a novel dual-branch architecture that combines the benefits of a flexible and efficient 3D Gaussian Splatting (3DGS) representation with neural Signed Distance Fields (SDF). The core idea is to leverage and enhance the strengths of each branch while alleviating their limitation through mutual guidance and joint supervision. We show on diverse scenes that our design unlocks the potential for more accurate and detailed surface reconstructions, and at the meantime benefits 3DGS rendering with structures that are more aligned with the underlying geometry. △ Less

Submitted 25 March, 2024; originally announced March 2024.

Comments: Project page: https://city-super.github.io/GSDF

arXiv:2403.12995 [pdf, other]

ESM All-Atom: Multi-scale Protein Language Model for Unified Molecular Modeling

Authors: Kangjie Zheng, Siyu Long, Tianyu Lu, Junwei Yang, Xinyu Dai, Ming Zhang, Zaiqing Nie, Wei-Ying Ma, Hao Zhou

Abstract: Protein language models have demonstrated significant potential in the field of protein engineering. However, current protein language models primarily operate at the residue scale, which limits their ability to provide information at the atom level. This limitation prevents us from fully exploiting the capabilities of protein language models for applications involving both proteins and small mole… ▽ More Protein language models have demonstrated significant potential in the field of protein engineering. However, current protein language models primarily operate at the residue scale, which limits their ability to provide information at the atom level. This limitation prevents us from fully exploiting the capabilities of protein language models for applications involving both proteins and small molecules. In this paper, we propose ESM-AA (ESM All-Atom), a novel approach that enables atom-scale and residue-scale unified molecular modeling. ESM-AA achieves this by pre-training on multi-scale code-switch protein sequences and utilizing a multi-scale position encoding to capture relationships among residues and atoms. Experimental results indicate that ESM-AA surpasses previous methods in protein-molecule tasks, demonstrating the full utilization of protein language models. Further investigations reveal that through unified molecular modeling, ESM-AA not only gains molecular knowledge but also retains its understanding of proteins. The source codes of ESM-AA are publicly released at https://github.com/zhengkangjie/ESM-AA. △ Less

Submitted 12 June, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

Comments: ICML2024 camera-ready, update some experimental results, add github url, fix some typos

arXiv:2403.09979 [pdf, other]

Quantum Advantage of One-Way Squeezing in Enhancing Weak-Force Sensing

Authors: Jie Wang, Qian Zhang, Ya-Feng Jiao, Sheng-Dian Zhang, Tian-Xiang Lu, Zhipeng Li, Cheng-Wei Qiu, Hui Jing

Abstract: Cavity optomechanical (COM) sensors, featuring efficient light-motion couplings, have been widely used for ultra sensitive measurements of various physical quantities ranging from displacements to accelerations or weak forces. Previous works, however, have mainly focused on reciprocal COM systems. Here, we propose how to further improve the performance of quantum COM sensors by breaking reciprocal… ▽ More Cavity optomechanical (COM) sensors, featuring efficient light-motion couplings, have been widely used for ultra sensitive measurements of various physical quantities ranging from displacements to accelerations or weak forces. Previous works, however, have mainly focused on reciprocal COM systems. Here, we propose how to further improve the performance of quantum COM sensors by breaking reciprocal symmetry in purely quantum regime. Specifically, we consider a spinning COM resonator and show that by selectively driving it in opposite directions, highly nonreciprocal optical squeezing can emerge, which in turn provides an efficient way to surpass the standard quantum limit that otherwise exists in conventional reciprocal devices. Our work confirms that breaking reciprocal symmetry, already achieved in diverse systems well beyond spinning systems, can serve as a new strategy to further enhance the abilities of advanced quantum sensors, for applications ranging from testing fundamental physical laws to practical quantum metrology. △ Less

Submitted 14 March, 2024; originally announced March 2024.

Comments: 7 pages,3 figures

arXiv:2403.09626 [pdf, other]

Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding

Authors: Guo Chen, Yifei Huang, Jilan Xu, Baoqi Pei, Zhe Chen, Zhiqi Li, Jiahao Wang, Kunchang Li, Tong Lu, Limin Wang

Abstract: Understanding videos is one of the fundamental directions in computer vision research, with extensive efforts dedicated to exploring various architectures such as RNN, 3D CNN, and Transformers. The newly proposed architecture of state space model, e.g., Mamba, shows promising traits to extend its success in long sequence modeling to video modeling. To assess whether Mamba can be a viable alternati… ▽ More Understanding videos is one of the fundamental directions in computer vision research, with extensive efforts dedicated to exploring various architectures such as RNN, 3D CNN, and Transformers. The newly proposed architecture of state space model, e.g., Mamba, shows promising traits to extend its success in long sequence modeling to video modeling. To assess whether Mamba can be a viable alternative to Transformers in the video understanding domain, in this work, we conduct a comprehensive set of studies, probing different roles Mamba can play in modeling videos, while investigating diverse tasks where Mamba could exhibit superiority. We categorize Mamba into four roles for modeling videos, deriving a Video Mamba Suite composed of 14 models/modules, and evaluating them on 12 video understanding tasks. Our extensive experiments reveal the strong potential of Mamba on both video-only and video-language tasks while showing promising efficiency-performance trade-offs. We hope this work could provide valuable data points and insights for future research on video understanding. Code is public: https://github.com/OpenGVLab/video-mamba-suite. △ Less

Submitted 14 March, 2024; originally announced March 2024.

Comments: Technical Report

arXiv:2403.04247 [pdf, other]

UltraWiki: Ultra-fine-grained Entity Set Expansion with Negative Seed Entities

Authors: Yangning Li, Qingsong Lv, Tianyu Yu, Yinghui Li, Shulin Huang, Tingwei Lu, Xuming Hu, Wenhao JIang, Hai-Tao Zheng, Hui Wang

Abstract: Entity Set Expansion (ESE) aims to identify new entities belonging to the same semantic class as a given set of seed entities. Traditional methods primarily relied on positive seed entities to represent a target semantic class, which poses challenge for the representation of ultra-fine-grained semantic classes. Ultra-fine-grained semantic classes are defined based on fine-grained semantic classes… ▽ More Entity Set Expansion (ESE) aims to identify new entities belonging to the same semantic class as a given set of seed entities. Traditional methods primarily relied on positive seed entities to represent a target semantic class, which poses challenge for the representation of ultra-fine-grained semantic classes. Ultra-fine-grained semantic classes are defined based on fine-grained semantic classes with more specific attribute constraints. Describing it with positive seed entities alone cause two issues: (i) Ambiguity among ultra-fine-grained semantic classes. (ii) Inability to define "unwanted" semantic. Due to these inherent shortcomings, previous methods struggle to address the ultra-fine-grained ESE (Ultra-ESE). To solve this issue, we first introduce negative seed entities in the inputs, which belong to the same fine-grained semantic class as the positive seed entities but differ in certain attributes. Negative seed entities eliminate the semantic ambiguity by contrast between positive and negative attributes. Meanwhile, it provide a straightforward way to express "unwanted". To assess model performance in Ultra-ESE, we constructed UltraWiki, the first large-scale dataset tailored for Ultra-ESE. UltraWiki encompasses 236 ultra-fine-grained semantic classes, where each query of them is represented with 3-5 positive and negative seed entities. A retrieval-based framework RetExpan and a generation-based framework GenExpan are proposed to comprehensively assess the efficacy of large language models from two different paradigms in Ultra-ESE. Moreover, we devised three strategies to enhance models' comprehension of ultra-fine-grained entities semantics: contrastive learning, retrieval augmentation, and chain-of-thought reasoning. Extensive experiments confirm the effectiveness of our proposed strategies and also reveal that there remains a large space for improvement in Ultra-ESE. △ Less

Submitted 23 April, 2024; v1 submitted 7 March, 2024; originally announced March 2024.

Comments: Initial Version

arXiv:2403.03419 [pdf, other]

Negating Negatives: Alignment without Human Positive Samples via Distributional Dispreference Optimization

Authors: Shitong Duan, Xiaoyuan Yi, Peng Zhang, Tun Lu, Xing Xie, Ning Gu

Abstract: Large language models (LLMs) have revolutionized the role of AI, yet also pose potential risks of propagating unethical content. Alignment technologies have been introduced to steer LLMs towards human preference, gaining increasing attention. Despite notable breakthroughs in this direction, existing methods heavily rely on high-quality positive-negative training pairs, suffering from noisy labels… ▽ More Large language models (LLMs) have revolutionized the role of AI, yet also pose potential risks of propagating unethical content. Alignment technologies have been introduced to steer LLMs towards human preference, gaining increasing attention. Despite notable breakthroughs in this direction, existing methods heavily rely on high-quality positive-negative training pairs, suffering from noisy labels and the marginal distinction between preferred and dispreferred response data. Given recent LLMs' proficiency in generating helpful responses, this work pivots towards a new research focus: achieving alignment using solely human-annotated negative samples, preserving helpfulness while reducing harmfulness. For this purpose, we propose Distributional Dispreference Optimization (D$^2$O), which maximizes the discrepancy between the generated responses and the dispreferred ones to effectively eschew harmful information. We theoretically demonstrate that D$^2$O is equivalent to learning a distributional instead of instance-level preference model reflecting human dispreference against the distribution of negative responses. Besides, D$^2$O integrates an implicit Jeffrey Divergence regularization to balance the exploitation and exploration of reference policies and converges to a non-negative one during training. Extensive experiments demonstrate that our method achieves comparable generation quality and surpasses the latest baselines in producing less harmful and more informative responses with better training stability and faster convergence. △ Less

Submitted 5 March, 2024; originally announced March 2024.

arXiv:2403.02308 [pdf, other]

Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures

Authors: Yuchen Duan, Weiyun Wang, Zhe Chen, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Hongsheng Li, Jifeng Dai, Wenhai Wang

Abstract: Transformers have revolutionized computer vision and natural language processing, but their high computational complexity limits their application in high-resolution image processing and long-context analysis. This paper introduces Vision-RWKV (VRWKV), a model adapted from the RWKV model used in the NLP field with necessary modifications for vision tasks. Similar to the Vision Transformer (ViT), o… ▽ More Transformers have revolutionized computer vision and natural language processing, but their high computational complexity limits their application in high-resolution image processing and long-context analysis. This paper introduces Vision-RWKV (VRWKV), a model adapted from the RWKV model used in the NLP field with necessary modifications for vision tasks. Similar to the Vision Transformer (ViT), our model is designed to efficiently handle sparse inputs and demonstrate robust global processing capabilities, while also scaling up effectively, accommodating both large-scale parameters and extensive datasets. Its distinctive advantage lies in its reduced spatial aggregation complexity, which renders it exceptionally adept at processing high-resolution images seamlessly, eliminating the necessity for windowing operations. Our evaluations demonstrate that VRWKV surpasses ViT's performance in image classification and has significantly faster speeds and lower memory usage processing high-resolution inputs. In dense prediction tasks, it outperforms window-based models, maintaining comparable speeds. These results highlight VRWKV's potential as a more efficient alternative for visual perception tasks. Code is released at \url{https://github.com/OpenGVLab/Vision-RWKV}. △ Less

Submitted 7 March, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

arXiv:2402.19327 [pdf]

GPTFF: A high-accuracy out-of-the-box universal AI force field for arbitrary inorganic materials

Authors: Fankai Xie, Tenglong Lu, Sheng Meng, Miao Liu

Abstract: This study introduces a novel AI force field, namely graph-based pre-trained transformer force field (GPTFF), which can simulate arbitrary inorganic systems with good precision and generalizability. Harnessing a large trove of the data and the attention mechanism of transformer algorithms, the model can accurately predict energy, atomic forces, and stress with Mean Absolute Error (MAE) values of 3… ▽ More This study introduces a novel AI force field, namely graph-based pre-trained transformer force field (GPTFF), which can simulate arbitrary inorganic systems with good precision and generalizability. Harnessing a large trove of the data and the attention mechanism of transformer algorithms, the model can accurately predict energy, atomic forces, and stress with Mean Absolute Error (MAE) values of 32 meV/atom, 71 meV/Å, and 0.365 GPa, respectively. The dataset used to train the model includes 37.8 million single-point energies, 11.7 billion force pairs, and 340.2 million stresses. We also demonstrated that GPTFF can be universally used to simulate various physical systems, such as crystal structure optimization, phase transition simulations, and mass transport. △ Less

Submitted 29 February, 2024; originally announced February 2024.

arXiv:2402.15991 [pdf, other]

$C^3$: Confidence Calibration Model Cascade for Inference-Efficient Cross-Lingual Natural Language Understanding

Authors: Taixi Lu, Haoyu Wang, Huajie Shao, Jing Gao, Huaxiu Yao

Abstract: Cross-lingual natural language understanding (NLU) is a critical task in natural language processing (NLP). Recent advancements have seen multilingual pre-trained language models (mPLMs) significantly enhance the performance of these tasks. However, mPLMs necessitate substantial resources and incur high computational costs during inference, posing challenges for deployment in real-world and real-t… ▽ More Cross-lingual natural language understanding (NLU) is a critical task in natural language processing (NLP). Recent advancements have seen multilingual pre-trained language models (mPLMs) significantly enhance the performance of these tasks. However, mPLMs necessitate substantial resources and incur high computational costs during inference, posing challenges for deployment in real-world and real-time systems. Existing model cascade methods seek to enhance inference efficiency by greedily selecting the lightest model capable of processing the current input from a variety of models, based on model confidence scores. Nonetheless, deep models tend to exhibit overconfidence, and confidence distributions vary across languages. This leads to the emission of confident but incorrect predictions by smaller models, hindering their ability to generalize effectively across test languages. In this study, we introduce a confidence calibration model cascade ($C^3$) method. This approach, simple yet effective, involves calibration prior to cascade inference, thereby enhancing cascade accuracy through more reliable predictions. Extensive experiments conducted on three cross-lingual benchmarks demonstrate that $C^3$ significantly outperforms all state-of-the-art baselines. △ Less

Submitted 25 February, 2024; originally announced February 2024.

arXiv:2402.08426 [pdf, other]

Frequency-aware Graph Signal Processing for Collaborative Filtering

Authors: Jiafeng Xia, Dongsheng Li, Hansu Gu, Tun Lu, Peng Zhang, Li Shang, Ning Gu

Abstract: Graph Signal Processing (GSP) based recommendation algorithms have recently attracted lots of attention due to its high efficiency. However, these methods failed to consider the importance of various interactions that reflect unique user/item characteristics and failed to utilize user and item high-order neighborhood information to model user preference, thus leading to sub-optimal performance. To… ▽ More Graph Signal Processing (GSP) based recommendation algorithms have recently attracted lots of attention due to its high efficiency. However, these methods failed to consider the importance of various interactions that reflect unique user/item characteristics and failed to utilize user and item high-order neighborhood information to model user preference, thus leading to sub-optimal performance. To address the above issues, we propose a frequency-aware graph signal processing method (FaGSP) for collaborative filtering. Firstly, we design a Cascaded Filter Module, consisting of an ideal high-pass filter and an ideal low-pass filter that work in a successive manner, to capture both unique and common user/item characteristics to more accurately model user preference. Then, we devise a Parallel Filter Module, consisting of two low-pass filters that can easily capture the hierarchy of neighborhood, to fully utilize high-order neighborhood information of users/items for more accurate user preference modeling. Finally, we combine these two modules via a linear model to further improve recommendation accuracy. Extensive experiments on six public datasets demonstrate the superiority of our method from the perspectives of prediction accuracy and training efficiency compared with state-of-the-art GCN-based recommendation methods and GSP-based recommendation methods. △ Less

Submitted 13 February, 2024; originally announced February 2024.

arXiv:2402.02374 [pdf, other]

PromptRR: Diffusion Models as Prompt Generators for Single Image Reflection Removal

Authors: Tao Wang, Wanglong Lu, Kaihao Zhang, Wenhan Luo, Tae-Kyun Kim, Tong Lu, Hongdong Li, Ming-Hsuan Yang

Abstract: Existing single image reflection removal (SIRR) methods using deep learning tend to miss key low-frequency (LF) and high-frequency (HF) differences in images, affecting their effectiveness in removing reflections. To address this problem, this paper proposes a novel prompt-guided reflection removal (PromptRR) framework that uses frequency information as new visual prompts for better reflection per… ▽ More Existing single image reflection removal (SIRR) methods using deep learning tend to miss key low-frequency (LF) and high-frequency (HF) differences in images, affecting their effectiveness in removing reflections. To address this problem, this paper proposes a novel prompt-guided reflection removal (PromptRR) framework that uses frequency information as new visual prompts for better reflection performance. Specifically, the proposed framework decouples the reflection removal process into the prompt generation and subsequent prompt-guided restoration. For the prompt generation, we first propose a prompt pre-training strategy to train a frequency prompt encoder that encodes the ground-truth image into LF and HF prompts. Then, we adopt diffusion models (DMs) as prompt generators to generate the LF and HF prompts estimated by the pre-trained frequency prompt encoder. For the prompt-guided restoration, we integrate specially generated prompts into the PromptFormer network, employing a novel Transformer-based prompt block to effectively steer the model toward enhanced reflection removal. The results on commonly used benchmarks show that our method outperforms state-of-the-art approaches. The codes and models are available at https://github.com/TaoWangzj/PromptRR. △ Less

Submitted 4 February, 2024; originally announced February 2024.

Comments: 10 pages, 10 figures

arXiv:2401.16668 [pdf, other]

doi 10.1145/3613904.3642317

InteractOut: Leveraging Interaction Proxies as Input Manipulation Strategies for Reducing Smartphone Overuse

Authors: Tao Lu, Hongxiao Zheng, Tianying Zhang, Xuhai Xu, Anhong Guo

Abstract: Smartphone overuse poses risks to people's physical and mental health. However, current intervention techniques mainly focus on explicitly changing screen content (i.e., output) and often fail to persistently reduce smartphone overuse due to being over-restrictive or over-flexible. We present the design and implementation of InteractOut, a suite of implicit input manipulation techniques that lever… ▽ More Smartphone overuse poses risks to people's physical and mental health. However, current intervention techniques mainly focus on explicitly changing screen content (i.e., output) and often fail to persistently reduce smartphone overuse due to being over-restrictive or over-flexible. We present the design and implementation of InteractOut, a suite of implicit input manipulation techniques that leverage interaction proxies to weakly inhibit the natural execution of common user gestures on mobile devices. We present a design space for input manipulations and demonstrate 8 Android implementations of input interventions. We first conducted a pilot lab study (N=30) to evaluate the usability of these interventions. Based on the results, we then performed a 5-week within-subject field experiment (N=42) to evaluate InteractOut in real-world scenarios. Compared to the traditional and common timed lockout technique, InteractOut significantly reduced the usage time by an additional 15.6% and opening frequency by 16.5% on participant-selected target apps. InteractOut also achieved a 25.3% higher user acceptance rate, and resulted in less frustration and better user experience according to participants' subjective feedback. InteractOut demonstrates a new direction for smartphone overuse intervention and serves as a strong complementary set of techniques with existing methods. △ Less

Submitted 19 February, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

Comments: CHI 2024

arXiv:2401.15261 [pdf, other]

Vanishing-Point-Guided Video Semantic Segmentation of Driving Scenes

Authors: Diandian Guo, Deng-Ping Fan, Tongyu Lu, Christos Sakaridis, Luc Van Gool

Abstract: The estimation of implicit cross-frame correspondences and the high computational cost have long been major challenges in video semantic segmentation (VSS) for driving scenes. Prior works utilize keyframes, feature propagation, or cross-frame attention to address these issues. By contrast, we are the first to harness vanishing point (VP) priors for more effective segmentation. Intuitively, objects… ▽ More The estimation of implicit cross-frame correspondences and the high computational cost have long been major challenges in video semantic segmentation (VSS) for driving scenes. Prior works utilize keyframes, feature propagation, or cross-frame attention to address these issues. By contrast, we are the first to harness vanishing point (VP) priors for more effective segmentation. Intuitively, objects near VPs (i.e., away from the vehicle) are less discernible. Moreover, they tend to move radially away from the VP over time in the usual case of a forward-facing camera, a straight road, and linear forward motion of the vehicle. Our novel, efficient network for VSS, named VPSeg, incorporates two modules that utilize exactly this pair of static and dynamic VP priors: sparse-to-dense feature mining (DenseVP) and VP-guided motion fusion (MotionVP). MotionVP employs VP-guided motion estimation to establish explicit correspondences across frames and help attend to the most relevant features from neighboring frames, while DenseVP enhances weak dynamic features in distant regions around VPs. These modules operate within a context-detail framework, which separates contextual features from high-resolution local features at different input resolutions to reduce computational costs. Contextual and local features are integrated through contextualized motion attention (CMA) for the final prediction. Extensive experiments on two popular driving segmentation benchmarks, Cityscapes and ACDC, demonstrate that VPSeg outperforms previous SOTA methods, with only modest computational overhead. △ Less

Submitted 25 April, 2024; v1 submitted 26 January, 2024; originally announced January 2024.

Comments: CVPR 2024 highlight

arXiv:2401.13723 [pdf, other]

doi 10.3847/25c2cfeb.9580e9c8

Emerging Researchers in Exoplanetary Science (ERES): Lessons Learned in Conference Organization for Early-Career Researchers

Authors: W. Garrett Levine, Konstantin Gerbig, Emma M. Louden, Tiger Lu, Cheng-Han Hsieh, Christopher O'Connor, Rixin Li, Jiayin Dong

Abstract: Since 2015, the Emerging Researchers in Exoplanetary Science (ERES) conference has provided a venue for early-career researchers in exoplanetary astronomy, astrophysics, and planetary science to share their research, network, and build new collaborations. ERES stands out in that it is spearheaded by early-career researchers, providing a unique attendance experience for the participants and a profe… ▽ More Since 2015, the Emerging Researchers in Exoplanetary Science (ERES) conference has provided a venue for early-career researchers in exoplanetary astronomy, astrophysics, and planetary science to share their research, network, and build new collaborations. ERES stands out in that it is spearheaded by early-career researchers, providing a unique attendance experience for the participants and a professional experience for the organizers. In this Bulletin, we share experiences and lessons learned from the perspective of the organizing committee for the 2023 edition of ERES. For this eighth ERES conference, we hosted over 100 participants in New Haven, CT, for a three-day program. This manuscript is aimed primarily toward groups of early-career scientists who are planning a conference for their fields of study. We anticipate that this Bulletin will continue dialogue within the academic community about best practices for equitable event organization. △ Less

Submitted 24 January, 2024; originally announced January 2024.

Comments: To appear in the Bulletin of the American Astronomical Society (see DOI); 13 pages, 6 figures

arXiv:2401.10529 [pdf, other]

Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences

Authors: Xiyao Wang, Yuhang Zhou, Xiaoyu Liu, Hongjin Lu, Yuancheng Xu, Feihong He, Jaehong Yoon, Taixi Lu, Gedas Bertasius, Mohit Bansal, Huaxiu Yao, Furong Huang

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated proficiency in handling a variety of visual-language tasks. However, current MLLM benchmarks are predominantly designed to evaluate reasoning based on static information about a single image, and the ability of modern MLLMs to extrapolate from image sequences, which is essential for understanding our ever-changing world, has been less inve… ▽ More Multimodal Large Language Models (MLLMs) have demonstrated proficiency in handling a variety of visual-language tasks. However, current MLLM benchmarks are predominantly designed to evaluate reasoning based on static information about a single image, and the ability of modern MLLMs to extrapolate from image sequences, which is essential for understanding our ever-changing world, has been less investigated. To address this challenge, this paper introduces Mementos, a new benchmark designed to assess MLLMs' sequential image reasoning abilities. Mementos features 4,761 diverse image sequences with varying lengths. We also employ a GPT-4 assisted method to evaluate MLLM reasoning performance. Through a careful evaluation of nine recent MLLMs on Mementos, including GPT-4V and Gemini, we find that they struggle to accurately describe dynamic information about given image sequences, often leading to hallucinations/misrepresentations of objects and their corresponding behaviors. Our quantitative analysis and case studies identify three key factors impacting MLLMs' sequential image reasoning: the correlation between object and behavioral hallucinations, the influence of cooccurring behaviors, and the compounding impact of behavioral hallucinations. Our dataset is available at https://github.com/umd-huang-lab/Mementos. △ Less

Submitted 24 January, 2024; v1 submitted 19 January, 2024; originally announced January 2024.

Comments: 27 pages, 23 figures

arXiv:2401.10208 [pdf, other]

MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer

Authors: Changyao Tian, Xizhou Zhu, Yuwen Xiong, Weiyun Wang, Zhe Chen, Wenhai Wang, Yuntao Chen, Lewei Lu, Tong Lu, Jie Zhou, Hongsheng Li, Yu Qiao, Jifeng Dai

Abstract: Developing generative models for interleaved image-text data has both research and practical value. It requires models to understand the interleaved sequences and subsequently generate images and text. However, existing attempts are limited by the issue that the fixed number of visual tokens cannot efficiently capture image details, which is particularly problematic in the multi-image scenarios. T… ▽ More Developing generative models for interleaved image-text data has both research and practical value. It requires models to understand the interleaved sequences and subsequently generate images and text. However, existing attempts are limited by the issue that the fixed number of visual tokens cannot efficiently capture image details, which is particularly problematic in the multi-image scenarios. To address this, this paper presents MM-Interleaved, an end-to-end generative model for interleaved image-text data. It introduces a multi-scale and multi-image feature synchronizer module, allowing direct access to fine-grained image features in the previous context during the generation process. MM-Interleaved is end-to-end pre-trained on both paired and interleaved image-text corpora. It is further enhanced through a supervised fine-tuning phase, wherein the model improves its ability to follow complex multi-modal instructions. Experiments demonstrate the versatility of MM-Interleaved in recognizing visual details following multi-modal instructions and generating consistent images following both textual and visual conditions. Code and models are available at \url{https://github.com/OpenGVLab/MM-Interleaved}. △ Less

Submitted 2 April, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

Comments: 20 pages, 9 figures, 17 tables

arXiv:2401.08614 [pdf, ps, other]

Computing the Haar state of $\mathcal{O}(SL_q(3))$ using value preserving (anti)homomorphisms

Authors: Ting Lu

Abstract: In this paper, we introduce two (anti)homomorphisms that preserve the Haar state values of monomials. Together with the modular automorphism, the three (anti)homomorphisms are used in our new algorithm to compute the Haar states of monomials on $\mathcal{O}(SL_q(3))$. Comparing with the algorithm proposed in the author's previous work \cite{lu2023}, the new algorithm reduces the linear relations u… ▽ More In this paper, we introduce two (anti)homomorphisms that preserve the Haar state values of monomials. Together with the modular automorphism, the three (anti)homomorphisms are used in our new algorithm to compute the Haar states of monomials on $\mathcal{O}(SL_q(3))$. Comparing with the algorithm proposed in the author's previous work \cite{lu2023}, the new algorithm reduces the linear relations used in the computation to a half. △ Less

Submitted 26 April, 2024; v1 submitted 1 December, 2023; originally announced January 2024.

Comments: Removed text overlap with arXiv:2301.12683. Updated the introduction section and changed title for section 2

MSC Class: 20G42(Primary) 46L53(Secondary)

arXiv:2401.08036 [pdf, other]

3D Lane Detection from Front or Surround-View using Joint-Modeling & Matching

Authors: Haibin Zhou, Huabing Zhou, Jun Chang, Tao Lu, Jiayi Ma

Abstract: 3D lanes offer a more comprehensive understanding of the road surface geometry than 2D lanes, thereby providing crucial references for driving decisions and trajectory planning. While many efforts aim to improve prediction accuracy, we recognize that an efficient network can bring results closer to lane modeling. However, if the modeling data is imprecise, the results might not accurately capture… ▽ More 3D lanes offer a more comprehensive understanding of the road surface geometry than 2D lanes, thereby providing crucial references for driving decisions and trajectory planning. While many efforts aim to improve prediction accuracy, we recognize that an efficient network can bring results closer to lane modeling. However, if the modeling data is imprecise, the results might not accurately capture the real-world scenario. Therefore, accurate lane modeling is essential to align prediction results closely with the environment. This study centers on efficient and accurate lane modeling, proposing a joint modeling approach that combines Bezier curves and interpolation methods. Furthermore, based on this lane modeling approach, we developed a Global2Local Lane Matching method with Bezier Control-Point and Key-Point, which serve as a comprehensive solution that leverages hierarchical features with two mathematical models to ensure a precise match. We also introduce a novel 3D Spatial Encoder, representing an exploration of 3D surround-view lane detection research. The framework is suitable for front-view or surround-view 3D lane detection. By directly outputting the key points of lanes in 3D space, it overcomes the limitations of anchor-based methods, enabling accurate prediction of closed-loop or U-shaped lanes and effective adaptation to complex road conditions. This innovative method establishes a new benchmark in front-view 3D lane detection on the Openlane dataset and achieves competitive performance in surround-view 2D lane detection on the Argoverse2 dataset. △ Less

Submitted 28 May, 2024; v1 submitted 15 January, 2024; originally announced January 2024.

Comments: Accepted to IEEE Transactions on Intelligent Vehicles(T-IV). 13 pages with 9 figures and 6 tables

arXiv:2401.06197 [pdf, other]

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

Authors: Yuwen Xiong, Zhiqi Li, Yuntao Chen, Feng Wang, Xizhou Zhu, Jiapeng Luo, Wenhai Wang, Tong Lu, Hongsheng Li, Yu Qiao, Lewei Lu, Jie Zhou, Jifeng Dai

Abstract: We introduce Deformable Convolution v4 (DCNv4), a highly efficient and effective operator designed for a broad spectrum of vision applications. DCNv4 addresses the limitations of its predecessor, DCNv3, with two key enhancements: 1. removing softmax normalization in spatial aggregation to enhance its dynamic property and expressive power and 2. optimizing memory access to minimize redundant operat… ▽ More We introduce Deformable Convolution v4 (DCNv4), a highly efficient and effective operator designed for a broad spectrum of vision applications. DCNv4 addresses the limitations of its predecessor, DCNv3, with two key enhancements: 1. removing softmax normalization in spatial aggregation to enhance its dynamic property and expressive power and 2. optimizing memory access to minimize redundant operations for speedup. These improvements result in a significantly faster convergence compared to DCNv3 and a substantial increase in processing speed, with DCNv4 achieving more than three times the forward speed. DCNv4 demonstrates exceptional performance across various tasks, including image classification, instance and semantic segmentation, and notably, image generation. When integrated into generative models like U-Net in the latent diffusion model, DCNv4 outperforms its baseline, underscoring its possibility to enhance generative models. In practical applications, replacing DCNv3 with DCNv4 in the InternImage model to create FlashInternImage results in up to 80% speed increase and further performance improvement without further modifications. The advancements in speed and efficiency of DCNv4, combined with its robust performance across diverse vision tasks, show its potential as a foundational building block for future vision models. △ Less

Submitted 11 January, 2024; originally announced January 2024.

Comments: Tech report; Code: https://github.com/OpenGVLab/DCNv4

arXiv:2401.01552 [pdf, other]

doi 10.1609/aaai.v38i5.28268

CRA-PCN: Point Cloud Completion with Intra- and Inter-level Cross-Resolution Transformers

Authors: Yi Rong, Haoran Zhou, Lixin Yuan, Cheng Mei, Jiahao Wang, Tong Lu

Abstract: Point cloud completion is an indispensable task for recovering complete point clouds due to incompleteness caused by occlusion, limited sensor resolution, etc. The family of coarse-to-fine generation architectures has recently exhibited great success in point cloud completion and gradually became mainstream. In this work, we unveil one of the key ingredients behind these methods: meticulously devi… ▽ More Point cloud completion is an indispensable task for recovering complete point clouds due to incompleteness caused by occlusion, limited sensor resolution, etc. The family of coarse-to-fine generation architectures has recently exhibited great success in point cloud completion and gradually became mainstream. In this work, we unveil one of the key ingredients behind these methods: meticulously devised feature extraction operations with explicit cross-resolution aggregation. We present Cross-Resolution Transformer that efficiently performs cross-resolution aggregation with local attention mechanisms. With the help of our recursive designs, the proposed operation can capture more scales of features than common aggregation operations, which is beneficial for capturing fine geometric characteristics. While prior methodologies have ventured into various manifestations of inter-level cross-resolution aggregation, the effectiveness of intra-level one and their combination has not been analyzed. With unified designs, Cross-Resolution Transformer can perform intra- or inter-level cross-resolution aggregation by switching inputs. We integrate two forms of Cross-Resolution Transformers into one up-sampling block for point generation, and following the coarse-to-fine manner, we construct CRA-PCN to incrementally predict complete shapes with stacked up-sampling blocks. Extensive experiments demonstrate that our method outperforms state-of-the-art methods by a large margin on several widely used benchmarks. Codes are available at https://github.com/EasyRy/CRA-PCN. △ Less

Submitted 14 February, 2024; v1 submitted 3 January, 2024; originally announced January 2024.

Comments: Accepted to AAAI 2024

arXiv:2312.17235 [pdf, other]

A Simple LLM Framework for Long-Range Video Question-Answering

Authors: Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, Gedas Bertasius

Abstract: We present LLoVi, a language-based framework for long-range video question-answering (LVQA). Unlike prior long-range video understanding methods, which are often costly and require specialized long-range video modeling design (e.g., memory queues, state-space layers, etc.), our approach uses a frame/clip-level visual captioner (e.g., BLIP2, LaViLa, LLaVA) coupled with a Large Language Model (GPT-3… ▽ More We present LLoVi, a language-based framework for long-range video question-answering (LVQA). Unlike prior long-range video understanding methods, which are often costly and require specialized long-range video modeling design (e.g., memory queues, state-space layers, etc.), our approach uses a frame/clip-level visual captioner (e.g., BLIP2, LaViLa, LLaVA) coupled with a Large Language Model (GPT-3.5, GPT-4) leading to a simple yet surprisingly effective LVQA framework. Specifically, we decompose short and long-range modeling aspects of LVQA into two stages. First, we use a short-term visual captioner to generate textual descriptions of short video clips (0.5-8s in length) densely sampled from a long input video. Afterward, an LLM aggregates the densely extracted short-term captions to perform long-range temporal reasoning needed to understand the whole video and answer a question. To analyze what makes our simple framework so effective, we thoroughly evaluate various components of our system. Our empirical analysis reveals that the choice of the visual captioner and LLM is critical for good LVQA performance. Furthermore, we show that a specialized prompt that asks the LLM first to summarize the noisy short-term visual captions and then answer a given input question leads to a significant LVQA performance boost. On EgoSchema, which is best known as a very long-form video question-answering benchmark, our method achieves 50.3% accuracy, outperforming the previous best-performing approach by 18.1% (absolute gain). In addition, our approach outperforms the previous state-of-the-art by 4.1% and 3.1% on NeXT-QA and IntentQA. We also extend LLoVi to grounded LVQA and show that it outperforms all prior methods on the NeXT-GQA dataset. We will release our code at https://github.com/CeeZh/LLoVi. △ Less

Submitted 26 February, 2024; v1 submitted 28 December, 2023; originally announced December 2023.

arXiv:2312.15690 [pdf, other]

Word length-aware text spotting: Enhancing detection and recognition in dense text image

Authors: Hao Wang, Huabing Zhou, Yanduo Zhang, Tao Lu, Jiayi Ma

Abstract: Scene text spotting is essential in various computer vision applications, enabling extracting and interpreting textual information from images. However, existing methods often neglect the spatial semantics of word images, leading to suboptimal detection recall rates for long and short words within long-tailed word length distributions that exist prominently in dense scenes. In this paper, we prese… ▽ More Scene text spotting is essential in various computer vision applications, enabling extracting and interpreting textual information from images. However, existing methods often neglect the spatial semantics of word images, leading to suboptimal detection recall rates for long and short words within long-tailed word length distributions that exist prominently in dense scenes. In this paper, we present WordLenSpotter, a novel word length-aware spotter for scene text image detection and recognition, improving the spotting capabilities for long and short words, particularly in the tail data of dense text images. We first design an image encoder equipped with a dilated convolutional fusion module to integrate multiscale text image features effectively. Then, leveraging the Transformer framework, we synergistically optimize text detection and recognition accuracy after iteratively refining text region image features using the word length prior. Specially, we design a Spatial Length Predictor module (SLP) using character count prior tailored to different word lengths to constrain the regions of interest effectively. Furthermore, we introduce a specialized word Length-aware Segmentation (LenSeg) proposal head, enhancing the network's capacity to capture the distinctive features of long and short terms within categories characterized by long-tailed distributions. Comprehensive experiments on public datasets and our dense text spotting dataset DSTD1500 demonstrate the superiority of our proposed methods, particularly in dense text image detection and recognition tasks involving long-tailed word length distributions encompassing a range of long and short words. △ Less

Submitted 25 December, 2023; originally announced December 2023.

arXiv:2312.14238 [pdf, other]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Authors: Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, Jifeng Dai

Abstract: The exponential growth of large language models (LLMs) has opened up numerous possibilities for multimodal AGI systems. However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs. In this work, we design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model… ▽ More The exponential growth of large language models (LLMs) has opened up numerous possibilities for multimodal AGI systems. However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs. In this work, we design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters and progressively aligns it with the LLM, using web-scale image-text data from various sources. This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks including visual perception tasks such as image-level or pixel-level recognition, vision-language tasks such as zero-shot image/video classification, zero-shot image/video-text retrieval, and link with LLMs to create multi-modal dialogue systems. It has powerful visual capabilities and can be a good alternative to the ViT-22B. We hope that our research could contribute to the development of multi-modal large models. Code and models are available at https://github.com/OpenGVLab/InternVL. △ Less

Submitted 15 January, 2024; v1 submitted 21 December, 2023; originally announced December 2023.

Comments: 25 pages, 5 figures, 28 tables

arXiv:2312.11968 [pdf, other]

Multi-color nonreciprocal optical amplifier with spinning active optomechanics

Authors: Ru-Ting Sun, Mei-Yu Peng, Tian-Xiang Lu, Ya-Feng Jiao, Jie Wang, Qian Zhang, Hui Jing

Abstract: We propose to achieve a multi-color nonreciprocal optical amplifier, a crucial device in optical communication and information processing, by spinning an active resonator. We show that in such a device, due to the interplay of the Sagnac effect and the optical gain, nonreciprocal signal {amplification} can be realized, accompanied by a giant enhancement of optical group delay from… ▽ More We propose to achieve a multi-color nonreciprocal optical amplifier, a crucial device in optical communication and information processing, by spinning an active resonator. We show that in such a device, due to the interplay of the Sagnac effect and the optical gain, nonreciprocal signal {amplification} can be realized, accompanied by a giant enhancement of optical group delay from $0.3\;\mathrm{ms}$ to $35\;\mathrm{ms}$ in a chosen direction, which is otherwise unattainable in a passive device. Also, coherent amplification of higher-order optical sidebands and a slow-to-fast light switch can be achieved by tuning both the pump power and the spinning velocity. Our work provides a unique and accessible way, well-compatible with other existing techniques, to realize multi-color nonreciprocal optical amplifiers for more flexible control of optical fields. △ Less

Submitted 19 December, 2023; originally announced December 2023.

Comments: 8pages, 4 figures

arXiv:2312.11805 [pdf, other]

Gemini: A Family of Highly Capable Multimodal Models

Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI. △ Less

Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

arXiv:2312.06896 [pdf, other]

Quantum squeezing induced nonreciprocal phonon laser

Authors: Tian-Xiang Lu, Yan Wang, Keyu Xia, Xing Xiao, Le-Man Kuang, Hui Jing

Abstract: Phonon lasers or coherent amplifications of mechanical oscillations have provided powerful tools for both fundamental studies of coherent acoustics and diverse applications ranging from ultrasensitive force sensing to phononic information processing. Here, we propose how to achieve directional phonon lasing with an optomechanical resonator coupled to a nonlinear optical resonator. We find that, by… ▽ More Phonon lasers or coherent amplifications of mechanical oscillations have provided powerful tools for both fundamental studies of coherent acoustics and diverse applications ranging from ultrasensitive force sensing to phononic information processing. Here, we propose how to achieve directional phonon lasing with an optomechanical resonator coupled to a nonlinear optical resonator. We find that, by pumping the nonlinear resonator, directional optical squeezing can occur along the pump direction. As a result, we can achieve the directional mechanical gain by utilizing the directional optical squeezing, thus leading to nonreciprocal phonon lasing with a well-tunable directional power threshold. Our work shows a feasible way to build nonreciprocal phonon lasers with various nonlinear optical mediums, which are important for such a wide range of applications as directional acoustic amplifiers, invisible sound sensing or imaging, and one-way phononic networks. △ Less

Submitted 11 December, 2023; originally announced December 2023.

Comments: 11 pages, 4 figures

arXiv:2312.03988 [pdf, other]

Enhanced high-dimensional teleportation in correlated amplitude damping noise by weak measurement and environment-assisted measurement

Authors: Xing Xiao, Tian-Xiang Lu, Yan-Ling Li

Abstract: High-dimensional teleportation provides various benefits in quantum networks and repeaters, but all these advantages rely on the high-quality distribution of high-dimensional entanglement over a noisy channel. It is essential to consider correlation effects when two entangled qutrits travel consecutively through the same channel. In this paper, we present two strategies for enhancing qutrit telepo… ▽ More High-dimensional teleportation provides various benefits in quantum networks and repeaters, but all these advantages rely on the high-quality distribution of high-dimensional entanglement over a noisy channel. It is essential to consider correlation effects when two entangled qutrits travel consecutively through the same channel. In this paper, we present two strategies for enhancing qutrit teleportation in correlated amplitude damping (CAD) noise by weak measurement (WM) and environment-assisted measurement (EAM). The fidelity of both approaches has been dramatically improved due to the probabilistic nature of WM and EAM. We have observed that the correlation effects of CAD noise result in an increase in the probability of success. A comparison has demonstrated that the EAM scheme consistently outperforms the WM scheme in regard to fidelity. Our research expands the capabilities of WM and EAM as quantum techniques to combat CAD noise in qutrit teleportation, facilitating the development of advanced quantum technologies in high-dimensional systems. △ Less

Submitted 2 July, 2024; v1 submitted 6 December, 2023; originally announced December 2023.

Comments: 18 pages, 5 figures. The figure 1 is replaced

arXiv:2312.03031 [pdf, other]

Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?

Authors: Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, Jose M. Alvarez

Abstract: End-to-end autonomous driving recently emerged as a promising research direction to target autonomy from a full-stack perspective. Along this line, many of the latest works follow an open-loop evaluation setting on nuScenes to study the planning behavior. In this paper, we delve deeper into the problem by conducting thorough analyses and demystifying more devils in the details. We initially observ… ▽ More End-to-end autonomous driving recently emerged as a promising research direction to target autonomy from a full-stack perspective. Along this line, many of the latest works follow an open-loop evaluation setting on nuScenes to study the planning behavior. In this paper, we delve deeper into the problem by conducting thorough analyses and demystifying more devils in the details. We initially observed that the nuScenes dataset, characterized by relatively simple driving scenarios, leads to an under-utilization of perception information in end-to-end models incorporating ego status, such as the ego vehicle's velocity. These models tend to rely predominantly on the ego vehicle's status for future path planning. Beyond the limitations of the dataset, we also note that current metrics do not comprehensively assess the planning quality, leading to potentially biased conclusions drawn from existing benchmarks. To address this issue, we introduce a new metric to evaluate whether the predicted trajectories adhere to the road. We further propose a simple baseline able to achieve competitive results without relying on perception annotations. Given the current limitations on the benchmark and metrics, we suggest the community reassess relevant prevailing research and be cautious whether the continued pursuit of state-of-the-art would yield convincing and universal conclusions. Code and models are available at \url{https://github.com/NVlabs/BEV-Planner} △ Less

Submitted 2 June, 2024; v1 submitted 5 December, 2023; originally announced December 2023.

Comments: Accept to cvpr 2024

Showing 1–50 of 620 results for author: Lu, T