subscribe to arXiv mailings

Navi2Gaze: Leveraging Foundation Models for Navigation and Target Gazing

Authors: Jun Zhu, Zihao Du, Haotian Xu, Fengbo Lan, Zilong Zheng, Bo Ma, Shengjie Wang, Tao Zhang

Abstract: Task-aware navigation continues to be a challenging area of research, especially in scenarios involving open vocabulary. Previous studies primarily focus on finding suitable locations for task completion, often overlooking the importance of the robot's pose. However, the robot's orientation is crucial for successfully completing tasks because of how objects are arranged (e.g., to open a refrigerat… ▽ More Task-aware navigation continues to be a challenging area of research, especially in scenarios involving open vocabulary. Previous studies primarily focus on finding suitable locations for task completion, often overlooking the importance of the robot's pose. However, the robot's orientation is crucial for successfully completing tasks because of how objects are arranged (e.g., to open a refrigerator door). Humans intuitively navigate to objects with the right orientation using semantics and common sense. For instance, when opening a refrigerator, we naturally stand in front of it rather than to the side. Recent advances suggest that Vision-Language Models (VLMs) can provide robots with similar common sense. Therefore, we develop a VLM-driven method called Navigation-to-Gaze (Navi2Gaze) for efficient navigation and object gazing based on task descriptions. This method uses the VLM to score and select the best pose from numerous candidates automatically. In evaluations on multiple photorealistic simulation benchmarks, Navi2Gaze significantly outperforms existing approaches and precisely determines the optimal orientation relative to target objects. △ Less

Submitted 12 July, 2024; originally announced July 2024.

arXiv:2407.09027 [pdf, other]

Exploring the role of criticality in the quantum Otto cycle fueled by the anisotropic quantum Rabi-Stark model

Authors: He-Guang Xu, Jiasen Jin, Norton G. de Almeida, G. D. de Moraes Neto

Abstract: Quantum heat machines, encompassing heat engines, refrigerators, heaters, and accelerators, represent the forefront of quantum thermodynamics, offering a novel paradigm for converting heat energy into useful mechanical work. Leveraging quantum mechanical principles, these machines promise superior efficiency and performance compared to classical counterparts, with potential applications in renewab… ▽ More Quantum heat machines, encompassing heat engines, refrigerators, heaters, and accelerators, represent the forefront of quantum thermodynamics, offering a novel paradigm for converting heat energy into useful mechanical work. Leveraging quantum mechanical principles, these machines promise superior efficiency and performance compared to classical counterparts, with potential applications in renewable energy and quantum computing. This paper investigates a quantum Otto engine operating in both ideal and finite-time scenarios, employing a two-level system interacting with a harmonic oscillator within the framework of the anisotropic quantum Rabi-Stark model (AQRSM) as the working medium. This model is notable for exhibiting both first-order and continuous quantum phase transitions. By focusing on quantum heat engines, our study reveals that these phase transitions critically modulate the efficiency and power of AQRSM-based engines, outperforming quantum engines fueled by working medium with harmonic spectrum. Additionally, we explore the impacts of quantum friction and conduct limit cycle analysis in finite-time operations, providing insights into optimizing quantum heat engines for practical implementation. △ Less

Submitted 12 July, 2024; originally announced July 2024.

arXiv:2407.08706 [pdf, other]

HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models

Authors: Runhui Huang, Xinpeng Ding, Chunwei Wang, Jianhua Han, Yulong Liu, Hengshuang Zhao, Hang Xu, Lu Hou, Wei Zhang, Xiaodan Liang

Abstract: High-resolution inputs enable Large Vision-Language Models (LVLMs) to discern finer visual details, enhancing their comprehension capabilities. To reduce the training and computation costs caused by high-resolution input, one promising direction is to use sliding windows to slice the input into uniform patches, each matching the input size of the well-trained vision encoder. Although efficient, th… ▽ More High-resolution inputs enable Large Vision-Language Models (LVLMs) to discern finer visual details, enhancing their comprehension capabilities. To reduce the training and computation costs caused by high-resolution input, one promising direction is to use sliding windows to slice the input into uniform patches, each matching the input size of the well-trained vision encoder. Although efficient, this slicing strategy leads to the fragmentation of original input, i.e., the continuity of contextual information and spatial geometry is lost across patches, adversely affecting performance in cross-patch context perception and position-specific tasks. To overcome these shortcomings, we introduce HiRes-LLaVA, a novel framework designed to efficiently process any size of high-resolution input without altering the original contextual and geometric information. HiRes-LLaVA comprises two innovative components: (i) a SliceRestore adapter that reconstructs sliced patches into their original form, efficiently extracting both global and local features via down-up-sampling and convolution layers, and (ii) a Self-Mining Sampler to compresses the vision tokens based on themselves, preserving the original context and positional information while reducing training overhead. To assess the ability of handling context fragmentation, we construct a new benchmark, EntityGrid-QA, consisting of edge-related and position-related tasks. Our comprehensive experiments demonstrate the superiority of HiRes-LLaVA on both existing public benchmarks and on EntityGrid-QA, particularly on document-oriented tasks, establishing new standards for handling high-resolution inputs. △ Less