subscribe to arXiv mailings

PosFormer: Recognizing Complex Handwritten Mathematical Expression with Position Forest Transformer

Authors: Tongkun Guan, Chengyu Lin, Wei Shen, Xiaokang Yang

Abstract: Handwritten Mathematical Expression Recognition (HMER) has wide applications in human-machine interaction scenarios, such as digitized education and automated offices. Recently, sequence-based models with encoder-decoder architectures have been commonly adopted to address this task by directly predicting LaTeX sequences of expression images. However, these methods only implicitly learn the syntax… ▽ More Handwritten Mathematical Expression Recognition (HMER) has wide applications in human-machine interaction scenarios, such as digitized education and automated offices. Recently, sequence-based models with encoder-decoder architectures have been commonly adopted to address this task by directly predicting LaTeX sequences of expression images. However, these methods only implicitly learn the syntax rules provided by LaTeX, which may fail to describe the position and hierarchical relationship between symbols due to complex structural relations and diverse handwriting styles. To overcome this challenge, we propose a position forest transformer (PosFormer) for HMER, which jointly optimizes two tasks: expression recognition and position recognition, to explicitly enable position-aware symbol feature representation learning. Specifically, we first design a position forest that models the mathematical expression as a forest structure and parses the relative position relationships between symbols. Without requiring extra annotations, each symbol is assigned a position identifier in the forest to denote its relative spatial position. Second, we propose an implicit attention correction module to accurately capture attention for HMER in the sequence-based decoder architecture. Extensive experiments validate the superiority of PosFormer, which consistently outperforms the state-of-the-art methods 2.03%/1.22%/2.00%, 1.83%, and 4.62% gains on the single-line CROHME 2014/2016/2019, multi-line M2E, and complex MNE datasets, respectively, with no additional latency or computational cost. Code is available at https://github.com/SJTU-DeepVisionLab/PosFormer. △ Less

Submitted 10 July, 2024; originally announced July 2024.

Comments: Accepted by ECCV2024

arXiv:2406.18054 [pdf, other]

Leveraging Pre-trained Models for FF-to-FFPE Histopathological Image Translation

Authors: Qilai Zhang, Jiawen Li, Peiran Liao, Jiali Hu, Tian Guan, Anjia Han, Yonghong He

Abstract: The two primary types of Hematoxylin and Eosin (H&E) slides in histopathology are Formalin-Fixed Paraffin-Embedded (FFPE) and Fresh Frozen (FF). FFPE slides offer high quality histopathological images but require a labor-intensive acquisition process. In contrast, FF slides can be prepared quickly, but the image quality is relatively poor. Our task is to translate FF images into FFPE style, thereb… ▽ More The two primary types of Hematoxylin and Eosin (H&E) slides in histopathology are Formalin-Fixed Paraffin-Embedded (FFPE) and Fresh Frozen (FF). FFPE slides offer high quality histopathological images but require a labor-intensive acquisition process. In contrast, FF slides can be prepared quickly, but the image quality is relatively poor. Our task is to translate FF images into FFPE style, thereby improving the image quality for diagnostic purposes. In this paper, we propose Diffusion-FFPE, a method for FF-to-FFPE histopathological image translation using a pre-trained diffusion model. Specifically, we employ a one-step diffusion model as the generator and fine-tune it with LoRA adapters using adversarial learning objectives. To ensure that the model effectively captures both global structural information and local details, we propose a multi-scale feature fusion (MFF) module. This module utilizes two VAE encoders to extract features of varying image sizes and performs feature fusion before feeding them into the UNet. Furthermore, we utilize a pre-trained vision-language model for histopathology as the backbone for the discriminator to further improve performance We conducted FF-to-FFPE translation experiments on the TCGA-NSCLC datasets, and our method achieved better performance compared to other methods. The code and models are released at https://github.com/QilaiZhang/Diffusion-FFPE. △ Less

Submitted 26 June, 2024; originally announced June 2024.

arXiv:2406.10900 [pdf, other]

AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models

Authors: Xiyang Wu, Tianrui Guan, Dianqi Li, Shuaiyi Huang, Xiaoyu Liu, Xijun Wang, Ruiqi Xian, Abhinav Shrivastava, Furong Huang, Jordan Lee Boyd-Graber, Tianyi Zhou, Dinesh Manocha

Abstract: Large vision-language models (LVLMs) hallucinate: certain context cues in an image may trigger the language module's overconfident and incorrect reasoning on abnormal or hypothetical objects. Though a few benchmarks have been developed to investigate LVLM hallucinations, they mainly rely on hand-crafted corner cases whose fail patterns may hardly generalize, and finetuning on them could undermine… ▽ More Large vision-language models (LVLMs) hallucinate: certain context cues in an image may trigger the language module's overconfident and incorrect reasoning on abnormal or hypothetical objects. Though a few benchmarks have been developed to investigate LVLM hallucinations, they mainly rely on hand-crafted corner cases whose fail patterns may hardly generalize, and finetuning on them could undermine their validity. These motivate us to develop the first automatic benchmark generation approach, AUTOHALLUSION, that harnesses a few principal strategies to create diverse hallucination examples. It probes the language modules in LVLMs for context cues and uses them to synthesize images by: (1) adding objects abnormal to the context cues; (2) for two co-occurring objects, keeping one and excluding the other; or (3) removing objects closely tied to the context cues. It then generates image-based questions whose ground-truth answers contradict the language module's prior. A model has to overcome contextual biases and distractions to reach correct answers, while incorrect or inconsistent answers indicate hallucinations. AUTOHALLUSION enables us to create new benchmarks at the minimum cost and thus overcomes the fragility of hand-crafted benchmarks. It also reveals common failure patterns and reasons, providing key insights to detect, avoid, or control hallucinations. Comprehensive evaluations of top-tier LVLMs, e.g., GPT-4V(ision), Gemini Pro Vision, Claude 3, and LLaVA-1.5, show a 97.7% and 98.7% success rate of hallucination induction on synthetic and real-world datasets of AUTOHALLUSION, paving the way for a long battle against hallucinations. △ Less

Submitted 16 June, 2024; originally announced June 2024.

arXiv:2406.00672 [pdf, other]

Task-oriented Embedding Counts: Heuristic Clustering-driven Feature Fine-tuning for Whole Slide Image Classification

Authors: Xuenian Wang, Shanshan Shi, Renao Yan, Qiehe Sun, Lianghui Zhu, Tian Guan, Yonghong He

Abstract: In the field of whole slide image (WSI) classification, multiple instance learning (MIL) serves as a promising approach, commonly decoupled into feature extraction and aggregation. In this paradigm, our observation reveals that discriminative embeddings are crucial for aggregation to the final prediction. Among all feature updating strategies, task-oriented ones can capture characteristics specifi… ▽ More In the field of whole slide image (WSI) classification, multiple instance learning (MIL) serves as a promising approach, commonly decoupled into feature extraction and aggregation. In this paradigm, our observation reveals that discriminative embeddings are crucial for aggregation to the final prediction. Among all feature updating strategies, task-oriented ones can capture characteristics specifically for certain tasks. However, they can be prone to overfitting and contaminated by samples assigned with noisy labels. To address this issue, we propose a heuristic clustering-driven feature fine-tuning method (HC-FT) to enhance the performance of multiple instance learning by providing purified positive and hard negative samples. Our method first employs a well-trained MIL model to evaluate the confidence of patches. Then, patches with high confidence are marked as positive samples, while the remaining patches are used to identify crucial negative samples. After two rounds of heuristic clustering and selection, purified positive and hard negative samples are obtained to facilitate feature fine-tuning. The proposed method is evaluated on both CAMELYON16 and BRACS datasets, achieving an AUC of 97.13% and 85.85%, respectively, consistently outperforming all compared methods. △ Less

Submitted 2 June, 2024; originally announced June 2024.

arXiv:2405.05363 [pdf, other]

LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation

Authors: Tianrui Guan, Yurou Yang, Harry Cheng, Muyuan Lin, Richard Kim, Rajasimman Madhivanan, Arnie Sen, Dinesh Manocha

Abstract: In this paper, we present LOC-ZSON, a novel Language-driven Object-Centric image representation for object navigation task within complex scenes. We propose an object-centric image representation and corresponding losses for visual-language model (VLM) fine-tuning, which can handle complex object-level queries. In addition, we design a novel LLM-based augmentation and prompt templates for stabilit… ▽ More In this paper, we present LOC-ZSON, a novel Language-driven Object-Centric image representation for object navigation task within complex scenes. We propose an object-centric image representation and corresponding losses for visual-language model (VLM) fine-tuning, which can handle complex object-level queries. In addition, we design a novel LLM-based augmentation and prompt templates for stability during training and zero-shot inference. We implement our method on Astro robot and deploy it in both simulated and real-world environments for zero-shot object navigation. We show that our proposed method can achieve an improvement of 1.38 - 13.38% in terms of text-to-image recall on different benchmark settings for the retrieval task. For object navigation, we show the benefit of our approach in simulation and real world, showing 5% and 16.67% improvement in terms of navigation success rate, respectively. △ Less

Submitted 8 May, 2024; originally announced May 2024.

Comments: Accepted to ICRA 2024

arXiv:2404.12777 [pdf, other]

EfficientGS: Streamlining Gaussian Splatting for Large-Scale High-Resolution Scene Representation

Authors: Wenkai Liu, Tao Guan, Bin Zhu, Lili Ju, Zikai Song, Dan Li, Yuesong Wang, Wei Yang

Abstract: In the domain of 3D scene representation, 3D Gaussian Splatting (3DGS) has emerged as a pivotal technology. However, its application to large-scale, high-resolution scenes (exceeding 4k$\times$4k pixels) is hindered by the excessive computational requirements for managing a large number of Gaussians. Addressing this, we introduce 'EfficientGS', an advanced approach that optimizes 3DGS for high-res… ▽ More In the domain of 3D scene representation, 3D Gaussian Splatting (3DGS) has emerged as a pivotal technology. However, its application to large-scale, high-resolution scenes (exceeding 4k$\times$4k pixels) is hindered by the excessive computational requirements for managing a large number of Gaussians. Addressing this, we introduce 'EfficientGS', an advanced approach that optimizes 3DGS for high-resolution, large-scale scenes. We analyze the densification process in 3DGS and identify areas of Gaussian over-proliferation. We propose a selective strategy, limiting Gaussian increase to key primitives, thereby enhancing the representational efficiency. Additionally, we develop a pruning mechanism to remove redundant Gaussians, those that are merely auxiliary to adjacent ones. For further enhancement, we integrate a sparse order increment for Spherical Harmonics (SH), designed to alleviate storage constraints and reduce training overhead. Our empirical evaluations, conducted on a range of datasets including extensive 4K+ aerial images, demonstrate that 'EfficientGS' not only expedites training and rendering times but also achieves this with a model size approximately tenfold smaller than conventional 3DGS while maintaining high rendering fidelity. △ Less

Submitted 19 April, 2024; originally announced April 2024.

arXiv:2404.03187 [pdf, other]

AGL-NET: Aerial-Ground Cross-Modal Global Localization with Varying Scales

Authors: Tianrui Guan, Ruiqi Xian, Xijun Wang, Xiyang Wu, Mohamed Elnoor, Daeun Song, Dinesh Manocha

Abstract: We present AGL-NET, a novel learning-based method for global localization using LiDAR point clouds and satellite maps. AGL-NET tackles two critical challenges: bridging the representation gap between image and points modalities for robust feature matching, and handling inherent scale discrepancies between global view and local view. To address these challenges, AGL-NET leverages a unified network… ▽ More We present AGL-NET, a novel learning-based method for global localization using LiDAR point clouds and satellite maps. AGL-NET tackles two critical challenges: bridging the representation gap between image and points modalities for robust feature matching, and handling inherent scale discrepancies between global view and local view. To address these challenges, AGL-NET leverages a unified network architecture with a novel two-stage matching design. The first stage extracts informative neural features directly from raw sensor data and performs initial feature matching. The second stage refines this matching process by extracting informative skeleton features and incorporating a novel scale alignment step to rectify scale variations between LiDAR and map data. Furthermore, a novel scale and skeleton loss function guides the network toward learning scale-invariant feature representations, eliminating the need for pre-processing satellite maps. This significantly improves real-world applicability in scenarios with unknown map scales. To facilitate rigorous performance evaluation, we introduce a meticulously designed dataset within the CARLA simulator specifically tailored for metric localization training and assessment. The code and dataset will be made publicly available. △ Less

Submitted 4 April, 2024; originally announced April 2024.

arXiv:2403.13235 [pdf, other]

AMCO: Adaptive Multimodal Coupling of Vision and Proprioception for Quadruped Robot Navigation in Outdoor Environments

Authors: Mohamed Elnoor, Kasun Weerakoon, Adarsh Jagan Sathyamoorthy, Tianrui Guan, Vignesh Rajagopal, Dinesh Manocha

Abstract: We present AMCO, a novel navigation method for quadruped robots that adaptively combines vision-based and proprioception-based perception capabilities. Our approach uses three cost maps: general knowledge map; traversability history map; and current proprioception map; which are derived from a robot's vision and proprioception data, and couples them to obtain a coupled traversability cost map for… ▽ More We present AMCO, a novel navigation method for quadruped robots that adaptively combines vision-based and proprioception-based perception capabilities. Our approach uses three cost maps: general knowledge map; traversability history map; and current proprioception map; which are derived from a robot's vision and proprioception data, and couples them to obtain a coupled traversability cost map for navigation. The general knowledge map encodes terrains semantically segmented from visual sensing, and represents a terrain's typically expected traversability. The traversability history map encodes the robot's recent proprioceptive measurements on a terrain and its semantic segmentation as a cost map. Further, the robot's present proprioceptive measurement is encoded as a cost map in the current proprioception map. As the general knowledge map and traversability history map rely on semantic segmentation, we evaluate the reliability of the visual sensory data by estimating the brightness and motion blur of input RGB images and accordingly combine the three cost maps to obtain the coupled traversability cost map used for navigation. Leveraging this adaptive coupling, the robot can depend on the most reliable input modality available. Finally, we present a novel planner that selects appropriate gaits and velocities for traversing challenging outdoor environments using the coupled traversability cost map. We demonstrate AMCO's navigation performance in different real-world outdoor environments and observe 10.8%-34.9% reduction w.r.t. two stability metrics, and up to 50% improvement in terms of success rate compared to current navigation methods. △ Less

Submitted 19 March, 2024; originally announced March 2024.

Comments: 8 pages

arXiv:2403.12414 [pdf, other]

Development of low-radon ultra-pure water for the Jiangmen Underground Neutrino Observatory

Authors: T. Y. Guan, Y. P. Zhang, B. Wang, C. Guo, J. C. Liu, Q. Tang, C. G. Yang, C. Li

Abstract: The Jiangmen Underground Neutrino Observatory(JUNO) is a state-of-the-art liquid scintillator-based neutrino physics experiment under construction in South China. To reduce the background from external radioactivities, a water Cherenkov detector composed of 35~kton ultra-pure water and 2,400 20-inch photomultiplier tubes is developed. Even after specialized treatment, ultra-pure water still contai… ▽ More The Jiangmen Underground Neutrino Observatory(JUNO) is a state-of-the-art liquid scintillator-based neutrino physics experiment under construction in South China. To reduce the background from external radioactivities, a water Cherenkov detector composed of 35~kton ultra-pure water and 2,400 20-inch photomultiplier tubes is developed. Even after specialized treatment, ultra-pure water still contains trace levels of radioactive elements that can contribute to the detector background. Among which $^{222}$Rn is particularly significant. To address this, an online radon removal system based on the JUNO prototype has been developed. By integrating micro-bubble generators to enhance degasser's radon removal efficiency, the radon concentration in water can be reduced to 1~mBq/m$^{3}$ level, meeting the stringent requirements of JUNO. Additionally, a highly sensitive online radon concentration measurement system capable of detecting concentrations $\sim$1~mBq/m$^3$ has been developed to monitor the radon concentration in water. In this paper, the details regarding both systems will be presented. △ Less

Submitted 18 March, 2024; originally announced March 2024.

Comments: 20 pages, 13 figures

arXiv:2403.11193 [pdf, other]

Neural Markov Random Field for Stereo Matching

Authors: Tongfan Guan, Chen Wang, Yun-Hui Liu

Abstract: Stereo matching is a core task for many computer vision and robotics applications. Despite their dominance in traditional stereo methods, the hand-crafted Markov Random Field (MRF) models lack sufficient modeling accuracy compared to end-to-end deep models. While deep learning representations have greatly improved the unary terms of the MRF models, the overall accuracy is still severely limited by… ▽ More Stereo matching is a core task for many computer vision and robotics applications. Despite their dominance in traditional stereo methods, the hand-crafted Markov Random Field (MRF) models lack sufficient modeling accuracy compared to end-to-end deep models. While deep learning representations have greatly improved the unary terms of the MRF models, the overall accuracy is still severely limited by the hand-crafted pairwise terms and message passing. To address these issues, we propose a neural MRF model, where both potential functions and message passing are designed using data-driven neural networks. Our fully data-driven model is built on the foundation of variational inference theory, to prevent convergence issues and retain stereo MRF's graph inductive bias. To make the inference tractable and scale well to high-resolution images, we also propose a Disparity Proposal Network (DPN) to adaptively prune the search space of disparity. The proposed approach ranks $1^{st}$ on both KITTI 2012 and 2015 leaderboards among all published methods while running faster than 100 ms. This approach significantly outperforms prior global methods, e.g., lowering D1 metric by more than 50% on KITTI 2015. In addition, our method exhibits strong cross-domain generalization and can recover sharp edges. The codes at https://github.com/aeolusguan/NMRF △ Less

Submitted 21 March, 2024; v1 submitted 17 March, 2024; originally announced March 2024.

Comments: Accepted to CVPR 2024

arXiv:2403.10858 [pdf, other]

RetMIL: Retentive Multiple Instance Learning for Histopathological Whole Slide Image Classification

Authors: Hongbo Chu, Qiehe Sun, Jiawen Li, Yuxuan Chen, Lizhong Zhang, Tian Guan, Anjia Han, Yonghong He

Abstract: Histopathological whole slide image (WSI) analysis with deep learning has become a research focus in computational pathology. The current paradigm is mainly based on multiple instance learning (MIL), in which approaches with Transformer as the backbone are well discussed. These methods convert WSI tasks into sequence tasks by representing patches as tokens in the WSI sequence. However, the feature… ▽ More Histopathological whole slide image (WSI) analysis with deep learning has become a research focus in computational pathology. The current paradigm is mainly based on multiple instance learning (MIL), in which approaches with Transformer as the backbone are well discussed. These methods convert WSI tasks into sequence tasks by representing patches as tokens in the WSI sequence. However, the feature complexity brought by high heterogeneity and the ultra-long sequences brought by gigapixel size makes Transformer-based MIL suffer from the challenges of high memory consumption, slow inference speed, and lack of performance. To this end, we propose a retentive MIL method called RetMIL, which processes WSI sequences through hierarchical feature propagation structure. At the local level, the WSI sequence is divided into multiple subsequences. Tokens of each subsequence are updated through a parallel linear retention mechanism and aggregated utilizing an attention layer. At the global level, subsequences are fused into a global sequence, then updated through a serial retention mechanism, and finally the slide-level representation is obtained through a global attention pooling. We conduct experiments on two public CAMELYON and BRACS datasets and an public-internal LUNG dataset, confirming that RetMIL not only achieves state-of-the-art performance but also significantly reduces computational overhead. Our code will be accessed shortly. △ Less

Submitted 16 March, 2024; originally announced March 2024.

Comments: under review

arXiv:2403.09606 [pdf, ps, other]

Large Language Models and Causal Inference in Collaboration: A Comprehensive Survey

Authors: Xiaoyu Liu, Paiheng Xu, Junda Wu, Jiaxin Yuan, Yifan Yang, Yuhang Zhou, Fuxiao Liu, Tianrui Guan, Haoliang Wang, Tong Yu, Julian McAuley, Wei Ai, Furong Huang

Abstract: Causal inference has shown potential in enhancing the predictive accuracy, fairness, robustness, and explainability of Natural Language Processing (NLP) models by capturing causal relationships among variables. The emergence of generative Large Language Models (LLMs) has significantly impacted various NLP domains, particularly through their advanced reasoning capabilities. This survey focuses on e… ▽ More Causal inference has shown potential in enhancing the predictive accuracy, fairness, robustness, and explainability of Natural Language Processing (NLP) models by capturing causal relationships among variables. The emergence of generative Large Language Models (LLMs) has significantly impacted various NLP domains, particularly through their advanced reasoning capabilities. This survey focuses on evaluating and improving LLMs from a causal view in the following areas: understanding and improving the LLMs' reasoning capacity, addressing fairness and safety issues in LLMs, complementing LLMs with explanations, and handling multimodality. Meanwhile, LLMs' strong reasoning capacities can in turn contribute to the field of causal inference by aiding causal relationship discovery and causal effect estimations. This review explores the interplay between causal inference frameworks and LLMs from both perspectives, emphasizing their collective potential to further the development of more advanced and equitable artificial intelligence systems. △ Less

Submitted 14 March, 2024; originally announced March 2024.

arXiv:2403.07719 [pdf, other]

Dynamic Graph Representation with Knowledge-aware Attention for Histopathology Whole Slide Image Analysis

Authors: Jiawen Li, Yuxuan Chen, Hongbo Chu, Qiehe Sun, Tian Guan, Anjia Han, Yonghong He

Abstract: Histopathological whole slide images (WSIs) classification has become a foundation task in medical microscopic imaging processing. Prevailing approaches involve learning WSIs as instance-bag representations, emphasizing significant instances but struggling to capture the interactions between instances. Additionally, conventional graph representation methods utilize explicit spatial positions to co… ▽ More Histopathological whole slide images (WSIs) classification has become a foundation task in medical microscopic imaging processing. Prevailing approaches involve learning WSIs as instance-bag representations, emphasizing significant instances but struggling to capture the interactions between instances. Additionally, conventional graph representation methods utilize explicit spatial positions to construct topological structures but restrict the flexible interaction capabilities between instances at arbitrary locations, particularly when spatially distant. In response, we propose a novel dynamic graph representation algorithm that conceptualizes WSIs as a form of the knowledge graph structure. Specifically, we dynamically construct neighbors and directed edge embeddings based on the head and tail relationships between instances. Then, we devise a knowledge-aware attention mechanism that can update the head node features by learning the joint attention score of each neighbor and edge. Finally, we obtain a graph-level embedding through the global pooling process of the updated head, serving as an implicit representation for the WSI classification. Our end-to-end graph representation learning approach has outperformed the state-of-the-art WSI analysis methods on three TCGA benchmark datasets and in-house test sets. Our code is available at https://github.com/WonderLandxD/WiKG. △ Less

Submitted 12 March, 2024; originally announced March 2024.

Comments: Accepted by CVPR 2024

arXiv:2402.13614 [pdf, other]

Developing a $μ$Bq/m$^{3}$ level $^{226}$Ra concentration in water measurement system for the Jiangmen Underground Neutrino Observatory

Authors: C. Li, B. Wang, Y. Liu, C. Guo, Y. P. Zhang, J. C. Liu, Q. Tang, T. Y. Guan, C. G. Yang

Abstract: The Jiangmen Underground Neutrino Observatory (JUNO), a 20~kton multi-purpose low background Liquid Scintillator (LS) detector, was proposed primarily to determine the neutrino mass ordering. To suppress the radioactivity from the surrounding rocks and tag cosmic muons, the JUNO central detector is submerged in a Water Cherenkov Detector (WCD). In addition to being used in the WCD, ultrapure water… ▽ More The Jiangmen Underground Neutrino Observatory (JUNO), a 20~kton multi-purpose low background Liquid Scintillator (LS) detector, was proposed primarily to determine the neutrino mass ordering. To suppress the radioactivity from the surrounding rocks and tag cosmic muons, the JUNO central detector is submerged in a Water Cherenkov Detector (WCD). In addition to being used in the WCD, ultrapure water is used in LS filling, for which the $^{226}$Ra concentration in water needs to be less than 50~$μ$Bq/m$^3$. To precisely measure the $^{226}$Ra concentration in water, a 6.0~$μ$Bq/m$^3$ $^{226}$Ra concentration in water measurement system has been developed. In this paper, the detail of the measurement system as well as the $^{226}$Ra concentration measurement result in regular EWII ultrapure water will be presented. △ Less

Submitted 21 February, 2024; originally announced February 2024.

Comments: 16 pages, 7 figures

arXiv:2402.10340 [pdf, other]

Highlighting the Safety Concerns of Deploying LLMs/VLMs in Robotics

Authors: Xiyang Wu, Souradip Chakraborty, Ruiqi Xian, Jing Liang, Tianrui Guan, Fuxiao Liu, Brian M. Sadler, Dinesh Manocha, Amrit Singh Bedi

Abstract: In this paper, we highlight the critical issues of robustness and safety associated with integrating large language models (LLMs) and vision-language models (VLMs) into robotics applications. Recent works focus on using LLMs and VLMs to improve the performance of robotics tasks, such as manipulation and navigation. Despite these improvements, analyzing the safety of such systems remains underexplo… ▽ More In this paper, we highlight the critical issues of robustness and safety associated with integrating large language models (LLMs) and vision-language models (VLMs) into robotics applications. Recent works focus on using LLMs and VLMs to improve the performance of robotics tasks, such as manipulation and navigation. Despite these improvements, analyzing the safety of such systems remains underexplored yet extremely critical. LLMs and VLMs are highly susceptible to adversarial inputs, prompting a significant inquiry into the safety of robotic systems. This concern is important because robotics operate in the physical world where erroneous actions can result in severe consequences. This paper explores this issue thoroughly, presenting a mathematical formulation of potential attacks on LLM/VLM-based robotic systems and offering experimental evidence of the safety challenges. Our empirical findings highlight a significant vulnerability: simple modifications to the input can drastically reduce system effectiveness. Specifically, our results demonstrate an average performance deterioration of 19.4% under minor input prompt modifications and a more alarming 29.1% under slight perceptual changes. These findings underscore the urgent need for robust countermeasures to ensure the safe and reliable deployment of advanced LLM/VLM-based robotic systems. △ Less

Submitted 16 June, 2024; v1 submitted 15 February, 2024; originally announced February 2024.

arXiv:2312.05490 [pdf, other]

Shapley Values-enabled Progressive Pseudo Bag Augmentation for Whole Slide Image Classification

Authors: Renao Yan, Qiehe Sun, Cheng Jin, Yiqing Liu, Yonghong He, Tian Guan, Hao Chen

Abstract: In computational pathology, whole slide image (WSI) classification presents a formidable challenge due to its gigapixel resolution and limited fine-grained annotations. Multiple instance learning (MIL) offers a weakly supervised solution, yet refining instance-level information from bag-level labels remains complex. While most of the conventional MIL methods use attention scores to estimate instan… ▽ More In computational pathology, whole slide image (WSI) classification presents a formidable challenge due to its gigapixel resolution and limited fine-grained annotations. Multiple instance learning (MIL) offers a weakly supervised solution, yet refining instance-level information from bag-level labels remains complex. While most of the conventional MIL methods use attention scores to estimate instance importance scores (IIS) which contribute to the prediction of the slide labels, these often lead to skewed attention distributions and inaccuracies in identifying crucial instances. To address these issues, we propose a new approach inspired by cooperative game theory: employing Shapley values to assess each instance's contribution, thereby improving IIS estimation. The computation of the Shapley value is then accelerated using attention, meanwhile retaining the enhanced instance identification and prioritization. We further introduce a framework for the progressive assignment of pseudo bags based on estimated IIS, encouraging more balanced attention distributions in MIL models. Our extensive experiments on CAMELYON-16, BRACS, and TCGA-LUNG datasets show our method's superiority over existing state-of-the-art approaches, offering enhanced interpretability and class-wise insights. We will release the code upon acceptance. △ Less

Submitted 15 May, 2024; v1 submitted 9 December, 2023; originally announced December 2023.

Comments: submitted to IEEE TRANSACTIONS ON MEDICAL IMAGING

arXiv:2312.05286 [pdf, other]

Bridging Synthetic and Real Worlds for Pre-training Scene Text Detectors

Authors: Tongkun Guan, Wei Shen, Xue Yang, Xuehui Wang, Xiaokang Yang

Abstract: Existing scene text detection methods typically rely on extensive real data for training. Due to the lack of annotated real images, recent works have attempted to exploit large-scale labeled synthetic data (LSD) for pre-training text detectors. However, a synth-to-real domain gap emerges, further limiting the performance of text detectors. Differently, in this work, we propose FreeReal, a real-dom… ▽ More Existing scene text detection methods typically rely on extensive real data for training. Due to the lack of annotated real images, recent works have attempted to exploit large-scale labeled synthetic data (LSD) for pre-training text detectors. However, a synth-to-real domain gap emerges, further limiting the performance of text detectors. Differently, in this work, we propose FreeReal, a real-domain-aligned pre-training paradigm that enables the complementary strengths of both LSD and unlabeled real data (URD). Specifically, to bridge real and synthetic worlds for pre-training, a glyph-based mixing mechanism (GlyphMix) is tailored for text images.GlyphMix delineates the character structures of synthetic images and embeds them as graffiti-like units onto real images. Without introducing real domain drift, GlyphMix freely yields real-world images with annotations derived from synthetic labels. Furthermore, when given free fine-grained synthetic labels, GlyphMix can effectively bridge the linguistic domain gap stemming from English-dominated LSD to URD in various languages. Without bells and whistles, FreeReal achieves average gains of 1.97%, 3.90%, 3.85%, and 4.56% in improving the performance of FCENet, PSENet, PANet, and DBNet methods, respectively, consistently outperforming previous pre-training methods by a substantial margin across four public datasets. Code is available at https://github.com/SJTU-DeepVisionLab/FreeReal. △ Less

Submitted 10 July, 2024; v1 submitted 8 December, 2023; originally announced December 2023.

Comments: Accepted by ECCV2024

arXiv:2311.11499 [pdf]

Flexible generation of structured terahertz fields via programmable exchange-biased spintronic emitters

Authors: Shunjia Wang, Wentao Qin, Tongyang Guan, Jingyu Liu, Qingnan Cai, Sheng Zhang, Lei Zhou, Yan Zhang, Yizheng Wu, Zhensheng Tao

Abstract: Structured light, particularly in the terahertz frequency range, holds considerable potential for a diverse range of applications. However, the generation and control of structured terahertz radiation pose major challenges. In this work, we demonstrate a novel programmable spintronic emitter that can flexibly generate a variety of structured terahertz waves. This is achieved through the precise an… ▽ More Structured light, particularly in the terahertz frequency range, holds considerable potential for a diverse range of applications. However, the generation and control of structured terahertz radiation pose major challenges. In this work, we demonstrate a novel programmable spintronic emitter that can flexibly generate a variety of structured terahertz waves. This is achieved through the precise and high-resolution programming of the magnetization pattern on the emitter surface, utilizing laser-assisted local field cooling of an exchange-biased ferromagnetic heterostructure. Moreover, we outline a generic design strategy for realizing specific complex structured terahertz fields in the far field. Our device successfully demonstrates the generation of terahertz waves with diverse structured polarization states, including spatially separated circular polarizations, azimuthal or radial polarization states, and a full Poincare beam. This innovation opens a new avenue for designing and generating structured terahertz radiations, with potential applications in terahertz microscopy, communication, quantum information, and light-matter interactions. △ Less

Submitted 19 November, 2023; originally announced November 2023.

arXiv:2310.14566 [pdf, other]

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

Authors: Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, Tianyi Zhou

Abstract: We introduce HallusionBench, a comprehensive benchmark designed for the evaluation of image-context reasoning. This benchmark presents significant challenges to advanced large visual-language models (LVLMs), such as GPT-4V(Vision), Gemini Pro Vision, Claude 3, and LLaVA-1.5, by emphasizing nuanced understanding and interpretation of visual data. The benchmark comprises 346 images paired with 1129… ▽ More We introduce HallusionBench, a comprehensive benchmark designed for the evaluation of image-context reasoning. This benchmark presents significant challenges to advanced large visual-language models (LVLMs), such as GPT-4V(Vision), Gemini Pro Vision, Claude 3, and LLaVA-1.5, by emphasizing nuanced understanding and interpretation of visual data. The benchmark comprises 346 images paired with 1129 questions, all meticulously crafted by human experts. We introduce a novel structure for these visual questions designed to establish control groups. This structure enables us to conduct a quantitative analysis of the models' response tendencies, logical consistency, and various failure modes. In our evaluation on HallusionBench, we benchmarked 15 different models, highlighting a 31.42% question-pair accuracy achieved by the state-of-the-art GPT-4V. Notably, all other evaluated models achieve accuracy below 16%. Moreover, our analysis not only highlights the observed failure modes, including language hallucination and visual illusion, but also deepens an understanding of these pitfalls. Our comprehensive case studies within HallusionBench shed light on the challenges of hallucination and illusion in LVLMs. Based on these insights, we suggest potential pathways for their future improvement. The benchmark and codebase can be accessed at https://github.com/tianyi-lab/HallusionBench. △ Less

Submitted 25 March, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

Comments: Accepted to CVPR 2024

arXiv:2310.14524 [pdf, other]

Study on the radon adsorption capability of low-background activated carbon

Authors: Chi Li, Yongpeng Zhang, Lidan Lv, Jinchang Liu, Cong Guo, Changgen Yang, Tingyu Guan, Yu Liu, Yu Lei, Quan Tang

Abstract: Radon is a significant background source in rare event detection experiments. Activated Carbon (AC) adsorption is widely used for effective radon removal. The selection of AC considers its adsorption capacity and radioactive background. In this study, using self-developed devices, we screened and identified a new kind of low-background AC from Qingdao Inaf Technology Company that has very high Rad… ▽ More Radon is a significant background source in rare event detection experiments. Activated Carbon (AC) adsorption is widely used for effective radon removal. The selection of AC considers its adsorption capacity and radioactive background. In this study, using self-developed devices, we screened and identified a new kind of low-background AC from Qingdao Inaf Technology Company that has very high Radon adsorption capacity. By adjusting the average pore size to 2.3 nm, this AC demonstrates a radon adsorption capacity of 2.6 or 4.7 times higher than Saratech or Carboact activated carbon under the same conditions. △ Less

Submitted 22 October, 2023; originally announced October 2023.

Comments: 21pages, 7 figures

arXiv:2310.07191 [pdf, other]

$pκ$-Curves: Interpolatory curves with curvature approximating a parabola

Authors: Zhihao Wang, Juan Cao, Tuan Guan, Zhonggui Chen, Yongjie Jessica Zhang

Abstract: This paper introduces a novel class of fair and interpolatory curves called $pκ$-curves. These curves are comprised of smoothly stitched Bézier curve segments, where the curvature distribution of each segment is made to closely resemble a parabola, resulting in an aesthetically pleasing shape. Moreover, each segment passes through an interpolated point at a parameter where the parabola has an extr… ▽ More This paper introduces a novel class of fair and interpolatory curves called $pκ$-curves. These curves are comprised of smoothly stitched Bézier curve segments, where the curvature distribution of each segment is made to closely resemble a parabola, resulting in an aesthetically pleasing shape. Moreover, each segment passes through an interpolated point at a parameter where the parabola has an extremum, encouraging the alignment of interpolated points with curvature extrema. To achieve these properties, we tailor an energy function that guides the optimization process to obtain the desired curve characteristics. Additionally, we develop an efficient algorithm and an initialization method, enabling interactive modeling of the $pκ$-curves without the need for global optimization. We provide various examples and comparisons with existing state-of-the-art methods to demonstrate the curve modeling capabilities and visually pleasing appearance of $pκ$-curves. △ Less

Submitted 11 October, 2023; originally announced October 2023.

arXiv:2307.06344 [pdf, other]

The Whole Pathological Slide Classification via Weakly Supervised Learning

Authors: Qiehe Sun, Jiawen Li, Jin Xu, Junru Cheng, Tian Guan, Yonghong He

Abstract: Due to its superior efficiency in utilizing annotations and addressing gigapixel-sized images, multiple instance learning (MIL) has shown great promise as a framework for whole slide image (WSI) classification in digital pathology diagnosis. However, existing methods tend to focus on advanced aggregators with different structures, often overlooking the intrinsic features of H\&E pathological slide… ▽ More Due to its superior efficiency in utilizing annotations and addressing gigapixel-sized images, multiple instance learning (MIL) has shown great promise as a framework for whole slide image (WSI) classification in digital pathology diagnosis. However, existing methods tend to focus on advanced aggregators with different structures, often overlooking the intrinsic features of H\&E pathological slides. To address this limitation, we introduced two pathological priors: nuclear heterogeneity of diseased cells and spatial correlation of pathological tiles. Leveraging the former, we proposed a data augmentation method that utilizes stain separation during extractor training via a contrastive learning strategy to obtain instance-level representations. We then described the spatial relationships between the tiles using an adjacency matrix. By integrating these two views, we designed a multi-instance framework for analyzing H\&E-stained tissue images based on pathological inductive bias, encompassing feature extraction, filtering, and aggregation. Extensive experiments on the Camelyon16 breast dataset and TCGA-NSCLC Lung dataset demonstrate that our proposed framework can effectively handle tasks related to cancer detection and differentiation of subtypes, outperforming state-of-the-art medical image classification methods based on MIL. The code will be released later. △ Less

Submitted 12 July, 2023; originally announced July 2023.

arXiv:2306.10003 [pdf, other]

C2F2NeUS: Cascade Cost Frustum Fusion for High Fidelity and Generalizable Neural Surface Reconstruction

Authors: Luoyuan Xu, Tao Guan, Yuesong Wang, Wenkai Liu, Zhaojie Zeng, Junle Wang, Wei Yang

Abstract: There is an emerging effort to combine the two popular 3D frameworks using Multi-View Stereo (MVS) and Neural Implicit Surfaces (NIS) with a specific focus on the few-shot / sparse view setting. In this paper, we introduce a novel integration scheme that combines the multi-view stereo with neural signed distance function representations, which potentially overcomes the limitations of both methods.… ▽ More There is an emerging effort to combine the two popular 3D frameworks using Multi-View Stereo (MVS) and Neural Implicit Surfaces (NIS) with a specific focus on the few-shot / sparse view setting. In this paper, we introduce a novel integration scheme that combines the multi-view stereo with neural signed distance function representations, which potentially overcomes the limitations of both methods. MVS uses per-view depth estimation and cross-view fusion to generate accurate surfaces, while NIS relies on a common coordinate volume. Based on this strategy, we propose to construct per-view cost frustum for finer geometry estimation, and then fuse cross-view frustums and estimate the implicit signed distance functions to tackle artifacts that are due to noise and holes in the produced surface reconstruction. We further apply a cascade frustum fusion strategy to effectively captures global-local information and structural consistency. Finally, we apply cascade sampling and a pseudo-geometric loss to foster stronger integration between the two architectures. Extensive experiments demonstrate that our method reconstructs robust surfaces and outperforms existing state-of-the-art methods. △ Less

Submitted 14 August, 2023; v1 submitted 16 June, 2023; originally announced June 2023.

Comments: Accepted by ICCV2023

arXiv:2306.06236 [pdf, other]

iPLAN: Intent-Aware Planning in Heterogeneous Traffic via Distributed Multi-Agent Reinforcement Learning

Authors: Xiyang Wu, Rohan Chandra, Tianrui Guan, Amrit Singh Bedi, Dinesh Manocha

Abstract: Navigating safely and efficiently in dense and heterogeneous traffic scenarios is challenging for autonomous vehicles (AVs) due to their inability to infer the behaviors or intentions of nearby drivers. In this work, we introduce a distributed multi-agent reinforcement learning (MARL) algorithm that can predict trajectories and intents in dense and heterogeneous traffic scenarios. Our approach for… ▽ More Navigating safely and efficiently in dense and heterogeneous traffic scenarios is challenging for autonomous vehicles (AVs) due to their inability to infer the behaviors or intentions of nearby drivers. In this work, we introduce a distributed multi-agent reinforcement learning (MARL) algorithm that can predict trajectories and intents in dense and heterogeneous traffic scenarios. Our approach for intent-aware planning, iPLAN, allows agents to infer nearby drivers' intents solely from their local observations. We model two distinct incentives for agents' strategies: Behavioral Incentive for high-level decision-making based on their driving behavior or personality and Instant Incentive for motion planning for collision avoidance based on the current traffic state. Our approach enables agents to infer their opponents' behavior incentives and integrate this inferred information into their decision-making and motion-planning processes. We perform experiments on two simulation environments, Non-Cooperative Navigation and Heterogeneous Highway. In Heterogeneous Highway, results show that, compared with centralized training decentralized execution (CTDE) MARL baselines such as QMIX and MAPPO, our method yields a 4.3% and 38.4% higher episodic reward in mild and chaotic traffic, with 48.1% higher success rate and 80.6% longer survival time in chaotic traffic. We also compare with a decentralized training decentralized execution (DTDE) baseline IPPO and demonstrate a higher episodic reward of 12.7% and 6.3% in mild traffic and chaotic traffic, 25.3% higher success rate, and 13.7% longer survival time. △ Less

Submitted 21 August, 2023; v1 submitted 9 June, 2023; originally announced June 2023.

arXiv:2305.12437 [pdf, other]

PLAR: Prompt Learning for Action Recognition

Authors: Xijun Wang, Ruiqi Xian, Tianrui Guan, Dinesh Manocha

Abstract: We present a new general learning approach, Prompt Learning for Action Recognition (PLAR), which leverages the strengths of prompt learning to guide the learning process. Our approach is designed to predict the action label by helping the models focus on the descriptions or instructions associated with actions in the input videos. Our formulation uses various prompts, including learnable prompts,… ▽ More We present a new general learning approach, Prompt Learning for Action Recognition (PLAR), which leverages the strengths of prompt learning to guide the learning process. Our approach is designed to predict the action label by helping the models focus on the descriptions or instructions associated with actions in the input videos. Our formulation uses various prompts, including learnable prompts, auxiliary visual information, and large vision models to improve the recognition performance. In particular, we design a learnable prompt method that learns to dynamically generate prompts from a pool of prompt experts under different inputs. By sharing the same objective with the task, our proposed PLAR can optimize prompts that guide the model's predictions while explicitly learning input-invariant (prompt experts pool) and input-specific (data-dependent) prompt knowledge. We evaluate our approach on datasets consisting of both ground camera videos and aerial videos, and scenes with single-agent and multi-agent actions. In practice, we observe a 3.17-10.2% accuracy improvement on the aerial multi-agent dataset Okutamam and a 1.0-3.6% improvement on the ground camera single-agent dataset Something Something V2. We plan to release our code on the WWW. △ Less

Submitted 14 November, 2023; v1 submitted 21 May, 2023; originally announced May 2023.

arXiv:2304.00692 [pdf]

doi 10.1117/1.AP.5.5.056006

Nonrelativistic and nonmagnetic control of terahertz charge currents via electrical anisotropy in RuO2 and IrO2

Authors: Sheng Zhang, Yongwei Cui, Shunjia Wang, Haoran Chen, Yaxin Liu, Wentao Qin, Tongyang Guan, Chuanshan Tian, Zhe Yuan, Lei Zhou, Yizheng Wu, Zhensheng Tao

Abstract: Precise and ultrafast control over photo-induced charge currents across nanoscale interfaces could lead to important applications in energy harvesting, ultrafast electronics, and coherent terahertz sources. Recent studies have shown that several relativistic mechanisms, including inverse spin-Hall effect, inverse Rashba-Edelstein effect and inverse spin-orbit-torque effect, can convert longitudina… ▽ More Precise and ultrafast control over photo-induced charge currents across nanoscale interfaces could lead to important applications in energy harvesting, ultrafast electronics, and coherent terahertz sources. Recent studies have shown that several relativistic mechanisms, including inverse spin-Hall effect, inverse Rashba-Edelstein effect and inverse spin-orbit-torque effect, can convert longitudinally injected spin-polarized currents from magnetic materials to transverse charge currents, thereby harnessing these currents for terahertz generation. However, these mechanisms typically require external magnetic fields and suffer from low spin-polarization rates and low efficiencies of relativistic spin-to-charge conversion. In this work, we present a novel nonrelativistic and nonmagnetic mechanism that directly utilizes the photo-excited high-density charge currents across the interface. We demonstrate that the electrical anisotropy of conductive oxides RuO2 and IrO2 can effectively deflect injected charge currents to the transverse direction, resulting in efficient and broadband terahertz radiation. Importantly, this new mechanism has the potential to offer much higher conversion efficiency compared to previous methods, as conductive materials with large electrical anisotropy are readily available, whereas further increasing the spin-Hall angle of heavy-metal materials would be challenging. Our new findings offer exciting possibilities for directly utilizing these photo-excited high-density currents across metallic interfaces for ultrafast electronics and terahertz spectroscopy. △ Less

Submitted 2 April, 2023; originally announced April 2023.

Journal ref: Advanced Photonics (2023)

arXiv:2303.17778 [pdf, other]

CrossLoc3D: Aerial-Ground Cross-Source 3D Place Recognition

Authors: Tianrui Guan, Aswath Muthuselvam, Montana Hoover, Xijun Wang, Jing Liang, Adarsh Jagan Sathyamoorthy, Damon Conover, Dinesh Manocha

Abstract: We present CrossLoc3D, a novel 3D place recognition method that solves a large-scale point matching problem in a cross-source setting. Cross-source point cloud data corresponds to point sets captured by depth sensors with different accuracies or from different distances and perspectives. We address the challenges in terms of developing 3D place recognition methods that account for the representati… ▽ More We present CrossLoc3D, a novel 3D place recognition method that solves a large-scale point matching problem in a cross-source setting. Cross-source point cloud data corresponds to point sets captured by depth sensors with different accuracies or from different distances and perspectives. We address the challenges in terms of developing 3D place recognition methods that account for the representation gap between points captured by different sources. Our method handles cross-source data by utilizing multi-grained features and selecting convolution kernel sizes that correspond to most prominent features. Inspired by the diffusion models, our method uses a novel iterative refinement process that gradually shifts the embedding spaces from different sources to a single canonical space for better metric learning. In addition, we present CS-Campus3D, the first 3D aerial-ground cross-source dataset consisting of point cloud data from both aerial and ground LiDAR scans. The point clouds in CS-Campus3D have representation gaps and other features like different views, point densities, and noise patterns. We show that our CrossLoc3D algorithm can achieve an improvement of 4.74% - 15.37% in terms of the top 1 average recall on our CS-Campus3D benchmark and achieves performance comparable to state-of-the-art 3D place recognition method on the Oxford RobotCar. The code and CS-CAMPUS3D benchmark will be available at github.com/rayguan97/crossloc3d. △ Less

Submitted 29 September, 2023; v1 submitted 30 March, 2023; originally announced March 2023.

arXiv:2303.14502 [pdf, other]

VERN: Vegetation-aware Robot Navigation in Dense Unstructured Outdoor Environments

Authors: Adarsh Jagan Sathyamoorthy, Kasun Weerakoon, Tianrui Guan, Mason Russell, Damon Conover, Jason Pusey, Dinesh Manocha

Abstract: We propose a novel method for autonomous legged robot navigation in densely vegetated environments with a variety of pliable/traversable and non-pliable/untraversable vegetation. We present a novel few-shot learning classifier that can be trained on a few hundred RGB images to differentiate flora that can be navigated through, from the ones that must be circumvented. Using the vegetation classific… ▽ More We propose a novel method for autonomous legged robot navigation in densely vegetated environments with a variety of pliable/traversable and non-pliable/untraversable vegetation. We present a novel few-shot learning classifier that can be trained on a few hundred RGB images to differentiate flora that can be navigated through, from the ones that must be circumvented. Using the vegetation classification and 2D lidar scans, our method constructs a vegetation-aware traversability cost map that accurately represents the pliable and non-pliable obstacles with lower, and higher traversability costs, respectively. Our cost map construction accounts for misclassifications of the vegetation and further lowers the risk of collisions, freezing and entrapment in vegetation during navigation. Furthermore, we propose holonomic recovery behaviors for the robot for scenarios where it freezes, or gets physically entrapped in dense, pliable vegetation. We demonstrate our method on a Boston Dynamics Spot robot in real-world unstructured environments with sparse and dense tall grass, bushes, trees, etc. We observe an increase of 25-90% in success rates, 10-90% decrease in freezing rate, and up to 65% decrease in the false positive rate compared to existing methods. △ Less

Submitted 25 March, 2023; originally announced March 2023.

Comments: 8 Pages, 5 figures

arXiv:2303.01589 [pdf, other]

doi 10.1109/ICRA48891.2023.10160564

AZTR: Aerial Video Action Recognition with Auto Zoom and Temporal Reasoning

Authors: Xijun Wang, Ruiqi Xian, Tianrui Guan, Celso M. de Melo, Stephen M. Nogar, Aniket Bera, Dinesh Manocha

Abstract: We propose a novel approach for aerial video action recognition. Our method is designed for videos captured using UAVs and can run on edge or mobile devices. We present a learning-based approach that uses customized auto zoom to automatically identify the human target and scale it appropriately. This makes it easier to extract the key features and reduces the computational overhead. We also presen… ▽ More We propose a novel approach for aerial video action recognition. Our method is designed for videos captured using UAVs and can run on edge or mobile devices. We present a learning-based approach that uses customized auto zoom to automatically identify the human target and scale it appropriately. This makes it easier to extract the key features and reduces the computational overhead. We also present an efficient temporal reasoning algorithm to capture the action information along the spatial and temporal domains within a controllable computational cost. Our approach has been implemented and evaluated both on the desktop with high-end GPUs and on the low power Robotics RB5 Platform for robots and drones. In practice, we achieve 6.1-7.4% improvement over SOTA in Top-1 accuracy on the RoCoG-v2 dataset, 8.3-10.4% improvement on the UAV-Human dataset and 3.2% improvement on the Drone Action dataset. △ Less

Submitted 2 March, 2023; originally announced March 2023.

Comments: Accepted for publication at ICRA 2023

arXiv:2301.06982 [pdf, other]

doi 10.1088/1748-0221/18/04/P04006

Research of radon diffusion behavior in liquid scintillator

Authors: Z. F. Xu, C. Guo, J. C. Liu, Y. P. Zhang, P. Zhang, C. G. Yang, Q. Tang, Y. Liu, C. Li, T. Y. Guan

Abstract: The background caused by radon and its daughters is an important background in the low background liquid scintillator (LS) detectors. The study of the diffusion behaviour of radon in the LS contributes to the analysis of the related background caused by radon. Methodologies and devices for measuring the diffusion coefficient and solubility of radon in materials are developed and described. The rad… ▽ More The background caused by radon and its daughters is an important background in the low background liquid scintillator (LS) detectors. The study of the diffusion behaviour of radon in the LS contributes to the analysis of the related background caused by radon. Methodologies and devices for measuring the diffusion coefficient and solubility of radon in materials are developed and described. The radon diffusion coefficient of the LS was measured for the first time and in addition the solubility coefficient was also obtained. In addition, the radon diffusion coefficient of the polyolefine film which is consistent with data in the literature was measured to verify the reliability of the diffusion device. △ Less

Submitted 28 January, 2023; v1 submitted 17 January, 2023; originally announced January 2023.

Comments: 9 pages, 7 figures

arXiv:2301.00959 [pdf, ps, other]

System upgrade for $μ$Bq/m$^3$ level $^{222}$Rn concentration measurement

Authors: Y. Liu, Y. P. Zhang, J. C. Liu, C. Guo, C. G. Yang. P. Zhang, Q. Tang, Z. F. Xu, C. Li, T. Y. Guan, S. B. Wang

Abstract: The Jiangmen Underground Neutrino Observatory (JUNO), a 20 kton multipurpose underground liquid scintillator detector, was proposed for the determination of the neutrino mass hierarchy as primary physics goal. The central detector will be submerged in a water Cherenkov detector to lower the background from the environment and cosmic muons. Radon is one of the primary background sources. Nitrogen w… ▽ More The Jiangmen Underground Neutrino Observatory (JUNO), a 20 kton multipurpose underground liquid scintillator detector, was proposed for the determination of the neutrino mass hierarchy as primary physics goal. The central detector will be submerged in a water Cherenkov detector to lower the background from the environment and cosmic muons. Radon is one of the primary background sources. Nitrogen will be used in several sub-systems, and a highly sensitive radon detector has to be developed to measure its radon concentration. A system has been developed based on $^{222}$Rn enrichment of activated carbon and $^{222}$Rn detection based on the electrostatic collection. This paper presents the detail of a $μ$Bq/m$^3$ level $^{222}$Rn concentration measurement system and gives detailed information about how the adsorption coefficient was measured and how the temperature, flow rate, and $^{222}$Rn concentration affect the adsorption coefficient. △ Less

Submitted 24 September, 2023; v1 submitted 3 January, 2023; originally announced January 2023.

Comments: 18 pages, 9 figures

arXiv:2211.00288 [pdf, other]

Self-supervised Character-to-Character Distillation for Text Recognition

Authors: Tongkun Guan, Wei Shen, Xue Yang, Qi Feng, Zekun Jiang, Xiaokang Yang

Abstract: When handling complicated text images (e.g., irregular structures, low resolution, heavy occlusion, and uneven illumination), existing supervised text recognition methods are data-hungry. Although these methods employ large-scale synthetic text images to reduce the dependence on annotated real images, the domain gap still limits the recognition performance. Therefore, exploring the robust text fea… ▽ More When handling complicated text images (e.g., irregular structures, low resolution, heavy occlusion, and uneven illumination), existing supervised text recognition methods are data-hungry. Although these methods employ large-scale synthetic text images to reduce the dependence on annotated real images, the domain gap still limits the recognition performance. Therefore, exploring the robust text feature representations on unlabeled real images by self-supervised learning is a good solution. However, existing self-supervised text recognition methods conduct sequence-to-sequence representation learning by roughly splitting the visual features along the horizontal axis, which limits the flexibility of the augmentations, as large geometric-based augmentations may lead to sequence-to-sequence feature inconsistency. Motivated by this, we propose a novel self-supervised Character-to-Character Distillation method, CCD, which enables versatile augmentations to facilitate general text representation learning. Specifically, we delineate the character structures of unlabeled real images by designing a self-supervised character segmentation module. Following this, CCD easily enriches the diversity of local characters while keeping their pairwise alignment under flexible augmentations, using the transformation matrix between two augmented views from images. Experiments demonstrate that CCD achieves state-of-the-art results, with average performance gains of 1.38% in text recognition, 1.7% in text segmentation, 0.24 dB (PSNR) and 0.0321 (SSIM) in text super-resolution. Code is available at https://github.com/TongkunGuan/CCD. △ Less

Submitted 18 August, 2023; v1 submitted 1 November, 2022; originally announced November 2022.

Comments: Accepted by ICCV2023

arXiv:2209.07725 [pdf, other]

VINet: Visual and Inertial-based Terrain Classification and Adaptive Navigation over Unknown Terrain

Authors: Tianrui Guan, Ruitao Song, Zhixian Ye, Liangjun Zhang

Abstract: We present a visual and inertial-based terrain classification network (VINet) for robotic navigation over different traversable surfaces. We use a novel navigation-based labeling scheme for terrain classification and generalization on unknown surfaces. Our proposed perception method and adaptive scheduling control framework can make predictions according to terrain navigation properties and lead t… ▽ More We present a visual and inertial-based terrain classification network (VINet) for robotic navigation over different traversable surfaces. We use a novel navigation-based labeling scheme for terrain classification and generalization on unknown surfaces. Our proposed perception method and adaptive scheduling control framework can make predictions according to terrain navigation properties and lead to better performance on both terrain classification and navigation control on known and unknown surfaces. Our VINet can achieve 98.37% in terms of accuracy under supervised setting on known terrains and improve the accuracy by 8.51% on unknown terrains compared to previous methods. We deploy VINet on a mobile tracked robot for trajectory following and navigation on different terrains, and we demonstrate an improvement of 10.3% compared to a baseline controller in terms of RMSE. △ Less

Submitted 1 March, 2023; v1 submitted 16 September, 2022; originally announced September 2022.

arXiv:2209.05722 [pdf, other]

GrASPE: Graph based Multimodal Fusion for Robot Navigation in Unstructured Outdoor Environments

Authors: Kasun Weerakoon, Adarsh Jagan Sathyamoorthy, Jing Liang, Tianrui Guan, Utsav Patel, Dinesh Manocha

Abstract: We present a novel trajectory traversability estimation and planning algorithm for robot navigation in complex outdoor environments. We incorporate multimodal sensory inputs from an RGB camera, 3D LiDAR, and the robot's odometry sensor to train a prediction model to estimate candidate trajectories' success probabilities based on partially reliable multi-modal sensor observations. We encode high-di… ▽ More We present a novel trajectory traversability estimation and planning algorithm for robot navigation in complex outdoor environments. We incorporate multimodal sensory inputs from an RGB camera, 3D LiDAR, and the robot's odometry sensor to train a prediction model to estimate candidate trajectories' success probabilities based on partially reliable multi-modal sensor observations. We encode high-dimensional multi-modal sensory inputs to low-dimensional feature vectors using encoder networks and represent them as a connected graph. The graph is then used to train an attention-based Graph Neural Network (GNN) to predict trajectory success probabilities. We further analyze the number of features in the image (corners) and point cloud data (edges and planes) separately to quantify their reliability to augment the weights of the feature graph representation used in our GNN. During runtime, our model utilizes multi-sensor inputs to predict the success probabilities of the trajectories generated by a local planner to avoid potential collisions and failures. Our algorithm demonstrates robust predictions when one or more sensor modalities are unreliable or unavailable in complex outdoor environments. We evaluate our algorithm's navigation performance using a Spot robot in real-world outdoor environments. We observe an increase of 10-30% in terms of navigation success rate and a 13-15% decrease in false positive estimations compared to the state-of-the-art navigation methods. △ Less

Submitted 16 May, 2023; v1 submitted 13 September, 2022; originally announced September 2022.

arXiv:2207.13848 [pdf, other]

Predicting the Output Structure of Sparse Matrix Multiplication with Sampled Compression Ratio

Authors: Zhaoyang Du, Yijin Guan, Tianchan Guan, Dimin Niu, Nianxiong Tan, Xiaopeng Yu, Hongzhong Zheng, Jianyi Meng, Xiaolang Yan, Yuan Xie

Abstract: Sparse general matrix multiplication (SpGEMM) is a fundamental building block in numerous scientific applications. One critical task of SpGEMM is to compute or predict the structure of the output matrix (i.e., the number of nonzero elements per output row) for efficient memory allocation and load balance, which impact the overall performance of SpGEMM. Existing work either precisely calculates the… ▽ More Sparse general matrix multiplication (SpGEMM) is a fundamental building block in numerous scientific applications. One critical task of SpGEMM is to compute or predict the structure of the output matrix (i.e., the number of nonzero elements per output row) for efficient memory allocation and load balance, which impact the overall performance of SpGEMM. Existing work either precisely calculates the output structure or adopts upper-bound or sampling-based methods to predict the output structure. However, these methods either take much execution time or are not accurate enough. In this paper, we propose a novel sampling-based method with better accuracy and low costs compared to the existing sampling-based method. The proposed method first predicts the compression ratio of SpGEMM by leveraging the number of intermediate products (denoted as FLOP) and the number of nonzero elements (denoted as NNZ) of the same sampled result matrix. And then, the predicted output structure is obtained by dividing the FLOP per output row by the predicted compression ratio. We also propose a reference design of the existing sampling-based method with optimized computing overheads to demonstrate the better accuracy of the proposed method. We construct 625 test cases with various matrix dimensions and sparse structures to evaluate the prediction accuracy. Experimental results show that the absolute relative errors of the proposed method and the reference design are 1.56\% and 8.12\%, respectively, on average, and 25\% and 156\%, respectively, in the worst case. △ Less

Submitted 27 July, 2022; originally announced July 2022.

Comments: This paper has been submitted to the IEEE International Conference on Parallel and Distributed Systems (ICPADS). 8 pages, 2 fgures, 3 tables

ACM Class: F.2.1; G.3; D.1.3; G.1.3

arXiv:2206.08059 [pdf]

doi 10.1109/JLT.2023.3247447

Intrinsic Spectrum Analysis of Laser Dynamics Based on Fractional Fourier Transform

Authors: Ligang Huang, Tianyi Lan, Chaoze Zhang, Laiyang Dang, Tianyu Guan, Bowen Zheng, Shunli Liu, Lei Gao, Wei Huang, Guolu Yin, Tao Zhu

Abstract: Intrinsic spectrum that results from the coupling of spontaneous emission in a laser cavity, can determine the energy concentration and coherence of lasers, which is crucial for the optical high-precision measurement. Up to now, it is hard to analyze the intrinsic spectrum in the high-speed laser dynamics process, especially under the condition of fast wavelength sweeping. In this work, a new meth… ▽ More Intrinsic spectrum that results from the coupling of spontaneous emission in a laser cavity, can determine the energy concentration and coherence of lasers, which is crucial for the optical high-precision measurement. Up to now, it is hard to analyze the intrinsic spectrum in the high-speed laser dynamics process, especially under the condition of fast wavelength sweeping. In this work, a new method to analyze the laser intrinsic spectrum is proposed with the laser energy decomposition to a series of chirp-frequency signals, which is realized by fractional Fourier transform (FRFT) of the coherently reconstructed laser waveform. The new understanding of the energy distribution of lasers contributes to the accurate characterization of laser dynamical parameters in time-frequency domain. In the proof-of-concept experiment, the time-frequency dynamical process of a commercial wavelength swept laser is tested with different wavelength-scanning speeds, and the most suitable measurement time window width required for the FRFT-based narrowest spectrum is also explored. The proposed analysis method of laser dynamical parameters will promote the understanding of laser dynamics, and benefit for the optical precision measurement applications. △ Less

Submitted 16 June, 2022; originally announced June 2022.

arXiv:2206.07244 [pdf, other]

OpSparse: a Highly Optimized Framework for Sparse General Matrix Multiplication on GPUs

Authors: Zhaoyang Du, Yijin Guan, Tianchan Guan, Dimin Niu, Linyong Huang, Hongzhong Zheng, Yuan Xie

Abstract: Sparse general matrix multiplication (SpGEMM) is an important and expensive computation primitive in many real-world applications. Due to SpGEMM's inherent irregularity and the vast diversity of its input matrices, developing high-performance SpGEMM implementation on modern processors such as GPUs is challenging. The state-of-the-art SpGEMM libraries (i.e., $nsparse$ and $spECK$) adopt several alg… ▽ More Sparse general matrix multiplication (SpGEMM) is an important and expensive computation primitive in many real-world applications. Due to SpGEMM's inherent irregularity and the vast diversity of its input matrices, developing high-performance SpGEMM implementation on modern processors such as GPUs is challenging. The state-of-the-art SpGEMM libraries (i.e., $nsparse$ and $spECK$) adopt several algorithms to tackle the challenges of global load balance, local load balance, and allocation of the result matrix. While these libraries focus on the high-level algorithm design for SpGEMM, they neglect several low-level architecture-specific optimizations, which causes inefficient implementations in their libraries. In this paper, we classify their inefficient implementations into seven categories. Based on our observations, we propose a highly optimized SpGEMM library called $OpSparse$. The optimizations in $OpSparse$ include 1) optimizing the binning method by improving the utilization of the shared memory, 2) optimizing the hashing method by reducing the access to the hash table, 3) improving the trade-off between hash collision rate and hardware utilization in the hashing method by setting appropriate binning ranges, 4) reducing the overheads of global memory utilization by minimizing the global memory usage of the metadata, and 5) improving the execution parallelism by overlapping global memory allocation with kernel execution. Performance evaluations with 26 commonly used matrices on an Nvidia Tesla V100 GPU show that $OpSparse$ achieves up to $27.8\times$, $1.81\times$, and $2.04\times$ performance speedup over three state-of-the-art libraries: $cuSPARSE$, $nsparse$, and $spECK$, respectively. △ Less

Submitted 14 June, 2022; originally announced June 2022.

Comments: This paper has been submitted to the IEEE Access since May 7, 2022, and is currently under review by IEEE Access. 20 pages, 11 fgures, 5 tables

MSC Class: 68-02; 68W10; 65F50 ACM Class: D.1.3; G.1.3

arXiv:2206.06611 [pdf, other]

Accelerating CPU-Based Sparse General Matrix Multiplication With Binary Row Merging

Authors: Zhaoyang Du, Yijin Guan, Tianchan Guan, Dimin Niu, Hongzhong Zheng, Yuan Xie

Abstract: Sparse general matrix multiplication (SpGEMM) is a fundamental building block for many real-world applications. Since SpGEMM is a well-known memory-bounded application with vast and irregular memory accesses, considering the memory access efficiency is of critical importance for SpGEMM's performance. Yet, the existing methods put less consideration into the memory subsystem and achieved suboptimal… ▽ More Sparse general matrix multiplication (SpGEMM) is a fundamental building block for many real-world applications. Since SpGEMM is a well-known memory-bounded application with vast and irregular memory accesses, considering the memory access efficiency is of critical importance for SpGEMM's performance. Yet, the existing methods put less consideration into the memory subsystem and achieved suboptimal performance. In this paper, we thoroughly analyze the memory access patterns of SpGEMM and their influences on the memory subsystem. Based on the analysis, we propose a novel and more efficient accumulation method named BRMerge for the multi-core CPU architectures. The BRMerge accumulation method follows the row-wise dataflow. It first accesses the $B$ matrix, generates the intermediate lists for one output row, and stores these intermediate lists in a consecutive memory space, which is implemented by a ping-pong buffer. It then immediately merges these intermediate lists generated in the previous phase two by two in a tree-like hierarchy between two ping-pong buffers. The architectural benefits of BRMerge are 1) streaming access patterns, 2) minimized TLB cache miss rate, and 3) reasonably high L1/L2 cache hit rates, which result in both low access latency and high bandwidth utilization when performing SpGEMM. Based on the BRMerge accumulation method, we propose two SpGEMM libraries named BRMerge-Upper and BRMerge-Precise, which use different allocation methods. Performance evaluations with 26 commonly used benchmarks on two CPU servers show that the proposed SpGEMM libraries significantly outperform the state-of-the-art SpGEMM libraries. △ Less

Submitted 19 August, 2022; v1 submitted 14 June, 2022; originally announced June 2022.

Comments: This work has been accepted by IEEE Access (DOI:10.1109/ACCESS.2022.3193937). There are 12 pages, 6 fgures, 2 tables

MSC Class: 68-02; 68W10; 65F50 ACM Class: D.1.3; G.1.3

arXiv:2206.05840 [pdf, other]

GAN based Data Augmentation to Resolve Class Imbalance

Authors: Sairamvinay Vijayaraghavan, Terry Guan, Jason, Song

Abstract: The number of credit card fraud has been growing as technology grows and people can take advantage of it. Therefore, it is very important to implement a robust and effective method to detect such frauds. The machine learning algorithms are appropriate for these tasks since they try to maximize the accuracy of predictions and hence can be relied upon. However, there is an impending flaw where in ma… ▽ More The number of credit card fraud has been growing as technology grows and people can take advantage of it. Therefore, it is very important to implement a robust and effective method to detect such frauds. The machine learning algorithms are appropriate for these tasks since they try to maximize the accuracy of predictions and hence can be relied upon. However, there is an impending flaw where in machine learning models may not perform well due to the presence of an imbalance across classes distribution within the sample set. So, in many related tasks, the datasets have a very small number of observed fraud cases (sometimes around 1 percent positive fraud instances found). Therefore, this imbalance presence may impact any learning model's behavior by predicting all labels as the majority class, hence allowing no scope for generalization in the predictions made by the model. We trained Generative Adversarial Network(GAN) to generate a large number of convincing (and reliable) synthetic examples of the minority class that can be used to alleviate the class imbalance within the training set and hence generalize the learning of the data more effectively. △ Less

Submitted 12 June, 2022; originally announced June 2022.

arXiv:2205.03517 [pdf, other]

AdaptiveON: Adaptive Outdoor Local Navigation Method For Stable and Reliable Actions

Authors: Jing Liang, Kasun Weerakoon, Tianrui Guan, Nare Karapetyan, Dinesh Manocha

Abstract: We present a novel outdoor navigation algorithm to generate stable and efficient actions to navigate a robot to reach a goal. We use a multi-stage training pipeline and show that our approach produces policies that result in stable and reliable robot navigation on complex terrains. Based on the Proximal Policy Optimization (PPO) algorithm, we developed a novel method to achieve multiple capabiliti… ▽ More We present a novel outdoor navigation algorithm to generate stable and efficient actions to navigate a robot to reach a goal. We use a multi-stage training pipeline and show that our approach produces policies that result in stable and reliable robot navigation on complex terrains. Based on the Proximal Policy Optimization (PPO) algorithm, we developed a novel method to achieve multiple capabilities for outdoor navigation tasks, namely alleviating the robot's drifting, keeping the robot stable on bumpy terrains, avoiding climbing on hills with steep elevation changes, and avoiding collisions. Our training process mitigates the reality (sim-to-real) gap by introducing generalized environmental and robotic parameters and training with rich features of Lidar perception in a high-fidelity Unity simulator. We evaluate our method in both simulation and real world environments using Clearpath Husky and Jackal robots. Further, we compare our method against the state-of-the-art approaches and observe that, in the real world it improves stability by at least 30.7% on uneven terrains, reduces drifting by 8.08% and decreases the elevation changes by 14.75%. △ Less

Submitted 6 December, 2022; v1 submitted 6 May, 2022; originally announced May 2022.

Comments: 10 pages

arXiv:2203.15459 [pdf, other]

Influence of Communication Among Shared Developers on the Productivity of Open Source Software Projects

Authors: Sairamvinay Vijayaraghavan, Jinxiao Song, Terry Guan, Seongwoo Choi, Sutej Kulkarni

Abstract: Many software developers rely on open source software for developing their applications and writing their source codes. Measuring an independent project's overall productivity is still an open problem for many technology companies. In this project, we address to bridge the gap of analyzing which are the most important features for prediction of a productivity based system. We have chosen to collec… ▽ More Many software developers rely on open source software for developing their applications and writing their source codes. Measuring an independent project's overall productivity is still an open problem for many technology companies. In this project, we address to bridge the gap of analyzing which are the most important features for prediction of a productivity based system. We have chosen to collect data from GitHub via their application programming interfaces (API) and analyze the data we gathered to understand the relation between the average time to close an issue and the features that we collected. Since most of the data we gathered were not Gaussian, we had to preprocess the data using outlier detection and applying transformations before statistical modeling. The best model we observed was polynomial regression with degree 5. Overall, we noticed that there are many aspects of software development that make developers increase their productivity. △ Less

Submitted 25 March, 2022; originally announced March 2022.

arXiv:2203.10694 [pdf, other]

FAR: Fourier Aerial Video Recognition

Authors: Divya Kothandaraman, Tianrui Guan, Xijun Wang, Sean Hu, Ming Lin, Dinesh Manocha

Abstract: We present an algorithm, Fourier Activity Recognition (FAR), for UAV video activity recognition. Our formulation uses a novel Fourier object disentanglement method to innately separate out the human agent (which is typically small) from the background. Our disentanglement technique operates in the frequency domain to characterize the extent of temporal change of spatial pixels, and exploits convol… ▽ More We present an algorithm, Fourier Activity Recognition (FAR), for UAV video activity recognition. Our formulation uses a novel Fourier object disentanglement method to innately separate out the human agent (which is typically small) from the background. Our disentanglement technique operates in the frequency domain to characterize the extent of temporal change of spatial pixels, and exploits convolution-multiplication properties of Fourier transform to map this representation to the corresponding object-background entangled features obtained from the network. To encapsulate contextual information and long-range space-time dependencies, we present a novel Fourier Attention algorithm, which emulates the benefits of self-attention by modeling the weighted outer product in the frequency domain. Our Fourier attention formulation uses much fewer computations than self-attention. We have evaluated our approach on multiple UAV datasets including UAV Human RGB, UAV Human Night, Drone Action, and NEC Drone. We demonstrate a relative improvement of 8.02% - 38.69% in top-1 accuracy and up to 3 times faster over prior works. △ Less

Submitted 18 July, 2022; v1 submitted 20 March, 2022; originally announced March 2022.

Comments: ECCV 2022 Poster paper

arXiv:2203.03382 [pdf, other]

Self-supervised Implicit Glyph Attention for Text Recognition

Authors: Tongkun Guan, Chaochen Gu, Jingzheng Tu, Xue Yang, Qi Feng, Yudi Zhao, Xiaokang Yang, Wei Shen

Abstract: The attention mechanism has become the \emph{de facto} module in scene text recognition (STR) methods, due to its capability of extracting character-level representations. These methods can be summarized into implicit attention based and supervised attention based, depended on how the attention is computed, i.e., implicit attention and supervised attention are learned from sequence-level text anno… ▽ More The attention mechanism has become the \emph{de facto} module in scene text recognition (STR) methods, due to its capability of extracting character-level representations. These methods can be summarized into implicit attention based and supervised attention based, depended on how the attention is computed, i.e., implicit attention and supervised attention are learned from sequence-level text annotations and or character-level bounding box annotations, respectively. Implicit attention, as it may extract coarse or even incorrect spatial regions as character attention, is prone to suffering from an alignment-drifted issue. Supervised attention can alleviate the above issue, but it is character category-specific, which requires extra laborious character-level bounding box annotations and would be memory-intensive when handling languages with larger character categories. To address the aforementioned issues, we propose a novel attention mechanism for STR, self-supervised implicit glyph attention (SIGA). SIGA delineates the glyph structures of text images by jointly self-supervised text segmentation and implicit attention alignment, which serve as the supervision to improve attention correctness without extra character-level annotations. Experimental results demonstrate that SIGA performs consistently and significantly better than previous attention-based STR methods, in terms of both attention correctness and final recognition performance on publicly available context benchmarks and our contributed contextless benchmarks. △ Less

Submitted 15 May, 2023; v1 submitted 7 March, 2022; originally announced March 2022.

Comments: CVPR2023

arXiv:2202.12873 [pdf, other]

TerraPN: Unstructured Terrain Navigation using Online Self-Supervised Learning

Authors: Adarsh Jagan Sathyamoorthy, Kasun Weerakoon, Tianrui Guan, Jing Liang, Dinesh Manocha

Abstract: We present TerraPN, a novel method that learns the surface properties (traction, bumpiness, deformability, etc.) of complex outdoor terrains directly from robot-terrain interactions through self-supervised learning, and uses it for autonomous robot navigation. Our method uses RGB images of terrain surfaces and the robot's velocities as inputs, and the IMU vibrations and odometry errors experienced… ▽ More We present TerraPN, a novel method that learns the surface properties (traction, bumpiness, deformability, etc.) of complex outdoor terrains directly from robot-terrain interactions through self-supervised learning, and uses it for autonomous robot navigation. Our method uses RGB images of terrain surfaces and the robot's velocities as inputs, and the IMU vibrations and odometry errors experienced by the robot as labels for self-supervision. Our method computes a surface cost map that differentiates smooth, high-traction surfaces (low navigation costs) from bumpy, slippery, deformable surfaces (high navigation costs). We compute the cost map by non-uniformly sampling patches from the input RGB image by detecting boundaries between surfaces resulting in low inference times (47.27% lower) compared to uniform sampling and existing segmentation methods. We present a novel navigation algorithm that accounts for a surface's cost, computes cost-based acceleration limits for the robot, and dynamically feasible, collision-free trajectories. TerraPN's surface cost prediction can be trained in ~25 minutes for five different surfaces, compared to several hours for previous learning-based segmentation methods. In terms of navigation, our method outperforms previous works in terms of success rates (up to 35.84% higher), vibration cost of the trajectories (up to 21.52% lower), and slowing the robot on bumpy, deformable surfaces (up to 46.76% slower) in different scenarios. △ Less

Submitted 22 June, 2022; v1 submitted 25 February, 2022; originally announced February 2022.

Comments: 10 pages, 6 figures

arXiv:2202.09746 [pdf, ps, other]

Ultrasensitive refractive index sensor with rotatory biased weak measurement

Authors: Chongqi Zhou, Yang Xu, Xiaonan Zhang, Zhangyan Li, Tian Guan, Yonghong He, Yanhong Ji

Abstract: A modified weak measurement scheme, rotatory biased weak measurement, is proposed to significantly improve the sensitivity and resolution of the refractive index sensor on a total reflection structure. This method introduces an additional phase in the post-selected procedure and generates an extinction point in the spectrum distribution. The biased post-selection makes smaller coupling strength av… ▽ More A modified weak measurement scheme, rotatory biased weak measurement, is proposed to significantly improve the sensitivity and resolution of the refractive index sensor on a total reflection structure. This method introduces an additional phase in the post-selected procedure and generates an extinction point in the spectrum distribution. The biased post-selection makes smaller coupling strength available, which leads to an enhancement of phase sensitivity and refractive index sensitivity. In rotatory biased weak measurement, we achieve an enhanced refractive index sensitivity of 13605 nm/RIU compared to 1644 nm/RIU in standard weak measurement. The performance of sensors with different sensitivity is analyzed, and we find the optimal refractive index resolution of sensors increases with sensitivity. In this work, we demonstrate an optimal refractive index resolution of $4\times10^{-7}$ RIU on a total reflection structure. The rabbit anti-mouse IgG and mouse IgG binding reaction experiments demonstrate that our system has a high response to the concentration of IgG in a wide range and the limit of detection is 15 ng/mL. The improvements in this work are helpful to the optimizations of other optical sensors with weak measurement. △ Less

Submitted 21 April, 2022; v1 submitted 20 February, 2022; originally announced February 2022.

Comments: 8 pages, 6 figures

arXiv:2202.07505 [pdf, ps, other]

A note on $\partial$-bilipschitz mappings in quasiconvex metric spaces

Authors: Tiantian Guan, Saminathan Ponnusamy, Qingshan Zhou

Abstract: This paper focuses on properties of \partial-biLipschitz mappings which were recently introduced by Bulter. We establish several characterizations for the class of \partial-biLipschitz mappings between domains in quasiconvex metric spaces. As an application, we show that a locally quasisymmetric equivalence between uniform metric spaces is quasimöbius, quantitatively. This paper focuses on properties of \partial-biLipschitz mappings which were recently introduced by Bulter. We establish several characterizations for the class of \partial-biLipschitz mappings between domains in quasiconvex metric spaces. As an application, we show that a locally quasisymmetric equivalence between uniform metric spaces is quasimöbius, quantitatively. △ Less

Submitted 15 February, 2022; originally announced February 2022.

Comments: 17 pages; To appear in Bulletin des sciences mathématiques

MSC Class: Primary: 30C65; 30F45; 53C23; Secondary: 30C20

arXiv:2202.02800 [pdf, other]

doi 10.14778/3489496.3489508

Learning to be a Statistician: Learned Estimator for Number of Distinct Values

Authors: Renzhi Wu, Bolin Ding, Xu Chu, Zhewei Wei, Xiening Dai, Tao Guan, Jingren Zhou

Abstract: Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems, such as columnstore compression and data profiling. In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples. Such efficient estimation is critical for tasks where it is prohibitive to scan the data even once. Existing sample-based estimators typical… ▽ More Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems, such as columnstore compression and data profiling. In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples. Such efficient estimation is critical for tasks where it is prohibitive to scan the data even once. Existing sample-based estimators typically rely on heuristics or assumptions and do not have robust performance across different datasets as the assumptions on data can easily break. On the other hand, deriving an estimator from a principled formulation such as maximum likelihood estimation is very challenging due to the complex structure of the formulation. We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator. To this end, we need to answer several questions: i) how to make the learned model workload agnostic; ii) how to obtain training data; iii) how to perform model training. We derive conditions of the learning framework under which the learned model is workload agnostic, in the sense that the model/estimator can be trained with synthetically generated training data, and then deployed into any data warehouse simply as, e.g., user-defined functions (UDFs), to offer efficient (within microseconds on CPU) and accurate NDV estimations for unseen tables and workloads. We compare the learned estimator with the state-of-the-art sample-based estimators on nine real-world datasets to demonstrate its superior estimation accuracy. We publish our code for training data generation, model training, and the learned estimator online for reproducibility. △ Less

Submitted 6 February, 2022; originally announced February 2022.

Comments: Published at International Conference on Very Large Data Bases (VLDB) 2022

Journal ref: PVLDB, 15(2): 272 - 284, 2022

arXiv:2110.12663 [pdf, other]

doi 10.1109/TCSVT.2022.3156390

Industrial Scene Text Detection with Refined Feature-attentive Network

Authors: Tongkun Guan, Chaochen Gu, Changsheng Lu, Jingzheng Tu, Qi Feng, Kaijie Wu, Xinping Guan

Abstract: Detecting the marking characters of industrial metal parts remains challenging due to low visual contrast, uneven illumination, corroded character structures, and cluttered background of metal part images. Affected by these factors, bounding boxes generated by most existing methods locate low-contrast text areas inaccurately. In this paper, we propose a refined feature-attentive network (RFN) to s… ▽ More Detecting the marking characters of industrial metal parts remains challenging due to low visual contrast, uneven illumination, corroded character structures, and cluttered background of metal part images. Affected by these factors, bounding boxes generated by most existing methods locate low-contrast text areas inaccurately. In this paper, we propose a refined feature-attentive network (RFN) to solve the inaccurate localization problem. Specifically, we design a parallel feature integration mechanism to construct an adaptive feature representation from multi-resolution features, which enhances the perception of multi-scale texts at each scale-specific level to generate a high-quality attention map. Then, an attentive refinement network is developed by the attention map to rectify the location deviation of candidate boxes. In addition, a re-scoring mechanism is designed to select text boxes with the best rectified location. Moreover, we construct two industrial scene text datasets, including a total of 102156 images and 1948809 text instances with various character structures and metal parts. Extensive experiments on our dataset and four public datasets demonstrate that our proposed method achieves the state-of-the-art performance. △ Less

Submitted 29 March, 2022; v1 submitted 25 October, 2021; originally announced October 2021.

arXiv:2109.12780 [pdf, other]

Pommerenke's theorem on Gromov hyperbolic domains

Authors: Qingshan Zhou, Antti Rasila, Tiantian Guan

Abstract: We establish a version of a classical theorem of Pommerenke, which is a diameter version of the Gehring-Hayman inequality on Gromov hyperbolic domains of $\mathbb{R}^n$. Two applications are given. Firstly, we generalize Ostrowski's Faltensatz to quasihyperbolic geodesics of Gromov hyperbolic domains. Secondly, we prove that unbounded uniform domains can be characterized in the terms of Gromov hyp… ▽ More We establish a version of a classical theorem of Pommerenke, which is a diameter version of the Gehring-Hayman inequality on Gromov hyperbolic domains of $\mathbb{R}^n$. Two applications are given. Firstly, we generalize Ostrowski's Faltensatz to quasihyperbolic geodesics of Gromov hyperbolic domains. Secondly, we prove that unbounded uniform domains can be characterized in the terms of Gromov hyperbolicity and a naturally quasisymmetric correspondence on the boundary, where the Gromov boundary is equipped with a Hamenstädt metric (defined by using a Busemann function). △ Less

Submitted 26 September, 2021; originally announced September 2021.

Comments: 28 pages, 4 figures

MSC Class: Primary: 30C65; 30F45; Secondary: 30C20

arXiv:2109.06250 [pdf, other]

TNS: Terrain Traversability Mapping and Navigation System for Autonomous Excavators

Authors: Tianrui Guan, Zhenpeng He, Ruitao Song, Dinesh Manocha, Liangjun Zhang

Abstract: We present a terrain traversability mapping and navigation system (TNS) for autonomous excavator applications in an unstructured environment. We use an efficient approach to extract terrain features from RGB images and 3D point clouds and incorporate them into a global map for planning and navigation. Our system can adapt to changing environments and update the terrain information in real-time. Mo… ▽ More We present a terrain traversability mapping and navigation system (TNS) for autonomous excavator applications in an unstructured environment. We use an efficient approach to extract terrain features from RGB images and 3D point clouds and incorporate them into a global map for planning and navigation. Our system can adapt to changing environments and update the terrain information in real-time. Moreover, we present a novel dataset, the Complex Worksite Terrain (CWT) dataset, which consists of RGB images from construction sites with seven categories based on navigability. Our novel algorithms improve the mapping accuracy over previous SOTA methods by 4.17-30.48% and reduce MSE on the traversability map by 13.8-71.4%. We have combined our mapping approach with planning and control modules in an autonomous excavator navigation system and observe 49.3% improvement in the overall success rate. Based on TNS, we demonstrate the first autonomous excavator that can navigate through unstructured environments consisting of deep pits, steep hills, rock piles, and other complex terrain features. △ Less

Submitted 5 May, 2022; v1 submitted 13 September, 2021; originally announced September 2021.

Showing 1–50 of 86 results for author: Guan, T