-
Training-Free Large Model Priors for Multiple-in-One Image Restoration
Authors:
Xuanhua He,
Lang Li,
Yingying Wang,
Hui Zheng,
Ke Cao,
Keyu Yan,
Rui Li,
Chengjun Xie,
Jie Zhang,
Man Zhou
Abstract:
Image restoration aims to reconstruct the latent clear images from their degraded versions. Despite the notable achievement, existing methods predominantly focus on handling specific degradation types and thus require specialized models, impeding real-world applications in dynamic degradation scenarios. To address this issue, we propose Large Model Driven Image Restoration framework (LMDIR), a nov…
▽ More
Image restoration aims to reconstruct the latent clear images from their degraded versions. Despite the notable achievement, existing methods predominantly focus on handling specific degradation types and thus require specialized models, impeding real-world applications in dynamic degradation scenarios. To address this issue, we propose Large Model Driven Image Restoration framework (LMDIR), a novel multiple-in-one image restoration paradigm that leverages the generic priors from large multi-modal language models (MMLMs) and the pretrained diffusion models. In detail, LMDIR integrates three key prior knowledges: 1) global degradation knowledge from MMLMs, 2) scene-aware contextual descriptions generated by MMLMs, and 3) fine-grained high-quality reference images synthesized by diffusion models guided by MMLM descriptions. Standing on above priors, our architecture comprises a query-based prompt encoder, degradation-aware transformer block injecting global degradation knowledge, content-aware transformer block incorporating scene description, and reference-based transformer block incorporating fine-grained image priors. This design facilitates single-stage training paradigm to address various degradations while supporting both automatic and user-guided restoration. Extensive experiments demonstrate that our designed method outperforms state-of-the-art competitors on multiple evaluation benchmarks.
△ Less
Submitted 18 July, 2024;
originally announced July 2024.
-
DSO: A GPU Energy Efficiency Optimizer by Fusing Dynamic and Static Information
Authors:
Qiang Wang,
Laiyi Li,
Weile Luo,
Yijia Zhang,
Bingqiang Wang
Abstract:
Increased reliance on graphics processing units (GPUs) for high-intensity computing tasks raises challenges regarding energy consumption. To address this issue, dynamic voltage and frequency scaling (DVFS) has emerged as a promising technique for conserving energy while maintaining the quality of service (QoS) of GPU applications. However, existing solutions using DVFS are hindered by inefficiency…
▽ More
Increased reliance on graphics processing units (GPUs) for high-intensity computing tasks raises challenges regarding energy consumption. To address this issue, dynamic voltage and frequency scaling (DVFS) has emerged as a promising technique for conserving energy while maintaining the quality of service (QoS) of GPU applications. However, existing solutions using DVFS are hindered by inefficiency or inaccuracy as they depend either on dynamic or static information respectively, which prevents them from being adopted to practical power management schemes. To this end, we propose a novel energy efficiency optimizer, called DSO, to explore a light weight solution that leverages both dynamic and static information to model and optimize the GPU energy efficiency. DSO firstly proposes a novel theoretical energy efficiency model which reflects the DVFS roofline phenomenon and considers the tradeoff between performance and energy. Then it applies machine learning techniques to predict the parameters of the above model with both GPU kernel runtime metrics and static code features. Experiments on modern DVFS-enabled GPUs indicate that DSO can enhance energy efficiency by 19% whilst maintaining performance within a 5% loss margin.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
Matching-Driven Deep Reinforcement Learning for Energy-Efficient Transmission Parameter Allocation in Multi-Gateway LoRa Networks
Authors:
Ziqi Lin,
Xu Zhang,
Shimin Gong,
Lanhua Li,
Zhou Su,
Bo Gu
Abstract:
Long-range (LoRa) communication technology, distinguished by its low power consumption and long communication range, is widely used in the Internet of Things. Nevertheless, the LoRa MAC layer adopts pure ALOHA for medium access control, which may suffer from severe packet collisions as the network scale expands, consequently reducing the system energy efficiency (EE). To address this issue, it is…
▽ More
Long-range (LoRa) communication technology, distinguished by its low power consumption and long communication range, is widely used in the Internet of Things. Nevertheless, the LoRa MAC layer adopts pure ALOHA for medium access control, which may suffer from severe packet collisions as the network scale expands, consequently reducing the system energy efficiency (EE). To address this issue, it is critical to carefully allocate transmission parameters such as the channel (CH), transmission power (TP) and spreading factor (SF) to each end device (ED). Owing to the low duty cycle and sporadic traffic of LoRa networks, evaluating the system EE under various parameter settings proves to be time-consuming. Consequently, we propose an analytical model aimed at calculating the system EE while fully considering the impact of multiple gateways, duty cycling, quasi-orthogonal SFs and capture effects. On this basis, we investigate a joint CH, SF and TP allocation problem, with the objective of optimizing the system EE for uplink transmissions. Due to the NP-hard complexity of the problem, the optimization problem is decomposed into two subproblems: CH assignment and SF/TP assignment. First, a matching-based algorithm is introduced to address the CH assignment subproblem. Then, an attention-based multiagent reinforcement learning technique is employed to address the SF/TP assignment subproblem for EDs allocated to the same CH, which reduces the number of learning agents to achieve fast convergence. The simulation outcomes indicate that the proposed approach converges quickly under various parameter settings and obtains significantly better system EE than baseline algorithms.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
Comprehensive Review and Empirical Evaluation of Causal Discovery Algorithms for Numerical Data
Authors:
Wenjin Niu,
Zijun Gao,
Liyan Song,
Lingbo Li
Abstract:
Causal analysis has become an essential component in understanding the underlying causes of phenomena across various fields. Despite its significance, the existing literature on causal discovery algorithms is fragmented, with inconsistent methodologies and a lack of comprehensive evaluations. This study addresses these gaps by conducting an exhaustive review and empirical evaluation of causal disc…
▽ More
Causal analysis has become an essential component in understanding the underlying causes of phenomena across various fields. Despite its significance, the existing literature on causal discovery algorithms is fragmented, with inconsistent methodologies and a lack of comprehensive evaluations. This study addresses these gaps by conducting an exhaustive review and empirical evaluation of causal discovery methods for numerical data, aiming to provide a clearer and more structured understanding of the field. Our research began with a comprehensive literature review spanning over a decade, revealing that existing surveys fall short in covering the vast array of causal discovery advancements. We meticulously analyzed over 200 scholarly articles to identify 24 distinct algorithms. This extensive analysis led to the development of a novel taxonomy tailored to the complexities of causal discovery, categorizing methods into six main types. Addressing the lack of comprehensive evaluations, our study conducts an extensive empirical assessment of more than 20 causal discovery algorithms on synthetic and real-world datasets. We categorize synthetic datasets based on size, linearity, and noise distribution, employing 5 evaluation metrics, and summarized the top-3 algorithm recommendations for different data scenarios. The recommendations have been validated on 2 real-world datasets. Our results highlight the significant impact of dataset characteristics on algorithm performance. Moreover, a metadata extraction strategy was developed to assist users in algorithm selection on unknown datasets. The accuracy of estimating metadata is higher than 80%. Based on these insights, we offer professional and practical recommendations to help users choose the most suitable causal discovery methods for their specific dataset needs.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
Contrastive Adversarial Training for Unsupervised Domain Adaptation
Authors:
Jiahong Chen,
Zhilin Zhang,
Lucy Li,
Behzad Shahrasbi,
Arjun Mishra
Abstract:
Domain adversarial training has shown its effective capability for finding domain invariant feature representations and been successfully adopted for various domain adaptation tasks. However, recent advances of large models (e.g., vision transformers) and emerging of complex adaptation scenarios (e.g., DomainNet) make adversarial training being easily biased towards source domain and hardly adapte…
▽ More
Domain adversarial training has shown its effective capability for finding domain invariant feature representations and been successfully adopted for various domain adaptation tasks. However, recent advances of large models (e.g., vision transformers) and emerging of complex adaptation scenarios (e.g., DomainNet) make adversarial training being easily biased towards source domain and hardly adapted to target domain. The reason is twofold: relying on large amount of labelled data from source domain for large model training and lacking of labelled data from target domain for fine-tuning. Existing approaches widely focused on either enhancing discriminator or improving the training stability for the backbone networks. Due to unbalanced competition between the feature extractor and the discriminator during the adversarial training, existing solutions fail to function well on complex datasets. To address this issue, we proposed a novel contrastive adversarial training (CAT) approach that leverages the labeled source domain samples to reinforce and regulate the feature generation for target domain. Typically, the regulation forces the target feature distribution being similar to the source feature distribution. CAT addressed three major challenges in adversarial learning: 1) ensure the feature distributions from two domains as indistinguishable as possible for the discriminator, resulting in a more robust domain-invariant feature generation; 2) encourage target samples moving closer to the source in the feature space, reducing the requirement for generalizing classifier trained on the labeled source domain to unlabeled target domain; 3) avoid directly aligning unpaired source and target samples within mini-batch. CAT can be easily plugged into existing models and exhibits significant performance improvements.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
Case2Code: Learning Inductive Reasoning with Synthetic Data
Authors:
Yunfan Shao,
Linyang Li,
Yichuan Ma,
Peiji Li,
Demin Song,
Qinyuan Cheng,
Shimin Li,
Xiaonan Li,
Pengyu Wang,
Qipeng Guo,
Hang Yan,
Xipeng Qiu,
Xuanjing Huang,
Dahua Lin
Abstract:
Complex reasoning is an impressive ability shown by large language models (LLMs). Most LLMs are skilled in deductive reasoning, such as chain-of-thought prompting or iterative tool-using to solve challenging tasks step-by-step. In this paper, we hope to focus on evaluating and teaching LLMs to conduct inductive reasoning, that is, LLMs are supposed to infer underlying rules by observing examples o…
▽ More
Complex reasoning is an impressive ability shown by large language models (LLMs). Most LLMs are skilled in deductive reasoning, such as chain-of-thought prompting or iterative tool-using to solve challenging tasks step-by-step. In this paper, we hope to focus on evaluating and teaching LLMs to conduct inductive reasoning, that is, LLMs are supposed to infer underlying rules by observing examples or sequential transformations. However, collecting large-scale and diverse human-generated inductive data is challenging. We focus on data synthesis in the code domain and propose a \textbf{Case2Code} task by exploiting the expressiveness and correctness of programs. Specifically, we collect a diverse set of executable programs, synthesize input-output transformations for each program, and force LLMs to infer the underlying code implementations based on the synthetic I/O cases. We first evaluate representative LLMs on the synthesized Case2Code task and demonstrate that the Case-to-code induction is challenging for LLMs. Then, we synthesize large-scale Case2Code training samples to train LLMs to perform inductive reasoning. Experimental results show that such induction training benefits not only in distribution Case2Code performance but also enhances various coding abilities of trained LLMs, demonstrating the great potential of learning inductive reasoning via synthetic data.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
RankTower: A Synergistic Framework for Enhancing Two-Tower Pre-Ranking Model
Authors:
YaChen Yan,
Liubo Li
Abstract:
In large-scale ranking systems, cascading architectures have been widely adopted to achieve a balance between efficiency and effectiveness. The pre-ranking module plays a vital role in selecting a subset of candidates for the subsequent ranking module. It is crucial for the pre-ranking model to maintain a balance between efficiency and accuracy to adhere to online latency constraints. In this pape…
▽ More
In large-scale ranking systems, cascading architectures have been widely adopted to achieve a balance between efficiency and effectiveness. The pre-ranking module plays a vital role in selecting a subset of candidates for the subsequent ranking module. It is crucial for the pre-ranking model to maintain a balance between efficiency and accuracy to adhere to online latency constraints. In this paper, we propose a novel neural network architecture called RankTower, which is designed to efficiently capture user-item interactions while following the user-item decoupling paradigm to ensure online inference efficiency. The proposed approach employs a hybrid training objective that learns from samples obtained from the full stage of the cascade ranking system, optimizing different objectives for varying sample spaces. This strategy aims to enhance the pre-ranking model's ranking capability and improvement alignment with the existing cascade ranking system. Experimental results conducted on public datasets demonstrate that RankTower significantly outperforms state-of-the-art pre-ranking models.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
HIMO: A New Benchmark for Full-Body Human Interacting with Multiple Objects
Authors:
Xintao Lv,
Liang Xu,
Yichao Yan,
Xin Jin,
Congsheng Xu,
Shuwen Wu,
Yifan Liu,
Lincheng Li,
Mengxiao Bi,
Wenjun Zeng,
Xiaokang Yang
Abstract:
Generating human-object interactions (HOIs) is critical with the tremendous advances of digital avatars. Existing datasets are typically limited to humans interacting with a single object while neglecting the ubiquitous manipulation of multiple objects. Thus, we propose HIMO, a large-scale MoCap dataset of full-body human interacting with multiple objects, containing 3.3K 4D HOI sequences and 4.08…
▽ More
Generating human-object interactions (HOIs) is critical with the tremendous advances of digital avatars. Existing datasets are typically limited to humans interacting with a single object while neglecting the ubiquitous manipulation of multiple objects. Thus, we propose HIMO, a large-scale MoCap dataset of full-body human interacting with multiple objects, containing 3.3K 4D HOI sequences and 4.08M 3D HOI frames. We also annotate HIMO with detailed textual descriptions and temporal segments, benchmarking two novel tasks of HOI synthesis conditioned on either the whole text prompt or the segmented text prompts as fine-grained timeline control. To address these novel tasks, we propose a dual-branch conditional diffusion model with a mutual interaction module for HOI synthesis. Besides, an auto-regressive generation pipeline is also designed to obtain smooth transitions between HOI segments. Experimental results demonstrate the generalization ability to unseen object geometries and temporal compositions.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning
Authors:
Yunbin Tu,
Liang Li,
Li Su,
Chenggang Yan,
Qingming Huang
Abstract:
Change captioning aims to succinctly describe the semantic change between a pair of similar images, while being immune to distractors (illumination and viewpoint changes). Under these distractors, unchanged objects often appear pseudo changes about location and scale, and certain objects might overlap others, resulting in perturbational and discrimination-degraded features between two images. Howe…
▽ More
Change captioning aims to succinctly describe the semantic change between a pair of similar images, while being immune to distractors (illumination and viewpoint changes). Under these distractors, unchanged objects often appear pseudo changes about location and scale, and certain objects might overlap others, resulting in perturbational and discrimination-degraded features between two images. However, most existing methods directly capture the difference between them, which risk obtaining error-prone difference features. In this paper, we propose a distractors-immune representation learning network that correlates the corresponding channels of two image representations and decorrelates different ones in a self-supervised manner, thus attaining a pair of stable image representations under distractors. Then, the model can better interact them to capture the reliable difference features for caption generation. To yield words based on the most related difference features, we further design a cross-modal contrastive regularization, which regularizes the cross-modal alignment by maximizing the contrastive alignment between the attended difference features and generated words. Extensive experiments show that our method outperforms the state-of-the-art methods on four public datasets. The code is available at https://github.com/tuyunbin/DIRL.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
Uniformly Accelerated Motion Model for Inter Prediction
Authors:
Zhuoyuan Li,
Yao Li,
Chuanbo Tang,
Li Li,
Dong Liu,
Feng Wu
Abstract:
Inter prediction is a key technology to reduce the temporal redundancy in video coding. In natural videos, there are usually multiple moving objects with variable velocity, resulting in complex motion fields that are difficult to represent compactly. In Versatile Video Coding (VVC), existing inter prediction methods usually assume uniform speed motion between consecutive frames and use the linear…
▽ More
Inter prediction is a key technology to reduce the temporal redundancy in video coding. In natural videos, there are usually multiple moving objects with variable velocity, resulting in complex motion fields that are difficult to represent compactly. In Versatile Video Coding (VVC), existing inter prediction methods usually assume uniform speed motion between consecutive frames and use the linear models for motion estimation (ME) and motion compensation (MC), which may not well handle the complex motion fields in the real world. To address these issues, we introduce a uniformly accelerated motion model (UAMM) to exploit motion-related elements (velocity, acceleration) of moving objects between the video frames, and further combine them to assist the inter prediction methods to handle the variable motion in the temporal domain. Specifically, first, the theory of UAMM is mentioned. Second, based on that, we propose the UAMM-based parameter derivation and extrapolation schemes in the coding process. Third, we integrate the UAMM into existing inter prediction modes (Merge, MMVD, CIIP) to achieve higher prediction accuracy. The proposed method is implemented into the VVC reference software, VTM version 12.0. Experimental results show that the proposed method achieves up to 0.38% and on average 0.13% BD-rate reduction compared to the VTM anchor, under the Low-delay P configuration, with a slight increase of time complexity on the encoding/decoding side.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
Navigating the Data Trading Crossroads: An Interdisciplinary Survey
Authors:
Yi Yu,
Jingru Yu,
Xuhong Wang,
Juanjuan Li,
Yilun Lin,
Conghui He,
Yanqing Yang,
Yu Qiao,
Li Li,
Fei-Yue Wang
Abstract:
Data has been increasingly recognized as a critical factor in the future economy. However, constructing an efficient data trading market faces challenges such as privacy breaches, data monopolies, and misuse. Despite numerous studies proposing algorithms to protect privacy and methods for pricing data, a comprehensive understanding of these issues and systemic solutions remain elusive. This paper…
▽ More
Data has been increasingly recognized as a critical factor in the future economy. However, constructing an efficient data trading market faces challenges such as privacy breaches, data monopolies, and misuse. Despite numerous studies proposing algorithms to protect privacy and methods for pricing data, a comprehensive understanding of these issues and systemic solutions remain elusive. This paper provides an extensive review and evaluation of data trading research, aiming to identify existing problems, research gaps, and propose potential solutions. We categorize the challenges into three main areas: Compliance Challenges, Collateral Consequences, and Costly Transactions (the "3C problems"), all stemming from ambiguity in data rights. Through a quantitative analysis of the literature, we observe a paradigm shift from isolated solutions to integrated approaches. Addressing the unresolved issue of right ambiguity, we introduce the novel concept of "data usufruct," which allows individuals to use and benefit from data they do not own. This concept helps reframe data as a more conventional factor of production and aligns it with established economic theories, paving the way for a comprehensive framework of research theories, technical tools, and platforms. We hope this survey provides valuable insights and guidance for researchers, practitioners, and policymakers, thereby contributing to digital economy advancements.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
Building Intelligence Identification System via Large Language Model Watermarking: A Survey and Beyond
Authors:
Xuhong Wang,
Haoyu Jiang,
Yi Yu,
Jingru Yu,
Yilun Lin,
Ping Yi,
Yingchun Wang,
Qiao Yu,
Li Li,
Fei-Yue Wang
Abstract:
Large Language Models (LLMs) are increasingly integrated into diverse industries, posing substantial security risks due to unauthorized replication and misuse. To mitigate these concerns, robust identification mechanisms are widely acknowledged as an effective strategy. Identification systems for LLMs now rely heavily on watermarking technology to manage and protect intellectual property and ensur…
▽ More
Large Language Models (LLMs) are increasingly integrated into diverse industries, posing substantial security risks due to unauthorized replication and misuse. To mitigate these concerns, robust identification mechanisms are widely acknowledged as an effective strategy. Identification systems for LLMs now rely heavily on watermarking technology to manage and protect intellectual property and ensure data security. However, previous studies have primarily concentrated on the basic principles of algorithms and lacked a comprehensive analysis of watermarking theory and practice from the perspective of intelligent identification. To bridge this gap, firstly, we explore how a robust identity recognition system can be effectively implemented and managed within LLMs by various participants using watermarking technology. Secondly, we propose a mathematical framework based on mutual information theory, which systematizes the identification process to achieve more precise and customized watermarking. Additionally, we present a comprehensive evaluation of performance metrics for LLM watermarking, reflecting participant preferences and advancing discussions on its identification applications. Lastly, we outline the existing challenges in current watermarking technologies and theoretical frameworks, and provide directional guidance to address these challenges. Our systematic classification and detailed exposition aim to enhance the comparison and evaluation of various methods, fostering further research and development toward a transparent, secure, and equitable LLM ecosystem.
△ Less
Submitted 16 July, 2024; v1 submitted 15 July, 2024;
originally announced July 2024.
-
IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation
Authors:
Yuanhao Zhai,
Kevin Lin,
Linjie Li,
Chung-Ching Lin,
Jianfeng Wang,
Zhengyuan Yang,
David Doermann,
Junsong Yuan,
Zicheng Liu,
Lijuan Wang
Abstract:
Significant advances have been made in human-centric video generation, yet the joint video-depth generation problem remains underexplored. Most existing monocular depth estimation methods may not generalize well to synthesized images or videos, and multi-view-based methods have difficulty controlling the human appearance and motion. In this work, we present IDOL (unIfied Dual-mOdal Latent diffusio…
▽ More
Significant advances have been made in human-centric video generation, yet the joint video-depth generation problem remains underexplored. Most existing monocular depth estimation methods may not generalize well to synthesized images or videos, and multi-view-based methods have difficulty controlling the human appearance and motion. In this work, we present IDOL (unIfied Dual-mOdal Latent diffusion) for high-quality human-centric joint video-depth generation. Our IDOL consists of two novel designs. First, to enable dual-modal generation and maximize the information exchange between video and depth generation, we propose a unified dual-modal U-Net, a parameter-sharing framework for joint video and depth denoising, wherein a modality label guides the denoising target, and cross-modal attention enables the mutual information flow. Second, to ensure a precise video-depth spatial alignment, we propose a motion consistency loss that enforces consistency between the video and depth feature motion fields, leading to harmonized outputs. Additionally, a cross-attention map consistency loss is applied to align the cross-attention map of the video denoising with that of the depth denoising, further facilitating spatial alignment. Extensive experiments on the TikTok and NTU120 datasets show our superior performance, significantly surpassing existing methods in terms of video FVD and depth accuracy.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
In-Loop Filtering via Trained Look-Up Tables
Authors:
Zhuoyuan Li,
Jiacheng Li,
Yao Li,
Li Li,
Dong Liu,
Feng Wu
Abstract:
In-loop filtering (ILF) is a key technology for removing the artifacts in image/video coding standards. Recently, neural network-based in-loop filtering methods achieve remarkable coding gains beyond the capability of advanced video coding standards, which becomes a powerful coding tool candidate for future video coding standards. However, the utilization of deep neural networks brings heavy time…
▽ More
In-loop filtering (ILF) is a key technology for removing the artifacts in image/video coding standards. Recently, neural network-based in-loop filtering methods achieve remarkable coding gains beyond the capability of advanced video coding standards, which becomes a powerful coding tool candidate for future video coding standards. However, the utilization of deep neural networks brings heavy time and computational complexity, and high demands of high-performance hardware, which is challenging to apply to the general uses of coding scene. To address this limitation, inspired by explorations in image restoration, we propose an efficient and practical in-loop filtering scheme by adopting the Look-up Table (LUT). We train the DNN of in-loop filtering within a fixed filtering reference range, and cache the output values of the DNN into a LUT via traversing all possible inputs. At testing time in the coding process, the filtered pixel is generated by locating input pixels (to-be-filtered pixel with reference pixels) and interpolating cached filtered pixel values. To further enable the large filtering reference range with the limited storage cost of LUT, we introduce the enhanced indexing mechanism in the filtering process, and clipping/finetuning mechanism in the training. The proposed method is implemented into the Versatile Video Coding (VVC) reference software, VTM-11.0. Experimental results show that the ultrafast, very fast, and fast mode of the proposed method achieves on average 0.13%/0.34%/0.51%, and 0.10%/0.27%/0.39% BD-rate reduction, under the all intra (AI) and random access (RA) configurations. Especially, our method has friendly time and computational complexity, only 101%/102%-104%/108% time increase with 0.13-0.93 kMACs/pixel, and only 164-1148 KB storage cost for a single model. Our solution may shed light on the journey of practical neural network-based coding tool evolution.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
Parallel Ising Annealer via Gradient-based Hamiltonian Monte Carlo
Authors:
Hao Wang,
Zixuan Liu,
Zhixin Xie,
Langyu Li,
Zibo Miao,
Wei Cui,
Yu Pan
Abstract:
Ising annealer is a promising quantum-inspired computing architecture for combinatorial optimization problems. In this paper, we introduce an Ising annealer based on the Hamiltonian Monte Carlo, which updates the variables of all dimensions in parallel. The main innovation is the fusion of an approximate gradient-based approach into the Ising annealer which introduces significant acceleration and…
▽ More
Ising annealer is a promising quantum-inspired computing architecture for combinatorial optimization problems. In this paper, we introduce an Ising annealer based on the Hamiltonian Monte Carlo, which updates the variables of all dimensions in parallel. The main innovation is the fusion of an approximate gradient-based approach into the Ising annealer which introduces significant acceleration and allows a portable and scalable implementation on the commercial FPGA. Comprehensive simulation and hardware experiments show that the proposed Ising annealer has promising performance and scalability on all types of benchmark problems when compared to other Ising annealers including the state-of-the-art hardware. In particular, we have built a prototype annealer which solves Ising problems of both integer and fraction coefficients with up to 200 spins on a single low-cost FPGA board, whose performance is demonstrated to be better than the state-of-the-art quantum hardware D-Wave 2000Q and similar to the expensive coherent Ising machine. The sub-linear scalability of the annealer signifies its potential in solving challenging combinatorial optimization problems and evaluating the advantage of quantum hardware.
△ Less
Submitted 14 July, 2024;
originally announced July 2024.
-
RAPiD-Seg: Range-Aware Pointwise Distance Distribution Networks for 3D LiDAR Segmentation
Authors:
Li Li,
Hubert P. H. Shum,
Toby P. Breckon
Abstract:
3D point clouds play a pivotal role in outdoor scene perception, especially in the context of autonomous driving. Recent advancements in 3D LiDAR segmentation often focus intensely on the spatial positioning and distribution of points for accurate segmentation. However, these methods, while robust in variable conditions, encounter challenges due to sole reliance on coordinates and point intensity,…
▽ More
3D point clouds play a pivotal role in outdoor scene perception, especially in the context of autonomous driving. Recent advancements in 3D LiDAR segmentation often focus intensely on the spatial positioning and distribution of points for accurate segmentation. However, these methods, while robust in variable conditions, encounter challenges due to sole reliance on coordinates and point intensity, leading to poor isometric invariance and suboptimal segmentation. To tackle this challenge, our work introduces Range-Aware Pointwise Distance Distribution (RAPiD) features and the associated RAPiD-Seg architecture. Our RAPiD features exhibit rigid transformation invariance and effectively adapt to variations in point density, with a design focus on capturing the localized geometry of neighboring structures. They utilize inherent LiDAR isotropic radiation and semantic categorization for enhanced local representation and computational efficiency, while incorporating a 4D distance metric that integrates geometric and surface material reflectivity for improved semantic segmentation. To effectively embed high-dimensional RAPiD features, we propose a double-nested autoencoder structure with a novel class-aware embedding objective to encode high-dimensional features into manageable voxel-wise embeddings. Additionally, we propose RAPiD-Seg which incorporates a channel-wise attention fusion and two effective RAPiD-Seg variants, further optimizing the embedding for enhanced performance and generalization. Our method outperforms contemporary LiDAR segmentation work in terms of mIoU on SemanticKITTI (76.1) and nuScenes (83.6) datasets.
△ Less
Submitted 14 July, 2024;
originally announced July 2024.
-
Optimizing ROI Benefits Vehicle ReID in ITS
Authors:
Mei Qiu,
Lauren Ann Christopher,
Lingxi Li,
Stanley Chien,
Yaobin Chen
Abstract:
Vehicle re-identification (ReID) is a computer vision task that matches the same vehicle across different cameras or viewpoints in a surveillance system. This is crucial for Intelligent Transportation Systems (ITS), where the effectiveness is influenced by the regions from which vehicle images are cropped. This study explores whether optimal vehicle detection regions, guided by detection confidenc…
▽ More
Vehicle re-identification (ReID) is a computer vision task that matches the same vehicle across different cameras or viewpoints in a surveillance system. This is crucial for Intelligent Transportation Systems (ITS), where the effectiveness is influenced by the regions from which vehicle images are cropped. This study explores whether optimal vehicle detection regions, guided by detection confidence scores, can enhance feature matching and ReID tasks. Using our framework with multiple Regions of Interest (ROIs) and lane-wise vehicle counts, we employed YOLOv8 for detection and DeepSORT for tracking across twelve Indiana Highway videos, including two pairs of videos from non-overlapping cameras. Tracked vehicle images were cropped from inside and outside the ROIs at five-frame intervals. Features were extracted using pre-trained models: ResNet50, ResNeXt50, Vision Transformer, and Swin-Transformer. Feature consistency was assessed through cosine similarity, information entropy, and clustering variance. Results showed that features from images cropped inside ROIs had higher mean cosine similarity values compared to those involving one image inside and one outside the ROIs. The most significant difference was observed during night conditions (0.7842 inside vs. 0.5 outside the ROI with Swin-Transformer) and in cross-camera scenarios (0.75 inside-inside vs. 0.52 inside-outside the ROI with Vision Transformer). Information entropy and clustering variance further supported that features in ROIs are more consistent. These findings suggest that strategically selected ROIs can enhance tracking performance and ReID accuracy in ITS.
△ Less
Submitted 13 July, 2024;
originally announced July 2024.
-
DisQ: A Markov Decision Process Based Language for Quantum Distributed Systems
Authors:
Le Chang,
Saitej Yavvari,
Rance Cleaveland,
Samik Basu,
Liyi Li
Abstract:
The development of quantum computers has reached a great milestone, in spite of restrictions on important quantum resources: the number of qubits being entangled at a single-location quantum computer. Recently, there has been some work to combine single-location quantum computing and quantum networking techniques to develop distributed quantum systems such that large entangled qubit groups can be…
▽ More
The development of quantum computers has reached a great milestone, in spite of restrictions on important quantum resources: the number of qubits being entangled at a single-location quantum computer. Recently, there has been some work to combine single-location quantum computing and quantum networking techniques to develop distributed quantum systems such that large entangled qubit groups can be established through remote processors, and quantum algorithms can be executed distributively. We present DisQ as a framework to facilitate the rewrites of quantum algorithms to their distributed versions. The core of DisQ is a distributed quantum programming language that combines the concepts of Chemical Abstract Machine (CHAM) and Markov Decision Processes (MDP) with the objective of providing a clearly distinguishing quantum concurrent and distributed behaviors. Based on the DisQ language, we develop a simulation relation for verifying the equivalence of a quantum algorithm and its distributed versions. We present several case studies, such as quantum addition and Shor's algorithm, to demonstrate their equivalent rewrites to distributed versions.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
From Easy to Hard: Learning Curricular Shape-aware Features for Robust Panoptic Scene Graph Generation
Authors:
Hanrong Shi,
Lin Li,
Jun Xiao,
Yueting Zhuang,
Long Chen
Abstract:
Panoptic Scene Graph Generation (PSG) aims to generate a comprehensive graph-structure representation based on panoptic segmentation masks. Despite remarkable progress in PSG, almost all existing methods neglect the importance of shape-aware features, which inherently focus on the contours and boundaries of objects. To bridge this gap, we propose a model-agnostic Curricular shApe-aware FEature (CA…
▽ More
Panoptic Scene Graph Generation (PSG) aims to generate a comprehensive graph-structure representation based on panoptic segmentation masks. Despite remarkable progress in PSG, almost all existing methods neglect the importance of shape-aware features, which inherently focus on the contours and boundaries of objects. To bridge this gap, we propose a model-agnostic Curricular shApe-aware FEature (CAFE) learning strategy for PSG. Specifically, we incorporate shape-aware features (i.e., mask features and boundary features) into PSG, moving beyond reliance solely on bbox features. Furthermore, drawing inspiration from human cognition, we propose to integrate shape-aware features in an easy-to-hard manner. To achieve this, we categorize the predicates into three groups based on cognition learning difficulty and correspondingly divide the training process into three stages. Each stage utilizes a specialized relation classifier to distinguish specific groups of predicates. As the learning difficulty of predicates increases, these classifiers are equipped with features of ascending complexity. We also incorporate knowledge distillation to retain knowledge acquired in earlier stages. Due to its model-agnostic nature, CAFE can be seamlessly incorporated into any PSG model. Extensive experiments and ablations on two PSG tasks under both robust and zero-shot PSG have attested to the superiority and robustness of our proposed CAFE, which outperforms existing state-of-the-art methods by a large margin.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
Audio Spotforming Using Nonnegative Tensor Factorization with Attractor-Based Regularization
Authors:
Shoma Ayano,
Li Li,
Shogo Seki,
Daichi Kitamura
Abstract:
Spotforming is a target-speaker extraction technique that uses multiple microphone arrays. This method applies beamforming (BF) to each microphone array, and the common components among the BF outputs are estimated as the target source. This study proposes a new common component extraction method based on nonnegative tensor factorization (NTF) for higher model interpretability and more robust spot…
▽ More
Spotforming is a target-speaker extraction technique that uses multiple microphone arrays. This method applies beamforming (BF) to each microphone array, and the common components among the BF outputs are estimated as the target source. This study proposes a new common component extraction method based on nonnegative tensor factorization (NTF) for higher model interpretability and more robust spotforming against hyperparameters. Moreover, attractor-based regularization was introduced to facilitate the automatic selection of optimal target bases in the NTF. Experimental results show that the proposed method performs better than conventional methods in spotforming performance and also shows some characteristics suitable for practical use.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
Unifying 3D Representation and Control of Diverse Robots with a Single Camera
Authors:
Sizhe Lester Li,
Annan Zhang,
Boyuan Chen,
Hanna Matusik,
Chao Liu,
Daniela Rus,
Vincent Sitzmann
Abstract:
Mirroring the complex structures and diverse functions of natural organisms is a long-standing challenge in robotics. Modern fabrication techniques have dramatically expanded feasible hardware, yet deploying these systems requires control software to translate desired motions into actuator commands. While conventional robots can easily be modeled as rigid links connected via joints, it remains an…
▽ More
Mirroring the complex structures and diverse functions of natural organisms is a long-standing challenge in robotics. Modern fabrication techniques have dramatically expanded feasible hardware, yet deploying these systems requires control software to translate desired motions into actuator commands. While conventional robots can easily be modeled as rigid links connected via joints, it remains an open challenge to model and control bio-inspired robots that are often multi-material or soft, lack sensing capabilities, and may change their material properties with use. Here, we introduce Neural Jacobian Fields, an architecture that autonomously learns to model and control robots from vision alone. Our approach makes no assumptions about the robot's materials, actuation, or sensing, requires only a single camera for control, and learns to control the robot without expert intervention by observing the execution of random commands. We demonstrate our method on a diverse set of robot manipulators, varying in actuation, materials, fabrication, and cost. Our approach achieves accurate closed-loop control and recovers the causal dynamic structure of each robot. By enabling robot control with a generic camera as the only sensor, we anticipate our work will dramatically broaden the design space of robotic systems and serve as a starting point for lowering the barrier to robotic automation.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
AutoBencher: Creating Salient, Novel, Difficult Datasets for Language Models
Authors:
Xiang Lisa Li,
Evan Zheran Liu,
Percy Liang,
Tatsunori Hashimoto
Abstract:
Evaluation is critical for assessing capabilities, tracking scientific progress, and informing model selection. In this paper, we present three desiderata for a good benchmark for language models: (i) salience (e.g., knowledge about World War II is more salient than a random day in history), (ii) novelty (i.e., the benchmark reveals new trends in model rankings not shown by previous benchmarks), a…
▽ More
Evaluation is critical for assessing capabilities, tracking scientific progress, and informing model selection. In this paper, we present three desiderata for a good benchmark for language models: (i) salience (e.g., knowledge about World War II is more salient than a random day in history), (ii) novelty (i.e., the benchmark reveals new trends in model rankings not shown by previous benchmarks), and (iii) difficulty (i.e., the benchmark should be difficult for existing models, leaving headroom for future improvement). We operationalize these three desiderata and cast benchmark creation as a search problem, that of finding benchmarks that that satisfy all three desiderata. To tackle this search problem, we present AutoBencher, which uses a language model to automatically search for datasets that meet the three desiderata. AutoBencher uses privileged information (e.g. relevant documents) to construct reliable datasets, and adaptivity with reranking to optimize for the search objective. We use AutoBencher to create datasets for math, multilingual, and knowledge-intensive question answering. The scalability of AutoBencher allows it to test fine-grained categories and tail knowledge, creating datasets that are on average 27% more novel and 22% more difficult than existing benchmarks. A closer investigation of our constructed datasets shows that we can identify specific gaps in LM knowledge in language models that are not captured by existing benchmarks, such as Gemini Pro performing much worse on question answering about the Permian Extinction and Fordism, while OpenAGI-7B performing surprisingly well on QA about COVID-19.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
Controlling Space and Time with Diffusion Models
Authors:
Daniel Watson,
Saurabh Saxena,
Lala Li,
Andrea Tagliasacchi,
David J. Fleet
Abstract:
We present 4DiM, a cascaded diffusion model for 4D novel view synthesis (NVS), conditioned on one or more images of a general scene, and a set of camera poses and timestamps. To overcome challenges due to limited availability of 4D training data, we advocate joint training on 3D (with camera pose), 4D (pose+time) and video (time but no pose) data and propose a new architecture that enables the sam…
▽ More
We present 4DiM, a cascaded diffusion model for 4D novel view synthesis (NVS), conditioned on one or more images of a general scene, and a set of camera poses and timestamps. To overcome challenges due to limited availability of 4D training data, we advocate joint training on 3D (with camera pose), 4D (pose+time) and video (time but no pose) data and propose a new architecture that enables the same. We further advocate the calibration of SfM posed data using monocular metric depth estimators for metric scale camera control. For model evaluation, we introduce new metrics to enrich and overcome shortcomings of current evaluation schemes, demonstrating state-of-the-art results in both fidelity and pose control compared to existing diffusion models for 3D NVS, while at the same time adding the ability to handle temporal dynamics. 4DiM is also used for improved panorama stitching, pose-conditioned video to video translation, and several other tasks. For an overview see https://4d-diffusion.github.io
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
Study on Aspect Ratio Variability toward Robustness of Vision Transformer-based Vehicle Re-identification
Authors:
Mei Qiu,
Lauren Christopher,
Lingxi Li
Abstract:
Vision Transformers (ViTs) have excelled in vehicle re-identification (ReID) tasks. However, non-square aspect ratios of image or video input might significantly affect the re-identification performance. To address this issue, we propose a novel ViT-based ReID framework in this paper, which fuses models trained on a variety of aspect ratios. Our main contributions are threefold: (i) We analyze asp…
▽ More
Vision Transformers (ViTs) have excelled in vehicle re-identification (ReID) tasks. However, non-square aspect ratios of image or video input might significantly affect the re-identification performance. To address this issue, we propose a novel ViT-based ReID framework in this paper, which fuses models trained on a variety of aspect ratios. Our main contributions are threefold: (i) We analyze aspect ratio performance on VeRi-776 and VehicleID datasets, guiding input settings based on aspect ratios of original images. (ii) We introduce patch-wise mixup intra-image during ViT patchification (guided by spatial attention scores) and implement uneven stride for better object aspect ratio matching. (iii) We propose a dynamic feature fusing ReID network, enhancing model robustness. Our ReID method achieves a significantly improved mean Average Precision (mAP) of 91.0\% compared to the the closest state-of-the-art (CAL) result of 80.9\% on VehicleID dataset.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
Transmission Design for XL-RIS-Aided Massive MIMO System with Visibility Regions
Authors:
Luchu Li,
Kangda Zhi,
Cunhua Pan
Abstract:
This paper proposes a two-timescale transmission scheme for extremely large-scale (XL)-reconfigurable intelligent surfaces (RIS)-aided massive multi-input multi-output (MIMO) systems considering visibility regions (VRs). The beamforming of base stations (BS) is designed based on rapidly changing instantaneous channel state information (CSI), while the phase shifts of RIS are configured based on sl…
▽ More
This paper proposes a two-timescale transmission scheme for extremely large-scale (XL)-reconfigurable intelligent surfaces (RIS)-aided massive multi-input multi-output (MIMO) systems considering visibility regions (VRs). The beamforming of base stations (BS) is designed based on rapidly changing instantaneous channel state information (CSI), while the phase shifts of RIS are configured based on slowly changing statistical CSI. Specifically, we first formulate a system model with spatially correlated Rician fading channels and introduce the concept of VRs. Then, we derive a closed-form approximate expression for the achievable rate applicable to any number of BS antennas and RIS elements, and analyze the impact of VRs on system performance and complexity. Next, we solve the problem of maximizing the minimum user rate by optimizing the phase shifts of RIS through an algorithm based on accelerated gradient ascent. Finally, we present numerical results to demonstrate the performance of the gradient algorithm from different aspects and reveal the low system complexity of deploying XL-RIS in massive MIMO systems with the help of VRs.
△ Less
Submitted 17 May, 2024;
originally announced July 2024.
-
Metasurface-based Snapshot Shortwave-Infrared Hyperspectral Image Reconstruction with Inter and Intra Prior Learning Network
Authors:
Linqiang Li,
Jinglei Hao,
Yongqiang Zhao,
Pan Liu,
Haofang Yan,
Ziqin Zhang,
Seong G. Kong
Abstract:
Shortwave-infrared(SWIR) spectral information,ranging from 1 μm to 2.5μm, breaks the limitations of traditional color cameras in acquiring scene information and has been used in many fields. However, conventional SWIR hyperspectral imaging systems face challenges due to their bulky setups and low acquisition speed. In this work, we introduce a snapshot SWIR hyperspectral imaging system based on a…
▽ More
Shortwave-infrared(SWIR) spectral information,ranging from 1 μm to 2.5μm, breaks the limitations of traditional color cameras in acquiring scene information and has been used in many fields. However, conventional SWIR hyperspectral imaging systems face challenges due to their bulky setups and low acquisition speed. In this work, we introduce a snapshot SWIR hyperspectral imaging system based on a metasurface filter and a corresponding filter selection method to achieve the lowest correlation coefficient among these filters.This systemhas the advantages of small size and snapshot imaging. We propose a novel inter and intra prior learning unfolding framework proposed to achieve high-quality SWIR hyperspectral image reconstruction, which bridges the gap between prior learning and cross-stage information interaction. We also design an adaptive feature transfer mechanism to adaptively the transfer contextual correlation of multi-scale encoder features to prevent detailed information loss in the decoder. Experiment results demonstrate that our method can reconstruct HSI with high speed and superior performance over existing methods.
△ Less
Submitted 10 July, 2024; v1 submitted 10 July, 2024;
originally announced July 2024.
-
Integrated Sensing and Communications for Resource Allocation in Non-Terrestrial Networks
Authors:
Israel Leyva-Mayorga,
Fabio Saggese,
Lintao Li,
Petar Popovski
Abstract:
The integration of Non-Terrestrial Networks (NTNs) with Low Earth Orbit (LEO) satellite constellations into 5G and Beyond is essential to achieve truly global connectivity. A distinctive characteristic of LEO mega-constellations is that they constitute a global infrastructure with predictable dynamics, which enables the pre-planned allocation of the radio resources. However, the different bands th…
▽ More
The integration of Non-Terrestrial Networks (NTNs) with Low Earth Orbit (LEO) satellite constellations into 5G and Beyond is essential to achieve truly global connectivity. A distinctive characteristic of LEO mega-constellations is that they constitute a global infrastructure with predictable dynamics, which enables the pre-planned allocation of the radio resources. However, the different bands that can be used for ground-to-satellite communication are affected differently by atmospheric conditions such as precipitation, which introduces uncertainty on the attenuation of the communication links at high frequencies. Based on this, we present a compelling case for applying integrated sensing and communications (ISAC) in heterogeneous and multi-layer LEO satellite constellations over wide areas. Specifically, we present an ISAC framework and frame structure to accurately estimate the attenuation in the communication links due to precipitation, with the aim of finding the optimal serving satellites and resource allocation for downlink communication with users on ground. The results show that, by dedicating an adequate amount of resources for sensing and solving the association and resource allocation problems jointly, it is feasible to increase the average throughput by 59% and the fairness by 600% when compared to solving these problems separately.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
TVR-Ranking: A Dataset for Ranked Video Moment Retrieval with Imprecise Queries
Authors:
Renjie Liang,
Li Li,
Chongzhi Zhang,
Jing Wang,
Xizhou Zhu,
Aixin Sun
Abstract:
In this paper, we propose the task of \textit{Ranked Video Moment Retrieval} (RVMR) to locate a ranked list of matching moments from a collection of videos, through queries in natural language. Although a few related tasks have been proposed and studied by CV, NLP, and IR communities, RVMR is the task that best reflects the practical setting of moment search. To facilitate research in RVMR, we dev…
▽ More
In this paper, we propose the task of \textit{Ranked Video Moment Retrieval} (RVMR) to locate a ranked list of matching moments from a collection of videos, through queries in natural language. Although a few related tasks have been proposed and studied by CV, NLP, and IR communities, RVMR is the task that best reflects the practical setting of moment search. To facilitate research in RVMR, we develop the TVR-Ranking dataset, based on the raw videos and existing moment annotations provided in the TVR dataset. Our key contribution is the manual annotation of relevance levels for 94,442 query-moment pairs. We then develop the $NDCG@K, IoU\geq μ$ evaluation metric for this new task and conduct experiments to evaluate three baseline models. Our experiments show that the new RVMR task brings new challenges to existing models and we believe this new dataset contributes to the research on multi-modality search. The dataset is available at \url{https://github.com/Ranking-VMR/TVR-Ranking}
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
LLM for Mobile: An Initial Roadmap
Authors:
Daihang Chen,
Yonghui Liu,
Mingyi Zhou,
Yanjie Zhao,
Haoyu Wang,
Shuai Wang,
Xiao Chen,
Tegawendé F. Bissyandé,
Jacques Klein,
Li Li
Abstract:
When mobile meets LLMs, mobile app users deserve to have more intelligent usage experiences. For this to happen, we argue that there is a strong need to appl LLMs for the mobile ecosystem. We therefore provide a research roadmap for guiding our fellow researchers to achieve that as a whole. In this roadmap, we sum up six directions that we believe are urgently required for research to enable nativ…
▽ More
When mobile meets LLMs, mobile app users deserve to have more intelligent usage experiences. For this to happen, we argue that there is a strong need to appl LLMs for the mobile ecosystem. We therefore provide a research roadmap for guiding our fellow researchers to achieve that as a whole. In this roadmap, we sum up six directions that we believe are urgently required for research to enable native intelligence in mobile devices. In each direction, we further summarize the current research progress and the gaps that still need to be filled by our fellow researchers.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
General and Task-Oriented Video Segmentation
Authors:
Mu Chen,
Liulei Li,
Wenguan Wang,
Ruijie Quan,
Yi Yang
Abstract:
We present GvSeg, a general video segmentation framework for addressing four different video segmentation tasks (i.e., instance, semantic, panoptic, and exemplar-guided) while maintaining an identical architectural design. Currently, there is a trend towards developing general video segmentation solutions that can be applied across multiple tasks. This streamlines research endeavors and simplifies…
▽ More
We present GvSeg, a general video segmentation framework for addressing four different video segmentation tasks (i.e., instance, semantic, panoptic, and exemplar-guided) while maintaining an identical architectural design. Currently, there is a trend towards developing general video segmentation solutions that can be applied across multiple tasks. This streamlines research endeavors and simplifies deployment. However, such a highly homogenized framework in current design, where each element maintains uniformity, could overlook the inherent diversity among different tasks and lead to suboptimal performance. To tackle this, GvSeg: i) provides a holistic disentanglement and modeling for segment targets, thoroughly examining them from the perspective of appearance, position, and shape, and on this basis, ii) reformulates the query initialization, matching and sampling strategies in alignment with the task-specific requirement. These architecture-agnostic innovations empower GvSeg to effectively address each unique task by accommodating the specific properties that characterize them. Extensive experiments on seven gold-standard benchmark datasets demonstrate that GvSeg surpasses all existing specialized/general solutions by a significant margin on four different video segmentation tasks.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
C2C: Component-to-Composition Learning for Zero-Shot Compositional Action Recognition
Authors:
Rongchang Li,
Zhenhua Feng,
Tianyang Xu,
Linze Li,
Xiao-Jun Wu,
Muhammad Awais,
Sara Atito,
Josef Kittler
Abstract:
Compositional actions consist of dynamic (verbs) and static (objects) concepts. Humans can easily recognize unseen compositions using the learned concepts. For machines, solving such a problem requires a model to recognize unseen actions composed of previously observed verbs and objects, thus requiring, so-called, compositional generalization ability. To facilitate this research, we propose a nove…
▽ More
Compositional actions consist of dynamic (verbs) and static (objects) concepts. Humans can easily recognize unseen compositions using the learned concepts. For machines, solving such a problem requires a model to recognize unseen actions composed of previously observed verbs and objects, thus requiring, so-called, compositional generalization ability. To facilitate this research, we propose a novel Zero-Shot Compositional Action Recognition (ZS-CAR) task. For evaluating the task, we construct a new benchmark, Something-composition (Sth-com), based on the widely used Something-Something V2 dataset. We also propose a novel Component-to-Composition (C2C) learning method to solve the new ZS-CAR task. C2C includes an independent component learning module and a composition inference module. Last, we devise an enhanced training strategy to address the challenges of component variation between seen and unseen compositions and to handle the subtle balance between learning seen and unseen actions. The experimental results demonstrate that the proposed framework significantly surpasses the existing compositional generalization methods and sets a new state-of-the-art. The new Sth-com benchmark and code are available at https://github.com/RongchangLi/ZSCAR_C2C.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
Few-Shot Keyword Spotting from Mixed Speech
Authors:
Junming Yuan,
Ying Shi,
LanTian Li,
Dong Wang,
Askar Hamdulla
Abstract:
Few-shot keyword spotting (KWS) aims to detect unknown keywords with limited training samples. A commonly used approach is the pre-training and fine-tuning framework. While effective in clean conditions, this approach struggles with mixed keyword spotting -- simultaneously detecting multiple keywords blended in an utterance, which is crucial in real-world applications. Previous research has propos…
▽ More
Few-shot keyword spotting (KWS) aims to detect unknown keywords with limited training samples. A commonly used approach is the pre-training and fine-tuning framework. While effective in clean conditions, this approach struggles with mixed keyword spotting -- simultaneously detecting multiple keywords blended in an utterance, which is crucial in real-world applications. Previous research has proposed a Mix-Training (MT) approach to solve the problem, however, it has never been tested in the few-shot scenario. In this paper, we investigate the possibility of using MT and other relevant methods to solve the two practical challenges together: few-shot and mixed speech. Experiments conducted on the LibriSpeech and Google Speech Command corpora demonstrate that MT is highly effective on this task when employed in either the pre-training phase or the fine-tuning phase. Moreover, combining SSL-based large-scale pre-training (HuBert) and MT fine-tuning yields very strong results in all the test conditions.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages
Authors:
Yinquan Lu,
Wenhao Zhu,
Lei Li,
Yu Qiao,
Fei Yuan
Abstract:
Large Language Models~(LLMs) demonstrate remarkable translation capabilities in high-resource language tasks, yet their performance in low-resource languages is hindered by insufficient multilingual data during pre-training. To address this, we dedicate 35,000 A100-SXM4-80GB GPU hours in conducting extensive multilingual continual pre-training on the LLaMA series models, enabling translation suppo…
▽ More
Large Language Models~(LLMs) demonstrate remarkable translation capabilities in high-resource language tasks, yet their performance in low-resource languages is hindered by insufficient multilingual data during pre-training. To address this, we dedicate 35,000 A100-SXM4-80GB GPU hours in conducting extensive multilingual continual pre-training on the LLaMA series models, enabling translation support across more than 100 languages. Through a comprehensive analysis of training strategies, such as vocabulary expansion and data augmentation, we develop LLaMAX. Remarkably, without sacrificing its generalization ability, LLaMAX achieves significantly higher translation performance compared to existing open-source LLMs~(by more than 10 spBLEU points) and performs on-par with specialized translation model~(M2M-100-12B) on the Flores-101 benchmark. Extensive experiments indicate that LLaMAX can serve as a robust multilingual foundation model. The code~\footnote{\url{https://github.com/CONE-MT/LLaMAX/.}} and models~\footnote{\url{https://huggingface.co/LLaMAX/.}} are publicly available.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
Explainable Image Recognition via Enhanced Slot-attention Based Classifier
Authors:
Bowen Wang,
Liangzhi Li,
Jiahao Zhang,
Yuta Nakashima,
Hajime Nagahara
Abstract:
The imperative to comprehend the behaviors of deep learning models is of utmost importance. In this realm, Explainable Artificial Intelligence (XAI) has emerged as a promising avenue, garnering increasing interest in recent years. Despite this, most existing methods primarily depend on gradients or input perturbation, which often fails to embed explanations directly within the model's decision-mak…
▽ More
The imperative to comprehend the behaviors of deep learning models is of utmost importance. In this realm, Explainable Artificial Intelligence (XAI) has emerged as a promising avenue, garnering increasing interest in recent years. Despite this, most existing methods primarily depend on gradients or input perturbation, which often fails to embed explanations directly within the model's decision-making process. Addressing this gap, we introduce ESCOUTER, a visually explainable classifier based on the modified slot attention mechanism. ESCOUTER distinguishes itself by not only delivering high classification accuracy but also offering more transparent insights into the reasoning behind its decisions. It differs from prior approaches in two significant aspects: (a) ESCOUTER incorporates explanations into the final confidence scores for each category, providing a more intuitive interpretation, and (b) it offers positive or negative explanations for all categories, elucidating "why an image belongs to a certain category" or "why it does not." A novel loss function specifically for ESCOUTER is designed to fine-tune the model's behavior, enabling it to toggle between positive and negative explanations. Moreover, an area loss is also designed to adjust the size of the explanatory regions for a more precise explanation. Our method, rigorously tested across various datasets and XAI metrics, outperformed previous state-of-the-art methods, solidifying its effectiveness as an explanatory tool.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
PANS: Probabilistic Airway Navigation System for Real-time Robust Bronchoscope Localization
Authors:
Qingyao Tian,
Zhen Chen,
Huai Liao,
Xinyan Huang,
Bingyu Yang,
Lujie Li,
Hongbin Liu
Abstract:
Accurate bronchoscope localization is essential for pulmonary interventions, by providing six degrees of freedom (DOF) in airway navigation. However, the robustness of current vision-based methods is often compromised in clinical practice, and they struggle to perform in real-time and to generalize across cases unseen during training. To overcome these challenges, we propose a novel Probabilistic…
▽ More
Accurate bronchoscope localization is essential for pulmonary interventions, by providing six degrees of freedom (DOF) in airway navigation. However, the robustness of current vision-based methods is often compromised in clinical practice, and they struggle to perform in real-time and to generalize across cases unseen during training. To overcome these challenges, we propose a novel Probabilistic Airway Navigation System (PANS), leveraging Monte-Carlo method with pose hypotheses and likelihoods to achieve robust and real-time bronchoscope localization. Specifically, our PANS incorporates diverse visual representations (\textit{e.g.}, odometry and landmarks) by leveraging two key modules, including the Depth-based Motion Inference (DMI) and the Bronchial Semantic Analysis (BSA). To generate the pose hypotheses of bronchoscope for PANS, we devise the DMI to accurately propagate the estimation of pose hypotheses over time. Moreover, to estimate the accurate pose likelihood, we devise the BSA module by effectively distinguishing between similar bronchial regions in endoscopic images, along with a novel metric to assess the congruence between estimated depth maps and the segmented airway structure. Under this probabilistic formulation, our PANS is capable of achieving the 6-DOF bronchoscope localization with superior accuracy and robustness. Extensive experiments on the collected pulmonary intervention dataset comprising 10 clinical cases confirm the advantage of our PANS over state-of-the-arts, in terms of both robustness and generalization in localizing deeper airway branches and the efficiency of real-time inference. The proposed PANS reveals its potential to be a reliable tool in the operating room, promising to enhance the quality and safety of pulmonary interventions.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
Forest2Seq: Revitalizing Order Prior for Sequential Indoor Scene Synthesis
Authors:
Qi Sun,
Hang Zhou,
Wengang Zhou,
Li Li,
Houqiang Li
Abstract:
Synthesizing realistic 3D indoor scenes is a challenging task that traditionally relies on manual arrangement and annotation by expert designers. Recent advances in autoregressive models have automated this process, but they often lack semantic understanding of the relationships and hierarchies present in real-world scenes, yielding limited performance. In this paper, we propose Forest2Seq, a fram…
▽ More
Synthesizing realistic 3D indoor scenes is a challenging task that traditionally relies on manual arrangement and annotation by expert designers. Recent advances in autoregressive models have automated this process, but they often lack semantic understanding of the relationships and hierarchies present in real-world scenes, yielding limited performance. In this paper, we propose Forest2Seq, a framework that formulates indoor scene synthesis as an order-aware sequential learning problem. Forest2Seq organizes the inherently unordered collection of scene objects into structured, ordered hierarchical scene trees and forests. By employing a clustering-based algorithm and a breadth-first traversal, Forest2Seq derives meaningful orderings and utilizes a transformer to generate realistic 3D scenes autoregressively. Experimental results on standard benchmarks demonstrate Forest2Seq's superiority in synthesizing more realistic scenes compared to top-performing baselines, with significant improvements in FID and KL scores. Our additional experiments for downstream tasks and ablation studies also confirm the importance of incorporating order as a prior in 3D scene generation.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
Crowdsourced reviews reveal substantial disparities in public perceptions of parking
Authors:
Lingyao Li,
Songhua Hu,
Ly Dinh,
Libby Hemphill
Abstract:
Due to increased reliance on private vehicles and growing travel demand, parking remains a longstanding urban challenge globally. Quantifying parking perceptions is paramount as it enables decision-makers to identify problematic areas and make informed decisions on parking management. This study introduces a cost-effective and widely accessible data source, crowdsourced online reviews, to investig…
▽ More
Due to increased reliance on private vehicles and growing travel demand, parking remains a longstanding urban challenge globally. Quantifying parking perceptions is paramount as it enables decision-makers to identify problematic areas and make informed decisions on parking management. This study introduces a cost-effective and widely accessible data source, crowdsourced online reviews, to investigate public perceptions of parking across the U.S. Specifically, we examine 4,987,483 parking-related reviews for 1,129,460 points of interest (POIs) across 911 core-based statistical areas (CBSAs) sourced from Google Maps. We employ the Bidirectional Encoder Representations from Transformers (BERT) model to classify the parking sentiment and conduct regression analyses to explore its relationships with socio-spatial factors. Findings reveal significant variations in parking sentiment across POI types and CBSAs, with Restaurant POIs showing the most negative. Regression results further indicate that denser urban areas with higher proportions of African Americans and Hispanics and lower socioeconomic status are more likely to exhibit negative parking sentiment. Interestingly, an opposite relationship between parking supply and sentiment is observed, indicating increasing supply does not necessarily improve parking experiences. Finally, our textual analysis identifies keywords associated with positive or negative sentiments and highlights disparities between urban and rural areas. Overall, this study demonstrates the potential of a novel data source and methodological framework in measuring parking sentiment, offering valuable insights that help identify hyperlocal parking issues and guide targeted parking management strategies.
△ Less
Submitted 6 July, 2024;
originally announced July 2024.
-
Computational Graph Representation of Equations System Constructors in Hierarchical Circuit Simulation
Authors:
Zichao Long,
Lin Li,
Lei Han,
Xianglong Meng,
Chongjun Ding,
Ruiyan Li,
Wu Jiang,
Fuchen Ding,
Jiaqing Yue,
Zhichao Li,
Yisheng Hu,
Ding Li,
Heng Liao
Abstract:
Equations system constructors of hierarchical circuits play a central role in device modeling, nonlinear equations solving, and circuit design automation. However, existing constructors present limitations in applications to different extents. For example, the costs of developing and reusing device models -- especially coarse-grained equivalent models of circuit modules -- remain high while parame…
▽ More
Equations system constructors of hierarchical circuits play a central role in device modeling, nonlinear equations solving, and circuit design automation. However, existing constructors present limitations in applications to different extents. For example, the costs of developing and reusing device models -- especially coarse-grained equivalent models of circuit modules -- remain high while parameter sensitivity analysis is complex and inefficient. Inspired by differentiable programming and leveraging the ecosystem benefits of open-source software, we propose an equations system constructor using the computational graph representation, along with its JSON format netlist, to address these limitations. This representation allows for runtime dependencies between signals and subcircuit/device parameters. The proposed method streamlines the model development process and facilitates end-to-end computation of gradients of equations remainders with respect to parameters. This paper discusses in detail the overarching concept of hierarchical subcircuit/device decomposition and nested invocation by drawing parallels to functions in programming languages, and introduces rules for parameters passing and gradient propagation across hierarchical circuit modules. The presented numerical examples, including (1) an uncoupled CMOS model representation using "equivalent circuit decomposition+dynamic parameters" and (2) operational amplifier (OpAmp) auto device sizing, have demonstrated that the proposed method supports circuit simulation and design and particularly subcircuit modeling with improved efficiency, simplicity, and decoupling compared to existing techniques.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
Serialized Output Training by Learned Dominance
Authors:
Ying Shi,
Lantian Li,
Shi Yin,
Dong Wang,
Jiqing Han
Abstract:
Serialized Output Training (SOT) has showcased state-of-the-art performance in multi-talker speech recognition by sequentially decoding the speech of individual speakers. To address the challenging label-permutation issue, prior methods have relied on either the Permutation Invariant Training (PIT) or the time-based First-In-First-Out (FIFO) rule. This study presents a model-based serialization st…
▽ More
Serialized Output Training (SOT) has showcased state-of-the-art performance in multi-talker speech recognition by sequentially decoding the speech of individual speakers. To address the challenging label-permutation issue, prior methods have relied on either the Permutation Invariant Training (PIT) or the time-based First-In-First-Out (FIFO) rule. This study presents a model-based serialization strategy that incorporates an auxiliary module into the Attention Encoder-Decoder architecture, autonomously identifying the crucial factors to order the output sequence of the speech components in multi-talker speech. Experiments conducted on the LibriSpeech and LibriMix databases reveal that our approach significantly outperforms the PIT and FIFO baselines in both 2-mix and 3-mix scenarios. Further analysis shows that the serialization module identifies dominant speech components in a mixture by factors including loudness and gender, and orders speech components based on the dominance score.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
Unifying quantum spatial search, state transfer and uniform sampling on graphs: simple and exact
Authors:
Qingwen Wang,
Ying Jiang,
Lvzhou Li
Abstract:
This article presents a novel and succinct algorithmic framework via alternating quantum walks, unifying quantum spatial search, state transfer and uniform sampling on a large class of graphs. Using the framework, we can achieve exact uniform sampling over all vertices and perfect state transfer between any two vertices, provided that eigenvalues of Laplacian matrix of the graph are all integers.…
▽ More
This article presents a novel and succinct algorithmic framework via alternating quantum walks, unifying quantum spatial search, state transfer and uniform sampling on a large class of graphs. Using the framework, we can achieve exact uniform sampling over all vertices and perfect state transfer between any two vertices, provided that eigenvalues of Laplacian matrix of the graph are all integers. Furthermore, if the graph is vertex-transitive as well, then we can achieve deterministic quantum spatial search that finds a marked vertex with certainty. In contrast, existing quantum search algorithms generally has a certain probability of failure. Even if the graph is not vertex-transitive, such as the complete bipartite graph, we can still adjust the algorithmic framework to obtain deterministic spatial search, which thus shows the flexibility of it. Besides unifying and improving plenty of previous results, our work provides new results on more graphs. The approach is easy to use since it has a succinct formalism that depends only on the depth of the Laplacian eigenvalue set of the graph, and may shed light on the solution of more problems related to graphs.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Certainly Uncertain: A Benchmark and Metric for Multimodal Epistemic and Aleatoric Awareness
Authors:
Khyathi Raghavi Chandu,
Linjie Li,
Anas Awadalla,
Ximing Lu,
Jae Sung Park,
Jack Hessel,
Lijuan Wang,
Yejin Choi
Abstract:
The ability to acknowledge the inevitable uncertainty in their knowledge and reasoning is a prerequisite for AI systems to be truly truthful and reliable. In this paper, we present a taxonomy of uncertainty specific to vision-language AI systems, distinguishing between epistemic uncertainty (arising from a lack of information) and aleatoric uncertainty (due to inherent unpredictability), and furth…
▽ More
The ability to acknowledge the inevitable uncertainty in their knowledge and reasoning is a prerequisite for AI systems to be truly truthful and reliable. In this paper, we present a taxonomy of uncertainty specific to vision-language AI systems, distinguishing between epistemic uncertainty (arising from a lack of information) and aleatoric uncertainty (due to inherent unpredictability), and further explore finer categories within. Based on this taxonomy, we synthesize a benchmark dataset, CertainlyUncertain, featuring 178K visual question answering (VQA) samples as contrastive pairs. This is achieved by 1) inpainting images to make previously answerable questions into unanswerable ones; and 2) using image captions to prompt large language models for both answerable and unanswerable questions. Additionally, we introduce a new metric confidence-weighted accuracy, that is well correlated with both accuracy and calibration error, to address the shortcomings of existing metrics.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
Fast and Efficient: Mask Neural Fields for 3D Scene Segmentation
Authors:
Zihan Gao,
Lingling Li,
Licheng Jiao,
Fang Liu,
Xu Liu,
Wenping Ma,
Yuwei Guo,
Shuyuan Yang
Abstract:
Understanding 3D scenes is a crucial challenge in computer vision research with applications spanning multiple domains. Recent advancements in distilling 2D vision-language foundation models into neural fields, like NeRF and 3DGS, enables open-vocabulary segmentation of 3D scenes from 2D multi-view images without the need for precise 3D annotations. While effective, however, the per-pixel distilla…
▽ More
Understanding 3D scenes is a crucial challenge in computer vision research with applications spanning multiple domains. Recent advancements in distilling 2D vision-language foundation models into neural fields, like NeRF and 3DGS, enables open-vocabulary segmentation of 3D scenes from 2D multi-view images without the need for precise 3D annotations. While effective, however, the per-pixel distillation of high-dimensional CLIP features introduces ambiguity and necessitates complex regularization strategies, adding inefficiencies during training. This paper presents MaskField, which enables fast and efficient 3D open-vocabulary segmentation with neural fields under weak supervision. Unlike previous methods, MaskField distills masks rather than dense high-dimensional CLIP features. MaskFields employ neural fields as binary mask generators and supervise them with masks generated by SAM and classified by coarse CLIP features. MaskField overcomes the ambiguous object boundaries by naturally introducing SAM segmented object shapes without extra regularization during training. By circumventing the direct handling of high-dimensional CLIP features during training, MaskField is particularly compatible with explicit scene representations like 3DGS. Our extensive experiments show that MaskField not only surpasses prior state-of-the-art methods but also achieves remarkably fast convergence, outperforming previous methods with just 5 minutes of training. We hope that MaskField will inspire further exploration into how neural fields can be trained to comprehend 3D scenes from 2D models.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
FedEx: Expediting Federated Learning over Heterogeneous Mobile Devices by Overlapping and Participant Selection
Authors:
Jiaxiang Geng,
Boyu Li,
Xiaoqi Qin,
Yixuan Li,
Liang Li,
Yanzhao Hou,
Miao Pan
Abstract:
Training latency is critical for the success of numerous intrigued applications ignited by federated learning (FL) over heterogeneous mobile devices. By revolutionarily overlapping local gradient transmission with continuous local computing, FL can remarkably reduce its training latency over homogeneous clients, yet encounter severe model staleness, model drifts, memory cost and straggler issues i…
▽ More
Training latency is critical for the success of numerous intrigued applications ignited by federated learning (FL) over heterogeneous mobile devices. By revolutionarily overlapping local gradient transmission with continuous local computing, FL can remarkably reduce its training latency over homogeneous clients, yet encounter severe model staleness, model drifts, memory cost and straggler issues in heterogeneous environments. To unleash the full potential of overlapping, we propose, FedEx, a novel \underline{fed}erated learning approach to \underline{ex}pedite FL training over mobile devices under data, computing and wireless heterogeneity. FedEx redefines the overlapping procedure with staleness ceilings to constrain memory consumption and make overlapping compatible with participation selection (PS) designs. Then, FedEx characterizes the PS utility function by considering the latency reduced by overlapping, and provides a holistic PS solution to address the straggler issue. FedEx also introduces a simple but effective metric to trigger overlapping, in order to avoid model drifts. Experimental results show that compared with its peer designs, FedEx demonstrates substantial reductions in FL training latency over heterogeneous mobile devices with limited memory cost.
△ Less
Submitted 2 July, 2024; v1 submitted 30 June, 2024;
originally announced July 2024.
-
FoldGPT: Simple and Effective Large Language Model Compression Scheme
Authors:
Songwei Liu,
Chao Zeng,
Lianqiang Li,
Chenqian Yan,
Lean Fu,
Xing Mei,
Fangmin Chen
Abstract:
The demand for deploying large language models(LLMs) on mobile devices continues to increase, driven by escalating data security concerns and cloud costs. However, network bandwidth and memory limitations pose challenges for deploying billion-level models on mobile devices. In this study, we investigate the outputs of different layers across various scales of LLMs and found that the outputs of mos…
▽ More
The demand for deploying large language models(LLMs) on mobile devices continues to increase, driven by escalating data security concerns and cloud costs. However, network bandwidth and memory limitations pose challenges for deploying billion-level models on mobile devices. In this study, we investigate the outputs of different layers across various scales of LLMs and found that the outputs of most layers exhibit significant similarity. Moreover, this similarity becomes more pronounced as the model size increases, indicating substantial redundancy in the depth direction of the LLMs. Based on this observation, we propose an efficient model volume compression strategy, termed FoldGPT, which combines block removal and block parameter sharing.This strategy consists of three parts: (1) Based on the learnable gating parameters, we determine the block importance ranking while modeling the coupling effect between blocks. Then we delete some redundant layers based on the given removal rate. (2) For the retained blocks, we apply a specially designed group parameter sharing strategy, where blocks within the same group share identical weights, significantly compressing the number of parameters and slightly reducing latency overhead. (3) After sharing these Blocks, we "cure" the mismatch caused by sparsity with a minor amount of fine-tuning and introduce a tail-layer distillation strategy to improve the performance. Experiments demonstrate that FoldGPT outperforms previous state-of-the-art(SOTA) methods in efficient model compression, demonstrating the feasibility of achieving model lightweighting through straightforward block removal and parameter sharing.
△ Less
Submitted 30 June, 2024;
originally announced July 2024.
-
IVCA: Inter-Relation-Aware Video Complexity Analyzer
Authors:
Junqi Liao,
Yao Li,
Zhuoyuan Li,
Li Li,
Dong Liu
Abstract:
To meet the real-time analysis requirements of video streaming applications, we propose an inter-relation-aware video complexity analyzer (IVCA) as an extension to VCA. The IVCA addresses the limitation of VCA by considering inter-frame relations, namely motion and reference structure. First, we enhance the accuracy of temporal features by introducing feature-domain motion estimation into the IVCA…
▽ More
To meet the real-time analysis requirements of video streaming applications, we propose an inter-relation-aware video complexity analyzer (IVCA) as an extension to VCA. The IVCA addresses the limitation of VCA by considering inter-frame relations, namely motion and reference structure. First, we enhance the accuracy of temporal features by introducing feature-domain motion estimation into the IVCA. Next, drawing inspiration from the hierarchical reference structure in codecs, we design layer-aware weights to adjust the majorities of frame complexity in different layers. Additionally, we expand the scope of temporal features by considering frames that be referred to, rather than relying solely on the previous frame. Experimental results show the significant improvement in complexity estimation accuracy achieved by IVCA, with minimal time complexity increase.
△ Less
Submitted 28 June, 2024;
originally announced July 2024.
-
Parallax-tolerant Image Stitching via Segmentation-guided Multi-homography Warping
Authors:
Tianli Liao,
Ce Wang,
Lei Li,
Guangen Liu,
Nan Li
Abstract:
Large parallax between images is an intractable issue in image stitching. Various warping-based methods are proposed to address it, yet the results are unsatisfactory. In this paper, we propose a novel image stitching method using multi-homography warping guided by image segmentation. Specifically, we leverage the Segment Anything Model to segment the target image into numerous contents and partit…
▽ More
Large parallax between images is an intractable issue in image stitching. Various warping-based methods are proposed to address it, yet the results are unsatisfactory. In this paper, we propose a novel image stitching method using multi-homography warping guided by image segmentation. Specifically, we leverage the Segment Anything Model to segment the target image into numerous contents and partition the feature points into multiple subsets via the energy-based multi-homography fitting algorithm. The multiple subsets of feature points are used to calculate the corresponding multiple homographies. For each segmented content in the overlapping region, we select its best-fitting homography with the lowest photometric error. For each segmented content in the non-overlapping region, we calculate a weighted combination of the linearized homographies. Finally, the target image is warped via the best-fitting homographies to align with the reference image, and the final panorama is generated via linear blending. Comprehensive experimental results on the public datasets demonstrate that our method provides the best alignment accuracy by a large margin, compared with the state-of-the-art methods. The source code is available at https://github.com/tlliao/multi-homo-warp.
△ Less
Submitted 28 June, 2024;
originally announced June 2024.
-
Deep Convolutional Neural Networks Meet Variational Shape Compactness Priors for Image Segmentation
Authors:
Kehui Zhang,
Lingfeng Li,
Hao Liu,
Jing Yuan,
Xue-Cheng Tai
Abstract:
Shape compactness is a key geometrical property to describe interesting regions in many image segmentation tasks. In this paper, we propose two novel algorithms to solve the introduced image segmentation problem that incorporates a shape-compactness prior. Existing algorithms for such a problem often suffer from computational inefficiency, difficulty in reaching a local minimum, and the need to fi…
▽ More
Shape compactness is a key geometrical property to describe interesting regions in many image segmentation tasks. In this paper, we propose two novel algorithms to solve the introduced image segmentation problem that incorporates a shape-compactness prior. Existing algorithms for such a problem often suffer from computational inefficiency, difficulty in reaching a local minimum, and the need to fine-tune the hyperparameters. To address these issues, we propose a novel optimization model along with its equivalent primal-dual model and introduce a new optimization algorithm based on primal-dual threshold dynamics (PD-TD). Additionally, we relax the solution constraint and propose another novel primal-dual soft threshold-dynamics algorithm (PD-STD) to achieve superior performance. Based on the variational explanation of the sigmoid layer, the proposed PD-STD algorithm can be integrated into Deep Neural Networks (DNNs) to enforce compact regions as image segmentation results. Compared to existing deep learning methods, extensive experiments demonstrated that the proposed algorithms outperformed state-of-the-art algorithms in numerical efficiency and effectiveness, especially while applying to the popular networks of DeepLabV3 and IrisParseNet with higher IoU, dice, and compactness metrics on noisy Iris datasets. In particular, the proposed algorithms significantly improve IoU by 20% training on a highly noisy image dataset.
△ Less
Submitted 23 May, 2024;
originally announced June 2024.
-
GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model
Authors:
Ling Li,
Yu Ye,
Bingchuan Jiang,
Wei Zeng
Abstract:
This work tackles the problem of geo-localization with a new paradigm using a large vision-language model (LVLM) augmented with human inference knowledge. A primary challenge here is the scarcity of data for training the LVLM - existing street-view datasets often contain numerous low-quality images lacking visual clues, and lack any reasoning inference. To address the data-quality issue, we devise…
▽ More
This work tackles the problem of geo-localization with a new paradigm using a large vision-language model (LVLM) augmented with human inference knowledge. A primary challenge here is the scarcity of data for training the LVLM - existing street-view datasets often contain numerous low-quality images lacking visual clues, and lack any reasoning inference. To address the data-quality issue, we devise a CLIP-based network to quantify the degree of street-view images being locatable, leading to the creation of a new dataset comprising highly locatable street views. To enhance reasoning inference, we integrate external knowledge obtained from real geo-localization games, tapping into valuable human inference capabilities. The data are utilized to train GeoReasoner, which undergoes fine-tuning through dedicated reasoning and location-tuning stages. Qualitative and quantitative evaluations illustrate that GeoReasoner outperforms counterpart LVLMs by more than 25% at country-level and 38% at city-level geo-localization tasks, and surpasses StreetCLIP performance while requiring fewer training resources. The data and code are available at https://github.com/lingli1996/GeoReasoner.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
Symbolic Learning Enables Self-Evolving Agents
Authors:
Wangchunshu Zhou,
Yixin Ou,
Shengwei Ding,
Long Li,
Jialong Wu,
Tiannan Wang,
Jiamin Chen,
Shuai Wang,
Xiaohua Xu,
Ningyu Zhang,
Huajun Chen,
Yuchen Eleanor Jiang
Abstract:
The AI community has been exploring a pathway to artificial general intelligence (AGI) by developing "language agents", which are complex large language models (LLMs) pipelines involving both prompting techniques and tool usage methods. While language agents have demonstrated impressive capabilities for many real-world tasks, a fundamental limitation of current language agents research is that the…
▽ More
The AI community has been exploring a pathway to artificial general intelligence (AGI) by developing "language agents", which are complex large language models (LLMs) pipelines involving both prompting techniques and tool usage methods. While language agents have demonstrated impressive capabilities for many real-world tasks, a fundamental limitation of current language agents research is that they are model-centric, or engineering-centric. That's to say, the progress on prompts, tools, and pipelines of language agents requires substantial manual engineering efforts from human experts rather than automatically learning from data. We believe the transition from model-centric, or engineering-centric, to data-centric, i.e., the ability of language agents to autonomously learn and evolve in environments, is the key for them to possibly achieve AGI.
In this work, we introduce agent symbolic learning, a systematic framework that enables language agents to optimize themselves on their own in a data-centric way using symbolic optimizers. Specifically, we consider agents as symbolic networks where learnable weights are defined by prompts, tools, and the way they are stacked together. Agent symbolic learning is designed to optimize the symbolic network within language agents by mimicking two fundamental algorithms in connectionist learning: back-propagation and gradient descent. Instead of dealing with numeric weights, agent symbolic learning works with natural language simulacrums of weights, loss, and gradients. We conduct proof-of-concept experiments on both standard benchmarks and complex real-world tasks and show that agent symbolic learning enables language agents to update themselves after being created and deployed in the wild, resulting in "self-evolving agents".
△ Less
Submitted 26 June, 2024;
originally announced June 2024.
-
Retrieval-style In-Context Learning for Few-shot Hierarchical Text Classification
Authors:
Huiyao Chen,
Yu Zhao,
Zulong Chen,
Mengjia Wang,
Liangyue Li,
Meishan Zhang,
Min Zhang
Abstract:
Hierarchical text classification (HTC) is an important task with broad applications, while few-shot HTC has gained increasing interest recently. While in-context learning (ICL) with large language models (LLMs) has achieved significant success in few-shot learning, it is not as effective for HTC because of the expansive hierarchical label sets and extremely-ambiguous labels. In this work, we intro…
▽ More
Hierarchical text classification (HTC) is an important task with broad applications, while few-shot HTC has gained increasing interest recently. While in-context learning (ICL) with large language models (LLMs) has achieved significant success in few-shot learning, it is not as effective for HTC because of the expansive hierarchical label sets and extremely-ambiguous labels. In this work, we introduce the first ICL-based framework with LLM for few-shot HTC. We exploit a retrieval database to identify relevant demonstrations, and an iterative policy to manage multi-layer hierarchical labels. Particularly, we equip the retrieval database with HTC label-aware representations for the input texts, which is achieved by continual training on a pretrained language model with masked language modeling (MLM), layer-wise classification (CLS, specifically for HTC), and a novel divergent contrastive learning (DCL, mainly for adjacent semantically-similar labels) objective. Experimental results on three benchmark datasets demonstrate superior performance of our method, and we can achieve state-of-the-art results in few-shot HTC.
△ Less
Submitted 29 June, 2024; v1 submitted 25 June, 2024;
originally announced June 2024.