-
LLMBox: A Comprehensive Library for Large Language Models
Authors:
Tianyi Tang,
Yiwen Hu,
Bingqian Li,
Wenyang Luo,
Zijing Qin,
Haoxiang Sun,
Jiapeng Wang,
Shiyi Xu,
Xiaoxue Cheng,
Geyang Guo,
Han Peng,
Bowen Zheng,
Yiru Tang,
Yingqian Min,
Yushuo Chen,
Jie Chen,
Yuanqian Zhao,
Luran Ding,
Yuhao Wang,
Zican Dong,
Chunxuan Xia,
Junyi Li,
Kun Zhou,
Wayne Xin Zhao,
Ji-Rong Wen
Abstract:
To facilitate the research on large language models (LLMs), this paper presents a comprehensive and unified library, LLMBox, to ease the development, use, and evaluation of LLMs. This library is featured with three main merits: (1) a unified data interface that supports the flexible implementation of various training strategies, (2) a comprehensive evaluation that covers extensive tasks, datasets,…
▽ More
To facilitate the research on large language models (LLMs), this paper presents a comprehensive and unified library, LLMBox, to ease the development, use, and evaluation of LLMs. This library is featured with three main merits: (1) a unified data interface that supports the flexible implementation of various training strategies, (2) a comprehensive evaluation that covers extensive tasks, datasets, and models, and (3) more practical consideration, especially on user-friendliness and efficiency. With our library, users can easily reproduce existing methods, train new models, and conduct comprehensive performance comparisons. To rigorously test LLMBox, we conduct extensive experiments in a diverse coverage of evaluation settings, and experimental results demonstrate the effectiveness and efficiency of our library in supporting various implementations related to LLMs. The detailed introduction and usage guidance can be found at https://github.com/RUCAIBox/LLMBox.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
Evolutionary Trigger Detection and Lightweight Model Repair Based Backdoor Defense
Authors:
Qi Zhou,
Zipeng Ye,
Yubo Tang,
Wenjian Luo,
Yuhui Shi,
Yan Jia
Abstract:
Deep Neural Networks (DNNs) have been widely used in many areas such as autonomous driving and face recognition. However, DNN model is fragile to backdoor attack. A backdoor in the DNN model can be activated by a poisoned input with trigger and leads to wrong prediction, which causes serious security issues in applications. It is challenging for current defenses to eliminate the backdoor effective…
▽ More
Deep Neural Networks (DNNs) have been widely used in many areas such as autonomous driving and face recognition. However, DNN model is fragile to backdoor attack. A backdoor in the DNN model can be activated by a poisoned input with trigger and leads to wrong prediction, which causes serious security issues in applications. It is challenging for current defenses to eliminate the backdoor effectively with limited computing resources, especially when the sizes and numbers of the triggers are variable as in the physical world. We propose an efficient backdoor defense based on evolutionary trigger detection and lightweight model repair. In the first phase of our method, CAM-focus Evolutionary Trigger Filter (CETF) is proposed for trigger detection. CETF is an effective sample-preprocessing based method with the evolutionary algorithm, and our experimental results show that CETF not only distinguishes the images with triggers accurately from the clean images, but also can be widely used in practice for its simplicity and stability in different backdoor attack situations. In the second phase of our method, we leverage several lightweight unlearning methods with the trigger detected by CETF for model repair, which also constructively demonstrate the underlying correlation of the backdoor with Batch Normalization layers. Source code will be published after accepted.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
Safety-Critical Control with Uncertainty Quantification using Adaptive Conformal Prediction
Authors:
Hao Zhou,
Yanze Zhang,
Wenhao Luo
Abstract:
Safety assurance is critical in the planning and control of robotic systems. For robots operating in the real world, the safety-critical design often needs to explicitly address uncertainties and the pre-computed guarantees often rely on the assumption of the particular distribution of the uncertainty. However, it is difficult to characterize the actual uncertainty distribution beforehand and thus…
▽ More
Safety assurance is critical in the planning and control of robotic systems. For robots operating in the real world, the safety-critical design often needs to explicitly address uncertainties and the pre-computed guarantees often rely on the assumption of the particular distribution of the uncertainty. However, it is difficult to characterize the actual uncertainty distribution beforehand and thus the established safety guarantee may be violated due to possible distribution mismatch. In this paper, we propose a novel safe control framework that provides a high-probability safety guarantee for stochastic dynamical systems following unknown distributions of motion noise. Specifically, this framework adopts adaptive conformal prediction to dynamically quantify the prediction uncertainty from online observations and combines that with the probabilistic extension of the control barrier functions (CBFs) to characterize the uncertainty-aware control constraints. By integrating the constraints in the model predictive control scheme, it allows robots to adaptively capture the true prediction uncertainty online in a distribution-free setting and enjoys formally provable high-probability safety assurance. Simulation results on multi-robot systems with stochastic single-integrator dynamics and unicycle dynamics are provided to demonstrate the effectiveness of our framework.
△ Less
Submitted 8 July, 2024; v1 submitted 3 July, 2024;
originally announced July 2024.
-
Task-oriented Over-the-air Computation for Edge-device Co-inference with Balanced Classification Accuracy
Authors:
Xiang Jiao,
Dingzhu Wen,
Guangxu Zhu,
Wei Jiang,
Wu Luo,
Yuanming Shi
Abstract:
Edge-device co-inference, which concerns the cooperation between edge devices and an edge server for completing inference tasks over wireless networks, has been a promising technique for enabling various kinds of intelligent services at the network edge, e.g., auto-driving. In this paradigm, the concerned design objective of the network shifts from the traditional communication throughput to the e…
▽ More
Edge-device co-inference, which concerns the cooperation between edge devices and an edge server for completing inference tasks over wireless networks, has been a promising technique for enabling various kinds of intelligent services at the network edge, e.g., auto-driving. In this paradigm, the concerned design objective of the network shifts from the traditional communication throughput to the effective and efficient execution of the inference task underpinned by the network, measured by, e.g., the inference accuracy and latency. In this paper, a task-oriented over-the-air computation scheme is proposed for a multidevice artificial intelligence system. Particularly, a novel tractable inference accuracy metric is proposed for classification tasks, which is called minimum pair-wise discriminant gain. Unlike prior work measuring the average of all class pairs in feature space, it measures the minimum distance of all class pairs. By maximizing the minimum pair-wise discriminant gain instead of its average counterpart, any pair of classes can be better separated in the feature space, and thus leading to a balanced and improved inference accuracy for all classes. Besides, this paper jointly optimizes the minimum discriminant gain of all feature elements instead of separately maximizing that of each element in the existing designs. As a result, the transmit power can be adaptively allocated to the feature elements according to their different contributions to the inference accuracy, opening an extra degree of freedom to improve inference performance. Extensive experiments are conducted using a concrete use case of human motion recognition to verify the superiority of the proposed design over the benchmarking scheme.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Observation of the Electromagnetic Dalitz Transition $h_c \rightarrow e^+e^-η_c$
Authors:
BESIII Collaboration,
M. Ablikim,
M. N. Achasov,
P. Adlarson,
S. Ahmed,
M. Albrecht,
R. Aliberti,
A. Amoroso,
M. R. An,
Q. An,
X. H. Bai,
Y. Bai,
O. Bakina,
R. Baldini Ferroli,
I. Balossino,
Y. Ban,
K. Begzsuren,
N. Berger,
M. Bertani,
D. Bettoni,
F. Bianchi,
J. Bloms,
A. Bortone,
I. Boyko,
R. A. Briere
, et al. (495 additional authors not shown)
Abstract:
Using $(27.12\pm 0.14)\times10^8$ $ψ(3686)$ decays and data samples of $e^+e^-$ collisions with $\sqrt{s}$ from 4.130 to 4.780~GeV collected with the BESIII detector, we report the first observation of the electromagnetic Dalitz transition $h_c\to e^+e^-η_c$ with a statistical significance of $5.4σ$. We measure the ratio of the branching fractions…
▽ More
Using $(27.12\pm 0.14)\times10^8$ $ψ(3686)$ decays and data samples of $e^+e^-$ collisions with $\sqrt{s}$ from 4.130 to 4.780~GeV collected with the BESIII detector, we report the first observation of the electromagnetic Dalitz transition $h_c\to e^+e^-η_c$ with a statistical significance of $5.4σ$. We measure the ratio of the branching fractions $\frac{\mathcal{B}(h_c\rightarrow e^+e^-η_c)}{\mathcal{B}(h_c\rightarrow γη_c)}$ separately for the $h_c$ samples produced via $ψ(3686)\toπ^0h_c$ and $e^+e^-\toπ^+π^-h_c$. The average ratio is determined to be $(0.59\pm0.10(\text{stat.})\pm0.04(\text{syst.}))\%$, where the uncertainty includes both statistical and systematic components.
△ Less
Submitted 2 July, 2024; v1 submitted 28 June, 2024;
originally announced July 2024.
-
Enhancing interfacial thermal transport by nanostructures: Monte Carlo simulations with ab initio phonon properties
Authors:
Wenzhu Luo,
Neng Wang,
Wenlei Lian,
Ershuai Yin,
Qiang Li
Abstract:
Recent experiments have indicated that employing nanostructures can enhance interfacial heat transport, but the mechanism by which different structural morphologies and dimensions contribute to the full-spectrum phonon interfacial transport remains unclear. In this paper, a multiscale method to study the thermal transfer at nanostructured interfaces is developed by combining density functional cal…
▽ More
Recent experiments have indicated that employing nanostructures can enhance interfacial heat transport, but the mechanism by which different structural morphologies and dimensions contribute to the full-spectrum phonon interfacial transport remains unclear. In this paper, a multiscale method to study the thermal transfer at nanostructured interfaces is developed by combining density functional calculation, Monte Carlo simulation, and diffuse mismatch method. The changes in the transport paths and contributions to thermal conductance of different frequency phonons caused by changes in nanostructure morphology and size are investigated. The results show that, compared to the triangular and trapezoidal nanostructures, the rectangular nanostructures are more beneficial in enhancing the probability of the reflected phonons encountering the interface, and thus the phonon interfacial transmittance. The nanostructure makes the interfacial heat flow extremely heterogeneous, with significant transverse heat flow occurring at the sidewalls, resulting in a new thermal conduction pathway. The phenomena of multiple reflections and double transmission together lead to the existence of the optimal dimension that maximizes the nanostructures enhancement effect on interfacial heat transfer. The optimal nanostructure width is 100 nm when the height is 100 nm and the maximum interfacial thermal conductance enhancement ratio is 1.31. These results can guide the design of heat transfer enhancement structures at the interface of the actual high-power chips.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
Consistency Models Made Easy
Authors:
Zhengyang Geng,
Ashwini Pokle,
William Luo,
Justin Lin,
J. Zico Kolter
Abstract:
Consistency models (CMs) are an emerging class of generative models that offer faster sampling than traditional diffusion models. CMs enforce that all points along a sampling trajectory are mapped to the same initial point. But this target leads to resource-intensive training: for example, as of 2024, training a SoTA CM on CIFAR-10 takes one week on 8 GPUs. In this work, we propose an alternative…
▽ More
Consistency models (CMs) are an emerging class of generative models that offer faster sampling than traditional diffusion models. CMs enforce that all points along a sampling trajectory are mapped to the same initial point. But this target leads to resource-intensive training: for example, as of 2024, training a SoTA CM on CIFAR-10 takes one week on 8 GPUs. In this work, we propose an alternative scheme for training CMs, vastly improving the efficiency of building such models. Specifically, by expressing CM trajectories via a particular differential equation, we argue that diffusion models can be viewed as a special case of CMs with a specific discretization. We can thus fine-tune a consistency model starting from a pre-trained diffusion model and progressively approximate the full consistency condition to stronger degrees over the training process. Our resulting method, which we term Easy Consistency Tuning (ECT), achieves vastly improved training times while indeed improving upon the quality of previous methods: for example, ECT achieves a 2-step FID of 2.73 on CIFAR10 within 1 hour on a single A100 GPU, matching Consistency Distillation trained of hundreds of GPU hours. Owing to this computational efficiency, we investigate the scaling law of CMs under ECT, showing that they seem to obey classic power law scaling, hinting at their ability to improve efficiency and performance at larger scales. Code (https://github.com/locuslab/ect) is available.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
NTIRE 2024 Challenge on Night Photography Rendering
Authors:
Egor Ershov,
Artyom Panshin,
Oleg Karasev,
Sergey Korchagin,
Shepelev Lev,
Alexandr Startsev,
Daniil Vladimirov,
Ekaterina Zaychenkova,
Nikola Banić,
Dmitrii Iarchuk,
Maria Efimova,
Radu Timofte,
Arseniy Terekhin,
Shuwei Yue,
Yuyang Liu,
Minchen Wei,
Lu Xu,
Chao Zhang,
Yasi Wang,
Furkan Kınlı,
Doğa Yılmaz,
Barış Özcan,
Furkan Kıraç,
Shuai Liu,
Jingyuan Xiao
, et al. (25 additional authors not shown)
Abstract:
This paper presents a review of the NTIRE 2024 challenge on night photography rendering. The goal of the challenge was to find solutions that process raw camera images taken in nighttime conditions, and thereby produce a photo-quality output images in the standard RGB (sRGB) space. Unlike the previous year's competition, the challenge images were collected with a mobile phone and the speed of algo…
▽ More
This paper presents a review of the NTIRE 2024 challenge on night photography rendering. The goal of the challenge was to find solutions that process raw camera images taken in nighttime conditions, and thereby produce a photo-quality output images in the standard RGB (sRGB) space. Unlike the previous year's competition, the challenge images were collected with a mobile phone and the speed of algorithms was also measured alongside the quality of their output. To evaluate the results, a sufficient number of viewers were asked to assess the visual quality of the proposed solutions, considering the subjective nature of the task. There were 2 nominations: quality and efficiency. Top 5 solutions in terms of output quality were sorted by evaluation time (see Fig. 1). The top ranking participants' solutions effectively represent the state-of-the-art in nighttime photography rendering. More results can be found at https://nightimaging.org.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Decentralized Multi-Robot Line-of-Sight Connectivity Maintenance under Uncertainty
Authors:
Yupeng Yang,
Yiwei Lyu,
Yanze Zhang,
Sha Yi,
Wenhao Luo
Abstract:
In this paper, we propose a novel decentralized control method to maintain Line-of-Sight connectivity for multi-robot networks in the presence of Guassian-distributed localization uncertainty. In contrast to most existing work that assumes perfect positional information about robots or enforces overly restrictive rigid formation against uncertainty, our method enables robots to preserve Line-of-Si…
▽ More
In this paper, we propose a novel decentralized control method to maintain Line-of-Sight connectivity for multi-robot networks in the presence of Guassian-distributed localization uncertainty. In contrast to most existing work that assumes perfect positional information about robots or enforces overly restrictive rigid formation against uncertainty, our method enables robots to preserve Line-of-Sight connectivity with high probability under unbounded Gaussian-like positional noises while remaining minimally intrusive to the original robots' tasks. This is achieved by a motion coordination framework that jointly optimizes the set of existing Line-of-Sight edges to preserve and control revisions to the nominal task-related controllers, subject to the safety constraints and the corresponding composition of uncertainty-aware Line-of-Sight control constraints. Such compositional control constraints, expressed by our novel notion of probabilistic Line-of-Sight connectivity barrier certificates (PrLOS-CBC) for pairwise robots using control barrier functions, explicitly characterize the deterministic admissible control space for the two robots. The resulting motion ensures Line-of-Sight connectedness for the robot team with high probability. Furthermore, we propose a fully decentralized algorithm that decomposes the motion coordination framework by interleaving the composite constraint specification and solving for the resulting optimization-based controllers. The optimality of our approach is justified by the theoretical proofs. Simulation and real-world experiments results are given to demonstrate the effectiveness of our method.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Prior Normality Prompt Transformer for Multi-class Industrial Image Anomaly Detection
Authors:
Haiming Yao,
Yunkang Cao,
Wei Luo,
Weihang Zhang,
Wenyong Yu,
Weiming Shen
Abstract:
Image anomaly detection plays a pivotal role in industrial inspection. Traditional approaches often demand distinct models for specific categories, resulting in substantial deployment costs. This raises concerns about multi-class anomaly detection, where a unified model is developed for multiple classes. However, applying conventional methods, particularly reconstruction-based models, directly to…
▽ More
Image anomaly detection plays a pivotal role in industrial inspection. Traditional approaches often demand distinct models for specific categories, resulting in substantial deployment costs. This raises concerns about multi-class anomaly detection, where a unified model is developed for multiple classes. However, applying conventional methods, particularly reconstruction-based models, directly to multi-class scenarios encounters challenges such as identical shortcut learning, hindering effective discrimination between normal and abnormal instances. To tackle this issue, our study introduces the Prior Normality Prompt Transformer (PNPT) method for multi-class image anomaly detection. PNPT strategically incorporates normal semantics prompting to mitigate the "identical mapping" problem. This entails integrating a prior normality prompt into the reconstruction process, yielding a dual-stream model. This innovative architecture combines normal prior semantics with abnormal samples, enabling dual-stream reconstruction grounded in both prior knowledge and intrinsic sample characteristics. PNPT comprises four essential modules: Class-Specific Normality Prompting Pool (CS-NPP), Hierarchical Patch Embedding (HPE), Semantic Alignment Coupling Encoding (SACE), and Contextual Semantic Conditional Decoding (CSCD). Experimental validation on diverse benchmark datasets and real-world industrial applications highlights PNPT's superior performance in multi-class industrial anomaly detection.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Taking a Deep Breath: Enhancing Language Modeling of Large Language Models with Sentinel Tokens
Authors:
Weiyao Luo,
Suncong Zheng,
Heming Xia,
Weikang Wang,
Yan Lei,
Tianyu Liu,
Shuang Chen,
Zhifang Sui
Abstract:
Large language models (LLMs) have shown promising efficacy across various tasks, becoming powerful tools in numerous aspects of human life. However, Transformer-based LLMs suffer a performance degradation when modeling long-term contexts due to they discard some information to reduce computational overhead. In this work, we propose a simple yet effective method to enable LLMs to take a deep breath…
▽ More
Large language models (LLMs) have shown promising efficacy across various tasks, becoming powerful tools in numerous aspects of human life. However, Transformer-based LLMs suffer a performance degradation when modeling long-term contexts due to they discard some information to reduce computational overhead. In this work, we propose a simple yet effective method to enable LLMs to take a deep breath, encouraging them to summarize information contained within discrete text chunks. Specifically, we segment the text into multiple chunks and insert special token <SR> at the end of each chunk. We then modify the attention mask to integrate the chunk's information into the corresponding <SR> token. This facilitates LLMs to interpret information not only from historical individual tokens but also from the <SR> token, aggregating the chunk's semantic information. Experiments on language modeling and out-of-domain downstream tasks validate the superiority of our approach.
△ Less
Submitted 16 June, 2024;
originally announced June 2024.
-
M-LRM: Multi-view Large Reconstruction Model
Authors:
Mengfei Li,
Xiaoxiao Long,
Yixun Liang,
Weiyu Li,
Yuan Liu,
Peng Li,
Xiaowei Chi,
Xingqun Qi,
Wei Xue,
Wenhan Luo,
Qifeng Liu,
Yike Guo
Abstract:
Despite recent advancements in the Large Reconstruction Model (LRM) demonstrating impressive results, when extending its input from single image to multiple images, it exhibits inefficiencies, subpar geometric and texture quality, as well as slower convergence speed than expected.
It is attributed to that, LRM formulates 3D reconstruction as a naive images-to-3D translation problem, ignoring the…
▽ More
Despite recent advancements in the Large Reconstruction Model (LRM) demonstrating impressive results, when extending its input from single image to multiple images, it exhibits inefficiencies, subpar geometric and texture quality, as well as slower convergence speed than expected.
It is attributed to that, LRM formulates 3D reconstruction as a naive images-to-3D translation problem, ignoring the strong 3D coherence among the input images. In this paper, we propose a Multi-view Large Reconstruction Model (M-LRM) designed to efficiently reconstruct high-quality 3D shapes from multi-views in a 3D-aware manner. Specifically, we introduce a multi-view consistent cross-attention scheme to enable M-LRM to accurately query information from the input images. Moreover, we employ the 3D priors of the input multi-view images to initialize the tri-plane tokens. Compared to LRM, the proposed M-LRM can produce a tri-plane NeRF with $128 \times 128$ resolution and generate 3D shapes of high fidelity. Experimental studies demonstrate that our model achieves a significant performance gain and faster training convergence than LRM. Project page: https://murphylmf.github.io/M-LRM/
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
Global-Regularized Neighborhood Regression for Efficient Zero-Shot Texture Anomaly Detection
Authors:
Haiming Yao,
Wei Luo,
Yunkang Cao,
Yiheng Zhang,
Wenyong Yu,
Weiming Shen
Abstract:
Texture surface anomaly detection finds widespread applications in industrial settings. However, existing methods often necessitate gathering numerous samples for model training. Moreover, they predominantly operate within a close-set detection framework, limiting their ability to identify anomalies beyond the training dataset. To tackle these challenges, this paper introduces a novel zero-shot te…
▽ More
Texture surface anomaly detection finds widespread applications in industrial settings. However, existing methods often necessitate gathering numerous samples for model training. Moreover, they predominantly operate within a close-set detection framework, limiting their ability to identify anomalies beyond the training dataset. To tackle these challenges, this paper introduces a novel zero-shot texture anomaly detection method named Global-Regularized Neighborhood Regression (GRNR). Unlike conventional approaches, GRNR can detect anomalies on arbitrary textured surfaces without any training data or cost. Drawing from human visual cognition, GRNR derives two intrinsic prior supports directly from the test texture image: local neighborhood priors characterized by coherent similarities and global normality priors featuring typical normal patterns. The fundamental principle of GRNR involves utilizing the two extracted intrinsic support priors for self-reconstructive regression of the query sample. This process employs the transformation facilitated by local neighbor support while being regularized by global normality support, aiming to not only achieve visually consistent reconstruction results but also preserve normality properties. We validate the effectiveness of GRNR across various industrial scenarios using eight benchmark datasets, demonstrating its superior detection performance without the need for training data. Remarkably, our method is applicable for open-set texture defect detection and can even surpass existing vanilla approaches that require extensive training.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
Advancing Tool-Augmented Large Language Models: Integrating Insights from Errors in Inference Trees
Authors:
Sijia Chen,
Yibo Wang,
Yi-Feng Wu,
Qing-Guo Chen,
Zhao Xu,
Weihua Luo,
Kaifu Zhang,
Lijun Zhang
Abstract:
Tool-augmented large language models (LLMs) leverage tools, often in the form of APIs, to enhance their reasoning capabilities on complex tasks, thus taking on the role of intelligent agents interacting with the real world. The recently introduced ToolLLaMA model by Qin et al. [2024] utilizes the depth-first search-based decision tree (DFSDT) method for reasoning with $16000+$ real-world APIs, whi…
▽ More
Tool-augmented large language models (LLMs) leverage tools, often in the form of APIs, to enhance their reasoning capabilities on complex tasks, thus taking on the role of intelligent agents interacting with the real world. The recently introduced ToolLLaMA model by Qin et al. [2024] utilizes the depth-first search-based decision tree (DFSDT) method for reasoning with $16000+$ real-world APIs, which effectively improves the planning and inferencing performance of tool-augmented LLMs compared to traditional chain reasoning approaches. However, their approach only employs successful paths from decision trees (also called inference trees) for supervised fine-tuning (SFT) during training, which does not fully exploit the advantages of the tree of thought. In this study, we propose an inference trajectory optimization framework based on the preference data extracted from decision trees to address this limitation. We first introduce a novel method for constructing preference data from the tree of thought, capitalizing on the failed explorations previously overlooked in the trees. Specifically, we generate an effective step-wise preference dataset, named ToolPreference, for tool use based on the ToolBench dataset. In the subsequent training phase, we first fine-tune the LLM with tool-usage expert trajectories and then use these step-wise preference pairs for direct preference optimization (DPO) to update the policy of the LLM, resulting in our ToolPrefer-LLaMA (TP-LLaMA) model. Our experiments demonstrate that by obtaining insights from errors in inference trees, TP-LLaMA significantly outperforms the baselines across almost all test scenarios by a large margin and exhibits better generalization capabilities with unseen APIs. At the same time, TP-LLaMA has also demonstrated superior reasoning efficiency compared to the baselines, making it more suitable for complex tool-usage reasoning tasks.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation
Authors:
Wen Luo,
Tianshu Shen,
Wei Li,
Guangyue Peng,
Richeng Xuan,
Houfeng Wang,
Xi Yang
Abstract:
Large Language Models (LLMs) have significantly advanced the field of Natural Language Processing (NLP), achieving remarkable performance across diverse tasks and enabling widespread real-world applications. However, LLMs are prone to hallucination, generating content that either conflicts with established knowledge or is unfaithful to the original sources. Existing hallucination benchmarks primar…
▽ More
Large Language Models (LLMs) have significantly advanced the field of Natural Language Processing (NLP), achieving remarkable performance across diverse tasks and enabling widespread real-world applications. However, LLMs are prone to hallucination, generating content that either conflicts with established knowledge or is unfaithful to the original sources. Existing hallucination benchmarks primarily focus on sentence- or passage-level hallucination detection, neglecting dialogue-level evaluation, hallucination localization, and rationale provision. They also predominantly target factuality hallucinations while underestimating faithfulness hallucinations, often relying on labor-intensive or non-specialized evaluators. To address these limitations, we propose HalluDial, the first comprehensive large-scale benchmark for automatic dialogue-level hallucination evaluation. HalluDial encompasses both spontaneous and induced hallucination scenarios, covering factuality and faithfulness hallucinations. The benchmark includes 4,094 dialogues with a total of 146,856 samples. Leveraging HalluDial, we conduct a comprehensive meta-evaluation of LLMs' hallucination evaluation capabilities in information-seeking dialogues and introduce a specialized judge language model, HalluJudge. The high data quality of HalluDial enables HalluJudge to achieve superior or competitive performance in hallucination evaluation, facilitating the automatic assessment of dialogue-level hallucinations in LLMs and providing valuable insights into this phenomenon. The dataset and the code are available at https://github.com/FlagOpen/HalluDial.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
MIPI 2024 Challenge on Few-shot RAW Image Denoising: Methods and Results
Authors:
Xin Jin,
Chunle Guo,
Xiaoming Li,
Zongsheng Yue,
Chongyi Li,
Shangchen Zhou,
Ruicheng Feng,
Yuekun Dai,
Peiqing Yang,
Chen Change Loy,
Ruoqi Li,
Chang Liu,
Ziyi Wang,
Yao Du,
Jingjing Yang,
Long Bao,
Heng Sun,
Xiangyu Kong,
Xiaoxia Xing,
Jinlong Wu,
Yuanyang Xue,
Hyunhee Park,
Sejun Song,
Changho Kim,
Jingfan Tan
, et al. (17 additional authors not shown)
Abstract:
The increasing demand for computational photography and imaging on mobile platforms has led to the widespread development and integration of advanced image sensors with novel algorithms in camera systems. However, the scarcity of high-quality data for research and the rare opportunity for in-depth exchange of views from industry and academia constrain the development of mobile intelligent photogra…
▽ More
The increasing demand for computational photography and imaging on mobile platforms has led to the widespread development and integration of advanced image sensors with novel algorithms in camera systems. However, the scarcity of high-quality data for research and the rare opportunity for in-depth exchange of views from industry and academia constrain the development of mobile intelligent photography and imaging (MIPI). Building on the achievements of the previous MIPI Workshops held at ECCV 2022 and CVPR 2023, we introduce our third MIPI challenge including three tracks focusing on novel image sensors and imaging algorithms. In this paper, we summarize and review the Few-shot RAW Image Denoising track on MIPI 2024. In total, 165 participants were successfully registered, and 7 teams submitted results in the final testing phase. The developed solutions in this challenge achieved state-of-the-art erformance on Few-shot RAW Image Denoising. More details of this challenge and the link to the dataset can be found at https://mipichallenge.org/MIPI2024.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
M3GIA: A Cognition Inspired Multilingual and Multimodal General Intelligence Ability Benchmark
Authors:
Wei Song,
Yadong Li,
Jianhua Xu,
Guowei Wu,
Lingfeng Ming,
Kexin Yi,
Weihua Luo,
Houyi Li,
Yi Du,
Fangda Guo,
Kaicheng Yu
Abstract:
As recent multi-modality large language models (MLLMs) have shown formidable proficiency on various complex tasks, there has been increasing attention on debating whether these models could eventually mirror human intelligence. However, existing benchmarks mainly focus on evaluating solely on task performance, such as the accuracy of identifying the attribute of an object. Combining well-developed…
▽ More
As recent multi-modality large language models (MLLMs) have shown formidable proficiency on various complex tasks, there has been increasing attention on debating whether these models could eventually mirror human intelligence. However, existing benchmarks mainly focus on evaluating solely on task performance, such as the accuracy of identifying the attribute of an object. Combining well-developed cognitive science to understand the intelligence of MLLMs beyond superficial achievements remains largely unexplored. To this end, we introduce the first cognitive-driven multi-lingual and multi-modal benchmark to evaluate the general intelligence ability of MLLMs, dubbed M3GIA. Specifically, we identify five key cognitive factors based on the well-recognized Cattell-Horn-Carrol (CHC) model of intelligence and propose a novel evaluation metric. In addition, since most MLLMs are trained to perform in different languages, a natural question arises: is language a key factor influencing the cognitive ability of MLLMs? As such, we go beyond English to encompass other languages based on their popularity, including Chinese, French, Spanish, Portuguese and Korean, to construct our M3GIA. We make sure all the data relevant to the cultural backgrounds are collected from their native context to avoid English-centric bias. We collected a significant corpus of data from human participants, revealing that the most advanced MLLM reaches the lower boundary of human intelligence in English. Yet, there remains a pronounced disparity in the other five languages assessed. We also reveals an interesting winner takes all phenomenon that are aligned with the discovery in cognitive studies. Our benchmark will be open-sourced, with the aspiration of facilitating the enhancement of cognitive capabilities in MLLMs.
△ Less
Submitted 14 June, 2024; v1 submitted 8 June, 2024;
originally announced June 2024.
-
Wings: Learning Multimodal LLMs without Text-only Forgetting
Authors:
Yi-Kai Zhang,
Shiyin Lu,
Yang Li,
Yanqing Ma,
Qing-Guo Chen,
Zhao Xu,
Weihua Luo,
Kaifu Zhang,
De-Chuan Zhan,
Han-Jia Ye
Abstract:
Multimodal large language models (MLLMs), initiated with a trained LLM, first align images with text and then fine-tune on multimodal mixed inputs. However, the MLLM catastrophically forgets the text-only instructions, which do not include images and can be addressed within the initial LLM. In this paper, we present Wings, a novel MLLM that excels in both text-only dialogues and multimodal compreh…
▽ More
Multimodal large language models (MLLMs), initiated with a trained LLM, first align images with text and then fine-tune on multimodal mixed inputs. However, the MLLM catastrophically forgets the text-only instructions, which do not include images and can be addressed within the initial LLM. In this paper, we present Wings, a novel MLLM that excels in both text-only dialogues and multimodal comprehension. Analyzing MLLM attention in multimodal instructions reveals that text-only forgetting is related to the attention shifts from pre-image to post-image text. From that, we construct extra modules that act as the boosted learner to compensate for the attention shift. The complementary visual and textual learners, like "wings" on either side, are connected in parallel within each layer's attention block. Initially, image and text inputs are aligned with visual learners operating alongside the main attention, balancing focus on visual elements. Textual learners are later collaboratively integrated with attention-based routing to blend the outputs of the visual and textual learners. We design the Low-Rank Residual Attention (LoRRA) to guarantee high efficiency for learners. Our experimental results demonstrate that Wings outperforms equally-scaled MLLMs in both text-only and visual question-answering tasks. On a newly constructed Interleaved Image-Text (IIT) benchmark, Wings exhibits superior performance from text-only-rich to multimodal-rich question-answering tasks.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
Follow-Your-Pose v2: Multiple-Condition Guided Character Image Animation for Stable Pose Control
Authors:
Jingyun Xue,
Hongfa Wang,
Qi Tian,
Yue Ma,
Andong Wang,
Zhiyuan Zhao,
Shaobo Min,
Wenzhe Zhao,
Kaihao Zhang,
Heung-Yeung Shum,
Wei Liu,
Mengyang Liu,
Wenhan Luo
Abstract:
Pose-controllable character video generation is in high demand with extensive applications for fields such as automatic advertising and content creation on social media platforms. While existing character image animation methods using pose sequences and reference images have shown promising performance, they tend to struggle with incoherent animation in complex scenarios, such as multiple characte…
▽ More
Pose-controllable character video generation is in high demand with extensive applications for fields such as automatic advertising and content creation on social media platforms. While existing character image animation methods using pose sequences and reference images have shown promising performance, they tend to struggle with incoherent animation in complex scenarios, such as multiple character animation and body occlusion. Additionally, current methods request large-scale high-quality videos with stable backgrounds and temporal consistency as training datasets, otherwise, their performance will greatly deteriorate. These two issues hinder the practical utilization of character image animation tools. In this paper, we propose a practical and robust framework Follow-Your-Pose v2, which can be trained on noisy open-sourced videos readily available on the internet. Multi-condition guiders are designed to address the challenges of background stability, body occlusion in multi-character generation, and consistency of character appearance. Moreover, to fill the gap of fair evaluation of multi-character pose animation, we propose a new benchmark comprising approximately 4,000 frames. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods by a margin of over 35% across 2 datasets and on 7 metrics. Meanwhile, qualitative assessments reveal a significant improvement in the quality of generated video, particularly in scenarios involving complex backgrounds and body occlusion of multi-character, suggesting the superiority of our approach.
△ Less
Submitted 12 June, 2024; v1 submitted 5 June, 2024;
originally announced June 2024.
-
Parrot: Multilingual Visual Instruction Tuning
Authors:
Hai-Long Sun,
Da-Wei Zhou,
Yang Li,
Shiyin Lu,
Chao Yi,
Qing-Guo Chen,
Zhao Xu,
Weihua Luo,
Kaifu Zhang,
De-Chuan Zhan,
Han-Jia Ye
Abstract:
The rapid development of Multimodal Large Language Models (MLLMs) like GPT-4V has marked a significant step towards artificial general intelligence. Existing methods mainly focus on aligning vision encoders with LLMs through supervised fine-tuning (SFT) to endow LLMs with multimodal abilities, making MLLMs' inherent ability to react to multiple languages progressively deteriorate as the training p…
▽ More
The rapid development of Multimodal Large Language Models (MLLMs) like GPT-4V has marked a significant step towards artificial general intelligence. Existing methods mainly focus on aligning vision encoders with LLMs through supervised fine-tuning (SFT) to endow LLMs with multimodal abilities, making MLLMs' inherent ability to react to multiple languages progressively deteriorate as the training process evolves. We empirically find that the imbalanced SFT datasets, primarily composed of English-centric image-text pairs, lead to significantly reduced performance in non-English languages. This is due to the failure of aligning the vision encoder and LLM with multilingual tokens during the SFT process. In this paper, we introduce Parrot, a novel method that utilizes textual guidance to drive visual token alignment at the language level. Parrot makes the visual tokens condition on diverse language inputs and uses Mixture-of-Experts (MoE) to promote the alignment of multilingual tokens. Specifically, to enhance non-English visual tokens alignment, we compute the cross-attention using the initial visual features and textual embeddings, the result of which is then fed into the MoE router to select the most relevant experts. The selected experts subsequently convert the initial visual tokens into language-specific visual tokens. Moreover, considering the current lack of benchmarks for evaluating multilingual capabilities within the field, we collect and make available a Massive Multilingual Multimodal Benchmark which includes 6 languages, 15 categories, and 12,000 questions, named as MMMB. Our method not only demonstrates state-of-the-art performance on multilingual MMBench and MMMB, but also excels across a broad range of multimodal tasks. Both the source code and the training dataset of Parrot will be made publicly available.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Observational test for $f(Q)$ gravity with weak gravitational lensing
Authors:
Qingqing Wang,
Xin Ren,
Yi-Fu Cai,
Wentao Luo,
Emmanuel N. Saridakis
Abstract:
In this article we confront a class of $f(Q)$ gravity models with observational data of galaxy-galaxy lensing. Specifically, we consider the $f(Q)$ gravity models containing a small quadratic correction when compared with General Relativity (GR), and quantify this correction by a model parameter $α$. To derive the observational constraints, we start by extracting the spherically symmetric solution…
▽ More
In this article we confront a class of $f(Q)$ gravity models with observational data of galaxy-galaxy lensing. Specifically, we consider the $f(Q)$ gravity models containing a small quadratic correction when compared with General Relativity (GR), and quantify this correction by a model parameter $α$. To derive the observational constraints, we start by extracting the spherically symmetric solutions which correspond to the deviations from the Schwarzschild solution that depends on the model parameter in a two-fold way, i.e., a renormalized mass and a new term proportional to $r^{-2}$. Then, we calculate the effective lensing potential, the deflection angle, the shear component, and the effective Excess Surface Density (ESD) profile. After that, we employ the group catalog and shape catalog from the SDSS DR7 for the lens and source samples respectively. Moreover, we handle the off-center radius as a free parameter and constrain it using the MCMC. Concerning the deviation parameter from GR we derive $α=1.202^{+0.277}_{-0.179}\times 10^{-6} {\rm Mpc}^{-2}$ at 1 $σ$ confidence level, and then compare the fitting efficiency with the standard $Λ$CDM paradigm by applying the AIC and BIC information criteria. Our results indicate that the $f(Q)$ corrections alongside off-center effects yield a scenario that is slightly favored.
△ Less
Submitted 31 May, 2024;
originally announced June 2024.
-
Ovis: Structural Embedding Alignment for Multimodal Large Language Model
Authors:
Shiyin Lu,
Yang Li,
Qing-Guo Chen,
Zhao Xu,
Weihua Luo,
Kaifu Zhang,
Han-Jia Ye
Abstract:
Current Multimodal Large Language Models (MLLMs) typically integrate a pre-trained LLM with another pre-trained vision transformer through a connector, such as an MLP, endowing the LLM with visual capabilities. However, the misalignment between two embedding strategies in MLLMs -- the structural textual embeddings based on an embedding look-up table and the continuous embeddings generated directly…
▽ More
Current Multimodal Large Language Models (MLLMs) typically integrate a pre-trained LLM with another pre-trained vision transformer through a connector, such as an MLP, endowing the LLM with visual capabilities. However, the misalignment between two embedding strategies in MLLMs -- the structural textual embeddings based on an embedding look-up table and the continuous embeddings generated directly by the vision encoder -- makes challenges for a more seamless fusion of visual and textual information. We propose Ovis, a novel MLLM architecture designed to structurally align visual and textual embeddings. Ovis integrates an additional learnable visual embedding table into the visual encoder's process. To capture rich visual semantics, each image patch indexes the visual embedding table multiple times, resulting in a final visual embedding that is a probabilistic combination of the indexed embeddings. This structural approach mirrors the method used for generating textual embeddings. Empirical evaluations on various multimodal benchmarks show that Ovis outperforms open-source MLLMs of similar parameter scales and even surpasses the proprietary model Qwen-VL-Plus overall. These results highlight the potential of Ovis' structured visual representation for advancing MLLM architectural design and promoting more effective multimodal learning. Code, datasets, and models are available at https://github.com/AIDC-AI/Ovis.
△ Less
Submitted 17 June, 2024; v1 submitted 31 May, 2024;
originally announced May 2024.
-
Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Character
Authors:
Siyuan Ma,
Weidi Luo,
Yu Wang,
Xiaogeng Liu
Abstract:
With the advent and widespread deployment of Multimodal Large Language Models (MLLMs), ensuring their safety has become increasingly critical. To achieve this objective, it requires us to proactively discover the vulnerability of MLLMs by exploring the attack methods. Thus, structure-based jailbreak attacks, where harmful semantic content is embedded within images, have been proposed to mislead th…
▽ More
With the advent and widespread deployment of Multimodal Large Language Models (MLLMs), ensuring their safety has become increasingly critical. To achieve this objective, it requires us to proactively discover the vulnerability of MLLMs by exploring the attack methods. Thus, structure-based jailbreak attacks, where harmful semantic content is embedded within images, have been proposed to mislead the models. However, previous structure-based jailbreak methods mainly focus on transforming the format of malicious queries, such as converting harmful content into images through typography, which lacks sufficient jailbreak effectiveness and generalizability. To address these limitations, we first introduce the concept of "Role-play" into MLLM jailbreak attacks and propose a novel and effective method called Visual Role-play (VRP). Specifically, VRP leverages Large Language Models to generate detailed descriptions of high-risk characters and create corresponding images based on the descriptions. When paired with benign role-play instruction texts, these high-risk character images effectively mislead MLLMs into generating malicious responses by enacting characters with negative attributes. We further extend our VRP method into a universal setup to demonstrate its generalizability. Extensive experiments on popular benchmarks show that VRP outperforms the strongest baseline, Query relevant and FigStep, by an average Attack Success Rate (ASR) margin of 14.3% across all models.
△ Less
Submitted 12 June, 2024; v1 submitted 25 May, 2024;
originally announced May 2024.
-
Machine-Learning based photon counting for PMT waveforms and its application to the improvement of the energy resolution in large liquid scintillator detectors
Authors:
Wei Jiang,
Guihong Huang,
Zhen Liu,
Wuming Luo,
Liangjian Wen,
Jianyi Luo
Abstract:
Photomultiplier tubes (PMTs) are widely used in particle experiments for photon detection. PMT waveform analysis is crucial for high-precision measurement of the position and energy of incident particles in liquid scintillator (LS) detectors. A key factor contributing to the energy resolution in large liquid scintillator detectors with PMTs is the charge smearing of PMTs. This paper presents a mac…
▽ More
Photomultiplier tubes (PMTs) are widely used in particle experiments for photon detection. PMT waveform analysis is crucial for high-precision measurement of the position and energy of incident particles in liquid scintillator (LS) detectors. A key factor contributing to the energy resolution in large liquid scintillator detectors with PMTs is the charge smearing of PMTs. This paper presents a machine-learning-based photon counting method for PMT waveforms and its application to the energy reconstruction, using the JUNO experiment as an example. The results indicate that leveraging the photon counting information from the machine learning model can partially mitigate the impact of PMT charge smearing and lead to a relative 2.0% to 2.8% improvement on the energy resolution at different energies.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
JUNO Sensitivity to Invisible Decay Modes of Neutrons
Authors:
JUNO Collaboration,
Angel Abusleme,
Thomas Adam,
Kai Adamowicz,
Shakeel Ahmad,
Rizwan Ahmed,
Sebastiano Aiello,
Fengpeng An,
Qi An,
Giuseppe Andronico,
Nikolay Anfimov,
Vito Antonelli,
Tatiana Antoshkina,
João Pedro Athayde Marcondes de André,
Didier Auguste,
Weidong Bai,
Nikita Balashov,
Wander Baldini,
Andrea Barresi,
Davide Basilico,
Eric Baussan,
Marco Bellato,
Marco Beretta,
Antonio Bergnoli,
Daniel Bick
, et al. (635 additional authors not shown)
Abstract:
We explore the bound neutrons decay into invisible particles (e.g., $n\rightarrow 3 ν$ or $nn \rightarrow 2 ν$) in the JUNO liquid scintillator detector. The invisible decay includes two decay modes: $ n \rightarrow { inv} $ and $ nn \rightarrow { inv} $. The invisible decays of $s$-shell neutrons in $^{12}{\rm C}$ will leave a highly excited residual nucleus. Subsequently, some de-excitation mode…
▽ More
We explore the bound neutrons decay into invisible particles (e.g., $n\rightarrow 3 ν$ or $nn \rightarrow 2 ν$) in the JUNO liquid scintillator detector. The invisible decay includes two decay modes: $ n \rightarrow { inv} $ and $ nn \rightarrow { inv} $. The invisible decays of $s$-shell neutrons in $^{12}{\rm C}$ will leave a highly excited residual nucleus. Subsequently, some de-excitation modes of the excited residual nuclei can produce a time- and space-correlated triple coincidence signal in the JUNO detector. Based on a full Monte Carlo simulation informed with the latest available data, we estimate all backgrounds, including inverse beta decay events of the reactor antineutrino $\barν_e$, natural radioactivity, cosmogenic isotopes and neutral current interactions of atmospheric neutrinos. Pulse shape discrimination and multivariate analysis techniques are employed to further suppress backgrounds. With two years of exposure, JUNO is expected to give an order of magnitude improvement compared to the current best limits. After 10 years of data taking, the JUNO expected sensitivities at a 90% confidence level are $τ/B( n \rightarrow { inv} ) > 5.0 \times 10^{31} \, {\rm yr}$ and $τ/B( nn \rightarrow { inv} ) > 1.4 \times 10^{32} \, {\rm yr}$.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
CoCoGesture: Toward Coherent Co-speech 3D Gesture Generation in the Wild
Authors:
Xingqun Qi,
Hengyuan Zhang,
Yatian Wang,
Jiahao Pan,
Chen Liu,
Peng Li,
Xiaowei Chi,
Mengfei Li,
Qixun Zhang,
Wei Xue,
Shanghang Zhang,
Wenhan Luo,
Qifeng Liu,
Yike Guo
Abstract:
Deriving co-speech 3D gestures has seen tremendous progress in virtual avatar animation. Yet, the existing methods often produce stiff and unreasonable gestures with unseen human speech inputs due to the limited 3D speech-gesture data. In this paper, we propose CoCoGesture, a novel framework enabling vivid and diverse gesture synthesis from unseen human speech prompts. Our key insight is built upo…
▽ More
Deriving co-speech 3D gestures has seen tremendous progress in virtual avatar animation. Yet, the existing methods often produce stiff and unreasonable gestures with unseen human speech inputs due to the limited 3D speech-gesture data. In this paper, we propose CoCoGesture, a novel framework enabling vivid and diverse gesture synthesis from unseen human speech prompts. Our key insight is built upon the custom-designed pretrain-fintune training paradigm. At the pretraining stage, we aim to formulate a large generalizable gesture diffusion model by learning the abundant postures manifold. Therefore, to alleviate the scarcity of 3D data, we first construct a large-scale co-speech 3D gesture dataset containing more than 40M meshed posture instances across 4.3K speakers, dubbed GES-X. Then, we scale up the large unconditional diffusion model to 1B parameters and pre-train it to be our gesture experts. At the finetune stage, we present the audio ControlNet that incorporates the human voice as condition prompts to guide the gesture generation. Here, we construct the audio ControlNet through a trainable copy of our pre-trained diffusion model. Moreover, we design a novel Mixture-of-Gesture-Experts (MoGE) block to adaptively fuse the audio embedding from the human speech and the gesture features from the pre-trained gesture experts with a routing mechanism. Such an effective manner ensures audio embedding is temporal coordinated with motion features while preserving the vivid and diverse gesture generation. Extensive experiments demonstrate that our proposed CoCoGesture outperforms the state-of-the-art methods on the zero-shot speech-to-gesture generation. The dataset will be publicly available at: https://mattie-e.github.io/GES-X/
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
Transmission Interface Power Flow Adjustment: A Deep Reinforcement Learning Approach based on Multi-task Attribution Map
Authors:
Shunyu Liu,
Wei Luo,
Yanzhen Zhou,
Kaixuan Chen,
Quan Zhang,
Huating Xu,
Qinglai Guo,
Mingli Song
Abstract:
Transmission interface power flow adjustment is a critical measure to ensure the security and economy operation of power systems. However, conventional model-based adjustment schemes are limited by the increasing variations and uncertainties occur in power systems, where the adjustment problems of different transmission interfaces are often treated as several independent tasks, ignoring their coup…
▽ More
Transmission interface power flow adjustment is a critical measure to ensure the security and economy operation of power systems. However, conventional model-based adjustment schemes are limited by the increasing variations and uncertainties occur in power systems, where the adjustment problems of different transmission interfaces are often treated as several independent tasks, ignoring their coupling relationship and even leading to conflict decisions. In this paper, we introduce a novel data-driven deep reinforcement learning (DRL) approach, to handle multiple power flow adjustment tasks jointly instead of learning each task from scratch. At the heart of the proposed method is a multi-task attribution map (MAM), which enables the DRL agent to explicitly attribute each transmission interface task to different power system nodes with task-adaptive attention weights. Based on this MAM, the agent can further provide effective strategies to solve the multi-task adjustment problem with a near-optimal operation cost. Simulation results on the IEEE 118-bus system, a realistic 300-bus system in China, and a very large European system with 9241 buses demonstrate that the proposed method significantly improves the performance compared with several baseline methods, and exhibits high interpretability with the learnable MAM.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
Enhancing Interaction Modeling with Agent Selection and Physical Methods for Trajectory Prediction
Authors:
Shiji Huang,
Lei Ye,
Min Chen,
Wenhai Luo,
Chenqi Xu,
Deyuan Liang,
Dihong Wang
Abstract:
In this study, we address the limitations inherent in most existing vehicle trajectory prediction methodologies that indiscriminately incorporate all agents within a predetermined proximity when accounting for inter-agent interactions. These approaches commonly employ attention-based architecture or graph neural networks for encoding interactions, which introduces three challenges: (i) The indiscr…
▽ More
In this study, we address the limitations inherent in most existing vehicle trajectory prediction methodologies that indiscriminately incorporate all agents within a predetermined proximity when accounting for inter-agent interactions. These approaches commonly employ attention-based architecture or graph neural networks for encoding interactions, which introduces three challenges: (i) The indiscriminate selection of all nearby agents substantially escalates the computational demands of the model, particularly in those interaction-rich scenarios. (ii) Moreover, the simplistic feature extraction of current time agents falls short of adequately capturing the nuanced dynamics of interactions. (iii) Compounded by the inherently low interpretability of attention mechanism and graph neural networks, there is a propensity for the model to allocate unreliable correlation coefficients to certain agents, adversely impacting the accuracy of trajectory predictions. To mitigate these issues, we introduce ASPILin, a novel approach that enhances the selection of interacting agents by considering their current and future lanes, extending this consideration across all historical frames. Utilizing the states of the agents, we estimate the nearest future distance between agents and the time needed to reach this distance. Then, combine these with their current distances to derive a physical correlation coefficient to encode interactions. Experiments conducted on popular trajectory prediction datasets demonstrate that our method is efficient and straightforward, outperforming other state-of-the-art methods.
△ Less
Submitted 21 May, 2024;
originally announced May 2024.
-
EmoEdit: Evoking Emotions through Image Manipulation
Authors:
Jingyuan Yang,
Jiawei Feng,
Weibin Luo,
Dani Lischinski,
Daniel Cohen-Or,
Hui Huang
Abstract:
Affective Image Manipulation (AIM) seeks to modify user-provided images to evoke specific emotional responses. This task is inherently complex due to its twofold objective: significantly evoking the intended emotion, while preserving the original image composition. Existing AIM methods primarily adjust color and style, often failing to elicit precise and profound emotional shifts. Drawing on psych…
▽ More
Affective Image Manipulation (AIM) seeks to modify user-provided images to evoke specific emotional responses. This task is inherently complex due to its twofold objective: significantly evoking the intended emotion, while preserving the original image composition. Existing AIM methods primarily adjust color and style, often failing to elicit precise and profound emotional shifts. Drawing on psychological insights, we extend AIM by incorporating content modifications to enhance emotional impact. We introduce EmoEdit, a novel two-stage framework comprising emotion attribution and image editing. In the emotion attribution stage, we leverage a Vision-Language Model (VLM) to create hierarchies of semantic factors that represent abstract emotions. In the image editing stage, the VLM identifies the most relevant factors for the provided image, and guides a generative editing model to perform affective modifications. A ranking technique that we developed selects the best edit, balancing between emotion fidelity and structure integrity. To validate EmoEdit, we assembled a dataset of 416 images, categorized into positive, negative, and neutral classes. Our method is evaluated both qualitatively and quantitatively, demonstrating superior performance compared to existing state-of-the-art techniques. Additionally, we showcase EmoEdit's potential in various manipulation tasks, including emotion-oriented and semantics-oriented editing.
△ Less
Submitted 21 May, 2024;
originally announced May 2024.
-
Era3D: High-Resolution Multiview Diffusion using Efficient Row-wise Attention
Authors:
Peng Li,
Yuan Liu,
Xiaoxiao Long,
Feihu Zhang,
Cheng Lin,
Mengfei Li,
Xingqun Qi,
Shanghang Zhang,
Wenhan Luo,
Ping Tan,
Wenping Wang,
Qifeng Liu,
Yike Guo
Abstract:
In this paper, we introduce Era3D, a novel multiview diffusion method that generates high-resolution multiview images from a single-view image. Despite significant advancements in multiview generation, existing methods still suffer from camera prior mismatch, inefficacy, and low resolution, resulting in poor-quality multiview images. Specifically, these methods assume that the input images should…
▽ More
In this paper, we introduce Era3D, a novel multiview diffusion method that generates high-resolution multiview images from a single-view image. Despite significant advancements in multiview generation, existing methods still suffer from camera prior mismatch, inefficacy, and low resolution, resulting in poor-quality multiview images. Specifically, these methods assume that the input images should comply with a predefined camera type, e.g. a perspective camera with a fixed focal length, leading to distorted shapes when the assumption fails. Moreover, the full-image or dense multiview attention they employ leads to an exponential explosion of computational complexity as image resolution increases, resulting in prohibitively expensive training costs. To bridge the gap between assumption and reality, Era3D first proposes a diffusion-based camera prediction module to estimate the focal length and elevation of the input image, which allows our method to generate images without shape distortions. Furthermore, a simple but efficient attention layer, named row-wise attention, is used to enforce epipolar priors in the multiview diffusion, facilitating efficient cross-view information fusion. Consequently, compared with state-of-the-art methods, Era3D generates high-quality multiview images with up to a 512*512 resolution while reducing computation complexity by 12x times. Comprehensive experiments demonstrate that Era3D can reconstruct high-quality and detailed 3D meshes from diverse single-view input images, significantly outperforming baseline multiview diffusion methods. Project page: https://penghtyx.github.io/Era3D/.
△ Less
Submitted 29 May, 2024; v1 submitted 19 May, 2024;
originally announced May 2024.
-
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts
Authors:
Yunxin Li,
Shenyuan Jiang,
Baotian Hu,
Longyue Wang,
Wanqi Zhong,
Wenhan Luo,
Lin Ma,
Min Zhang
Abstract:
Recent advancements in Multimodal Large Language Models (MLLMs) underscore the significance of scalable models and data to boost performance, yet this often incurs substantial computational costs. Although the Mixture of Experts (MoE) architecture has been employed to efficiently scale large language and image-text models, these efforts typically involve fewer experts and limited modalities. To ad…
▽ More
Recent advancements in Multimodal Large Language Models (MLLMs) underscore the significance of scalable models and data to boost performance, yet this often incurs substantial computational costs. Although the Mixture of Experts (MoE) architecture has been employed to efficiently scale large language and image-text models, these efforts typically involve fewer experts and limited modalities. To address this, our work presents the pioneering attempt to develop a unified MLLM with the MoE architecture, named Uni-MoE that can handle a wide array of modalities. Specifically, it features modality-specific encoders with connectors for a unified multimodal representation. We also implement a sparse MoE architecture within the LLMs to enable efficient training and inference through modality-level data parallelism and expert-level model parallelism. To enhance the multi-expert collaboration and generalization, we present a progressive training strategy: 1) Cross-modality alignment using various connectors with different cross-modality data, 2) Training modality-specific experts with cross-modality instruction data to activate experts' preferences, and 3) Tuning the Uni-MoE framework utilizing Low-Rank Adaptation (LoRA) on mixed multimodal instruction data. We evaluate the instruction-tuned Uni-MoE on a comprehensive set of multimodal datasets. The extensive experimental results demonstrate Uni-MoE's principal advantage of significantly reducing performance bias in handling mixed multimodal datasets, alongside improved multi-expert collaboration and generalization. Our findings highlight the substantial potential of MoE frameworks in advancing MLLMs and the code is available at https://github.com/HITsz-TMG/UMOE-Scaling-Unified-Multimodal-LLMs.
△ Less
Submitted 18 May, 2024;
originally announced May 2024.
-
The well-posedness and blow up phenomenon for a Tsunamis model with time-fractional derivative
Authors:
Bingbing Dai,
Wei Luo,
Zhaoyang Yin,
Pei Zheng
Abstract:
This paper is concerned with the well-posedness of a time-fractional shallow-water equations, which has received little attention. In the realm of fractional calculus, numerous types of fractional derivatives have been explored in the literature. Among these, one of the most notable and well-structured ones is the conformable fractional derivative. In this paper, we delve into the local well-posed…
▽ More
This paper is concerned with the well-posedness of a time-fractional shallow-water equations, which has received little attention. In the realm of fractional calculus, numerous types of fractional derivatives have been explored in the literature. Among these, one of the most notable and well-structured ones is the conformable fractional derivative. In this paper, we delve into the local well-posedness of the fractional tsunami shallow-water mathematical model in the critical Besov space $B^{\frac{3}{2}}_{2,1}$. Under some symmetric and sign conditions, we show that the strong solution will blow up in finite time.
△ Less
Submitted 17 May, 2024;
originally announced May 2024.
-
IBD-PSC: Input-level Backdoor Detection via Parameter-oriented Scaling Consistency
Authors:
Linshan Hou,
Ruili Feng,
Zhongyun Hua,
Wei Luo,
Leo Yu Zhang,
Yiming Li
Abstract:
Deep neural networks (DNNs) are vulnerable to backdoor attacks, where adversaries can maliciously trigger model misclassifications by implanting a hidden backdoor during model training. This paper proposes a simple yet effective input-level backdoor detection (dubbed IBD-PSC) as a `firewall' to filter out malicious testing images. Our method is motivated by an intriguing phenomenon, i.e., paramete…
▽ More
Deep neural networks (DNNs) are vulnerable to backdoor attacks, where adversaries can maliciously trigger model misclassifications by implanting a hidden backdoor during model training. This paper proposes a simple yet effective input-level backdoor detection (dubbed IBD-PSC) as a `firewall' to filter out malicious testing images. Our method is motivated by an intriguing phenomenon, i.e., parameter-oriented scaling consistency (PSC), where the prediction confidences of poisoned samples are significantly more consistent than those of benign ones when amplifying model parameters. In particular, we provide theoretical analysis to safeguard the foundations of the PSC phenomenon. We also design an adaptive method to select BN layers to scale up for effective detection. Extensive experiments are conducted on benchmark datasets, verifying the effectiveness and efficiency of our IBD-PSC method and its resistance to adaptive attacks. Codes are available at \href{https://github.com/THUYimingLi/BackdoorBox}{BackdoorBox}.
△ Less
Submitted 2 June, 2024; v1 submitted 15 May, 2024;
originally announced May 2024.
-
Robust non-Abelian even-denominator fractional Chern insulator in twisted bilayer MoTe$_2$
Authors:
Feng Chen,
Wei-Wei Luo,
Wei Zhu,
D. N. Sheng
Abstract:
A recent experiment observes a series of quantum spin Hall effects in transition metal dichalcogenide moiré MoTe$_2$ [K. Kang, \textit{et. al}, Nature 628, 522-526 (2024)]. Among them, the filling $ν=3$ state points to a time-reversal pair of edge states resembling those of the even-denominator fractional Chern insulators (FCIs). Inspired by this discovery, we investigate whether a robust incompre…
▽ More
A recent experiment observes a series of quantum spin Hall effects in transition metal dichalcogenide moiré MoTe$_2$ [K. Kang, \textit{et. al}, Nature 628, 522-526 (2024)]. Among them, the filling $ν=3$ state points to a time-reversal pair of edge states resembling those of the even-denominator fractional Chern insulators (FCIs). Inspired by this discovery, we investigate whether a robust incompressible quantum Hall liquid can be stabilized in the half-filled Chern band of twisted MoTe$_2$ bilayers. We use the continuum model with parameters relevant to twisted MoTe$_2$ bilayers and obtain three consecutive nearly flat Chern bands with the same Chern number. Crucially, when the second moiré miniband is half-filled, signatures of non-Abelian states are found via exact diagonalization calculations, including the stable six-fold ground state degeneracy which grows more robust for larger lattice sizes and is consistent with an even-denominator FCI state. We further perform flux insertion simulations to reveal a 1/2 quantized many-body Chern number as direct evidence of topological order. Furthermore, the ground state density structure factors show no sharp peak, indicating no charge density wave order. These evidences signal the potential of realizing the non-Abelian state at zero magnetic field in twisted bilayer MoTe$_2$ at the fractional hole filling 3/2.
△ Less
Submitted 27 May, 2024; v1 submitted 14 May, 2024;
originally announced May 2024.
-
Testing Cotton gravity as dark matter substitute with weak lensing
Authors:
Geyu Mo,
Qingqing Wang,
Xin Ren,
Weitong Yan,
Yen Chin Ong,
Wentao Luo
Abstract:
Harada proposed a modified theory of gravity called Cotton gravity, and argued that it successfully explains the rotation curves of $84$ galaxies without the need of dark matter. In this work we use galaxy-galaxy lensing technique to test whether the modification effect of Cotton gravity can indeed be a viable substitute for dark matter. Using the spherically symmetric solution of Cotton gravity,…
▽ More
Harada proposed a modified theory of gravity called Cotton gravity, and argued that it successfully explains the rotation curves of $84$ galaxies without the need of dark matter. In this work we use galaxy-galaxy lensing technique to test whether the modification effect of Cotton gravity can indeed be a viable substitute for dark matter. Using the spherically symmetric solution of Cotton gravity, we obtain the deflection angle via Gauss-Bonnet theorem and the weak lensing shear. We use five galaxy catalogs divided in 5 stellar mass bins from the Sloan Digital Sky Survey Data Release 7 (SDSS DR7), each of which is further divided into blue star forming galaxy and red passive galaxy sub-catalogs. We find that Cotton gravity on its own has significant deviation from the measured galaxy-galaxy lensing signals, thus it cannot replace the role of dark matter. If we consider the combination of dark matter and Cotton gravity, the modification is tightly constrained. Our analysis also applies to other modified gravity theories whose an additional linear term appears in the Schwarzschild solution.
△ Less
Submitted 12 May, 2024;
originally announced May 2024.
-
Aux-NAS: Exploiting Auxiliary Labels with Negligibly Extra Inference Cost
Authors:
Yuan Gao,
Weizhong Zhang,
Wenhan Luo,
Lin Ma,
Jin-Gang Yu,
Gui-Song Xia,
Jiayi Ma
Abstract:
We aim at exploiting additional auxiliary labels from an independent (auxiliary) task to boost the primary task performance which we focus on, while preserving a single task inference cost of the primary task. While most existing auxiliary learning methods are optimization-based relying on loss weights/gradients manipulation, our method is architecture-based with a flexible asymmetric structure fo…
▽ More
We aim at exploiting additional auxiliary labels from an independent (auxiliary) task to boost the primary task performance which we focus on, while preserving a single task inference cost of the primary task. While most existing auxiliary learning methods are optimization-based relying on loss weights/gradients manipulation, our method is architecture-based with a flexible asymmetric structure for the primary and auxiliary tasks, which produces different networks for training and inference. Specifically, starting from two single task networks/branches (each representing a task), we propose a novel method with evolving networks where only primary-to-auxiliary links exist as the cross-task connections after convergence. These connections can be removed during the primary task inference, resulting in a single-task inference cost. We achieve this by formulating a Neural Architecture Search (NAS) problem, where we initialize bi-directional connections in the search space and guide the NAS optimization converging to an architecture with only the single-side primary-to-auxiliary connections. Moreover, our method can be incorporated with optimization-based auxiliary learning approaches. Extensive experiments with six tasks on NYU v2, CityScapes, and Taskonomy datasets using VGG, ResNet, and ViT backbones validate the promising performance. The codes are available at https://github.com/ethanygao/Aux-NAS.
△ Less
Submitted 9 May, 2024;
originally announced May 2024.
-
Separated Pairs of Submodules in Hilbert $C^*$-modules
Authors:
R. Eskandari,
W. Luo,
M. S. Moslehian,
Q. Xu,
H. Zhang
Abstract:
We introduce the notion of the separated pair of closed submodules in the setting of Hilbert $C^*$-modules. We demonstrate that even in the case of Hilbert spaces this concept has several nice characterizations enriching the theory of separated pairs of subspaces in Hilbert spaces. Let $\mathscr H$ and $\mathscr K$ be orthogonally complemented closed submodules of a Hilbert $C^*$-module…
▽ More
We introduce the notion of the separated pair of closed submodules in the setting of Hilbert $C^*$-modules. We demonstrate that even in the case of Hilbert spaces this concept has several nice characterizations enriching the theory of separated pairs of subspaces in Hilbert spaces. Let $\mathscr H$ and $\mathscr K$ be orthogonally complemented closed submodules of a Hilbert $C^*$-module $\mathscr E$. We establish that $ (\mathscr H,\mathscr K)$ is a separated pair in $\mathscr{E}$ if and only if there are idempotents $Π_1$ and $Π_2$ such that $Π_1Π_2=Π_2Π_1=0$ and $\mathscr R(Π_1)=\mathscr H$ and $\mathscr R(Π_2)=\mathscr K$. We show that $\mathscr R(Π_1+λΠ_2)$ is closed for each $λ\in \mathbb{C}$ if and only if $\mathscr R(Π_1+Π_2)$ is closed.
We use the localization of Hilbert $C^*$-modules to define the angle between closed submodules. We prove that if $(\mathscr H^\perp,\mathscr K^\perp)$ is concordant, then $(\mathscr H^{\perp\perp},\mathscr K^{\perp\perp})$ is a separated pair if the cosine of this angle is less than one. We also present some surprising examples to illustrate our results.
△ Less
Submitted 8 May, 2024;
originally announced May 2024.
-
Variational Schrödinger Diffusion Models
Authors:
Wei Deng,
Weijian Luo,
Yixin Tan,
Marin Biloš,
Yu Chen,
Yuriy Nevmyvaka,
Ricky T. Q. Chen
Abstract:
Schrödinger bridge (SB) has emerged as the go-to method for optimizing transportation plans in diffusion models. However, SB requires estimating the intractable forward score functions, inevitably resulting in the costly implicit training loss based on simulated trajectories. To improve the scalability while preserving efficient transportation plans, we leverage variational inference to linearize…
▽ More
Schrödinger bridge (SB) has emerged as the go-to method for optimizing transportation plans in diffusion models. However, SB requires estimating the intractable forward score functions, inevitably resulting in the costly implicit training loss based on simulated trajectories. To improve the scalability while preserving efficient transportation plans, we leverage variational inference to linearize the forward score functions (variational scores) of SB and restore simulation-free properties in training backward scores. We propose the variational Schrödinger diffusion model (VSDM), where the forward process is a multivariate diffusion and the variational scores are adaptively optimized for efficient transport. Theoretically, we use stochastic approximation to prove the convergence of the variational scores and show the convergence of the adaptively generated samples based on the optimal variational scores. Empirically, we test the algorithm in simulated examples and observe that VSDM is efficient in generations of anisotropic shapes and yields straighter sample trajectories compared to the single-variate diffusion. We also verify the scalability of the algorithm in real-world data and achieve competitive unconditional generation performance in CIFAR10 and conditional generation in time series modeling. Notably, VSDM no longer depends on warm-up initializations and has become tuning-friendly in training large-scale experiments.
△ Less
Submitted 19 June, 2024; v1 submitted 8 May, 2024;
originally announced May 2024.
-
Point-Spread Function errors for weak lensing - density cross-correlations. Application to UNIONS
Authors:
Ziwen Zhang,
Martin Kilbinger,
Fabian Hervas Peters,
Qinxun Li,
Wentao Luo,
Lucie Baumont,
Jean-Charles Cuillandre,
Sebastien Fabbro,
Stephen Gwyn,
Alan McConnachie,
Anna Wittje
Abstract:
Aims:Calibrating the point spread function (PSF) is a fundamental part of weak gravitational lensing analyses. Even with corrected galaxy images, imperfect calibrations can introduce biases. We propose an analytical framework for quantifying PSF-induced systematics as diagnostics for cross-correlation measurements of weak lensing with density tracers, e.g., galaxy-galaxy lensing. We show how those…
▽ More
Aims:Calibrating the point spread function (PSF) is a fundamental part of weak gravitational lensing analyses. Even with corrected galaxy images, imperfect calibrations can introduce biases. We propose an analytical framework for quantifying PSF-induced systematics as diagnostics for cross-correlation measurements of weak lensing with density tracers, e.g., galaxy-galaxy lensing. We show how those systematics propagate to physical parameters of the density tracers. Those diagnostics only require a shape catalogue of PSF stars and foreground galaxy positions. Methods:We consider the PSF-induced multiplicative bias, and introduce three second-order statistics as additive biases. We compute both biases for the weak-lensing derived halo mass of spectroscopic foreground galaxy samples, in particular, their effect on the tangential shear and fitted halo mass as a function of stellar mass. In addition, we assess their impact on the recently published black-hole - halo-mass relation for type I Active Galactic Nuclei (AGNs). Results:Using weak-lensing catalogues from the Ultraviolet Near Infrared Optical Northern Survey (UNIONS) and Dark Energy Survey (DES), we find the multiplicative biases in the tangential shear to be less than $0.5\%$. No correlations between additive bias and galaxy properties of the foreground sample are detected. The combined PSF systematics affect low-mass galaxies and small angular scales; halo mass estimates can be biased by up to 18$\%$ for a sample of central galaxies in the stellar mass range 9.0 $\leq$ log $M_*/\rm M_{\odot}$ < 9.5. Conclusions:The PSF-induced multiplicative bias is a subdominant contribution to current studies of weak-lensing - density cross-correlations, but might become significant for upcoming Stage-VI surveys. For samples with a low tangential shear, additive PSF systematics can induce a significant bias on derived properties such as halo mass.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
Exploring Spatial Context: A Comprehensive Bibliography of GWR and MGWR
Authors:
A. Stewart Fotheringham,
Chen-Lun Kao,
Hanchen Yu,
Sarah Bardin,
Taylor Oshan,
Ziqi Li,
Mehak Sachdeva,
Wei Luo
Abstract:
Local spatial models such as Geographically Weighted Regression (GWR) and Multiscale Geographically Weighted Regression (MGWR) serve as instrumental tools to capture intrinsic contextual effects through the estimates of the local intercepts and behavioral contextual effects through estimates of the local slope parameters. GWR and MGWR provide simple implementation yet powerful frameworks that coul…
▽ More
Local spatial models such as Geographically Weighted Regression (GWR) and Multiscale Geographically Weighted Regression (MGWR) serve as instrumental tools to capture intrinsic contextual effects through the estimates of the local intercepts and behavioral contextual effects through estimates of the local slope parameters. GWR and MGWR provide simple implementation yet powerful frameworks that could be extended to various disciplines that handle spatial data. This bibliography aims to serve as a comprehensive compilation of peer-reviewed papers that have utilized GWR or MGWR as a primary analytical method to conduct spatial analyses and acts as a useful guide to anyone searching the literature for previous examples of local statistical modeling in a wide variety of application fields.
△ Less
Submitted 24 April, 2024;
originally announced April 2024.
-
WaveSleepNet: An Interpretable Network for Expert-like Sleep Staging
Authors:
Yan Pei,
Wei Luo
Abstract:
Although deep learning algorithms have proven their efficiency in automatic sleep staging, the widespread skepticism about their "black-box" nature has limited its clinical acceptance. In this study, we propose WaveSleepNet, an interpretable neural network for sleep staging that reasons in a similar way to sleep experts. In this network, we utilize the latent space representations generated during…
▽ More
Although deep learning algorithms have proven their efficiency in automatic sleep staging, the widespread skepticism about their "black-box" nature has limited its clinical acceptance. In this study, we propose WaveSleepNet, an interpretable neural network for sleep staging that reasons in a similar way to sleep experts. In this network, we utilize the latent space representations generated during training to identify characteristic wave prototypes corresponding to different sleep stages. The feature representation of an input signal is segmented into patches within the latent space, each of which is compared against the learned wave prototypes. The proximity between these patches and the wave prototypes is quantified through scores, indicating the prototypes' presence and relative proportion within the signal. The scores are served as the decision-making criteria for final sleep staging. During training, an ensemble of loss functions is employed for the prototypes' diversity and robustness. Furthermore, the learned wave prototypes are visualized by analysing occlusion sensitivity. The efficacy of WaveSleepNet is validated across three public datasets, achieving sleep staging performance that are on par with the state-of-the-art models when several WaveSleepNets are combine into a larger network. A detailed case study examined the decision-making process of the WaveSleepNet which aligns closely with American Academy of Sleep Medicine (AASM) manual guidelines. Another case study systematically explained the misidentified reason behind each sleep stage. WaveSleepNet's transparent process provides specialists with direct access to the physiological significance of its criteria, allowing for future adaptation or enrichment by sleep experts.
△ Less
Submitted 10 April, 2024;
originally announced April 2024.
-
Effective Individual Fairest Community Search over Heterogeneous Information Networks
Authors:
Taige Zhao,
Jianxin Li,
Ningning Cui,
Wei Luo
Abstract:
Community search over heterogeneous information networks has been applied to wide domains, such as activity organization and team formation. From these scenarios, the members of a group with the same treatment often have different levels of activity and workloads, which causes unfairness in the treatment between active members and inactive members (called individual unfairness). However, existing…
▽ More
Community search over heterogeneous information networks has been applied to wide domains, such as activity organization and team formation. From these scenarios, the members of a group with the same treatment often have different levels of activity and workloads, which causes unfairness in the treatment between active members and inactive members (called individual unfairness). However, existing works do not pay attention to individual fairness and do not sufficiently consider the rich semantics of HINs (e.g., high-order structure), which disables complex queries. To fill the gap, we formally define the issue of individual fairest community search over HINs (denoted as IFCS), which aims to find a set of vertices from the HIN that own the same type, close relationships, and small difference of activity level and has been demonstrated to be NP-hard. To do this, we first develop an exploration-based filter that reduces the search space of the community effectively. Further, to avoid repeating computation and prune unfair communities in advance, we propose a message-based scheme and a lower bound-based scheme. At last, we conduct extensive experiments on four real-world datasets to demonstrate the effectiveness and efficiency of our proposed algorithms, which achieve at least X3 times faster than the baseline solution.
△ Less
Submitted 18 April, 2024;
originally announced April 2024.
-
A unified generalization of inverse regression via adaptive column selection
Authors:
Yin Jin,
Wei Luo
Abstract:
A bottleneck of sufficient dimension reduction (SDR) in the modern era is that, among numerous methods, only the sliced inverse regression (SIR) is generally applicable under the high-dimensional settings. The higher-order inverse regression methods, which form a major family of SDR methods that are superior to SIR in the population level, suffer from the dimensionality of their intermediate matri…
▽ More
A bottleneck of sufficient dimension reduction (SDR) in the modern era is that, among numerous methods, only the sliced inverse regression (SIR) is generally applicable under the high-dimensional settings. The higher-order inverse regression methods, which form a major family of SDR methods that are superior to SIR in the population level, suffer from the dimensionality of their intermediate matrix-valued parameters that have an excessive number of columns. In this paper, we propose the generic idea of using a small subset of columns of the matrix-valued parameter for SDR estimation, which breaks the convention of using the ambient matrix for the higher-order inverse regression methods. With the aid of a quick column selection procedure, we then generalize these methods as well as their ensembles towards sparsity under the ultrahigh-dimensional settings, in a uniform manner that resembles sparse SIR and without additional assumptions. This is the first promising attempt in the literature to free the higher-order inverse regression methods from their dimensionality, which facilitates the applicability of SDR. The gain of column selection with respect to SDR estimation efficiency is also studied under the fixed-dimensional settings. Simulation studies and a real data example are provided at the end.
△ Less
Submitted 12 April, 2024;
originally announced April 2024.
-
Homography Guided Temporal Fusion for Road Line and Marking Segmentation
Authors:
Shan Wang,
Chuong Nguyen,
Jiawei Liu,
Kaihao Zhang,
Wenhan Luo,
Yanhao Zhang,
Sundaram Muthu,
Fahira Afzal Maken,
Hongdong Li
Abstract:
Reliable segmentation of road lines and markings is critical to autonomous driving. Our work is motivated by the observations that road lines and markings are (1) frequently occluded in the presence of moving vehicles, shadow, and glare and (2) highly structured with low intra-class shape variance and overall high appearance consistency. To solve these issues, we propose a Homography Guided Fusion…
▽ More
Reliable segmentation of road lines and markings is critical to autonomous driving. Our work is motivated by the observations that road lines and markings are (1) frequently occluded in the presence of moving vehicles, shadow, and glare and (2) highly structured with low intra-class shape variance and overall high appearance consistency. To solve these issues, we propose a Homography Guided Fusion (HomoFusion) module to exploit temporally-adjacent video frames for complementary cues facilitating the correct classification of the partially occluded road lines or markings. To reduce computational complexity, a novel surface normal estimator is proposed to establish spatial correspondences between the sampled frames, allowing the HomoFusion module to perform a pixel-to-pixel attention mechanism in updating the representation of the occluded road lines or markings. Experiments on ApolloScape, a large-scale lane mark segmentation dataset, and ApolloScape Night with artificial simulated night-time road conditions, demonstrate that our method outperforms other existing SOTA lane mark segmentation models with less than 9\% of their parameters and computational complexity. We show that exploiting available camera intrinsic data and ground plane assumption for cross-frame correspondence can lead to a light-weight network with significantly improved performances in speed and accuracy. We also prove the versatility of our HomoFusion approach by applying it to the problem of water puddle segmentation and achieving SOTA performance.
△ Less
Submitted 11 April, 2024;
originally announced April 2024.
-
$\frac{5}{2}$ fractional quantum Hall state in GaAs with Landau level mixing
Authors:
Wenchen Luo,
Muaath Abdulwahab,
Xiang Liu,
Hao Wang
Abstract:
The Landau level mixing is the key in understanding the mysterious $5/2$ fractional quantum Hall effect in GaAs quantum well. Theoretical calculations with and without Landau level mixing show striking differences. However, the way to deal with the considerable strong Landau level mixing in GaAs is still unsatisfactory. We develop a method combining the screening and the perturbation theories to s…
▽ More
The Landau level mixing is the key in understanding the mysterious $5/2$ fractional quantum Hall effect in GaAs quantum well. Theoretical calculations with and without Landau level mixing show striking differences. However, the way to deal with the considerable strong Landau level mixing in GaAs is still unsatisfactory. We develop a method combining the screening and the perturbation theories to study the nature of the $5/2$ fractional quantum Hall effect in GaAs efficiently. The screening which has been succeed in explaining ZnO systems integrates out the low-energy Landau levels close to the related Landau level, while the other high-energy Landau levels are integrated out by the perturbation theory. We find that the ground states still hold the quasi-triplet degeneracy which implies the Pfaffian nature of the system. Furthermore, the particle-hole symmetry is only weakly violated since the particle-hole parity is close to unity. We propose that the ground state in the finite-size calculations can be approximated as a variational superposition of the Pfaffian and anit-Pfaffian states. In the experimental environment the symmetrized Pfaffian component is dominant, corresponding a thermal conductance around $2.5$ quanta can be understood consequently.
△ Less
Submitted 6 May, 2024; v1 submitted 3 April, 2024;
originally announced April 2024.
-
JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks
Authors:
Weidi Luo,
Siyuan Ma,
Xiaogeng Liu,
Xiaoyu Guo,
Chaowei Xiao
Abstract:
With the rapid advancements in Multimodal Large Language Models (MLLMs), securing these models against malicious inputs while aligning them with human values has emerged as a critical challenge. In this paper, we investigate an important and unexplored question of whether techniques that successfully jailbreak Large Language Models (LLMs) can be equally effective in jailbreaking MLLMs. To explore…
▽ More
With the rapid advancements in Multimodal Large Language Models (MLLMs), securing these models against malicious inputs while aligning them with human values has emerged as a critical challenge. In this paper, we investigate an important and unexplored question of whether techniques that successfully jailbreak Large Language Models (LLMs) can be equally effective in jailbreaking MLLMs. To explore this issue, we introduce JailBreakV-28K, a pioneering benchmark designed to assess the transferability of LLM jailbreak techniques to MLLMs, thereby evaluating the robustness of MLLMs against diverse jailbreak attacks. Utilizing a dataset of 2, 000 malicious queries that is also proposed in this paper, we generate 20, 000 text-based jailbreak prompts using advanced jailbreak attacks on LLMs, alongside 8, 000 image-based jailbreak inputs from recent MLLMs jailbreak attacks, our comprehensive dataset includes 28, 000 test cases across a spectrum of adversarial scenarios. Our evaluation of 10 open-source MLLMs reveals a notably high Attack Success Rate (ASR) for attacks transferred from LLMs, highlighting a critical vulnerability in MLLMs that stems from their text-processing capabilities. Our findings underscore the urgent need for future research to address alignment vulnerabilities in MLLMs from both textual and visual inputs.
△ Less
Submitted 3 July, 2024; v1 submitted 3 April, 2024;
originally announced April 2024.
-
Context-Aware Integration of Language and Visual References for Natural Language Tracking
Authors:
Yanyan Shao,
Shuting He,
Qi Ye,
Yuchao Feng,
Wenhan Luo,
Jiming Chen
Abstract:
Tracking by natural language specification (TNL) aims to consistently localize a target in a video sequence given a linguistic description in the initial frame. Existing methodologies perform language-based and template-based matching for target reasoning separately and merge the matching results from two sources, which suffer from tracking drift when language and visual templates miss-align with…
▽ More
Tracking by natural language specification (TNL) aims to consistently localize a target in a video sequence given a linguistic description in the initial frame. Existing methodologies perform language-based and template-based matching for target reasoning separately and merge the matching results from two sources, which suffer from tracking drift when language and visual templates miss-align with the dynamic target state and ambiguity in the later merging stage. To tackle the issues, we propose a joint multi-modal tracking framework with 1) a prompt modulation module to leverage the complementarity between temporal visual templates and language expressions, enabling precise and context-aware appearance and linguistic cues, and 2) a unified target decoding module to integrate the multi-modal reference cues and executes the integrated queries on the search image to predict the target location in an end-to-end manner directly. This design ensures spatio-temporal consistency by leveraging historical visual information and introduces an integrated solution, generating predictions in a single step. Extensive experiments conducted on TNL2K, OTB-Lang, LaSOT, and RefCOCOg validate the efficacy of our proposed approach. The results demonstrate competitive performance against state-of-the-art methods for both tracking and grounding.
△ Less
Submitted 29 March, 2024;
originally announced March 2024.
-
Neural Image Compression with Quantization Rectifier
Authors:
Wei Luo,
Bo Chen
Abstract:
Neural image compression has been shown to outperform traditional image codecs in terms of rate-distortion performance. However, quantization introduces errors in the compression process, which can degrade the quality of the compressed image. Existing approaches address the train-test mismatch problem incurred during quantization, the random impact of quantization on the expressiveness of image fe…
▽ More
Neural image compression has been shown to outperform traditional image codecs in terms of rate-distortion performance. However, quantization introduces errors in the compression process, which can degrade the quality of the compressed image. Existing approaches address the train-test mismatch problem incurred during quantization, the random impact of quantization on the expressiveness of image features is still unsolved. This paper presents a novel quantization rectifier (QR) method for image compression that leverages image feature correlation to mitigate the impact of quantization. Our method designs a neural network architecture that predicts unquantized features from the quantized ones, preserving feature expressiveness for better image reconstruction quality. We develop a soft-to-predictive training technique to integrate QR into existing neural image codecs. In evaluation, we integrate QR into state-of-the-art neural image codecs and compare enhanced models and baselines on the widely-used Kodak benchmark. The results show consistent coding efficiency improvement by QR with a negligible increase in the running time.
△ Less
Submitted 25 March, 2024;
originally announced March 2024.
-
Scale Decoupled Distillation
Authors:
Shicai Wei Chunbo Luo Yang Luo
Abstract:
Logit knowledge distillation attracts increasing attention due to its practicality in recent studies. However, it often suffers inferior performance compared to the feature knowledge distillation. In this paper, we argue that existing logit-based methods may be sub-optimal since they only leverage the global logit output that couples multiple semantic knowledge. This may transfer ambiguous knowled…
▽ More
Logit knowledge distillation attracts increasing attention due to its practicality in recent studies. However, it often suffers inferior performance compared to the feature knowledge distillation. In this paper, we argue that existing logit-based methods may be sub-optimal since they only leverage the global logit output that couples multiple semantic knowledge. This may transfer ambiguous knowledge to the student and mislead its learning. To this end, we propose a simple but effective method, i.e., Scale Decoupled Distillation (SDD), for logit knowledge distillation. SDD decouples the global logit output into multiple local logit outputs and establishes distillation pipelines for them. This helps the student to mine and inherit fine-grained and unambiguous logit knowledge. Moreover, the decoupled knowledge can be further divided into consistent and complementary logit knowledge that transfers the semantic information and sample ambiguity, respectively. By increasing the weight of complementary parts, SDD can guide the student to focus more on ambiguous samples, improving its discrimination ability. Extensive experiments on several benchmark datasets demonstrate the effectiveness of SDD for wide teacher-student pairs, especially in the fine-grained classification task. Code is available at: https://github.com/shicaiwei123/SDD-CVPR2024
△ Less
Submitted 20 March, 2024;
originally announced March 2024.
-
Optical manipulation of the topological phase in ZrTe5 revealed by time- and angle-resolved photoemission
Authors:
Chaozhi Huang,
Chengyang Xu,
Fengfeng Zhu,
Shaofeng Duan,
Jianzhe Liu,
Lingxiao Gu,
Shichong Wang,
Haoran Liu,
Dong Qian,
Weidong Luo,
Wentao Zhang
Abstract:
High-resolution time- and angle-resolved photoemission measurements were conducted on the topological insulator ZrTe5. With strong femtosecond photoexcitation, a possible ultrafast phase transition from a weak to a strong topological insulating phase was experimentally realized by recovering the energy gap inversion in a time scale that was shorter than 0.15 ps. This photoinduced transient strong…
▽ More
High-resolution time- and angle-resolved photoemission measurements were conducted on the topological insulator ZrTe5. With strong femtosecond photoexcitation, a possible ultrafast phase transition from a weak to a strong topological insulating phase was experimentally realized by recovering the energy gap inversion in a time scale that was shorter than 0.15 ps. This photoinduced transient strong topological phase can last longer than 2 ps at the highest excitation fluence studied, and it cannot be attributed to the photoinduced heating of electrons or modification of the conduction band filling. Additionally, the measured unoccupied electronic states are consistent with the first-principles calculation based on experimental crystal lattice constants, which favor a strong topological insulating phase. These findings provide new insights into the longstanding controversy about the strong and weak topological properties in ZrTe5, and they suggest that many-body effects including electron-electron interactions must be taken into account to understand the equilibrium weak topological insulating phase in ZrTe5.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.