subscribe to arXiv mailings

Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training

Authors: Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Jiahao Xu, Tian Liang, Pinjia He, Zhaopeng Tu

Abstract: This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs) by identifying and tackling a refusal position bias within safety tuning data, which compromises the models' ability to appropriately refuse generating unsafe content. We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at a… ▽ More This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs) by identifying and tackling a refusal position bias within safety tuning data, which compromises the models' ability to appropriately refuse generating unsafe content. We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position, significantly enhancing their safety capabilities. DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation (MLE) with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence. Our empirical evaluation, conducted using LLaMA3 and Mistral model families across six attack scenarios, demonstrates that our method not only improves model safety without compromising performance but also surpasses well-known models such as GPT-4 in defending against attacks. Importantly, our approach successfully defends recent advanced attack methods (e.g., CodeAttack) that have jailbroken GPT-4 and LLaMA3-70B-Instruct. Our code and data can be found at https://github.com/RobustNLP/DeRTa. △ Less

Submitted 12 July, 2024; originally announced July 2024.

arXiv:2407.09111 [pdf, other]

Inference Optimization of Foundation Models on AI Accelerators

Authors: Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis

Abstract: Powerful foundation models, including large language models (LLMs), with Transformer architectures have ushered in a new era of Generative AI across various industries. Industry and research community have witnessed a large number of new applications, based on those foundation models. Such applications include question and answer, customer services, image and video generation, and code completions… ▽ More Powerful foundation models, including large language models (LLMs), with Transformer architectures have ushered in a new era of Generative AI across various industries. Industry and research community have witnessed a large number of new applications, based on those foundation models. Such applications include question and answer, customer services, image and video generation, and code completions, among others. However, as the number of model parameters reaches to hundreds of billions, their deployment incurs prohibitive inference costs and high latency in real-world scenarios. As a result, the demand for cost-effective and fast inference using AI accelerators is ever more higher. To this end, our tutorial offers a comprehensive discussion on complementary inference optimization techniques using AI accelerators. Beginning with an overview of basic Transformer architectures and deep learning system frameworks, we deep dive into system optimization techniques for fast and memory-efficient attention computations and discuss how they can be implemented efficiently on AI accelerators. Next, we describe architectural elements that are key for fast transformer inference. Finally, we examine various model compression and fast decoding strategies in the same context. △ Less

Submitted 12 July, 2024; originally announced July 2024.

Comments: Tutorial published at KDD 2024. Camera-ready version

arXiv:2407.08112 [pdf, other]

How Well Can a Long Sequence Model Model Long Sequences? Comparing Architechtural Inductive Biases on Long-Context Abilities

Authors: Jerry Huang

Abstract: Long sequences occur in abundance within real-world scenarios, hence properly modelling them opens numerous down-stream use-cases. Deep neural networks, however, have often struggled with these for a variety of reasons. Recent advances, both in system engineering as well as model design, have enabled the scaling up of model that are purported to support extended context length. In particular, the… ▽ More Long sequences occur in abundance within real-world scenarios, hence properly modelling them opens numerous down-stream use-cases. Deep neural networks, however, have often struggled with these for a variety of reasons. Recent advances, both in system engineering as well as model design, have enabled the scaling up of model that are purported to support extended context length. In particular, the state-space and linear recurrent neural network families of models hypothetically can entend to infinite sequence lenth. However, is this too good to be true? We conduct an evaluation to show that while such claims may be sound theoretically, there remain large practical gaps that are empirically observed. In particular, recurrent models still suffer in the same settings as long-context LLMs with attention. We further show that different inductive biases have inconsistent extrapolation capabilities, highlighting the need to further study such paradigms and investigate why long-context models seemingly fail to behave as one might expect. △ Less

Submitted 10 July, 2024; originally announced July 2024.

Comments: Work In Progress. 9 pages

arXiv:2407.06567 [pdf, other]

FinCon: A Synthesized LLM Multi-Agent System with Conceptual Verbal Reinforcement for Enhanced Financial Decision Making

Authors: Yangyang Yu, Zhiyuan Yao, Haohang Li, Zhiyang Deng, Yupeng Cao, Zhi Chen, Jordan W. Suchow, Rong Liu, Zhenyu Cui, Denghui Zhang, Koduvayur Subbalakshmi, Guojun Xiong, Yueru He, Jimin Huang, Dong Li, Qianqian Xie

Abstract: Large language models (LLMs) have demonstrated notable potential in conducting complex tasks and are increasingly utilized in various financial applications. However, high-quality sequential financial investment decision-making remains challenging. These tasks require multiple interactions with a volatile environment for every decision, demanding sufficient intelligence to maximize returns and man… ▽ More Large language models (LLMs) have demonstrated notable potential in conducting complex tasks and are increasingly utilized in various financial applications. However, high-quality sequential financial investment decision-making remains challenging. These tasks require multiple interactions with a volatile environment for every decision, demanding sufficient intelligence to maximize returns and manage risks. Although LLMs have been used to develop agent systems that surpass human teams and yield impressive investment returns, opportunities to enhance multi-sourced information synthesis and optimize decision-making outcomes through timely experience refinement remain unexplored. Here, we introduce the FinCon, an LLM-based multi-agent framework with CONceptual verbal reinforcement tailored for diverse FINancial tasks. Inspired by effective real-world investment firm organizational structures, FinCon utilizes a manager-analyst communication hierarchy. This structure allows for synchronized cross-functional agent collaboration towards unified goals through natural language interactions and equips each agent with greater memory capacity than humans. Additionally, a risk-control component in FinCon enhances decision quality by episodically initiating a self-critiquing mechanism to update systematic investment beliefs. The conceptualized beliefs serve as verbal reinforcement for the future agent's behavior and can be selectively propagated to the appropriate node that requires knowledge updates. This feature significantly improves performance while reducing unnecessary peer-to-peer communication costs. Moreover, FinCon demonstrates strong generalization capabilities in various financial tasks, including single stock trading and portfolio management. △ Less

Submitted 10 July, 2024; v1 submitted 9 July, 2024; originally announced July 2024.

Comments: LLM Applications, LLM Agents, Financial Technology, Quantitative Finance, Algorithmic Trading, Cognitive Science

arXiv:2407.06546 [pdf, other]

Exploring the Causality of End-to-End Autonomous Driving

Authors: Jiankun Li, Hao Li, Jiangjiang Liu, Zhikang Zou, Xiaoqing Ye, Fan Wang, Jizhou Huang, Hua Wu, Haifeng Wang

Abstract: Deep learning-based models are widely deployed in autonomous driving areas, especially the increasingly noticed end-to-end solutions. However, the black-box property of these models raises concerns about their trustworthiness and safety for autonomous driving, and how to debug the causality has become a pressing concern. Despite some existing research on the explainability of autonomous driving, t… ▽ More Deep learning-based models are widely deployed in autonomous driving areas, especially the increasingly noticed end-to-end solutions. However, the black-box property of these models raises concerns about their trustworthiness and safety for autonomous driving, and how to debug the causality has become a pressing concern. Despite some existing research on the explainability of autonomous driving, there is currently no systematic solution to help researchers debug and identify the key factors that lead to the final predicted action of end-to-end autonomous driving. In this work, we propose a comprehensive approach to explore and analyze the causality of end-to-end autonomous driving. First, we validate the essential information that the final planning depends on by using controlled variables and counterfactual interventions for qualitative analysis. Then, we quantitatively assess the factors influencing model decisions by visualizing and statistically analyzing the response of key model inputs. Finally, based on the comprehensive study of the multi-factorial end-to-end autonomous driving system, we have developed a strong baseline and a tool for exploring causality in the close-loop simulator CARLA. It leverages the essential input sources to obtain a well-designed model, resulting in highly competitive capabilities. As far as we know, our work is the first to unveil the mystery of end-to-end autonomous driving and turn the black box into a white one. Thorough close-loop experiments demonstrate that our method can be applied to end-to-end autonomous driving solutions for causality debugging. Code will be available at https://github.com/bdvisl/DriveInsight. △ Less

Submitted 9 July, 2024; originally announced July 2024.

arXiv:2407.06317 [pdf, other]

Enhanced Safety in Autonomous Driving: Integrating Latent State Diffusion Model for End-to-End Navigation

Authors: Jianuo Huang, Zhenlong Fang

Abstract: With the advancement of autonomous driving, ensuring safety during motion planning and navigation is becoming more and more important. However, most end-to-end planning methods suffer from a lack of safety. This research addresses the safety issue in the control optimization problem of autonomous driving, formulated as Constrained Markov Decision Processes (CMDPs). We propose a novel, model-based… ▽ More With the advancement of autonomous driving, ensuring safety during motion planning and navigation is becoming more and more important. However, most end-to-end planning methods suffer from a lack of safety. This research addresses the safety issue in the control optimization problem of autonomous driving, formulated as Constrained Markov Decision Processes (CMDPs). We propose a novel, model-based approach for policy optimization, utilizing a conditional Value-at-Risk based Soft Actor Critic to manage constraints in complex, high-dimensional state spaces effectively. Our method introduces a worst-case actor to guide safe exploration, ensuring rigorous adherence to safety requirements even in unpredictable scenarios. The policy optimization employs the Augmented Lagrangian method and leverages latent diffusion models to predict and simulate future trajectories. This dual approach not only aids in navigating environments safely but also refines the policy's performance by integrating distribution modeling to account for environmental uncertainties. Empirical evaluations conducted in both simulated and real environment demonstrate that our approach outperforms existing methods in terms of safety, efficiency, and decision-making capabilities. △ Less

Submitted 9 July, 2024; v1 submitted 8 July, 2024; originally announced July 2024.

arXiv:2407.06204 [pdf, other]

A Survey on Mixture of Experts

Authors: Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, Jiayi Huang

Abstract: Large language models (LLMs) have garnered unprecedented advancements across diverse fields, ranging from natural language processing to computer vision and beyond. The prowess of LLMs is underpinned by their substantial model size, extensive and diverse datasets, and the vast computational power harnessed during training, all of which contribute to the emergent abilities of LLMs (e.g., in-context… ▽ More Large language models (LLMs) have garnered unprecedented advancements across diverse fields, ranging from natural language processing to computer vision and beyond. The prowess of LLMs is underpinned by their substantial model size, extensive and diverse datasets, and the vast computational power harnessed during training, all of which contribute to the emergent abilities of LLMs (e.g., in-context learning) that are not present in small models. Within this context, the mixture of experts (MoE) has emerged as an effective method for substantially scaling up model capacity with minimal computation overhead, gaining significant attention from academia and industry. Despite its growing prevalence, there lacks a systematic and comprehensive review of the literature on MoE. This survey seeks to bridge that gap, serving as an essential resource for researchers delving into the intricacies of MoE. We first briefly introduce the structure of the MoE layer, followed by proposing a new taxonomy of MoE. Next, we overview the core designs for various MoE models including both algorithmic and systemic aspects, alongside collections of available open-source implementations, hyperparameter configurations and empirical evaluations. Furthermore, we delineate the multifaceted applications of MoE in practice, and outline some potential directions for future research. To facilitate ongoing updates and the sharing of cutting-edge developments in MoE research, we have established a resource repository accessible at https://github.com/withinmiaov/A-Survey-on-Mixture-of-Experts. △ Less

Submitted 26 June, 2024; originally announced July 2024.

arXiv:2407.05749 [pdf, other]

LDGCN: An Edge-End Lightweight Dual GCN Based on Single-Channel EEG for Driver Drowsiness Monitoring

Authors: Jingwei Huang, Chuansheng Wang, Jiayan Huang, Haoyi Fan, Antoni Grau, Fuquan Zhang

Abstract: Driver drowsiness electroencephalography (EEG) signal monitoring can timely alert drivers of their drowsiness status, thereby reducing the probability of traffic accidents. Graph convolutional networks (GCNs) have shown significant advancements in processing the non-stationary, time-varying, and non-Euclidean nature of EEG signals. However, the existing single-channel EEG adjacency graph construct… ▽ More Driver drowsiness electroencephalography (EEG) signal monitoring can timely alert drivers of their drowsiness status, thereby reducing the probability of traffic accidents. Graph convolutional networks (GCNs) have shown significant advancements in processing the non-stationary, time-varying, and non-Euclidean nature of EEG signals. However, the existing single-channel EEG adjacency graph construction process lacks interpretability, which hinders the ability of GCNs to effectively extract adjacency graph features, thus affecting the performance of drowsiness monitoring. To address this issue, we propose an edge-end lightweight dual graph convolutional network (LDGCN). Specifically, we are the first to incorporate neurophysiological knowledge to design a Baseline Drowsiness Status Adjacency Graph (BDSAG), which characterizes driver drowsiness status. Additionally, to express more features within limited EEG data, we introduce the Augmented Graph-level Module (AGM). This module captures global and local information at the graph level, ensuring that BDSAG features remain intact while enhancing effective feature expression capability. Furthermore, to deploy our method on the fourth-generation Raspberry Pi, we utilize Adaptive Pruning Optimization (APO) on both channels and neurons, reducing inference latency by almost half. Experiments on benchmark datasets demonstrate that LDGCN offers the best trade-off between monitoring performance and hardware resource utilization compared to existing state-of-the-art algorithms. All our source code can be found at https://github.com/BryantDom/Driver-Drowsiness-Monitoring. △ Less

Submitted 8 July, 2024; originally announced July 2024.

arXiv:2407.05679 [pdf, other]

BEVWorld: A Multimodal World Model for Autonomous Driving via Unified BEV Latent Space

Authors: Yumeng Zhang, Shi Gong, Kaixin Xiong, Xiaoqing Ye, Xiao Tan, Fan Wang, Jizhou Huang, Hua Wu, Haifeng Wang

Abstract: World models are receiving increasing attention in autonomous driving for their ability to predict potential future scenarios. In this paper, we present BEVWorld, a novel approach that tokenizes multimodal sensor inputs into a unified and compact Bird's Eye View (BEV) latent space for environment modeling. The world model consists of two parts: the multi-modal tokenizer and the latent BEV sequence… ▽ More World models are receiving increasing attention in autonomous driving for their ability to predict potential future scenarios. In this paper, we present BEVWorld, a novel approach that tokenizes multimodal sensor inputs into a unified and compact Bird's Eye View (BEV) latent space for environment modeling. The world model consists of two parts: the multi-modal tokenizer and the latent BEV sequence diffusion model. The multi-modal tokenizer first encodes multi-modality information and the decoder is able to reconstruct the latent BEV tokens into LiDAR and image observations by ray-casting rendering in a self-supervised manner. Then the latent BEV sequence diffusion model predicts future scenarios given action tokens as conditions. Experiments demonstrate the effectiveness of BEVWorld in autonomous driving tasks, showcasing its capability in generating future scenes and benefiting downstream tasks such as perception and motion prediction. Code will be available at https://github.com/zympsyche/BevWorld. △ Less

Submitted 8 July, 2024; originally announced July 2024.

Comments: 10 pages

arXiv:2407.05573 [pdf]

Spatio-Temporal Encoding and Decoding-Based Method for Future Human Activity Skeleton Synthesis

Authors: Tingyu Liu, Jun Huang, Chenyi Weng

Abstract: Inferring future activity information based on observed activity data is a crucial step to improve the accuracy of early activity prediction. Traditional methods based on generative adversarial networks(GAN) or joint learning frameworks can achieve good prediction accuracy under low observation ratios, but they usually have high computational costs. In view of this, this paper proposes a spatio-te… ▽ More Inferring future activity information based on observed activity data is a crucial step to improve the accuracy of early activity prediction. Traditional methods based on generative adversarial networks(GAN) or joint learning frameworks can achieve good prediction accuracy under low observation ratios, but they usually have high computational costs. In view of this, this paper proposes a spatio-temporal encoding and decoding-based method for future human activity skeleton synthesis. Firstly, algorithms such as time control, discrete cosine transform, and low-pass filtering are used to cut or pad the skeleton sequences. Secondly, the encoder and decoder are responsible for extracting intermediate semantic encoding from observed skeleton sequences and inferring future sequences from the intermediate semantic encoding, respectively. Finally, joint displacement error, velocity error, and acceleration error, three higher-order kinematic features, are used as key components of the loss function to optimize model parameters. Experimental results show that the proposed future skeleton synthesis algorithm performs better than some existing algorithms. It generates skeleton sequences with smaller errors and fewer model parameters, effectively providing future information for early activity prediction. △ Less

Submitted 7 July, 2024; originally announced July 2024.

arXiv:2407.05350 [pdf]

Multiple boundary states in bilayer and decorated Su-Schrieffer-Heeger-like models

Authors: Shengqun Guo, Jinke Huang, Ruimin Huang, Fengjiang Zhuang, Zhili Lin, Weibin Qiu

Abstract: Topological boundary states have attracted widespread fascination due to their series of intriguing properties. In this paper, we investigate the multiple boundary states within the two kinds of extended Su-Schrieffer-Heeger (SSH) models. The coexistence of boundary states that exist both in the bulk and band gaps is realized based on the bilayer SSH-like model, which consists of two conventional… ▽ More Topological boundary states have attracted widespread fascination due to their series of intriguing properties. In this paper, we investigate the multiple boundary states within the two kinds of extended Su-Schrieffer-Heeger (SSH) models. The coexistence of boundary states that exist both in the bulk and band gaps is realized based on the bilayer SSH-like model, which consists of two conventional square-root SSH models that are directly coupled. We further show the square-root topology within the decorated SSH-like model, which supports multiple boundary states that could be embedded into the bulk continuum by tuning the hopping parameters. In addition, the connection between the decorated SSH-like model and its effectively decomposed counterparts is revealed. Our results broaden insight into the multiple boundary states and open up an exciting avenue for the future exploration of square-root topology. △ Less

Submitted 7 July, 2024; originally announced July 2024.

arXiv:2407.04932 [pdf, other]

Helicity-changing Decays of Cosmological Relic Neutrinos

Authors: Jihong Huang, Shun Zhou

Abstract: In this paper, we examine the possibility that massive neutrinos are unstable due to their invisible decays $ν^{}_i \to ν^{}_j + φ$, where $ν^{}_i$ and $ν^{}_j$ (for $i, j = 1, 2, 3$) are any two of neutrino mass eigenstates with masses $m^{}_i > m^{}_j$ and $φ$ is a massless Nambu-Goldstone boson, and explore the implications for the detection of cosmological relic neutrinos in the present Univer… ▽ More In this paper, we examine the possibility that massive neutrinos are unstable due to their invisible decays $ν^{}_i \to ν^{}_j + φ$, where $ν^{}_i$ and $ν^{}_j$ (for $i, j = 1, 2, 3$) are any two of neutrino mass eigenstates with masses $m^{}_i > m^{}_j$ and $φ$ is a massless Nambu-Goldstone boson, and explore the implications for the detection of cosmological relic neutrinos in the present Universe. First, we carry out a complete calculation of neutrino decay rates in the general case where the individual helicities of parent and daughter neutrinos are specified. Then, the invisible decays of cosmological relic neutrinos are studied and their impact on the capture rates on the beta-decaying nuclei (e.g., $ν^{}_e + {^3{\rm H}} \to {^3{\rm He}} + e^-$) is analyzed. The invisible decays of massive neutrinos could substantially change the capture rates in the PTOLEMY-like experiments when compared to the case of stable neutrinos. In particular, we find that the helicity-changing decays of Dirac neutrinos play an important role whereas those of Majorana neutrinos have no practical effects. △ Less

Submitted 5 July, 2024; originally announced July 2024.

Comments: 28 pages, 7 figures

arXiv:2407.04069 [pdf, other]

A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations

Authors: Md Tahmid Rahman Laskar, Sawsan Alqahtani, M Saiful Bari, Mizanur Rahman, Mohammad Abdullah Matin Khan, Haidar Khan, Israt Jahan, Amran Bhuiyan, Chee Wei Tan, Md Rizwan Parvez, Enamul Hoque, Shafiq Joty, Jimmy Huang

Abstract: Large Language Models (LLMs) have recently gained significant attention due to their remarkable capabilities in performing diverse tasks across various domains. However, a thorough evaluation of these models is crucial before deploying them in real-world applications to ensure they produce reliable performance. Despite the well-established importance of evaluating LLMs in the community, the comple… ▽ More Large Language Models (LLMs) have recently gained significant attention due to their remarkable capabilities in performing diverse tasks across various domains. However, a thorough evaluation of these models is crucial before deploying them in real-world applications to ensure they produce reliable performance. Despite the well-established importance of evaluating LLMs in the community, the complexity of the evaluation process has led to varied evaluation setups, causing inconsistencies in findings and interpretations. To address this, we systematically review the primary challenges and limitations causing these inconsistencies and unreliable evaluations in various steps of LLM evaluation. Based on our critical review, we present our perspectives and recommendations to ensure LLM evaluations are reproducible, reliable, and robust. △ Less

Submitted 4 July, 2024; originally announced July 2024.

arXiv:2407.03202 [pdf, other]

Clifford Circuits Augmented Time-Dependent Variational Principle

Authors: Xiangjian Qian, Jiale Huang, Mingpu Qin

Abstract: The recently proposed Clifford Circuits Augmented Matrix Product States (CA-MPS) (arXiv:2405.09217) seamlessly augments Density Matrix Renormalization Group with Clifford circuits. In CA-MPS, the entanglement from stabilizers is transferred to the Clifford circuits which can be easily handled according to the Gottesman-Knill theorem. As a result, MPS needs only to deal with the non-stabilizer enta… ▽ More The recently proposed Clifford Circuits Augmented Matrix Product States (CA-MPS) (arXiv:2405.09217) seamlessly augments Density Matrix Renormalization Group with Clifford circuits. In CA-MPS, the entanglement from stabilizers is transferred to the Clifford circuits which can be easily handled according to the Gottesman-Knill theorem. As a result, MPS needs only to deal with the non-stabilizer entanglement, which largely reduce the bond dimension and the resource required for the accurate simulation of many-body systems. In this work, we generalize CA-MPS to the framework of Time-Dependent Variational Principle (TDVP) for time evolution simulations. In this method, we apply Clifford circuits to the resulting MPS in each TDVP step with a two-site sweeping process similar as in DMRG, aiming at reducing the entanglement entropy in the MPS, and the Hamiltonian is transformed accordingly using the chosen Clifford circuits. Similar as in CA-MPS, the Clifford circuits doesn't increase the number of terms in the Hamiltonian which makes the overhead very small in the new method. We test this method in both XXZ chain and two dimensional Heisenberg model. The results show that the Clifford circuits augmented TDVP method can reduce the entanglement entropy in the time evolution process and hence makes the simulation reliable for longer time. The Clifford circuits augmented Time-Dependent Variational Principle provides a useful tool for the simulation of time evolution process of many-body systems in the future. △ Less

Submitted 3 July, 2024; originally announced July 2024.

arXiv:2407.02973 [pdf, other]

NOEMA formIng Cluster survEy (NICE): Characterizing eight massive galaxy groups at $1.5 < z < 4$ in the COSMOS field

Authors: Nikolaj B. Sillassen, Shuowen Jin, Georgios E. Magdis, Emanuele Daddi, Tao Wang, Shiying Lu, Hanwen Sun, Vinod Arumugam, Daizhong Liu, Malte Brinch, Chiara D'Eugenio, Raphael Gobat, Carlos Gómez-Guijarro, Michael Rich, Eva Schinnerer, Veronica Strazzullo, Qinghua Tan, Francesco Valentino, Yijun Wang, Mengyuan Xiao, Luwenjia Zhou, David Blánquez-Sesé, Zheng Cai, Yanmei Chen, Laure Ciesla , et al. (19 additional authors not shown)

Abstract: The NOEMA formIng Cluster survEy (NICE) is a large program targeting 69 massive galaxy group candidates at $z>2$ in six deep fields. We report spectroscopic confirmation of eight groups at $1.65\leq z\leq3.61$ in COSMOS. Homogeneously selected as significant overdensities of red IRAC sources with red Herschel colors, four groups are confirmed by CO and [CI] with NOEMA 3mm observations, three are c… ▽ More The NOEMA formIng Cluster survEy (NICE) is a large program targeting 69 massive galaxy group candidates at $z>2$ in six deep fields. We report spectroscopic confirmation of eight groups at $1.65\leq z\leq3.61$ in COSMOS. Homogeneously selected as significant overdensities of red IRAC sources with red Herschel colors, four groups are confirmed by CO and [CI] with NOEMA 3mm observations, three are confirmed with ALMA, and one is confirmed by H$α$ from Subaru/FMOS. We constructed the integrated FIR SEDs for the eight groups, obtaining total IR SFR $=260-1300~{\rm M_\odot}$~yr$^{-1}$. We adopted six methods to estimate the dark matter masses, including stellar mass to halo mass relations, overdensity with galaxy bias, and NFW profile fitting to radial stellar mass density. We found the radial stellar mass density are consistent with a NFW profile, supporting that they are collapsed structures hosted by a single dark matter halo. The best halo mass estimates are $\log(M_{\rm h}/{\rm M_\odot})=12.8-13.7$ with uncertainty of 0.3 dex. From halo mass estimates, we derive baryonic accretion rate ${\rm BAR}=(1-8)\times10^{3}\,{\rm M_{\odot}/yr}$ for this sample. We find a quasi-linear correlation between the integrated SFR/BAR and the theoretical halo mass limit for cold streams, $M_{\rm stream}/M_{\rm h}$, with ${\rm SFR/BAR}=10^{-0.46\pm0.22}\left({M_{\rm stream}/M_{\rm h}}\right)^{0.71\pm0.16}$ with a scatter of $0.40\,{\rm dex}$. Further, we compare halo masses and stellar masses with simulations, and find all structures are consistent with being progenitors of $M_{\rm h}(z=0)>10^{14}\,{\rm M_{\odot}}$ galaxy clusters, and the most massive central galaxies have stellar masses consistent with brightest cluster galaxies (BCGs) progenitors in the TNG300 simulation. The results strongly suggest these structures are forming massive galaxy clusters via baryonic and dark matter accretion. △ Less

Submitted 5 July, 2024; v1 submitted 3 July, 2024; originally announced July 2024.

Comments: 44 pages (27pp appendix), 32 figures, 18 tables, accepted for publication in A&A

arXiv:2407.02803 [pdf, other]

KnobCF: Uncertainty-aware Knob Tuning

Authors: Yu Yan, Junfang Huang, Hongzhi Wang, Jian Geng, Kaixin Zhang, Tao Yu

Abstract: The knob tuning aims to optimize database performance by searching for the most effective knob configuration under a certain workload. Existing works suffer two significant problems. On the one hand, there exist multiple similar even useless evaluations of knob tuning even with the diverse searching methods because of the different sensitivities of knobs on a certain workload. On the other hand, t… ▽ More The knob tuning aims to optimize database performance by searching for the most effective knob configuration under a certain workload. Existing works suffer two significant problems. On the one hand, there exist multiple similar even useless evaluations of knob tuning even with the diverse searching methods because of the different sensitivities of knobs on a certain workload. On the other hand, the single evaluation of knob configurations may bring overestimation or underestimation because of the query uncertainty performance. To solve the above problems, we propose a decoupled query uncertainty-aware knob classifier, called KnobCF, to enhance the knob tuning. Our method has three significant contributions: (1) We propose a novel concept of the uncertainty-aware knob configuration estimation to enhance the knob tuning process. (2) We provide an effective few-shot uncertainty knob estimator without extra time consumption in training data collection, which has a high time efficiency in practical tuning tasks. (3) Our method provides a general framework that could be easily deployed in any knob tuning task because we make no changes to the knob tuners and the database management system. Our experiments on four open-source benchmarks demonstrate that our method effectively reduces useless evaluations and improves the tuning results. Especially in TPCC, our method achieves competitive tuning results with only 60% to 70% time consumption compared to the full workload evaluations. △ Less

Submitted 3 July, 2024; originally announced July 2024.

arXiv:2407.01781 [pdf, other]

doi 10.1145/3658226

fVDB: A Deep-Learning Framework for Sparse, Large-Scale, and High-Performance Spatial Intelligence

Authors: Francis Williams, Jiahui Huang, Jonathan Swartz, Gergely Klár, Vijay Thakkar, Matthew Cong, Xuanchi Ren, Ruilong Li, Clement Fuji-Tsang, Sanja Fidler, Eftychios Sifakis, Ken Museth

Abstract: We present fVDB, a novel GPU-optimized framework for deep learning on large-scale 3D data. fVDB provides a complete set of differentiable primitives to build deep learning architectures for common tasks in 3D learning such as convolution, pooling, attention, ray-tracing, meshing, etc. fVDB simultaneously provides a much larger feature set (primitives and operators) than established frameworks wi… ▽ More We present fVDB, a novel GPU-optimized framework for deep learning on large-scale 3D data. fVDB provides a complete set of differentiable primitives to build deep learning architectures for common tasks in 3D learning such as convolution, pooling, attention, ray-tracing, meshing, etc. fVDB simultaneously provides a much larger feature set (primitives and operators) than established frameworks with no loss in efficiency: our operators match or exceed the performance of other frameworks with narrower scope. Furthermore, fVDB can process datasets with much larger footprint and spatial resolution than prior works, while providing a competitive memory footprint on small inputs. To achieve this combination of versatility and performance, fVDB relies on a single novel VDB index grid acceleration structure paired with several key innovations including GPU accelerated sparse grid construction, convolution using tensorcores, fast ray tracing kernels using a Hierarchical Digital Differential Analyzer algorithm (HDDA), and jagged tensors. Our framework is fully integrated with PyTorch enabling interoperability with existing pipelines, and we demonstrate its effectiveness on a number of representative tasks such as large-scale point-cloud segmentation, high resolution 3D generative modeling, unbounded scale Neural Radiance Fields, and large-scale point cloud reconstruction. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2407.01679 [pdf, other]

Constraints on the gas-phase C/O ratio of DR Tau's outer disk from CS, SO, and C$_2$H observations

Authors: Jane Huang, Edwin A. Bergin, Romane Le Gal, Sean M. Andrews, Jaehan Bae, Luke Keyte, J. A. Sturm

Abstract: Millimeter wavelength observations of Class II protoplanetary disks often display strong emission from hydrocarbons and high CS/SO values, providing evidence that the gas-phase C/O ratio commonly exceeds 1 in their outer regions. We present new NOEMA observations of CS $5-4$, SO $7_6-6_5$ and $5_6-4_5$, C$_2$H $N=3-2$, HCN $3-2$, HCO$^+$ $3-2$, and H$^{13}$CO$^+$ $3-2$ in the DR Tau protoplanetary… ▽ More Millimeter wavelength observations of Class II protoplanetary disks often display strong emission from hydrocarbons and high CS/SO values, providing evidence that the gas-phase C/O ratio commonly exceeds 1 in their outer regions. We present new NOEMA observations of CS $5-4$, SO $7_6-6_5$ and $5_6-4_5$, C$_2$H $N=3-2$, HCN $3-2$, HCO$^+$ $3-2$, and H$^{13}$CO$^+$ $3-2$ in the DR Tau protoplanetary disk at a resolution of $\sim0.4''$ (80 au). Estimates for the disk-averaged CS/SO ratio range from $\sim0.4-0.5$, the lowest value reported thus far for a T Tauri disk. At a projected separation of $\sim180$ au northeast of the star, the SO moment maps exhibit a clump that has no counterpart in the other lines, and the CS/SO value decreases to $<0.2$ at its location. Thermochemical models calculated with DALI indicate that DR Tau's low CS/SO ratio and faint C$_2$H emission can be explained by a gas-phase C/O ratio that is $<1$ at the disk radii traced by NOEMA. Comparisons of DR Tau's SO emission to maps of extended structures traced by $^{13}$CO suggest that late infall may contribute to driving down the gas-phase C/O ratio of its disk. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: Accepted by ApJ

arXiv:2407.01654 [pdf, other]

A thermodynamically consistent phase-field lattice Boltzmann method for two-phase electrohydrodynamic flows

Authors: Fang Xiong, Lei Wang, Jiangxu Huang, Kang Luo

Abstract: In this work, we aim to develop a phase-field based lattice Boltzmann (LB) method for simulating two-phase electrohydrodynamics (EHD) flows, which allows for different properties (densities, viscosities, conductivity and permittivity) of each phase while maintaining thermodynamic consistency. To this end, we first present a theoretical analysis on the two-phase EHD flows by using the Onsager's var… ▽ More In this work, we aim to develop a phase-field based lattice Boltzmann (LB) method for simulating two-phase electrohydrodynamics (EHD) flows, which allows for different properties (densities, viscosities, conductivity and permittivity) of each phase while maintaining thermodynamic consistency. To this end, we first present a theoretical analysis on the two-phase EHD flows by using the Onsager's variational principle, which is an extension of Rayleigh's principle of least energy dissipation and, naturally, guarantees thermodynamic consistency. It shows that the governing equations of the model include the hydrodynamic equations, Cahn-Hilliard equation coupled with additional electrical effect, and the full Poisson-Nernst-Planck electrokinetic equations. After that, a coupled lattice Boltzmann (LB) scheme is constructed for simulating two-phase EHD flows. In particular, in order to handle two-phase EHD flows with a relatively larger electric permittivity ratio, we also introduce a delicately designed discrete forcing term into the LB equation for electrostatic field. Moreover, some numerical examples including two-phase EHD flows in planar layers and charge diffusion of a Gaussian bell are simulated with the developed LB method. It is shown that our numerical scheme shares a second-order convergence rate in space in predicting electric potential and charge density. Finally, we used the current model to simulate the deformation of a droplet under an electric field and the dynamics of droplet detachment in reversed electrowetting. Our numerical results align well with the theoretic solutions, and the available experimental/numerical data, demonstrating that the proposed method is feasible for simulating two-phase EHD flows. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2407.01541 [pdf]

Integration of Computer Networks and Artificial Neural Networks for an AI-based Network Operator

Authors: Binbin Wu, Jingyu Xu, Yifan Zhang, Bo Liu, Yulu Gong, Jiaxin Huang

Abstract: This paper proposes an integrated approach combining computer networks and artificial neural networks to construct an intelligent network operator, functioning as an AI model. State information from computer networks is transformed into embedded vectors, enabling the operator to efficiently recognize different pieces of information and accurately output appropriate operations for the computer netw… ▽ More This paper proposes an integrated approach combining computer networks and artificial neural networks to construct an intelligent network operator, functioning as an AI model. State information from computer networks is transformed into embedded vectors, enabling the operator to efficiently recognize different pieces of information and accurately output appropriate operations for the computer network at each step. The operator has undergone comprehensive testing, achieving a 100% accuracy rate, thus eliminating operational risks. Furthermore, a novel algorithm is proposed to emphasize crucial training losses, aiming to enhance the efficiency of operator training. Additionally, a simple computer network simulator is created and encapsulated into training and testing environment components, enabling automation of the data collection, training, and testing processes. This abstract outlines the core contributions of the paper while highlighting the innovative methodology employed in the development and validation of the AI-based network operator. △ Less

Submitted 9 April, 2024; originally announced July 2024.

arXiv:2407.01312 [pdf, other]

ToCoAD: Two-Stage Contrastive Learning for Industrial Anomaly Detection

Authors: Yun Liang, Zhiguang Hu, Junjie Huang, Donglin Di, Anyang Su, Lei Fan

Abstract: Current unsupervised anomaly detection approaches perform well on public datasets but struggle with specific anomaly types due to the domain gap between pre-trained feature extractors and target-specific domains. To tackle this issue, this paper presents a two-stage training strategy, called \textbf{ToCoAD}. In the first stage, a discriminative network is trained by using synthetic anomalies in a… ▽ More Current unsupervised anomaly detection approaches perform well on public datasets but struggle with specific anomaly types due to the domain gap between pre-trained feature extractors and target-specific domains. To tackle this issue, this paper presents a two-stage training strategy, called \textbf{ToCoAD}. In the first stage, a discriminative network is trained by using synthetic anomalies in a self-supervised learning manner. This network is then utilized in the second stage to provide a negative feature guide, aiding in the training of the feature extractor through bootstrap contrastive learning. This approach enables the model to progressively learn the distribution of anomalies specific to industrial datasets, effectively enhancing its generalizability to various types of anomalies. Extensive experiments are conducted to demonstrate the effectiveness of our proposed two-stage training strategy, and our model produces competitive performance, achieving pixel-level AUROC scores of 98.21\%, 98.43\% and 97.70\% on MVTec AD, VisA and BTAD respectively. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: 11 pages, 7 figures

arXiv:2407.01015 [pdf, other]

Bayesian Entropy Neural Networks for Physics-Aware Prediction

Authors: Rahul Rathnakumar, Jiayu Huang, Hao Yan, Yongming Liu

Abstract: This paper addresses the need for deep learning models to integrate well-defined constraints into their outputs, driven by their application in surrogate models, learning with limited data and partial information, and scenarios requiring flexible model behavior to incorporate non-data sample information. We introduce Bayesian Entropy Neural Networks (BENN), a framework grounded in Maximum Entropy… ▽ More This paper addresses the need for deep learning models to integrate well-defined constraints into their outputs, driven by their application in surrogate models, learning with limited data and partial information, and scenarios requiring flexible model behavior to incorporate non-data sample information. We introduce Bayesian Entropy Neural Networks (BENN), a framework grounded in Maximum Entropy (MaxEnt) principles, designed to impose constraints on Bayesian Neural Network (BNN) predictions. BENN is capable of constraining not only the predicted values but also their derivatives and variances, ensuring a more robust and reliable model output. To achieve simultaneous uncertainty quantification and constraint satisfaction, we employ the method of multipliers approach. This allows for the concurrent estimation of neural network parameters and the Lagrangian multipliers associated with the constraints. Our experiments, spanning diverse applications such as beam deflection modeling and microstructure generation, demonstrate the effectiveness of BENN. The results highlight significant improvements over traditional BNNs and showcase competitive performance relative to contemporary constrained deep learning methods. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: 15 pages

ACM Class: I.5.1

arXiv:2407.00741 [pdf, other]

Diffusion Models for Offline Multi-agent Reinforcement Learning with Safety Constraints

Authors: Jianuo Huang

Abstract: In recent advancements in Multi-agent Reinforcement Learning (MARL), its application has extended to various safety-critical scenarios. However, most methods focus on online learning, which presents substantial risks when deployed in real-world settings. Addressing this challenge, we introduce an innovative framework integrating diffusion models within the MARL paradigm. This approach notably enha… ▽ More In recent advancements in Multi-agent Reinforcement Learning (MARL), its application has extended to various safety-critical scenarios. However, most methods focus on online learning, which presents substantial risks when deployed in real-world settings. Addressing this challenge, we introduce an innovative framework integrating diffusion models within the MARL paradigm. This approach notably enhances the safety of actions taken by multiple agents through risk mitigation while modeling coordinated action. Our framework is grounded in the Centralized Training with Decentralized Execution (CTDE) architecture, augmented by a Diffusion Model for prediction trajectory generation. Additionally, we incorporate a specialized algorithm to further ensure operational safety. We evaluate our model against baselines on the DSRL benchmark. Experiment results demonstrate that our model not only adheres to stringent safety constraints but also achieves superior performance compared to existing methodologies. This underscores the potential of our approach in advancing the safety and efficacy of MARL in real-world applications. △ Less

Submitted 3 July, 2024; v1 submitted 30 June, 2024; originally announced July 2024.

arXiv:2407.00468 [pdf, other]

MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

Authors: Jinsheng Huang, Liang Chen, Taian Guo, Fu Zeng, Yusheng Zhao, Bohan Wu, Ye Yuan, Haozhe Zhao, Zhihui Guo, Yichi Zhang, Jingyang Yuan, Wei Ju, Luchen Liu, Tianyu Liu, Baobao Chang, Ming Zhang

Abstract: Large Multimodal Models (LMMs) exhibit impressive cross-modal understanding and reasoning abilities, often assessed through multiple-choice questions (MCQs) that include an image, a question, and several options. However, many benchmarks used for such evaluations suffer from systematic biases. Remarkably, Large Language Models (LLMs) without any visual perception capabilities achieve non-trivial p… ▽ More Large Multimodal Models (LMMs) exhibit impressive cross-modal understanding and reasoning abilities, often assessed through multiple-choice questions (MCQs) that include an image, a question, and several options. However, many benchmarks used for such evaluations suffer from systematic biases. Remarkably, Large Language Models (LLMs) without any visual perception capabilities achieve non-trivial performance, undermining the credibility of these evaluations. To address this issue while maintaining the efficiency of MCQ evaluations, we propose MMEvalPro, a benchmark designed to avoid Type-I errors through a trilogy evaluation pipeline and more rigorous metrics. For each original question from existing benchmarks, human annotators augment it by creating one perception question and one knowledge anchor question through a meticulous annotation process. MMEvalPro comprises $2,138$ question triplets, totaling $6,414$ distinct questions. Two-thirds of these questions are manually labeled by human experts, while the rest are sourced from existing benchmarks (MMMU, ScienceQA, and MathVista). Compared with the existing benchmarks, our experiments with the latest LLMs and LMMs demonstrate that MMEvalPro is more challenging (the best LMM lags behind human performance by $31.73\%$, compared to an average gap of $8.03\%$ in previous benchmarks) and more trustworthy (the best LLM trails the best LMM by $23.09\%$, whereas the gap for previous benchmarks is just $14.64\%$). Our in-depth analysis explains the reason for the large performance gap and justifies the trustworthiness of evaluation, underscoring its significant potential for advancing future research. △ Less

Submitted 29 June, 2024; originally announced July 2024.

Comments: 21 pages, code released at https://github.com/chenllliang/MMEvalPro, Homepage at https://mmevalpro.github.io/

arXiv:2407.00431 [pdf, other]

Location embedding based pairwise distance learning for fine-grained diagnosis of urinary stones

Authors: Qiangguo Jin, Jiapeng Huang, Changming Sun, Hui Cui, Ping Xuan, Ran Su, Leyi Wei, Yu-Jie Wu, Chia-An Wu, Henry B. L. Duh, Yueh-Hsun Lu

Abstract: The precise diagnosis of urinary stones is crucial for devising effective treatment strategies. The diagnostic process, however, is often complicated by the low contrast between stones and surrounding tissues, as well as the variability in stone locations across different patients. To address this issue, we propose a novel location embedding based pairwise distance learning network (LEPD-Net) that… ▽ More The precise diagnosis of urinary stones is crucial for devising effective treatment strategies. The diagnostic process, however, is often complicated by the low contrast between stones and surrounding tissues, as well as the variability in stone locations across different patients. To address this issue, we propose a novel location embedding based pairwise distance learning network (LEPD-Net) that leverages low-dose abdominal X-ray imaging combined with location information for the fine-grained diagnosis of urinary stones. LEPD-Net enhances the representation of stone-related features through context-aware region enhancement, incorporates critical location knowledge via stone location embedding, and achieves recognition of fine-grained objects with our innovative fine-grained pairwise distance learning. Additionally, we have established an in-house dataset on urinary tract stones to demonstrate the effectiveness of our proposed approach. Comprehensive experiments conducted on this dataset reveal that our framework significantly surpasses existing state-of-the-art methods. △ Less

Submitted 29 June, 2024; originally announced July 2024.

Journal ref: MICCAI 2024

arXiv:2406.18657 [pdf, other]

Exploring the Complex Ionization Environment of the Turbulent DM Tau Disk

Authors: Deryl E. Long, L. Ilsedore Cleeves, Fred C. Adams, Sean Andrews, Edwin A. Bergin, Viviana V. Guzmán, Jane Huang, A. Meredith Hughes, Chunhua Qi, Kamber Schwarz, Jacob B. Simon, David Wilner

Abstract: Ionization drives important chemical and dynamical processes within protoplanetary disks, including the formation of organics and water in the cold midplane and the transportation of material via accretion and magneto-hydrodynamic (MHD) flows. Understanding these ionization-driven processes is crucial for understanding disk evolution and planet formation. We use new and archival ALMA observations… ▽ More Ionization drives important chemical and dynamical processes within protoplanetary disks, including the formation of organics and water in the cold midplane and the transportation of material via accretion and magneto-hydrodynamic (MHD) flows. Understanding these ionization-driven processes is crucial for understanding disk evolution and planet formation. We use new and archival ALMA observations of HCO+, H13CO+, and N2H+ to produce the first forward-modeled 2D ionization constraints for the DM Tau protoplanetary disk. We include ionization from multiple sources and explore the disk chemistry under a range of ionizing conditions. Abundances from our 2D chemical models are post-processed using non-LTE radiative transfer, visibility sampling, and imaging, and are compared directly to the observed radial emission profiles. The observations are best fit by a modestly reduced CR ionization rate ($ζ_{CR}$ ~ 10$^{-18}$ s$^{-1}$) and a hard X-ray spectrum (hardness ratio [HR] = 0.3), which we associate with stellar flaring conditions. Our best-fit model under-produces emission in the inner disk, suggesting that there may be an additional mechanism enhancing ionization in DM Tau's inner disk. Overall, our findings highlight the complexity of ionization in protoplanetary disks and the need for high resolution multi-line studies. △ Less

Submitted 26 June, 2024; originally announced June 2024.

Comments: 18 pages, 12 figures, accepted to be published in The Astrophysical Journal (June 25, 2024)

arXiv:2406.18522 [pdf, other]

ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation

Authors: Shenghai Yuan, Jinfa Huang, Yongqi Xu, Yaoyang Liu, Shaofeng Zhang, Yujun Shi, Ruijie Zhu, Xinhua Cheng, Jiebo Luo, Li Yuan

Abstract: We propose a novel text-to-video (T2V) generation benchmark, ChronoMagic-Bench, to evaluate the temporal and metamorphic capabilities of the T2V models (e.g. Sora and Lumiere) in time-lapse video generation. In contrast to existing benchmarks that focus on the visual quality and textual relevance of generated videos, ChronoMagic-Bench focuses on the model's ability to generate time-lapse videos wi… ▽ More We propose a novel text-to-video (T2V) generation benchmark, ChronoMagic-Bench, to evaluate the temporal and metamorphic capabilities of the T2V models (e.g. Sora and Lumiere) in time-lapse video generation. In contrast to existing benchmarks that focus on the visual quality and textual relevance of generated videos, ChronoMagic-Bench focuses on the model's ability to generate time-lapse videos with significant metamorphic amplitude and temporal coherence. The benchmark probes T2V models for their physics, biology, and chemistry capabilities, in a free-form text query. For these purposes, ChronoMagic-Bench introduces 1,649 prompts and real-world videos as references, categorized into four major types of time-lapse videos: biological, human-created, meteorological, and physical phenomena, which are further divided into 75 subcategories. This categorization comprehensively evaluates the model's capacity to handle diverse and complex transformations. To accurately align human preference with the benchmark, we introduce two new automatic metrics, MTScore and CHScore, to evaluate the videos' metamorphic attributes and temporal coherence. MTScore measures the metamorphic amplitude, reflecting the degree of change over time, while CHScore assesses the temporal coherence, ensuring the generated videos maintain logical progression and continuity. Based on the ChronoMagic-Bench, we conduct comprehensive manual evaluations of ten representative T2V models, revealing their strengths and weaknesses across different categories of prompts, and providing a thorough evaluation framework that addresses current gaps in video generation research. Moreover, we create a large-scale ChronoMagic-Pro dataset, containing 460k high-quality pairs of 720p time-lapse videos and detailed captions ensuring high physical pertinence and large metamorphic amplitude. △ Less

Submitted 26 June, 2024; originally announced June 2024.

Comments: 31 pages, 15 figures

arXiv:2406.18139 [pdf, other]

LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference

Authors: Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, Li Yuan

Abstract: Long-context Multimodal Large Language Models (MLLMs) demand substantial computational resources for inference as the growth of their multimodal Key-Value (KV) cache, in response to increasing input lengths, challenges memory and time efficiency. Unlike single-modality LLMs that manage only textual contexts, the KV cache of long-context MLLMs includes representations from multiple images with temp… ▽ More Long-context Multimodal Large Language Models (MLLMs) demand substantial computational resources for inference as the growth of their multimodal Key-Value (KV) cache, in response to increasing input lengths, challenges memory and time efficiency. Unlike single-modality LLMs that manage only textual contexts, the KV cache of long-context MLLMs includes representations from multiple images with temporal and spatial relationships and related textual contexts. The predominance of image tokens means traditional optimizations for LLMs' KV caches are unsuitable for multimodal long-context settings, and no prior works have addressed this challenge. In this work, we introduce LOOK-M, a pioneering, fine-tuning-free approach that efficiently reduces the multimodal KV cache size while maintaining performance comparable to a full cache. We observe that during prompt prefill, the model prioritizes more textual attention over image features, and based on the multimodal interaction observation, a new proposed text-prior method is explored to compress the KV cache. Furthermore, to mitigate the degradation of image contextual information, we propose several compensatory strategies using KV pairs merging. LOOK-M demonstrates that with a significant reduction in KV Cache memory usage, such as reducing it by 80% in some cases, it not only achieves up to 1.5x faster decoding but also maintains or even enhances performance across a variety of long context multimodal tasks. △ Less

Submitted 26 June, 2024; originally announced June 2024.

arXiv:2406.17880 [pdf, other]

MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval

Authors: Weitong Cai, Jiabo Huang, Shaogang Gong, Hailin Jin, Yang Liu

Abstract: Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query. Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity. This intrinsic modality imbalance leaves a considerable porti… ▽ More Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query. Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity. This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text. It confines the cross-modal alignment knowledge within the scope of a limited text corpus, thereby leading to sub-optimal visual-textual modeling and poor generalizability. By leveraging the visual-textual understanding capability of multi-modal large language models (MLLM), in this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization. To effectively maintain temporal sensibility for localization, we design to get text narratives for each certain video timestamp and construct a structured text paragraph with time information, which is temporally aligned with the visual content. Then we perform cross-modal feature merging between the temporal-aware narratives and corresponding video temporal features to produce semantic-enhanced video representation sequences for query localization. Subsequently, we introduce a uni-modal narrative-query matching mechanism, which encourages the model to extract complementary information from contextual cohesive descriptions for improved retrieval. Extensive experiments on two benchmarks show the effectiveness and generalizability of our proposed method. △ Less