subscribe to arXiv mailings

arXiv:2407.08506 [pdf, other]

Imitation Learning for Robotic Assisted Ultrasound Examination of Deep Venous Thrombosis using Kernelized Movement Primitives

Authors: Diego Dall'Alba, Lorenzo Busellato, Thiusius Rajeeth Savarimuthu, Zhuoqi Cheng, Iñigo Iturrate

Abstract: Deep Vein Thrombosis (DVT) is a common yet potentially fatal condition, often leading to critical complications like pulmonary embolism. DVT is commonly diagnosed using Ultrasound (US) imaging, which can be inconsistent due to its high dependence on the operator's skill. Robotic US Systems (RUSs) aim to improve diagnostic test consistency but face challenges with the complex scanning pattern neede… ▽ More Deep Vein Thrombosis (DVT) is a common yet potentially fatal condition, often leading to critical complications like pulmonary embolism. DVT is commonly diagnosed using Ultrasound (US) imaging, which can be inconsistent due to its high dependence on the operator's skill. Robotic US Systems (RUSs) aim to improve diagnostic test consistency but face challenges with the complex scanning pattern needed for DVT assessment, where precise control over US probe pressure is crucial for indirectly detecting occlusions. This work introduces an imitation learning method, based on Kernelized Movement Primitives (KMP), to standardize DVT US exams by training an autonomous robotic controller using sonographer demonstrations. A new recording device design enhances demonstration ergonomics, integrating with US probes and enabling seamless force and position data recording. KMPs are used to capture scanning skills, linking scan trajectory and force, enabling generalization beyond the demonstrations. Our approach, evaluated on synthetic models and volunteers, shows that the KMP-based RUS can replicate an expert's force control and image quality in DVT US examination. It outperforms previous methods using manually defined force profiles, improving exam standardization and reducing reliance on specialized sonographers. △ Less

Submitted 11 July, 2024; originally announced July 2024.

arXiv:2407.07053 [pdf, other]

Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model

Authors: Wenqi Zhang, Zhenglin Cheng, Yuanyu He, Mengna Wang, Yongliang Shen, Zeqi Tan, Guiyang Hou, Mingqian He, Yanna Ma, Weiming Lu, Yueting Zhuang

Abstract: Although most current large multimodal models (LMMs) can already understand photos of natural scenes and portraits, their understanding of abstract images, e.g., charts, maps, or layouts, and visual reasoning capabilities remains quite rudimentary. They often struggle with simple daily tasks, such as reading time from a clock, understanding a flowchart, or planning a route using a road map. In lig… ▽ More Although most current large multimodal models (LMMs) can already understand photos of natural scenes and portraits, their understanding of abstract images, e.g., charts, maps, or layouts, and visual reasoning capabilities remains quite rudimentary. They often struggle with simple daily tasks, such as reading time from a clock, understanding a flowchart, or planning a route using a road map. In light of this, we design a multi-modal self-instruct, utilizing large language models and their code capabilities to synthesize massive abstract images and visual reasoning instructions across daily scenarios. Our strategy effortlessly creates a multimodal benchmark with 11,193 instructions for eight visual scenarios: charts, tables, simulated maps, dashboards, flowcharts, relation graphs, floor plans, and visual puzzles. \textbf{This benchmark, constructed with simple lines and geometric elements, exposes the shortcomings of most advanced LMMs} like Claude-3.5-Sonnet and GPT-4o in abstract image understanding, spatial relations reasoning, and visual element induction. Besides, to verify the quality of our synthetic data, we fine-tune an LMM using 62,476 synthetic chart, table and road map instructions. The results demonstrate improved chart understanding and map navigation performance, and also demonstrate potential benefits for other visual reasoning tasks. Our code is available at: \url{https://github.com/zwq2018/Multi-modal-Self-instruct}. △ Less

Submitted 10 July, 2024; v1 submitted 9 July, 2024; originally announced July 2024.

Comments: code: https://github.com/zwq2018/Multi-modal-Self-instruct dataset: https://huggingface.co/datasets/zwq2018/Multi-modal-Self-instruct Leaderboard: https://multi-modal-self-instruct.github.io/

arXiv:2407.05118 [pdf, other]

SHINE: Saliency-aware HIerarchical NEgative Ranking for Compositional Temporal Grounding

Authors: Zixu Cheng, Yujiang Pu, Shaogang Gong, Parisa Kordjamshidi, Yu Kong

Abstract: Temporal grounding, a.k.a video moment retrieval, aims at locating video segments corresponding to a given query sentence. The compositional nature of natural language enables the localization beyond predefined events, posing a certain challenge to the compositional generalizability of existing methods. Recent studies establish the correspondence between videos and queries through a decompose-reco… ▽ More Temporal grounding, a.k.a video moment retrieval, aims at locating video segments corresponding to a given query sentence. The compositional nature of natural language enables the localization beyond predefined events, posing a certain challenge to the compositional generalizability of existing methods. Recent studies establish the correspondence between videos and queries through a decompose-reconstruct manner to achieve compositional generalization. However, they only consider dominant primitives and build negative queries through random sampling and recombination, resulting in semantically implausible negatives that hinder the models from learning rational compositions. In addition, recent DETR-based methods still underperform in compositional temporal grounding, showing irrational saliency responses when given negative queries that have subtle differences from positive queries. To address these limitations, we first propose a large language model-driven method for negative query construction, utilizing GPT-3.5-Turbo to generate semantically plausible hard negative queries. Subsequently, we introduce a coarse-to-fine saliency ranking strategy, which encourages the model to learn the multi-granularity semantic relationships between videos and hierarchical negative queries to boost compositional generalization. Extensive experiments on two challenging benchmarks validate the effectiveness and generalizability of our proposed method. Our code is available at https://github.com/zxccade/SHINE. △ Less

Submitted 6 July, 2024; originally announced July 2024.

Comments: Accepted to ECCV 2024

arXiv:2407.03636 [pdf, other]

Diff-Restorer: Unleashing Visual Prompts for Diffusion-based Universal Image Restoration

Authors: Yuhong Zhang, Hengsheng Zhang, Xinning Chai, Zhengxue Cheng, Rong Xie, Li Song, Wenjun Zhang

Abstract: Image restoration is a classic low-level problem aimed at recovering high-quality images from low-quality images with various degradations such as blur, noise, rain, haze, etc. However, due to the inherent complexity and non-uniqueness of degradation in real-world images, it is challenging for a model trained for single tasks to handle real-world restoration problems effectively. Moreover, existin… ▽ More Image restoration is a classic low-level problem aimed at recovering high-quality images from low-quality images with various degradations such as blur, noise, rain, haze, etc. However, due to the inherent complexity and non-uniqueness of degradation in real-world images, it is challenging for a model trained for single tasks to handle real-world restoration problems effectively. Moreover, existing methods often suffer from over-smoothing and lack of realism in the restored results. To address these issues, we propose Diff-Restorer, a universal image restoration method based on the diffusion model, aiming to leverage the prior knowledge of Stable Diffusion to remove degradation while generating high perceptual quality restoration results. Specifically, we utilize the pre-trained visual language model to extract visual prompts from degraded images, including semantic and degradation embeddings. The semantic embeddings serve as content prompts to guide the diffusion model for generation. In contrast, the degradation embeddings modulate the Image-guided Control Module to generate spatial priors for controlling the spatial structure of the diffusion process, ensuring faithfulness to the original image. Additionally, we design a Degradation-aware Decoder to perform structural correction and convert the latent code to the pixel domain. We conducted comprehensive qualitative and quantitative analysis on restoration tasks with different degradations, demonstrating the effectiveness and superiority of our approach. △ Less

Submitted 4 July, 2024; originally announced July 2024.

arXiv:2406.19859 [pdf, other]

MetaDesigner: Advancing Artistic Typography through AI-Driven, User-Centric, and Multilingual WordArt Synthesis

Authors: Jun-Yan He, Zhi-Qi Cheng, Chenyang Li, Jingdong Sun, Qi He, Wangmeng Xiang, Hanyuan Chen, Jin-Peng Lan, Xianhui Lin, Kang Zhu, Bin Luo, Yifeng Geng, Xuansong Xie, Alexander G. Hauptmann

Abstract: MetaDesigner revolutionizes artistic typography synthesis by leveraging the strengths of Large Language Models (LLMs) to drive a design paradigm centered around user engagement. At the core of this framework lies a multi-agent system comprising the Pipeline, Glyph, and Texture agents, which collectively enable the creation of customized WordArt, ranging from semantic enhancements to the imposition… ▽ More MetaDesigner revolutionizes artistic typography synthesis by leveraging the strengths of Large Language Models (LLMs) to drive a design paradigm centered around user engagement. At the core of this framework lies a multi-agent system comprising the Pipeline, Glyph, and Texture agents, which collectively enable the creation of customized WordArt, ranging from semantic enhancements to the imposition of complex textures. MetaDesigner incorporates a comprehensive feedback mechanism that harnesses insights from multimodal models and user evaluations to refine and enhance the design process iteratively. Through this feedback loop, the system adeptly tunes hyperparameters to align with user-defined stylistic and thematic preferences, generating WordArt that not only meets but exceeds user expectations of visual appeal and contextual relevance. Empirical validations highlight MetaDesigner's capability to effectively serve diverse WordArt applications, consistently producing aesthetically appealing and context-sensitive results. △ Less

Submitted 4 July, 2024; v1 submitted 28 June, 2024; originally announced June 2024.

Comments: 18 pages, 16 figures, Project: https://modelscope.cn/studios/WordArt/WordArt

arXiv:2406.19236 [pdf, other]

Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions

Authors: Minghan Li, Heng Li, Zhi-Qi Cheng, Yifei Dong, Yuxuan Zhou, Jun-Yan He, Qi Dai, Teruko Mitamura, Alexander G. Hauptmann

Abstract: Vision-and-Language Navigation (VLN) aims to develop embodied agents that navigate based on human instructions. However, current VLN frameworks often rely on static environments and optimal expert supervision, limiting their real-world applicability. To address this, we introduce Human-Aware Vision-and-Language Navigation (HA-VLN), extending traditional VLN by incorporating dynamic human activitie… ▽ More Vision-and-Language Navigation (VLN) aims to develop embodied agents that navigate based on human instructions. However, current VLN frameworks often rely on static environments and optimal expert supervision, limiting their real-world applicability. To address this, we introduce Human-Aware Vision-and-Language Navigation (HA-VLN), extending traditional VLN by incorporating dynamic human activities and relaxing key assumptions. We propose the Human-Aware 3D (HA3D) simulator, which combines dynamic human activities with the Matterport3D dataset, and the Human-Aware Room-to-Room (HA-R2R) dataset, extending R2R with human activity descriptions. To tackle HA-VLN challenges, we present the Expert-Supervised Cross-Modal (VLN-CM) and Non-Expert-Supervised Decision Transformer (VLN-DT) agents, utilizing cross-modal fusion and diverse training strategies for effective navigation in dynamic human environments. A comprehensive evaluation, including metrics considering human activities, and systematic analysis of HA-VLN's unique challenges, underscores the need for further research to enhance HA-VLN agents' real-world robustness and adaptability. Ultimately, this work provides benchmarks and insights for future research on embodied AI and Sim2Real transfer, paving the way for more realistic and applicable VLN systems in human-populated environments. △ Less

Submitted 4 July, 2024; v1 submitted 27 June, 2024; originally announced June 2024.

Comments: 30 pages, 18 figures, Project Page: https://lpercc.github.io/HA3D_simulator/

arXiv:2406.15877 [pdf, other]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Authors: Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, Thong Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu , et al. (8 additional authors not shown)

Abstract: Automated software engineering has been greatly empowered by the recent advances in Large Language Models (LLMs) for programming. While current benchmarks have shown that LLMs can perform various software engineering tasks like human developers, the majority of their evaluations are limited to short and self-contained algorithmic tasks. Solving challenging and practical programming tasks requires… ▽ More Automated software engineering has been greatly empowered by the recent advances in Large Language Models (LLMs) for programming. While current benchmarks have shown that LLMs can perform various software engineering tasks like human developers, the majority of their evaluations are limited to short and self-contained algorithmic tasks. Solving challenging and practical programming tasks requires the capability of utilizing diverse function calls as tools to efficiently implement functionalities like data analysis and web development. In addition, using multiple tools to solve a task needs compositional reasoning by accurately understanding complex instructions. Fulfilling both of these characteristics can pose a great challenge for LLMs. To assess how well LLMs can solve challenging and practical programming tasks, we introduce Bench, a benchmark that challenges LLMs to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained programming tasks. To evaluate LLMs rigorously, each programming task encompasses 5.6 test cases with an average branch coverage of 99%. In addition, we propose a natural-language-oriented variant of Bench, Benchi, that automatically transforms the original docstrings into short instructions only with essential information. Our extensive evaluation of 60 LLMs shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%. The results underscore the need for further advancements in this area. △ Less

Submitted 26 June, 2024; v1 submitted 22 June, 2024; originally announced June 2024.

Comments: 44 pages, 14 figures, 7 tables, built with love by the BigCode community :)

arXiv:2406.15835 [pdf]

Alternating-Chiral Charge Density Waves and Hybrid Ferrimagnetism in Monolayered NbTe2

Authors: Yusong Bai, Guohua Cao, Jinghao Deng, Haomin Fei, Xiaoyu Lin, Leiqiang Li, Chao Zhu, Zemin Pan, Tao Jian, Da Huo, Zhengbo Cheng, Chih-Kang Shih, Ping Cui, Chendong Zhang, Zhenyu Zhang

Abstract: Intertwining of different quantum degrees of freedom manifests exotic quantum phenomena in many-body systems, especially in reduced dimensionality. Here we show that monolayered NbTe2 serves as an ideal platform where lattice, charge, and spin degrees of freedom manifest cooperatively, leading to a new and threading order of chirality. By using spin-polarized scanning tunneling microscopy/spectros… ▽ More Intertwining of different quantum degrees of freedom manifests exotic quantum phenomena in many-body systems, especially in reduced dimensionality. Here we show that monolayered NbTe2 serves as an ideal platform where lattice, charge, and spin degrees of freedom manifest cooperatively, leading to a new and threading order of chirality. By using spin-polarized scanning tunneling microscopy/spectroscopy, we reveal that the root19 * root19 phase of NbTe2 is encoded with both alternating-chiral atomic displacements and charge density waves, characterized by two chiral units of opposite handedness within the reconstructed cell. We show unambiguous evidence for emergent spin polarizations spreading over the primitive cell, with the magnetization orientation synchronized with alternating handedness of chiral order. Our first-principles studies identify the origin of intertwined orders being correlation driven, with the threading order of chirality emerging when the on-site Coulomb repulsion exceeds a critical value. The spin ordering is further shown to be of hybrid ferrimagnetic nature, contributed by the itinerant electrons and localized d-orbitals. Collectively, these findings expand the realm of chiral order in correlated electron systems, and facilitate an appealing platform for chiral spintronic and related applications. △ Less

Submitted 22 June, 2024; originally announced June 2024.

arXiv:2406.15060 [pdf, other]

Evidence for Three-$α$ Breathing Modes Uncovered by Control Neural Network

Authors: Zheng Cheng, Mengjiao Lyu, Takayuki Myo, Hisashi Horiuchi, Hiroshi Toki, Zhongzhou Ren, Masahiro Isaka, Mengyun Mao, Hiroki Takemoto, Niu Wan, Wenlong You, Qing Zhao

Abstract: This work introduces a new Control Neural Network (Ctrl.NN) method to uncover evidence of exotic quantum state, \textit{i.e.}, the breathing modes in 3-$α$ resonant states of $^{12}$C nucleus. We provide the most precise microscopic description to date for the $^{12}$C energy spectrum, identify two new exotic breathing states, and uncover strong evidence that directly connects the recent experimen… ▽ More This work introduces a new Control Neural Network (Ctrl.NN) method to uncover evidence of exotic quantum state, \textit{i.e.}, the breathing modes in 3-$α$ resonant states of $^{12}$C nucleus. We provide the most precise microscopic description to date for the $^{12}$C energy spectrum, identify two new exotic breathing states, and uncover strong evidence that directly connects the recent experimental observations to the breathing modes. The Ctrl.NN method significantly simplifies numerical calculations of quantum systems under multiple constraints and offers a new perspective for solving the nuclear many-body problem. △ Less

Submitted 21 June, 2024; originally announced June 2024.

arXiv:2406.14025 [pdf]

Direct Observation of Dendrites Nucleation in Li Metal Battery by Machine Learning Accelerated Molecular Simulations under Realistic Electrochemical Conditions

Authors: Taiping Hu, Haichao Huang, Guobing Zhou, Xinyan Wang, Zheng Cheng, Fangjia Fu, Xiaoxu Wang, Fuzhi Dai, Kuang Yu, Shenzhen Xu

Abstract: Uncontrollable dendrites growth during electrochemical cycles leads to low Coulombic efficiency and critical safety issues in Li metal batteries. Hence, a comprehensive understanding of the dendrite formation mechanism is essential for further enhancing the performance of Li metal batteries. Machine learning accelerated molecular dynamics (MD) simulations can provide atomic-scale resolution for va… ▽ More Uncontrollable dendrites growth during electrochemical cycles leads to low Coulombic efficiency and critical safety issues in Li metal batteries. Hence, a comprehensive understanding of the dendrite formation mechanism is essential for further enhancing the performance of Li metal batteries. Machine learning accelerated molecular dynamics (MD) simulations can provide atomic-scale resolution for various key processes at an ab-initio level accuracy. However, traditional MD simulation tools hardly capture Li electrochemical depositions, due to lack of an electrochemical constant potential (ConstP) condition. In this work, we propose a ConstP approach that combines a machine learning force field with the charge equilibration method to reveal the dynamic process of Li dendrites nucleation at Li metal anode surfaces. Our results show that both dead Li cluster formation and inhomogeneous Li electro-depositions can induce Li dendrites nucleation. We further reveal that the local aggregation of Li atoms in amorphous inorganic components of solid electrolyte interphase is the key factor triggering the nucleation process. Overall, our simulations provide microscopic insights for Li dendrites formations in Li metal anodes. More importantly, we present an efficient and accurate simulation method for modeling realistic ConstP conditions, which holds considerable potential for broader applications in modeling of complex electrochemical interfaces. △ Less

Submitted 3 July, 2024; v1 submitted 20 June, 2024; originally announced June 2024.

arXiv:2406.13702 [pdf]

Van-Hove annihilation and nematic instability on a Kagome lattice

Authors: Yu-Xiao Jiang, Sen Shao, Wei Xia, M. Michael Denner, Julian Ingham, Md Shafayat Hossain, Qingzheng Qiu, Xiquan Zheng, Hongyu Chen, Zi-Jia Cheng, Xian P. Yang, Byunghoon Kim, Jia-Xin Yin, Songbo Zhang, Maksim Litskevich, Qi Zhang, Tyler A. Cochran, Yingying Peng, Guoqing Chang, Yanfeng Guo, Ronny Thomale, Titus Neupert, M. Zahid Hasan

Abstract: Novel states of matter arise in quantum materials due to strong interactions among electrons. A nematic phase breaks the point group symmetry of the crystal lattice and is known to emerge in correlated materials. Here we report the observation of an intra-unit-cell nematic order and signatures of Pomeranchuk instability in the Kagome metal ScV6Sn6. Using scanning tunneling microscopy and spectrosc… ▽ More Novel states of matter arise in quantum materials due to strong interactions among electrons. A nematic phase breaks the point group symmetry of the crystal lattice and is known to emerge in correlated materials. Here we report the observation of an intra-unit-cell nematic order and signatures of Pomeranchuk instability in the Kagome metal ScV6Sn6. Using scanning tunneling microscopy and spectroscopy, we reveal a stripe-like nematic order breaking the crystal rotational symmetry within the Kagome lattice itself. Moreover, we identify a set of van Hove singularities adhering to the Kagome layer electrons, which appear along one direction of the Brillouin zone while being annihilated along other high-symmetry directions, revealing a rotational symmetry breaking. Via detailed spectroscopic maps, we further observe an elliptical deformation of Fermi surface, which provides direct evidence for an electronically mediated nematic order. Our work not only bridges the gap between electronic nematicity and Kagome physics, but also sheds light on the potential mechanism for realizing symmetry-broken phases in correlated electron systems. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: 19 pages, 5 figures, accepted for publication in Nature materials

arXiv:2406.11161 [pdf, other]

Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning

Authors: Zebang Cheng, Zhi-Qi Cheng, Jun-Yan He, Jingdong Sun, Kai Wang, Yuxiang Lin, Zheng Lian, Xiaojiang Peng, Alexander Hauptmann

Abstract: Accurate emotion perception is crucial for various applications, including human-computer interaction, education, and counseling. However, traditional single-modality approaches often fail to capture the complexity of real-world emotional expressions, which are inherently multimodal. Moreover, existing Multimodal Large Language Models (MLLMs) face challenges in integrating audio and recognizing su… ▽ More Accurate emotion perception is crucial for various applications, including human-computer interaction, education, and counseling. However, traditional single-modality approaches often fail to capture the complexity of real-world emotional expressions, which are inherently multimodal. Moreover, existing Multimodal Large Language Models (MLLMs) face challenges in integrating audio and recognizing subtle facial micro-expressions. To address this, we introduce the MERR dataset, containing 28,618 coarse-grained and 4,487 fine-grained annotated samples across diverse emotional categories. This dataset enables models to learn from varied scenarios and generalize to real-world applications. Furthermore, we propose Emotion-LLaMA, a model that seamlessly integrates audio, visual, and textual inputs through emotion-specific encoders. By aligning features into a shared space and employing a modified LLaMA model with instruction tuning, Emotion-LLaMA significantly enhances both emotional recognition and reasoning capabilities. Extensive evaluations show Emotion-LLaMA outperforms other MLLMs, achieving top scores in Clue Overlap (7.83) and Label Overlap (6.25) on EMER, an F1 score of 0.9036 on MER2023 challenge, and the highest UAR (45.59) and WAR (59.37) in zero-shot evaluations on DFEW dataset. △ Less

Submitted 16 June, 2024; originally announced June 2024.

Comments: 37 pages, 12 figures, Project: https://github.com/ZebangCheng/Emotion-LLaMA, Demo: https://huggingface.co/spaces/ZebangCheng/Emotion-LLaMA

arXiv:2406.10575 [pdf, ps, other]

The necessity of (co)unit in nearly Frobenius algebra

Authors: Zhiyun Cheng, Ziyi Lei

Abstract: In this article, we concern the concept of nearly Frobenius algebra, which corresponds to most 2D-TQFT of which each cobordism admits no critical points of index 0 or 2. We prove that any nearly Frobenius algebra over a principal ideal domain with surjective multiplication and injective comultiplication is indeed a Frobenius algebra. The motivation of this study mainly emanates from the investigat… ▽ More In this article, we concern the concept of nearly Frobenius algebra, which corresponds to most 2D-TQFT of which each cobordism admits no critical points of index 0 or 2. We prove that any nearly Frobenius algebra over a principal ideal domain with surjective multiplication and injective comultiplication is indeed a Frobenius algebra. The motivation of this study mainly emanates from the investigation of potential constructions of link homology. △ Less

Submitted 15 June, 2024; originally announced June 2024.

Comments: 23 pages, 6 figures

MSC Class: 16T10; 16S10; 57K18

arXiv:2406.09375 [pdf, other]

Learning conditional distributions on continuous spaces

Authors: Cyril Bénézet, Ziteng Cheng, Sebastian Jaimungal

Abstract: We investigate sample-based learning of conditional distributions on multi-dimensional unit boxes, allowing for different dimensions of the feature and target spaces. Our approach involves clustering data near varying query points in the feature space to create empirical measures in the target space. We employ two distinct clustering schemes: one based on a fixed-radius ball and the other on neare… ▽ More We investigate sample-based learning of conditional distributions on multi-dimensional unit boxes, allowing for different dimensions of the feature and target spaces. Our approach involves clustering data near varying query points in the feature space to create empirical measures in the target space. We employ two distinct clustering schemes: one based on a fixed-radius ball and the other on nearest neighbors. We establish upper bounds for the convergence rates of both methods and, from these bounds, deduce optimal configurations for the radius and the number of neighbors. We propose to incorporate the nearest neighbors method into neural network training, as our empirical analysis indicates it has better performance in practice. For efficiency, our training process utilizes approximate nearest neighbors search with random binary space partitioning. Additionally, we employ the Sinkhorn algorithm and a sparsity-enforced transport plan. Our empirical findings demonstrate that, with a suitably designed structure, the neural network has the ability to adapt to a suitable level of Lipschitz continuity locally. For reproducibility, our code is available at \url{https://github.com/zcheng-a/LCD_kNN}. △ Less

Submitted 13 June, 2024; originally announced June 2024.

arXiv:2406.09180 [pdf, other]

Detection-Rate-Emphasized Multi-objective Evolutionary Feature Selection for Network Intrusion Detection

Authors: Zi-Hang Cheng, Haopu Shang, Chao Qian

Abstract: Network intrusion detection is one of the most important issues in the field of cyber security, and various machine learning techniques have been applied to build intrusion detection systems. However, since the number of features to describe the network connections is often large, where some features are redundant or noisy, feature selection is necessary in such scenarios, which can both improve t… ▽ More Network intrusion detection is one of the most important issues in the field of cyber security, and various machine learning techniques have been applied to build intrusion detection systems. However, since the number of features to describe the network connections is often large, where some features are redundant or noisy, feature selection is necessary in such scenarios, which can both improve the efficiency and accuracy. Recently, some researchers focus on using multi-objective evolutionary algorithms (MOEAs) to select features. But usually, they only consider the number of features and classification accuracy as the objectives, resulting in unsatisfactory performance on a critical metric, detection rate. This will lead to the missing of many real attacks and bring huge losses to the network system. In this paper, we propose DR-MOFS to model the feature selection problem in network intrusion detection as a three-objective optimization problem, where the number of features, accuracy and detection rate are optimized simultaneously, and use MOEAs to solve it. Experiments on two popular network intrusion detection datasets NSL-KDD and UNSW-NB15 show that in most cases the proposed method can outperform previous methods, i.e., lead to fewer features, higher accuracy and detection rate. △ Less

Submitted 13 June, 2024; originally announced June 2024.

arXiv:2406.08689 [pdf, other]

Security of AI Agents

Authors: Yifeng He, Ethan Wang, Yuyang Rong, Zifei Cheng, Hao Chen

Abstract: The study and development of AI agents have been boosted by large language models. AI agents can function as intelligent assistants and complete tasks on behalf of their users with access to tools and the ability to execute commands in their environments, Through studying and experiencing the workflow of typical AI agents, we have raised several concerns regarding their security. These potential v… ▽ More The study and development of AI agents have been boosted by large language models. AI agents can function as intelligent assistants and complete tasks on behalf of their users with access to tools and the ability to execute commands in their environments, Through studying and experiencing the workflow of typical AI agents, we have raised several concerns regarding their security. These potential vulnerabilities are not addressed by the frameworks used to build the agents, nor by research aimed at improving the agents. In this paper, we identify and describe these vulnerabilities in detail from a system security perspective, emphasizing their causes and severe effects. Furthermore, we introduce defense mechanisms corresponding to each vulnerability with meticulous design and experiments to evaluate their viability. Altogether, this paper contextualizes the security issues in the current development of AI agents and delineates methods to make AI agents safer and more reliable. △ Less

Submitted 20 June, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

arXiv:2406.07476 [pdf, other]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Authors: Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing

Abstract: In this paper, we present the VideoLLaMA 2, a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks. Building upon its predecessor, VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC) connector, which effectively captures the intricate spatial and temporal dynamics of video data… ▽ More In this paper, we present the VideoLLaMA 2, a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks. Building upon its predecessor, VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC) connector, which effectively captures the intricate spatial and temporal dynamics of video data. Additionally, we integrate an Audio Branch into the model through joint training, thereby enriching the multimodal understanding capabilities of the model by seamlessly incorporating audio cues. Comprehensive evaluations on multiple-choice video question answering (MC-VQA), open-ended video question answering (OE-VQA), and video captioning (VC) tasks demonstrate that VideoLLaMA 2 consistently achieves competitive results among open-source models and even gets close to some proprietary models on several benchmarks. Furthermore, VideoLLaMA 2 exhibits reasonable improvements in audio-only and audio-video question-answering (AQA & OE-AVQA) benchmarks over existing models. These advancements underline VideoLLaMA 2's superior performance in multimodal comprehension, setting a new standard for intelligent video analysis systems. All models are public to facilitate further research. △ Less

Submitted 17 June, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

Comments: ZC, SL, HZ, YX, and XL contributed equally to this project

arXiv:2406.06396 [pdf, other]

Lightwave-controlled relativistic plasma mirrors

Authors: Marie Ouillé, Jaismeen Kaur, Zhao Cheng, Stefan Haessler, Rodrigo Lopez-Martens

Abstract: We report on attosecond-scale control of high-harmonic and electron emission from plasma mirrors driven by relativistic-intensity near-single-cycle lightwaves at kHz repetition rate. By controlling the waveform of the intense light transient, we reproducibly form a sub-cycle temporal intensity gate at the plasma mirror surface, leading to the observation of extreme ultraviolet spectral continua, c… ▽ More We report on attosecond-scale control of high-harmonic and electron emission from plasma mirrors driven by relativistic-intensity near-single-cycle lightwaves at kHz repetition rate. By controlling the waveform of the intense light transient, we reproducibly form a sub-cycle temporal intensity gate at the plasma mirror surface, leading to the observation of extreme ultraviolet spectral continua, characteristic of isolated attosecond pulse generation. We also observe the correlated emission of a waveform-dependent relativistic electron beam, paving the way towards fully lightwave-controlled dynamics of relativistic plasma mirrors. △ Less

Submitted 11 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

arXiv:2406.06279 [pdf, other]

Multi-Prompting Decoder Helps Better Language Understanding

Authors: Zifeng Cheng, Zhaoling Chen, Zhiwei Jiang, Yafeng Yin, Shiping Ge, Yuliang Liu, Qing Gu

Abstract: Recent Pre-trained Language Models (PLMs) usually only provide users with the inference APIs, namely the emerging Model-as-a-Service (MaaS) setting. To adapt MaaS PLMs to downstream tasks without accessing their parameters and gradients, some existing methods focus on the output-side adaptation of PLMs, viewing the PLM as an encoder and then optimizing a task-specific decoder for decoding the outp… ▽ More Recent Pre-trained Language Models (PLMs) usually only provide users with the inference APIs, namely the emerging Model-as-a-Service (MaaS) setting. To adapt MaaS PLMs to downstream tasks without accessing their parameters and gradients, some existing methods focus on the output-side adaptation of PLMs, viewing the PLM as an encoder and then optimizing a task-specific decoder for decoding the output hidden states and class scores of the PLM. Despite the effectiveness of these methods, they only use a single prompt to query PLMs for decoding, leading to a heavy reliance on the quality of the adopted prompt. In this paper, we propose a simple yet effective Multi-Prompting Decoder (MPD) framework for MaaS adaptation. The core idea is to query PLMs with multiple different prompts for each sample, thereby obtaining multiple output hidden states and class scores for subsequent decoding. Such multi-prompting decoding paradigm can simultaneously mitigate reliance on the quality of a single prompt, alleviate the issue of data scarcity under the few-shot setting, and provide richer knowledge extracted from PLMs. Specifically, we propose two decoding strategies: multi-prompting decoding with optimal transport for hidden states and calibrated decoding for class scores. Extensive experiments demonstrate that our method achieves new state-of-the-art results on multiple natural language understanding datasets under the few-shot setting. △ Less

Submitted 10 June, 2024; originally announced June 2024.

arXiv:2406.06104 [pdf]

Correlated electrons of the flat band in charge density wave state of 4Hb-TaSexS2-x

Authors: Yanyan Geng, Jianfeng Guo, Fanyu Meng, Manyu Wang, Shuo Mi, Li Huang, Rui Xu, Fei Pang, Kai Liu, Shancai Wang, Hong-Jun Gao, Weichang Zhou, Wei Ji, Hechang Lei, Zhihai Cheng

Abstract: Many intriguing quantum states of matter, such as unconventional superconductivity, magnetic phases and fractional quantum Hall physics, emergent from the spatially-correlated localized electrons in the flat band of solid materials. By using scanning tunneling microscopy and spectroscopy (STM/STS), we report the real-space investigation of correlated electrons in the flat band of superlattice 4Hb-… ▽ More Many intriguing quantum states of matter, such as unconventional superconductivity, magnetic phases and fractional quantum Hall physics, emergent from the spatially-correlated localized electrons in the flat band of solid materials. By using scanning tunneling microscopy and spectroscopy (STM/STS), we report the real-space investigation of correlated electrons in the flat band of superlattice 4Hb-TaSexS2-x. In contrast with the pristine 4Hb-TaS2, the selenium (Se) substitutions significantly affect the interfacial transfer of correlated electrons between the CDW states of 1T- and 1H-TaS2 layers, and contribute a real-space fractional electron-filling configurations with the distributed electron-filled and -void SoD clusters of 1T-layer. The site-specific STS spectra directly reveal their respective prominent spectra weight above EF and symmetric Mott-like spectra. In addition, the spatial distributions of these electron-filled SoDs in the 1T-layer of 4Hb-TaSe0.7S1.3 demonstrate different local short-range patterning, clearly indicating the complex neighboring interactions among the localized electrons in the flat band of 1T-layer. Our results not only provide an in-depth insight of correlated electrons in the flat CDW band, and provide a simple platform to manipulate the electron-correlation-related quantum states. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: 18 pages, 4 figures

arXiv:2406.06031 [pdf, other]

A WT-ResNet based fault diagnosis model for the urban rail train transmission system

Authors: Zuyu Cheng, Zhengcai Zhao, Yixiao Wang, Wentao Guo, Yufei Wang, Xiang Gao

Abstract: This study presents a novel fault diagnosis model for urban rail transit systems based on Wavelet Transform Residual Neural Network (WT-ResNet). The model integrates the advantages of wavelet transform for feature extraction and ResNet for pattern recognition, offering enhanced diagnostic accuracy and robustness. Experimental results demonstrate the effectiveness of the proposed model in identifyi… ▽ More This study presents a novel fault diagnosis model for urban rail transit systems based on Wavelet Transform Residual Neural Network (WT-ResNet). The model integrates the advantages of wavelet transform for feature extraction and ResNet for pattern recognition, offering enhanced diagnostic accuracy and robustness. Experimental results demonstrate the effectiveness of the proposed model in identifying faults in urban rail trains, paving the way for improved maintenance strategies and reduced downtime. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: 12 pages,10 figures

arXiv:2406.05857 [pdf, other]

doi 10.1109/TPAMI.2024.3412632

Self-supervised Adversarial Training of Monocular Depth Estimation against Physical-World Attacks

Authors: Zhiyuan Cheng, Cheng Han, James Liang, Qifan Wang, Xiangyu Zhang, Dongfang Liu

Abstract: Monocular Depth Estimation (MDE) plays a vital role in applications such as autonomous driving. However, various attacks target MDE models, with physical attacks posing significant threats to system security. Traditional adversarial training methods, which require ground-truth labels, are not directly applicable to MDE models that lack ground-truth depth. Some self-supervised model hardening techn… ▽ More Monocular Depth Estimation (MDE) plays a vital role in applications such as autonomous driving. However, various attacks target MDE models, with physical attacks posing significant threats to system security. Traditional adversarial training methods, which require ground-truth labels, are not directly applicable to MDE models that lack ground-truth depth. Some self-supervised model hardening techniques (e.g., contrastive learning) overlook the domain knowledge of MDE, resulting in suboptimal performance. In this work, we introduce a novel self-supervised adversarial training approach for MDE models, leveraging view synthesis without the need for ground-truth depth. We enhance adversarial robustness against real-world attacks by incorporating L_0-norm-bounded perturbation during training. We evaluate our method against supervised learning-based and contrastive learning-based approaches specifically designed for MDE. Our experiments with two representative MDE networks demonstrate improved robustness against various adversarial attacks, with minimal impact on benign performance. △ Less

Submitted 9 June, 2024; originally announced June 2024.

Comments: Accepted in TPAMI'24. Extended from our ICLR'23 publication (arXiv:2301.13487). arXiv admin note: substantial text overlap with arXiv:2301.13487

arXiv:2406.05731 [pdf, ps, other]

Nonlinear saturation of reversed shear Alfven eigenmode via high-frequency quasi-mode generation

Authors: Zhiwen Cheng, Guangyu Wei, Lei Ye, Zhiyong Qiu

Abstract: A nonlinear saturation mechanism for reversed shear Alfven eigenmode (RSAE) is proposed and analysed, and is shown to be of relevance to typical reactor parameter region. The saturation is achieved through the generation of high-frequency quasi-mode due to nonlinear coupling of two RSAEs, which is then damped due to coupling with the shear Alfven continuum, and leads to the nonlinear saturation of… ▽ More A nonlinear saturation mechanism for reversed shear Alfven eigenmode (RSAE) is proposed and analysed, and is shown to be of relevance to typical reactor parameter region. The saturation is achieved through the generation of high-frequency quasi-mode due to nonlinear coupling of two RSAEs, which is then damped due to coupling with the shear Alfven continuum, and leads to the nonlinear saturation of the primary RSAEs . An estimation of the nonlinear damping rate is also provided. △ Less

Submitted 9 June, 2024; originally announced June 2024.

Comments: submitted to Plasma Physics and Technology

arXiv:2406.04538 [pdf, other]

A unified framework for prediction of vortex-induced vibration based on the nonlinear identification of general wake oscillator modeling

Authors: Zhi Cheng, Fue-Sang Lien, Earl H. Dowell

Abstract: In this paper, we present novel identification strategies to develop a unified framework for vortex-induced vibration (VIV) prediction based on the general semi-empirical wake oscillator. Greybox nonlinear system identification method accompanying high-fidelity computational fluid dynamics (CFD) and/or experimental data could be applied for the identification process. The proposed template of gene… ▽ More In this paper, we present novel identification strategies to develop a unified framework for vortex-induced vibration (VIV) prediction based on the general semi-empirical wake oscillator. Greybox nonlinear system identification method accompanying high-fidelity computational fluid dynamics (CFD) and/or experimental data could be applied for the identification process. The proposed template of general wake oscillators contains low- to high-order damping terms to be identified for characterizing the possible flow dynamics. Two different strategies, including individual identification of single wake oscillator and overall identification of coupled VIV control equations, are proposed. VIV system consisting of an elastically-mounted circular cylinder submerged in laminar flow at Reynold number of 100 is considered. Both strategies have been tested and have exhibited high accuracy. The second strategy, i.e., overall identification of coupled VIV control equations, would be more suitable for the future framework owing that its training process considers the effect of fluid damping. A detailed mathematical introduction to future works on framework development covering the wide Reynold number range is addressed. The proposed unified framework is a landmark update of past wake oscillators both in terms of prediction accuracy and physical principles and has considerable research significance and practical engineering value. △ Less

Submitted 13 June, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

Comments: This version is not intended to be sent for peer review and is merely a summary of some preliminary work

arXiv:2406.02607 [pdf, other]

Flow-Induced Vibration of Flexible Hydrofoil Within Cavitating Turbulent Flow

Authors: Zhi Cheng, Rajeev Jaiman

Abstract: The flow-induced vibration and cavitation dynamics of three-dimensional flow past a cantilever flexible hydrofoil are investigated using a large eddy simulation (LES) model, a homogeneous mixture cavitation model and the structural modes superposition method. The present work aims to explore a potential mechanism responsible for a propeller singing behavior, and thus focuses on the synchronized hy… ▽ More The flow-induced vibration and cavitation dynamics of three-dimensional flow past a cantilever flexible hydrofoil are investigated using a large eddy simulation (LES) model, a homogeneous mixture cavitation model and the structural modes superposition method. The present work aims to explore a potential mechanism responsible for a propeller singing behavior, and thus focuses on the synchronized hydroelastic coupling among the pressure pulsation inside the flow field, the cavitation generation and the structural vibration. To begin, we validate the tip vortex dynamics of a flexible hydrofoil against the available experimental. Our results demonstrate that the tip vortex shedding and the blade vibration are responsible for the intense peak in the low-frequency tonal components of the noise source, and the trailing-edge vortex shedding induces broadband components. Additionally, the generation of sheet cavitation induces considerable synchronized hydrofoil vibration (subjected to a flutter-like response), and affects the pressure fluctuations in the flow field, which further dominate the features of the underwater noise sources. It is suggested that the cavitation behavior and structural vibrations co-dominate the characteristics of singing noise from a propeller blade. △ Less

Submitted 2 June, 2024; originally announced June 2024.

Report number: OMAE 2024-125985

arXiv:2406.01850 [pdf, other]

Cell-free massive MIMO Channels in an Urban Environment -- Measurements and Channel Statistics

Authors: Yuning Zhang, Thomas Choi, Zihang Cheng, Jorge Gomez-Ponce, Issei Kanno, Masaaki Ito, Andreas F. Molisch

Abstract: Cell-free massive MIMO (CF-mMIMO), where each user equipment (UE) is connected to multiple access points (APs), is emerging as an important component for 5G and 6G cellular systems. Accurate channel models based on measurements are required to optimize their design and deployment. This paper presents an extensive measurement campaign for CF-mMIMO in an urban environment. A new "virtual AP" techniq… ▽ More Cell-free massive MIMO (CF-mMIMO), where each user equipment (UE) is connected to multiple access points (APs), is emerging as an important component for 5G and 6G cellular systems. Accurate channel models based on measurements are required to optimize their design and deployment. This paper presents an extensive measurement campaign for CF-mMIMO in an urban environment. A new "virtual AP" technique measures channels between 80 UE locations and more than 20,000 possible microcellular AP locations. Measurements are done at 3.5 GHz carrier frequency with 350 MHz bandwidth (BW). The paper describes the measurement setup and data processing, shows sample results and their physical interpretation, and provides statistics for key quantities such as pathloss, shadowing, delay spread (DS), and delay window. We find pathloss coefficients of 2.9 and 10.4 for line-of-sight (LOS) and non line-of-sight (NLOS), respectively, where the high LOS coefficient is mainly because larger distance leads to more grazing angle of incidence and thus lower antenna gain in our setup. Shadowing standard deviations are 5.1/16.6 dB, and root mean squared (RMS) DSs of -80.6/-72.6 dBs. The measurements can also be used for parameterizing a CUNEC-type model, which will be reported in future work. △ Less

Submitted 3 June, 2024; originally announced June 2024.

Comments: submitted to IEEE TWC

arXiv:2406.01007 [pdf, other]

Measurement of Electron Antineutrino Oscillation Amplitude and Frequency via Neutron Capture on Hydrogen at Daya Bay

Authors: Daya Bay collaboration, F. P. An, W. D. Bai, A. B. Balantekin, M. Bishai, S. Blyth, G. F. Cao, J. Cao, J. F. Chang, Y. Chang, H. S. Chen, H. Y. Chen, S. M. Chen, Y. Chen, Y. X. Chen, Z. Y. Chen, J. Cheng, J. Cheng, Y. -C. Cheng, Z. K. Cheng, J. J. Cherwinka, M. C. Chu, J. P. Cummings, O. Dalager, F. S. Deng , et al. (177 additional authors not shown)

Abstract: This Letter reports the first measurement of the oscillation amplitude and frequency of reactor antineutrinos at Daya Bay via neutron capture on hydrogen using 1958 days of data. With over 3.6 million signal candidates, an optimized candidate selection, improved treatment of backgrounds and efficiencies, refined energy calibration, and an energy response model for the capture-on-hydrogen sensitive… ▽ More This Letter reports the first measurement of the oscillation amplitude and frequency of reactor antineutrinos at Daya Bay via neutron capture on hydrogen using 1958 days of data. With over 3.6 million signal candidates, an optimized candidate selection, improved treatment of backgrounds and efficiencies, refined energy calibration, and an energy response model for the capture-on-hydrogen sensitive region, the relative $\overlineν_{e}$ rates and energy spectra variation among the near and far detectors gives $\mathrm{sin}^22θ_{13} = 0.0759_{-0.0049}^{+0.0050}$ and $Δm^2_{32} = (2.72^{+0.14}_{-0.15})\times10^{-3}$ eV$^2$ assuming the normal neutrino mass ordering, and $Δm^2_{32} = (-2.83^{+0.15}_{-0.14})\times10^{-3}$ eV$^2$ for the inverted neutrino mass ordering. This estimate of $\sin^2 2θ_{13}$ is consistent with and essentially independent from the one obtained using the capture-on-gadolinium sample at Daya Bay. The combination of these two results yields $\mathrm{sin}^22θ_{13}= 0.0833\pm0.0022$, which represents an 8% relative improvement in precision regarding the Daya Bay full 3158-day capture-on-gadolinium result. △ Less

Submitted 3 June, 2024; originally announced June 2024.

arXiv:2405.20617 [pdf, other]

Large-scale Outdoor Cell-free mMIMO Channel Measurement in an Urban Scenario at 3.5 GHz

Authors: Yuning Zhang, Thomas Choi, Zihang Cheng, Issei Kanno, Masaaki Ito, Jorge Gomez-Ponce, Hussein Hammoud, Bowei Wu, Ashwani Pradhan, Kelvin Arana, Pramod Krishna, Tianyi Yang, Tyler Chen, Ishita Vasishtha, Haoyu Xie, Linyu Sun, Andreas F. Molisch

Abstract: The design of cell-free massive MIMO (CF-mMIMO) systems requires accurate, measurement-based channel models. This paper provides the first results from the by far most extensive outdoor measurement campaign for CF-mMIMO channels in an urban environment. We measured impulse responses between over 20,000 potential access point (AP) locations and 80 user equipments (UEs) at 3.5 GHz with 350 MHz bandw… ▽ More The design of cell-free massive MIMO (CF-mMIMO) systems requires accurate, measurement-based channel models. This paper provides the first results from the by far most extensive outdoor measurement campaign for CF-mMIMO channels in an urban environment. We measured impulse responses between over 20,000 potential access point (AP) locations and 80 user equipments (UEs) at 3.5 GHz with 350 MHz bandwidth (BW). Measurements use a "virtual array" approach at the AP and a hybrid switched/virtual approach at the UE. This paper describes the sounder design, measurement environment, data processing, and sample results, particularly the evolution of the power-delay profiles (PDPs) as a function of the AP locations, and its relation to the propagation environment. △ Less

Submitted 6 June, 2024; v1 submitted 31 May, 2024; originally announced May 2024.

Comments: Submitted to: VTC 2024-Fall

arXiv:2405.20325 [pdf, other]

MotionFollower: Editing Video Motion via Lightweight Score-Guided Diffusion

Authors: Shuyuan Tu, Qi Dai, Zihao Zhang, Sicheng Xie, Zhi-Qi Cheng, Chong Luo, Xintong Han, Zuxuan Wu, Yu-Gang Jiang

Abstract: Despite impressive advancements in diffusion-based video editing models in altering video attributes, there has been limited exploration into modifying motion information while preserving the original protagonist's appearance and background. In this paper, we propose MotionFollower, a lightweight score-guided diffusion model for video motion editing. To introduce conditional controls to the denois… ▽ More Despite impressive advancements in diffusion-based video editing models in altering video attributes, there has been limited exploration into modifying motion information while preserving the original protagonist's appearance and background. In this paper, we propose MotionFollower, a lightweight score-guided diffusion model for video motion editing. To introduce conditional controls to the denoising process, MotionFollower leverages two of our proposed lightweight signal controllers, one for poses and the other for appearances, both of which consist of convolution blocks without involving heavy attention calculations. Further, we design a score guidance principle based on a two-branch architecture, including the reconstruction and editing branches, which significantly enhance the modeling capability of texture details and complicated backgrounds. Concretely, we enforce several consistency regularizers and losses during the score estimation. The resulting gradients thus inject appropriate guidance to the intermediate latents, forcing the model to preserve the original background details and protagonists' appearances without interfering with the motion modification. Experiments demonstrate the competitive motion editing ability of MotionFollower qualitatively and quantitatively. Compared with MotionEditor, the most advanced motion editing model, MotionFollower achieves an approximately 80% reduction in GPU memory while delivering superior motion editing performance and exclusively supporting large camera movements and actions. △ Less

Submitted 30 May, 2024; originally announced May 2024.

Comments: 23 pages, 18 figures. Project page at https://francis-rings.github.io/MotionFollower/

MSC Class: 68T45; 68T10

arXiv:2405.18997 [pdf, other]

Kernel Semi-Implicit Variational Inference

Authors: Ziheng Cheng, Longlin Yu, Tianyu Xie, Shiyue Zhang, Cheng Zhang

Abstract: Semi-implicit variational inference (SIVI) extends traditional variational families with semi-implicit distributions defined in a hierarchical manner. Due to the intractable densities of semi-implicit distributions, classical SIVI often resorts to surrogates of evidence lower bound (ELBO) that would introduce biases for training. A recent advancement in SIVI, named SIVI-SM, utilizes an alternative… ▽ More Semi-implicit variational inference (SIVI) extends traditional variational families with semi-implicit distributions defined in a hierarchical manner. Due to the intractable densities of semi-implicit distributions, classical SIVI often resorts to surrogates of evidence lower bound (ELBO) that would introduce biases for training. A recent advancement in SIVI, named SIVI-SM, utilizes an alternative score matching objective made tractable via a minimax formulation, albeit requiring an additional lower-level optimization. In this paper, we propose kernel SIVI (KSIVI), a variant of SIVI-SM that eliminates the need for lower-level optimization through kernel tricks. Specifically, we show that when optimizing over a reproducing kernel Hilbert space (RKHS), the lower-level problem has an explicit solution. This way, the upper-level objective becomes the kernel Stein discrepancy (KSD), which is readily computable for stochastic gradient descent due to the hierarchical structure of semi-implicit variational distributions. An upper bound for the variance of the Monte Carlo gradient estimators of the KSD objective is derived, which allows us to establish novel convergence guarantees of KSIVI. We demonstrate the effectiveness and efficiency of KSIVI on both synthetic distributions and a variety of real data Bayesian inference tasks. △ Less

Submitted 29 May, 2024; originally announced May 2024.

Comments: ICML 2024 camera ready

arXiv:2405.18347 [pdf, other]

Dataset Growth

Authors: Ziheng Qin, Zhaopan Xu, Yukun Zhou, Zangwei Zheng, Zebang Cheng, Hao Tang, Lei Shang, Baigui Sun, Xiaojiang Peng, Radu Timofte, Hongxun Yao, Kai Wang, Yang You

Abstract: Deep learning benefits from the growing abundance of available data. Meanwhile, efficiently dealing with the growing data scale has become a challenge. Data publicly available are from different sources with various qualities, and it is impractical to do manual cleaning against noise and redundancy given today's data scale. There are existing techniques for cleaning/selecting the collected data. H… ▽ More Deep learning benefits from the growing abundance of available data. Meanwhile, efficiently dealing with the growing data scale has become a challenge. Data publicly available are from different sources with various qualities, and it is impractical to do manual cleaning against noise and redundancy given today's data scale. There are existing techniques for cleaning/selecting the collected data. However, these methods are mainly proposed for offline settings that target one of the cleanness and redundancy problems. In practice, data are growing exponentially with both problems. This leads to repeated data curation with sub-optimal efficiency. To tackle this challenge, we propose InfoGrowth, an efficient online algorithm for data cleaning and selection, resulting in a growing dataset that keeps up to date with awareness of cleanliness and diversity. InfoGrowth can improve data quality/efficiency on both single-modal and multi-modal tasks, with an efficient and scalable design. Its framework makes it practical for real-world data engines. △ Less

Submitted 28 May, 2024; originally announced May 2024.

arXiv:2405.17509 [pdf, other]

Reference Neural Operators: Learning the Smooth Dependence of Solutions of PDEs on Geometric Deformations

Authors: Ze Cheng, Zhongkai Hao, Xiaoqiang Wang, Jianing Huang, Youjia Wu, Xudan Liu, Yiru Zhao, Songming Liu, Hang Su

Abstract: For partial differential equations on domains of arbitrary shapes, existing works of neural operators attempt to learn a mapping from geometries to solutions. It often requires a large dataset of geometry-solution pairs in order to obtain a sufficiently accurate neural operator. However, for many industrial applications, e.g., engineering design optimization, it can be prohibitive to satisfy the r… ▽ More For partial differential equations on domains of arbitrary shapes, existing works of neural operators attempt to learn a mapping from geometries to solutions. It often requires a large dataset of geometry-solution pairs in order to obtain a sufficiently accurate neural operator. However, for many industrial applications, e.g., engineering design optimization, it can be prohibitive to satisfy the requirement since even a single simulation may take hours or days of computation. To address this issue, we propose reference neural operators (RNO), a novel way of implementing neural operators, i.e., to learn the smooth dependence of solutions on geometric deformations. Specifically, given a reference solution, RNO can predict solutions corresponding to arbitrary deformations of the referred geometry. This approach turns out to be much more data efficient. Through extensive experiments, we show that RNO can learn the dependence across various types and different numbers of geometry objects with relatively small datasets. RNO outperforms baseline models in accuracy by a large lead and achieves up to 80% error reduction. △ Less

Submitted 27 May, 2024; originally announced May 2024.

arXiv:2405.16577 [pdf, other]

Reflected Flow Matching

Authors: Tianyu Xie, Yu Zhu, Longlin Yu, Tong Yang, Ziheng Cheng, Shiyue Zhang, Xiangyu Zhang, Cheng Zhang

Abstract: Continuous normalizing flows (CNFs) learn an ordinary differential equation to transform prior samples into data. Flow matching (FM) has recently emerged as a simulation-free approach for training CNFs by regressing a velocity model towards the conditional velocity field. However, on constrained domains, the learned velocity model may lead to undesirable flows that result in highly unnatural sampl… ▽ More Continuous normalizing flows (CNFs) learn an ordinary differential equation to transform prior samples into data. Flow matching (FM) has recently emerged as a simulation-free approach for training CNFs by regressing a velocity model towards the conditional velocity field. However, on constrained domains, the learned velocity model may lead to undesirable flows that result in highly unnatural samples, e.g., oversaturated images, due to both flow matching error and simulation error. To address this, we add a boundary constraint term to CNFs, which leads to reflected CNFs that keep trajectories within the constrained domains. We propose reflected flow matching (RFM) to train the velocity model in reflected CNFs by matching the conditional velocity fields in a simulation-free manner, similar to the vanilla FM. Moreover, the analytical form of conditional velocity fields in RFM avoids potentially biased approximations, making it superior to existing score-based generative models on constrained domains. We demonstrate that RFM achieves comparable or better results on standard image benchmarks and produces high-quality class-conditioned samples under high guidance weight. △ Less

Submitted 26 May, 2024; originally announced May 2024.

Comments: ICML 2024 camera-ready

arXiv:2405.15553 [pdf, other]

Massive MIMO-ISAC System With 1-Bit ADCs/DACs

Authors: Bowen Wang, Hongyu Li, Bin Liao, Ziyang Cheng

Abstract: This paper investigates a hardware-efficient massive multiple-input multiple-output integrated sensing and communication (MIMO-ISAC) system with 1-bit analog-to-digital converters (ADCs)/digital-to-analog converters (DACs). The proposed system, referred to as 1BitISAC, employs 1-bit DACs at the ISAC transmitter and 1-bit ADCs at the sensing receiver, achieving significant reductions in power consu… ▽ More This paper investigates a hardware-efficient massive multiple-input multiple-output integrated sensing and communication (MIMO-ISAC) system with 1-bit analog-to-digital converters (ADCs)/digital-to-analog converters (DACs). The proposed system, referred to as 1BitISAC, employs 1-bit DACs at the ISAC transmitter and 1-bit ADCs at the sensing receiver, achieving significant reductions in power consumption and hardware costs. For such kind of systems, two 1BitISAC joint transceiver designs, i.e., i) quality of service constrained 1BitISAC design and ii) quality of detection constrained design, are considered and the corresponding problems are formulated. In order to address these problems, we thoroughly analyze the radar detection performance after 1-bit ADCs quantization and the communication bit error rate. This analysis yields new design insights and leads to unique radar and communication metrics, which enables us to simplify the original problems and employ majorization-minimization and integer linear programming methods to solve the problems. Numerical results are provided to validate the performance analysis of the proposed 1BitISAC and to compare with other ISAC configurations. The superiority of the proposed 1BitISAC system in terms of balancing ISAC performance and energy efficiency is also demonstrated. △ Less

Submitted 24 May, 2024; originally announced May 2024.

arXiv:2405.15350 [pdf, ps, other]

Coloring invariants for links in $Σ_g\times S^1$

Authors: Zhiyun Cheng, Hongzhu Gao

Abstract: Let $Σ_g$ be a closed oriented surface of genus $g$, in this paper we discuss how to define coloring invariants and its generalizations for links in $Σ_g\times S^1$. Let $Σ_g$ be a closed oriented surface of genus $g$, in this paper we discuss how to define coloring invariants and its generalizations for links in $Σ_g\times S^1$. △ Less

Submitted 24 May, 2024; originally announced May 2024.

Comments: 11 pages, 8 figures

MSC Class: 57K10; 57K12

arXiv:2405.15133 [pdf, other]

Modeling of Hydroacoustic Noise from Marine Propellers with Tip Vortex Cavitation

Authors: Zhi Cheng, Suraj Kashyap, Brendan Smoker, Giorgio Burella, Rajeev Jaiman

Abstract: The present work aims to study the cavitating turbulent flow of a full-scale marine propeller and explore the physical mechanism underpinning the underwater radiated noise. We employ the standard dynamic large eddy simulation for the turbulent wake flow and the Schnerr-Sauer cavitation model, while the Ffowcs-Williams-Hawkings acoustic analogy is considered for the hydroacoustic modeling. For the… ▽ More The present work aims to study the cavitating turbulent flow of a full-scale marine propeller and explore the physical mechanism underpinning the underwater radiated noise. We employ the standard dynamic large eddy simulation for the turbulent wake flow and the Schnerr-Sauer cavitation model, while the Ffowcs-Williams-Hawkings acoustic analogy is considered for the hydroacoustic modeling. For the current investigation, we consider a well-known Potsdam Propeller Test Case to analyze the turbulent cavitating flow and the associated hydroacoustic emissions. To begin, the modeling framework is validated using the available experimental data, and distinctive double-helical tip vortex cavitation and its qualitative patterns along the vortex trajectory are captured. In comparison to the non-cavitating condition, the pressure distribution on the propeller surface is more disordered for the cavitating condition, which is further reflected by a relatively stronger power of both low-frequency tonal peaks and high-frequency broadband components in the spectrum of thrust generation. Specifically, the generation of cavitation leads to the enhancement of the monopole noise source and the breakdown of cavitation bubbles as well as vortex structures in the turbulent wake. Furthermore, the tonal noise with the frequency corresponding to the harmonics of blade passing frequency is also enhanced. Generally speaking, the generation of cavitation structures enhances the hydroacoustics energy of URN at all orientations, especially in the downstream direction with sound pressure level increasing up to 20 dB. △ Less

Submitted 23 May, 2024; originally announced May 2024.

Report number: OMAE2024-125991

arXiv:2405.14297 [pdf, other]

Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models

Authors: Yongxin Guo, Zhenglin Cheng, Xiaoying Tang, Tao Lin

Abstract: The Sparse Mixture of Experts (SMoE) has been widely employed to enhance the efficiency of training and inference for Transformer-based foundational models, yielding promising results. However, the performance of SMoE heavily depends on the choice of hyper-parameters, such as the number of experts and the number of experts to be activated (referred to as top-k), resulting in significant computatio… ▽ More The Sparse Mixture of Experts (SMoE) has been widely employed to enhance the efficiency of training and inference for Transformer-based foundational models, yielding promising results. However, the performance of SMoE heavily depends on the choice of hyper-parameters, such as the number of experts and the number of experts to be activated (referred to as top-k), resulting in significant computational overhead due to the extensive model training by searching over various hyper-parameter configurations. As a remedy, we introduce the Dynamic Mixture of Experts (DynMoE) technique. DynMoE incorporates (1) a novel gating method that enables each token to automatically determine the number of experts to activate. (2) An adaptive process automatically adjusts the number of experts during training. Extensive numerical results across Vision, Language, and Vision-Language tasks demonstrate the effectiveness of our approach to achieve competitive performance compared to GMoE for vision and language tasks, and MoE-LLaVA for vision-language tasks, while maintaining efficiency by activating fewer parameters. Our code is available at https://github.com/LINs-lab/DynMoE. △ Less

Submitted 23 May, 2024; originally announced May 2024.

Comments: 9 pages, 21 figures

arXiv:2405.12072 [pdf, other]

Real topological phonons in 3D carbon allotropes

Authors: Xiaotian Wang, Jingbo Bai, Jianhua Wang, Zhenxiang Cheng, Shifeng Qian, Wenhong Wang, Gang Zhang, Zhi-Ming Yu, Yugui Yao

Abstract: There has been a significant focus on real topological systems that enjoy space-time inversion symmetry (PT ) and lack spin-orbit coupling. While the theoretical classification of the real topology has been established, more progress has yet to be made in the materials realization of such real topological systems in three dimensions (3D). To address this crucial issue, by selecting the carbon-base… ▽ More There has been a significant focus on real topological systems that enjoy space-time inversion symmetry (PT ) and lack spin-orbit coupling. While the theoretical classification of the real topology has been established, more progress has yet to be made in the materials realization of such real topological systems in three dimensions (3D). To address this crucial issue, by selecting the carbon-based material candidates as targets, we perform high-throughput computing to inspect the real topology in the phonon spectrums of the 3D carbon allotropes in the Samara Carbon Allotrope Database (SACADA). Among 1192 kinds of 3D carbon allotropes, we find 65 real topological systems with a phononic real Chern insulating (PRCI) state, 2 real topological systems with a phononic real nodal line (PRNL) state, 10 real topological systems with a phononic real Dirac point (PRDP) state, and 8 real topological systems with a phononic real triple-point pair (PRTPP) state. This extremely expands the material candidates with real topology, especially for the gapless topological phonons. We exhibit the PRCI, PRNL, PRTPP, and PRDP states of 27-SG. 166-pcu-h, 1081-SG. 194- 4 2T13-CA, 52-SG. 141-gis, and 132-SG. 191-3,4T157 as illustrative examples, and explore the second-order boundary mode, i.e., phononic hinge mode. Among the four examples, the materials 1081-SG. 194-42T13-CA and 52-SG. 141-gis are so ideal that the PRNL and PRTPP in them are well separated from other bands, and the phononic hinge mode can be clearly observed. This study aims to broaden the understanding of 3D topological phonons, and emphasizes the potential of 3D carbon allotropes as a valuable framework for exploring the fascinating physics related to phononic hinge modes and phononic real topology. △ Less

Submitted 20 May, 2024; originally announced May 2024.

arXiv:2405.11667 [pdf, other]

The Limits and Potentials of Local SGD for Distributed Heterogeneous Learning with Intermittent Communication

Authors: Kumar Kshitij Patel, Margalit Glasgow, Ali Zindari, Lingxiao Wang, Sebastian U. Stich, Ziheng Cheng, Nirmit Joshi, Nathan Srebro

Abstract: Local SGD is a popular optimization method in distributed learning, often outperforming other algorithms in practice, including mini-batch SGD. Despite this success, theoretically proving the dominance of local SGD in settings with reasonable data heterogeneity has been difficult, creating a significant gap between theory and practice. In this paper, we provide new lower bounds for local SGD under… ▽ More Local SGD is a popular optimization method in distributed learning, often outperforming other algorithms in practice, including mini-batch SGD. Despite this success, theoretically proving the dominance of local SGD in settings with reasonable data heterogeneity has been difficult, creating a significant gap between theory and practice. In this paper, we provide new lower bounds for local SGD under existing first-order data heterogeneity assumptions, showing that these assumptions are insufficient to prove the effectiveness of local update steps. Furthermore, under these same assumptions, we demonstrate the min-max optimality of accelerated mini-batch SGD, which fully resolves our understanding of distributed optimization for several problem classes. Our results emphasize the need for better models of data heterogeneity to understand the effectiveness of local SGD in practice. Towards this end, we consider higher-order smoothness and heterogeneity assumptions, providing new upper bounds that imply the dominance of local SGD over mini-batch SGD when data heterogeneity is low. △ Less

Submitted 19 May, 2024; originally announced May 2024.

arXiv:2405.10313 [pdf, other]

How Far Are We From AGI

Authors: Tao Feng, Chuanyang Jin, Jingyu Liu, Kunlun Zhu, Haoqin Tu, Zirui Cheng, Guanyu Lin, Jiaxuan You

Abstract: The evolution of artificial intelligence (AI) has profoundly impacted human society, driving significant advancements in multiple sectors. Yet, the escalating demands on AI have highlighted the limitations of AI's current offerings, catalyzing a movement towards Artificial General Intelligence (AGI). AGI, distinguished by its ability to execute diverse real-world tasks with efficiency and effectiv… ▽ More The evolution of artificial intelligence (AI) has profoundly impacted human society, driving significant advancements in multiple sectors. Yet, the escalating demands on AI have highlighted the limitations of AI's current offerings, catalyzing a movement towards Artificial General Intelligence (AGI). AGI, distinguished by its ability to execute diverse real-world tasks with efficiency and effectiveness comparable to human intelligence, reflects a paramount milestone in AI evolution. While existing works have summarized specific recent advancements of AI, they lack a comprehensive discussion of AGI's definitions, goals, and developmental trajectories. Different from existing survey papers, this paper delves into the pivotal questions of our proximity to AGI and the strategies necessary for its realization through extensive surveys, discussions, and original perspectives. We start by articulating the requisite capability frameworks for AGI, integrating the internal, interface, and system dimensions. As the realization of AGI requires more advanced capabilities and adherence to stringent constraints, we further discuss necessary AGI alignment technologies to harmonize these factors. Notably, we emphasize the importance of approaching AGI responsibly by first defining the key levels of AGI progression, followed by the evaluation framework that situates the status-quo, and finally giving our roadmap of how to reach the pinnacle of AGI. Moreover, to give tangible insights into the ubiquitous impact of the integration of AI, we outline existing challenges and potential pathways toward AGI in multiple domains. In sum, serving as a pioneering exploration into the current state and future trajectory of AGI, this paper aims to foster a collective comprehension and catalyze broader public discussions among researchers and practitioners on AGI. △ Less

Submitted 16 May, 2024; originally announced May 2024.

arXiv:2405.08463 [pdf, other]

A Timely Survey on Vision Transformer for Deepfake Detection

Authors: Zhikan Wang, Zhongyao Cheng, Jiajie Xiong, Xun Xu, Tianrui Li, Bharadwaj Veeravalli, Xulei Yang

Abstract: In recent years, the rapid advancement of deepfake technology has revolutionized content creation, lowering forgery costs while elevating quality. However, this progress brings forth pressing concerns such as infringements on individual rights, national security threats, and risks to public safety. To counter these challenges, various detection methodologies have emerged, with Vision Transformer (… ▽ More In recent years, the rapid advancement of deepfake technology has revolutionized content creation, lowering forgery costs while elevating quality. However, this progress brings forth pressing concerns such as infringements on individual rights, national security threats, and risks to public safety. To counter these challenges, various detection methodologies have emerged, with Vision Transformer (ViT)-based approaches showcasing superior performance in generality and efficiency. This survey presents a timely overview of ViT-based deepfake detection models, categorized into standalone, sequential, and parallel architectures. Furthermore, it succinctly delineates the structure and characteristics of each model. By analyzing existing research and addressing future directions, this survey aims to equip researchers with a nuanced understanding of ViT's pivotal role in deepfake detection, serving as a valuable reference for both academic and practical pursuits in this domain. △ Less

Submitted 14 May, 2024; originally announced May 2024.

arXiv:2405.07281 [pdf, ps, other]

Movable Antennas Aided Multicast MISO Communication Systems

Authors: Zhenqiao Cheng, Nanxi Li, Ruizhe Long, Jianchi Zhu, Chongjun Ouyang, Peng Chen

Abstract: A novel multicast communication system with movable antennas (MAs) is proposed, where the antenna position optimization is exploited to enhance the transmission rate. Specifically, an MA-assisted two-user multicast multiple-input single-input system is considered. The joint optimization of the transmit beamforming vector and transmit MA positions is studied by modeling the motion of the MA element… ▽ More A novel multicast communication system with movable antennas (MAs) is proposed, where the antenna position optimization is exploited to enhance the transmission rate. Specifically, an MA-assisted two-user multicast multiple-input single-input system is considered. The joint optimization of the transmit beamforming vector and transmit MA positions is studied by modeling the motion of the MA elements as discrete movements. A low-complexity greedy search-based algorithm is proposed to tackle this non-convex inter-programming problem. A branch-and-bound (BAB)-based method is proposed to achieve the optimal multicast rate with a reduced time complexity than the brute-force search by assuming the two users suffer similar line-of-sight path losses. Numerical results reveal that the proposed MA systems significantly improve the multicast rate compared to conventional fixed-position antennas (FPAs)-based systems. △ Less

Submitted 12 May, 2024; originally announced May 2024.

Comments: 5 pages

arXiv:2405.03064 [pdf, other]

RICE: Breaking Through the Training Bottlenecks of Reinforcement Learning with Explanation

Authors: Zelei Cheng, Xian Wu, Jiahao Yu, Sabrina Yang, Gang Wang, Xinyu Xing

Abstract: Deep reinforcement learning (DRL) is playing an increasingly important role in real-world applications. However, obtaining an optimally performing DRL agent for complex tasks, especially with sparse rewards, remains a significant challenge. The training of a DRL agent can be often trapped in a bottleneck without further progress. In this paper, we propose RICE, an innovative refining scheme for re… ▽ More Deep reinforcement learning (DRL) is playing an increasingly important role in real-world applications. However, obtaining an optimally performing DRL agent for complex tasks, especially with sparse rewards, remains a significant challenge. The training of a DRL agent can be often trapped in a bottleneck without further progress. In this paper, we propose RICE, an innovative refining scheme for reinforcement learning that incorporates explanation methods to break through the training bottlenecks. The high-level idea of RICE is to construct a new initial state distribution that combines both the default initial states and critical states identified through explanation methods, thereby encouraging the agent to explore from the mixed initial states. Through careful design, we can theoretically guarantee that our refining scheme has a tighter sub-optimality bound. We evaluate RICE in various popular RL environments and real-world applications. The results demonstrate that RICE significantly outperforms existing refining schemes in enhancing agent performance. △ Less

Submitted 5 June, 2024; v1 submitted 5 May, 2024; originally announced May 2024.

Comments: Accepted by ICML 2024

arXiv:2405.00587 [pdf, other]

GraCo: Granularity-Controllable Interactive Segmentation

Authors: Yian Zhao, Kehan Li, Zesen Cheng, Pengchong Qiao, Xiawu Zheng, Rongrong Ji, Chang Liu, Li Yuan, Jie Chen

Abstract: Interactive Segmentation (IS) segments specific objects or parts in the image according to user input. Current IS pipelines fall into two categories: single-granularity output and multi-granularity output. The latter aims to alleviate the spatial ambiguity present in the former. However, the multi-granularity output pipeline suffers from limited interaction flexibility and produces redundant resul… ▽ More Interactive Segmentation (IS) segments specific objects or parts in the image according to user input. Current IS pipelines fall into two categories: single-granularity output and multi-granularity output. The latter aims to alleviate the spatial ambiguity present in the former. However, the multi-granularity output pipeline suffers from limited interaction flexibility and produces redundant results. In this work, we introduce Granularity-Controllable Interactive Segmentation (GraCo), a novel approach that allows precise control of prediction granularity by introducing additional parameters to input. This enhances the customization of the interactive system and eliminates redundancy while resolving ambiguity. Nevertheless, the exorbitant cost of annotating multi-granularity masks and the lack of available datasets with granularity annotations make it difficult for models to acquire the necessary guidance to control output granularity. To address this problem, we design an any-granularity mask generator that exploits the semantic property of the pre-trained IS model to automatically generate abundant mask-granularity pairs without requiring additional manual annotation. Based on these pairs, we propose a granularity-controllable learning strategy that efficiently imparts the granularity controllability to the IS model. Extensive experiments on intricate scenarios at object and part levels demonstrate that our GraCo has significant advantages over previous methods. This highlights the potential of GraCo to be a flexible annotation tool, capable of adapting to diverse segmentation scenarios. The project page: https://zhao-yian.github.io/GraCo. △ Less

Submitted 16 May, 2024; v1 submitted 1 May, 2024; originally announced May 2024.

Comments: CVPR2024 Highlight, Project: https://zhao-yian.github.io/GraCo

arXiv:2404.18398 [pdf, other]

MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis

Authors: Xiang Li, Zhi-Qi Cheng, Jun-Yan He, Xiaojiang Peng, Alexander G. Hauptmann

Abstract: Emotional Text-to-Speech (E-TTS) synthesis has gained significant attention in recent years due to its potential to enhance human-computer interaction. However, current E-TTS approaches often struggle to capture the complexity of human emotions, primarily relying on oversimplified emotional labels or single-modality inputs. To address these limitations, we propose the Multimodal Emotional Text-to-… ▽ More Emotional Text-to-Speech (E-TTS) synthesis has gained significant attention in recent years due to its potential to enhance human-computer interaction. However, current E-TTS approaches often struggle to capture the complexity of human emotions, primarily relying on oversimplified emotional labels or single-modality inputs. To address these limitations, we propose the Multimodal Emotional Text-to-Speech System (MM-TTS), a unified framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech. MM-TTS consists of two key components: (1) the Emotion Prompt Alignment Module (EP-Align), which employs contrastive learning to align emotional features across text, audio, and visual modalities, ensuring a coherent fusion of multimodal information; and (2) the Emotion Embedding-Induced TTS (EMI-TTS), which integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions. Extensive evaluations across diverse datasets demonstrate the superior performance of MM-TTS compared to traditional E-TTS models. Objective metrics, including Word Error Rate (WER) and Character Error Rate (CER), show significant improvements on ESD dataset, with MM-TTS achieving scores of 7.35% and 3.07%, respectively. Subjective assessments further validate that MM-TTS generates speech with emotional fidelity and naturalness comparable to human speech. Our code and pre-trained models are publicly available at https://anonymous.4open.science/r/MMTTS-D214 △ Less

Submitted 28 April, 2024; originally announced April 2024.

arXiv:2404.18243 [pdf, other]

LEGENT: Open Platform for Embodied Agents

Authors: Zhili Cheng, Zhitong Wang, Jinyi Hu, Shengding Hu, An Liu, Yuge Tu, Pengkai Li, Lei Shi, Zhiyuan Liu, Maosong Sun

Abstract: Despite advancements in Large Language Models (LLMs) and Large Multimodal Models (LMMs), their integration into language-grounded, human-like embodied agents remains incomplete, hindering complex real-life task performance in physical environments. Existing integrations often feature limited open sourcing, challenging collective progress in this field. We introduce LEGENT, an open, scalable platfo… ▽ More Despite advancements in Large Language Models (LLMs) and Large Multimodal Models (LMMs), their integration into language-grounded, human-like embodied agents remains incomplete, hindering complex real-life task performance in physical environments. Existing integrations often feature limited open sourcing, challenging collective progress in this field. We introduce LEGENT, an open, scalable platform for developing embodied agents using LLMs and LMMs. LEGENT offers a dual approach: a rich, interactive 3D environment with communicable and actionable agents, paired with a user-friendly interface, and a sophisticated data generation pipeline utilizing advanced algorithms to exploit supervision from simulated worlds at scale. In our experiments, an embryonic vision-language-action model trained on LEGENT-generated data surpasses GPT-4V in embodied tasks, showcasing promising generalization capabilities. △ Less

Submitted 28 April, 2024; originally announced April 2024.

Comments: Demo Paper

arXiv:2404.18166 [pdf, other]

Behavior-Contextualized Item Preference Modeling for Multi-Behavior Recommendation

Authors: Mingshi Yan, Fan Liu, Jing Sun, Fuming Sun, Zhiyong Cheng, Yahong Han

Abstract: In recommender systems, multi-behavior methods have demonstrated their effectiveness in mitigating issues like data sparsity, a common challenge in traditional single-behavior recommendation approaches. These methods typically infer user preferences from various auxiliary behaviors and apply them to the target behavior for recommendations. However, this direct transfer can introduce noise to the t… ▽ More In recommender systems, multi-behavior methods have demonstrated their effectiveness in mitigating issues like data sparsity, a common challenge in traditional single-behavior recommendation approaches. These methods typically infer user preferences from various auxiliary behaviors and apply them to the target behavior for recommendations. However, this direct transfer can introduce noise to the target behavior in recommendation, due to variations in user attention across different behaviors. To address this issue, this paper introduces a novel approach, Behavior-Contextualized Item Preference Modeling (BCIPM), for multi-behavior recommendation. Our proposed Behavior-Contextualized Item Preference Network discerns and learns users' specific item preferences within each behavior. It then considers only those preferences relevant to the target behavior for final recommendations, significantly reducing noise from auxiliary behaviors. These auxiliary behaviors are utilized solely for training the network parameters, thereby refining the learning process without compromising the accuracy of the target behavior recommendations. To further enhance the effectiveness of BCIPM, we adopt a strategy of pre-training the initial embeddings. This step is crucial for enriching the item-aware preferences, particularly in scenarios where data related to the target behavior is sparse. Comprehensive experiments conducted on four real-world datasets demonstrate BCIPM's superior performance compared to several leading state-of-the-art models, validating the robustness and efficiency of our proposed approach. △ Less

Submitted 28 April, 2024; originally announced April 2024.

Comments: This paper has been accepted by SIGIR 2024

arXiv:2404.17936 [pdf, other]

FDCE-Net: Underwater Image Enhancement with Embedding Frequency and Dual Color Encoder

Authors: Zheng Cheng, Guodong Fan, Jingchun Zhou, Min Gan, C. L. Philip Chen

Abstract: Underwater images often suffer from various issues such as low brightness, color shift, blurred details, and noise due to light absorption and scattering caused by water and suspended particles. Previous underwater image enhancement (UIE) methods have primarily focused on spatial domain enhancement, neglecting the frequency domain information inherent in the images. However, the degradation factor… ▽ More Underwater images often suffer from various issues such as low brightness, color shift, blurred details, and noise due to light absorption and scattering caused by water and suspended particles. Previous underwater image enhancement (UIE) methods have primarily focused on spatial domain enhancement, neglecting the frequency domain information inherent in the images. However, the degradation factors of underwater images are closely intertwined in the spatial domain. Although certain methods focus on enhancing images in the frequency domain, they overlook the inherent relationship between the image degradation factors and the information present in the frequency domain. As a result, these methods frequently enhance certain attributes of the improved image while inadequately addressing or even exacerbating other attributes. Moreover, many existing methods heavily rely on prior knowledge to address color shift problems in underwater images, limiting their flexibility and robustness. In order to overcome these limitations, we propose the Embedding Frequency and Dual Color Encoder Network (FDCE-Net) in our paper. The FDCE-Net consists of two main structures: (1) Frequency Spatial Network (FS-Net) aims to achieve initial enhancement by utilizing our designed Frequency Spatial Residual Block (FSRB) to decouple image degradation factors in the frequency domain and enhance different attributes separately. (2) To tackle the color shift issue, we introduce the Dual-Color Encoder (DCE). The DCE establishes correlations between color and semantic representations through cross-attention and leverages multi-scale image features to guide the optimization of adaptive color query. The final enhanced images are generated by combining the outputs of FS-Net and DCE through a fusion network. These images exhibit rich details, clear textures, low noise and natural colors. △ Less

Submitted 27 April, 2024; originally announced April 2024.

Comments: 16pages,13 figures

arXiv:2404.17297 [pdf, ps, other]

Denotation-based Compositional Compiler Verification

Authors: Zhang Cheng, Jiyang Wu, Di Wang, Qinxiang Cao

Abstract: A desired but challenging property of compiler verification is compositionality in the sense that the compilation correctness of a program can be deduced from that of its substructures ranging from statements, functions, and modules incrementally. Previously proposed approaches have devoted extensive effort to module-level compositionality based on small-step semantics and simulation theories. Thi… ▽ More A desired but challenging property of compiler verification is compositionality in the sense that the compilation correctness of a program can be deduced from that of its substructures ranging from statements, functions, and modules incrementally. Previously proposed approaches have devoted extensive effort to module-level compositionality based on small-step semantics and simulation theories. This paper proposes a novel compiler verification framework based on denotational semantics for better compositionality. Specifically, our denotational semantics is defined by semantic functions that map a syntactic component to a semantic domain composed of multiple behavioral \emph{sets}, and compiler correctness is defined by the behavioral refinement between semantic domains of the source and the target programs. Therefore, when proving compiler correctness, we can extensively leverage the algebraic properties of sets. Another important contribution is that our formalization of denotational semantics captures the full meaning of a program and bridges the gap between those based on conventional powerdomains and what realistic compiler verification actually needs. We demonstrate our denotation-based framework viable and practical by applying it to the verification of the front-end of CompCert and showing that the compositionality from the compilation correctness of sub-statements to statements, from functions to modules, and from modules to the whole program (i.e., module-level compositionality) can be achieved similarly. △ Less

Submitted 15 May, 2024; v1 submitted 26 April, 2024; originally announced April 2024.

Comments: 38 pages, 8 figures

arXiv:2404.16580 [pdf, other]

A New Two-Sided Sketching Algorithm for Large-Scale Tensor Decomposition Based on Discrete Cosine Transformation

Authors: Zhiguang Cheng, Gaohang Yu, Xiaohao Cai, Liqun Qi

Abstract: Large tensors are frequently encountered in various fields such as computer vision, scientific simulations, sensor networks, and data mining. However, these tensors are often too large for convenient processing, transfer, or storage. Fortunately, they typically exhibit a low-rank structure that can be leveraged through tensor decomposition. Despite this, performing large-scale tensor decomposition… ▽ More Large tensors are frequently encountered in various fields such as computer vision, scientific simulations, sensor networks, and data mining. However, these tensors are often too large for convenient processing, transfer, or storage. Fortunately, they typically exhibit a low-rank structure that can be leveraged through tensor decomposition. Despite this, performing large-scale tensor decomposition can be time-consuming. Sketching is a useful technique to reduce the dimensionality of the data. In this study, we introduce a novel two-sided sketching method based on the $t$-product decomposition and the discrete cosine transformation. We conduct a thorough theoretical analysis to assess the approximation error of the proposed method. Specifically, we enhance the algorithm with power iteration to achieve more precise approximate solutions. Extensive numerical experiments and comparisons on low-rank approximation of color images and grayscale videos illustrate the efficiency and effectiveness of the proposed approach in terms of both CPU time and approximation accuracy. △ Less

Submitted 28 April, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

Showing 1–50 of 871 results for author: Cheng, Z