subscribe to arXiv mailings

The Role of Depth, Width, and Tree Size in Expressiveness of Deep Forest

Authors: Shen-Huan Lyu, Jin-Hui Wu, Qin-Cheng Zheng, Baoliu Ye

Abstract: Random forests are classical ensemble algorithms that construct multiple randomized decision trees and aggregate their predictions using naive averaging. \citet{zhou2019deep} further propose a deep forest algorithm with multi-layer forests, which outperforms random forests in various tasks. The performance of deep forests is related to three hyperparameters in practice: depth, width, and tree size… ▽ More Random forests are classical ensemble algorithms that construct multiple randomized decision trees and aggregate their predictions using naive averaging. \citet{zhou2019deep} further propose a deep forest algorithm with multi-layer forests, which outperforms random forests in various tasks. The performance of deep forests is related to three hyperparameters in practice: depth, width, and tree size, but little has been known about its theoretical explanation. This work provides the first upper and lower bounds on the approximation complexity of deep forests concerning the three hyperparameters. Our results confirm the distinctive role of depth, which can exponentially enhance the expressiveness of deep forests compared with width and tree size. Experiments confirm the theoretical findings. △ Less

Submitted 6 July, 2024; originally announced July 2024.

Journal ref: In: Proceedings of the 27th European Conference on Artificial Intelligence, 2024

arXiv:2407.03107 [pdf]

Design of a UE5-based digital twin platform

Authors: Shaoqiu Lyu, Muzhi Wang, Sunrui Zhang, Shengzhi Wang

Abstract: Aiming at the current mainstream 3D scene engine learning and building cost is too high, this thesis proposes a digital twin platform design program based on Unreal Engine 5 (UE5). It aims to provide a universal platform construction design process to effectively reduce the learning cost of large-scale scene construction. Taking an actual project of a unit as an example, the overall cycle work of… ▽ More Aiming at the current mainstream 3D scene engine learning and building cost is too high, this thesis proposes a digital twin platform design program based on Unreal Engine 5 (UE5). It aims to provide a universal platform construction design process to effectively reduce the learning cost of large-scale scene construction. Taking an actual project of a unit as an example, the overall cycle work of platform building is explained, and the digital twin and data visualization technologies and applications based on UE5 are analyzed. By summarizing the project implementation into a process approach, the standardization and operability of the process pathway is improved. △ Less

Submitted 3 July, 2024; originally announced July 2024.

arXiv:2406.16943 [pdf, other]

doi 10.1109/CSCAIoT62585.2024.00005

EarDA: Towards Accurate and Data-Efficient Earable Activity Sensing

Authors: Shengzhe Lyu, Yongliang Chen, Di Duan, Renqi Jia, Weitao Xu

Abstract: In the realm of smart sensing with the Internet of Things, earable devices are empowered with the capability of multi-modality sensing and intelligence of context-aware computing, leading to its wide usage in Human Activity Recognition (HAR). Nonetheless, unlike the movements captured by Inertial Measurement Unit (IMU) sensors placed on the upper or lower body, those motion signals obtained from e… ▽ More In the realm of smart sensing with the Internet of Things, earable devices are empowered with the capability of multi-modality sensing and intelligence of context-aware computing, leading to its wide usage in Human Activity Recognition (HAR). Nonetheless, unlike the movements captured by Inertial Measurement Unit (IMU) sensors placed on the upper or lower body, those motion signals obtained from earable devices show significant changes in amplitudes and patterns, especially in the presence of dynamic and unpredictable head movements, posing a significant challenge for activity classification. In this work, we present EarDA, an adversarial-based domain adaptation system to extract the domain-independent features across different sensor locations. Moreover, while most deep learning methods commonly rely on training with substantial amounts of labeled data to offer good accuracy, the proposed scheme can release the potential usage of publicly available smartphone-based IMU datasets. Furthermore, we explore the feasibility of applying a filter-based data processing method to mitigate the impact of head movement. EarDA, the proposed system, enables more data-efficient and accurate activity sensing. It achieves an accuracy of 88.8% under HAR task, demonstrating a significant 43% improvement over methods without domain adaptation. This clearly showcases its effectiveness in mitigating domain gaps. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: accepted by 2024 IEEE Coupling of Sensing & Computing in AIoT Systems (CSCAIoT)

arXiv:2406.10427 [pdf, other]

Adaptive Randomized Smoothing: Certifying Multi-Step Defences against Adversarial Examples

Authors: Saiyue Lyu, Shadab Shaikh, Frederick Shpilevskiy, Evan Shelhamer, Mathias Lécuyer

Abstract: We propose Adaptive Randomized Smoothing (ARS) to certify the predictions of our test-time adaptive models against adversarial examples. ARS extends the analysis of randomized smoothing using f-Differential Privacy to certify the adaptive composition of multiple steps. For the first time, our theory covers the sound adaptive composition of general and high-dimensional functions of noisy input. We… ▽ More We propose Adaptive Randomized Smoothing (ARS) to certify the predictions of our test-time adaptive models against adversarial examples. ARS extends the analysis of randomized smoothing using f-Differential Privacy to certify the adaptive composition of multiple steps. For the first time, our theory covers the sound adaptive composition of general and high-dimensional functions of noisy input. We instantiate ARS on deep image classification to certify predictions against adversarial examples of bounded $L_{\infty}$ norm. In the $L_{\infty}$ threat model, our flexibility enables adaptation through high-dimensional input-dependent masking. We design adaptivity benchmarks, based on CIFAR-10 and CelebA, and show that ARS improves accuracy by $2$ to $5\%$ points. On ImageNet, ARS improves accuracy by $1$ to $3\%$ points over standard RS without adaptivity. △ Less

Submitted 14 June, 2024; originally announced June 2024.

arXiv:2406.04745 [pdf, other]

Confidence-aware Contrastive Learning for Selective Classification

Authors: Yu-Chang Wu, Shen-Huan Lyu, Haopu Shang, Xiangyu Wang, Chao Qian

Abstract: Selective classification enables models to make predictions only when they are sufficiently confident, aiming to enhance safety and reliability, which is important in high-stakes scenarios. Previous methods mainly use deep neural networks and focus on modifying the architecture of classification layers to enable the model to estimate the confidence of its prediction. This work provides a generaliz… ▽ More Selective classification enables models to make predictions only when they are sufficiently confident, aiming to enhance safety and reliability, which is important in high-stakes scenarios. Previous methods mainly use deep neural networks and focus on modifying the architecture of classification layers to enable the model to estimate the confidence of its prediction. This work provides a generalization bound for selective classification, disclosing that optimizing feature layers helps improve the performance of selective classification. Inspired by this theory, we propose to explicitly improve the selective classification model at the feature level for the first time, leading to a novel Confidence-aware Contrastive Learning method for Selective Classification, CCL-SC, which similarizes the features of homogeneous instances and differentiates the features of heterogeneous instances, with the strength controlled by the model's confidence. The experimental results on typical datasets, i.e., CIFAR-10, CIFAR-100, CelebA, and ImageNet, show that CCL-SC achieves significantly lower selective risk than state-of-the-art methods, across almost all coverage degrees. Moreover, it can be combined with existing methods to bring further improvement. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: Accepted by ICML 2024

arXiv:2406.01112 [pdf, other]

BACON: Bayesian Optimal Condensation Framework for Dataset Distillation

Authors: Zheng Zhou, Hongbo Zhao, Guangliang Cheng, Xiangtai Li, Shuchang Lyu, Wenquan Feng, Qi Zhao

Abstract: Dataset Distillation (DD) aims to distill knowledge from extensive datasets into more compact ones while preserving performance on the test set, thereby reducing storage costs and training expenses. However, existing methods often suffer from computational intensity, particularly exhibiting suboptimal performance with large dataset sizes due to the lack of a robust theoretical framework for analyz… ▽ More Dataset Distillation (DD) aims to distill knowledge from extensive datasets into more compact ones while preserving performance on the test set, thereby reducing storage costs and training expenses. However, existing methods often suffer from computational intensity, particularly exhibiting suboptimal performance with large dataset sizes due to the lack of a robust theoretical framework for analyzing the DD problem. To address these challenges, we propose the BAyesian optimal CONdensation framework (BACON), which is the first work to introduce the Bayesian theoretical framework to the literature of DD. This framework provides theoretical support for enhancing the performance of DD. Furthermore, BACON formulates the DD problem as the minimization of the expected risk function in joint probability distributions using the Bayesian framework. Additionally, by analyzing the expected risk function for optimal condensation, we derive a numerically feasible lower bound based on specific assumptions, providing an approximate solution for BACON. We validate BACON across several datasets, demonstrating its superior performance compared to existing state-of-the-art methods. For instance, under the IPC-10 setting, BACON achieves a 3.46% accuracy gain over the IDM method on the CIFAR-10 dataset and a 3.10% gain on the TinyImageNet dataset. Our extensive experiments confirm the effectiveness of BACON and its seamless integration with existing methods, thereby enhancing their performance for the DD task. Code and distilled datasets are available at BACON. △ Less

Submitted 3 June, 2024; originally announced June 2024.

Comments: 22 pages, 10 figures

arXiv:2406.00985 [pdf, other]

MultiEdits: Simultaneous Multi-Aspect Editing with Text-to-Image Diffusion Models

Authors: Mingzhen Huang, Jialing Cai, Shan Jia, Vishnu Suresh Lokhande, Siwei Lyu

Abstract: Text-driven image synthesis has made significant advancements with the development of diffusion models, transforming how visual content is generated from text prompts. Despite these advances, text-driven image editing, a key area in computer graphics, faces unique challenges. A major challenge is making simultaneous edits across multiple objects or attributes. Applying these methods sequentially f… ▽ More Text-driven image synthesis has made significant advancements with the development of diffusion models, transforming how visual content is generated from text prompts. Despite these advances, text-driven image editing, a key area in computer graphics, faces unique challenges. A major challenge is making simultaneous edits across multiple objects or attributes. Applying these methods sequentially for multi-aspect edits increases computational demands and efficiency losses. In this paper, we address these challenges with significant contributions. Our main contribution is the development of MultiEdits, a method that seamlessly manages simultaneous edits across multiple attributes. In contrast to previous approaches, MultiEdits not only preserves the quality of single attribute edits but also significantly improves the performance of multitasking edits. This is achieved through an innovative attention distribution mechanism and a multi-branch design that operates across several processing heads. Additionally, we introduce the PIE-Bench++ dataset, an expansion of the original PIE-Bench dataset, to better support evaluating image-editing tasks involving multiple objects and attributes simultaneously. This dataset is a benchmark for evaluating text-driven image editing methods in multifaceted scenarios. Dataset and code are available at https://mingzhenhuang.com/projects/MultiEdits.html. △ Less

Submitted 3 June, 2024; originally announced June 2024.

arXiv:2405.18320 [pdf, other]

Self-Supervised Learning Based Handwriting Verification

Authors: Mihir Chauhan, Mohammad Abuzar Shaikh, Bina Ramamurthy, Mingchen Gao, Siwei Lyu, Sargur Srihari

Abstract: We present SSL-HV: Self-Supervised Learning approaches applied to the task of Handwriting Verification. This task involves determining whether a given pair of handwritten images originate from the same or different writer distribution. We have compared the performance of multiple generative, contrastive SSL approaches against handcrafted feature extractors and supervised learning on CEDAR AND data… ▽ More We present SSL-HV: Self-Supervised Learning approaches applied to the task of Handwriting Verification. This task involves determining whether a given pair of handwritten images originate from the same or different writer distribution. We have compared the performance of multiple generative, contrastive SSL approaches against handcrafted feature extractors and supervised learning on CEDAR AND dataset. We show that ResNet based Variational Auto-Encoder (VAE) outperforms other generative approaches achieving 76.3% accuracy, while ResNet-18 fine-tuned using Variance-Invariance-Covariance Regularization (VICReg) outperforms other contrastive approaches achieving 78% accuracy. Using a pre-trained VAE and VICReg for the downstream task of writer verification we observed a relative improvement in accuracy of 6.7% and 9% over ResNet-18 supervised baseline with 10% writer labels. △ Less

Submitted 28 May, 2024; originally announced May 2024.

Comments: 14 pages, 6 figures, 2 tables

arXiv:2405.17837 [pdf, other]

Enabling Generative Design Tools with LLM Agents for Building Novel Devices: A Case Study on Fluidic Computation Interfaces

Authors: Qiuyu Lu, Jiawei Fang, Zhihao Yao, Yue Yang, Shiqing Lyu, Haipeng Mi, Lining Yao

Abstract: In the field of Human-Computer Interaction (HCI), the development of interactive devices represents a significant area of focus. The advent of novel hardware and advanced fabrication techniques has underscored the demand for specialized design tools that democratize the prototyping process for such cutting-edge devices. While these tools simplify the process through parametric design and simulatio… ▽ More In the field of Human-Computer Interaction (HCI), the development of interactive devices represents a significant area of focus. The advent of novel hardware and advanced fabrication techniques has underscored the demand for specialized design tools that democratize the prototyping process for such cutting-edge devices. While these tools simplify the process through parametric design and simulation, they typically require a certain learning curve and often fall short in facilitating creative ideation. In this study, we employ fluidic computation interface as a case study to investigate the potential of augmenting design tools of physical devices with Large Language Model (LLM) agents. Enhanced by LLM agents, the Generative Design Tool (GDT) can comprehend the capabilities and limitations of newly developed devices; it can propose varied, insightful, and practical application scenarios, and recommend device designs that are technically and contextually appropriate. Furthermore, it generates the necessary design parameters for the traditional part of the design tool to visualize results and produce support files for fabrication. This paper outlines the GDT's framework, implementation, and performance, while also contemplating its prospects and the obstacles encountered. △ Less

Submitted 28 May, 2024; originally announced May 2024.

Comments: 25 pages, 12 figures

arXiv:2405.11326 [pdf, other]

On the Trajectory Regularity of ODE-based Diffusion Sampling

Authors: Defang Chen, Zhenyu Zhou, Can Wang, Chunhua Shen, Siwei Lyu

Abstract: Diffusion-based generative models use stochastic differential equations (SDEs) and their equivalent ordinary differential equations (ODEs) to establish a smooth connection between a complex data distribution and a tractable prior distribution. In this paper, we identify several intriguing trajectory properties in the ODE-based sampling process of diffusion models. We characterize an implicit denoi… ▽ More Diffusion-based generative models use stochastic differential equations (SDEs) and their equivalent ordinary differential equations (ODEs) to establish a smooth connection between a complex data distribution and a tractable prior distribution. In this paper, we identify several intriguing trajectory properties in the ODE-based sampling process of diffusion models. We characterize an implicit denoising trajectory and discuss its vital role in forming the coupled sampling trajectory with a strong shape regularity, regardless of the generated content. We also describe a dynamic programming-based scheme to make the time schedule in sampling better fit the underlying trajectory structure. This simple strategy requires minimal modification to any given ODE-based numerical solvers and incurs negligible computational cost, while delivering superior performance in image generation, especially in $5\sim 10$ function evaluations. △ Less

Submitted 18 May, 2024; originally announced May 2024.

Comments: ICML 2024, 30 pages

arXiv:2405.08487 [pdf, other]

Semantic Contextualization of Face Forgery: A New Definition, Dataset, and Detection Method

Authors: Mian Zou, Baosheng Yu, Yibing Zhan, Siwei Lyu, Kede Ma

Abstract: In recent years, deep learning has greatly streamlined the process of generating realistic fake face images. Aware of the dangers, researchers have developed various tools to spot these counterfeits. Yet none asked the fundamental question: What digital manipulations make a real photographic face image fake, while others do not? In this paper, we put face forgery in a semantic context and define t… ▽ More In recent years, deep learning has greatly streamlined the process of generating realistic fake face images. Aware of the dangers, researchers have developed various tools to spot these counterfeits. Yet none asked the fundamental question: What digital manipulations make a real photographic face image fake, while others do not? In this paper, we put face forgery in a semantic context and define that computational methods that alter semantic face attributes to exceed human discrimination thresholds are sources of face forgery. Guided by our new definition, we construct a large face forgery image dataset, where each image is associated with a set of labels organized in a hierarchical graph. Our dataset enables two new testing protocols to probe the generalization of face forgery detectors. Moreover, we propose a semantics-oriented face forgery detection method that captures label relations and prioritizes the primary task (\ie, real or fake face detection). We show that the proposed dataset successfully exposes the weaknesses of current detectors as the test set and consistently improves their generalizability as the training set. Additionally, we demonstrate the superiority of our semantics-oriented method over traditional binary and multi-class classification-based detectors. △ Less

Submitted 14 May, 2024; originally announced May 2024.

arXiv:2405.04051 [pdf, ps, other]

On the quantization goodness of polar lattices

Authors: Ling Liu, Shanxiang Lyu, Cong Ling, Baoming Bai

Abstract: In this work, we prove that polar lattices, when tailored for lossy compression, are quantization-good in the sense that their normalized second moments approach $\frac{1}{2πe}$ as the dimension of lattices increases. It has been predicted by Zamir et al. \cite{ZamirQZ96} that the Entropy Coded Dithered Quantization (ECDQ) system using quantization-good lattices can achieve the rate-distortion bou… ▽ More In this work, we prove that polar lattices, when tailored for lossy compression, are quantization-good in the sense that their normalized second moments approach $\frac{1}{2πe}$ as the dimension of lattices increases. It has been predicted by Zamir et al. \cite{ZamirQZ96} that the Entropy Coded Dithered Quantization (ECDQ) system using quantization-good lattices can achieve the rate-distortion bound of i.i.d. Gaussian sources. In our previous work \cite{LingQZ}, we established that polar lattices are indeed capable of attaining the same objective. It is reasonable to conjecture that polar lattices also demonstrate quantization goodness in the context of lossy compression. This study confirms this hypothesis. △ Less

Submitted 13 May, 2024; v1 submitted 7 May, 2024; originally announced May 2024.

Comments: 12 pages, 5 figures, submitted to IEEE for possible publication

arXiv:2405.00135 [pdf, other]

Improving Channel Resilience for Task-Oriented Semantic Communications: A Unified Information Bottleneck Approach

Authors: Shuai Lyu, Yao Sun, Linke Guo, Xiaoyong Yuan, Fang Fang, Lan Zhang, Xianbin Wang

Abstract: Task-oriented semantic communications (TSC) enhance radio resource efficiency by transmitting task-relevant semantic information. However, current research often overlooks the inherent semantic distinctions among encoded features. Due to unavoidable channel variations from time and frequency-selective fading, semantically sensitive feature units could be more susceptible to erroneous inference if… ▽ More Task-oriented semantic communications (TSC) enhance radio resource efficiency by transmitting task-relevant semantic information. However, current research often overlooks the inherent semantic distinctions among encoded features. Due to unavoidable channel variations from time and frequency-selective fading, semantically sensitive feature units could be more susceptible to erroneous inference if corrupted by dynamic channels. Therefore, this letter introduces a unified channel-resilient TSC framework via information bottleneck. This framework complements existing TSC approaches by controlling information flow to capture fine-grained feature-level semantic robustness. Experiments on a case study for real-time subchannel allocation validate the framework's effectiveness. △ Less

Submitted 30 April, 2024; originally announced May 2024.

Comments: This work has been submitted to the IEEE Communications Letters

arXiv:2404.19171 [pdf, other]

Explicit Correlation Learning for Generalizable Cross-Modal Deepfake Detection

Authors: Cai Yu, Shan Jia, Xiaomeng Fu, Jin Liu, Jiahe Tian, Jiao Dai, Xi Wang, Siwei Lyu, Jizhong Han

Abstract: With the rising prevalence of deepfakes, there is a growing interest in developing generalizable detection methods for various types of deepfakes. While effective in their specific modalities, traditional detection methods fall short in addressing the generalizability of detection across diverse cross-modal deepfakes. This paper aims to explicitly learn potential cross-modal correlation to enhance… ▽ More With the rising prevalence of deepfakes, there is a growing interest in developing generalizable detection methods for various types of deepfakes. While effective in their specific modalities, traditional detection methods fall short in addressing the generalizability of detection across diverse cross-modal deepfakes. This paper aims to explicitly learn potential cross-modal correlation to enhance deepfake detection towards various generation scenarios. Our approach introduces a correlation distillation task, which models the inherent cross-modal correlation based on content information. This strategy helps to prevent the model from overfitting merely to audio-visual synchronization. Additionally, we present the Cross-Modal Deepfake Dataset (CMDFD), a comprehensive dataset with four generation methods to evaluate the detection of diverse cross-modal deepfakes. The experimental results on CMDFD and FakeAVCeleb datasets demonstrate the superior generalizability of our method over existing state-of-the-art methods. Our code and data can be found at \url{https://github.com/ljj898/CMDFD-Dataset-and-Deepfake-Detection}. △ Less

Submitted 29 April, 2024; originally announced April 2024.

Comments: accepted by ICME 2024

arXiv:2404.18033 [pdf, other]

Exposing Text-Image Inconsistency Using Diffusion Models

Authors: Mingzhen Huang, Shan Jia, Zhou Zhou, Yan Ju, Jialing Cai, Siwei Lyu

Abstract: In the battle against widespread online misinformation, a growing problem is text-image inconsistency, where images are misleadingly paired with texts with different intent or meaning. Existing classification-based methods for text-image inconsistency can identify contextual inconsistencies but fail to provide explainable justifications for their decisions that humans can understand. Although more… ▽ More In the battle against widespread online misinformation, a growing problem is text-image inconsistency, where images are misleadingly paired with texts with different intent or meaning. Existing classification-based methods for text-image inconsistency can identify contextual inconsistencies but fail to provide explainable justifications for their decisions that humans can understand. Although more nuanced, human evaluation is impractical at scale and susceptible to errors. To address these limitations, this study introduces D-TIIL (Diffusion-based Text-Image Inconsistency Localization), which employs text-to-image diffusion models to localize semantic inconsistencies in text and image pairs. These models, trained on large-scale datasets act as ``omniscient" agents that filter out irrelevant information and incorporate background knowledge to identify inconsistencies. In addition, D-TIIL uses text embeddings and modified image regions to visualize these inconsistencies. To evaluate D-TIIL's efficacy, we introduce a new TIIL dataset containing 14K consistent and inconsistent text-image pairs. Unlike existing datasets, TIIL enables assessment at the level of individual words and image regions and is carefully designed to represent various inconsistencies. D-TIIL offers a scalable and evidence-based approach to identifying and localizing text-image inconsistency, providing a robust framework for future research combating misinformation. △ Less

Submitted 27 April, 2024; originally announced April 2024.

arXiv:2404.13146 [pdf, other]

DeepFake-O-Meter v2.0: An Open Platform for DeepFake Detection

Authors: Yan Ju, Chengzhe Sun, Shan Jia, Shuwei Hou, Zhaofeng Si, Soumyya Kanti Datta, Lipeng Ke, Riky Zhou, Anita Nikolich, Siwei Lyu

Abstract: Deepfakes, as AI-generated media, have increasingly threatened media integrity and personal privacy with realistic yet fake digital content. In this work, we introduce an open-source and user-friendly online platform, DeepFake-O-Meter v2.0, that integrates state-of-the-art methods for detecting Deepfake images, videos, and audio. Built upon DeepFake-O-Meter v1.0, we have made significant upgrades… ▽ More Deepfakes, as AI-generated media, have increasingly threatened media integrity and personal privacy with realistic yet fake digital content. In this work, we introduce an open-source and user-friendly online platform, DeepFake-O-Meter v2.0, that integrates state-of-the-art methods for detecting Deepfake images, videos, and audio. Built upon DeepFake-O-Meter v1.0, we have made significant upgrades and improvements in platform architecture design, including user interaction, detector integration, job balancing, and security management. The platform aims to offer everyday users a convenient service for analyzing DeepFake media using multiple state-of-the-art detection algorithms. It ensures secure and private delivery of the analysis results. Furthermore, it serves as an evaluation and benchmarking platform for researchers in digital media forensics to compare the performance of multiple algorithms on the same input. We have also conducted detailed usage analysis based on the collected data to gain deeper insights into our platform's statistics. This involves analyzing two-month trends in user activity and evaluating the processing efficiency of each detector. △ Less

Submitted 27 June, 2024; v1 submitted 19 April, 2024; originally announced April 2024.

arXiv:2403.14077 [pdf, other]

Can ChatGPT Detect DeepFakes? A Study of Using Multimodal Large Language Models for Media Forensics

Authors: Shan Jia, Reilin Lyu, Kangran Zhao, Yize Chen, Zhiyuan Yan, Yan Ju, Chuanbo Hu, Xin Li, Baoyuan Wu, Siwei Lyu

Abstract: DeepFakes, which refer to AI-generated media content, have become an increasing concern due to their use as a means for disinformation. Detecting DeepFakes is currently solved with programmed machine learning algorithms. In this work, we investigate the capabilities of multimodal large language models (LLMs) in DeepFake detection. We conducted qualitative and quantitative experiments to demonstrat… ▽ More DeepFakes, which refer to AI-generated media content, have become an increasing concern due to their use as a means for disinformation. Detecting DeepFakes is currently solved with programmed machine learning algorithms. In this work, we investigate the capabilities of multimodal large language models (LLMs) in DeepFake detection. We conducted qualitative and quantitative experiments to demonstrate multimodal LLMs and show that they can expose AI-generated images through careful experimental design and prompt engineering. This is interesting, considering that LLMs are not inherently tailored for media forensic tasks, and the process does not require programming. We discuss the limitations of multimodal LLMs for these tasks and suggest possible improvements. △ Less

Submitted 11 June, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

arXiv:2403.13358 [pdf, other]

GeRM: A Generalist Robotic Model with Mixture-of-experts for Quadruped Robot

Authors: Wenxuan Song, Han Zhao, Pengxiang Ding, Can Cui, Shangke Lyu, Yaning Fan, Donglin Wang

Abstract: Multi-task robot learning holds significant importance in tackling diverse and complex scenarios. However, current approaches are hindered by performance issues and difficulties in collecting training datasets. In this paper, we propose GeRM (Generalist Robotic Model). We utilize offline reinforcement learning to optimize data utilization strategies to learn from both demonstrations and sub-optima… ▽ More Multi-task robot learning holds significant importance in tackling diverse and complex scenarios. However, current approaches are hindered by performance issues and difficulties in collecting training datasets. In this paper, we propose GeRM (Generalist Robotic Model). We utilize offline reinforcement learning to optimize data utilization strategies to learn from both demonstrations and sub-optimal data, thus surpassing the limitations of human demonstrations. Thereafter, we employ a transformer-based VLA network to process multi-modal inputs and output actions. By introducing the Mixture-of-Experts structure, GeRM allows faster inference speed with higher whole model capacity, and thus resolves the issue of limited RL parameters, enhancing model performance in multi-task learning while controlling computational costs. Through a series of experiments, we demonstrate that GeRM outperforms other methods across all tasks, while also validating its efficiency in both training and inference processes. Additionally, we uncover its potential to acquire emergent skills. Additionally, we contribute the QUARD-Auto dataset, collected automatically to support our training approach and foster advancements in multi-task quadruped robot learning. This work presents a new paradigm for reducing the cost of collecting robot data and driving progress in the multi-task learning community. You can reach our project and video through the link: https://songwxuan.github.io/GeRM/ . △ Less

Submitted 9 April, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

arXiv:2403.12631 [pdf]

PointGrasp: Point Cloud-based Grasping for Tendon-driven Soft Robotic Glove Applications

Authors: Chen Hu, Shirui Lyu, Eojin Rho, Daekyum Kim, Shan Luo, Letizia Gionfrida

Abstract: Controlling hand exoskeletons to assist individuals with grasping tasks poses a challenge due to the difficulty in understanding user intentions. We propose that most daily grasping tasks during activities of daily living (ADL) can be deduced by analyzing object geometries (simple and complex) from 3D point clouds. The study introduces PointGrasp, a real-time system designed for identifying househ… ▽ More Controlling hand exoskeletons to assist individuals with grasping tasks poses a challenge due to the difficulty in understanding user intentions. We propose that most daily grasping tasks during activities of daily living (ADL) can be deduced by analyzing object geometries (simple and complex) from 3D point clouds. The study introduces PointGrasp, a real-time system designed for identifying household scenes semantically, aiming to support and enhance assistance during ADL for tailored end-to-end grasping tasks. The system comprises an RGB-D camera with an inertial measurement unit and a microprocessor integrated into a tendon-driven soft robotic glove. The RGB-D camera processes 3D scenes at a rate exceeding 30 frames per second. The proposed pipeline demonstrates an average RMSE of 0.8 $\pm$ 0.39 cm for simple and 0.11 $\pm$ 0.06 cm for complex geometries. Within each mode, it identifies and pinpoints reachable objects. This system shows promise in end-to-end vision-driven robotic-assisted rehabilitation manual tasks. △ Less

Submitted 19 March, 2024; originally announced March 2024.

Comments: 6 pages, 8 figures, conference

ACM Class: I.2; I.4

arXiv:2403.03101 [pdf, other]

KnowAgent: Knowledge-Augmented Planning for LLM-Based Agents

Authors: Yuqi Zhu, Shuofei Qiao, Yixin Ou, Shumin Deng, Ningyu Zhang, Shiwei Lyu, Yue Shen, Lei Liang, Jinjie Gu, Huajun Chen

Abstract: Large Language Models (LLMs) have demonstrated great potential in complex reasoning tasks, yet they fall short when tackling more sophisticated challenges, especially when interacting with environments through generating executable actions. This inadequacy primarily stems from the lack of built-in action knowledge in language agents, which fails to effectively guide the planning trajectories durin… ▽ More Large Language Models (LLMs) have demonstrated great potential in complex reasoning tasks, yet they fall short when tackling more sophisticated challenges, especially when interacting with environments through generating executable actions. This inadequacy primarily stems from the lack of built-in action knowledge in language agents, which fails to effectively guide the planning trajectories during task solving and results in planning hallucination. To address this issue, we introduce KnowAgent, a novel approach designed to enhance the planning capabilities of LLMs by incorporating explicit action knowledge. Specifically, KnowAgent employs an action knowledge base and a knowledgeable self-learning strategy to constrain the action path during planning, enabling more reasonable trajectory synthesis, and thereby enhancing the planning performance of language agents. Experimental results on HotpotQA and ALFWorld based on various backbone models demonstrate that KnowAgent can achieve comparable or superior performance to existing baselines. Further analysis indicates the effectiveness of KnowAgent in terms of planning hallucinations mitigation. Code is available in https://github.com/zjunlp/KnowAgent. △ Less

Submitted 5 March, 2024; originally announced March 2024.

Comments: Work in progress. Project page: https://zjunlp.github.io/project/KnowAgent/ Code: https://github.com/zjunlp/KnowAgent

arXiv:2402.06749 [pdf]

Copper phosphate micro-flowers coated with indocyanine green and iron oxide nanoparticles for in vivo localization optoacoustic tomography and magnetic actuation

Authors: Daniil Nozdriukhin, Shuxin Lyu, Jerome Bonvin, Michael Reiss, Daniel Razansky, Xose Luis Dean-Ben

Abstract: Efficient drug delivery is a major challenge in modern medicine and pharmaceutical research. Micrometer-scale robots have recently been proposed as a promising venue to amplify precision of drug administration. Remotely controlled microrobots sufficiently small to navigate through microvascular networks can reach any part of the human body, yet real-time tracking is crucial for providing precise g… ▽ More Efficient drug delivery is a major challenge in modern medicine and pharmaceutical research. Micrometer-scale robots have recently been proposed as a promising venue to amplify precision of drug administration. Remotely controlled microrobots sufficiently small to navigate through microvascular networks can reach any part of the human body, yet real-time tracking is crucial for providing precise guidance and verifying successful arrival at the target. In vivo deep-tissue monitoring of individual microrobots is currently hampered by the lack of sensitivity and/or spatio-temporal resolution of commonly used clinical imaging modalities. We synthesized biocompatible and biodegradable copper phosphate micro-flowers loaded with indocyanine green and iron oxide nanoparticles to enable in vivo individual detection with localization optoacoustic tomography. We demonstrate magnetic actuation and optoacoustic tracking of these decorated micro-flowers at a per-particle level. Functional super-resolution imaging achieved via tracking intravenously injected particles provides a means of identifying microvascular targets and quantifying blood flow, while the versatile carrying capacity can be further exploited for transporting multiple types of drug formulations. △ Less

Submitted 9 February, 2024; originally announced February 2024.

arXiv:2402.01154 [pdf, other]

Towards Quantum-Safe Federated Learning via Homomorphic Encryption: Learning with Gradients

Authors: Guangfeng Yan, Shanxiang Lyu, Hanxu Hou, Zhiyong Zheng, Linqi Song

Abstract: This paper introduces a privacy-preserving distributed learning framework via private-key homomorphic encryption. Thanks to the randomness of the quantization of gradients, our learning with error (LWE) based encryption can eliminate the error terms, thus avoiding the issue of error expansion in conventional LWE-based homomorphic encryption. The proposed system allows a large number of learning pa… ▽ More This paper introduces a privacy-preserving distributed learning framework via private-key homomorphic encryption. Thanks to the randomness of the quantization of gradients, our learning with error (LWE) based encryption can eliminate the error terms, thus avoiding the issue of error expansion in conventional LWE-based homomorphic encryption. The proposed system allows a large number of learning participants to engage in neural network-based deep learning collaboratively over an honest-but-curious server, while ensuring the cryptographic security of participants' uploaded gradients. △ Less

Submitted 2 February, 2024; originally announced February 2024.

arXiv:2401.17255 [pdf, other]

Towards Quantum Simulation of Non-Markovian Open Quantum Dynamics: A Universal and Compact Theory

Authors: Xiang Li, Su-Xiang Lyu, Yao Wang, Rui-Xue Xu, Xiao Zheng, YiJing Yan

Abstract: Non-Markovianity, the intricate dependence of an open quantum system on its temporal evolution history, holds tremendous implications across various scientific disciplines. However, accurately characterizing the complex non-Markovian effects has posed a formidable challenge for numerical simulations. Despite the promising potential of emerging quantum computing technologies, the pursuit of a unive… ▽ More Non-Markovianity, the intricate dependence of an open quantum system on its temporal evolution history, holds tremendous implications across various scientific disciplines. However, accurately characterizing the complex non-Markovian effects has posed a formidable challenge for numerical simulations. Despite the promising potential of emerging quantum computing technologies, the pursuit of a universal theory enabling practical implementation of quantum computation algorithms remains ongoing. In this paper, we present a major advancement in bridging this critical gap: the development of dissipaton-embedded quantum master equation in second quantization (DQME-SQ). The DQME-SQ is an exact and compact theory, and its unique capabilities are demonstrated through digital quantum simulations of non-Markovian dissipative dynamics in both bosonic and fermionic environments. Our new theoretical developments pave the way for efficient exploration of complex open quantum systems. △ Less

Submitted 8 February, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

Comments: abstract and figures update

arXiv:2401.10113 [pdf, ps, other]

Exposing Lip-syncing Deepfakes from Mouth Inconsistencies

Authors: Soumyya Kanti Datta, Shan Jia, Siwei Lyu

Abstract: A lip-syncing deepfake is a digitally manipulated video in which a person's lip movements are created convincingly using AI models to match altered or entirely new audio. Lip-syncing deepfakes are a dangerous type of deepfakes as the artifacts are limited to the lip region and more difficult to discern. In this paper, we describe a novel approach, LIP-syncing detection based on mouth INConsistency… ▽ More A lip-syncing deepfake is a digitally manipulated video in which a person's lip movements are created convincingly using AI models to match altered or entirely new audio. Lip-syncing deepfakes are a dangerous type of deepfakes as the artifacts are limited to the lip region and more difficult to discern. In this paper, we describe a novel approach, LIP-syncing detection based on mouth INConsistency (LIPINC), for lip-syncing deepfake detection by identifying temporal inconsistencies in the mouth region. These inconsistencies are seen in the adjacent frames and throughout the video. Our model can successfully capture these irregularities and outperforms the state-of-the-art methods on several benchmark deepfake datasets. Code is available at https://github.com/skrantidatta/LIPINC △ Less

Submitted 3 June, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

arXiv:2312.17431 [pdf, other]

MVPatch: More Vivid Patch for Adversarial Camouflaged Attacks on Object Detectors in the Physical World

Authors: Zheng Zhou, Hongbo Zhao, Ju Liu, Qiaosheng Zhang, Liwei Geng, Shuchang Lyu, Wenquan Feng

Abstract: Recent investigations demonstrate that adversarial patches can be utilized to manipulate the result of object detection models. However, the conspicuous patterns on these patches may draw more attention and raise suspicions among humans. Moreover, existing works have primarily focused on enhancing the efficacy of attacks in the physical domain, rather than seeking to optimize their stealth attribu… ▽ More Recent investigations demonstrate that adversarial patches can be utilized to manipulate the result of object detection models. However, the conspicuous patterns on these patches may draw more attention and raise suspicions among humans. Moreover, existing works have primarily focused on enhancing the efficacy of attacks in the physical domain, rather than seeking to optimize their stealth attributes and transferability potential. To address these issues, we introduce a dual-perception-based attack framework that generates an adversarial patch known as the More Vivid Patch (MVPatch). The framework consists of a model-perception degradation method and a human-perception improvement method. To derive the MVPatch, we formulate an iterative process that simultaneously constrains the efficacy of multiple object detectors and refines the visual correlation between the generated adversarial patch and a realistic image. Our method employs a model-perception-based approach that reduces the object confidence scores of several object detectors to boost the transferability of adversarial patches. Further, within the human-perception-based framework, we put forward a lightweight technique for visual similarity measurement that facilitates the development of inconspicuous and natural adversarial patches and eliminates the reliance on additional generative models. Additionally, we introduce the naturalness score and transferability score as metrics for an unbiased assessment of various adversarial patches' natural appearance and transferability capacity. Extensive experiments demonstrate that the proposed MVPatch algorithm achieves superior attack transferability compared to similar algorithms in both digital and physical domains while also exhibiting a more natural appearance. These findings emphasize the remarkable stealthiness and transferability of the proposed MVPatch attack algorithm. △ Less

Submitted 11 January, 2024; v1 submitted 28 December, 2023; originally announced December 2023.

Comments: 14 pages, 8 figures. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2312.09785 [pdf, other]

RJUA-QA: A Comprehensive QA Dataset for Urology

Authors: Shiwei Lyu, Chenfei Chi, Hongbo Cai, Lei Shi, Xiaoyan Yang, Lei Liu, Xiang Chen, Deng Zhao, Zhiqiang Zhang, Xianguo Lyu, Ming Zhang, Fangzhou Li, Xiaowei Ma, Yue Shen, Jinjie Gu, Wei Xue, Yiran Huang

Abstract: We introduce RJUA-QA, a novel medical dataset for question answering (QA) and reasoning with clinical evidence, contributing to bridge the gap between general large language models (LLMs) and medical-specific LLM applications. RJUA-QA is derived from realistic clinical scenarios and aims to facilitate LLMs in generating reliable diagnostic and advice. The dataset contains 2,132 curated Question-Co… ▽ More We introduce RJUA-QA, a novel medical dataset for question answering (QA) and reasoning with clinical evidence, contributing to bridge the gap between general large language models (LLMs) and medical-specific LLM applications. RJUA-QA is derived from realistic clinical scenarios and aims to facilitate LLMs in generating reliable diagnostic and advice. The dataset contains 2,132 curated Question-Context-Answer pairs, corresponding about 25,000 diagnostic records and clinical cases. The dataset covers 67 common urological disease categories, where the disease coverage exceeds 97.6\% of the population seeking medical services in urology. Each data instance in RJUA-QA comprises: (1) a question mirroring real patient to inquiry about clinical symptoms and medical conditions, (2) a context including comprehensive expert knowledge, serving as a reference for medical examination and diagnosis, (3) a doctor response offering the diagnostic conclusion and suggested examination guidance, (4) a diagnosed clinical disease as the recommended diagnostic outcome, and (5) clinical advice providing recommendations for medical examination. RJUA-QA is the first medical QA dataset for clinical reasoning over the patient inquiries, where expert-level knowledge and experience are required for yielding diagnostic conclusions and medical examination advice. A comprehensive evaluation is conducted to evaluate the performance of both medical-specific and general LLMs on the RJUA-QA dataset. Our data is are publicly available at \url{https://github.com/alipay/RJU_Ant_QA}. △ Less

Submitted 7 January, 2024; v1 submitted 15 December, 2023; originally announced December 2023.

Comments: An initial version

arXiv:2312.09639 [pdf, other]

doi 10.1145/3511808.3557655

Multiple Instance Learning for Uplift Modeling

Authors: Yao Zhao, Haipeng Zhang, Shiwei Lyu, Ruiying Jiang, Jinjie Gu, Guannan Zhang

Abstract: Uplift modeling is widely used in performance marketing to estimate effects of promotion campaigns (e.g., increase of customer retention rate). Since it is impossible to observe outcomes of a recipient in treatment (e.g., receiving a certain promotion) and control (e.g., without promotion) groups simultaneously (i.e., counter-factual), uplift models are mainly trained on instances of treatment and… ▽ More Uplift modeling is widely used in performance marketing to estimate effects of promotion campaigns (e.g., increase of customer retention rate). Since it is impossible to observe outcomes of a recipient in treatment (e.g., receiving a certain promotion) and control (e.g., without promotion) groups simultaneously (i.e., counter-factual), uplift models are mainly trained on instances of treatment and control groups separately to form two models respectively, and uplifts are predicted by the difference of predictions from these two models (i.e., two-model method). When responses are noisy and the treatment effect is fractional, induced individual uplift predictions will be inaccurate, resulting in targeting undesirable customers. Though it is impossible to obtain the ideal ground-truth individual uplifts, known as Individual Treatment Effects (ITEs), alternatively, an average uplift of a group of users, called Average Treatment Effect (ATE), can be observed from experimental deliveries. Upon this, similar to Multiple Instance Learning (MIL) in which each training sample is a bag of instances, our framework sums up individual user uplift predictions for each bag of users as its bag-wise ATE prediction, and regularizes it to its ATE label, thus learning more accurate individual uplifts. Additionally, to amplify the fractional treatment effect, bags are composed of instances with adjacent individual uplift predictions, instead of random instances. Experiments conducted on two datasets show the effectiveness and universality of the proposed framework. △ Less

Submitted 15 December, 2023; originally announced December 2023.

Comments: short paper of CIKM22(full version)

Journal ref: Proceedings of the 31st ACM International Conference on Information and Knowledge Management (2022) 4727-4731

arXiv:2312.05738 [pdf, other]

FedReverse: Multiparty Reversible Deep Neural Network Watermarking

Authors: Junlong Mao, Huiyi Tang, Yi Zhang, Fengxia Liu, Zhiyong Zheng, Shanxiang Lyu

Abstract: The proliferation of Deep Neural Networks (DNN) in commercial applications is expanding rapidly. Simultaneously, the increasing complexity and cost of training DNN models have intensified the urgency surrounding the protection of intellectual property associated with these trained models. In this regard, DNN watermarking has emerged as a crucial safeguarding technique. This paper presents FedRever… ▽ More The proliferation of Deep Neural Networks (DNN) in commercial applications is expanding rapidly. Simultaneously, the increasing complexity and cost of training DNN models have intensified the urgency surrounding the protection of intellectual property associated with these trained models. In this regard, DNN watermarking has emerged as a crucial safeguarding technique. This paper presents FedReverse, a novel multiparty reversible watermarking approach for robust copyright protection while minimizing performance impact. Unlike existing methods, FedReverse enables collaborative watermark embedding from multiple parties after model training, ensuring individual copyright claims. In addition, FedReverse is reversible, enabling complete watermark removal with unanimous client consent. FedReverse demonstrates perfect covering, ensuring that observations of watermarked content do not reveal any information about the hidden watermark. Additionally, it showcases resistance against Known Original Attacks (KOA), making it highly challenging for attackers to forge watermarks or infer the key. This paper further evaluates FedReverse through comprehensive simulations involving Multi-layer Perceptron (MLP) and Convolutional Neural Networks (CNN) trained on the MNIST dataset. The simulations demonstrate FedReverse's robustness, reversibility, and minimal impact on model accuracy across varying embedding parameters and multiple client scenarios. △ Less

Submitted 9 December, 2023; originally announced December 2023.

Comments: 13 pages

arXiv:2311.18196 [pdf, ps, other]

Formal lifting of dualizing complexes and consequences

Authors: Shiji Lyu

Abstract: We show that for a Noetherian ring $A$ that is $I$-adically complete for an ideal $I$, if $A/I$ admits a dualizing complex, so does $A$. We discuss several consequences of this result. We also consider a generalization of the notion of dualizing complexes to infinite-dimensional rings and prove the results in this generality. In addition, we give an alternative proof of the fact that every excelle… ▽ More We show that for a Noetherian ring $A$ that is $I$-adically complete for an ideal $I$, if $A/I$ admits a dualizing complex, so does $A$. We discuss several consequences of this result. We also consider a generalization of the notion of dualizing complexes to infinite-dimensional rings and prove the results in this generality. In addition, we give an alternative proof of the fact that every excellent Henselian local ring admits a dualizing complex, using ultrapower. △ Less

Submitted 29 November, 2023; originally announced November 2023.

Comments: 19 pages. Comments welcome!

arXiv:2311.11278 [pdf, other]

Transcending Forgery Specificity with Latent Space Augmentation for Generalizable Deepfake Detection

Authors: Zhiyuan Yan, Yuhao Luo, Siwei Lyu, Qingshan Liu, Baoyuan Wu

Abstract: Deepfake detection faces a critical generalization hurdle, with performance deteriorating when there is a mismatch between the distributions of training and testing data. A broadly received explanation is the tendency of these detectors to be overfitted to forgery-specific artifacts, rather than learning features that are widely applicable across various forgeries. To address this issue, we propos… ▽ More Deepfake detection faces a critical generalization hurdle, with performance deteriorating when there is a mismatch between the distributions of training and testing data. A broadly received explanation is the tendency of these detectors to be overfitted to forgery-specific artifacts, rather than learning features that are widely applicable across various forgeries. To address this issue, we propose a simple yet effective detector called LSDA (\underline{L}atent \underline{S}pace \underline{D}ata \underline{A}ugmentation), which is based on a heuristic idea: representations with a wider variety of forgeries should be able to learn a more generalizable decision boundary, thereby mitigating the overfitting of method-specific features (see Fig.~\ref{fig:toy}). Following this idea, we propose to enlarge the forgery space by constructing and simulating variations within and across forgery features in the latent space. This approach encompasses the acquisition of enriched, domain-specific features and the facilitation of smoother transitions between different forgery types, effectively bridging domain gaps. Our approach culminates in refining a binary classifier that leverages the distilled knowledge from the enhanced features, striving for a generalizable deepfake detector. Comprehensive experiments show that our proposed method is surprisingly effective and transcends state-of-the-art detectors across several widely used benchmarks. △ Less

Submitted 28 March, 2024; v1 submitted 19 November, 2023; originally announced November 2023.

arXiv:2311.06712 [pdf, other]

PuzzleTuning: Explicitly Bridge Pathological and Natural Image with Puzzles

Authors: Tianyi Zhang, Shangqing Lyu, Yanli Lei, Sicheng Chen, Nan Ying, Yufang He, Yu Zhao, Yunlu Feng, Hwee Kuan Lee, Guanglei Zhang

Abstract: Pathological image analysis is a crucial field in computer vision. Due to the annotation scarcity in the pathological field, pre-training with self-supervised learning (SSL) is widely applied to learn on unlabeled images. However, the current SSL-based pathological pre-training: (1) does not explicitly explore the essential focuses of the pathological field, and (2) does not effectively bridge wit… ▽ More Pathological image analysis is a crucial field in computer vision. Due to the annotation scarcity in the pathological field, pre-training with self-supervised learning (SSL) is widely applied to learn on unlabeled images. However, the current SSL-based pathological pre-training: (1) does not explicitly explore the essential focuses of the pathological field, and (2) does not effectively bridge with and thus take advantage of the knowledge from natural images. To explicitly address them, we propose our large-scale PuzzleTuning framework, containing the following innovations. Firstly, we define three task focuses that can effectively bridge knowledge of pathological and natural domain: appearance consistency, spatial consistency, and restoration understanding. Secondly, we devise a novel multiple puzzle restoring task, which explicitly pre-trains the model regarding these focuses. Thirdly, we introduce an explicit prompt-tuning process to incrementally integrate the domain-specific knowledge. It builds a bridge to align the large domain gap between natural and pathological images. Additionally, a curriculum-learning training strategy is designed to regulate task difficulty, making the model adaptive to the puzzle restoring complexity. Experimental results show that our PuzzleTuning framework outperforms the previous state-of-the-art methods in various downstream tasks on multiple datasets. The code, demo, and pre-trained weights are available at https://github.com/sagizty/PuzzleTuning. △ Less

Submitted 22 April, 2024; v1 submitted 11 November, 2023; originally announced November 2023.

Comments: 13 pages, 9 figures, 8 tables

arXiv:2311.06015 [pdf]

RSG: Fast Learning Adaptive Skills for Quadruped Robots by Skill Graph

Authors: Hongyin Zhang, Diyuan Shi, Zifeng Zhuang, Han Zhao, Zhenyu Wei, Feng Zhao, Sibo Gai, Shangke Lyu, Donglin Wang

Abstract: Developing robotic intelligent systems that can adapt quickly to unseen wild situations is one of the critical challenges in pursuing autonomous robotics. Although some impressive progress has been made in walking stability and skill learning in the field of legged robots, their ability to fast adaptation is still inferior to that of animals in nature. Animals are born with massive skills needed t… ▽ More Developing robotic intelligent systems that can adapt quickly to unseen wild situations is one of the critical challenges in pursuing autonomous robotics. Although some impressive progress has been made in walking stability and skill learning in the field of legged robots, their ability to fast adaptation is still inferior to that of animals in nature. Animals are born with massive skills needed to survive, and can quickly acquire new ones, by composing fundamental skills with limited experience. Inspired by this, we propose a novel framework, named Robot Skill Graph (RSG) for organizing massive fundamental skills of robots and dexterously reusing them for fast adaptation. Bearing a structure similar to the Knowledge Graph (KG), RSG is composed of massive dynamic behavioral skills instead of static knowledge in KG and enables discovering implicit relations that exist in be-tween of learning context and acquired skills of robots, serving as a starting point for understanding subtle patterns existing in robots' skill learning. Extensive experimental results demonstrate that RSG can provide rational skill inference upon new tasks and environments and enable quadruped robots to adapt to new scenarios and learn new skills rapidly. △ Less

Submitted 10 November, 2023; originally announced November 2023.

arXiv:2311.05836 [pdf, other]

UMedNeRF: Uncertainty-aware Single View Volumetric Rendering for Medical Neural Radiance Fields

Authors: Jing Hu, Qinrui Fan, Shu Hu, Siwei Lyu, Xi Wu, Xin Wang

Abstract: In the field of clinical medicine, computed tomography (CT) is an effective medical imaging modality for the diagnosis of various pathologies. Compared with X-ray images, CT images can provide more information, including multi-planar slices and three-dimensional structures for clinical diagnosis. However, CT imaging requires patients to be exposed to large doses of ionizing radiation for a long ti… ▽ More In the field of clinical medicine, computed tomography (CT) is an effective medical imaging modality for the diagnosis of various pathologies. Compared with X-ray images, CT images can provide more information, including multi-planar slices and three-dimensional structures for clinical diagnosis. However, CT imaging requires patients to be exposed to large doses of ionizing radiation for a long time, which may cause irreversible physical harm. In this paper, we propose an Uncertainty-aware MedNeRF (UMedNeRF) network based on generated radiation fields. The network can learn a continuous representation of CT projections from 2D X-ray images by obtaining the internal structure and depth information and using adaptive loss weights to ensure the quality of the generated images. Our model is trained on publicly available knee and chest datasets, and we show the results of CT projection rendering with a single X-ray and compare our method with other methods based on generated radiation fields. △ Less

Submitted 1 March, 2024; v1 submitted 9 November, 2023; originally announced November 2023.

arXiv:2311.04732 [pdf, other]

General-purpose machine-learned potential for 16 elemental metals and their alloys

Authors: Keke Song, Rui Zhao, Jiahui Liu, Yanzhou Wang, Eric Lindgren, Yong Wang, Shunda Chen, Ke Xu, Ting Liang, Penghua Ying, Nan Xu, Zhiqiang Zhao, Jiuyang Shi, Junjie Wang, Shuang Lyu, Zezhu Zeng, Shirong Liang, Haikuan Dong, Ligang Sun, Yue Chen, Zhuhua Zhang, Wanlin Guo, Ping Qian, Jian Sun, Paul Erhart , et al. (3 additional authors not shown)

Abstract: Machine-learned potentials (MLPs) have exhibited remarkable accuracy, yet the lack of general-purpose MLPs for a broad spectrum of elements and their alloys limits their applicability. Here, we present a feasible approach for constructing a unified general-purpose MLP for numerous elements, demonstrated through a model (UNEP-v1) for 16 elemental metals and their alloys. To achieve a complete repre… ▽ More Machine-learned potentials (MLPs) have exhibited remarkable accuracy, yet the lack of general-purpose MLPs for a broad spectrum of elements and their alloys limits their applicability. Here, we present a feasible approach for constructing a unified general-purpose MLP for numerous elements, demonstrated through a model (UNEP-v1) for 16 elemental metals and their alloys. To achieve a complete representation of the chemical space, we show, via principal component analysis and diverse test datasets, that employing one-component and two-component systems suffices. Our unified UNEP-v1 model exhibits superior performance across various physical properties compared to a widely used embedded-atom method potential, while maintaining remarkable efficiency. We demonstrate our approach's effectiveness through reproducing experimentally observed chemical order and stable phases, and large-scale simulations of plasticity and primary radiation damage in MoTaVW alloys. This work represents a significant leap towards a unified general-purpose MLP encompassing the periodic table, with profound implications for materials science. △ Less

Submitted 12 June, 2024; v1 submitted 8 November, 2023; originally announced November 2023.

Comments: Main text with 17 pages and 8 figures; supplementary with 26 figures and 4 tables; source code and training/test data available

arXiv:2311.02926 [pdf, other]

Deep Image Semantic Communication Model for Artificial Intelligent Internet of Things

Authors: Li Ping Qian, Yi Zhang, Sikai Lyu, Huijie Zhu, Yuan Wu, Xuemin Sherman Shen, Xiaoniu Yang

Abstract: With the rapid development of Artificial Intelligent Internet of Things (AIoT), the image data from AIoT devices has been witnessing the explosive increasing. In this paper, a novel deep image semantic communication model is proposed for the efficient image communication in AIoT. Particularly, at the transmitter side, a high-precision image semantic segmentation algorithm is proposed to extract th… ▽ More With the rapid development of Artificial Intelligent Internet of Things (AIoT), the image data from AIoT devices has been witnessing the explosive increasing. In this paper, a novel deep image semantic communication model is proposed for the efficient image communication in AIoT. Particularly, at the transmitter side, a high-precision image semantic segmentation algorithm is proposed to extract the semantic information of the image to achieve significant compression of the image data. At the receiver side, a semantic image restoration algorithm based on Generative Adversarial Network (GAN) is proposed to convert the semantic image to a real scene image with detailed information. Simulation results demonstrate that the proposed image semantic communication model can improve the image compression ratio and recovery accuracy by 71.93% and 25.07% on average in comparison with WebP and CycleGAN, respectively. More importantly, our demo experiment shows that the proposed model reduces the total delay by 95.26% in the image communication, when comparing with the original image transmission. △ Less

Submitted 8 November, 2023; v1 submitted 6 November, 2023; originally announced November 2023.

arXiv:2310.17902 [pdf]

CPIA Dataset: A Comprehensive Pathological Image Analysis Dataset for Self-supervised Learning Pre-training

Authors: Nan Ying, Yanli Lei, Tianyi Zhang, Shangqing Lyu, Chunhui Li, Sicheng Chen, Zeyu Liu, Yu Zhao, Guanglei Zhang

Abstract: Pathological image analysis is a crucial field in computer-aided diagnosis, where deep learning is widely applied. Transfer learning using pre-trained models initialized on natural images has effectively improved the downstream pathological performance. However, the lack of sophisticated domain-specific pathological initialization hinders their potential. Self-supervised learning (SSL) enables pre… ▽ More Pathological image analysis is a crucial field in computer-aided diagnosis, where deep learning is widely applied. Transfer learning using pre-trained models initialized on natural images has effectively improved the downstream pathological performance. However, the lack of sophisticated domain-specific pathological initialization hinders their potential. Self-supervised learning (SSL) enables pre-training without sample-level labels, which has great potential to overcome the challenge of expensive annotations. Thus, studies focusing on pathological SSL pre-training call for a comprehensive and standardized dataset, similar to the ImageNet in computer vision. This paper presents the comprehensive pathological image analysis (CPIA) dataset, a large-scale SSL pre-training dataset combining 103 open-source datasets with extensive standardization. The CPIA dataset contains 21,427,877 standardized images, covering over 48 organs/tissues and about 100 kinds of diseases, which includes two main data types: whole slide images (WSIs) and characteristic regions of interest (ROIs). A four-scale WSI standardization process is proposed based on the uniform resolution in microns per pixel (MPP), while the ROIs are divided into three scales artificially. This multi-scale dataset is built with the diagnosis habits under the supervision of experienced senior pathologists. The CPIA dataset facilitates a comprehensive pathological understanding and enables pattern discovery explorations. Additionally, to launch the CPIA dataset, several state-of-the-art (SOTA) baselines of SSL pre-training and downstream evaluation are specially conducted. The CPIA dataset along with baselines is available at https://github.com/zhanglab2021/CPIA_Dataset. △ Less

Submitted 27 October, 2023; originally announced October 2023.

arXiv:2310.14374 [pdf, other]

OV-VG: A Benchmark for Open-Vocabulary Visual Grounding

Authors: Chunlei Wang, Wenquan Feng, Xiangtai Li, Guangliang Cheng, Shuchang Lyu, Binghao Liu, Lijiang Chen, Qi Zhao

Abstract: Open-vocabulary learning has emerged as a cutting-edge research area, particularly in light of the widespread adoption of vision-based foundational models. Its primary objective is to comprehend novel concepts that are not encompassed within a predefined vocabulary. One key facet of this endeavor is Visual Grounding, which entails locating a specific region within an image based on a corresponding… ▽ More Open-vocabulary learning has emerged as a cutting-edge research area, particularly in light of the widespread adoption of vision-based foundational models. Its primary objective is to comprehend novel concepts that are not encompassed within a predefined vocabulary. One key facet of this endeavor is Visual Grounding, which entails locating a specific region within an image based on a corresponding language description. While current foundational models excel at various visual language tasks, there's a noticeable absence of models specifically tailored for open-vocabulary visual grounding. This research endeavor introduces novel and challenging OV tasks, namely Open-Vocabulary Visual Grounding and Open-Vocabulary Phrase Localization. The overarching aim is to establish connections between language descriptions and the localization of novel objects. To facilitate this, we have curated a comprehensive annotated benchmark, encompassing 7,272 OV-VG images and 1,000 OV-PL images. In our pursuit of addressing these challenges, we delved into various baseline methodologies rooted in existing open-vocabulary object detection, VG, and phrase localization frameworks. Surprisingly, we discovered that state-of-the-art methods often falter in diverse scenarios. Consequently, we developed a novel framework that integrates two critical components: Text-Image Query Selection and Language-Guided Feature Attention. These modules are designed to bolster the recognition of novel categories and enhance the alignment between visual and linguistic information. Extensive experiments demonstrate the efficacy of our proposed framework, which consistently attains SOTA performance across the OV-VG task. Additionally, ablation studies provide further evidence of the effectiveness of our innovative models. Codes and datasets will be made publicly available at https://github.com/cv516Buaa/OV-VG. △ Less

Submitted 22 October, 2023; originally announced October 2023.

arXiv:2310.07525 [pdf, other]

ViT-A*: Legged Robot Path Planning using Vision Transformer A*

Authors: Jianwei Liu, Shirui Lyu, Denis Hadjivelichkov, Valerio Modugno, Dimitrios Kanoulas

Abstract: Legged robots, particularly quadrupeds, offer promising navigation capabilities, especially in scenarios requiring traversal over diverse terrains and obstacle avoidance. This paper addresses the challenge of enabling legged robots to navigate complex environments effectively through the integration of data-driven path-planning methods. We propose an approach that utilizes differentiable planners,… ▽ More Legged robots, particularly quadrupeds, offer promising navigation capabilities, especially in scenarios requiring traversal over diverse terrains and obstacle avoidance. This paper addresses the challenge of enabling legged robots to navigate complex environments effectively through the integration of data-driven path-planning methods. We propose an approach that utilizes differentiable planners, allowing the learning of end-to-end global plans via a neural network for commanding quadruped robots. The approach leverages 2D maps and obstacle specifications as inputs to generate a global path. To enhance the functionality of the developed neural network-based path planner, we use Vision Transformers (ViT) for map pre-processing, to enable the effective handling of larger maps. Experimental evaluations on two real robotic quadrupeds (Boston Dynamics Spot and Unitree Go1) demonstrate the effectiveness and versatility of the proposed approach in generating reliable path plans. △ Less

Submitted 11 October, 2023; originally announced October 2023.

Comments: 6 pages, 6 figures, conference

Journal ref: IEEE-RAS International Conference on Humanoids Robots (Humanoids) 2023

arXiv:2310.06750 [pdf, other]

Unraveling the Thermodynamic Enigma between Fast and Slow Coronal Mass Ejections

Authors: Soumyaranjan Khuntia, Wageesh Mishra, Sudheer K Mishra, Yuming Wang, Jie Zhang, Shaoyu Lyu

Abstract: Coronal Mass Ejections (CMEs) are the most energetic expulsions of magnetized plasma from the Sun that play a crucial role in space weather dynamics. This study investigates the diverse kinematics and thermodynamic evolution of two CMEs (CME1: 2011 September 24 and CME2: 2018 August 20) at coronal heights where thermodynamic measurements are limited. The peak 3D propagation speed of CME1 is high (… ▽ More Coronal Mass Ejections (CMEs) are the most energetic expulsions of magnetized plasma from the Sun that play a crucial role in space weather dynamics. This study investigates the diverse kinematics and thermodynamic evolution of two CMEs (CME1: 2011 September 24 and CME2: 2018 August 20) at coronal heights where thermodynamic measurements are limited. The peak 3D propagation speed of CME1 is high (1,885 km/s) with two-phase expansion (rapid and nearly constant), while the peak 3D propagation speed of CME2 is slow (420 km/s) with only a gradual expansion. We estimate the distance-dependent variations in the polytropic index, heating rate, temperature, and internal forces implementing the revised FRIS model, taking inputs of 3D kinematics estimated from the GCS model. We find CME1 exhibiting heat-release during its early-rapid acceleration decrease and jumps to the heat-absorption state during its constant acceleration phase. In contrast to CME1, CME2 shows a gradual transition from the near-adiabatic to the heat-absorption state during its gradually increasing acceleration. Our analysis reveals that although both CMEs show differential heating, they experience heat-absorption during their later propagation phases, approaching the isothermal state. The faster CME1 achieves an adiabatic state followed by an isothermal state at smaller distances from the Sun than the slower CME2. We also find that the expansion of CMEs is primarily influenced by centrifugal and thermal pressure forces, with the Lorentz force impeding expansion. Multi-wavelength observations of flux-ropes at source regions support the FRIS model-derived findings at initially observed lower coronal heights. △ Less

Submitted 10 October, 2023; originally announced October 2023.

Comments: 23 pages, 9 figures, accepted for publication in The Astrophysical Journal (ApJ)

arXiv:2310.04637 [pdf, other]

doi 10.1109/IROS.2018.8594268

Efficient State Estimation with Constrained Rao-Blackwellized Particle Filter

Authors: Shuai Li, Siwei Lyu, Jeff Trinkle

Abstract: Due to the limitations of the robotic sensors, during a robotic manipulation task, the acquisition of the object's state can be unreliable and noisy. Combining an accurate model of multi-body dynamic system with Bayesian filtering methods has been shown to be able to filter out noise from the object's observed states. However, efficiency of these filtering methods suffers from samples that violate… ▽ More Due to the limitations of the robotic sensors, during a robotic manipulation task, the acquisition of the object's state can be unreliable and noisy. Combining an accurate model of multi-body dynamic system with Bayesian filtering methods has been shown to be able to filter out noise from the object's observed states. However, efficiency of these filtering methods suffers from samples that violate the physical constraints, e.g., no penetration constraint. In this paper, we propose a Rao-Blackwellized Particle Filter (RBPF) that samples the contact states and updates the object's poses using Kalman filters. This RBPF also enforces the physical constraints on the samples by solving a quadratic programming problem. By comparing our method with methods that does not consider physical constraints, we show that our proposed RBPF is not only able to estimate the object's states, e.g., poses, more accurately but also able to infer unobserved states, e.g., velocities, with higher precision. △ Less

Submitted 6 October, 2023; originally announced October 2023.

arXiv:2310.03827 [pdf, other]

Integrating Audio-Visual Features for Multimodal Deepfake Detection

Authors: Sneha Muppalla, Shan Jia, Siwei Lyu

Abstract: Deepfakes are AI-generated media in which an image or video has been digitally modified. The advancements made in deepfake technology have led to privacy and security issues. Most deepfake detection techniques rely on the detection of a single modality. Existing methods for audio-visual detection do not always surpass that of the analysis based on single modalities. Therefore, this paper proposes… ▽ More Deepfakes are AI-generated media in which an image or video has been digitally modified. The advancements made in deepfake technology have led to privacy and security issues. Most deepfake detection techniques rely on the detection of a single modality. Existing methods for audio-visual detection do not always surpass that of the analysis based on single modalities. Therefore, this paper proposes an audio-visual-based method for deepfake detection, which integrates fine-grained deepfake identification with binary classification. We categorize the samples into four types by combining labels specific to each single modality. This method enhances the detection under intra-domain and cross-domain testing. △ Less

Submitted 5 October, 2023; originally announced October 2023.

arXiv:2310.00405 [pdf, other]

Controlling Neural Style Transfer with Deep Reinforcement Learning

Authors: Chengming Feng, Jing Hu, Xin Wang, Shu Hu, Bin Zhu, Xi Wu, Hongtu Zhu, Siwei Lyu

Abstract: Controlling the degree of stylization in the Neural Style Transfer (NST) is a little tricky since it usually needs hand-engineering on hyper-parameters. In this paper, we propose the first deep Reinforcement Learning (RL) based architecture that splits one-step style transfer into a step-wise process for the NST task. Our RL-based method tends to preserve more details and structures of the content… ▽ More Controlling the degree of stylization in the Neural Style Transfer (NST) is a little tricky since it usually needs hand-engineering on hyper-parameters. In this paper, we propose the first deep Reinforcement Learning (RL) based architecture that splits one-step style transfer into a step-wise process for the NST task. Our RL-based method tends to preserve more details and structures of the content image in early steps, and synthesize more style patterns in later steps. It is a user-easily-controlled style-transfer method. Additionally, as our RL-based model performs the stylization progressively, it is lightweight and has lower computational complexity than existing one-step Deep Learning (DL) based models. Experimental results demonstrate the effectiveness and robustness of our method. △ Less

Submitted 30 September, 2023; originally announced October 2023.

Comments: Accepted by IJCAI 2023. The contributions of Chengming Feng and Jing Hu to this paper were equal. arXiv admin note: text overlap with arXiv:2309.13672

arXiv:2310.00359 [pdf, other]

Improving Cross-dataset Deepfake Detection with Deep Information Decomposition

Authors: Shanmin Yang, Shu Hu, Bin Zhu, Ying Fu, Siwei Lyu, Xi Wu, Xin Wang

Abstract: Deepfake technology poses a significant threat to security and social trust. Although existing detection methods have demonstrated high performance in identifying forgeries within datasets using the same techniques for training and testing, they suffer from sharp performance degradation when faced with cross-dataset scenarios where unseen deepfake techniques are tested. To address this challenge,… ▽ More Deepfake technology poses a significant threat to security and social trust. Although existing detection methods have demonstrated high performance in identifying forgeries within datasets using the same techniques for training and testing, they suffer from sharp performance degradation when faced with cross-dataset scenarios where unseen deepfake techniques are tested. To address this challenge, we propose a deep information decomposition (DID) framework in this paper. Unlike most existing deepfake detection methods, our framework prioritizes high-level semantic features over visual artifacts. Specifically, it decomposes facial features into deepfake-related and irrelevant information and optimizes the deepfake information for real/fake discrimination to be independent of other factors. Our approach improves the robustness of deepfake detection against various irrelevant information changes and enhances the generalization ability of the framework to detect unseen forgery methods. Extensive experimental comparisons with existing state-of-the-art detection methods validate the effectiveness and superiority of the DID framework on cross-dataset deepfake detection. △ Less

Submitted 30 September, 2023; originally announced October 2023.

arXiv:2309.13672 [pdf, other]

RL-I2IT: Image-to-Image Translation with Deep Reinforcement Learning

Authors: Xin Wang, Ziwei Luo, Jing Hu, Chengming Feng, Shu Hu, Bin Zhu, Xi Wu, Hongtu Zhu, Xin Li, Siwei Lyu

Abstract: Most existing Image-to-Image Translation (I2IT) methods generate images in a single run of a deep learning (DL) model. However, designing such a single-step model is always challenging, requiring a huge number of parameters and easily falling into bad global minimums and overfitting. In this work, we reformulate I2IT as a step-wise decision-making problem via deep reinforcement learning (DRL) and… ▽ More Most existing Image-to-Image Translation (I2IT) methods generate images in a single run of a deep learning (DL) model. However, designing such a single-step model is always challenging, requiring a huge number of parameters and easily falling into bad global minimums and overfitting. In this work, we reformulate I2IT as a step-wise decision-making problem via deep reinforcement learning (DRL) and propose a novel framework that performs RL-based I2IT (RL-I2IT). The key feature in the RL-I2IT framework is to decompose a monolithic learning process into small steps with a lightweight model to progressively transform a source image successively to a target image. Considering that it is challenging to handle high dimensional continuous state and action spaces in the conventional RL framework, we introduce meta policy with a new concept Plan to the standard Actor-Critic model, which is of a lower dimension than the original image and can facilitate the actor to generate a tractable high dimensional action. In the RL-I2IT framework, we also employ a task-specific auxiliary learning strategy to stabilize the training process and improve the performance of the corresponding task. Experiments on several I2IT tasks demonstrate the effectiveness and robustness of the proposed method when facing high-dimensional continuous action space problems. Our implementation of the RL-I2IT framework is available at https://github.com/Algolzw/SPAC-Deformable-Registration. △ Less

Submitted 7 June, 2024; v1 submitted 24 September, 2023; originally announced September 2023.

arXiv:2309.05145 [pdf, other]

Outlier Robust Adversarial Training

Authors: Shu Hu, Zhenhuan Yang, Xin Wang, Yiming Ying, Siwei Lyu

Abstract: Supervised learning models are challenged by the intrinsic complexities of training data such as outliers and minority subpopulations and intentional attacks at inference time with adversarial samples. While traditional robust learning methods and the recent adversarial training approaches are designed to handle each of the two challenges, to date, no work has been done to develop models that are… ▽ More Supervised learning models are challenged by the intrinsic complexities of training data such as outliers and minority subpopulations and intentional attacks at inference time with adversarial samples. While traditional robust learning methods and the recent adversarial training approaches are designed to handle each of the two challenges, to date, no work has been done to develop models that are robust with regard to the low-quality training data and the potential adversarial attack at inference time simultaneously. It is for this reason that we introduce Outlier Robust Adversarial Training (ORAT) in this work. ORAT is based on a bi-level optimization formulation of adversarial training with a robust rank-based loss function. Theoretically, we show that the learning objective of ORAT satisfies the $\mathcal{H}$-consistency in binary classification, which establishes it as a proper surrogate to adversarial 0/1 loss. Furthermore, we analyze its generalization ability and provide uniform convergence rates in high probability. ORAT can be optimized with a simple algorithm. Experimental evaluations on three benchmark datasets demonstrate the effectiveness and robustness of ORAT in handling outliers and adversarial attacks. Our code is available at https://github.com/discovershu/ORAT. △ Less

Submitted 10 September, 2023; originally announced September 2023.

Comments: Accepted by The 15th Asian Conference on Machine Learning (ACML 2023)

arXiv:2308.12577 [pdf, other]

doi 10.1016/j.knosys.2024.111563

REB: Reducing Biases in Representation for Industrial Anomaly Detection

Authors: Shuai Lyu, Dongmei Mo, Waikeung Wong

Abstract: Existing representation-based methods usually conduct industrial anomaly detection in two stages: obtain feature representations with a pre-trained model and perform distance measures for anomaly detection. Among them, K-nearest neighbor (KNN) retrieval-based anomaly detection methods show promising results. However, the features are not fully exploited as these methods ignore domain bias of pre-t… ▽ More Existing representation-based methods usually conduct industrial anomaly detection in two stages: obtain feature representations with a pre-trained model and perform distance measures for anomaly detection. Among them, K-nearest neighbor (KNN) retrieval-based anomaly detection methods show promising results. However, the features are not fully exploited as these methods ignore domain bias of pre-trained models and the difference of local density in feature space, which limits the detection performance. In this paper, we propose Reducing Biases (REB) in representation by considering the domain bias and building a self-supervised learning task for better domain adaption with a defect generation strategy (DefectMaker) that ensures a strong diversity in the synthetic defects. Additionally, we propose a local-density KNN (LDKNN) to reduce the local density bias in the feature space and obtain effective anomaly detection. The proposed REB method achieves a promising result of 99.5\% Im.AUROC on the widely used MVTec AD, with smaller backbone networks such as Vgg11 and Resnet18. The method also achieves an impressive 88.8\% Im.AUROC on the MVTec LOCO AD dataset and a remarkable 96.0\% on the BTAD dataset, outperforming other representation-based approaches. These results indicate the effectiveness and efficiency of REB for practical industrial applications. Code:https://github.com/ShuaiLYU/REB. △ Less

Submitted 17 May, 2024; v1 submitted 24 August, 2023; originally announced August 2023.

Comments: 14 pages, 7 figures, 7 tables

arXiv:2308.09611 [pdf, other]

Language-guided Human Motion Synthesis with Atomic Actions

Authors: Yuanhao Zhai, Mingzhen Huang, Tianyu Luan, Lu Dong, Ifeoma Nwogu, Siwei Lyu, David Doermann, Junsong Yuan

Abstract: Language-guided human motion synthesis has been a challenging task due to the inherent complexity and diversity of human behaviors. Previous methods face limitations in generalization to novel actions, often resulting in unrealistic or incoherent motion sequences. In this paper, we propose ATOM (ATomic mOtion Modeling) to mitigate this problem, by decomposing actions into atomic actions, and emplo… ▽ More Language-guided human motion synthesis has been a challenging task due to the inherent complexity and diversity of human behaviors. Previous methods face limitations in generalization to novel actions, often resulting in unrealistic or incoherent motion sequences. In this paper, we propose ATOM (ATomic mOtion Modeling) to mitigate this problem, by decomposing actions into atomic actions, and employing a curriculum learning strategy to learn atomic action composition. First, we disentangle complex human motions into a set of atomic actions during learning, and then assemble novel actions using the learned atomic actions, which offers better adaptability to new actions. Moreover, we introduce a curriculum learning training strategy that leverages masked motion modeling with a gradual increase in the mask ratio, and thus facilitates atomic action assembly. This approach mitigates the overfitting problem commonly encountered in previous methods while enforcing the model to learn better motion representations. We demonstrate the effectiveness of ATOM through extensive experiments, including text-to-motion and action-to-motion synthesis tasks. We further illustrate its superiority in synthesizing plausible and coherent text-guided human motion sequences. △ Less

Submitted 18 August, 2023; originally announced August 2023.

Comments: Accepted to ACM MM 2023, code: https://github.com/yhZhai/ATOM

arXiv:2308.08896 [pdf, other]

Optimal Resource Allocation for U-Shaped Parallel Split Learning

Authors: Song Lyu, Zheng Lin, Guanqiao Qu, Xianhao Chen, Xiaoxia Huang, Pan Li

Abstract: Split learning (SL) has emerged as a promising approach for model training without revealing the raw data samples from the data owners. However, traditional SL inevitably leaks label privacy as the tail model (with the last layers) should be placed on the server. To overcome this limitation, one promising solution is to utilize U-shaped architecture to leave both early layers and last layers on th… ▽ More Split learning (SL) has emerged as a promising approach for model training without revealing the raw data samples from the data owners. However, traditional SL inevitably leaks label privacy as the tail model (with the last layers) should be placed on the server. To overcome this limitation, one promising solution is to utilize U-shaped architecture to leave both early layers and last layers on the user side. In this paper, we develop a novel parallel U-shaped split learning and devise the optimal resource optimization scheme to improve the performance of edge networks. In the proposed framework, multiple users communicate with an edge server for SL. We analyze the end-to-end delay of each client during the training process and design an efficient resource allocation algorithm, called LSCRA, which finds the optimal computing resource allocation and split layers. Our experimental results show the effectiveness of LSCRA and that U-shaped parallel split learning can achieve a similar performance with other SL baselines while preserving label privacy. Index Terms: U-shaped network, split learning, label privacy, resource allocation, 5G/6G edge networks. △ Less

Submitted 8 October, 2023; v1 submitted 17 August, 2023; originally announced August 2023.

Comments: 6 pages, 6 figures

arXiv:2308.01520 [pdf, other]

COMICS: End-to-end Bi-grained Contrastive Learning for Multi-face Forgery Detection

Authors: Cong Zhang, Honggang Qi, Shuhui Wang, Yuezun Li, Siwei Lyu

Abstract: DeepFakes have raised serious societal concerns, leading to a great surge in detection-based forensics methods in recent years. Face forgery recognition is a standard detection method that usually follows a two-phase pipeline. While those methods perform well in ideal experimental environment, they face challenges when dealing with DeepFakes in the wild involving complex background and multiple fa… ▽ More DeepFakes have raised serious societal concerns, leading to a great surge in detection-based forensics methods in recent years. Face forgery recognition is a standard detection method that usually follows a two-phase pipeline. While those methods perform well in ideal experimental environment, they face challenges when dealing with DeepFakes in the wild involving complex background and multiple faces of varying sizes. Moreover, most face forgery recognition methods can only process one face at a time. One straightforward way to address this issue is to simultaneous process multi-face by integrating face extraction and forgery detection in an end-to-end fashion by adapting advanced object detection architectures. However, as these object detection architectures are designed to capture the discriminative features of different object categories rather than the subtle forgery traces among the faces, the direct adaptation suffers from limited representation ability. In this paper, we propose COMICS, an end-to-end framework for multi-face forgery detection. COMICS integrates face extraction and forgery detection in a seamless manner and adapts to advanced object detection architectures. The proposed bi-grained contrastive learning approach explores face forgery traces at both the coarse- and fine-grained levels. Specifically, coarse-grained level contrastive learning captures the discriminative features among positive and negative proposal pairs at multiple layers produced by the proposal generator, and fine-grained level contrastive learning captures the pixel-wise discrepancy between the forged and original areas of the same face and the pixel-wise content inconsistency among different faces. Extensive experiments on the OpenForensics and FFIW datasets demonstrate that our method outperforms other counterparts and shows great potential for being integrated into various architectures. △ Less

Submitted 24 May, 2024; v1 submitted 2 August, 2023; originally announced August 2023.

arXiv:2308.00964 [pdf, other]

ForensicsForest Family: A Series of Multi-scale Hierarchical Cascade Forests for Detecting GAN-generated Faces

Authors: Jiucui Lu, Jiaran Zhou, Junyu Dong, Bin Li, Siwei Lyu, Yuezun Li

Abstract: The prominent progress in generative models has significantly improved the reality of generated faces, bringing serious concerns to society. Since recent GAN-generated faces are in high realism, the forgery traces have become more imperceptible, increasing the forensics challenge. To combat GAN-generated faces, many countermeasures based on Convolutional Neural Networks (CNNs) have been spawned du… ▽ More The prominent progress in generative models has significantly improved the reality of generated faces, bringing serious concerns to society. Since recent GAN-generated faces are in high realism, the forgery traces have become more imperceptible, increasing the forensics challenge. To combat GAN-generated faces, many countermeasures based on Convolutional Neural Networks (CNNs) have been spawned due to their strong learning ability. In this paper, we rethink this problem and explore a new approach based on forest models instead of CNNs. Specifically, we describe a simple and effective forest-based method set called {\em ForensicsForest Family} to detect GAN-generate faces. The proposed ForensicsForest family is composed of three variants, which are {\em ForensicsForest}, {\em Hybrid ForensicsForest} and {\em Divide-and-Conquer ForensicsForest} respectively. ForenscisForest is a newly proposed Multi-scale Hierarchical Cascade Forest, which takes semantic, frequency and biology features as input, hierarchically cascades different levels of features for authenticity prediction, and then employs a multi-scale ensemble scheme that can comprehensively consider different levels of information to improve the performance further. Based on ForensicsForest, we develop Hybrid ForensicsForest, an extended version that integrates the CNN layers into models, to further refine the effectiveness of augmented features. Moreover, to reduce the memory cost in training, we propose Divide-and-Conquer ForensicsForest, which can construct a forest model using only a portion of training samplings. In the training stage, we train several candidate forest models using the subsets of training samples. Then a ForensicsForest is assembled by picking the suitable components from these candidate forest models... △ Less

Submitted 26 April, 2024; v1 submitted 2 August, 2023; originally announced August 2023.

Comments: To Appear in IEEE TIFS 2024

Showing 1–50 of 233 results for author: Lyu, S