-
WiNet: Wavelet-based Incremental Learning for Efficient Medical Image Registration
Authors:
Xinxing Cheng,
Xi Jia,
Wenqi Lu,
Qiufu Li,
Linlin Shen,
Alexander Krull,
Jinming Duan
Abstract:
Deep image registration has demonstrated exceptional accuracy and fast inference. Recent advances have adopted either multiple cascades or pyramid architectures to estimate dense deformation fields in a coarse-to-fine manner. However, due to the cascaded nature and repeated composition/warping operations on feature maps, these methods negatively increase memory usage during training and testing. M…
▽ More
Deep image registration has demonstrated exceptional accuracy and fast inference. Recent advances have adopted either multiple cascades or pyramid architectures to estimate dense deformation fields in a coarse-to-fine manner. However, due to the cascaded nature and repeated composition/warping operations on feature maps, these methods negatively increase memory usage during training and testing. Moreover, such approaches lack explicit constraints on the learning process of small deformations at different scales, thus lacking explainability. In this study, we introduce a model-driven WiNet that incrementally estimates scale-wise wavelet coefficients for the displacement/velocity field across various scales, utilizing the wavelet coefficients derived from the original input image pair. By exploiting the properties of the wavelet transform, these estimated coefficients facilitate the seamless reconstruction of a full-resolution displacement/velocity field via our devised inverse discrete wavelet transform (IDWT) layer. This approach avoids the complexities of cascading networks or composition operations, making our WiNet an explainable and efficient competitor with other coarse-to-fine methods. Extensive experimental results from two 3D datasets show that our WiNet is accurate and GPU efficient. The code is available at https://github.com/x-xc/WiNet .
△ Less
Submitted 18 July, 2024;
originally announced July 2024.
-
Unveiling Structural Memorization: Structural Membership Inference Attack for Text-to-Image Diffusion Models
Authors:
Qiao Li,
Xiaomeng Fu,
Xi Wang,
Jin Liu,
Xingyu Gao,
Jiao Dai,
Jizhong Han
Abstract:
With the rapid advancements of large-scale text-to-image diffusion models, various practical applications have emerged, bringing significant convenience to society. However, model developers may misuse the unauthorized data to train diffusion models. These data are at risk of being memorized by the models, thus potentially violating citizens' privacy rights. Therefore, in order to judge whether a…
▽ More
With the rapid advancements of large-scale text-to-image diffusion models, various practical applications have emerged, bringing significant convenience to society. However, model developers may misuse the unauthorized data to train diffusion models. These data are at risk of being memorized by the models, thus potentially violating citizens' privacy rights. Therefore, in order to judge whether a specific image is utilized as a member of a model's training set, Membership Inference Attack (MIA) is proposed to serve as a tool for privacy protection. Current MIA methods predominantly utilize pixel-wise comparisons as distinguishing clues, considering the pixel-level memorization characteristic of diffusion models. However, it is practically impossible for text-to-image models to memorize all the pixel-level information in massive training sets. Therefore, we move to the more advanced structure-level memorization. Observations on the diffusion process show that the structures of members are better preserved compared to those of nonmembers, indicating that diffusion models possess the capability to remember the structures of member images from training sets. Drawing on these insights, we propose a simple yet effective MIA method tailored for text-to-image diffusion models. Extensive experimental results validate the efficacy of our approach. Compared to current pixel-level baselines, our approach not only achieves state-of-the-art performance but also demonstrates remarkable robustness against various distortions.
△ Less
Submitted 18 July, 2024;
originally announced July 2024.
-
Sharpness-diversity tradeoff: improving flat ensembles with SharpBalance
Authors:
Haiquan Lu,
Xiaotian Liu,
Yefan Zhou,
Qunli Li,
Kurt Keutzer,
Michael W. Mahoney,
Yujun Yan,
Huanrui Yang,
Yaoqing Yang
Abstract:
Recent studies on deep ensembles have identified the sharpness of the local minima of individual learners and the diversity of the ensemble members as key factors in improving test-time performance. Building on this, our study investigates the interplay between sharpness and diversity within deep ensembles, illustrating their crucial role in robust generalization to both in-distribution (ID) and o…
▽ More
Recent studies on deep ensembles have identified the sharpness of the local minima of individual learners and the diversity of the ensemble members as key factors in improving test-time performance. Building on this, our study investigates the interplay between sharpness and diversity within deep ensembles, illustrating their crucial role in robust generalization to both in-distribution (ID) and out-of-distribution (OOD) data. We discover a trade-off between sharpness and diversity: minimizing the sharpness in the loss landscape tends to diminish the diversity of individual members within the ensemble, adversely affecting the ensemble's improvement. The trade-off is justified through our theoretical analysis and verified empirically through extensive experiments. To address the issue of reduced diversity, we introduce SharpBalance, a novel training approach that balances sharpness and diversity within ensembles. Theoretically, we show that our training strategy achieves a better sharpness-diversity trade-off. Empirically, we conducted comprehensive evaluations in various data sets (CIFAR-10, CIFAR-100, TinyImageNet) and showed that SharpBalance not only effectively improves the sharpness-diversity trade-off, but also significantly improves ensemble performance in ID and OOD scenarios.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
Is Sarcasm Detection A Step-by-Step Reasoning Process in Large Language Models?
Authors:
Ben Yao,
Yazhou Zhang,
Qiuchi Li,
Jing Qin
Abstract:
Elaborating a series of intermediate reasoning steps significantly improves the ability of large language models (LLMs) to solve complex problems, as such steps would evoke LLMs to think sequentially. However, human sarcasm understanding is often considered an intuitive and holistic cognitive process, in which various linguistic, contextual, and emotional cues are integrated to form a comprehensiv…
▽ More
Elaborating a series of intermediate reasoning steps significantly improves the ability of large language models (LLMs) to solve complex problems, as such steps would evoke LLMs to think sequentially. However, human sarcasm understanding is often considered an intuitive and holistic cognitive process, in which various linguistic, contextual, and emotional cues are integrated to form a comprehensive understanding of the speaker's true intention, which is argued not be limited to a step-by-step reasoning process. To verify this argument, we introduce a new prompting framework called SarcasmCue, which contains four prompting strategies, $viz.$ chain of contradiction (CoC), graph of cues (GoC), bagging of cues (BoC) and tensor of cues (ToC), which elicits LLMs to detect human sarcasm by considering sequential and non-sequential prompting methods. Through a comprehensive empirical comparison on four benchmarking datasets, we show that the proposed four prompting methods outperforms standard IO prompting, CoT and ToT with a considerable margin, and non-sequential prompting generally outperforms sequential prompting.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
Encapsulating Knowledge in One Prompt
Authors:
Qi Li,
Runpeng Yu,
Xinchao Wang
Abstract:
This paradigm encapsulates knowledge from various models into a solitary prompt without altering the original models or requiring access to the training data, which enables us to achieve efficient and convenient knowledge transfer in more realistic scenarios. From a practicality standpoint, this paradigm not only for the first time proves the effectiveness of Visual Prompt in data inaccessible con…
▽ More
This paradigm encapsulates knowledge from various models into a solitary prompt without altering the original models or requiring access to the training data, which enables us to achieve efficient and convenient knowledge transfer in more realistic scenarios. From a practicality standpoint, this paradigm not only for the first time proves the effectiveness of Visual Prompt in data inaccessible contexts, but also solves the problems of low model reusability and high storage resource consumption faced by traditional Data-Free Knowledge Transfer, which means that we can realize the parallel knowledge transfer of multiple models without modifying any source model. Extensive experiments across various datasets and models demonstrate the efficacy of the proposed KiOP knowledge transfer paradigm. Without access to real training data and with rigorous storage capacity constraints, it is also capable of yielding considerable outcomes when dealing with cross-model backbone setups and handling parallel knowledge transfer processing requests with multiple (more than 2) models.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
Snail-Radar: A large-scale diverse dataset for the evaluation of 4D-radar-based SLAM systems
Authors:
Jianzhu Huai,
Binliang Wang,
Yuan Zhuang,
Yiwen Chen,
Qipeng Li,
Yulong Han,
Charles Toth
Abstract:
4D radars are increasingly favored for odometry and mapping of autonomous systems due to their robustness in harsh weather and dynamic environments. Existing datasets, however, often cover limited areas and are typically captured using a single platform. To address this gap, we present a diverse large-scale dataset specifically designed for 4D radar-based localization and mapping. This dataset was…
▽ More
4D radars are increasingly favored for odometry and mapping of autonomous systems due to their robustness in harsh weather and dynamic environments. Existing datasets, however, often cover limited areas and are typically captured using a single platform. To address this gap, we present a diverse large-scale dataset specifically designed for 4D radar-based localization and mapping. This dataset was gathered using three different platforms: a handheld device, an e-bike, and an SUV, under a variety of environmental conditions, including clear days, nighttime, and heavy rain. The data collection occurred from September 2023 to February 2024, encompassing diverse settings such as roads in a vegetated campus and tunnels on highways. Each route was traversed multiple times to facilitate place recognition evaluations. The sensor suite included a 3D lidar, 4D radars, stereo cameras, consumer-grade IMUs, and a GNSS/INS system. Sensor data packets were synchronized to GNSS time using a two-step process: a convex hull algorithm was applied to smooth host time jitter, and then odometry and correlation algorithms were used to correct constant time offsets. Extrinsic calibration between sensors was achieved through manual measurements and subsequent nonlinear optimization. The reference motion for the platforms was generated by registering lidar scans to a terrestrial laser scanner (TLS) point cloud map using a lidar inertial odometry (LIO) method in localization mode. Additionally, a data reversion technique was introduced to enable backward LIO processing. We believe this dataset will boost research in radar-based point cloud registration, odometry, mapping, and place recognition.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
FIRE: A Dataset for Feedback Integration and Refinement Evaluation of Multimodal Models
Authors:
Pengxiang Li,
Zhi Gao,
Bofei Zhang,
Tao Yuan,
Yuwei Wu,
Mehrtash Harandi,
Yunde Jia,
Song-Chun Zhu,
Qing Li
Abstract:
Vision language models (VLMs) have achieved impressive progress in diverse applications, becoming a prevalent research direction. In this paper, we build FIRE, a feedback-refinement dataset, consisting of 1.1M multi-turn conversations that are derived from 27 source datasets, empowering VLMs to spontaneously refine their responses based on user feedback across diverse tasks. To scale up the data c…
▽ More
Vision language models (VLMs) have achieved impressive progress in diverse applications, becoming a prevalent research direction. In this paper, we build FIRE, a feedback-refinement dataset, consisting of 1.1M multi-turn conversations that are derived from 27 source datasets, empowering VLMs to spontaneously refine their responses based on user feedback across diverse tasks. To scale up the data collection, FIRE is collected in two components: FIRE-100K and FIRE-1M, where FIRE-100K is generated by GPT-4V, and FIRE-1M is freely generated via models trained on FIRE-100K. Then, we build FIRE-Bench, a benchmark to comprehensively evaluate the feedback-refining capability of VLMs, which contains 11K feedback-refinement conversations as the test data, two evaluation settings, and a model to provide feedback for VLMs. We develop the FIRE-LLaVA model by fine-tuning LLaVA on FIRE-100K and FIRE-1M, which shows remarkable feedback-refining capability on FIRE-Bench and outperforms untrained VLMs by 50%, making more efficient user-agent interactions and underscoring the significance of the FIRE dataset.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
EndoFinder: Online Image Retrieval for Explainable Colorectal Polyp Diagnosis
Authors:
Ruijie Yang,
Yan Zhu,
Peiyao Fu,
Yizhe Zhang,
Zhihua Wang,
Quanlin Li,
Pinghong Zhou,
Xian Yang,
Shuo Wang
Abstract:
Determining the necessity of resecting malignant polyps during colonoscopy screen is crucial for patient outcomes, yet challenging due to the time-consuming and costly nature of histopathology examination. While deep learning-based classification models have shown promise in achieving optical biopsy with endoscopic images, they often suffer from a lack of explainability. To overcome this limitatio…
▽ More
Determining the necessity of resecting malignant polyps during colonoscopy screen is crucial for patient outcomes, yet challenging due to the time-consuming and costly nature of histopathology examination. While deep learning-based classification models have shown promise in achieving optical biopsy with endoscopic images, they often suffer from a lack of explainability. To overcome this limitation, we introduce EndoFinder, a content-based image retrieval framework to find the 'digital twin' polyp in the reference database given a newly detected polyp. The clinical semantics of the new polyp can be inferred referring to the matched ones. EndoFinder pioneers a polyp-aware image encoder that is pre-trained on a large polyp dataset in a self-supervised way, merging masked image modeling with contrastive learning. This results in a generic embedding space ready for different downstream clinical tasks based on image retrieval. We validate the framework on polyp re-identification and optical biopsy tasks, with extensive experiments demonstrating that EndoFinder not only achieves explainable diagnostics but also matches the performance of supervised classification models. EndoFinder's reliance on image retrieval has the potential to support diverse downstream decision-making tasks during real-time colonoscopy procedures.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
Backdoor Graph Condensation
Authors:
Jiahao Wu,
Ning Lu,
Zeiyu Dai,
Wenqi Fan,
Shengcai Liu,
Qing Li,
Ke Tang
Abstract:
Recently, graph condensation has emerged as a prevalent technique to improve the training efficiency for graph neural networks (GNNs). It condenses a large graph into a small one such that a GNN trained on this small synthetic graph can achieve comparable performance to a GNN trained on a large graph. However, while existing graph condensation studies mainly focus on the best trade-off between gra…
▽ More
Recently, graph condensation has emerged as a prevalent technique to improve the training efficiency for graph neural networks (GNNs). It condenses a large graph into a small one such that a GNN trained on this small synthetic graph can achieve comparable performance to a GNN trained on a large graph. However, while existing graph condensation studies mainly focus on the best trade-off between graph size and the GNNs' performance (model utility), the security issues of graph condensation have not been studied. To bridge this research gap, we propose the task of backdoor graph condensation. While graph backdoor attacks have been extensively explored, applying existing graph backdoor methods for graph condensation is not practical since they can undermine the model utility and yield low attack success rate.
To alleviate these issues, we introduce two primary objectives for backdoor attacks against graph condensation: 1) the injection of triggers cannot affect the quality of condensed graphs, maintaining the utility of GNNs trained on them; and 2) the effectiveness of triggers should be preserved throughout the condensation process, achieving high attack success rate. To pursue the objectives, we devise the first backdoor attack against graph condensation, denoted as BGC. Specifically, we inject triggers during condensation and iteratively update the triggers to ensure effective attacks. Further, we propose a poisoned node selection module to minimize the influence of triggers on condensed graphs' quality. The extensive experiments demonstrate the effectiveness of our attack. BGC achieves a high attack success rate (close to 1.0) and good model utility in all cases. Furthermore, the results demonstrate our method's resilience against multiple defense methods. Finally, we conduct comprehensive studies to analyze the factors that influence the attack performance.
△ Less
Submitted 16 July, 2024; v1 submitted 3 July, 2024;
originally announced July 2024.
-
LVCP: LiDAR-Vision Tightly Coupled Collaborative Real-time Relative Positioning
Authors:
Zhuozhu Jian,
Qixuan Li,
Shengtao Zheng,
Xueqian Wang,
Xinlei Chen
Abstract:
In air-ground collaboration scenarios without GPS and prior maps, the relative positioning of drones and unmanned ground vehicles (UGVs) has always been a challenge. For a drone equipped with monocular camera and an UGV equipped with LiDAR as an external sensor, we propose a robust and real-time relative pose estimation method (LVCP) based on the tight coupling of vision and LiDAR point cloud info…
▽ More
In air-ground collaboration scenarios without GPS and prior maps, the relative positioning of drones and unmanned ground vehicles (UGVs) has always been a challenge. For a drone equipped with monocular camera and an UGV equipped with LiDAR as an external sensor, we propose a robust and real-time relative pose estimation method (LVCP) based on the tight coupling of vision and LiDAR point cloud information, which does not require prior information such as maps or precise initial poses. Given that large-scale point clouds generated by 3D sensors has more accurate spatial geometric information than the feature point cloud generated by image, we utilize LiDAR point clouds to correct the drift in visual-inertial odometry (VIO) when the camera undergoes significant shaking or the IMU has a low signal-to-noise ratio. To achieve this, we propose a novel coarse-to-fine framework for LiDAR-vision collaborative localization. In this framework, we construct point-plane association based on spatial geometric information, and innovatively construct a point-aided Bundle Adjustment (BA) problem as the backend to simultaneously estimate the relative pose of the camera and LiDAR and correct the VIO drift. In this process, we propose a particle swarm optimization (PSO) based sampling algorithm to complete the coarse estimation of the current camera-LiDAR pose. In this process, the initial pose of the camera used for sampling is obtained based on VIO propagation, and the valid feature-plane association number (VFPN) is used to trigger PSO-sampling process. Additionally, we propose a method that combines Structure from Motion (SFM) and multi-level sampling to initialize the algorithm, addressing the challenge of lacking initial values.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
SEMINAR: Search Enhanced Multi-modal Interest Network and Approximate Retrieval for Lifelong Sequential Recommendation
Authors:
Kaiming Shen,
Xichen Ding,
Zixiang Zheng,
Yuqi Gong,
Qianqian Li,
Zhongyi Liu,
Guannan Zhang
Abstract:
The modeling of users' behaviors is crucial in modern recommendation systems. A lot of research focuses on modeling users' lifelong sequences, which can be extremely long and sometimes exceed thousands of items. These models use the target item to search for the most relevant items from the historical sequence. However, training lifelong sequences in click through rate (CTR) prediction or personal…
▽ More
The modeling of users' behaviors is crucial in modern recommendation systems. A lot of research focuses on modeling users' lifelong sequences, which can be extremely long and sometimes exceed thousands of items. These models use the target item to search for the most relevant items from the historical sequence. However, training lifelong sequences in click through rate (CTR) prediction or personalized search ranking (PSR) is extremely difficult due to the insufficient learning problem of ID embedding, especially when the IDs in the lifelong sequence features do not exist in the samples of training dataset. Additionally, existing target attention mechanisms struggle to learn the multi-modal representations of items in the sequence well. The distribution of multi-modal embedding (text, image and attributes) output of user's interacted items are not properly aligned and there exist divergence across modalities. We also observe that users' search query sequences and item browsing sequences can fully depict users' intents and benefit from each other. To address these challenges, we propose a unified lifelong multi-modal sequence model called SEMINAR-Search Enhanced Multi-Modal Interest Network and Approximate Retrieval. Specifically, a network called Pretraining Search Unit (PSU) learns the lifelong sequences of multi-modal query-item pairs in a pretraining-finetuning manner with multiple objectives: multi-modal alignment, next query-item pair prediction, query-item relevance prediction, etc. After pretraining, the downstream model restores the pretrained embedding as initialization and finetunes the network. To accelerate the online retrieval speed of multi-modal embedding, we propose a multi-modal codebook-based product quantization strategy to approximate the exact attention calculati
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
Order parameters and phase transitions of continual learning in deep neural networks
Authors:
Haozhe Shan,
Qianyi Li,
Haim Sompolinsky
Abstract:
Continual learning (CL) enables animals to learn new tasks without erasing prior knowledge. CL in artificial neural networks (NNs) is challenging due to catastrophic forgetting, where new learning degrades performance on older tasks. While various techniques exist to mitigate forgetting, theoretical insights into when and why CL fails in NNs are lacking. Here, we present a statistical-mechanics th…
▽ More
Continual learning (CL) enables animals to learn new tasks without erasing prior knowledge. CL in artificial neural networks (NNs) is challenging due to catastrophic forgetting, where new learning degrades performance on older tasks. While various techniques exist to mitigate forgetting, theoretical insights into when and why CL fails in NNs are lacking. Here, we present a statistical-mechanics theory of CL in deep, wide NNs, which characterizes the network's input-output mapping as it learns a sequence of tasks. It gives rise to order parameters (OPs) that capture how task relations and network architecture influence forgetting and knowledge transfer, as verified by numerical evaluations. We found that the input and rule similarity between tasks have different effects on CL performance. In addition, the theory predicts that increasing the network depth can effectively reduce overlap between tasks, thereby lowering forgetting. For networks with task-specific readouts, the theory identifies a phase transition where CL performance shifts dramatically as tasks become less similar, as measured by the OPs. Sufficiently low similarity leads to catastrophic anterograde interference, where the network retains old tasks perfectly but completely fails to generalize new learning. Our results delineate important factors affecting CL performance and suggest strategies for mitigating forgetting.
△ Less
Submitted 14 July, 2024;
originally announced July 2024.
-
Prompting Whole Slide Image Based Genetic Biomarker Prediction
Authors:
Ling Zhang,
Boxiang Yun,
Xingran Xie,
Qingli Li,
Xinxing Li,
Yan Wang
Abstract:
Prediction of genetic biomarkers, e.g., microsatellite instability and BRAF in colorectal cancer is crucial for clinical decision making. In this paper, we propose a whole slide image (WSI) based genetic biomarker prediction method via prompting techniques. Our work aims at addressing the following challenges: (1) extracting foreground instances related to genetic biomarkers from gigapixel WSIs, a…
▽ More
Prediction of genetic biomarkers, e.g., microsatellite instability and BRAF in colorectal cancer is crucial for clinical decision making. In this paper, we propose a whole slide image (WSI) based genetic biomarker prediction method via prompting techniques. Our work aims at addressing the following challenges: (1) extracting foreground instances related to genetic biomarkers from gigapixel WSIs, and (2) the interaction among the fine-grained pathological components in WSIs.Specifically, we leverage large language models to generate medical prompts that serve as prior knowledge in extracting instances associated with genetic biomarkers. We adopt a coarse-to-fine approach to mine biomarker information within the tumor microenvironment. This involves extracting instances related to genetic biomarkers using coarse medical prior knowledge, grouping pathology instances into fine-grained pathological components and mining their interactions. Experimental results on two colorectal cancer datasets show the superiority of our method, achieving 91.49% in AUC for MSI classification. The analysis further shows the clinical interpretability of our method. Code is publicly available at https://github.com/DeepMed-Lab-ECNU/PromptBio.
△ Less
Submitted 26 June, 2024;
originally announced July 2024.
-
Provable Privacy Advantages of Decentralized Federated Learning via Distributed Optimization
Authors:
Wenrui Yu,
Qiongxiu Li,
Milan Lopuhaä-Zwakenberg,
Mads Græsbøll Christensen,
Richard Heusdens
Abstract:
Federated learning (FL) emerged as a paradigm designed to improve data privacy by enabling data to reside at its source, thus embedding privacy as a core consideration in FL architectures, whether centralized or decentralized. Contrasting with recent findings by Pasquini et al., which suggest that decentralized FL does not empirically offer any additional privacy or security benefits over centrali…
▽ More
Federated learning (FL) emerged as a paradigm designed to improve data privacy by enabling data to reside at its source, thus embedding privacy as a core consideration in FL architectures, whether centralized or decentralized. Contrasting with recent findings by Pasquini et al., which suggest that decentralized FL does not empirically offer any additional privacy or security benefits over centralized models, our study provides compelling evidence to the contrary. We demonstrate that decentralized FL, when deploying distributed optimization, provides enhanced privacy protection - both theoretically and empirically - compared to centralized approaches. The challenge of quantifying privacy loss through iterative processes has traditionally constrained the theoretical exploration of FL protocols. We overcome this by conducting a pioneering in-depth information-theoretical privacy analysis for both frameworks. Our analysis, considering both eavesdropping and passive adversary models, successfully establishes bounds on privacy leakage. We show information theoretically that the privacy loss in decentralized FL is upper bounded by the loss in centralized FL. Compared to the centralized case where local gradients of individual participants are directly revealed, a key distinction of optimization-based decentralized FL is that the relevant information includes differences of local gradients over successive iterations and the aggregated sum of different nodes' gradients over the network. This information complicates the adversary's attempt to infer private data. To bridge our theoretical insights with practical applications, we present detailed case studies involving logistic regression and deep neural networks. These examples demonstrate that while privacy leakage remains comparable in simpler models, complex models like deep neural networks exhibit lower privacy risks under decentralized FL.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
Self-Prompt Tuning: Enable Autonomous Role-Playing in LLMs
Authors:
Aobo Kong,
Shiwan Zhao,
Hao Chen,
Qicheng Li,
Yong Qin,
Ruiqi Sun,
Xin Zhou,
Jiaming Zhou,
Haoqin Sun
Abstract:
Recent advancements in LLMs have showcased their remarkable role-playing capabilities, able to accurately simulate the dialogue styles and cognitive processes of various roles based on different instructions and contexts. Studies indicate that assigning LLMs the roles of experts, a strategy known as role-play prompting, can enhance their performance in the corresponding domains. However, the promp…
▽ More
Recent advancements in LLMs have showcased their remarkable role-playing capabilities, able to accurately simulate the dialogue styles and cognitive processes of various roles based on different instructions and contexts. Studies indicate that assigning LLMs the roles of experts, a strategy known as role-play prompting, can enhance their performance in the corresponding domains. However, the prompt needs to be manually designed for the given problem, requiring certain expertise and iterative modifications. To this end, we propose self-prompt tuning, making LLMs themselves generate role-play prompts through fine-tuning. Leveraging the LIMA dataset as our foundational corpus, we employ GPT-4 to annotate role-play prompts for each data points, resulting in the creation of the LIMA-Role dataset. We then fine-tune LLMs like Llama-2-7B and Mistral-7B on LIMA-Role. Consequently, the self-prompt tuned LLMs can automatically generate expert role prompts for any given question. We extensively evaluate self-prompt tuned LLMs on widely used NLP benchmarks and open-ended question test. Our empirical results illustrate that self-prompt tuned LLMs outperform standard instruction tuned baselines across most datasets. This highlights the great potential of utilizing fine-tuning to enable LLMs to self-prompt, thereby automating complex prompting strategies. We release the dataset, models, and code at this \href{https://anonymous.4open.science/r/Self-Prompt-Tuning-739E/}{url}.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
Bora: Biomedical Generalist Video Generation Model
Authors:
Weixiang Sun,
Xiaocao You,
Ruizhe Zheng,
Zhengqing Yuan,
Xiang Li,
Lifang He,
Quanzheng Li,
Lichao Sun
Abstract:
Generative models hold promise for revolutionizing medical education, robot-assisted surgery, and data augmentation for medical AI development. Diffusion models can now generate realistic images from text prompts, while recent advancements have demonstrated their ability to create diverse, high-quality videos. However, these models often struggle with generating accurate representations of medical…
▽ More
Generative models hold promise for revolutionizing medical education, robot-assisted surgery, and data augmentation for medical AI development. Diffusion models can now generate realistic images from text prompts, while recent advancements have demonstrated their ability to create diverse, high-quality videos. However, these models often struggle with generating accurate representations of medical procedures and detailed anatomical structures. This paper introduces Bora, the first spatio-temporal diffusion probabilistic model designed for text-guided biomedical video generation. Bora leverages Transformer architecture and is pre-trained on general-purpose video generation tasks. It is fine-tuned through model alignment and instruction tuning using a newly established medical video corpus, which includes paired text-video data from various biomedical fields. To the best of our knowledge, this is the first attempt to establish such a comprehensive annotated biomedical video dataset. Bora is capable of generating high-quality video data across four distinct biomedical domains, adhering to medical expert standards and demonstrating consistency and diversity. This generalist video generative model holds significant potential for enhancing medical consultation and decision-making, particularly in resource-limited settings. Additionally, Bora could pave the way for immersive medical training and procedure planning. Extensive experiments on distinct medical modalities such as endoscopy, ultrasound, MRI, and cell tracking validate the effectiveness of our model in understanding biomedical instructions and its superior performance across subjects compared to state-of-the-art generation models.
△ Less
Submitted 15 July, 2024; v1 submitted 11 July, 2024;
originally announced July 2024.
-
Distributed Backdoor Attacks on Federated Graph Learning and Certified Defenses
Authors:
Yuxin Yang,
Qiang Li,
Jinyuan Jia,
Yuan Hong,
Binghui Wang
Abstract:
Federated graph learning (FedGL) is an emerging federated learning (FL) framework that extends FL to learn graph data from diverse sources. FL for non-graph data has shown to be vulnerable to backdoor attacks, which inject a shared backdoor trigger into the training data such that the trained backdoored FL model can predict the testing data containing the trigger as the attacker desires. However,…
▽ More
Federated graph learning (FedGL) is an emerging federated learning (FL) framework that extends FL to learn graph data from diverse sources. FL for non-graph data has shown to be vulnerable to backdoor attacks, which inject a shared backdoor trigger into the training data such that the trained backdoored FL model can predict the testing data containing the trigger as the attacker desires. However, FedGL against backdoor attacks is largely unexplored, and no effective defense exists.
In this paper, we aim to address such significant deficiency. First, we propose an effective, stealthy, and persistent backdoor attack on FedGL. Our attack uses a subgraph as the trigger and designs an adaptive trigger generator that can derive the effective trigger location and shape for each graph. Our attack shows that empirical defenses are hard to detect/remove our generated triggers. To mitigate it, we further develop a certified defense for any backdoored FedGL model against the trigger with any shape at any location. Our defense involves carefully dividing a testing graph into multiple subgraphs and designing a majority vote-based ensemble classifier on these subgraphs. We then derive the deterministic certified robustness based on the ensemble classifier and prove its tightness. We extensively evaluate our attack and defense on six graph datasets. Our attack results show our attack can obtain > 90% backdoor accuracy in almost all datasets. Our defense results show, in certain cases, the certified accuracy for clean testing graphs against an arbitrary trigger with size 20 can be close to the normal accuracy under no attack, while there is a moderate gap in other cases. Moreover, the certified backdoor accuracy is always 0 for backdoored testing graphs generated by our attack, implying our defense can fully mitigate the attack. Source code is available at: https://github.com/Yuxin104/Opt-GDBA.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
MetaUrban: A Simulation Platform for Embodied AI in Urban Spaces
Authors:
Wayne Wu,
Honglin He,
Yiran Wang,
Chenda Duan,
Jack He,
Zhizheng Liu,
Quanyi Li,
Bolei Zhou
Abstract:
Public urban spaces like streetscapes and plazas serve residents and accommodate social life in all its vibrant variations. Recent advances in Robotics and Embodied AI make public urban spaces no longer exclusive to humans. Food delivery bots and electric wheelchairs have started sharing sidewalks with pedestrians, while diverse robot dogs and humanoids have recently emerged in the street. Ensurin…
▽ More
Public urban spaces like streetscapes and plazas serve residents and accommodate social life in all its vibrant variations. Recent advances in Robotics and Embodied AI make public urban spaces no longer exclusive to humans. Food delivery bots and electric wheelchairs have started sharing sidewalks with pedestrians, while diverse robot dogs and humanoids have recently emerged in the street. Ensuring the generalizability and safety of these forthcoming mobile machines is crucial when navigating through the bustling streets in urban spaces. In this work, we present MetaUrban, a compositional simulation platform for Embodied AI research in urban spaces. MetaUrban can construct an infinite number of interactive urban scenes from compositional elements, covering a vast array of ground plans, object placements, pedestrians, vulnerable road users, and other mobile agents' appearances and dynamics. We design point navigation and social navigation tasks as the pilot study using MetaUrban for embodied AI research and establish various baselines of Reinforcement Learning and Imitation Learning. Experiments demonstrate that the compositional nature of the simulated environments can substantially improve the generalizability and safety of the trained mobile agents. MetaUrban will be made publicly available to provide more research opportunities and foster safe and trustworthy embodied AI in urban spaces.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
Tactics, Techniques, and Procedures (TTPs) in Interpreted Malware: A Zero-Shot Generation with Large Language Models
Authors:
Ying Zhang,
Xiaoyan Zhou,
Hui Wen,
Wenjia Niu,
Jiqiang Liu,
Haining Wang,
Qiang Li
Abstract:
Nowadays, the open-source software (OSS) ecosystem suffers from security threats of software supply chain (SSC) attacks. Interpreted OSS malware plays a vital role in SSC attacks, as criminals have an arsenal of attack vectors to deceive users into installing malware and executing malicious activities. In this paper, we introduce tactics, techniques, and procedures (TTPs) proposed by MITRE ATT\&CK…
▽ More
Nowadays, the open-source software (OSS) ecosystem suffers from security threats of software supply chain (SSC) attacks. Interpreted OSS malware plays a vital role in SSC attacks, as criminals have an arsenal of attack vectors to deceive users into installing malware and executing malicious activities. In this paper, we introduce tactics, techniques, and procedures (TTPs) proposed by MITRE ATT\&CK into the interpreted malware analysis to characterize different phases of an attack lifecycle. Specifically, we propose GENTTP, a zero-shot approach to extracting a TTP of an interpreted malware package. GENTTP leverages large language models (LLMs) to automatically generate a TTP, where the input is a malicious package, and the output is a deceptive tactic and an execution tactic of attack vectors. To validate the effectiveness of GENTTP, we collect two datasets for evaluation: a dataset with ground truth labels and a large dataset in the wild. Experimental results show that GENTTP can generate TTPs with high accuracy and efficiency. To demonstrate GENTTP's benefits, we build an LLM-based Chatbot from 3,700+ PyPI malware's TTPs. We further conduct a quantitative analysis of malware's TTPs at a large scale. Our main findings include: (1) many OSS malicious packages share a relatively stable TTP, even with the increasing emergence of malware and attack campaigns, (2) a TTP reflects characteristics of a malware-based attack, and (3) an attacker's intent behind the malware is linked to a TTP.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
Spatially-Variant Degradation Model for Dataset-free Super-resolution
Authors:
Shaojie Guo,
Haofei Song,
Qingli Li,
Yan Wang
Abstract:
This paper focuses on the dataset-free Blind Image Super-Resolution (BISR). Unlike existing dataset-free BISR methods that focus on obtaining a degradation kernel for the entire image, we are the first to explicitly design a spatially-variant degradation model for each pixel. Our method also benefits from having a significantly smaller number of learnable parameters compared to data-driven spatial…
▽ More
This paper focuses on the dataset-free Blind Image Super-Resolution (BISR). Unlike existing dataset-free BISR methods that focus on obtaining a degradation kernel for the entire image, we are the first to explicitly design a spatially-variant degradation model for each pixel. Our method also benefits from having a significantly smaller number of learnable parameters compared to data-driven spatially-variant BISR methods. Concretely, each pixel's degradation kernel is expressed as a linear combination of a learnable dictionary composed of a small number of spatially-variant atom kernels. The coefficient matrices of the atom degradation kernels are derived using membership functions of fuzzy set theory. We construct a novel Probabilistic BISR model with tailored likelihood function and prior terms. Subsequently, we employ the Monte Carlo EM algorithm to infer the degradation kernels for each pixel. Our method achieves a significant improvement over other state-of-the-art BISR methods, with an average improvement of 1 dB (2x).Code will be released at https://github.com/shaojieguoECNU/SVDSR.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
Beyond Benchmarking: A New Paradigm for Evaluation and Assessment of Large Language Models
Authors:
Jin Liu,
Qingquan Li,
Wenlong Du
Abstract:
In current benchmarks for evaluating large language models (LLMs), there are issues such as evaluation content restriction, untimely updates, and lack of optimization guidance. In this paper, we propose a new paradigm for the measurement of LLMs: Benchmarking-Evaluation-Assessment. Our paradigm shifts the "location" of LLM evaluation from the "examination room" to the "hospital". Through conductin…
▽ More
In current benchmarks for evaluating large language models (LLMs), there are issues such as evaluation content restriction, untimely updates, and lack of optimization guidance. In this paper, we propose a new paradigm for the measurement of LLMs: Benchmarking-Evaluation-Assessment. Our paradigm shifts the "location" of LLM evaluation from the "examination room" to the "hospital". Through conducting a "physical examination" on LLMs, it utilizes specific task-solving as the evaluation content, performs deep attribution of existing problems within LLMs, and provides recommendation for optimization.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
Using Galaxy Evolution as Source of Physics-Based Ground Truth for Generative Models
Authors:
Yun Qi Li,
Tuan Do,
Evan Jones,
Bernie Boscoe,
Kevin Alfaro,
Zooey Nguyen
Abstract:
Generative models producing images have enormous potential to advance discoveries across scientific fields and require metrics capable of quantifying the high dimensional output. We propose that astrophysics data, such as galaxy images, can test generative models with additional physics-motivated ground truths in addition to human judgment. For example, galaxies in the Universe form and change ove…
▽ More
Generative models producing images have enormous potential to advance discoveries across scientific fields and require metrics capable of quantifying the high dimensional output. We propose that astrophysics data, such as galaxy images, can test generative models with additional physics-motivated ground truths in addition to human judgment. For example, galaxies in the Universe form and change over billions of years, following physical laws and relationships that are both easy to characterize and difficult to encode in generative models. We build a conditional denoising diffusion probabilistic model (DDPM) and a conditional variational autoencoder (CVAE) and test their ability to generate realistic galaxies conditioned on their redshifts (galaxy ages). This is one of the first studies to probe these generative models using physically motivated metrics. We find that both models produce comparable realistic galaxies based on human evaluation, but our physics-based metrics are better able to discern the strengths and weaknesses of the generative models. Overall, the DDPM model performs better than the CVAE on the majority of the physics-based metrics. Ultimately, if we can show that generative models can learn the physics of galaxy evolution, they have the potential to unlock new astrophysical discoveries.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
Hypergraph based Understanding for Document Semantic Entity Recognition
Authors:
Qiwei Li,
Zuchao Li,
Ping Wang,
Haojun Ai,
Hai Zhao
Abstract:
Semantic entity recognition is an important task in the field of visually-rich document understanding. It distinguishes the semantic types of text by analyzing the position relationship between text nodes and the relation between text content. The existing document understanding models mainly focus on entity categories while ignoring the extraction of entity boundaries. We build a novel hypergraph…
▽ More
Semantic entity recognition is an important task in the field of visually-rich document understanding. It distinguishes the semantic types of text by analyzing the position relationship between text nodes and the relation between text content. The existing document understanding models mainly focus on entity categories while ignoring the extraction of entity boundaries. We build a novel hypergraph attention document semantic entity recognition framework, HGA, which uses hypergraph attention to focus on entity boundaries and entity categories at the same time. It can conduct a more detailed analysis of the document text representation analyzed by the upstream model and achieves a better performance of semantic information. We apply this method on the basis of GraphLayoutLM to construct a new semantic entity recognition model HGALayoutLM. Our experiment results on FUNSD, CORD, XFUND and SROIE show that our method can effectively improve the performance of semantic entity recognition tasks based on the original model. The results of HGALayoutLM on FUNSD and XFUND reach the new state-of-the-art results.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
Nonrigid Reconstruction of Freehand Ultrasound without a Tracker
Authors:
Qi Li,
Ziyi Shen,
Qianye Yang,
Dean C. Barratt,
Matthew J. Clarkson,
Tom Vercauteren,
Yipeng Hu
Abstract:
Reconstructing 2D freehand Ultrasound (US) frames into 3D space without using a tracker has recently seen advances with deep learning. Predicting good frame-to-frame rigid transformations is often accepted as the learning objective, especially when the ground-truth labels from spatial tracking devices are inherently rigid transformations. Motivated by a) the observed nonrigid deformation due to so…
▽ More
Reconstructing 2D freehand Ultrasound (US) frames into 3D space without using a tracker has recently seen advances with deep learning. Predicting good frame-to-frame rigid transformations is often accepted as the learning objective, especially when the ground-truth labels from spatial tracking devices are inherently rigid transformations. Motivated by a) the observed nonrigid deformation due to soft tissue motion during scanning, and b) the highly sensitive prediction of rigid transformation, this study investigates the methods and their benefits in predicting nonrigid transformations for reconstructing 3D US. We propose a novel co-optimisation algorithm for simultaneously estimating rigid transformations among US frames, supervised by ground-truth from a tracker, and a nonrigid deformation, optimised by a regularised registration network. We show that these two objectives can be either optimised using meta-learning or combined by weighting. A fast scattered data interpolation is also developed for enabling frequent reconstruction and registration of non-parallel US frames, during training. With a new data set containing over 357,000 frames in 720 scans, acquired from 60 subjects, the experiments demonstrate that, due to an expanded thus easier-to-optimise solution space, the generalisation is improved with the added deformation estimation, with respect to the rigid ground-truth. The global pixel reconstruction error (assessing accumulative prediction) is lowered from 18.48 to 16.51 mm, compared with baseline rigid-transformation-predicting methods. Using manually identified landmarks, the proposed co-optimisation also shows potentials in compensating nonrigid tissue motion at inference, which is not measurable by tracker-provided ground-truth. The code and data used in this paper are made publicly available at https://github.com/QiLi111/NR-Rec-FUS.
△ Less
Submitted 14 July, 2024; v1 submitted 8 July, 2024;
originally announced July 2024.
-
Cross Prompting Consistency with Segment Anything Model for Semi-supervised Medical Image Segmentation
Authors:
Juzheng Miao,
Cheng Chen,
Keli Zhang,
Jie Chuai,
Quanzheng Li,
Pheng-Ann Heng
Abstract:
Semi-supervised learning (SSL) has achieved notable progress in medical image segmentation. To achieve effective SSL, a model needs to be able to efficiently learn from limited labeled data and effectively exploiting knowledge from abundant unlabeled data. Recent developments in visual foundation models, such as the Segment Anything Model (SAM), have demonstrated remarkable adaptability with impro…
▽ More
Semi-supervised learning (SSL) has achieved notable progress in medical image segmentation. To achieve effective SSL, a model needs to be able to efficiently learn from limited labeled data and effectively exploiting knowledge from abundant unlabeled data. Recent developments in visual foundation models, such as the Segment Anything Model (SAM), have demonstrated remarkable adaptability with improved sample efficiency. To harness the power of foundation models for application in SSL, we propose a cross prompting consistency method with segment anything model (CPC-SAM) for semi-supervised medical image segmentation. Our method employs SAM's unique prompt design and innovates a cross-prompting strategy within a dual-branch framework to automatically generate prompts and supervisions across two decoder branches, enabling effectively learning from both scarce labeled and valuable unlabeled data. We further design a novel prompt consistency regularization, to reduce the prompt position sensitivity and to enhance the output invariance under different prompts. We validate our method on two medical image segmentation tasks. The extensive experiments with different labeled-data ratios and modalities demonstrate the superiority of our proposed method over the state-of-the-art SSL methods, with more than 9% Dice improvement on the breast cancer segmentation task.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
FM-OSD: Foundation Model-Enabled One-Shot Detection of Anatomical Landmarks
Authors:
Juzheng Miao,
Cheng Chen,
Keli Zhang,
Jie Chuai,
Quanzheng Li,
Pheng-Ann Heng
Abstract:
One-shot detection of anatomical landmarks is gaining significant attention for its efficiency in using minimal labeled data to produce promising results. However, the success of current methods heavily relies on the employment of extensive unlabeled data to pre-train an effective feature extractor, which limits their applicability in scenarios where a substantial amount of unlabeled data is unava…
▽ More
One-shot detection of anatomical landmarks is gaining significant attention for its efficiency in using minimal labeled data to produce promising results. However, the success of current methods heavily relies on the employment of extensive unlabeled data to pre-train an effective feature extractor, which limits their applicability in scenarios where a substantial amount of unlabeled data is unavailable. In this paper, we propose the first foundation model-enabled one-shot landmark detection (FM-OSD) framework for accurate landmark detection in medical images by utilizing solely a single template image without any additional unlabeled data. Specifically, we use the frozen image encoder of visual foundation models as the feature extractor, and introduce dual-branch global and local feature decoders to increase the resolution of extracted features in a coarse to fine manner. The introduced feature decoders are efficiently trained with a distance-aware similarity learning loss to incorporate domain knowledge from the single template image. Moreover, a novel bidirectional matching strategy is developed to improve both robustness and accuracy of landmark detection in the case of scattered similarity map obtained by foundation models. We validate our method on two public anatomical landmark detection datasets. By using solely a single template image, our method demonstrates significant superiority over strong state-of-the-art one-shot landmark detection methods.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
UltraEdit: Instruction-based Fine-Grained Image Editing at Scale
Authors:
Haozhe Zhao,
Xiaojian Ma,
Liang Chen,
Shuzheng Si,
Rujie Wu,
Kaikai An,
Peiyu Yu,
Minjia Zhang,
Qing Li,
Baobao Chang
Abstract:
This paper presents UltraEdit, a large-scale (approximately 4 million editing samples), automatically generated dataset for instruction-based image editing. Our key idea is to address the drawbacks in existing image editing datasets like InstructPix2Pix and MagicBrush, and provide a systematic approach to producing massive and high-quality image editing samples. UltraEdit offers several distinct a…
▽ More
This paper presents UltraEdit, a large-scale (approximately 4 million editing samples), automatically generated dataset for instruction-based image editing. Our key idea is to address the drawbacks in existing image editing datasets like InstructPix2Pix and MagicBrush, and provide a systematic approach to producing massive and high-quality image editing samples. UltraEdit offers several distinct advantages: 1) It features a broader range of editing instructions by leveraging the creativity of large language models (LLMs) alongside in-context editing examples from human raters; 2) Its data sources are based on real images, including photographs and artworks, which provide greater diversity and reduced bias compared to datasets solely generated by text-to-image models; 3) It also supports region-based editing, enhanced by high-quality, automatically produced region annotations. Our experiments show that canonical diffusion-based editing baselines trained on UltraEdit set new records on MagicBrush and Emu-Edit benchmarks. Our analysis further confirms the crucial role of real image anchors and region-based editing data. The dataset, code, and models can be found in https://ultra-editing.github.io.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
On Evaluating The Performance of Watermarked Machine-Generated Texts Under Adversarial Attacks
Authors:
Zesen Liu,
Tianshuo Cong,
Xinlei He,
Qi Li
Abstract:
Large Language Models (LLMs) excel in various applications, including text generation and complex tasks. However, the misuse of LLMs raises concerns about the authenticity and ethical implications of the content they produce, such as deepfake news, academic fraud, and copyright infringement. Watermarking techniques, which embed identifiable markers in machine-generated text, offer a promising solu…
▽ More
Large Language Models (LLMs) excel in various applications, including text generation and complex tasks. However, the misuse of LLMs raises concerns about the authenticity and ethical implications of the content they produce, such as deepfake news, academic fraud, and copyright infringement. Watermarking techniques, which embed identifiable markers in machine-generated text, offer a promising solution to these issues by allowing for content verification and origin tracing. Unfortunately, the robustness of current LLM watermarking schemes under potential watermark removal attacks has not been comprehensively explored.
In this paper, to fill this gap, we first systematically comb the mainstream watermarking schemes and removal attacks on machine-generated texts, and then we categorize them into pre-text (before text generation) and post-text (after text generation) classes so that we can conduct diversified analyses. In our experiments, we evaluate eight watermarks (five pre-text, three post-text) and twelve attacks (two pre-text, ten post-text) across 87 scenarios. Evaluation results indicate that (1) KGW and Exponential watermarks offer high text quality and watermark retention but remain vulnerable to most attacks; (2) Post-text attacks are found to be more efficient and practical than pre-text attacks; (3) Pre-text watermarks are generally more imperceptible, as they do not alter text fluency, unlike post-text watermarks; (4) Additionally, combined attack methods can significantly increase effectiveness, highlighting the need for more robust watermarking solutions. Our study underscores the vulnerabilities of current techniques and the necessity for developing more resilient schemes.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
Authors:
Sibo Yi,
Yule Liu,
Zhen Sun,
Tianshuo Cong,
Xinlei He,
Jiaxing Song,
Ke Xu,
Qi Li
Abstract:
Large Language Models (LLMs) have performed exceptionally in various text-generative tasks, including question answering, translation, code completion, etc. However, the over-assistance of LLMs has raised the challenge of "jailbreaking", which induces the model to generate malicious responses against the usage policy and society by designing adversarial prompts. With the emergence of jailbreak att…
▽ More
Large Language Models (LLMs) have performed exceptionally in various text-generative tasks, including question answering, translation, code completion, etc. However, the over-assistance of LLMs has raised the challenge of "jailbreaking", which induces the model to generate malicious responses against the usage policy and society by designing adversarial prompts. With the emergence of jailbreak attack methods exploiting different vulnerabilities in LLMs, the corresponding safety alignment measures are also evolving. In this paper, we propose a comprehensive and detailed taxonomy of jailbreak attack and defense methods. For instance, the attack methods are divided into black-box and white-box attacks based on the transparency of the target model. Meanwhile, we classify defense methods into prompt-level and model-level defenses. Additionally, we further subdivide these attack and defense methods into distinct sub-classes and present a coherent diagram illustrating their relationships. We also conduct an investigation into the current evaluation methods and compare them from different perspectives. Our findings aim to inspire future research and practical implementations in safeguarding LLMs against adversarial attacks. Above all, although jailbreak remains a significant concern within the community, we believe that our work enhances the understanding of this domain and provides a foundation for developing more secure LLMs.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
Measurement Embedded Schrödinger Bridge for Inverse Problems
Authors:
Yuang Wang,
Pengfei Jin,
Siyeop Yoon,
Matthew Tivnan,
Quanzheng Li,
Li Zhang,
Dufan Wu
Abstract:
Score-based diffusion models are frequently employed as structural priors in inverse problems. However, their iterative denoising process, initiated from Gaussian noise, often results in slow inference speeds. The Image-to-Image Schrödinger Bridge (I$^2$SB), which begins with the corrupted image, presents a promising alternative as a prior for addressing inverse problems. In this work, we introduc…
▽ More
Score-based diffusion models are frequently employed as structural priors in inverse problems. However, their iterative denoising process, initiated from Gaussian noise, often results in slow inference speeds. The Image-to-Image Schrödinger Bridge (I$^2$SB), which begins with the corrupted image, presents a promising alternative as a prior for addressing inverse problems. In this work, we introduce the Measurement Embedded Schrödinger Bridge (MESB). MESB establishes Schrödinger Bridges between the distribution of corrupted images and the distribution of clean images given observed measurements. Based on optimal transport theory, we derive the forward and backward processes of MESB. Through validation on diverse inverse problems, our proposed approach exhibits superior performance compared to existing Schrödinger Bridge-based inverse problems solvers in both visual quality and quantitative metrics.
△ Less
Submitted 22 May, 2024;
originally announced July 2024.
-
Raw Text is All you Need: Knowledge-intensive Multi-turn Instruction Tuning for Large Language Model
Authors:
Xia Hou,
Qifeng Li,
Jian Yang,
Tongliang Li,
Linzheng Chai,
Xianjie Wu,
Hangyuan Ji,
Zhoujun Li,
Jixuan Nie,
Jingbo Dun,
Wenfeng Song
Abstract:
Instruction tuning as an effective technique aligns the outputs of large language models (LLMs) with human preference. But how to generate the seasonal multi-turn dialogues from raw documents for instruction tuning still requires further exploration. In this paper, we present a novel framework named R2S that leverages the CoD-Chain of Dialogue logic to guide large language models (LLMs) in generat…
▽ More
Instruction tuning as an effective technique aligns the outputs of large language models (LLMs) with human preference. But how to generate the seasonal multi-turn dialogues from raw documents for instruction tuning still requires further exploration. In this paper, we present a novel framework named R2S that leverages the CoD-Chain of Dialogue logic to guide large language models (LLMs) in generating knowledge-intensive multi-turn dialogues for instruction tuning. By integrating raw documents from both open-source datasets and domain-specific web-crawled documents into a benchmark K-BENCH, we cover diverse areas such as Wikipedia (English), Science (Chinese), and Artifacts (Chinese). Our approach first decides the logic flow of the current dialogue and then prompts LLMs to produce key phrases for sourcing relevant response content. This methodology enables the creation of the G I NSTRUCT instruction dataset, retaining raw document knowledge within dialoguestyle interactions. Utilizing this dataset, we fine-tune GLLM, a model designed to transform raw documents into structured multi-turn dialogues, thereby injecting comprehensive domain knowledge into the SFT model for enhanced instruction tuning. This work signifies a stride towards refining the adaptability and effectiveness of LLMs in processing and generating more accurate, contextually nuanced responses across various fields.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
Light-SLAM: A Robust Deep-Learning Visual SLAM System Based on LightGlue under Challenging Lighting Conditions
Authors:
Zhiqi Zhao,
Chang Wu,
Xiaotong Kong,
Zejie Lv,
Xiaoqi Du,
Qiyan Li
Abstract:
Simultaneous Localization and Mapping (SLAM) has become a critical technology for intelligent transportation systems and autonomous robots and is widely used in autonomous driving. However, traditional manual feature-based methods in challenging lighting environments make it difficult to ensure robustness and accuracy. Some deep learning-based methods show potential but still have significant draw…
▽ More
Simultaneous Localization and Mapping (SLAM) has become a critical technology for intelligent transportation systems and autonomous robots and is widely used in autonomous driving. However, traditional manual feature-based methods in challenging lighting environments make it difficult to ensure robustness and accuracy. Some deep learning-based methods show potential but still have significant drawbacks. To address this problem, we propose a novel hybrid system for visual SLAM based on the LightGlue deep learning network. It uses deep local feature descriptors to replace traditional hand-crafted features and a more efficient and accurate deep network to achieve fast and precise feature matching. Thus, we use the robustness of deep learning to improve the whole system. We have combined traditional geometry-based approaches to introduce a complete visual SLAM system for monocular, binocular, and RGB-D sensors. We thoroughly tested the proposed system on four public datasets: KITTI, EuRoC, TUM, and 4Season, as well as on actual campus scenes. The experimental results show that the proposed method exhibits better accuracy and robustness in adapting to low-light and strongly light-varying environments than traditional manual features and deep learning-based methods. It can also run on GPU in real time.
△ Less
Submitted 10 May, 2024;
originally announced July 2024.
-
RVISA: Reasoning and Verification for Implicit Sentiment Analysis
Authors:
Wenna Lai,
Haoran Xie,
Guandong Xu,
Qing Li
Abstract:
With an increasing social demand for fine-grained sentiment analysis (SA), implicit sentiment analysis (ISA) poses a significant challenge with the absence of salient cue words in expressions. It necessitates reliable reasoning to understand how the sentiment is aroused and thus determine implicit sentiments. In the era of Large Language Models (LLMs), Encoder-Decoder (ED) LLMs have gained popular…
▽ More
With an increasing social demand for fine-grained sentiment analysis (SA), implicit sentiment analysis (ISA) poses a significant challenge with the absence of salient cue words in expressions. It necessitates reliable reasoning to understand how the sentiment is aroused and thus determine implicit sentiments. In the era of Large Language Models (LLMs), Encoder-Decoder (ED) LLMs have gained popularity to serve as backbone models for SA applications, considering impressive text comprehension and reasoning ability among diverse tasks. On the other hand, Decoder-only (DO) LLMs exhibit superior natural language generation and in-context learning capabilities. However, their responses may contain misleading or inaccurate information. To identify implicit sentiment with reliable reasoning, this study proposes RVISA, a two-stage reasoning framework that harnesses the generation ability of DO LLMs and the reasoning ability of ED LLMs to train an enhanced reasoner. Specifically, we adopt three-hop reasoning prompting to explicitly furnish sentiment elements as cues. The generated rationales are utilized to fine-tune an ED LLM into a skilled reasoner. Additionally, we develop a straightforward yet effective verification mechanism to ensure the reliability of the reasoning learning. We evaluated the proposed method on two benchmark datasets and achieved state-of-the-art results in ISA performance.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
CFinBench: A Comprehensive Chinese Financial Benchmark for Large Language Models
Authors:
Ying Nie,
Binwei Yan,
Tianyu Guo,
Hao Liu,
Haoyu Wang,
Wei He,
Binfan Zheng,
Weihao Wang,
Qiang Li,
Weijian Sun,
Yunhe Wang,
Dacheng Tao
Abstract:
Large language models (LLMs) have achieved remarkable performance on various NLP tasks, yet their potential in more challenging and domain-specific task, such as finance, has not been fully explored. In this paper, we present CFinBench: a meticulously crafted, the most comprehensive evaluation benchmark to date, for assessing the financial knowledge of LLMs under Chinese context. In practice, to b…
▽ More
Large language models (LLMs) have achieved remarkable performance on various NLP tasks, yet their potential in more challenging and domain-specific task, such as finance, has not been fully explored. In this paper, we present CFinBench: a meticulously crafted, the most comprehensive evaluation benchmark to date, for assessing the financial knowledge of LLMs under Chinese context. In practice, to better align with the career trajectory of Chinese financial practitioners, we build a systematic evaluation from 4 first-level categories: (1) Financial Subject: whether LLMs can memorize the necessary basic knowledge of financial subjects, such as economics, statistics and auditing. (2) Financial Qualification: whether LLMs can obtain the needed financial qualified certifications, such as certified public accountant, securities qualification and banking qualification. (3) Financial Practice: whether LLMs can fulfill the practical financial jobs, such as tax consultant, junior accountant and securities analyst. (4) Financial Law: whether LLMs can meet the requirement of financial laws and regulations, such as tax law, insurance law and economic law. CFinBench comprises 99,100 questions spanning 43 second-level categories with 3 question types: single-choice, multiple-choice and judgment. We conduct extensive experiments of 50 representative LLMs with various model size on CFinBench. The results show that GPT4 and some Chinese-oriented models lead the benchmark, with the highest average accuracy being 60.16%, highlighting the challenge presented by CFinBench. The dataset and evaluation code are available at https://cfinbench.github.io/.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
Looking From the Future: Multi-order Iterations Can Enhance Adversarial Attack Transferability
Authors:
Zijian Ying,
Qianmu Li,
Tao Wang,
Zhichao Lian,
Shunmei Meng,
Xuyun Zhang
Abstract:
Various methods try to enhance adversarial transferability by improving the generalization from different perspectives. In this paper, we rethink the optimization process and propose a novel sequence optimization concept, which is named Looking From the Future (LFF). LFF makes use of the original optimization process to refine the very first local optimization choice. Adapting the LFF concept to t…
▽ More
Various methods try to enhance adversarial transferability by improving the generalization from different perspectives. In this paper, we rethink the optimization process and propose a novel sequence optimization concept, which is named Looking From the Future (LFF). LFF makes use of the original optimization process to refine the very first local optimization choice. Adapting the LFF concept to the adversarial attack task, we further propose an LFF attack as well as an MLFF attack with better generalization ability. Furthermore, guiding with the LFF concept, we propose an $LLF^{\mathcal{N}}$ attack which entends the LFF attack to a multi-order attack, further enhancing the transfer attack ability. All our proposed methods can be directly applied to the iteration-based attack methods. We evaluate our proposed method on the ImageNet1k dataset by applying several SOTA adversarial attack methods under four kinds of tasks. Experimental results show that our proposed method can greatly enhance the attack transferability. Ablation experiments are also applied to verify the effectiveness of each component. The source code will be released after this paper is accepted.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Universal Approximation Theory: The basic theory for large language models
Authors:
Wei Wang,
Qing Li
Abstract:
Language models have emerged as a critical area of focus in artificial intelligence, particularly with the introduction of groundbreaking innovations like ChatGPT. Large-scale Transformer networks have quickly become the leading approach for advancing natural language processing algorithms. Built on the Transformer architecture, these models enable interactions that closely mimic human communicati…
▽ More
Language models have emerged as a critical area of focus in artificial intelligence, particularly with the introduction of groundbreaking innovations like ChatGPT. Large-scale Transformer networks have quickly become the leading approach for advancing natural language processing algorithms. Built on the Transformer architecture, these models enable interactions that closely mimic human communication and, equipped with extensive knowledge, can even assist in guiding human tasks. Despite their impressive capabilities and growing complexity, a key question remains-the theoretical foundations of large language models (LLMs). What makes Transformer so effective for powering intelligent language applications, such as translation and coding? What underlies LLMs' ability for In-Context Learning (ICL)? How does the LoRA scheme enhance the fine-tuning of LLMs? And what supports the practicality of pruning LLMs? To address these critical questions and explore the technological strategies within LLMs, we leverage the Universal Approximation Theory (UAT) to offer a theoretical backdrop, shedding light on the mechanisms that underpin these advancements.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Robust and Reliable Early-Stage Website Fingerprinting Attacks via Spatial-Temporal Distribution Analysis
Authors:
Xinhao Deng,
Qi Li,
Ke Xu
Abstract:
Website Fingerprinting (WF) attacks identify the websites visited by users by performing traffic analysis, compromising user privacy. Particularly, DL-based WF attacks demonstrate impressive attack performance. However, the effectiveness of DL-based WF attacks relies on the collected complete and pure traffic during the page loading, which impacts the practicality of these attacks. The WF performa…
▽ More
Website Fingerprinting (WF) attacks identify the websites visited by users by performing traffic analysis, compromising user privacy. Particularly, DL-based WF attacks demonstrate impressive attack performance. However, the effectiveness of DL-based WF attacks relies on the collected complete and pure traffic during the page loading, which impacts the practicality of these attacks. The WF performance is rather low under dynamic network conditions and various WF defenses, particularly when the analyzed traffic is only a small part of the complete traffic. In this paper, we propose Holmes, a robust and reliable early-stage WF attack. Holmes utilizes temporal and spatial distribution analysis of website traffic to effectively identify websites in the early stages of page loading. Specifically, Holmes develops adaptive data augmentation based on the temporal distribution of website traffic and utilizes a supervised contrastive learning method to extract the correlations between the early-stage traffic and the pre-collected complete traffic. Holmes accurately identifies traffic in the early stages of page loading by computing the correlation of the traffic with the spatial distribution information, which ensures robust and reliable detection according to early-stage traffic. We extensively evaluate Holmes using six datasets. Compared to nine existing DL-based WF attacks, Holmes improves the F1-score of identifying early-stage traffic by an average of 169.18%. Furthermore, we replay the traffic of visiting real-world dark web websites. Holmes successfully identifies dark web websites when the ratio of page loading on average is only 21.71%, with an average precision improvement of 169.36% over the existing WF attacks.
△ Less
Submitted 30 June, 2024;
originally announced July 2024.
-
Investigating and Mitigating the Multimodal Hallucination Snowballing in Large Vision-Language Models
Authors:
Weihong Zhong,
Xiaocheng Feng,
Liang Zhao,
Qiming Li,
Lei Huang,
Yuxuan Gu,
Weitao Ma,
Yuan Xu,
Bing Qin
Abstract:
Though advanced in understanding visual information with human languages, Large Vision-Language Models (LVLMs) still suffer from multimodal hallucinations. A natural concern is that during multimodal interaction, the generated hallucinations could influence the LVLMs' subsequent generation. Thus, we raise a question: When presented with a query relevant to the previously generated hallucination, w…
▽ More
Though advanced in understanding visual information with human languages, Large Vision-Language Models (LVLMs) still suffer from multimodal hallucinations. A natural concern is that during multimodal interaction, the generated hallucinations could influence the LVLMs' subsequent generation. Thus, we raise a question: When presented with a query relevant to the previously generated hallucination, will LVLMs be misled and respond incorrectly, even though the ground visual information exists? To answer this, we propose a framework called MMHalSnowball to evaluate LVLMs' behaviors when encountering generated hallucinations, where LVLMs are required to answer specific visual questions within a curated hallucinatory conversation. Crucially, our experiment shows that the performance of open-source LVLMs drops by at least $31\%$, indicating that LVLMs are prone to accept the generated hallucinations and make false claims that they would not have supported without distractions. We term this phenomenon Multimodal Hallucination Snowballing. To mitigate this, we further propose a training-free method called Residual Visual Decoding, where we revise the output distribution of LVLMs with the one derived from the residual visual input, providing models with direct access to the visual information. Experiments show that our method can mitigate more than $24\%$ of the snowballed multimodal hallucination while maintaining capabilities.
△ Less
Submitted 29 June, 2024;
originally announced July 2024.
-
OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents
Authors:
Zihao Wang,
Shaofei Cai,
Zhancun Mu,
Haowei Lin,
Ceyao Zhang,
Xuejie Liu,
Qing Li,
Anji Liu,
Xiaojian Ma,
Yitao Liang
Abstract:
We present OmniJARVIS, a novel Vision-Language-Action (VLA) model for open-world instruction-following agents in open-world Minecraft. Compared to prior works that either emit textual goals to separate controllers or produce the control command directly, OmniJARVIS seeks a different path to ensure both strong reasoning and efficient decision-making capabilities via unified tokenization of multimod…
▽ More
We present OmniJARVIS, a novel Vision-Language-Action (VLA) model for open-world instruction-following agents in open-world Minecraft. Compared to prior works that either emit textual goals to separate controllers or produce the control command directly, OmniJARVIS seeks a different path to ensure both strong reasoning and efficient decision-making capabilities via unified tokenization of multimodal interaction data. First, we introduce a self-supervised approach to learn a behavior encoder that produces discretized tokens for behavior trajectories $τ$ = {$o_0$, $a_0$, $\dots$} and an imitation learning (IL) policy decoder conditioned on these tokens. These additional behavior tokens will be augmented to the vocabulary of pretrained Multimodal Language Models (MLMs). With this encoder, we then pack long-term multimodal interactions involving task instructions, memories, thoughts, observations, textual responses, behavior trajectories, etc. into unified token sequences and model them with autoregressive transformers. Thanks to the semantically meaningful behavior tokens, the resulting VLA model, OmniJARVIS, can reason (by producing chain-of-thoughts), plan, answer questions, and act (by producing behavior tokens for the IL policy decoder). OmniJARVIS demonstrates excellent performances on a comprehensive collection of atomic, programmatic, and open-ended tasks in open-world Minecraft. Our analysis further unveils the crucial design principles in interaction data formation, unified tokenization, and its scaling potentials.
△ Less
Submitted 27 June, 2024;
originally announced July 2024.
-
Zero-Query Adversarial Attack on Black-box Automatic Speech Recognition Systems
Authors:
Zheng Fang,
Tao Wang,
Lingchen Zhao,
Shenyi Zhang,
Bowen Li,
Yunjie Ge,
Qi Li,
Chao Shen,
Qian Wang
Abstract:
In recent years, extensive research has been conducted on the vulnerability of ASR systems, revealing that black-box adversarial example attacks pose significant threats to real-world ASR systems. However, most existing black-box attacks rely on queries to the target ASRs, which is impractical when queries are not permitted. In this paper, we propose ZQ-Attack, a transfer-based adversarial attack…
▽ More
In recent years, extensive research has been conducted on the vulnerability of ASR systems, revealing that black-box adversarial example attacks pose significant threats to real-world ASR systems. However, most existing black-box attacks rely on queries to the target ASRs, which is impractical when queries are not permitted. In this paper, we propose ZQ-Attack, a transfer-based adversarial attack on ASR systems in the zero-query black-box setting. Through a comprehensive review and categorization of modern ASR technologies, we first meticulously select surrogate ASRs of diverse types to generate adversarial examples. Following this, ZQ-Attack initializes the adversarial perturbation with a scaled target command audio, rendering it relatively imperceptible while maintaining effectiveness. Subsequently, to achieve high transferability of adversarial perturbations, we propose a sequential ensemble optimization algorithm, which iteratively optimizes the adversarial perturbation on each surrogate model, leveraging collaborative information from other models. We conduct extensive experiments to evaluate ZQ-Attack. In the over-the-line setting, ZQ-Attack achieves a 100% success rate of attack (SRoA) with an average signal-to-noise ratio (SNR) of 21.91dB on 4 online speech recognition services, and attains an average SRoA of 100% and SNR of 19.67dB on 16 open-source ASRs. For commercial intelligent voice control devices, ZQ-Attack also achieves a 100% SRoA with an average SNR of 15.77dB in the over-the-air setting.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
CMRxRecon2024: A Multi-Modality, Multi-View K-Space Dataset Boosting Universal Machine Learning for Accelerated Cardiac MRI
Authors:
Zi Wang,
Fanwen Wang,
Chen Qin,
Jun Lyu,
Ouyang Cheng,
Shuo Wang,
Yan Li,
Mengyao Yu,
Haoyu Zhang,
Kunyuan Guo,
Zhang Shi,
Qirong Li,
Ziqiang Xu,
Yajing Zhang,
Hao Li,
Sha Hua,
Binghua Chen,
Longyu Sun,
Mengting Sun,
Qin Li,
Ying-Hua Chu,
Wenjia Bai,
Jing Qin,
Xiahai Zhuang,
Claudia Prieto
, et al. (7 additional authors not shown)
Abstract:
Cardiac magnetic resonance imaging (MRI) has emerged as a clinically gold-standard technique for diagnosing cardiac diseases, thanks to its ability to provide diverse information with multiple modalities and anatomical views. Accelerated cardiac MRI is highly expected to achieve time-efficient and patient-friendly imaging, and then advanced image reconstruction approaches are required to recover h…
▽ More
Cardiac magnetic resonance imaging (MRI) has emerged as a clinically gold-standard technique for diagnosing cardiac diseases, thanks to its ability to provide diverse information with multiple modalities and anatomical views. Accelerated cardiac MRI is highly expected to achieve time-efficient and patient-friendly imaging, and then advanced image reconstruction approaches are required to recover high-quality, clinically interpretable images from undersampled measurements. However, the lack of publicly available cardiac MRI k-space dataset in terms of both quantity and diversity has severely hindered substantial technological progress, particularly for data-driven artificial intelligence. Here, we provide a standardized, diverse, and high-quality CMRxRecon2024 dataset to facilitate the technical development, fair evaluation, and clinical transfer of cardiac MRI reconstruction approaches, towards promoting the universal frameworks that enable fast and robust reconstructions across different cardiac MRI protocols in clinical practice. To the best of our knowledge, the CMRxRecon2024 dataset is the largest and most diverse publicly available cardiac k-space dataset. It is acquired from 330 healthy volunteers, covering commonly used modalities, anatomical views, and acquisition trajectories in clinical cardiac MRI workflows. Besides, an open platform with tutorials, benchmarks, and data processing tools is provided to facilitate data usage, advanced method development, and fair performance evaluation.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
Artificial Immune System of Secure Face Recognition Against Adversarial Attacks
Authors:
Min Ren,
Yunlong Wang,
Yuhao Zhu,
Yongzhen Huang,
Zhenan Sun,
Qi Li,
Tieniu Tan
Abstract:
Insect production for food and feed presents a promising supplement to ensure food safety and address the adverse impacts of agriculture on climate and environment in the future. However, optimisation is required for insect production to realise its full potential. This can be by targeted improvement of traits of interest through selective breeding, an approach which has so far been underexplored…
▽ More
Insect production for food and feed presents a promising supplement to ensure food safety and address the adverse impacts of agriculture on climate and environment in the future. However, optimisation is required for insect production to realise its full potential. This can be by targeted improvement of traits of interest through selective breeding, an approach which has so far been underexplored and underutilised in insect farming. Here we present a comprehensive review of the selective breeding framework in the context of insect production. We systematically evaluate adjustments of selective breeding techniques to the realm of insects and highlight the essential components integral to the breeding process. The discussion covers every step of a conventional breeding scheme, such as formulation of breeding objectives, phenotyping, estimation of genetic parameters and breeding values, selection of appropriate breeding strategies, and mitigation of issues associated with genetic diversity depletion and inbreeding. This review combines knowledge from diverse disciplines, bridging the gap between animal breeding, quantitative genetics, evolutionary biology, and entomology, offering an integrated view of the insect breeding research area and uniting knowledge which has previously remained scattered across diverse fields of expertise.
△ Less
Submitted 26 June, 2024;
originally announced June 2024.
-
Cross-Modal Spherical Aggregation for Weakly Supervised Remote Sensing Shadow Removal
Authors:
Kaichen Chi,
Wei Jing,
Junjie Li,
Qiang Li,
Qi Wang
Abstract:
Remote sensing shadow removal, which aims to recover contaminated surface information, is tricky since shadows typically display overwhelmingly low illumination intensities. In contrast, the infrared image is robust toward significant light changes, providing visual clues complementary to the visible image. Nevertheless, the existing methods ignore the collaboration between heterogeneous modalitie…
▽ More
Remote sensing shadow removal, which aims to recover contaminated surface information, is tricky since shadows typically display overwhelmingly low illumination intensities. In contrast, the infrared image is robust toward significant light changes, providing visual clues complementary to the visible image. Nevertheless, the existing methods ignore the collaboration between heterogeneous modalities, leading to undesired quality degradation. To fill this gap, we propose a weakly supervised shadow removal network with a spherical feature space, dubbed S2-ShadowNet, to explore the best of both worlds for visible and infrared modalities. Specifically, we employ a modal translation (visible-to-infrared) model to learn the cross-domain mapping, thus generating realistic infrared samples. Then, Swin Transformer is utilized to extract strong representational visible/infrared features. Simultaneously, the extracted features are mapped to the smooth spherical manifold, which alleviates the domain shift through regularization. Well-designed similarity loss and orthogonality loss are embedded into the spherical space, prompting the separation of private visible/infrared features and the alignment of shared visible/infrared features through constraints on both representation content and orientation. Such a manner encourages implicit reciprocity between modalities, thus providing a novel insight into shadow removal. Notably, ground truth is not available in practice, thus S2-ShadowNet is trained by cropping shadow and shadow-free patches from the shadow image itself, avoiding stereotypical and strict pair data acquisition. More importantly, we contribute a large-scale weakly supervised shadow removal benchmark, including 4000 shadow images with corresponding shadow masks.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
Retrieval Augmented Instruction Tuning for Open NER with Large Language Models
Authors:
Tingyu Xie,
Jian Zhang,
Yan Zhang,
Yuanyuan Liang,
Qi Li,
Hongwei Wang
Abstract:
The strong capability of large language models (LLMs) has been applied to information extraction (IE) through either retrieval augmented prompting or instruction tuning (IT). However, the best way to incorporate information with LLMs for IE remains an open question. In this paper, we explore Retrieval Augmented Instruction Tuning (RA-IT) for IE, focusing on the task of open named entity recognitio…
▽ More
The strong capability of large language models (LLMs) has been applied to information extraction (IE) through either retrieval augmented prompting or instruction tuning (IT). However, the best way to incorporate information with LLMs for IE remains an open question. In this paper, we explore Retrieval Augmented Instruction Tuning (RA-IT) for IE, focusing on the task of open named entity recognition (NER). Specifically, for each training sample, we retrieve semantically similar examples from the training dataset as the context and prepend them to the input of the original instruction. To evaluate our RA-IT approach more thoroughly, we construct a Chinese IT dataset for open NER and evaluate RA-IT in both English and Chinese scenarios. Experimental results verify the effectiveness of RA-IT across various data sizes and in both English and Chinese scenarios. We also conduct thorough studies to explore the impacts of various retrieval strategies in the proposed RA-IT framework. Code and data are available at: https://github.com/Emma1066/Retrieval-Augmented-IT-OpenNER
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
Towards Comprehensive Preference Data Collection for Reward Modeling
Authors:
Yulan Hu,
Qingyang Li,
Sheng Ouyang,
Ge Chen,
Kaihui Chen,
Lijun Mei,
Xucheng Ye,
Fuzheng Zhang,
Yong Liu
Abstract:
Reinforcement Learning from Human Feedback (RLHF) facilitates the alignment of large language models (LLMs) with human preferences, thereby enhancing the quality of responses generated. A critical component of RLHF is the reward model, which is trained on preference data and outputs a scalar reward during the inference stage. However, the collection of preference data still lacks thorough investig…
▽ More
Reinforcement Learning from Human Feedback (RLHF) facilitates the alignment of large language models (LLMs) with human preferences, thereby enhancing the quality of responses generated. A critical component of RLHF is the reward model, which is trained on preference data and outputs a scalar reward during the inference stage. However, the collection of preference data still lacks thorough investigation. Recent studies indicate that preference data is collected either by AI or humans, where chosen and rejected instances are identified among pairwise responses. We question whether this process effectively filters out noise and ensures sufficient diversity in collected data. To address these concerns, for the first time, we propose a comprehensive framework for preference data collection, decomposing the process into four incremental steps: Prompt Generation, Response Generation, Response Filtering, and Human Labeling. This structured approach ensures the collection of high-quality preferences while reducing reliance on human labor. We conducted comprehensive experiments based on the data collected at different stages, demonstrating the effectiveness of the proposed data collection method.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
Imperative Learning: A Self-supervised Neural-Symbolic Learning Framework for Robot Autonomy
Authors:
Chen Wang,
Kaiyi Ji,
Junyi Geng,
Zhongqiang Ren,
Taimeng Fu,
Fan Yang,
Yifan Guo,
Haonan He,
Xiangyu Chen,
Zitong Zhan,
Qiwei Du,
Shaoshu Su,
Bowen Li,
Yuheng Qiu,
Yi Du,
Qihang Li,
Yifan Yang,
Xiao Lin,
Zhipeng Zhao
Abstract:
Data-driven methods such as reinforcement and imitation learning have achieved remarkable success in robot autonomy. However, their data-centric nature still hinders them from generalizing well to ever-changing environments. Moreover, collecting large datasets for robotic tasks is often impractical and expensive. To overcome these challenges, we introduce a new self-supervised neural-symbolic (NeS…
▽ More
Data-driven methods such as reinforcement and imitation learning have achieved remarkable success in robot autonomy. However, their data-centric nature still hinders them from generalizing well to ever-changing environments. Moreover, collecting large datasets for robotic tasks is often impractical and expensive. To overcome these challenges, we introduce a new self-supervised neural-symbolic (NeSy) computational framework, imperative learning (IL), for robot autonomy, leveraging the generalization abilities of symbolic reasoning. The framework of IL consists of three primary components: a neural module, a reasoning engine, and a memory system. We formulate IL as a special bilevel optimization (BLO), which enables reciprocal learning over the three modules. This overcomes the label-intensive obstacles associated with data-driven approaches and takes advantage of symbolic reasoning concerning logical reasoning, physical principles, geometric analysis, etc. We discuss several optimization techniques for IL and verify their effectiveness in five distinct robot autonomy tasks including path planning, rule induction, optimal control, visual odometry, and multi-robot routing. Through various experiments, we show that IL can significantly enhance robot autonomy capabilities and we anticipate that it will catalyze further research across diverse domains.
△ Less
Submitted 6 July, 2024; v1 submitted 23 June, 2024;
originally announced June 2024.
-
SALI: Short-term Alignment and Long-term Interaction Network for Colonoscopy Video Polyp Segmentation
Authors:
Qiang Hu,
Zhenyu Yi,
Ying Zhou,
Fang Peng,
Mei Liu,
Qiang Li,
Zhiwei Wang
Abstract:
Colonoscopy videos provide richer information in polyp segmentation for rectal cancer diagnosis. However, the endoscope's fast moving and close-up observing make the current methods suffer from large spatial incoherence and continuous low-quality frames, and thus yield limited segmentation accuracy. In this context, we focus on robust video polyp segmentation by enhancing the adjacent feature cons…
▽ More
Colonoscopy videos provide richer information in polyp segmentation for rectal cancer diagnosis. However, the endoscope's fast moving and close-up observing make the current methods suffer from large spatial incoherence and continuous low-quality frames, and thus yield limited segmentation accuracy. In this context, we focus on robust video polyp segmentation by enhancing the adjacent feature consistency and rebuilding the reliable polyp representation. To achieve this goal, we in this paper propose SALI network, a hybrid of Short-term Alignment Module (SAM) and Long-term Interaction Module (LIM). The SAM learns spatial-aligned features of adjacent frames via deformable convolution and further harmonizes them to capture more stable short-term polyp representation. In case of low-quality frames, the LIM stores the historical polyp representations as a long-term memory bank, and explores the retrospective relations to interactively rebuild more reliable polyp features for the current segmentation. Combing SAM and LIM, the SALI network of video segmentation shows a great robustness to the spatial variations and low-visual cues. Benchmark on the large-scale SUNSEG verifies the superiority of SALI over the current state-of-the-arts by improving Dice by 2.1%, 2.5%, 4.1% and 1.9%, for the four test sub-sets, respectively. Codes are at https://github.com/Scatteredrain/SALI.
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
Biomedical Visual Instruction Tuning with Clinician Preference Alignment
Authors:
Hejie Cui,
Lingjun Mao,
Xin Liang,
Jieyu Zhang,
Hui Ren,
Quanzheng Li,
Xiang Li,
Carl Yang
Abstract:
Recent advancements in multimodal foundation models have showcased impressive capabilities in understanding and reasoning with visual and textual information. Adapting these foundation models trained for general usage to specialized domains like biomedicine requires large-scale domain-specific instruction datasets. While existing works have explored curating such datasets automatically, the result…
▽ More
Recent advancements in multimodal foundation models have showcased impressive capabilities in understanding and reasoning with visual and textual information. Adapting these foundation models trained for general usage to specialized domains like biomedicine requires large-scale domain-specific instruction datasets. While existing works have explored curating such datasets automatically, the resultant datasets are not explicitly aligned with domain expertise. In this work, we propose a data-centric framework, Biomedical Visual Instruction Tuning with Clinician Preference Alignment (BioMed-VITAL), that incorporates clinician preferences into both stages of generating and selecting instruction data for tuning biomedical multimodal foundation models. First, during the generation stage, we prompt the GPT-4V generator with a diverse set of clinician-selected demonstrations for preference-aligned data candidate generation. Then, during the selection phase, we train a separate selection model, which explicitly distills clinician and policy-guided model preferences into a rating function to select high-quality data for medical instruction tuning. Results show that the model tuned with the instruction-following data from our method demonstrates a significant improvement in open visual chat (18.5% relatively) and medical VQA (win rate up to 81.73%). Our instruction-following data and models are available at BioMed-VITAL.github.io.
△ Less
Submitted 16 July, 2024; v1 submitted 18 June, 2024;
originally announced June 2024.
-
AutoFirm: Automatically Identifying Reused Libraries inside IoT Firmware at Large-Scale
Authors:
YongLe Chen,
Feng Ma,
Ying Zhang,
YongZhong He,
Haining Wang,
Qiang Li
Abstract:
The Internet of Things (IoT) has become indispensable to our daily lives and work. Unfortunately, developers often reuse software libraries in the IoT firmware, leading to a major security concern. If vulnerabilities or insecure versions of these libraries go unpatched, a massive number of IoT devices can be impacted. In this paper, we propose the AutoFirm, an automated tool for detecting reused l…
▽ More
The Internet of Things (IoT) has become indispensable to our daily lives and work. Unfortunately, developers often reuse software libraries in the IoT firmware, leading to a major security concern. If vulnerabilities or insecure versions of these libraries go unpatched, a massive number of IoT devices can be impacted. In this paper, we propose the AutoFirm, an automated tool for detecting reused libraries in IoT firmware at a large scale. Specifically, AutoFirm leverages the syntax information (library name and version) to determine whether IoT firmware reuses the libraries. We conduct a large-scale empirical study of reused libraries of IoT firmware, investigating more than 6,900+ firmware and 2,700+ distinct vulnerabilities affecting 11,300+ vulnerable versions from 349 open-source software libraries. Leveraging this diverse information set, we conduct a qualitative assessment of vulnerable library versions to understand security gaps and the misplaced trust of libraries in IoT firmware. Our research reveals that: manufacturers neglected to update outdated libraries for IoT firmware in 67.3\% of cases; on average, outdated libraries persisted for over 1.34 years prior to remediation; vulnerabilities of software libraries have posed server threats to widespread IoT devices.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions
Authors:
Xunzhi Wang,
Zhuowei Zhang,
Qiongyu Li,
Gaonan Chen,
Mengting Hu,
Zhiyu li,
Bitong Luo,
Hang Gao,
Zhixin Han,
Haotian Wang
Abstract:
The rapid development of large language models (LLMs) has shown promising practical results. However, their low interpretability often leads to errors in unforeseen circumstances, limiting their utility. Many works have focused on creating comprehensive evaluation systems, but previous benchmarks have primarily assessed problem-solving abilities while neglecting the response's uncertainty, which m…
▽ More
The rapid development of large language models (LLMs) has shown promising practical results. However, their low interpretability often leads to errors in unforeseen circumstances, limiting their utility. Many works have focused on creating comprehensive evaluation systems, but previous benchmarks have primarily assessed problem-solving abilities while neglecting the response's uncertainty, which may result in unreliability. Recent methods for measuring LLM reliability are resource-intensive and unable to test black-box models. To address this, we propose UBENCH, a comprehensive benchmark for evaluating LLM reliability. UBENCH includes 3,978 multiple-choice questions covering knowledge, language, understanding, and reasoning abilities. Experimental results show that UBENCH has achieved state-of-the-art performance, while its single-sampling method significantly saves computational resources compared to baseline methods that require multiple samplings. Additionally, based on UBENCH, we evaluate the reliability of 15 popular LLMs, finding GLM4 to be the most outstanding, closely followed by GPT-4. We also explore the impact of Chain-of-Thought prompts, role-playing prompts, option order, and temperature on LLM reliability, analyzing the varying effects on different LLMs.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.