-
Identifying Smart Contract Security Issues in Code Snippets from Stack Overflow
Authors:
Jiachi Chen,
Chong Chen,
Jiang Hu,
John Grundy,
Yanlin Wang,
Ting Chen,
Zibin Zheng
Abstract:
Smart contract developers frequently seak solutions to developmental challenges on Q&A platforms such as Stack Overflow (SO). Although community responses often provide viable solutions, the embedded code snippets can also contain hidden vulnerabilities. Integrating such code directly into smart contracts may make them susceptible to malicious attacks. We conducted an online survey and received 74…
▽ More
Smart contract developers frequently seak solutions to developmental challenges on Q&A platforms such as Stack Overflow (SO). Although community responses often provide viable solutions, the embedded code snippets can also contain hidden vulnerabilities. Integrating such code directly into smart contracts may make them susceptible to malicious attacks. We conducted an online survey and received 74 responses from smart contract developers. The results of this survey indicate that the majority (86.4%) of participants do not sufficiently consider security when reusing SO code snippets. Despite the existence of various tools designed to detect vulnerabilities in smart contracts, these tools are typically developed for analyzing fully-completed smart contracts and thus are ineffective for analyzing typical code snippets as found on SO. We introduce SOChecker, the first tool designed to identify potential vulnerabilities in incomplete SO smart contract code snippets. SOChecker first leverages a fine-tuned Llama2 model for code completion, followed by the application of symbolic execution methods for vulnerability detection. Our experimental results, derived from a dataset comprising 897 code snippets collected from smart contract-related SO posts, demonstrate that SOChecker achieves an F1 score of 68.2%, greatly surpassing GPT-3.5 and GPT-4 (20.9% and 33.2% F1 Scores respectively). Our findings underscore the need to improve the security of code snippets from Q&A websites.
△ Less
Submitted 18 July, 2024;
originally announced July 2024.
-
Retrieval-Augmented Generation for Natural Language Processing: A Survey
Authors:
Shangyu Wu,
Ying Xiong,
Yufei Cui,
Haolun Wu,
Can Chen,
Ye Yuan,
Lianming Huang,
Xue Liu,
Tei-Wei Kuo,
Nan Guan,
Chun Jason Xue
Abstract:
Large language models (LLMs) have demonstrated great success in various fields, benefiting from their huge amount of parameters that store knowledge. However, LLMs still suffer from several key issues, such as hallucination problems, knowledge update issues, and lacking domain-specific expertise. The appearance of retrieval-augmented generation (RAG), which leverages an external knowledge database…
▽ More
Large language models (LLMs) have demonstrated great success in various fields, benefiting from their huge amount of parameters that store knowledge. However, LLMs still suffer from several key issues, such as hallucination problems, knowledge update issues, and lacking domain-specific expertise. The appearance of retrieval-augmented generation (RAG), which leverages an external knowledge database to augment LLMs, makes up those drawbacks of LLMs. This paper reviews all significant techniques of RAG, especially in the retriever and the retrieval fusions. Besides, tutorial codes are provided for implementing the representative techniques in RAG. This paper further discusses the RAG training, including RAG with/without datastore update. Then, we introduce the application of RAG in representative natural language processing tasks and industrial scenarios. Finally, this paper discusses the future directions and challenges of RAG for promoting its development.
△ Less
Submitted 18 July, 2024;
originally announced July 2024.
-
GenRC: Generative 3D Room Completion from Sparse Image Collections
Authors:
Ming-Feng Li,
Yueh-Feng Ku,
Hong-Xuan Yen,
Chi Liu,
Yu-Lun Liu,
Albert Y. C. Chen,
Cheng-Hao Kuo,
Min Sun
Abstract:
Sparse RGBD scene completion is a challenging task especially when considering consistent textures and geometries throughout the entire scene. Different from existing solutions that rely on human-designed text prompts or predefined camera trajectories, we propose GenRC, an automated training-free pipeline to complete a room-scale 3D mesh with high-fidelity textures. To achieve this, we first proje…
▽ More
Sparse RGBD scene completion is a challenging task especially when considering consistent textures and geometries throughout the entire scene. Different from existing solutions that rely on human-designed text prompts or predefined camera trajectories, we propose GenRC, an automated training-free pipeline to complete a room-scale 3D mesh with high-fidelity textures. To achieve this, we first project the sparse RGBD images to a highly incomplete 3D mesh. Instead of iteratively generating novel views to fill in the void, we utilized our proposed E-Diffusion to generate a view-consistent panoramic RGBD image which ensures global geometry and appearance consistency. Furthermore, we maintain the input-output scene stylistic consistency through textual inversion to replace human-designed text prompts. To bridge the domain gap among datasets, E-Diffusion leverages models trained on large-scale datasets to generate diverse appearances. GenRC outperforms state-of-the-art methods under most appearance and geometric metrics on ScanNet and ARKitScenes datasets, even though GenRC is not trained on these datasets nor using predefined camera trajectories. Project page: \href{https://minfenli.github.io/GenRC}{this https URL}
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
A Study on the Situation of Connected Car Patent Portfolios
Authors:
Abel C. H. Chen,
Chia-Shen Chang
Abstract:
In recent years, the countries of the world have drafted the specifications of connected cars; for instance, the Security Credential Management System (SCMS) has been proposed by United States Department of Transportation (USDOT), and the Cooperative Intelligent Transportation System (C-ITS) Credential Management System (CCMS) has been proposed by European Union (EU). Therefore, several companies…
▽ More
In recent years, the countries of the world have drafted the specifications of connected cars; for instance, the Security Credential Management System (SCMS) has been proposed by United States Department of Transportation (USDOT), and the Cooperative Intelligent Transportation System (C-ITS) Credential Management System (CCMS) has been proposed by European Union (EU). Therefore, several companies have developed the technology and productions of connected cars based on specifications, and connected car patent portfolios have been proactively performed. Therefore, this study uses Patent Search System (PSS) to find and analyze the contents of patents for obtaining the innovation reports of connected cars according to patents. This study considers the single-factor and two-factors to analyze the relationships of annuals, major technology leaders, major market leaders, and major technology and applications for exploring the patent portfolios of technology leaders and market leaders in connected cars.
△ Less
Submitted 26 June, 2024;
originally announced July 2024.
-
Efficient and Flexible Differet-Radix Montgomery Modular Multiplication for Hardware Implementation
Authors:
Yuxuan Zhang,
Hua Guo,
Chen Chen,
Yewei Guan,
Xiyong Zhang,
Zhenyu Guan
Abstract:
Montgomery modular multiplication is widely-used in public key cryptosystems (PKC) and affects the efficiency of upper systems directly. However, modulus is getting larger due to the increasing demand of security, which results in a heavy computing cost. High-performance implementation of Montgomery modular multiplication is urgently required to ensure the highly-efficient operations in PKC. Howev…
▽ More
Montgomery modular multiplication is widely-used in public key cryptosystems (PKC) and affects the efficiency of upper systems directly. However, modulus is getting larger due to the increasing demand of security, which results in a heavy computing cost. High-performance implementation of Montgomery modular multiplication is urgently required to ensure the highly-efficient operations in PKC. However, existing high-speed implementations still need a large amount redundant computing to simplify the intermediate result. Supports to the redundant representation is extremely limited on Montgomery modular multiplication. In this paper, we propose an efficient parallel variant of iterative Montgomery modular multiplication, called DRMMM, that allows the quotient can be computed in multiple iterations. In this variant, terms in intermediate result and the quotient in each iteration are computed in different radix such that computation of the quotient can be pipelined. Based on proposed variant, we also design high-performance hardware implementation architecture for faster operation. In the architecture, intermediate result in every iteration is denoted as three parts to free from redundant computations. Finally, to support FPGA-based systems, we design operators based on FPGA underlying architecture for better area-time performance. The result of implementation and experiment shows that our method reduces the output latency by 38.3\% than the fastest design on FPGA.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference
Authors:
Mengcheng Lan,
Chaofeng Chen,
Yiping Ke,
Xinjiang Wang,
Litong Feng,
Wayne Zhang
Abstract:
Despite the success of large-scale pretrained Vision-Language Models (VLMs) especially CLIP in various open-vocabulary tasks, their application to semantic segmentation remains challenging, producing noisy segmentation maps with mis-segmented regions. In this paper, we carefully re-investigate the architecture of CLIP, and identify residual connections as the primary source of noise that degrades…
▽ More
Despite the success of large-scale pretrained Vision-Language Models (VLMs) especially CLIP in various open-vocabulary tasks, their application to semantic segmentation remains challenging, producing noisy segmentation maps with mis-segmented regions. In this paper, we carefully re-investigate the architecture of CLIP, and identify residual connections as the primary source of noise that degrades segmentation quality. With a comparative analysis of statistical properties in the residual connection and the attention output across different pretrained models, we discover that CLIP's image-text contrastive training paradigm emphasizes global features at the expense of local discriminability, leading to noisy segmentation results. In response, we propose ClearCLIP, a novel approach that decomposes CLIP's representations to enhance open-vocabulary semantic segmentation. We introduce three simple modifications to the final layer: removing the residual connection, implementing the self-self attention, and discarding the feed-forward network. ClearCLIP consistently generates clearer and more accurate segmentation maps and outperforms existing approaches across multiple benchmarks, affirming the significance of our discoveries.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
Frequency Guidance Matters: Skeletal Action Recognition by Frequency-Aware Mixed Transformer
Authors:
Wenhan Wu,
Ce Zheng,
Zihao Yang,
Chen Chen,
Srijan Das,
Aidong Lu
Abstract:
Recently, transformers have demonstrated great potential for modeling long-term dependencies from skeleton sequences and thereby gained ever-increasing attention in skeleton action recognition. However, the existing transformer-based approaches heavily rely on the naive attention mechanism for capturing the spatiotemporal features, which falls short in learning discriminative representations that…
▽ More
Recently, transformers have demonstrated great potential for modeling long-term dependencies from skeleton sequences and thereby gained ever-increasing attention in skeleton action recognition. However, the existing transformer-based approaches heavily rely on the naive attention mechanism for capturing the spatiotemporal features, which falls short in learning discriminative representations that exhibit similar motion patterns. To address this challenge, we introduce the Frequency-aware Mixed Transformer (FreqMixFormer), specifically designed for recognizing similar skeletal actions with subtle discriminative motions. First, we introduce a frequency-aware attention module to unweave skeleton frequency representations by embedding joint features into frequency attention maps, aiming to distinguish the discriminative movements based on their frequency coefficients. Subsequently, we develop a mixed transformer architecture to incorporate spatial features with frequency features to model the comprehensive frequency-spatial patterns. Additionally, a temporal transformer is proposed to extract the global correlations across frames. Extensive experiments show that FreqMiXFormer outperforms SOTA on 3 popular skeleton action recognition datasets, including NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
GPT-4V Cannot Generate Radiology Reports Yet
Authors:
Yuyang Jiang,
Chacha Chen,
Dang Nguyen,
Benjamin M. Mervak,
Chenhao Tan
Abstract:
GPT-4V's purported strong multimodal abilities raise interests in using it to automate radiology report writing, but there lacks thorough evaluations. In this work, we perform a systematic evaluation of GPT-4V in generating radiology reports on two chest X-ray report datasets: MIMIC-CXR and IU X-Ray. We attempt to directly generate reports using GPT-4V through different prompting strategies and fi…
▽ More
GPT-4V's purported strong multimodal abilities raise interests in using it to automate radiology report writing, but there lacks thorough evaluations. In this work, we perform a systematic evaluation of GPT-4V in generating radiology reports on two chest X-ray report datasets: MIMIC-CXR and IU X-Ray. We attempt to directly generate reports using GPT-4V through different prompting strategies and find that it fails terribly in both lexical metrics and clinical efficacy metrics. To understand the low performance, we decompose the task into two steps: 1) the medical image reasoning step of predicting medical condition labels from images; and 2) the report synthesis step of generating reports from (groundtruth) conditions. We show that GPT-4V's performance in image reasoning is consistently low across different prompts. In fact, the distributions of model-predicted labels remain constant regardless of which groundtruth conditions are present on the image, suggesting that the model is not interpreting chest X-rays meaningfully. Even when given groundtruth conditions in report synthesis, its generated reports are less correct and less natural-sounding than a finetuned LLaMA-2. Altogether, our findings cast doubt on the viability of using GPT-4V in a radiology workflow.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
A Benchmark for Fairness-Aware Graph Learning
Authors:
Yushun Dong,
Song Wang,
Zhenyu Lei,
Zaiyi Zheng,
Jing Ma,
Chen Chen,
Jundong Li
Abstract:
Fairness-aware graph learning has gained increasing attention in recent years. Nevertheless, there lacks a comprehensive benchmark to evaluate and compare different fairness-aware graph learning methods, which blocks practitioners from choosing appropriate ones for broader real-world applications. In this paper, we present an extensive benchmark on ten representative fairness-aware graph learning…
▽ More
Fairness-aware graph learning has gained increasing attention in recent years. Nevertheless, there lacks a comprehensive benchmark to evaluate and compare different fairness-aware graph learning methods, which blocks practitioners from choosing appropriate ones for broader real-world applications. In this paper, we present an extensive benchmark on ten representative fairness-aware graph learning methods. Specifically, we design a systematic evaluation protocol and conduct experiments on seven real-world datasets to evaluate these methods from multiple perspectives, including group fairness, individual fairness, the balance between different fairness criteria, and computational efficiency. Our in-depth analysis reveals key insights into the strengths and limitations of existing methods. Additionally, we provide practical guidance for applying fairness-aware graph learning methods in applications. To the best of our knowledge, this work serves as an initial step towards comprehensively understanding representative fairness-aware graph learning methods to facilitate future advancements in this area.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
SemiAdv: Query-Efficient Black-Box Adversarial Attack with Unlabeled Images
Authors:
Mingyuan Fan,
Yang Liu,
Cen Chen,
Ximeng Liu
Abstract:
Adversarial attack has garnered considerable attention due to its profound implications for the secure deployment of robots in sensitive security scenarios. To potentially push for advances in the field, this paper studies the adversarial attack in the black-box setting and proposes an unlabeled data-driven adversarial attack method, called SemiAdv. Specifically, SemiAdv achieves the following bre…
▽ More
Adversarial attack has garnered considerable attention due to its profound implications for the secure deployment of robots in sensitive security scenarios. To potentially push for advances in the field, this paper studies the adversarial attack in the black-box setting and proposes an unlabeled data-driven adversarial attack method, called SemiAdv. Specifically, SemiAdv achieves the following breakthroughs compared with previous works. First, by introducing the semi-supervised learning technique into the adversarial attack, SemiAdv substantially decreases the number of queries required for generating adversarial samples. On average, SemiAdv only needs to query a few hundred times to launch an effective attack with more than 90% success rate. Second, many existing black-box adversarial attacks require massive labeled data to mitigate the difference between the local substitute model and the remote target model for a good attack performance. While SemiAdv relaxes this limitation and is capable of utilizing unlabeled raw data to launch an effective attack. Finally, our experiments show that SemiAdv saves up to 12x query accesses for generating adversarial samples while maintaining a competitive attack success rate compared with state-of-the-art attacks.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
Omni-Dimensional Frequency Learner for General Time Series Analysis
Authors:
Xianing Chen. Hanting Chen,
Hailin Hu
Abstract:
Frequency domain representation of time series feature offers a concise representation for handling real-world time series data with inherent complexity and dynamic nature. However, current frequency-based methods with complex operations still fall short of state-of-the-art time domain methods for general time series analysis. In this work, we present Omni-Dimensional Frequency Learner (ODFL) mode…
▽ More
Frequency domain representation of time series feature offers a concise representation for handling real-world time series data with inherent complexity and dynamic nature. However, current frequency-based methods with complex operations still fall short of state-of-the-art time domain methods for general time series analysis. In this work, we present Omni-Dimensional Frequency Learner (ODFL) model based on a in depth analysis among all the three aspects of the spectrum feature: channel redundancy property among the frequency dimension, the sparse and un-salient frequency energy distribution among the frequency dimension, and the semantic diversity among the variable dimension. Technically, our method is composed of a semantic-adaptive global filter with attention to the un-salient frequency bands and partial operation among the channel dimension. Empirical results show that ODFL achieves consistent state-of-the-art in five mainstream time series analysis tasks, including short- and long-term forecasting, imputation, classification, and anomaly detection, offering a promising foundation for time series analysis.
△ Less
Submitted 14 July, 2024;
originally announced July 2024.
-
Defending Against Repetitive-based Backdoor Attacks on Semi-supervised Learning through Lens of Rate-Distortion-Perception Trade-off
Authors:
Cheng-Yi Lee,
Ching-Chia Kao,
Cheng-Han Yeh,
Chun-Shien Lu,
Chia-Mu Yu,
Chu-Song Chen
Abstract:
Semi-supervised learning (SSL) has achieved remarkable performance with a small fraction of labeled data by leveraging vast amounts of unlabeled data from the Internet. However, this large pool of untrusted data is extremely vulnerable to data poisoning, leading to potential backdoor attacks. Current backdoor defenses are not yet effective against such a vulnerability in SSL. In this study, we pro…
▽ More
Semi-supervised learning (SSL) has achieved remarkable performance with a small fraction of labeled data by leveraging vast amounts of unlabeled data from the Internet. However, this large pool of untrusted data is extremely vulnerable to data poisoning, leading to potential backdoor attacks. Current backdoor defenses are not yet effective against such a vulnerability in SSL. In this study, we propose a novel method, Unlabeled Data Purification (UPure), to disrupt the association between trigger patterns and target classes by introducing perturbations in the frequency domain. By leveraging the Rate- Distortion-Perception (RDP) trade-off, we further identify the frequency band, where the perturbations are added, and justify this selection. Notably, UPure purifies poisoned unlabeled data without the need of extra clean labeled data. Extensive experiments on four benchmark datasets and five SSL algorithms demonstrate that UPure effectively reduces the attack success rate from 99.78% to 0% while maintaining model accuracy
△ Less
Submitted 14 July, 2024;
originally announced July 2024.
-
3DEgo: 3D Editing on the Go!
Authors:
Umar Khalid,
Hasan Iqbal,
Azib Farooq,
Jing Hua,
Chen Chen
Abstract:
We introduce 3DEgo to address a novel problem of directly synthesizing photorealistic 3D scenes from monocular videos guided by textual prompts. Conventional methods construct a text-conditioned 3D scene through a three-stage process, involving pose estimation using Structure-from-Motion (SfM) libraries like COLMAP, initializing the 3D model with unedited images, and iteratively updating the datas…
▽ More
We introduce 3DEgo to address a novel problem of directly synthesizing photorealistic 3D scenes from monocular videos guided by textual prompts. Conventional methods construct a text-conditioned 3D scene through a three-stage process, involving pose estimation using Structure-from-Motion (SfM) libraries like COLMAP, initializing the 3D model with unedited images, and iteratively updating the dataset with edited images to achieve a 3D scene with text fidelity. Our framework streamlines the conventional multi-stage 3D editing process into a single-stage workflow by overcoming the reliance on COLMAP and eliminating the cost of model initialization. We apply a diffusion model to edit video frames prior to 3D scene creation by incorporating our designed noise blender module for enhancing multi-view editing consistency, a step that does not require additional training or fine-tuning of T2I diffusion models. 3DEgo utilizes 3D Gaussian Splatting to create 3D scenes from the multi-view consistent edited frames, capitalizing on the inherent temporal continuity and explicit point cloud data. 3DEgo demonstrates remarkable editing precision, speed, and adaptability across a variety of video sources, as validated by extensive evaluations on six datasets, including our own prepared GS25 dataset. Project Page: https://3dego.github.io/
△ Less
Submitted 14 July, 2024;
originally announced July 2024.
-
LeRF: Learning Resampling Function for Adaptive and Efficient Image Interpolation
Authors:
Jiacheng Li,
Chang Chen,
Fenglong Song,
Youliang Yan,
Zhiwei Xiong
Abstract:
Image resampling is a basic technique that is widely employed in daily applications, such as camera photo editing. Recent deep neural networks (DNNs) have made impressive progress in performance by introducing learned data priors. Still, these methods are not the perfect substitute for interpolation, due to the drawbacks in efficiency and versatility. In this work, we propose a novel method of Lea…
▽ More
Image resampling is a basic technique that is widely employed in daily applications, such as camera photo editing. Recent deep neural networks (DNNs) have made impressive progress in performance by introducing learned data priors. Still, these methods are not the perfect substitute for interpolation, due to the drawbacks in efficiency and versatility. In this work, we propose a novel method of Learning Resampling Function (termed LeRF), which takes advantage of both the structural priors learned by DNNs and the locally continuous assumption of interpolation. Specifically, LeRF assigns spatially varying resampling functions to input image pixels and learns to predict the hyper-parameters that determine the shapes of these resampling functions with a neural network. Based on the formulation of LeRF, we develop a family of models, including both efficiency-orientated and performance-orientated ones. To achieve interpolation-level efficiency, we adopt look-up tables (LUTs) to accelerate the inference of the learned neural network. Furthermore, we design a directional ensemble strategy and edge-sensitive indexing patterns to better capture local structures. On the other hand, to obtain DNN-level performance, we propose an extension of LeRF to enable it in cooperation with pre-trained upsampling models for cascaded resampling. Extensive experiments show that the efficiency-orientated version of LeRF runs as fast as interpolation, generalizes well to arbitrary transformations, and outperforms interpolation significantly, e.g., up to 3dB PSNR gain over Bicubic for x2 upsampling on Manga109. Besides, the performance-orientated version of LeRF reaches comparable performance with existing DNNs at much higher efficiency, e.g., less than 25% running time on a desktop GPU.
△ Less
Submitted 13 July, 2024;
originally announced July 2024.
-
Performance Comparison of Various Modes of Advanced Encryption Standard
Authors:
Abel C. H. Chen
Abstract:
With the maturation of quantum computing technology, many cryptographic methods are gradually facing threats from quantum computing. Although the Grover algorithm can accelerate search speeds, current research indicates that the Advanced Encryption Standard (AES) method can still enhance security by increasing the length of the secret key. However, the AES method involves multiple modes in impleme…
▽ More
With the maturation of quantum computing technology, many cryptographic methods are gradually facing threats from quantum computing. Although the Grover algorithm can accelerate search speeds, current research indicates that the Advanced Encryption Standard (AES) method can still enhance security by increasing the length of the secret key. However, the AES method involves multiple modes in implementation, and not all modes are secure. Therefore, this study proposes a normalized Gini impurity (NGI) to verify the security of each mode, using encrypted images as a case study for empirical analysis. Furthermore, this study primarily compares the Electronic Codebook (ECB) mode, Cipher Block Chaining (CBC) mode, Counter (CTR) mode, Counter with CBC-Message Authentication Code (MAC) (CCM) mode, and Galois Counter Mode (GCM).
△ Less
Submitted 21 May, 2024;
originally announced July 2024.
-
Heterogeneous Subgraph Network with Prompt Learning for Interpretable Depression Detection on Social Media
Authors:
Chen Chen,
Mingwei Li,
Fenghuan Li,
Haopeng Chen,
Yuankun Lin
Abstract:
Massive social media data can reflect people's authentic thoughts, emotions, communication, etc., and therefore can be analyzed for early detection of mental health problems such as depression. Existing works about early depression detection on social media lacked interpretability and neglected the heterogeneity of social media data. Furthermore, they overlooked the global interaction among users.…
▽ More
Massive social media data can reflect people's authentic thoughts, emotions, communication, etc., and therefore can be analyzed for early detection of mental health problems such as depression. Existing works about early depression detection on social media lacked interpretability and neglected the heterogeneity of social media data. Furthermore, they overlooked the global interaction among users. To address these issues, we develop a novel method that leverages a Heterogeneous Subgraph Network with Prompt Learning(HSNPL) and contrastive learning mechanisms. Specifically, prompt learning is employed to map users' implicit psychological symbols with excellent interpretability while deep semantic and diverse behavioral features are incorporated by a heterogeneous information network. Then, the heterogeneous graph network with a dual attention mechanism is constructed to model the relationships among heterogeneous social information at the feature level. Furthermore, the heterogeneous subgraph network integrating subgraph attention and self-supervised contrastive learning is developed to explore complicated interactions among users and groups at the user level. Extensive experimental results demonstrate that our proposed method significantly outperforms state-of-the-art methods for depression detection on social media.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
AUITestAgent: Automatic Requirements Oriented GUI Function Testing
Authors:
Yongxiang Hu,
Xuan Wang,
Yingchuan Wang,
Yu Zhang,
Shiyu Guo,
Chaoyi Chen,
Xin Wang,
Yangfan Zhou
Abstract:
The Graphical User Interface (GUI) is how users interact with mobile apps. To ensure it functions properly, testing engineers have to make sure it functions as intended, based on test requirements that are typically written in natural language. While widely adopted manual testing and script-based methods are effective, they demand substantial effort due to the vast number of GUI pages and rapid it…
▽ More
The Graphical User Interface (GUI) is how users interact with mobile apps. To ensure it functions properly, testing engineers have to make sure it functions as intended, based on test requirements that are typically written in natural language. While widely adopted manual testing and script-based methods are effective, they demand substantial effort due to the vast number of GUI pages and rapid iterations in modern mobile apps. This paper introduces AUITestAgent, the first automatic, natural language-driven GUI testing tool for mobile apps, capable of fully automating the entire process of GUI interaction and function verification. Since test requirements typically contain interaction commands and verification oracles. AUITestAgent can extract GUI interactions from test requirements via dynamically organized agents. Then, AUITestAgent employs a multi-dimensional data extraction strategy to retrieve data relevant to the test requirements from the interaction trace and perform verification. Experiments on customized benchmarks demonstrate that AUITestAgent outperforms existing tools in the quality of generated GUI interactions and achieved the accuracy of verifications of 94%. Moreover, field deployment in Meituan has shown AUITestAgent's practical usability, with it detecting 4 new functional bugs during 10 regression tests in two months.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
UICrit: Enhancing Automated Design Evaluation with a UICritique Dataset
Authors:
Peitong Duan,
Chin-yi Chen,
Gang Li,
Bjoern Hartmann,
Yang Li
Abstract:
Automated UI evaluation can be beneficial for the design process; for example, to compare different UI designs, or conduct automated heuristic evaluation. LLM-based UI evaluation, in particular, holds the promise of generalizability to a wide variety of UI types and evaluation tasks. However, current LLM-based techniques do not yet match the performance of human evaluators. We hypothesize that aut…
▽ More
Automated UI evaluation can be beneficial for the design process; for example, to compare different UI designs, or conduct automated heuristic evaluation. LLM-based UI evaluation, in particular, holds the promise of generalizability to a wide variety of UI types and evaluation tasks. However, current LLM-based techniques do not yet match the performance of human evaluators. We hypothesize that automatic evaluation can be improved by collecting a targeted UI feedback dataset and then using this dataset to enhance the performance of general-purpose LLMs. We present a targeted dataset of 3,059 design critiques and quality ratings for 983 mobile UIs, collected from seven experienced designers. We carried out an in-depth analysis to characterize the dataset's features. We then applied this dataset to achieve a 55% performance gain in LLM-generated UI feedback via various few-shot and visual prompting techniques. We also discuss future applications of this dataset, including training a reward model for generative UI techniques, and fine-tuning a tool-agnostic multi-modal LLM that automates UI evaluation.
△ Less
Submitted 15 July, 2024; v1 submitted 11 July, 2024;
originally announced July 2024.
-
GTA: A Benchmark for General Tool Agents
Authors:
Jize Wang,
Zerun Ma,
Yining Li,
Songyang Zhang,
Cailian Chen,
Kai Chen,
Xinyi Le
Abstract:
Significant focus has been placed on integrating large language models (LLMs) with various tools in developing general-purpose agents. This poses a challenge to LLMs' tool-use capabilities. However, there are evident gaps between existing tool-use evaluations and real-world scenarios. Current evaluations often use AI-generated queries, single-step tasks, dummy tools, and text-only interactions, fa…
▽ More
Significant focus has been placed on integrating large language models (LLMs) with various tools in developing general-purpose agents. This poses a challenge to LLMs' tool-use capabilities. However, there are evident gaps between existing tool-use evaluations and real-world scenarios. Current evaluations often use AI-generated queries, single-step tasks, dummy tools, and text-only interactions, failing to reveal the agents' real-world problem-solving abilities effectively. To address this, we propose GTA, a benchmark for General Tool Agents, featuring three main aspects: (i) Real user queries: human-written queries with simple real-world objectives but implicit tool-use, requiring the LLM to reason the suitable tools and plan the solution steps. (ii) Real deployed tools: an evaluation platform equipped with tools across perception, operation, logic, and creativity categories to evaluate the agents' actual task execution performance. (iii) Real multimodal inputs: authentic image files, such as spatial scenes, web page screenshots, tables, code snippets, and printed/handwritten materials, used as the query contexts to align with real-world scenarios closely. We design 229 real-world tasks and executable tool chains to evaluate mainstream LLMs. Our findings show that real-world user queries are challenging for existing LLMs, with GPT-4 completing less than 50% of the tasks and most LLMs achieving below 25%. This evaluation reveals the bottlenecks in the tool-use capabilities of current LLMs in real-world scenarios, which provides future direction for advancing general-purpose tool agents. The code and dataset are available at https://github.com/open-compass/GTA.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
Highway Networks for Improved Surface Reconstruction: The Role of Residuals and Weight Updates
Authors:
A. Noorizadegan,
Y. C. Hon,
D. L. Young,
C. S. Chen
Abstract:
Surface reconstruction from point clouds is a fundamental challenge in computer graphics and medical imaging. In this paper, we explore the application of advanced neural network architectures for the accurate and efficient reconstruction of surfaces from data points. We introduce a novel variant of the Highway network (Hw) called Square-Highway (SqrHw) within the context of multilayer perceptrons…
▽ More
Surface reconstruction from point clouds is a fundamental challenge in computer graphics and medical imaging. In this paper, we explore the application of advanced neural network architectures for the accurate and efficient reconstruction of surfaces from data points. We introduce a novel variant of the Highway network (Hw) called Square-Highway (SqrHw) within the context of multilayer perceptrons and investigate its performance alongside plain neural networks and a simplified Hw in various numerical examples. These examples include the reconstruction of simple and complex surfaces, such as spheres, human hands, and intricate models like the Stanford Bunny. We analyze the impact of factors such as the number of hidden layers, interior and exterior points, and data distribution on surface reconstruction quality. Our results show that the proposed SqrHw architecture outperforms other neural network configurations, achieving faster convergence and higher-quality surface reconstructions. Additionally, we demonstrate the SqrHw's ability to predict surfaces over missing data, a valuable feature for challenging applications like medical imaging. Furthermore, our study delves into further details, demonstrating that the proposed method based on highway networks yields more stable weight norms and backpropagation gradients compared to the Plain Network architecture. This research not only advances the field of computer graphics but also holds utility for other purposes such as function interpolation and physics-informed neural networks, which integrate multilayer perceptrons into their algorithms.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
Stable Weight Updating: A Key to Reliable PDE Solutions Using Deep Learning
Authors:
A. Noorizadegan,
R. Cavoretto,
D. L. Young,
C. S. Chen
Abstract:
Background: Deep learning techniques, particularly neural networks, have revolutionized computational physics, offering powerful tools for solving complex partial differential equations (PDEs). However, ensuring stability and efficiency remains a challenge, especially in scenarios involving nonlinear and time-dependent equations. Methodology: This paper introduces novel residual-based architecture…
▽ More
Background: Deep learning techniques, particularly neural networks, have revolutionized computational physics, offering powerful tools for solving complex partial differential equations (PDEs). However, ensuring stability and efficiency remains a challenge, especially in scenarios involving nonlinear and time-dependent equations. Methodology: This paper introduces novel residual-based architectures, namely the Simple Highway Network and the Squared Residual Network, designed to enhance stability and accuracy in physics-informed neural networks (PINNs). These architectures augment traditional neural networks by incorporating residual connections, which facilitate smoother weight updates and improve backpropagation efficiency. Results: Through extensive numerical experiments across various examples including linear and nonlinear, time-dependent and independent PDEs we demonstrate the efficacy of the proposed architectures. The Squared Residual Network, in particular, exhibits robust performance, achieving enhanced stability and accuracy compared to conventional neural networks. These findings underscore the potential of residual-based architectures in advancing deep learning for PDEs and computational physics applications.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
MNeRV: A Multilayer Neural Representation for Videos
Authors:
Qingling Chang,
Haohui Yu,
Shuxuan Fu,
Zhiqiang Zeng,
Chuangquan Chen
Abstract:
As a novel video representation method, Neural Representations for Videos (NeRV) has shown great potential in the fields of video compression, video restoration, and video interpolation. In the process of representing videos using NeRV, each frame corresponds to an embedding, which is then reconstructed into a video frame sequence after passing through a small number of decoding layers (E-NeRV, HN…
▽ More
As a novel video representation method, Neural Representations for Videos (NeRV) has shown great potential in the fields of video compression, video restoration, and video interpolation. In the process of representing videos using NeRV, each frame corresponds to an embedding, which is then reconstructed into a video frame sequence after passing through a small number of decoding layers (E-NeRV, HNeRV, etc.). However, this small number of decoding layers can easily lead to the problem of redundant model parameters due to the large proportion of parameters in a single decoding layer, which greatly restricts the video regression ability of neural network models. In this paper, we propose a multilayer neural representation for videos (MNeRV) and design a new decoder M-Decoder and its matching encoder M-Encoder. MNeRV has more encoding and decoding layers, which effectively alleviates the problem of redundant model parameters caused by too few layers. In addition, we design MNeRV blocks to perform more uniform and effective parameter allocation between decoding layers. In the field of video regression reconstruction, we achieve better reconstruction quality (+4.06 PSNR) with fewer parameters. Finally, we showcase MNeRV performance in downstream tasks such as video restoration and video interpolation. The source code of MNeRV is available at https://github.com/Aaronbtb/MNeRV.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
Pretraining-finetuning Framework for Efficient Co-design: A Case Study on Quadruped Robot Parkour
Authors:
Ci Chen,
Jiyu Yu,
Haojian Lu,
Hongbo Gao,
Rong Xiong,
Yue Wang
Abstract:
In nature, animals with exceptional locomotion abilities, such as cougars, often possess asymmetric fore and hind legs, with their powerful hind legs acting as reservoirs of energy for leaps. This observation inspired us: could optimize the leg length of quadruped robots endow them with similar locomotive capabilities? In this paper, we propose an approach that co-optimizes the mechanical structur…
▽ More
In nature, animals with exceptional locomotion abilities, such as cougars, often possess asymmetric fore and hind legs, with their powerful hind legs acting as reservoirs of energy for leaps. This observation inspired us: could optimize the leg length of quadruped robots endow them with similar locomotive capabilities? In this paper, we propose an approach that co-optimizes the mechanical structure and control policy to boost the locomotive prowess of quadruped robots. Specifically, we introduce a novel pretraining-finetuning framework, which not only guarantees optimal control strategies for each mechanical candidate but also ensures time efficiency. Additionally, we have devised an innovative training method for our pretraining network, integrating spatial domain randomization with regularization methods, markedly improving the network's generalizability. Our experimental results indicate that the proposed pretraining-finetuning framework significantly enhances the overall co-design performance with less time consumption. Moreover, the co-design strategy substantially exceeds the conventional method of independently optimizing control strategies, further improving the robot's locomotive performance and providing an innovative approach to enhancing the extreme parkour capabilities of quadruped robots.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
Implicit Regression in Subspace for High-Sensitivity CEST Imaging
Authors:
Chu Chen,
Yang Liu,
Se Weon Park,
Jizhou Li,
Kannie W. Y. Chan,
Raymond H. F. Chan
Abstract:
Chemical Exchange Saturation Transfer (CEST) MRI demonstrates its capability in significantly enhancing the detection of proteins and metabolites with low concentrations through exchangeable protons. The clinical application of CEST, however, is constrained by its low contrast and low signal-to-noise ratio (SNR) in the acquired data. Denoising, as one of the post-processing stages for CEST data, c…
▽ More
Chemical Exchange Saturation Transfer (CEST) MRI demonstrates its capability in significantly enhancing the detection of proteins and metabolites with low concentrations through exchangeable protons. The clinical application of CEST, however, is constrained by its low contrast and low signal-to-noise ratio (SNR) in the acquired data. Denoising, as one of the post-processing stages for CEST data, can effectively improve the accuracy of CEST quantification. In this work, by modeling spatial variant z-spectrums into low-dimensional subspace, we introduce Implicit Regression in Subspace (IRIS), which is an unsupervised denoising algorithm utilizing the excellent property of implicit neural representation for continuous mapping. Experiments conducted on both synthetic and in-vivo data demonstrate that our proposed method surpasses other CEST denoising methods regarding both qualitative and quantitative performance.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
Uni-ELF: A Multi-Level Representation Learning Framework for Electrolyte Formulation Design
Authors:
Boshen Zeng,
Sian Chen,
Xinxin Liu,
Changhong Chen,
Bin Deng,
Xiaoxu Wang,
Zhifeng Gao,
Yuzhi Zhang,
Weinan E,
Linfeng Zhang
Abstract:
Advancements in lithium battery technology heavily rely on the design and engineering of electrolytes. However, current schemes for molecular design and recipe optimization of electrolytes lack an effective computational-experimental closed loop and often fall short in accurately predicting diverse electrolyte formulation properties. In this work, we introduce Uni-ELF, a novel multi-level represen…
▽ More
Advancements in lithium battery technology heavily rely on the design and engineering of electrolytes. However, current schemes for molecular design and recipe optimization of electrolytes lack an effective computational-experimental closed loop and often fall short in accurately predicting diverse electrolyte formulation properties. In this work, we introduce Uni-ELF, a novel multi-level representation learning framework to advance electrolyte design. Our approach involves two-stage pretraining: reconstructing three-dimensional molecular structures at the molecular level using the Uni-Mol model, and predicting statistical structural properties (e.g., radial distribution functions) from molecular dynamics simulations at the mixture level. Through this comprehensive pretraining, Uni-ELF is able to capture intricate molecular and mixture-level information, which significantly enhances its predictive capability. As a result, Uni-ELF substantially outperforms state-of-the-art methods in predicting both molecular properties (e.g., melting point, boiling point, synthesizability) and formulation properties (e.g., conductivity, Coulombic efficiency). Moreover, Uni-ELF can be seamlessly integrated into an automatic experimental design workflow. We believe this innovative framework will pave the way for automated AI-based electrolyte design and engineering.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
iLLM-TSC: Integration reinforcement learning and large language model for traffic signal control policy improvement
Authors:
Aoyu Pang,
Maonan Wang,
Man-On Pun,
Chung Shue Chen,
Xi Xiong
Abstract:
Urban congestion remains a critical challenge, with traffic signal control (TSC) emerging as a potent solution. TSC is often modeled as a Markov Decision Process problem and then solved using reinforcement learning (RL), which has proven effective. However, the existing RL-based TSC system often overlooks imperfect observations caused by degraded communication, such as packet loss, delays, and noi…
▽ More
Urban congestion remains a critical challenge, with traffic signal control (TSC) emerging as a potent solution. TSC is often modeled as a Markov Decision Process problem and then solved using reinforcement learning (RL), which has proven effective. However, the existing RL-based TSC system often overlooks imperfect observations caused by degraded communication, such as packet loss, delays, and noise, as well as rare real-life events not included in the reward function, such as unconsidered emergency vehicles. To address these limitations, we introduce a novel integration framework that combines a large language model (LLM) with RL. This framework is designed to manage overlooked elements in the reward function and gaps in state information, thereby enhancing the policies of RL agents. In our approach, RL initially makes decisions based on observed data. Subsequently, LLMs evaluate these decisions to verify their reasonableness. If a decision is found to be unreasonable, it is adjusted accordingly. Additionally, this integration approach can be seamlessly integrated with existing RL-based TSC systems without necessitating modifications. Extensive testing confirms that our approach reduces the average waiting time by $17.5\%$ in degraded communication conditions as compared to traditional RL methods, underscoring its potential to advance practical RL applications in intelligent transportation systems. The related code can be found at \url{https://github.com/Traffic-Alpha/iLLM-TSC}.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
Cross Prompting Consistency with Segment Anything Model for Semi-supervised Medical Image Segmentation
Authors:
Juzheng Miao,
Cheng Chen,
Keli Zhang,
Jie Chuai,
Quanzheng Li,
Pheng-Ann Heng
Abstract:
Semi-supervised learning (SSL) has achieved notable progress in medical image segmentation. To achieve effective SSL, a model needs to be able to efficiently learn from limited labeled data and effectively exploiting knowledge from abundant unlabeled data. Recent developments in visual foundation models, such as the Segment Anything Model (SAM), have demonstrated remarkable adaptability with impro…
▽ More
Semi-supervised learning (SSL) has achieved notable progress in medical image segmentation. To achieve effective SSL, a model needs to be able to efficiently learn from limited labeled data and effectively exploiting knowledge from abundant unlabeled data. Recent developments in visual foundation models, such as the Segment Anything Model (SAM), have demonstrated remarkable adaptability with improved sample efficiency. To harness the power of foundation models for application in SSL, we propose a cross prompting consistency method with segment anything model (CPC-SAM) for semi-supervised medical image segmentation. Our method employs SAM's unique prompt design and innovates a cross-prompting strategy within a dual-branch framework to automatically generate prompts and supervisions across two decoder branches, enabling effectively learning from both scarce labeled and valuable unlabeled data. We further design a novel prompt consistency regularization, to reduce the prompt position sensitivity and to enhance the output invariance under different prompts. We validate our method on two medical image segmentation tasks. The extensive experiments with different labeled-data ratios and modalities demonstrate the superiority of our proposed method over the state-of-the-art SSL methods, with more than 9% Dice improvement on the breast cancer segmentation task.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
FM-OSD: Foundation Model-Enabled One-Shot Detection of Anatomical Landmarks
Authors:
Juzheng Miao,
Cheng Chen,
Keli Zhang,
Jie Chuai,
Quanzheng Li,
Pheng-Ann Heng
Abstract:
One-shot detection of anatomical landmarks is gaining significant attention for its efficiency in using minimal labeled data to produce promising results. However, the success of current methods heavily relies on the employment of extensive unlabeled data to pre-train an effective feature extractor, which limits their applicability in scenarios where a substantial amount of unlabeled data is unava…
▽ More
One-shot detection of anatomical landmarks is gaining significant attention for its efficiency in using minimal labeled data to produce promising results. However, the success of current methods heavily relies on the employment of extensive unlabeled data to pre-train an effective feature extractor, which limits their applicability in scenarios where a substantial amount of unlabeled data is unavailable. In this paper, we propose the first foundation model-enabled one-shot landmark detection (FM-OSD) framework for accurate landmark detection in medical images by utilizing solely a single template image without any additional unlabeled data. Specifically, we use the frozen image encoder of visual foundation models as the feature extractor, and introduce dual-branch global and local feature decoders to increase the resolution of extracted features in a coarse to fine manner. The introduced feature decoders are efficiently trained with a distance-aware similarity learning loss to incorporate domain knowledge from the single template image. Moreover, a novel bidirectional matching strategy is developed to improve both robustness and accuracy of landmark detection in the case of scattered similarity map obtained by foundation models. We validate our method on two public anatomical landmark detection datasets. By using solely a single template image, our method demonstrates significant superiority over strong state-of-the-art one-shot landmark detection methods.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
Learning Label Refinement and Threshold Adjustment for Imbalanced Semi-Supervised Learning
Authors:
Zeju Li,
Ying-Qiu Zheng,
Chen Chen,
Saad Jbabdi
Abstract:
Semi-supervised learning (SSL) algorithms struggle to perform well when exposed to imbalanced training data. In this scenario, the generated pseudo-labels can exhibit a bias towards the majority class, and models that employ these pseudo-labels can further amplify this bias. Here we investigate pseudo-labeling strategies for imbalanced SSL including pseudo-label refinement and threshold adjustment…
▽ More
Semi-supervised learning (SSL) algorithms struggle to perform well when exposed to imbalanced training data. In this scenario, the generated pseudo-labels can exhibit a bias towards the majority class, and models that employ these pseudo-labels can further amplify this bias. Here we investigate pseudo-labeling strategies for imbalanced SSL including pseudo-label refinement and threshold adjustment, through the lens of statistical analysis. We find that existing SSL algorithms which generate pseudo-labels using heuristic strategies or uncalibrated model confidence are unreliable when imbalanced class distributions bias pseudo-labels. To address this, we introduce SEmi-supervised learning with pseudo-label optimization based on VALidation data (SEVAL) to enhance the quality of pseudo-labelling for imbalanced SSL. We propose to learn refinement and thresholding parameters from a partition of the training dataset in a class-balanced way. SEVAL adapts to specific tasks with improved pseudo-labels accuracy and ensures pseudo-labels correctness on a per-class basis. Our experiments show that SEVAL surpasses state-of-the-art SSL methods, delivering more accurate and effective pseudo-labels in various imbalanced SSL situations. SEVAL, with its simplicity and flexibility, can enhance various SSL techniques effectively. The code is publicly available~\footnote{\url{https://github.com/ZerojumpLine/SEVAL}}.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
BadCLM: Backdoor Attack in Clinical Language Models for Electronic Health Records
Authors:
Weimin Lyu,
Zexin Bi,
Fusheng Wang,
Chao Chen
Abstract:
The advent of clinical language models integrated into electronic health records (EHR) for clinical decision support has marked a significant advancement, leveraging the depth of clinical notes for improved decision-making. Despite their success, the potential vulnerabilities of these models remain largely unexplored. This paper delves into the realm of backdoor attacks on clinical language models…
▽ More
The advent of clinical language models integrated into electronic health records (EHR) for clinical decision support has marked a significant advancement, leveraging the depth of clinical notes for improved decision-making. Despite their success, the potential vulnerabilities of these models remain largely unexplored. This paper delves into the realm of backdoor attacks on clinical language models, introducing an innovative attention-based backdoor attack method, BadCLM (Bad Clinical Language Models). This technique clandestinely embeds a backdoor within the models, causing them to produce incorrect predictions when a pre-defined trigger is present in inputs, while functioning accurately otherwise. We demonstrate the efficacy of BadCLM through an in-hospital mortality prediction task with MIMIC III dataset, showcasing its potential to compromise model integrity. Our findings illuminate a significant security risk in clinical decision support systems and pave the way for future endeavors in fortifying clinical language models against such vulnerabilities.
△ Less
Submitted 6 July, 2024;
originally announced July 2024.
-
Toward a Unified Metadata Schema for Ecological Momentary Assessment with Voice-First Virtual Assistants
Authors:
Chen Chen,
Khalil Mrini,
Kemeberly Charles,
Ella T. Lifset,
Michael Hogarth,
Alison A. Moore,
Nadir Weibel,
Emilia Farcas
Abstract:
Ecological momentary assessment (EMA) is used to evaluate subjects' behaviors and moods in their natural environments, yet collecting real-time and self-report data with EMA is challenging due to user burden. Integrating voice into EMA data collection platforms through today's intelligent virtual assistants (IVAs) is promising due to hands-free and eye-free nature. However, efficiently managing co…
▽ More
Ecological momentary assessment (EMA) is used to evaluate subjects' behaviors and moods in their natural environments, yet collecting real-time and self-report data with EMA is challenging due to user burden. Integrating voice into EMA data collection platforms through today's intelligent virtual assistants (IVAs) is promising due to hands-free and eye-free nature. However, efficiently managing conversations and EMAs is non-trivial and time consuming due to the ambiguity of the voice input. We approach this problem by rethinking the data modeling of EMA questions and what is needed to deploy them on voice-first user interfaces. We propose a unified metadata schema that models EMA questions and the necessary attributes to effectively and efficiently integrate voice as a new EMA modality. Our schema allows user experience researchers to write simple rules that can be rendered at run-time, instead of having to edit the source code. We showcase an example EMA survey implemented with our schema, which can run on multiple voice-only and voice-first devices. We believe that our work will accelerate the iterative prototyping and design process of real-world voice-based EMA data collection platforms.
△ Less
Submitted 6 July, 2024;
originally announced July 2024.
-
Feedback-Driven Automated Whole Bug Report Reproduction for Android Apps
Authors:
Dingbang Wang,
Yu Zhao,
Sidong Feng,
Zhaoxu Zhang,
William G. J. Halfond,
Chunyang Chen,
Xiaoxia Sun,
Jiangfan Shi,
Tingting Yu
Abstract:
In software development, bug report reproduction is a challenging task. This paper introduces ReBL, a novel feedback-driven approach that leverages GPT-4, a large-scale language model, to automatically reproduce Android bug reports. Unlike traditional methods, ReBL bypasses the use of Step to Reproduce (S2R) entities. Instead, it leverages the entire textual bug report and employs innovative promp…
▽ More
In software development, bug report reproduction is a challenging task. This paper introduces ReBL, a novel feedback-driven approach that leverages GPT-4, a large-scale language model, to automatically reproduce Android bug reports. Unlike traditional methods, ReBL bypasses the use of Step to Reproduce (S2R) entities. Instead, it leverages the entire textual bug report and employs innovative prompts to enhance GPT's contextual reasoning. This approach is more flexible and context-aware than the traditional step-by-step entity matching approach, resulting in improved accuracy and effectiveness. In addition to handling crash reports, ReBL has the capability of handling non-crash bug reports. Our evaluation of 96 Android bug reports (73 crash and 23 non-crash) demonstrates that ReBL successfully reproduced 90.63% of these reports, averaging only 74.98 seconds per bug report. Additionally, ReBL outperformed three existing tools in both success rate and speed.
△ Less
Submitted 6 July, 2024;
originally announced July 2024.
-
Consistency and Discrepancy-Based Contrastive Tripartite Graph Learning for Recommendations
Authors:
Linxin Guo,
Yaochen Zhu,
Min Gao,
Yinghui Tao,
Junliang Yu,
Chen Chen
Abstract:
Tripartite graph-based recommender systems markedly diverge from traditional models by recommending unique combinations such as user groups and item bundles. Despite their effectiveness, these systems exacerbate the longstanding cold-start problem in traditional recommender systems, because any number of user groups or item bundles can be formed among users or items. To address this issue, we intr…
▽ More
Tripartite graph-based recommender systems markedly diverge from traditional models by recommending unique combinations such as user groups and item bundles. Despite their effectiveness, these systems exacerbate the longstanding cold-start problem in traditional recommender systems, because any number of user groups or item bundles can be formed among users or items. To address this issue, we introduce a Consistency and Discrepancy-based graph contrastive learning method for tripartite graph-based Recommendation. This approach leverages two novel meta-path-based metrics consistency and discrepancy to capture nuanced, implicit associations between the recommended objects and the recommendees. These metrics, indicative of high-order similarities, can be efficiently calculated with infinite graph convolutional networks layers under a multi-objective optimization framework, using the limit theory of GCN.
△ Less
Submitted 6 July, 2024;
originally announced July 2024.
-
Unraveling Radiomics Complexity: Strategies for Optimal Simplicity in Predictive Modeling
Authors:
Mahdi Ait Lhaj Loutfi,
Teodora Boblea Podasca,
Alex Zwanenburg,
Taman Upadhaya,
Jorge Barrios,
David R. Raleigh,
William C. Chen,
Dante P. I. Capaldi,
Hong Zheng,
Olivier Gevaert,
Jing Wu,
Alvin C. Silva,
Paul J. Zhang,
Harrison X. Bai,
Jan Seuntjens,
Steffen Löck,
Patrick O. Richard,
Olivier Morin,
Caroline Reinhold,
Martin Lepage,
Martin Vallières
Abstract:
Background: The high dimensionality of radiomic feature sets, the variability in radiomic feature types and potentially high computational requirements all underscore the need for an effective method to identify the smallest set of predictive features for a given clinical problem. Purpose: Develop a methodology and tools to identify and explain the smallest set of predictive radiomic features. Mat…
▽ More
Background: The high dimensionality of radiomic feature sets, the variability in radiomic feature types and potentially high computational requirements all underscore the need for an effective method to identify the smallest set of predictive features for a given clinical problem. Purpose: Develop a methodology and tools to identify and explain the smallest set of predictive radiomic features. Materials and Methods: 89,714 radiomic features were extracted from five cancer datasets: low-grade glioma, meningioma, non-small cell lung cancer (NSCLC), and two renal cell carcinoma cohorts (n=2104). Features were categorized by computational complexity into morphological, intensity, texture, linear filters, and nonlinear filters. Models were trained and evaluated on each complexity level using the area under the curve (AUC). The most informative features were identified, and their importance was explained. The optimal complexity level and associated most informative features were identified using systematic statistical significance analyses and a false discovery avoidance procedure, respectively. Their predictive importance was explained using a novel tree-based method. Results: MEDimage, a new open-source tool, was developed to facilitate radiomic studies. Morphological features were optimal for MRI-based meningioma (AUC: 0.65) and low-grade glioma (AUC: 0.68). Intensity features were optimal for CECT-based renal cell carcinoma (AUC: 0.82) and CT-based NSCLC (AUC: 0.76). Texture features were optimal for MRI-based renal cell carcinoma (AUC: 0.72). Tuning the Hounsfield unit range improved results for CECT-based renal cell carcinoma (AUC: 0.86). Conclusion: Our proposed methodology and software can estimate the optimal radiomics complexity level for specific medical outcomes, potentially simplifying the use of radiomics in predictive modeling across various contexts.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?
Authors:
Zhaorun Chen,
Yichao Du,
Zichen Wen,
Yiyang Zhou,
Chenhang Cui,
Zhenzhen Weng,
Haoqin Tu,
Chaoqi Wang,
Zhengwei Tong,
Qinglan Huang,
Canyu Chen,
Qinghao Ye,
Zhihong Zhu,
Yuqing Zhang,
Jiawei Zhou,
Zhuokai Zhao,
Rafael Rafailov,
Chelsea Finn,
Huaxiu Yao
Abstract:
While text-to-image models like DALLE-3 and Stable Diffusion are rapidly proliferating, they often encounter challenges such as hallucination, bias, and the production of unsafe, low-quality output. To effectively address these issues, it is crucial to align these models with desired behaviors based on feedback from a multimodal judge. Despite their significance, current multimodal judges frequent…
▽ More
While text-to-image models like DALLE-3 and Stable Diffusion are rapidly proliferating, they often encounter challenges such as hallucination, bias, and the production of unsafe, low-quality output. To effectively address these issues, it is crucial to align these models with desired behaviors based on feedback from a multimodal judge. Despite their significance, current multimodal judges frequently undergo inadequate evaluation of their capabilities and limitations, potentially leading to misalignment and unsafe fine-tuning outcomes. To address this issue, we introduce MJ-Bench, a novel benchmark which incorporates a comprehensive preference dataset to evaluate multimodal judges in providing feedback for image generation models across four key perspectives: alignment, safety, image quality, and bias. Specifically, we evaluate a large variety of multimodal judges including smaller-sized CLIP-based scoring models, open-source VLMs (e.g. LLaVA family), and close-source VLMs (e.g. GPT-4o, Claude 3) on each decomposed subcategory of our preference dataset. Experiments reveal that close-source VLMs generally provide better feedback, with GPT-4o outperforming other judges in average. Compared with open-source VLMs, smaller-sized scoring models can provide better feedback regarding text-image alignment and image quality, while VLMs provide more accurate feedback regarding safety and generation bias due to their stronger reasoning capabilities. Further studies in feedback scale reveal that VLM judges can generally provide more accurate and stable feedback in natural language (Likert-scale) than numerical scales. Notably, human evaluations on end-to-end fine-tuned models using separate feedback from these multimodal judges provide similar conclusions, further confirming the effectiveness of MJ-Bench. All data, code, models are available at https://huggingface.co/MJ-Bench.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
KeyVideoLLM: Towards Large-scale Video Keyframe Selection
Authors:
Hao Liang,
Jiapeng Li,
Tianyi Bai,
Chong Chen,
Conghui He,
Bin Cui,
Wentao Zhang
Abstract:
Recently, with the rise of web videos, managing and understanding large-scale video datasets has become increasingly important. Video Large Language Models (VideoLLMs) have emerged in recent years due to their strong video understanding capabilities. However, training and inference processes for VideoLLMs demand vast amounts of data, presenting significant challenges to data management, particular…
▽ More
Recently, with the rise of web videos, managing and understanding large-scale video datasets has become increasingly important. Video Large Language Models (VideoLLMs) have emerged in recent years due to their strong video understanding capabilities. However, training and inference processes for VideoLLMs demand vast amounts of data, presenting significant challenges to data management, particularly regarding efficiency, robustness, and effectiveness. In this work, we present KeyVideoLLM, a text-video frame similarity-based keyframe selection method designed to manage VideoLLM data efficiently, robustly, and effectively. Specifically, KeyVideoLLM achieves a remarkable data compression rate of up to 60.9 times, substantially lowering disk space requirements, which proves its high efficiency. Additionally, it maintains a 100% selection success rate across all video formats and scales, enhances processing speed by up to 200 times compared to existing keyframe selection methods, and does not require hyperparameter tuning. Beyond its outstanding efficiency and robustness, KeyVideoLLM further improves model performance in video question-answering tasks during both training and inference stages. Notably, it consistently achieved the state-of-the-art (SoTA) experimental results on diverse datasets.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
Vision-driven Automated Mobile GUI Testing via Multimodal Large Language Model
Authors:
Zhe Liu,
Cheng Li,
Chunyang Chen,
Junjie Wang,
Boyu Wu,
Yawen Wang,
Jun Hu,
Qing Wang
Abstract:
With the advancement of software rendering techniques, GUI pages in mobile apps now encompass a wealth of visual information, where the visual semantics of each page contribute to the overall app logic, presenting new challenges to software testing. Despite the progress in automated Graphical User Interface (GUI) testing, the absence of testing oracles has constrained its efficacy to identify only…
▽ More
With the advancement of software rendering techniques, GUI pages in mobile apps now encompass a wealth of visual information, where the visual semantics of each page contribute to the overall app logic, presenting new challenges to software testing. Despite the progress in automated Graphical User Interface (GUI) testing, the absence of testing oracles has constrained its efficacy to identify only crash bugs with evident abnormal signals. Nonetheless, there are still a considerable number of non-crash bugs, ranging from unexpected behaviors to misalignments, often evading detection by existing techniques. While these bugs can exhibit visual cues that serve as potential testing oracles, they often entail a sequence of screenshots, and detecting them necessitates an understanding of the operational logic among GUI page transitions, which is challenging traditional techniques. Considering the remarkable performance of Multimodal Large Language Models (MLLM) in visual and language understanding, this paper proposes a vision-driven automated GUI testing approach VisionDroid to detect non-crash functional bugs with MLLM. It begins by extracting GUI text information and aligning it with screenshots to form a vision prompt, enabling MLLM to understand GUI context. The function-aware explorer then employs MLLM for deeper and function-oriented GUI page exploration, while the logic-aware bug detector segments the entire exploration history into logically cohesive parts and prompts the MLLM for bug detection. We evaluate VisionDroid on three datasets and compare it with 10 baselines, demonstrating its excellent performance. The ablation study further proves the contribution of each module. Moreover, VisionDroid identifies 29 new bugs on Google Play, of which 19 have been confirmed and fixed.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
AcuVR: Enhancing Acupuncture Training Workflow with Virtual Reality
Authors:
Menghe Zhang,
Chen Chen,
Matin Yarmand,
Anish Rajeshkumar,
Nadir Weibel
Abstract:
Acupuncture is a widely adopted medical practice that involves inserting thin needles into specific points on the body to alleviate pain and treat various health conditions. Current learning practices heavily rely on 2D atlases and practice on peers, which are notably less intuitive and pose risks, particularly in sensitive areas such as the eyes. To address these challenges, we introduce AcuVR, a…
▽ More
Acupuncture is a widely adopted medical practice that involves inserting thin needles into specific points on the body to alleviate pain and treat various health conditions. Current learning practices heavily rely on 2D atlases and practice on peers, which are notably less intuitive and pose risks, particularly in sensitive areas such as the eyes. To address these challenges, we introduce AcuVR, a Virtual Reality (VR) based system designed to add a layer of interactivity and realism. This innovation aims to reduce the risks associated with practicing acupuncture techniques while offering more effective learning strategies. Furthermore, AcuVR incorporates medical imaging and standardized anatomy models, enabling the simulation of customized acupuncture scenarios. This feature represents a significant advancement beyond the limitations of conventional resources such as atlases and textbooks, facilitating a more immersive and personalized learning experience. The evaluation study with eight acupuncture students and practitioners revealed high participant satisfaction and pointed to the effectiveness and potential of AcuVR as a valuable addition to acupuncture training.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
GlyphDraw2: Automatic Generation of Complex Glyph Posters with Diffusion Models and Large Language Models
Authors:
Jian Ma,
Yonglin Deng,
Chen Chen,
Haonan Lu,
Zhenyu Yang
Abstract:
Posters play a crucial role in marketing and advertising, contributing significantly to industrial design by enhancing visual communication and brand visibility. With recent advances in controllable text-to-image diffusion models, more concise research is now focusing on rendering text within synthetic images. Despite improvements in text rendering accuracy, the field of end-to-end poster generati…
▽ More
Posters play a crucial role in marketing and advertising, contributing significantly to industrial design by enhancing visual communication and brand visibility. With recent advances in controllable text-to-image diffusion models, more concise research is now focusing on rendering text within synthetic images. Despite improvements in text rendering accuracy, the field of end-to-end poster generation remains underexplored. This complex task involves striking a balance between text rendering accuracy and automated layout to produce high-resolution images with variable aspect ratios. To tackle this challenge, we propose an end-to-end text rendering framework employing a triple cross-attention mechanism rooted in align learning, designed to create precise poster text within detailed contextual backgrounds. Additionally, we introduce a high-resolution dataset that exceeds 1024 pixels in image resolution. Our approach leverages the SDXL architecture. Extensive experiments validate the ability of our method to generate poster images featuring intricate and contextually rich backgrounds. Codes will be available at https://github.com/OPPO-Mente-Lab/GlyphDraw2.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization
Authors:
Yuchen Hu,
Chen Chen,
Siyin Wang,
Eng Siong Chng,
Chao Zhang
Abstract:
In this paper, we propose reverse inference optimization (RIO), a simple and effective method designed to enhance the robustness of autoregressive-model-based zero-shot text-to-speech (TTS) systems using reinforcement learning from human feedback (RLHF). To assess the quality of speech produced by the TTS system without human annotations, RIO introduces a novel concept termed as reverse inference…
▽ More
In this paper, we propose reverse inference optimization (RIO), a simple and effective method designed to enhance the robustness of autoregressive-model-based zero-shot text-to-speech (TTS) systems using reinforcement learning from human feedback (RLHF). To assess the quality of speech produced by the TTS system without human annotations, RIO introduces a novel concept termed as reverse inference based on the Bayesian principle, which suggests that a high-quality generated speech should be able to be used as a prompt for subsequent generation using the same TTS model. By leveraging reverse inference as the standard to select exemplars used in RLHF from the speech samples generated by the TTS system itself, RIO steers the subsequent optimization towards a direction of enhancing the TTS robustness. The RIO framework, comprising sampling, automatic annotating, and learning, obviates the need for a reward model or pairwise preference data, and significantly improves the stability of zero-shot TTS performance by reducing the discrepancies between training and inference conditions. Our experimental results verify that RIO can effectively improve both subjective and objective metrics, including mean opinion scores, word error rates, and speaker similarity. Remarkably, RIO can also diminish the incidence of bad outputs to nearly zero percent, rivalling the robustness when using ground-truth speech as the prompt.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
LPViT: Low-Power Semi-structured Pruning for Vision Transformers
Authors:
Kaixin Xu,
Zhe Wang,
Chunyun Chen,
Xue Geng,
Jie Lin,
Xulei Yang,
Min Wu,
Xiaoli Li,
Weisi Lin
Abstract:
Vision transformers have emerged as a promising alternative to convolutional neural networks for various image analysis tasks, offering comparable or superior performance. However, one significant drawback of ViTs is their resource-intensive nature, leading to increased memory footprint, computation complexity, and power consumption. To democratize this high-performance technology and make it more…
▽ More
Vision transformers have emerged as a promising alternative to convolutional neural networks for various image analysis tasks, offering comparable or superior performance. However, one significant drawback of ViTs is their resource-intensive nature, leading to increased memory footprint, computation complexity, and power consumption. To democratize this high-performance technology and make it more environmentally friendly, it is essential to compress ViT models, reducing their resource requirements while maintaining high performance. In this paper, we introduce a new block-structured pruning to address the resource-intensive issue for ViTs, offering a balanced trade-off between accuracy and hardware acceleration. Unlike unstructured pruning or channel-wise structured pruning, block pruning leverages the block-wise structure of linear layers, resulting in more efficient matrix multiplications. To optimize this pruning scheme, our paper proposes a novel hardware-aware learning objective that simultaneously maximizes speedup and minimizes power consumption during inference, tailored to the block sparsity structure. This objective eliminates the need for empirical look-up tables and focuses solely on reducing parametrized layer connections. Moreover, our paper provides a lightweight algorithm to achieve post-training pruning for ViTs, utilizing second-order Taylor approximation and empirical optimization to solve the proposed hardware-aware objective. Extensive experiments on ImageNet are conducted across various ViT architectures, including DeiT-B and DeiT-S, demonstrating competitive performance with other pruning methods and achieving a remarkable balance between accuracy preservation and power savings. Especially, we achieve up to 3.93x and 1.79x speedups on dedicated hardware and GPUs respectively for DeiT-B, and also observe an inference power reduction by 1.4x on real-world GPUs.
△ Less
Submitted 12 July, 2024; v1 submitted 2 July, 2024;
originally announced July 2024.
-
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models
Authors:
Haibo Jin,
Leyang Hu,
Xinuo Li,
Peiyan Zhang,
Chonghan Chen,
Jun Zhuang,
Haohan Wang
Abstract:
The rapid evolution of artificial intelligence (AI) through developments in Large Language Models (LLMs) and Vision-Language Models (VLMs) has brought significant advancements across various technological domains. While these models enhance capabilities in natural language processing and visual interactive tasks, their growing adoption raises critical concerns regarding security and ethical alignm…
▽ More
The rapid evolution of artificial intelligence (AI) through developments in Large Language Models (LLMs) and Vision-Language Models (VLMs) has brought significant advancements across various technological domains. While these models enhance capabilities in natural language processing and visual interactive tasks, their growing adoption raises critical concerns regarding security and ethical alignment. This survey provides an extensive review of the emerging field of jailbreaking--deliberately circumventing the ethical and operational boundaries of LLMs and VLMs--and the consequent development of defense mechanisms. Our study categorizes jailbreaks into seven distinct types and elaborates on defense strategies that address these vulnerabilities. Through this comprehensive examination, we identify research gaps and propose directions for future studies to enhance the security frameworks of LLMs and VLMs. Our findings underscore the necessity for a unified perspective that integrates both jailbreak strategies and defensive solutions to foster a robust, secure, and reliable environment for the next generation of language models. More details can be found on our website: \url{https://chonghan-chen.com/llm-jailbreak-zoo-survey/}.
△ Less
Submitted 25 June, 2024;
originally announced July 2024.
-
Data on the Move: Traffic-Oriented Data Trading Platform Powered by AI Agent with Common Sense
Authors:
Yi Yu,
Shengyue Yao,
Tianchen Zhou,
Yexuan Fu,
Jingru Yu,
Ding Wang,
Xuhong Wang,
Cen Chen,
Yilun Lin
Abstract:
In the digital era, data has become a pivotal asset, advancing technologies such as autonomous driving. Despite this, data trading faces challenges like the absence of robust pricing methods and the lack of trustworthy trading mechanisms. To address these challenges, we introduce a traffic-oriented data trading platform named Data on The Move (DTM), integrating traffic simulation, data trading, an…
▽ More
In the digital era, data has become a pivotal asset, advancing technologies such as autonomous driving. Despite this, data trading faces challenges like the absence of robust pricing methods and the lack of trustworthy trading mechanisms. To address these challenges, we introduce a traffic-oriented data trading platform named Data on The Move (DTM), integrating traffic simulation, data trading, and Artificial Intelligent (AI) agents. The DTM platform supports evident-based data value evaluation and AI-based trading mechanisms. Leveraging the common sense capabilities of Large Language Models (LLMs) to assess traffic state and data value, DTM can determine reasonable traffic data pricing through multi-round interaction and simulations. Moreover, DTM provides a pricing method validation by simulating traffic systems, multi-agent interactions, and the heterogeneity and irrational behaviors of individuals in the trading market. Within the DTM platform, entities such as connected vehicles and traffic light controllers could engage in information collecting, data pricing, trading, and decision-making. Simulation results demonstrate that our proposed AI agent-based pricing approach enhances data trading by offering rational prices, as evidenced by the observed improvement in traffic efficiency. This underscores the effectiveness and practical value of DTM, offering new perspectives for the evolution of data markets and smart cities. To the best of our knowledge, this is the first study employing LLMs in data pricing and a pioneering data trading practice in the field of intelligent vehicles and smart cities.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Supercharging Federated Learning with Flower and NVIDIA FLARE
Authors:
Holger R. Roth,
Daniel J. Beutel,
Yan Cheng,
Javier Fernandez Marques,
Heng Pan,
Chester Chen,
Zhihong Zhang,
Yuhong Wen,
Sean Yang,
Isaac,
Yang,
Yuan-Ting Hsieh,
Ziyue Xu,
Daguang Xu,
Nicholas D. Lane,
Andrew Feng
Abstract:
Several open-source systems, such as Flower and NVIDIA FLARE, have been developed in recent years while focusing on different aspects of federated learning (FL). Flower is dedicated to implementing a cohesive approach to FL, analytics, and evaluation. Over time, Flower has cultivated extensive strategies and algorithms tailored for FL application development, fostering a vibrant FL community in re…
▽ More
Several open-source systems, such as Flower and NVIDIA FLARE, have been developed in recent years while focusing on different aspects of federated learning (FL). Flower is dedicated to implementing a cohesive approach to FL, analytics, and evaluation. Over time, Flower has cultivated extensive strategies and algorithms tailored for FL application development, fostering a vibrant FL community in research and industry. Conversely, FLARE has prioritized the creation of an enterprise-ready, resilient runtime environment explicitly designed for FL applications in production environments. In this paper, we describe our initial integration of both frameworks and show how they can work together to supercharge the FL ecosystem as a whole. Through the seamless integration of Flower and FLARE, applications crafted within the Flower framework can effortlessly operate within the FLARE runtime environment without necessitating any modifications. This initial integration streamlines the process, eliminating complexities and ensuring smooth interoperability between the two platforms, thus enhancing the overall efficiency and accessibility of FL applications.
△ Less
Submitted 21 May, 2024;
originally announced July 2024.
-
HUWSOD: Holistic Self-training for Unified Weakly Supervised Object Detection
Authors:
Liujuan Cao,
Jianghang Lin,
Zebo Hong,
Yunhang Shen,
Shaohui Lin,
Chao Chen,
Rongrong Ji
Abstract:
Most WSOD methods rely on traditional object proposals to generate candidate regions and are confronted with unstable training, which easily gets stuck in a poor local optimum. In this paper, we introduce a unified, high-capacity weakly supervised object detection (WSOD) network called HUWSOD, which utilizes a comprehensive self-training framework without needing external modules or additional sup…
▽ More
Most WSOD methods rely on traditional object proposals to generate candidate regions and are confronted with unstable training, which easily gets stuck in a poor local optimum. In this paper, we introduce a unified, high-capacity weakly supervised object detection (WSOD) network called HUWSOD, which utilizes a comprehensive self-training framework without needing external modules or additional supervision. HUWSOD innovatively incorporates a self-supervised proposal generator and an autoencoder proposal generator with a multi-rate resampling pyramid to replace traditional object proposals, enabling end-to-end WSOD training and inference. Additionally, we implement a holistic self-training scheme that refines detection scores and coordinates through step-wise entropy minimization and consistency-constraint regularization, ensuring consistent predictions across stochastic augmentations of the same image. Extensive experiments on PASCAL VOC and MS COCO demonstrate that HUWSOD competes with state-of-the-art WSOD methods, eliminating the need for offline proposals and additional data. The peak performance of HUWSOD approaches that of fully-supervised Faster R-CNN. Our findings also indicate that randomly initialized boxes, although significantly different from well-designed offline object proposals, are effective for WSOD training.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
Detecting Machine-Generated Texts: Not Just "AI vs Humans" and Explainability is Complicated
Authors:
Jiazhou Ji,
Ruizhe Li,
Shujun Li,
Jie Guo,
Weidong Qiu,
Zheng Huang,
Chiyu Chen,
Xiaoyu Jiang,
Xinru Lu
Abstract:
As LLMs rapidly advance, increasing concerns arise regarding risks about actual authorship of texts we see online and in real world. The task of distinguishing LLM-authored texts is complicated by the nuanced and overlapping behaviors of both machines and humans. In this paper, we challenge the current practice of considering LLM-generated text detection a binary classification task of differentia…
▽ More
As LLMs rapidly advance, increasing concerns arise regarding risks about actual authorship of texts we see online and in real world. The task of distinguishing LLM-authored texts is complicated by the nuanced and overlapping behaviors of both machines and humans. In this paper, we challenge the current practice of considering LLM-generated text detection a binary classification task of differentiating human from AI. Instead, we introduce a novel ternary text classification scheme, adding an "undecided" category for texts that could be attributed to either source, and we show that this new category is crucial to understand how to make the detection result more explainable to lay users. This research shifts the paradigm from merely classifying to explaining machine-generated texts, emphasizing need for detectors to provide clear and understandable explanations to users. Our study involves creating four new datasets comprised of texts from various LLMs and human authors. Based on new datasets, we performed binary classification tests to ascertain the most effective SOTA detection methods and identified SOTA LLMs capable of producing harder-to-detect texts. We constructed a new dataset of texts generated by two top-performing LLMs and human authors, and asked three human annotators to produce ternary labels with explanation notes. This dataset was used to investigate how three top-performing SOTA detectors behave in new ternary classification context. Our results highlight why "undecided" category is much needed from the viewpoint of explainability. Additionally, we conducted an analysis of explainability of the three best-performing detectors and the explanation notes of the human annotators, revealing insights about the complexity of explainable detection of machine-generated texts. Finally, we propose guidelines for developing future detection systems with improved explanatory power.
△ Less
Submitted 26 June, 2024;
originally announced June 2024.
-
Human-free Prompted Based Anomaly Detection: prompt optimization with Meta-guiding prompt scheme
Authors:
Pi-Wei Chen,
Jerry Chun-Wei Lin,
Jia Ji,
Feng-Hao Yeh,
Chao-Chun Chen
Abstract:
Pre-trained vision-language models (VLMs) are highly adaptable to various downstream tasks through few-shot learning, making prompt-based anomaly detection a promising approach. Traditional methods depend on human-crafted prompts that require prior knowledge of specific anomaly types. Our goal is to develop a human-free prompt-based anomaly detection framework that optimally learns prompts through…
▽ More
Pre-trained vision-language models (VLMs) are highly adaptable to various downstream tasks through few-shot learning, making prompt-based anomaly detection a promising approach. Traditional methods depend on human-crafted prompts that require prior knowledge of specific anomaly types. Our goal is to develop a human-free prompt-based anomaly detection framework that optimally learns prompts through data-driven methods, eliminating the need for human intervention. The primary challenge in this approach is the lack of anomalous samples during the training phase. Additionally, the Vision Transformer (ViT)-based image encoder in VLMs is not ideal for pixel-wise anomaly segmentation due to a locality feature mismatch between the original image and the output feature map. To tackle the first challenge, we have developed the Object-Attention Anomaly Generation Module (OAGM) to synthesize anomaly samples for training. Furthermore, our Meta-Guiding Prompt-Tuning Scheme (MPTS) iteratively adjusts the gradient-based optimization direction of learnable prompts to avoid overfitting to the synthesized anomalies. For the second challenge, we propose Locality-Aware Attention, which ensures that each local patch feature attends only to nearby patch features, preserving the locality features corresponding to their original locations. This framework allows for the optimal prompt embeddings by searching in the continuous latent space via backpropagation, free from human semantic constraints. Additionally, the modified locality-aware attention improves the precision of pixel-wise anomaly segmentation.
△ Less
Submitted 26 June, 2024;
originally announced June 2024.
-
Large Language Models for Cuffless Blood Pressure Measurement From Wearable Biosignals
Authors:
Zengding Liu,
Chen Chen,
Jiannong Cao,
Minglei Pan,
Jikui Liu,
Nan Li,
Fen Miao,
Ye Li
Abstract:
Large language models (LLMs) have captured significant interest from both academia and industry due to their impressive performance across various textual tasks. However, the potential of LLMs to analyze physiological time-series data remains an emerging research field. Particularly, there is a notable gap in the utilization of LLMs for analyzing wearable biosignals to achieve cuffless blood press…
▽ More
Large language models (LLMs) have captured significant interest from both academia and industry due to their impressive performance across various textual tasks. However, the potential of LLMs to analyze physiological time-series data remains an emerging research field. Particularly, there is a notable gap in the utilization of LLMs for analyzing wearable biosignals to achieve cuffless blood pressure (BP) measurement, which is critical for the management of cardiovascular diseases. This paper presents the first work to explore the capacity of LLMs to perform cuffless BP estimation based on wearable biosignals. We extracted physiological features from electrocardiogram (ECG) and photoplethysmogram (PPG) signals and designed context-enhanced prompts by combining these features with BP domain knowledge and user information. Subsequently, we adapted LLMs to BP estimation tasks through fine-tuning. To evaluate the proposed approach, we conducted assessments of ten advanced LLMs using a comprehensive public dataset of wearable biosignals from 1,272 participants. The experimental results demonstrate that the optimally fine-tuned LLM significantly surpasses conventional task-specific baselines, achieving an estimation error of 0.00 $\pm$ 9.25 mmHg for systolic BP and 1.29 $\pm$ 6.37 mmHg for diastolic BP. Notably, the ablation studies highlight the benefits of our context enhancement strategy, leading to an 8.9% reduction in mean absolute error for systolic BP estimation. This paper pioneers the exploration of LLMs for cuffless BP measurement, providing a potential solution to enhance the accuracy of cuffless BP measurement.
△ Less
Submitted 4 July, 2024; v1 submitted 26 June, 2024;
originally announced June 2024.
-
Fine-tuning Diffusion Models for Enhancing Face Quality in Text-to-image Generation
Authors:
Zhenyi Liao,
Qingsong Xie,
Chen Chen,
Hannan Lu,
Zhijie Deng
Abstract:
Diffusion models (DMs) have achieved significant success in generating imaginative images given textual descriptions. However, they are likely to fall short when it comes to real-life scenarios with intricate details.The low-quality, unrealistic human faces in text-to-image generation are one of the most prominent issues, hindering the wide application of DMs in practice. Targeting addressing such…
▽ More
Diffusion models (DMs) have achieved significant success in generating imaginative images given textual descriptions. However, they are likely to fall short when it comes to real-life scenarios with intricate details.The low-quality, unrealistic human faces in text-to-image generation are one of the most prominent issues, hindering the wide application of DMs in practice. Targeting addressing such an issue, we first assess the face quality of generations from popular pre-trained DMs with the aid of human annotators and then evaluate the alignment between existing metrics such as ImageReward, Human Preference Score, Aesthetic Score Predictor, and Face Quality Assessment, with human judgments. Observing that existing metrics can be unsatisfactory for quantifying face quality, we develop a novel metric named Face Score (FS) by fine-tuning ImageReward on a dataset of (good, bad) face pairs cheaply crafted by an inpainting pipeline of DMs. Extensive studies reveal that FS enjoys a superior alignment with humans. On the other hand, FS opens up the door for refining DMs for better face generation. To achieve this, we incorporate a guidance loss on the denoising trajectories of the aforementioned face pairs for fine-tuning pre-trained DMs such as Stable Diffusion V1.5 and Realistic Vision V5.1. Intuitively, such a loss pushes the trajectory of bad faces toward that of good ones. Comprehensive experiments verify the efficacy of our approach for improving face quality while preserving general capability.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
Mind's Eye: Image Recognition by EEG via Multimodal Similarity-Keeping Contrastive Learning
Authors:
Chi-Sheng Chen,
Chun-Shu Wei
Abstract:
Decoding images from non-invasive electroencephalographic (EEG) signals has been a grand challenge in understanding how the human brain process visual information in real-world scenarios. To cope with the issues of signal-to-noise ratio and nonstationarity, this paper introduces a MUltimodal Similarity-keeping contrastivE learning (MUSE) framework for zero-shot EEG-based image classification. We d…
▽ More
Decoding images from non-invasive electroencephalographic (EEG) signals has been a grand challenge in understanding how the human brain process visual information in real-world scenarios. To cope with the issues of signal-to-noise ratio and nonstationarity, this paper introduces a MUltimodal Similarity-keeping contrastivE learning (MUSE) framework for zero-shot EEG-based image classification. We develop a series of multivariate time-series encoders tailored for EEG signals and assess the efficacy of regularized contrastive EEG-Image pretraining using an extensive visual EEG dataset. Our method achieves state-of-the-art performance, with a top-1 accuracy of 19.3% and a top-5 accuracy of 48.8% in 200-way zero-shot image classification. Furthermore, we visualize neural patterns via model interpretation, shedding light on the visual processing dynamics in the human brain. The code repository for this work is available at: https://github.com/ChiShengChen/MUSE_EEG.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.