subscribe to arXiv mailings

Realization of Conditional Operations through Transition Pathway Engineering

Authors: Sheng Zhang, Peng Duan, Yun-Jie Wang, Tian-Le Wang, Peng Wang, Ren-Ze Zhao, Xiao-Yan Yang, Ze-An Zhao, Liang-Liang Guo, Yong Chen, Hai-Feng Zhang, Lei Du, Hao-Ran Tao, Zhi-Fei Li, Yuan Wu, Zhi-Long Jia, Wei-Cheng Kong, Zhao-Yun Chen, Yu-Chun Wu, Guo-Ping Guo

Abstract: In the NISQ era, achieving large-scale quantum computing demands compact circuits to mitigate decoherence and gate error accumulation. Quantum operations with diverse degrees of freedom hold promise for circuit compression, but conventional approaches encounter challenges in simultaneously adjusting multiple parameters. Here, we propose a transition composite gate (TCG) scheme grounded on state-se… ▽ More In the NISQ era, achieving large-scale quantum computing demands compact circuits to mitigate decoherence and gate error accumulation. Quantum operations with diverse degrees of freedom hold promise for circuit compression, but conventional approaches encounter challenges in simultaneously adjusting multiple parameters. Here, we propose a transition composite gate (TCG) scheme grounded on state-selective transition path engineering, enabling more expressive conditional operations. We experimentally validate a controlled unitary (CU) gate as an example, with independent and continuous parameters. By adjusting the parameters of $\rm X^{12}$ gate, we obtain the CU family with a fidelity range of 95.2% to 99.0% leveraging quantum process tomography (QPT). To demonstrate the capability of circuit compression, we use TCG scheme to prepare 3-qubit Greenberger-Horne-Zeilinger (GHZ) and W states, with the fidelity of 96.77% and 95.72%. TCG can achieve the reduction in circuit depth of about 40% and 44% compared with the use of CZ gates only. Moreover, we show that short-path TCG (SPTCG) can further reduce the state-preparation circuit time cost. The TCG scheme exhibits advantages in certain quantum circuits and shows significant potential for large-scale quantum algorithms. △ Less

Submitted 10 July, 2024; v1 submitted 9 July, 2024; originally announced July 2024.

Comments: 21 pages, 12 figures

arXiv:2407.05237 [pdf, ps, other]

Privacy of the last iterate in cyclically-sampled DP-SGD on nonconvex composite losses

Authors: Weiwei Kong, Mónica Ribero

Abstract: Differentially private stochastic gradient descent (DP-SGD) refers to a family of optimization algorithms that provide a guaranteed level of differential privacy (DP) through DP accounting techniques. However, current accounting techniques make assumptions that diverge significantly from practical DP-SGD implementations. For example, they may assume the loss function is Lipschitz continuous and co… ▽ More Differentially private stochastic gradient descent (DP-SGD) refers to a family of optimization algorithms that provide a guaranteed level of differential privacy (DP) through DP accounting techniques. However, current accounting techniques make assumptions that diverge significantly from practical DP-SGD implementations. For example, they may assume the loss function is Lipschitz continuous and convex, sample the batches randomly with replacement, or omit the gradient clipping step. In this work, we analyze the most commonly used variant of DP-SGD, in which we sample batches cyclically with replacement, perform gradient clipping, and only release the last DP-SGD iterate. More specifically - without assuming convexity, smoothness, or Lipschitz continuity of the loss function - we establish new Rényi differential privacy (RDP) bounds for the last DP-SGD iterate under the mild assumption that (i) the DP-SGD stepsize is small relative to the topological constants in the loss function, and (ii) the loss function is weakly-convex. Moreover, we show that our bounds converge to previously established convex bounds when the weak-convexity parameter of the objective function approaches zero. In the case of non-Lipschitz smooth loss functions, we provide a weaker bound that scales well in terms of the number of DP-SGD iterations. △ Less

Submitted 6 July, 2024; originally announced July 2024.

MSC Class: 65K10 (Primary); 60G15; 68P27 ACM Class: G.3; G.1.6

arXiv:2406.15512 [pdf, ps, other]

Quantum Mechanics in Curved Space(time) with a Noncommutative Geometric Perspective

Authors: Otto C. W. Kong

Abstract: We have previously presented a version of the Weak Equivalence Principle for a quantum particle as an exact analog of the classical case, based on the Heisenberg picture analysis of free particle motion. Here, we take that to a full formalism of quantum mechanics in a generic curved space(time). Our basic perspective is to take seriously the noncommutative symplectic geometry corresponding to the… ▽ More We have previously presented a version of the Weak Equivalence Principle for a quantum particle as an exact analog of the classical case, based on the Heisenberg picture analysis of free particle motion. Here, we take that to a full formalism of quantum mechanics in a generic curved space(time). Our basic perspective is to take seriously the noncommutative symplectic geometry corresponding to the quantum observable algebra. Particle position coordinate transformations and a nontrivial metric assigning an invariant inner product to vectors, and covectors, are implemented accordingly. That allows an analog to the classical picture of the phase space as the cotangent bundle. The mass-independent quantum geodesic equations as equations of free particle motion under a generic metric as a quantum observable are obtained from an invariant Hamiltonian. Hermiticity of momentum observables is to be taken as reference frame dependent. Our results have a big contrast to the alternative obtained based on the Schrödinger wavefunction representation. Hence, the work points to a very different approach to quantum gravity. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: 24 pages in RevTex, no figure

Report number: NCU-HEP-k102

arXiv:2406.12282 [pdf, other]

SAGDFN: A Scalable Adaptive Graph Diffusion Forecasting Network for Multivariate Time Series Forecasting

Authors: Yue Jiang, Xiucheng Li, Yile Chen, Shuai Liu, Weilong Kong, Antonis F. Lentzakis, Gao Cong

Abstract: Time series forecasting is essential for our daily activities and precise modeling of the complex correlations and shared patterns among multiple time series is essential for improving forecasting performance. Spatial-Temporal Graph Neural Networks (STGNNs) are widely used in multivariate time series forecasting tasks and have achieved promising performance on multiple real-world datasets for thei… ▽ More Time series forecasting is essential for our daily activities and precise modeling of the complex correlations and shared patterns among multiple time series is essential for improving forecasting performance. Spatial-Temporal Graph Neural Networks (STGNNs) are widely used in multivariate time series forecasting tasks and have achieved promising performance on multiple real-world datasets for their ability to model the underlying complex spatial and temporal dependencies. However, existing studies have mainly focused on datasets comprising only a few hundred sensors due to the heavy computational cost and memory cost of spatial-temporal GNNs. When applied to larger datasets, these methods fail to capture the underlying complex spatial dependencies and exhibit limited scalability and performance. To this end, we present a Scalable Adaptive Graph Diffusion Forecasting Network (SAGDFN) to capture complex spatial-temporal correlation for large-scale multivariate time series and thereby, leading to exceptional performance in multivariate time series forecasting tasks. The proposed SAGDFN is scalable to datasets of thousands of nodes without the need of prior knowledge of spatial correlation. Extensive experiments demonstrate that SAGDFN achieves comparable performance with state-of-the-art baselines on one real-world dataset of 207 nodes and outperforms all state-of-the-art baselines by a significant margin on three real-world datasets of 2000 nodes. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: Accepted at ICDE 2024

arXiv:2406.06063 [pdf, other]

Enabling Large-Scale and High-Precision Fluid Simulations on Near-Term Quantum Computers

Authors: Zhao-Yun Chen, Teng-Yang Ma, Chuang-Chao Ye, Liang Xu, Ming-Yang Tan, Xi-Ning Zhuang, Xiao-Fan Xu, Yun-Jie Wang, Tai-Ping Sun, Yong Chen, Lei Du, Liang-Liang Guo, Hai-Feng Zhang, Hao-Ran Tao, Tian-Le Wang, Xiao-Yan Yang, Ze-An Zhao, Peng Wang, Sheng Zhang, Chi Zhang, Ren-Ze Zhao, Zhi-Long Jia, Wei-Cheng Kong, Meng-Han Dou, Jun-Chao Wang , et al. (7 additional authors not shown)

Abstract: Quantum computational fluid dynamics (QCFD) offers a promising alternative to classical computational fluid dynamics (CFD) by leveraging quantum algorithms for higher efficiency. This paper introduces a comprehensive QCFD method, including an iterative method "Iterative-QLS" that suppresses error in quantum linear solver, and a subspace method to scale the solution to a larger size. We implement o… ▽ More Quantum computational fluid dynamics (QCFD) offers a promising alternative to classical computational fluid dynamics (CFD) by leveraging quantum algorithms for higher efficiency. This paper introduces a comprehensive QCFD method, including an iterative method "Iterative-QLS" that suppresses error in quantum linear solver, and a subspace method to scale the solution to a larger size. We implement our method on a superconducting quantum computer, demonstrating successful simulations of steady Poiseuille flow and unsteady acoustic wave propagation. The Poiseuille flow simulation achieved a relative error of less than $0.2\%$, and the unsteady acoustic wave simulation solved a 5043-dimensional matrix. We emphasize the utilization of the quantum-classical hybrid approach in applications of near-term quantum computers. By adapting to quantum hardware constraints and offering scalable solutions for large-scale CFD problems, our method paves the way for practical applications of near-term quantum computers in computational science. △ Less

Submitted 19 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

Comments: 31 pages, 10 figures

arXiv:2405.10339 [pdf, ps, other]

Noncommutative Number Systems for Quantum Information

Authors: Otto C. W. Kong

Abstract: Dirac talked about q-numbers versus c-numbers. Quantum observables are q-number variables that generally do not commute among themselves. He was proposing to have a generalized form of numbers as elements of a noncommutative algebra. That was Dirac's appreciation of the mathematical properties of the physical quantities as presented in Heisenberg's new quantum theory. After all, the familiar real,… ▽ More Dirac talked about q-numbers versus c-numbers. Quantum observables are q-number variables that generally do not commute among themselves. He was proposing to have a generalized form of numbers as elements of a noncommutative algebra. That was Dirac's appreciation of the mathematical properties of the physical quantities as presented in Heisenberg's new quantum theory. After all, the familiar real, or complex, number system only came into existence through the history of mathematics. Values of physical quantities having a commutative product is an assumption that is not compatible with quantum physics. The revolutionary idea of Heisenberg and Dirac was pulled back to a much more conservative setting by the work of Schrödinger, followed by Born and Bohr. What Bohr missed is that the real number values we obtained from our measurements are only a consequence of the design of the kind of experiments and our using real numbers to calibrate the output scales of our apparatus. It is only our modeling of the information obtained about the physical quantities rather than what Nature dictates. We have proposed an explicit notion of definite noncommutative values of observables that gives a picture of quantum mechanics as realistic as the classical theory. In this article, we illustrate how matrices can be taken as noncommutative (q-)numbers serving as the values of physical quantities, each to be seen as a piece of quantum information. Our main task is to clarify the subtle issues involved in setting up a conventional scheme assigning matrices as values to the physical quantities. △ Less

Submitted 14 May, 2024; originally announced May 2024.

Comments: 18 pages in Revtex, no figure

Report number: NCU-HEP-k103

arXiv:2405.08311 [pdf, ps, other]

A Decoupling and Aggregating Framework for Joint Extraction of Entities and Relations

Authors: Yao Wang, Xin Liu, Weikun Kong, Hai-Tao Yu, Teeradaj Racharak, Kyoung-Sook Kim, Minh Le Nguyen

Abstract: Named Entity Recognition and Relation Extraction are two crucial and challenging subtasks in the field of Information Extraction. Despite the successes achieved by the traditional approaches, fundamental research questions remain open. First, most recent studies use parameter sharing for a single subtask or shared features for both two subtasks, ignoring their semantic differences. Second, informa… ▽ More Named Entity Recognition and Relation Extraction are two crucial and challenging subtasks in the field of Information Extraction. Despite the successes achieved by the traditional approaches, fundamental research questions remain open. First, most recent studies use parameter sharing for a single subtask or shared features for both two subtasks, ignoring their semantic differences. Second, information interaction mainly focuses on the two subtasks, leaving the fine-grained informtion interaction among the subtask-specific features of encoding subjects, relations, and objects unexplored. Motivated by the aforementioned limitations, we propose a novel model to jointly extract entities and relations. The main novelties are as follows: (1) We propose to decouple the feature encoding process into three parts, namely encoding subjects, encoding objects, and encoding relations. Thanks to this, we are able to use fine-grained subtask-specific features. (2) We propose novel inter-aggregation and intra-aggregation strategies to enhance the information interaction and construct individual fine-grained subtask-specific features, respectively. The experimental results demonstrate that our model outperforms several previous state-of-the-art models. Extensive additional experiments further confirm the effectiveness of our model. △ Less

Submitted 14 May, 2024; originally announced May 2024.

arXiv:2405.06995 [pdf, other]

Benchmarking Cross-Domain Audio-Visual Deception Detection

Authors: Xiaobao Guo, Zitong Yu, Nithish Muthuchamy Selvaraj, Bingquan Shen, Adams Wai-Kin Kong, Alex C. Kot

Abstract: Automated deception detection is crucial for assisting humans in accurately assessing truthfulness and identifying deceptive behavior. Conventional contact-based techniques, like polygraph devices, rely on physiological signals to determine the authenticity of an individual's statements. Nevertheless, recent developments in automated deception detection have demonstrated that multimodal features d… ▽ More Automated deception detection is crucial for assisting humans in accurately assessing truthfulness and identifying deceptive behavior. Conventional contact-based techniques, like polygraph devices, rely on physiological signals to determine the authenticity of an individual's statements. Nevertheless, recent developments in automated deception detection have demonstrated that multimodal features derived from both audio and video modalities may outperform human observers on publicly available datasets. Despite these positive findings, the generalizability of existing audio-visual deception detection approaches across different scenarios remains largely unexplored. To close this gap, we present the first cross-domain audio-visual deception detection benchmark, that enables us to assess how well these methods generalize for use in real-world scenarios. We used widely adopted audio and visual features and different architectures for benchmarking, comparing single-to-single and multi-to-single domain generalization performance. To further exploit the impacts using data from multiple source domains for training, we investigate three types of domain sampling strategies, including domain-simultaneous, domain-alternating, and domain-by-domain for multi-to-single domain generalization evaluation. Furthermore, we proposed the Attention-Mixer fusion method to improve performance, and we believe that this new cross-domain benchmark will facilitate future research in audio-visual deception detection. Protocols and source code are available at \href{https://github.com/Redaimao/cross_domain_DD}{https://github.com/Redaimao/cross\_domain\_DD}. △ Less

Submitted 11 May, 2024; originally announced May 2024.

Comments: 10 pages

arXiv:2405.06361 [pdf, other]

Certified $\ell_2$ Attribution Robustness via Uniformly Smoothed Attributions

Authors: Fan Wang, Adams Wai-Kin Kong

Abstract: Model attribution is a popular tool to explain the rationales behind model predictions. However, recent work suggests that the attributions are vulnerable to minute perturbations, which can be added to input samples to fool the attributions while maintaining the prediction outputs. Although empirical studies have shown positive performance via adversarial training, an effective certified defense m… ▽ More Model attribution is a popular tool to explain the rationales behind model predictions. However, recent work suggests that the attributions are vulnerable to minute perturbations, which can be added to input samples to fool the attributions while maintaining the prediction outputs. Although empirical studies have shown positive performance via adversarial training, an effective certified defense method is eminently needed to understand the robustness of attributions. In this work, we propose to use uniform smoothing technique that augments the vanilla attributions by noises uniformly sampled from a certain space. It is proved that, for all perturbations within the attack region, the cosine similarity between uniformly smoothed attribution of perturbed sample and the unperturbed sample is guaranteed to be lower bounded. We also derive alternative formulations of the certification that is equivalent to the original one and provides the maximum size of perturbation or the minimum smoothing radius such that the attribution can not be perturbed. We evaluate the proposed method on three datasets and show that the proposed method can effectively protect the attributions from attacks, regardless of the architecture of networks, training schemes and the size of the datasets. △ Less

Submitted 10 May, 2024; originally announced May 2024.

arXiv:2405.01825 [pdf, other]

Improving Concept Alignment in Vision-Language Concept Bottleneck Models

Authors: Nithish Muthuchamy Selvaraj, Xiaobao Guo, Bingquan Shen, Adams Wai-Kin Kong, Alex Kot

Abstract: Concept Bottleneck Models (CBM) map the input image to a high-level human-understandable concept space and then make class predictions based on these concepts. Recent approaches automate the construction of CBM by prompting Large Language Models (LLM) to generate text concepts and then use Vision Language Models (VLM) to obtain concept scores to train a CBM. However, it is desired to build CBMs wi… ▽ More Concept Bottleneck Models (CBM) map the input image to a high-level human-understandable concept space and then make class predictions based on these concepts. Recent approaches automate the construction of CBM by prompting Large Language Models (LLM) to generate text concepts and then use Vision Language Models (VLM) to obtain concept scores to train a CBM. However, it is desired to build CBMs with concepts defined by human experts instead of LLM generated concepts to make them more trustworthy. In this work, we take a closer inspection on the faithfulness of VLM concept scores for such expert-defined concepts in domains like fine-grain bird species classification and animal classification. Our investigations reveal that frozen VLMs, like CLIP, struggle to correctly associate a concept to the corresponding visual input despite achieving a high classification performance. To address this, we propose a novel Contrastive Semi-Supervised (CSS) learning method which uses a few labeled concept examples to improve concept alignment (activate truthful visual concepts) in CLIP model. Extensive experiments on three benchmark datasets show that our approach substantially increases the concept accuracy and classification accuracy, yet requires only a fraction of the human-annotated concept labels. To further improve the classification performance, we also introduce a new class-level intervention procedure for fine-grain classification problems that identifies the confounding classes and intervenes their concept space to reduce errors. △ Less

Submitted 2 May, 2024; originally announced May 2024.

arXiv:2404.15409 [pdf, ps, other]

Insufficient Statistics Perturbation: Stable Estimators for Private Least Squares

Authors: Gavin Brown, Jonathan Hayase, Samuel Hopkins, Weihao Kong, Xiyang Liu, Sewoong Oh, Juan C. Perdomo, Adam Smith

Abstract: We present a sample- and time-efficient differentially private algorithm for ordinary least squares, with error that depends linearly on the dimension and is independent of the condition number of $X^\top X$, where $X$ is the design matrix. All prior private algorithms for this task require either $d^{3/2}$ examples, error growing polynomially with the condition number, or exponential time. Our ne… ▽ More We present a sample- and time-efficient differentially private algorithm for ordinary least squares, with error that depends linearly on the dimension and is independent of the condition number of $X^\top X$, where $X$ is the design matrix. All prior private algorithms for this task require either $d^{3/2}$ examples, error growing polynomially with the condition number, or exponential time. Our near-optimal accuracy guarantee holds for any dataset with bounded statistical leverage and bounded residuals. Technically, we build on the approach of Brown et al. (2023) for private mean estimation, adding scaled noise to a carefully designed stable nonprivate estimator of the empirical regression vector. △ Less

Submitted 23 April, 2024; originally announced April 2024.

Comments: 42 pages, 3 figures

arXiv:2404.09516 [pdf, other]

State Space Model for New-Generation Network Alternative to Transformers: A Survey

Authors: Xiao Wang, Shiao Wang, Yuhe Ding, Yuehang Li, Wentao Wu, Yao Rong, Weizhe Kong, Ju Huang, Shihao Li, Haoxiang Yang, Ziwen Wang, Bo Jiang, Chenglong Li, Yaowei Wang, Yonghong Tian, Jin Tang

Abstract: In the post-deep learning era, the Transformer architecture has demonstrated its powerful performance across pre-trained big models and various downstream tasks. However, the enormous computational demands of this architecture have deterred many researchers. To further reduce the complexity of attention models, numerous efforts have been made to design more efficient methods. Among them, the State… ▽ More In the post-deep learning era, the Transformer architecture has demonstrated its powerful performance across pre-trained big models and various downstream tasks. However, the enormous computational demands of this architecture have deterred many researchers. To further reduce the complexity of attention models, numerous efforts have been made to design more efficient methods. Among them, the State Space Model (SSM), as a possible replacement for the self-attention based Transformer model, has drawn more and more attention in recent years. In this paper, we give the first comprehensive review of these works and also provide experimental comparisons and analysis to better demonstrate the features and advantages of SSM. Specifically, we first give a detailed description of principles to help the readers quickly capture the key ideas of SSM. After that, we dive into the reviews of existing SSMs and their various applications, including natural language processing, computer vision, graph, multi-modal and multi-media, point cloud/event stream, time series data, and other domains. In addition, we give statistical comparisons and analysis of these models and hope it helps the readers to understand the effectiveness of different structures on various tasks. Then, we propose possible research points in this direction to better promote the development of the theoretical model and application of SSM. More related works will be continuously updated on the following GitHub: https://github.com/Event-AHU/Mamba_State_Space_Model_Paper_List. △ Less

Submitted 15 April, 2024; originally announced April 2024.

Comments: The First review of State Space Model (SSM)/Mamba and their applications in artificial intelligence, 33 pages

arXiv:2403.18401 [pdf, other]

Force generation by a cylindrical cell under stationary osmolytes synthesis

Authors: Wei-Yuan Kong, Antonio Mosciatti Jofré, Manon Quiros, Marie-Béatrice Bogeat-Triboulot, Evelyne Kolb, Etienne Couturier

Abstract: Turgor is the driving force of plant growth, making possible for roots to overcome soil resistance or for stems to counteract gravity. Maintaining a constant growth rate while avoiding the cell content dilution, which would progressively stop the inward water flux, imposes the production or import of osmolytes in proportion to the increase of volume. We coin this phenomenon stationary osmoregulati… ▽ More Turgor is the driving force of plant growth, making possible for roots to overcome soil resistance or for stems to counteract gravity. Maintaining a constant growth rate while avoiding the cell content dilution, which would progressively stop the inward water flux, imposes the production or import of osmolytes in proportion to the increase of volume. We coin this phenomenon stationary osmoregulation. The article explores the quantitative consequences of this hypothesis on the interaction of a cylindrical cell growing axially against an obstacle. An instantaneous axial compression of a pressurized cylindrical cell generates a force and a pressure jump which both decrease toward a lower value once water has flowed out of the cell to reach the water potential equilibrium. In a first part, the article derives analytical formula for these force and over-pressure both before and after relaxation. In a second part, we describe how the coupling of the Lockhart's growth law with the stationary osmoregulation hypothesis predicts a transient slowdown in growth due to contact before a re-acceleration in growth. We finally compare these predictions with the output of an elastic growth model which ignores the osmotic origin of growth: models only match in the early phase of contact for high stiffness obstacle. △ Less

Submitted 3 July, 2024; v1 submitted 27 March, 2024; originally announced March 2024.

arXiv:2403.10214 [pdf, other]

Enhanced Coherence-Aware Network with Hierarchical Disentanglement for Aspect-Category Sentiment Analysis

Authors: Jin Cui, Fumiyo Fukumoto, Xinfeng Wang, Yoshimi Suzuki, Jiyi Li, Noriko Tomuro, Wanzeng Kong

Abstract: Aspect-category-based sentiment analysis (ACSA), which aims to identify aspect categories and predict their sentiments has been intensively studied due to its wide range of NLP applications. Most approaches mainly utilize intrasentential features. However, a review often includes multiple different aspect categories, and some of them do not explicitly appear in the review. Even in a sentence, ther… ▽ More Aspect-category-based sentiment analysis (ACSA), which aims to identify aspect categories and predict their sentiments has been intensively studied due to its wide range of NLP applications. Most approaches mainly utilize intrasentential features. However, a review often includes multiple different aspect categories, and some of them do not explicitly appear in the review. Even in a sentence, there is more than one aspect category with its sentiments, and they are entangled intra-sentence, which makes the model fail to discriminately preserve all sentiment characteristics. In this paper, we propose an enhanced coherence-aware network with hierarchical disentanglement (ECAN) for ACSA tasks. Specifically, we explore coherence modeling to capture the contexts across the whole review and to help the implicit aspect and sentiment identification. To address the issue of multiple aspect categories and sentiment entanglement, we propose a hierarchical disentanglement module to extract distinct categories and sentiment features. Extensive experimental and visualization results show that our ECAN effectively decouples multiple categories and sentiments entangled in the coherence representations and achieves state-of-the-art (SOTA) performance. Our codes and data are available online: \url{https://github.com/cuijin-23/ECAN}. △ Less

Submitted 15 March, 2024; originally announced March 2024.

Comments: Accepted by LREC-COLING 2024

arXiv:2403.10021 [pdf, other]

Time-Frequency Jointed Imperceptible Adversarial Attack to Brainprint Recognition with Deep Learning Models

Authors: Hangjie Yi, Yuhang Ming, Dongjun Liu, Wanzeng Kong

Abstract: EEG-based brainprint recognition with deep learning models has garnered much attention in biometric identification. Yet, studies have indicated vulnerability to adversarial attacks in deep learning models with EEG inputs. In this paper, we introduce a novel adversarial attack method that jointly attacks time-domain and frequency-domain EEG signals by employing wavelet transform. Different from mos… ▽ More EEG-based brainprint recognition with deep learning models has garnered much attention in biometric identification. Yet, studies have indicated vulnerability to adversarial attacks in deep learning models with EEG inputs. In this paper, we introduce a novel adversarial attack method that jointly attacks time-domain and frequency-domain EEG signals by employing wavelet transform. Different from most existing methods which only target time-domain EEG signals, our method not only takes advantage of the time-domain attack's potent adversarial strength but also benefits from the imperceptibility inherent in frequency-domain attack, achieving a better balance between attack performance and imperceptibility. Extensive experiments are conducted in both white- and grey-box scenarios and the results demonstrate that our attack method achieves state-of-the-art attack performance on three datasets and three deep-learning models. In the meanwhile, the perturbations in the signals attacked by our method are barely perceptible to the human visual system. △ Less

Submitted 30 June, 2024; v1 submitted 15 March, 2024; originally announced March 2024.

Comments: This work is accepted by ICME 2024

arXiv:2403.06135 [pdf, other]

MACE: Mass Concept Erasure in Diffusion Models

Authors: Shilin Lu, Zilan Wang, Leyang Li, Yanzhu Liu, Adams Wai-Kin Kong

Abstract: The rapid expansion of large-scale text-to-image diffusion models has raised growing concerns regarding their potential misuse in creating harmful or misleading content. In this paper, we introduce MACE, a finetuning framework for the task of mass concept erasure. This task aims to prevent models from generating images that embody unwanted concepts when prompted. Existing concept erasure methods a… ▽ More The rapid expansion of large-scale text-to-image diffusion models has raised growing concerns regarding their potential misuse in creating harmful or misleading content. In this paper, we introduce MACE, a finetuning framework for the task of mass concept erasure. This task aims to prevent models from generating images that embody unwanted concepts when prompted. Existing concept erasure methods are typically restricted to handling fewer than five concepts simultaneously and struggle to find a balance between erasing concept synonyms (generality) and maintaining unrelated concepts (specificity). In contrast, MACE differs by successfully scaling the erasure scope up to 100 concepts and by achieving an effective balance between generality and specificity. This is achieved by leveraging closed-form cross-attention refinement along with LoRA finetuning, collectively eliminating the information of undesirable concepts. Furthermore, MACE integrates multiple LoRAs without mutual interference. We conduct extensive evaluations of MACE against prior methods across four different tasks: object erasure, celebrity erasure, explicit content erasure, and artistic style erasure. Our results reveal that MACE surpasses prior methods in all evaluated tasks. Code is available at https://github.com/Shilin-LU/MACE. △ Less

Submitted 10 March, 2024; originally announced March 2024.

Comments: Accepted by CVPR 2024

arXiv:2401.08189 [pdf, other]

PRewrite: Prompt Rewriting with Reinforcement Learning

Authors: Weize Kong, Spurthi Amba Hombaiah, Mingyang Zhang, Qiaozhu Mei, Michael Bendersky

Abstract: Prompt engineering is critical for the development of LLM-based applications. However, it is usually done manually in a "trial and error" fashion that can be time consuming, ineffective, and sub-optimal. Even for the prompts which seemingly work well, there is always a lingering question: can the prompts be made better with further modifications? To address these problems, we investigate automat… ▽ More Prompt engineering is critical for the development of LLM-based applications. However, it is usually done manually in a "trial and error" fashion that can be time consuming, ineffective, and sub-optimal. Even for the prompts which seemingly work well, there is always a lingering question: can the prompts be made better with further modifications? To address these problems, we investigate automated prompt engineering in this paper. Specifically, we propose PRewrite, an automated method to rewrite an under-optimized prompt to a more effective prompt. We instantiate the prompt rewriter using a LLM. The rewriter LLM is trained using reinforcement learning to optimize the performance on a given downstream task. We conduct experiments on diverse benchmark datasets, which demonstrates the effectiveness of PRewrite. △ Less

Submitted 10 June, 2024; v1 submitted 16 January, 2024; originally announced January 2024.

arXiv:2401.06954 [pdf, other]

Bridging the Preference Gap between Retrievers and LLMs

Authors: Zixuan Ke, Weize Kong, Cheng Li, Mingyang Zhang, Qiaozhu Mei, Michael Bendersky

Abstract: Large Language Models (LLMs) have demonstrated superior results across a wide range of tasks, and Retrieval-augmented Generation (RAG) is an effective way to enhance the performance by locating relevant information and placing it into the context window of the LLM. However, the relationship between retrievers and LLMs in a RAG is still under-investigated. Most existing work treats the retriever an… ▽ More Large Language Models (LLMs) have demonstrated superior results across a wide range of tasks, and Retrieval-augmented Generation (RAG) is an effective way to enhance the performance by locating relevant information and placing it into the context window of the LLM. However, the relationship between retrievers and LLMs in a RAG is still under-investigated. Most existing work treats the retriever and the LLM as independent components and leaves a gap between retrieving human-"friendly" information and assembling a LLM-"friendly" context. In this work, we examine a novel bridge mechanism. We validate the ranking and selection assumptions of retrievers in the context of RAG and propose a framework that chains together supervised and reinforcement learning to train a bridge model that optimizes the connection between the retriever and the LLM. Empirical results demonstrate the effectiveness of our method in both question-answering and personalized generation tasks. △ Less

Submitted 20 February, 2024; v1 submitted 12 January, 2024; originally announced January 2024.

arXiv:2312.11805 [pdf, other]

Gemini: A Family of Highly Capable Multimodal Models

Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI. △ Less

Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

arXiv:2312.09538 [pdf, other]

AEGIS-Net: Attention-guided Multi-Level Feature Aggregation for Indoor Place Recognition

Authors: Yuhang Ming, Jian Ma, Xingrui Yang, Weichen Dai, Yong Peng, Wanzeng Kong

Abstract: We present AEGIS-Net, a novel indoor place recognition model that takes in RGB point clouds and generates global place descriptors by aggregating lower-level color, geometry features and higher-level implicit semantic features. However, rather than simple feature concatenation, self-attention modules are employed to select the most important local features that best describe an indoor place. Our A… ▽ More We present AEGIS-Net, a novel indoor place recognition model that takes in RGB point clouds and generates global place descriptors by aggregating lower-level color, geometry features and higher-level implicit semantic features. However, rather than simple feature concatenation, self-attention modules are employed to select the most important local features that best describe an indoor place. Our AEGIS-Net is made of a semantic encoder, a semantic decoder and an attention-guided feature embedding. The model is trained in a 2-stage process with the first stage focusing on an auxiliary semantic segmentation task and the second one on the place recognition task. We evaluate our AEGIS-Net on the ScanNetPR dataset and compare its performance with a pre-deep-learning feature-based method and five state-of-the-art deep-learning-based methods. Our AEGIS-Net achieves exceptional performance and outperforms all six methods. △ Less

Submitted 15 December, 2023; originally announced December 2023.

Comments: Accepted by 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2024)

arXiv:2311.16416 [pdf, other]

A Combinatorial Approach to Robust PCA

Authors: Weihao Kong, Mingda Qiao, Rajat Sen

Abstract: We study the problem of recovering Gaussian data under adversarial corruptions when the noises are low-rank and the corruptions are on the coordinate level. Concretely, we assume that the Gaussian noises lie in an unknown $k$-dimensional subspace $U \subseteq \mathbb{R}^d$, and $s$ randomly chosen coordinates of each data point fall into the control of an adversary. This setting models the scenari… ▽ More We study the problem of recovering Gaussian data under adversarial corruptions when the noises are low-rank and the corruptions are on the coordinate level. Concretely, we assume that the Gaussian noises lie in an unknown $k$-dimensional subspace $U \subseteq \mathbb{R}^d$, and $s$ randomly chosen coordinates of each data point fall into the control of an adversary. This setting models the scenario of learning from high-dimensional yet structured data that are transmitted through a highly-noisy channel, so that the data points are unlikely to be entirely clean. Our main result is an efficient algorithm that, when $ks^2 = O(d)$, recovers every single data point up to a nearly-optimal $\ell_1$ error of $\tilde O(ks/d)$ in expectation. At the core of our proof is a new analysis of the well-known Basis Pursuit (BP) method for recovering a sparse signal, which is known to succeed under additional assumptions (e.g., incoherence or the restricted isometry property) on the underlying subspace $U$. In contrast, we present a novel approach via studying a natural combinatorial problem and show that, over the randomness in the support of the sparse signal, a high-probability error bound is possible even if the subspace $U$ is arbitrary. △ Less

Submitted 27 November, 2023; originally announced November 2023.

Comments: To appear at ITCS 2024

arXiv:2311.14580 [pdf, other]

Large Language Models as Automated Aligners for benchmarking Vision-Language Models

Authors: Yuanfeng Ji, Chongjian Ge, Weikai Kong, Enze Xie, Zhengying Liu, Zhengguo Li, Ping Luo

Abstract: With the advancements in Large Language Models (LLMs), Vision-Language Models (VLMs) have reached a new level of sophistication, showing notable competence in executing intricate cognition and reasoning tasks. However, existing evaluation benchmarks, primarily relying on rigid, hand-crafted datasets to measure task-specific performance, face significant limitations in assessing the alignment of th… ▽ More With the advancements in Large Language Models (LLMs), Vision-Language Models (VLMs) have reached a new level of sophistication, showing notable competence in executing intricate cognition and reasoning tasks. However, existing evaluation benchmarks, primarily relying on rigid, hand-crafted datasets to measure task-specific performance, face significant limitations in assessing the alignment of these increasingly anthropomorphic models with human intelligence. In this work, we address the limitations via Auto-Bench, which delves into exploring LLMs as proficient aligners, measuring the alignment between VLMs and human intelligence and value through automatic data curation and assessment. Specifically, for data curation, Auto-Bench utilizes LLMs (e.g., GPT-4) to automatically generate a vast set of question-answer-reasoning triplets via prompting on visual symbolic representations (e.g., captions, object locations, instance relationships, and etc.). The curated data closely matches human intent, owing to the extensive world knowledge embedded in LLMs. Through this pipeline, a total of 28.5K human-verified and 3,504K unfiltered question-answer-reasoning triplets have been curated, covering 4 primary abilities and 16 sub-abilities. We subsequently engage LLMs like GPT-3.5 to serve as judges, implementing the quantitative and qualitative automated assessments to facilitate a comprehensive evaluation of VLMs. Our validation results reveal that LLMs are proficient in both evaluation data curation and model assessment, achieving an average agreement rate of 85%. We envision Auto-Bench as a flexible, scalable, and comprehensive benchmark for evaluating the evolving sophisticated VLMs. △ Less

Submitted 24 November, 2023; originally announced November 2023.

arXiv:2311.14464 [pdf, other]

Finite Volume Features, Global Geometry Representations, and Residual Training for Deep Learning-based CFD Simulation

Authors: Loh Sher En Jessica, Naheed Anjum Arafat, Wei Xian Lim, Wai Lee Chan, Adams Wai Kin Kong

Abstract: Computational fluid dynamics (CFD) simulation is an irreplaceable modelling step in many engineering designs, but it is often computationally expensive. Some graph neural network (GNN)-based CFD methods have been proposed. However, the current methods inherit the weakness of traditional numerical simulators, as well as ignore the cell characteristics in the mesh used in the finite volume method, a… ▽ More Computational fluid dynamics (CFD) simulation is an irreplaceable modelling step in many engineering designs, but it is often computationally expensive. Some graph neural network (GNN)-based CFD methods have been proposed. However, the current methods inherit the weakness of traditional numerical simulators, as well as ignore the cell characteristics in the mesh used in the finite volume method, a common method in practical CFD applications. Specifically, the input nodes in these GNN methods have very limited information about any object immersed in the simulation domain and its surrounding environment. Also, the cell characteristics of the mesh such as cell volume, face surface area, and face centroid are not included in the message-passing operations in the GNN methods. To address these weaknesses, this work proposes two novel geometric representations: Shortest Vector (SV) and Directional Integrated Distance (DID). Extracted from the mesh, the SV and DID provide global geometry perspective to each input node, thus removing the need to collect this information through message-passing. This work also introduces the use of Finite Volume Features (FVF) in the graph convolutions as node and edge attributes, enabling its message-passing operations to adjust to different nodes. Finally, this work is the first to demonstrate how residual training, with the availability of low-resolution data, can be adopted to improve the flow field prediction accuracy. Experimental results on two datasets with five different state-of-the-art GNN methods for CFD indicate that SV, DID, FVF and residual training can effectively reduce the predictive error of current GNN-based methods by as much as 41%. △ Less

Submitted 24 November, 2023; originally announced November 2023.

arXiv:2311.08362 [pdf, other]

Transformers can optimally learn regression mixture models

Authors: Reese Pathak, Rajat Sen, Weihao Kong, Abhimanyu Das

Abstract: Mixture models arise in many regression problems, but most methods have seen limited adoption partly due to these algorithms' highly-tailored and model-specific nature. On the other hand, transformers are flexible, neural sequence models that present the intriguing possibility of providing general-purpose prediction methods, even in this mixture setting. In this work, we investigate the hypothesis… ▽ More Mixture models arise in many regression problems, but most methods have seen limited adoption partly due to these algorithms' highly-tailored and model-specific nature. On the other hand, transformers are flexible, neural sequence models that present the intriguing possibility of providing general-purpose prediction methods, even in this mixture setting. In this work, we investigate the hypothesis that transformers can learn an optimal predictor for mixtures of regressions. We construct a generative process for a mixture of linear regressions for which the decision-theoretic optimal procedure is given by data-driven exponential weights on a finite set of parameters. We observe that transformers achieve low mean-squared error on data generated via this process. By probing the transformer's output at inference time, we also show that transformers typically make predictions that are close to the optimal predictor. Our experiments also demonstrate that transformers can learn mixtures of regressions in a sample-efficient fashion and are somewhat robust to distribution shifts. We complement our experimental observations by proving constructively that the decision-theoretic optimal procedure is indeed implementable by a transformer. △ Less

Submitted 14 November, 2023; originally announced November 2023.

Comments: 24 pages, 9 figures

arXiv:2311.05383 [pdf]

Improving Hand Recognition in Uncontrolled and Uncooperative Environments using Multiple Spatial Transformers and Loss Functions

Authors: Wojciech Michal Matkowski, Xiaojie Li, Adams Wai Kin Kong

Abstract: The prevalence of smartphone and consumer camera has led to more evidence in the form of digital images, which are mostly taken in uncontrolled and uncooperative environments. In these images, criminals likely hide or cover their faces while their hands are observable in some cases, creating a challenging use case for forensic investigation. Many existing hand-based recognition methods perform wel… ▽ More The prevalence of smartphone and consumer camera has led to more evidence in the form of digital images, which are mostly taken in uncontrolled and uncooperative environments. In these images, criminals likely hide or cover their faces while their hands are observable in some cases, creating a challenging use case for forensic investigation. Many existing hand-based recognition methods perform well for hand images collected in controlled environments with user cooperation. However, their performance deteriorates significantly in uncontrolled and uncooperative environments. A recent work has exposed the potential of hand recognition in these environments. However, only the palmar regions were considered, and the recognition performance is still far from satisfactory. To improve the recognition accuracy, an algorithm integrating a multi-spatial transformer network (MSTN) and multiple loss functions is proposed to fully utilize information in full hand images. MSTN is firstly employed to localize the palms and fingers and estimate the alignment parameters. Then, the aligned images are further fed into pretrained convolutional neural networks, where features are extracted. Finally, a training scheme with multiple loss functions is used to train the network end-to-end. To demonstrate the effectiveness of the proposed algorithm, the trained model is evaluated on NTU-PI-v1 database and six benchmark databases from different domains. Experimental results show that the proposed algorithm performs significantly better than the existing methods in these uncontrolled and uncooperative environments and has good generalization capabilities to samples from different domains. △ Less

Submitted 9 November, 2023; originally announced November 2023.

arXiv:2310.12570 [pdf, other]

DA-TransUNet: Integrating Spatial and Channel Dual Attention with Transformer U-Net for Medical Image Segmentation

Authors: Guanqun Sun, Yizhi Pan, Weikun Kong, Zichang Xu, Jianhua Ma, Teeradaj Racharak, Le-Minh Nguyen, Junyi Xin

Abstract: Accurate medical image segmentation is critical for disease quantification and treatment evaluation. While traditional Unet architectures and their transformer-integrated variants excel in automated segmentation tasks. However, they lack the ability to harness the intrinsic position and channel features of image. Existing models also struggle with parameter efficiency and computational complexity,… ▽ More Accurate medical image segmentation is critical for disease quantification and treatment evaluation. While traditional Unet architectures and their transformer-integrated variants excel in automated segmentation tasks. However, they lack the ability to harness the intrinsic position and channel features of image. Existing models also struggle with parameter efficiency and computational complexity, often due to the extensive use of Transformers. To address these issues, this study proposes a novel deep medical image segmentation framework, called DA-TransUNet, aiming to integrate the Transformer and dual attention block(DA-Block) into the traditional U-shaped architecture. Unlike earlier transformer-based U-net models, DA-TransUNet utilizes Transformers and DA-Block to integrate not only global and local features, but also image-specific positional and channel features, improving the performance of medical image segmentation. By incorporating a DA-Block at the embedding layer and within each skip connection layer, we substantially enhance feature extraction capabilities and improve the efficiency of the encoder-decoder structure. DA-TransUNet demonstrates superior performance in medical image segmentation tasks, consistently outperforming state-of-the-art techniques across multiple datasets. In summary, DA-TransUNet offers a significant advancement in medical image segmentation, providing an effective and powerful alternative to existing techniques. Our architecture stands out for its ability to improve segmentation accuracy, thereby advancing the field of automated medical image diagnostics. The codes and parameters of our model will be publicly available at https://github.com/SUN-1024/DA-TransUnet. △ Less

Submitted 14 November, 2023; v1 submitted 19 October, 2023; originally announced October 2023.

arXiv:2310.10688 [pdf, other]

A decoder-only foundation model for time-series forecasting

Authors: Abhimanyu Das, Weihao Kong, Rajat Sen, Yichen Zhou

Abstract: Motivated by recent advances in large language models for Natural Language Processing (NLP), we design a time-series foundation model for forecasting whose out-of-the-box zero-shot performance on a variety of public datasets comes close to the accuracy of state-of-the-art supervised forecasting models for each individual dataset. Our model is based on pretraining a patched-decoder style attention… ▽ More Motivated by recent advances in large language models for Natural Language Processing (NLP), we design a time-series foundation model for forecasting whose out-of-the-box zero-shot performance on a variety of public datasets comes close to the accuracy of state-of-the-art supervised forecasting models for each individual dataset. Our model is based on pretraining a patched-decoder style attention model on a large time-series corpus, and can work well across different forecasting history lengths, prediction lengths and temporal granularities. △ Less

Submitted 17 April, 2024; v1 submitted 14 October, 2023; originally announced October 2023.

arXiv:2310.05116 [pdf, other]

Utilizing Contextual Clues and Role Correlations for Enhancing Document-level Event Argument Extraction

Authors: Wanlong Liu, Dingyi Zeng, Li Zhou, Yichen Xiao, Weishan Kong, Malu Zhang, Shaohuan Cheng, Hongyang Zhao, Wenyu Chen

Abstract: Document-level event argument extraction is a crucial yet challenging task within the field of information extraction. Current mainstream approaches primarily focus on the information interaction between event triggers and their arguments, facing two limitations: insufficient context interaction and the ignorance of event correlations. Here, we introduce a novel framework named CARLG (Contextual A… ▽ More Document-level event argument extraction is a crucial yet challenging task within the field of information extraction. Current mainstream approaches primarily focus on the information interaction between event triggers and their arguments, facing two limitations: insufficient context interaction and the ignorance of event correlations. Here, we introduce a novel framework named CARLG (Contextual Aggregation of clues and Role-based Latent Guidance), comprising two innovative components: the Contextual Clues Aggregation (CCA) and the Role-based Latent Information Guidance (RLIG). The CCA module leverages the attention weights derived from a pre-trained encoder to adaptively assimilates broader contextual information, while the RLIG module aims to capture the semantic correlations among event roles. We then instantiate the CARLG framework into two variants based on two types of current mainstream EAE approaches. Notably, our CARLG framework introduces less than 1% new parameters yet significantly improving the performance. Comprehensive experiments across the RAMS, WikiEvents, and MLEE datasets confirm the superiority of CARLG, showing significant superiority in terms of both performance and inference speed compared to major benchmarks. Further analyses demonstrate the effectiveness of the proposed modules. △ Less

Submitted 3 April, 2024; v1 submitted 8 October, 2023; originally announced October 2023.

Comments: pre-submission

arXiv:2310.03610 [pdf, other]

Interpreting the Value of Flexibility in AC Security-Constrained Transmission Expansion Planning via a Cooperative Game Framework

Authors: Andrey Churkin, Wangwei Kong, Mohammad Iman Alizadeh, Florin Capitanescu, Pierluigi Mancarella, Eduardo A. Martínez Ceseña

Abstract: Security-constrained transmission expansion planning (SCTEP) is an inherently complex problem that requires simultaneously solving multiple contingency states of the system (usually corresponding to N-1 security criterion). Existing studies focus on effectively finding optimal solutions; however, single optimal solutions are not sufficient to interpret the value of flexibility (e.g., from energy s… ▽ More Security-constrained transmission expansion planning (SCTEP) is an inherently complex problem that requires simultaneously solving multiple contingency states of the system (usually corresponding to N-1 security criterion). Existing studies focus on effectively finding optimal solutions; however, single optimal solutions are not sufficient to interpret the value of flexibility (e.g., from energy storage systems) and support system planners in well-informed decision making. In view of planning uncertainties, it is necessary to estimate the contributions of flexibility to various objectives and prioritise the most effective investments. In this regard, this work introduces a SCTEP tool that enables interpreting the value of flexibility in terms of contributions to avoided load curtailment and total expected system cost reduction. Inspired by cooperative game theory, the tool ranks the contributions of flexibility providers and compares them against traditional line reinforcements. This information can be used by system planners to prioritise investments with higher contributions and synergistic capabilities. △ Less

Submitted 14 March, 2024; v1 submitted 5 October, 2023; originally announced October 2023.

Comments: Submitted to PSCC 2024

arXiv:2310.03104 [pdf, other]

DP-SGD for non-decomposable objective functions

Authors: William Kong, Andrés Muñoz Medina, Mónica Ribero

Abstract: Unsupervised pre-training is a common step in developing computer vision models and large language models. In this setting, the absence of labels requires the use of similarity-based loss functions, such as contrastive loss, that favor minimizing the distance between similar inputs and maximizing the distance between distinct inputs. As privacy concerns mount, training these models using different… ▽ More Unsupervised pre-training is a common step in developing computer vision models and large language models. In this setting, the absence of labels requires the use of similarity-based loss functions, such as contrastive loss, that favor minimizing the distance between similar inputs and maximizing the distance between distinct inputs. As privacy concerns mount, training these models using differential privacy has become more important. However, due to how inputs are generated for these losses, one of their undesirable properties is that their $L_2$ sensitivity can grow with increasing batch size. This property is particularly disadvantageous for differentially private training methods, such as DP-SGD. To overcome this issue, we develop a new DP-SGD variant for similarity based loss functions -- in particular the commonly used contrastive loss -- that manipulates gradients of the objective function in a novel way to obtain a senstivity of the summed gradient that is $O(1)$ for batch size $n$. We test our DP-SGD variant on some preliminary CIFAR-10 pre-training and CIFAR-100 finetuning tasks and show that, in both tasks, our method's performance comes close to that of a non-private model and generally outperforms DP-SGD applied directly to the contrastive loss. △ Less

Submitted 4 October, 2023; originally announced October 2023.

arXiv:2310.00296 [pdf, other]

QUIZ: An Arbitrary Volumetric Point Matching Method for Medical Image Registration

Authors: Lin Liu, Xinxin Fan, Haoyang Liu, Chulong Zhang, Weibin Kong, Jingjing Dai, Yuming Jiang, Yaoqin Xie, Xiaokun Liang

Abstract: Rigid pre-registration involving local-global matching or other large deformation scenarios is crucial. Current popular methods rely on unsupervised learning based on grayscale similarity, but under circumstances where different poses lead to varying tissue structures, or where image quality is poor, these methods tend to exhibit instability and inaccuracies. In this study, we propose a novel meth… ▽ More Rigid pre-registration involving local-global matching or other large deformation scenarios is crucial. Current popular methods rely on unsupervised learning based on grayscale similarity, but under circumstances where different poses lead to varying tissue structures, or where image quality is poor, these methods tend to exhibit instability and inaccuracies. In this study, we propose a novel method for medical image registration based on arbitrary voxel point of interest matching, called query point quizzer (QUIZ). QUIZ focuses on the correspondence between local-global matching points, specifically employing CNN for feature extraction and utilizing the Transformer architecture for global point matching queries, followed by applying average displacement for local image rigid transformation. We have validated this approach on a large deformation dataset of cervical cancer patients, with results indicating substantially smaller deviations compared to state-of-the-art methods. Remarkably, even for cross-modality subjects, it achieves results surpassing the current state-of-the-art. △ Less

Submitted 30 September, 2023; originally announced October 2023.

arXiv:2310.00152 [pdf, other]

doi 10.1145/3589334.3645408

Learning to Rewrite Prompts for Personalized Text Generation

Authors: Cheng Li, Mingyang Zhang, Qiaozhu Mei, Weize Kong, Michael Bendersky

Abstract: Facilitated by large language models (LLMs), personalized text generation has become a rapidly growing research direction. Most existing studies focus on designing specialized models for a particular domain, or they require fine-tuning the LLMs to generate personalized text. We consider a typical scenario in which the large language model, which generates personalized output, is frozen and can onl… ▽ More Facilitated by large language models (LLMs), personalized text generation has become a rapidly growing research direction. Most existing studies focus on designing specialized models for a particular domain, or they require fine-tuning the LLMs to generate personalized text. We consider a typical scenario in which the large language model, which generates personalized output, is frozen and can only be accessed through APIs. Under this constraint, all one can do is to improve the input text (i.e., text prompts) sent to the LLM, a procedure that is usually done manually. In this paper, we propose a novel method to automatically revise prompts for personalized text generation. The proposed method takes the initial prompts generated by a state-of-the-art, multistage framework for personalized generation and rewrites a few critical components that summarize and synthesize the personal context. The prompt rewriter employs a training paradigm that chains together supervised learning (SL) and reinforcement learning (RL), where SL reduces the search space of RL and RL facilitates end-to-end training of the rewriter. Using datasets from three representative domains, we demonstrate that the rewritten prompts outperform both the original prompts and the prompts optimized via supervised learning or reinforcement learning alone. In-depth analysis of the rewritten prompts shows that they are not only human readable, but also able to guide manual revision of prompts when there is limited resource to employ reinforcement learning to train the prompt rewriter, or when it is costly to deploy an automatic prompt rewriter for inference. △ Less

Submitted 8 February, 2024; v1 submitted 29 September, 2023; originally announced October 2023.

Comments: In Proceedings of the ACM Web Conference 2024 (WWW '24)

arXiv:2309.03095 [pdf, ps, other]

Equivalence Principle for Quantum Mechanics in the Heisenberg Picture

Authors: Otto C. W. Kong

Abstract: We present an exact quantum observable analog of the weak equivalence principle for a `relativistic' quantum particle. The quantum geodesic equations are obtained from Heisenberg equations of motion as an exact analog of a fully covariant classical Hamiltonian evolution picture, with the proper identification of the canonical momentum variables as $p_μ$, rather than $p^μ$. We discuss the meaning o… ▽ More We present an exact quantum observable analog of the weak equivalence principle for a `relativistic' quantum particle. The quantum geodesic equations are obtained from Heisenberg equations of motion as an exact analog of a fully covariant classical Hamiltonian evolution picture, with the proper identification of the canonical momentum variables as $p_μ$, rather than $p^μ$. We discuss the meaning of the equations in relation to projective measurements as well as equations with solution curves as ones in the noncommutative geometric picture of spacetime, and a plausible approach to quantum gravity as a theory about quantum observables as physical quantities including the notion of quantum coordinate transformation. △ Less

Submitted 14 May, 2024; v1 submitted 5 September, 2023; originally announced September 2023.

Comments: 19 pages in Revtex, no figure; proof-read version published

Report number: NCU-HEP-k101

Journal ref: Class. Quantum Grav. 41 (1924) 085013

arXiv:2309.01973 [pdf, other]

Linear Regression using Heterogeneous Data Batches

Authors: Ayush Jain, Rajat Sen, Weihao Kong, Abhimanyu Das, Alon Orlitsky

Abstract: In many learning applications, data are collected from multiple sources, each providing a \emph{batch} of samples that by itself is insufficient to learn its input-output relationship. A common approach assumes that the sources fall in one of several unknown subgroups, each with an unknown input distribution and input-output relationship. We consider one of this setup's most fundamental and import… ▽ More In many learning applications, data are collected from multiple sources, each providing a \emph{batch} of samples that by itself is insufficient to learn its input-output relationship. A common approach assumes that the sources fall in one of several unknown subgroups, each with an unknown input distribution and input-output relationship. We consider one of this setup's most fundamental and important manifestations where the output is a noisy linear combination of the inputs, and there are $k$ subgroups, each with its own regression vector. Prior work~\cite{kong2020meta} showed that with abundant small-batches, the regression vectors can be learned with only few, $\tildeΩ( k^{3/2})$, batches of medium-size with $\tildeΩ(\sqrt k)$ samples each. However, the paper requires that the input distribution for all $k$ subgroups be isotropic Gaussian, and states that removing this assumption is an ``interesting and challenging problem". We propose a novel gradient-based algorithm that improves on the existing results in several ways. It extends the applicability of the algorithm by: (1) allowing the subgroups' underlying input distributions to be different, unknown, and heavy-tailed; (2) recovering all subgroups followed by a significant proportion of batches even for infinite $k$; (3) removing the separation requirement between the regression vectors; (4) reducing the number of batches and allowing smaller batch sizes. △ Less

Submitted 5 September, 2023; originally announced September 2023.

arXiv:2308.13219 [pdf, other]

Physics-informed neural networks for unsteady incompressible flows with time-dependent moving boundaries

Authors: Yongzheng Zhu, Weizhen Kong, Jian Deng, Xin Bian

Abstract: Physics-informed neural networks (PINNs) employed in fluid mechanics deal primarily with stationary boundaries. This hinders the capability to address a wide range of flow problems involving moving bodies. To this end, we propose a novel extension, which enables PINNs to solve incompressible flows with time-dependent moving boundaries. More specifically, we impose Dirichlet constraints of velocity… ▽ More Physics-informed neural networks (PINNs) employed in fluid mechanics deal primarily with stationary boundaries. This hinders the capability to address a wide range of flow problems involving moving bodies. To this end, we propose a novel extension, which enables PINNs to solve incompressible flows with time-dependent moving boundaries. More specifically, we impose Dirichlet constraints of velocity at the moving interfaces and define new loss functions for the corresponding training points. Moreover, we refine training points for flows around the moving boundaries for accuracy. This effectively enforces the no-slip condition of the moving boundaries. With an initial condition, the extended PINNs solve unsteady flow problems with time-dependent moving boundaries and still have the flexibility to leverage partial data to reconstruct the entire flow field. Therefore, the extended version inherits the amalgamation of both physics and data from the original PINNs. With a series of typical flow problems, we demonstrate the effectiveness and accuracy of the extended PINNs. The proposed concept allows for solving inverse problems as well, which calls for further investigations. △ Less

Submitted 25 August, 2023; originally announced August 2023.

arXiv:2308.12531 [pdf, other]

CARE: Co-Attention Network for Joint Entity and Relation Extraction

Authors: Wenjun Kong, Yamei Xia

Abstract: Joint entity and relation extraction is the fundamental task of information extraction, consisting of two subtasks: named entity recognition and relation extraction. However, most existing joint extraction methods suffer from issues of feature confusion or inadequate interaction between the two subtasks. Addressing these challenges, in this work, we propose a Co-Attention network for joint entity… ▽ More Joint entity and relation extraction is the fundamental task of information extraction, consisting of two subtasks: named entity recognition and relation extraction. However, most existing joint extraction methods suffer from issues of feature confusion or inadequate interaction between the two subtasks. Addressing these challenges, in this work, we propose a Co-Attention network for joint entity and Relation Extraction (CARE). Our approach includes adopting a parallel encoding strategy to learn separate representations for each subtask, aiming to avoid feature overlap or confusion. At the core of our approach is the co-attention module that captures two-way interaction between the two subtasks, allowing the model to leverage entity information for relation prediction and vice versa, thus promoting mutual enhancement. Through extensive experiments on three benchmark datasets for joint entity and relation extraction (NYT, WebNLG, and SciERC), we demonstrate that our proposed model outperforms existing baseline models. Our code will be available at https://github.com/kwj0x7f/CARE. △ Less

Submitted 27 March, 2024; v1 submitted 23 August, 2023; originally announced August 2023.

Comments: Accepted by LREC-COLING 2024

arXiv:2308.04001 [pdf, other]

Explicit Topology Optimization of Conforming Voronoi Foams

Authors: Ming Li, Jingqiao Hu, Wei Chen, Weipeng Kong, Jin Huang

Abstract: Topology optimization is able to maximally leverage the high DOFs and mechanical potentiality of porous foams but faces three fundamental challenges: conforming to free-form outer shapes, maintaining geometric connectivity between adjacent cells, and achieving high simulation accuracy. To resolve the issues, borrowing the concept from Voronoi tessellation, we propose to use the site (or seed) posi… ▽ More Topology optimization is able to maximally leverage the high DOFs and mechanical potentiality of porous foams but faces three fundamental challenges: conforming to free-form outer shapes, maintaining geometric connectivity between adjacent cells, and achieving high simulation accuracy. To resolve the issues, borrowing the concept from Voronoi tessellation, we propose to use the site (or seed) positions and radii of the beams as the DOFs for open-cell foam design. Such DOFs cover extensive design space and have clear geometrical meaning, which makes it easy to provide explicit controls (e.g. granularity). During the gradient-based optimization, the foam topology can change freely, and some seeds may even be pushed out of the shape, which greatly alleviates the challenges of prescribing a fixed underlying grid. The mechanical property of our foam is computed from its highly heterogeneous density field counterpart discretized on a background mesh, with a much improved accuracy via a new material-aware numerical coarsening method. We also explore the differentiability of the open-cell Voronoi foams w.r.t. its seed locations, and propose a local finite difference method to estimate the derivatives efficiently. We do not only show the improved foam performance of our Voronoi foam in comparison with classical topology optimization approaches, but also demonstrate its advantages in various settings, especially when the target volume fraction is extremely low. △ Less

Submitted 7 August, 2023; originally announced August 2023.

arXiv:2308.02809 [pdf]

3D front tip fields in creeping solids under constraint effects: a higher-order asymptotic solution

Authors: Weichen Kong, Yanwei Dai, Yinghua Liu

Abstract: As one of the most important topics studied in creep fracture mechanics, mechanics fields at three-dimensional (3D) sharp V-notches and crack tip have drawn tremendous attentions. With many years efforts on constraint theory developed in creeping solids, there still seems dense fog on how in-plane and out-of-plane constraint effects are interacted for 3D sharp V-notch and crack in creeping solids.… ▽ More As one of the most important topics studied in creep fracture mechanics, mechanics fields at three-dimensional (3D) sharp V-notches and crack tip have drawn tremendous attentions. With many years efforts on constraint theory developed in creeping solids, there still seems dense fog on how in-plane and out-of-plane constraint effects are interacted for 3D sharp V-notch and crack in creeping solids. To shed lights on this topic, a 3D higher-order termed solution for sharp V-notches in creeping materials subjected to mode 1 loading is established by introducing the out-of-plane factor, which is the out-of-plane stress divided by the sum of in-plane normal stress. The solution can naturally be degenerated to a 3D crack. Based on the 3D higher-order term solution, a new fracture parameter is proposed and combined with to characterize 3D constraint effect. It is found that the stress exponents and angular distribution of higher-order term for 3D notches and cracks are highly related to . The proposed higher order termed solutions show better agreement with the FEA results than the 3D leading-term and 2D two-term solutions, especially for smaller notch angles and ligament width. Moreover, the presented 3D constraint theory shows that effects of and are highly interlinked rather than simply separated. It implies that the 3D constraint level may be significantly influenced by . The 3D mathematical solutions discussed in this paper could enhance the understanding of the 3D effect and has the potential to explain the 3D constraint effect on the notches and cracks under creep conditions. △ Less

Submitted 5 August, 2023; originally announced August 2023.

Comments: 56 pages, 25 figures

arXiv:2307.12493 [pdf, other]

TF-ICON: Diffusion-Based Training-Free Cross-Domain Image Composition

Authors: Shilin Lu, Yanzhu Liu, Adams Wai-Kin Kong

Abstract: Text-driven diffusion models have exhibited impressive generative capabilities, enabling various image editing tasks. In this paper, we propose TF-ICON, a novel Training-Free Image COmpositioN framework that harnesses the power of text-driven diffusion models for cross-domain image-guided composition. This task aims to seamlessly integrate user-provided objects into a specific visual context. Curr… ▽ More Text-driven diffusion models have exhibited impressive generative capabilities, enabling various image editing tasks. In this paper, we propose TF-ICON, a novel Training-Free Image COmpositioN framework that harnesses the power of text-driven diffusion models for cross-domain image-guided composition. This task aims to seamlessly integrate user-provided objects into a specific visual context. Current diffusion-based methods often involve costly instance-based optimization or finetuning of pretrained models on customized datasets, which can potentially undermine their rich prior. In contrast, TF-ICON can leverage off-the-shelf diffusion models to perform cross-domain image-guided composition without requiring additional training, finetuning, or optimization. Moreover, we introduce the exceptional prompt, which contains no information, to facilitate text-driven diffusion models in accurately inverting real images into latent representations, forming the basis for compositing. Our experiments show that equipping Stable Diffusion with the exceptional prompt outperforms state-of-the-art inversion methods on various datasets (CelebA-HQ, COCO, and ImageNet), and that TF-ICON surpasses prior baselines in versatile visual domains. Code is available at https://github.com/Shilin-LU/TF-ICON △ Less

Submitted 10 October, 2023; v1 submitted 23 July, 2023; originally announced July 2023.

Comments: Accepted by ICCV 2023

arXiv:2307.05608 [pdf, other]

DP-Auditorium: a Large Scale Library for Auditing Differential Privacy

Authors: William Kong, Andrés Muñoz Medina, Mónica Ribero, Umar Syed

Abstract: New regulations and increased awareness of data privacy have led to the deployment of new and more efficient differentially private mechanisms across public institutions and industries. Ensuring the correctness of these mechanisms is therefore crucial to ensure the proper protection of data. However, since differential privacy is a property of the mechanism itself, and not of an individual output,… ▽ More New regulations and increased awareness of data privacy have led to the deployment of new and more efficient differentially private mechanisms across public institutions and industries. Ensuring the correctness of these mechanisms is therefore crucial to ensure the proper protection of data. However, since differential privacy is a property of the mechanism itself, and not of an individual output, testing whether a mechanism is differentially private is not a trivial task. While ad hoc testing techniques exist under specific assumptions, no concerted effort has been made by the research community to develop a flexible and extendable tool for testing differentially private mechanisms. This paper introduces DP-Auditorium as a step advancing research in this direction. DP-Auditorium abstracts the problem of testing differential privacy into two steps: (1) measuring the distance between distributions, and (2) finding neighboring datasets where a mechanism generates output distributions maximizing such distance. From a technical point of view, we propose three new algorithms for evaluating the distance between distributions. While these algorithms are well-established in the statistics community, we provide new estimation guarantees that exploit the fact that we are only interested in verifying whether a mechanism is differentially private, and not in obtaining an exact estimate of the distance between two distributions. DP-Auditorium is easily extensible, as demonstrated in this paper by implementing a well-known approximate differential privacy testing algorithm into our library. We provide an extensive comparison to date of multiple testers across varying sample sizes and differential privacy parameters, demonstrating that there is no single tester that dominates all others, and that a combination of different techniques is required to ensure proper testing of mechanisms. △ Less

Submitted 18 December, 2023; v1 submitted 10 July, 2023; originally announced July 2023.

arXiv:2306.07096 [pdf, other]

Global and Local Semantic Completion Learning for Vision-Language Pre-training

Authors: Rong-Cheng Tu, Yatai Ji, Jie Jiang, Weijie Kong, Chengfei Cai, Wenzhe Zhao, Hongfa Wang, Yujiu Yang, Wei Liu

Abstract: Cross-modal alignment plays a crucial role in vision-language pre-training (VLP) models, enabling them to capture meaningful associations across different modalities. For this purpose, numerous masked modeling tasks have been proposed for VLP to further promote cross-modal interactions. The core idea of previous masked modeling tasks is to focus on reconstructing the masked tokens based on visible… ▽ More Cross-modal alignment plays a crucial role in vision-language pre-training (VLP) models, enabling them to capture meaningful associations across different modalities. For this purpose, numerous masked modeling tasks have been proposed for VLP to further promote cross-modal interactions. The core idea of previous masked modeling tasks is to focus on reconstructing the masked tokens based on visible context for learning local-local alignment. However, most of them pay little attention to the global semantic features generated for the masked data, resulting in a limited cross-modal alignment ability of global representations to local features of the other modality. Therefore, in this paper, we propose a novel Global and Local Semantic Completion Learning (GLSCL) task to facilitate global-local alignment and local-local alignment simultaneously. Specifically, the GLSCL task complements the missing semantics of masked data and recovers global and local features by cross-modal interactions. Our GLSCL consists of masked global semantic completion (MGSC) and masked local token completion (MLTC). MGSC promotes learning more representative global features, which have a great impact on the performance of downstream tasks, while MLTC reconstructs modal-fusion local tokens, further enhancing accurate comprehension of multimodal data. To evaluate the proposed approaches on cross-modal alignment, we develop a validation benchmark called ALIGN-BENCH. Moreover, we present a flexible vision encoder, enabling our model to simultaneously perform image-text and video-text multimodal tasks. Experimental results show that our proposed method obtains state-of-the-art performance on various vision-language benchmarks, such as visual question answering, image-text retrieval, and video-text retrieval. △ Less

Submitted 5 December, 2023; v1 submitted 12 June, 2023; originally announced June 2023.

Comments: arXiv admin note: substantial text overlap with arXiv:2211.13437

arXiv:2306.05608 [pdf]

Coexistence of surface oxygen vacancy and interface conducting states in LaAlO3/SrTiO3 revealed by low-angle resonant soft X-ray scattering

Authors: Ming Yang, Ariando Ariando, Caozheng Diao, James C Lee, Kaushik Jayaraman, Mansoor B A Jalil, Serban Smadici, Shengwei Zeng, Jun Zhou, Weilong Kong, Mark B. H. Breese, Sankar Dhar, Yuan Ping Feng, Peter Abbamonte, Thirumalai Venkatesan, Andrivo Rusydi

Abstract: Oxide heterostructures have shown rich physics phenomena, particularly in the conjunction of exotic insulator-metal transition (IMT) at the interface between polar insulator LaAlO3 and non-polar insulator SrTiO3 (LaAlO3/SrTiO3). Polarization catastrophe model has suggested an electronic reconstruction yielding to metallicity at both the interface and surface. Another scenario is the occurrence of… ▽ More Oxide heterostructures have shown rich physics phenomena, particularly in the conjunction of exotic insulator-metal transition (IMT) at the interface between polar insulator LaAlO3 and non-polar insulator SrTiO3 (LaAlO3/SrTiO3). Polarization catastrophe model has suggested an electronic reconstruction yielding to metallicity at both the interface and surface. Another scenario is the occurrence of surface oxygen vacancy at LaAlO3 (surface-Ov), which has predicted surface-to-interface charge transfer yielding metallic interface but insulating surface. To clarify the origin of IMT, one should probe surface-Ov and the associated electronic structures at both the surface and the buried interface simultaneously. Here, using low-angle resonant soft X-ray scattering (LA-RSXS) supported with first-principles calculations, we reveal the co-existence of the surface-Ov state and the interface conducting state only in conducting LaAlO3/SrTiO3 (001) films. Interestingly, both the surface-Ov state and the interface conducting state are absent for the insulating film. As a function of Ov density, while the surface-Ov state is responsible for the IMT, the spatial charge distribution is found responsible for a transition from two-dimensional-like to three-dimensional-like conducting accompanied by spectral weight transfer, revealing the importance of electronic correlation. Our results show the importance of surface-Ov in determining interface properties and provides a new strategy in utilizing LA-RSXS to directly probe the surface and buried interface electronic properties in complex oxide heterostructures. △ Less

Submitted 8 June, 2023; originally announced June 2023.

arXiv:2306.00676 [pdf, other]

Hyperspectral Target Detection Based on Low-Rank Background Subspace Learning and Graph Laplacian Regularization

Authors: Dunbin Shen, Xiaorui Ma, Wenfeng Kong, Jiacheng Tian, Hongyu Wang

Abstract: Hyperspectral target detection is good at finding dim and small objects based on spectral characteristics. However, existing representation-based methods are hindered by the problem of the unknown background dictionary and insufficient utilization of spatial information. To address these issues, this paper proposes an efficient optimizing approach based on low-rank representation (LRR) and graph L… ▽ More Hyperspectral target detection is good at finding dim and small objects based on spectral characteristics. However, existing representation-based methods are hindered by the problem of the unknown background dictionary and insufficient utilization of spatial information. To address these issues, this paper proposes an efficient optimizing approach based on low-rank representation (LRR) and graph Laplacian regularization (GLR). Firstly, to obtain a complete and pure background dictionary, we propose a LRR-based background subspace learning method by jointly mining the low-dimensional structure of all pixels. Secondly, to fully exploit local spatial relationships and capture the underlying geometric structure, a local region-based GLR is employed to estimate the coefficients. Finally, the desired detection map is generated by computing the ratio of representation errors from binary hypothesis testing. The experiments conducted on two benchmark datasets validate the effectiveness and superiority of the approach. For reproduction, the accompanying code is available at https://github.com/shendb2022/LRBSL-GLR. △ Less

Submitted 1 June, 2023; originally announced June 2023.

Comments: 4 pages, 3 figures, 1 table

arXiv:2305.17445 [pdf, other]

Synthesizing Speech Test Cases with Text-to-Speech? An Empirical Study on the False Alarms in Automated Speech Recognition Testing

Authors: Julia Kaiwen Lau, Kelvin Kai Wen Kong, Julian Hao Yong, Per Hoong Tan, Zhou Yang, Zi Qian Yong, Joshua Chern Wey Low, Chun Yong Chong, Mei Kuan Lim, David Lo

Abstract: Recent studies have proposed the use of Text-To-Speech (TTS) systems to automatically synthesise speech test cases on a scale and uncover a large number of failures in ASR systems. However, the failures uncovered by synthetic test cases may not reflect the actual performance of an ASR system when it transcribes human audio, which we refer to as false alarms. Given a failed test case synthesised fr… ▽ More Recent studies have proposed the use of Text-To-Speech (TTS) systems to automatically synthesise speech test cases on a scale and uncover a large number of failures in ASR systems. However, the failures uncovered by synthetic test cases may not reflect the actual performance of an ASR system when it transcribes human audio, which we refer to as false alarms. Given a failed test case synthesised from TTS systems, which consists of TTS-generated audio and the corresponding ground truth text, we feed the human audio stating the same text to an ASR system. If human audio can be correctly transcribed, an instance of a false alarm is detected. In this study, we investigate false alarm occurrences in five popular ASR systems using synthetic audio generated from four TTS systems and human audio obtained from two commonly used datasets. Our results show that the least number of false alarms is identified when testing Deepspeech, and the number of false alarms is the highest when testing Wav2vec2. On average, false alarm rates range from 21% to 34% in all five ASR systems. Among the TTS systems used, Google TTS produces the least number of false alarms (17%), and Espeak TTS produces the highest number of false alarms (32%) among the four TTS systems. Additionally, we build a false alarm estimator that flags potential false alarms, which achieves promising results: a precision of 98.3%, a recall of 96.4%, an accuracy of 98.5%, and an F1 score of 97.3%. Our study provides insight into the appropriate selection of TTS systems to generate high-quality speech to test ASR systems. Additionally, a false alarm estimator can be a way to minimise the impact of false alarms and help developers choose suitable test inputs when evaluating ASR systems. The source code used in this paper is publicly available on GitHub at https://github.com/julianyonghao/FAinASRtest. △ Less

Submitted 18 July, 2023; v1 submitted 27 May, 2023; originally announced May 2023.

Comments: 13 pages, Accepted at ISSTA2023

arXiv:2305.09697 [pdf, ps, other]

doi 10.1016/j.cjph.2023.04.008

$E=mc^2$ versus Symmetry for Lorentz Covariant Physics

Authors: Otto C. W. Kong, Hock King Ting

Abstract: The famous equation $E=mc^2$ is a version of particle mass being essentially the magnitude of the (energy-)momentum four-vector in the setting of `relativistic' dynamics, which can be seen as dictated by the Poincaré symmetry adopted as the relativity symmetry. However, as Einstein himself suggested, the naive notion of momentum as mass times velocity may not be right. The Hamiltonian formulation… ▽ More The famous equation $E=mc^2$ is a version of particle mass being essentially the magnitude of the (energy-)momentum four-vector in the setting of `relativistic' dynamics, which can be seen as dictated by the Poincaré symmetry adopted as the relativity symmetry. However, as Einstein himself suggested, the naive notion of momentum as mass times velocity may not be right. The Hamiltonian formulation perspective gives exactly such a setting which in the case of motion of a charged particle under an electromagnetic field actually has the right, canonical, momentum four-vector with an evolving magnitude. The important simple result seems to have missed proper appreciation. In relation to that, we present clear arguments against taking the Poincaré symmetry as the fundamental symmetry behind `relativistic' quantum dynamics, and discuss the proper symmetry theoretical formulation and the necessary picture of the covariant Hamiltonian dynamics with an evolution parameter that is, in general, not a particle proper time. In fact, it is obvious that the action of any position operator of a quantum state violates the on-shell mass condition. The phenomenologically quite successful quantum field theories are `second quantized' versions of `relativistic' quantum mechanics. We present a way for some reconciliation of that with our symmetry picture and discuss implications. △ Less

Submitted 16 May, 2023; originally announced May 2023.

Comments: 22 pages Revtex, no figure, published version

Report number: NCU-HEP-k096

Journal ref: Chin. J. Phys. 83 (2023) 480-488

arXiv:2304.08424 [pdf, other]

Long-term Forecasting with TiDE: Time-series Dense Encoder

Authors: Abhimanyu Das, Weihao Kong, Andrew Leach, Shaan Mathur, Rajat Sen, Rose Yu

Abstract: Recent work has shown that simple linear models can outperform several Transformer based approaches in long term time-series forecasting. Motivated by this, we propose a Multi-layer Perceptron (MLP) based encoder-decoder model, Time-series Dense Encoder (TiDE), for long-term time-series forecasting that enjoys the simplicity and speed of linear models while also being able to handle covariates and… ▽ More Recent work has shown that simple linear models can outperform several Transformer based approaches in long term time-series forecasting. Motivated by this, we propose a Multi-layer Perceptron (MLP) based encoder-decoder model, Time-series Dense Encoder (TiDE), for long-term time-series forecasting that enjoys the simplicity and speed of linear models while also being able to handle covariates and non-linear dependencies. Theoretically, we prove that the simplest linear analogue of our model can achieve near optimal error rate for linear dynamical systems (LDS) under some assumptions. Empirically, we show that our method can match or outperform prior approaches on popular long-term time-series forecasting benchmarks while being 5-10x faster than the best Transformer based model. △ Less

Submitted 4 April, 2024; v1 submitted 17 April, 2023; originally announced April 2023.

arXiv:2303.12745 [pdf, other]

Audio-Visual Deception Detection: DOLOS Dataset and Parameter-Efficient Crossmodal Learning

Authors: Xiaobao Guo, Nithish Muthuchamy Selvaraj, Zitong Yu, Adams Wai-Kin Kong, Bingquan Shen, Alex Kot

Abstract: Deception detection in conversations is a challenging yet important task, having pivotal applications in many fields such as credibility assessment in business, multimedia anti-frauds, and custom security. Despite this, deception detection research is hindered by the lack of high-quality deception datasets, as well as the difficulties of learning multimodal features effectively. To address this is… ▽ More Deception detection in conversations is a challenging yet important task, having pivotal applications in many fields such as credibility assessment in business, multimedia anti-frauds, and custom security. Despite this, deception detection research is hindered by the lack of high-quality deception datasets, as well as the difficulties of learning multimodal features effectively. To address this issue, we introduce DOLOS\footnote {The name ``DOLOS" comes from Greek mythology.}, the largest gameshow deception detection dataset with rich deceptive conversations. DOLOS includes 1,675 video clips featuring 213 subjects, and it has been labeled with audio-visual feature annotations. We provide train-test, duration, and gender protocols to investigate the impact of different factors. We benchmark our dataset on previously proposed deception detection approaches. To further improve the performance by fine-tuning fewer parameters, we propose Parameter-Efficient Crossmodal Learning (PECL), where a Uniform Temporal Adapter (UT-Adapter) explores temporal attention in transformer-based architectures, and a crossmodal fusion module, Plug-in Audio-Visual Fusion (PAVF), combines crossmodal information from audio-visual features. Based on the rich fine-grained audio-visual annotations on DOLOS, we also exploit multi-task learning to enhance performance by concurrently predicting deception and audio-visual features. Experimental results demonstrate the desired quality of the DOLOS dataset and the effectiveness of the PECL. The DOLOS dataset and the source codes are available at https://github.com/NMS05/Audio-Visual-Deception-Detection-DOLOS-Dataset-and-Parameter-Efficient-Crossmodal-Learning/tree/main. △ Less

Submitted 3 August, 2023; v1 submitted 9 March, 2023; originally announced March 2023.

Comments: 11 pages, 6 figures

arXiv:2303.12531 [pdf, other]

Experimental Implementation of Short-Path Non-adiabatic Geometric Gates in a Superconducting Circuit

Authors: Xin-Xin Yang, Liang-Liang Guo, Hai-Feng Zhang, Lei Du, Chi Zhang, Hao-Ran Tao, Yong Chen, Peng Duan, Zhi-Long Jia, Wei-Cheng Kong, Guo-Ping Guo

Abstract: The non-adiabatic geometric quantum computation (NGQC) has attracted a lot of attention for noise-resilient quantum control. However, previous implementations of NGQC require long evolution paths that make them more vulnerable to incoherent errors than their dynamical counterparts.In this work, we experimentally realize a universal short-path non-adiabatic geometric gate set (SPNGQC) with a 2-time… ▽ More The non-adiabatic geometric quantum computation (NGQC) has attracted a lot of attention for noise-resilient quantum control. However, previous implementations of NGQC require long evolution paths that make them more vulnerable to incoherent errors than their dynamical counterparts.In this work, we experimentally realize a universal short-path non-adiabatic geometric gate set (SPNGQC) with a 2-times shorter evolution path on a superconducting quantum processor. Characterizing with both quantum process tomography and randomized benchmarking methods, we report an average single-qubit gate fidelity of 99.86% and a two-qubit gate fidelity of 97.9%. Additionally, we demonstrate superior robustness of single-qubit SP-NGQC gate to Rabi frequency error in some certain parameter space by comparing their performance to those of the dynamical gates and the former NGQC gates. △ Less

Submitted 22 March, 2023; originally announced March 2023.

Comments: 15 pages, 11 figures

Journal ref: Physical Review Applied (2023)

arXiv:2303.03131 [pdf, other]

Video Question Answering Using CLIP-Guided Visual-Text Attention

Authors: Shuhong Ye, Weikai Kong, Chenglin Yao, Jianfeng Ren, Xudong Jiang

Abstract: Cross-modal learning of video and text plays a key role in Video Question Answering (VideoQA). In this paper, we propose a visual-text attention mechanism to utilize the Contrastive Language-Image Pre-training (CLIP) trained on lots of general domain language-image pairs to guide the cross-modal learning for VideoQA. Specifically, we first extract video features using a TimeSformer and text featur… ▽ More Cross-modal learning of video and text plays a key role in Video Question Answering (VideoQA). In this paper, we propose a visual-text attention mechanism to utilize the Contrastive Language-Image Pre-training (CLIP) trained on lots of general domain language-image pairs to guide the cross-modal learning for VideoQA. Specifically, we first extract video features using a TimeSformer and text features using a BERT from the target application domain, and utilize CLIP to extract a pair of visual-text features from the general-knowledge domain through the domain-specific learning. We then propose a Cross-domain Learning to extract the attention information between visual and linguistic features across the target domain and general domain. The set of CLIP-guided visual-text features are integrated to predict the answer. The proposed method is evaluated on MSVD-QA and MSRVTT-QA datasets, and outperforms state-of-the-art methods. △ Less

Submitted 8 March, 2023; v1 submitted 6 March, 2023; originally announced March 2023.

Comments: Submitted to the 2023 IEEE International Conference on Image Processing (ICIP 2023)

ACM Class: I.2.10

arXiv:2303.03105 [pdf, other]

Confidence-based Event-centric Online Video Question Answering on a Newly Constructed ATBS Dataset

Authors: Weikai Kong, Shuhong Ye, Chenglin Yao, Jianfeng Ren

Abstract: Deep neural networks facilitate video question answering (VideoQA), but the real-world applications on video streams such as CCTV and live cast place higher demands on the solver. To address the challenges of VideoQA on long videos of unknown length, we define a new set of problems called Online Open-ended Video Question Answering (O^2VQA). It requires an online state-updating mechanism for the so… ▽ More Deep neural networks facilitate video question answering (VideoQA), but the real-world applications on video streams such as CCTV and live cast place higher demands on the solver. To address the challenges of VideoQA on long videos of unknown length, we define a new set of problems called Online Open-ended Video Question Answering (O^2VQA). It requires an online state-updating mechanism for the solver to decide if the collected information is sufficient to conclude an answer. We then propose a Confidence-based Event-centric Online Video Question Answering (CEO-VQA) model to solve this problem. Furthermore, a dataset called Answer Target in Background Stream (ATBS) is constructed to evaluate this newly developed online VideoQA application. Compared to the baseline VideoQA method that watches the whole video, the experimental results show that the proposed method achieves a significant performance gain. △ Less

Submitted 7 March, 2023; v1 submitted 6 March, 2023; originally announced March 2023.

Comments: Accepted for publication at the 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2023)

Showing 1–50 of 259 results for author: Kong, W