subscribe to arXiv mailings

arXiv:2407.09281 [pdf, other]

Predicting and Understanding Human Action Decisions: Insights from Large Language Models and Cognitive Instance-Based Learning

Authors: Thuy Ngoc Nguyen, Kasturi Jamale, Cleotilde Gonzalez

Abstract: Large Language Models (LLMs) have demonstrated their capabilities across various tasks, from language translation to complex reasoning. Understanding and predicting human behavior and biases are crucial for artificial intelligence (AI) assisted systems to provide useful assistance, yet it remains an open question whether these models can achieve this. This paper addresses this gap by leveraging th… ▽ More Large Language Models (LLMs) have demonstrated their capabilities across various tasks, from language translation to complex reasoning. Understanding and predicting human behavior and biases are crucial for artificial intelligence (AI) assisted systems to provide useful assistance, yet it remains an open question whether these models can achieve this. This paper addresses this gap by leveraging the reasoning and generative capabilities of the LLMs to predict human behavior in two sequential decision-making tasks. These tasks involve balancing between exploitative and exploratory actions and handling delayed feedback, both essential for simulating real-life decision processes. We compare the performance of LLMs with a cognitive instance-based learning (IBL) model, which imitates human experiential decision-making. Our findings indicate that LLMs excel at rapidly incorporating feedback to enhance prediction accuracy. In contrast, the cognitive IBL model better accounts for human exploratory behaviors and effectively captures loss aversion bias, i.e., the tendency to choose a sub-optimal goal with fewer step-cost penalties rather than exploring to find the optimal choice, even with limited experience. The results highlight the benefits of integrating LLMs with cognitive architectures, suggesting that this synergy could enhance the modeling and understanding of complex human decision-making patterns. △ Less

Submitted 12 July, 2024; originally announced July 2024.

arXiv:2407.09035 [pdf, other]

GPC: Generative and General Pathology Image Classifier

Authors: Anh Tien Nguyen, Jin Tae Kwak

Abstract: Deep learning has been increasingly incorporated into various computational pathology applications to improve its efficiency, accuracy, and robustness. Although successful, most previous approaches for image classification have crucial drawbacks. There exist numerous tasks in pathology, but one needs to build a model per task, i.e., a task-specific model, thereby increasing the number of models, t… ▽ More Deep learning has been increasingly incorporated into various computational pathology applications to improve its efficiency, accuracy, and robustness. Although successful, most previous approaches for image classification have crucial drawbacks. There exist numerous tasks in pathology, but one needs to build a model per task, i.e., a task-specific model, thereby increasing the number of models, training resources, and cost. Moreover, transferring arbitrary task-specific model to another task is still a challenging problem. Herein, we propose a task-agnostic generative and general pathology image classifier, so called GPC, that aims at learning from diverse kinds of pathology images and conducting numerous classification tasks in a unified model. GPC, equipped with a convolutional neural network and a Transformer-based language model, maps pathology images into a high-dimensional feature space and generates pertinent class labels as texts via the image-to-text classification mechanism. We evaluate GPC on six datasets for four different pathology image classification tasks. Experimental results show that GPC holds considerable potential for developing an effective and efficient universal model for pathology image analysis. △ Less

Submitted 12 July, 2024; originally announced July 2024.

Comments: MICCAI-MedAGI 2023 (Best Paper Honorable Mention)

arXiv:2407.09030 [pdf, other]

CAMP: Continuous and Adaptive Learning Model in Pathology

Authors: Anh Tien Nguyen, Keunho Byeon, Kyungeun Kim, Boram Song, Seoung Wan Chae, Jin Tae Kwak

Abstract: There exist numerous diagnostic tasks in pathology. Conventional computational pathology formulates and tackles them as independent and individual image classification problems, thereby resulting in computational inefficiency and high costs. To address the challenges, we propose a generic, unified, and universal framework, called a continuous and adaptive learning model in pathology (CAMP), for pa… ▽ More There exist numerous diagnostic tasks in pathology. Conventional computational pathology formulates and tackles them as independent and individual image classification problems, thereby resulting in computational inefficiency and high costs. To address the challenges, we propose a generic, unified, and universal framework, called a continuous and adaptive learning model in pathology (CAMP), for pathology image classification. CAMP is a generative, efficient, and adaptive classification model that can continuously adapt to any classification task by leveraging pathology-specific prior knowledge and learning taskspecific knowledge with minimal computational cost and without forgetting the knowledge from the existing tasks. We evaluated CAMP on 22 datasets, including 1,171,526 patches and 11,811 pathology slides, across 17 classification tasks. CAMP achieves state-of-theart classification performance on a wide range of datasets and tasks at both patch- and slide-levels and reduces up to 94% of computation time and 85% of storage memory in comparison to the conventional classification models. Our results demonstrate that CAMP can offer a fundamental transformation in pathology image classification, paving the way for the fully digitized and computerized pathology practice. △ Less

Submitted 12 July, 2024; originally announced July 2024.

Comments: Under review

arXiv:2407.08872 [pdf, other]

Visual Multi-Object Tracking with Re-Identification and Occlusion Handling using Labeled Random Finite Sets

Authors: Linh Van Ma, Tran Thien Dat Nguyen, Changbeom Shim, Du Yong Kim, Namkoo Ha, Moongu Jeon

Abstract: This paper proposes an online visual multi-object tracking (MOT) algorithm that resolves object appearance-reappearance and occlusion. Our solution is based on the labeled random finite set (LRFS) filtering approach, which in principle, addresses disappearance, appearance, reappearance, and occlusion via a single Bayesian recursion. However, in practice, existing numerical approximations cause rea… ▽ More This paper proposes an online visual multi-object tracking (MOT) algorithm that resolves object appearance-reappearance and occlusion. Our solution is based on the labeled random finite set (LRFS) filtering approach, which in principle, addresses disappearance, appearance, reappearance, and occlusion via a single Bayesian recursion. However, in practice, existing numerical approximations cause reappearing objects to be initialized as new tracks, especially after long periods of being undetected. In occlusion handling, the filter's efficacy is dictated by trade-offs between the sophistication of the occlusion model and computational demand. Our contribution is a novel modeling method that exploits object features to address reappearing objects whilst maintaining a linear complexity in the number of detections. Moreover, to improve the filter's occlusion handling, we propose a fuzzy detection model that takes into consideration the overlapping areas between tracks and their sizes. We also develop a fast version of the filter to further reduce the computational time. △ Less

Submitted 11 July, 2024; originally announced July 2024.

arXiv:2407.08470 [pdf, other]

Brain Tumor Segmentation in MRI Images with 3D U-Net and Contextual Transformer

Authors: Thien-Qua T. Nguyen, Hieu-Nghia Nguyen, Thanh-Hieu Bui, Thien B. Nguyen-Tat, Vuong M. Ngo

Abstract: This research presents an enhanced approach for precise segmentation of brain tumor masses in magnetic resonance imaging (MRI) using an advanced 3D-UNet model combined with a Context Transformer (CoT). By architectural expansion CoT, the proposed model extends its architecture to a 3D format, integrates it smoothly with the base model to utilize the complex contextual information found in MRI scan… ▽ More This research presents an enhanced approach for precise segmentation of brain tumor masses in magnetic resonance imaging (MRI) using an advanced 3D-UNet model combined with a Context Transformer (CoT). By architectural expansion CoT, the proposed model extends its architecture to a 3D format, integrates it smoothly with the base model to utilize the complex contextual information found in MRI scans, emphasizing how elements rely on each other across an extended spatial range. The proposed model synchronizes tumor mass characteristics from CoT, mutually reinforcing feature extraction, facilitating the precise capture of detailed tumor mass structures, including location, size, and boundaries. Several experimental results present the outstanding segmentation performance of the proposed method in comparison to current state-of-the-art approaches, achieving Dice score of 82.0%, 81.5%, 89.0% for Enhancing Tumor, Tumor Core and Whole Tumor, respectively, on BraTS2019. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: 6 pages, 7 figures

arXiv:2407.07917 [pdf, other]

Non-Cooperative Backdoor Attacks in Federated Learning: A New Threat Landscape

Authors: Tuan Nguyen, Dung Thuy Nguyen, Khoa D Doan, Kok-Seng Wong

Abstract: Despite the promise of Federated Learning (FL) for privacy-preserving model training on distributed data, it remains susceptible to backdoor attacks. These attacks manipulate models by embedding triggers (specific input patterns) in the training data, forcing misclassification as predefined classes during deployment. Traditional single-trigger attacks and recent work on cooperative multiple-trigge… ▽ More Despite the promise of Federated Learning (FL) for privacy-preserving model training on distributed data, it remains susceptible to backdoor attacks. These attacks manipulate models by embedding triggers (specific input patterns) in the training data, forcing misclassification as predefined classes during deployment. Traditional single-trigger attacks and recent work on cooperative multiple-trigger attacks, where clients collaborate, highlight limitations in attack realism due to coordination requirements. We investigate a more alarming scenario: non-cooperative multiple-trigger attacks. Here, independent adversaries introduce distinct triggers targeting unique classes. These parallel attacks exploit FL's decentralized nature, making detection difficult. Our experiments demonstrate the alarming vulnerability of FL to such attacks, where individual backdoors can be successfully learned without impacting the main task. This research emphasizes the critical need for robust defenses against diverse backdoor attacks in the evolving FL landscape. While our focus is on empirical analysis, we believe it can guide backdoor research toward more realistic settings, highlighting the crucial role of FL in building robust defenses against diverse backdoor threats. The code is available at \url{https://anonymous.4open.science/r/nba-980F/}. △ Less

Submitted 5 July, 2024; originally announced July 2024.

arXiv:2407.07472 [pdf, other]

Rectifier: Code Translation with Corrector via LLMs

Authors: Xin Yin, Chao Ni, Tien N. Nguyen, Shaohua Wang, Xiaohu Yang

Abstract: Software migration is garnering increasing attention with the evolution of software and society. Early studies mainly relied on handcrafted translation rules to translate between two languages, the translation process is error-prone and time-consuming. In recent years, researchers have begun to explore the use of pre-trained large language models (LLMs) in code translation. However, code translati… ▽ More Software migration is garnering increasing attention with the evolution of software and society. Early studies mainly relied on handcrafted translation rules to translate between two languages, the translation process is error-prone and time-consuming. In recent years, researchers have begun to explore the use of pre-trained large language models (LLMs) in code translation. However, code translation is a complex task that LLMs would generate mistakes during code translation, they all produce certain types of errors when performing code translation tasks, which include (1) compilation error, (2) runtime error, (3) functional error, and (4) non-terminating execution. We found that the root causes of these errors are very similar (e.g. failure to import packages, errors in loop boundaries, operator errors, and more). In this paper, we propose a general corrector, namely Rectifier, which is a micro and universal model for repairing translation errors. It learns from errors generated by existing LLMs and can be widely applied to correct errors generated by any LLM. The experimental results on translation tasks between C++, Java, and Python show that our model has effective repair ability, and cross experiments also demonstrate the robustness of our method. △ Less

Submitted 10 July, 2024; originally announced July 2024.

Comments: arXiv admin note: text overlap with arXiv:2308.03109, arXiv:2302.03908 by other authors

arXiv:2407.07421 [pdf, other]

doi 10.1109/TNET.2024.3423780

Federated PCA on Grassmann Manifold for IoT Anomaly Detection

Authors: Tung-Anh Nguyen, Long Tan Le, Tuan Dung Nguyen, Wei Bao, Suranga Seneviratne, Choong Seon Hong, Nguyen H. Tran

Abstract: With the proliferation of the Internet of Things (IoT) and the rising interconnectedness of devices, network security faces significant challenges, especially from anomalous activities. While traditional machine learning-based intrusion detection systems (ML-IDS) effectively employ supervised learning methods, they possess limitations such as the requirement for labeled data and challenges with hi… ▽ More With the proliferation of the Internet of Things (IoT) and the rising interconnectedness of devices, network security faces significant challenges, especially from anomalous activities. While traditional machine learning-based intrusion detection systems (ML-IDS) effectively employ supervised learning methods, they possess limitations such as the requirement for labeled data and challenges with high dimensionality. Recent unsupervised ML-IDS approaches such as AutoEncoders and Generative Adversarial Networks (GAN) offer alternative solutions but pose challenges in deployment onto resource-constrained IoT devices and in interpretability. To address these concerns, this paper proposes a novel federated unsupervised anomaly detection framework, FedPCA, that leverages Principal Component Analysis (PCA) and the Alternating Directions Method Multipliers (ADMM) to learn common representations of distributed non-i.i.d. datasets. Building on the FedPCA framework, we propose two algorithms, FEDPE in Euclidean space and FEDPG on Grassmann manifolds. Our approach enables real-time threat detection and mitigation at the device level, enhancing network resilience while ensuring privacy. Moreover, the proposed algorithms are accompanied by theoretical convergence rates even under a subsampling scheme, a novel result. Experimental results on the UNSW-NB15 and TON-IoT datasets show that our proposed methods offer performance in anomaly detection comparable to nonlinear baselines, while providing significant improvements in communication and memory efficiency, underscoring their potential for securing IoT networks. △ Less

Submitted 10 July, 2024; originally announced July 2024.

Comments: Accepted for publication at IEEE/ACM Transactions on Networking

Journal ref: IEEE/ACM Transactions on Networking On page(s): 1-16 Print ISSN: 1063-6692 Online ISSN: 1558-2566 Digital Object Identifier: 10.1109/TNET.2024.3423780

arXiv:2407.07369 [pdf, ps, other]

Viscosity estimation for 2D pipe flows I. Construction, consistency, asymptotic normality

Authors: Thi Hien Nguyen, Armen Shirikyan

Abstract: We consider the motion of incompressible viscous fluid in a rectangle, imposing the periodicity condition in one direction and the no-slip boundary condition in the other. Assuming that the flow is subject to an external random force, white in time and regular in space, we construct an estimator for the viscosity using only observations of the enstrophy. The goal of the paper is to prove that the… ▽ More We consider the motion of incompressible viscous fluid in a rectangle, imposing the periodicity condition in one direction and the no-slip boundary condition in the other. Assuming that the flow is subject to an external random force, white in time and regular in space, we construct an estimator for the viscosity using only observations of the enstrophy. The goal of the paper is to prove that the estimator is strongly consistent and asymptotically normal. The proof of consistency is based on the explicit formula for the estimator and some bounds for trajectories, while that of asymptotic normality uses in addition mixing properties of the Navier-Stokes flow. △ Less

Submitted 10 July, 2024; originally announced July 2024.

MSC Class: 35Q30; 37L55; 62M05; 76D06

arXiv:2407.07360 [pdf, other]

Towards a text-based quantitative and explainable histopathology image analysis

Authors: Anh Tien Nguyen, Trinh Thi Le Vuong, Jin Tae Kwak

Abstract: Recently, vision-language pre-trained models have emerged in computational pathology. Previous works generally focused on the alignment of image-text pairs via the contrastive pre-training paradigm. Such pre-trained models have been applied to pathology image classification in zero-shot learning or transfer learning fashion. Herein, we hypothesize that the pre-trained vision-language models can be… ▽ More Recently, vision-language pre-trained models have emerged in computational pathology. Previous works generally focused on the alignment of image-text pairs via the contrastive pre-training paradigm. Such pre-trained models have been applied to pathology image classification in zero-shot learning or transfer learning fashion. Herein, we hypothesize that the pre-trained vision-language models can be utilized for quantitative histopathology image analysis through a simple image-to-text retrieval. To this end, we propose a Text-based Quantitative and Explainable histopathology image analysis, which we call TQx. Given a set of histopathology images, we adopt a pre-trained vision-language model to retrieve a word-of-interest pool. The retrieved words are then used to quantify the histopathology images and generate understandable feature embeddings due to the direct mapping to the text description. To evaluate the proposed method, the text-based embeddings of four histopathology image datasets are utilized to perform clustering and classification tasks. The results demonstrate that TQx is able to quantify and analyze histopathology images that are comparable to the prevalent visual models in computational pathology. △ Less

Submitted 10 July, 2024; originally announced July 2024.

Comments: MICCAI 2024 - Early acceptance (Top 11%)

arXiv:2407.06826 [pdf, other]

VRDSynth: Synthesizing Programs for Multilingual Visually Rich Document Information Extraction

Authors: Thanh-Dat Nguyen, Tung Do-Viet, Hung Nguyen-Duy, Tuan-Hai Luu, Hung Le, Bach Le, Patanamon, Thongtanunam

Abstract: Businesses need to query visually rich documents (VRDs) like receipts, medical records, and insurance forms to make decisions. Existing techniques for extracting entities from VRDs struggle with new layouts or require extensive pre-training data. We introduce VRDSynth, a program synthesis method to automatically extract entity relations from multilingual VRDs without pre-training data. To capture… ▽ More Businesses need to query visually rich documents (VRDs) like receipts, medical records, and insurance forms to make decisions. Existing techniques for extracting entities from VRDs struggle with new layouts or require extensive pre-training data. We introduce VRDSynth, a program synthesis method to automatically extract entity relations from multilingual VRDs without pre-training data. To capture the complexity of VRD domain, we design a domain-specific language (DSL) to capture spatial and textual relations to describe the synthesized programs. Along with this, we also derive a new synthesis algorithm utilizing frequent spatial relations, search space pruning, and a combination of positive, negative, and exclusive programs to improve coverage. We evaluate VRDSynth on the FUNSD and XFUND benchmarks for semantic entity linking, consisting of 1,592 forms in 8 languages. VRDSynth outperforms state-of-the-art pre-trained models (LayoutXLM, InfoXLMBase, and XLMRobertaBase) in 5, 6, and 7 out of 8 languages, respectively, improving the F1 score by 42% over LayoutXLM in English. To test the extensibility of the model, we further improve VRDSynth with automated table recognition, creating VRDSynth(Table), and compare it with extended versions of the pre-trained models, InfoXLM(Large) and XLMRoberta(Large). VRDSynth(Table) outperforms these baselines in 4 out of 8 languages and in average F1 score. VRDSynth also significantly reduces memory footprint (1M and 380MB vs. 1.48GB and 3GB for LayoutXLM) while maintaining similar time efficiency. △ Less

Submitted 9 July, 2024; originally announced July 2024.

Comments: Accepted in ISSTA'24

arXiv:2407.06581 [pdf, other]

Vision language models are blind

Authors: Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, Anh Totti Nguyen

Abstract: Large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini 1.5 Pro are powering countless image-text applications and scoring high on many vision-understanding benchmarks. We propose BlindTest, a suite of 7 visual tasks absurdly easy to humans such as identifying (a) whether two circles overlap; (b) whether two lines intersect; (c) which letter is being circled in a word; and (… ▽ More Large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini 1.5 Pro are powering countless image-text applications and scoring high on many vision-understanding benchmarks. We propose BlindTest, a suite of 7 visual tasks absurdly easy to humans such as identifying (a) whether two circles overlap; (b) whether two lines intersect; (c) which letter is being circled in a word; and (d) counting the number of circles in a Olympic-like logo. Surprisingly, four state-of-the-art VLMs are, on average, only 56.20% accurate on our benchmark, with \newsonnet being the best (73.77% accuracy). On BlindTest, VLMs struggle with tasks that requires precise spatial information and counting (from 0 to 10), sometimes providing an impression of a person with myopia seeing fine details as blurry and making educated guesses. Code is available at: https://vlmsareblind.github.io/ △ Less

Submitted 12 July, 2024; v1 submitted 9 July, 2024; originally announced July 2024.

arXiv:2407.06142 [pdf, ps, other]

Delay-Aware Robust Edge Network Hardening Under Decision-Dependent Uncertainty

Authors: Jiaming Cheng, Duong Thuy Anh Nguyen, Ni Trieu, Duong Tung Nguyen

Abstract: Edge computing promises to offer low-latency and ubiquitous computation to numerous devices at the network edge. For delay-sensitive applications, link delays can have a direct impact on service quality. These delays can fluctuate drastically over time due to various factors such as network congestion, changing traffic conditions, cyberattacks, component failures, and natural disasters. Thus, it i… ▽ More Edge computing promises to offer low-latency and ubiquitous computation to numerous devices at the network edge. For delay-sensitive applications, link delays can have a direct impact on service quality. These delays can fluctuate drastically over time due to various factors such as network congestion, changing traffic conditions, cyberattacks, component failures, and natural disasters. Thus, it is crucial to efficiently harden the edge network to mitigate link delay variation as well as ensure a stable and improved user experience. To this end, we propose a novel robust model for optimal edge network hardening, considering the link delay uncertainty. Departing from the existing literature that treats uncertainties as exogenous, our model incorporates an endogenous uncertainty set to properly capture the impact of hardening and workload allocation decisions on link delays. However, the endogenous set introduces additional complexity to the problem due to the interdependence between decisions and uncertainties. We present two efficient methods to transform the problem into a solvable form. Extensive numerical results are shown to demonstrate the effectiveness of the proposed approach. △ Less

Submitted 8 July, 2024; originally announced July 2024.

Comments: 14 pages, 18 figures

arXiv:2407.06045 [pdf, other]

OpenCIL: Benchmarking Out-of-Distribution Detection in Class-Incremental Learning

Authors: Wenjun Miao, Guansong Pang, Trong-Tung Nguyen, Ruohang Fang, Jin Zheng, Xiao Bai

Abstract: Class incremental learning (CIL) aims to learn a model that can not only incrementally accommodate new classes, but also maintain the learned knowledge of old classes. Out-of-distribution (OOD) detection in CIL is to retain this incremental learning ability, while being able to reject unknown samples that are drawn from different distributions of the learned classes. This capability is crucial to… ▽ More Class incremental learning (CIL) aims to learn a model that can not only incrementally accommodate new classes, but also maintain the learned knowledge of old classes. Out-of-distribution (OOD) detection in CIL is to retain this incremental learning ability, while being able to reject unknown samples that are drawn from different distributions of the learned classes. This capability is crucial to the safety of deploying CIL models in open worlds. However, despite remarkable advancements in the respective CIL and OOD detection, there lacks a systematic and large-scale benchmark to assess the capability of advanced CIL models in detecting OOD samples. To fill this gap, in this study we design a comprehensive empirical study to establish such a benchmark, named $\textbf{OpenCIL}$. To this end, we propose two principled frameworks for enabling four representative CIL models with 15 diverse OOD detection methods, resulting in 60 baseline models for OOD detection in CIL. The empirical evaluation is performed on two popular CIL datasets with six commonly-used OOD datasets. One key observation we find through our comprehensive evaluation is that the CIL models can be severely biased towards the OOD samples and newly added classes when they are exposed to open environments. Motivated by this, we further propose a new baseline for OOD detection in CIL, namely Bi-directional Energy Regularization ($\textbf{BER}$), which is specially designed to mitigate these two biases in different CIL models by having energy regularization on both old and new classes. Its superior performance is justified in our experiments. All codes and datasets are open-source at https://github.com/mala-lab/OpenCIL. △ Less

Submitted 9 July, 2024; v1 submitted 8 July, 2024; originally announced July 2024.

arXiv:2407.05469 [pdf, other]

Smart Camera Parking System With Auto Parking Spot Detection

Authors: Tuan T. Nguyen, Mina Sartipi

Abstract: Given the rising urban population and the consequential rise in traffic congestion, the implementation of smart parking systems has emerged as a critical matter of concern. Smart parking solutions use cameras, sensors, and algorithms like computer vision to find available parking spaces. This method improves parking place recognition, reduces traffic and pollution, and optimizes travel time. In re… ▽ More Given the rising urban population and the consequential rise in traffic congestion, the implementation of smart parking systems has emerged as a critical matter of concern. Smart parking solutions use cameras, sensors, and algorithms like computer vision to find available parking spaces. This method improves parking place recognition, reduces traffic and pollution, and optimizes travel time. In recent years, computer vision-based approaches have been widely used. However, most existing studies rely on manually labeled parking spots, which has implications for the cost and practicality of implementation. To solve this problem, we propose a novel approach PakLoc, which automatically localize parking spots. Furthermore, we present the PakSke module, which automatically adjust the rotation and the size of detected bounding box. The efficacy of our proposed methodology on the PKLot dataset results in a significant reduction in human labor of 94.25\%. Another fundamental aspect of a smart parking system is its capacity to accurately determine and indicate the state of parking spots within a parking lot. The conventional approach involves employing classification techniques to forecast the condition of parking spots based on the bounding boxes derived from manually labeled grids. In this study, we provide a novel approach called PakSta for identifying the state of parking spots automatically. Our method utilizes object detector from PakLoc to simultaneously determine the occupancy status of all parking lots within a video frame. Our proposed method PakSta exhibits a competitive performance on the PKLot dataset when compared to other classification methods. △ Less

Submitted 7 July, 2024; originally announced July 2024.

arXiv:2407.05452 [pdf, other]

Semantic Segmentation for Real-World and Synthetic Vehicle's Forward-Facing Camera Images

Authors: Tuan T. Nguyen, Phan Le, Yasir Hassan, Mina Sartipi

Abstract: In this paper, we present the submission to the 5th Annual Smoky Mountains Computational Sciences Data Challenge, Challenge 3. This is the solution for semantic segmentation problem in both real-world and synthetic images from a vehicle s forward-facing camera. We concentrate in building a robust model which performs well across various domains of different outdoor situations such as sunny, snowy,… ▽ More In this paper, we present the submission to the 5th Annual Smoky Mountains Computational Sciences Data Challenge, Challenge 3. This is the solution for semantic segmentation problem in both real-world and synthetic images from a vehicle s forward-facing camera. We concentrate in building a robust model which performs well across various domains of different outdoor situations such as sunny, snowy, rainy, etc. In particular, our method is developed with two main directions: model development and domain adaptation. In model development, we use the High Resolution Network (HRNet) as the baseline. Then, this baseline s result is processed by two coarse-to-fine models: Object-Contextual Representations (OCR) and Hierarchical Multi-scale Attention (HMA) to get the better robust feature. For domain adaption, we implement the Domain-Based Batch Normalization (DNB) to reduce the distribution shift from diverse domains. Our proposed method yield 81.259 mean intersection-over-union (mIoU) in validation set. This paper studies the effectiveness of employing real-world and synthetic data to handle the domain adaptation in semantic segmentation problem. △ Less

Submitted 7 July, 2024; originally announced July 2024.

Comments: 13 pages

arXiv:2407.05205 [pdf, other]

The AI Companion in Education: Analyzing the Pedagogical Potential of ChatGPT in Computer Science and Engineering

Authors: Zhangying He, Thomas Nguyen, Tahereh Miari, Mehrdad Aliasgari, Setareh Rafatirad, Hossein Sayadi

Abstract: Artificial Intelligence (AI), with ChatGPT as a prominent example, has recently taken center stage in various domains including higher education, particularly in Computer Science and Engineering (CSE). The AI revolution brings both convenience and controversy, offering substantial benefits while lacking formal guidance on their application. The primary objective of this work is to comprehensively… ▽ More Artificial Intelligence (AI), with ChatGPT as a prominent example, has recently taken center stage in various domains including higher education, particularly in Computer Science and Engineering (CSE). The AI revolution brings both convenience and controversy, offering substantial benefits while lacking formal guidance on their application. The primary objective of this work is to comprehensively analyze the pedagogical potential of ChatGPT in CSE education, understanding its strengths and limitations from the perspectives of educators and learners. We employ a systematic approach, creating a diverse range of educational practice problems within CSE field, focusing on various subjects such as data science, programming, AI, machine learning, networks, and more. According to our examinations, certain question types, like conceptual knowledge queries, typically do not pose significant challenges to ChatGPT, and thus, are excluded from our analysis. Alternatively, we focus our efforts on developing more in-depth and personalized questions and project-based tasks. These questions are presented to ChatGPT, followed by interactions to assess its effectiveness in delivering complete and meaningful responses. To this end, we propose a comprehensive five-factor reliability analysis framework to evaluate the responses. This assessment aims to identify when ChatGPT excels and when it faces challenges. Our study concludes with a correlation analysis, delving into the relationships among subjects, task types, and limiting factors. This analysis offers valuable insights to enhance ChatGPT's utility in CSE education, providing guidance to educators and students regarding its reliability and efficacy. △ Less

Submitted 23 April, 2024; originally announced July 2024.

Comments: conference, 13 pages

arXiv:2407.04992 [pdf, other]

Scalable Variational Causal Discovery Unconstrained by Acyclicity

Authors: Nu Hoang, Bao Duong, Thin Nguyen

Abstract: Bayesian causal discovery offers the power to quantify epistemic uncertainties among a broad range of structurally diverse causal theories potentially explaining the data, represented in forms of directed acyclic graphs (DAGs). However, existing methods struggle with efficient DAG sampling due to the complex acyclicity constraint. In this study, we propose a scalable Bayesian approach to effective… ▽ More Bayesian causal discovery offers the power to quantify epistemic uncertainties among a broad range of structurally diverse causal theories potentially explaining the data, represented in forms of directed acyclic graphs (DAGs). However, existing methods struggle with efficient DAG sampling due to the complex acyclicity constraint. In this study, we propose a scalable Bayesian approach to effectively learn the posterior distribution over causal graphs given observational data thanks to the ability to generate DAGs without explicitly enforcing acyclicity. Specifically, we introduce a novel differentiable DAG sampling method that can generate a valid acyclic causal graph by mapping an unconstrained distribution of implicit topological orders to a distribution over DAGs. Given this efficient DAG sampling scheme, we are able to model the posterior distribution over causal graphs using a simple variational distribution over a continuous domain, which can be learned via the variational inference framework. Extensive empirical experiments on both simulated and real datasets demonstrate the superior performance of the proposed model compared to several state-of-the-art baselines. △ Less

Submitted 6 July, 2024; originally announced July 2024.

Comments: Accepted at ECAI 2024

arXiv:2407.04980 [pdf, other]

Enabling Causal Discovery in Post-Nonlinear Models with Normalizing Flows

Authors: Nu Hoang, Bao Duong, Thin Nguyen

Abstract: Post-nonlinear (PNL) causal models stand out as a versatile and adaptable framework for modeling intricate causal relationships. However, accurately capturing the invertibility constraint required in PNL models remains challenging in existing studies. To address this problem, we introduce CAF-PoNo (Causal discovery via Normalizing Flows for Post-Nonlinear models), harnessing the power of the norma… ▽ More Post-nonlinear (PNL) causal models stand out as a versatile and adaptable framework for modeling intricate causal relationships. However, accurately capturing the invertibility constraint required in PNL models remains challenging in existing studies. To address this problem, we introduce CAF-PoNo (Causal discovery via Normalizing Flows for Post-Nonlinear models), harnessing the power of the normalizing flows architecture to enforce the crucial invertibility constraint in PNL models. Through normalizing flows, our method precisely reconstructs the hidden noise, which plays a vital role in cause-effect identification through statistical independence testing. Furthermore, the proposed approach exhibits remarkable extensibility, as it can be seamlessly expanded to facilitate multivariate causal discovery via causal order identification, empowering us to efficiently unravel complex causal relationships. Extensive experimental evaluations on both simulated and real datasets consistently demonstrate that the proposed method outperforms several state-of-the-art approaches in both bivariate and multivariate causal discovery tasks. △ Less

Submitted 6 July, 2024; originally announced July 2024.

Comments: Acepted at ECAI 2024

arXiv:2407.04489 [pdf, other]

Dude: Dual Distribution-Aware Context Prompt Learning For Large Vision-Language Model

Authors: Duy M. H. Nguyen, An T. Le, Trung Q. Nguyen, Nghiem T. Diep, Tai Nguyen, Duy Duong-Tran, Jan Peters, Li Shen, Mathias Niepert, Daniel Sonntag

Abstract: Prompt learning methods are gaining increasing attention due to their ability to customize large vision-language models to new domains using pre-trained contextual knowledge and minimal training data. However, existing works typically rely on optimizing unified prompt inputs, often struggling with fine-grained classification tasks due to insufficient discriminative attributes. To tackle this, we c… ▽ More Prompt learning methods are gaining increasing attention due to their ability to customize large vision-language models to new domains using pre-trained contextual knowledge and minimal training data. However, existing works typically rely on optimizing unified prompt inputs, often struggling with fine-grained classification tasks due to insufficient discriminative attributes. To tackle this, we consider a new framework based on a dual context of both domain-shared and class-specific contexts, where the latter is generated by Large Language Models (LLMs) such as GPTs. Such dual prompt methods enhance the model's feature representation by joining implicit and explicit factors encoded in LLM knowledge. Moreover, we formulate the Unbalanced Optimal Transport (UOT) theory to quantify the relationships between constructed prompts and visual tokens. Through partial matching, UOT can properly align discrete sets of visual tokens and prompt embeddings under different mass distributions, which is particularly valuable for handling irrelevant or noisy elements, ensuring that the preservation of mass does not restrict transport solutions. Furthermore, UOT's characteristics integrate seamlessly with image augmentation, expanding the training sample pool while maintaining a reasonable distance between perturbed images and prompt inputs. Extensive experiments across few-shot classification and adapter settings substantiate the superiority of our model over current state-of-the-art baselines. △ Less

Submitted 5 July, 2024; originally announced July 2024.

Comments: Version 1

arXiv:2407.04408 [pdf, ps, other]

Hybrid Receiver Design for Massive MIMO-OFDM with Low-Resolution ADCs and Oversampling

Authors: Mengyuan Ma, Nhan Thanh Nguyen, Italo Atzeni, Markku Juntti

Abstract: Low-resolution analog-to-digital converters (ADCs) and hybrid beamforming have emerged as efficient solutions to reduce power consumption with satisfactory spectral efficiency (SE) in massive multiple-input multiple-output (MIMO) systems. In this paper, we investigate the performance of a hybrid receiver in uplink massive MIMO orthogonal frequency-division multiplexing (OFDM) systems with low-reso… ▽ More Low-resolution analog-to-digital converters (ADCs) and hybrid beamforming have emerged as efficient solutions to reduce power consumption with satisfactory spectral efficiency (SE) in massive multiple-input multiple-output (MIMO) systems. In this paper, we investigate the performance of a hybrid receiver in uplink massive MIMO orthogonal frequency-division multiplexing (OFDM) systems with low-resolution ADCs and oversampling. Considering both the temporal and spatial correlation of the quantization distortion (QD), we derive a closed-form approximation of the frequency-domain QD covariance matrix, which facilitates the evaluation of the system SE. Then we jointly design the analog and baseband combiners to maximize the SE. The formulated problem is significantly challenging due to the constant-modulus constraint of the analog combiner and its coupling with the digital one. To overcome the challenges, we transform the objective function into an equivalent but more tractable form and then iteratively update the analog and digital combiner. Numerical simulations verify the superiority of the proposed algorithm compared to the considered benchmarks and show the resilience of the hybrid receiver to beam squint for low-resolution systems. Furthermore, the results show that the proposed hybrid receiver design with oversampling can achieve significantly higher energy efficiency compared to the digital one. △ Less

Submitted 5 July, 2024; originally announced July 2024.

Comments: 6 pages, 4 figures, submitted to GlobeCom 2024

arXiv:2407.03796 [pdf, ps, other]

Joint Beamforming Design and Bit Allocation in Massive MIMO with Resolution-Adaptive ADCs

Authors: Mengyuan Ma, Nhan Thanh Nguyen, Italo Atzeni, Markku Juntti

Abstract: Low-resolution analog-to-digital converters (ADCs) have emerged as a promising technology for reducing power consumption and complexity in massive multiple-input multiple-output (MIMO) systems while maintaining satisfactory spectral and energy efficiencies (SE/EE). In this work, we first identify the essential properties of optimal quantization and leverage them to derive a closed-form approximati… ▽ More Low-resolution analog-to-digital converters (ADCs) have emerged as a promising technology for reducing power consumption and complexity in massive multiple-input multiple-output (MIMO) systems while maintaining satisfactory spectral and energy efficiencies (SE/EE). In this work, we first identify the essential properties of optimal quantization and leverage them to derive a closed-form approximation of the covariance matrix of the quantization distortion. The theoretical finding facilitates the system SE analysis in the presence of low-resolution ADCs. We then focus on the joint optimization of the transmit-receive beamforming and bit allocation to maximize the SE under constraints on the transmit power and the total number of active ADC bits. To solve the resulting mixed-integer problem, we first develop an efficient beamforming design for fixed ADC resolutions. Then, we propose a low-complexity heuristic algorithm to iteratively optimize the ADC resolutions and beamforming matrices. Numerical results for a $64 \times 64$ MIMO system demonstrate that the proposed design offers $6\%$ improvement in both SE and EE with $40\%$ fewer active ADC bits compared with the uniform bit allocation. Furthermore, we numerically show that receiving more data streams with low-resolution ADCs can achieve higher SE and EE compared to receiving fewer data streams with high-resolution ADCs. △ Less

Submitted 4 July, 2024; originally announced July 2024.

Comments: 13 pages, 14 figures

arXiv:2407.03788 [pdf, other]

Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning

Authors: Thong Nguyen, Yi Bin, Xiaobao Wu, Xinshuai Dong, Zhiyuan Hu, Khoi Le, Cong-Duy Nguyen, See-Kiong Ng, Luu Anh Tuan

Abstract: Data quality stands at the forefront of deciding the effectiveness of video-language representation learning. However, video-text pairs in previous data typically do not align perfectly with each other, which might lead to video-language representations that do not accurately reflect cross-modal semantics. Moreover, previous data also possess an uneven distribution of concepts, thereby hampering t… ▽ More Data quality stands at the forefront of deciding the effectiveness of video-language representation learning. However, video-text pairs in previous data typically do not align perfectly with each other, which might lead to video-language representations that do not accurately reflect cross-modal semantics. Moreover, previous data also possess an uneven distribution of concepts, thereby hampering the downstream performance across unpopular subjects. To address these problems, we propose a contrastive objective with a subtractive angular margin to regularize cross-modal representations in their effort to reach perfect similarity. Furthermore, to adapt to the non-uniform concept distribution, we propose a multi-layer perceptron (MLP)-parameterized weighting function that maps loss values to sample weights which enable dynamic adjustment of the model's focus throughout the training. With the training guided by a small amount of unbiased meta-data and augmented by video-text data generated by large vision-language model, we improve video-language representations and achieve superior performances on commonly used video question answering and text-video retrieval datasets. △ Less

Submitted 4 July, 2024; originally announced July 2024.

Comments: Accepted to ECCV 2024

arXiv:2407.03665 [pdf, other]

Heterogeneous Hypergraph Embedding for Recommendation Systems

Authors: Darnbi Sakong, Viet Hung Vu, Thanh Trung Huynh, Phi Le Nguyen, Hongzhi Yin, Quoc Viet Hung Nguyen, Thanh Tam Nguyen

Abstract: Recent advancements in recommender systems have focused on integrating knowledge graphs (KGs) to leverage their auxiliary information. The core idea of KG-enhanced recommenders is to incorporate rich semantic information for more accurate recommendations. However, two main challenges persist: i) Neglecting complex higher-order interactions in the KG-based user-item network, potentially leading to… ▽ More Recent advancements in recommender systems have focused on integrating knowledge graphs (KGs) to leverage their auxiliary information. The core idea of KG-enhanced recommenders is to incorporate rich semantic information for more accurate recommendations. However, two main challenges persist: i) Neglecting complex higher-order interactions in the KG-based user-item network, potentially leading to sub-optimal recommendations, and ii) Dealing with the heterogeneous modalities of input sources, such as user-item bipartite graphs and KGs, which may introduce noise and inaccuracies. To address these issues, we present a novel Knowledge-enhanced Heterogeneous Hypergraph Recommender System (KHGRec). KHGRec captures group-wise characteristics of both the interaction network and the KG, modeling complex connections in the KG. Using a collaborative knowledge heterogeneous hypergraph (CKHG), it employs two hypergraph encoders to model group-wise interdependencies and ensure explainability. Additionally, it fuses signals from the input graphs with cross-view self-supervised learning and attention mechanisms. Extensive experiments on four real-world datasets show our model's superiority over various state-of-the-art baselines, with an average 5.18\% relative improvement. Additional tests on noise resilience, missing data, and cold-start problems demonstrate the robustness of our KHGRec framework. Our model and evaluation datasets are publicly available at \url{https://github.com/viethungvu1998/KHGRec}. △ Less

Submitted 4 July, 2024; originally announced July 2024.

arXiv:2407.03611 [pdf, other]

An Empirical Study on Capability of Large Language Models in Understanding Code Semantics

Authors: Thu-Trang Nguyen, Thanh Trong Vu, Hieu Dinh Vo, Son Nguyen

Abstract: Large Language Models for Code (code LLMs) have demonstrated remarkable performance across various software engineering (SE) tasks, increasing the application of code LLMs in software development. Despite the success of code LLMs, there remain significant concerns about the actual capabilities and reliability of these models, "whether these models really learn the semantics of code from the traini… ▽ More Large Language Models for Code (code LLMs) have demonstrated remarkable performance across various software engineering (SE) tasks, increasing the application of code LLMs in software development. Despite the success of code LLMs, there remain significant concerns about the actual capabilities and reliability of these models, "whether these models really learn the semantics of code from the training data and leverage the learned knowledge to perform the SE tasks". In this paper, we introduce EMPICA, a comprehensive framework designed to systematically and empirically evaluate the capabilities of code LLMs in understanding code semantics. Specifically, EMPICA systematically introduces controlled modifications/transformations into the input code and examines the models' responses. Generally, code LLMs must be robust to semantically equivalent code inputs and be sensitive to non-equivalent ones for all SE tasks. Specifically, for every SE task, given an input code snippet c and its semantic equivalent variants, code LLMs must robustly produce consistent/equivalent outputs while they are expected to generate different outputs for c and its semantic non-equivalent variants. Our experimental results on three representative code understanding tasks, including code summarization, method name prediction, and output prediction, reveal that the robustness and sensitivity of the state-of-the-art code LLMs to code transformations vary significantly across tasks and transformation operators. In addition, the code LLMs exhibit better robustness to the semantic preserving transformations than their sensitivity to the semantic non-preserving transformations. These results highlight a need to enhance the model's capabilities of understanding code semantics, especially the sensitivity property. △ Less

Submitted 3 July, 2024; originally announced July 2024.

arXiv:2407.03144 [pdf, other]

Venomancer: Towards Imperceptible and Target-on-Demand Backdoor Attacks in Federated Learning

Authors: Son Nguyen, Thinh Nguyen, Khoa D Doan, Kok-Seng Wong

Abstract: Federated Learning (FL) is a distributed machine learning approach that maintains data privacy by training on decentralized data sources. Similar to centralized machine learning, FL is also susceptible to backdoor attacks, where an attacker can compromise some clients by injecting a backdoor trigger into local models of those clients, leading to the global model's behavior being manipulated as des… ▽ More Federated Learning (FL) is a distributed machine learning approach that maintains data privacy by training on decentralized data sources. Similar to centralized machine learning, FL is also susceptible to backdoor attacks, where an attacker can compromise some clients by injecting a backdoor trigger into local models of those clients, leading to the global model's behavior being manipulated as desired by the attacker. Most backdoor attacks in FL assume a predefined target class and require control over a large number of clients or knowledge of benign clients' information. Furthermore, they are not imperceptible and are easily detected by human inspection due to clear artifacts left on the poison data. To overcome these challenges, we propose Venomancer, an effective backdoor attack that is imperceptible and allows target-on-demand. Specifically, imperceptibility is achieved by using a visual loss function to make the poison data visually indistinguishable from the original data. Target-on-demand property allows the attacker to choose arbitrary target classes via conditional adversarial training. Additionally, experiments showed that the method is robust against state-of-the-art defenses such as Norm Clipping, Weak DP, Krum, Multi-Krum, RLR, FedRAD, Deepsight, and RFLBAT. The source code is available at https://github.com/nguyenhongson1902/Venomancer. △ Less

Submitted 11 July, 2024; v1 submitted 3 July, 2024; originally announced July 2024.

arXiv:2407.03110 [pdf, other]

A Toolchain for Comprehensive Audio/Video Analysis Using Deep Learning Based Multimodal Approach (A use case of riot or violent context detection)

Authors: Lam Pham, Phat Lam, Tin Nguyen, Hieu Tang, Alexander Schindler

Abstract: In this paper, we present a toolchain for a comprehensive audio/video analysis by leveraging deep learning based multimodal approach. To this end, different specific tasks of Speech to Text (S2T), Acoustic Scene Classification (ASC), Acoustic Event Detection (AED), Visual Object Detection (VOD), Image Captioning (IC), and Video Captioning (VC) are conducted and integrated into the toolchain. By co… ▽ More In this paper, we present a toolchain for a comprehensive audio/video analysis by leveraging deep learning based multimodal approach. To this end, different specific tasks of Speech to Text (S2T), Acoustic Scene Classification (ASC), Acoustic Event Detection (AED), Visual Object Detection (VOD), Image Captioning (IC), and Video Captioning (VC) are conducted and integrated into the toolchain. By combining individual tasks and analyzing both audio \& visual data extracted from input video, the toolchain offers various audio/video-based applications: Two general applications of audio/video clustering, comprehensive audio/video summary and a specific application of riot or violent context detection. Furthermore, the toolchain presents a flexible and adaptable architecture that is effective to integrate new models for further audio/video-based applications. △ Less

Submitted 2 May, 2024; originally announced July 2024.

arXiv:2407.02966 [pdf, other]

Efficient Forward-Mode Algorithmic Derivatives of Geant4

Authors: Max Aehle, Xuan Tung Nguyen, Mihály Novák, Tommaso Dorigo, Nicolas R. Gauger, Jan Kieseler, Markus Klute, Vassil Vassilev

Abstract: We have applied an operator-overloading forward-mode algorithmic differentiation tool to the Monte-Carlo particle simulation toolkit Geant4. Our differentiated version of Geant4 allows computing mean pathwise derivatives of user-defined outputs of Geant4 applications with respect to user-defined inputs. This constitutes a major step towards enabling gradient-based optimization techniques in high-e… ▽ More We have applied an operator-overloading forward-mode algorithmic differentiation tool to the Monte-Carlo particle simulation toolkit Geant4. Our differentiated version of Geant4 allows computing mean pathwise derivatives of user-defined outputs of Geant4 applications with respect to user-defined inputs. This constitutes a major step towards enabling gradient-based optimization techniques in high-energy physics, as well as other application domains of Geant4. This is a preliminary report on the technical aspects of applying operator-overloading AD to Geant4, as well as a first analysis of some results obtained by our differentiated Geant4 prototype. We plan to follow up with a more refined analysis. △ Less

Submitted 3 July, 2024; originally announced July 2024.

arXiv:2407.02828 [pdf]

Quantum Serverless Paradigm and Application Development using the QFaaS Framework

Authors: Hoa T. Nguyen, Bui Binh An Pham, Muhammad Usman, Rajkumar Buyya

Abstract: Quantum computing has the potential to solve complex problems beyond the capabilities of classical computers. However, its practical use is currently limited due to early-stage quantum software engineering and the constraints of Noisy Intermediate-Scale Quantum (NISQ) devices. To address this issue, this chapter introduces the concept of serverless quantum computing with examples using QFaaS, a pr… ▽ More Quantum computing has the potential to solve complex problems beyond the capabilities of classical computers. However, its practical use is currently limited due to early-stage quantum software engineering and the constraints of Noisy Intermediate-Scale Quantum (NISQ) devices. To address this issue, this chapter introduces the concept of serverless quantum computing with examples using QFaaS, a practical Quantum Function-as-a-Service framework. This framework utilizes the serverless computing model to simplify quantum application development and deployment by abstracting the complexities of quantum hardware and enhancing application portability across different quantum software development kits and quantum backends. The chapter provides comprehensive documentation and guidelines for deploying and using QFaaS, detailing the setup, component deployment, and examples of service-oriented quantum applications. This framework offers a promising approach to overcoming current limitations and advancing the practical software engineering of quantum computing. △ Less

Submitted 3 July, 2024; originally announced July 2024.

Comments: Guidelines for deploying and using the QFaaS Framework (for the original paper, see https://doi.org/10.1016/j.future.2024.01.018)

arXiv:2407.02748 [pdf, other]

DRLQ: A Deep Reinforcement Learning-based Task Placement for Quantum Cloud Computing

Authors: Hoa T. Nguyen, Muhammad Usman, Rajkumar Buyya

Abstract: The quantum cloud computing paradigm presents unique challenges in task placement due to the dynamic and heterogeneous nature of quantum computation resources. Traditional heuristic approaches fall short in adapting to the rapidly evolving landscape of quantum computing. This paper proposes DRLQ, a novel Deep Reinforcement Learning (DRL)-based technique for task placement in quantum cloud computin… ▽ More The quantum cloud computing paradigm presents unique challenges in task placement due to the dynamic and heterogeneous nature of quantum computation resources. Traditional heuristic approaches fall short in adapting to the rapidly evolving landscape of quantum computing. This paper proposes DRLQ, a novel Deep Reinforcement Learning (DRL)-based technique for task placement in quantum cloud computing environments, addressing the optimization of task completion time and quantum task scheduling efficiency. It leverages the Deep Q Network (DQN) architecture, enhanced with the Rainbow DQN approach, to create a dynamic task placement strategy. This approach is one of the first in the field of quantum cloud resource management, enabling adaptive learning and decision-making for quantum cloud environments and effectively optimizing task placement based on changing conditions and resource availability. We conduct extensive experiments using the QSimPy simulation toolkit to evaluate the performance of our method, demonstrating substantial improvements in task execution efficiency and a reduction in the need to reschedule quantum tasks. Our results show that utilizing the DRLQ approach for task placement can significantly reduce total quantum task completion time by 37.81% to 72.93% and prevent task rescheduling attempts compared to other heuristic approaches. △ Less

Submitted 2 July, 2024; originally announced July 2024.

Comments: Accepted paper at IEEE CLOUD 2024 conference

arXiv:2407.02190 [pdf, other]

I2EKF-LO: A Dual-Iteration Extended Kalman Filter Based LiDAR Odometry

Authors: Wenlu Yu, Jie Xu, Chengwei Zhao, Lijun Zhao, Thien-Minh Nguyen, Shenghai Yuan, Mingming Bai, Lihua Xie

Abstract: LiDAR odometry is a pivotal technology in the fields of autonomous driving and autonomous mobile robotics. However, most of the current works focus on nonlinear optimization methods, and still existing many challenges in using the traditional Iterative Extended Kalman Filter (IEKF) framework to tackle the problem: IEKF only iterates over the observation equation, relying on a rough estimate of the… ▽ More LiDAR odometry is a pivotal technology in the fields of autonomous driving and autonomous mobile robotics. However, most of the current works focus on nonlinear optimization methods, and still existing many challenges in using the traditional Iterative Extended Kalman Filter (IEKF) framework to tackle the problem: IEKF only iterates over the observation equation, relying on a rough estimate of the initial state, which is insufficient to fully eliminate motion distortion in the input point cloud; the system process noise is difficult to be determined during state estimation of the complex motions; and the varying motion models across different sensor carriers. To address these issues, we propose the Dual-Iteration Extended Kalman Filter (I2EKF) and the LiDAR odometry based on I2EKF (I2EKF-LO). This approach not only iterates over the observation equation but also leverages state updates to iteratively mitigate motion distortion in LiDAR point clouds. Moreover, it dynamically adjusts process noise based on the confidence level of prior predictions during state estimation and establishes motion models for different sensor carriers to achieve accurate and efficient state estimation. Comprehensive experiments demonstrate that I2EKF-LO achieves outstanding levels of accuracy and computational efficiency in the realm of LiDAR odometry. Additionally, to foster community development, our code is open-sourced.https://github.com/YWL0720/I2EKF-LO. △ Less

Submitted 2 July, 2024; originally announced July 2024.

Comments: Accepted by IROS 2024

arXiv:2407.01987 [pdf, other]

AHMsys: An Automated HVAC Modeling System for BIM Project

Authors: Long Hoang Dang, Duy-Hung Nguyen, Thai Quang Le, Thinh Truong Nguyen, Clark Mei, Vu Hoang

Abstract: This paper presents a novel system, named AHMsys, designed to automate the process of generating 3D Heating, Ventilation, and Air Conditioning (HVAC) models from 2D Computer-Aided Design (CAD) drawings, a key component of Building Information Modeling (BIM). By automatically preprocessing and extracting essential HVAC object information then creating detailed 3D models, our proposed AHMsys signifi… ▽ More This paper presents a novel system, named AHMsys, designed to automate the process of generating 3D Heating, Ventilation, and Air Conditioning (HVAC) models from 2D Computer-Aided Design (CAD) drawings, a key component of Building Information Modeling (BIM). By automatically preprocessing and extracting essential HVAC object information then creating detailed 3D models, our proposed AHMsys significantly reduced the 20 percent work schedule of the BIM process in Akila. This advancement highlights the essential impact of integrating AI technologies in managing the lifecycle of a digital representation of the building. △ Less

Submitted 2 July, 2024; originally announced July 2024.

arXiv:2407.01963 [pdf, other]

Towards Unsupervised Speaker Diarization System for Multilingual Telephone Calls Using Pre-trained Whisper Model and Mixture of Sparse Autoencoders

Authors: Phat Lam, Lam Pham, Tin Nguyen, Hieu Tang, Thinh Pham, Loi Khanh Nguyen, Alexander Schindler

Abstract: Existing speaker diarization systems heavily rely on large amounts of manually annotated data, which is labor-intensive and challenging to collect in real-world scenarios. Additionally, the language-specific constraint in speaker diarization systems significantly hinders their applicability and scalability in multilingual settings. In this paper, we therefore propose a cluster-based speaker diariz… ▽ More Existing speaker diarization systems heavily rely on large amounts of manually annotated data, which is labor-intensive and challenging to collect in real-world scenarios. Additionally, the language-specific constraint in speaker diarization systems significantly hinders their applicability and scalability in multilingual settings. In this paper, we therefore propose a cluster-based speaker diarization system for multilingual telephone call applications. The proposed system supports multiple languages and does not require large-scale annotated data for the training process as leveraging the multilingual Whisper model to extract speaker embeddings and proposing a novel Mixture of Sparse Autoencoders (Mix-SAE) network architecture for unsupervised speaker clustering. Experimental results on the evaluating dataset derived from two-speaker subsets of CALLHOME and CALLFRIEND telephonic speech corpora demonstrate superior efficiency of the proposed Mix-SAE network to other autoencoder-based clustering methods. The overall performance of our proposed system also indicates the promising potential of our approach in developing unsupervised multilingual speaker diarization applications within the context of limited annotated data and enhancing the integration ability into comprehensive multi-task speech analysis systems (i.e. multiple tasks of speech-to-text, language detection, speaker diarization integrated in a low-complexity system). △ Less

Submitted 7 July, 2024; v1 submitted 2 July, 2024; originally announced July 2024.

Comments: 8 pages, 7 figures

arXiv:2407.01777 [pdf, other]

Deepfake Audio Detection Using Spectrogram-based Feature and Ensemble of Deep Learning Models

Authors: Lam Pham, Phat Lam, Truong Nguyen, Huyen Nguyen, Alexander Schindler

Abstract: In this paper, we propose a deep learning based system for the task of deepfake audio detection. In particular, the draw input audio is first transformed into various spectrograms using three transformation methods of Short-time Fourier Transform (STFT), Constant-Q Transform (CQT), Wavelet Transform (WT) combined with different auditory-based filters of Mel, Gammatone, linear filters (LF), and dis… ▽ More In this paper, we propose a deep learning based system for the task of deepfake audio detection. In particular, the draw input audio is first transformed into various spectrograms using three transformation methods of Short-time Fourier Transform (STFT), Constant-Q Transform (CQT), Wavelet Transform (WT) combined with different auditory-based filters of Mel, Gammatone, linear filters (LF), and discrete cosine transform (DCT). Given the spectrograms, we evaluate a wide range of classification models based on three deep learning approaches. The first approach is to train directly the spectrograms using our proposed baseline models of CNN-based model (CNN-baseline), RNN-based model (RNN-baseline), C-RNN model (C-RNN baseline). Meanwhile, the second approach is transfer learning from computer vision models such as ResNet-18, MobileNet-V3, EfficientNet-B0, DenseNet-121, SuffleNet-V2, Swint, Convnext-Tiny, GoogLeNet, MNASsnet, RegNet. In the third approach, we leverage the state-of-the-art audio pre-trained models of Whisper, Seamless, Speechbrain, and Pyannote to extract audio embeddings from the input spectrograms. Then, the audio embeddings are explored by a Multilayer perceptron (MLP) model to detect the fake or real audio samples. Finally, high-performance deep learning models from these approaches are fused to achieve the best performance. We evaluated our proposed models on ASVspoof 2019 benchmark dataset. Our best ensemble model achieved an Equal Error Rate (EER) of 0.03, which is highly competitive to top-performing systems in the ASVspoofing 2019 challenge. Experimental results also highlight the potential of selective spectrograms and deep learning approaches to enhance the task of audio deepfake detection. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2407.01110 [pdf]

SecGenAI: Enhancing Security of Cloud-based Generative AI Applications within Australian Critical Technologies of National Interest

Authors: Christoforus Yoga Haryanto, Minh Hieu Vu, Trung Duc Nguyen, Emily Lomempow, Yulia Nurliana, Sona Taheri

Abstract: The rapid advancement of Generative AI (GenAI) technologies offers transformative opportunities within Australia's critical technologies of national interest while introducing unique security challenges. This paper presents SecGenAI, a comprehensive security framework for cloud-based GenAI applications, with a focus on Retrieval-Augmented Generation (RAG) systems. SecGenAI addresses functional, in… ▽ More The rapid advancement of Generative AI (GenAI) technologies offers transformative opportunities within Australia's critical technologies of national interest while introducing unique security challenges. This paper presents SecGenAI, a comprehensive security framework for cloud-based GenAI applications, with a focus on Retrieval-Augmented Generation (RAG) systems. SecGenAI addresses functional, infrastructure, and governance requirements, integrating end-to-end security analysis to generate specifications emphasizing data privacy, secure deployment, and shared responsibility models. Aligned with Australian Privacy Principles, AI Ethics Principles, and guidelines from the Australian Cyber Security Centre and Digital Transformation Agency, SecGenAI mitigates threats such as data leakage, adversarial attacks, and model inversion. The framework's novel approach combines advanced machine learning techniques with robust security measures, ensuring compliance with Australian regulations while enhancing the reliability and trustworthiness of GenAI systems. This research contributes to the field of intelligent systems by providing actionable strategies for secure GenAI implementation in industry, fostering innovation in AI applications, and safeguarding national interests. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: 10 pages, 4 figures, 9 tables, submitted to the 2024 11th International Conference on Soft Computing & Machine Intelligence (ISCMI 2024)

arXiv:2407.00895 [pdf, other]

Large-Amplitude, Easy-Plane Spin-Orbit Torque Oscillators Driven by Out-of-Plane Spin Current: A Micromagnetic Study

Authors: Daniel Kubler, David A. Smith, Tommy Nguyen, Fernando Ramos-Diaz, Satoru Emori, Vivek P. Amin

Abstract: Spin torque oscillators are spintronic devices that generate a periodic output signal from a non-periodic input, making them promising candidates for applications like microwave communications and neuromorphic computing. However, traditional spin torque oscillators suffer from a limited precessional cone angle and thermal stability, as well as a need for an applied bias magnetic field. We use micr… ▽ More Spin torque oscillators are spintronic devices that generate a periodic output signal from a non-periodic input, making them promising candidates for applications like microwave communications and neuromorphic computing. However, traditional spin torque oscillators suffer from a limited precessional cone angle and thermal stability, as well as a need for an applied bias magnetic field. We use micromagnetic simulations to demonstrate a novel spin torque oscillator that relies on spin-orbit effects in ferromagnets to overcome these limitations. The key mechanism behind this oscillator is the generation of an out-of-plane spin current, in which both the spin flow and the spin orientation are out-of-plane. The torque from this spin current enables easy-plane coherent magnetic precession with a large cone angle and high thermal stability over a micron-scale lateral area. Moreover, the precession occurs about an internal field in the free layer, thereby eliminating the need for an external bias field. We demonstrate the feasibility of an easy-plane spin-orbit torque oscillator at room temperature over a wide parameter space, including the ratio of the out-of-plane spin current to the conventional spin-Hall spin current, presenting exciting possibilities for this novel spintronic device. △ Less

Submitted 30 June, 2024; originally announced July 2024.

arXiv:2407.00710 [pdf, other]

Weighted Missing Linear Discriminant Analysis: An Explainable Approach for Classification with Missing Data

Authors: Tuan L. Vo, Uyen Dang, Thu Nguyen

Abstract: As Artificial Intelligence (AI) models are gradually being adopted in real-life applications, the explainability of the model used is critical, especially in high-stakes areas such as medicine, finance, etc. Among the commonly used models, Linear Discriminant Analysis (LDA) is a widely used classification tool that is also explainable thanks to its ability to model class distributions and maximize… ▽ More As Artificial Intelligence (AI) models are gradually being adopted in real-life applications, the explainability of the model used is critical, especially in high-stakes areas such as medicine, finance, etc. Among the commonly used models, Linear Discriminant Analysis (LDA) is a widely used classification tool that is also explainable thanks to its ability to model class distributions and maximize class separation through linear feature combinations. Nevertheless, real-world data is frequently incomplete, presenting significant challenges for classification tasks and model explanations. In this paper, we propose a novel approach to LDA under missing data, termed \textbf{\textit{Weighted missing Linear Discriminant Analysis (WLDA)}}, to directly classify observations in data that contains missing values without imputation effectively by estimating the parameters directly on missing data and use a weight matrix for missing values to penalize missing entries during classification. Furthermore, we also analyze the theoretical properties and examine the explainability of the proposed technique in a comprehensive manner. Experimental results demonstrate that WLDA outperforms conventional methods by a significant margin, particularly in scenarios where missing values are present in both training and test sets. △ Less

Submitted 30 June, 2024; originally announced July 2024.

arXiv:2407.00609 [pdf, other]

ESGNN: Towards Equivariant Scene Graph Neural Network for 3D Scene Understanding

Authors: Quang P. M. Pham, Khoi T. N. Nguyen, Lan C. Ngo, Truong Do, Truong Son Hy

Abstract: Scene graphs have been proven to be useful for various scene understanding tasks due to their compact and explicit nature. However, existing approaches often neglect the importance of maintaining the symmetry-preserving property when generating scene graphs from 3D point clouds. This oversight can diminish the accuracy and robustness of the resulting scene graphs, especially when handling noisy, m… ▽ More Scene graphs have been proven to be useful for various scene understanding tasks due to their compact and explicit nature. However, existing approaches often neglect the importance of maintaining the symmetry-preserving property when generating scene graphs from 3D point clouds. This oversight can diminish the accuracy and robustness of the resulting scene graphs, especially when handling noisy, multi-view 3D data. This work, to the best of our knowledge, is the first to implement an Equivariant Graph Neural Network in semantic scene graph generation from 3D point clouds for scene understanding. Our proposed method, ESGNN, outperforms existing state-of-the-art approaches, demonstrating a significant improvement in scene estimation with faster convergence. ESGNN demands low computational resources and is easy to implement from available frameworks, paving the way for real-time applications such as robotics and computer vision. △ Less

Submitted 30 June, 2024; originally announced July 2024.

arXiv:2407.00535 [pdf, other]

AI-powered multimodal modeling of personalized hemodynamics in aortic stenosis

Authors: Caglar Ozturk, Daniel H. Pak, Luca Rosalia, Debkalpa Goswami, Mary E. Robakowski, Raymond McKay, Christopher T. Nguyen, James S. Duncan, Ellen T. Roche

Abstract: Aortic stenosis (AS) is the most common valvular heart disease in developed countries. High-fidelity preclinical models can improve AS management by enabling therapeutic innovation, early diagnosis, and tailored treatment planning. However, their use is currently limited by complex workflows necessitating lengthy expert-driven manual operations. Here, we propose an AI-powered computational framewo… ▽ More Aortic stenosis (AS) is the most common valvular heart disease in developed countries. High-fidelity preclinical models can improve AS management by enabling therapeutic innovation, early diagnosis, and tailored treatment planning. However, their use is currently limited by complex workflows necessitating lengthy expert-driven manual operations. Here, we propose an AI-powered computational framework for accelerated and democratized patient-specific modeling of AS hemodynamics from computed tomography. First, we demonstrate that our automated meshing algorithms can generate task-ready geometries for both computational and benchtop simulations with higher accuracy and 100 times faster than existing approaches. Then, we show that our approach can be integrated with fluid-structure interaction and soft robotics models to accurately recapitulate a broad spectrum of clinical hemodynamic measurements of diverse AS patients. The efficiency and reliability of these algorithms make them an ideal complementary tool for personalized high-fidelity modeling of AS biomechanics, hemodynamics, and treatment planning. △ Less

Submitted 29 June, 2024; originally announced July 2024.

Comments: CO and DHP contributed equally to this work. JSD and ETR are corresponding authors

arXiv:2407.00411 [pdf, other]

Explainability of Machine Learning Models under Missing Data

Authors: Tuan L. Vo, Thu Nguyen, Hugo L. Hammer, Michael A. Riegler, Pal Halvorsen

Abstract: Missing data is a prevalent issue that can significantly impair model performance and interpretability. This paper briefly summarizes the development of the field of missing data with respect to Explainable Artificial Intelligence and experimentally investigates the effects of various imputation methods on the calculation of Shapley values, a popular technique for interpreting complex machine lear… ▽ More Missing data is a prevalent issue that can significantly impair model performance and interpretability. This paper briefly summarizes the development of the field of missing data with respect to Explainable Artificial Intelligence and experimentally investigates the effects of various imputation methods on the calculation of Shapley values, a popular technique for interpreting complex machine learning models. We compare different imputation strategies and assess their impact on feature importance and interaction as determined by Shapley values. Moreover, we also theoretically analyze the effects of missing values on Shapley values. Importantly, our findings reveal that the choice of imputation method can introduce biases that could lead to changes in the Shapley values, thereby affecting the interpretability of the model. Moreover, and that a lower test prediction mean square error (MSE) may not imply a lower MSE in Shapley values and vice versa. Also, while Xgboost is a method that could handle missing data directly, using Xgboost directly on missing data can seriously affect interpretability compared to imputing the data before training Xgboost. This study provides a comprehensive evaluation of imputation methods in the context of model interpretation, offering practical guidance for selecting appropriate techniques based on dataset characteristics and analysis objectives. The results underscore the importance of considering imputation effects to ensure robust and reliable insights from machine learning models. △ Less

Submitted 29 June, 2024; originally announced July 2024.

arXiv:2406.20077 [pdf, other]

HouseCrafter: Lifting Floorplans to 3D Scenes with 2D Diffusion Model

Authors: Hieu T. Nguyen, Yiwen Chen, Vikram Voleti, Varun Jampani, Huaizu Jiang

Abstract: We introduce HouseCrafter, a novel approach that can lift a floorplan into a complete large 3D indoor scene (e.g., a house). Our key insight is to adapt a 2D diffusion model, which is trained on web-scale images, to generate consistent multi-view color (RGB) and depth (D) images across different locations of the scene. Specifically, the RGB-D images are generated autoregressively in a batch-wise m… ▽ More We introduce HouseCrafter, a novel approach that can lift a floorplan into a complete large 3D indoor scene (e.g., a house). Our key insight is to adapt a 2D diffusion model, which is trained on web-scale images, to generate consistent multi-view color (RGB) and depth (D) images across different locations of the scene. Specifically, the RGB-D images are generated autoregressively in a batch-wise manner along sampled locations based on the floorplan, where previously generated images are used as condition to the diffusion model to produce images at nearby locations. The global floorplan and attention design in the diffusion model ensures the consistency of the generated images, from which a 3D scene can be reconstructed. Through extensive evaluation on the 3D-Front dataset, we demonstrate that HouseCraft can generate high-quality house-scale 3D scenes. Ablation studies also validate the effectiveness of different design choices. We will release our code and model weights. Project page: https://neu-vi.github.io/houseCrafter/ △ Less

Submitted 28 June, 2024; originally announced June 2024.

arXiv:2406.19753 [pdf, other]

Backdoor Attack in Prompt-Based Continual Learning

Authors: Trang Nguyen, Anh Tran, Nhat Ho

Abstract: Prompt-based approaches offer a cutting-edge solution to data privacy issues in continual learning, particularly in scenarios involving multiple data suppliers where long-term storage of private user data is prohibited. Despite delivering state-of-the-art performance, its impressive remembering capability can become a double-edged sword, raising security concerns as it might inadvertently retain p… ▽ More Prompt-based approaches offer a cutting-edge solution to data privacy issues in continual learning, particularly in scenarios involving multiple data suppliers where long-term storage of private user data is prohibited. Despite delivering state-of-the-art performance, its impressive remembering capability can become a double-edged sword, raising security concerns as it might inadvertently retain poisoned knowledge injected during learning from private user data. Following this insight, in this paper, we expose continual learning to a potential threat: backdoor attack, which drives the model to follow a desired adversarial target whenever a specific trigger is present while still performing normally on clean samples. We highlight three critical challenges in executing backdoor attacks on incremental learners and propose corresponding solutions: (1) \emph{Transferability}: We employ a surrogate dataset and manipulate prompt selection to transfer backdoor knowledge to data from other suppliers; (2) \emph{Resiliency}: We simulate static and dynamic states of the victim to ensure the backdoor trigger remains robust during intense incremental learning processes; and (3) \emph{Authenticity}: We apply binary cross-entropy loss as an anti-cheating factor to prevent the backdoor trigger from devolving into adversarial noise. Extensive experiments across various benchmark datasets and continual learners validate our continual backdoor framework, achieving up to $100\%$ attack success rate, with further ablation studies confirming our contributions' effectiveness. △ Less

Submitted 28 June, 2024; originally announced June 2024.

arXiv:2406.19445 [pdf, other]

X-Ray Constraints on Dark Photon Tridents

Authors: Tim Linden, Thong T. Q. Nguyen, Tim M. P. Tait

Abstract: Dark photons that are sufficiently light and/or weakly-interacting represent a compelling vision of dark matter. Dark photon decay into three photons, which we call the dark photon trident, can be the dominant channel when the dark photon mass falls below the electron pair threshold and can produce a significant flux of x-rays. We use 16 years of data from INTEGRAL/SPI to constrain sub-MeV dark ph… ▽ More Dark photons that are sufficiently light and/or weakly-interacting represent a compelling vision of dark matter. Dark photon decay into three photons, which we call the dark photon trident, can be the dominant channel when the dark photon mass falls below the electron pair threshold and can produce a significant flux of x-rays. We use 16 years of data from INTEGRAL/SPI to constrain sub-MeV dark photon decay, producing new worlds-best constraints on the kinetic mixing parameter for dark photon masses between 61 keV and 1022 keV, and comment on the potential for future x-ray observatories to discover the trident decay process. △ Less

Submitted 27 June, 2024; originally announced June 2024.

Comments: 4+3 pages, 4 figures. Comments are welcome!

arXiv:2406.18851 [pdf, other]

LICO: Large Language Models for In-Context Molecular Optimization

Authors: Tung Nguyen, Aditya Grover

Abstract: Optimizing black-box functions is a fundamental problem in science and engineering. To solve this problem, many approaches learn a surrogate function that estimates the underlying objective from limited historical evaluations. Large Language Models (LLMs), with their strong pattern-matching capabilities via pretraining on vast amounts of data, stand out as a potential candidate for surrogate model… ▽ More Optimizing black-box functions is a fundamental problem in science and engineering. To solve this problem, many approaches learn a surrogate function that estimates the underlying objective from limited historical evaluations. Large Language Models (LLMs), with their strong pattern-matching capabilities via pretraining on vast amounts of data, stand out as a potential candidate for surrogate modeling. However, directly prompting a pretrained language model to produce predictions is not feasible in many scientific domains due to the scarcity of domain-specific data in the pretraining corpora and the challenges of articulating complex problems in natural language. In this work, we introduce LICO, a general-purpose model that extends arbitrary base LLMs for black-box optimization, with a particular application to the molecular domain. To achieve this, we equip the language model with a separate embedding layer and prediction layer, and train the model to perform in-context predictions on a diverse set of functions defined over the domain. Once trained, LICO can generalize to unseen molecule properties simply via in-context prompting. LICO achieves state-of-the-art performance on PMO, a challenging molecular optimization benchmark comprising over 20 objective functions. △ Less

Submitted 26 June, 2024; originally announced June 2024.

arXiv:2406.17381 [pdf, other]

Forget but Recall: Incremental Latent Rectification in Continual Learning

Authors: Nghia D. Nguyen, Hieu Trung Nguyen, Ang Li, Hoang Pham, Viet Anh Nguyen, Khoa D. Doan

Abstract: Intrinsic capability to continuously learn a changing data stream is a desideratum of deep neural networks (DNNs). However, current DNNs suffer from catastrophic forgetting, which hinders remembering past knowledge. To mitigate this issue, existing Continual Learning (CL) approaches either retain exemplars for replay, regularize learning, or allocate dedicated capacity for new tasks. This paper in… ▽ More Intrinsic capability to continuously learn a changing data stream is a desideratum of deep neural networks (DNNs). However, current DNNs suffer from catastrophic forgetting, which hinders remembering past knowledge. To mitigate this issue, existing Continual Learning (CL) approaches either retain exemplars for replay, regularize learning, or allocate dedicated capacity for new tasks. This paper investigates an unexplored CL direction for incremental learning called Incremental Latent Rectification or ILR. In a nutshell, ILR learns to propagate with correction (or rectify) the representation from the current trained DNN backward to the representation space of the old task, where performing predictive decisions is easier. This rectification process only employs a chain of small representation mapping networks, called rectifier units. Empirical experiments on several continual learning benchmarks, including CIFAR10, CIFAR100, and Tiny ImageNet, demonstrate the effectiveness and potential of this novel CL direction compared to existing representative CL methods. △ Less

Submitted 25 June, 2024; originally announced June 2024.

arXiv:2406.17376 [pdf, other]

Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection

Authors: Duc-Tuan Truong, Ruijie Tao, Tuan Nguyen, Hieu-Thi Luong, Kong Aik Lee, Eng Siong Chng

Abstract: Recent synthetic speech detectors leveraging the Transformer model have superior performance compared to the convolutional neural network counterparts. This improvement could be due to the powerful modeling ability of the multi-head self-attention (MHSA) in the Transformer model, which learns the temporal relationship of each input token. However, artifacts of synthetic speech can be located in sp… ▽ More Recent synthetic speech detectors leveraging the Transformer model have superior performance compared to the convolutional neural network counterparts. This improvement could be due to the powerful modeling ability of the multi-head self-attention (MHSA) in the Transformer model, which learns the temporal relationship of each input token. However, artifacts of synthetic speech can be located in specific regions of both frequency channels and temporal segments, while MHSA neglects this temporal-channel dependency of the input sequence. In this work, we proposed a Temporal-Channel Modeling (TCM) module to enhance MHSA's capability for capturing temporal-channel dependencies. Experimental results on the ASVspoof 2021 show that with only 0.03M additional parameters, the TCM module can outperform the state-of-the-art system by 9.25% in EER. Further ablation study reveals that utilizing both temporal and channel information yields the most improvement for detecting synthetic speech. △ Less

Submitted 25 June, 2024; originally announced June 2024.

Comments: Accepted by INTERSPEECH 2024

arXiv:2406.16777 [pdf, other]

Blending LLMs into Cascaded Speech Translation: KIT's Offline Speech Translation System for IWSLT 2024

Authors: Sai Koneru, Thai-Binh Nguyen, Ngoc-Quan Pham, Danni Liu, Zhaolin Li, Alexander Waibel, Jan Niehues

Abstract: Large Language Models (LLMs) are currently under exploration for various tasks, including Automatic Speech Recognition (ASR), Machine Translation (MT), and even End-to-End Speech Translation (ST). In this paper, we present KIT's offline submission in the constrained + LLM track by incorporating recently proposed techniques that can be added to any cascaded speech translation. Specifically, we inte… ▽ More Large Language Models (LLMs) are currently under exploration for various tasks, including Automatic Speech Recognition (ASR), Machine Translation (MT), and even End-to-End Speech Translation (ST). In this paper, we present KIT's offline submission in the constrained + LLM track by incorporating recently proposed techniques that can be added to any cascaded speech translation. Specifically, we integrate Mistral-7B\footnote{mistralai/Mistral-7B-Instruct-v0.1} into our system to enhance it in two ways. Firstly, we refine the ASR outputs by utilizing the N-best lists generated by our system and fine-tuning the LLM to predict the transcript accurately. Secondly, we refine the MT outputs at the document level by fine-tuning the LLM, leveraging both ASR and MT predictions to improve translation quality. We find that integrating the LLM into the ASR and MT systems results in an absolute improvement of $0.3\%$ in Word Error Rate and $0.65\%$ in COMET for tst2019 test set. In challenging test sets with overlapping speakers and background noise, we find that integrating LLM is not beneficial due to poor ASR performance. Here, we use ASR with chunked long-form decoding to improve context usage that may be unavailable when transcribing with Voice Activity Detection segmentation alone. △ Less

Submitted 24 June, 2024; originally announced June 2024.

arXiv:2406.16685 [pdf, other]

A locking-free isogeometric thin shell formulation based on higher order accurate local strain projection via approximate dual splines

Authors: Thi-Hoa Nguyen, René R. Hiemstra, Dominik Schillinger

Abstract: We present a novel isogeometric discretization approach for the Kirchhoff-Love shell formulation based on the Hellinger-Reissner variational principle. For mitigating membrane locking, we discretize the independent strains with spline basis functions that are one degree lower than those used for the displacements. To enable computationally efficient condensation of the independent strains, we firs… ▽ More We present a novel isogeometric discretization approach for the Kirchhoff-Love shell formulation based on the Hellinger-Reissner variational principle. For mitigating membrane locking, we discretize the independent strains with spline basis functions that are one degree lower than those used for the displacements. To enable computationally efficient condensation of the independent strains, we first discretize the variations of the independent strains with approximate dual splines to obtain a projection matrix that is close to a diagonal matrix. We then diagonalize this strain projection matrix via row-sum lumping. The combination of approximate dual test functions with row-sum lumping enables the direct condensation of the independent strain fields at the quadrature point level, while maintaining higher-order accuracy at optimal rates of convergence. We illustrate the numerical properties and the performance of our approach through numerical benchmarks, including a curved Euler-Bernoulli beam and the examples of the shell obstacle course. △ Less

Submitted 24 June, 2024; originally announced June 2024.

arXiv:2406.16656 [pdf, ps, other]

Permutation Codes Correcting Multiple Deletions

Authors: Shuche Wang, The Nguyen, Yeow Meng Chee, Van Khu Vu

Abstract: Permutation codes in the Ulam metric, which can correct multiple deletions, have been investigated extensively recently owing to their applications. In this work, we are interested in the maximum size of the permutation codes in the Ulam metric and aim to design permutation codes that can correct multiple deletions with efficient decoding algorithms. We first present an improvement on the Gilbert-… ▽ More Permutation codes in the Ulam metric, which can correct multiple deletions, have been investigated extensively recently owing to their applications. In this work, we are interested in the maximum size of the permutation codes in the Ulam metric and aim to design permutation codes that can correct multiple deletions with efficient decoding algorithms. We first present an improvement on the Gilbert--Varshamov bound of the maximum size of these permutation codes which is the best-known lower bound. Next, we focus on designing permutation codes in the Ulam metric with a decoding algorithm. These constructed codes are the best-known permutation codes that can correct multiple deletions. In particular, the constructed permutation codes can correct $t$ deletions with at most $(3t-1) \log n+o(\log n)$ bits of redundancy where $n$ is the length of the code. Finally, we provide an efficient decoding algorithm for our constructed permutation codes. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: 9 pages

arXiv:2406.15749 [pdf, ps, other]

Decay of CP-even Higgs $H\rightarrow h γγ$ in Two Higgs Doublet Model: (I) one-loop analytic results, ward identity checks

Authors: Khiem Hong Phan, Dzung Tri Tran, Thanh Huy Nguyen

Abstract: We present the first analytical expressions for one-loop induced contributions for the decay channels of CP-even Higgs $H\rightarrow h γγ$ with $h$ being standard model-like Higgs boson within the framework of Two Higgs Doublet Model in this paper. One-loop form factors for the decay processes are written in terms of the scalar Passarino-Veltman functions following the general notations of the pac… ▽ More We present the first analytical expressions for one-loop induced contributions for the decay channels of CP-even Higgs $H\rightarrow h γγ$ with $h$ being standard model-like Higgs boson within the framework of Two Higgs Doublet Model in this paper. One-loop form factors for the decay processes are written in terms of the scalar Passarino-Veltman functions following the general notations of the package~{\tt LoopTools} as well as the library {\tt Collier}. Subsequently, physical results for the decay processes can be generated numerically by using one of the above-mentioned packages. The analytical expressions shown in this paper, are verified by several numerical checks, for examples, the ultraviolet (UV) and the infrared (IR) finiteness for one-loop amplitude. Furthermore, the amplitude must be followed the so-called ward identity due to on-shell photons in final states. The identity can also be tested numerically in this work. We find that the numerical results for the checks are good stability. In phenomenological studies, the differential decay rates as functions of the invariant of two photons in final state of $H\rightarrow h γγ$ are first studied in parameter space for all types of Two Higgs Doublet Models. △ Less

Submitted 22 June, 2024; originally announced June 2024.

Comments: 39 pages, 8 Figures, 9 Tables

Report number: DTU_2024-03

Showing 1–50 of 3,743 results for author: Nguyen, T