subscribe to arXiv mailings

Analyzing the Generalization and Reliability of Steering Vectors -- ICML 2024

Authors: Daniel Tan, David Chanin, Aengus Lynch, Dimitrios Kanoulas, Brooks Paige, Adria Garriga-Alonso, Robert Kirk

Abstract: Steering vectors (SVs) are a new approach to efficiently adjust language model behaviour at inference time by intervening on intermediate model activations. They have shown promise in terms of improving both capabilities and model alignment. However, the reliability and generalisation properties of this approach are unknown. In this work, we rigorously investigate these properties, and show that s… ▽ More Steering vectors (SVs) are a new approach to efficiently adjust language model behaviour at inference time by intervening on intermediate model activations. They have shown promise in terms of improving both capabilities and model alignment. However, the reliability and generalisation properties of this approach are unknown. In this work, we rigorously investigate these properties, and show that steering vectors have substantial limitations both in- and out-of-distribution. In-distribution, steerability is highly variable across different inputs. Depending on the concept, spurious biases can substantially contribute to how effective steering is for each input, presenting a challenge for the widespread use of steering vectors. Out-of-distribution, while steering vectors often generalise well, for several concepts they are brittle to reasonable changes in the prompt, resulting in them failing to generalise well. Overall, our findings show that while steering can work well in the right circumstances, there remain many technical difficulties of applying steering vectors to guide models' behaviour at scale. △ Less

Submitted 17 July, 2024; originally announced July 2024.

arXiv:2406.09801 [pdf, other]

RaNeuS: Ray-adaptive Neural Surface Reconstruction

Authors: Yida Wang, David Joseph Tan, Nassir Navab, Federico Tombari

Abstract: Our objective is to leverage a differentiable radiance field \eg NeRF to reconstruct detailed 3D surfaces in addition to producing the standard novel view renderings. There have been related methods that perform such tasks, usually by utilizing a signed distance field (SDF). However, the state-of-the-art approaches still fail to correctly reconstruct the small-scale details, such as the leaves, ro… ▽ More Our objective is to leverage a differentiable radiance field \eg NeRF to reconstruct detailed 3D surfaces in addition to producing the standard novel view renderings. There have been related methods that perform such tasks, usually by utilizing a signed distance field (SDF). However, the state-of-the-art approaches still fail to correctly reconstruct the small-scale details, such as the leaves, ropes, and textile surfaces. Considering that different methods formulate and optimize the projection from SDF to radiance field with a globally constant Eikonal regularization, we improve with a ray-wise weighting factor to prioritize the rendering and zero-crossing surface fitting on top of establishing a perfect SDF. We propose to adaptively adjust the regularization on the signed distance field so that unsatisfying rendering rays won't enforce strong Eikonal regularization which is ineffective, and allow the gradients from regions with well-learned radiance to effectively back-propagated to the SDF. Consequently, balancing the two objectives in order to generate accurate and detailed surfaces. Additionally, concerning whether there is a geometric bias between the zero-crossing surface in SDF and rendering points in the radiance field, the projection becomes adjustable as well depending on different 3D locations during optimization. Our proposed \textit{RaNeuS} are extensively evaluated on both synthetic and real datasets, achieving state-of-the-art results on both novel view synthesis and geometric reconstruction. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: 3DV 2024, oral. In: Proceedings of the IEEE/CVF International Conference on 3D Vision (2023)

arXiv:2406.08989 [pdf, other]

ToneUnit: A Speech Discretization Approach for Tonal Language Speech Synthesis

Authors: Dehua Tao, Daxin Tan, Yu Ting Yeung, Xiao Chen, Tan Lee

Abstract: Representing speech as discretized units has numerous benefits in supporting downstream spoken language processing tasks. However, the approach has been less explored in speech synthesis of tonal languages like Mandarin Chinese. Our preliminary experiments on Chinese speech synthesis reveal the issue of "tone shift", where a synthesized speech utterance contains correct base syllables but incorrec… ▽ More Representing speech as discretized units has numerous benefits in supporting downstream spoken language processing tasks. However, the approach has been less explored in speech synthesis of tonal languages like Mandarin Chinese. Our preliminary experiments on Chinese speech synthesis reveal the issue of "tone shift", where a synthesized speech utterance contains correct base syllables but incorrect tones. To address the issue, we propose the ToneUnit framework, which leverages annotated data with tone labels as CTC supervision to learn tone-aware discrete speech units for Mandarin Chinese speech. Our findings indicate that the discrete units acquired through the TonUnit resolve the "tone shift" issue in synthesized Chinese speech and yield favorable results in English synthesis. Moreover, the experimental results suggest that finite scalar quantization enhances the effectiveness of ToneUnit. Notably, ToneUnit can work effectively even with minimal annotated data. △ Less

Submitted 13 June, 2024; originally announced June 2024.

arXiv:2405.15095 [pdf, other]

Compilation for Dynamically Field-Programmable Qubit Arrays with Efficient and Provably Near-Optimal Scheduling

Authors: Daniel Bochen Tan, Wan-Hsuan Lin, Jason Cong

Abstract: Dynamically field-programmable qubit arrays based on neutral atoms have high fidelity and highly parallel gates for quantum computing. However, it is challenging for compilers to fully leverage the novel flexibility offered by such hardware while respecting its various constraints. In this study, we break down the compilation for this architecture into three tasks: scheduling, placement, and routi… ▽ More Dynamically field-programmable qubit arrays based on neutral atoms have high fidelity and highly parallel gates for quantum computing. However, it is challenging for compilers to fully leverage the novel flexibility offered by such hardware while respecting its various constraints. In this study, we break down the compilation for this architecture into three tasks: scheduling, placement, and routing. We formulate these three problems and present efficient solutions to them. Notably, our scheduling based on graph edge coloring is provably near-optimal in terms of two-qubit gate stage count (at most one more than the optimum), the fidelity bottleneck of this platform. As a result, our compiler, Enola, produces higher fidelity results compared to existing works, e.g., 3.7X stage reduction and 5.9X fidelity improvement on the benchmark set used by OLSQ-DPQA, the current state of the art. Additionally, Enola is highly scalable, e.g., within 30 minutes, it can compile circuits with 10,000 qubits, a scale sufficient for the current era of quantum computing. Enola is open source at https://github.com/UCLA-VAST/Enola △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2404.19664 [pdf, other]

Towards Generalist Robot Learning from Internet Video: A Survey

Authors: Robert McCarthy, Daniel C. H. Tan, Dominik Schmidt, Fernando Acero, Nathan Herr, Yilun Du, Thomas G. Thuruthel, Zhibin Li

Abstract: This survey presents an overview of methods for learning from video (LfV) in the context of reinforcement learning (RL) and robotics. We focus on methods capable of scaling to large internet video datasets and, in the process, extracting foundational knowledge about the world's dynamics and physical human behaviour. Such methods hold great promise for developing general-purpose robots. We open w… ▽ More This survey presents an overview of methods for learning from video (LfV) in the context of reinforcement learning (RL) and robotics. We focus on methods capable of scaling to large internet video datasets and, in the process, extracting foundational knowledge about the world's dynamics and physical human behaviour. Such methods hold great promise for developing general-purpose robots. We open with an overview of fundamental concepts relevant to the LfV-for-robotics setting. This includes a discussion of the exciting benefits LfV methods can offer (e.g., improved generalization beyond the available robot data) and commentary on key LfV challenges (e.g., missing information in video and LfV distribution shifts). Our literature review begins with an analysis of video foundation model techniques that can extract knowledge from large, heterogeneous video datasets. Next, we review methods that specifically leverage video data for robot learning. Here, we categorise work according to which RL knowledge modality (KM) benefits from the use of video data. We additionally highlight techniques for mitigating LfV challenges, including reviewing action representations that address missing action labels in video. Finally, we examine LfV datasets and benchmarks, before concluding with a discussion of challenges and opportunities in LfV. Here, we advocate for scalable foundation model approaches that can leverage the full range of internet video data, and that target the learning of the most promising RL KMs: the policy and dynamics model. Overall, we hope this survey will serve as a comprehensive reference for the emerging field of LfV, catalysing further research in the area and facilitating progress towards the development of general-purpose robots. △ Less

Submitted 7 June, 2024; v1 submitted 30 April, 2024; originally announced April 2024.

Comments: Updated formatting. Reduced paper length and made other minor improvements

arXiv:2404.18369 [pdf, other]

A SAT Scalpel for Lattice Surgery: Representation and Synthesis of Subroutines for Surface-Code Fault-Tolerant Quantum Computing

Authors: Daniel Bochen Tan, Murphy Yuezhen Niu, Craig Gidney

Abstract: Quantum error correction is necessary for large-scale quantum computing. A promising quantum error correcting code is the surface code. For this code, fault-tolerant quantum computing (FTQC) can be performed via lattice surgery, i.e., splitting and merging patches of code. Given the frequent use of certain lattice-surgery subroutines (LaS), it becomes crucial to optimize their design in order to m… ▽ More Quantum error correction is necessary for large-scale quantum computing. A promising quantum error correcting code is the surface code. For this code, fault-tolerant quantum computing (FTQC) can be performed via lattice surgery, i.e., splitting and merging patches of code. Given the frequent use of certain lattice-surgery subroutines (LaS), it becomes crucial to optimize their design in order to minimize the overall spacetime volume of FTQC. In this study, we define the variables to represent LaS and the constraints on these variables. Leveraging this formulation, we develop a synthesizer for LaS, LaSsynth, that encodes a LaS construction problem into a SAT instance, subsequently querying SAT solvers for a solution. Starting from a baseline design, we can gradually invoke the solver with shrinking spacetime volume to derive more compact designs. Due to our foundational formulation and the use of SAT solvers, LaSsynth can exhaustively explore the design space, yielding optimal designs in volume. For example, it achieves 8% and 18% volume reduction respectively over two states-of-the-art human designs for the 15-to-1 T-factory, a bottleneck in FTQC. △ Less

Submitted 17 May, 2024; v1 submitted 28 April, 2024; originally announced April 2024.

Comments: To appear in 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)

arXiv:2404.06224 [pdf, other]

Low-Cost Generation and Evaluation of Dictionary Example Sentences

Authors: Bill Cai, Clarence Boon Liang Ng, Daniel Tan, Shelvia Hotama

Abstract: Dictionary example sentences play an important role in illustrating word definitions and usage, but manually creating quality sentences is challenging. Prior works have demonstrated that language models can be trained to generate example sentences. However, they relied on costly customized models and word sense datasets for generation and evaluation of their work. Rapid advancements in foundationa… ▽ More Dictionary example sentences play an important role in illustrating word definitions and usage, but manually creating quality sentences is challenging. Prior works have demonstrated that language models can be trained to generate example sentences. However, they relied on costly customized models and word sense datasets for generation and evaluation of their work. Rapid advancements in foundational models present the opportunity to create low-cost, zero-shot methods for the generation and evaluation of dictionary example sentences. We introduce a new automatic evaluation metric called OxfordEval that measures the win-rate of generated sentences against existing Oxford Dictionary sentences. OxfordEval shows high alignment with human judgments, enabling large-scale automated quality evaluation. We experiment with various LLMs and configurations to generate dictionary sentences across word classes. We complement this with a novel approach of using masked language models to identify and select sentences that best exemplify word meaning. The eventual model, FM-MLM, achieves over 85.1% win rate against Oxford baseline sentences according to OxfordEval, compared to 39.8% win rate for prior model-generated sentences. △ Less

Submitted 9 April, 2024; originally announced April 2024.

arXiv:2403.04765 [pdf, other]

Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed

Authors: Yifan Wang, Xingyi He, Sida Peng, Dongli Tan, Xiaowei Zhou

Abstract: We present a novel method for efficiently producing semi-dense matches across images. Previous detector-free matcher LoFTR has shown remarkable matching capability in handling large-viewpoint change and texture-poor scenarios but suffers from low efficiency. We revisit its design choices and derive multiple improvements for both efficiency and accuracy. One key observation is that performing the t… ▽ More We present a novel method for efficiently producing semi-dense matches across images. Previous detector-free matcher LoFTR has shown remarkable matching capability in handling large-viewpoint change and texture-poor scenarios but suffers from low efficiency. We revisit its design choices and derive multiple improvements for both efficiency and accuracy. One key observation is that performing the transformer over the entire feature map is redundant due to shared local information, therefore we propose an aggregated attention mechanism with adaptive token selection for efficiency. Furthermore, we find spatial variance exists in LoFTR's fine correlation module, which is adverse to matching accuracy. A novel two-stage correlation layer is proposed to achieve accurate subpixel correspondences for accuracy improvement. Our efficiency optimized model is $\sim 2.5\times$ faster than LoFTR which can even surpass state-of-the-art efficient sparse matching pipeline SuperPoint + LightGlue. Moreover, extensive experiments show that our method can achieve higher accuracy compared with competitive semi-dense matchers, with considerable efficiency benefits. This opens up exciting prospects for large-scale or latency-sensitive applications such as image retrieval and 3D reconstruction. Project page: https://zju3dv.github.io/efficientloftr. △ Less

Submitted 11 March, 2024; v1 submitted 7 March, 2024; originally announced March 2024.

Comments: CVPR 2024; Project page: https://zju3dv.github.io/efficientloftr

arXiv:2402.10551 [pdf, other]

Personalised Drug Identifier for Cancer Treatment with Transformers using Auxiliary Information

Authors: Aishwarya Jayagopal, Hansheng Xue, Ziyang He, Robert J. Walsh, Krishna Kumar Hariprasannan, David Shao Peng Tan, Tuan Zea Tan, Jason J. Pitt, Anand D. Jeyasekharan, Vaibhav Rajan

Abstract: Cancer remains a global challenge due to its growing clinical and economic burden. Its uniquely personal manifestation, which makes treatment difficult, has fuelled the quest for personalized treatment strategies. Thus, genomic profiling is increasingly becoming part of clinical diagnostic panels. Effective use of such panels requires accurate drug response prediction (DRP) models, which are chall… ▽ More Cancer remains a global challenge due to its growing clinical and economic burden. Its uniquely personal manifestation, which makes treatment difficult, has fuelled the quest for personalized treatment strategies. Thus, genomic profiling is increasingly becoming part of clinical diagnostic panels. Effective use of such panels requires accurate drug response prediction (DRP) models, which are challenging to build due to limited labelled patient data. Previous methods to address this problem have used various forms of transfer learning. However, they do not explicitly model the variable length sequential structure of the list of mutations in such diagnostic panels. Further, they do not utilize auxiliary information (like patient survival) for model training. We address these limitations through a novel transformer based method, which surpasses the performance of state-of-the-art DRP models on benchmark data. We also present the design of a treatment recommendation system (TRS), which is currently deployed at the National University Hospital, Singapore and is being evaluated in a clinical trial. △ Less

Submitted 16 February, 2024; originally announced February 2024.

arXiv:2402.03046 [pdf, other]

Open RL Benchmark: Comprehensive Tracked Experiments for Reinforcement Learning

Authors: Shengyi Huang, Quentin Gallouédec, Florian Felten, Antonin Raffin, Rousslan Fernand Julien Dossa, Yanxiao Zhao, Ryan Sullivan, Viktor Makoviychuk, Denys Makoviichuk, Mohamad H. Danesh, Cyril Roumégous, Jiayi Weng, Chufan Chen, Md Masudur Rahman, João G. M. Araújo, Guorui Quan, Daniel Tan, Timo Klein, Rujikorn Charakorn, Mark Towers, Yann Berthelot, Kinal Mehta, Dipam Chakraborty, Arjun KG, Valentin Charraut , et al. (8 additional authors not shown)

Abstract: In many Reinforcement Learning (RL) papers, learning curves are useful indicators to measure the effectiveness of RL algorithms. However, the complete raw data of the learning curves are rarely available. As a result, it is usually necessary to reproduce the experiments from scratch, which can be time-consuming and error-prone. We present Open RL Benchmark, a set of fully tracked RL experiments, i… ▽ More In many Reinforcement Learning (RL) papers, learning curves are useful indicators to measure the effectiveness of RL algorithms. However, the complete raw data of the learning curves are rarely available. As a result, it is usually necessary to reproduce the experiments from scratch, which can be time-consuming and error-prone. We present Open RL Benchmark, a set of fully tracked RL experiments, including not only the usual data such as episodic return, but also all algorithm-specific and system metrics. Open RL Benchmark is community-driven: anyone can download, use, and contribute to the data. At the time of writing, more than 25,000 runs have been tracked, for a cumulative duration of more than 8 years. Open RL Benchmark covers a wide range of RL libraries and reference implementations. Special care is taken to ensure that each experiment is precisely reproducible by providing not only the full parameters, but also the versions of the dependencies used to generate it. In addition, Open RL Benchmark comes with a command-line interface (CLI) for easy fetching and generating figures to present the results. In this document, we include two case studies to demonstrate the usefulness of Open RL Benchmark in practice. To the best of our knowledge, Open RL Benchmark is the first RL benchmark of its kind, and the authors hope that it will improve and facilitate the work of researchers in the field. △ Less

Submitted 5 February, 2024; originally announced February 2024.

Comments: Under review

arXiv:2401.16356 [pdf, other]

cDVGAN: One Flexible Model for Multi-class Gravitational Wave Signal and Glitch Generation

Authors: Tom Dooney, Lyana Curier, Daniel Tan, Melissa Lopez, Chris Van Den Broeck, Stefano Bromuri

Abstract: Simulating realistic time-domain observations of gravitational waves (GWs) and GW detector glitches can help in advancing GW data analysis. Simulated data can be used in downstream tasks by augmenting datasets for signal searches, balancing data sets for machine learning, and validating detection schemes. In this work, we present Conditional Derivative GAN (cDVGAN), a novel conditional model in th… ▽ More Simulating realistic time-domain observations of gravitational waves (GWs) and GW detector glitches can help in advancing GW data analysis. Simulated data can be used in downstream tasks by augmenting datasets for signal searches, balancing data sets for machine learning, and validating detection schemes. In this work, we present Conditional Derivative GAN (cDVGAN), a novel conditional model in the Generative Adversarial Network framework for simulating multiple classes of time-domain observations that represent gravitational waves (GWs) and detector glitches. cDVGAN can also generate generalized hybrid samples that span the variation between classes through interpolation in the conditioned class vector. cDVGAN introduces an additional player into the typical 2-player adversarial game of GANs, where an auxiliary discriminator analyzes the first-order derivative time-series. Our results show that this provides synthetic data that better captures the features of the original data. cDVGAN conditions on three classes, two denoised from LIGO blip and tomte glitch events from its 3rd observing run (O3), and the third representing binary black hole (BBH) mergers. Our proposed cDVGAN outperforms 4 different baseline GAN models in replicating the features of the three classes. Specifically, our experiments show that training convolutional neural networks (CNNs) with our cDVGAN-generated data improves the detection of samples embedded in detector noise beyond the synthetic data from other state-of-the-art GAN models. Our best synthetic dataset yields as much as a 4.2% increase in area-under-the-curve (AUC) performance compared to synthetic datasets from baseline GANs. Moreover, training the CNN with hybrid samples from our cDVGAN outperforms CNNs trained only on the standard classes, when identifying real samples embedded in LIGO detector background (4% AUC improvement for cDVGAN). △ Less

Submitted 5 June, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

Comments: 20 pages, 17 figures, 5 tables

arXiv:2401.13807 [pdf, other]

Depth-Optimal Addressing of 2D Qubit Array with 1D Controls Based on Exact Binary Matrix Factorization

Authors: Daniel Bochen Tan, Shuohao Ping, Jason Cong

Abstract: Reducing control complexity is essential for achieving large-scale quantum computing. However, reducing control knobs may compromise the ability to independently address each qubit. Recent progress in neutral atom-based platforms suggests that rectangular (row-column) addressing may strike a balance between control granularity and flexibility for 2D qubit arrays. This scheme allows addressing qubi… ▽ More Reducing control complexity is essential for achieving large-scale quantum computing. However, reducing control knobs may compromise the ability to independently address each qubit. Recent progress in neutral atom-based platforms suggests that rectangular (row-column) addressing may strike a balance between control granularity and flexibility for 2D qubit arrays. This scheme allows addressing qubits on the intersections of a set of rows and columns each time. While quadratically reducing controls, it may necessitate more depth. We formulate the depth-optimal rectangular addressing problem as exact binary matrix factorization, an NP-hard problem also appearing in communication complexity and combinatorial optimization. We introduce a satisfiability modulo theories-based solver for this problem, and a heuristic, row packing, performing close to the optimal solver on various benchmarks. Furthermore, we discuss rectangular addressing in the context of fault-tolerant quantum computing, leveraging a natural two-level structure. △ Less

Submitted 22 March, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

arXiv:2311.16241 [pdf, other]

SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language Guidance

Authors: Lukas Hoyer, David Joseph Tan, Muhammad Ferjad Naeem, Luc Van Gool, Federico Tombari

Abstract: In semi-supervised semantic segmentation, a model is trained with a limited number of labeled images along with a large corpus of unlabeled images to reduce the high annotation effort. While previous methods are able to learn good segmentation boundaries, they are prone to confuse classes with similar visual appearance due to the limited supervision. On the other hand, vision-language models (VLMs… ▽ More In semi-supervised semantic segmentation, a model is trained with a limited number of labeled images along with a large corpus of unlabeled images to reduce the high annotation effort. While previous methods are able to learn good segmentation boundaries, they are prone to confuse classes with similar visual appearance due to the limited supervision. On the other hand, vision-language models (VLMs) are able to learn diverse semantic knowledge from image-caption datasets but produce noisy segmentation due to the image-level training. In SemiVL, we propose to integrate rich priors from VLM pre-training into semi-supervised semantic segmentation to learn better semantic decision boundaries. To adapt the VLM from global to local reasoning, we introduce a spatial fine-tuning strategy for label-efficient learning. Further, we design a language-guided decoder to jointly reason over vision and language. Finally, we propose to handle inherent ambiguities in class labels by providing the model with language guidance in the form of class definitions. We evaluate SemiVL on 4 semantic segmentation datasets, where it significantly outperforms previous semi-supervised methods. For instance, SemiVL improves the state-of-the-art by +13.5 mIoU on COCO with 232 annotated images and by +6.1 mIoU on Pascal VOC with 92 labels. Project page: https://github.com/google-research/semivl △ Less

Submitted 27 November, 2023; originally announced November 2023.

arXiv:2311.16190 [pdf, other]

Q-Pilot: Field Programmable Qubit Array Compilation with Flying Ancillas

Authors: Hanrui Wang, Daniel Bochen Tan, Pengyu Liu, Yilian Liu, Jiaqi Gu, Jason Cong, Song Han

Abstract: Neutral atom arrays have become a promising platform for quantum computing, especially the field programmable qubit array (FPQA) endowed with the unique capability of atom movement. This feature allows dynamic alterations in qubit connectivity during runtime, which can reduce the cost of executing long-range gates and improve parallelism. However, this added flexibility introduces new challenges i… ▽ More Neutral atom arrays have become a promising platform for quantum computing, especially the field programmable qubit array (FPQA) endowed with the unique capability of atom movement. This feature allows dynamic alterations in qubit connectivity during runtime, which can reduce the cost of executing long-range gates and improve parallelism. However, this added flexibility introduces new challenges in circuit compilation. Inspired by the placement and routing strategies for FPGAs, we propose to map all data qubits to fixed atoms while utilizing movable atoms to route for 2-qubit gates between data qubits. Coined flying ancillas, these mobile atoms function as ancilla qubits, dynamically generated and recycled during execution. We present Q-Pilot, a scalable compiler for FPQA employing flying ancillas to maximize circuit parallelism. For two important quantum applications, quantum simulation and the Quantum Approximate Optimization Algorithm (QAOA), we devise domain-specific routing strategies. In comparison to alternative technologies such as superconducting devices or fixed atom arrays, Q-Pilot effectively harnesses the flexibility of FPQA, achieving reductions of 1.4x, 27.7x, and 6.3x in circuit depth for 100-qubit random, quantum simulation, and QAOA circuits, respectively. △ Less

Submitted 6 May, 2024; v1 submitted 25 November, 2023; originally announced November 2023.

Comments: 10 pages, 16 figures; Published as a conference paper at DAC 2024

arXiv:2311.15123 [pdf, other]

Atomique: A Quantum Compiler for Reconfigurable Neutral Atom Arrays

Authors: Hanrui Wang, Pengyu Liu, Daniel Bochen Tan, Yilian Liu, Jiaqi Gu, David Z. Pan, Jason Cong, Umut A. Acar, Song Han

Abstract: The neutral atom array has gained prominence in quantum computing for its scalability and operation fidelity. Previous works focus on fixed atom arrays (FAAs) that require extensive SWAP operations for long-range interactions. This work explores a novel architecture reconfigurable atom arrays (RAAs), also known as field programmable qubit arrays (FPQAs), which allows for coherent atom movements du… ▽ More The neutral atom array has gained prominence in quantum computing for its scalability and operation fidelity. Previous works focus on fixed atom arrays (FAAs) that require extensive SWAP operations for long-range interactions. This work explores a novel architecture reconfigurable atom arrays (RAAs), also known as field programmable qubit arrays (FPQAs), which allows for coherent atom movements during circuit execution under some constraints. Such atom movements, which are unique to this architecture, could reduce the cost of long-range interactions significantly if the atom movements could be scheduled strategically. In this work, we introduce Atomique, a compilation framework designed for qubit mapping, atom movement, and gate scheduling for RAA. Atomique contains a qubit-array mapper to decide the coarse-grained mapping of the qubits to arrays, leveraging MAX k-Cut on a constructed gate frequency graph to minimize SWAP overhead. Subsequently, a qubit-atom mapper determines the fine-grained mapping of qubits to specific atoms in the array and considers load balance to prevent hardware constraint violations. We further propose a router that identifies parallel gates, schedules them simultaneously, and reduces depth. We evaluate Atomique across 20+ diverse benchmarks, including generic circuits (arbitrary, QASMBench, SupermarQ), quantum simulation, and QAOA circuits. Atomique consistently outperforms IBM Superconducting, FAA with long-range gates, and FAA with rectangular and triangular topologies, achieving significant reductions in depth and the number of two-qubit gates. △ Less

Submitted 2 May, 2024; v1 submitted 25 November, 2023; originally announced November 2023.

Comments: 17 pages, 26 figures; Published as a conference paper at ISCA 2024

arXiv:2309.15487 [pdf, other]

Tackling VQA with Pretrained Foundation Models without Further Training

Authors: Alvin De Jun Tan, Bingquan Shen

Abstract: Large language models (LLMs) have achieved state-of-the-art results in many natural language processing tasks. They have also demonstrated ability to adapt well to different tasks through zero-shot or few-shot settings. With the capability of these LLMs, researchers have looked into how to adopt them for use with Visual Question Answering (VQA). Many methods require further training to align the i… ▽ More Large language models (LLMs) have achieved state-of-the-art results in many natural language processing tasks. They have also demonstrated ability to adapt well to different tasks through zero-shot or few-shot settings. With the capability of these LLMs, researchers have looked into how to adopt them for use with Visual Question Answering (VQA). Many methods require further training to align the image and text embeddings. However, these methods are computationally expensive and requires large scale image-text dataset for training. In this paper, we explore a method of combining pretrained LLMs and other foundation models without further training to solve the VQA problem. The general idea is to use natural language to represent the images such that the LLM can understand the images. We explore different decoding strategies for generating textual representation of the image and evaluate their performance on the VQAv2 dataset. △ Less

Submitted 27 September, 2023; originally announced September 2023.

arXiv:2309.15486 [pdf, other]

Transferability of Representations Learned using Supervised Contrastive Learning Trained on a Multi-Domain Dataset

Authors: Alvin De Jun Tan, Clement Tan, Chai Kiat Yeo

Abstract: Contrastive learning has shown to learn better quality representations than models trained using cross-entropy loss. They also transfer better to downstream datasets from different domains. However, little work has been done to explore the transferability of representations learned using contrastive learning when trained on a multi-domain dataset. In this paper, a study has been conducted using th… ▽ More Contrastive learning has shown to learn better quality representations than models trained using cross-entropy loss. They also transfer better to downstream datasets from different domains. However, little work has been done to explore the transferability of representations learned using contrastive learning when trained on a multi-domain dataset. In this paper, a study has been conducted using the Supervised Contrastive Learning framework to learn representations from the multi-domain DomainNet dataset and then evaluate the transferability of the representations learned on other downstream datasets. The fixed feature linear evaluation protocol will be used to evaluate the transferability on 7 downstream datasets that were chosen across different domains. The results obtained are compared to a baseline model that was trained using the widely used cross-entropy loss. Empirical results from the experiments showed that on average, the Supervised Contrastive Learning model performed 6.05% better than the baseline model on the 7 downstream datasets. The findings suggest that Supervised Contrastive Learning models can potentially learn more robust representations that transfer better across domains than cross-entropy models when trained on a multi-domain dataset. △ Less

Submitted 27 September, 2023; originally announced September 2023.

arXiv:2306.04026 [pdf, other]

Value Functions are Control Barrier Functions: Verification of Safe Policies using Control Theory

Authors: Daniel C. H. Tan, Fernando Acero, Robert McCarthy, Dimitrios Kanoulas, Zhibin Li

Abstract: Guaranteeing safe behaviour of reinforcement learning (RL) policies poses significant challenges for safety-critical applications, despite RL's generality and scalability. To address this, we propose a new approach to apply verification methods from control theory to learned value functions. By analyzing task structures for safety preservation, we formalize original theorems that establish links b… ▽ More Guaranteeing safe behaviour of reinforcement learning (RL) policies poses significant challenges for safety-critical applications, despite RL's generality and scalability. To address this, we propose a new approach to apply verification methods from control theory to learned value functions. By analyzing task structures for safety preservation, we formalize original theorems that establish links between value functions and control barrier functions. Further, we propose novel metrics for verifying value functions in safe control tasks and practical implementation details to improve learning. Our work presents a novel method for certificate learning, which unlocks a diversity of verification techniques from control theory for RL policies, and marks a significant step towards a formal framework for the general, scalable, and verifiable design of RL-based control systems. Code and videos are available at this https url: https://rl-cbf.github.io/ △ Less

Submitted 5 December, 2023; v1 submitted 6 June, 2023; originally announced June 2023.

arXiv:2306.03487 [pdf, other]

doi 10.22331/q-2024-03-14-1281

Compiling Quantum Circuits for Dynamically Field-Programmable Neutral Atoms Array Processors

Authors: Daniel Bochen Tan, Dolev Bluvstein, Mikhail D. Lukin, Jason Cong

Abstract: Dynamically field-programmable qubit arrays (DPQA) have recently emerged as a promising platform for quantum information processing. In DPQA, atomic qubits are selectively loaded into arrays of optical traps that can be reconfigured during the computation itself. Leveraging qubit transport and parallel, entangling quantum operations, different pairs of qubits, even those initially far away, can be… ▽ More Dynamically field-programmable qubit arrays (DPQA) have recently emerged as a promising platform for quantum information processing. In DPQA, atomic qubits are selectively loaded into arrays of optical traps that can be reconfigured during the computation itself. Leveraging qubit transport and parallel, entangling quantum operations, different pairs of qubits, even those initially far away, can be entangled at different stages of the quantum program execution. Such reconfigurability and non-local connectivity present new challenges for compilation, especially in the layout synthesis step which places and routes the qubits and schedules the gates. In this paper, we consider a DPQA architecture that contains multiple arrays and supports 2D array movements, representing cutting-edge experimental platforms. Within this architecture, we discretize the state space and formulate layout synthesis as a satisfiability modulo theories problem, which can be solved by existing solvers optimally in terms of circuit depth. For a set of benchmark circuits generated by random graphs with complex connectivities, our compiler OLSQ-DPQA reduces the number of two-qubit entangling gates on small problem instances by 1.7x compared to optimal compilation results on a fixed planar architecture. To further improve scalability and practicality of the method, we introduce a greedy heuristic inspired by the iterative peeling approach in classical integrated circuit routing. Using a hybrid approach that combined the greedy and optimal methods, we demonstrate that our DPQA-based compiled circuits feature reduced scaling overhead compared to a grid fixed architecture, resulting in 5.1X less two-qubit gates for 90 qubit quantum circuits. These methods enable programmable, complex quantum circuits with neutral atom quantum computers, as well as informing both future compilers and future hardware choices. △ Less

Submitted 1 July, 2024; v1 submitted 6 June, 2023; originally announced June 2023.

Comments: Version accepted by Quantum. 21 pages, 9 figures, 7 tables. An extended abstract was presented at the 41st International Conference on Computer-Aided Design (ICCAD '22)

Journal ref: Quantum 8, 1281 (2024)

arXiv:2303.17408 [pdf, other]

P-Transformer: A Prompt-based Multimodal Transformer Architecture For Medical Tabular Data

Authors: Yucheng Ruan, Xiang Lan, Daniel J. Tan, Hairil Rizal Abdullah, Mengling Feng

Abstract: Medical tabular data, abundant in Electronic Health Records (EHRs), is a valuable resource for diverse medical tasks such as risk prediction. While deep learning approaches, particularly transformer-based models, have shown remarkable performance in tabular data prediction, there are still problems remained for existing work to be effectively adapted into medical domain, such as under-utilization… ▽ More Medical tabular data, abundant in Electronic Health Records (EHRs), is a valuable resource for diverse medical tasks such as risk prediction. While deep learning approaches, particularly transformer-based models, have shown remarkable performance in tabular data prediction, there are still problems remained for existing work to be effectively adapted into medical domain, such as under-utilization of unstructured free-texts, limited exploration of textual information in structured data, and data corruption. To address these issues, we propose P-Transformer, a Prompt-based multimodal Transformer architecture designed specifically for medical tabular data. This framework consists two critical components: a tabular cell embedding generator and a tabular transformer. The former efficiently encodes diverse modalities from both structured and unstructured tabular data into a harmonized language semantic space with the help of pre-trained sentence encoder and medical prompts. The latter integrates cell representations to generate patient embeddings for various medical tasks. In comprehensive experiments on two real-world datasets for three medical tasks, P-Transformer demonstrated the improvements with 10.9%/11.0% on RMSE/MAE, 0.5%/2.2% on RMSE/MAE, and 1.6%/0.8% on BACC/AUROC compared to state-of-the-art (SOTA) baselines in predictability. Notably, the model exhibited strong resilience to data corruption in the structured data, particularly when the corruption rates are high. △ Less

Submitted 9 January, 2024; v1 submitted 30 March, 2023; originally announced March 2023.

arXiv:2301.10894 [pdf, other]

Perceptive Locomotion with Controllable Pace and Natural Gait Transitions Over Uneven Terrains

Authors: Daniel Chee Hian Tan, Jenny Zhang, Michael, Chuah, Zhibin Li

Abstract: This work developed a learning framework for perceptive legged locomotion that combines visual feedback, proprioceptive information, and active gait regulation of foot-ground contacts. The perception requires only one forward-facing camera to obtain the heightmap, and the active regulation of gait paces and traveling velocity are realized through our formulation of CPG-based high-level imitation o… ▽ More This work developed a learning framework for perceptive legged locomotion that combines visual feedback, proprioceptive information, and active gait regulation of foot-ground contacts. The perception requires only one forward-facing camera to obtain the heightmap, and the active regulation of gait paces and traveling velocity are realized through our formulation of CPG-based high-level imitation of foot-ground contacts. Through this framework, an end-user has the ability to command task-level inputs to control different walking speeds and gait frequencies according to the traversal of different terrains, which enables more reliable negotiation with encountered obstacles. The results demonstrated that the learned perceptive locomotion policy followed task-level control inputs with intended behaviors, and was robust in presence of unseen terrains and external force perturbations. A video demonstration can be found at https://youtu.be/OTzlWzDfAe8, and the codebase at https://github.com/jennyzzt/perceptual-locomotion. △ Less

Submitted 30 January, 2023; v1 submitted 25 January, 2023; originally announced January 2023.

arXiv:2212.06145 [pdf, other]

AP: Selective Activation for De-sparsifying Pruned Neural Networks

Authors: Shiyu Liu, Rohan Ghosh, Dylan Tan, Mehul Motani

Abstract: The rectified linear unit (ReLU) is a highly successful activation function in neural networks as it allows networks to easily obtain sparse representations, which reduces overfitting in overparameterized networks. However, in network pruning, we find that the sparsity introduced by ReLU, which we quantify by a term called dynamic dead neuron rate (DNR), is not beneficial for the pruned network. I… ▽ More The rectified linear unit (ReLU) is a highly successful activation function in neural networks as it allows networks to easily obtain sparse representations, which reduces overfitting in overparameterized networks. However, in network pruning, we find that the sparsity introduced by ReLU, which we quantify by a term called dynamic dead neuron rate (DNR), is not beneficial for the pruned network. Interestingly, the more the network is pruned, the smaller the dynamic DNR becomes during optimization. This motivates us to propose a method to explicitly reduce the dynamic DNR for the pruned network, i.e., de-sparsify the network. We refer to our method as Activating-while-Pruning (AP). We note that AP does not function as a stand-alone method, as it does not evaluate the importance of weights. Instead, it works in tandem with existing pruning methods and aims to improve their performance by selective activation of nodes to reduce the dynamic DNR. We conduct extensive experiments using popular networks (e.g., ResNet, VGG) via two classical and three state-of-the-art pruning methods. The experimental results on public datasets (e.g., CIFAR-10/100) suggest that AP works well with existing pruning methods and improves the performance by 3% - 4%. For larger scale datasets (e.g., ImageNet) and state-of-the-art networks (e.g., vision transformer), we observe an improvement of 2% - 3% with AP as opposed to without. Lastly, we conduct an ablation study to examine the effectiveness of the components comprising AP. △ Less

Submitted 9 December, 2022; originally announced December 2022.

Comments: 16 Pages

arXiv:2212.03398

Analysis and Utilization of Entrainment on Acoustic and Emotion Features in User-agent Dialogue

Authors: Daxin Tan, Nikos Kargas, David McHardy, Constantinos Papayiannis, Antonio Bonafonte, Marek Strelec, Jonas Rohnke, Agis Oikonomou Filandras, Trevor Wood

Abstract: Entrainment is the phenomenon by which an interlocutor adapts their speaking style to align with their partner in conversations. It has been found in different dimensions as acoustic, prosodic, lexical or syntactic. In this work, we explore and utilize the entrainment phenomenon to improve spoken dialogue systems for voice assistants. We first examine the existence of the entrainment phenomenon in… ▽ More Entrainment is the phenomenon by which an interlocutor adapts their speaking style to align with their partner in conversations. It has been found in different dimensions as acoustic, prosodic, lexical or syntactic. In this work, we explore and utilize the entrainment phenomenon to improve spoken dialogue systems for voice assistants. We first examine the existence of the entrainment phenomenon in human-to-human dialogues in respect to acoustic feature and then extend the analysis to emotion features. The analysis results show strong evidence of entrainment in terms of both acoustic and emotion features. Based on this findings, we implement two entrainment policies and assess if the integration of entrainment principle into a Text-to-Speech (TTS) system improves the synthesis performance and the user experience. It is found that the integration of the entrainment principle into a TTS system brings performance improvement when considering acoustic features, while no obvious improvement is observed when considering emotion features. △ Less

Submitted 6 December, 2022; originally announced December 2022.

Comments: This version has been removed by arXiv administrators because the submitter did not have the right to assign a license at the time of submission

arXiv:2211.11674 [pdf, other]

Shape, Pose, and Appearance from a Single Image via Bootstrapped Radiance Field Inversion

Authors: Dario Pavllo, David Joseph Tan, Marie-Julie Rakotosaona, Federico Tombari

Abstract: Neural Radiance Fields (NeRF) coupled with GANs represent a promising direction in the area of 3D reconstruction from a single view, owing to their ability to efficiently model arbitrary topologies. Recent work in this area, however, has mostly focused on synthetic datasets where exact ground-truth poses are known, and has overlooked pose estimation, which is important for certain downstream appli… ▽ More Neural Radiance Fields (NeRF) coupled with GANs represent a promising direction in the area of 3D reconstruction from a single view, owing to their ability to efficiently model arbitrary topologies. Recent work in this area, however, has mostly focused on synthetic datasets where exact ground-truth poses are known, and has overlooked pose estimation, which is important for certain downstream applications such as augmented reality (AR) and robotics. We introduce a principled end-to-end reconstruction framework for natural images, where accurate ground-truth poses are not available. Our approach recovers an SDF-parameterized 3D shape, pose, and appearance from a single image of an object, without exploiting multiple views during training. More specifically, we leverage an unconditional 3D-aware generator, to which we apply a hybrid inversion scheme where a model produces a first guess of the solution which is then refined via optimization. Our framework can de-render an image in as few as 10 steps, enabling its use in practical scenarios. We demonstrate state-of-the-art results on a variety of real and synthetic benchmarks. △ Less

Submitted 20 March, 2023; v1 submitted 21 November, 2022; originally announced November 2022.

Comments: CVPR 2023. Code and models are available at https://github.com/google-research/nerf-from-image

arXiv:2209.13112 [pdf, other]

Automated Sex Classification of Children's Voices and Changes in Differentiating Factors with Age

Authors: Fuling Chen, Roberto Togneri, Murray Maybery, Diana Weiting Tan

Abstract: Sex classification of children's voices allows for an investigation of the development of secondary sex characteristics which has been a key interest in the field of speech analysis. This research investigated a broad range of acoustic features from scripted and spontaneous speech and applied a hierarchical clustering-based machine learning model to distinguish the sex of children aged between 5 a… ▽ More Sex classification of children's voices allows for an investigation of the development of secondary sex characteristics which has been a key interest in the field of speech analysis. This research investigated a broad range of acoustic features from scripted and spontaneous speech and applied a hierarchical clustering-based machine learning model to distinguish the sex of children aged between 5 and 15 years. We proposed an optimal feature set and our modelling achieved an average F1 score (the harmonic mean of the precision and recall) of 0.84 across all ages. Our results suggest that the sex classification is generally more accurate when a model is developed for each year group rather than for children in 4-year age bands, with classification accuracy being better for older age groups. We found that spontaneous speech could provide more helpful cues in sex classification than scripted speech, especially for children younger than 7 years. For younger age groups, a broad range of acoustic factors contributed evenly to sex classification, while for older age groups, F0-related acoustic factors were found to be the most critical predictors generally. Other important acoustic factors for older age groups include vocal tract length estimators, spectral flux, loudness and unvoiced features. △ Less

Submitted 26 September, 2022; originally announced September 2022.

arXiv:2209.12213 [pdf, other]

ECO-TR: Efficient Correspondences Finding Via Coarse-to-Fine Refinement

Authors: Dongli Tan, Jiang-Jiang Liu, Xingyu Chen, Chao Chen, Ruixin Zhang, Yunhang Shen, Shouhong Ding, Rongrong Ji

Abstract: Modeling sparse and dense image matching within a unified functional correspondence model has recently attracted increasing research interest. However, existing efforts mainly focus on improving matching accuracy while ignoring its efficiency, which is crucial for realworld applications. In this paper, we propose an efficient structure named Efficient Correspondence Transformer (ECO-TR) by finding… ▽ More Modeling sparse and dense image matching within a unified functional correspondence model has recently attracted increasing research interest. However, existing efforts mainly focus on improving matching accuracy while ignoring its efficiency, which is crucial for realworld applications. In this paper, we propose an efficient structure named Efficient Correspondence Transformer (ECO-TR) by finding correspondences in a coarse-to-fine manner, which significantly improves the efficiency of functional correspondence model. To achieve this, multiple transformer blocks are stage-wisely connected to gradually refine the predicted coordinates upon a shared multi-scale feature extraction network. Given a pair of images and for arbitrary query coordinates, all the correspondences are predicted within a single feed-forward pass. We further propose an adaptive query-clustering strategy and an uncertainty-based outlier detection module to cooperate with the proposed framework for faster and better predictions. Experiments on various sparse and dense matching tasks demonstrate the superiority of our method in both efficiency and effectiveness against existing state-of-the-arts. △ Less

Submitted 25 September, 2022; originally announced September 2022.

Comments: Accepted by ECCV2022

arXiv:2209.06429 [pdf, other]

A Hybrid Deep Learning Model-based Remaining Useful Life Estimation for Reed Relay with Degradation Pattern Clustering

Authors: Chinthaka Gamanayake, Yan Qin, Chau Yuen, Lahiru Jayasinghe, Dominique-Ea Tan, Jenny Low

Abstract: Reed relay serves as the fundamental component of functional testing, which closely relates to the successful quality inspection of electronics. To provide accurate remaining useful life (RUL) estimation for reed relay, a hybrid deep learning network with degradation pattern clustering is proposed based on the following three considerations. First, multiple degradation behaviors are observed for r… ▽ More Reed relay serves as the fundamental component of functional testing, which closely relates to the successful quality inspection of electronics. To provide accurate remaining useful life (RUL) estimation for reed relay, a hybrid deep learning network with degradation pattern clustering is proposed based on the following three considerations. First, multiple degradation behaviors are observed for reed relay, and hence a dynamic time wrapping-based $K$-means clustering is offered to distinguish degradation patterns from each other. Second, although proper selections of features are of great significance, few studies are available to guide the selection. The proposed method recommends operational rules for easy implementation purposes. Third, a neural network for remaining useful life estimation (RULNet) is proposed to address the weakness of the convolutional neural network (CNN) in capturing temporal information of sequential data, which incorporates temporal correlation ability after high-level feature representation of convolutional operation. In this way, three variants of RULNet are constructed with health indicators, features with self-organizing map, or features with curve fitting. Ultimately, the proposed hybrid model is compared with the typical baseline models, including CNN and long short-term memory network (LSTM), through a practical reed relay dataset with two distinct degradation manners. The results from both degradation cases demonstrate that the proposed method outperforms CNN and LSTM regarding the index root mean squared error. △ Less

Submitted 14 September, 2022; originally announced September 2022.

Comments: This paper has been acctepted by IEEE Transactions on Industrial Informatics

arXiv:2205.03899 [pdf]

doi 10.1007/s11263-022-01588-7

SoftPool++: An Encoder-Decoder Network for Point Cloud Completion

Authors: Yida Wang, David Joseph Tan, Nassir Navab, Federico Tombari

Abstract: We propose a novel convolutional operator for the task of point cloud completion. One striking characteristic of our approach is that, conversely to related work it does not require any max-pooling or voxelization operation. Instead, the proposed operator used to learn the point cloud embedding in the encoder extracts permutation-invariant features from the point cloud via a soft-pooling of featur… ▽ More We propose a novel convolutional operator for the task of point cloud completion. One striking characteristic of our approach is that, conversely to related work it does not require any max-pooling or voxelization operation. Instead, the proposed operator used to learn the point cloud embedding in the encoder extracts permutation-invariant features from the point cloud via a soft-pooling of feature activations, which are able to preserve fine-grained geometric details. These features are then passed on to a decoder architecture. Due to the compression in the encoder, a typical limitation of this type of architectures is that they tend to lose parts of the input shape structure. We propose to overcome this limitation by using skip connections specifically devised for point clouds, where links between corresponding layers in the encoder and the decoder are established. As part of these connections, we introduce a transformation matrix that projects the features from the encoder to the decoder and vice-versa. The quantitative and qualitative results on the task of object completion from partial scans on the ShapeNet dataset show that incorporating our approach achieves state-of-the-art performance in shape completion both at low and high resolutions. △ Less

Submitted 8 May, 2022; originally announced May 2022.

Comments: Accepted in International Journal of Computer Vision

Journal ref: Int J Comput Vis 130, 1145-1164 (2022)

arXiv:2204.05460 [pdf, other]

CorrectSpeech: A Fully Automated System for Speech Correction and Accent Reduction

Authors: Daxin Tan, Liqun Deng, Nianzu Zheng, Yu Ting Yeung, Xin Jiang, Xiao Chen, Tan Lee

Abstract: This study propose a fully automated system for speech correction and accent reduction. Consider the application scenario that a recorded speech audio contains certain errors, e.g., inappropriate words, mispronunciations, that need to be corrected. The proposed system, named CorrectSpeech, performs the correction in three steps: recognizing the recorded speech and converting it into time-stamped s… ▽ More This study propose a fully automated system for speech correction and accent reduction. Consider the application scenario that a recorded speech audio contains certain errors, e.g., inappropriate words, mispronunciations, that need to be corrected. The proposed system, named CorrectSpeech, performs the correction in three steps: recognizing the recorded speech and converting it into time-stamped symbol sequence, aligning recognized symbol sequence with target text to determine locations and types of required edit operations, and generating the corrected speech. Experiments show that the quality and naturalness of corrected speech depend on the performance of speech recognition and alignment modules, as well as the granularity level of editing operations. The proposed system is evaluated on two corpora: a manually perturbed version of VCTK and L2-ARCTIC. The results demonstrate that our system is able to correct mispronunciation and reduce accent in speech recordings. Audio samples are available online for demonstration https://daxintan-cuhk.github.io/CorrectSpeech/ . △ Less

Submitted 13 October, 2022; v1 submitted 11 April, 2022; originally announced April 2022.

Comments: Accepted by ISCSLP 2022

arXiv:2203.17190 [pdf, other]

Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech

Authors: Guangyan Zhang, Kaitao Song, Xu Tan, Daxin Tan, Yuzi Yan, Yanqing Liu, Gang Wang, Wei Zhou, Tao Qin, Tan Lee, Sheng Zhao

Abstract: Recently, leveraging BERT pre-training to improve the phoneme encoder in text to speech (TTS) has drawn increasing attention. However, the works apply pre-training with character-based units to enhance the TTS phoneme encoder, which is inconsistent with the TTS fine-tuning that takes phonemes as input. Pre-training only with phonemes as input can alleviate the input mismatch but lack the ability t… ▽ More Recently, leveraging BERT pre-training to improve the phoneme encoder in text to speech (TTS) has drawn increasing attention. However, the works apply pre-training with character-based units to enhance the TTS phoneme encoder, which is inconsistent with the TTS fine-tuning that takes phonemes as input. Pre-training only with phonemes as input can alleviate the input mismatch but lack the ability to model rich representations and semantic information due to limited phoneme vocabulary. In this paper, we propose MixedPhoneme BERT, a novel variant of the BERT model that uses mixed phoneme and sup-phoneme representations to enhance the learning capability. Specifically, we merge the adjacent phonemes into sup-phonemes and combine the phoneme sequence and the merged sup-phoneme sequence as the model input, which can enhance the model capacity to learn rich contextual representations. Experiment results demonstrate that our proposed Mixed-Phoneme BERT significantly improves the TTS performance with 0.30 CMOS gain compared with the FastSpeech 2 baseline. The Mixed-Phoneme BERT achieves 3x inference speedup and similar voice quality to the previous TTS pre-trained model PnG BERT △ Less

Submitted 19 July, 2022; v1 submitted 31 March, 2022; originally announced March 2022.

Comments: Accepted by interspeech 2022

arXiv:2203.16600 [pdf, other]

Learning Local Displacements for Point Cloud Completion

Authors: Yida Wang, David Joseph Tan, Nassir Navab, Federico Tombari

Abstract: We propose a novel approach aimed at object and semantic scene completion from a partial scan represented as a 3D point cloud. Our architecture relies on three novel layers that are used successively within an encoder-decoder structure and specifically developed for the task at hand. The first one carries out feature extraction by matching the point features to a set of pre-trained local descripto… ▽ More We propose a novel approach aimed at object and semantic scene completion from a partial scan represented as a 3D point cloud. Our architecture relies on three novel layers that are used successively within an encoder-decoder structure and specifically developed for the task at hand. The first one carries out feature extraction by matching the point features to a set of pre-trained local descriptors. Then, to avoid losing individual descriptors as part of standard operations such as max-pooling, we propose an alternative neighbor-pooling operation that relies on adopting the feature vectors with the highest activations. Finally, up-sampling in the decoder modifies our feature extraction in order to increase the output dimension. While this model is already able to achieve competitive results with the state of the art, we further propose a way to increase the versatility of our approach to process point clouds. To this aim, we introduce a second model that assembles our layers within a transformer architecture. We evaluate both architectures on object and indoor scene completion tasks, achieving state-of-the-art performance. △ Less

Submitted 30 March, 2022; originally announced March 2022.

Comments: Conference on Computer Vision and Pattern Recognition (CVPR) 2022

arXiv:2201.05675 [pdf, other]

Transformers in Action: Weakly Supervised Action Segmentation

Authors: John Ridley, Huseyin Coskun, David Joseph Tan, Nassir Navab, Federico Tombari

Abstract: The video action segmentation task is regularly explored under weaker forms of supervision, such as transcript supervision, where a list of actions is easier to obtain than dense frame-wise labels. In this formulation, the task presents various challenges for sequence modeling approaches due to the emphasis on action transition points, long sequence lengths, and frame contextualization, making the… ▽ More The video action segmentation task is regularly explored under weaker forms of supervision, such as transcript supervision, where a list of actions is easier to obtain than dense frame-wise labels. In this formulation, the task presents various challenges for sequence modeling approaches due to the emphasis on action transition points, long sequence lengths, and frame contextualization, making the task well-posed for transformers. Given developments enabling transformers to scale linearly, we demonstrate through our architecture how they can be applied to improve action alignment accuracy over the equivalent RNN-based models with the attention mechanism focusing around salient action transition regions. Additionally, given the recent focus on inference-time transcript selection, we propose a supplemental transcript embedding approach to select transcripts more quickly at inference-time. Furthermore, we subsequently demonstrate how this approach can also improve the overall segmentation performance. Finally, we evaluate our proposed methods across the benchmark datasets to better understand the applicability of transformers and the importance of transcript selection on this video-driven weakly-supervised task. △ Less

Submitted 20 January, 2022; v1 submitted 14 January, 2022; originally announced January 2022.

Comments: Under Review

arXiv:2201.01669 [pdf, other]

Using Deep Learning with Large Aggregated Datasets for COVID-19 Classification from Cough

Authors: Esin Darici Haritaoglu, Nicholas Rasmussen, Daniel C. H. Tan, Jennifer Ranjani J., Jaclyn Xiao, Gunvant Chaudhari, Akanksha Rajput, Praveen Govindan, Christian Canham, Wei Chen, Minami Yamaura, Laura Gomezjurado, Aaron Broukhim, Amil Khanzada, Mert Pilanci

Abstract: The Covid-19 pandemic has been one of the most devastating events in recent history, claiming the lives of more than 5 million people worldwide. Even with the worldwide distribution of vaccines, there is an apparent need for affordable, reliable, and accessible screening techniques to serve parts of the World that do not have access to Western medicine. Artificial Intelligence can provide a soluti… ▽ More The Covid-19 pandemic has been one of the most devastating events in recent history, claiming the lives of more than 5 million people worldwide. Even with the worldwide distribution of vaccines, there is an apparent need for affordable, reliable, and accessible screening techniques to serve parts of the World that do not have access to Western medicine. Artificial Intelligence can provide a solution utilizing cough sounds as a primary screening mode for COVID-19 diagnosis. This paper presents multiple models that have achieved relatively respectable performance on the largest evaluation dataset currently presented in academic literature. Through investigation of a self-supervised learning model (Area under the ROC curve, AUC = 0.807) and a convolutional nerual network (CNN) model (AUC = 0.802), we observe the possibility of model bias with limited datasets. Moreover, we observe that performance increases with training data size, showing the need for the worldwide collection of data to help combat the Covid-19 pandemic with non-traditional means. △ Less

Submitted 29 March, 2022; v1 submitted 5 January, 2022; originally announced January 2022.

arXiv:2112.13384 [pdf, other]

doi 10.1145/3487351.3488276

Will You Dance To The Challenge? Predicting User Participation of TikTok Challenges

Authors: Lynnette Hui Xian Ng, John Yeh Han Tan, Darryl Jing Heng Tan, Roy Ka-Wei Lee

Abstract: TikTok is a popular new social media, where users express themselves through short video clips. A common form of interaction on the platform is participating in "challenges", which are songs and dances for users to iterate upon. Challenge contagion can be measured through replication reach, i.e., users uploading videos of their participation in the challenges. The uniqueness of the TikTok platform… ▽ More TikTok is a popular new social media, where users express themselves through short video clips. A common form of interaction on the platform is participating in "challenges", which are songs and dances for users to iterate upon. Challenge contagion can be measured through replication reach, i.e., users uploading videos of their participation in the challenges. The uniqueness of the TikTok platform where both challenge content and user preferences are evolving requires the combination of challenge and user representation. This paper investigates social contagion of TikTok challenges through predicting a user's participation. We propose a novel deep learning model, deepChallenger, to learn and combine latent user and challenge representations from past videos to perform this user-challenge prediction task. We collect a dataset of over 7,000 videos from 12 trending challenges on the ForYouPage, the app's landing page, and over 10,000 videos from 1303 users. Extensive experiments are conducted and the results show that our proposed deepChallenger (F1=0.494) outperforms baselines (F1=0.188) in the prediction task. △ Less

Submitted 26 December, 2021; originally announced December 2021.

Comments: Accepted at ASONAM 2021

arXiv:2110.03887 [pdf, other]

Environment Aware Text-to-Speech Synthesis

Authors: Daxin Tan, Guangyan Zhang, Tan Lee

Abstract: This study aims at designing an environment-aware text-to-speech (TTS) system that can generate speech to suit specific acoustic environments. It is also motivated by the desire to leverage massive data of speech audio from heterogeneous sources in TTS system development. The key idea is to model the acoustic environment in speech audio as a factor of data variability and incorporate it as a condi… ▽ More This study aims at designing an environment-aware text-to-speech (TTS) system that can generate speech to suit specific acoustic environments. It is also motivated by the desire to leverage massive data of speech audio from heterogeneous sources in TTS system development. The key idea is to model the acoustic environment in speech audio as a factor of data variability and incorporate it as a condition in the process of neural network based speech synthesis. Two embedding extractors are trained with two purposely constructed datasets for characterization and disentanglement of speaker and environment factors in speech. A neural network model is trained to generate speech from extracted speaker and environment embeddings. Objective and subjective evaluation results demonstrate that the proposed TTS system is able to effectively disentangle speaker and environment factors and synthesize speech audio that carries designated speaker characteristics and environment attribute. Audio samples are available online for demonstration https://daxintan-cuhk.github.io/Environment-Aware-TTS/ . △ Less

Submitted 6 August, 2022; v1 submitted 8 October, 2021; originally announced October 2021.

Comments: Accepted by Interspeech 2022

arXiv:2110.03857 [pdf, other]

A study on the efficacy of model pre-training in developing neural text-to-speech system

Authors: Guangyan Zhang, Yichong Leng, Daxin Tan, Ying Qin, Kaitao Song, Xu Tan, Sheng Zhao, Tan Lee

Abstract: In the development of neural text-to-speech systems, model pre-training with a large amount of non-target speakers' data is a common approach. However, in terms of ultimately achieved system performance for target speaker(s), the actual benefits of model pre-training are uncertain and unstable, depending very much on the quantity and text content of training data. This study aims to understand bet… ▽ More In the development of neural text-to-speech systems, model pre-training with a large amount of non-target speakers' data is a common approach. However, in terms of ultimately achieved system performance for target speaker(s), the actual benefits of model pre-training are uncertain and unstable, depending very much on the quantity and text content of training data. This study aims to understand better why and how model pre-training can positively contribute to TTS system performance. It is postulated that the pre-training process plays a critical role in learning text-related variation in speech, while further training with the target speaker's data aims to capture the speaker-related variation. Different test sets are created with varying degrees of similarity to target speaker data in terms of text content. Experiments show that leveraging a speaker-independent TTS trained on speech data with diverse text content can improve the target speaker TTS on domain-mismatched text. We also attempt to reduce the amount of pre-training data for a new text domain and improve the data and computational efficiency. It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size. △ Less

Submitted 7 October, 2021; originally announced October 2021.

arXiv:2108.02821 [pdf, other]

Applying the Information Bottleneck Principle to Prosodic Representation Learning

Authors: Guangyan Zhang, Ying Qin, Daxin Tan, Tan Lee

Abstract: This paper describes a novel design of a neural network-based speech generation model for learning prosodic representation.The problem of representation learning is formulated according to the information bottleneck (IB) principle. A modified VQ-VAE quantized layer is incorporated in the speech generation model to control the IB capacity and adjust the balance between reconstruction power and dise… ▽ More This paper describes a novel design of a neural network-based speech generation model for learning prosodic representation.The problem of representation learning is formulated according to the information bottleneck (IB) principle. A modified VQ-VAE quantized layer is incorporated in the speech generation model to control the IB capacity and adjust the balance between reconstruction power and disentangle capability of the learned representation. The proposed model is able to learn word-level prosodic representations from speech data. With an optimized IB capacity, the learned representations not only are adequate to reconstruct the original speech but also can be used to transfer the prosody onto different textual content. Extensive results of the objective and subjective evaluation are presented to demonstrate the effect of IB capacity control, the effectiveness, and potential usage of the learned prosodic representation in controllable neural speech generation. △ Less

Submitted 5 August, 2021; originally announced August 2021.

Comments: To be appeared in Interspeech 2021

arXiv:2107.01554 [pdf, other]

EditSpeech: A Text Based Speech Editing System Using Partial Inference and Bidirectional Fusion

Authors: Daxin Tan, Liqun Deng, Yu Ting Yeung, Xin Jiang, Xiao Chen, Tan Lee

Abstract: This paper presents the design, implementation and evaluation of a speech editing system, named EditSpeech, which allows a user to perform deletion, insertion and replacement of words in a given speech utterance, without causing audible degradation in speech quality and naturalness. The EditSpeech system is developed upon a neural text-to-speech (NTTS) synthesis framework. Partial inference and bi… ▽ More This paper presents the design, implementation and evaluation of a speech editing system, named EditSpeech, which allows a user to perform deletion, insertion and replacement of words in a given speech utterance, without causing audible degradation in speech quality and naturalness. The EditSpeech system is developed upon a neural text-to-speech (NTTS) synthesis framework. Partial inference and bidirectional fusion are proposed to effectively incorporate the contextual information related to the edited region and achieve smooth transition at both left and right boundaries. Distortion introduced to the unmodified parts of the utterance is alleviated. The EditSpeech system is developed and evaluated on English and Chinese in multi-speaker scenarios. Objective and subjective evaluation demonstrate that EditSpeech outperforms a few baseline systems in terms of low spectral distortion and preferred speech quality. Audio samples are available online for demonstration https://daxintan-cuhk.github.io/EditSpeech/ . △ Less

Submitted 7 October, 2021; v1 submitted 4 July, 2021; originally announced July 2021.

Comments: Accepted by ASRU 2021

arXiv:2103.17086 [pdf, other]

Deep adaptive fuzzy clustering for evolutionary unsupervised representation learning

Authors: Dayu Tan, Zheng Huang, Xin Peng, Weimin Zhong, Vladimir Mahalec

Abstract: Cluster assignment of large and complex images is a crucial but challenging task in pattern recognition and computer vision. In this study, we explore the possibility of employing fuzzy clustering in a deep neural network framework. Thus, we present a novel evolutionary unsupervised learning representation model with iterative optimization. It implements the deep adaptive fuzzy clustering (DAFC) s… ▽ More Cluster assignment of large and complex images is a crucial but challenging task in pattern recognition and computer vision. In this study, we explore the possibility of employing fuzzy clustering in a deep neural network framework. Thus, we present a novel evolutionary unsupervised learning representation model with iterative optimization. It implements the deep adaptive fuzzy clustering (DAFC) strategy that learns a convolutional neural network classifier from given only unlabeled data samples. DAFC consists of a deep feature quality-verifying model and a fuzzy clustering model, where deep feature representation learning loss function and embedded fuzzy clustering with the weighted adaptive entropy is implemented. We joint fuzzy clustering to the deep reconstruction model, in which fuzzy membership is utilized to represent a clear structure of deep cluster assignments and jointly optimize for the deep representation learning and clustering. Also, the joint model evaluates current clustering performance by inspecting whether the re-sampled data from estimated bottleneck space have consistent clustering properties to progressively improve the deep clustering model. Comprehensive experiments on a variety of datasets show that the proposed method obtains a substantially better performance for both reconstruction and clustering quality when compared to the other state-of-the-art deep clustering methods, as demonstrated with the in-depth analysis in the extensive experiments. △ Less

Submitted 31 March, 2021; originally announced March 2021.

arXiv:2103.04699 [pdf, other]

CUHK-EE Voice Cloning System for ICASSP 2021 M2VoC Challenge

Authors: Daxin Tan, Hingpang Huang, Guangyan Zhang, Tan Lee

Abstract: This paper presents the CUHK-EE voice cloning system for ICASSP 2021 M2VoC challenge. The challenge provides two Mandarin speech corpora: the AIShell-3 corpus of 218 speakers with noise and reverberation and the MST corpus including high-quality speech of one male and one female speakers. 100 and 5 utterances of 3 target speakers in different voice and style are provided in track 1 and 2 respectiv… ▽ More This paper presents the CUHK-EE voice cloning system for ICASSP 2021 M2VoC challenge. The challenge provides two Mandarin speech corpora: the AIShell-3 corpus of 218 speakers with noise and reverberation and the MST corpus including high-quality speech of one male and one female speakers. 100 and 5 utterances of 3 target speakers in different voice and style are provided in track 1 and 2 respectively, and the participants are required to synthesize speech in target speaker's voice and style. We take part in the track 1 and carry out voice cloning based on 100 utterances of target speakers. An end-to-end voicing cloning system is developed to accomplish the task, which includes: 1. a text and speech front-end module with the help of forced alignment, 2. an acoustic model combining Tacotron2 and DurIAN to predict melspectrogram, 3. a Hifigan vocoder for waveform generation. Our system comprises three stages: multi-speaker training stage, target speaker adaption stage and target speaker synthesis stage. Our team is identified as T17. The subjective evaluation results provided by the challenge organizer demonstrate the effectiveness of our system. Audio samples are available at our demo page: https://daxintan-cuhk.github.io/CUHK-EE-system-M2VoC-challenge/ . △ Less

Submitted 5 July, 2021; v1 submitted 8 March, 2021; originally announced March 2021.

arXiv:2102.07982 [pdf, other]

Voice Gender Scoring and Independent Acoustic Characterization of Perceived Masculinity and Femininity

Authors: Fuling Chen, Roberto Togneri, Murray Maybery, Diana Tan

Abstract: Previous research has found that voices can provide reliable information to be used for gender classification with a high level of accuracy. In social psychology, perceived masculinity and femininity (masculinity and femininity rated by humans) has often been considered an important feature when investigating the influence of vocal features on social behaviours. While previous studies have charact… ▽ More Previous research has found that voices can provide reliable information to be used for gender classification with a high level of accuracy. In social psychology, perceived masculinity and femininity (masculinity and femininity rated by humans) has often been considered an important feature when investigating the influence of vocal features on social behaviours. While previous studies have characterised the acoustic features that contributed to perceivers' judgements of speakers' masculinity or femininity, there is limited research on developing a machine masculinity/femininity scoring model and characterizing the independent acoustic factors that contribute to perceivers' masculinity and femininity judgements. In this work, we first propose a machine scoring model of perceived masculinity/femininity based on the Extreme Random Forest and then characterize the independent and meaningful acoustic factors that contribute to perceivers' judgements by using a correlation matrix based hierarchical clustering method. Our results show that the machine ratings of masculinity and femininity strongly correlated with the human ratings of masculinity and femininity when we used an optimal speech duration of 7 seconds, with a correlation coefficient of up to .63 for females and .77 for males. Nine independent clusters of acoustic measures were generated from our modelling of femininity judgements for female voices and eight clusters were found for masculinity judgements for male voices. The results revealed that, for both genders, the F0 mean is the most important acoustic measure affecting the judgement of acoustic-related masculinity and femininity. The F3 mean, F4 mean and VTL estimators were found to be highly inter-correlated and appeared in the same cluster, forming the second most significant factor in influencing the assessment of acoustic-related masculinity and femininity. △ Less

Submitted 4 August, 2022; v1 submitted 16 February, 2021; originally announced February 2021.

Comments: 24 pages, 7 figures, journal

arXiv:2012.14629 [pdf, other]

TrustMAE: A Noise-Resilient Defect Classification Framework using Memory-Augmented Auto-Encoders with Trust Regions

Authors: Daniel Stanley Tan, Yi-Chun Chen, Trista Pei-Chun Chen, Wei-Chao Chen

Abstract: In this paper, we propose a framework called TrustMAE to address the problem of product defect classification. Instead of relying on defective images that are difficult to collect and laborious to label, our framework can accept datasets with unlabeled images. Moreover, unlike most anomaly detection methods, our approach is robust against noises, or defective images, in the training dataset. Our f… ▽ More In this paper, we propose a framework called TrustMAE to address the problem of product defect classification. Instead of relying on defective images that are difficult to collect and laborious to label, our framework can accept datasets with unlabeled images. Moreover, unlike most anomaly detection methods, our approach is robust against noises, or defective images, in the training dataset. Our framework uses a memory-augmented auto-encoder with a sparse memory addressing scheme to avoid over-generalizing the auto-encoder, and a novel trust-region memory updating scheme to keep the noises away from the memory slots. The result is a framework that can reconstruct defect-free images and identify the defective regions using a perceptual distance network. When compared against various state-of-the-art baselines, our approach performs competitively under noise-free MVTec datasets. More importantly, it remains effective at a noise level up to 40% while significantly outperforming other baselines. △ Less

Submitted 29 December, 2020; originally announced December 2020.

Comments: Accepted at WACV 2021

arXiv:2011.08534 [pdf, other]

A Divide et Impera Approach for 3D Shape Reconstruction from Multiple Views

Authors: Riccardo Spezialetti, David Joseph Tan, Alessio Tonioni, Keisuke Tateno, Federico Tombari

Abstract: Estimating the 3D shape of an object from a single or multiple images has gained popularity thanks to the recent breakthroughs powered by deep learning. Most approaches regress the full object shape in a canonical pose, possibly extrapolating the occluded parts based on the learned priors. However, their viewpoint invariant technique often discards the unique structures visible from the input imag… ▽ More Estimating the 3D shape of an object from a single or multiple images has gained popularity thanks to the recent breakthroughs powered by deep learning. Most approaches regress the full object shape in a canonical pose, possibly extrapolating the occluded parts based on the learned priors. However, their viewpoint invariant technique often discards the unique structures visible from the input images. In contrast, this paper proposes to rely on viewpoint variant reconstructions by merging the visible information from the given views. Our approach is divided into three steps. Starting from the sparse views of the object, we first align them into a common coordinate system by estimating the relative pose between all the pairs. Then, inspired by the traditional voxel carving, we generate an occupancy grid of the object taken from the silhouette on the images and their relative poses. Finally, we refine the initial reconstruction to build a clean 3D model which preserves the details from each viewpoint. To validate the proposed method, we perform a comprehensive evaluation on the ShapeNet reference benchmark in terms of relative pose estimation and 3D shape reconstruction. △ Less

Submitted 18 November, 2020; v1 submitted 17 November, 2020; originally announced November 2020.

Comments: Accepted to 3DV 2020 as oral

arXiv:2011.03943 [pdf, other]

doi 10.21437/Interspeech.2021-1129

Fine-grained Style Modeling, Transfer and Prediction in Text-to-Speech Synthesis via Phone-Level Content-Style Disentanglement

Authors: Daxin Tan, Tan Lee

Abstract: This paper presents a novel design of neural network system for fine-grained style modeling, transfer and prediction in expressive text-to-speech (TTS) synthesis. Fine-grained modeling is realized by extracting style embeddings from the mel-spectrograms of phone-level speech segments. Collaborative learning and adversarial learning strategies are applied in order to achieve effective disentangleme… ▽ More This paper presents a novel design of neural network system for fine-grained style modeling, transfer and prediction in expressive text-to-speech (TTS) synthesis. Fine-grained modeling is realized by extracting style embeddings from the mel-spectrograms of phone-level speech segments. Collaborative learning and adversarial learning strategies are applied in order to achieve effective disentanglement of content and style factors in speech and alleviate the "content leakage" problem in style modeling. The proposed system can be used for varying-content speech style transfer in the single-speaker scenario. The results of objective and subjective evaluation show that our system performs better than other fine-grained speech style transfer models, especially in the aspect of content preservation. By incorporating a style predictor, the proposed system can also be used for text-to-speech synthesis. Audio samples are provided for system demonstration https://daxintan-cuhk.github.io/pl-csd-speech . △ Less

Submitted 7 October, 2021; v1 submitted 8 November, 2020; originally announced November 2020.

Comments: Accepted by Interspeech 2021

arXiv:2008.07358 [pdf, other]

SoftPoolNet: Shape Descriptor for Point Cloud Completion and Classification

Authors: Yida Wang, David Joseph Tan, Nassir Navab, Federico Tombari

Abstract: Point clouds are often the default choice for many applications as they exhibit more flexibility and efficiency than volumetric data. Nevertheless, their unorganized nature -- points are stored in an unordered way -- makes them less suited to be processed by deep learning pipelines. In this paper, we propose a method for 3D object completion and classification based on point clouds. We introduce a… ▽ More Point clouds are often the default choice for many applications as they exhibit more flexibility and efficiency than volumetric data. Nevertheless, their unorganized nature -- points are stored in an unordered way -- makes them less suited to be processed by deep learning pipelines. In this paper, we propose a method for 3D object completion and classification based on point clouds. We introduce a new way of organizing the extracted features based on their activations, which we name soft pooling. For the decoder stage, we propose regional convolutions, a novel operator aimed at maximizing the global activation entropy. Furthermore, inspired by the local refining procedure in Point Completion Network (PCN), we also propose a patch-deforming operation to simulate deconvolutional operations for point clouds. This paper proves that our regional activation can be incorporated in many point cloud architectures like AtlasNet and PCN, leading to better performance for geometric completion. We evaluate our approach on different 3D tasks such as object completion and classification, achieving state-of-the-art accuracy. △ Less

Submitted 17 August, 2020; originally announced August 2020.

Comments: accepted in ECCV 2020 as oral

arXiv:2001.07108 [pdf, other]

Spectral Pyramid Graph Attention Network for Hyperspectral Image Classification

Authors: Tinghuai Wang, Guangming Wang, Kuan Eeik Tan, Donghui Tan

Abstract: Convolutional neural networks (CNN) have made significant advances in hyperspectral image (HSI) classification. However, standard convolutional kernel neglects the intrinsic connections between data points, resulting in poor region delineation and small spurious predictions. Furthermore, HSIs have a unique continuous data distribution along the high dimensional spectrum domain - much remains to be… ▽ More Convolutional neural networks (CNN) have made significant advances in hyperspectral image (HSI) classification. However, standard convolutional kernel neglects the intrinsic connections between data points, resulting in poor region delineation and small spurious predictions. Furthermore, HSIs have a unique continuous data distribution along the high dimensional spectrum domain - much remains to be addressed in characterizing the spectral contexts considering the prohibitively high dimensionality and improving reasoning capability in light of the limited amount of labelled data. This paper presents a novel architecture which explicitly addresses these two issues. Specifically, we design an architecture to encode the multiple spectral contextual information in the form of spectral pyramid of multiple embedding spaces. In each spectral embedding space, we propose graph attention mechanism to explicitly perform interpretable reasoning in the spatial domain based on the connection in spectral feature space. Experiments on three HSI datasets demonstrate that the proposed architecture can significantly improve the classification accuracy compared with the existing methods. △ Less

Submitted 20 January, 2020; originally announced January 2020.

Comments: 7 pages, 6 figures, 4 tables

arXiv:1910.09156 [pdf]

Easy Java/JavaScript Simulations as a tool for Learning Analytics

Authors: Francisco Esquembre, Félix J. García Clemente, Rafael Chicón, Lawrence Wee, Leong Tze Kwang, Darren Tan

Abstract: In this paper we introduce the new and planned features of Easy Java/JavaScript Simulations (EJS) to support Learning Analytics (LA) and Educational Data Mining (EDM) research and practice in the use of simulations for the teaching and self-learning of natural sciences and engineering. Simulations created with EJS can now be easily embedded in a popular Learning Management System using a new plug-… ▽ More In this paper we introduce the new and planned features of Easy Java/JavaScript Simulations (EJS) to support Learning Analytics (LA) and Educational Data Mining (EDM) research and practice in the use of simulations for the teaching and self-learning of natural sciences and engineering. Simulations created with EJS can now be easily embedded in a popular Learning Management System using a new plug-in that allows creation of full-fledged instructional units that also collect and record fine-grained, instructional-savvy data of the students interaction with the simulation. The resulting data can then be mined to obtain information about students performance, behaviors, or learning procedures with the intention to support student learning, provide instructors with timely information about student performance, and also help optimize the pedagogic design of the simulations themselves. We describe the current development and architecture, as well as future directions for testing and extending the current capabilities of EJS as a modelling and authoring tool to support LA and EDM research and practice in the use of simulations for teaching science, in particular in the context of the increasingly popular online learning platforms. △ Less

Submitted 21 October, 2019; originally announced October 2019.

Comments: 7 pages, 4 figures, conference paper Shih, J. L. et al. (Eds.) (2019). Proceedings of the 27th International Conference on Computers in Education. Taiwan: Asia-Pacific Society for Computers in Education

arXiv:1909.01106 [pdf, other]

ForkNet: Multi-branch Volumetric Semantic Completion from a Single Depth Image

Authors: Yida Wang, David Joseph Tan, Nassir Navab, Federico Tombari

Abstract: We propose a novel model for 3D semantic completion from a single depth image, based on a single encoder and three separate generators used to reconstruct different geometric and semantic representations of the original and completed scene, all sharing the same latent space. To transfer information between the geometric and semantic branches of the network, we introduce paths between them concaten… ▽ More We propose a novel model for 3D semantic completion from a single depth image, based on a single encoder and three separate generators used to reconstruct different geometric and semantic representations of the original and completed scene, all sharing the same latent space. To transfer information between the geometric and semantic branches of the network, we introduce paths between them concatenating features at corresponding network layers. Motivated by the limited amount of training samples from real scenes, an interesting attribute of our architecture is the capacity to supplement the existing dataset by generating a new training dataset with high quality, realistic scenes that even includes occlusion and real noise. We build the new dataset by sampling the features directly from latent space which generates a pair of partial volumetric surface and completed volumetric semantic surface. Moreover, we utilize multiple discriminators to increase the accuracy and realism of the reconstructions. We demonstrate the benefits of our approach on standard benchmarks for the two most common completion tasks: semantic 3D scene completion and 3D object completion. △ Less

Submitted 3 September, 2019; originally announced September 2019.

Comments: Accepted in International Conference on Computer Vision 2019

arXiv:1903.03804 [pdf, other]

Program Classification Using Gated Graph Attention Neural Network for Online Programming Service

Authors: Mingming Lu, Dingwu Tan, Naixue Xiong, Zailiang Chen, Haifeng Li

Abstract: The online programing services, such as Github,TopCoder, and EduCoder, have promoted a lot of social interactions among the service users. However, the existing social interactions is rather limited and inefficient due to the rapid increasing of source-code repositories, which is difficult to explore manually. The emergence of source-code mining provides a promising way to analyze those source cod… ▽ More The online programing services, such as Github,TopCoder, and EduCoder, have promoted a lot of social interactions among the service users. However, the existing social interactions is rather limited and inefficient due to the rapid increasing of source-code repositories, which is difficult to explore manually. The emergence of source-code mining provides a promising way to analyze those source codes, so that those source codes can be relatively easy to understand and share among those service users. Among all the source-code mining attempts,program classification lays a foundation for various tasks related to source-code understanding, because it is impossible for a machine to understand a computer program if it cannot classify the program correctly. Although numerous machine learning models, such as the Natural Language Processing (NLP) based models and the Abstract Syntax Tree (AST) based models, have been proposed to classify computer programs based on their corresponding source codes, the existing works cannot fully characterize the source codes from the perspective of both the syntax and semantic information. To address this problem, we proposed a Graph Neural Network (GNN) based model, which integrates data flow and function call information to the AST,and applies an improved GNN model to the integrated graph, so as to achieve the state-of-art program classification accuracy. The experiment results have shown that the proposed work can classify programs with accuracy over 97%. △ Less

Submitted 9 March, 2019; originally announced March 2019.

Comments: 12 pages, 27 figures

arXiv:1810.10901 [pdf, other]

doi 10.1109/3DV.2018.00056

Adversarial Semantic Scene Completion from a Single Depth Image

Authors: Yida Wang, David Joseph Tan, Nassir Navab, Federico Tombari

Abstract: We propose a method to reconstruct, complete and semantically label a 3D scene from a single input depth image. We improve the accuracy of the regressed semantic 3D maps by a novel architecture based on adversarial learning. In particular, we suggest using multiple adversarial loss terms that not only enforce realistic outputs with respect to the ground truth, but also an effective embedding of th… ▽ More We propose a method to reconstruct, complete and semantically label a 3D scene from a single input depth image. We improve the accuracy of the regressed semantic 3D maps by a novel architecture based on adversarial learning. In particular, we suggest using multiple adversarial loss terms that not only enforce realistic outputs with respect to the ground truth, but also an effective embedding of the internal features. This is done by correlating the latent features of the encoder working on partial 2.5D data with the latent features extracted from a variational 3D auto-encoder trained to reconstruct the complete semantic scene. In addition, differently from other approaches that operate entirely through 3D convolutions, at test time we retain the original 2.5D structure of the input during downsampling to improve the effectiveness of the internal representation of our model. We test our approach on the main benchmark datasets for semantic scene completion to qualitatively and quantitatively assess the effectiveness of our proposal. △ Less

Submitted 25 October, 2018; originally announced October 2018.

Comments: 2018 International Conference on 3D Vision (3DV)

Journal ref: 2018 International Conference on 3D Vision (3DV), Verona, Italy, 2018, pp. 426-434

Showing 1–50 of 60 results for author: Tan, D