-
Every Pixel Has its Moments: Ultra-High-Resolution Unpaired Image-to-Image Translation via Dense Normalization
Authors:
Ming-Yang Ho,
Che-Ming Wu,
Min-Sheng Wu,
Yufeng Jane Tseng
Abstract:
Recent advancements in ultra-high-resolution unpaired image-to-image translation have aimed to mitigate the constraints imposed by limited GPU memory through patch-wise inference. Nonetheless, existing methods often compromise between the reduction of noticeable tiling artifacts and the preservation of color and hue contrast, attributed to the reliance on global image- or patch-level statistics in…
▽ More
Recent advancements in ultra-high-resolution unpaired image-to-image translation have aimed to mitigate the constraints imposed by limited GPU memory through patch-wise inference. Nonetheless, existing methods often compromise between the reduction of noticeable tiling artifacts and the preservation of color and hue contrast, attributed to the reliance on global image- or patch-level statistics in the instance normalization layers. In this study, we introduce a Dense Normalization (DN) layer designed to estimate pixel-level statistical moments. This approach effectively diminishes tiling artifacts while concurrently preserving local color and hue contrasts. To address the computational demands of pixel-level estimation, we further propose an efficient interpolation algorithm. Moreover, we invent a parallelism strategy that enables the DN layer to operate in a single pass. Through extensive experiments, we demonstrate that our method surpasses all existing approaches in performance. Notably, our DN layer is hyperparameter-free and can be seamlessly integrated into most unpaired image-to-image translation frameworks without necessitating retraining. Overall, our work paves the way for future exploration in handling images of arbitrary resolutions within the realm of unpaired image-to-image translation. Code is available at: https://github.com/Kaminyou/Dense-Normalization.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems
Authors:
Haibin Wu,
Yuan Tseng,
Hung-yi Lee
Abstract:
Current state-of-the-art (SOTA) codec-based audio synthesis systems can mimic anyone's voice with just a 3-second sample from that specific unseen speaker. Unfortunately, malicious attackers may exploit these technologies, causing misuse and security issues. Anti-spoofing models have been developed to detect fake speech. However, the open question of whether current SOTA anti-spoofing models can e…
▽ More
Current state-of-the-art (SOTA) codec-based audio synthesis systems can mimic anyone's voice with just a 3-second sample from that specific unseen speaker. Unfortunately, malicious attackers may exploit these technologies, causing misuse and security issues. Anti-spoofing models have been developed to detect fake speech. However, the open question of whether current SOTA anti-spoofing models can effectively counter deepfake audios from codec-based speech synthesis systems remains unanswered. In this paper, we curate an extensive collection of contemporary SOTA codec models, employing them to re-create synthesized speech. This endeavor leads to the creation of CodecFake, the first codec-based deepfake audio dataset. Additionally, we verify that anti-spoofing models trained on commonly used datasets cannot detect synthesized speech from current codec-based speech generation systems. The proposed CodecFake dataset empowers these models to counter this challenge effectively.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
A DeNoising FPN With Transformer R-CNN for Tiny Object Detection
Authors:
Hou-I Liu,
Yu-Wen Tseng,
Kai-Cheng Chang,
Pin-Jyun Wang,
Hong-Han Shuai,
Wen-Huang Cheng
Abstract:
Despite notable advancements in the field of computer vision, the precise detection of tiny objects continues to pose a significant challenge, largely owing to the minuscule pixel representation allocated to these objects in imagery data. This challenge resonates profoundly in the domain of geoscience and remote sensing, where high-fidelity detection of tiny objects can facilitate a myriad of appl…
▽ More
Despite notable advancements in the field of computer vision, the precise detection of tiny objects continues to pose a significant challenge, largely owing to the minuscule pixel representation allocated to these objects in imagery data. This challenge resonates profoundly in the domain of geoscience and remote sensing, where high-fidelity detection of tiny objects can facilitate a myriad of applications ranging from urban planning to environmental monitoring. In this paper, we propose a new framework, namely, DeNoising FPN with Trans R-CNN (DNTR), to improve the performance of tiny object detection. DNTR consists of an easy plug-in design, DeNoising FPN (DN-FPN), and an effective Transformer-based detector, Trans R-CNN. Specifically, feature fusion in the feature pyramid network is important for detecting multiscale objects. However, noisy features may be produced during the fusion process since there is no regularization between the features of different scales. Therefore, we introduce a DN-FPN module that utilizes contrastive learning to suppress noise in each level's features in the top-down path of FPN. Second, based on the two-stage framework, we replace the obsolete R-CNN detector with a novel Trans R-CNN detector to focus on the representation of tiny objects with self-attention. Experimental results manifest that our DNTR outperforms the baselines by at least 17.4% in terms of APvt on the AI-TOD dataset and 9.6% in terms of AP on the VisDrone dataset, respectively. Our code will be available at https://github.com/hoiliu-0801/DNTR.
△ Less
Submitted 15 June, 2024; v1 submitted 9 June, 2024;
originally announced June 2024.
-
MP-PolarMask: A Faster and Finer Instance Segmentation for Concave Images
Authors:
Ke-Lei Wang,
Pin-Hsuan Chou,
Young-Ching Chou,
Chia-Jen Liu,
Cheng-Kuan Lin,
Yu-Chee Tseng
Abstract:
While there are a lot of models for instance segmentation, PolarMask stands out as a unique one that represents an object by a Polar coordinate system. With an anchor-box-free design and a single-stage framework that conducts detection and segmentation at one time, PolarMask is proved to be able to balance efficiency and accuracy. Hence, it can be easily connected with other downstream real-time a…
▽ More
While there are a lot of models for instance segmentation, PolarMask stands out as a unique one that represents an object by a Polar coordinate system. With an anchor-box-free design and a single-stage framework that conducts detection and segmentation at one time, PolarMask is proved to be able to balance efficiency and accuracy. Hence, it can be easily connected with other downstream real-time applications. In this work, we observe that there are two deficiencies associated with PolarMask: (i) inability of representing concave objects and (ii) inefficiency in using ray regression. We propose MP-PolarMask (Multi-Point PolarMask) by taking advantage of multiple Polar systems. The main idea is to extend from one main Polar system to four auxiliary Polar systems, thus capable of representing more complicated convex-and-concave-mixed shapes. We validate MP-PolarMask on both general objects and food objects of the COCO dataset, and the results demonstrate significant improvement of 13.69% in AP_L and 7.23% in AP over PolarMask with 36 rays.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
Two Tales of Persona in LLMs: A Survey of Role-Playing and Personalization
Authors:
Yu-Min Tseng,
Yu-Chao Huang,
Teng-Yun Hsiao,
Wei-Lin Chen,
Chao-Wei Huang,
Yu Meng,
Yun-Nung Chen
Abstract:
The concept of persona, originally adopted in dialogue literature, has re-surged as a promising framework for tailoring large language models (LLMs) to specific context (e.g., personalized search, LLM-as-a-judge). However, the growing research on leveraging persona in LLMs is relatively disorganized and lacks a systematic taxonomy. To close the gap, we present a comprehensive survey to categorize…
▽ More
The concept of persona, originally adopted in dialogue literature, has re-surged as a promising framework for tailoring large language models (LLMs) to specific context (e.g., personalized search, LLM-as-a-judge). However, the growing research on leveraging persona in LLMs is relatively disorganized and lacks a systematic taxonomy. To close the gap, we present a comprehensive survey to categorize the current state of the field. We identify two lines of research, namely (1) LLM Role-Playing, where personas are assigned to LLMs, and (2) LLM Personalization, where LLMs take care of user personas. Additionally, we introduce existing methods for LLM personality evaluation. To the best of our knowledge, we present the first survey for role-playing and personalization in LLMs under the unified view of persona. We continuously maintain a paper collection to foster future endeavors: https://github.com/MiuLab/PersonaLLM-Survey
△ Less
Submitted 26 June, 2024; v1 submitted 3 June, 2024;
originally announced June 2024.
-
Spin-orbital excitations encoding the magnetic phase transition in the van der Waals antiferromagnet FePS$_{3}$
Authors:
Yuan Wei,
Yi Tseng,
Hebatalla Elnaggar,
Wenliang Zhang,
Teguh Citra Asmara,
Eugenio Paris,
Gabriele Domaine,
Vladimir N. Strocov,
Luc Testa,
Virgile Favre,
Mario Di Luca,
Mitali Banerjee,
Andrew R. Wildes,
Frank M. F. de Groot,
Henrik M. Ronnow,
Thorsten Schmitt
Abstract:
In the rich phases of van der Waals (vdW) materials featuring intertwined electronic order and collective phenomena, characterizing elementary dynamics that entail the low-energy Hamiltonian and electronic degrees of freedom is of paramount importance. Here we performed resonant inelastic X-ray scattering (RIXS) to elaborate the spin-orbital ground and excited states of the vdW antiferromagnetic i…
▽ More
In the rich phases of van der Waals (vdW) materials featuring intertwined electronic order and collective phenomena, characterizing elementary dynamics that entail the low-energy Hamiltonian and electronic degrees of freedom is of paramount importance. Here we performed resonant inelastic X-ray scattering (RIXS) to elaborate the spin-orbital ground and excited states of the vdW antiferromagnetic insulator FePS$_{3}$, as well as their relation to magnetism. We observed the spectral enhancement of spin-orbital multiplet transitions about $\sim$ 100 and $\sim$ 220 meV, as well as quasielastic response, when entering the zig-zag antiferromagnetic phase, where the spectral changes develop an order-parameter-like evolution with temperature. By comparing with ligand field theory calculations, we discovered the essential role of trigonal lattice distortion and negative metal-ligand charge-transfer to account for these emergent excitations. Such spectral profiles are further examined upon confinement by mechanical exfoliation. We reveal their spectral robustness down to the few atomic layer limit, in accordance with the persistent antiferromagnetic state previously reported in optical measurements. Our study demonstrates the versatile RIXS capability that resolves magneto-crystalline anisotropy and charge-transfer energetics. These provide the crucial insight to understand how the spontaneous magnetic symmetry-breaking stabilizes in the quasi-two-dimensional limit for the vdW magnet FePS$_{3}$.
△ Less
Submitted 22 May, 2024;
originally announced May 2024.
-
Word-specific tonal realizations in Mandarin
Authors:
Yu-Ying Chuang,
Melanie J. Bell,
Yu-Hsiang Tseng,
R. Harald Baayen
Abstract:
The pitch contours of Mandarin two-character words are generally understood as being shaped by the underlying tones of the constituent single-character words, in interaction with articulatory constraints imposed by factors such as speech rate, co-articulation with adjacent tones, segmental make-up, and predictability. This study shows that tonal realization is also partially determined by words' m…
▽ More
The pitch contours of Mandarin two-character words are generally understood as being shaped by the underlying tones of the constituent single-character words, in interaction with articulatory constraints imposed by factors such as speech rate, co-articulation with adjacent tones, segmental make-up, and predictability. This study shows that tonal realization is also partially determined by words' meanings. We first show, on the basis of a Taiwan corpus of spontaneous conversations, using the generalized additive regression model, and focusing on the rise-fall tone pattern, that after controlling for effects of speaker and context, word type is a stronger predictor of pitch realization than all the previously established word-form related predictors combined. Importantly, the addition of information about meaning in context improves prediction accuracy even further. We then proceed to show, using computational modeling with context-specific word embeddings, that token-specific pitch contours predict word type with 50% accuracy on held-out data, and that context-sensitive, token-specific embeddings can predict the shape of pitch contours with 30% accuracy. These accuracies, which are an order of magnitude above chance level, suggest that the relation between words' pitch contours and their meanings are sufficiently strong to be functional for language users. The theoretical implications of these empirical findings are discussed.
△ Less
Submitted 11 May, 2024;
originally announced May 2024.
-
EmoVIT: Revolutionizing Emotion Insights with Visual Instruction Tuning
Authors:
Hongxia Xie,
Chu-Jun Peng,
Yu-Wen Tseng,
Hung-Jen Chen,
Chan-Feng Hsu,
Hong-Han Shuai,
Wen-Huang Cheng
Abstract:
Visual Instruction Tuning represents a novel learning paradigm involving the fine-tuning of pre-trained language models using task-specific instructions. This paradigm shows promising zero-shot results in various natural language processing tasks but is still unexplored in vision emotion understanding. In this work, we focus on enhancing the model's proficiency in understanding and adhering to ins…
▽ More
Visual Instruction Tuning represents a novel learning paradigm involving the fine-tuning of pre-trained language models using task-specific instructions. This paradigm shows promising zero-shot results in various natural language processing tasks but is still unexplored in vision emotion understanding. In this work, we focus on enhancing the model's proficiency in understanding and adhering to instructions related to emotional contexts. Initially, we identify key visual clues critical to visual emotion recognition. Subsequently, we introduce a novel GPT-assisted pipeline for generating emotion visual instruction data, effectively addressing the scarcity of annotated instruction data in this domain. Expanding on the groundwork established by InstructBLIP, our proposed EmoVIT architecture incorporates emotion-specific instruction data, leveraging the powerful capabilities of Large Language Models to enhance performance. Through extensive experiments, our model showcases its proficiency in emotion classification, adeptness in affective reasoning, and competence in comprehending humor. The comparative analysis provides a robust benchmark for Emotion Visual Instruction Tuning in the era of LLMs, providing valuable insights and opening avenues for future exploration in this domain. Our code is available at \url{https://github.com/aimmemotion/EmoVIT}.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
Nature of excitons and their ligand-mediated delocalization in nickel dihalide charge-transfer insulators
Authors:
Connor A. Occhialini,
Yi Tseng,
Hebatalla Elnaggar,
Qian Song,
Mark Blei,
Seth Ariel Tongay,
Valentina Bisogni,
Frank M. F. de Groot,
Jonathan Pelliciari,
Riccardo Comin
Abstract:
The fundamental optical excitations of correlated transition-metal compounds are typically identified with multielectronic transitions localized at the transition-metal site, such as $dd$ transitions. In this vein, intense interest has surrounded the appearance of sharp, below band-gap optical transitions, i.e. excitons, within the magnetic phase of correlated Ni$^{2+}$ van der Waals magnets. The…
▽ More
The fundamental optical excitations of correlated transition-metal compounds are typically identified with multielectronic transitions localized at the transition-metal site, such as $dd$ transitions. In this vein, intense interest has surrounded the appearance of sharp, below band-gap optical transitions, i.e. excitons, within the magnetic phase of correlated Ni$^{2+}$ van der Waals magnets. The interplay of magnetic and charge-transfer insulating ground states in Ni$^{2+}$ systems raises intriguing questions on the roles of long-range magnetic order and of metal-ligand charge transfer in the exciton nature, which inspired microscopic descriptions beyond typical $dd$ excitations. Here we study the impact of charge-transfer and magnetic order on the excitation spectrum of the nickel dihalides (NiX$_2$, X $=$ Cl, Br, and I) using Ni-$L_3$ resonant inelastic x-ray scattering (RIXS). In all compounds, we detect sharp excitations, analogous to the recently reported excitons, and assign them to spin-singlet multiplets of octahedrally-coordinated Ni$^{2+}$ stabilized by intra-atomic Hund's exchange. Additionally, we demonstrate that these excitons are dispersive using momentum resolved RIXS. Our data evidence a ligand-mediated multiplet dispersion, which is tuned by the charge-transfer gap and independent of the presence of long-range magnetic order. This reveals the mechanisms governing non-local interactions of on-site $dd$ excitations with the surrounding crystal/magnetic structure, in analogy to ground state superexchange. These measurements thus establish the roles of magnetic order, self-doped ligand holes, and intersite coupling mechanisms for the properties of $dd$ excitations in charge-transfer insulators.
△ Less
Submitted 16 April, 2024;
originally announced April 2024.
-
Unraveling the Mn $L_3$-edge RIXS spectrum of lightly manganese doped Sr$_{3}$Ru$_{2}$O$_{7}$
Authors:
Wei-Yang Chen,
Shih-Wen Huang,
Yi Tseng,
Wenliang Zhang,
Eugenio Paris,
Teguh Citra Asmara,
Jenn-Min Lee,
Thorsten Schmitt,
Yu-Cheng Shao,
Yi-De Chuang,
Byron Freelon,
Dao-Xin Yao,
Trinanjan Datta
Abstract:
Resonant inelastic x-ray scattering (RIXS) experiment was performed at the Mn $L_3$ edge. A 10 $\%$ Mn-doped Sr$_{3}$Ru$_{2}$O$_{7}$ compound, where the Mn$^{3+}$ ions are in the 3$d^4$ state, were probed for $dd$ excitations. The dilute doping concentration allows one to treat the dopant Mn$^{3+}$ ions as effectively free in the host ruthenium compound. The local nature of $dd$ RIXS spectroscopy…
▽ More
Resonant inelastic x-ray scattering (RIXS) experiment was performed at the Mn $L_3$ edge. A 10 $\%$ Mn-doped Sr$_{3}$Ru$_{2}$O$_{7}$ compound, where the Mn$^{3+}$ ions are in the 3$d^4$ state, were probed for $dd$ excitations. The dilute doping concentration allows one to treat the dopant Mn$^{3+}$ ions as effectively free in the host ruthenium compound. The local nature of $dd$ RIXS spectroscopy permits one to use a single-site model to simulate the experimental spectra. The simulated spectra reproduces the in-plane [100] experimental RIXS spectrum. We also predict the intensity for the in-plane [110] direction and the out-of-plane spin orientation configuration [001]. Based on our single-ion model we were able to fit the experimental data to obtain the crystal field parameters, the 10Dq value, and the intra-orbital spin-flip energy 2$\mathcal{J}$(or $3J_{H}$, where $J_{H}$ is the Hund's energy) of the Mn$^{3+}$ ion. Utilizing our computed RIXS quantum transition amplitudes between the various $d$ orbitals of the Mn$^{3+}$ ion, the expression for the Kramers-Heisenberg cross section, and a self-consistent fitting procedure we also identify the energy boundaries of the non-spin-flip and spin-flip $dd$ excitations present in the experimental data. From our fitting procedure we obtain $2\mathcal{J} (3J_{H})=2.06$ eV, a value which is in excellent agreement with that computed from the free ion Racah parameters. We also identified the charge transfer boundary. In addition to predicting the microscopic parameters, we find a quantum spin-flip transition in the non-cross ($σ_{in}-σ_{out}$, $π_{in}-π_{out}$) x-ray polarization channels of the $dd$ RIXS spectra. A similar transition, was previously predicted to occur in the $π-π$ channel of the magnon spectrum in the non-collinear non-coplanar Kagome compound composed of Cu$^{2+}$ 3d$^{9}$ ion.
△ Less
Submitted 3 April, 2024;
originally announced April 2024.
-
Help Supporters: Exploring the Design Space of Assistive Technologies to Support Face-to-Face Help Between Blind and Sighted Strangers
Authors:
Yuanyang Teng,
Connor Courtien,
David Angel Rios,
Yves M. Tseng,
Jacqueline Gibson,
Maryam Aziz,
Avery Reyna,
Rajan Vaish,
Brian A. Smith
Abstract:
Blind and low-vision (BLV) people face many challenges when venturing into public environments, often wishing it were easier to get help from people nearby. Ironically, while many sighted individuals are willing to help, such interactions are infrequent. Asking for help is socially awkward for BLV people, and sighted people lack experience in helping BLV people. Through a mixed-ability research-th…
▽ More
Blind and low-vision (BLV) people face many challenges when venturing into public environments, often wishing it were easier to get help from people nearby. Ironically, while many sighted individuals are willing to help, such interactions are infrequent. Asking for help is socially awkward for BLV people, and sighted people lack experience in helping BLV people. Through a mixed-ability research-through-design process, we explore four diverse approaches toward how assistive technology can serve as help supporters that collaborate with both BLV and sighted parties throughout the help process. These approaches span two phases: the connection phase (finding someone to help) and the collaboration phase (facilitating help after finding someone). Our findings from a 20-participant mixed-ability study reveal how help supporters can best facilitate connection, which types of information they should present during both phases, and more. We discuss design implications for future approaches to support face-to-face help.
△ Less
Submitted 12 March, 2024;
originally announced March 2024.
-
Large Language Multimodal Models for 5-Year Chronic Disease Cohort Prediction Using EHR Data
Authors:
Jun-En Ding,
Phan Nguyen Minh Thao,
Wen-Chih Peng,
Jian-Zhe Wang,
Chun-Cheng Chug,
Min-Chen Hsieh,
Yun-Chien Tseng,
Ling Chen,
Dongsheng Luo,
Chi-Te Wang,
Pei-fu Chen,
Feng Liu,
Fang-Ming Hung
Abstract:
Chronic diseases such as diabetes are the leading causes of morbidity and mortality worldwide. Numerous research studies have been attempted with various deep learning models in diagnosis. However, most previous studies had certain limitations, including using publicly available datasets (e.g. MIMIC), and imbalanced data. In this study, we collected five-year electronic health records (EHRs) from…
▽ More
Chronic diseases such as diabetes are the leading causes of morbidity and mortality worldwide. Numerous research studies have been attempted with various deep learning models in diagnosis. However, most previous studies had certain limitations, including using publicly available datasets (e.g. MIMIC), and imbalanced data. In this study, we collected five-year electronic health records (EHRs) from the Taiwan hospital database, including 1,420,596 clinical notes, 387,392 laboratory test results, and more than 1,505 laboratory test items, focusing on research pre-training large language models. We proposed a novel Large Language Multimodal Models (LLMMs) framework incorporating multimodal data from clinical notes and laboratory test results for the prediction of chronic disease risk. Our method combined a text embedding encoder and multi-head attention layer to learn laboratory test values, utilizing a deep neural network (DNN) module to merge blood features with chronic disease semantics into a latent space. In our experiments, we observe that clinicalBERT and PubMed-BERT, when combined with attention fusion, can achieve an accuracy of 73% in multiclass chronic diseases and diabetes prediction. By transforming laboratory test values into textual descriptions and employing the Flan T-5 model, we achieved a 76% Area Under the ROC Curve (AUROC), demonstrating the effectiveness of leveraging numerical text data for training and inference in language models. This approach significantly improves the accuracy of early-stage diabetes prediction.
△ Less
Submitted 2 March, 2024;
originally announced March 2024.
-
Emergence of interfacial magnetism in strongly-correlated nickelate-titanate superlattices
Authors:
Teguh Citra Asmara,
Robert J. Green,
Andreas Suter,
Yuan Wei,
Wenliang Zhang,
Grant Harris,
Yi Tseng,
Tianlun Yu,
Davide Betto,
Mirian Garcia-Fernandez,
Stefano Agrestini,
Yannick Maximilian Klein,
Neeraj Kumar,
Carlos William Galdino,
Zaher Salman,
Thomas Prokscha,
Marisa Medarde,
Elisabeth Müller,
Yona Soh,
Nicholas B. Brookes,
Ke-Jin Zhou,
Milan Radovic,
Thorsten Schmitt
Abstract:
Strongly-correlated transition-metal oxides are widely known for their various exotic phenomena. This is exemplified by rare-earth nickelates such as LaNiO$_{3}$, which possess intimate interconnections between their electronic, spin, and lattice degrees of freedom. Their properties can be further enhanced by pairing them in hybrid heterostructures, which can lead to hidden phases and emergent phe…
▽ More
Strongly-correlated transition-metal oxides are widely known for their various exotic phenomena. This is exemplified by rare-earth nickelates such as LaNiO$_{3}$, which possess intimate interconnections between their electronic, spin, and lattice degrees of freedom. Their properties can be further enhanced by pairing them in hybrid heterostructures, which can lead to hidden phases and emergent phenomena. An important example is the LaNiO$_{3}$/LaTiO$_{3}$ superlattice, where an interlayer electron transfer has been observed from LaTiO$_{3}$ into LaNiO$_{3}$ and is predicted to result in a high-spin state. However, macroscopic emergence of magnetic order has so far not been observed. Here, by using muon spin rotation, x-ray absorption, and resonant inelastic x-ray scattering, we present direct evidence of an emergent antiferromagnetic order with high magnon energy and exchange interactions at the LaNiO$_{3}$/LaTiO$_{3}$ interface. As the magnetism is purely interfacial, a single LaNiO$_{3}$/LaTiO$_{3}$ interface can essentially behave as an atomically thin quasi-two-dimensional antiferromagnet, potentially allowing its technological utilisation in advanced spintronic devices. Furthermore, its strong quasi-two-dimensional magnetic correlations and orbitally-polarized planar ligand holes make its electronic and magnetic configurations resemble the precursor states of superconducting cuprates and nickelates, but with an S $\rightarrow$ 1 spin state instead.
△ Less
Submitted 1 March, 2024;
originally announced March 2024.
-
Mechanical detection of nuclear decays
Authors:
Jiaxiang Wang,
T. W. Penny,
Juan Recoaro,
Benjamin Siegel,
Yu-Han Tseng,
David C. Moore
Abstract:
We report the detection of individual nuclear $α$ decays through the mechanical recoil of the entire micron-sized particle in which the decaying nuclei are embedded. Momentum conservation ensures that such measurements are sensitive to any particles emitted in the decay, including neutral particles that may otherwise evade detection with existing techniques. Detection of the minuscule recoil of an…
▽ More
We report the detection of individual nuclear $α$ decays through the mechanical recoil of the entire micron-sized particle in which the decaying nuclei are embedded. Momentum conservation ensures that such measurements are sensitive to any particles emitted in the decay, including neutral particles that may otherwise evade detection with existing techniques. Detection of the minuscule recoil of an object more than $10^{12}$ times more massive than the emitted particles is made possible by recently developed techniques in levitated optomechanics, which enable high-precision optical control and measurement of the mechanical motion of optically trapped particles. Observation of a change in the net charge of the particle coincident with the recoil allows decays to be identified with background levels at the micro-Becquerel level. The techniques developed here may find use in fields ranging from nuclear forensics to dark matter and neutrino physics.
△ Less
Submitted 8 July, 2024; v1 submitted 18 January, 2024;
originally announced February 2024.
-
REBORN: Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR
Authors:
Liang-Hsuan Tseng,
En-Pei Hu,
Cheng-Han Chiang,
Yuan Tseng,
Hung-yi Lee,
Lin-shan Lee,
Shao-Hua Sun
Abstract:
Unsupervised automatic speech recognition (ASR) aims to learn the mapping between the speech signal and its corresponding textual transcription without the supervision of paired speech-text data. A word/phoneme in the speech signal is represented by a segment of speech signal with variable length and unknown boundary, and this segmental structure makes learning the mapping between speech and text…
▽ More
Unsupervised automatic speech recognition (ASR) aims to learn the mapping between the speech signal and its corresponding textual transcription without the supervision of paired speech-text data. A word/phoneme in the speech signal is represented by a segment of speech signal with variable length and unknown boundary, and this segmental structure makes learning the mapping between speech and text challenging, especially without paired data. In this paper, we propose REBORN,Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR. REBORN alternates between (1) training a segmentation model that predicts the boundaries of the segmental structures in speech signals and (2) training the phoneme prediction model, whose input is the speech feature segmented by the segmentation model, to predict a phoneme transcription. Since supervised data for training the segmentation model is not available, we use reinforcement learning to train the segmentation model to favor segmentations that yield phoneme sequence predictions with a lower perplexity. We conduct extensive experiments and find that under the same setting, REBORN outperforms all prior unsupervised ASR models on LibriSpeech, TIMIT, and five non-English languages in Multilingual LibriSpeech. We comprehensively analyze why the boundaries learned by REBORN improve the unsupervised ASR performance.
△ Less
Submitted 28 May, 2024; v1 submitted 6 February, 2024;
originally announced February 2024.
-
Resolving Regular Polysemy in Named Entities
Authors:
Shu-Kai Hsieh,
Yu-Hsiang Tseng,
Hsin-Yu Chou,
Ching-Wen Yang,
Yu-Yun Chang
Abstract:
Word sense disambiguation primarily addresses the lexical ambiguity of common words based on a predefined sense inventory. Conversely, proper names are usually considered to denote an ad-hoc real-world referent. Once the reference is decided, the ambiguity is purportedly resolved. However, proper names also exhibit ambiguities through appellativization, i.e., they act like common words and may den…
▽ More
Word sense disambiguation primarily addresses the lexical ambiguity of common words based on a predefined sense inventory. Conversely, proper names are usually considered to denote an ad-hoc real-world referent. Once the reference is decided, the ambiguity is purportedly resolved. However, proper names also exhibit ambiguities through appellativization, i.e., they act like common words and may denote different aspects of their referents. We proposed to address the ambiguities of proper names through the light of regular polysemy, which we formalized as dot objects. This paper introduces a combined word sense disambiguation (WSD) model for disambiguating common words against Chinese Wordnet (CWN) and proper names as dot objects. The model leverages the flexibility of a gloss-based model architecture, which takes advantage of the glosses and example sentences of CWN. We show that the model achieves competitive results on both common and proper nouns, even on a relatively sparse sense dataset. Aside from being a performant WSD tool, the model further facilitates the future development of the lexical resource.
△ Less
Submitted 18 January, 2024;
originally announced January 2024.
-
Scale-Aware Crowd Count Network with Annotation Error Correction
Authors:
Yi-Kuan Hsieh,
Jun-Wei Hsieh,
Yu-Chee Tseng,
Ming-Ching Chang,
Li Xin
Abstract:
Traditional crowd counting networks suffer from information loss when feature maps are downsized through pooling layers, leading to inaccuracies in counting crowds at a distance. Existing methods often assume correct annotations during training, disregarding the impact of noisy annotations, especially in crowded scenes. Furthermore, the use of a fixed Gaussian kernel fails to account for the varyi…
▽ More
Traditional crowd counting networks suffer from information loss when feature maps are downsized through pooling layers, leading to inaccuracies in counting crowds at a distance. Existing methods often assume correct annotations during training, disregarding the impact of noisy annotations, especially in crowded scenes. Furthermore, the use of a fixed Gaussian kernel fails to account for the varying pixel distribution with respect to the camera distance. To overcome these challenges, we propose a Scale-Aware Crowd Counting Network (SACC-Net) that introduces a ``scale-aware'' architecture with error-correcting capabilities of noisy annotations. For the first time, we {\bf simultaneously} model labeling errors (mean) and scale variations (variance) by spatially-varying Gaussian distributions to produce fine-grained heat maps for crowd counting. Furthermore, the proposed adaptive Gaussian kernel variance enables the model to learn dynamically with a low-rank approximation, leading to improved convergence efficiency with comparable accuracy. The performance of SACC-Net is extensively evaluated on four public datasets: UCF-QNRF, UCF CC 50, NWPU, and ShanghaiTech A-B. Experimental results demonstrate that SACC-Net outperforms all state-of-the-art methods, validating its effectiveness in achieving superior crowd counting accuracy.
△ Less
Submitted 27 December, 2023;
originally announced December 2023.
-
PointNeRF++: A multi-scale, point-based Neural Radiance Field
Authors:
Weiwei Sun,
Eduard Trulls,
Yang-Che Tseng,
Sneha Sambandam,
Gopal Sharma,
Andrea Tagliasacchi,
Kwang Moo Yi
Abstract:
Point clouds offer an attractive source of information to complement images in neural scene representations, especially when few images are available. Neural rendering methods based on point clouds do exist, but they do not perform well when the point cloud quality is low -- e.g., sparse or incomplete, which is often the case with real-world data. We overcome these problems with a simple represent…
▽ More
Point clouds offer an attractive source of information to complement images in neural scene representations, especially when few images are available. Neural rendering methods based on point clouds do exist, but they do not perform well when the point cloud quality is low -- e.g., sparse or incomplete, which is often the case with real-world data. We overcome these problems with a simple representation that aggregates point clouds at multiple scale levels with sparse voxel grids at different resolutions. To deal with point cloud sparsity, we average across multiple scale levels -- but only among those that are valid, i.e., that have enough neighboring points in proximity to the ray of a pixel. To help model areas without points, we add a global voxel at the coarsest scale, thus unifying ``classical'' and point-based NeRF formulations. We validate our method on the NeRF Synthetic, ScanNet, and KITTI-360 datasets, outperforming the state of the art, with a significant gap compared to other NeRF-based methods, especially on more challenging scenes.
△ Less
Submitted 21 March, 2024; v1 submitted 4 December, 2023;
originally announced December 2023.
-
Magnon interactions in a moderately correlated Mott insulator
Authors:
Qisi Wang,
S. Mustafi,
E. Fogh,
N. Astrakhantsev,
Z. He,
I. Biało,
Ying Chan,
L. Martinelli,
M. Horio,
O. Ivashko,
N. E. Shaik,
K. von Arx,
Y. Sassa,
E. Paris,
M. H. Fischer,
Y. Tseng,
N. B. Christensen,
A. Galdi,
D. G. Schlom,
K. M. Shen,
T. Schmitt,
H. M. Rønnow,
J. Chang
Abstract:
Quantum fluctuations in low-dimensional systems and near quantum phase transitions have significant influences on material properties. Yet, it is difficult to experimentally gauge the strength and importance of quantum fluctuations. Here we provide a resonant inelastic x-ray scattering study of magnon excitations in Mott insulating cuprates. From the thin film of SrCuO$_2$, single- and bi-magnon d…
▽ More
Quantum fluctuations in low-dimensional systems and near quantum phase transitions have significant influences on material properties. Yet, it is difficult to experimentally gauge the strength and importance of quantum fluctuations. Here we provide a resonant inelastic x-ray scattering study of magnon excitations in Mott insulating cuprates. From the thin film of SrCuO$_2$, single- and bi-magnon dispersions are derived. Using an effective Heisenberg Hamiltonian generated from the Hubbard model, we show that the single-magnon dispersion is only described satisfactorily when including significant quantum corrections stemming from magnon-magnon interactions. Comparative results on La$_2$CuO$_4$ indicate that quantum fluctuations are much stronger in SrCuO$_2$ suggesting closer proximity to a magnetic quantum critical point. Monte Carlo calculations reveal that other magnetic orders may compete with the antiferromagnetic Néel order as the ground state. Our results indicate that SrCuO$_2$ - due to strong quantum fluctuations - is a unique starting point for the exploration of novel magnetic ground states.
△ Less
Submitted 26 June, 2024; v1 submitted 28 November, 2023;
originally announced November 2023.
-
Single- and two-particle observables in the Emery model: a dynamical mean-field perspective
Authors:
Yi-Ting Tseng,
M. O. Malcolms,
Henri Menke,
Marcel Klett,
Thomas Schäfer,
P. Hansmann
Abstract:
We compare the dynamical mean-field descriptions of the single-band Hubbard model and the three-band Emery model at the one- and two-particle level for parameters relevant to high-Tc superconductors. We show that even within dynamical mean-field theory, accounting solely for temporal fluctuations, the intrinsic multi-orbital nature of the Emery model introduces effective non-local correlations. Th…
▽ More
We compare the dynamical mean-field descriptions of the single-band Hubbard model and the three-band Emery model at the one- and two-particle level for parameters relevant to high-Tc superconductors. We show that even within dynamical mean-field theory, accounting solely for temporal fluctuations, the intrinsic multi-orbital nature of the Emery model introduces effective non-local correlations. These lead to a non-Curie-like temperature-dependence of the magnetic susceptibility, also seen in nuclear magnetic resonance experiments in the pseudogap regime by M. Avramovska, et al. [Journal of Superconductivity and Novel Magnetism 33, 2621 (2020)]. We demonstrate the agreement of our results with these experiments for a large range of dopings and trace back the effective non-local correlations to an emerging oxygen-copper singlet by analyzing a minimal finite-size cluster model. Despite this correct description of the hallmark of the pseudogap at the two-particle level, i.e., the drop in the Knight shift of nuclear magnetic resonance, dynamical mean-field theory fails to properly describe the spectral properties of the pseudogap.
△ Less
Submitted 15 November, 2023;
originally announced November 2023.
-
Federated Learning for Sparse Principal Component Analysis
Authors:
Sin Cheng Ciou,
Pin Jui Chen,
Elvin Y. Tseng,
Yuh-Jye Lee
Abstract:
In the rapidly evolving realm of machine learning, algorithm effectiveness often faces limitations due to data quality and availability. Traditional approaches grapple with data sharing due to legal and privacy concerns. The federated learning framework addresses this challenge. Federated learning is a decentralized approach where model training occurs on client sides, preserving privacy by keepin…
▽ More
In the rapidly evolving realm of machine learning, algorithm effectiveness often faces limitations due to data quality and availability. Traditional approaches grapple with data sharing due to legal and privacy concerns. The federated learning framework addresses this challenge. Federated learning is a decentralized approach where model training occurs on client sides, preserving privacy by keeping data localized. Instead of sending raw data to a central server, only model updates are exchanged, enhancing data security. We apply this framework to Sparse Principal Component Analysis (SPCA) in this work. SPCA aims to attain sparse component loadings while maximizing data variance for improved interpretability. Beside the L1 norm regularization term in conventional SPCA, we add a smoothing function to facilitate gradient-based optimization methods. Moreover, in order to improve computational efficiency, we introduce a least squares approximation to original SPCA. This enables analytic solutions on the optimization processes, leading to substantial computational improvements. Within the federated framework, we formulate SPCA as a consensus optimization problem, which can be solved using the Alternating Direction Method of Multipliers (ADMM). Our extensive experiments involve both IID and non-IID random features across various data owners. Results on synthetic and public datasets affirm the efficacy of our federated SPCA approach.
△ Less
Submitted 14 November, 2023;
originally announced November 2023.
-
ActiveAI: Introducing AI Literacy for Middle School Learners with Goal-based Scenario Learning
Authors:
Ying Jui Tseng,
Gautam Yadav
Abstract:
The ActiveAI project addresses key challenges in AI education for grades 7-9 students by providing an engaging AI literacy learning experience based on the AI4K12 knowledge framework. Utilizing learning science mechanisms such as goal-based scenarios, immediate feedback, project-based learning, and intelligent agents, the app incorporates a variety of learner inputs like sliders, steppers, and col…
▽ More
The ActiveAI project addresses key challenges in AI education for grades 7-9 students by providing an engaging AI literacy learning experience based on the AI4K12 knowledge framework. Utilizing learning science mechanisms such as goal-based scenarios, immediate feedback, project-based learning, and intelligent agents, the app incorporates a variety of learner inputs like sliders, steppers, and collectors to enhance understanding. In these courses, students work on real-world scenarios like analyzing sentiment in social media comments. This helps them learn to effectively engage with AI systems and develop their ability to evaluate AI-generated output. The Learning Engineering Process (LEP) guided the project's creation and data instrumentation, focusing on design and impact. The project is currently in the implementation stage, leveraging the intelligent tutor design principles for app development. The extended abstract presents the foundational design and development, with further evaluation and research to be conducted in the future.
△ Less
Submitted 21 August, 2023;
originally announced September 2023.
-
AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models
Authors:
Yuan Tseng,
Layne Berry,
Yi-Ting Chen,
I-Hsiang Chiu,
Hsuan-Hao Lin,
Max Liu,
Puyuan Peng,
Yi-Jen Shih,
Hung-Yu Wang,
Haibin Wu,
Po-Yao Huang,
Chun-Mao Lai,
Shang-Wen Li,
David Harwath,
Yu Tsao,
Shinji Watanabe,
Abdelrahman Mohamed,
Chi-Luen Feng,
Hung-yi Lee
Abstract:
Audio-visual representation learning aims to develop systems with human-like perception by utilizing correlation between auditory and visual information. However, current models often focus on a limited set of tasks, and generalization abilities of learned representations are unclear. To this end, we propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual a…
▽ More
Audio-visual representation learning aims to develop systems with human-like perception by utilizing correlation between auditory and visual information. However, current models often focus on a limited set of tasks, and generalization abilities of learned representations are unclear. To this end, we propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual and bimodal fusion representations on 7 datasets covering 5 audio-visual tasks in speech and audio processing. We evaluate 5 recent self-supervised models and show that none of these models generalize to all tasks, emphasizing the need for future study on improving universal model performance. In addition, we show that representations may be improved with intermediate-task fine-tuning and audio event classification with AudioSet serves as a strong intermediate task. We release our benchmark with evaluation code and a model submission platform to encourage further research in audio-visual learning.
△ Less
Submitted 19 March, 2024; v1 submitted 19 September, 2023;
originally announced September 2023.
-
Tracking Players in a Badminton Court by Two Cameras
Authors:
Young-Ching Chou,
Shen-Ru Zhang,
Bo-Wei Chen,
Hong-Qi Chen,
Cheng-Kuan Lin,
Yu-Chee Tseng
Abstract:
This study proposes a simple method for multi-object tracking (MOT) of players in a badminton court. We leverage two off-the-shelf cameras, one on the top of the court and the other on the side of the court. The one on the top is to track players' trajectories, while the one on the side is to analyze the pixel features of players. By computing the correlations between adjacent frames and engaging…
▽ More
This study proposes a simple method for multi-object tracking (MOT) of players in a badminton court. We leverage two off-the-shelf cameras, one on the top of the court and the other on the side of the court. The one on the top is to track players' trajectories, while the one on the side is to analyze the pixel features of players. By computing the correlations between adjacent frames and engaging the information of the two cameras, MOT of badminton players is obtained. This two-camera approach addresses the challenge of player occlusion and overlapping in a badminton court, providing player trajectory tracking and multi-angle analysis. The presented system offers insights into the positions and movements of badminton players, thus serving as a coaching or self-training tool for badminton players to improve their gaming strategies.
△ Less
Submitted 9 August, 2023;
originally announced August 2023.
-
LLMs as Workers in Human-Computational Algorithms? Replicating Crowdsourcing Pipelines with LLMs
Authors:
Tongshuang Wu,
Haiyi Zhu,
Maya Albayrak,
Alexis Axon,
Amanda Bertsch,
Wenxing Deng,
Ziqi Ding,
Bill Guo,
Sireesh Gururaja,
Tzu-Sheng Kuo,
Jenny T. Liang,
Ryan Liu,
Ihita Mandal,
Jeremiah Milbauer,
Xiaolin Ni,
Namrata Padmanabhan,
Subhashini Ramkumar,
Alexis Sudjianto,
Jordan Taylor,
Ying-Jui Tseng,
Patricia Vaidos,
Zhijin Wu,
Wei Wu,
Chenyang Yang
Abstract:
LLMs have shown promise in replicating human-like behavior in crowdsourcing tasks that were previously thought to be exclusive to human abilities. However, current efforts focus mainly on simple atomic tasks. We explore whether LLMs can replicate more complex crowdsourcing pipelines. We find that modern LLMs can simulate some of crowdworkers' abilities in these "human computation algorithms," but…
▽ More
LLMs have shown promise in replicating human-like behavior in crowdsourcing tasks that were previously thought to be exclusive to human abilities. However, current efforts focus mainly on simple atomic tasks. We explore whether LLMs can replicate more complex crowdsourcing pipelines. We find that modern LLMs can simulate some of crowdworkers' abilities in these "human computation algorithms," but the level of success is variable and influenced by requesters' understanding of LLM capabilities, the specific skills required for sub-tasks, and the optimal interaction modality for performing these sub-tasks. We reflect on human and LLMs' different sensitivities to instructions, stress the importance of enabling human-facing safeguards for LLMs, and discuss the potential of training humans and LLMs with complementary skill sets. Crucially, we show that replicating crowdsourcing pipelines offers a valuable platform to investigate (1) the relative strengths of LLMs on different tasks (by cross-comparing their performances on sub-tasks) and (2) LLMs' potential in complex tasks, where they can complete part of the tasks while leaving others to humans.
△ Less
Submitted 19 July, 2023; v1 submitted 19 July, 2023;
originally announced July 2023.
-
A discontinuity and cusp capturing PINN for Stokes interface problems with discontinuous viscosity and singular forces
Authors:
Yu-Hau Tseng,
Ming-Chih Lai
Abstract:
In this paper, we present a discontinuity and cusp capturing physics-informed neural network (PINN) to solve Stokes equations with a piecewise-constant viscosity and singular force along an interface. We first reformulate the governing equations in each fluid domain separately and replace the singular force effect with the traction balance equation between solutions in two sides along the interfac…
▽ More
In this paper, we present a discontinuity and cusp capturing physics-informed neural network (PINN) to solve Stokes equations with a piecewise-constant viscosity and singular force along an interface. We first reformulate the governing equations in each fluid domain separately and replace the singular force effect with the traction balance equation between solutions in two sides along the interface. Since the pressure is discontinuous and the velocity has discontinuous derivatives across the interface, we hereby use a network consisting of two fully-connected sub-networks that approximate the pressure and velocity, respectively. The two sub-networks share the same primary coordinate input arguments but with different augmented feature inputs. These two augmented inputs provide the interface information, so we assume that a level set function is given and its zero level set indicates the position of the interface. The pressure sub-network uses an indicator function as an augmented input to capture the function discontinuity, while the velocity sub-network uses a cusp-enforced level set function to capture the derivative discontinuities via the traction balance equation. We perform a series of numerical experiments to solve two- and three-dimensional Stokes interface problems and perform an accuracy comparison with the augmented immersed interface methods in literature. Our results indicate that even a shallow network with a moderate number of neurons and sufficient training data points can achieve prediction accuracy comparable to that of immersed interface methods.
△ Less
Submitted 10 September, 2023; v1 submitted 18 July, 2023;
originally announced July 2023.
-
Contextualizing Problems to Student Interests at Scale in Intelligent Tutoring System Using Large Language Models
Authors:
Gautam Yadav,
Ying-Jui Tseng,
Xiaolin Ni
Abstract:
Contextualizing problems to align with student interests can significantly improve learning outcomes. However, this task often presents scalability challenges due to resource and time constraints. Recent advancements in Large Language Models (LLMs) like GPT-4 offer potential solutions to these issues. This study explores the ability of GPT-4 in the contextualization of problems within CTAT, an int…
▽ More
Contextualizing problems to align with student interests can significantly improve learning outcomes. However, this task often presents scalability challenges due to resource and time constraints. Recent advancements in Large Language Models (LLMs) like GPT-4 offer potential solutions to these issues. This study explores the ability of GPT-4 in the contextualization of problems within CTAT, an intelligent tutoring system, aiming to increase student engagement and enhance learning outcomes. Through iterative prompt engineering, we achieved meaningful contextualization that preserved the difficulty and original intent of the problem, thereby not altering values or overcomplicating the questions. While our research highlights the potential of LLMs in educational settings, we acknowledge current limitations, particularly with geometry problems, and emphasize the need for ongoing evaluation and research. Future work includes systematic studies to measure the impact of this tool on students' learning outcomes and enhancements to handle a broader range of problems.
△ Less
Submitted 31 May, 2023;
originally announced June 2023.
-
Vec2Gloss: definition modeling leveraging contextualized vectors with Wordnet gloss
Authors:
Yu-Hsiang Tseng,
Mao-Chang Ku,
Wei-Ling Chen,
Yu-Lin Chang,
Shu-Kai Hsieh
Abstract:
Contextualized embeddings are proven to be powerful tools in multiple NLP tasks. Nonetheless, challenges regarding their interpretability and capability to represent lexical semantics still remain. In this paper, we propose that the task of definition modeling, which aims to generate the human-readable definition of the word, provides a route to evaluate or understand the high dimensional semantic…
▽ More
Contextualized embeddings are proven to be powerful tools in multiple NLP tasks. Nonetheless, challenges regarding their interpretability and capability to represent lexical semantics still remain. In this paper, we propose that the task of definition modeling, which aims to generate the human-readable definition of the word, provides a route to evaluate or understand the high dimensional semantic vectors. We propose a `Vec2Gloss' model, which produces the gloss from the target word's contextualized embeddings. The generated glosses of this study are made possible by the systematic gloss patterns provided by Chinese Wordnet. We devise two dependency indices to measure the semantic and contextual dependency, which are used to analyze the generated texts in gloss and token levels. Our results indicate that the proposed `Vec2Gloss' model opens a new perspective to the lexical-semantic applications of contextualized embeddings.
△ Less
Submitted 28 May, 2023;
originally announced May 2023.
-
Lexical Retrieval Hypothesis in Multimodal Context
Authors:
Po-Ya Angela Wang,
Pin-Er Chen,
Hsin-Yu Chou,
Yu-Hsiang Tseng,
Shu-Kai Hsieh
Abstract:
Multimodal corpora have become an essential language resource for language science and grounded natural language processing (NLP) systems due to the growing need to understand and interpret human communication across various channels. In this paper, we first present our efforts in building the first Multimodal Corpus for Languages in Taiwan (MultiMoco). Based on the corpus, we conduct a case study…
▽ More
Multimodal corpora have become an essential language resource for language science and grounded natural language processing (NLP) systems due to the growing need to understand and interpret human communication across various channels. In this paper, we first present our efforts in building the first Multimodal Corpus for Languages in Taiwan (MultiMoco). Based on the corpus, we conduct a case study investigating the Lexical Retrieval Hypothesis (LRH), specifically examining whether the hand gestures co-occurring with speech constants facilitate lexical retrieval or serve other discourse functions. With detailed annotations on eight parliamentary interpellations in Taiwan Mandarin, we explore the co-occurrence between speech constants and non-verbal features (i.e., head movement, face movement, hand gesture, and function of hand gesture). Our findings suggest that while hand gestures do serve as facilitators for lexical retrieval in some cases, they also serve the purpose of information emphasis. This study highlights the potential of the MultiMoco Corpus to provide an important resource for in-depth analysis and further research in multimodal communication studies.
△ Less
Submitted 28 May, 2023;
originally announced May 2023.
-
Exploring Affordance and Situated Meaning in Image Captions: A Multimodal Analysis
Authors:
Pin-Er Chen,
Po-Ya Angela Wang,
Hsin-Yu Chou,
Yu-Hsiang Tseng,
Shu-Kai Hsieh
Abstract:
This paper explores the grounding issue regarding multimodal semantic representation from a computational cognitive-linguistic view. We annotate images from the Flickr30k dataset with five perceptual properties: Affordance, Perceptual Salience, Object Number, Gaze Cueing, and Ecological Niche Association (ENA), and examine their association with textual elements in the image captions. Our findings…
▽ More
This paper explores the grounding issue regarding multimodal semantic representation from a computational cognitive-linguistic view. We annotate images from the Flickr30k dataset with five perceptual properties: Affordance, Perceptual Salience, Object Number, Gaze Cueing, and Ecological Niche Association (ENA), and examine their association with textual elements in the image captions. Our findings reveal that images with Gibsonian affordance show a higher frequency of captions containing 'holding-verbs' and 'container-nouns' compared to images displaying telic affordance. Perceptual Salience, Object Number, and ENA are also associated with the choice of linguistic expressions. Our study demonstrates that comprehensive understanding of objects or events requires cognitive attention, semantic nuances in language, and integration across multiple modalities. We highlight the vital importance of situated meaning and affordance grounding in natural language understanding, with the potential to advance human-like interpretation in various scenarios.
△ Less
Submitted 24 October, 2023; v1 submitted 23 May, 2023;
originally announced May 2023.
-
GPTutor: a ChatGPT-powered programming tool for code explanation
Authors:
Eason Chen,
Ray Huang,
Han-Shin Chen,
Yuen-Hsien Tseng,
Liang-Yi Li
Abstract:
Learning new programming skills requires tailored guidance. With the emergence of advanced Natural Language Generation models like the ChatGPT API, there is now a possibility of creating a convenient and personalized tutoring system with AI for computer science education. This paper presents GPTutor, a ChatGPT-powered programming tool, which is a Visual Studio Code extension using the ChatGPT API…
▽ More
Learning new programming skills requires tailored guidance. With the emergence of advanced Natural Language Generation models like the ChatGPT API, there is now a possibility of creating a convenient and personalized tutoring system with AI for computer science education. This paper presents GPTutor, a ChatGPT-powered programming tool, which is a Visual Studio Code extension using the ChatGPT API to provide programming code explanations. By integrating Visual Studio Code API, GPTutor can comprehensively analyze the provided code by referencing the relevant source codes. As a result, GPTutor can use designed prompts to explain the selected code with a pop-up message. GPTutor is now published at the Visual Studio Code Extension Marketplace, and its source code is openly accessible on GitHub. Preliminary evaluation indicates that GPTutor delivers the most concise and accurate explanations compared to vanilla ChatGPT and GitHub Copilot. Moreover, the feedback from students and teachers indicated that GPTutor is user-friendly and can explain given codes satisfactorily. Finally, we discuss possible future research directions for GPTutor. This includes enhancing its performance and personalization via further prompt programming, as well as evaluating the effectiveness of GPTutor with real users.
△ Less
Submitted 15 June, 2023; v1 submitted 2 May, 2023;
originally announced May 2023.
-
Privacy-Preserving Video Conferencing via Thermal-Generative Images
Authors:
Sheng-Yang Chiu,
Yu-Ting Huang,
Chieh-Ting Lin,
Yu-Chee Tseng,
Jen-Jee Chen,
Meng-Hsuan Tu,
Bo-Chen Tung,
YuJou Nieh
Abstract:
Due to the COVID-19 epidemic, video conferencing has evolved as a new paradigm of communication and teamwork. However, private and personal information can be easily leaked through cameras during video conferencing. This includes leakage of a person's appearance as well as the contents in the background. This paper proposes a novel way of using online low-resolution thermal images as conditions to…
▽ More
Due to the COVID-19 epidemic, video conferencing has evolved as a new paradigm of communication and teamwork. However, private and personal information can be easily leaked through cameras during video conferencing. This includes leakage of a person's appearance as well as the contents in the background. This paper proposes a novel way of using online low-resolution thermal images as conditions to guide the synthesis of RGB images, bringing a promising solution for real-time video conferencing when privacy leakage is a concern. SPADE-SR (Spatially-Adaptive De-normalization with Self Resampling), a variant of SPADE, is adopted to incorporate the spatial property of a thermal heatmap and the non-thermal property of a normal, privacy-free pre-recorded RGB image provided in a form of latent code. We create a PAIR-LRT-Human (LRT = Low-Resolution Thermal) dataset to validate our claims. The result enables a convenient way of video conferencing where users no longer need to groom themselves and tidy up backgrounds for a short meeting. Additionally, it allows a user to switch to a different appearance and background during a conference.
△ Less
Submitted 28 March, 2023; v1 submitted 15 March, 2023;
originally announced March 2023.
-
Cascading and Direct Approaches to Unsupervised Constituency Parsing on Spoken Sentences
Authors:
Yuan Tseng,
Cheng-I Lai,
Hung-yi Lee
Abstract:
Past work on unsupervised parsing is constrained to written form. In this paper, we present the first study on unsupervised spoken constituency parsing given unlabeled spoken sentences and unpaired textual data. The goal is to determine the spoken sentences' hierarchical syntactic structure in the form of constituency parse trees, such that each node is a span of audio that corresponds to a consti…
▽ More
Past work on unsupervised parsing is constrained to written form. In this paper, we present the first study on unsupervised spoken constituency parsing given unlabeled spoken sentences and unpaired textual data. The goal is to determine the spoken sentences' hierarchical syntactic structure in the form of constituency parse trees, such that each node is a span of audio that corresponds to a constituent. We compare two approaches: (1) cascading an unsupervised automatic speech recognition (ASR) model and an unsupervised parser to obtain parse trees on ASR transcripts, and (2) direct training an unsupervised parser on continuous word-level speech representations. This is done by first splitting utterances into sequences of word-level segments, and aggregating self-supervised speech representations within segments to obtain segment embeddings. We find that separately training a parser on the unpaired text and directly applying it on ASR transcripts for inference produces better results for unsupervised parsing. Additionally, our results suggest that accurate segmentation alone may be sufficient to parse spoken sentences accurately. Finally, we show the direct approach may learn head-directionality correctly for both head-initial and head-final languages without any explicit inductive bias.
△ Less
Submitted 9 May, 2023; v1 submitted 15 March, 2023;
originally announced March 2023.
-
Self-supervised learning-based general laboratory progress pretrained model for cardiovascular event detection
Authors:
Li-Chin Chen,
Kuo-Hsuan Hung,
Yi-Ju Tseng,
Hsin-Yao Wang,
Tse-Min Lu,
Wei-Chieh Huang,
Yu Tsao
Abstract:
The inherent nature of patient data poses several challenges. Prevalent cases amass substantial longitudinal data owing to their patient volume and consistent follow-ups, however, longitudinal laboratory data are renowned for their irregularity, temporality, absenteeism, and sparsity; In contrast, recruitment for rare or specific cases is often constrained due to their limited patient size and epi…
▽ More
The inherent nature of patient data poses several challenges. Prevalent cases amass substantial longitudinal data owing to their patient volume and consistent follow-ups, however, longitudinal laboratory data are renowned for their irregularity, temporality, absenteeism, and sparsity; In contrast, recruitment for rare or specific cases is often constrained due to their limited patient size and episodic observations. This study employed self-supervised learning (SSL) to pretrain a generalized laboratory progress (GLP) model that captures the overall progression of six common laboratory markers in prevalent cardiovascular cases, with the intention of transferring this knowledge to aid in the detection of specific cardiovascular event. GLP implemented a two-stage training approach, leveraging the information embedded within interpolated data and amplify the performance of SSL. After GLP pretraining, it is transferred for TVR detection. The proposed two-stage training improved the performance of pure SSL, and the transferability of GLP exhibited distinctiveness. After GLP processing, the classification exhibited a notable enhancement, with averaged accuracy rising from 0.63 to 0.90. All evaluated metrics demonstrated substantial superiority (p < 0.01) compared to prior GLP processing. Our study effectively engages in translational engineering by transferring patient progression of cardiovascular laboratory parameters from one patient group to another, transcending the limitations of data availability. The transferability of disease progression optimized the strategies of examinations and treatments, and improves patient prognosis while using commonly available laboratory parameters. The potential for expanding this approach to encompass other diseases holds great promise.
△ Less
Submitted 7 September, 2023; v1 submitted 13 March, 2023;
originally announced March 2023.
-
Detection of Berezinskii--Kosterlitz--Thouless transitions for the two-dimensional $q$-state clock models with neural networks
Authors:
Yaun-Heng Tseng,
Fu-Jiun Jiang
Abstract:
Using the technique of supervised neural networks (NN), we study the phase transitions of two-dimensional (2D) 6- and 8-state clock models on the square lattice. The employed NN has only one input layer, one hidden layer of 2 neurons, and one output layer. In addition, the NN is trained without any prior information about the considered models. Interestingly, despite its simple architecture, the b…
▽ More
Using the technique of supervised neural networks (NN), we study the phase transitions of two-dimensional (2D) 6- and 8-state clock models on the square lattice. The employed NN has only one input layer, one hidden layer of 2 neurons, and one output layer. In addition, the NN is trained without any prior information about the considered models. Interestingly, despite its simple architecture, the built supervised NN not only detects both the two Berezinskii--Kosterlitz--Thouless (BKT) transitions but also determines the transition temperatures with reasonable high accuracy. It is remarkable that a NN, which has an extremely simple structure and is trained without any input from the studied models, can be employed to study topological phase transitions. The outcomes shown here as well as those previously demonstrated in the literature suggest the feasibility of constructing a universal NN that is applicable to investigate the phase transitions of many systems.
△ Less
Submitted 1 March, 2023;
originally announced March 2023.
-
Spin waves in a ferromagnetic topological metal
Authors:
Wenliang Zhang,
Teguh Citra Asmara,
Yi Tseng,
Junbo Li,
Yimin Xiong,
Vladimir N. Strocov,
Y. Soh,
Thorsten Schmitt,
Gabriel Aeppli
Abstract:
In most metals, charges and spins can hop rapidly between atoms, yielding strong dispersion of their energy versus momentum. There are, however, special arrangements of atoms, such as twisted graphene bilayers or lattices which resemble woven bamboo "kagome" mats, so that particle motion with strong hopping between neighbours becomes nearly or even completely dispersionless. Such flat bands are in…
▽ More
In most metals, charges and spins can hop rapidly between atoms, yielding strong dispersion of their energy versus momentum. There are, however, special arrangements of atoms, such as twisted graphene bilayers or lattices which resemble woven bamboo "kagome" mats, so that particle motion with strong hopping between neighbours becomes nearly or even completely dispersionless. Such flat bands are interesting because the interactions between the heavy particles inhabiting them will become much more important than for strong dispersion, resulting in novel quantum solid and liquid states, particularly when topology enters on account of significant spin-orbit coupling for the underlying electrons. Nonetheless, spectroscopic evidence for flat bands engendered by lattice geometry rather than weak hopping is rare, particularly for metallic single crystals. Here we report the discovery, using circularly polarized X-Rays in resonant absorption and inelastic scattering (RIXS) for the unambiguous isolation of magnetic signals, of a flat spin wave band and large orbital moment for the metallic kagome ferromagnet Fe$_3$Sn$_2$, which has a topologically non-trivial electronic band structure controllable by modest external magnetic fields. The flat mode energy is consistent with the high Curie temperature ($T_C$ ~ 640 K) as well as the strong acoustic mode dispersion, implying, together with the substantial spin-orbit coupling indicated by the large orbital moment, that the mode is topological. The measured properties of the spin waves are highly unconventional, and include very severe damping as well as the flat mode amplitude which is maximized in the long wavelength limit where it is ordinarily expected to vanish. Our results open the topic of interactions of topological bosons (spin waves) and fermions (electrons) with the very specific target of explaining boson lifetimes and amplitudes.
△ Less
Submitted 2 February, 2023;
originally announced February 2023.
-
Mechanosensitive bonds induced complex cell motility patterns
Authors:
Jen-Yu Lo,
Yuan-Heng Tseng,
Hsuan-Yi Chen
Abstract:
The one-dimensional crawling movement of a cell is considered in this theoretical study. Our active gel model shows that for a cell with weakly mechanosensitive adhesion complexes, as myosin contractility increases, a cell starts to move at a constant velocity. As the mechanosensitivity of the adhesion complexes increases, a cell can exhibit stick-slip motion. Finally, a cell with highly mechanose…
▽ More
The one-dimensional crawling movement of a cell is considered in this theoretical study. Our active gel model shows that for a cell with weakly mechanosensitive adhesion complexes, as myosin contractility increases, a cell starts to move at a constant velocity. As the mechanosensitivity of the adhesion complexes increases, a cell can exhibit stick-slip motion. Finally, a cell with highly mechanosensitive adhesion complexes exhibits periodic back-and-forth migration. A simplified model which assumes that the cell crawling dynamics are controlled by the evolution of the myosin density dipole and the asymmetry of adhesion complex distribution captures the motility behaviors of crawling cells qualitatively. It suggests that the complex cell crawling behaviors observed in the experiments could result from the interplay between the distribution of contractile force and mechanosensitive bonds.
△ Less
Submitted 4 January, 2023;
originally announced January 2023.
-
Machine learning phases of an Abelian gauge theory
Authors:
Jhao-Hong Peng,
Yuan-Heng Tseng,
Fu-Jiun Jiang
Abstract:
The phase transition of the two-dimensional $U(1)$ quantum link model on the triangular lattice is investigated by employing a supervised neural network (NN) consisting of only one input layer, one hidden layer of two neurons, and one output layer. No information on the studied model is used when the NN training is conducted. Instead, two artificially made configurations are considered as the trai…
▽ More
The phase transition of the two-dimensional $U(1)$ quantum link model on the triangular lattice is investigated by employing a supervised neural network (NN) consisting of only one input layer, one hidden layer of two neurons, and one output layer. No information on the studied model is used when the NN training is conducted. Instead, two artificially made configurations are considered as the training set. Interestingly, the obtained NN not only estimates the critical point accurately but also uncovers the physics correctly. The results presented here imply that a supervised NN, which has a very simple architecture and is trained without any input from the investigated model, can identify the targeted phase structure with high precision.
△ Less
Submitted 30 December, 2022;
originally announced December 2022.
-
A novel test of gravity via black hole eikonal correspondence
Authors:
Che-Yu Chen,
Yu-Jui Chen,
Meng-Yuan Ho,
Yung-Hsuan Tseng
Abstract:
When adopted in black hole spacetimes, geometric-optics approximations imply a mapping between the quasinormal mode (QNM) spectrum of black holes in the eikonal limit and black hole images. In particular, the real part and the imaginary part of eikonal QNM frequencies are associated with the apparent size and the detailed structure of the ring images, respectively. This correspondence could be vio…
▽ More
When adopted in black hole spacetimes, geometric-optics approximations imply a mapping between the quasinormal mode (QNM) spectrum of black holes in the eikonal limit and black hole images. In particular, the real part and the imaginary part of eikonal QNM frequencies are associated with the apparent size and the detailed structure of the ring images, respectively. This correspondence could be violated when going beyond general relativity. We propose a novel method to test the eikonal correspondence via the comparison of two sets of observables from a nonrotating black hole, one extracted from QNM spectra and the other from the lensed photon rings on the image plane. Specifically, the photon ring observables robustly capture the information of the black hole spacetime itself regardless of the surrounding emission models. Therefore, the proposed test of eikonal correspondence can be validated in quite broad scenarios.
△ Less
Submitted 5 September, 2023; v1 submitted 20 December, 2022;
originally announced December 2022.
-
Scale-Aware Crowd Counting Using a Joint Likelihood Density Map and Synthetic Fusion Pyramid Network
Authors:
Yi-Kuan Hsieh,
Jun-Wei Hsieh,
Yu-Chee Tseng,
Ming-Ching Chang,
Bor-Shiun Wang
Abstract:
We develop a Synthetic Fusion Pyramid Network (SPF-Net) with a scale-aware loss function design for accurate crowd counting. Existing crowd-counting methods assume that the training annotation points were accurate and thus ignore the fact that noisy annotations can lead to large model-learning bias and counting error, especially for counting highly dense crowds that appear far away. To the best of…
▽ More
We develop a Synthetic Fusion Pyramid Network (SPF-Net) with a scale-aware loss function design for accurate crowd counting. Existing crowd-counting methods assume that the training annotation points were accurate and thus ignore the fact that noisy annotations can lead to large model-learning bias and counting error, especially for counting highly dense crowds that appear far away. To the best of our knowledge, this work is the first to properly handle such noise at multiple scales in end-to-end loss design and thus push the crowd counting state-of-the-art. We model the noise of crowd annotation points as a Gaussian and derive the crowd probability density map from the input image. We then approximate the joint distribution of crowd density maps with the full covariance of multiple scales and derive a low-rank approximation for tractability and efficient implementation. The derived scale-aware loss function is used to train the SPF-Net. We show that it outperforms various loss functions on four public datasets: UCF-QNRF, UCF CC 50, NWPU and ShanghaiTech A-B datasets. The proposed SPF-Net can accurately predict the locations of people in the crowd, despite training on noisy training annotations.
△ Less
Submitted 2 January, 2023; v1 submitted 13 November, 2022;
originally announced November 2022.
-
MicroISP: Processing 32MP Photos on Mobile Devices with Deep Learning
Authors:
Andrey Ignatov,
Anastasia Sycheva,
Radu Timofte,
Yu Tseng,
Yu-Syuan Xu,
Po-Hsiang Yu,
Cheng-Ming Chiang,
Hsien-Kai Kuo,
Min-Hung Chen,
Chia-Ming Cheng,
Luc Van Gool
Abstract:
While neural networks-based photo processing solutions can provide a better image quality compared to the traditional ISP systems, their application to mobile devices is still very limited due to their very high computational complexity. In this paper, we present a novel MicroISP model designed specifically for edge devices, taking into account their computational and memory limitations. The propo…
▽ More
While neural networks-based photo processing solutions can provide a better image quality compared to the traditional ISP systems, their application to mobile devices is still very limited due to their very high computational complexity. In this paper, we present a novel MicroISP model designed specifically for edge devices, taking into account their computational and memory limitations. The proposed solution is capable of processing up to 32MP photos on recent smartphones using the standard mobile ML libraries and requiring less than 1 second to perform the inference, while for FullHD images it achieves real-time performance. The architecture of the model is flexible, allowing to adjust its complexity to devices of different computational power. To evaluate the performance of the model, we collected a novel Fujifilm UltraISP dataset consisting of thousands of paired photos captured with a normal mobile camera sensor and a professional 102MP medium-format FujiFilm GFX100 camera. The experiments demonstrated that, despite its compact size, the MicroISP model is able to provide comparable or better visual results than the traditional mobile ISP systems, while outperforming the previously proposed efficient deep learning based solutions. Finally, this model is also compatible with the latest mobile AI accelerators, achieving good runtime and low power consumption on smartphone NPUs and APUs. The code, dataset and pre-trained models are available on the project website: https://people.ee.ethz.ch/~ihnatova/microisp.html
△ Less
Submitted 8 November, 2022;
originally announced November 2022.
-
PyNet-V2 Mobile: Efficient On-Device Photo Processing With Neural Networks
Authors:
Andrey Ignatov,
Grigory Malivenko,
Radu Timofte,
Yu Tseng,
Yu-Syuan Xu,
Po-Hsiang Yu,
Cheng-Ming Chiang,
Hsien-Kai Kuo,
Min-Hung Chen,
Chia-Ming Cheng,
Luc Van Gool
Abstract:
The increased importance of mobile photography created a need for fast and performant RAW image processing pipelines capable of producing good visual results in spite of the mobile camera sensor limitations. While deep learning-based approaches can efficiently solve this problem, their computational requirements usually remain too large for high-resolution on-device image processing. To address th…
▽ More
The increased importance of mobile photography created a need for fast and performant RAW image processing pipelines capable of producing good visual results in spite of the mobile camera sensor limitations. While deep learning-based approaches can efficiently solve this problem, their computational requirements usually remain too large for high-resolution on-device image processing. To address this limitation, we propose a novel PyNET-V2 Mobile CNN architecture designed specifically for edge devices, being able to process RAW 12MP photos directly on mobile phones under 1.5 second and producing high perceptual photo quality. To train and to evaluate the performance of the proposed solution, we use the real-world Fujifilm UltraISP dataset consisting on thousands of RAW-RGB image pairs captured with a professional medium-format 102MP Fujifilm camera and a popular Sony mobile camera sensor. The results demonstrate that the PyNET-V2 Mobile model can substantially surpass the quality of tradition ISP pipelines, while outperforming the previously introduced neural network-based solutions designed for fast image processing. Furthermore, we show that the proposed architecture is also compatible with the latest mobile AI accelerators such as NPUs or APUs that can be used to further reduce the latency of the model to as little as 0.5 second. The dataset, code and pre-trained models used in this paper are available on the project website: https://github.com/gmalivenko/PyNET-v2
△ Less
Submitted 8 November, 2022;
originally announced November 2022.
-
Conversion of Legal Agreements into Smart Legal Contracts using NLP
Authors:
Eason Chen,
Niall Roche,
Yuen-Hsien Tseng,
Walter Hernandez,
Jiangbo Shangguan,
Alastair Moore
Abstract:
A Smart Legal Contract (SLC) is a specialized digital agreement comprising natural language and computable components. The Accord Project provides an open-source SLC framework containing three main modules: Cicero, Concerto, and Ergo. Currently, we need lawyers, programmers, and clients to work together with great effort to create a usable SLC using the Accord Project. This paper proposes a pipeli…
▽ More
A Smart Legal Contract (SLC) is a specialized digital agreement comprising natural language and computable components. The Accord Project provides an open-source SLC framework containing three main modules: Cicero, Concerto, and Ergo. Currently, we need lawyers, programmers, and clients to work together with great effort to create a usable SLC using the Accord Project. This paper proposes a pipeline to automate the SLC creation process with several Natural Language Processing (NLP) models to convert law contracts to the Accord Project's Concerto model. After evaluating the proposed pipeline, we discovered that our NER pipeline accurately detects CiceroMark from Accord Project template text with an accuracy of 0.8. Additionally, our Question Answering method can extract one-third of the Concerto variables from the template text. We also delve into some limitations and possible future research for the proposed pipeline. Finally, we describe a web interface enabling users to build SLCs. This interface leverages the proposed pipeline to convert text documents to Smart Legal Contracts by using NLP models.
△ Less
Submitted 5 April, 2023; v1 submitted 27 August, 2022;
originally announced October 2022.
-
A cusp-capturing PINN for elliptic interface problems
Authors:
Yu-Hau Tseng,
Te-Sheng Lin,
Wei-Fan Hu,
Ming-Chih Lai
Abstract:
In this paper, we propose a cusp-capturing physics-informed neural network (PINN) to solve discontinuous-coefficient elliptic interface problems whose solution is continuous but has discontinuous first derivatives on the interface. To find such a solution using neural network representation, we introduce a cusp-enforced level set function as an additional feature input to the network to retain the…
▽ More
In this paper, we propose a cusp-capturing physics-informed neural network (PINN) to solve discontinuous-coefficient elliptic interface problems whose solution is continuous but has discontinuous first derivatives on the interface. To find such a solution using neural network representation, we introduce a cusp-enforced level set function as an additional feature input to the network to retain the inherent solution properties; that is, capturing the solution cusps (where the derivatives are discontinuous) sharply. In addition, the proposed neural network has the advantage of being mesh-free, so it can easily handle problems in irregular domains. We train the network using the physics-informed framework in which the loss function comprises the residual of the differential equation together with certain interface and boundary conditions. We conduct a series of numerical experiments to demonstrate the effectiveness of the cusp-capturing technique and the accuracy of the present network model. Numerical results show that even using a one-hidden-layer (shallow) network with a moderate number of neurons and sufficient training data points, the present network model can achieve prediction accuracy comparable with traditional methods. Besides, if the solution is discontinuous across the interface, we can simply incorporate an additional supervised learning task for solution jump approximation into the present network without much difficulty.
△ Less
Submitted 16 April, 2023; v1 submitted 15 October, 2022;
originally announced October 2022.
-
On the Utility of Self-supervised Models for Prosody-related Tasks
Authors:
Guan-Ting Lin,
Chi-Luen Feng,
Wei-Ping Huang,
Yuan Tseng,
Tzu-Han Lin,
Chen-An Li,
Hung-yi Lee,
Nigel G. Ward
Abstract:
Self-Supervised Learning (SSL) from speech data has produced models that have achieved remarkable performance in many tasks, and that are known to implicitly represent many aspects of information latently present in speech signals. However, relatively little is known about the suitability of such models for prosody-related tasks or the extent to which they encode prosodic information. We present a…
▽ More
Self-Supervised Learning (SSL) from speech data has produced models that have achieved remarkable performance in many tasks, and that are known to implicitly represent many aspects of information latently present in speech signals. However, relatively little is known about the suitability of such models for prosody-related tasks or the extent to which they encode prosodic information. We present a new evaluation framework, SUPERB-prosody, consisting of three prosody-related downstream tasks and two pseudo tasks. We find that 13 of the 15 SSL models outperformed the baseline on all the prosody-related tasks. We also show good performance on two pseudo tasks: prosody reconstruction and future prosody prediction. We further analyze the layerwise contributions of the SSL models. Overall we conclude that SSL speech models are highly effective for prosody-related tasks.
△ Less
Submitted 26 October, 2022; v1 submitted 13 October, 2022;
originally announced October 2022.
-
An efficient neural-network and finite-difference hybrid method for elliptic interface problems with applications
Authors:
Wei-Fan Hu,
Te-Sheng Lin,
Yu-Hau Tseng,
Ming-Chih Lai
Abstract:
A new and efficient neural-network and finite-difference hybrid method is developed for solving Poisson equation in a regular domain with jump discontinuities on embedded irregular interfaces. Since the solution has low regularity across the interface, when applying finite difference discretization to this problem, an additional treatment accounting for the jump discontinuities must be employed. H…
▽ More
A new and efficient neural-network and finite-difference hybrid method is developed for solving Poisson equation in a regular domain with jump discontinuities on embedded irregular interfaces. Since the solution has low regularity across the interface, when applying finite difference discretization to this problem, an additional treatment accounting for the jump discontinuities must be employed. Here, we aim to elevate such an extra effort to ease our implementation by machine learning methodology. The key idea is to decompose the solution into singular and regular parts. The neural network learning machinery incorporating the given jump conditions finds the singular solution, while the standard five-point Laplacian discretization is used to obtain the regular solution with associated boundary conditions. Regardless of the interface geometry, these two tasks only require supervised learning for function approximation and a fast direct solver for Poisson equation, making the hybrid method easy to implement and efficient. The two- and three-dimensional numerical results show that the present hybrid method preserves second-order accuracy for the solution and its derivatives, and it is comparable with the traditional immersed interface method in the literature. As an application, we solve the Stokes equations with singular forces to demonstrate the robustness of the present method.
△ Less
Submitted 2 March, 2023; v1 submitted 11 October, 2022;
originally announced October 2022.
-
Focus Plus: Detect Learner's Distraction by Web Camera in Distance Teaching
Authors:
Eason Chen,
Yuen Hsien Tseng,
Kuo-Ping Lo
Abstract:
Distance teaching has become popular these years because of the COVID-19 epidemic. However, both students and teachers face several challenges in distance teaching, like being easy to distract. We proposed Focus+, a system designed to detect learners' status with the latest AI technology from their web camera to solve such challenges. By doing so, teachers can know students' status, and students c…
▽ More
Distance teaching has become popular these years because of the COVID-19 epidemic. However, both students and teachers face several challenges in distance teaching, like being easy to distract. We proposed Focus+, a system designed to detect learners' status with the latest AI technology from their web camera to solve such challenges. By doing so, teachers can know students' status, and students can regulate their learning experience. In this research, we will discuss the expected model's design for training and evaluating the AI detection model of Focus+.
△ Less
Submitted 9 October, 2022;
originally announced October 2022.
-
Orbeez-SLAM: A Real-time Monocular Visual SLAM with ORB Features and NeRF-realized Mapping
Authors:
Chi-Ming Chung,
Yang-Che Tseng,
Ya-Ching Hsu,
Xiang-Qian Shi,
Yun-Hung Hua,
Jia-Fong Yeh,
Wen-Chin Chen,
Yi-Ting Chen,
Winston H. Hsu
Abstract:
A spatial AI that can perform complex tasks through visual signals and cooperate with humans is highly anticipated. To achieve this, we need a visual SLAM that easily adapts to new scenes without pre-training and generates dense maps for downstream tasks in real-time. None of the previous learning-based and non-learning-based visual SLAMs satisfy all needs due to the intrinsic limitations of their…
▽ More
A spatial AI that can perform complex tasks through visual signals and cooperate with humans is highly anticipated. To achieve this, we need a visual SLAM that easily adapts to new scenes without pre-training and generates dense maps for downstream tasks in real-time. None of the previous learning-based and non-learning-based visual SLAMs satisfy all needs due to the intrinsic limitations of their components. In this work, we develop a visual SLAM named Orbeez-SLAM, which successfully collaborates with implicit neural representation and visual odometry to achieve our goals. Moreover, Orbeez-SLAM can work with the monocular camera since it only needs RGB inputs, making it widely applicable to the real world. Results show that our SLAM is up to 800x faster than the strong baseline with superior rendering outcomes. Code link: https://github.com/MarvinChung/Orbeez-SLAM.
△ Less
Submitted 31 January, 2023; v1 submitted 27 September, 2022;
originally announced September 2022.
-
A Survey on Open-Source-Defined Wireless Networks: Framework, Key Technology, and Implementation
Authors:
Liqiang Zhao,
Muhammad Muhammad Bala,
Wu Gang,
Pan Chengkang,
Yuan Yannan,
Tian Zhigang,
Yu-Chee Tseng,
Chen Xiang,
Bin Shen,
Chih-Lin I
Abstract:
The realization of open-source-defined wireless networks in the telecommunication domain is accomplished through the fifth-generation network (5G). In contrast to its predecessors (3G and 4G), the 5G network can support a wide variety of heterogeneous use cases with challenging requirements from both the Internet and the Internet of Things (IoT). The future sixth-generation (6G) network will not o…
▽ More
The realization of open-source-defined wireless networks in the telecommunication domain is accomplished through the fifth-generation network (5G). In contrast to its predecessors (3G and 4G), the 5G network can support a wide variety of heterogeneous use cases with challenging requirements from both the Internet and the Internet of Things (IoT). The future sixth-generation (6G) network will not only extend 5G capabilities but also innovate new functionalities to address emerging academic and engineering challenges. The research community has identified these challenges could be overcome by open-source-defined wireless networks, which is based on open-source software and hardware. In this survey, we present an overview of different aspects of open-source-defined wireless networks, comprising motivation, frameworks, key technologies, and implementation. We start by introducing the motivation and explore several frameworks with classification into three different categories: black-box, grey-box, and white-box. We review research efforts related to open-source-defined Core Network (CN), Radio Access Network (RAN), Multi-access Edge Computing (MEC), the capabilities of security threats, open-source hardware, and various implementations, including testbeds. The last but most important in this survey, lessons learned, future research direction, open research issues, pitfalls, and limitations of existing surveys on open-source wireless networks are included to motivate and encourage future research.
△ Less
Submitted 5 September, 2022;
originally announced September 2022.
-
VizWiz-FewShot: Locating Objects in Images Taken by People With Visual Impairments
Authors:
Yu-Yun Tseng,
Alexander Bell,
Danna Gurari
Abstract:
We introduce a few-shot localization dataset originating from photographers who authentically were trying to learn about the visual content in the images they took. It includes nearly 10,000 segmentations of 100 categories in over 4,500 images that were taken by people with visual impairments. Compared to existing few-shot object detection and instance segmentation datasets, our dataset is the fir…
▽ More
We introduce a few-shot localization dataset originating from photographers who authentically were trying to learn about the visual content in the images they took. It includes nearly 10,000 segmentations of 100 categories in over 4,500 images that were taken by people with visual impairments. Compared to existing few-shot object detection and instance segmentation datasets, our dataset is the first to locate holes in objects (e.g., found in 12.3\% of our segmentations), it shows objects that occupy a much larger range of sizes relative to the images, and text is over five times more common in our objects (e.g., found in 22.4\% of our segmentations). Analysis of three modern few-shot localization algorithms demonstrates that they generalize poorly to our new dataset. The algorithms commonly struggle to locate objects with holes, very small and very large objects, and objects lacking text. To encourage a larger community to work on these unsolved challenges, we publicly share our annotated few-shot dataset at https://vizwiz.org .
△ Less
Submitted 24 July, 2022;
originally announced July 2022.