-
Electrical switching of spin-polarized light-emitting diodes based on a 2D CrI3/hBN/WSe2 heterostructure
Authors:
Jianchen Dang,
Tongyao Wu,
Shuohua Yan,
Kenji Watanabe,
Takashi Taniguchi,
Hechang Lei,
Xiao-Xiao Zhang
Abstract:
Spin-polarized light-emitting diodes (spin-LEDs) convert the electronic spin information to photon circular polarization, offering potential applications including spin amplification, optical communications, and advanced imaging. The conventional control of the emitted light's circular polarization requires a change in the external magnetic field, limiting the operation conditions of spin-LEDs. He…
▽ More
Spin-polarized light-emitting diodes (spin-LEDs) convert the electronic spin information to photon circular polarization, offering potential applications including spin amplification, optical communications, and advanced imaging. The conventional control of the emitted light's circular polarization requires a change in the external magnetic field, limiting the operation conditions of spin-LEDs. Here, we demonstrate an atomically thin spin-LED device based on a heterostructure of a monolayer WSe2 and a few-layer antiferromagnetic CrI3, separated by a thin hBN tunneling barrier. The CrI3 and hBN layers polarize the spin of the injected carriers into the WSe2. With the valley optical selection rule in the monolayer WSe2, the electroluminescence exhibits a high degree of circular polarization that follows the CrI3 magnetic states. Importantly, we show an efficient electrical tuning, including a sign reversal, of the electroluminescent circular polarization by applying an electrostatic field due to the electrical tunability of the few-layer CrI3 magnetization. Our results establish a new platform to achieve on-demand operation of nanoscale spin-LED and electrical control of helicity for device applications.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
RLHF Can Speak Many Languages: Unlocking Multilingual Preference Optimization for LLMs
Authors:
John Dang,
Arash Ahmadian,
Kelly Marchisio,
Julia Kreutzer,
Ahmet Üstün,
Sara Hooker
Abstract:
Preference optimization techniques have become a standard final stage for training state-of-art large language models (LLMs). However, despite widespread adoption, the vast majority of work to-date has focused on first-class citizen languages like English and Chinese. This captures a small fraction of the languages in the world, but also makes it unclear which aspects of current state-of-the-art r…
▽ More
Preference optimization techniques have become a standard final stage for training state-of-art large language models (LLMs). However, despite widespread adoption, the vast majority of work to-date has focused on first-class citizen languages like English and Chinese. This captures a small fraction of the languages in the world, but also makes it unclear which aspects of current state-of-the-art research transfer to a multilingual setting. In this work, we perform an exhaustive study to achieve a new state-of-the-art in aligning multilingual LLMs. We introduce a novel, scalable method for generating high-quality multilingual feedback data to balance data coverage. We establish the benefits of cross-lingual transfer and increased dataset size in preference training. Our preference-trained model achieves a 54.4% win-rate against Aya 23 8B, the current state-of-the-art multilingual LLM in its parameter class, and a 69.5% win-rate or higher against widely used models like Gemma-1.1-7B-it, Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3. As a result of our study, we expand the frontier of alignment techniques to 23 languages covering half of the world's population.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations
Authors:
Sheng Wu,
Jiaxing Liu,
Longbiao Wang,
Dongxiao He,
Xiaobao Wang,
Jianwu Dang
Abstract:
Emotion Recognition in Conversations (ERC) is a popular task in natural language processing, which aims to recognize the emotional state of the speaker in conversations. While current research primarily emphasizes contextual modeling, there exists a dearth of investigation into effective multimodal fusion methods. We propose a novel framework called AIMDiT to solve the problem of multimodal fusion…
▽ More
Emotion Recognition in Conversations (ERC) is a popular task in natural language processing, which aims to recognize the emotional state of the speaker in conversations. While current research primarily emphasizes contextual modeling, there exists a dearth of investigation into effective multimodal fusion methods. We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features. Specifically, we design a Modality Augmentation Network which performs rich representation learning through dimension transformation of different modalities and parameter-efficient inception block. On the other hand, the Modality Interaction Network performs interaction fusion of extracted inter-modal features and intra-modal features. Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics compared to the state-of-the-art (SOTA) models.
△ Less
Submitted 12 April, 2024;
originally announced July 2024.
-
Distinguishing Surface and Bulk Electromagnetism via Their Dynamics in an Intrinsic Magnetic Topological Insulator
Authors:
Khanh Duy Nguyen,
Woojoo Lee,
Jianchen Dang,
Tongyao Wu,
Gabriele Berruto,
Chenhui Yan,
Chi Ian Jess Ip,
Haoran Lin,
Qiang Gao,
Seng Huat Lee,
Binghai Yan,
Chaoxing Liu,
Zhiqiang Mao,
Xiao-Xiao Zhang,
Shuolong Yang
Abstract:
The indirect exchange interaction between local magnetic moments via surface electrons has been long predicted to bolster the surface ferromagnetism in magnetic topological insulators (MTIs), which facilitates the quantum anomalous Hall effect. This unconventional effect is critical to determining the operating temperatures of future topotronic devices. However, the experimental confirmation of th…
▽ More
The indirect exchange interaction between local magnetic moments via surface electrons has been long predicted to bolster the surface ferromagnetism in magnetic topological insulators (MTIs), which facilitates the quantum anomalous Hall effect. This unconventional effect is critical to determining the operating temperatures of future topotronic devices. However, the experimental confirmation of this mechanism remains elusive, especially in intrinsic MTIs. Here we combine time-resolved photoemission spectroscopy with time-resolved magneto-optical Kerr effect measurements to elucidate the unique electromagnetism at the surface of an intrinsic MTI MnBi2Te4. Theoretical modeling based on 2D Ruderman-Kittel-Kasuya-Yosida interactions captures the initial quenching of a surface-rooted exchange gap within a factor of two but over-estimates the bulk demagnetization by one order of magnitude. This mechanism directly explains the sizable gap in the quasi-2D electronic state and the nonzero residual magnetization in even-layer MnBi2Te4. Furthermore, it leads to efficient light-induced demagnetization comparable to state-of-the-art magnetophotonic crystals, promising an effective manipulation of magnetism and topological orders for future topotronics.
△ Less
Submitted 28 June, 2024;
originally announced July 2024.
-
An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios
Authors:
Cheng Gong,
Erica Cooper,
Xin Wang,
Chunyu Qiang,
Mengzhe Geng,
Dan Wells,
Longbiao Wang,
Jianwu Dang,
Marc Tessier,
Aidan Pine,
Korin Richmond,
Junichi Yamagishi
Abstract:
Self-supervised learning (SSL) representations from massively multilingual models offer a promising solution for low-resource language speech tasks. Despite advancements, language adaptation in TTS systems remains an open problem. This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system proposed in our previous work. We conducted experiments on…
▽ More
Self-supervised learning (SSL) representations from massively multilingual models offer a promising solution for low-resource language speech tasks. Despite advancements, language adaptation in TTS systems remains an open problem. This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system proposed in our previous work. We conducted experiments on 12 languages using limited data with various fine-tuning configurations. We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance. Additionally, we find that the fine-tuning dataset size and number of speakers influence adaptability. Surprisingly, we also observed that using paired data for fine-tuning is not always optimal compared to audio-only data. Beyond speech intelligibility, our analysis covers speaker similarity, language identification, and predicted MOS.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Aya 23: Open Weight Releases to Further Multilingual Progress
Authors:
Viraat Aryabumi,
John Dang,
Dwarak Talupuru,
Saurabh Dash,
David Cairuz,
Hangyu Lin,
Bharat Venkitesh,
Madeline Smith,
Jon Ander Campos,
Yi Chern Tan,
Kelly Marchisio,
Max Bartolo,
Sebastian Ruder,
Acyr Locatelli,
Julia Kreutzer,
Nick Frosst,
Aidan Gomez,
Phil Blunsom,
Marzieh Fadaee,
Ahmet Üstün,
Sara Hooker
Abstract:
This technical report introduces Aya 23, a family of multilingual language models. Aya 23 builds on the recent release of the Aya model (Üstün et al., 2024), focusing on pairing a highly performant pre-trained model with the recently released Aya collection (Singh et al., 2024). The result is a powerful multilingual large language model serving 23 languages, expanding state-of-art language modelin…
▽ More
This technical report introduces Aya 23, a family of multilingual language models. Aya 23 builds on the recent release of the Aya model (Üstün et al., 2024), focusing on pairing a highly performant pre-trained model with the recently released Aya collection (Singh et al., 2024). The result is a powerful multilingual large language model serving 23 languages, expanding state-of-art language modeling capabilities to approximately half of the world's population. The Aya model covered 101 languages whereas Aya 23 is an experiment in depth vs breadth, exploring the impact of allocating more capacity to fewer languages that are included during pre-training. Aya 23 outperforms both previous massively multilingual models like Aya 101 for the languages it covers, as well as widely used models like Gemma, Mistral and Mixtral on an extensive range of discriminative and generative tasks. We release the open weights for both the 8B and 35B models as part of our continued commitment for expanding access to multilingual progress.
△ Less
Submitted 31 May, 2024; v1 submitted 23 May, 2024;
originally announced May 2024.
-
Fact :Teaching MLLMs with Faithful, Concise and Transferable Rationales
Authors:
Minghe Gao,
Shuang Chen,
Liang Pang,
Yuan Yao,
Jisheng Dang,
Wenqiao Zhang,
Juncheng Li,
Siliang Tang,
Yueting Zhuang,
Tat-Seng Chua
Abstract:
The remarkable performance of Multimodal Large Language Models (MLLMs) has unequivocally demonstrated their proficient understanding capabilities in handling a wide array of visual tasks. Nevertheless, the opaque nature of their black-box reasoning processes persists as an enigma, rendering them uninterpretable and struggling with hallucination. Their ability to execute intricate compositional rea…
▽ More
The remarkable performance of Multimodal Large Language Models (MLLMs) has unequivocally demonstrated their proficient understanding capabilities in handling a wide array of visual tasks. Nevertheless, the opaque nature of their black-box reasoning processes persists as an enigma, rendering them uninterpretable and struggling with hallucination. Their ability to execute intricate compositional reasoning tasks is also constrained, culminating in a stagnation of learning progression for these models. In this work, we introduce Fact, a novel paradigm designed to generate multimodal rationales that are faithful, concise, and transferable for teaching MLLMs. This paradigm utilizes verifiable visual programming to generate executable code guaranteeing faithfulness and precision. Subsequently, through a series of operations including pruning, merging, and bridging, the rationale enhances its conciseness. Furthermore, we filter rationales that can be transferred to end-to-end paradigms from programming paradigms to guarantee transferability. Empirical evidence from experiments demonstrates the superiority of our method across models of varying parameter sizes, significantly enhancing their compositional reasoning and generalization ability. Our approach also reduces hallucinations owing to its high correlation between images and text.
△ Less
Submitted 17 April, 2024;
originally announced April 2024.
-
Statistical analysis of pulsar flux density distribution
Authors:
H. W. Xu,
R. S. Zhao,
Erbil Gugercinoglu,
H. Liu,
D. Li,
P. Wang,
C. H. Niu,
C. Miao,
X. Zhu,
R. W. Tian,
W. L. Li,
S. D. Wang,
Z. F. Tu,
Q. J. Zhi,
S. J. Dang,
L. H. Shang,
S. Xiao
Abstract:
This study presents a comprehensive analysis of the spectral properties of 886 pulsars across a wide frequency range from 20MHz to 343.5GHz, including a total of 86 millisecond pulsars. The majority of the pulsars exhibit power-law behavior in their spectra, although some exceptions are observed. Five different spectral models, namely simple power-law, broken power-law, low-frequency turn-over, hi…
▽ More
This study presents a comprehensive analysis of the spectral properties of 886 pulsars across a wide frequency range from 20MHz to 343.5GHz, including a total of 86 millisecond pulsars. The majority of the pulsars exhibit power-law behavior in their spectra, although some exceptions are observed. Five different spectral models, namely simple power-law, broken power-law, low-frequency turn-over, high-frequency cut-off, and double turn-over, were employed to explore the spectral behaviors. The average spectral index for pulsars modeled with a simple power-law is found to be -1.64 +/-0.80, consistent with previous studies. Additionally, significant correlations between the spectral index and characteristic parameters are observed particularly in millisecond pulsars, while no strong correlation is observed in normal pulsars. Different models show variations in the most influential characteristic parameters associated with the spectral index, indicating diverse dominant radiation mechanisms in millisecond pulsars.Finally, this study identifies 22 pulsars of the Gigahertz-peaked Spectra (GPS) type for the first time based on the Akaike information criterion.
△ Less
Submitted 16 April, 2024; v1 submitted 16 April, 2024;
originally announced April 2024.
-
Accurate Low-Degree Polynomial Approximation of Non-polynomial Operators for Fast Private Inference in Homomorphic Encryption
Authors:
Jianming Tong,
Jingtian Dang,
Anupam Golder,
Callie Hao,
Arijit Raychowdhury,
Tushar Krishna
Abstract:
As machine learning (ML) permeates fields like healthcare, facial recognition, and blockchain, the need to protect sensitive data intensifies. Fully Homomorphic Encryption (FHE) allows inference on encrypted data, preserving the privacy of both data and the ML model. However, it slows down non-secure inference by up to five magnitudes, with a root cause of replacing non-polynomial operators (ReLU…
▽ More
As machine learning (ML) permeates fields like healthcare, facial recognition, and blockchain, the need to protect sensitive data intensifies. Fully Homomorphic Encryption (FHE) allows inference on encrypted data, preserving the privacy of both data and the ML model. However, it slows down non-secure inference by up to five magnitudes, with a root cause of replacing non-polynomial operators (ReLU and MaxPooling) with high-degree Polynomial Approximated Function (PAF). We propose SmartPAF, a framework to replace non-polynomial operators with low-degree PAF and then recover the accuracy of PAF-approximated model through four techniques: (1) Coefficient Tuning (CT) -- adjust PAF coefficients based on the input distributions before training, (2) Progressive Approximation (PA) -- progressively replace one non-polynomial operator at a time followed by a fine-tuning, (3) Alternate Training (AT) -- alternate the training between PAFs and other linear operators in the decoupled manner, and (4) Dynamic Scale (DS) / Static Scale (SS) -- dynamically scale PAF input value within (-1, 1) in training, and fix the scale as the running max value in FHE deployment. The synergistic effect of CT, PA, AT, and DS/SS enables SmartPAF to enhance the accuracy of the various models approximated by PAFs with various low degrees under multiple datasets. For ResNet-18 under ImageNet-1k, the Pareto-frontier spotted by SmartPAF in latency-accuracy tradeoff space achieves 1.42x ~ 13.64x accuracy improvement and 6.79x ~ 14.9x speedup than prior works. Further, SmartPAF enables a 14-degree PAF (f1^2 g_1^2) to achieve 7.81x speedup compared to the 27-degree PAF obtained by minimax approximation with the same 69.4% post-replacement accuracy. Our code is available at https://github.com/EfficientFHE/SmartPAF.
△ Less
Submitted 7 May, 2024; v1 submitted 4 April, 2024;
originally announced April 2024.
-
Investigation of profile shifting and subpulse movement in PSR J0344-0901 with FAST
Authors:
H. M. Tedila,
R. Yuen,
N. Wang,
D. Li,
Z. G. Wen,
W. M. Yan,
J. P. Yuan,
X. H. Han,
P. Wang,
W. W. Zhu,
S. J. Dang,
S. Q. Wang,
J. T. Xie,
Q. D. Wu,
Sh. Khasanov,
FAST Collaboration
Abstract:
We report two phenomena detected in PSR J0344$-$0901 from two observations conducted at frequency centered at 1.25 GHz using the Five-hundred-meter Aperture Spherical radio Telescope (FAST). The first phenomenon manifests as shifting in the pulse emission to later longitudinal phases and then gradually returns to its original location. The event lasts for about 216 pulse periods, with an average s…
▽ More
We report two phenomena detected in PSR J0344$-$0901 from two observations conducted at frequency centered at 1.25 GHz using the Five-hundred-meter Aperture Spherical radio Telescope (FAST). The first phenomenon manifests as shifting in the pulse emission to later longitudinal phases and then gradually returns to its original location. The event lasts for about 216 pulse periods, with an average shift of about $0.7^\circ$ measured at the peak of the integrated profile. Changes in the polarization position angle (PPA) are detected around the trailing edge of the profile, together with an increase in the profile width. The second phenomenon is characterized by the apparent movement of subpulses, which results in different subpulse track patterns across the profile window. For the first time in this pulsar, we identify four emission modes, each with unique subpulse movement, and determine the pattern periods for three of the emission modes. Pulse nulling was not detected. Modeling of the changes in the PPA using the rotating vector model gives an inclination angle of $75.12^\circ \pm 3.80^\circ$ and an impact parameter of $-3.17^\circ \pm 5.32^\circ$ for this pulsar. We speculate that the subpulse movement may be related to the shifting of the pulse emission.
△ Less
Submitted 22 February, 2024;
originally announced February 2024.
-
Lattice QCD calculation of the $D_s^{*}$ radiative decay with (2+1)-flavor Wilson-clover ensembles
Authors:
Yu Meng,
Jin-Long Dang,
Chuan Liu,
Zhaofeng Liu,
Tinghong Shen,
Haobo Yan,
Ke-Long Zhang
Abstract:
We perform a lattice calculation on the radiative decay of $D_s^*$ using the (2+1)-flavor Wilson-clover gauge ensembles generated by CLQCD collaboration. A method allowing us to calculate the form factor with zero transfer momentum is proposed and applied to the radiative transition $D_s^*\rightarrow D_sγ$ and the Dalitz decay $D_s^*\rightarrow D_s e^+e^-$. After a continuum extrapolation using th…
▽ More
We perform a lattice calculation on the radiative decay of $D_s^*$ using the (2+1)-flavor Wilson-clover gauge ensembles generated by CLQCD collaboration. A method allowing us to calculate the form factor with zero transfer momentum is proposed and applied to the radiative transition $D_s^*\rightarrow D_sγ$ and the Dalitz decay $D_s^*\rightarrow D_s e^+e^-$. After a continuum extrapolation using three lattice spacings, we obtain $Γ(D_s^*\rightarrow D_s γ)=0.0549(54)$ keV, where the error is purely statistical. The result is consistent with previous lattice calculations but with a error reduced to only a fifth of the before. The Dalitz decay rate is also calculated for the first time and the ratio with the radiative transition is found to be $R_{ee}=0.624(3)\%$. A total decay width of $D_s^*$ can then be determined as 0.0587(54) keV taking into account the experimental branching fraction. Combining with the most recent experimental measurement on the branching fraction of the purely leptonic decay $D_s^{+,*}\rightarrow e^+ν_e$, we obtain the quantity $f_{D_s^*}|V_{cs}|=(190.5^{+55.1}_{-41.7_{\textrm{stat.}}}\pm 12.6_{\textrm{syst.}})$ MeV, where the stat. is only the statistical error from the experiment, and syst. results from the experimental systematic uncertainty and the lattice statistical error. Our result leads to an improved systematic uncertainty compared to $42.7_{\textrm{syst.}}$ obtained using previous lattice prediction of total decay width $0.070(28)$ keV as the input.
△ Less
Submitted 29 April, 2024; v1 submitted 24 January, 2024;
originally announced January 2024.
-
Pulse Jitter and Single-pulse Variability in Millisecond Pulsars
Authors:
S. Q. Wang,
N. Wang,
J. B. Wang,
G. Hobbs,
H. Xu,
B. J. Wang,
S. Dai,
S. J. Dang,
D. Li,
Y. Feng,
C. M. Zhang
Abstract:
Understanding the jitter noise resulting from single-pulse phase and shape variations is important for the detection of gravitational waves using pulsar timing array. We presented measurements of jitter noise and single-pulse variability of 12 millisecond pulsars that are part of the International Pulsar Timing Array sample using the Five-hundred-meter Aperture Spherical radio Telescope (FAST). We…
▽ More
Understanding the jitter noise resulting from single-pulse phase and shape variations is important for the detection of gravitational waves using pulsar timing array. We presented measurements of jitter noise and single-pulse variability of 12 millisecond pulsars that are part of the International Pulsar Timing Array sample using the Five-hundred-meter Aperture Spherical radio Telescope (FAST). We found that the levels of jitter noise can vary dramatically among pulsars. A moderate correlation with a correlation coefficient of 0.57 between jitter noise and pulse width is detected. To mitigate jitter noise, we performed matrix template matching using all four Stokes parameters. Our results revealed a reduction in jitter noise ranging from 6.7\% to 39.6\%. By performing longitude-resolved fluctuation spectrum analysis, we identified periodic intensity modulations in 10 pulsars. In PSR J0030+0451, we detected single-pulses with energies more than 10 times the average pulse energy, suggesting the presence of giant pulses. We also observed a periodic mode-changing phenomenon in PSR J0030+0451. We examined the achievable timing precision by selecting a sub-set of pulses with a specific range of peak intensity, but no significant improvement in timing precision is achievable.
△ Less
Submitted 22 January, 2024;
originally announced January 2024.
-
Performance Trade-off and Joint Waveform Design for MIMO-OFDM DFRC Systems
Authors:
Tianchen Liu,
Liang Wu,
Bo An,
Zaichen Zhang,
Jian Dang,
Jiangzhou Wang
Abstract:
Dual-functional radar-communication (DFRC) has attracted considerable attention. This paper considers the frequency-selective multipath fading environment and proposes DFRC waveform design strategies based on multiple-input and multiple-output (MIMO) and orthogonal frequency division multiplexing (OFDM) techniques. In the proposed waveform design strategies, the Cramer-Rao bound (CRB) of the radar…
▽ More
Dual-functional radar-communication (DFRC) has attracted considerable attention. This paper considers the frequency-selective multipath fading environment and proposes DFRC waveform design strategies based on multiple-input and multiple-output (MIMO) and orthogonal frequency division multiplexing (OFDM) techniques. In the proposed waveform design strategies, the Cramer-Rao bound (CRB) of the radar system, the inter-stream interference (ISI) and the achievable rate of the communication system, are respectively considered as the performance metrics. In this paper, we focus on the performance trade-off between the radar system and the communication system, and the optimization problems are formulated. In the ISI minimization based waveform design strategy, the optimization problem is convex and can be easily solved. In the achievable rate maximization based waveform design strategy, we propose a water-filling (WF) and sequential quadratic programming (SQP) based algorithm to derive the covariance matrix and the precoding matrix. Simulation results validate the proposed DFRC waveform designs and show that the achievable rate maximization based strategy has a better performance than the ISI minimization based strategy.
△ Less
Submitted 4 January, 2024;
originally announced January 2024.
-
ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations
Authors:
Cheng Gong,
Xin Wang,
Erica Cooper,
Dan Wells,
Longbiao Wang,
Jianwu Dang,
Korin Richmond,
Junichi Yamagishi
Abstract:
Neural text-to-speech (TTS) has achieved human-like synthetic speech for single-speaker, single-language synthesis. Multilingual TTS systems are limited to resource-rich languages due to the lack of large paired text and studio-quality audio data. In most cases, TTS systems are built using a single speaker's voice. However, there is growing interest in developing systems that can synthesize voices…
▽ More
Neural text-to-speech (TTS) has achieved human-like synthetic speech for single-speaker, single-language synthesis. Multilingual TTS systems are limited to resource-rich languages due to the lack of large paired text and studio-quality audio data. In most cases, TTS systems are built using a single speaker's voice. However, there is growing interest in developing systems that can synthesize voices for new speakers using only a few seconds of their speech. This paper presents ZMM-TTS, a multilingual and multispeaker framework utilizing quantized latent speech representations from a large-scale, pre-trained, self-supervised model. Our paper is the first to incorporate the representations from text-based and speech-based self-supervised learning models into multilingual speech synthesis tasks. We conducted comprehensive subjective and objective evaluations through a series of experiments. Our model has been proven effective in terms of speech naturalness and similarity for both seen and unseen speakers in six high-resource languages. We also tested the efficiency of our method on two hypothetical low-resource languages. The results are promising, indicating that our proposed approach can synthesize audio that is intelligible and has a high degree of similarity to the target speaker's voice, even without any training data for the new, unseen language.
△ Less
Submitted 21 December, 2023;
originally announced December 2023.
-
A Refining Underlying Information Framework for Monaural Speech Enhancement
Authors:
Rui Cao,
Tianrui Wang,
Meng Ge,
Longbiao Wang,
Jianwu Dang
Abstract:
Supervised speech enhancement has gained significantly from recent advancements in neural networks, especially due to their ability to non-linearly fit the diverse representations of target speech, such as waveform or spectrum. However, these direct-fitting solutions continue to face challenges with degraded speech and residual noise in hearing evaluations. By bridging the speech enhancement and t…
▽ More
Supervised speech enhancement has gained significantly from recent advancements in neural networks, especially due to their ability to non-linearly fit the diverse representations of target speech, such as waveform or spectrum. However, these direct-fitting solutions continue to face challenges with degraded speech and residual noise in hearing evaluations. By bridging the speech enhancement and the Information Bottleneck principle in this letter, we rethink a universal plug-and-play strategy and propose a Refining Underlying Information framework called RUI to rise to the challenges both in theory and practice. Specifically, we first transform the objective of speech enhancement into an incremental convergence problem of mutual information between comprehensive speech characteristics and individual speech characteristics, e.g., spectral and acoustic characteristics. By doing so, compared with the existing direct-fitting solutions, the underlying information stems from the conditional entropy of acoustic characteristic given spectral characteristics. Therefore, we design a dual-path multiple refinement iterator based on the chain rule of entropy to refine this underlying information for further approximating target speech. Experimental results on DNS-Challenge dataset show that our solution consistently improves 0.3+ PESQ score over baselines, with only additional 1.18 M parameters. The source code is available at https://github.com/caoruitju/RUI_SE.
△ Less
Submitted 24 December, 2023; v1 submitted 18 December, 2023;
originally announced December 2023.
-
Ahpatron: A New Budgeted Online Kernel Learning Machine with Tighter Mistake Bound
Authors:
Yun Liao,
Junfan Li,
Shizhong Liao,
Qinghua Hu,
Jianwu Dang
Abstract:
In this paper, we study the mistake bound of online kernel learning on a budget. We propose a new budgeted online kernel learning model, called Ahpatron, which significantly improves the mistake bound of previous work and resolves the open problem posed by Dekel, Shalev-Shwartz, and Singer (2005). We first present an aggressive variant of Perceptron, named AVP, a model without budget, which uses a…
▽ More
In this paper, we study the mistake bound of online kernel learning on a budget. We propose a new budgeted online kernel learning model, called Ahpatron, which significantly improves the mistake bound of previous work and resolves the open problem posed by Dekel, Shalev-Shwartz, and Singer (2005). We first present an aggressive variant of Perceptron, named AVP, a model without budget, which uses an active updating rule. Then we design a new budget maintenance mechanism, which removes a half of examples,and projects the removed examples onto a hypothesis space spanned by the remaining examples. Ahpatron adopts the above mechanism to approximate AVP. Theoretical analyses prove that Ahpatron has tighter mistake bounds, and experimental results show that Ahpatron outperforms the state-of-the-art algorithms on the same or a smaller budget.
△ Less
Submitted 12 December, 2023;
originally announced December 2023.
-
Discovery of four pulsars in a pilot survey at intermediate Galactic latitudes with FAST
Authors:
Q. J. Zhi,
J. T. Bai,
S. Dai,
X. Xu,
S. J. Dang,
L. H. Shang,
R. S. Zhao,
D. Li,
W. W. Zhu,
N. Wang,
J. P. Yuan,
P. Wang,
L. Zhang,
Y. Feng,
J. B. Wang,
S. Q. Wang,
Q. D. Wu,
A. J. Dong,
H. Yang,
J. Tian,
W. Q. Zhong,
X. H. Luo,
Miroslav D. Filipovi,
G. J. Qiao
Abstract:
We present the discovery and timing results of four pulsars discovered in a pilot survey at intermediate Galactic latitudes with the Five-hundred Aperture Spherical Telescope (FAST). Among these pulsars, two belong to the category of millisecond pulsars (MSPs) with spin periods of less than 20 ms. The other two fall under the classification of "mildly recycled" pulsars, with massive white dwarfs a…
▽ More
We present the discovery and timing results of four pulsars discovered in a pilot survey at intermediate Galactic latitudes with the Five-hundred Aperture Spherical Telescope (FAST). Among these pulsars, two belong to the category of millisecond pulsars (MSPs) with spin periods of less than 20 ms. The other two fall under the classification of "mildly recycled" pulsars, with massive white dwarfs as companions. Remarkably, this small survey, covering an area of 4.7 $deg^2$ , led to the discovery of four recycled pulsars. Such success underscores the immense potential of future surveys at intermediate Galactic latitudes. In order to assess the potential yield of MSPs, we conducted population simulations and found that both FAST and Parkes new phased array feed surveys, focusing on intermediate Galactic latitudes, have the capacity to uncover several hundred new MSPs.
△ Less
Submitted 28 December, 2023; v1 submitted 1 November, 2023;
originally announced November 2023.
-
Group Preference Optimization: Few-Shot Alignment of Large Language Models
Authors:
Siyan Zhao,
John Dang,
Aditya Grover
Abstract:
Many applications of large language models (LLMs), ranging from chatbots to creative writing, require nuanced subjective judgments that can differ significantly across different groups. Existing alignment algorithms can be expensive to align for each group, requiring prohibitive amounts of group-specific preference data and computation for real-world use cases. We introduce Group Preference Optimi…
▽ More
Many applications of large language models (LLMs), ranging from chatbots to creative writing, require nuanced subjective judgments that can differ significantly across different groups. Existing alignment algorithms can be expensive to align for each group, requiring prohibitive amounts of group-specific preference data and computation for real-world use cases. We introduce Group Preference Optimization (GPO), an alignment framework that steers language models to preferences of individual groups in a few-shot manner. In GPO, we augment the base LLM with an independent transformer module trained to predict the preferences of a group for the LLM generations. For few-shot learning, we parameterize this module as an in-context autoregressive transformer and train it via meta-learning on several groups. We empirically validate the efficacy of GPO through rigorous evaluations using LLMs with varied sizes on three human opinion adaptation tasks. These tasks involve adapting to the preferences of US demographic groups, global countries, and individual users. Our results demonstrate that GPO not only aligns models more accurately but also requires fewer group-specific preferences, and less training and inference computing resources, outperforming existing strategies such as in-context steering and fine-tuning methods.
△ Less
Submitted 17 October, 2023;
originally announced October 2023.
-
High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models
Authors:
Chunyu Qiang,
Hao Li,
Yixin Tian,
Yi Zhao,
Ying Zhang,
Longbiao Wang,
Jianwu Dang
Abstract:
Text-to-speech (TTS) methods have shown promising results in voice cloning, but they require a large number of labeled text-speech pairs. Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations(semantic \& acoustic) and using two sequence-to-sequence tasks to enable training with minimal supervision. However, existing methods suffer from inform…
▽ More
Text-to-speech (TTS) methods have shown promising results in voice cloning, but they require a large number of labeled text-speech pairs. Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations(semantic \& acoustic) and using two sequence-to-sequence tasks to enable training with minimal supervision. However, existing methods suffer from information redundancy and dimension explosion in semantic representation, and high-frequency waveform distortion in discrete acoustic representation. Autoregressive frameworks exhibit typical instability and uncontrollability issues. And non-autoregressive frameworks suffer from prosodic averaging caused by duration prediction models. To address these issues, we propose a minimally-supervised high-fidelity speech synthesis method, where all modules are constructed based on the diffusion models. The non-autoregressive framework enhances controllability, and the duration diffusion model enables diversified prosodic expression. Contrastive Token-Acoustic Pretraining (CTAP) is used as an intermediate semantic representation to solve the problems of information redundancy and dimension explosion in existing semantic coding methods. Mel-spectrogram is used as the acoustic representation. Both semantic and acoustic representations are predicted by continuous variable regression tasks to solve the problem of high-frequency fine-grained waveform distortion. Experimental results show that our proposed method outperforms the baseline method. We provide audio samples on our website.
△ Less
Submitted 18 December, 2023; v1 submitted 27 September, 2023;
originally announced September 2023.
-
Learning Speech Representation From Contrastive Token-Acoustic Pretraining
Authors:
Chunyu Qiang,
Hao Li,
Yixin Tian,
Ruibo Fu,
Tao Wang,
Longbiao Wang,
Jianwu Dang
Abstract:
For fine-grained generation and recognition tasks such as minimally-supervised text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), the intermediate representations extracted from speech should serve as a "bridge" between text and acoustic information, containing information from both modalities. The semantic content is emphasized, while the paralinguistic informati…
▽ More
For fine-grained generation and recognition tasks such as minimally-supervised text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), the intermediate representations extracted from speech should serve as a "bridge" between text and acoustic information, containing information from both modalities. The semantic content is emphasized, while the paralinguistic information such as speaker identity and acoustic details should be de-emphasized. However, existing methods for extracting fine-grained intermediate representations from speech suffer from issues of excessive redundancy and dimension explosion. Contrastive learning is a good method for modeling intermediate representations from two modalities. However, existing contrastive learning methods in the audio field focus on extracting global descriptive information for downstream audio classification tasks, making them unsuitable for TTS, VC, and ASR tasks. To address these issues, we propose a method named "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space, learning how to connect phoneme and speech at the frame level. The CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR. The proposed CTAP method offers a promising solution for fine-grained generation and recognition downstream tasks in speech processing. We provide a website with audio samples.
△ Less
Submitted 18 December, 2023; v1 submitted 1 September, 2023;
originally announced September 2023.
-
Peering Through Preferences: Unraveling Feedback Acquisition for Aligning Large Language Models
Authors:
Hritik Bansal,
John Dang,
Aditya Grover
Abstract:
Aligning large language models (LLMs) with human values and intents critically involves the use of human or AI feedback. While dense feedback annotations are expensive to acquire and integrate, sparse feedback presents a structural design choice between ratings (e.g., score Response A on a scale of 1-7) and rankings (e.g., is Response A better than Response B?). In this work, we analyze the effect…
▽ More
Aligning large language models (LLMs) with human values and intents critically involves the use of human or AI feedback. While dense feedback annotations are expensive to acquire and integrate, sparse feedback presents a structural design choice between ratings (e.g., score Response A on a scale of 1-7) and rankings (e.g., is Response A better than Response B?). In this work, we analyze the effect of this design choice for the alignment and evaluation of LLMs. We uncover an inconsistency problem wherein the preferences inferred from ratings and rankings significantly disagree 60% for both human and AI annotators. Our subsequent analysis identifies various facets of annotator biases that explain this phenomena, such as human annotators would rate denser responses higher while preferring accuracy during pairwise judgments. To our surprise, we also observe that the choice of feedback protocol also has a significant effect on the evaluation of aligned LLMs. In particular, we find that LLMs that leverage rankings data for alignment (say model X) are preferred over those that leverage ratings data (say model Y), with a rank-based evaluation protocol (is X/Y's response better than reference response?) but not with a rating-based evaluation protocol (score Rank X/Y's response on a scale of 1-7). Our findings thus shed light on critical gaps in methods for evaluating the real-world utility of language models and their strong dependence on the feedback protocol used for alignment. Our code and data are available at https://github.com/Hritikbansal/sparse_feedback.
△ Less
Submitted 5 February, 2024; v1 submitted 30 August, 2023;
originally announced August 2023.
-
Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding
Authors:
Chunyu Qiang,
Hao Li,
Hao Ni,
He Qu,
Ruibo Fu,
Tao Wang,
Longbiao Wang,
Jianwu Dang
Abstract:
Recently, there has been a growing interest in text-to-speech (TTS) methods that can be trained with minimal supervision by combining two types of discrete speech representations and using two sequence-to-sequence tasks to decouple TTS. However, existing methods suffer from three problems: the high dimensionality and waveform distortion of discrete speech representations, the prosodic averaging pr…
▽ More
Recently, there has been a growing interest in text-to-speech (TTS) methods that can be trained with minimal supervision by combining two types of discrete speech representations and using two sequence-to-sequence tasks to decouple TTS. However, existing methods suffer from three problems: the high dimensionality and waveform distortion of discrete speech representations, the prosodic averaging problem caused by the duration prediction model in non-autoregressive frameworks, and the information redundancy and dimension explosion problems of existing semantic encoding methods. To address these problems, three progressive methods are proposed. First, we propose Diff-LM-Speech, an autoregressive structure consisting of a language model and diffusion models, which models the semantic embedding into the mel-spectrogram based on a diffusion model to achieve higher audio quality. We also introduce a prompt encoder structure based on a variational autoencoder and a prosody bottleneck to improve prompt representation ability. Second, we propose Tetra-Diff-Speech, a non-autoregressive structure consisting of four diffusion model-based modules that design a duration diffusion model to achieve diverse prosodic expressions. Finally, we propose Tri-Diff-Speech, a non-autoregressive structure consisting of three diffusion model-based modules that verify the non-necessity of existing semantic encoding models and achieve the best results. Experimental results show that our proposed methods outperform baseline methods. We provide a website with audio samples.
△ Less
Submitted 18 December, 2023; v1 submitted 28 July, 2023;
originally announced July 2023.
-
Downlink Precoding for Cell-free FBMC/OQAM Systems With Asynchronous Reception
Authors:
Yuhao Qi,
Jian Dang,
Zaichen Zhang,
Liang Wu,
Yongpeng Wu
Abstract:
In this work, an efficient precoding design scheme is proposed for downlink cell-free distributed massive multiple-input multiple-output (DM-MIMO) filter bank multi-carrier (FBMC) systems with asynchronous reception and highly frequency selectivity. The proposed scheme includes a multiple interpolation structure to eliminate the impact of response difference we recently discovered, which has bette…
▽ More
In this work, an efficient precoding design scheme is proposed for downlink cell-free distributed massive multiple-input multiple-output (DM-MIMO) filter bank multi-carrier (FBMC) systems with asynchronous reception and highly frequency selectivity. The proposed scheme includes a multiple interpolation structure to eliminate the impact of response difference we recently discovered, which has better performance in highly frequency-selective channels. Besides, we also consider the phase shift in asynchronous reception and introduce a phase compensation in the design process. The phase compensation also benefits from the multiple interpolation structure and better adapts to asynchronous reception. Based on the proposed scheme, we theoretically analyze its ergodic achievable rate performance and derive a closed-form expression. Simulation results show that the derived expression can accurately characterize the rate performance, and FBMC with the proposed scheme outperforms orthogonal frequency-division multiplexing (OFDM) in the asynchronous scenario.
△ Less
Submitted 13 July, 2023;
originally announced July 2023.
-
Rethinking the visual cues in audio-visual speaker extraction
Authors:
Junjie Li,
Meng Ge,
Zexu pan,
Rui Cao,
Longbiao Wang,
Jianwu Dang,
Shiliang Zhang
Abstract:
The Audio-Visual Speaker Extraction (AVSE) algorithm employs parallel video recording to leverage two visual cues, namely speaker identity and synchronization, to enhance performance compared to audio-only algorithms. However, the visual front-end in AVSE is often derived from a pre-trained model or end-to-end trained, making it unclear which visual cue contributes more to the speaker extraction p…
▽ More
The Audio-Visual Speaker Extraction (AVSE) algorithm employs parallel video recording to leverage two visual cues, namely speaker identity and synchronization, to enhance performance compared to audio-only algorithms. However, the visual front-end in AVSE is often derived from a pre-trained model or end-to-end trained, making it unclear which visual cue contributes more to the speaker extraction performance. This raises the question of how to better utilize visual cues. To address this issue, we propose two training strategies that decouple the learning of the two visual cues. Our experimental results demonstrate that both visual cues are useful, with the synchronization cue having a higher impact. We introduce a more explainable model, the Decoupled Audio-Visual Speaker Extraction (DAVSE) model, which leverages both visual cues.
△ Less
Submitted 5 June, 2023;
originally announced June 2023.
-
speech and noise dual-stream spectrogram refine network with speech distortion loss for robust speech recognition
Authors:
Haoyu Lu,
Nan Li,
Tongtong Song,
Longbiao Wang,
Jianwu Dang,
Xiaobao Wang,
Shiliang Zhang
Abstract:
In recent years, the joint training of speech enhancement front-end and automatic speech recognition (ASR) back-end has been widely used to improve the robustness of ASR systems. Traditional joint training methods only use enhanced speech as input for the backend. However, it is difficult for speech enhancement systems to directly separate speech from input due to the diverse types of noise with d…
▽ More
In recent years, the joint training of speech enhancement front-end and automatic speech recognition (ASR) back-end has been widely used to improve the robustness of ASR systems. Traditional joint training methods only use enhanced speech as input for the backend. However, it is difficult for speech enhancement systems to directly separate speech from input due to the diverse types of noise with different intensities. Furthermore, speech distortion and residual noise are often observed in enhanced speech, and the distortion of speech and noise is different. Most existing methods focus on fusing enhanced and noisy features to address this issue. In this paper, we propose a dual-stream spectrogram refine network to simultaneously refine the speech and noise and decouple the noise from the noisy input. Our proposed method can achieve better performance with a relative 8.6% CER reduction.
△ Less
Submitted 30 May, 2023; v1 submitted 28 May, 2023;
originally announced May 2023.
-
Locate and Beamform: Two-dimensional Locating All-neural Beamformer for Multi-channel Speech Separation
Authors:
Yanjie Fu,
Meng Ge,
Honglong Wang,
Nan Li,
Haoran Yin,
Longbiao Wang,
Gaoyan Zhang,
Jianwu Dang,
Chengyun Deng,
Fei Wang
Abstract:
Recently, stunning improvements on multi-channel speech separation have been achieved by neural beamformers when direction information is available. However, most of them neglect to utilize speaker's 2-dimensional (2D) location cues contained in mixture signal, which limits the performance when two sources come from close directions. In this paper, we propose an end-to-end beamforming network for…
▽ More
Recently, stunning improvements on multi-channel speech separation have been achieved by neural beamformers when direction information is available. However, most of them neglect to utilize speaker's 2-dimensional (2D) location cues contained in mixture signal, which limits the performance when two sources come from close directions. In this paper, we propose an end-to-end beamforming network for 2D location guided speech separation merely given mixture signal. It first estimates discriminable direction and 2D location cues, which imply directions the sources come from in multi views of microphones and their 2D coordinates. These cues are then integrated into location-aware neural beamformer, thus allowing accurate reconstruction of two sources' speech signals. Experiments show that our proposed model not only achieves a comprehensive decent improvement compared to baseline systems, but avoids inferior performance on spatial overlapping cases.
△ Less
Submitted 2 June, 2023; v1 submitted 18 May, 2023;
originally announced May 2023.
-
Asymmetric Chiral Coupling in a Topological Resonator
Authors:
Shushu Shi,
Xin Xie,
Sai Yan,
Jingnan Yang,
Jianchen Dang,
Shan Xiao,
Longlong Yang,
Danjie Dai,
Bowen Fu,
Yu Yuan,
Rui Zhu,
Xiangbin Su,
Hanqing Liu,
Zhanchun Zuo,
Can Wang,
Haiqiao Ni,
Zhichuan Niu,
Qihuang Gong,
Xiulai Xu
Abstract:
Chiral light-matter interactions supported by topological edge modes at the interface of valley photonic crystals provide a robust method to implement the unidirectional spin transfer. The valley topological photonic crystals possess a pair of counterpropagating edge modes. The edge modes are robust against the sharp bend of $60^{\circ}$ and $120^{\circ}$, which can form a resonator with whisperin…
▽ More
Chiral light-matter interactions supported by topological edge modes at the interface of valley photonic crystals provide a robust method to implement the unidirectional spin transfer. The valley topological photonic crystals possess a pair of counterpropagating edge modes. The edge modes are robust against the sharp bend of $60^{\circ}$ and $120^{\circ}$, which can form a resonator with whispering gallery modes. Here, we demonstrate the asymmetric emission of chiral coupling from single quantum dots in a topological resonator by tuning the coupling between a quantum emitter and a resonator mode. Under a magnetic field in Faraday configuration, the exciton state from a single quantum dot splits into two exciton spin states with opposite circularly polarized emissions due to Zeeman effect. Two branches of the quantum dot emissions couple to a resonator mode in different degrees, resulting in an asymmetric chiral emission. Without the demanding of site-control of quantum emitters for chiral quantum optics, an extra degree of freedom to tune the chiral contrast with a topological resonator could be useful for the development of on-chip integrated photonic circuits.
△ Less
Submitted 26 April, 2023;
originally announced April 2023.
-
Time-domain Speech Enhancement Assisted by Multi-resolution Frequency Encoder and Decoder
Authors:
Hao Shi,
Masato Mimura,
Longbiao Wang,
Jianwu Dang,
Tatsuya Kawahara
Abstract:
Time-domain speech enhancement (SE) has recently been intensively investigated. Among recent works, DEMUCS introduces multi-resolution STFT loss to enhance performance. However, some resolutions used for STFT contain non-stationary signals, and it is challenging to learn multi-resolution frequency losses simultaneously with only one output. For better use of multi-resolution frequency information,…
▽ More
Time-domain speech enhancement (SE) has recently been intensively investigated. Among recent works, DEMUCS introduces multi-resolution STFT loss to enhance performance. However, some resolutions used for STFT contain non-stationary signals, and it is challenging to learn multi-resolution frequency losses simultaneously with only one output. For better use of multi-resolution frequency information, we supplement multiple spectrograms in different frame lengths into the time-domain encoders. They extract stationary frequency information in both narrowband and wideband. We also adopt multiple decoder outputs, each of which computes its corresponding resolution frequency loss. Experimental results show that (1) it is more effective to fuse stationary frequency features than non-stationary features in the encoder, and (2) the multiple outputs consistent with the frequency loss improve performance. Experiments on the Voice-Bank dataset show that the proposed method obtained a 0.14 PESQ improvement.
△ Less
Submitted 25 March, 2023;
originally announced March 2023.
-
Pilot-Free Unsourced Random Access Via Dictionary Learning and Error-Correcting Codes
Authors:
Zhentian Zhang,
Jian Dang,
Zaichen Zhang,
Liang Wu,
Bingcheng Zhu,
Lei Wang
Abstract:
Massive machine-type communications (mMTC) or massive access is a critical scenario in the fifth generation (5G) and the future cellular network. With the surging density of devices from millions to billions, unique pilot allocation becomes inapplicable in the user ID-incorporated grant-free random access protocol. Unsourced random access (URA) manifests itself by focusing only on unwrapping the r…
▽ More
Massive machine-type communications (mMTC) or massive access is a critical scenario in the fifth generation (5G) and the future cellular network. With the surging density of devices from millions to billions, unique pilot allocation becomes inapplicable in the user ID-incorporated grant-free random access protocol. Unsourced random access (URA) manifests itself by focusing only on unwrapping the received signals via a common codebook. In this paper, we propose a URA protocol for a massive access cellular system equipped with multiple antennas at the base station. The proposed scheme encompasses a codebook enabling construction of sparse transmission frame, a receiver equipped with dictionary learning and error-correcting codes and a collision resolution strategy for the collided codeword. Discrepant to the existing schemes with necessary overhead for preamble signals, no overhead or pre-defined pilot sequences are needed in the proposed scheme, which is favorable for energy-efficient transmission and latency reduction. Numerical results verify the viability of the proposed scheme in practical massive access scenario.
△ Less
Submitted 9 March, 2023;
originally announced March 2023.
-
Controllable Spin-Resolved Photon Emission Enhanced by Slow-Light Mode in Photonic Crystal Waveguides on Chip
Authors:
Shushu Shi,
Shan Xiao,
Jingnan Yang,
Shulun Li,
Xin Xie,
Jianchen Dang,
Longlong Yang,
Danjie Dai,
Bowen Fu,
Sai Yan,
Yu Yuan,
Rui Zhu,
Bei-Bei Li,
Zhanchun Zuo,
Can Wang,
Haiqiao Ni,
Zhichuan Niu,
Kuijuan Jin,
Qihuang Gong,
Xiulai Xu
Abstract:
We report the slow-light enhanced spin-resolved in-plane emission from a single quantum dot (QD) in a photonic crystal waveguide (PCW). The slow light dispersions in PCWs are designed to match the emission wavelengths of single QDs. The resonance between two spin states emitted from a single QD and a slow light mode of a waveguide is investigated under a magnetic field with Faraday configuration.…
▽ More
We report the slow-light enhanced spin-resolved in-plane emission from a single quantum dot (QD) in a photonic crystal waveguide (PCW). The slow light dispersions in PCWs are designed to match the emission wavelengths of single QDs. The resonance between two spin states emitted from a single QD and a slow light mode of a waveguide is investigated under a magnetic field with Faraday configuration. Two spin states of a single QD experience different degrees of enhancement as their emission wavelengths are shifted by combining diamagnetic and Zeeman effects with an optical excitation power control. A circular polarization degree up to 0.81 is achieved by changing the off-resonant excitation power. Strongly polarized photon emission enhanced by a slow light mode shows great potential to attain controllable spin-resolved photon sources for integrated optical quantum networks on chip.
△ Less
Submitted 22 February, 2023;
originally announced February 2023.
-
Cross-modal Audio-visual Co-learning for Text-independent Speaker Verification
Authors:
Meng Liu,
Kong Aik Lee,
Longbiao Wang,
Hanyi Zhang,
Chang Zeng,
Jianwu Dang
Abstract:
Visual speech (i.e., lip motion) is highly related to auditory speech due to the co-occurrence and synchronization in speech production. This paper investigates this correlation and proposes a cross-modal speech co-learning paradigm. The primary motivation of our cross-modal co-learning method is modeling one modality aided by exploiting knowledge from another modality. Specifically, two cross-mod…
▽ More
Visual speech (i.e., lip motion) is highly related to auditory speech due to the co-occurrence and synchronization in speech production. This paper investigates this correlation and proposes a cross-modal speech co-learning paradigm. The primary motivation of our cross-modal co-learning method is modeling one modality aided by exploiting knowledge from another modality. Specifically, two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation. Inside each booster, a max-feature-map embedded Transformer variant is proposed for modality alignment and enhanced feature generation. The network is co-learned both from scratch and with pretrained models. Experimental results on the LRSLip3, GridLip, LomGridLip, and VoxLip datasets demonstrate that our proposed method achieves 60% and 20% average relative performance improvement over independently trained audio-only/visual-only and baseline fusion systems, respectively.
△ Less
Submitted 22 February, 2023;
originally announced February 2023.
-
Bridge Damage Cause Estimation Using Multiple Images Based on Visual Question Answering
Authors:
Tatsuro Yamane,
Pang-jo Chun,
Ji Dang,
Takayuki Okatani
Abstract:
In this paper, a bridge member damage cause estimation framework is proposed by calculating the image position using Structure from Motion (SfM) and acquiring its information via Visual Question Answering (VQA). For this, a VQA model was developed that uses bridge images for dataset creation and outputs the damage or member name and its existence based on the images and questions. In the developed…
▽ More
In this paper, a bridge member damage cause estimation framework is proposed by calculating the image position using Structure from Motion (SfM) and acquiring its information via Visual Question Answering (VQA). For this, a VQA model was developed that uses bridge images for dataset creation and outputs the damage or member name and its existence based on the images and questions. In the developed model, the correct answer rate for questions requiring the member's name and the damage's name were 67.4% and 68.9%, respectively. The correct answer rate for questions requiring a yes/no answer was 99.1%. Based on the developed model, a damage cause estimation method was proposed. In the proposed method, the damage causes are narrowed down by inputting new questions to the VQA model, which are determined based on the surrounding images obtained via SfM and the results of the VQA model. Subsequently, the proposed method was then applied to an actual bridge and shown to be capable of determining damage and estimating its cause. The proposed method could be used to prevent damage causes from being overlooked, and practitioners could determine inspection focus areas, which could contribute to the improvement of maintenance techniques. In the future, it is expected to contribute to infrastructure diagnosis automation.
△ Less
Submitted 17 February, 2023;
originally announced February 2023.
-
MIMO-DBnet: Multi-channel Input and Multiple Outputs DOA-aware Beamforming Network for Speech Separation
Authors:
Yanjie Fu,
Haoran Yin,
Meng Ge,
Longbiao Wang,
Gaoyan Zhang,
Jianwu Dang,
Chengyun Deng,
Fei Wang
Abstract:
Recently, many deep learning based beamformers have been proposed for multi-channel speech separation. Nevertheless, most of them rely on extra cues known in advance, such as speaker feature, face image or directional information. In this paper, we propose an end-to-end beamforming network for direction guided speech separation given merely the mixture signal, namely MIMO-DBnet. Specifically, we d…
▽ More
Recently, many deep learning based beamformers have been proposed for multi-channel speech separation. Nevertheless, most of them rely on extra cues known in advance, such as speaker feature, face image or directional information. In this paper, we propose an end-to-end beamforming network for direction guided speech separation given merely the mixture signal, namely MIMO-DBnet. Specifically, we design a multi-channel input and multiple outputs architecture to predict the direction-of-arrival based embeddings and beamforming weights for each source. The precisely estimated directional embedding provides quite effective spatial discrimination guidance for the neural beamformer to offset the effect of phase wrapping, thus allowing more accurate reconstruction of two sources' speech signals. Experiments show that our proposed MIMO-DBnet not only achieves a comprehensive decent improvement compared to baseline systems, but also maintain the performance on high frequency bands when phase wrapping occurs.
△ Less
Submitted 6 December, 2022;
originally announced December 2022.
-
Monolingual Recognizers Fusion for Code-switching Speech Recognition
Authors:
Tongtong Song,
Qiang Xu,
Haoyu Lu,
Longbiao Wang,
Hao Shi,
Yuqin Lin,
Yanbing Yang,
Jianwu Dang
Abstract:
The bi-encoder structure has been intensively investigated in code-switching (CS) automatic speech recognition (ASR). However, most existing methods require the structures of two monolingual ASR models (MAMs) should be the same and only use the encoder of MAMs. This leads to the problem that pre-trained MAMs cannot be timely and fully used for CS ASR. In this paper, we propose a monolingual recogn…
▽ More
The bi-encoder structure has been intensively investigated in code-switching (CS) automatic speech recognition (ASR). However, most existing methods require the structures of two monolingual ASR models (MAMs) should be the same and only use the encoder of MAMs. This leads to the problem that pre-trained MAMs cannot be timely and fully used for CS ASR. In this paper, we propose a monolingual recognizers fusion method for CS ASR. It has two stages: the speech awareness (SA) stage and the language fusion (LF) stage. In the SA stage, acoustic features are mapped to two language-specific predictions by two independent MAMs. To keep the MAMs focused on their own language, we further extend the language-aware training strategy for the MAMs. In the LF stage, the BELM fuses two language-specific predictions to get the final prediction. Moreover, we propose a text simulation strategy to simplify the training process of the BELM and reduce reliance on CS data. Experiments on a Mandarin-English corpus show the efficiency of the proposed method. The mix error rate is significantly reduced on the test set after using open-source pre-trained MAMs.
△ Less
Submitted 2 November, 2022;
originally announced November 2022.
-
Asynchronous RIS-assisted Localization: A Comprehensive Analysis of Fundamental Limits
Authors:
Ziyi Gong,
Liang Wu,
Zaichen Zhang,
Jian Dang,
Yongpeng Wu,
Jiangzhou Wang
Abstract:
The reconfigurable intelligent surface (RIS) has drawn considerable attention for its ability to enhance the performance of not only the wireless communication but also the indoor localization with low-cost. This paper investigates the performance limits of the RIS-based near-field localization in the asynchronous scenario, and analyzes the impact of each part of the cascaded channel on the locali…
▽ More
The reconfigurable intelligent surface (RIS) has drawn considerable attention for its ability to enhance the performance of not only the wireless communication but also the indoor localization with low-cost. This paper investigates the performance limits of the RIS-based near-field localization in the asynchronous scenario, and analyzes the impact of each part of the cascaded channel on the localization performance. The Fisher information matrix (FIM) and the position error bound (PEB) are derived. Besides, we also derive the equivalent Fisher information (EFI) for the position-related intermediate parameters. Enabled by the derived EFI, we verify that both the ranging and bearing information of the user can be obtained when the near-field model is considered for the RIS-User equipment (UE) part of the channel, while only the direction of the UE can be inferred in the far-field scenario. This result is well known in the scenario that the curvature of arrival (COA) is directly sensed by the traditional active large-scale array, and we prove that it still holds when the COA is sensed passively by the large RIS. For the base station (BS)-RIS part of the channel, we reveal that this part of the channel determines the type of the gain provided by the BS antenna array. Besides, in the single-carrier, single snapshot case, it requires both the BS-RIS and the RIS-UE part of the channel works in the near-field scenario to localize the UE. We also show that the well-known focusing control scheme for RIS, which maximizes the received SNR, is not always a good choice and may degrade the localization performance in the asynchronous scenario. The simulation results validate the analytic work. The impact of the focusing control scheme on the PEB performances under synchronous and asynchronous conditions is also investigated.
△ Less
Submitted 26 March, 2023; v1 submitted 19 October, 2022;
originally announced October 2022.
-
VCSE: Time-Domain Visual-Contextual Speaker Extraction Network
Authors:
Junjie Li,
Meng Ge,
Zexu Pan,
Longbiao Wang,
Jianwu Dang
Abstract:
Speaker extraction seeks to extract the target speech in a multi-talker scenario given an auxiliary reference. Such reference can be auditory, i.e., a pre-recorded speech, visual, i.e., lip movements, or contextual, i.e., phonetic sequence. References in different modalities provide distinct and complementary information that could be fused to form top-down attention on the target speaker. Previou…
▽ More
Speaker extraction seeks to extract the target speech in a multi-talker scenario given an auxiliary reference. Such reference can be auditory, i.e., a pre-recorded speech, visual, i.e., lip movements, or contextual, i.e., phonetic sequence. References in different modalities provide distinct and complementary information that could be fused to form top-down attention on the target speaker. Previous studies have introduced visual and contextual modalities in a single model. In this paper, we propose a two-stage time-domain visual-contextual speaker extraction network named VCSE, which incorporates visual and self-enrolled contextual cues stage by stage to take full advantage of every modality. In the first stage, we pre-extract a target speech with visual cues and estimate the underlying phonetic sequence. In the second stage, we refine the pre-extracted target speech with the self-enrolled contextual cues. Experimental results on the real-world Lip Reading Sentences 3 (LRS3) database demonstrate that our proposed VCSE network consistently outperforms other state-of-the-art baselines.
△ Less
Submitted 9 October, 2022;
originally announced October 2022.
-
Deep Spectro-temporal Artifacts for Detecting Synthesized Speech
Authors:
Xiaohui Liu,
Meng Liu,
Lin Zhang,
Linjuan Zhang,
Chang Zeng,
Kai Li,
Nan Li,
Kong Aik Lee,
Longbiao Wang,
Jianwu Dang
Abstract:
The Audio Deep Synthesis Detection (ADD) Challenge has been held to detect generated human-like speech. With our submitted system, this paper provides an overall assessment of track 1 (Low-quality Fake Audio Detection) and track 2 (Partially Fake Audio Detection). In this paper, spectro-temporal artifacts were detected using raw temporal signals, spectral features, as well as deep embedding featur…
▽ More
The Audio Deep Synthesis Detection (ADD) Challenge has been held to detect generated human-like speech. With our submitted system, this paper provides an overall assessment of track 1 (Low-quality Fake Audio Detection) and track 2 (Partially Fake Audio Detection). In this paper, spectro-temporal artifacts were detected using raw temporal signals, spectral features, as well as deep embedding features. To address track 1, low-quality data augmentation, domain adaptation via finetuning, and various complementary feature information fusion were aggregated in our system. Furthermore, we analyzed the clustering characteristics of subsystems with different features by visualization method and explained the effectiveness of our proposed greedy fusion strategy. As for track 2, frame transition and smoothing were detected using self-supervised learning structure to capture the manipulation of PF attacks in the time domain. We ranked 4th and 5th in track 1 and track 2, respectively.
△ Less
Submitted 11 October, 2022;
originally announced October 2022.
-
Single charge control of localized excitons in heterostructures with ferroelectric thin films and two-dimensional transition metal dichalcogenides
Authors:
Danjie Dai,
Xinyan Wang,
Jingnan Yang,
Jianchen Dang,
Yu Yuan,
Bowen Fu,
Xin Xie,
Longlong Yang,
Shan Xiao,
Shushu Shi,
Sai Yan,
Rui Zhu,
Zhanchun Zuo,
Can Wang,
Kuijuan Jin,
Qihuang Gong,
Xiulai Xu
Abstract:
Single charge control of localized excitons (LXs) in two-dimensional transition metal dichalcogenides (TMDCs) is crucial for potential applications in quantum information processing and storage. However, traditional electrostatic doping method with applying metallic gates onto TMDCs may cause the inhomogeneous charge distribution, optical quench, and energy loss. Here, by locally controlling the f…
▽ More
Single charge control of localized excitons (LXs) in two-dimensional transition metal dichalcogenides (TMDCs) is crucial for potential applications in quantum information processing and storage. However, traditional electrostatic doping method with applying metallic gates onto TMDCs may cause the inhomogeneous charge distribution, optical quench, and energy loss. Here, by locally controlling the ferroelectric polarization of the ferroelectric thin film BiFeO3 (BFO) with a scanning probe, we can deterministically manipulate the doping type of monolayer WSe2 to achieve the p-type and n-type doping. This nonvolatile approach can maintain the doping type and hold the localized excitonic charges for a long time without applied voltage. Our work demonstrated that ferroelectric polarization of BFO can control the charges of LXs effectively. Neutral and charged LXs have been observed in different ferroelectric polarization regions, confirmed by magnetic optical measurement. Highly circular polarization degree about 90 % of the photon emission from these quantum emitters have been achieved in high magnetic fields. Controlling single charge of LXs in a non-volatile way shows a great potential for deterministic photon emission with desired charge states for photonic long-term memory.
△ Less
Submitted 30 September, 2022;
originally announced September 2022.
-
MIMO-DoAnet: Multi-channel Input and Multiple Outputs DoA Network with Unknown Number of Sound Sources
Authors:
Haoran Yin,
Meng Ge,
Yanjie Fu,
Gaoyan Zhang,
Longbiao Wang,
Lei Zhang,
Lin Qiu,
Jianwu Dang
Abstract:
Recent neural network based Direction of Arrival (DoA) estimation algorithms have performed well on unknown number of sound sources scenarios. These algorithms are usually achieved by mapping the multi-channel audio input to the single output (i.e. overall spatial pseudo-spectrum (SPS) of all sources), that is called MISO. However, such MISO algorithms strongly depend on empirical threshold settin…
▽ More
Recent neural network based Direction of Arrival (DoA) estimation algorithms have performed well on unknown number of sound sources scenarios. These algorithms are usually achieved by mapping the multi-channel audio input to the single output (i.e. overall spatial pseudo-spectrum (SPS) of all sources), that is called MISO. However, such MISO algorithms strongly depend on empirical threshold setting and the angle assumption that the angles between the sound sources are greater than a fixed angle. To address these limitations, we propose a novel multi-channel input and multiple outputs DoA network called MIMO-DoAnet. Unlike the general MISO algorithms, MIMO-DoAnet predicts the SPS coding of each sound source with the help of the informative spatial covariance matrix. By doing so, the threshold task of detecting the number of sound sources becomes an easier task of detecting whether there is a sound source in each output, and the serious interaction between sound sources disappears during inference stage. Experimental results show that MIMO-DoAnet achieves relative 18.6% and absolute 13.3%, relative 34.4% and absolute 20.2% F1 score improvement compared with the MISO baseline system in 3, 4 sources scenes. The results also demonstrate MIMO-DoAnet alleviates the threshold setting problem and solves the angle assumption problem effectively.
△ Less
Submitted 16 November, 2022; v1 submitted 15 July, 2022;
originally announced July 2022.
-
Language-specific Characteristic Assistance for Code-switching Speech Recognition
Authors:
Tongtong Song,
Qiang Xu,
Meng Ge,
Longbiao Wang,
Hao Shi,
Yongjie Lv,
Yuqin Lin,
Jianwu Dang
Abstract:
Dual-encoder structure successfully utilizes two language-specific encoders (LSEs) for code-switching speech recognition. Because LSEs are initialized by two pre-trained language-specific models (LSMs), the dual-encoder structure can exploit sufficient monolingual data and capture the individual language attributes. However, most existing methods have no language constraints on LSEs and underutili…
▽ More
Dual-encoder structure successfully utilizes two language-specific encoders (LSEs) for code-switching speech recognition. Because LSEs are initialized by two pre-trained language-specific models (LSMs), the dual-encoder structure can exploit sufficient monolingual data and capture the individual language attributes. However, most existing methods have no language constraints on LSEs and underutilize language-specific knowledge of LSMs. In this paper, we propose a language-specific characteristic assistance (LSCA) method to mitigate the above problems. Specifically, during training, we introduce two language-specific losses as language constraints and generate corresponding language-specific targets for them. During decoding, we take the decoding abilities of LSMs into account by combining the output probabilities of two LSMs and the mixture model to obtain the final predictions. Experiments show that either the training or decoding method of LSCA can improve the model's performance. Furthermore, the best result can obtain up to 15.4% relative error reduction on the code-switching test set by combining the training and decoding methods of LSCA. Moreover, the system can process code-switching speech recognition tasks well without extra shared parameters or even retraining based on two pre-trained LSMs by using our method.
△ Less
Submitted 11 July, 2022; v1 submitted 29 June, 2022;
originally announced June 2022.
-
Iterative Sound Source Localization for Unknown Number of Sources
Authors:
Yanjie Fu,
Meng Ge,
Haoran Yin,
Xinyuan Qian,
Longbiao Wang,
Gaoyan Zhang,
Jianwu Dang
Abstract:
Sound source localization aims to seek the direction of arrival (DOA) of all sound sources from the observed multi-channel audio. For the practical problem of unknown number of sources, existing localization algorithms attempt to predict a likelihood-based coding (i.e., spatial spectrum) and employ a pre-determined threshold to detect the source number and corresponding DOA value. However, these t…
▽ More
Sound source localization aims to seek the direction of arrival (DOA) of all sound sources from the observed multi-channel audio. For the practical problem of unknown number of sources, existing localization algorithms attempt to predict a likelihood-based coding (i.e., spatial spectrum) and employ a pre-determined threshold to detect the source number and corresponding DOA value. However, these threshold-based algorithms are not stable since they are limited by the careful choice of threshold. To address this problem, we propose an iterative sound source localization approach called ISSL, which can iteratively extract each source's DOA without threshold until the termination criterion is met. Unlike threshold-based algorithms, ISSL designs an active source detector network based on binary classifier to accept residual spatial spectrum and decide whether to stop the iteration. By doing so, our ISSL can deal with an arbitrary number of sources, even more than the number of sources seen during the training stage. The experimental results show that our ISSL achieves significant performance improvements in both DOA estimation and source number detection compared with the existing threshold-based algorithms.
△ Less
Submitted 24 June, 2022;
originally announced June 2022.
-
Arecibo and FAST Timing Follow-up of twelve Millisecond Pulsars Discovered in Commensal Radio Astronomy FAST Survey
Authors:
C. C. Miao,
W. W. Zhu,
D. Li,
P. C. C. Freire,
J. R. Niu,
P. Wang,
J. P. Yuan,
M. Y. Xue,
A. D. Cameron,
D. J. Champion,
M. Cruces,
Y. T. Chen,
M. M. Chi,
X. F. Cheng,
S. J. Dang,
M. F. Ding,
Y. Feng,
Z. Y. Gan,
G. Hobbs,
M. Kramer,
Z. J. Liu,
Y. X. Li,
Z. K. Luo,
X. L. Miao,
L. Q. Meng
, et al. (24 additional authors not shown)
Abstract:
We report the phase-connected timing ephemeris, polarization pulse profiles, Faraday rotation measurements, and Rotating-Vector-Model (RVM) fitting results of twelve millisecond pulsars (MSPs) discovered with the Five-hundred-meter Aperture Spherical radio Telescope (FAST) in the Commensal radio Astronomy FAST survey (CRAFTS). The timing campaigns were carried out with FAST and Arecibo over three…
▽ More
We report the phase-connected timing ephemeris, polarization pulse profiles, Faraday rotation measurements, and Rotating-Vector-Model (RVM) fitting results of twelve millisecond pulsars (MSPs) discovered with the Five-hundred-meter Aperture Spherical radio Telescope (FAST) in the Commensal radio Astronomy FAST survey (CRAFTS). The timing campaigns were carried out with FAST and Arecibo over three years. Eleven of the twelve pulsars are in neutron star - white dwarf binary systems, with orbital periods between 2.4 and 100 d. Ten of them have spin periods, companion masses, and orbital eccentricities that are consistent with the theoretical expectations for MSP - Helium white dwarf (He WD) systems. The last binary pulsar (PSR J1912$-$0952) has a significantly smaller spin frequency and a smaller companion mass, the latter could be caused by a low orbital inclination for the system. Its orbital period of 29 days is well within the range of orbital periods where some MSP - He WD systems have shown anomalous eccentricities, however, the eccentricity of PSR J1912$-$0952 is typical of what one finds for the remaining MSP - He WD systems.
△ Less
Submitted 9 May, 2022;
originally announced May 2022.
-
Fast and Arbitrary Beam Pattern Design for RIS-Assisted Terahertz Wireless Communication
Authors:
Jian Dang,
Zaichen Zhang,
Yewei Li,
Liang Wu,
Bingcheng Zhu,
Lei Wang
Abstract:
Reconfigurable intelligent surface (RIS) can assist terahertz wireless communication to restore the fragile line-of-sight links and facilitate beam steering. Arbitrary reflection beam patterns are desired to meet diverse requirements in different applications. This paper establishes relationship between RIS beam pattern design with two-dimensional finite impulse response filter design and proposes…
▽ More
Reconfigurable intelligent surface (RIS) can assist terahertz wireless communication to restore the fragile line-of-sight links and facilitate beam steering. Arbitrary reflection beam patterns are desired to meet diverse requirements in different applications. This paper establishes relationship between RIS beam pattern design with two-dimensional finite impulse response filter design and proposes a fast non-iterative algorithm to solve the problem. Simulations show that the proposed method outperforms baseline method. Hence, it represents a promising solution for fast and arbitrary beam pattern design in RIS-assisted terahertz wireless communication.
△ Less
Submitted 5 May, 2022;
originally announced May 2022.
-
Emission Variation of a Long-period Pulsar Discovered by the Five-hundred-meter Aperture Spherical Radio Telescope (FAST)
Authors:
H. M. Tedila,
R. Yuen,
N. Wang,
J. P. Yuan,
Z. G. Wen,
W. M. Yan,
S. Q. Wang,
S. J. Dang,
D. Li,
P. Wang,
W. W. Zhu,
J. R. Niu,
C. C. Miao,
M. Y. Xue,
L. Zhang,
Z. Y. Tu,
R. Rejep,
J. T. Xie,
FAST Collaboration
Abstract:
We report on the variation in the single-pulse emission from PSR J1900+4221 (CRAFTS 19C10) observed at frequency centered at 1.25 GHz using the Five-hundred-meter Aperture Spherical radio Telescope. The integrated pulse profile shows two distinct components, referred to here as the leading and trailing components, with the latter component also containing a third weak component. The single-pulse s…
▽ More
We report on the variation in the single-pulse emission from PSR J1900+4221 (CRAFTS 19C10) observed at frequency centered at 1.25 GHz using the Five-hundred-meter Aperture Spherical radio Telescope. The integrated pulse profile shows two distinct components, referred to here as the leading and trailing components, with the latter component also containing a third weak component. The single-pulse sequence reveals different emissions demonstrating as nulling, regular, and bright pulses, each with a particular abundance and duration distribution. There also exists pulses that follow a log-normal distribution suggesting the possibility of another emission, in which the pulsar is radiating weakly. Changes in the profile shape are seen across different emissions. We examine the emission variations in the leading and trailing components collectively and separately, and find moderate correlation between the two components. The inclination angle is estimated to be about 7° based on pulse-width, and we discuss that nulling in this pulsar does not seem to show correlation with age and rotation period.
△ Less
Submitted 3 May, 2022;
originally announced May 2022.
-
Heterogeneous Graph Neural Networks using Self-supervised Reciprocally Contrastive Learning
Authors:
Cuiying Huo,
Dongxiao He,
Yawen Li,
Di Jin,
Jianwu Dang,
Weixiong Zhang,
Witold Pedrycz,
Lingfei Wu
Abstract:
Heterogeneous graph neural network (HGNN) is a very popular technique for the modeling and analysis of heterogeneous graphs. Most existing HGNN-based approaches are supervised or semi-supervised learning methods requiring graphs to be annotated, which is costly and time-consuming. Self-supervised contrastive learning has been proposed to address the problem of requiring annotated data by mining in…
▽ More
Heterogeneous graph neural network (HGNN) is a very popular technique for the modeling and analysis of heterogeneous graphs. Most existing HGNN-based approaches are supervised or semi-supervised learning methods requiring graphs to be annotated, which is costly and time-consuming. Self-supervised contrastive learning has been proposed to address the problem of requiring annotated data by mining intrinsic information hidden within the given data. However, the existing contrastive learning methods are inadequate for heterogeneous graphs because they construct contrastive views only based on data perturbation or pre-defined structural properties (e.g., meta-path) in graph data while ignore the noises that may exist in both node attributes and graph topologies. We develop for the first time a novel and robust heterogeneous graph contrastive learning approach, namely HGCL, which introduces two views on respective guidance of node attributes and graph topologies and integrates and enhances them by reciprocally contrastive mechanism to better model heterogeneous graphs. In this new approach, we adopt distinct but most suitable attribute and topology fusion mechanisms in the two views, which are conducive to mining relevant information in attributes and topologies separately. We further use both attribute similarity and topological correlation to construct high-quality contrastive samples. Extensive experiments on three large real-world heterogeneous graphs demonstrate the superiority and robustness of HGCL over state-of-the-art methods.
△ Less
Submitted 16 November, 2023; v1 submitted 30 April, 2022;
originally announced May 2022.
-
Detection of strong scattering close to the eclipse region of PSR B1957+20
Authors:
J. T. Bai,
S. Dai,
Q. J. Zhi,
W. A. Coles,
D. Li,
W. W. Zhu,
G. Hobbs,
G. J. Qiao,
N. Wang,
J. P. Yuan,
M. D. Filipovic,
J. B. Wang,
Z. C. Pan,
L. H. Shang,
S. J. Dang,
S. Q. Wang,
C. C. Miao
Abstract:
We present the first measurement of pulse scattering close to the eclipse region of PSR B1957+20, which is in a compact binary system with a low-mass star. We measured pulse scattering time-scales up to 0.2 ms close to the eclipse and showed that it scales with the dispersion measure (DM) excess roughly as $τ\proptoΔ{\rm DM}^{2}$. Our observations provide the first evidence of strong scattering du…
▽ More
We present the first measurement of pulse scattering close to the eclipse region of PSR B1957+20, which is in a compact binary system with a low-mass star. We measured pulse scattering time-scales up to 0.2 ms close to the eclipse and showed that it scales with the dispersion measure (DM) excess roughly as $τ\proptoΔ{\rm DM}^{2}$. Our observations provide the first evidence of strong scattering due to multi-path propagation effects in the eclipsing material. We show that Kolmogorov turbulence in the eclipsing material with an inner scale of $\sim100$ m and an outer scale of the size of the eclipse region can naturally explain the observation. Our results show that the eclipsing material in such systems can be highly turbulent and suggest that scattering is one of the main eclipsing mechanisms at around 1.4 GHz.
△ Less
Submitted 28 March, 2022;
originally announced March 2022.
-
TMS: A Temporal Multi-scale Backbone Design for Speaker Embedding
Authors:
Ruiteng Zhang,
Jianguo Wei,
Xugang Lu,
Wenhuan Lu,
Di Jin,
Junhai Xu,
Lin Zhang,
Yantao Ji,
Jianwu Dang
Abstract:
Speaker embedding is an important front-end module to explore discriminative speaker features for many speech applications where speaker information is needed. Current SOTA backbone networks for speaker embedding are designed to aggregate multi-scale features from an utterance with multi-branch network architectures for speaker representation. However, naively adding many branches of multi-scale f…
▽ More
Speaker embedding is an important front-end module to explore discriminative speaker features for many speech applications where speaker information is needed. Current SOTA backbone networks for speaker embedding are designed to aggregate multi-scale features from an utterance with multi-branch network architectures for speaker representation. However, naively adding many branches of multi-scale features with the simple fully convolutional operation could not efficiently improve the performance due to the rapid increase of model parameters and computational complexity. Therefore, in the most current state-of-the-art network architectures, only a few branches corresponding to a limited number of temporal scales could be designed for speaker embeddings. To address this problem, in this paper, we propose an effective temporal multi-scale (TMS) model where multi-scale branches could be efficiently designed in a speaker embedding network almost without increasing computational costs. The new model is based on the conventional TDNN, where the network architecture is smartly separated into two modeling operators: a channel-modeling operator and a temporal multi-branch modeling operator. Adding temporal multi-scale in the temporal multi-branch operator needs only a little bit increase of the number of parameters, and thus save more computational budget for adding more branches with large temporal scales. Moreover, in the inference stage, we further developed a systemic re-parameterization method to convert the TMS-based model into a single-path-based topology in order to increase inference speed. We investigated the performance of the new TMS method for automatic speaker verification (ASV) on in-domain and out-of-domain conditions. Results show that the TMS-based model obtained a significant increase in the performance over the SOTA ASV models, meanwhile, had a faster inference speed.
△ Less
Submitted 17 March, 2022;
originally announced March 2022.
-
Strong light-matter interactions between gap plasmons and two-dimensional excitons at ambient condition in a deterministic way
Authors:
Longlong Yang,
Xin Xie,
Jingnan Yang,
Mengfei Xue,
Shiyao Wu,
Shan Xiao,
Feilong Song,
Jianchen Dang,
Sibai Sun,
Zhanchun Zuo,
Jianing Chen,
Yuan Huang,
Xingjiang Zhou,
Kuijuan Jin,
Can Wang,
Xiulai Xu
Abstract:
Strong exciton-plasmon interaction between the layered two-dimensional (2D) semiconductors and gap plasmons shows a great potential to implement cavity quantum-electrodynamics in ambient condition. However, achieving a robust plasmon-exciton coupling with nanocavity is still very challenging, because the layer area is usually small with conventional approaches. Here, we report on a robust strong e…
▽ More
Strong exciton-plasmon interaction between the layered two-dimensional (2D) semiconductors and gap plasmons shows a great potential to implement cavity quantum-electrodynamics in ambient condition. However, achieving a robust plasmon-exciton coupling with nanocavity is still very challenging, because the layer area is usually small with conventional approaches. Here, we report on a robust strong exciton-plasmon coupling between the gap mode of bowtie and the excitons in MoS$_2$ layers with gold-assisted mechanical exfoliation and the nondestructive wet transfer techniques for large-area layer. Benefiting from the ultrasmall mode volume and strong in-plane field, the estimated effective exciton number contributing to the coupling is largely reduced. With a corrected exciton transition dipole moment, the exciton numbers are extracted with 40 for the case of monolayer and 48 for 8 layers. Our work paves a way to realize the strong coupling with 2D materials with few excitons at room temperature.
△ Less
Submitted 2 March, 2022;
originally announced March 2022.
-
L-SpEx: Localized Target Speaker Extraction
Authors:
Meng Ge,
Chenglin Xu,
Longbiao Wang,
Eng Siong Chng,
Jianwu Dang,
Haizhou Li
Abstract:
Speaker extraction aims to extract the target speaker's voice from a multi-talker speech mixture given an auxiliary reference utterance. Recent studies show that speaker extraction benefits from the location or direction of the target speaker. However, these studies assume that the target speaker's location is known in advance or detected by an extra visual cue, e.g., face image or video. In this…
▽ More
Speaker extraction aims to extract the target speaker's voice from a multi-talker speech mixture given an auxiliary reference utterance. Recent studies show that speaker extraction benefits from the location or direction of the target speaker. However, these studies assume that the target speaker's location is known in advance or detected by an extra visual cue, e.g., face image or video. In this paper, we propose an end-to-end localized target speaker extraction on pure speech cues, that is called L-SpEx. Specifically, we design a speaker localizer driven by the target speaker's embedding to extract the spatial features, including direction-of-arrival (DOA) of the target speaker and beamforming output. Then, the spatial cues and target speaker's embedding are both used to form a top-down auditory attention to the target speaker. Experiments on the multi-channel reverberant dataset called MC-Libri2Mix show that our L-SpEx approach significantly outperforms the baseline system.
△ Less
Submitted 21 February, 2022;
originally announced February 2022.
-
A New High Energy Efficiency Scheme Based on Two-Dimension Resource Blocks in Wireless Communication Systems
Authors:
Kang Liu,
Zaichen Zhang,
Jian Dang,
Liang Wu,
Bingchen Zhu,
Lei Wang,
Chuan Zhang
Abstract:
Energy efficiency (EE) plays a key role in future wireless communication network and it is easily to achieve high EE performance in low SNR regime. In this paper, a new high EE scheme is proposed for a MIMO wireless communication system working in the low SNR regime by using two dimension resource allocation. First, we define the high EE area based on the relationship between the transmission powe…
▽ More
Energy efficiency (EE) plays a key role in future wireless communication network and it is easily to achieve high EE performance in low SNR regime. In this paper, a new high EE scheme is proposed for a MIMO wireless communication system working in the low SNR regime by using two dimension resource allocation. First, we define the high EE area based on the relationship between the transmission power and the SNR. To meet the constraint of the high EE area, both frequency and space dimension are needed. Besides analysing them separately, we decided to consider frequency and space dimensions as a unit and proposed a two-dimension scheme. Furthermore, considering communication in the high EE area may cause decline of the communication quality, we add quality-of-service(QoS) constraint into the consideration and derive the corresponding EE performance based on the effective capacity. We also derive an approximate expression to simplify the complex EE performance. Finally, our numerical results demonstrate the effectiveness of the proposed scheme.
△ Less
Submitted 27 January, 2022;
originally announced January 2022.