-
Unifying Sequences, Structures, and Descriptions for Any-to-Any Protein Generation with the Large Multimodal Model HelixProtX
Authors:
Zhiyuan Chen,
Tianhao Chen,
Chenggang Xie,
Yang Xue,
Xiaonan Zhang,
Jingbo Zhou,
Xiaomin Fang
Abstract:
Proteins are fundamental components of biological systems and can be represented through various modalities, including sequences, structures, and textual descriptions. Despite the advances in deep learning and scientific large language models (LLMs) for protein research, current methodologies predominantly focus on limited specialized tasks -- often predicting one protein modality from another. Th…
▽ More
Proteins are fundamental components of biological systems and can be represented through various modalities, including sequences, structures, and textual descriptions. Despite the advances in deep learning and scientific large language models (LLMs) for protein research, current methodologies predominantly focus on limited specialized tasks -- often predicting one protein modality from another. These approaches restrict the understanding and generation of multimodal protein data. In contrast, large multimodal models have demonstrated potential capabilities in generating any-to-any content like text, images, and videos, thus enriching user interactions across various domains. Integrating these multimodal model technologies into protein research offers significant promise by potentially transforming how proteins are studied. To this end, we introduce HelixProtX, a system built upon the large multimodal model, aiming to offer a comprehensive solution to protein research by supporting any-to-any protein modality generation. Unlike existing methods, it allows for the transformation of any input protein modality into any desired protein modality. The experimental results affirm the advanced capabilities of HelixProtX, not only in generating functional descriptions from amino acid sequences but also in executing critical tasks such as designing protein sequences and structures from textual descriptions. Preliminary findings indicate that HelixProtX consistently achieves superior accuracy across a range of protein-related tasks, outperforming existing state-of-the-art models. By integrating multimodal large models into protein research, HelixProtX opens new avenues for understanding protein biology, thereby promising to accelerate scientific discovery.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
Movie Recommendation with Poster Attention via Multi-modal Transformer Feature Fusion
Authors:
Linhan Xia,
Yicheng Yang,
Ziou Chen,
Zheng Yang,
Shengxin Zhu
Abstract:
Pre-trained models learn general representations from large datsets which can be fine-turned for specific tasks to significantly reduce training time. Pre-trained models like generative pretrained transformers (GPT), bidirectional encoder representations from transformers (BERT), vision transfomers (ViT) have become a cornerstone of current research in machine learning. This study proposes a multi…
▽ More
Pre-trained models learn general representations from large datsets which can be fine-turned for specific tasks to significantly reduce training time. Pre-trained models like generative pretrained transformers (GPT), bidirectional encoder representations from transformers (BERT), vision transfomers (ViT) have become a cornerstone of current research in machine learning. This study proposes a multi-modal movie recommendation system by extract features of the well designed posters for each movie and the narrative text description of the movie. This system uses the BERT model to extract the information of text modality, the ViT model applied to extract the information of poster/image modality, and the Transformer architecture for feature fusion of all modalities to predict users' preference. The integration of pre-trained foundational models with some smaller data sets in downstream applications capture multi-modal content features in a more comprehensive manner, thereby providing more accurate recommendations. The efficiency of the proof-of-concept model is verified by the standard benchmark problem the MovieLens 100K and 1M datasets. The prediction accuracy of user ratings is enhanced in comparison to the baseline algorithm, thereby demonstrating the potential of this cross-modal algorithm to be applied for movie or video recommendation.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
Large Language Models as Biomedical Hypothesis Generators: A Comprehensive Evaluation
Authors:
Biqing Qi,
Kaiyan Zhang,
Kai Tian,
Haoxiang Li,
Zhang-Ren Chen,
Sihang Zeng,
Ermo Hua,
Hu Jinfang,
Bowen Zhou
Abstract:
The rapid growth of biomedical knowledge has outpaced our ability to efficiently extract insights and generate novel hypotheses. Large language models (LLMs) have emerged as a promising tool to revolutionize knowledge interaction and potentially accelerate biomedical discovery. In this paper, we present a comprehensive evaluation of LLMs as biomedical hypothesis generators. We construct a dataset…
▽ More
The rapid growth of biomedical knowledge has outpaced our ability to efficiently extract insights and generate novel hypotheses. Large language models (LLMs) have emerged as a promising tool to revolutionize knowledge interaction and potentially accelerate biomedical discovery. In this paper, we present a comprehensive evaluation of LLMs as biomedical hypothesis generators. We construct a dataset of background-hypothesis pairs from biomedical literature, carefully partitioned into training, seen, and unseen test sets based on publication date to mitigate data contamination. Using this dataset, we assess the hypothesis generation capabilities of top-tier instructed models in zero-shot, few-shot, and fine-tuning settings. To enhance the exploration of uncertainty, a crucial aspect of scientific discovery, we incorporate tool use and multi-agent interactions in our evaluation framework. Furthermore, we propose four novel metrics grounded in extensive literature review to evaluate the quality of generated hypotheses, considering both LLM-based and human assessments. Our experiments yield two key findings: 1) LLMs can generate novel and validated hypotheses, even when tested on literature unseen during training, and 2) Increasing uncertainty through multi-agent interactions and tool use can facilitate diverse candidate generation and improve zero-shot hypothesis generation performance. However, we also observe that the integration of additional knowledge through few-shot learning and tool use may not always lead to performance gains, highlighting the need for careful consideration of the type and scope of external knowledge incorporated. These findings underscore the potential of LLMs as powerful aids in biomedical hypothesis generation and provide valuable insights to guide further research in this area.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
Natural language is not enough: Benchmarking multi-modal generative AI for Verilog generation
Authors:
Kaiyan Chang,
Zhirong Chen,
Yunhao Zhou,
Wenlong Zhu,
kun wang,
Haobo Xu,
Cangyuan Li,
Mengdi Wang,
Shengwen Liang,
Huawei Li,
Yinhe Han,
Ying Wang
Abstract:
Natural language interfaces have exhibited considerable potential in the automation of Verilog generation derived from high-level specifications through the utilization of large language models, garnering significant attention. Nevertheless, this paper elucidates that visual representations contribute essential contextual information critical to design intent for hardware architectures possessing…
▽ More
Natural language interfaces have exhibited considerable potential in the automation of Verilog generation derived from high-level specifications through the utilization of large language models, garnering significant attention. Nevertheless, this paper elucidates that visual representations contribute essential contextual information critical to design intent for hardware architectures possessing spatial complexity, potentially surpassing the efficacy of natural-language-only inputs. Expanding upon this premise, our paper introduces an open-source benchmark for multi-modal generative models tailored for Verilog synthesis from visual-linguistic inputs, addressing both singular and complex modules. Additionally, we introduce an open-source visual and natural language Verilog query language framework to facilitate efficient and user-friendly multi-modal queries. To evaluate the performance of the proposed multi-modal hardware generative AI in Verilog generation tasks, we compare it with a popular method that relies solely on natural language. Our results demonstrate a significant accuracy improvement in the multi-modal generated Verilog compared to queries based solely on natural language. We hope to reveal a new approach to hardware design in the large-hardware-design-model era, thereby fostering a more diversified and productive approach to hardware design.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditions
Authors:
Zhiyuan Chen,
Jiajiong Cao,
Zhiquan Chen,
Yuming Li,
Chenguang Ma
Abstract:
The area of portrait image animation, propelled by audio input, has witnessed notable progress in the generation of lifelike and dynamic portraits. Conventional methods are limited to utilizing either audios or facial key points to drive images into videos, while they can yield satisfactory results, certain issues exist. For instance, methods driven solely by audios can be unstable at times due to…
▽ More
The area of portrait image animation, propelled by audio input, has witnessed notable progress in the generation of lifelike and dynamic portraits. Conventional methods are limited to utilizing either audios or facial key points to drive images into videos, while they can yield satisfactory results, certain issues exist. For instance, methods driven solely by audios can be unstable at times due to the relatively weaker audio signal, while methods driven exclusively by facial key points, although more stable in driving, can result in unnatural outcomes due to the excessive control of key point information. In addressing the previously mentioned challenges, in this paper, we introduce a novel approach which we named EchoMimic. EchoMimic is concurrently trained using both audios and facial landmarks. Through the implementation of a novel training strategy, EchoMimic is capable of generating portrait videos not only by audios and facial landmarks individually, but also by a combination of both audios and selected facial landmarks. EchoMimic has been comprehensively compared with alternative algorithms across various public datasets and our collected dataset, showcasing superior performance in both quantitative and qualitative evaluations. Additional visualization and access to the source code can be located on the EchoMimic project page.
△ Less
Submitted 11 July, 2024; v1 submitted 10 July, 2024;
originally announced July 2024.
-
Asynchronous measurement-device-independent quantum digital signatures
Authors:
Jing-Wei Bian,
Bing-Hong Li,
Yuan-Mei Xie,
Hua-Lei Yin,
Zeng-Bing Chen
Abstract:
Quantum digital signatures (QDSs), which distribute and measure quantum states by key generation protocols and then sign messages via classical data processing, are a key area of interest in quantum cryptography. However, the practical implementation of a QDS network has many challenges, including complex interference technical requirements, linear channel loss of quantum state transmission, and p…
▽ More
Quantum digital signatures (QDSs), which distribute and measure quantum states by key generation protocols and then sign messages via classical data processing, are a key area of interest in quantum cryptography. However, the practical implementation of a QDS network has many challenges, including complex interference technical requirements, linear channel loss of quantum state transmission, and potential side-channel attacks on detectors. Here, we propose an asynchronous measurement-device-independent (MDI) QDS protocol with asynchronous two-photon interference strategy and one-time universal hashing method. The two-photon interference approach protects our protocol against all detector side-channel attacks and relaxes the difficulty of experiment implementation, while the asynchronous strategy effectively reduces the equivalent channel loss to its square root. Compared to previous MDI-QDS schemes, our protocol shows several orders of magnitude performance improvements and doubling of transmission distance when processing multi-bit messages. Our findings present an efficient and practical MDI-QDS scheme, paving the way for large-scale data processing with non-repudiation in quantum networks.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
Source Code Summarization in the Era of Large Language Models
Authors:
Weisong Sun,
Yun Miao,
Yuekang Li,
Hongyu Zhang,
Chunrong Fang,
Yi Liu,
Gelei Deng,
Yang Liu,
Zhenyu Chen
Abstract:
To support software developers in understanding and maintaining programs, various automatic (source) code summarization techniques have been proposed to generate a concise natural language summary (i.e., comment) for a given code snippet. Recently, the emergence of large language models (LLMs) has led to a great boost in the performance of code-related tasks. In this paper, we undertake a systemat…
▽ More
To support software developers in understanding and maintaining programs, various automatic (source) code summarization techniques have been proposed to generate a concise natural language summary (i.e., comment) for a given code snippet. Recently, the emergence of large language models (LLMs) has led to a great boost in the performance of code-related tasks. In this paper, we undertake a systematic and comprehensive study on code summarization in the era of LLMs, which covers multiple aspects involved in the workflow of LLM-based code summarization. Specifically, we begin by examining prevalent automated evaluation methods for assessing the quality of summaries generated by LLMs and find that the results of the GPT-4 evaluation method are most closely aligned with human evaluation. Then, we explore the effectiveness of five prompting techniques (zero-shot, few-shot, chain-of-thought, critique, and expert) in adapting LLMs to code summarization tasks. Contrary to expectations, advanced prompting techniques may not outperform simple zero-shot prompting. Next, we investigate the impact of LLMs' model settings (including top\_p and temperature parameters) on the quality of generated summaries. We find the impact of the two parameters on summary quality varies by the base LLM and programming language, but their impacts are similar. Moreover, we canvass LLMs' abilities to summarize code snippets in distinct types of programming languages. The results reveal that LLMs perform suboptimally when summarizing code written in logic programming languages compared to other language types. Finally, we unexpectedly find that CodeLlama-Instruct with 7B parameters can outperform advanced GPT-4 in generating summaries describing code implementation details and asserting code properties. We hope that our findings can provide a comprehensive understanding of code summarization in the era of LLMs.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
Topological Offsets
Authors:
Daniel Zint,
Zhouyuan Chen,
Yifei Zhu,
Denis Zorin,
Teseo Schneider,
Daniele Panozzo
Abstract:
We introduce Topological Offsets, a novel approach to generate manifold and self-intersection-free offset surfaces that are topologically equivalent to an offset infinitesimally close to the surface. Our approach, by construction, creates a manifold, watertight, and self-intersection-free offset surface strictly enclosing the input, while doing a best effort to move it to a prescribed distance fro…
▽ More
We introduce Topological Offsets, a novel approach to generate manifold and self-intersection-free offset surfaces that are topologically equivalent to an offset infinitesimally close to the surface. Our approach, by construction, creates a manifold, watertight, and self-intersection-free offset surface strictly enclosing the input, while doing a best effort to move it to a prescribed distance from the input. Differently from existing approaches, we embed the input in a volumetric mesh, and insert a topological offset around the mesh with purely combinatorial operations. The topological offset is then inflated/deflated to match the user-prescribed distance, while enforcing that no intersections or non-manifold configurations are introduced. We evaluate the effectiveness and robustness of our approach on the non-intersecting subset of Thingi10k, and show that topological offsets are beneficial in multiple graphics applications, including (1) converting non-manifold surfaces to manifold ones, (2) creation of nested cages/layered offsets, and (3) reliably computing finite offsets.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
Boosting Medical Image Synthesis via Registration-guided Consistency and Disentanglement Learning
Authors:
Chuanpu Li,
Zeli Chen,
Yiwen Zhang,
Liming Zhong,
Wei Yang
Abstract:
Medical image synthesis remains challenging due to misalignment noise during training. Existing methods have attempted to address this challenge by incorporating a registration-guided module. However, these methods tend to overlook the task-specific constraints on the synthetic and registration modules, which may cause the synthetic module to still generate spatially aligned images with misaligned…
▽ More
Medical image synthesis remains challenging due to misalignment noise during training. Existing methods have attempted to address this challenge by incorporating a registration-guided module. However, these methods tend to overlook the task-specific constraints on the synthetic and registration modules, which may cause the synthetic module to still generate spatially aligned images with misaligned target images during training, regardless of the registration module's function. Therefore, this paper proposes registration-guided consistency and incorporates disentanglement learning for medical image synthesis. The proposed registration-guided consistency architecture fosters task-specificity within the synthetic and registration modules by applying identical deformation fields before and after synthesis, while enforcing output consistency through an alignment loss. Moreover, the synthetic module is designed to possess the capability of disentangling anatomical structures and specific styles across various modalities. An anatomy consistency loss is introduced to further compel the synthetic module to preserve geometrical integrity within latent spaces. Experiments conducted on both an in-house abdominal CECT-CT dataset and a publicly available pelvic MR-CT dataset have demonstrated the superiority of the proposed method.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
Study of the decay and production properties of $D_{s1}(2536)$ and $D_{s2}^*(2573)$
Authors:
M. Ablikim,
M. N. Achasov,
P. Adlarson,
O. Afedulidis,
X. C. Ai,
R. Aliberti,
A. Amoroso,
Q. An,
Y. Bai,
O. Bakina,
I. Balossino,
Y. Ban,
H. -R. Bao,
V. Batozskaya,
K. Begzsuren,
N. Berger,
M. Berlowski,
M. Bertani,
D. Bettoni,
F. Bianchi,
E. Bianco,
A. Bortone,
I. Boyko,
R. A. Briere,
A. Brueggemann
, et al. (645 additional authors not shown)
Abstract:
The $e^+e^-\rightarrow D_s^+D_{s1}(2536)^-$ and $e^+e^-\rightarrow D_s^+D^*_{s2}(2573)^-$ processes are studied using data samples collected with the BESIII detector at center-of-mass energies from 4.530 to 4.946~GeV. The absolute branching fractions of $D_{s1}(2536)^- \rightarrow \bar{D}^{*0}K^-$ and $D_{s2}^*(2573)^- \rightarrow \bar{D}^0K^-$ are measured for the first time to be…
▽ More
The $e^+e^-\rightarrow D_s^+D_{s1}(2536)^-$ and $e^+e^-\rightarrow D_s^+D^*_{s2}(2573)^-$ processes are studied using data samples collected with the BESIII detector at center-of-mass energies from 4.530 to 4.946~GeV. The absolute branching fractions of $D_{s1}(2536)^- \rightarrow \bar{D}^{*0}K^-$ and $D_{s2}^*(2573)^- \rightarrow \bar{D}^0K^-$ are measured for the first time to be $(35.9\pm 4.8\pm 3.5)\%$ and $(37.4\pm 3.1\pm 4.6)\%$, respectively. The measurements are in tension with predictions based on the assumption that the $D_{s1}(2536)$ and $D_{s2}^*(2573)$ are dominated by a bare $c\bar{s}$ component. The $e^+e^-\rightarrow D_s^+D_{s1}(2536)^-$ and $e^+e^-\rightarrow D_s^+D^*_{s2}(2573)^-$ cross sections are measured, and a resonant structure at around 4.6~GeV with a width of 50~MeV is observed for the first time with a statistical significance of $15σ$ in the $e^+e^-\rightarrow D_s^+D^*_{s2}(2573)^-$ process. It could be the $Y(4626)$ found by the Belle collaboration in the $D_s^+D_{s1}(2536)^{-}$ final state, since they have similar masses and widths. There is also evidence for a structure at around 4.75~GeV in both processes.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
The neutron star mass, distance, and inclination from precision timing of the brilliant millisecond pulsar J0437$-$4715
Authors:
Daniel J. Reardon,
Matthew Bailes,
Ryan M. Shannon,
Chris Flynn,
Jacob Askew,
N. D. Ramesh Bhat,
Zu-Cheng Chen,
Małgorzata Curyło,
Yi Feng,
George B. Hobbs,
Agastya Kapur,
Matthew Kerr,
Xiaojin Liu,
Richard N. Manchester,
Rami Mandow,
Saurav Mishra,
Christopher J. Russell,
Mohsen Shamohammadi,
Lei Zhang,
Andrew Zic
Abstract:
The observation of neutron stars enables the otherwise impossible study of fundamental physical processes. Timing of binary radio pulsars is particularly powerful, as it enables precise characterization of their (three-dimensional) positions and orbits. PSR J0437$-$4715 is an important millisecond pulsar for timing array experiments and is also a primary target for the Neutron Star Interior Compos…
▽ More
The observation of neutron stars enables the otherwise impossible study of fundamental physical processes. Timing of binary radio pulsars is particularly powerful, as it enables precise characterization of their (three-dimensional) positions and orbits. PSR J0437$-$4715 is an important millisecond pulsar for timing array experiments and is also a primary target for the Neutron Star Interior Composition ExploreR (NICER). The main aim of the NICER mission is to constrain the neutron star equation of state by inferring the compactness ($M_p/R$) of the star. Direct measurements of the mass $M_p$ from pulsar timing therefore substantially improve constraints on the radius $R$, and the equation of state. Here we use observations spanning 26 years from Murriyang, the 64-m Parkes radio telescope, to improve the timing model for this pulsar. Among the new precise measurements are the pulsar mass $M_p=1.418\pm 0.044$ M$_{\odot}$, distance $D=156.96 \pm 0.11$ pc, and orbital inclination angle $i=137.506 \pm 0.016^\circ$, which can be used to inform the X-ray pulse profile models inferred from NICER observations. We demonstrate that these results are consistent between multiple data sets from the Parkes Pulsar Timing Array (PPTA), each modelled with different noise assumptions. Using the longest available PPTA data set, we measure an apparent second derivative of the pulsar spin frequency and discuss how this can be explained either by kinematic effects due to the proper motion and radial velocity of the pulsar, or excess low-frequency noise such as a gravitational-wave background.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
Realization of Conditional Operations through Transition Pathway Engineering
Authors:
Sheng Zhang,
Peng Duan,
Yun-Jie Wang,
Tian-Le Wang,
Peng Wang,
Ren-Ze Zhao,
Xiao-Yan Yang,
Ze-An Zhao,
Liang-Liang Guo,
Yong Chen,
Hai-Feng Zhang,
Lei Du,
Hao-Ran Tao,
Zhi-Fei Li,
Yuan Wu,
Zhi-Long Jia,
Wei-Cheng Kong,
Zhao-Yun Chen,
Yu-Chun Wu,
Guo-Ping Guo
Abstract:
In the NISQ era, achieving large-scale quantum computing demands compact circuits to mitigate decoherence and gate error accumulation. Quantum operations with diverse degrees of freedom hold promise for circuit compression, but conventional approaches encounter challenges in simultaneously adjusting multiple parameters. Here, we propose a transition composite gate (TCG) scheme grounded on state-se…
▽ More
In the NISQ era, achieving large-scale quantum computing demands compact circuits to mitigate decoherence and gate error accumulation. Quantum operations with diverse degrees of freedom hold promise for circuit compression, but conventional approaches encounter challenges in simultaneously adjusting multiple parameters. Here, we propose a transition composite gate (TCG) scheme grounded on state-selective transition path engineering, enabling more expressive conditional operations. We experimentally validate a controlled unitary (CU) gate as an example, with independent and continuous parameters. By adjusting the parameters of $\rm X^{12}$ gate, we obtain the CU family with a fidelity range of 95.2% to 99.0% leveraging quantum process tomography (QPT). To demonstrate the capability of circuit compression, we use TCG scheme to prepare 3-qubit Greenberger-Horne-Zeilinger (GHZ) and W states, with the fidelity of 96.77% and 95.72%. TCG can achieve the reduction in circuit depth of about 40% and 44% compared with the use of CZ gates only. Moreover, we show that short-path TCG (SPTCG) can further reduce the state-preparation circuit time cost. The TCG scheme exhibits advantages in certain quantum circuits and shows significant potential for large-scale quantum algorithms.
△ Less
Submitted 10 July, 2024; v1 submitted 9 July, 2024;
originally announced July 2024.
-
FinCon: A Synthesized LLM Multi-Agent System with Conceptual Verbal Reinforcement for Enhanced Financial Decision Making
Authors:
Yangyang Yu,
Zhiyuan Yao,
Haohang Li,
Zhiyang Deng,
Yupeng Cao,
Zhi Chen,
Jordan W. Suchow,
Rong Liu,
Zhenyu Cui,
Denghui Zhang,
Koduvayur Subbalakshmi,
Guojun Xiong,
Yueru He,
Jimin Huang,
Dong Li,
Qianqian Xie
Abstract:
Large language models (LLMs) have demonstrated notable potential in conducting complex tasks and are increasingly utilized in various financial applications. However, high-quality sequential financial investment decision-making remains challenging. These tasks require multiple interactions with a volatile environment for every decision, demanding sufficient intelligence to maximize returns and man…
▽ More
Large language models (LLMs) have demonstrated notable potential in conducting complex tasks and are increasingly utilized in various financial applications. However, high-quality sequential financial investment decision-making remains challenging. These tasks require multiple interactions with a volatile environment for every decision, demanding sufficient intelligence to maximize returns and manage risks. Although LLMs have been used to develop agent systems that surpass human teams and yield impressive investment returns, opportunities to enhance multi-sourced information synthesis and optimize decision-making outcomes through timely experience refinement remain unexplored. Here, we introduce the FinCon, an LLM-based multi-agent framework with CONceptual verbal reinforcement tailored for diverse FINancial tasks. Inspired by effective real-world investment firm organizational structures, FinCon utilizes a manager-analyst communication hierarchy. This structure allows for synchronized cross-functional agent collaboration towards unified goals through natural language interactions and equips each agent with greater memory capacity than humans. Additionally, a risk-control component in FinCon enhances decision quality by episodically initiating a self-critiquing mechanism to update systematic investment beliefs. The conceptualized beliefs serve as verbal reinforcement for the future agent's behavior and can be selectively propagated to the appropriate node that requires knowledge updates. This feature significantly improves performance while reducing unnecessary peer-to-peer communication costs. Moreover, FinCon demonstrates strong generalization capabilities in various financial tasks, including single stock trading and portfolio management.
△ Less
Submitted 10 July, 2024; v1 submitted 9 July, 2024;
originally announced July 2024.
-
Visual-Geometry GP-based Navigable Space for Autonomous Navigation
Authors:
Mahmoud Ali,
Durgkant Pushp,
Zheng Chen,
Lantao Liu
Abstract:
Autonomous navigation in unknown environments is challenging and demands the consideration of both geometric and semantic information in order to parse the navigability of the environment. In this work, we propose a novel space modeling framework, Visual-Geometry Sparse Gaussian Process (VG-SGP), that simultaneously considers semantics and geometry of the scene. Our proposed approach can overcome…
▽ More
Autonomous navigation in unknown environments is challenging and demands the consideration of both geometric and semantic information in order to parse the navigability of the environment. In this work, we propose a novel space modeling framework, Visual-Geometry Sparse Gaussian Process (VG-SGP), that simultaneously considers semantics and geometry of the scene. Our proposed approach can overcome the limitation of visual planners that fail to recognize geometry associated with the semantic and the geometric planners that completely overlook the semantic information which is very critical in real-world navigation. The proposed method leverages dual Sparse Gaussian Processes in an integrated manner; the first is trained to forecast geometrically navigable spaces while the second predicts the semantically navigable areas. This integrated model is able to pinpoint the overlapping (geometric and semantic) navigable space. The simulation and real-world experiments demonstrate that the ability of the proposed VG-SGP model, coupled with our innovative navigation strategy, outperforms models solely reliant on visual or geometric navigation algorithms, highlighting a superior adaptive behavior.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
Circuit Partitioning and Transmission Cost Optimization in Distributed Quantum Computing
Authors:
Xinyu Chen,
Zilu Chen,
Xueyun Cheng,
Zhijin Guan
Abstract:
Given the limitations on the number of qubits in current NISQ devices, the implementation of large-scale quantum algorithms on such devices is challenging, prompting research into distributed quantum computing. This paper focuses on the issue of excessive communication complexity in distributed quantum computing oriented towards quantum circuits. To reduce the number of quantum state transmissions…
▽ More
Given the limitations on the number of qubits in current NISQ devices, the implementation of large-scale quantum algorithms on such devices is challenging, prompting research into distributed quantum computing. This paper focuses on the issue of excessive communication complexity in distributed quantum computing oriented towards quantum circuits. To reduce the number of quantum state transmissions, i.e., the transmission cost, in distributed quantum circuits, a circuit partitioning method based on the QUBO model is proposed, coupled with the lookahead method for transmission cost optimization. Initially, the problem of distributed quantum circuit partitioning is transformed into a graph minimum cut problem. The QUBO model, which can be accelerated by quantum algorithms, is introduced to minimize the number of quantum gates between QPUs and the transmission cost. Subsequently, the dynamic lookahead strategy for the selection of transmission qubits is proposed to optimize the transmission cost in distributed quantum circuits. Finally, through numerical simulations, the impact of different circuit partitioning indicators on the transmission cost is explored, and the proposed method is evaluated on benchmark circuits. Experimental results demonstrate that the transmission cost optimized through the method proposed in this paper is significantly reduced compared with current methods for optimizing transmission cost, achieving noticeable improvements across different numbers of partitions.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
Multi-times Monte Carlo Rendering for Inter-reflection Reconstruction
Authors:
Tengjie Zhu,
Zhuo Chen,
Jingnan Gao,
Yichao Yan,
Xiaokang Yang
Abstract:
Inverse rendering methods have achieved remarkable performance in reconstructing high-fidelity 3D objects with disentangled geometries, materials, and environmental light. However, they still face huge challenges in reflective surface reconstruction. Although recent methods model the light trace to learn specularity, the ignorance of indirect illumination makes it hard to handle inter-reflections…
▽ More
Inverse rendering methods have achieved remarkable performance in reconstructing high-fidelity 3D objects with disentangled geometries, materials, and environmental light. However, they still face huge challenges in reflective surface reconstruction. Although recent methods model the light trace to learn specularity, the ignorance of indirect illumination makes it hard to handle inter-reflections among multiple smooth objects. In this work, we propose Ref-MC2 that introduces the multi-time Monte Carlo sampling which comprehensively computes the environmental illumination and meanwhile considers the reflective light from object surfaces. To address the computation challenge as the times of Monte Carlo sampling grow, we propose a specularity-adaptive sampling strategy, significantly reducing the computational complexity. Besides the computational resource, higher geometry accuracy is also required because geometric errors accumulate multiple times. Therefore, we further introduce a reflection-aware surface model to initialize the geometry and refine it during inverse rendering. We construct a challenging dataset containing scenes with multiple objects and inter-reflections. Experiments show that our method outperforms other inverse rendering methods on various object groups. We also show downstream applications, e.g., relighting and material editing, to illustrate the disentanglement ability of our method.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
Improving the trainability of VQE on NISQ computers for solving portfolio optimization using convex interpolation
Authors:
Shengbin Wang,
Guihui Li,
Zhaoyun Chen,
Peng Wang,
Menghan Dou,
Haiyong Zheng,
Zhimin Wang,
Yongjian Gu,
Yu-Chun Wu,
Guo-Ping Guo
Abstract:
Solving combinatorial optimization problems using variational quantum algorithms (VQAs) represents one of the most promising applications in the NISQ era. However, the limited trainability of VQAs could hinder their scalability to large problem sizes. In this paper, we improve the trainability of variational quantum eigensolver (VQE) by utilizing convex interpolation to solve portfolio optimizatio…
▽ More
Solving combinatorial optimization problems using variational quantum algorithms (VQAs) represents one of the most promising applications in the NISQ era. However, the limited trainability of VQAs could hinder their scalability to large problem sizes. In this paper, we improve the trainability of variational quantum eigensolver (VQE) by utilizing convex interpolation to solve portfolio optimization. The idea is inspired by the observation that the Dicke state possesses an inherent clustering property. Consequently, the energy of a state with a larger Hamming distance from the ground state intuitively results in a greater energy gap away from the ground state energy in the overall distribution trend. Based on convex interpolation, the location of the ground state can be evaluated by learning the property of a small subset of basis states in the Hilbert space. This enlightens naturally the proposals of the strategies of close-to-solution initialization, regular cost function landscape, and recursive ansatz equilibrium partition. The successfully implementation of a $40$-qubit experiment using only $10$ superconducting qubits demonstrates the effectiveness of our proposals. Furthermore, the quantum inspiration has also spurred the development of a prototype greedy algorithm. Extensive numerical simulations indicate that the hybridization of VQE and greedy algorithms achieves a mutual complementarity, combining the advantages of both global and local optimization methods. Our proposals can be extended to improve the trainability for solving other large-scale combinatorial optimization problems that are widely used in real applications, paving the way to unleash quantum advantages of NISQ computers in the near future.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
PANS: Probabilistic Airway Navigation System for Real-time Robust Bronchoscope Localization
Authors:
Qingyao Tian,
Zhen Chen,
Huai Liao,
Xinyan Huang,
Bingyu Yang,
Lujie Li,
Hongbin Liu
Abstract:
Accurate bronchoscope localization is essential for pulmonary interventions, by providing six degrees of freedom (DOF) in airway navigation. However, the robustness of current vision-based methods is often compromised in clinical practice, and they struggle to perform in real-time and to generalize across cases unseen during training. To overcome these challenges, we propose a novel Probabilistic…
▽ More
Accurate bronchoscope localization is essential for pulmonary interventions, by providing six degrees of freedom (DOF) in airway navigation. However, the robustness of current vision-based methods is often compromised in clinical practice, and they struggle to perform in real-time and to generalize across cases unseen during training. To overcome these challenges, we propose a novel Probabilistic Airway Navigation System (PANS), leveraging Monte-Carlo method with pose hypotheses and likelihoods to achieve robust and real-time bronchoscope localization. Specifically, our PANS incorporates diverse visual representations (\textit{e.g.}, odometry and landmarks) by leveraging two key modules, including the Depth-based Motion Inference (DMI) and the Bronchial Semantic Analysis (BSA). To generate the pose hypotheses of bronchoscope for PANS, we devise the DMI to accurately propagate the estimation of pose hypotheses over time. Moreover, to estimate the accurate pose likelihood, we devise the BSA module by effectively distinguishing between similar bronchial regions in endoscopic images, along with a novel metric to assess the congruence between estimated depth maps and the segmented airway structure. Under this probabilistic formulation, our PANS is capable of achieving the 6-DOF bronchoscope localization with superior accuracy and robustness. Extensive experiments on the collected pulmonary intervention dataset comprising 10 clinical cases confirm the advantage of our PANS over state-of-the-arts, in terms of both robustness and generalization in localizing deeper airway branches and the efficiency of real-time inference. The proposed PANS reveals its potential to be a reliable tool in the operating room, promising to enhance the quality and safety of pulmonary interventions.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
Enhancing Computer Programming Education with LLMs: A Study on Effective Prompt Engineering for Python Code Generation
Authors:
Tianyu Wang,
Nianjun Zhou,
Zhixiong Chen
Abstract:
Large language models (LLMs) and prompt engineering hold significant potential for advancing computer programming education through personalized instruction. This paper explores this potential by investigating three critical research questions: the systematic categorization of prompt engineering strategies tailored to diverse educational needs, the empowerment of LLMs to solve complex problems bey…
▽ More
Large language models (LLMs) and prompt engineering hold significant potential for advancing computer programming education through personalized instruction. This paper explores this potential by investigating three critical research questions: the systematic categorization of prompt engineering strategies tailored to diverse educational needs, the empowerment of LLMs to solve complex problems beyond their inherent capabilities, and the establishment of a robust framework for evaluating and implementing these strategies. Our methodology involves categorizing programming questions based on educational requirements, applying various prompt engineering strategies, and assessing the effectiveness of LLM-generated responses. Experiments with GPT-4, GPT-4o, Llama3-8b, and Mixtral-8x7b models on datasets such as LeetCode and USACO reveal that GPT-4o consistently outperforms others, particularly with the "multi-step" prompt strategy. The results show that tailored prompt strategies significantly enhance LLM performance, with specific strategies recommended for foundational learning, competition preparation, and advanced problem-solving. This study underscores the crucial role of prompt engineering in maximizing the educational benefits of LLMs. By systematically categorizing and testing these strategies, we provide a comprehensive framework for both educators and students to optimize LLM-based learning experiences. Future research should focus on refining these strategies and addressing current LLM limitations to further enhance educational outcomes in computer programming instruction.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
DM-MIMO: Diffusion Models for Robust Semantic Communications over MIMO Channels
Authors:
Yiheng Duan,
Tong Wu,
Zhiyong Chen,
Meixia Tao
Abstract:
This paper investigates robust semantic communications over multiple-input multiple-output (MIMO) fading channels. Current semantic communications over MIMO channels mainly focus on channel adaptive encoding and decoding, which lacks exploration of signal distribution. To leverage the potential of signal distribution in signal space denoising, we develop a diffusion model over MIMO channels (DM-MI…
▽ More
This paper investigates robust semantic communications over multiple-input multiple-output (MIMO) fading channels. Current semantic communications over MIMO channels mainly focus on channel adaptive encoding and decoding, which lacks exploration of signal distribution. To leverage the potential of signal distribution in signal space denoising, we develop a diffusion model over MIMO channels (DM-MIMO), a plugin module at the receiver side in conjunction with singular value decomposition (SVD) based precoding and equalization. Specifically, due to the significant variations in effective noise power over distinct sub-channels, we determine the effective sampling steps accordingly and devise a joint sampling algorithm. Utilizing a three-stage training algorithm, DM-MIMO learns the distribution of the encoded signal, which enables noise elimination over all sub-channels. Experimental results demonstrate that the DM-MIMO effectively reduces the mean square errors (MSE) of the equalized signal and the DM-MIMO semantic communication system (DM-MIMO-JSCC) outperforms the JSCC-based semantic communication system in image reconstruction.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
OvSW: Overcoming Silent Weights for Accurate Binary Neural Networks
Authors:
Jingyang Xiang,
Zuohui Chen,
Siqi Li,
Qing Wu,
Yong Liu
Abstract:
Binary Neural Networks~(BNNs) have been proven to be highly effective for deploying deep neural networks on mobile and embedded platforms. Most existing works focus on minimizing quantization errors, improving representation ability, or designing gradient approximations to alleviate gradient mismatch in BNNs, while leaving the weight sign flipping, a critical factor for achieving powerful BNNs, un…
▽ More
Binary Neural Networks~(BNNs) have been proven to be highly effective for deploying deep neural networks on mobile and embedded platforms. Most existing works focus on minimizing quantization errors, improving representation ability, or designing gradient approximations to alleviate gradient mismatch in BNNs, while leaving the weight sign flipping, a critical factor for achieving powerful BNNs, untouched. In this paper, we investigate the efficiency of weight sign updates in BNNs. We observe that, for vanilla BNNs, over 50\% of the weights remain their signs unchanged during training, and these weights are not only distributed at the tails of the weight distribution but also universally present in the vicinity of zero. We refer to these weights as ``silent weights'', which slow down convergence and lead to a significant accuracy degradation. Theoretically, we reveal this is due to the independence of the BNNs gradient from the latent weight distribution. To address the issue, we propose Overcome Silent Weights~(OvSW). OvSW first employs Adaptive Gradient Scaling~(AGS) to establish a relationship between the gradient and the latent weight distribution, thereby improving the overall efficiency of weight sign updates. Additionally, we design Silence Awareness Decaying~(SAD) to automatically identify ``silent weights'' by tracking weight flipping state, and apply an additional penalty to ``silent weights'' to facilitate their flipping. By efficiently updating weight signs, our method achieves faster convergence and state-of-the-art performance on CIFAR10 and ImageNet1K dataset with various architectures. For example, OvSW obtains 61.6\% and 65.5\% top-1 accuracy on the ImageNet1K using binarized ResNet18 and ResNet34 architecture respectively. Codes are available at \url{https://github.com/JingyangXiang/OvSW}.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
Towards Context-Aware Emotion Recognition Debiasing from a Causal Demystification Perspective via De-confounded Training
Authors:
Dingkang Yang,
Kun Yang,
Haopeng Kuang,
Zhaoyu Chen,
Yuzheng Wang,
Lihua Zhang
Abstract:
Understanding emotions from diverse contexts has received widespread attention in computer vision communities. The core philosophy of Context-Aware Emotion Recognition (CAER) is to provide valuable semantic cues for recognizing the emotions of target persons by leveraging rich contextual information. Current approaches invariably focus on designing sophisticated structures to extract perceptually…
▽ More
Understanding emotions from diverse contexts has received widespread attention in computer vision communities. The core philosophy of Context-Aware Emotion Recognition (CAER) is to provide valuable semantic cues for recognizing the emotions of target persons by leveraging rich contextual information. Current approaches invariably focus on designing sophisticated structures to extract perceptually critical representations from contexts. Nevertheless, a long-neglected dilemma is that a severe context bias in existing datasets results in an unbalanced distribution of emotional states among different contexts, causing biased visual representation learning. From a causal demystification perspective, the harmful bias is identified as a confounder that misleads existing models to learn spurious correlations based on likelihood estimation, limiting the models' performance. To address the issue, we embrace causal inference to disentangle the models from the impact of such bias, and formulate the causalities among variables in the CAER task via a customized causal graph. Subsequently, we present a Contextual Causal Intervention Module (CCIM) to de-confound the confounder, which is built upon backdoor adjustment theory to facilitate seeking approximate causal effects during model training. As a plug-and-play component, CCIM can easily integrate with existing approaches and bring significant improvements. Systematic experiments on three datasets demonstrate the effectiveness of our CCIM.
△ Less
Submitted 6 July, 2024;
originally announced July 2024.
-
FreeCompose: Generic Zero-Shot Image Composition with Diffusion Prior
Authors:
Zhekai Chen,
Wen Wang,
Zhen Yang,
Zeqing Yuan,
Hao Chen,
Chunhua Shen
Abstract:
We offer a novel approach to image composition, which integrates multiple input images into a single, coherent image. Rather than concentrating on specific use cases such as appearance editing (image harmonization) or semantic editing (semantic image composition), we showcase the potential of utilizing the powerful generative prior inherent in large-scale pre-trained diffusion models to accomplish…
▽ More
We offer a novel approach to image composition, which integrates multiple input images into a single, coherent image. Rather than concentrating on specific use cases such as appearance editing (image harmonization) or semantic editing (semantic image composition), we showcase the potential of utilizing the powerful generative prior inherent in large-scale pre-trained diffusion models to accomplish generic image composition applicable to both scenarios. We observe that the pre-trained diffusion models automatically identify simple copy-paste boundary areas as low-density regions during denoising. Building on this insight, we propose to optimize the composed image towards high-density regions guided by the diffusion prior. In addition, we introduce a novel maskguided loss to further enable flexible semantic image composition. Extensive experiments validate the superiority of our approach in achieving generic zero-shot image composition. Additionally, our approach shows promising potential in various tasks, such as object removal and multiconcept customization.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
Balance of Number of Embedding and their Dimensions in Vector Quantization
Authors:
Hang Chen,
Sankepally Sainath Reddy,
Ziwei Chen,
Dianbo Liu
Abstract:
The dimensionality of the embedding and the number of available embeddings ( also called codebook size) are critical factors influencing the performance of Vector Quantization(VQ), a discretization process used in many models such as the Vector Quantized Variational Autoencoder (VQ-VAE) architecture. This study examines the balance between the codebook sizes and dimensions of embeddings in VQ, whi…
▽ More
The dimensionality of the embedding and the number of available embeddings ( also called codebook size) are critical factors influencing the performance of Vector Quantization(VQ), a discretization process used in many models such as the Vector Quantized Variational Autoencoder (VQ-VAE) architecture. This study examines the balance between the codebook sizes and dimensions of embeddings in VQ, while maintaining their product constant. Traditionally, these hyper parameters are static during training; however, our findings indicate that augmenting the codebook size while simultaneously reducing the embedding dimension can significantly boost the effectiveness of the VQ-VAE. As a result, the strategic selection of codebook size and embedding dimensions, while preserving the capacity of the discrete codebook space, is critically important. To address this, we propose a novel adaptive dynamic quantization approach, underpinned by the Gumbel-Softmax mechanism, which allows the model to autonomously determine the optimal codebook configuration for each data instance. This dynamic discretizer gives the VQ-VAE remarkable flexibility. Thorough empirical evaluations across multiple benchmark datasets validate the notable performance enhancements achieved by our approach, highlighting the significant potential of adaptive dynamic quantization to improve model performance.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
SAM-Med3D-MoE: Towards a Non-Forgetting Segment Anything Model via Mixture of Experts for 3D Medical Image Segmentation
Authors:
Guoan Wang,
Jin Ye,
Junlong Cheng,
Tianbin Li,
Zhaolin Chen,
Jianfei Cai,
Junjun He,
Bohan Zhuang
Abstract:
Volumetric medical image segmentation is pivotal in enhancing disease diagnosis, treatment planning, and advancing medical research. While existing volumetric foundation models for medical image segmentation, such as SAM-Med3D and SegVol, have shown remarkable performance on general organs and tumors, their ability to segment certain categories in clinical downstream tasks remains limited. Supervi…
▽ More
Volumetric medical image segmentation is pivotal in enhancing disease diagnosis, treatment planning, and advancing medical research. While existing volumetric foundation models for medical image segmentation, such as SAM-Med3D and SegVol, have shown remarkable performance on general organs and tumors, their ability to segment certain categories in clinical downstream tasks remains limited. Supervised Finetuning (SFT) serves as an effective way to adapt such foundation models for task-specific downstream tasks but at the cost of degrading the general knowledge previously stored in the original foundation model.To address this, we propose SAM-Med3D-MoE, a novel framework that seamlessly integrates task-specific finetuned models with the foundational model, creating a unified model at minimal additional training expense for an extra gating network. This gating network, in conjunction with a selection strategy, allows the unified model to achieve comparable performance of the original models in their respective tasks both general and specialized without updating any parameters of them.Our comprehensive experiments demonstrate the efficacy of SAM-Med3D-MoE, with an average Dice performance increase from 53 to 56.4 on 15 specific classes. It especially gets remarkable gains of 29.6, 8.5, 11.2 on the spinal cord, esophagus, and right hip, respectively. Additionally, it achieves 48.9 Dice on the challenging SPPIN2023 Challenge, significantly surpassing the general expert's performance of 32.3. We anticipate that SAM-Med3D-MoE can serve as a new framework for adapting the foundation model to specific areas in medical image analysis. Codes and datasets will be publicly available.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?
Authors:
Zhaorun Chen,
Yichao Du,
Zichen Wen,
Yiyang Zhou,
Chenhang Cui,
Zhenzhen Weng,
Haoqin Tu,
Chaoqi Wang,
Zhengwei Tong,
Qinglan Huang,
Canyu Chen,
Qinghao Ye,
Zhihong Zhu,
Yuqing Zhang,
Jiawei Zhou,
Zhuokai Zhao,
Rafael Rafailov,
Chelsea Finn,
Huaxiu Yao
Abstract:
While text-to-image models like DALLE-3 and Stable Diffusion are rapidly proliferating, they often encounter challenges such as hallucination, bias, and the production of unsafe, low-quality output. To effectively address these issues, it is crucial to align these models with desired behaviors based on feedback from a multimodal judge. Despite their significance, current multimodal judges frequent…
▽ More
While text-to-image models like DALLE-3 and Stable Diffusion are rapidly proliferating, they often encounter challenges such as hallucination, bias, and the production of unsafe, low-quality output. To effectively address these issues, it is crucial to align these models with desired behaviors based on feedback from a multimodal judge. Despite their significance, current multimodal judges frequently undergo inadequate evaluation of their capabilities and limitations, potentially leading to misalignment and unsafe fine-tuning outcomes. To address this issue, we introduce MJ-Bench, a novel benchmark which incorporates a comprehensive preference dataset to evaluate multimodal judges in providing feedback for image generation models across four key perspectives: alignment, safety, image quality, and bias. Specifically, we evaluate a large variety of multimodal judges including smaller-sized CLIP-based scoring models, open-source VLMs (e.g. LLaVA family), and close-source VLMs (e.g. GPT-4o, Claude 3) on each decomposed subcategory of our preference dataset. Experiments reveal that close-source VLMs generally provide better feedback, with GPT-4o outperforming other judges in average. Compared with open-source VLMs, smaller-sized scoring models can provide better feedback regarding text-image alignment and image quality, while VLMs provide more accurate feedback regarding safety and generation bias due to their stronger reasoning capabilities. Further studies in feedback scale reveal that VLM judges can generally provide more accurate and stable feedback in natural language (Likert-scale) than numerical scales. Notably, human evaluations on end-to-end fine-tuned models using separate feedback from these multimodal judges provide similar conclusions, further confirming the effectiveness of MJ-Bench. All data, code, models are available at https://huggingface.co/MJ-Bench.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition
Authors:
Ye Bai,
Jingping Chen,
Jitong Chen,
Wei Chen,
Zhuo Chen,
Chuang Ding,
Linhao Dong,
Qianqian Dong,
Yujiao Du,
Kepan Gao,
Lu Gao,
Yi Guo,
Minglun Han,
Ting Han,
Wenchao Hu,
Xinying Hu,
Yuxiang Hu,
Deyu Hua,
Lu Huang,
Mingkun Huang,
Youjia Huang,
Jishuo Jin,
Fanliu Kong,
Zongwei Lan,
Tianyu Li
, et al. (30 additional authors not shown)
Abstract:
Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this wor…
▽ More
Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this work, we introduce Seed-ASR, a large language model (LLM) based speech recognition model. Seed-ASR is developed based on the framework of audio conditioned LLM (AcLLM), leveraging the capabilities of LLMs by inputting continuous speech representations together with contextual information into the LLM. Through stage-wise large-scale training and the elicitation of context-aware capabilities in LLM, Seed-ASR demonstrates significant improvement over end-to-end models on comprehensive evaluation sets, including multiple domains, accents/dialects and languages. Additionally, Seed-ASR can be further deployed to support specific needs in various scenarios without requiring extra language models. Compared to recently released large ASR models, Seed-ASR achieves 10%-40% reduction in word (or character, for Chinese) error rates on Chinese and English public test sets, further demonstrating its powerful performance.
△ Less
Submitted 10 July, 2024; v1 submitted 5 July, 2024;
originally announced July 2024.
-
Improving Audio Generation with Visual Enhanced Caption
Authors:
Yi Yuan,
Dongya Jia,
Xiaobin Zhuang,
Yuanzhe Chen,
Zhengxi Liu,
Zhuo Chen,
Yuping Wang,
Yuxuan Wang,
Xubo Liu,
Mark D. Plumbley,
Wenwu Wang
Abstract:
Generative models have shown significant achievements in audio generation tasks. However, existing models struggle with complex and detailed prompts, leading to potential performance degradation. We hypothesize that this problem stems from the low quality and relatively small quantity of training data. In this work, we aim to create a large-scale audio dataset with rich captions for improving audi…
▽ More
Generative models have shown significant achievements in audio generation tasks. However, existing models struggle with complex and detailed prompts, leading to potential performance degradation. We hypothesize that this problem stems from the low quality and relatively small quantity of training data. In this work, we aim to create a large-scale audio dataset with rich captions for improving audio generation models. We develop an automated pipeline to generate detailed captions for audio-visual datasets by transforming predicted visual captions, audio captions, and tagging labels into comprehensive descriptions using a Large Language Model (LLM). We introduce Sound-VECaps, a dataset comprising 1.66M high-quality audio-caption pairs with enriched details including audio event orders, occurred places and environment information. We demonstrate that training with Sound-VECaps significantly enhances the capability of text-to-audio generation models to comprehend and generate audio from complex input prompts, improving overall system performance. Furthermore, we conduct ablation studies of Sound-VECaps across several audio-language tasks, suggesting its potential in advancing audio-text representation learning. Our dataset and models are available online.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
Gemini: Integrating Full-fledged Sensing upon Millimeter Wave Communications
Authors:
Yilong Li,
Zhe Chen,
Jun Luo,
Suman Banerjee
Abstract:
Integrating millimeter wave (mmWave)technology in both communication and sensing is promising as it enables the reuse of existing spectrum and infrastructure without draining resources. Most existing systems piggyback sensing onto conventional communication modes without fully exploiting the potential of integrated sensing and communication (ISAC) in mmWave radios (not full-fledged). In this paper…
▽ More
Integrating millimeter wave (mmWave)technology in both communication and sensing is promising as it enables the reuse of existing spectrum and infrastructure without draining resources. Most existing systems piggyback sensing onto conventional communication modes without fully exploiting the potential of integrated sensing and communication (ISAC) in mmWave radios (not full-fledged). In this paper, we design and implement a full-fledged mmWave ISAC system Gemini; it delivers raw channel states to serve a broad category of sensing applications. We first propose the mmWave self-interference cancellation approach to extract the weak reflected signals for near-field sensing purposes. Then, we develop a joint optimization scheduling framework that can be utilized in accurate radar sensing while maximizing the communication throughput. Finally, we design a united fusion sensing algorithm to offer a better sensing performance via combining monostatic and bistatic modes. We evaluate our system in extensive experiments to demonstrate Gemini's capability of simultaneously operating sensing and communication, enabling mmWave ISAC to perform better than the commercial off-the-shelf mmWave radar for 5G cellular networks.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
MAPO: Boosting Large Language Model Performance with Model-Adaptive Prompt Optimization
Authors:
Yuyan Chen,
Zhihao Wen,
Ge Fan,
Zhengyu Chen,
Wei Wu,
Dayiheng Liu,
Zhixu Li,
Bang Liu,
Yanghua Xiao
Abstract:
Prompt engineering, as an efficient and effective way to leverage Large Language Models (LLM), has drawn a lot of attention from the research community. The existing research primarily emphasizes the importance of adapting prompts to specific tasks, rather than specific LLMs. However, a good prompt is not solely defined by its wording, but also binds to the nature of the LLM in question. In this w…
▽ More
Prompt engineering, as an efficient and effective way to leverage Large Language Models (LLM), has drawn a lot of attention from the research community. The existing research primarily emphasizes the importance of adapting prompts to specific tasks, rather than specific LLMs. However, a good prompt is not solely defined by its wording, but also binds to the nature of the LLM in question. In this work, we first quantitatively demonstrate that different prompts should be adapted to different LLMs to enhance their capabilities across various downstream tasks in NLP. Then we novelly propose a model-adaptive prompt optimizer (MAPO) method that optimizes the original prompts for each specific LLM in downstream tasks. Extensive experiments indicate that the proposed method can effectively refine prompts for an LLM, leading to significant improvements over various downstream tasks.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
Continuous-variable quantum digital signatures against coherent attacks
Authors:
Yi-Fan Zhang,
Wen-Bo Liu,
Bing-Hong Li,
Hua-Lei Yin,
Zeng-Bing Chen
Abstract:
Quantum digital signatures (QDS), which utilize correlated bit strings among sender and recipients, guarantee the authenticity, integrity and non-repudiation of classical messages based on quantum laws. Continuous-variable (CV) quantum protocol with heterodyne and homodyne measurement has obvious advantages of low-cost implementation and easy wavelength division multiplexing. However, security ana…
▽ More
Quantum digital signatures (QDS), which utilize correlated bit strings among sender and recipients, guarantee the authenticity, integrity and non-repudiation of classical messages based on quantum laws. Continuous-variable (CV) quantum protocol with heterodyne and homodyne measurement has obvious advantages of low-cost implementation and easy wavelength division multiplexing. However, security analyses in previous researches are limited to the proof against collective attacks in finite-size scenarios. Moreover, existing multi-bit CV QDS schemes have primarily focused on adapting single-bit protocols for simplicity of security proof, often sacrificing signature efficiency. Here, we introduce a CV QDS protocol designed to withstand general coherent attacks through the use of a cutting-edge fidelity test function, while achieving high signature efficiency by employing a refined one-time universal hashing signing technique. Our protocol is proved to be robust against finite-size effects and excess noise in quantum channels. In simulation, results demonstrate a significant reduction of over 6 orders of magnitude in signature length for a megabit message signing task compared to existing CV QDS protocols and this advantage expands as the message size grows. Our work offers a solution with enhanced security and efficiency, paving the way for large-scale deployment of CV QDS in future quantum networks.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
HiDiff: Hybrid Diffusion Framework for Medical Image Segmentation
Authors:
Tao Chen,
Chenhui Wang,
Zhihao Chen,
Yiming Lei,
Hongming Shan
Abstract:
Medical image segmentation has been significantly advanced with the rapid development of deep learning (DL) techniques. Existing DL-based segmentation models are typically discriminative; i.e., they aim to learn a mapping from the input image to segmentation masks. However, these discriminative methods neglect the underlying data distribution and intrinsic class characteristics, suffering from uns…
▽ More
Medical image segmentation has been significantly advanced with the rapid development of deep learning (DL) techniques. Existing DL-based segmentation models are typically discriminative; i.e., they aim to learn a mapping from the input image to segmentation masks. However, these discriminative methods neglect the underlying data distribution and intrinsic class characteristics, suffering from unstable feature space. In this work, we propose to complement discriminative segmentation methods with the knowledge of underlying data distribution from generative models. To that end, we propose a novel hybrid diffusion framework for medical image segmentation, termed HiDiff, which can synergize the strengths of existing discriminative segmentation models and new generative diffusion models. HiDiff comprises two key components: discriminative segmentor and diffusion refiner. First, we utilize any conventional trained segmentation models as discriminative segmentor, which can provide a segmentation mask prior for diffusion refiner. Second, we propose a novel binary Bernoulli diffusion model (BBDM) as the diffusion refiner, which can effectively, efficiently, and interactively refine the segmentation mask by modeling the underlying data distribution. Third, we train the segmentor and BBDM in an alternate-collaborative manner to mutually boost each other. Extensive experimental results on abdomen organ, brain tumor, polyps, and retinal vessels segmentation datasets, covering four widely-used modalities, demonstrate the superior performance of HiDiff over existing medical segmentation algorithms, including the state-of-the-art transformer- and diffusion-based ones. In addition, HiDiff excels at segmenting small objects and generalizing to new datasets. Source codes are made available at https://github.com/takimailto/HiDiff.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
Submillimeter and Mid-Infrared Variability of Young Stellar Objects in the M17SWex Intermediate-Mass Star-Forming Region
Authors:
Geumsook Park,
Doug Johnstone,
Carlos Contreras Pena,
Jeong-Eun Lee,
Sheng-Yuan Liu,
Gregory Herczeg,
Steve Mairs,
Zhiwei Chen,
Jennifer Hatchell,
Kee-Tae Kim,
Mi-Ryang Kim,
Keping Qiu,
Yao-Te Wang,
Xu Zhang,
The JCMT Transient Team
Abstract:
We present a comprehensive analysis of young stellar object (YSO) variability within the M17 Southwest Extension (M17 SWex), using 3.5 years of monitoring data from the JCMT Transient Survey at sub-millimeter (sub-mm) and 9 years from the NEOWISE mission at mid-infrared (mid-IR). Our study encompasses observations of 147 bright sub-mm peaks identified within our deep JCMT co-added map as well as 1…
▽ More
We present a comprehensive analysis of young stellar object (YSO) variability within the M17 Southwest Extension (M17 SWex), using 3.5 years of monitoring data from the JCMT Transient Survey at sub-millimeter (sub-mm) and 9 years from the NEOWISE mission at mid-infrared (mid-IR). Our study encompasses observations of 147 bright sub-mm peaks identified within our deep JCMT co-added map as well as 156 YSOs in NEOWISE W1 and 179 in W2 that were previously identified in Spitzer surveys. We find three robust sub-mm variables: two are candidate YSOs and one is a likely extragalactic source. At mid-IR wavelengths, our analysis reveals secular and stochastic variability in 47 YSOs, with the highest fraction of secular variability occurring at the earliest evolutionary stage. This is similar to what has previously been observed for low-mass YSO variability within the Gould Belt. However, we observe less overall variability in M17SWex at both the sub-mm and mid-IR. We suspect that this lower fraction is due to the greater distance to M17 SWex. Our findings showcase the utility of multi-wavelength observations to better capture the complex variability phenomena inherent to star formation processes and demonstrate the importance of years-long monitoring of a diverse selection of star-forming environments.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
Information Greenhouse: Optimal Persuasion for Medical Test-Avoiders
Authors:
Zhuo Chen
Abstract:
Patients often delay or reject medical tests due to information avoidance, which hinders timely reception of necessary treatments. This paper studies the optimal information policy to persuade an information-avoidant patient to undergo the test and make the best choice that maximizes his health. The patient sequentially decides whether to take the test and the optimal treatment plan. The informati…
▽ More
Patients often delay or reject medical tests due to information avoidance, which hinders timely reception of necessary treatments. This paper studies the optimal information policy to persuade an information-avoidant patient to undergo the test and make the best choice that maximizes his health. The patient sequentially decides whether to take the test and the optimal treatment plan. The information provided is about the background knowledge of the disease, which is complementary with the test result, and disclosure can take place both before and after the test decision. The optimal information policy depends on whether the patient is willing to be tested when he is completely pessimistic. If so, the optimal policy features \textit{preemptive warning}: the disclosure only takes place before the test, and the bad news guarantees the patient to be tested and be treated even without further information. If not, the optimal policy constructs an \textit{information greenhouse}: an information structure that provides high anticipatory utility is committed when the patient is tested and the test result is bad. I consider extensions to general information preference and ex ante participation constraint.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
Optical vortex-antivortex crystallization in free space
Authors:
Haolin Lin,
Yixuan Liao,
Guohua Liu,
Jianbin Ren,
Zhen Li,
Zhenqiang Chen,
Boris A. Malomed,
Shenhe Fu
Abstract:
Stable vortex lattices are basic dynamical patterns which have been demonstrated in physical systems including superconductor physics, Bose-Einstein condensates, hydrodynamics and optics. Vortex-antivortex (VAV) ensembles can be produced, self-organizing into the respective polar lattices. However, these structures are in general highly unstable due to the strong VAV attraction. Here, we demonstra…
▽ More
Stable vortex lattices are basic dynamical patterns which have been demonstrated in physical systems including superconductor physics, Bose-Einstein condensates, hydrodynamics and optics. Vortex-antivortex (VAV) ensembles can be produced, self-organizing into the respective polar lattices. However, these structures are in general highly unstable due to the strong VAV attraction. Here, we demonstrate that multiple optical VAV clusters nested in the propagating coherent field can crystallize into patterns which preserve their lattice structures over distance up to several Rayleigh lengths. To explain this phenomenon, we present a model for effective interactions between the vortices and antivortices at different lattice sites. The observed VAV crystallization is a consequence of the globally balanced VAV couplings. As the crystallization does not require the presence of nonlinearities and appears in free space, it may find applications to high-capacity optical communications and multiparticle manipulations. Our findings suggest possibilities for constructing VAV complexes through the orbit-orbit couplings, which differs from the extensively studied spin-orbit couplings.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
Measurement of the branching fraction of the decay $J/ψ\to p \bar{p} η$
Authors:
BESIII Collaboration,
M. Ablikim,
M. N. Achasov,
P. Adlarson,
O. Afedulidis,
X. C. Ai,
R. Aliberti,
A. Amoroso,
Q. An,
Y. Bai,
O. Bakina,
I. Balossino,
Y. Ban,
H. -R. Bao,
V. Batozskaya,
K. Begzsuren,
N. Berger,
M. Berlowski,
M. Bertani,
D. Bettoni,
F. Bianchi,
E. Bianco,
A. Bortone,
I. Boyko,
R. A. Briere
, et al. (639 additional authors not shown)
Abstract:
A high precision measurement of the branching fraction of the decay $J/ψ\to p \bar{p} η$ is performed using $(10 087 \pm 44) \times 10^6$ $J/ψ$ events recorded by the {BESIII} detector at the {BEPCII} storage ring. The branching fractions of the two decays $J/ψ\to p \bar{p} η(η\to γγ)$ and $J/ψ\to p \bar{p} η(η\to π^+ π^- π^0)$ are measured individually to be…
▽ More
A high precision measurement of the branching fraction of the decay $J/ψ\to p \bar{p} η$ is performed using $(10 087 \pm 44) \times 10^6$ $J/ψ$ events recorded by the {BESIII} detector at the {BEPCII} storage ring. The branching fractions of the two decays $J/ψ\to p \bar{p} η(η\to γγ)$ and $J/ψ\to p \bar{p} η(η\to π^+ π^- π^0)$ are measured individually to be $\mathcal{B}(J/ψ\to p \bar{p} η(η\to γγ)) = (1.480 \pm 0.001 \pm 0.024)\times\,10^{-3}$ and $\mathcal{B}(J/ψ\to p \bar{p} η(η\to π^+ π^- π^0)) = (1.557 \pm 0.003 \pm 0.038)\times\,10^{-3}$, where the first uncertainties are statistical and the second systematic. Both results are compatible within their uncorrelated systematic uncertainties. The combined result is $\mathcal{B}(J/ψ\to p \bar{p} η)=(1.495 \pm 0.001 \pm 0.023)\times\,10^{-3}$ where the first uncertainty is the combined statistical uncertainty and the second one the combined systematic uncertainty of both analyses, incorporating correlations between them. In addition, the $p \bar{p}$ threshold region is investigated for a potential threshold enhancement, and no evidence for one is observed.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
Product Geometries on Cholesky Manifolds with Applications to SPD Manifolds
Authors:
Ziheng Chen,
Yue Song,
Xiao-Jun Wu,
Nicu Sebe
Abstract:
This paper presents two new metrics on the Symmetric Positive Definite (SPD) manifold via the Cholesky manifold, i.e., the space of lower triangular matrices with positive diagonal elements. We first unveil that the existing popular Riemannian metric on the Cholesky manifold can be generally characterized as the product metric of a Euclidean metric and a Riemannian metric on the space of n-dimensi…
▽ More
This paper presents two new metrics on the Symmetric Positive Definite (SPD) manifold via the Cholesky manifold, i.e., the space of lower triangular matrices with positive diagonal elements. We first unveil that the existing popular Riemannian metric on the Cholesky manifold can be generally characterized as the product metric of a Euclidean metric and a Riemannian metric on the space of n-dimensional positive vectors. Based on this analysis, we propose two novel metrics on the Cholesky manifolds, i.e., Diagonal Power Euclidean Metric and Diagonal Generalized Bures-Wasserstein Metric, which are numerically stabler than the existing Cholesky metric. We also discuss the gyro structures and deformed metrics associated with our metrics. The gyro structures connect the linear and geometric properties, while the deformed metrics interpolate between our proposed metrics and the existing metric. Further, by Cholesky decomposition, the proposed deformed metrics and gyro structures are pulled back to SPD manifolds. Compared with existing Riemannian metrics on SPD manifolds, our metrics are easy to use, computationally efficient, and numerically stable.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation
Authors:
Kepan Nan,
Rui Xie,
Penghao Zhou,
Tiehan Fan,
Zhenheng Yang,
Zhijie Chen,
Xiang Li,
Jian Yang,
Ying Tai
Abstract:
Text-to-video (T2V) generation has recently garnered significant attention thanks to the large multi-modality model Sora. However, T2V generation still faces two important challenges: 1) Lacking a precise open sourced high-quality dataset. The previous popular video datasets, e.g. WebVid-10M and Panda-70M, are either with low quality or too large for most research institutions. Therefore, it is ch…
▽ More
Text-to-video (T2V) generation has recently garnered significant attention thanks to the large multi-modality model Sora. However, T2V generation still faces two important challenges: 1) Lacking a precise open sourced high-quality dataset. The previous popular video datasets, e.g. WebVid-10M and Panda-70M, are either with low quality or too large for most research institutions. Therefore, it is challenging but crucial to collect a precise high-quality text-video pairs for T2V generation. 2) Ignoring to fully utilize textual information. Recent T2V methods have focused on vision transformers, using a simple cross attention module for video generation, which falls short of thoroughly extracting semantic information from text prompt. To address these issues, we introduce OpenVid-1M, a precise high-quality dataset with expressive captions. This open-scenario dataset contains over 1 million text-video pairs, facilitating research on T2V generation. Furthermore, we curate 433K 1080p videos from OpenVid-1M to create OpenVidHD-0.4M, advancing high-definition video generation. Additionally, we propose a novel Multi-modal Video Diffusion Transformer (MVDiT) capable of mining both structure information from visual tokens and semantic information from text tokens. Extensive experiments and ablation studies verify the superiority of OpenVid-1M over previous datasets and the effectiveness of our MVDiT.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
Kinetics of Rayleigh-Taylor instability in van der Waals fluid: the influence of compressibility
Authors:
Jie Chen,
Aiguo Xu,
Yudong Zhang,
Dawei Chen,
Zhihua Chen
Abstract:
Early studies on Rayleigh-Taylor instability (RTI) primarily relied on the Navier-Stokes (NS) model. As research progresses, it becomes increasingly evident that the kinetic information that the NS model failed to capture is of great value for identifying and even controlling the RTI process; simultaneously, the lack of analysis techniques for complex physical fields results in a significant waste…
▽ More
Early studies on Rayleigh-Taylor instability (RTI) primarily relied on the Navier-Stokes (NS) model. As research progresses, it becomes increasingly evident that the kinetic information that the NS model failed to capture is of great value for identifying and even controlling the RTI process; simultaneously, the lack of analysis techniques for complex physical fields results in a significant waste of data information. In addition, early RTI studies mainly focused on the incompressible case and the weakly compressible case. In the case of strong compressibility, the density of the fluid from the upper layer (originally heavy fluid) may become smaller than that of the surrounding (originally light) fluid, thus invalidating the early method of distinguishing light and heavy fluids based on density. In this paper, tracer particles are incorporated into a single-fluid discrete Boltzmann method (DBM) model that considers the van der Waals potential. By using tracer particles to label the matter-particle sources, a careful study of the matter-mixing and energy-mixing processes of the RTI evolution is realized in the single-fluid framework. The effects of compressibility on the evolution of RTI are examined mainly through the analysis of bubble and spike velocities, the ratio of area occupied by heavy fluid, and various entropy generation rates of the system. It is demonstrated that: (1) compressibility has a suppressive effect on the spike velocity, and this suppressive impact diminishes as the Atwood number ($At$) increases. The influence of compressibility on bubble velocity shows a staged behavior with increasing $At$. (2) The impact of compressibility on the entropy production rate associated with the heat flow (${\dot{S}_{NOEF}}$) is related to the stages of RTI evolution.
△ Less
Submitted 3 July, 2024; v1 submitted 2 July, 2024;
originally announced July 2024.
-
CatMemo at the FinLLM Challenge Task: Fine-Tuning Large Language Models using Data Fusion in Financial Applications
Authors:
Yupeng Cao,
Zhiyuan Yao,
Zhi Chen,
Zhiyang Deng
Abstract:
The integration of Large Language Models (LLMs) into financial analysis has garnered significant attention in the NLP community. This paper presents our solution to IJCAI-2024 FinLLM challenge, investigating the capabilities of LLMs within three critical areas of financial tasks: financial classification, financial text summarization, and single stock trading. We adopted Llama3-8B and Mistral-7B a…
▽ More
The integration of Large Language Models (LLMs) into financial analysis has garnered significant attention in the NLP community. This paper presents our solution to IJCAI-2024 FinLLM challenge, investigating the capabilities of LLMs within three critical areas of financial tasks: financial classification, financial text summarization, and single stock trading. We adopted Llama3-8B and Mistral-7B as base models, fine-tuning them through Parameter Efficient Fine-Tuning (PEFT) and Low-Rank Adaptation (LoRA) approaches. To enhance model performance, we combine datasets from task 1 and task 2 for data fusion. Our approach aims to tackle these diverse tasks in a comprehensive and integrated manner, showcasing LLMs' capacity to address diverse and complex financial tasks with improved accuracy and decision-making capabilities.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
Self-Cooperation Knowledge Distillation for Novel Class Discovery
Authors:
Yuzheng Wang,
Zhaoyu Chen,
Dingkang Yang,
Yunquan Sun,
Lizhe Qi
Abstract:
Novel Class Discovery (NCD) aims to discover unknown and novel classes in an unlabeled set by leveraging knowledge already learned about known classes. Existing works focus on instance-level or class-level knowledge representation and build a shared representation space to achieve performance improvements. However, a long-neglected issue is the potential imbalanced number of samples from known and…
▽ More
Novel Class Discovery (NCD) aims to discover unknown and novel classes in an unlabeled set by leveraging knowledge already learned about known classes. Existing works focus on instance-level or class-level knowledge representation and build a shared representation space to achieve performance improvements. However, a long-neglected issue is the potential imbalanced number of samples from known and novel classes, pushing the model towards dominant classes. Therefore, these methods suffer from a challenging trade-off between reviewing known classes and discovering novel classes. Based on this observation, we propose a Self-Cooperation Knowledge Distillation (SCKD) method to utilize each training sample (whether known or novel, labeled or unlabeled) for both review and discovery. Specifically, the model's feature representations of known and novel classes are used to construct two disjoint representation spaces. Through spatial mutual information, we design a self-cooperation learning to encourage model learning from the two feature representation spaces from itself. Extensive experiments on six datasets demonstrate that our method can achieve significant performance improvements, achieving state-of-the-art performance.
△ Less
Submitted 3 July, 2024; v1 submitted 1 July, 2024;
originally announced July 2024.
-
A Method to Facilitate Membership Inference Attacks in Deep Learning Models
Authors:
Zitao Chen,
Karthik Pattabiraman
Abstract:
Modern machine learning (ML) ecosystems offer a surging number of ML frameworks and code repositories that can greatly facilitate the development of ML models. Today, even ordinary data holders who are not ML experts can apply off-the-shelf codebase to build high-performance ML models on their data, many of which are sensitive in nature (e.g., clinical records).
In this work, we consider a malic…
▽ More
Modern machine learning (ML) ecosystems offer a surging number of ML frameworks and code repositories that can greatly facilitate the development of ML models. Today, even ordinary data holders who are not ML experts can apply off-the-shelf codebase to build high-performance ML models on their data, many of which are sensitive in nature (e.g., clinical records).
In this work, we consider a malicious ML provider who supplies model-training code to the data holders, does not have access to the training process, and has only black-box query access to the resulting model. In this setting, we demonstrate a new form of membership inference attack that is strictly more powerful than prior art. Our attack empowers the adversary to reliably de-identify all the training samples (average >99% attack TPR@0.1% FPR), and the compromised models still maintain competitive performance as their uncorrupted counterparts (average <1% accuracy drop). Moreover, we show that the poisoned models can effectively disguise the amplified membership leakage under common membership privacy auditing, which can only be revealed by a set of secret samples known by the adversary.
Overall, our study not only points to the worst-case membership privacy leakage, but also unveils a common pitfall underlying existing privacy auditing methods, which calls for future efforts to rethink the current practice of auditing membership privacy in machine learning models.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
LogEval: A Comprehensive Benchmark Suite for Large Language Models In Log Analysis
Authors:
Tianyu Cui,
Shiyu Ma,
Ziang Chen,
Tong Xiao,
Shimin Tao,
Yilun Liu,
Shenglin Zhang,
Duoming Lin,
Changchang Liu,
Yuzhe Cai,
Weibin Meng,
Yongqian Sun,
Dan Pei
Abstract:
Log analysis is crucial for ensuring the orderly and stable operation of information systems, particularly in the field of Artificial Intelligence for IT Operations (AIOps). Large Language Models (LLMs) have demonstrated significant potential in natural language processing tasks. In the AIOps domain, they excel in tasks such as anomaly detection, root cause analysis of faults, operations and maint…
▽ More
Log analysis is crucial for ensuring the orderly and stable operation of information systems, particularly in the field of Artificial Intelligence for IT Operations (AIOps). Large Language Models (LLMs) have demonstrated significant potential in natural language processing tasks. In the AIOps domain, they excel in tasks such as anomaly detection, root cause analysis of faults, operations and maintenance script generation, and alert information summarization. However, the performance of current LLMs in log analysis tasks remains inadequately validated. To address this gap, we introduce LogEval, a comprehensive benchmark suite designed to evaluate the capabilities of LLMs in various log analysis tasks for the first time. This benchmark covers tasks such as log parsing, log anomaly detection, log fault diagnosis, and log summarization. LogEval evaluates each task using 4,000 publicly available log data entries and employs 15 different prompts for each task to ensure a thorough and fair assessment. By rigorously evaluating leading LLMs, we demonstrate the impact of various LLM technologies on log analysis performance, focusing on aspects such as self-consistency and few-shot contextual learning. We also discuss findings related to model quantification, Chinese-English question-answering evaluation, and prompt engineering. These findings provide insights into the strengths and weaknesses of LLMs in multilingual environments and the effectiveness of different prompt strategies. Various evaluation methods are employed for different tasks to accurately measure the performance of LLMs in log analysis, ensuring a comprehensive assessment. The insights gained from LogEvals evaluation reveal the strengths and limitations of LLMs in log analysis tasks, providing valuable guidance for researchers and practitioners.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
ESALE: Enhancing Code-Summary Alignment Learning for Source Code Summarization
Authors:
Chunrong Fang,
Weisong Sun,
Yuchen Chen,
Xiao Chen,
Zhao Wei,
Quanjun Zhang,
Yudu You,
Bin Luo,
Yang Liu,
Zhenyu Chen
Abstract:
(Source) code summarization aims to automatically generate succinct natural language summaries for given code snippets. Such summaries play a significant role in promoting developers to understand and maintain code. Inspired by neural machine translation, deep learning-based code summarization techniques widely adopt an encoder-decoder framework, where the encoder transforms given code snippets in…
▽ More
(Source) code summarization aims to automatically generate succinct natural language summaries for given code snippets. Such summaries play a significant role in promoting developers to understand and maintain code. Inspired by neural machine translation, deep learning-based code summarization techniques widely adopt an encoder-decoder framework, where the encoder transforms given code snippets into context vectors, and the decoder decodes context vectors into summaries. Recently, large-scale pre-trained models for source code are equipped with encoders capable of producing general context vectors and have achieved substantial improvements on code summarization. However, although they are usually trained mainly on code-focused tasks and can capture general code features, they still fall short in capturing specific features that need to be summarized.
This paper proposes a novel approach to improve code summarization based on summary-focused tasks. Specifically, we exploit a multi-task learning paradigm to train the encoder on three summary-focused tasks to enhance its ability to learn code-summary alignment, including unidirectional language modeling (ULM), masked language modeling (MLM), and action word prediction (AWP). Unlike pre-trained models that mainly predict masked tokens in code snippets, we design ULM and MLM to predict masked words in summaries. Intuitively, predicting words based on given code snippets would help learn the code-summary alignment. Additionally, we introduce the domain-specific task AWP to enhance the ability of the encoder to learn the alignment between action words and code snippets. The extensive experiments on four datasets demonstrate that our approach, called ESALE significantly outperforms baselines in all three widely used metrics, including BLEU, METEOR, and ROUGE-L.
△ Less
Submitted 30 June, 2024;
originally announced July 2024.
-
Probing the connection between IceCube neutrinos and MOJAVE AGN
Authors:
R. Abbasi,
M. Ackermann,
J. Adams,
S. K. Agarwalla,
J. A. Aguilar,
M. Ahlers,
J. M. Alameddine,
N. M. Amin,
K. Andeen,
C. Argüelles,
Y. Ashida,
S. Athanasiadou,
L. Ausborm,
S. N. Axani,
X. Bai,
A. Balagopal V.,
M. Baricevic,
S. W. Barwick,
S. Bash,
V. Basu,
R. Bay,
J. J. Beatty,
J. Becker Tjus,
J. Beise,
C. Bellenghi
, et al. (399 additional authors not shown)
Abstract:
Active Galactic Nuclei (AGN) are prime candidate sources of the high-energy, astrophysical neutrinos detected by IceCube. This is demonstrated by the real-time multi-messenger detection of the blazar TXS 0506+056 and the recent evidence of neutrino emission from NGC 1068 from a separate time-averaged study. However, the production mechanism of the astrophysical neutrinos in AGN is not well establi…
▽ More
Active Galactic Nuclei (AGN) are prime candidate sources of the high-energy, astrophysical neutrinos detected by IceCube. This is demonstrated by the real-time multi-messenger detection of the blazar TXS 0506+056 and the recent evidence of neutrino emission from NGC 1068 from a separate time-averaged study. However, the production mechanism of the astrophysical neutrinos in AGN is not well established which can be resolved via correlation studies with photon observations. For neutrinos produced due to photohadronic interactions in AGN, in addition to a correlation of neutrinos with high-energy photons, there would also be a correlation of neutrinos with photons emitted at radio wavelengths. In this work, we perform an in-depth stacking study of the correlation between 15 GHz radio observations of AGN reported in the MOJAVE XV catalog, and ten years of neutrino data from IceCube. We also use a time-dependent approach which improves the statistical power of the stacking analysis. No significant correlation was found for both analyses and upper limits are reported. When compared to the IceCube diffuse flux, at 100 TeV and for a spectral index of 2.5, the upper limits derived are $\sim3\%$ and $\sim9\%$ for the time-averaged and time-dependent case, respectively.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Search for a light sterile neutrino with 7.5 years of IceCube DeepCore data
Authors:
R. Abbasi,
M. Ackermann,
J. Adams,
S. K. Agarwalla,
J. A. Aguilar,
M. Ahlers,
J. M. Alameddine,
N. M. Amin,
K. Andeen,
C. Argüelles,
Y. Ashida,
S. Athanasiadou,
L. Ausborm,
S. N. Axani,
X. Bai,
A. Balagopal V.,
M. Baricevic,
S. W. Barwick,
S. Bash,
V. Basu,
R. Bay,
J. J. Beatty,
J. Becker Tjus,
J. Beise,
C. Bellenghi
, et al. (399 additional authors not shown)
Abstract:
We present a search for an eV-scale sterile neutrino using 7.5 years of data from the IceCube DeepCore detector. The analysis uses a sample of 21,914 events with energies between 5 and 150 GeV to search for sterile neutrinos through atmospheric muon neutrino disappearance. Improvements in event selection and treatment of systematic uncertainties provide greater statistical power compared to previo…
▽ More
We present a search for an eV-scale sterile neutrino using 7.5 years of data from the IceCube DeepCore detector. The analysis uses a sample of 21,914 events with energies between 5 and 150 GeV to search for sterile neutrinos through atmospheric muon neutrino disappearance. Improvements in event selection and treatment of systematic uncertainties provide greater statistical power compared to previous DeepCore sterile neutrino searches. Our results are compatible with the absence of mixing between active and sterile neutrino states, and we place constraints on the mixing matrix elements $|U_{μ4}|^2 < 0.0534$ and $|U_{τ4}|^2 < 0.0574$ at 90% CL under the assumption that $Δm^2_{41}\geq 1\;\mathrm{eV^2}$. These null results add to the growing tension between anomalous appearance results and constraints from disappearance searches in the 3+1 sterile neutrino landscape.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Proximity Matters: Local Proximity Preserved Balancing for Treatment Effect Estimation
Authors:
Hao Wang,
Zhichao Chen,
Yuan Shen,
Jiajun Fan,
Zhaoran Liu,
Degui Yang,
Xinggao Liu,
Haoxuan Li
Abstract:
Heterogeneous treatment effect (HTE) estimation from observational data poses significant challenges due to treatment selection bias. Existing methods address this bias by minimizing distribution discrepancies between treatment groups in latent space, focusing on global alignment. However, the fruitful aspect of local proximity, where similar units exhibit similar outcomes, is often overlooked. In…
▽ More
Heterogeneous treatment effect (HTE) estimation from observational data poses significant challenges due to treatment selection bias. Existing methods address this bias by minimizing distribution discrepancies between treatment groups in latent space, focusing on global alignment. However, the fruitful aspect of local proximity, where similar units exhibit similar outcomes, is often overlooked. In this study, we propose Proximity-aware Counterfactual Regression (PCR) to exploit proximity for representation balancing within the HTE estimation context. Specifically, we introduce a local proximity preservation regularizer based on optimal transport to depict the local proximity in discrepancy calculation. Furthermore, to overcome the curse of dimensionality that renders the estimation of discrepancy ineffective, exacerbated by limited data availability for HTE estimation, we develop an informative subspace projector, which trades off minimal distance precision for improved sample complexity. Extensive experiments demonstrate that PCR accurately matches units across different treatment groups, effectively mitigates treatment selection bias, and significantly outperforms competitors. Code is available at https://anonymous.4open.science/status/ncr-B697.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
HGNET: A Hierarchical Feature Guided Network for Occupancy Flow Field Prediction
Authors:
Zhan Chen,
Chen Tang,
Lu Xiong
Abstract:
Predicting the motion of multiple traffic participants has always been one of the most challenging tasks in autonomous driving. The recently proposed occupancy flow field prediction method has shown to be a more effective and scalable representation compared to general trajectory prediction methods. However, in complex multi-agent traffic scenarios, it remains difficult to model the interactions a…
▽ More
Predicting the motion of multiple traffic participants has always been one of the most challenging tasks in autonomous driving. The recently proposed occupancy flow field prediction method has shown to be a more effective and scalable representation compared to general trajectory prediction methods. However, in complex multi-agent traffic scenarios, it remains difficult to model the interactions among various factors and the dependencies among prediction outputs at different time steps. In view of this, we propose a transformer-based hierarchical feature guided network (HGNET), which can efficiently extract features of agents and map information from visual and vectorized inputs, modeling multimodal interaction relationships. Second, we design the Feature-Guided Attention (FGAT) module to leverage the potential guiding effects between different prediction targets, thereby improving prediction accuracy. Additionally, to enhance the temporal consistency and causal relationships of the predictions, we propose a Time Series Memory framework to learn the conditional distribution models of the prediction outputs at future time steps from multivariate time series. The results demonstrate that our model exhibits competitive performance, which ranks 3rd in the 2024 Waymo Occupancy and Flow Prediction Challenge.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Rethinking LLM-based Preference Evaluation
Authors:
Zhengyu Hu,
Linxin Song,
Jieyu Zhang,
Zheyuan Xiao,
Jingang Wang,
Zhenyu Chen,
Jieyu Zhao,
Hui Xiong
Abstract:
Recently, large language model (LLM)-based preference evaluation has been widely adopted to compare pairs of model responses. However, a severe bias towards lengthy responses has been observed, raising concerns about the reliability of this evaluation method. In this work, we designed a series of controlled experiments to study the major impacting factors of the metric of LLM-based preference eval…
▽ More
Recently, large language model (LLM)-based preference evaluation has been widely adopted to compare pairs of model responses. However, a severe bias towards lengthy responses has been observed, raising concerns about the reliability of this evaluation method. In this work, we designed a series of controlled experiments to study the major impacting factors of the metric of LLM-based preference evaluation, i.e., win rate, and conclude that the win rate is affected by two axes of model response: desirability and information mass, where the former is length-independent and related to trustworthiness, and the latter is length-dependent and can be represented by conditional entropy. We find that length impacts the existing evaluations by influencing information mass. However, a reliable evaluation metric should not only assess content quality but also ensure that the assessment is not confounded by extraneous factors such as response length. Therefore, we propose a simple yet effective adjustment, AdapAlpaca, to the existing practice of win rate measurement. Specifically, by adjusting the lengths of reference answers to match the test model's answers within the same interval, we debias information mass relative to length, ensuring a fair model evaluation.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Improve ROI with Causal Learning and Conformal Prediction
Authors:
Meng Ai,
Zhuo Chen,
Jibin Wang,
Jing Shang,
Tao Tao,
Zhen Li
Abstract:
In the commercial sphere, such as operations and maintenance, advertising, and marketing recommendations, intelligent decision-making utilizing data mining and neural network technologies is crucial, especially in resource allocation to optimize ROI. This study delves into the Cost-aware Binary Treatment Assignment Problem (C-BTAP) across different industries, with a focus on the state-of-the-art…
▽ More
In the commercial sphere, such as operations and maintenance, advertising, and marketing recommendations, intelligent decision-making utilizing data mining and neural network technologies is crucial, especially in resource allocation to optimize ROI. This study delves into the Cost-aware Binary Treatment Assignment Problem (C-BTAP) across different industries, with a focus on the state-of-the-art Direct ROI Prediction (DRP) method. However, the DRP model confronts issues like covariate shift and insufficient training data, hindering its real-world effectiveness. Addressing these challenges is essential for ensuring dependable and robust predictions in varied operational contexts.
This paper presents a robust Direct ROI Prediction (rDRP) method, designed to address challenges in real-world deployment of neural network-based uplift models, particularly under conditions of covariate shift and insufficient training data. The rDRP method, enhancing the standard DRP model, does not alter the model's structure or require retraining. It utilizes conformal prediction and Monte Carlo dropout for interval estimation, adapting to model uncertainty and data distribution shifts. A heuristic calibration method, inspired by a Kaggle competition, combines point and interval estimates. The effectiveness of these approaches is validated through offline tests and online A/B tests in various settings, demonstrating significant improvements in target rewards compared to the state-of-the-art method.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.