-
EyeFound: A Multimodal Generalist Foundation Model for Ophthalmic Imaging
Authors:
Danli Shi,
Weiyi Zhang,
Xiaolan Chen,
Yexin Liu,
Jiancheng Yang,
Siyu Huang,
Yih Chung Tham,
Yingfeng Zheng,
Mingguang He
Abstract:
Artificial intelligence (AI) is vital in ophthalmology, tackling tasks like diagnosis, classification, and visual question answering (VQA). However, existing AI models in this domain often require extensive annotation and are task-specific, limiting their clinical utility. While recent developments have brought about foundation models for ophthalmology, they are limited by the need to train separa…
▽ More
Artificial intelligence (AI) is vital in ophthalmology, tackling tasks like diagnosis, classification, and visual question answering (VQA). However, existing AI models in this domain often require extensive annotation and are task-specific, limiting their clinical utility. While recent developments have brought about foundation models for ophthalmology, they are limited by the need to train separate weights for each imaging modality, preventing a comprehensive representation of multi-modal features. This highlights the need for versatile foundation models capable of handling various tasks and modalities in ophthalmology. To address this gap, we present EyeFound, a multimodal foundation model for ophthalmic images. Unlike existing models, EyeFound learns generalizable representations from unlabeled multimodal retinal images, enabling efficient model adaptation across multiple applications. Trained on 2.78 million images from 227 hospitals across 11 ophthalmic modalities, EyeFound facilitates generalist representations and diverse multimodal downstream tasks, even for detecting challenging rare diseases. It outperforms previous work RETFound in diagnosing eye diseases, predicting systemic disease incidents, and zero-shot multimodal VQA. EyeFound provides a generalizable solution to improve model performance and lessen the annotation burden on experts, facilitating widespread clinical AI applications for retinal imaging.
△ Less
Submitted 21 May, 2024; v1 submitted 18 May, 2024;
originally announced May 2024.
-
Motion Avatar: Generate Human and Animal Avatars with Arbitrary Motion
Authors:
Zeyu Zhang,
Yiran Wang,
Biao Wu,
Shuo Chen,
Zhiyuan Zhang,
Shiya Huang,
Wenbo Zhang,
Meng Fang,
Ling Chen,
Yang Zhao
Abstract:
In recent years, there has been significant interest in creating 3D avatars and motions, driven by their diverse applications in areas like film-making, video games, AR/VR, and human-robot interaction. However, current efforts primarily concentrate on either generating the 3D avatar mesh alone or producing motion sequences, with integrating these two aspects proving to be a persistent challenge. A…
▽ More
In recent years, there has been significant interest in creating 3D avatars and motions, driven by their diverse applications in areas like film-making, video games, AR/VR, and human-robot interaction. However, current efforts primarily concentrate on either generating the 3D avatar mesh alone or producing motion sequences, with integrating these two aspects proving to be a persistent challenge. Additionally, while avatar and motion generation predominantly target humans, extending these techniques to animals remains a significant challenge due to inadequate training data and methods. To bridge these gaps, our paper presents three key contributions. Firstly, we proposed a novel agent-based approach named Motion Avatar, which allows for the automatic generation of high-quality customizable human and animal avatars with motions through text queries. The method significantly advanced the progress in dynamic 3D character generation. Secondly, we introduced a LLM planner that coordinates both motion and avatar generation, which transforms a discriminative planning into a customizable Q&A fashion. Lastly, we presented an animal motion dataset named Zoo-300K, comprising approximately 300,000 text-motion pairs across 65 animal categories and its building pipeline ZooGen, which serves as a valuable resource for the community. See project website https://steve-zeyu-zhang.github.io/MotionAvatar/
△ Less
Submitted 18 May, 2024;
originally announced May 2024.
-
LLM-based Multi-Agent Reinforcement Learning: Current and Future Directions
Authors:
Chuanneng Sun,
Songjun Huang,
Dario Pompili
Abstract:
In recent years, Large Language Models (LLMs) have shown great abilities in various tasks, including question answering, arithmetic problem solving, and poem writing, among others. Although research on LLM-as-an-agent has shown that LLM can be applied to Reinforcement Learning (RL) and achieve decent results, the extension of LLM-based RL to Multi-Agent System (MAS) is not trivial, as many aspects…
▽ More
In recent years, Large Language Models (LLMs) have shown great abilities in various tasks, including question answering, arithmetic problem solving, and poem writing, among others. Although research on LLM-as-an-agent has shown that LLM can be applied to Reinforcement Learning (RL) and achieve decent results, the extension of LLM-based RL to Multi-Agent System (MAS) is not trivial, as many aspects, such as coordination and communication between agents, are not considered in the RL frameworks of a single agent. To inspire more research on LLM-based MARL, in this letter, we survey the existing LLM-based single-agent and multi-agent RL frameworks and provide potential research directions for future research. In particular, we focus on the cooperative tasks of multiple agents with a common goal and communication among them. We also consider human-in/on-the-loop scenarios enabled by the language component in the framework.
△ Less
Submitted 17 May, 2024;
originally announced May 2024.
-
The unluckiest star: A spectroscopically confirmed repeated partial tidal disruption event AT 2022dbl
Authors:
Zheyu Lin,
Ning Jiang,
Tinggui Wang,
Xu Kong,
Dongyue Li,
Han He,
Yibo Wang,
Jiazheng Zhu,
Wentao Li,
Ji-an Jiang,
Avinash Singh,
Rishabh Singh Teja,
D. K. Sahu,
Chichuan Jin,
Keiichi Maeda,
Shifeng Huang
Abstract:
The unluckiest star orbits a supermassive black hole elliptically. Every time it reaches the pericenter, it shallowly enters the tidal radius and gets partially tidal disrupted, producing a series of flares. Confirmation of a repeated partial tidal disruption event (pTDE) requires not only evidence to rule out other types of transients, but also proof that only one star is involved, as TDEs from m…
▽ More
The unluckiest star orbits a supermassive black hole elliptically. Every time it reaches the pericenter, it shallowly enters the tidal radius and gets partially tidal disrupted, producing a series of flares. Confirmation of a repeated partial tidal disruption event (pTDE) requires not only evidence to rule out other types of transients, but also proof that only one star is involved, as TDEs from multiple stars can also produce similar flares. In this letter, we report the discovery of a repeated pTDE, AT 2022dbl. In a quiescent galaxy at z=0.0284, two separate optical/UV flares have been observed in 2022 and 2024, with no bright X-ray, radio or mid-infrared counterparts. Compared to the first flare, the second flare has a similar blackbody temperature of ~26,000 K, slightly lower peak luminosity, and slower rise and fall phases. Compared to the ZTF TDEs, their blackbody parameters, bolometric energies and light curve shapes are all similar. The spectra taken during the second flare show a steeper continuum than the late-time spectra of the previous flare, consistent with a newly risen flare. More importantly, the possibility of two independent TDEs can be largely ruled out because the optical spectra taken around the peak of the two flares exhibit highly similar broad Balmer, N III and possible He II emission lines, especially the extreme ~4100Å emission lines. This represents the first robust spectroscopic evidence for a repeated pTDE, which can soon be verified by observing the third flare, given its short orbital period.
△ Less
Submitted 17 May, 2024;
originally announced May 2024.
-
One registration is worth two segmentations
Authors:
Shiqi Huang,
Tingfa Xu,
Ziyi Shen,
Shaheer Ullah Saeed,
Wen Yan,
Dean Barratt,
Yipeng Hu
Abstract:
The goal of image registration is to establish spatial correspondence between two or more images, traditionally through dense displacement fields (DDFs) or parametric transformations (e.g., rigid, affine, and splines). Rethinking the existing paradigms of achieving alignment via spatial transformations, we uncover an alternative but more intuitive correspondence representation: a set of correspond…
▽ More
The goal of image registration is to establish spatial correspondence between two or more images, traditionally through dense displacement fields (DDFs) or parametric transformations (e.g., rigid, affine, and splines). Rethinking the existing paradigms of achieving alignment via spatial transformations, we uncover an alternative but more intuitive correspondence representation: a set of corresponding regions-of-interest (ROI) pairs, which we demonstrate to have sufficient representational capability as other correspondence representation methods.Further, it is neither necessary nor sufficient for these ROIs to hold specific anatomical or semantic significance. In turn, we formulate image registration as searching for the same set of corresponding ROIs from both moving and fixed images - in other words, two multi-class segmentation tasks on a pair of images. For a general-purpose and practical implementation, we integrate the segment anything model (SAM) into our proposed algorithms, resulting in a SAM-enabled registration (SAMReg) that does not require any training data, gradient-based fine-tuning or engineered prompts. We experimentally show that the proposed SAMReg is capable of segmenting and matching multiple ROI pairs, which establish sufficiently accurate correspondences, in three clinical applications of registering prostate MR, cardiac MR and abdominal CT images. Based on metrics including Dice and target registration errors on anatomical structures, the proposed registration outperforms both intensity-based iterative algorithms and DDF-predicting learning-based networks, even yielding competitive performance with weakly-supervised registration which requires fully-segmented training data.
△ Less
Submitted 17 May, 2024;
originally announced May 2024.
-
Occupancy-SLAM: Simultaneously Optimizing Robot Poses and Continuous Occupancy Map
Authors:
Liang Zhao,
Yingyu Wang,
Shoudong Huang
Abstract:
In this paper, we propose an optimization based SLAM approach to simultaneously optimize the robot trajectory and the occupancy map using 2D laser scans (and odometry) information. The key novelty is that the robot poses and the occupancy map are optimized together, which is significantly different from existing occupancy mapping strategies where the robot poses need to be obtained first before th…
▽ More
In this paper, we propose an optimization based SLAM approach to simultaneously optimize the robot trajectory and the occupancy map using 2D laser scans (and odometry) information. The key novelty is that the robot poses and the occupancy map are optimized together, which is significantly different from existing occupancy mapping strategies where the robot poses need to be obtained first before the map can be estimated. In our formulation, the map is represented as a continuous occupancy map where each 2D point in the environment has a corresponding evidence value. The Occupancy-SLAM problem is formulated as an optimization problem where the variables include all the robot poses and the occupancy values at the selected discrete grid cell nodes. We propose a variation of Gauss-Newton method to solve this new formulated problem, obtaining the optimized occupancy map and robot trajectory together with their uncertainties. Our algorithm is an offline approach since it is based on batch optimization and the number of variables involved is large. Evaluations using simulations and publicly available practical 2D laser datasets demonstrate that the proposed approach can estimate the maps and robot trajectories more accurately than the state-of-the-art techniques, when a relatively accurate initial guess is provided to our algorithm. The video shows the convergence process of the proposed Occupancy-SLAM and comparison of results to Cartographer can be found at \url{https://youtu.be/4oLyVEUC4iY}.
△ Less
Submitted 17 May, 2024;
originally announced May 2024.
-
Beyond static AI evaluations: advancing human interaction evaluations for LLM harms and risks
Authors:
Lujain Ibrahim,
Saffron Huang,
Lama Ahmad,
Markus Anderljung
Abstract:
Model evaluations are central to understanding the safety, risks, and societal impacts of AI systems. While most real-world AI applications involve human-AI interaction, most current evaluations (e.g., common benchmarks) of AI models do not. Instead, they incorporate human factors in limited ways, assessing the safety of models in isolation, thereby falling short of capturing the complexity of hum…
▽ More
Model evaluations are central to understanding the safety, risks, and societal impacts of AI systems. While most real-world AI applications involve human-AI interaction, most current evaluations (e.g., common benchmarks) of AI models do not. Instead, they incorporate human factors in limited ways, assessing the safety of models in isolation, thereby falling short of capturing the complexity of human-model interactions. In this paper, we discuss and operationalize a definition of an emerging category of evaluations -- "human interaction evaluations" (HIEs) -- which focus on the assessment of human-model interactions or the process and the outcomes of humans using models. First, we argue that HIEs can be used to increase the validity of safety evaluations, assess direct human impact and interaction-specific harms, and guide future assessments of models' societal impact. Second, we propose a safety-focused HIE design framework -- containing a human-LLM interaction taxonomy -- with three stages: (1) identifying the risk or harm area, (2) characterizing the use context, and (3) choosing the evaluation parameters. Third, we apply our framework to two potential evaluations for overreliance and persuasion risks. Finally, we conclude with tangible recommendations for addressing concerns over costs, replicability, and unrepresentativeness of HIEs.
△ Less
Submitted 12 July, 2024; v1 submitted 17 May, 2024;
originally announced May 2024.
-
When Large Language Model Meets Optimization
Authors:
Sen Huang,
Kaixiang Yang,
Sheng Qi,
Rui Wang
Abstract:
Optimization algorithms and large language models (LLMs) enhance decision-making in dynamic environments by integrating artificial intelligence with traditional techniques. LLMs, with extensive domain knowledge, facilitate intelligent modeling and strategic decision-making in optimization, while optimization algorithms refine LLM architectures and output quality. This synergy offers novel approach…
▽ More
Optimization algorithms and large language models (LLMs) enhance decision-making in dynamic environments by integrating artificial intelligence with traditional techniques. LLMs, with extensive domain knowledge, facilitate intelligent modeling and strategic decision-making in optimization, while optimization algorithms refine LLM architectures and output quality. This synergy offers novel approaches for advancing general AI, addressing both the computational challenges of complex problems and the application of LLMs in practical scenarios. This review outlines the progress and potential of combining LLMs with optimization algorithms, providing insights for future research directions.
△ Less
Submitted 16 May, 2024;
originally announced May 2024.
-
Advances in Robust Federated Learning: Heterogeneity Considerations
Authors:
Chuan Chen,
Tianchi Liao,
Xiaojun Deng,
Zihou Wu,
Sheng Huang,
Zibin Zheng
Abstract:
In the field of heterogeneous federated learning (FL), the key challenge is to efficiently and collaboratively train models across multiple clients with different data distributions, model structures, task objectives, computational capabilities, and communication resources. This diversity leads to significant heterogeneity, which increases the complexity of model training. In this paper, we first…
▽ More
In the field of heterogeneous federated learning (FL), the key challenge is to efficiently and collaboratively train models across multiple clients with different data distributions, model structures, task objectives, computational capabilities, and communication resources. This diversity leads to significant heterogeneity, which increases the complexity of model training. In this paper, we first outline the basic concepts of heterogeneous federated learning and summarize the research challenges in federated learning in terms of five aspects: data, model, task, device, and communication. In addition, we explore how existing state-of-the-art approaches cope with the heterogeneity of federated learning, and categorize and review these approaches at three different levels: data-level, model-level, and architecture-level. Subsequently, the paper extensively discusses privacy-preserving strategies in heterogeneous federated learning environments. Finally, the paper discusses current open issues and directions for future research, aiming to promote the further development of heterogeneous federated learning.
△ Less
Submitted 16 May, 2024;
originally announced May 2024.
-
Fermionic quantum criticality through the lens of topological holography
Authors:
Sheng-Jie Huang
Abstract:
We utilize the topological holographic framework to characterize and gain insights into the nature of quantum critical points and gapless phases in fermionic quantum systems. Topological holography is a general framework that describes the generalized global symmetry and the symmetry charges of a local quantum system in terms of a slab of a topological order, termed as the symmetry topological fie…
▽ More
We utilize the topological holographic framework to characterize and gain insights into the nature of quantum critical points and gapless phases in fermionic quantum systems. Topological holography is a general framework that describes the generalized global symmetry and the symmetry charges of a local quantum system in terms of a slab of a topological order, termed as the symmetry topological field theory (SymTFT), in one higher dimension. In this work, we consider a generalization of the topological holographic picture for $(1+1)d$ fermionic quantum phases of matter. We discuss how spin structures are encoded in the SymTFT and establish the connection between the formal fermionization formula in quantum field theory and the choice of fermionic gapped boundary conditions of the SymTFT. We demonstrate the identification and the characterization of the fermionic gapped phases and phase transitions through detailed analysis of various examples, including the fermionic systems with $\mathbb{Z}_{2}^{F}$, $\mathbb{Z}_{2} \times \mathbb{Z}_{2}^{F}$, $\mathbb{Z}_{4}^{F}$, and the fermionic version of the non-invertible $\text{Rep}(S_{3})$ symmetry. Our work uncovers many exotic fermionic quantum critical points and gapless phases, including two kinds of fermionic symmetry enriched quantum critical points, a fermionic gapless symmetry protected topological (SPT) phase, and a fermionic gapless spontaneous symmetry breaking (SSB) phase that breaks the fermionic non-invertible symmetry.
△ Less
Submitted 18 June, 2024; v1 submitted 15 May, 2024;
originally announced May 2024.
-
Search for the leptonic decays $D^{*+}\to e^+ν_e$ and $D^{*+}\to μ^+ν_μ$
Authors:
BESIII Collaboration,
M. Ablikim,
M. N. Achasov,
P. Adlarson,
M. Albrecht,
R. Aliberti,
A. Amoroso,
M. R. An,
Q. An,
Y. Bai,
O. Bakina,
R. Baldini Ferroli,
I. Balossino,
Y. Ban,
V. Batozskaya,
D. Becker,
K. Begzsuren,
N. Berger,
M. Bertani,
D. Bettoni,
F. Bianchi,
E. Bianco,
J. Bloms,
A. Bortone,
I. Boyko
, et al. (559 additional authors not shown)
Abstract:
We present the first search for the leptonic decays $D^{*+}\to e^+ν_e$ and $D^{*+}\to μ^+ν_μ$ by analyzing a data sample of electron-positron collisions recorded with the BESIII detector at center-of-mass energies between 4.178 and 4.226 GeV, corresponding to an integrated luminosity of 6.32~fb$^{-1}$. No significant signal is observed. The upper limits on the branching fractions for…
▽ More
We present the first search for the leptonic decays $D^{*+}\to e^+ν_e$ and $D^{*+}\to μ^+ν_μ$ by analyzing a data sample of electron-positron collisions recorded with the BESIII detector at center-of-mass energies between 4.178 and 4.226 GeV, corresponding to an integrated luminosity of 6.32~fb$^{-1}$. No significant signal is observed. The upper limits on the branching fractions for $D^{*+}\to e^+ν_e$ and $D^{*+}\to μ^+ν_μ$ are set to be $1.1 \times 10^{-5}$ and $4.3 \times 10^{-6}$ at 90\% confidence level, respectively.
△ Less
Submitted 14 May, 2024;
originally announced May 2024.
-
Longitudinal Structure of Quark-Gluon Plasma Unveiled Through Nuclear Deformations
Authors:
Chunjian Zhang,
Shengli Huang,
Jiangyong Jia
Abstract:
The study of quark-gluon plasma (QGP) is hindered by our limited understanding of its initial conditions, particularly its longitudinal structure. We propose a novel approach that entails analyzing collisions involving nuclei of similar masses but different deformations. This strategy allows us to vary the initial conditions and collective expansion of the QGP, while minimizing the influence of no…
▽ More
The study of quark-gluon plasma (QGP) is hindered by our limited understanding of its initial conditions, particularly its longitudinal structure. We propose a novel approach that entails analyzing collisions involving nuclei of similar masses but different deformations. This strategy allows us to vary the initial conditions and collective expansion of the QGP, while minimizing the influence of non-flow correlations. Using a dynamical transport model, we have for the first time extracted the complete longitudinal structure of elliptic flow ($v_2$). Our findings reveal that although deformation significantly enhances the overall magnitude of $v_2$, it does not alter its longitudinal profile. This approach not only enables the separation of the rapidity dependence of flow from its rapidity decorrelations but also prompts further investigation into other nuclear structural features, such as nuclear skin thickness, to advance our understanding of the QGP's initial conditions.
△ Less
Submitted 14 May, 2024;
originally announced May 2024.
-
A Determination of the Local Gravitational Acceleration for the Tsinghua Tabletop Kibble Balance
Authors:
Weibo Liu,
Nanjia Li,
Yongchao Ma,
Ruo Hu,
Shuqing Wu,
Wei Zhao,
Songling Huang,
Shisong Li
Abstract:
The Kibble balance requires a measurement of the local gravitational acceleration, $g$, with a typical relative measurement uncertainty of $10^{-9}$. In this paper, the determination of $g$ for the Tsinghua tabletop Kibble balance is presented. A polynomial fitting method is proposed for blind transfers of the absolute gravitational acceleration using relative gravimeters, showing agreement with t…
▽ More
The Kibble balance requires a measurement of the local gravitational acceleration, $g$, with a typical relative measurement uncertainty of $10^{-9}$. In this paper, the determination of $g$ for the Tsinghua tabletop Kibble balance is presented. A polynomial fitting method is proposed for blind transfers of the absolute gravitational acceleration using relative gravimeters, showing agreement with the value obtained by the tide correction within a few parts in $10^{9}$. Horizontal and vertical gravity gradients are extracted by mapping the gravity distribution at different heights. The self-attraction effect of major components in the experiment, as well as some time-varying systematic effects, are modeled. The final determination of the gravitational acceleration at the mass position, with an uncertainty of 5.4 $μ$Gal ($k=2$), is achieved for the Tsinghua tabletop Kibble balance experiment.
△ Less
Submitted 20 May, 2024; v1 submitted 14 May, 2024;
originally announced May 2024.
-
Search for the radiative transition $χ_{c1}(3872)\toγψ_2(3823)$
Authors:
BESIII Collaboration,
M. Ablikim,
M. N. Achasov,
P. Adlarson,
O. Afedulidis,
X. C. Ai,
R. Aliberti,
A. Amoroso,
M. R. An,
Q. An,
Y. Bai,
O. Bakina,
I. Balossino,
Y. Ban,
H. -R. Bao,
V. Batozskaya,
K. Begzsuren,
N. Berger,
M. Berlowski,
M. Bertani,
D. Bettoni,
F. Bianchi,
E. Bianco,
A. Bortone,
I. Boyko
, et al. (635 additional authors not shown)
Abstract:
Using 9.0 $\rm fb^{-1}$ of $e^+e^-$ collision data collected at center-of-mass energies from 4.178 to 4.278 GeV with the BESIII detector at the BEPCII collider, we perform the first search for the radiative transition $χ_{c1}(3872)\toγψ_2(3823)$. No $χ_{c1}(3872)\toγψ_2(3823)$ signal is observed. The upper limit on the ratio of branching fractions…
▽ More
Using 9.0 $\rm fb^{-1}$ of $e^+e^-$ collision data collected at center-of-mass energies from 4.178 to 4.278 GeV with the BESIII detector at the BEPCII collider, we perform the first search for the radiative transition $χ_{c1}(3872)\toγψ_2(3823)$. No $χ_{c1}(3872)\toγψ_2(3823)$ signal is observed. The upper limit on the ratio of branching fractions $\mathcal{B}(χ_{c1}(3872)\toγψ_2(3823), ψ_2(3823)\toγχ_{c1})/\mathcal{B}(χ_{c1}(3872)\toπ^+π^- J/ψ)$ is set as 0.075 at the 90\% confidence level. Our result contradicts theoretical predictions under the assumption that the $χ_{c1}(3872)$ is the pure charmonium state $χ_{c1}(2P)$.
△ Less
Submitted 13 May, 2024;
originally announced May 2024.
-
Phase coding semi-quantum key distribution system based on the Single-state protocol
Authors:
Qincheng Hou,
Siying Huang,
Naida Mo,
Jindong Wang,
Zhengjun Wei,
Yafei Yu,
Tianming Zhao,
Zhiming Zhang
Abstract:
Semi-quantum key distribution (SQKD) allows sharing random keys between a quantum user and a classical user. However, implementing classical user operations is challenging, posing a hurdle to achieving the Single-state protocol. By using the "selective modulation" method, the feasibility of SQKD is verified in principle. The proposal of the selective modulation method enables the realization of ot…
▽ More
Semi-quantum key distribution (SQKD) allows sharing random keys between a quantum user and a classical user. However, implementing classical user operations is challenging, posing a hurdle to achieving the Single-state protocol. By using the "selective modulation" method, the feasibility of SQKD is verified in principle. The proposal of the selective modulation method enables the realization of other protocols for SQKD. To advance experimental progress in SQKD, we propose and implement a phase-encoded semi-quantum key distribution system based on the Single-state protocol and the "selective modulation" method. The system operates at a frequency of 100MHz and an average photon number of 0.1. The interference contrast achieved 96.52%, the average quantum bit error rate was 1.19%, and the raw key rate reached 88Kbps. Our experimental results demonstrate the feasibility and stability of the proposed phase-encoded semi-quantum key distribution system. Furthermore, by leveraging the "selective modulation" scheme proposed in this paper, we develop a comprehensive theoretical description of selective modulation. Through an analysis of quantum state evolution, we assess the security of our system, ultimately demonstrating its resilience against attacks targeting quantum states. The classical user of our system requires only two optical devices, significantly reducing the equipment requirements and enhancing its application potential. This work validates the feasibility of semi-quantum key distribution experiments and provides ideas for future research on semi-quantum key distribution experiments and security studies.
△ Less
Submitted 13 May, 2024;
originally announced May 2024.
-
DrugLLM: Open Large Language Model for Few-shot Molecule Generation
Authors:
Xianggen Liu,
Yan Guo,
Haoran Li,
Jin Liu,
Shudong Huang,
Bowen Ke,
Jiancheng Lv
Abstract:
Large Language Models (LLMs) have made great strides in areas such as language processing and computer vision. Despite the emergence of diverse techniques to improve few-shot learning capacity, current LLMs fall short in handling the languages in biology and chemistry. For example, they are struggling to capture the relationship between molecule structure and pharmacochemical properties. Consequen…
▽ More
Large Language Models (LLMs) have made great strides in areas such as language processing and computer vision. Despite the emergence of diverse techniques to improve few-shot learning capacity, current LLMs fall short in handling the languages in biology and chemistry. For example, they are struggling to capture the relationship between molecule structure and pharmacochemical properties. Consequently, the few-shot learning capacity of small-molecule drug modification remains impeded. In this work, we introduced DrugLLM, a LLM tailored for drug design. During the training process, we employed Group-based Molecular Representation (GMR) to represent molecules, arranging them in sequences that reflect modifications aimed at enhancing specific molecular properties. DrugLLM learns how to modify molecules in drug discovery by predicting the next molecule based on past modifications. Extensive computational experiments demonstrate that DrugLLM can generate new molecules with expected properties based on limited examples, presenting a powerful few-shot molecule generation capacity.
△ Less
Submitted 7 May, 2024;
originally announced May 2024.
-
Measurement of the ${e}^{+}{e}^{-}\to p \bar{p}π^{0}$ cross section at $\sqrt{s}=2.1000-3.0800$ GeV
Authors:
BESIII Collaboration,
M. Ablikim,
M. N. Achasov,
P. Adlarson,
O. Afedulidis,
X. C. Ai,
R. Aliberti,
A. Amoroso,
Q. An,
Y. Bai,
O. Bakina,
I. Balossino,
Y. Ban,
H. -R. Bao,
V. Batozskaya,
K. Begzsuren,
N. Berger,
M. Berlowski,
M. Bertani,
D. Bettoni,
F. Bianchi,
E. Bianco,
A. Bortone,
I. Boyko,
R. A. Briere
, et al. (639 additional authors not shown)
Abstract:
The process $e^{+}e^{-}\to p\bar{p}π^{0}$ is studied at 20 center-of-mass energies ranging from 2.1000 to 3.0800 GeV using 636.8 pb$^{-1}$ of data collected with the BESIII detector operating at the BEPCII collider. The Born cross sections for $e^{+}e^{-}\to p\bar{p}π^{0}$ are measured with high precision. Since the lowest center-of-mass energy, 2.1000 GeV, is less than 90 MeV above the…
▽ More
The process $e^{+}e^{-}\to p\bar{p}π^{0}$ is studied at 20 center-of-mass energies ranging from 2.1000 to 3.0800 GeV using 636.8 pb$^{-1}$ of data collected with the BESIII detector operating at the BEPCII collider. The Born cross sections for $e^{+}e^{-}\to p\bar{p}π^{0}$ are measured with high precision. Since the lowest center-of-mass energy, 2.1000 GeV, is less than 90 MeV above the $p\bar{p}π^0$ energy threshold, we can probe the threshold behavior for this reaction. However, no anomalous threshold enhancement is found in the cross sections for $e^{+}e^{-}\to p\bar{p}π^{0}$.
△ Less
Submitted 10 May, 2024;
originally announced May 2024.
-
DARA: Domain- and Relation-aware Adapters Make Parameter-efficient Tuning for Visual Grounding
Authors:
Ting Liu,
Xuyang Liu,
Siteng Huang,
Honggang Chen,
Quanjun Yin,
Long Qin,
Donglin Wang,
Yue Hu
Abstract:
Visual grounding (VG) is a challenging task to localize an object in an image based on a textual description. Recent surge in the scale of VG models has substantially improved performance, but also introduced a significant burden on computational costs during fine-tuning. In this paper, we explore applying parameter-efficient transfer learning (PETL) to efficiently transfer the pre-trained vision-…
▽ More
Visual grounding (VG) is a challenging task to localize an object in an image based on a textual description. Recent surge in the scale of VG models has substantially improved performance, but also introduced a significant burden on computational costs during fine-tuning. In this paper, we explore applying parameter-efficient transfer learning (PETL) to efficiently transfer the pre-trained vision-language knowledge to VG. Specifically, we propose \textbf{DARA}, a novel PETL method comprising \underline{\textbf{D}}omain-aware \underline{\textbf{A}}dapters (DA Adapters) and \underline{\textbf{R}}elation-aware \underline{\textbf{A}}dapters (RA Adapters) for VG. DA Adapters first transfer intra-modality representations to be more fine-grained for the VG domain. Then RA Adapters share weights to bridge the relation between two modalities, improving spatial reasoning. Empirical results on widely-used benchmarks demonstrate that DARA achieves the best accuracy while saving numerous updated parameters compared to the full fine-tuning and other PETL methods. Notably, with only \textbf{2.13\%} tunable backbone parameters, DARA improves average accuracy by \textbf{0.81\%} across the three benchmarks compared to the baseline model. Our code is available at \url{https://github.com/liuting20/DARA}.
△ Less
Submitted 8 June, 2024; v1 submitted 9 May, 2024;
originally announced May 2024.
-
You Only Cache Once: Decoder-Decoder Architectures for Language Models
Authors:
Yutao Sun,
Li Dong,
Yi Zhu,
Shaohan Huang,
Wenhui Wang,
Shuming Ma,
Quanlu Zhang,
Jianyong Wang,
Furu Wei
Abstract:
We introduce a decoder-decoder architecture, YOCO, for large language models, which only caches key-value pairs once. It consists of two components, i.e., a cross-decoder stacked upon a self-decoder. The self-decoder efficiently encodes global key-value (KV) caches that are reused by the cross-decoder via cross-attention. The overall model behaves like a decoder-only Transformer, although YOCO onl…
▽ More
We introduce a decoder-decoder architecture, YOCO, for large language models, which only caches key-value pairs once. It consists of two components, i.e., a cross-decoder stacked upon a self-decoder. The self-decoder efficiently encodes global key-value (KV) caches that are reused by the cross-decoder via cross-attention. The overall model behaves like a decoder-only Transformer, although YOCO only caches once. The design substantially reduces GPU memory demands, yet retains global attention capability. Additionally, the computation flow enables prefilling to early exit without changing the final output, thereby significantly speeding up the prefill stage. Experimental results demonstrate that YOCO achieves favorable performance compared to Transformer in various settings of scaling up model size and number of training tokens. We also extend YOCO to 1M context length with near-perfect needle retrieval accuracy. The profiling results show that YOCO improves inference memory, prefill latency, and throughput by orders of magnitude across context lengths and model sizes. Code is available at https://aka.ms/YOCO.
△ Less
Submitted 9 May, 2024; v1 submitted 8 May, 2024;
originally announced May 2024.
-
Circularly polarized light irradiated ferromagnetic MnBi$_2$Te$_4$: the long-sought ideal Weyl semimetal
Authors:
Shuai Fan,
Shengpu Huang,
Zhuo Chen,
Fangyang Zhan,
Xian-Yong Ding,
Da-Shuai Ma,
Rui Wang
Abstract:
The interaction between light and non-trivial energy band topology allows for the precise manipulation of topological quantum states, which has attracted intensive interest in condensed matter physics. In this work, using first-principles calculations, we studied the topological transition of ferromagnetic (FM) MnBi$_2$Te$_4$ upon irradiation with circularly polarized light (CPL). We revealed that…
▽ More
The interaction between light and non-trivial energy band topology allows for the precise manipulation of topological quantum states, which has attracted intensive interest in condensed matter physics. In this work, using first-principles calculations, we studied the topological transition of ferromagnetic (FM) MnBi$_2$Te$_4$ upon irradiation with circularly polarized light (CPL). We revealed that the MnBi$_2$Te$_4$ can be driven from an FM insulator to a Weyl semimetal with a minimum number of Weyl points, i.e., two Weyl points in systems without time-reversal symmetry. More importantly, in FM MnBi$_2$Te$_4$ with out-of-plane easy magnetization axis, we found that the band dispersion of the WP evolves from Type-II to Type-III and finally to Type-I when the light intensity increases. Moreover, we show that the profile of the characteristic Fermi arc of Weyl semimetal phase is sensitive to changes in light intensity, which enables efficient manipulation of the Fermi arc length of FM MnBi$_2$Te$_4$ in experiments. In addition, for FM MnBi$_2$Te$_4$ with in-plane easy magnetization axis, the system becomes a type I Weyl semimetal under CPL irradiation. With controllable band dispersion, length of Fermi arc, and minimum number of WPs, our results indicate that CPL-irradiated FM MnBi$_2$Te$_4$ is an ideal platform to study novel transport phenomena in Weyl semimetals with distinct band dispersion.
△ Less
Submitted 7 May, 2024;
originally announced May 2024.
-
Continual Learning in the Presence of Repetition
Authors:
Hamed Hemati,
Lorenzo Pellegrini,
Xiaotian Duan,
Zixuan Zhao,
Fangfang Xia,
Marc Masana,
Benedikt Tscheschner,
Eduardo Veas,
Yuxiang Zheng,
Shiji Zhao,
Shao-Yuan Li,
Sheng-Jun Huang,
Vincenzo Lomonaco,
Gido M. van de Ven
Abstract:
Continual learning (CL) provides a framework for training models in ever-evolving environments. Although re-occurrence of previously seen objects or tasks is common in real-world problems, the concept of repetition in the data stream is not often considered in standard benchmarks for CL. Unlike with the rehearsal mechanism in buffer-based strategies, where sample repetition is controlled by the st…
▽ More
Continual learning (CL) provides a framework for training models in ever-evolving environments. Although re-occurrence of previously seen objects or tasks is common in real-world problems, the concept of repetition in the data stream is not often considered in standard benchmarks for CL. Unlike with the rehearsal mechanism in buffer-based strategies, where sample repetition is controlled by the strategy, repetition in the data stream naturally stems from the environment. This report provides a summary of the CLVision challenge at CVPR 2023, which focused on the topic of repetition in class-incremental learning. The report initially outlines the challenge objective and then describes three solutions proposed by finalist teams that aim to effectively exploit the repetition in the stream to learn continually. The experimental results from the challenge highlight the effectiveness of ensemble-based solutions that employ multiple versions of similar modules, each trained on different but overlapping subsets of classes. This report underscores the transformative potential of taking a different perspective in CL by employing repetition in the data stream to foster innovative strategy design.
△ Less
Submitted 7 May, 2024;
originally announced May 2024.
-
Deterministic Expander Routing: Faster and More Versatile
Authors:
Yi-Jun Chang,
Shang-En Huang,
Hsin-Hao Su
Abstract:
We consider the expander routing problem formulated by Ghaffari, Kuhn, and Su (PODC 2017), where the goal is to route all the tokens to their destinations given that each vertex is the source and the destination of at most $°(v)$ tokens. They developed $\textit{randomized algorithms}$ that solve this problem in $\text{poly}(φ^{-1}) \cdot 2^{O(\sqrt{\log n \log \log n})}$ rounds in the…
▽ More
We consider the expander routing problem formulated by Ghaffari, Kuhn, and Su (PODC 2017), where the goal is to route all the tokens to their destinations given that each vertex is the source and the destination of at most $°(v)$ tokens. They developed $\textit{randomized algorithms}$ that solve this problem in $\text{poly}(φ^{-1}) \cdot 2^{O(\sqrt{\log n \log \log n})}$ rounds in the $\textsf{CONGEST}$ model, where $φ$ is the conductance of the graph. Later, Ghaffari and Li (DISC 2018) gave an improved algorithm. However, both algorithms are randomized, which means that all the resulting applications are also randomized. Recently, Chang and Saranurak (FOCS 2020) gave a deterministic algorithm that solves an expander routing instance in $2^{O(\log^{2/3} n \cdot \log^{1/3} \log n)}$ rounds. The deterministic algorithm is less efficient and does not allow preprocessing/query tradeoffs, which precludes the de-randomization of algorithms that require this feature, such as the $k$-clique enumeration algorithm in general graphs.
The main contribution of our work is a new deterministic expander routing algorithm that not only matches the randomized bound of [GKS 2017] but also allows preprocessing/query tradeoffs. Our algorithm solves a single instance of routing query in $2^{{O}(\sqrt{\log n \cdot \log \log n})}$ rounds. Our algorithm achieves the following preprocessing and query tradeoffs: For $0 < ε< 1$, we can answer every routing query in $\log^{O(1/ε)} n$ rounds at the cost of a $(n^{O(ε)} + \log^{O(1/ε)} n)$-round preprocessing procedure. Combining this with the approach of Censor-Hillel, Leitersdorf, and Vulakh (PODC 2022), we obtain a near-optimal $\tilde{O}(n^{1-2/k})$-round deterministic algorithm for $k$-clique enumeration in general graphs, improving the previous state-of-the-art $n^{1-2/k+o(1)}$.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
Learning Robot Soccer from Egocentric Vision with Deep Reinforcement Learning
Authors:
Dhruva Tirumala,
Markus Wulfmeier,
Ben Moran,
Sandy Huang,
Jan Humplik,
Guy Lever,
Tuomas Haarnoja,
Leonard Hasenclever,
Arunkumar Byravan,
Nathan Batchelor,
Neil Sreendra,
Kushal Patel,
Marlon Gwira,
Francesco Nori,
Martin Riedmiller,
Nicolas Heess
Abstract:
We apply multi-agent deep reinforcement learning (RL) to train end-to-end robot soccer policies with fully onboard computation and sensing via egocentric RGB vision. This setting reflects many challenges of real-world robotics, including active perception, agile full-body control, and long-horizon planning in a dynamic, partially-observable, multi-agent domain. We rely on large-scale, simulation-b…
▽ More
We apply multi-agent deep reinforcement learning (RL) to train end-to-end robot soccer policies with fully onboard computation and sensing via egocentric RGB vision. This setting reflects many challenges of real-world robotics, including active perception, agile full-body control, and long-horizon planning in a dynamic, partially-observable, multi-agent domain. We rely on large-scale, simulation-based data generation to obtain complex behaviors from egocentric vision which can be successfully transferred to physical robots using low-cost sensors. To achieve adequate visual realism, our simulation combines rigid-body physics with learned, realistic rendering via multiple Neural Radiance Fields (NeRFs). We combine teacher-based multi-agent RL and cross-experiment data reuse to enable the discovery of sophisticated soccer strategies. We analyze active-perception behaviors including object tracking and ball seeking that emerge when simply optimizing perception-agnostic soccer play. The agents display equivalent levels of performance and agility as policies with access to privileged, ground-truth state. To our knowledge, this paper constitutes a first demonstration of end-to-end training for multi-agent robot soccer, mapping raw pixel observations to joint-level actions, that can be deployed in the real world. Videos of the game-play and analyses can be seen on our website https://sites.google.com/view/vision-soccer .
△ Less
Submitted 3 May, 2024;
originally announced May 2024.
-
The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights
Authors:
Wenhao Zhu,
Shujian Huang,
Fei Yuan,
Cheng Chen,
Jiajun Chen,
Alexandra Birch
Abstract:
Bridging the significant gap between large language model's English and non-English performance presents a great challenge. While some previous studies attempt to mitigate this gap with translated training data, the recently proposed question alignment approach leverages the model's English expertise to improve multilingual performance with minimum usage of expensive, error-prone translation. In t…
▽ More
Bridging the significant gap between large language model's English and non-English performance presents a great challenge. While some previous studies attempt to mitigate this gap with translated training data, the recently proposed question alignment approach leverages the model's English expertise to improve multilingual performance with minimum usage of expensive, error-prone translation. In this paper, we explore how broadly this method can be applied by examining its effects in reasoning with executable code and reasoning with common sense. We also explore how to apply this approach efficiently to extremely large language models using proxy-tuning. Experiment results on multilingual reasoning benchmarks mGSM, mSVAMP and xCSQA demonstrate that the question alignment approach can be used to boost multilingual performance across diverse reasoning scenarios, model families, and sizes. For instance, when applied to the LLaMA2 models, our method brings an average accuracy improvements of 12.2% on mGSM even with the 70B model. To understand the mechanism of its success, we analyze representation space, chain-of-thought and translation data scales, which reveals how question translation training strengthens language alignment within LLMs and shapes their working patterns.
△ Less
Submitted 29 June, 2024; v1 submitted 2 May, 2024;
originally announced May 2024.
-
Emergent Non-Abelian Thouless Pumping Induced by the Quasiperiodic Disorder
Authors:
Sen Huang,
Yan-Qing Zhu,
Zhi Li
Abstract:
We investigate the non-Abelian Thouless pumping in a disorder tunable Lieb chain with degenerate flat bands. The results reveal that quasiperiodic disorder will cause a topological phase transition from the trivial (without non-Abelian Thouless pumping) to the non-trivial (with non-Abelian Thouless pumping) phase. The mechanism behind is that the monopole originally outside the topological region…
▽ More
We investigate the non-Abelian Thouless pumping in a disorder tunable Lieb chain with degenerate flat bands. The results reveal that quasiperiodic disorder will cause a topological phase transition from the trivial (without non-Abelian Thouless pumping) to the non-trivial (with non-Abelian Thouless pumping) phase. The mechanism behind is that the monopole originally outside the topological region can be driven into the topological region due to the introduction of quasiperiodic disorder. Moreover, since the corresponding monopole will turn into a nodal line to spread beyond the boundaries of the topological region, the system with large disorder strength will result in the disappearance of non-Abelian Thouless pumping. Furthermore, we numerically simulate the Thouless pumping of non-Abelian systems, and the evolution results of center of mass' displacement are consistent with the Chern number. Finally, we discuss the localization properties of the system and find that, similar to [PRL 130, 206401(2023)], the inverse Anderson transition does not occur in the system with the increase of quasiperiodic strength, while the system still maintains the coexistence of localized and extended states.
△ Less
Submitted 29 April, 2024;
originally announced April 2024.
-
Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations
Authors:
Puhao Li,
Tengyu Liu,
Yuyang Li,
Muzhi Han,
Haoran Geng,
Shu Wang,
Yixin Zhu,
Song-Chun Zhu,
Siyuan Huang
Abstract:
Autonomous robotic systems capable of learning novel manipulation tasks are poised to transform industries from manufacturing to service automation. However, modern methods (e.g., VIP and R3M) still face significant hurdles, notably the domain gap among robotic embodiments and the sparsity of successful task executions within specific action spaces, resulting in misaligned and ambiguous task repre…
▽ More
Autonomous robotic systems capable of learning novel manipulation tasks are poised to transform industries from manufacturing to service automation. However, modern methods (e.g., VIP and R3M) still face significant hurdles, notably the domain gap among robotic embodiments and the sparsity of successful task executions within specific action spaces, resulting in misaligned and ambiguous task representations. We introduce Ag2Manip (Agent-Agnostic representations for Manipulation), a framework aimed at surmounting these challenges through two key innovations: a novel agent-agnostic visual representation derived from human manipulation videos, with the specifics of embodiments obscured to enhance generalizability; and an agent-agnostic action representation abstracting a robot's kinematics to a universal agent proxy, emphasizing crucial interactions between end-effector and object. Ag2Manip's empirical validation across simulated benchmarks like FrankaKitchen, ManiSkill, and PartManip shows a 325% increase in performance, achieved without domain-specific demonstrations. Ablation studies underline the essential contributions of the visual and action representations to this success. Extending our evaluations to the real world, Ag2Manip significantly improves imitation learning success rates from 50% to 77.5%, demonstrating its effectiveness and generalizability across both simulated and physical environments.
△ Less
Submitted 26 April, 2024;
originally announced April 2024.
-
How Does Conversation Length Impact User's Satisfaction? A Case Study of Length-Controlled Conversations with LLM-Powered Chatbots
Authors:
Shih-Hong Huang,
Ya-Fang Lin,
Zeyu He,
Chieh-Yang Huang,
Ting-Hao 'Kenneth' Huang
Abstract:
Users can discuss a wide range of topics with large language models (LLMs), but they do not always prefer solving problems or getting information through lengthy conversations. This raises an intriguing HCI question: How does instructing LLMs to engage in longer or shorter conversations affect conversation quality? In this paper, we developed two Slack chatbots using GPT-4 with the ability to vary…
▽ More
Users can discuss a wide range of topics with large language models (LLMs), but they do not always prefer solving problems or getting information through lengthy conversations. This raises an intriguing HCI question: How does instructing LLMs to engage in longer or shorter conversations affect conversation quality? In this paper, we developed two Slack chatbots using GPT-4 with the ability to vary conversation lengths and conducted a user study. Participants asked the chatbots both highly and less conversable questions, engaging in dialogues with 0, 3, 5, and 7 conversational turns. We found that the conversation quality does not differ drastically across different conditions, while participants had mixed reactions. Our study demonstrates LLMs' ability to change conversation length and the potential benefits for users resulting from such changes, but we caution that changes in text form may not necessarily imply changes in quality or content.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
PhyRecon: Physically Plausible Neural Scene Reconstruction
Authors:
Junfeng Ni,
Yixin Chen,
Bohan Jing,
Nan Jiang,
Bin Wang,
Bo Dai,
Puhao Li,
Yixin Zhu,
Song-Chun Zhu,
Siyuan Huang
Abstract:
Neural implicit representations have gained popularity in multi-view 3D reconstruction. However, most previous work struggles to yield physically plausible results, limiting their utility in domains requiring rigorous physical accuracy, such as embodied AI and robotics. This lack of plausibility stems from the absence of physics modeling in existing methods and their inability to recover intricate…
▽ More
Neural implicit representations have gained popularity in multi-view 3D reconstruction. However, most previous work struggles to yield physically plausible results, limiting their utility in domains requiring rigorous physical accuracy, such as embodied AI and robotics. This lack of plausibility stems from the absence of physics modeling in existing methods and their inability to recover intricate geometrical structures. In this paper, we introduce PhyRecon, the first approach to leverage both differentiable rendering and differentiable physics simulation to learn implicit surface representations. PhyRecon features a novel differentiable particle-based physical simulator built on neural implicit representations. Central to this design is an efficient transformation between SDF-based implicit representations and explicit surface points via our proposed Surface Points Marching Cubes (SP-MC), enabling differentiable learning with both rendering and physical losses. Additionally, PhyRecon models both rendering and physical uncertainty to identify and compensate for inconsistent and inaccurate monocular geometric priors. This physical uncertainty further facilitates a novel physics-guided pixel sampling to enhance the learning of slender structures. By integrating these techniques, our model supports differentiable joint modeling of appearance, geometry, and physics. Extensive experiments demonstrate that PhyRecon significantly outperforms all state-of-the-art methods. Our results also exhibit superior physical stability in physical simulators, with at least a 40% improvement across all datasets, paving the way for future physics-based applications.
△ Less
Submitted 2 June, 2024; v1 submitted 25 April, 2024;
originally announced April 2024.
-
TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models
Authors:
Haomiao Ni,
Bernhard Egger,
Suhas Lohit,
Anoop Cherian,
Ye Wang,
Toshiaki Koike-Akino,
Sharon X. Huang,
Tim K. Marks
Abstract:
Text-conditioned image-to-video generation (TI2V) aims to synthesize a realistic video starting from a given image (e.g., a woman's photo) and a text description (e.g., "a woman is drinking water."). Existing TI2V frameworks often require costly training on video-text datasets and specific model designs for text and image conditioning. In this paper, we propose TI2V-Zero, a zero-shot, tuning-free…
▽ More
Text-conditioned image-to-video generation (TI2V) aims to synthesize a realistic video starting from a given image (e.g., a woman's photo) and a text description (e.g., "a woman is drinking water."). Existing TI2V frameworks often require costly training on video-text datasets and specific model designs for text and image conditioning. In this paper, we propose TI2V-Zero, a zero-shot, tuning-free method that empowers a pretrained text-to-video (T2V) diffusion model to be conditioned on a provided image, enabling TI2V generation without any optimization, fine-tuning, or introducing external modules. Our approach leverages a pretrained T2V diffusion foundation model as the generative prior. To guide video generation with the additional image input, we propose a "repeat-and-slide" strategy that modulates the reverse denoising process, allowing the frozen diffusion model to synthesize a video frame-by-frame starting from the provided image. To ensure temporal continuity, we employ a DDPM inversion strategy to initialize Gaussian noise for each newly synthesized frame and a resampling technique to help preserve visual details. We conduct comprehensive experiments on both domain-specific and open-domain datasets, where TI2V-Zero consistently outperforms a recent open-domain TI2V model. Furthermore, we show that TI2V-Zero can seamlessly extend to other tasks such as video infilling and prediction when provided with more images. Its autoregressive design also supports long video generation.
△ Less
Submitted 24 April, 2024;
originally announced April 2024.
-
Research on OPF control of three-phase four-wire low-voltage distribution network considering uncertainty
Authors:
Rui Wang,
Xiaoqing Bai,
Shengquan Huang,
Shoupu Wei
Abstract:
As power systems become more complex and uncertain, low-voltage distribution networks face numerous challenges, including three-phase imbalances caused by asymmetrical loads and distributed energy resources. We propose a robust stochastic optimization (RSO) based optimal power flow (OPF) control method for three-phase, four-wire low-voltage distribution networks that consider uncertainty to addres…
▽ More
As power systems become more complex and uncertain, low-voltage distribution networks face numerous challenges, including three-phase imbalances caused by asymmetrical loads and distributed energy resources. We propose a robust stochastic optimization (RSO) based optimal power flow (OPF) control method for three-phase, four-wire low-voltage distribution networks that consider uncertainty to address these issues. Using historical data and deep learning classification methods, the proposed method simulates optimal system behaviour without requiring communication infrastructure. The simulation results verify that the proposed method effectively controls the voltage and current amplitude while minimizing the operational cost and three-phase imbalance within acceptable limits. The proposed method shows promise for managing uncertainties and optimizing performance in low-voltage distribution networks.
△ Less
Submitted 23 April, 2024;
originally announced April 2024.
-
Taming Diffusion Probabilistic Models for Character Control
Authors:
Rui Chen,
Mingyi Shi,
Shaoli Huang,
Ping Tan,
Taku Komura,
Xuelin Chen
Abstract:
We present a novel character control framework that effectively utilizes motion diffusion probabilistic models to generate high-quality and diverse character animations, responding in real-time to a variety of dynamic user-supplied control signals. At the heart of our method lies a transformer-based Conditional Autoregressive Motion Diffusion Model (CAMDM), which takes as input the character's his…
▽ More
We present a novel character control framework that effectively utilizes motion diffusion probabilistic models to generate high-quality and diverse character animations, responding in real-time to a variety of dynamic user-supplied control signals. At the heart of our method lies a transformer-based Conditional Autoregressive Motion Diffusion Model (CAMDM), which takes as input the character's historical motion and can generate a range of diverse potential future motions conditioned on high-level, coarse user control. To meet the demands for diversity, controllability, and computational efficiency required by a real-time controller, we incorporate several key algorithmic designs. These include separate condition tokenization, classifier-free guidance on past motion, and heuristic future trajectory extension, all designed to address the challenges associated with taming motion diffusion probabilistic models for character control. As a result, our work represents the first model that enables real-time generation of high-quality, diverse character animations based on user interactive control, supporting animating the character in multiple styles with a single unified model. We evaluate our method on a diverse set of locomotion skills, demonstrating the merits of our method over existing character controllers. Project page and source codes: https://aiganimation.github.io/CAMDM/
△ Less
Submitted 23 April, 2024;
originally announced April 2024.
-
Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation
Authors:
Xun Wu,
Shaohan Huang,
Furu Wei
Abstract:
Recent studies have demonstrated the exceptional potentials of leveraging human preference datasets to refine text-to-image generative models, enhancing the alignment between generated images and textual prompts. Despite these advances, current human preference datasets are either prohibitively expensive to construct or suffer from a lack of diversity in preference dimensions, resulting in limited…
▽ More
Recent studies have demonstrated the exceptional potentials of leveraging human preference datasets to refine text-to-image generative models, enhancing the alignment between generated images and textual prompts. Despite these advances, current human preference datasets are either prohibitively expensive to construct or suffer from a lack of diversity in preference dimensions, resulting in limited applicability for instruction tuning in open-source text-to-image generative models and hinder further exploration. To address these challenges and promote the alignment of generative models through instruction tuning, we leverage multimodal large language models to create VisionPrefer, a high-quality and fine-grained preference dataset that captures multiple preference aspects. We aggregate feedback from AI annotators across four aspects: prompt-following, aesthetic, fidelity, and harmlessness to construct VisionPrefer. To validate the effectiveness of VisionPrefer, we train a reward model VP-Score over VisionPrefer to guide the training of text-to-image generative models and the preference prediction accuracy of VP-Score is comparable to human annotators. Furthermore, we use two reinforcement learning methods to supervised fine-tune generative models to evaluate the performance of VisionPrefer, and extensive experimental results demonstrate that VisionPrefer significantly improves text-image alignment in compositional image generation across diverse aspects, e.g., aesthetic, and generalizes better than previous human-preference metrics across various image distributions. Moreover, VisionPrefer indicates that the integration of AI-generated synthetic data as a supervisory signal is a promising avenue for achieving improved alignment with human preferences in vision generative models.
△ Less
Submitted 23 April, 2024;
originally announced April 2024.
-
Multi-Head Mixture-of-Experts
Authors:
Xun Wu,
Shaohan Huang,
Wenhui Wang,
Furu Wei
Abstract:
Sparse Mixtures of Experts (SMoE) scales model capacity without significant increases in training and inference costs, but exhibits the following two issues: (1) Low expert activation, where only a small subset of experts are activated for optimization. (2) Lacking fine-grained analytical capabilities for multiple semantic concepts within individual tokens. We propose Multi-Head Mixture-of-Experts…
▽ More
Sparse Mixtures of Experts (SMoE) scales model capacity without significant increases in training and inference costs, but exhibits the following two issues: (1) Low expert activation, where only a small subset of experts are activated for optimization. (2) Lacking fine-grained analytical capabilities for multiple semantic concepts within individual tokens. We propose Multi-Head Mixture-of-Experts (MH-MoE), which employs a multi-head mechanism to split each token into multiple sub-tokens. These sub-tokens are then assigned to and processed by a diverse set of experts in parallel, and seamlessly reintegrated into the original token form. The multi-head mechanism enables the model to collectively attend to information from various representation spaces within different experts, while significantly enhances expert activation, thus deepens context understanding and alleviate overfitting. Moreover, our MH-MoE is straightforward to implement and decouples from other SMoE optimization methods, making it easy to integrate with other SMoE models for enhanced performance. Extensive experimental results across three tasks: English-focused language modeling, Multi-lingual language modeling and Masked multi-modality modeling tasks, demonstrate the effectiveness of MH-MoE.
△ Less
Submitted 23 April, 2024;
originally announced April 2024.
-
$\texttt{MiniMol}$: A Parameter-Efficient Foundation Model for Molecular Learning
Authors:
Kerstin Kläser,
Błażej Banaszewski,
Samuel Maddrell-Mander,
Callum McLean,
Luis Müller,
Ali Parviz,
Shenyang Huang,
Andrew Fitzgibbon
Abstract:
In biological tasks, data is rarely plentiful as it is generated from hard-to-gather measurements. Therefore, pre-training foundation models on large quantities of available data and then transfer to low-data downstream tasks is a promising direction. However, how to design effective foundation models for molecular learning remains an open question, with existing approaches typically focusing on m…
▽ More
In biological tasks, data is rarely plentiful as it is generated from hard-to-gather measurements. Therefore, pre-training foundation models on large quantities of available data and then transfer to low-data downstream tasks is a promising direction. However, how to design effective foundation models for molecular learning remains an open question, with existing approaches typically focusing on models with large parameter capacities. In this work, we propose $\texttt{MiniMol}$, a foundational model for molecular learning with 10 million parameters. $\texttt{MiniMol}$ is pre-trained on a mix of roughly 3300 sparsely defined graph- and node-level tasks of both quantum and biological nature. The pre-training dataset includes approximately 6 million molecules and 500 million labels. To demonstrate the generalizability of $\texttt{MiniMol}$ across tasks, we evaluate it on downstream tasks from the Therapeutic Data Commons (TDC) ADMET group showing significant improvements over the prior state-of-the-art foundation model across 17 tasks. $\texttt{MiniMol}$ will be a public and open-sourced model for future research.
△ Less
Submitted 23 April, 2024;
originally announced April 2024.
-
Study of $e^+e^-\toωX(3872)$ and $γX(3872)$ from 4.66 to 4.95 GeV
Authors:
BESIII Collaboration,
M. Ablikim,
M. N. Achasov,
P. Adlarson,
O. Afedulidis,
X. C. Ai,
R. Aliberti,
A. Amoroso,
Q. An,
Y. Bai,
O. Bakina,
I. Balossino,
Y. Ban,
H. -R. Bao,
V. Batozskaya,
K. Begzsuren,
N. Berger,
M. Berlowski,
M. Bertani,
D. Bettoni,
F. Bianchi,
E. Bianco,
A. Bortone,
I. Boyko,
R. A. Briere
, et al. (634 additional authors not shown)
Abstract:
Using data samples with an integrated luminosity of $4.5~\text{fb}^{-1}$ collected by the BESIII detector at center-of-mass energies ranging from 4.66 to 4.95 GeV, we study the processes of $e^+e^-\toωX(3872)$ and $e^+e^-\toγX(3872)$. With the $e^+e^-\toωX(3872)$ process, the branching fraction ratio $R\equiv\frac{\mathcal{B}(X(3872)\toγJ/ψ)}{\mathcal{B}(X(3872)\toπ^+π^- J/ψ)}$ is measured to be…
▽ More
Using data samples with an integrated luminosity of $4.5~\text{fb}^{-1}$ collected by the BESIII detector at center-of-mass energies ranging from 4.66 to 4.95 GeV, we study the processes of $e^+e^-\toωX(3872)$ and $e^+e^-\toγX(3872)$. With the $e^+e^-\toωX(3872)$ process, the branching fraction ratio $R\equiv\frac{\mathcal{B}(X(3872)\toγJ/ψ)}{\mathcal{B}(X(3872)\toπ^+π^- J/ψ)}$ is measured to be $0.38\pm0.20_\text{stat.}\pm0.01_\text{syst.}$ ($R< 0.83$ at 90\% confidence level). In addition, we measure the ratio of the average cross section of $e^+e^-\toωX(3872)$ to $e^+e^-\toωχ_{c1}(ωχ_{c2})$ to be $σ_{ωX(3872)}/σ_{ωχ_{c1}}~(σ_{ωX(3872)}/σ_{ωχ_{c2}})=5.2\pm1.0_\text{stat.}\pm1.9_\text{syst.}~ (5.5\pm1.1_\text{stat.}\pm2.4_\text{syst.})$. Finally, we search for the process of $e^+e^-\toγX(3872)$, and no obvious signal is observed. The upper limit on the ratio of the average cross section of $e^+e^-\toγX(3872)$ to $e^+e^-\toωX(3872)$ is set as $σ_{γX(3872)}/σ_{ωX(3872)}<0.23$ at 90\% confidence level.
△ Less
Submitted 21 April, 2024;
originally announced April 2024.
-
A Dataset and Model for Realistic License Plate Deblurring
Authors:
Haoyan Gong,
Yuzheng Feng,
Zhenrong Zhang,
Xianxu Hou,
Jingxin Liu,
Siqi Huang,
Hongbin Liu
Abstract:
Vehicle license plate recognition is a crucial task in intelligent traffic management systems. However, the challenge of achieving accurate recognition persists due to motion blur from fast-moving vehicles. Despite the widespread use of image synthesis approaches in existing deblurring and recognition algorithms, their effectiveness in real-world scenarios remains unproven. To address this, we int…
▽ More
Vehicle license plate recognition is a crucial task in intelligent traffic management systems. However, the challenge of achieving accurate recognition persists due to motion blur from fast-moving vehicles. Despite the widespread use of image synthesis approaches in existing deblurring and recognition algorithms, their effectiveness in real-world scenarios remains unproven. To address this, we introduce the first large-scale license plate deblurring dataset named License Plate Blur (LPBlur), captured by a dual-camera system and processed through a post-processing pipeline to avoid misalignment issues. Then, we propose a License Plate Deblurring Generative Adversarial Network (LPDGAN) to tackle the license plate deblurring: 1) a Feature Fusion Module to integrate multi-scale latent codes; 2) a Text Reconstruction Module to restore structure through textual modality; 3) a Partition Discriminator Module to enhance the model's perception of details in each letter. Extensive experiments validate the reliability of the LPBlur dataset for both model training and testing, showcasing that our proposed model outperforms other state-of-the-art motion deblurring methods in realistic license plate deblurring scenarios. The dataset and code are available at https://github.com/haoyGONG/LPDGAN.
△ Less
Submitted 22 April, 2024; v1 submitted 21 April, 2024;
originally announced April 2024.
-
Mixture of LoRA Experts
Authors:
Xun Wu,
Shaohan Huang,
Furu Wei
Abstract:
LoRA has gained widespread acceptance in the fine-tuning of large pre-trained models to cater to a diverse array of downstream tasks, showcasing notable effectiveness and efficiency, thereby solidifying its position as one of the most prevalent fine-tuning techniques. Due to the modular nature of LoRA's plug-and-play plugins, researchers have delved into the amalgamation of multiple LoRAs to empow…
▽ More
LoRA has gained widespread acceptance in the fine-tuning of large pre-trained models to cater to a diverse array of downstream tasks, showcasing notable effectiveness and efficiency, thereby solidifying its position as one of the most prevalent fine-tuning techniques. Due to the modular nature of LoRA's plug-and-play plugins, researchers have delved into the amalgamation of multiple LoRAs to empower models to excel across various downstream tasks. Nonetheless, extant approaches for LoRA fusion grapple with inherent challenges. Direct arithmetic merging may result in the loss of the original pre-trained model's generative capabilities or the distinct identity of LoRAs, thereby yielding suboptimal outcomes. On the other hand, Reference tuning-based fusion exhibits limitations concerning the requisite flexibility for the effective combination of multiple LoRAs. In response to these challenges, this paper introduces the Mixture of LoRA Experts (MoLE) approach, which harnesses hierarchical control and unfettered branch selection. The MoLE approach not only achieves superior LoRA fusion performance in comparison to direct arithmetic merging but also retains the crucial flexibility for combining LoRAs effectively. Extensive experimental evaluations conducted in both the Natural Language Processing (NLP) and Vision & Language (V&L) domains substantiate the efficacy of MoLE.
△ Less
Submitted 21 April, 2024;
originally announced April 2024.
-
Band Structure Engineering in Highly Crystalline Organic Semiconductors
Authors:
Shu-Jen Wang,
Sebastian Hutsch,
Felix Talnack,
Marielle Deconinck,
Shiyu Huang,
Zongbao Zhang,
Hans Kleemann,
Yana Vaynzof,
Stefan C. B. Mannsfeld,
Frank Ortmann,
Karl Leo
Abstract:
Blending of semiconductors for controlling the energy levels (band structure engineering) is an important technique, in particular, for optoelectronic applications. The underlying physics is the delocalized Bloch states, which average over the potential landscape of the blend. For organic semiconductors, it has been shown that two quite different effects, the dielectric constant and electrostatic…
▽ More
Blending of semiconductors for controlling the energy levels (band structure engineering) is an important technique, in particular, for optoelectronic applications. The underlying physics is the delocalized Bloch states, which average over the potential landscape of the blend. For organic semiconductors, it has been shown that two quite different effects, the dielectric constant and electrostatic interaction between molecules, can be used to tune the energy gap and ionization energy of disordered and weakly crystalline organic semiconductor blends. It is so far not known whether the electronic delocalization in organic crystals with large bandwidths can contribute to the energy structure engineering of the blend in a way similar to that in inorganic semiconductors. Here, we investigate the growth of highly ordered organic thin-film blends with a similar chemical structure and show the effect of band structure engineering by spectroscopic methods. We rationalize the experimental results with comprehensive theoretical simulations, showing that the delocalization is a significant effect. Our work paves the way for engineering the band structure of highly ordered organic semiconductor thin films that can be tailored for the desired optoelectronic device application.
△ Less
Submitted 17 April, 2024;
originally announced April 2024.
-
MixLight: Borrowing the Best of both Spherical Harmonics and Gaussian Models
Authors:
Xinlong Ji,
Fangneng Zhan,
Shijian Lu,
Shi-Sheng Huang,
Hua Huang
Abstract:
Accurately estimating scene lighting is critical for applications such as mixed reality. Existing works estimate illumination by generating illumination maps or regressing illumination parameters. However, the method of generating illumination maps has poor generalization performance and parametric models such as Spherical Harmonic (SH) and Spherical Gaussian (SG) fall short in capturing high-freq…
▽ More
Accurately estimating scene lighting is critical for applications such as mixed reality. Existing works estimate illumination by generating illumination maps or regressing illumination parameters. However, the method of generating illumination maps has poor generalization performance and parametric models such as Spherical Harmonic (SH) and Spherical Gaussian (SG) fall short in capturing high-frequency or low-frequency components. This paper presents MixLight, a joint model that utilizes the complementary characteristics of SH and SG to achieve a more complete illumination representation, which uses SH and SG to capture low-frequency ambient and high-frequency light sources respectively. In addition, a special spherical light source sparsemax (SLSparsemax) module that refers to the position and brightness relationship between spherical light sources is designed to improve their sparsity, which is significant but omitted by prior works. Extensive experiments demonstrate that MixLight surpasses state-of-the-art (SOTA) methods on multiple metrics. In addition, experiments on Web Dataset also show that MixLight as a parametric method has better generalization performance than non-parametric methods.
△ Less
Submitted 19 April, 2024;
originally announced April 2024.
-
DST-GTN: Dynamic Spatio-Temporal Graph Transformer Network for Traffic Forecasting
Authors:
Songtao Huang,
Hongjin Song,
Tianqi Jiang,
Akbar Telikani,
Jun Shen,
Qingguo Zhou,
Binbin Yong,
Qiang Wu
Abstract:
Accurate traffic forecasting is essential for effective urban planning and congestion management. Deep learning (DL) approaches have gained colossal success in traffic forecasting but still face challenges in capturing the intricacies of traffic dynamics. In this paper, we identify and address this challenges by emphasizing that spatial features are inherently dynamic and change over time. A novel…
▽ More
Accurate traffic forecasting is essential for effective urban planning and congestion management. Deep learning (DL) approaches have gained colossal success in traffic forecasting but still face challenges in capturing the intricacies of traffic dynamics. In this paper, we identify and address this challenges by emphasizing that spatial features are inherently dynamic and change over time. A novel in-depth feature representation, called Dynamic Spatio-Temporal (Dyn-ST) features, is introduced, which encapsulates spatial characteristics across varying times. Moreover, a Dynamic Spatio-Temporal Graph Transformer Network (DST-GTN) is proposed by capturing Dyn-ST features and other dynamic adjacency relations between intersections. The DST-GTN can model dynamic ST relationships between nodes accurately and refine the representation of global and local ST characteristics by adopting adaptive weights in low-pass and all-pass filters, enabling the extraction of Dyn-ST features from traffic time-series data. Through numerical experiments on public datasets, the DST-GTN achieves state-of-the-art performance for a range of traffic forecasting tasks and demonstrates enhanced stability.
△ Less
Submitted 18 April, 2024;
originally announced April 2024.
-
Image Compression and Reconstruction Based on Quantum Network
Authors:
Xun Ji,
Qin Liu,
Shan Huang,
Andi Chen,
Shengjun Wu
Abstract:
Quantum network is an emerging type of network structure that leverages the principles of quantum mechanics to transmit and process information. Compared with classical data reconstruction algorithms, quantum networks make image reconstruction more efficient and accurate. They can also process more complex image information using fewer bits and faster parallel computing capabilities. Therefore, th…
▽ More
Quantum network is an emerging type of network structure that leverages the principles of quantum mechanics to transmit and process information. Compared with classical data reconstruction algorithms, quantum networks make image reconstruction more efficient and accurate. They can also process more complex image information using fewer bits and faster parallel computing capabilities. Therefore, this paper will discuss image reconstruction methods based on our quantum network and explore their potential applications in image processing. We will introduce the basic structure of the quantum network, the process of image compression and reconstruction, and the specific parameter training method. Through this study, we can achieve a classical image reconstruction accuracy of 97.57\%. Our quantum network design will introduce novel ideas and methods for image reconstruction in the future.
△ Less
Submitted 18 April, 2024;
originally announced April 2024.
-
From Image to Video, what do we need in multimodal LLMs?
Authors:
Suyuan Huang,
Haoxin Zhang,
Yan Gao,
Yao Hu,
Zengchang Qin
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated profound capabilities in understanding multimodal information, covering from Image LLMs to the more complex Video LLMs. Numerous studies have illustrated their exceptional cross-modal comprehension. Recently, integrating video foundation models with large language models to build a comprehensive video understanding system has been proposed…
▽ More
Multimodal Large Language Models (MLLMs) have demonstrated profound capabilities in understanding multimodal information, covering from Image LLMs to the more complex Video LLMs. Numerous studies have illustrated their exceptional cross-modal comprehension. Recently, integrating video foundation models with large language models to build a comprehensive video understanding system has been proposed to overcome the limitations of specific pre-defined vision tasks. However, the current advancements in Video LLMs tend to overlook the foundational contributions of Image LLMs, often opting for more complicated structures and a wide variety of multimodal data for pre-training. This approach significantly increases the costs associated with these methods.In response to these challenges, this work introduces an efficient method that strategically leverages the priors of Image LLMs, facilitating a resource-efficient transition from Image to Video LLMs. We propose RED-VILLM, a Resource-Efficient Development pipeline for Video LLMs from Image LLMs, which utilizes a temporal adaptation plug-and-play structure within the image fusion module of Image LLMs. This adaptation extends their understanding capabilities to include temporal information, enabling the development of Video LLMs that not only surpass baseline performances but also do so with minimal instructional data and training resources. Our approach highlights the potential for a more cost-effective and scalable advancement in multimodal models, effectively building upon the foundational work of Image LLMs.
△ Less
Submitted 17 April, 2024;
originally announced April 2024.
-
Towards Highly Realistic Artistic Style Transfer via Stable Diffusion with Step-aware and Layer-aware Prompt
Authors:
Zhanjie Zhang,
Quanwei Zhang,
Huaizhong Lin,
Wei Xing,
Juncheng Mo,
Shuaicheng Huang,
Jinheng Xie,
Guangyuan Li,
Junsheng Luan,
Lei Zhao,
Dalong Zhang,
Lixia Chen
Abstract:
Artistic style transfer aims to transfer the learned artistic style onto an arbitrary content image, generating artistic stylized images. Existing generative adversarial network-based methods fail to generate highly realistic stylized images and always introduce obvious artifacts and disharmonious patterns. Recently, large-scale pre-trained diffusion models opened up a new way for generating highl…
▽ More
Artistic style transfer aims to transfer the learned artistic style onto an arbitrary content image, generating artistic stylized images. Existing generative adversarial network-based methods fail to generate highly realistic stylized images and always introduce obvious artifacts and disharmonious patterns. Recently, large-scale pre-trained diffusion models opened up a new way for generating highly realistic artistic stylized images. However, diffusion model-based methods generally fail to preserve the content structure of input content images well, introducing some undesired content structure and style patterns. To address the above problems, we propose a novel pre-trained diffusion-based artistic style transfer method, called LSAST, which can generate highly realistic artistic stylized images while preserving the content structure of input content images well, without bringing obvious artifacts and disharmonious style patterns. Specifically, we introduce a Step-aware and Layer-aware Prompt Space, a set of learnable prompts, which can learn the style information from the collection of artworks and dynamically adjusts the input images' content structure and style pattern. To train our prompt space, we propose a novel inversion method, called Step-ware and Layer-aware Prompt Inversion, which allows the prompt space to learn the style information of the artworks collection. In addition, we inject a pre-trained conditional branch of ControlNet into our LSAST, which further improved our framework's ability to maintain content structure. Extensive experiments demonstrate that our proposed method can generate more highly realistic artistic stylized images than the state-of-the-art artistic style transfer methods.
△ Less
Submitted 29 April, 2024; v1 submitted 17 April, 2024;
originally announced April 2024.
-
Exploring Key Point Analysis with Pairwise Generation and Graph Partitioning
Authors:
Xiao Li,
Yong Jiang,
Shen Huang,
Pengjun Xie,
Gong Cheng,
Fei Huang
Abstract:
Key Point Analysis (KPA), the summarization of multiple arguments into a concise collection of key points, continues to be a significant and unresolved issue within the field of argument mining. Existing models adapt a two-stage pipeline of clustering arguments or generating key points for argument clusters. This approach rely on semantic similarity instead of measuring the existence of shared key…
▽ More
Key Point Analysis (KPA), the summarization of multiple arguments into a concise collection of key points, continues to be a significant and unresolved issue within the field of argument mining. Existing models adapt a two-stage pipeline of clustering arguments or generating key points for argument clusters. This approach rely on semantic similarity instead of measuring the existence of shared key points among arguments. Additionally, it only models the intra-cluster relationship among arguments, disregarding the inter-cluster relationship between arguments that do not share key points. To address these limitations, we propose a novel approach for KPA with pairwise generation and graph partitioning. Our objective is to train a generative model that can simultaneously provide a score indicating the presence of shared key point between a pair of arguments and generate the shared key point. Subsequently, to map generated redundant key points to a concise set of key points, we proceed to construct an arguments graph by considering the arguments as vertices, the generated key points as edges, and the scores as edge weights. We then propose a graph partitioning algorithm to partition all arguments sharing the same key points to the same subgraph. Notably, our experimental findings demonstrate that our proposed model surpasses previous models when evaluated on both the ArgKP and QAM datasets.
△ Less
Submitted 17 April, 2024;
originally announced April 2024.
-
LAECIPS: Large Vision Model Assisted Adaptive Edge-Cloud Collaboration for IoT-based Perception System
Authors:
Shijing Hu,
Ruijun Deng,
Xin Du,
Zhihui Lu,
Qiang Duan,
Yi He,
Shih-Chia Huang,
Jie Wu
Abstract:
Recent large vision models (e.g., SAM) enjoy great potential to facilitate intelligent perception with high accuracy. Yet, the resource constraints in the IoT environment tend to limit such large vision models to be locally deployed, incurring considerable inference latency thereby making it difficult to support real-time applications, such as autonomous driving and robotics. Edge-cloud collaborat…
▽ More
Recent large vision models (e.g., SAM) enjoy great potential to facilitate intelligent perception with high accuracy. Yet, the resource constraints in the IoT environment tend to limit such large vision models to be locally deployed, incurring considerable inference latency thereby making it difficult to support real-time applications, such as autonomous driving and robotics. Edge-cloud collaboration with large-small model co-inference offers a promising approach to achieving high inference accuracy and low latency. However, existing edge-cloud collaboration methods are tightly coupled with the model architecture and cannot adapt to the dynamic data drifts in heterogeneous IoT environments. To address the issues, we propose LAECIPS, a new edge-cloud collaboration framework. In LAECIPS, both the large vision model on the cloud and the lightweight model on the edge are plug-and-play. We design an edge-cloud collaboration strategy based on hard input mining, optimized for both high accuracy and low latency. We propose to update the edge model and its collaboration strategy with the cloud under the supervision of the large vision model, so as to adapt to the dynamic IoT data streams. Theoretical analysis of LAECIPS proves its feasibility. Experiments conducted in a robotic semantic segmentation system using real-world datasets show that LAECIPS outperforms its state-of-the-art competitors in accuracy, latency, and communication overhead while having better adaptability to dynamic environments.
△ Less
Submitted 16 April, 2024;
originally announced April 2024.
-
Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V
Authors:
Peiyuan Zhi,
Zhiyuan Zhang,
Muzhi Han,
Zeyu Zhang,
Zhitian Li,
Ziyuan Jiao,
Baoxiong Jia,
Siyuan Huang
Abstract:
Autonomous robot navigation and manipulation in open environments require reasoning and replanning with closed-loop feedback. We present COME-robot, the first closed-loop framework utilizing the GPT-4V vision-language foundation model for open-ended reasoning and adaptive planning in real-world scenarios. We meticulously construct a library of action primitives for robot exploration, navigation, a…
▽ More
Autonomous robot navigation and manipulation in open environments require reasoning and replanning with closed-loop feedback. We present COME-robot, the first closed-loop framework utilizing the GPT-4V vision-language foundation model for open-ended reasoning and adaptive planning in real-world scenarios. We meticulously construct a library of action primitives for robot exploration, navigation, and manipulation, serving as callable execution modules for GPT-4V in task planning. On top of these modules, GPT-4V serves as the brain that can accomplish multimodal reasoning, generate action policy with code, verify the task progress, and provide feedback for replanning. Such design enables COME-robot to (i) actively perceive the environments, (ii) perform situated reasoning, and (iii) recover from failures. Through comprehensive experiments involving 8 challenging real-world tabletop and manipulation tasks, COME-robot demonstrates a significant improvement in task success rate (~25%) compared to state-of-the-art baseline methods. We further conduct comprehensive analyses to elucidate how COME-robot's design facilitates failure recovery, free-form instruction following, and long-horizon task planning.
△ Less
Submitted 15 April, 2024;
originally announced April 2024.
-
ANCHOR: LLM-driven News Subject Conditioning for Text-to-Image Synthesis
Authors:
Aashish Anantha Ramakrishnan,
Sharon X. Huang,
Dongwon Lee
Abstract:
Text-to-Image (T2I) Synthesis has made tremendous strides in enhancing synthesized image quality, but current datasets evaluate model performance only on descriptive, instruction-based prompts. Real-world news image captions take a more pragmatic approach, providing high-level situational and Named-Entity (NE) information and limited physical object descriptions, making them abstractive. To evalua…
▽ More
Text-to-Image (T2I) Synthesis has made tremendous strides in enhancing synthesized image quality, but current datasets evaluate model performance only on descriptive, instruction-based prompts. Real-world news image captions take a more pragmatic approach, providing high-level situational and Named-Entity (NE) information and limited physical object descriptions, making them abstractive. To evaluate the ability of T2I models to capture intended subjects from news captions, we introduce the Abstractive News Captions with High-level cOntext Representation (ANCHOR) dataset, containing 70K+ samples sourced from 5 different news media organizations. With Large Language Models (LLM) achieving success in language and commonsense reasoning tasks, we explore the ability of different LLMs to identify and understand key subjects from abstractive captions. Our proposed method Subject-Aware Finetuning (SAFE), selects and enhances the representation of key subjects in synthesized images by leveraging LLM-generated subject weights. It also adapts to the domain distribution of news images and captions through custom Domain Fine-tuning, outperforming current T2I baselines on ANCHOR. By launching the ANCHOR dataset, we hope to motivate research in furthering the Natural Language Understanding (NLU) capabilities of T2I models.
△ Less
Submitted 15 April, 2024;
originally announced April 2024.
-
PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI
Authors:
Yandan Yang,
Baoxiong Jia,
Peiyuan Zhi,
Siyuan Huang
Abstract:
With recent developments in Embodied Artificial Intelligence (EAI) research, there has been a growing demand for high-quality, large-scale interactive scene generation. While prior methods in scene synthesis have prioritized the naturalness and realism of the generated scenes, the physical plausibility and interactivity of scenes have been largely left unexplored. To address this disparity, we int…
▽ More
With recent developments in Embodied Artificial Intelligence (EAI) research, there has been a growing demand for high-quality, large-scale interactive scene generation. While prior methods in scene synthesis have prioritized the naturalness and realism of the generated scenes, the physical plausibility and interactivity of scenes have been largely left unexplored. To address this disparity, we introduce PhyScene, a novel method dedicated to generating interactive 3D scenes characterized by realistic layouts, articulated objects, and rich physical interactivity tailored for embodied agents. Based on a conditional diffusion model for capturing scene layouts, we devise novel physics- and interactivity-based guidance mechanisms that integrate constraints from object collision, room layout, and object reachability. Through extensive experiments, we demonstrate that PhyScene effectively leverages these guidance functions for physically interactable scene synthesis, outperforming existing state-of-the-art scene synthesis methods by a large margin. Our findings suggest that the scenes generated by PhyScene hold considerable potential for facilitating diverse skill acquisition among agents within interactive environments, thereby catalyzing further advancements in embodied AI research. Project website: http://physcene.github.io.
△ Less
Submitted 9 July, 2024; v1 submitted 15 April, 2024;
originally announced April 2024.
-
Optimal Real-time Bidding Strategy For EV Aggregators in Wholesale Electricity Markets
Authors:
Shihan Huang,
Dongkun Han,
John Zhen Fu Pang,
Yue Chen
Abstract:
With the rapid growth of electric vehicles (EVs), EV aggregators have been playing a increasingly vital role in power systems by not merely providing charging management but also participating in wholesale electricity markets. This work studies the optimal real-time bidding strategy for an EV aggregator. Since the charging process of EVs is time-coupled, it is necessary for EV aggregators to consi…
▽ More
With the rapid growth of electric vehicles (EVs), EV aggregators have been playing a increasingly vital role in power systems by not merely providing charging management but also participating in wholesale electricity markets. This work studies the optimal real-time bidding strategy for an EV aggregator. Since the charging process of EVs is time-coupled, it is necessary for EV aggregators to consider future operational conditions (e.g., future EV arrivals) when deciding the current bidding strategy. However, accurately forecasting future operational conditions is challenging under the inherent uncertainties. Hence, there demands a real-time bidding strategy based solely on the up-to-date information, which is the main goal of this work. We start by developing an online optimal EV charging management algorithm for the EV aggregator via Lyapunov optimization. Based on this, an optimal real-time bidding strategy (bidding cost curve and bounds) for the aggregator is derived. Then, an efficient yet practical algorithm is proposed to obtain the bidding strategy. It shows that with the proposed bidding strategy, the aggregator's profit is nearly offline optimal. Moreover, the wholesale electricity market clearing result aligns with the individual aggregator's optimal charging strategy given the prices. Case studies against several benchmarks are conducted to evaluate the performance of the proposed method.
△ Less
Submitted 15 April, 2024;
originally announced April 2024.
-
Observation of $D \to a_{0}(980)π$ in the decays $D^{0} \rightarrow π^{+}π^{-}η$ and $D^{+} \rightarrow π^{+}π^{0}η$
Authors:
BESIII Collaboration,
M. Ablikim,
M. N. Achasov,
P. Adlarson,
O. Afedulidis,
X. C. Ai,
R. Aliberti,
A. Amoroso,
Q. An,
Y. Bai,
O. Bakina,
I. Balossino,
Y. Ban,
H. -R. Bao,
V. Batozskaya,
K. Begzsuren,
N. Berger,
M. Berlowski,
M. Bertani,
D. Bettoni,
F. Bianchi,
E. Bianco,
A. Bortone,
I. Boyko,
R. A. Briere
, et al. (634 additional authors not shown)
Abstract:
We report the first amplitude analysis of the decays $D^{0} \to π^{+} π^{-} η$ and $D^{+} \rightarrow π^{+}π^{0}η$ using a data sample taken with the BESIII detector at the center-of-mass energy of 3.773 GeV, corresponding to an integrated luminosity of 7.9 ${\rm fb}^{-1}$. The contribution from the process $D^{0(+)} \to a_{0}(980)^{+} π^{-(0)}$ is significantly larger than the…
▽ More
We report the first amplitude analysis of the decays $D^{0} \to π^{+} π^{-} η$ and $D^{+} \rightarrow π^{+}π^{0}η$ using a data sample taken with the BESIII detector at the center-of-mass energy of 3.773 GeV, corresponding to an integrated luminosity of 7.9 ${\rm fb}^{-1}$. The contribution from the process $D^{0(+)} \to a_{0}(980)^{+} π^{-(0)}$ is significantly larger than the $D^{0(+)} \to a_{0}(980)^{-(0)} π^{+}$ contribution. The ratios $\mathcal{B}(D^{0} \rightarrow a_{0}(980)^{+}π^{-})/\mathcal{B}(D^{0} \rightarrow a_{0}(980)^{-}π^{+})$ and $\mathcal{B}(D^{+} \rightarrow a_{0}(980)^{+}π^{0})/\mathcal{B}(D^{+} \rightarrow a_{0}(980)^{0}π^{+})$ are measured to be $7.5^{+2.5}_{-0.8\,\mathrm{stat.}}\pm1.7_{\mathrm{syst.}}$ and $2.6\pm0.6_{\mathrm{stat.}}\pm0.3_{\mathrm{syst.}}$, respectively. The measured $D^{0}$ ratio disagrees with the theoretical predictions by orders of magnitudes, thus implying a substantial contribution from final-state interactions.
△ Less
Submitted 14 April, 2024;
originally announced April 2024.