-
A Neural Matrix Decomposition Recommender System Model based on the Multimodal Large Language Model
Authors:
Ao Xiang,
Bingjie Huang,
Xinyu Guo,
Haowei Yang,
Tianyao Zheng
Abstract:
Recommendation systems have become an important solution to information search problems. This article proposes a neural matrix factorization recommendation system model based on the multimodal large language model called BoNMF. This model combines BoBERTa's powerful capabilities in natural language processing, ViT in computer in vision, and neural matrix decomposition technology. By capturing the…
▽ More
Recommendation systems have become an important solution to information search problems. This article proposes a neural matrix factorization recommendation system model based on the multimodal large language model called BoNMF. This model combines BoBERTa's powerful capabilities in natural language processing, ViT in computer in vision, and neural matrix decomposition technology. By capturing the potential characteristics of users and items, and after interacting with a low-dimensional matrix composed of user and item IDs, the neural network outputs the results. recommend. Cold start and ablation experimental results show that the BoNMF model exhibits excellent performance on large public data sets and significantly improves the accuracy of recommendations.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
CrowdMoGen: Zero-Shot Text-Driven Collective Motion Generation
Authors:
Xinying Guo,
Mingyuan Zhang,
Haozhe Xie,
Chenyang Gu,
Ziwei Liu
Abstract:
Crowd Motion Generation is essential in entertainment industries such as animation and games as well as in strategic fields like urban simulation and planning. This new task requires an intricate integration of control and generation to realistically synthesize crowd dynamics under specific spatial and semantic constraints, whose challenges are yet to be fully explored. On the one hand, existing h…
▽ More
Crowd Motion Generation is essential in entertainment industries such as animation and games as well as in strategic fields like urban simulation and planning. This new task requires an intricate integration of control and generation to realistically synthesize crowd dynamics under specific spatial and semantic constraints, whose challenges are yet to be fully explored. On the one hand, existing human motion generation models typically focus on individual behaviors, neglecting the complexities of collective behaviors. On the other hand, recent methods for multi-person motion generation depend heavily on pre-defined scenarios and are limited to a fixed, small number of inter-person interactions, thus hampering their practicality. To overcome these challenges, we introduce CrowdMoGen, a zero-shot text-driven framework that harnesses the power of Large Language Model (LLM) to incorporate the collective intelligence into the motion generation framework as guidance, thereby enabling generalizable planning and generation of crowd motions without paired training data. Our framework consists of two key components: 1) Crowd Scene Planner that learns to coordinate motions and dynamics according to specific scene contexts or introduced perturbations, and 2) Collective Motion Generator that efficiently synthesizes the required collective motions based on the holistic plans. Extensive quantitative and qualitative experiments have validated the effectiveness of our framework, which not only fills a critical gap by providing scalable and generalizable solutions for Crowd Motion Generation task but also achieves high levels of realism and flexibility.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
Flying Calligrapher: Contact-Aware Motion and Force Planning and Control for Aerial Manipulation
Authors:
Xiaofeng Guo,
Guanqi He,
Jiahe Xu,
Mohammadreza Mousaei,
Junyi Geng,
Sebastian Scherer,
Guanya Shi
Abstract:
Aerial manipulation has gained interest in completing high-altitude tasks that are challenging for human workers, such as contact inspection and defect detection, etc. Previous research has focused on maintaining static contact points or forces. This letter addresses a more general and dynamic task: simultaneously tracking time-varying contact force in the surface normal direction and motion traje…
▽ More
Aerial manipulation has gained interest in completing high-altitude tasks that are challenging for human workers, such as contact inspection and defect detection, etc. Previous research has focused on maintaining static contact points or forces. This letter addresses a more general and dynamic task: simultaneously tracking time-varying contact force in the surface normal direction and motion trajectories on tangential surfaces. We propose a pipeline that includes a contact-aware trajectory planner to generate dynamically feasible trajectories, and a hybrid motion-force controller to track such trajectories. We demonstrate the approach in an aerial calligraphy task using a novel sponge pen design as the end-effector, whose stroke width is proportional to the contact force. Additionally, we develop a touchscreen interface for flexible user input. Experiments show our method can effectively draw diverse letters, achieving an IoU of 0.59 and an end-effector position (force) tracking RMSE of 2.9 cm (0.7 N). Website: https://xiaofeng-guo.github.io/flying-calligrapher/
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
Tracking Reflected Objects: A Benchmark
Authors:
Xiaoyu Guo,
Pengzhi Zhong,
Lizhi Lin,
Hao Zhang,
Ling Huang,
Shuiwang Li
Abstract:
Visual tracking has advanced significantly in recent years, mainly due to the availability of large-scale training datasets. These datasets have enabled the development of numerous algorithms that can track objects with high accuracy and robustness.However, the majority of current research has been directed towards tracking generic objects, with less emphasis on more specialized and challenging sc…
▽ More
Visual tracking has advanced significantly in recent years, mainly due to the availability of large-scale training datasets. These datasets have enabled the development of numerous algorithms that can track objects with high accuracy and robustness.However, the majority of current research has been directed towards tracking generic objects, with less emphasis on more specialized and challenging scenarios. One such challenging scenario involves tracking reflected objects. Reflections can significantly distort the appearance of objects, creating ambiguous visual cues that complicate the tracking process. This issue is particularly pertinent in applications such as autonomous driving, security, smart homes, and industrial production, where accurately tracking objects reflected in surfaces like mirrors or glass is crucial. To address this gap, we introduce TRO, a benchmark specifically for Tracking Reflected Objects. TRO includes 200 sequences with around 70,000 frames, each carefully annotated with bounding boxes. This dataset aims to encourage the development of new, accurate methods for tracking reflected objects, which present unique challenges not sufficiently covered by existing benchmarks. We evaluated 20 state-of-the-art trackers and found that they struggle with the complexities of reflections. To provide a stronger baseline, we propose a new tracker, HiP-HaTrack, which uses hierarchical features to improve performance, significantly outperforming existing algorithms. We believe our benchmark, evaluation, and HiP-HaTrack will inspire further research and applications in tracking reflected objects. The TRO and code are available at https://github.com/OpenCodeGithub/HIP-HaTrack.
△ Less
Submitted 6 July, 2024;
originally announced July 2024.
-
Altermagnetism in Heavy Fermion Systems
Authors:
Miaomiao Zhao,
Wei-Wei Yang,
Xueming Guo,
Hong-Gang Luo,
Yin Zhong
Abstract:
Novel collinear magnet, the altermagnet (AM) with spin-splitting energy band and zero net magnetization have attracted great interest due to its potential spintronic applications. Here, we demonstrate AM-like phases in a microscopic Kondo lattice model, widely used for heavy fermion compounds. With the framework of fermionic parton mean-field theory, we find the $d$-wave AM state can coexist with…
▽ More
Novel collinear magnet, the altermagnet (AM) with spin-splitting energy band and zero net magnetization have attracted great interest due to its potential spintronic applications. Here, we demonstrate AM-like phases in a microscopic Kondo lattice model, widely used for heavy fermion compounds. With the framework of fermionic parton mean-field theory, we find the $d$-wave AM state can coexist with the intrinsic Kondo screening effect in such itinerant-local electron system if an alternating next-nearest-neighbor-hopping (NNNH) is included. Such alternating NNNH take nonmagnetic atoms, neglected in usual antiferromagnetism study, into account when encountering real-life candidate AM materials. The AM-like states are characterized by their spin-splitting quasiparticle bands, Fermi surface, spin-resolved distribution function and conductivity. It is suggested that the magnetic quantum oscillation and charge transport measurement can detect those AM-like phases. We hope the present work may be useful for exploring AM-like phases in $f$-electron compounds.
△ Less
Submitted 6 July, 2024;
originally announced July 2024.
-
SurgicalGaussian: Deformable 3D Gaussians for High-Fidelity Surgical Scene Reconstruction
Authors:
Weixing Xie,
Junfeng Yao,
Xianpeng Cao,
Qiqin Lin,
Zerui Tang,
Xiao Dong,
Xiaohu Guo
Abstract:
Dynamic reconstruction of deformable tissues in endoscopic video is a key technology for robot-assisted surgery. Recent reconstruction methods based on neural radiance fields (NeRFs) have achieved remarkable results in the reconstruction of surgical scenes. However, based on implicit representation, NeRFs struggle to capture the intricate details of objects in the scene and cannot achieve real-tim…
▽ More
Dynamic reconstruction of deformable tissues in endoscopic video is a key technology for robot-assisted surgery. Recent reconstruction methods based on neural radiance fields (NeRFs) have achieved remarkable results in the reconstruction of surgical scenes. However, based on implicit representation, NeRFs struggle to capture the intricate details of objects in the scene and cannot achieve real-time rendering. In addition, restricted single view perception and occluded instruments also propose special challenges in surgical scene reconstruction. To address these issues, we develop SurgicalGaussian, a deformable 3D Gaussian Splatting method to model dynamic surgical scenes. Our approach models the spatio-temporal features of soft tissues at each time stamp via a forward-mapping deformation MLP and regularization to constrain local 3D Gaussians to comply with consistent movement. With the depth initialization strategy and tool mask-guided training, our method can remove surgical instruments and reconstruct high-fidelity surgical scenes. Through experiments on various surgical videos, our network outperforms existing method on many aspects, including rendering quality, rendering speed and GPU usage. The project page can be found at https://surgicalgaussian.github.io.
△ Less
Submitted 6 July, 2024;
originally announced July 2024.
-
A Survey on Natural Language Counterfactual Generation
Authors:
Yongjie Wang,
Xiaoqi Qiu,
Yu Yue,
Xu Guo,
Zhiwei Zeng,
Yuhong Feng,
Zhiqi Shen
Abstract:
Natural Language Counterfactual generation aims to minimally modify a given text such that the modified text will be classified into a different class. The generated counterfactuals provide insight into the reasoning behind a model's predictions by highlighting which words significantly influence the outcomes. Additionally, they can be used to detect model fairness issues or augment the training d…
▽ More
Natural Language Counterfactual generation aims to minimally modify a given text such that the modified text will be classified into a different class. The generated counterfactuals provide insight into the reasoning behind a model's predictions by highlighting which words significantly influence the outcomes. Additionally, they can be used to detect model fairness issues or augment the training data to enhance the model's robustness. A substantial amount of research has been conducted to generate counterfactuals for various NLP tasks, employing different models and methodologies. With the rapid growth of studies in this field, a systematic review is crucial to guide future researchers and developers. To bridge this gap, this survey comprehensively overview textual counterfactual generation methods, particularly including those based on Large Language Models. We propose a new taxonomy that categorizes the generation methods into four groups and systematically summarize the metrics for evaluating the generation quality. Finally, we discuss ongoing research challenges and outline promising directions for future work.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
LightStereo: Channel Boost Is All Your Need for Efficient 2D Cost Aggregation
Authors:
Xianda Guo,
Chenming Zhang,
Dujun Nie,
Wenzhao Zheng,
Youmin Zhang,
Long Chen
Abstract:
We present LightStereo, a cutting-edge stereo-matching network crafted to accelerate the matching process. Departing from conventional methodologies that rely on aggregating computationally intensive 4D costs, LightStereo adopts the 3D cost volume as a lightweight alternative. While similar approaches have been explored previously, our breakthrough lies in enhancing performance through a dedicated…
▽ More
We present LightStereo, a cutting-edge stereo-matching network crafted to accelerate the matching process. Departing from conventional methodologies that rely on aggregating computationally intensive 4D costs, LightStereo adopts the 3D cost volume as a lightweight alternative. While similar approaches have been explored previously, our breakthrough lies in enhancing performance through a dedicated focus on the channel dimension of the 3D cost volume, where the distribution of matching costs is encapsulated. Our exhaustive exploration has yielded plenty of strategies to amplify the capacity of the pivotal dimension, ensuring both precision and efficiency. We compare the proposed LightStereo with existing state-of-the-art methods across various benchmarks, which demonstrate its superior performance in speed, accuracy, and resource utilization. LightStereo achieves a competitive EPE metric in the SceneFlow datasets while demanding a minimum of only 22 GFLOPs, with an inference time of just 17 ms. Our comprehensive analysis reveals the effect of 2D cost aggregation for stereo matching, paving the way for real-world applications of efficient stereo systems. Code will be available at \url{https://github.com/XiandaGuo/OpenStereo}.
△ Less
Submitted 28 June, 2024;
originally announced June 2024.
-
Exploiting Structured Sparsity in Near Field: From the Perspective of Decomposition
Authors:
Xufeng Guo,
Yuanbin Chen,
Ying Wang,
Chau Yuen
Abstract:
The structured sparsity can be leveraged in traditional far-field channels, greatly facilitating efficient sparse channel recovery by compressing the complexity of overheads to the level of the scatterer number. However, when experiencing a fundamental shift from planar-wave-based far-field modeling to spherical-wave-based near-field modeling, whether these benefits persist in the near-field regim…
▽ More
The structured sparsity can be leveraged in traditional far-field channels, greatly facilitating efficient sparse channel recovery by compressing the complexity of overheads to the level of the scatterer number. However, when experiencing a fundamental shift from planar-wave-based far-field modeling to spherical-wave-based near-field modeling, whether these benefits persist in the near-field regime remains an open issue. To answer this question, this article delves into structured sparsity in the near-field realm, examining its peculiarities and challenges. In particular, we present the key features of near-field structured sparsity in contrast to the far-field counterpart, drawing from both physical and mathematical perspectives. Upon unmasking the theoretical bottlenecks, we resort to bypassing them by decoupling the geometric parameters of the scatterers, termed the triple parametric decomposition (TPD) framework. It is demonstrated that our novel TPD framework can achieve robust recovery of near-field sparse channels by applying the potential structured sparsity and avoiding the curse of complexity and overhead.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
Angle-dependent planar thermal Hall effect by quasi-ballistic phonons in black phosphorus
Authors:
Xiaokang Li,
Xiaodong Guo,
Zengwei Zhu,
Kamran Behnia
Abstract:
The origin of the phonon thermal Hall effect in insulators is a matter of ongoing debate. The large amplitude of the signal in an elemental non-magnetic solid, such as black phosphorus (BP) calls for a minimal mechanism with no role for spin degree of freedom. Here, we show that a longitudinal heat flow generates a transverse temperature gradient in BP even when the magnetic field, the heat curren…
▽ More
The origin of the phonon thermal Hall effect in insulators is a matter of ongoing debate. The large amplitude of the signal in an elemental non-magnetic solid, such as black phosphorus (BP) calls for a minimal mechanism with no role for spin degree of freedom. Here, we show that a longitudinal heat flow generates a transverse temperature gradient in BP even when the magnetic field, the heat current and the thermal gradient lie in the same plane. The long phonon mean-free-path leaves little room for scattering by point-like symmetry breaking defects. we show that the signal peaks when the magnetic field is oriented along one of the two diagonal orientations of the puckered honeycomb plane and argue that this can be understood as the sum of two distinct contributions each parallel to a mirror plane. This angular dependence as well as the order of magnitude of the observed signal points to the torque exerted by magnetic field on electric dipoles traveling with heat-carrying phonons as the driver of the effect.
△ Less
Submitted 26 June, 2024;
originally announced June 2024.
-
Serial Position Effects of Large Language Models
Authors:
Xiaobo Guo,
Soroush Vosoughi
Abstract:
Large Language Models (LLMs) have shown remarkable capabilities in zero-shot learning applications, generating responses to queries using only pre-training information without the need for additional fine-tuning. This represents a significant departure from traditional machine learning approaches. Previous research has indicated that LLMs may exhibit serial position effects, such as primacy and re…
▽ More
Large Language Models (LLMs) have shown remarkable capabilities in zero-shot learning applications, generating responses to queries using only pre-training information without the need for additional fine-tuning. This represents a significant departure from traditional machine learning approaches. Previous research has indicated that LLMs may exhibit serial position effects, such as primacy and recency biases, which are well-documented cognitive biases in human psychology. Our extensive testing across various tasks and models confirms the widespread occurrence of these effects, although their intensity varies. We also discovered that while carefully designed prompts can somewhat mitigate these biases, their effectiveness is inconsistent. These findings underscore the significance of serial position effects during the inference process, particularly in scenarios where there are no ground truth labels, highlighting the need for greater focus on addressing these effects in LLM applications.
△ Less
Submitted 22 June, 2024;
originally announced June 2024.
-
Thermodynamic modeling of the LiCl-KCl-LaCl$_3$ system with Bayesian model selection and uncertainty quantification
Authors:
Rushi Gong,
Shun-Li Shang,
Vitaliy G. Goncharov,
Xiaofeng Guo,
Zi-Kui Liu
Abstract:
Chloride molten salts are increasingly recognized for their applications in pyroprocessing techniques for the separation of lanthanides. Understanding the thermodynamic properties of these molten salts is essential to optimize the separation process. Several thermodynamic models, including the associate model, the two-sublattice ionic model, and the modified quasichemical model with quadruplet app…
▽ More
Chloride molten salts are increasingly recognized for their applications in pyroprocessing techniques for the separation of lanthanides. Understanding the thermodynamic properties of these molten salts is essential to optimize the separation process. Several thermodynamic models, including the associate model, the two-sublattice ionic model, and the modified quasichemical model with quadruplet approximation (MQMQA), are utilized to capture the complexity of molten salts. In the present work, the Bayes factor was used to guide model selection process for the thermodynamic modeling of the KCl-LaCl$_3$ system and provide statistical comparisons of liquid models. The results indicate that the MQMQA model is the most favorable model based on the available thermochemical data. The LiCl-KCl-LaCl$_3$ system was further optimized with uncertainty quantification using MQMQA. The thermodynamic properties of compounds in the KCl-LaCl$_3$ system were obtained from DFT-based phonon calculations. The calculated phase stability shows excellent agreement with experimental data, indicating that an appropriate model is important for accurately predicting the behavior of complex molten salts.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
On the equivalence of Noether charge and Hilbert action boundary term formulae for the black hole entropy in F(Riemann) gravity theory
Authors:
Wei Guo,
Xiyao Guo,
Mingfeng Li,
Zili Mou,
Hongbao Zhang
Abstract:
By working with the covariant phase space formalism, we have shown that not only can the Hamiltonian conjugate to a Killing vector field ξ be expressed as the sum of the associated Noether charge and ξ contracted with the Hilbert action boundary term for F(Riemann) gravity, but also be written as its contraction with another ξ independent tensor field. With this, we have proven the equivalence of…
▽ More
By working with the covariant phase space formalism, we have shown that not only can the Hamiltonian conjugate to a Killing vector field ξ be expressed as the sum of the associated Noether charge and ξ contracted with the Hilbert action boundary term for F(Riemann) gravity, but also be written as its contraction with another ξ independent tensor field. With this, we have proven the equivalence of Noether charge and Hilbert action boundary term formulae for the stationary black hole entropy in F(Riemann) gravity, which is further substantiated by our explicit computation using both formulae.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
Absence of a bulk charge density wave signature in x-ray measurements of UTe$_2$
Authors:
Caitlin S. Kengle,
Dipanjan Chaudhuri,
Xuefei Guo,
Thomas A. Johnson,
Simon Bettler,
Wolfgang Simeth,
Matthew J. Krogstad,
Zahir Islam,
Sheng Ran,
Shanta R. Saha,
Johnpierre Paglione,
Nicholas P. Butch,
Eduardo Fradkin,
Vidya Madhavan,
Peter Abbamonte
Abstract:
The long-sought pair density wave (PDW) is an exotic phase of matter in which charge density wave (CDW) order is intertwined with the amplitude or phase of coexisting, superconducting order \cite{Berg2009,Berg2009b}. Originally predicted to exist in copper-oxides, circumstantial evidence for PDW order now exists in a variety of materials. Recently, scanning tunneling microscopy (STM) studies have…
▽ More
The long-sought pair density wave (PDW) is an exotic phase of matter in which charge density wave (CDW) order is intertwined with the amplitude or phase of coexisting, superconducting order \cite{Berg2009,Berg2009b}. Originally predicted to exist in copper-oxides, circumstantial evidence for PDW order now exists in a variety of materials. Recently, scanning tunneling microscopy (STM) studies have reported evidence for a three-component charge density wave (CDW) at the surface of the heavy-fermion superconductor, UTe$_2$, persisting below its superconducting transition temperature. Here, we use hard x-ray diffraction measurements on crystals of UTe$_2$ at $T = 1.9$ K and $12$ K to search for a bulk signature of this CDW. Using STM measurements as a constraint, we calculate the expected locations of CDW superlattice peaks, and sweep a large volume of reciprocal space in search of a signature. We failed to find any evidence for a CDW near any of the expected superlattice positions in many Brillouin zones. We estimate an upper bound on the CDW lattice distortion of $u_{max} \lesssim 4 \times 10^{-3} \mathrmÅ$. Our results suggest that the CDW observed in STM is either purely electronic, somehow lacking a signature in the structural lattice, or is restricted to the material surface.
△ Less
Submitted 24 June, 2024; v1 submitted 20 June, 2024;
originally announced June 2024.
-
PAPR Reduction with Pre-chirp Selection for Affine Frequency Division Multiple
Authors:
Haozhi Yuan,
Yin Xu,
Xinghao Guo,
Tianyao Ma,
Haoyang Li,
Dazhi He,
Wenjun Zhang
Abstract:
Affine frequency division multiplexing (AFDM) is a promising new multicarrier technique based on discrete affine Fourier transform (DAFT). By properly tuning pre-chirp parameter and post-chirp parameter in the DAFT, the effective channel in the DAFT domain can completely avoid overlap of different paths, thus constitutes a full representation of delay-Doppler profile, which significantly improves…
▽ More
Affine frequency division multiplexing (AFDM) is a promising new multicarrier technique based on discrete affine Fourier transform (DAFT). By properly tuning pre-chirp parameter and post-chirp parameter in the DAFT, the effective channel in the DAFT domain can completely avoid overlap of different paths, thus constitutes a full representation of delay-Doppler profile, which significantly improves the system performance in high mobility scenarios. However, AFDM has the crucial problem of high peak-to-average power ratio (PAPR) caused by phase randomness of modulated symbols. In this letter, an algorithm named grouped pre-chirp selection (GPS) is proposed to reduce the PAPR by changing the value of pre-chirp parameter on sub-carriers group by group. Specifically, it is demonstrated first that the important properties of AFDM system are maintained when implementing GPS. Secondly, we elaborate the operation steps of GPS algorithm, illustrating its effect on PAPR reduction and its advantage in terms of computational complexity compared with the ungrouped approach. Finally, simulation results of PAPR reduction in the form of complementary cumulative distribution function (CCDF) show the effectiveness of the proposed GPS algorithm.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
AMC: Access to Miss Correlation Prefetcher for Evolving Graph Analytics
Authors:
Abhishek Singh,
Christian Schulte,
Xiaochen Guo
Abstract:
Modern memory hierarchies work well with applications that have good spatial locality. Evolving (dynamic) graphs are important applications widely used to model graphs and networks with edge and vertex changes. They exhibit irregular memory access patterns and suffer from a high miss ratio and long miss penalty. Prefetching can be employed to predict and fetch future demand misses. However, curren…
▽ More
Modern memory hierarchies work well with applications that have good spatial locality. Evolving (dynamic) graphs are important applications widely used to model graphs and networks with edge and vertex changes. They exhibit irregular memory access patterns and suffer from a high miss ratio and long miss penalty. Prefetching can be employed to predict and fetch future demand misses. However, current hardware prefetchers can not efficiently predict for applications with irregular memory accesses. In evolving graph applications, vertices that do not change during graph changes exhibit the same access correlation patterns. Current temporal prefetchers use one-to-one or one-to-many correlation to exploit these patterns. Similar patterns are recorded in the same entry, which causes aliasing and can lead to poor prefetch accuracy and coverage. This work proposes a software-assisted hardware prefetcher for evolving graphs. The key idea is to record the correlations between a sequence of vertex accesses and the following misses and then prefetch when the same vertex access sequence occurs in the future. The proposed Access-to-Miss Correlation (AMC) prefetcher provides a lightweight programming interface to identify the data structures of interest and sets the iteration boundary to update the correlation table. For the evaluated applications, AMC achieves a geomean speedup of 1.5x as compared to the best-performing prefetcher in prior work (VLDP). AMC can achieve an average of 62% accuracy and coverage, whereas VLDP has an accuracy of 31% and coverage of 23%.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
GUI Action Narrator: Where and When Did That Action Take Place?
Authors:
Qinchen Wu,
Difei Gao,
Kevin Qinghong Lin,
Zhuoyu Wu,
Xiangwu Guo,
Peiran Li,
Weichen Zhang,
Hengxu Wang,
Mike Zheng Shou
Abstract:
The advent of Multimodal LLMs has significantly enhanced image OCR recognition capabilities, making GUI automation a viable reality for increasing efficiency in digital tasks. One fundamental aspect of developing a GUI automation system is understanding primitive GUI actions. This comprehension is crucial as it enables agents to learn from user demonstrations, an essential element of automation. T…
▽ More
The advent of Multimodal LLMs has significantly enhanced image OCR recognition capabilities, making GUI automation a viable reality for increasing efficiency in digital tasks. One fundamental aspect of developing a GUI automation system is understanding primitive GUI actions. This comprehension is crucial as it enables agents to learn from user demonstrations, an essential element of automation. To rigorously evaluate such capabilities, we developed a video captioning benchmark for GUI actions, comprising 4,189 diverse video captioning samples. This task presents unique challenges compared to natural scene video captioning: 1) GUI screenshots typically contain denser information than natural scenes, and 2) events within GUIs are subtler and occur more rapidly, requiring precise attention to the appropriate time span and spatial region for accurate understanding. To address these challenges, we introduce our GUI action dataset \textbf{Act2Cap} as well as a simple yet effective framework, \textbf{GUI Narrator}, for GUI video captioning that utilizes the cursor as a visual prompt to enhance the interpretation of high-resolution screenshots. Specifically, a cursor detector is trained on our dataset, and a multimodal LLM model with mechanisms for selecting keyframes and key regions generates the captions. Experimental results indicate that even for today's most advanced multimodal models, such as GPT-4o, the task remains highly challenging. Additionally, our evaluations show that our strategy effectively enhances model performance, whether integrated into the fine-tuning of open-source models or employed as a prompting strategy in closed-source models.
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
Improving Zero-Shot Cross-Lingual Transfer via Progressive Code-Switching
Authors:
Zhuoran Li,
Chunming Hu,
Junfan Chen,
Zhijun Chen,
Xiaohui Guo,
Richong Zhang
Abstract:
Code-switching is a data augmentation scheme mixing words from multiple languages into source lingual text. It has achieved considerable generalization performance of cross-lingual transfer tasks by aligning cross-lingual contextual word representations. However, uncontrolled and over-replaced code-switching would augment dirty samples to model training. In other words, the excessive code-switchin…
▽ More
Code-switching is a data augmentation scheme mixing words from multiple languages into source lingual text. It has achieved considerable generalization performance of cross-lingual transfer tasks by aligning cross-lingual contextual word representations. However, uncontrolled and over-replaced code-switching would augment dirty samples to model training. In other words, the excessive code-switching text samples will negatively hurt the models' cross-lingual transferability. To this end, we propose a Progressive Code-Switching (PCS) method to gradually generate moderately difficult code-switching examples for the model to discriminate from easy to hard. The idea is to incorporate progressively the preceding learned multilingual knowledge using easier code-switching data to guide model optimization on succeeding harder code-switching data. Specifically, we first design a difficulty measurer to measure the impact of replacing each word in a sentence based on the word relevance score. Then a code-switcher generates the code-switching data of increasing difficulty via a controllable temperature variable. In addition, a training scheduler decides when to sample harder code-switching data for model training. Experiments show our model achieves state-of-the-art results on three different zero-shot cross-lingual transfer tasks across ten languages.
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
Limit Results for Estimation of Connectivity Matrix in Multi-layer Stochastic Block Models
Authors:
Wenqing Su,
Xiao Guo,
Ying Yang
Abstract:
Multi-layer networks arise naturally in various domains including biology, finance and sociology, among others. The multi-layer stochastic block model (multi-layer SBM) is commonly used for community detection in the multi-layer networks. Most of current literature focuses on statistical consistency of community detection methods under multi-layer SBMs. However, the asymptotic distributional prope…
▽ More
Multi-layer networks arise naturally in various domains including biology, finance and sociology, among others. The multi-layer stochastic block model (multi-layer SBM) is commonly used for community detection in the multi-layer networks. Most of current literature focuses on statistical consistency of community detection methods under multi-layer SBMs. However, the asymptotic distributional properties are also indispensable which play an important role in statistical inference. In this work, we aim to study the estimation and asymptotic properties of the layer-wise scaled connectivity matrices in the multi-layer SBMs. We develop a novel and efficient method to estimate the scaled connectivity matrices. Under the multi-layer SBM and its variant multi-layer degree-corrected SBM, we establish the asymptotic normality of the estimated matrices under mild conditions, which can be used for interval estimation and hypothesis testing. Simulations show the superior performance of proposed method over existing methods in two considered statistical inference tasks. We also apply the method to a real dataset and obtain interpretable results.
△ Less
Submitted 16 June, 2024;
originally announced June 2024.
-
Constraints on Ultra Heavy Dark Matter Properties from Dwarf Spheroidal Galaxies with LHAASO Observations
Authors:
Zhen Cao,
F. Aharonian,
Q. An,
Axikegu,
Y. X. Bai,
Y. W. Bao,
D. Bastieri,
X. J. Bi,
Y. J. Bi,
J. T. Cai,
Q. Cao,
W. Y. Cao,
Zhe Cao,
J. Chang,
J. F. Chang,
A. M. Chen,
E. S. Chen,
Liang Chen,
Lin Chen,
Long Chen,
M. J. Chen,
M. L. Chen,
Q. H. Chen,
S. H. Chen,
S. Z. Chen
, et al. (255 additional authors not shown)
Abstract:
In this work we try to search for signals generated by ultra-heavy dark matter at the Large High Altitude Air Shower Observatory (LHAASO) data. We look for possible gamma-ray by dark matter annihilation or decay from 16 dwarf spheroidal galaxies in the field of view of LHAASO. Dwarf spheroidal galaxies are among the most promising targets for indirect detection of dark matter which have low fluxes…
▽ More
In this work we try to search for signals generated by ultra-heavy dark matter at the Large High Altitude Air Shower Observatory (LHAASO) data. We look for possible gamma-ray by dark matter annihilation or decay from 16 dwarf spheroidal galaxies in the field of view of LHAASO. Dwarf spheroidal galaxies are among the most promising targets for indirect detection of dark matter which have low fluxes of astrophysical $γ$-ray background while large amount of dark matter. By analyzing more than 700 days observational data at LHAASO, no significant dark matter signal from 1 TeV to 1 EeV is detected. Accordingly we derive the most stringent constraints on the ultra-heavy dark matter annihilation cross-section up to EeV. The constraints on the lifetime of dark matter in decay mode are also derived.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
2.5D Multi-view Averaging Diffusion Model for 3D Medical Image Translation: Application to Low-count PET Reconstruction with CT-less Attenuation Correction
Authors:
Tianqi Chen,
Jun Hou,
Yinchi Zhou,
Huidong Xie,
Xiongchao Chen,
Qiong Liu,
Xueqi Guo,
Menghua Xia,
James S. Duncan,
Chi Liu,
Bo Zhou
Abstract:
Positron Emission Tomography (PET) is an important clinical imaging tool but inevitably introduces radiation hazards to patients and healthcare providers. Reducing the tracer injection dose and eliminating the CT acquisition for attenuation correction can reduce the overall radiation dose, but often results in PET with high noise and bias. Thus, it is desirable to develop 3D methods to translate t…
▽ More
Positron Emission Tomography (PET) is an important clinical imaging tool but inevitably introduces radiation hazards to patients and healthcare providers. Reducing the tracer injection dose and eliminating the CT acquisition for attenuation correction can reduce the overall radiation dose, but often results in PET with high noise and bias. Thus, it is desirable to develop 3D methods to translate the non-attenuation-corrected low-dose PET (NAC-LDPET) into attenuation-corrected standard-dose PET (AC-SDPET). Recently, diffusion models have emerged as a new state-of-the-art deep learning method for image-to-image translation, better than traditional CNN-based methods. However, due to the high computation cost and memory burden, it is largely limited to 2D applications. To address these challenges, we developed a novel 2.5D Multi-view Averaging Diffusion Model (MADM) for 3D image-to-image translation with application on NAC-LDPET to AC-SDPET translation. Specifically, MADM employs separate diffusion models for axial, coronal, and sagittal views, whose outputs are averaged in each sampling step to ensure the 3D generation quality from multiple views. To accelerate the 3D sampling process, we also proposed a strategy to use the CNN-based 3D generation as a prior for the diffusion model. Our experimental results on human patient studies suggested that MADM can generate high-quality 3D translation images, outperforming previous CNN-based and Diffusion-based baseline methods.
△ Less
Submitted 15 June, 2024; v1 submitted 12 June, 2024;
originally announced June 2024.
-
Unique Security and Privacy Threats of Large Language Model: A Comprehensive Survey
Authors:
Shang Wang,
Tianqing Zhu,
Bo Liu,
Ming Ding,
Xu Guo,
Dayong Ye,
Wanlei Zhou,
Philip S. Yu
Abstract:
With the rapid development of artificial intelligence, large language models (LLMs) have made remarkable advancements in natural language processing. These models are trained on vast datasets to exhibit powerful language understanding and generation capabilities across various applications, including machine translation, chatbots, and agents. However, LLMs have revealed a variety of privacy and se…
▽ More
With the rapid development of artificial intelligence, large language models (LLMs) have made remarkable advancements in natural language processing. These models are trained on vast datasets to exhibit powerful language understanding and generation capabilities across various applications, including machine translation, chatbots, and agents. However, LLMs have revealed a variety of privacy and security issues throughout their life cycle, drawing significant academic and industrial attention. Moreover, the risks faced by LLMs differ significantly from those encountered by traditional language models. Given that current surveys lack a clear taxonomy of unique threat models across diverse scenarios, we emphasize the unique privacy and security threats associated with five specific scenarios: pre-training, fine-tuning, retrieval-augmented generation systems, deployment, and LLM-based agents. Addressing the characteristics of each risk, this survey outlines potential threats and countermeasures. Research on attack and defense situations can offer feasible research directions, enabling more areas to benefit from LLMs.
△ Less
Submitted 18 June, 2024; v1 submitted 12 June, 2024;
originally announced June 2024.
-
Instruct Large Language Models to Drive like Humans
Authors:
Ruijun Zhang,
Xianda Guo,
Wenzhao Zheng,
Chenming Zhang,
Kurt Keutzer,
Long Chen
Abstract:
Motion planning in complex scenarios is the core challenge in autonomous driving. Conventional methods apply predefined rules or learn from driving data to plan the future trajectory. Recent methods seek the knowledge preserved in large language models (LLMs) and apply them in the driving scenarios. Despite the promising results, it is still unclear whether the LLM learns the underlying human logi…
▽ More
Motion planning in complex scenarios is the core challenge in autonomous driving. Conventional methods apply predefined rules or learn from driving data to plan the future trajectory. Recent methods seek the knowledge preserved in large language models (LLMs) and apply them in the driving scenarios. Despite the promising results, it is still unclear whether the LLM learns the underlying human logic to drive. In this paper, we propose an InstructDriver method to transform LLM into a motion planner with explicit instruction tuning to align its behavior with humans. We derive driving instruction data based on human logic (e.g., do not cause collisions) and traffic rules (e.g., proceed only when green lights). We then employ an interpretable InstructChain module to further reason the final planning reflecting the instructions. Our InstructDriver allows the injection of human rules and learning from driving data, enabling both interpretability and data scalability. Different from existing methods that experimented on closed-loop or simulated settings, we adopt the real-world closed-loop motion planning nuPlan benchmark for better evaluation. InstructDriver demonstrates the effectiveness of the LLM planner in a real-world closed-loop setting. Our code is publicly available at https://github.com/bonbon-rj/InstructDriver.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
HO-Cap: A Capture System and Dataset for 3D Reconstruction and Pose Tracking of Hand-Object Interaction
Authors:
Jikai Wang,
Qifan Zhang,
Yu-Wei Chao,
Bowen Wen,
Xiaohu Guo,
Yu Xiang
Abstract:
We introduce a data capture system and a new dataset named HO-Cap that can be used to study 3D reconstruction and pose tracking of hands and objects in videos. The capture system uses multiple RGB-D cameras and a HoloLens headset for data collection, avoiding the use of expensive 3D scanners or mocap systems. We propose a semi-automatic method to obtain annotations of shape and pose of hands and o…
▽ More
We introduce a data capture system and a new dataset named HO-Cap that can be used to study 3D reconstruction and pose tracking of hands and objects in videos. The capture system uses multiple RGB-D cameras and a HoloLens headset for data collection, avoiding the use of expensive 3D scanners or mocap systems. We propose a semi-automatic method to obtain annotations of shape and pose of hands and objects in the collected videos, which significantly reduces the required annotation time compared to manual labeling. With this system, we captured a video dataset of humans using objects to perform different tasks, as well as simple pick-and-place and handover of an object from one hand to the other, which can be used as human demonstrations for embodied AI and robot manipulation research. Our data capture setup and annotation framework can be used by the community to reconstruct 3D shapes of objects and human hands and track their poses in videos.
△ Less
Submitted 16 June, 2024; v1 submitted 10 June, 2024;
originally announced June 2024.
-
PairCFR: Enhancing Model Training on Paired Counterfactually Augmented Data through Contrastive Learning
Authors:
Xiaoqi Qiu,
Yongjie Wang,
Xu Guo,
Zhiwei Zeng,
Yue Yu,
Yuhong Feng,
Chunyan Miao
Abstract:
Counterfactually Augmented Data (CAD) involves creating new data samples by applying minimal yet sufficient modifications to flip the label of existing data samples to other classes. Training with CAD enhances model robustness against spurious features that happen to correlate with labels by spreading the casual relationships across different classes. Yet, recent research reveals that training wit…
▽ More
Counterfactually Augmented Data (CAD) involves creating new data samples by applying minimal yet sufficient modifications to flip the label of existing data samples to other classes. Training with CAD enhances model robustness against spurious features that happen to correlate with labels by spreading the casual relationships across different classes. Yet, recent research reveals that training with CAD may lead models to overly focus on modified features while ignoring other important contextual information, inadvertently introducing biases that may impair performance on out-ofdistribution (OOD) datasets. To mitigate this issue, we employ contrastive learning to promote global feature alignment in addition to learning counterfactual clues. We theoretically prove that contrastive loss can encourage models to leverage a broader range of features beyond those modified ones. Comprehensive experiments on two human-edited CAD datasets demonstrate that our proposed method outperforms the state-of-the-art on OOD datasets.
△ Less
Submitted 9 June, 2024;
originally announced June 2024.
-
Majorana Zero Modes in Lieb-Kitaev Model with Tunable Quantum Metric
Authors:
Xingyao Guo,
Xinglei Ma,
Xuzhe Ying,
K. T. Law
Abstract:
The relation between band topology and Majorana zero energy modes (MZMs) in topological superconductors had been well studied in the past decades. However, the relation between the quantum metric and MZMs has yet to be understood. In this work, we first introduce a three band Lieb-like lattice model with an isolated flat band and tunable quantum metric. By introducing nearest neighbor equal spin p…
▽ More
The relation between band topology and Majorana zero energy modes (MZMs) in topological superconductors had been well studied in the past decades. However, the relation between the quantum metric and MZMs has yet to be understood. In this work, we first introduce a three band Lieb-like lattice model with an isolated flat band and tunable quantum metric. By introducing nearest neighbor equal spin pairing, we obtain the Lieb-Kitaev model which supports MZMs. When the Fermi energy is set within the flat band, the MZMs are supposed to be well-localized at the ends of the 1D superconductor due to the flatness of the band. On the contrary, we show both numerically and analytically that the localization length of the MZMs is controlled by a length scale defined by the quantum metric of the flat band, which we call the quantum metric length (QML). The QML can be several orders of magnitude longer than the conventional BCS superconducting coherence length. When the QML is comparable to the length of the superconductor, the two MZMs from the two ends of the superconductor can hybridize. When two metallic leads are coupled to the two MZMs, crossed Andreev reflection probability can nearly reach the maximal theoretical value. This work unveils how the quantum metric can greatly influence the properties of MZMs through the QML and the results can be generalized to other topological bound states.
△ Less
Submitted 9 June, 2024;
originally announced June 2024.
-
The Potential Energy of Heavy Quarkonium in Flavor-Dependent Systems from a Holographic Model
Authors:
Xi Guo,
Xun Chen,
Dong Xiang,
Miguel Angel Martin Contreras,
Xiao-Hua Li
Abstract:
Within the framework of the Einstein-Maxwell-Dilaton (EMD) model, which incorporates information on the equation of state and baryon number susceptibility from lattice results, we have conducted a comprehensive analysis of the potential energy, running coupling, and dissociation time for heavy quark-antiquark pairs using gauge/gravity duality. This study encompasses various systems, including pure…
▽ More
Within the framework of the Einstein-Maxwell-Dilaton (EMD) model, which incorporates information on the equation of state and baryon number susceptibility from lattice results, we have conducted a comprehensive analysis of the potential energy, running coupling, and dissociation time for heavy quark-antiquark pairs using gauge/gravity duality. This study encompasses various systems, including pure gluon systems, 2 flavor systems, 2+1 flavor systems, and 2+1+1 flavor systems under finite temperature and chemical potential. The results reveal that the linear component of the potential energy diminishes as the flavor increases. It is also found that our results are extremely close to the recent lattice results for 2+1 flavors at finite temperature. Moreover, we have thoroughly investigated the dissociation distance and running coupling constant of quark-antiquark pairs to gain a comprehensive understanding of their behavior across various flavors. Finally, we have examined real-time dynamics of quark dissociation. The findings indicate that the dissociation time of quark-antiquark pairs is dependent on temperature, chemical potential, and flavor of the systems.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
AgentGym: Evolving Large Language Model-based Agents across Diverse Environments
Authors:
Zhiheng Xi,
Yiwen Ding,
Wenxiang Chen,
Boyang Hong,
Honglin Guo,
Junzhe Wang,
Dingwen Yang,
Chenyang Liao,
Xin Guo,
Wei He,
Songyang Gao,
Lu Chen,
Rui Zheng,
Yicheng Zou,
Tao Gui,
Qi Zhang,
Xipeng Qiu,
Xuanjing Huang,
Zuxuan Wu,
Yu-Gang Jiang
Abstract:
Building generalist agents that can handle diverse tasks and evolve themselves across different environments is a long-term goal in the AI community. Large language models (LLMs) are considered a promising foundation to build such agents due to their generalized capabilities. Current approaches either have LLM-based agents imitate expert-provided trajectories step-by-step, requiring human supervis…
▽ More
Building generalist agents that can handle diverse tasks and evolve themselves across different environments is a long-term goal in the AI community. Large language models (LLMs) are considered a promising foundation to build such agents due to their generalized capabilities. Current approaches either have LLM-based agents imitate expert-provided trajectories step-by-step, requiring human supervision, which is hard to scale and limits environmental exploration; or they let agents explore and learn in isolated environments, resulting in specialist agents with limited generalization. In this paper, we take the first step towards building generally-capable LLM-based agents with self-evolution ability. We identify a trinity of ingredients: 1) diverse environments for agent exploration and learning, 2) a trajectory set to equip agents with basic capabilities and prior knowledge, and 3) an effective and scalable evolution method. We propose AgentGym, a new framework featuring a variety of environments and tasks for broad, real-time, uni-format, and concurrent agent exploration. AgentGym also includes a database with expanded instructions, a benchmark suite, and high-quality trajectories across environments. Next, we propose a novel method, AgentEvol, to investigate the potential of agent self-evolution beyond previously seen data across tasks and environments. Experimental results show that the evolved agents can achieve results comparable to SOTA models. We release the AgentGym suite, including the platform, dataset, benchmark, checkpoints, and algorithm implementations. The AgentGym suite is available on https://github.com/WooooDyy/AgentGym.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
MODABS: Multi-Objective Learning for Dynamic Aspect-Based Summarization
Authors:
Xiaobo Guo,
Soroush Vosoughi
Abstract:
The rapid proliferation of online content necessitates effective summarization methods, among which dynamic aspect-based summarization stands out. Unlike its traditional counterpart, which assumes a fixed set of known aspects, this approach adapts to the varied aspects of the input text. We introduce a novel multi-objective learning framework employing a Longformer-Encoder-Decoder for this task. T…
▽ More
The rapid proliferation of online content necessitates effective summarization methods, among which dynamic aspect-based summarization stands out. Unlike its traditional counterpart, which assumes a fixed set of known aspects, this approach adapts to the varied aspects of the input text. We introduce a novel multi-objective learning framework employing a Longformer-Encoder-Decoder for this task. The framework optimizes aspect number prediction, minimizes disparity between generated and reference summaries for each aspect, and maximizes dissimilarity across aspect-specific summaries. Extensive experiments show our method significantly outperforms baselines on three diverse datasets, largely due to the effective alignment of generated and reference aspect counts without sacrificing single-aspect summarization quality.
△ Less
Submitted 17 June, 2024; v1 submitted 5 June, 2024;
originally announced June 2024.
-
LOLAMEME: Logic, Language, Memory, Mechanistic Framework
Authors:
Jay Desai,
Xiaobo Guo,
Srinivasan H. Sengamedu
Abstract:
The performance of Large Language Models has achieved superhuman breadth with unprecedented depth. At the same time, the language models are mostly black box models and the underlying mechanisms for performance have been evaluated using synthetic or mechanistic schemes. We extend current mechanistic schemes to incorporate Logic, memory, and nuances of Language such as latent structure. The propose…
▽ More
The performance of Large Language Models has achieved superhuman breadth with unprecedented depth. At the same time, the language models are mostly black box models and the underlying mechanisms for performance have been evaluated using synthetic or mechanistic schemes. We extend current mechanistic schemes to incorporate Logic, memory, and nuances of Language such as latent structure. The proposed framework is called LOLAMEME and we provide two instantiations of LOLAMEME: LoLa and MeMe languages. We then consider two generative language model architectures: transformer-based GPT-2 and convolution-based Hyena. We propose the hybrid architecture T HEX and use LOLAMEME framework is used to compare three architectures. T HEX outperforms GPT-2 and Hyena on select tasks.
△ Less
Submitted 31 May, 2024;
originally announced June 2024.
-
IterMask2: Iterative Unsupervised Anomaly Segmentation via Spatial and Frequency Masking for Brain Lesions in MRI
Authors:
Ziyun Liang,
Xiaoqing Guo,
J. Alison Noble,
Konstantinos Kamnitsas
Abstract:
Unsupervised anomaly segmentation approaches to pathology segmentation train a model on images of healthy subjects, that they define as the 'normal' data distribution. At inference, they aim to segment any pathologies in new images as 'anomalies', as they exhibit patterns that deviate from those in 'normal' training data. Prevailing methods follow the 'corrupt-and-reconstruct' paradigm. They inten…
▽ More
Unsupervised anomaly segmentation approaches to pathology segmentation train a model on images of healthy subjects, that they define as the 'normal' data distribution. At inference, they aim to segment any pathologies in new images as 'anomalies', as they exhibit patterns that deviate from those in 'normal' training data. Prevailing methods follow the 'corrupt-and-reconstruct' paradigm. They intentionally corrupt an input image, reconstruct it to follow the learned 'normal' distribution, and subsequently segment anomalies based on reconstruction error. Corrupting an input image, however, inevitably leads to suboptimal reconstruction even of normal regions, causing false positives. To alleviate this, we propose a novel iterative spatial mask-refining strategy IterMask2. We iteratively mask areas of the image, reconstruct them, and update the mask based on reconstruction error. This iterative process progressively adds information about areas that are confidently normal as per the model. The increasing content guides reconstruction of nearby masked areas, improving reconstruction of normal tissue under these areas, reducing false positives. We also use high-frequency image content as an auxiliary input to provide additional structural information for masked areas. This further improves reconstruction error of normal in comparison to anomalous areas, facilitating segmentation of the latter. We conduct experiments on several brain lesion datasets and demonstrate effectiveness of our method. Code is available at: https://github.com/ZiyunLiang/IterMask2
△ Less
Submitted 5 June, 2024; v1 submitted 4 June, 2024;
originally announced June 2024.
-
I4VGen: Image as Stepping Stone for Text-to-Video Generation
Authors:
Xiefan Guo,
Jinlin Liu,
Miaomiao Cui,
Di Huang
Abstract:
Text-to-video generation has lagged behind text-to-image synthesis in quality and diversity due to the complexity of spatio-temporal modeling and limited video-text datasets. This paper presents I4VGen, a training-free and plug-and-play video diffusion inference framework, which enhances text-to-video generation by leveraging robust image techniques. Specifically, following text-to-image-to-video,…
▽ More
Text-to-video generation has lagged behind text-to-image synthesis in quality and diversity due to the complexity of spatio-temporal modeling and limited video-text datasets. This paper presents I4VGen, a training-free and plug-and-play video diffusion inference framework, which enhances text-to-video generation by leveraging robust image techniques. Specifically, following text-to-image-to-video, I4VGen decomposes the text-to-video generation into two stages: anchor image synthesis and anchor image-guided video synthesis. Correspondingly, a well-designed generation-selection pipeline is employed to achieve visually-realistic and semantically-faithful anchor image, and an innovative Noise-Invariant Video Score Distillation Sampling is incorporated to animate the image to a dynamic video, followed by a video regeneration process to refine the video. This inference strategy effectively mitigates the prevalent issue of non-zero terminal signal-to-noise ratio. Extensive evaluations show that I4VGen not only produces videos with higher visual realism and textual fidelity but also integrates seamlessly into existing image-to-video diffusion models, thereby improving overall video quality.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Sparse Recovery for Holographic MIMO Channels: Leveraging the Clustered Sparsity
Authors:
Yuqing Guo,
Xufeng Guo,
Yuanbin Chen,
Ying Wang
Abstract:
Envisioned as the next-generation transceiver technology, the holographic multiple-input-multiple-output (HMIMO) garners attention for its superior capabilities of fabricating electromagnetic (EM) waves. However, the densely packed antenna elements significantly increase the dimension of the HMIMO channel matrix, rendering traditional channel estimation methods inefficient. While the dimension cur…
▽ More
Envisioned as the next-generation transceiver technology, the holographic multiple-input-multiple-output (HMIMO) garners attention for its superior capabilities of fabricating electromagnetic (EM) waves. However, the densely packed antenna elements significantly increase the dimension of the HMIMO channel matrix, rendering traditional channel estimation methods inefficient. While the dimension curse can be relieved to avoid the proportional increase with the antenna density using the state-of-the-art wavenumber-domain sparse representation, the sparse recovery complexity remains tied to the order of non-zero elements in the sparse channel, which still considerably exceeds the number of scatterers. By modeling the inherent clustered sparsity using a Gaussian mixed model (GMM)-based von Mises-Fisher (vMF) distribution, the to-be-estimated channel characteristics can be compressed to the scatterer level. Upon the sparsity extraction, a novel wavenumber-domain expectation-maximization (WD-EM) algorithm is proposed to implement the cluster-by-cluster variational inference, thus significantly reducing the computational complexity. Simulation results verify the robustness of the proposed scheme across overheads and signal-to-noise ratio (SNR).
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Leveraging Predicate and Triplet Learning for Scene Graph Generation
Authors:
Jiankai Li,
Yunhong Wang,
Xiefan Guo,
Ruijie Yang,
Weixin Li
Abstract:
Scene Graph Generation (SGG) aims to identify entities and predict the relationship triplets \textit{\textless subject, predicate, object\textgreater } in visual scenes. Given the prevalence of large visual variations of subject-object pairs even in the same predicate, it can be quite challenging to model and refine predicate representations directly across such pairs, which is however a common st…
▽ More
Scene Graph Generation (SGG) aims to identify entities and predict the relationship triplets \textit{\textless subject, predicate, object\textgreater } in visual scenes. Given the prevalence of large visual variations of subject-object pairs even in the same predicate, it can be quite challenging to model and refine predicate representations directly across such pairs, which is however a common strategy adopted by most existing SGG methods. We observe that visual variations within the identical triplet are relatively small and certain relation cues are shared in the same type of triplet, which can potentially facilitate the relation learning in SGG. Moreover, for the long-tail problem widely studied in SGG task, it is also crucial to deal with the limited types and quantity of triplets in tail predicates. Accordingly, in this paper, we propose a Dual-granularity Relation Modeling (DRM) network to leverage fine-grained triplet cues besides the coarse-grained predicate ones. DRM utilizes contexts and semantics of predicate and triplet with Dual-granularity Constraints, generating compact and balanced representations from two perspectives to facilitate relation recognition. Furthermore, a Dual-granularity Knowledge Transfer (DKT) strategy is introduced to transfer variation from head predicates/triplets to tail ones, aiming to enrich the pattern diversity of tail classes to alleviate the long-tail problem. Extensive experiments demonstrate the effectiveness of our method, which establishes new state-of-the-art performance on Visual Genome, Open Image, and GQA datasets. Our code is available at \url{https://github.com/jkli1998/DRM}
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Measurement of Electron Antineutrino Oscillation Amplitude and Frequency via Neutron Capture on Hydrogen at Daya Bay
Authors:
Daya Bay collaboration,
F. P. An,
W. D. Bai,
A. B. Balantekin,
M. Bishai,
S. Blyth,
G. F. Cao,
J. Cao,
J. F. Chang,
Y. Chang,
H. S. Chen,
H. Y. Chen,
S. M. Chen,
Y. Chen,
Y. X. Chen,
Z. Y. Chen,
J. Cheng,
J. Cheng,
Y. -C. Cheng,
Z. K. Cheng,
J. J. Cherwinka,
M. C. Chu,
J. P. Cummings,
O. Dalager,
F. S. Deng
, et al. (177 additional authors not shown)
Abstract:
This Letter reports the first measurement of the oscillation amplitude and frequency of reactor antineutrinos at Daya Bay via neutron capture on hydrogen using 1958 days of data. With over 3.6 million signal candidates, an optimized candidate selection, improved treatment of backgrounds and efficiencies, refined energy calibration, and an energy response model for the capture-on-hydrogen sensitive…
▽ More
This Letter reports the first measurement of the oscillation amplitude and frequency of reactor antineutrinos at Daya Bay via neutron capture on hydrogen using 1958 days of data. With over 3.6 million signal candidates, an optimized candidate selection, improved treatment of backgrounds and efficiencies, refined energy calibration, and an energy response model for the capture-on-hydrogen sensitive region, the relative $\overlineν_{e}$ rates and energy spectra variation among the near and far detectors gives $\mathrm{sin}^22θ_{13} = 0.0759_{-0.0049}^{+0.0050}$ and $Δm^2_{32} = (2.72^{+0.14}_{-0.15})\times10^{-3}$ eV$^2$ assuming the normal neutrino mass ordering, and $Δm^2_{32} = (-2.83^{+0.15}_{-0.14})\times10^{-3}$ eV$^2$ for the inverted neutrino mass ordering. This estimate of $\sin^2 2θ_{13}$ is consistent with and essentially independent from the one obtained using the capture-on-gadolinium sample at Daya Bay. The combination of these two results yields $\mathrm{sin}^22θ_{13}= 0.0833\pm0.0022$, which represents an 8% relative improvement in precision regarding the Daya Bay full 3158-day capture-on-gadolinium result.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
Ta2Pd3Te5 topological thermometer
Authors:
Yupeng Li,
Anqi Wang,
Senyang Pan,
Dayu Yan,
Guang Yang,
Xingchen Guo,
Yu Hong,
Guangtong Liu,
Fanming Qu,
Zhijun Wang,
Tian Qian,
Jinglei Zhang,
Youguo Shi,
Li Lu,
Jie Shen
Abstract:
In recent decades, there has been a persistent pursuit of applications for surface/edge states in topological systems, driven by their dissipationless transport effects. However, there have been limited tangible breakthroughs in this field. This work demonstrates the remarkable properties of the topological insulator Ta2Pd3Te5, as a thermometer. This material exhibits a power-law correlation in te…
▽ More
In recent decades, there has been a persistent pursuit of applications for surface/edge states in topological systems, driven by their dissipationless transport effects. However, there have been limited tangible breakthroughs in this field. This work demonstrates the remarkable properties of the topological insulator Ta2Pd3Te5, as a thermometer. This material exhibits a power-law correlation in temperature-dependent resistance at low temperatures, stemming from its Luttinger liquid behavior of edge states, while exhibiting semiconductor behavior at high temperatures. The power-law behavior effectively addresses the issue of infinite resistance in semiconductor thermometers at ultra-low temperatures, thereby playing a crucial role in enabling efficient thermometry in refrigerators supporting millikelvin temperatures or below. By employing chemical doping, adjusting thickness, and controlling gate voltage, its power-law behavior and semiconductor behavior can be effectively modulated. This enables efficient thermometry spanning from millikelvin temperatures to room temperature, and allows for precise local temperature measurement. Furthermore, this thermometer exhibits excellent temperature sensitivity and resolution, and can be fine-tuned to show small magnetoresistance. In summary, the Ta2Pd3Te5 thermometer, also referred to as a topological thermometer, exhibits outstanding performance and significant potential for measuring a wider range of temperatures compared to conventional low-temperature thermometers.
△ Less
Submitted 2 June, 2024;
originally announced June 2024.
-
Diffusion Tuning: Transferring Diffusion Models via Chain of Forgetting
Authors:
Jincheng Zhong,
Xingzhuo Guo,
Jiaxiang Dong,
Mingsheng Long
Abstract:
Diffusion models have significantly advanced the field of generative modeling. However, training a diffusion model is computationally expensive, creating a pressing need to adapt off-the-shelf diffusion models for downstream generation tasks. Current fine-tuning methods focus on parameter-efficient transfer learning but overlook the fundamental transfer characteristics of diffusion models. In this…
▽ More
Diffusion models have significantly advanced the field of generative modeling. However, training a diffusion model is computationally expensive, creating a pressing need to adapt off-the-shelf diffusion models for downstream generation tasks. Current fine-tuning methods focus on parameter-efficient transfer learning but overlook the fundamental transfer characteristics of diffusion models. In this paper, we investigate the transferability of diffusion models and observe a monotonous chain of forgetting trend of transferability along the reverse process. Based on this observation and novel theoretical insights, we present Diff-Tuning, a frustratingly simple transfer approach that leverages the chain of forgetting tendency. Diff-Tuning encourages the fine-tuned model to retain the pre-trained knowledge at the end of the denoising chain close to the generated data while discarding the other noise side. We conduct comprehensive experiments to evaluate Diff-Tuning, including the transfer of pre-trained Diffusion Transformer models to eight downstream generations and the adaptation of Stable Diffusion to five control conditions with ControlNet. Diff-Tuning achieves a 26% improvement over standard fine-tuning and enhances the convergence speed of ControlNet by 24%. Notably, parameter-efficient transfer learning techniques for diffusion models can also benefit from Diff-Tuning.
△ Less
Submitted 6 June, 2024; v1 submitted 2 June, 2024;
originally announced June 2024.
-
GLCAN: Global-Local Collaborative Auxiliary Network for Local Learning
Authors:
Feiyu Zhu,
Yuming Zhang,
Changpeng Cai,
Guinan Guo,
Jiao Li,
Xiuyuan Guo,
Quanwei Zhang,
Peizhe Wang,
Chenghao He,
Junhao Su
Abstract:
Traditional deep neural networks typically use end-to-end backpropagation, which often places a big burden on GPU memory. Another promising training method is local learning, which involves splitting the network into blocks and training them in parallel with the help of an auxiliary network. Local learning has been widely studied and applied to image classification tasks, and its performance is co…
▽ More
Traditional deep neural networks typically use end-to-end backpropagation, which often places a big burden on GPU memory. Another promising training method is local learning, which involves splitting the network into blocks and training them in parallel with the help of an auxiliary network. Local learning has been widely studied and applied to image classification tasks, and its performance is comparable to that of end-to-end method. However, different image tasks often rely on different feature representations, which is difficult for typical auxiliary networks to adapt to. To solve this problem, we propose the construction method of Global-Local Collaborative Auxiliary Network (GLCAN), which provides a macroscopic design approach for auxiliary networks. This is the first demonstration that local learning methods can be successfully applied to other tasks such as object detection and super-resolution. GLCAN not only saves a lot of GPU memory, but also has comparable performance to an end-to-end approach on data sets for multiple different tasks.
△ Less
Submitted 1 June, 2024;
originally announced June 2024.
-
SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths
Authors:
Kaixuan Huang,
Xudong Guo,
Mengdi Wang
Abstract:
Speculative decoding reduces the inference latency of a target large language model via utilizing a smaller and faster draft model. Its performance depends on a hyperparameter K -- the candidate length, i.e., the number of candidate tokens for the target model to verify in each round. However, previous methods often use simple heuristics to choose K, which may result in sub-optimal performance. We…
▽ More
Speculative decoding reduces the inference latency of a target large language model via utilizing a smaller and faster draft model. Its performance depends on a hyperparameter K -- the candidate length, i.e., the number of candidate tokens for the target model to verify in each round. However, previous methods often use simple heuristics to choose K, which may result in sub-optimal performance. We study the choice of the candidate length K and formulate it as a Markov Decision Process. We theoretically show that the optimal policy of this Markov decision process takes the form of a threshold policy, i.e., the current speculation should stop and be verified when the probability of getting a rejection exceeds a threshold value. Motivated by this theory, we propose SpecDec++, an enhanced version of speculative decoding that adaptively determines the candidate length on the fly. We augment the draft model with a trained acceptance prediction head to predict the conditional acceptance probability of the candidate tokens. SpecDec++ will stop the current speculation when the predicted probability that at least one token gets rejected exceeds a threshold. We implement SpecDec++ and apply it to the llama-2-chat 7B & 70B model pair. Our adaptive method achieves a 2.04x speedup on the Alpaca dataset (an additional 7.2% improvement over the baseline speculative decoding). On the GSM8K and HumanEval datasets, our method achieves a 2.26x speedup (9.4% improvement) and 2.23x speedup (11.1% improvement), respectively.
△ Less
Submitted 20 June, 2024; v1 submitted 30 May, 2024;
originally announced May 2024.
-
CSANet: Channel Spatial Attention Network for Robust 3D Face Alignment and Reconstruction
Authors:
Yilin Liu,
Xuezhou Guo,
Xinqi Wang,
Fangzhou Du
Abstract:
Our project proposes an end-to-end 3D face alignment and reconstruction network. The backbone of our model is built by Bottle-Neck structure via Depth-wise Separable Convolution. We integrate Coordinate Attention mechanism and Spatial Group-wise Enhancement to extract more representative features. For more stable training process and better convergence, we jointly use Wing loss and the Weighted Pa…
▽ More
Our project proposes an end-to-end 3D face alignment and reconstruction network. The backbone of our model is built by Bottle-Neck structure via Depth-wise Separable Convolution. We integrate Coordinate Attention mechanism and Spatial Group-wise Enhancement to extract more representative features. For more stable training process and better convergence, we jointly use Wing loss and the Weighted Parameter Distance Cost to learn parameters for 3D Morphable model and 3D vertices. Our proposed model outperforms all baseline models both quantitatively and qualitatively.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Isovalent alloying assisted anomalous valley Hall effect in hexagonal antiferromagnetic monolayer
Authors:
San-Dong Guo,
Liguo Zhang,
Xiao-Shu Guo,
Gangqiang Zhu
Abstract:
Exploring combination of antiferromagnetic (AFM) spintronics and anomalous valley Hall effect (AVHE) is one of the most important questions for valleytronic applications. The key to address this issue is to achieve spin splitting around the valleys in AFM systems. Here, we propose a possible way for achieving AVHE in hexagonal AFM monolayer, which involves the isovalent alloying. This can break th…
▽ More
Exploring combination of antiferromagnetic (AFM) spintronics and anomalous valley Hall effect (AVHE) is one of the most important questions for valleytronic applications. The key to address this issue is to achieve spin splitting around the valleys in AFM systems. Here, we propose a possible way for achieving AVHE in hexagonal AFM monolayer, which involves the isovalent alloying. This can break the combined symmetry ($PT$ symmetry) of spatial inversion ($P$) and time reversal ($T$), giving rise to spin splitting. More specifically, the large spin splitting around the Fermi energy level owes to $d$ orbital mismatch among these different transition metal ions. Based on first-principles calculations, the proposed way can be verified in out-of-plane AFM $\mathrm{CrMoC_2S_6}$ monolayer, which possesses spontaneous valley polarization and spitting splitting, providing possibility to realize AVHE. It is also proved that tensile strain can strengthen the valley splitting and maintain the out-of-plane AFM ordering. Our works provide an experimentally feasible way for developing AFM valleytronic devices.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
JADS: A Framework for Self-supervised Joint Aspect Discovery and Summarization
Authors:
Xiaobo Guo,
Jay Desai,
Srinivasan H. Sengamedu
Abstract:
To generate summaries that include multiple aspects or topics for text documents, most approaches use clustering or topic modeling to group relevant sentences and then generate a summary for each group. These approaches struggle to optimize the summarization and clustering algorithms jointly. On the other hand, aspect-based summarization requires known aspects. Our solution integrates topic discov…
▽ More
To generate summaries that include multiple aspects or topics for text documents, most approaches use clustering or topic modeling to group relevant sentences and then generate a summary for each group. These approaches struggle to optimize the summarization and clustering algorithms jointly. On the other hand, aspect-based summarization requires known aspects. Our solution integrates topic discovery and summarization into a single step. Given text data, our Joint Aspect Discovery and Summarization algorithm (JADS) discovers aspects from the input and generates a summary of the topics, in one step. We propose a self-supervised framework that creates a labeled dataset by first mixing sentences from multiple documents (e.g., CNN/DailyMail articles) as the input and then uses the article summaries from the mixture as the labels. The JADS model outperforms the two-step baselines. With pretraining, the model achieves better performance and stability. Furthermore, embeddings derived from JADS exhibit superior clustering capabilities. Our proposed method achieves higher semantic alignment with ground truth and is factual.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Performance Optimization in RSMA-assisted Uplink xURLLC IIoT Networks with Statistical QoS Provisioning
Authors:
Yuang Chen,
Hancheng Lu,
Chang Wu,
Langtian Qin,
Xiaobo Guo
Abstract:
Industry 5.0 and beyond networks have driven the emergence of numerous mission-critical applications, prompting contemplation of the neXt-generation ultra-reliable low-latency communication (xURLLC). To guarantee low-latency requirements, xURLLC heavily relies on short-blocklength packets with sporadic arrival traffic. As a disruptive multi-access technique, rate-splitting multiple access (RSMA) h…
▽ More
Industry 5.0 and beyond networks have driven the emergence of numerous mission-critical applications, prompting contemplation of the neXt-generation ultra-reliable low-latency communication (xURLLC). To guarantee low-latency requirements, xURLLC heavily relies on short-blocklength packets with sporadic arrival traffic. As a disruptive multi-access technique, rate-splitting multiple access (RSMA) has emerged as a promising avenue to enhance quality of service (QoS) and flexibly manage interference for next-generation communication networks. In this paper, we investigate an innovative RSMA-assisted uplink xURLLC industrial internet-of-things (IIoT) (RSMA-xURLLC-IIoT) network. To unveil reliable insights into the statistical QoS provisioning (SQP) for our proposed network with sporadic arrival traffic, we leverage stochastic network calculus (SNC) to develop a dependable theoretical framework. Building upon this theoretical framework, we formulate the SQP-driven short-packet size maximization problem and the SQP-driven transmit power minimization problem, aiming to guarantee the SQP performance to latency, decoding, and reliability while maximizing the short-packet size and minimizing the transmit power, respectively. By exploiting Monte-Carlo methods, we have thoroughly validated the dependability of the developed theoretical framework. Moreover, through extensive comparison analysis with state-of-the-art multi-access techniques, including non-orthogonal multiple access (NOMA) and orthogonal multiple access (OMA), we have demonstrated the superior performance gains achieved by the proposed RSMA-xURLLC-IIoT networks.
△ Less
Submitted 26 May, 2024;
originally announced May 2024.
-
Disentangling Foreground and Background Motion for Enhanced Realism in Human Video Generation
Authors:
Jinlin Liu,
Kai Yu,
Mengyang Feng,
Xiefan Guo,
Miaomiao Cui
Abstract:
Recent advancements in human video synthesis have enabled the generation of high-quality videos through the application of stable diffusion models. However, existing methods predominantly concentrate on animating solely the human element (the foreground) guided by pose information, while leaving the background entirely static. Contrary to this, in authentic, high-quality videos, backgrounds often…
▽ More
Recent advancements in human video synthesis have enabled the generation of high-quality videos through the application of stable diffusion models. However, existing methods predominantly concentrate on animating solely the human element (the foreground) guided by pose information, while leaving the background entirely static. Contrary to this, in authentic, high-quality videos, backgrounds often dynamically adjust in harmony with foreground movements, eschewing stagnancy. We introduce a technique that concurrently learns both foreground and background dynamics by segregating their movements using distinct motion representations. Human figures are animated leveraging pose-based motion, capturing intricate actions. Conversely, for backgrounds, we employ sparse tracking points to model motion, thereby reflecting the natural interaction between foreground activity and environmental changes. Training on real-world videos enhanced with this innovative motion depiction approach, our model generates videos exhibiting coherent movement in both foreground subjects and their surrounding contexts. To further extend video generation to longer sequences without accumulating errors, we adopt a clip-by-clip generation strategy, introducing global features at each step. To ensure seamless continuity across these segments, we ingeniously link the final frame of a produced clip with input noise to spawn the succeeding one, maintaining narrative flow. Throughout the sequential generation process, we infuse the feature representation of the initial reference image into the network, effectively curtailing any cumulative color inconsistencies that may otherwise arise. Empirical evaluations attest to the superiority of our method in producing videos that exhibit harmonious interplay between foreground actions and responsive background dynamics, surpassing prior methodologies in this regard.
△ Less
Submitted 28 May, 2024; v1 submitted 25 May, 2024;
originally announced May 2024.
-
DIDI: Diffusion-Guided Diversity for Offline Behavioral Generation
Authors:
Jinxin Liu,
Xinghong Guo,
Zifeng Zhuang,
Donglin Wang
Abstract:
In this paper, we propose a novel approach called DIffusion-guided DIversity (DIDI) for offline behavioral generation. The goal of DIDI is to learn a diverse set of skills from a mixture of label-free offline data. We achieve this by leveraging diffusion probabilistic models as priors to guide the learning process and regularize the policy. By optimizing a joint objective that incorporates diversi…
▽ More
In this paper, we propose a novel approach called DIffusion-guided DIversity (DIDI) for offline behavioral generation. The goal of DIDI is to learn a diverse set of skills from a mixture of label-free offline data. We achieve this by leveraging diffusion probabilistic models as priors to guide the learning process and regularize the policy. By optimizing a joint objective that incorporates diversity and diffusion-guided regularization, we encourage the emergence of diverse behaviors while maintaining the similarity to the offline data. Experimental results in four decision-making domains (Push, Kitchen, Humanoid, and D4RL tasks) show that DIDI is effective in discovering diverse and discriminative skills. We also introduce skill stitching and skill interpolation, which highlight the generalist nature of the learned skill space. Further, by incorporating an extrinsic reward function, DIDI enables reward-guided behavior generation, facilitating the learning of diverse and optimal behaviors from sub-optimal data.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
Statistical inference for high-dimensional convoluted rank regression
Authors:
Leheng Cai,
Xu Guo,
Heng Lian,
Liping Zhu
Abstract:
High-dimensional penalized rank regression is a powerful tool for modeling high-dimensional data due to its robustness and estimation efficiency. However, the non-smoothness of the rank loss brings great challenges to the computation. To solve this critical issue, high-dimensional convoluted rank regression is recently proposed, and penalized convoluted rank regression estimators are introduced. H…
▽ More
High-dimensional penalized rank regression is a powerful tool for modeling high-dimensional data due to its robustness and estimation efficiency. However, the non-smoothness of the rank loss brings great challenges to the computation. To solve this critical issue, high-dimensional convoluted rank regression is recently proposed, and penalized convoluted rank regression estimators are introduced. However, these developed estimators cannot be directly used to make inference. In this paper, we investigate the inference problem of high-dimensional convoluted rank regression. We first establish estimation error bounds of penalized convoluted rank regression estimators under weaker conditions on the predictors. Based on the penalized convoluted rank regression estimators, we further introduce a debiased estimator. We then provide Bahadur representation for our proposed estimator. We further develop simultaneous inference procedures. A novel bootstrap procedure is proposed and its theoretical validity is also established. Finally, simulation and real data analysis are conducted to illustrate the merits of our proposed methods.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
Automated Loss function Search for Class-imbalanced Node Classification
Authors:
Xinyu Guo,
Kai Wu,
Xiaoyu Zhang,
Jing Liu
Abstract:
Class-imbalanced node classification tasks are prevalent in real-world scenarios. Due to the uneven distribution of nodes across different classes, learning high-quality node representations remains a challenging endeavor. The engineering of loss functions has shown promising potential in addressing this issue. It involves the meticulous design of loss functions, utilizing information about the qu…
▽ More
Class-imbalanced node classification tasks are prevalent in real-world scenarios. Due to the uneven distribution of nodes across different classes, learning high-quality node representations remains a challenging endeavor. The engineering of loss functions has shown promising potential in addressing this issue. It involves the meticulous design of loss functions, utilizing information about the quantities of nodes in different categories and the network's topology to learn unbiased node representations. However, the design of these loss functions heavily relies on human expert knowledge and exhibits limited adaptability to specific target tasks. In this paper, we introduce a high-performance, flexible, and generalizable automated loss function search framework to tackle this challenge. Across 15 combinations of graph neural networks and datasets, our framework achieves a significant improvement in performance compared to state-of-the-art methods. Additionally, we observe that homophily in graph-structured data significantly contributes to the transferability of the proposed framework.
△ Less
Submitted 22 May, 2024;
originally announced May 2024.
-
Advancing Transportation Mode Share Analysis with Built Environment: Deep Hybrid Models with Urban Road Network
Authors:
Dingyi Zhuang,
Qingyi Wang,
Yunhan Zheng,
Xiaotong Guo,
Shenhao Wang,
Haris N Koutsopoulos,
Jinhua Zhao
Abstract:
Transportation mode share analysis is important to various real-world transportation tasks as it helps researchers understand the travel behaviors and choices of passengers. A typical example is the prediction of communities' travel mode share by accounting for their sociodemographics like age, income, etc., and travel modes' attributes (e.g. travel cost and time). However, there exist only limite…
▽ More
Transportation mode share analysis is important to various real-world transportation tasks as it helps researchers understand the travel behaviors and choices of passengers. A typical example is the prediction of communities' travel mode share by accounting for their sociodemographics like age, income, etc., and travel modes' attributes (e.g. travel cost and time). However, there exist only limited efforts in integrating the structure of the urban built environment, e.g., road networks, into the mode share models to capture the impacts of the built environment. This task usually requires manual feature engineering or prior knowledge of the urban design features. In this study, we propose deep hybrid models (DHM), which directly combine road networks and sociodemographic features as inputs for travel mode share analysis. Using graph embedding (GE) techniques, we enhance travel demand models with a more powerful representation of urban structures. In experiments of mode share prediction in Chicago, results demonstrate that DHM can provide valuable spatial insights into the sociodemographic structure, improving the performance of travel demand models in estimating different mode shares at the city level. Specifically, DHM improves the results by more than 20\% while retaining the interpretation power of the choice models, demonstrating its superiority in interpretability, prediction accuracy, and geographical insights.
△ Less
Submitted 22 May, 2024;
originally announced May 2024.
-
IM-RAG: Multi-Round Retrieval-Augmented Generation Through Learning Inner Monologues
Authors:
Diji Yang,
Jinmeng Rao,
Kezhen Chen,
Xiaoyuan Guo,
Yawen Zhang,
Jie Yang,
Yi Zhang
Abstract:
Although the Retrieval-Augmented Generation (RAG) paradigms can use external knowledge to enhance and ground the outputs of Large Language Models (LLMs) to mitigate generative hallucinations and static knowledge base problems, they still suffer from limited flexibility in adopting Information Retrieval (IR) systems with varying capabilities, constrained interpretability during the multi-round retr…
▽ More
Although the Retrieval-Augmented Generation (RAG) paradigms can use external knowledge to enhance and ground the outputs of Large Language Models (LLMs) to mitigate generative hallucinations and static knowledge base problems, they still suffer from limited flexibility in adopting Information Retrieval (IR) systems with varying capabilities, constrained interpretability during the multi-round retrieval process, and a lack of end-to-end optimization. To address these challenges, we propose a novel LLM-centric approach, IM-RAG, that integrates IR systems with LLMs to support multi-round RAG through learning Inner Monologues (IM, i.e., the human inner voice that narrates one's thoughts). During the IM process, the LLM serves as the core reasoning model (i.e., Reasoner) to either propose queries to collect more information via the Retriever or to provide a final answer based on the conversational context. We also introduce a Refiner that improves the outputs from the Retriever, effectively bridging the gap between the Reasoner and IR modules with varying capabilities and fostering multi-round communications. The entire IM process is optimized via Reinforcement Learning (RL) where a Progress Tracker is incorporated to provide mid-step rewards, and the answer prediction is further separately optimized via Supervised Fine-Tuning (SFT). We conduct extensive experiments with the HotPotQA dataset, a popular benchmark for retrieval-based, multi-step question-answering. The results show that our approach achieves state-of-the-art (SOTA) performance while providing high flexibility in integrating IR modules as well as strong interpretability exhibited in the learned inner monologues.
△ Less
Submitted 15 May, 2024;
originally announced May 2024.
-
Dose-aware Diffusion Model for 3D Low-dose PET: Multi-institutional Validation with Reader Study and Real Low-dose Data
Authors:
Huidong Xie,
Weijie Gan,
Bo Zhou,
Ming-Kai Chen,
Michal Kulon,
Annemarie Boustani,
Benjamin A. Spencer,
Reimund Bayerlein,
Xiongchao Chen,
Qiong Liu,
Xueqi Guo,
Menghua Xia,
Yinchi Zhou,
Hui Liu,
Liang Guo,
Hongyu An,
Ulugbek S. Kamilov,
Hanzhong Wang,
Biao Li,
Axel Rominger,
Kuangyu Shi,
Ge Wang,
Ramsey D. Badawi,
Chi Liu
Abstract:
As PET imaging is accompanied by radiation exposure and potentially increased cancer risk, reducing radiation dose in PET scans without compromising the image quality is an important topic. Deep learning (DL) techniques have been investigated for low-dose PET imaging. However, existing models have often resulted in compromised image quality when achieving low-dose PET and have limited generalizabi…
▽ More
As PET imaging is accompanied by radiation exposure and potentially increased cancer risk, reducing radiation dose in PET scans without compromising the image quality is an important topic. Deep learning (DL) techniques have been investigated for low-dose PET imaging. However, existing models have often resulted in compromised image quality when achieving low-dose PET and have limited generalizability to different image noise-levels, acquisition protocols, patient populations, and hospitals. Recently, diffusion models have emerged as the new state-of-the-art generative model to generate high-quality samples and have demonstrated strong potential for medical imaging tasks. However, for low-dose PET imaging, existing diffusion models failed to generate consistent 3D reconstructions, unable to generalize across varying noise-levels, often produced visually-appealing but distorted image details, and produced images with biased tracer uptake. Here, we develop DDPET-3D, a dose-aware diffusion model for 3D low-dose PET imaging to address these challenges. Collected from 4 medical centers globally with different scanners and clinical protocols, we extensively evaluated the proposed model using a total of 9,783 18F-FDG studies (1,596 patients) with low-dose/low-count levels ranging from 1% to 50%. With a cross-center, cross-scanner validation, the proposed DDPET-3D demonstrated its potential to generalize to different low-dose levels, different scanners, and different clinical protocols. As confirmed with reader studies performed by nuclear medicine physicians, the proposed method produced superior denoised results that are comparable to or even better than the 100% full-count images as well as previous DL baselines. The presented results show the potential of achieving low-dose PET while maintaining image quality. Lastly, a group of real low-dose scans was also included for evaluation.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.