subscribe to arXiv mailings

SlideGCD: Slide-based Graph Collaborative Training with Knowledge Distillation for Whole Slide Image Classification

Authors: Tong Shu, Jun Shi, Dongdong Sun, Zhiguo Jiang, Yushan Zheng

Abstract: Existing WSI analysis methods lie on the consensus that histopathological characteristics of tumors are significant guidance for cancer diagnostics. Particularly, as the evolution of cancers is a continuous process, the correlations and differences across various stages, anatomical locations and patients should be taken into account. However, recent research mainly focuses on the inner-contextual… ▽ More Existing WSI analysis methods lie on the consensus that histopathological characteristics of tumors are significant guidance for cancer diagnostics. Particularly, as the evolution of cancers is a continuous process, the correlations and differences across various stages, anatomical locations and patients should be taken into account. However, recent research mainly focuses on the inner-contextual information in a single WSI, ignoring the correlations between slides. To verify whether introducing the slide inter-correlations can bring improvements to WSI representation learning, we propose a generic WSI analysis pipeline SlideGCD that considers the existing multi-instance learning (MIL) methods as the backbone and forge the WSI classification task as a node classification problem. More specifically, SlideGCD declares a node buffer that stores previous slide embeddings for subsequent extensive slide-based graph construction and conducts graph learning to explore the inter-correlations implied in the slide-based graph. Moreover, we frame the MIL classifier and graph learning into two parallel workflows and deploy the knowledge distillation to transfer the differentiable information to the graph neural network. The consistent performance boosting, brought by SlideGCD, of four previous state-of-the-art MIL methods is observed on two TCGA benchmark datasets. The code is available at https://github.com/HFUT-miaLab/SlideGCD. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: Submitted to MICCAI-2024

arXiv:2407.08855 [pdf, other]

BraTS-PEDs: Results of the Multi-Consortium International Pediatric Brain Tumor Segmentation Challenge 2023

Authors: Anahita Fathi Kazerooni, Nastaran Khalili, Xinyang Liu, Debanjan Haldar, Zhifan Jiang, Anna Zapaishchykova, Julija Pavaine, Lubdha M. Shah, Blaise V. Jones, Nakul Sheth, Sanjay P. Prabhu, Aaron S. McAllister, Wenxin Tu, Khanak K. Nandolia, Andres F. Rodriguez, Ibraheem Salman Shaikh, Mariana Sanchez Montano, Hollie Anne Lai, Maruf Adewole, Jake Albrecht, Udunna Anazodo, Hannah Anderson, Syed Muhammed Anwar, Alejandro Aristizabal, Sina Bagheri , et al. (54 additional authors not shown)

Abstract: Pediatric central nervous system tumors are the leading cause of cancer-related deaths in children. The five-year survival rate for high-grade glioma in children is less than 20%. The development of new treatments is dependent upon multi-institutional collaborative clinical trials requiring reproducible and accurate centralized response assessment. We present the results of the BraTS-PEDs 2023 cha… ▽ More Pediatric central nervous system tumors are the leading cause of cancer-related deaths in children. The five-year survival rate for high-grade glioma in children is less than 20%. The development of new treatments is dependent upon multi-institutional collaborative clinical trials requiring reproducible and accurate centralized response assessment. We present the results of the BraTS-PEDs 2023 challenge, the first Brain Tumor Segmentation (BraTS) challenge focused on pediatric brain tumors. This challenge utilized data acquired from multiple international consortia dedicated to pediatric neuro-oncology and clinical trials. BraTS-PEDs 2023 aimed to evaluate volumetric segmentation algorithms for pediatric brain gliomas from magnetic resonance imaging using standardized quantitative performance evaluation metrics employed across the BraTS 2023 challenges. The top-performing AI approaches for pediatric tumor analysis included ensembles of nnU-Net and Swin UNETR, Auto3DSeg, or nnU-Net with a self-supervised framework. The BraTSPEDs 2023 challenge fostered collaboration between clinicians (neuro-oncologists, neuroradiologists) and AI/imaging scientists, promoting faster data sharing and the development of automated volumetric analysis techniques. These advancements could significantly benefit clinical trials and improve the care of children with brain tumors. △ Less

Submitted 11 July, 2024; originally announced July 2024.

arXiv:2407.08504 [pdf, other]

Revisiting the Formulation of Charged Defect in Solids

Authors: Hanzhi Shang, Zeyu Jiang, Yiyang Sun, Damien West, Shengbai Zhang

Abstract: Defect physics is at the heart of microelectronics. By keeping track of the reference energy in total energy calculations, we explicitly show that the "potential alignment" correction vanishes, and the classic Markov-Payne correction yields accurate results. From linear response theory, we further formulate an accurate expression for the quadrupole correction. Application to numerous defects inclu… ▽ More Defect physics is at the heart of microelectronics. By keeping track of the reference energy in total energy calculations, we explicitly show that the "potential alignment" correction vanishes, and the classic Markov-Payne correction yields accurate results. From linear response theory, we further formulate an accurate expression for the quadrupole correction. Application to numerous defects including anisotropic material yields accurate formation energies in small supercells and the historically slow convergence of the 2+ diamond vacancy is shown to be a result of slow varying gap levels of the defect leading to a size dependent dielectric constant. △ Less

Submitted 11 July, 2024; originally announced July 2024.

arXiv:2407.08153 [pdf, other]

Lifelong Histopathology Whole Slide Image Retrieval via Distance Consistency Rehearsal

Authors: Xinyu Zhu, Zhiguo Jiang, Kun Wu, Jun Shi, Yushan Zheng

Abstract: Content-based histopathological image retrieval (CBHIR) has gained attention in recent years, offering the capability to return histopathology images that are content-wise similar to the query one from an established database. However, in clinical practice, the continuously expanding size of WSI databases limits the practical application of the current CBHIR methods. In this paper, we propose a Li… ▽ More Content-based histopathological image retrieval (CBHIR) has gained attention in recent years, offering the capability to return histopathology images that are content-wise similar to the query one from an established database. However, in clinical practice, the continuously expanding size of WSI databases limits the practical application of the current CBHIR methods. In this paper, we propose a Lifelong Whole Slide Retrieval (LWSR) framework to address the challenges of catastrophic forgetting by progressive model updating on continuously growing retrieval database. Our framework aims to achieve the balance between stability and plasticity during continuous learning. To preserve system plasticity, we utilize local memory bank with reservoir sampling method to save instances, which can comprehensively encompass the feature spaces of both old and new tasks. Furthermore, A distance consistency rehearsal (DCR) module is designed to ensure the retrieval queue's consistency for previous tasks, which is regarded as stability within a lifelong CBHIR system. We evaluated the proposed method on four public WSI datasets from TCGA projects. The experimental results have demonstrated the proposed method is effective and is superior to the state-of-the-art methods. △ Less

Submitted 12 July, 2024; v1 submitted 10 July, 2024; originally announced July 2024.

Comments: Accepted for MICCAI 2024

arXiv:2407.07859 [pdf, other]

Domains in ferroelectric nitrides superlattices

Authors: Zhijun Jiang, Zhenlong Zhang, Charles Paillard, Hongjun Xiang, Laurent Bellaiche

Abstract: Ferroelectric nitrides have emerged as promising semiconductor materials for modern electronics. However, their domain structures and associated properties are basically unknown, despite their potential to result in optimized or new phenomena. Density functional theory calculations are performed to investigate the effect of epitaxial strain on multidomains of (Al,Sc)N nitride systems and to compar… ▽ More Ferroelectric nitrides have emerged as promising semiconductor materials for modern electronics. However, their domain structures and associated properties are basically unknown, despite their potential to result in optimized or new phenomena. Density functional theory calculations are performed to investigate the effect of epitaxial strain on multidomains of (Al,Sc)N nitride systems and to compare it with the monodomain case. The multidomain systems are predicted to have five strain-induced regions, to be denoted as Regions I to V, respectively. Each of these regions is associated with rather different values or behaviors of physical properties such as axial ratio, polarizations, internal parameters, bond lengths, etc. Of particular interest is the prediction of bent domains under compressive strain extending beyond $-$5.5%, which indicates that domain walls may play a key role in the mechanical failure properties of these systems. Interestingly, such bending induces the creation of a finite in-plane polarization (in addition to out-of-plane dipoles) due to geometric and symmetry considerations. Strikingly too, the bent domains have lower energy than the wurtzite monodomains and have atomically sharp boundaries. Our findings may pave the way for domain wall engineering in ferroelectric nitrides. △ Less

Submitted 10 July, 2024; originally announced July 2024.

Comments: 9 pages, 4 figures

arXiv:2407.07504 [pdf, other]

Pan-cancer Histopathology WSI Pre-training with Position-aware Masked Autoencoder

Authors: Kun Wu, Zhiguo Jiang, Kunming Tang, Jun Shi, Fengying Xie, Wei Wang, Haibo Wu, Yushan Zheng

Abstract: Large-scale pre-training models have promoted the development of histopathology image analysis. However, existing self-supervised methods for histopathology images focus on learning patch features, while there is still a lack of available pre-training models for WSI-level feature learning. In this paper, we propose a novel self-supervised learning framework for pan-cancer WSI-level representation… ▽ More Large-scale pre-training models have promoted the development of histopathology image analysis. However, existing self-supervised methods for histopathology images focus on learning patch features, while there is still a lack of available pre-training models for WSI-level feature learning. In this paper, we propose a novel self-supervised learning framework for pan-cancer WSI-level representation pre-training with the designed position-aware masked autoencoder (PAMA). Meanwhile, we propose the position-aware cross-attention (PACA) module with a kernel reorientation (KRO) strategy and an anchor dropout (AD) mechanism. The KRO strategy can capture the complete semantic structure and eliminate ambiguity in WSIs, and the AD contributes to enhancing the robustness and generalization of the model. We evaluated our method on 6 large-scale datasets from multiple organs for pan-cancer classification tasks. The results have demonstrated the effectiveness of PAMA in generalized and discriminative WSI representation learning and pan-cancer WSI pre-training. The proposed method was also compared with \R{7} WSI analysis methods. The experimental results have indicated that our proposed PAMA is superior to the state-of-the-art methods.The code and checkpoints are available at https://github.com/WkEEn/PAMA. △ Less

Submitted 10 July, 2024; originally announced July 2024.

arXiv:2407.06937 [pdf, other]

HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-fine Pose-Reversible Guidance

Authors: Guian Fang, Wenbiao Yan, Yuanfan Guo, Jianhua Han, Zutao Jiang, Hang Xu, Shengcai Liao, Xiaodan Liang

Abstract: Text-to-image diffusion models have significantly advanced in conditional image generation. However, these models usually struggle with accurately rendering images featuring humans, resulting in distorted limbs and other anomalies. This issue primarily stems from the insufficient recognition and evaluation of limb qualities in diffusion models. To address this issue, we introduce AbHuman, the firs… ▽ More Text-to-image diffusion models have significantly advanced in conditional image generation. However, these models usually struggle with accurately rendering images featuring humans, resulting in distorted limbs and other anomalies. This issue primarily stems from the insufficient recognition and evaluation of limb qualities in diffusion models. To address this issue, we introduce AbHuman, the first large-scale synthesized human benchmark focusing on anatomical anomalies. This benchmark consists of 56K synthesized human images, each annotated with detailed, bounding-box level labels identifying 147K human anomalies in 18 different categories. Based on this, the recognition of human anomalies can be established, which in turn enhances image generation through traditional techniques such as negative prompting and guidance. To further boost the improvement, we propose HumanRefiner, a novel plug-and-play approach for the coarse-to-fine refinement of human anomalies in text-to-image generation. Specifically, HumanRefiner utilizes a self-diagnostic procedure to detect and correct issues related to both coarse-grained abnormal human poses and fine-grained anomaly levels, facilitating pose-reversible diffusion generation. Experimental results on the AbHuman benchmark demonstrate that HumanRefiner significantly reduces generative discrepancies, achieving a 2.9x improvement in limb quality compared to the state-of-the-art open-source generator SDXL and a 1.4x improvement over DALL-E 3 in human evaluations. Our data and code are available at https://github.com/Enderfga/HumanRefiner. △ Less

Submitted 9 July, 2024; originally announced July 2024.

Comments: Accepted by ECCV2024

arXiv:2407.05413 [pdf, other]

SBoRA: Low-Rank Adaptation with Regional Weight Updates

Authors: Lai-Man Po, Yuyang Liu, Haoxuan Wu, Tianqi Zhang, Wing-Yin Yu, Zeyu Jiang, Kun Li

Abstract: This paper introduces Standard Basis LoRA (SBoRA), a novel parameter-efficient fine-tuning approach for Large Language Models that builds upon the pioneering works of Low-Rank Adaptation (LoRA) and Orthogonal Adaptation. SBoRA further reduces the computational and memory requirements of LoRA while enhancing learning performance. By leveraging orthogonal standard basis vectors to initialize one of… ▽ More This paper introduces Standard Basis LoRA (SBoRA), a novel parameter-efficient fine-tuning approach for Large Language Models that builds upon the pioneering works of Low-Rank Adaptation (LoRA) and Orthogonal Adaptation. SBoRA further reduces the computational and memory requirements of LoRA while enhancing learning performance. By leveraging orthogonal standard basis vectors to initialize one of the low-rank matrices, either A or B, SBoRA enables regional weight updates and memory-efficient fine-tuning. This approach gives rise to two variants, SBoRA-FA and SBoRA-FB, where only one of the matrices is updated, resulting in a sparse update matrix with a majority of zero rows or columns. Consequently, the majority of the fine-tuned model's weights remain unchanged from the pre-trained weights. This characteristic of SBoRA, wherein regional weight updates occur, is reminiscent of the modular organization of the human brain, which efficiently adapts to new tasks. Our empirical results demonstrate the superiority of SBoRA-FA over LoRA in various fine-tuning tasks, including commonsense reasoning and arithmetic reasoning. Furthermore, we evaluate the effectiveness of QSBoRA on quantized LLaMA models of varying scales, highlighting its potential for efficient adaptation to new tasks. Code is available at https://github.com/cityuhkai/SBoRA △ Less

Submitted 10 July, 2024; v1 submitted 7 July, 2024; originally announced July 2024.

Comments: 15 pages, 2 figures

arXiv:2407.04086 [pdf, other]

Certifiably Robust Image Watermark

Authors: Zhengyuan Jiang, Moyang Guo, Yuepeng Hu, Jinyuan Jia, Neil Zhenqiang Gong

Abstract: Generative AI raises many societal concerns such as boosting disinformation and propaganda campaigns. Watermarking AI-generated content is a key technology to address these concerns and has been widely deployed in industry. However, watermarking is vulnerable to removal attacks and forgery attacks. In this work, we propose the first image watermarks with certified robustness guarantees against rem… ▽ More Generative AI raises many societal concerns such as boosting disinformation and propaganda campaigns. Watermarking AI-generated content is a key technology to address these concerns and has been widely deployed in industry. However, watermarking is vulnerable to removal attacks and forgery attacks. In this work, we propose the first image watermarks with certified robustness guarantees against removal and forgery attacks. Our method leverages randomized smoothing, a popular technique to build certifiably robust classifiers and regression models. Our major technical contributions include extending randomized smoothing to watermarking by considering its unique characteristics, deriving the certified robustness guarantees, and designing algorithms to estimate them. Moreover, we extensively evaluate our image watermarks in terms of both certified and empirical robustness. Our code is available at \url{https://github.com/zhengyuan-jiang/Watermark-Library}. △ Less

Submitted 4 July, 2024; originally announced July 2024.

Comments: Accepted by ECCV 2024

arXiv:2407.03572 [pdf, other]

Core: Robust Factual Precision Scoring with Informative Sub-Claim Identification

Authors: Zhengping Jiang, Jingyu Zhang, Nathaniel Weir, Seth Ebner, Miriam Wanner, Kate Sanders, Daniel Khashabi, Anqi Liu, Benjamin Van Durme

Abstract: Hallucinations -- the generation of untrue claims -- pose a challenge to the application of large language models (LLMs) [1] thereby motivating the development of metrics to evaluate factual precision. We observe that popular metrics using the Decompose-Then-Verify framework, such as FActScore [2], can be manipulated by adding obvious or repetitive claims to artificially inflate scores. We expand… ▽ More Hallucinations -- the generation of untrue claims -- pose a challenge to the application of large language models (LLMs) [1] thereby motivating the development of metrics to evaluate factual precision. We observe that popular metrics using the Decompose-Then-Verify framework, such as FActScore [2], can be manipulated by adding obvious or repetitive claims to artificially inflate scores. We expand the FActScore dataset to design and analyze factual precision metrics, demonstrating that models can be trained to achieve high scores under existing metrics through exploiting the issues we identify. This motivates our new customizable plug-and-play subclaim selection component called Core, which filters down individual subclaims according to their uniqueness and informativeness. Metrics augmented by Core are substantially more robust as shown in head-to-head comparisons. We release an evaluation framework supporting the modular use of Core (https://github.com/zipJiang/Core) and various decomposition strategies, and we suggest its adoption by the LLM community. [1] Hong et al., "The Hallucinations Leaderboard -- An Open Effort to Measure Hallucinations in Large Language Models", arXiv:2404.05904v2 [cs.CL]. [2] Min et al., "FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation", arXiv:2305.14251v2 [cs.CL]. △ Less

Submitted 3 July, 2024; originally announced July 2024.

arXiv:2407.03515 [pdf, other]

Feature-Specific Coefficients of Determination in Tree Ensembles

Authors: Zhongli Jiang, Dabao Zhang, Min Zhang

Abstract: Tree ensemble methods provide promising predictions with models difficult to interpret. Recent introduction of Shapley values for individualized feature contributions, accompanied with several fast computing algorithms for predicted values, shows intriguing results. However, individualizing coefficients of determination, aka $R^2$, for each feature is challenged by the underlying quadratic losses,… ▽ More Tree ensemble methods provide promising predictions with models difficult to interpret. Recent introduction of Shapley values for individualized feature contributions, accompanied with several fast computing algorithms for predicted values, shows intriguing results. However, individualizing coefficients of determination, aka $R^2$, for each feature is challenged by the underlying quadratic losses, although these coefficients allow us to comparatively assess single feature's contribution to tree ensembles. Here we propose an efficient algorithm, Q-SHAP, that reduces the computational complexity to polynomial time when calculating Shapley values related to quadratic losses. Our extensive simulation studies demonstrate that this approach not only enhances computational efficiency but also improves estimation accuracy of feature-specific coefficients of determination. △ Less

Submitted 3 July, 2024; originally announced July 2024.

arXiv:2407.02746 [pdf, other]

Motion Comparator: Visual Comparison of Robot Motions

Authors: Yeping Wang, Alexander Peseckis, Zelong Jiang, Michael Gleicher

Abstract: Roboticists compare robot motions for tasks such as parameter tuning, troubleshooting, and deciding between possible motions. However, most existing visualization tools are designed for individual motions and lack the features necessary to facilitate robot motion comparison. In this paper, we utilize a rigorous design framework to develop Motion Comparator, a web-based tool that facilitates the co… ▽ More Roboticists compare robot motions for tasks such as parameter tuning, troubleshooting, and deciding between possible motions. However, most existing visualization tools are designed for individual motions and lack the features necessary to facilitate robot motion comparison. In this paper, we utilize a rigorous design framework to develop Motion Comparator, a web-based tool that facilitates the comprehension, comparison, and communication of robot motions. Our design process identified roboticists' needs, articulated design challenges, and provided corresponding strategies. Motion Comparator includes several key features such as multi-view coordination, quaternion visualization, time warping, and comparative designs. To demonstrate the applications of Motion Comparator, we discuss four case studies in which our tool is used for motion selection, troubleshooting, parameter tuning, and motion review. △ Less

Submitted 2 July, 2024; originally announced July 2024.

Comments: Accepted by IEEE Robotics and Automation Letters (RAL)

arXiv:2407.02604 [pdf, other]

D-Rax: Domain-specific Radiologic assistant leveraging multi-modal data and eXpert model predictions

Authors: Hareem Nisar, Syed Muhammad Anwar, Zhifan Jiang, Abhijeet Parida, Vishwesh Nath, Holger R. Roth, Marius George Linguraru

Abstract: Large vision language models (VLMs) have progressed incredibly from research to applicability for general-purpose use cases. LLaVA-Med, a pioneering large language and vision assistant for biomedicine, can perform multi-modal biomedical image and data analysis to provide a natural language interface for radiologists. While it is highly generalizable and works with multi-modal data, it is currently… ▽ More Large vision language models (VLMs) have progressed incredibly from research to applicability for general-purpose use cases. LLaVA-Med, a pioneering large language and vision assistant for biomedicine, can perform multi-modal biomedical image and data analysis to provide a natural language interface for radiologists. While it is highly generalizable and works with multi-modal data, it is currently limited by well-known challenges that exist in the large language model space. Hallucinations and imprecision in responses can lead to misdiagnosis which currently hinder the clinical adaptability of VLMs. To create precise, user-friendly models in healthcare, we propose D-Rax -- a domain-specific, conversational, radiologic assistance tool that can be used to gain insights about a particular radiologic image. In this study, we enhance the conversational analysis of chest X-ray (CXR) images to support radiological reporting, offering comprehensive insights from medical imaging and aiding in the formulation of accurate diagnosis. D-Rax is achieved by fine-tuning the LLaVA-Med architecture on our curated enhanced instruction-following data, comprising of images, instructions, as well as disease diagnosis and demographic predictions derived from MIMIC-CXR imaging data, CXR-related visual question answer (VQA) pairs, and predictive outcomes from multiple expert AI models. We observe statistically significant improvement in responses when evaluated for both open and close-ended conversations. Leveraging the power of state-of-the-art diagnostic models combined with VLMs, D-Rax empowers clinicians to interact with medical images using natural language, which could potentially streamline their decision-making process, enhance diagnostic accuracy, and conserve their time. △ Less

Submitted 2 July, 2024; originally announced July 2024.

arXiv:2407.01264 [pdf, other]

SignCLIP: Connecting Text and Sign Language by Contrastive Learning

Authors: Zifan Jiang, Gerard Sant, Amit Moryossef, Mathias Müller, Rico Sennrich, Sarah Ebling

Abstract: We present SignCLIP, which re-purposes CLIP (Contrastive Language-Image Pretraining) to project spoken language text and sign language videos, two classes of natural languages of distinct modalities, into the same space. SignCLIP is an efficient method of learning useful visual representations for sign language processing from large-scale, multilingual video-text pairs, without directly optimizing… ▽ More We present SignCLIP, which re-purposes CLIP (Contrastive Language-Image Pretraining) to project spoken language text and sign language videos, two classes of natural languages of distinct modalities, into the same space. SignCLIP is an efficient method of learning useful visual representations for sign language processing from large-scale, multilingual video-text pairs, without directly optimizing for a specific task or sign language which is often of limited size. We pretrain SignCLIP on Spreadthesign, a prominent sign language dictionary consisting of ~500 thousand video clips in up to 44 sign languages, and evaluate it with various downstream datasets. SignCLIP discerns in-domain signing with notable text-to-video/video-to-text retrieval accuracy. It also performs competitively for out-of-domain downstream tasks such as isolated sign language recognition upon essential few-shot prompting or fine-tuning. We analyze the latent space formed by the spoken language text and sign language poses, which provides additional linguistic insights. Our code and models are openly available. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2406.20098 [pdf, other]

Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs

Authors: Sukmin Yun, Haokun Lin, Rusiru Thushara, Mohammad Qazim Bhat, Yongxin Wang, Zutao Jiang, Mingkai Deng, Jinhong Wang, Tianhua Tao, Junbo Li, Haonan Li, Preslav Nakov, Timothy Baldwin, Zhengzhong Liu, Eric P. Xing, Xiaodan Liang, Zhiqiang Shen

Abstract: Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio in a variety of understanding and generation tasks. However, current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code. To address this problem, we propose Web2Code, a benchmark consisting of a new large-scale webpage-t… ▽ More Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio in a variety of understanding and generation tasks. However, current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code. To address this problem, we propose Web2Code, a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning and an evaluation framework for the webpage understanding and HTML code translation abilities of MLLMs. For dataset construction, we leverage pretrained LLMs to enhance existing webpage-to-code datasets as well as generate a diverse pool of new webpages rendered into images. Specifically, the inputs are webpage images and instructions, while the responses are the webpage's HTML code. We further include diverse natural language QA pairs about the webpage content in the responses to enable a more comprehensive understanding of the web content. To evaluate model performance in these tasks, we develop an evaluation framework for testing MLLMs' abilities in webpage understanding and web-to-code generation. Extensive experiments show that our proposed dataset is beneficial not only to our proposed tasks but also in the general visual domain, while previous datasets result in worse performance. We hope our work will contribute to the development of general MLLMs suitable for web-based content generation and task automation. Our data and code will be available at https://github.com/MBZUAI-LLM/web2code. △ Less

Submitted 28 June, 2024; originally announced June 2024.

Comments: Website at https://mbzuai-llm.github.io/webpage2code/

arXiv:2406.18013 [pdf, other]

Effects of model size in density-functional-theory study of alloys: A case study of CsPbBr$_2$Cl

Authors: Fang Pan, Lin Yang, Zhuangde Jiang, Wei Ren, Zuo-Guang Ye, Jingrui Li

Abstract: The primary challenge of density-functional-theory exploration of alloy systems concerns the size of computational model. Small alloy models can hardly exhibit the chemical disorder properly, while large models induce difficulty in sampling the alignments within the massive material space. We study this problem with the γ phase of the mixed halide inorganic perovskite alloy CsPbBr$_2$Cl. The distr… ▽ More The primary challenge of density-functional-theory exploration of alloy systems concerns the size of computational model. Small alloy models can hardly exhibit the chemical disorder properly, while large models induce difficulty in sampling the alignments within the massive material space. We study this problem with the γ phase of the mixed halide inorganic perovskite alloy CsPbBr$_2$Cl. The distribution of alloy formation energy becomes narrower when the size of the model system increases along $\sqrt{2}\times\sqrt{2}\times2$, $2\times2\times2$, and $2\sqrt{2}\times2\sqrt{2}\times2$ models. This is primarily because the distribution of Br distribution parameters, which plays a leading role in determining the formation energy range, is more narrow for larger models. As a result, larger entropy stability effect can be observed with larger models especially at high temperatures, for which the approximation using mixing entropy based on the ideal solution model becomes better. △ Less

Submitted 25 June, 2024; originally announced June 2024.

arXiv:2406.16549 [pdf, other]

Parity-violating primordial gravitational waves from null energy condition violation

Authors: Zi-Wei Jiang, Yong Cai, Fei Wang, Yun-Song Piao

Abstract: We investigate the parity-violating effects in primordial gravitational waves (GWs) due to null energy condition (NEC) violation in two very early universe scenarios: bounce-inflation and intermediate NEC violation during inflation. In both scenarios, we numerically solve the power spectra of parity-violating primordial GWs generated by coupling the background field and the spectator field with th… ▽ More We investigate the parity-violating effects in primordial gravitational waves (GWs) due to null energy condition (NEC) violation in two very early universe scenarios: bounce-inflation and intermediate NEC violation during inflation. In both scenarios, we numerically solve the power spectra of parity-violating primordial GWs generated by coupling the background field and the spectator field with the Nieh-Yan term, respectively. We find that the background field can significantly enhance parity-violating effects at scales corresponding to the maximum of the GW power spectra. In contrast, the parity-violating effects produced by the spectator show significantly weaker observability even if the coupling constant is large. Therefore, in NEC-violating scenarios, the significant observable parity-violating effects in primordial GWs primarily arise from the physics directly related to NEC violation. This result highlights the potential of primordial GWs as crucial tools for exploring NEC-violating and parity-violating physics. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: 31 pages

arXiv:2406.15319 [pdf, other]

LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs

Authors: Ziyan Jiang, Xueguang Ma, Wenhu Chen

Abstract: In traditional RAG framework, the basic retrieval units are normally short. The common retrievers like DPR normally work with 100-word Wikipedia paragraphs. Such a design forces the retriever to search over a large corpus to find the `needle' unit. In contrast, the readers only need to extract answers from the short retrieved units. Such an imbalanced `heavy' retriever and `light' reader design ca… ▽ More In traditional RAG framework, the basic retrieval units are normally short. The common retrievers like DPR normally work with 100-word Wikipedia paragraphs. Such a design forces the retriever to search over a large corpus to find the `needle' unit. In contrast, the readers only need to extract answers from the short retrieved units. Such an imbalanced `heavy' retriever and `light' reader design can lead to sub-optimal performance. In order to alleviate the imbalance, we propose a new framework LongRAG, consisting of a `long retriever' and a `long reader'. LongRAG processes the entire Wikipedia into 4K-token units, which is 30x longer than before. By increasing the unit size, we significantly reduce the total units from 22M to 700K. This significantly lowers the burden of retriever, which leads to a remarkable retrieval score: answer recall@1=71% on NQ (previously 52%) and answer recall@2=72% (previously 47%) on HotpotQA (full-wiki). Then we feed the top-k retrieved units ($\approx$ 30K tokens) to an existing long-context LLM to perform zero-shot answer extraction. Without requiring any training, LongRAG achieves an EM of 62.7% on NQ, which is the best known result. LongRAG also achieves 64.3% on HotpotQA (full-wiki), which is on par of the SoTA model. Our study offers insights into the future roadmap for combining RAG with long-context LLMs. △ Less

Submitted 30 June, 2024; v1 submitted 21 June, 2024; originally announced June 2024.

Comments: Technical Report

arXiv:2406.15252 [pdf, other]

VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation

Authors: Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, Kai Wang, Quy Duc Do, Yuansheng Ni, Bohan Lyu, Yaswanth Narsupalli, Rongqi Fan, Zhiheng Lyu, Yuchen Lin, Wenhu Chen

Abstract: The recent years have witnessed great advances in video generation. However, the development of automatic video metrics is lagging significantly behind. None of the existing metric is able to provide reliable scores over generated videos. The main barrier is the lack of large-scale human-annotated dataset. In this paper, we release VideoFeedback, the first large-scale dataset containing human-prov… ▽ More The recent years have witnessed great advances in video generation. However, the development of automatic video metrics is lagging significantly behind. None of the existing metric is able to provide reliable scores over generated videos. The main barrier is the lack of large-scale human-annotated dataset. In this paper, we release VideoFeedback, the first large-scale dataset containing human-provided multi-aspect score over 37.6K synthesized videos from 11 existing video generative models. We train VideoScore (initialized from Mantis) based on VideoFeedback to enable automatic video quality assessment. Experiments show that the Spearman correlation between VideoScore and humans can reach 77.1 on VideoFeedback-test, beating the prior best metrics by about 50 points. Further result on other held-out EvalCrafter, GenAI-Bench, and VBench show that VideoScore has consistently much higher correlation with human judges than other metrics. Due to these results, we believe VideoScore can serve as a great proxy for human raters to (1) rate different video models to track progress (2) simulate fine-grained human feedback in Reinforcement Learning with Human Feedback (RLHF) to improve current video generation models. △ Less

Submitted 24 June, 2024; v1 submitted 21 June, 2024; originally announced June 2024.

arXiv:2406.14797 [pdf, other]

Camera-Invariant Meta-Learning Network for Single-Camera-Training Person Re-identification

Authors: Jiangbo Pei, Zhuqing Jiang, Aidong Men, Haiying Wang, Haiyong Luo, Shiping Wen

Abstract: Single-camera-training person re-identification (SCT re-ID) aims to train a re-ID model using SCT datasets where each person appears in only one camera. The main challenge of SCT re-ID is to learn camera-invariant feature representations without cross-camera same-person (CCSP) data as supervision. Previous methods address it by assuming that the most similar person should be found in another camer… ▽ More Single-camera-training person re-identification (SCT re-ID) aims to train a re-ID model using SCT datasets where each person appears in only one camera. The main challenge of SCT re-ID is to learn camera-invariant feature representations without cross-camera same-person (CCSP) data as supervision. Previous methods address it by assuming that the most similar person should be found in another camera. However, this assumption is not guaranteed to be correct. In this paper, we propose a Camera-Invariant Meta-Learning Network (CIMN) for SCT re-ID. CIMN assumes that the camera-invariant feature representations should be robust to camera changes. To this end, we split the training data into meta-train set and meta-test set based on camera IDs and perform a cross-camera simulation via meta-learning strategy, aiming to enforce the representations learned from the meta-train set to be robust to the meta-test set. With the cross-camera simulation, CIMN can learn camera-invariant and identity-discriminative representations even there are no CCSP data. However, this simulation also causes the separation of the meta-train set and the meta-test set, which ignores some beneficial relations between them. Thus, we introduce three losses: meta triplet loss, meta classification loss, and meta camera alignment loss, to leverage the ignored relations. The experiment results demonstrate that our method achieves comparable performance with and without CCSP data, and outperforms the state-of-the-art methods on SCT re-ID benchmarks. In addition, it is also effective in improving the domain generalization ability of the model. △ Less

Submitted 20 June, 2024; originally announced June 2024.

arXiv:2406.14380 [pdf, other]

Estimating Treatment Effects under Recommender Interference: A Structured Neural Networks Approach

Authors: Ruohan Zhan, Shichao Han, Yuchen Hu, Zhenling Jiang

Abstract: Recommender systems are essential for content-sharing platforms by curating personalized content. To evaluate updates to recommender systems targeting content creators, platforms frequently rely on creator-side randomized experiments. The treatment effect measures the change in outcomes when a new algorithm is implemented compared to the status quo. We show that the standard difference-in-means es… ▽ More Recommender systems are essential for content-sharing platforms by curating personalized content. To evaluate updates to recommender systems targeting content creators, platforms frequently rely on creator-side randomized experiments. The treatment effect measures the change in outcomes when a new algorithm is implemented compared to the status quo. We show that the standard difference-in-means estimator can lead to biased estimates due to recommender interference that arises when treated and control creators compete for exposure. We propose a "recommender choice model" that describes which item gets exposed from a pool containing both treated and control items. By combining a structural choice model with neural networks, this framework directly models the interference pathway while accounting for rich viewer-content heterogeneity. We construct a debiased estimator of the treatment effect and prove it is $\sqrt n$-consistent and asymptotically normal with potentially correlated samples. We validate our estimator's empirical performance with a field experiment on Weixin short-video platform. In addition to the standard creator-side experiment, we conduct a costly double-sided randomization design to obtain a benchmark estimate free from interference bias. We show that the proposed estimator yields results comparable to the benchmark, whereas the standard difference-in-means estimator can exhibit significant bias and even produce reversed signs. △ Less

Submitted 5 July, 2024; v1 submitted 20 June, 2024; originally announced June 2024.

arXiv:2406.13301 [pdf, other]

ARDuP: Active Region Video Diffusion for Universal Policies

Authors: Shuaiyi Huang, Mara Levy, Zhenyu Jiang, Anima Anandkumar, Yuke Zhu, Linxi Fan, De-An Huang, Abhinav Shrivastava

Abstract: Sequential decision-making can be formulated as a text-conditioned video generation problem, where a video planner, guided by a text-defined goal, generates future frames visualizing planned actions, from which control actions are subsequently derived. In this work, we introduce Active Region Video Diffusion for Universal Policies (ARDuP), a novel framework for video-based policy learning that emp… ▽ More Sequential decision-making can be formulated as a text-conditioned video generation problem, where a video planner, guided by a text-defined goal, generates future frames visualizing planned actions, from which control actions are subsequently derived. In this work, we introduce Active Region Video Diffusion for Universal Policies (ARDuP), a novel framework for video-based policy learning that emphasizes the generation of active regions, i.e. potential interaction areas, enhancing the conditional policy's focus on interactive areas critical for task execution. This innovative framework integrates active region conditioning with latent diffusion models for video planning and employs latent representations for direct action decoding during inverse dynamic modeling. By utilizing motion cues in videos for automatic active region discovery, our method eliminates the need for manual annotations of active regions. We validate ARDuP's efficacy via extensive experiments on simulator CLIPort and the real-world dataset BridgeData v2, achieving notable improvements in success rates and generating convincingly realistic video plans. △ Less

Submitted 19 June, 2024; originally announced June 2024.

arXiv:2406.12875 [pdf, other]

Machine learning evaluation in the Global Event Processor FPGA for the ATLAS trigger upgrade

Authors: Zhixing Jiang, Scott Hauck, Dennis Yin, Bowen Zuo, Ben Carlson, Shih-Chieh Hsu, Allison Deiana, Rohin Narayan, Santosh Parajuli, Jeff Eastlack

Abstract: The Global Event Processor (GEP) FPGA is an area-constrained, performance-critical element of the Large Hadron Collider's (LHC) ATLAS experiment. It needs to very quickly determine which small fraction of detected events should be retained for further processing, and which other events will be discarded. This system involves a large number of individual processing tasks, brought together within th… ▽ More The Global Event Processor (GEP) FPGA is an area-constrained, performance-critical element of the Large Hadron Collider's (LHC) ATLAS experiment. It needs to very quickly determine which small fraction of detected events should be retained for further processing, and which other events will be discarded. This system involves a large number of individual processing tasks, brought together within the overall Algorithm Processing Platform (APP), to make filtering decisions at an overall latency of no more than 8ms. Currently, such filtering tasks are hand-coded implementations of standard deterministic signal processing tasks. In this paper we present methods to automatically create machine learning based algorithms for use within the APP framework, and demonstrate several successful such deployments. We leverage existing machine learning to FPGA flows such as hls4ml and fwX to significantly reduce the complexity of algorithm design. These have resulted in implementations of various machine learning algorithms with latencies of 1.2us and less than 5% resource utilization on an Xilinx XCVU9P FPGA. Finally, we implement these algorithms into the GEP system and present their actual performance. Our work shows the potential of using machine learning in the GEP for high-energy physics applications. This can significantly improve the performance of the trigger system and enable the ATLAS experiment to collect more data and make more discoveries. The architecture and approach presented in this paper can also be applied to other applications that require real-time processing of large volumes of data. △ Less

Submitted 7 May, 2024; originally announced June 2024.

Comments: 14 pages, 4 figures, 6 tables. Accepted by JINST on April 3, 2024

arXiv:2406.12736 [pdf, other]

Beyond Visual Appearances: Privacy-sensitive Objects Identification via Hybrid Graph Reasoning

Authors: Zhuohang Jiang, Bingkui Tong, Xia Du, Ahmed Alhammadi, Jizhe Zhou

Abstract: The Privacy-sensitive Object Identification (POI) task allocates bounding boxes for privacy-sensitive objects in a scene. The key to POI is settling an object's privacy class (privacy-sensitive or non-privacy-sensitive). In contrast to conventional object classes which are determined by the visual appearance of an object, one object's privacy class is derived from the scene contexts and is subject… ▽ More The Privacy-sensitive Object Identification (POI) task allocates bounding boxes for privacy-sensitive objects in a scene. The key to POI is settling an object's privacy class (privacy-sensitive or non-privacy-sensitive). In contrast to conventional object classes which are determined by the visual appearance of an object, one object's privacy class is derived from the scene contexts and is subject to various implicit factors beyond its visual appearance. That is, visually similar objects may be totally opposite in their privacy classes. To explicitly derive the objects' privacy class from the scene contexts, in this paper, we interpret the POI task as a visual reasoning task aimed at the privacy of each object in the scene. Following this interpretation, we propose the PrivacyGuard framework for POI. PrivacyGuard contains three stages. i) Structuring: an unstructured image is first converted into a structured, heterogeneous scene graph that embeds rich scene contexts. ii) Data Augmentation: a contextual perturbation oversampling strategy is proposed to create slightly perturbed privacy-sensitive objects in a scene graph, thereby balancing the skewed distribution of privacy classes. iii) Hybrid Graph Generation & Reasoning: the balanced, heterogeneous scene graph is then transformed into a hybrid graph by endowing it with extra "node-node" and "edge-edge" homogeneous paths. These homogeneous paths allow direct message passing between nodes or edges, thereby accelerating reasoning and facilitating the capturing of subtle context changes. Based on this hybrid graph... **For the full abstract, see the original paper.** △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: 15 pages

arXiv:2406.11551 [pdf, other]

Simple Yet Efficient: Towards Self-Supervised FG-SBIR with Unified Sample Feature Alignment

Authors: Jianan Jiang, Di Wu, Zhilin Jiang, Weiren Yu

Abstract: Fine-Grained Sketch-Based Image Retrieval (FG-SBIR) aims to minimize the distance between sketches and corresponding images in the embedding space. However, scalability is hindered by the growing complexity of solutions, mainly due to the abstract nature of fine-grained sketches. In this paper, we propose a simple yet efficient approach to narrow the gap between the two modes. It mainly facilitate… ▽ More Fine-Grained Sketch-Based Image Retrieval (FG-SBIR) aims to minimize the distance between sketches and corresponding images in the embedding space. However, scalability is hindered by the growing complexity of solutions, mainly due to the abstract nature of fine-grained sketches. In this paper, we propose a simple yet efficient approach to narrow the gap between the two modes. It mainly facilitates unified mutual information sharing both intra- and inter-samples, rather than treating them as a single feature alignment problem between modalities. Specifically, our approach includes: (i) Employing dual weight-sharing networks to optimize alignment within sketch and image domain, which also effectively mitigates model learning saturation issues. (ii) Introducing an objective optimization function based on contrastive loss to enhance the model's ability to align features intra- and inter-samples. (iii) Presenting a learnable TRSM combined of self-attention and cross-attention to promote feature representations among tokens, further enhancing sample alignment in the embedding space. Our framework achieves excellent results on CNN- and ViT-based backbones. Extensive experiments demonstrate its superiority over existing methods. We also introduce Cloths-V1, the first professional fashion sketches and images dataset, utilized to validate our method and will be beneficial for other applications. △ Less

Submitted 22 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

Comments: 10 pages,8 figures, 4 tables

arXiv:2406.11207 [pdf]

Doping-tunable Fermi surface with persistent topological Hall effect in axion candidate EuIn$_2$As$_2$

Authors: Jian Yan, Jianguo Si, Zhongzhu Jiang, Hanming Ma, Yoshiya Uwatoko, Bao-Tian Wang, Xuan Luo, Yuping Sun, Minoru Yamashita

Abstract: Rare-earth Zintl compound EuIn$_2$As$_2$ has been theoretically recognized as a candidate for realizing an intrinsic antiferromagnetic (AFM) bulk axion insulator and a higher-order topological state, which provides a fertile platform to explore novel topological transport phenomena. However, the axion state has yet to be realized because EuIn$_2$As$_2$ is highly hole-doped. Here, we synthesized a… ▽ More Rare-earth Zintl compound EuIn$_2$As$_2$ has been theoretically recognized as a candidate for realizing an intrinsic antiferromagnetic (AFM) bulk axion insulator and a higher-order topological state, which provides a fertile platform to explore novel topological transport phenomena. However, the axion state has yet to be realized because EuIn$_2$As$_2$ is highly hole-doped. Here, we synthesized a series of high-quality Ca-doped EuIn2As2 (Ca$_x$Eu$_{1-x}$In$_2$As$_2$, x = 0 ~ 0.25) single crystals to tune the Fermi energy above the hole pocket. Our Hall measurements reveal that the isovalent Ca substitution decreases the hole carrier density by shrinking the lattice spacing, which is also confirmed by our first-principles calculations. We further find that both the temperature dependence of the magnetic susceptibility with a local maximum at the Néel temperature and the topological Hall effect originating from the finite real-space spin chirality persist in the Ca-doped samples as observed in the pristine EuIn$_2$As$_2$, despite that the nonmagnetic Ca substitution decreases the effective moment and the Néel temperature. These results show that the Ca substitution tunes the Fermi energy while keeping the AFM magnetic structure, suggesting that the axion insulating state may be realized by further Ca substitution. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: 18 pages, 8 figures

arXiv:2406.10934 [pdf]

Beyond Answers: Large Language Model-Powered Tutoring System in Physics Education for Deep Learning and Precise Understanding

Authors: Zhoumingju Jiang, Mengjun Jiang

Abstract: The integration of artificial intelligence (AI) in education has shown significant promise, yet the effective personalization of learning, particularly in physics education, remains a challenge. This paper proposes Physics-STAR, a framework for large language model (LLM)- powered tutoring system designed to address this gap by providing personalized and adaptive learning experiences for high schoo… ▽ More The integration of artificial intelligence (AI) in education has shown significant promise, yet the effective personalization of learning, particularly in physics education, remains a challenge. This paper proposes Physics-STAR, a framework for large language model (LLM)- powered tutoring system designed to address this gap by providing personalized and adaptive learning experiences for high school students. Our study evaluates Physics-STAR against traditional teacher-led lectures and generic LLM tutoring through a controlled experiment with 12 high school sophomores. Results showed that Physics-STAR increased students' average scores and efficiency on conceptual, computational, and on informational questions. In particular, students' average scores on complex information problems increased by 100% and their efficiency increased by 5.95%. By facilitating step-by-step guidance and reflective learning, Physics-STAR helps students develop critical thinking skills and a robust comprehension of abstract concepts. The findings underscore the potential of AI-driven personalized tutoring systems to transform physics education. As LLM continues to advance, the future of student-centered AI in education looks promising, with the potential to significantly improve learning outcomes and efficiency. △ Less

Submitted 16 June, 2024; originally announced June 2024.

Comments: 13 pages, 3 figures, CSCW 2O24

arXiv:2406.10580 [pdf, other]

IMDL-BenCo: A Comprehensive Benchmark and Codebase for Image Manipulation Detection & Localization

Authors: Xiaochen Ma, Xuekang Zhu, Lei Su, Bo Du, Zhuohang Jiang, Bingkui Tong, Zeyu Lei, Xinyu Yang, Chi-Man Pun, Jiancheng Lv, Jizhe Zhou

Abstract: A comprehensive benchmark is yet to be established in the Image Manipulation Detection \& Localization (IMDL) field. The absence of such a benchmark leads to insufficient and misleading model evaluations, severely undermining the development of this field. However, the scarcity of open-sourced baseline models and inconsistent training and evaluation protocols make conducting rigorous experiments a… ▽ More A comprehensive benchmark is yet to be established in the Image Manipulation Detection \& Localization (IMDL) field. The absence of such a benchmark leads to insufficient and misleading model evaluations, severely undermining the development of this field. However, the scarcity of open-sourced baseline models and inconsistent training and evaluation protocols make conducting rigorous experiments and faithful comparisons among IMDL models challenging. To address these challenges, we introduce IMDL-BenCo, the first comprehensive IMDL benchmark and modular codebase. IMDL-BenCo:~\textbf{i)} decomposes the IMDL framework into standardized, reusable components and revises the model construction pipeline, improving coding efficiency and customization flexibility;~\textbf{ii)} fully implements or incorporates training code for state-of-the-art models to establish a comprehensive IMDL benchmark; and~\textbf{iii)} conducts deep analysis based on the established benchmark and codebase, offering new insights into IMDL model architecture, dataset characteristics, and evaluation standards. Specifically, IMDL-BenCo includes common processing algorithms, 8 state-of-the-art IMDL models (1 of which are reproduced from scratch), 2 sets of standard training and evaluation protocols, 15 GPU-accelerated evaluation metrics, and 3 kinds of robustness evaluation. This benchmark and codebase represent a significant leap forward in calibrating the current progress in the IMDL field and inspiring future breakthroughs. Code is available at: https://github.com/scu-zjz/IMDLBenCo △ Less

Submitted 15 June, 2024; originally announced June 2024.

Comments: Technical report

arXiv:2406.09317 [pdf, other]

Common and Rare Fundus Diseases Identification Using Vision-Language Foundation Model with Knowledge of Over 400 Diseases

Authors: Meng Wang, Tian Lin, Aidi Lin, Kai Yu, Yuanyuan Peng, Lianyu Wang, Cheng Chen, Ke Zou, Huiyu Liang, Man Chen, Xue Yao, Meiqin Zhang, Binwei Huang, Chaoxin Zheng, Peixin Zhang, Wei Chen, Yilong Luo, Yifan Chen, Honghe Xia, Tingkun Shi, Qi Zhang, Jinming Guo, Xiaolin Chen, Jingcheng Wang, Yih Chung Tham , et al. (24 additional authors not shown)

Abstract: Previous foundation models for retinal images were pre-trained with limited disease categories and knowledge base. Here we introduce RetiZero, a vision-language foundation model that leverages knowledge from over 400 fundus diseases. To RetiZero's pre-training, we compiled 341,896 fundus images paired with text descriptions, sourced from public datasets, ophthalmic literature, and online resources… ▽ More Previous foundation models for retinal images were pre-trained with limited disease categories and knowledge base. Here we introduce RetiZero, a vision-language foundation model that leverages knowledge from over 400 fundus diseases. To RetiZero's pre-training, we compiled 341,896 fundus images paired with text descriptions, sourced from public datasets, ophthalmic literature, and online resources, encompassing a diverse range of diseases across multiple ethnicities and countries. RetiZero exhibits superior performance in several downstream tasks, including zero-shot disease recognition, image-to-image retrieval, and internal- and cross-domain disease identification. In zero-shot scenarios, RetiZero achieves Top5 accuracy scores of 0.8430 for 15 fundus diseases and 0.7561 for 52 fundus diseases. For image retrieval, it achieves Top5 scores of 0.9500 and 0.8860 for the same disease sets, respectively. Clinical evaluations show that RetiZero's Top3 zero-shot performance surpasses the average of 19 ophthalmologists from Singapore, China and the United States. Furthermore, RetiZero significantly enhances clinicians' accuracy in diagnosing fundus disease. These findings underscore the value of integrating the RetiZero foundation model into clinical settings, where a variety of fundus diseases are encountered. △ Less

Submitted 30 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

arXiv:2406.08729 [pdf, other]

Structure Phase Change Induced by Nonequilibrium Effects in Molecular Scale Junctions

Authors: Hao Wang, Kah-Meng Yam, Zhuoling Jiang, Na Guo, Chun Zhang

Abstract: The interrelationship between a material's structure and its properties lies at the heart of materials-related research. Finding how the changes of one affect the other is of primary importance in theoretical and computational materials studies. In this work, based on Hershfield nonequilibrium quantum statistics and the mean-field approach with steady-state density functional theory, we derive a f… ▽ More The interrelationship between a material's structure and its properties lies at the heart of materials-related research. Finding how the changes of one affect the other is of primary importance in theoretical and computational materials studies. In this work, based on Hershfield nonequilibrium quantum statistics and the mean-field approach with steady-state density functional theory, we derive a first-principles method to calculate nonequilibrium effects induced forces acting on atoms, enabling structure optimizations and molecular dynamics simulations for molecular junctions under external biases. By applying the method to a few molecular devices, we found that in general, the external bias can induce profound nonequilibrium effects on both electronic/transport properties and the geometric structure of these devices, and consequent changes in electronic properties and geometric structure are closely interrelated. Particularly, when the bias voltage is above 1.0 V, significant structure phase changes could occur, causing dramatic changes in I-V characteristics and vibrational spectra. These findings greatly broaden our understanding of quantum electronic devices and provide a new avenue for discovering novel transport phenomena at molecular scale. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: 17 pages, 11 figures

arXiv:2406.08698 [pdf, other]

Constraints on Ultra Heavy Dark Matter Properties from Dwarf Spheroidal Galaxies with LHAASO Observations

Authors: Zhen Cao, F. Aharonian, Q. An, Axikegu, Y. X. Bai, Y. W. Bao, D. Bastieri, X. J. Bi, Y. J. Bi, J. T. Cai, Q. Cao, W. Y. Cao, Zhe Cao, J. Chang, J. F. Chang, A. M. Chen, E. S. Chen, Liang Chen, Lin Chen, Long Chen, M. J. Chen, M. L. Chen, Q. H. Chen, S. H. Chen, S. Z. Chen , et al. (255 additional authors not shown)

Abstract: In this work we try to search for signals generated by ultra-heavy dark matter at the Large High Altitude Air Shower Observatory (LHAASO) data. We look for possible gamma-ray by dark matter annihilation or decay from 16 dwarf spheroidal galaxies in the field of view of LHAASO. Dwarf spheroidal galaxies are among the most promising targets for indirect detection of dark matter which have low fluxes… ▽ More In this work we try to search for signals generated by ultra-heavy dark matter at the Large High Altitude Air Shower Observatory (LHAASO) data. We look for possible gamma-ray by dark matter annihilation or decay from 16 dwarf spheroidal galaxies in the field of view of LHAASO. Dwarf spheroidal galaxies are among the most promising targets for indirect detection of dark matter which have low fluxes of astrophysical $γ$-ray background while large amount of dark matter. By analyzing more than 700 days observational data at LHAASO, no significant dark matter signal from 1 TeV to 1 EeV is detected. Accordingly we derive the most stringent constraints on the ultra-heavy dark matter annihilation cross-section up to EeV. The constraints on the lifetime of dark matter in decay mode are also derived. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: 17 pages, 12 figures, accepted by PRL

arXiv:2406.07873 [pdf, other]

Robust 3D Face Alignment with Multi-Path Neural Architecture Search

Authors: Zhichao Jiang, Hongsong Wang, Xi Teng, Baopu Li

Abstract: 3D face alignment is a very challenging and fundamental problem in computer vision. Existing deep learning-based methods manually design different networks to regress either parameters of a 3D face model or 3D positions of face vertices. However, designing such networks relies on expert knowledge, and these methods often struggle to produce consistent results across various face poses. To address… ▽ More 3D face alignment is a very challenging and fundamental problem in computer vision. Existing deep learning-based methods manually design different networks to regress either parameters of a 3D face model or 3D positions of face vertices. However, designing such networks relies on expert knowledge, and these methods often struggle to produce consistent results across various face poses. To address this limitation, we employ Neural Architecture Search (NAS) to automatically discover the optimal architecture for 3D face alignment. We propose a novel Multi-path One-shot Neural Architecture Search (MONAS) framework that leverages multi-scale features and contextual information to enhance face alignment across various poses. The MONAS comprises two key algorithms: Multi-path Networks Unbiased Sampling Based Training and Simulated Annealing based Multi-path One-shot Search. Experimental results on three popular benchmarks demonstrate the superior performance of the MONAS for both sparse alignment and dense alignment. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2406.07174 [pdf, other]

ULog: Unsupervised Log Parsing with Large Language Models through Log Contrastive Units

Authors: Junjie Huang, Zhihan Jiang, Zhuangbin Chen, Michael R. Lyu

Abstract: Log parsing serves as an essential prerequisite for various log analysis tasks. Recent advancements in this field have improved parsing accuracy by leveraging the semantics in logs through fine-tuning large language models (LLMs) or learning from in-context demonstrations. However, these methods heavily depend on labeled examples to achieve optimal performance. In practice, collecting sufficient l… ▽ More Log parsing serves as an essential prerequisite for various log analysis tasks. Recent advancements in this field have improved parsing accuracy by leveraging the semantics in logs through fine-tuning large language models (LLMs) or learning from in-context demonstrations. However, these methods heavily depend on labeled examples to achieve optimal performance. In practice, collecting sufficient labeled data is challenging due to the large scale and continuous evolution of logs, leading to performance degradation of existing log parsers after deployment. To address this issue, we propose ULog, an unsupervised LLM-based method for efficient and off-the-shelf log parsing. Our key insight is that while LLMs may struggle with direct log parsing, their performance can be significantly enhanced through comparative analysis across multiple logs that differ only in their parameter parts. We refer to such groups of logs as Log Contrastive Units (LCUs). Given the vast volume of logs, obtaining LCUs is difficult. Therefore, ULog introduces a hybrid ranking scheme to effectively search for LCUs by jointly considering the commonality and variability among logs. Additionally, ULog crafts a novel parsing prompt for LLMs to identify contrastive patterns and extract meaningful log structures from LCUs. Experiments on large-scale public datasets demonstrate that ULog significantly outperforms state-of-the-art log parsers in terms of accuracy and efficiency, providing an effective and scalable solution for real-world deployment. △ Less

Submitted 11 June, 2024; originally announced June 2024.

arXiv:2406.06979 [pdf, other]

AudioMarkBench: Benchmarking Robustness of Audio Watermarking

Authors: Hongbin Liu, Moyang Guo, Zhengyuan Jiang, Lun Wang, Neil Zhenqiang Gong

Abstract: The increasing realism of synthetic speech, driven by advancements in text-to-speech models, raises ethical concerns regarding impersonation and disinformation. Audio watermarking offers a promising solution via embedding human-imperceptible watermarks into AI-generated audios. However, the robustness of audio watermarking against common/adversarial perturbations remains understudied. We present A… ▽ More The increasing realism of synthetic speech, driven by advancements in text-to-speech models, raises ethical concerns regarding impersonation and disinformation. Audio watermarking offers a promising solution via embedding human-imperceptible watermarks into AI-generated audios. However, the robustness of audio watermarking against common/adversarial perturbations remains understudied. We present AudioMarkBench, the first systematic benchmark for evaluating the robustness of audio watermarking against watermark removal and watermark forgery. AudioMarkBench includes a new dataset created from Common-Voice across languages, biological sexes, and ages, 3 state-of-the-art watermarking methods, and 15 types of perturbations. We benchmark the robustness of these methods against the perturbations in no-box, black-box, and white-box settings. Our findings highlight the vulnerabilities of current watermarking techniques and emphasize the need for more robust and fair audio watermarking solutions. Our dataset and code are publicly available at \url{https://github.com/moyangkuo/AudioMarkBench}. △ Less

Submitted 11 June, 2024; originally announced June 2024.

arXiv:2406.06975 [pdf, other]

TraceMesh: Scalable and Streaming Sampling for Distributed Traces

Authors: Zhuangbin Chen, Zhihan Jiang, Yuxin Su, Michael R. Lyu, Zibin Zheng

Abstract: Distributed tracing serves as a fundamental element in the monitoring of cloud-based and datacenter systems. It provides visibility into the full lifecycle of a request or operation across multiple services, which is essential for understanding system dependencies and performance bottlenecks. To mitigate computational and storage overheads, most tracing frameworks adopt a uniform sampling strategy… ▽ More Distributed tracing serves as a fundamental element in the monitoring of cloud-based and datacenter systems. It provides visibility into the full lifecycle of a request or operation across multiple services, which is essential for understanding system dependencies and performance bottlenecks. To mitigate computational and storage overheads, most tracing frameworks adopt a uniform sampling strategy, which inevitably captures overlapping and redundant information. More advanced methods employ learning-based approaches to bias the sampling toward more informative traces. However, existing methods fall short of considering the high-dimensional and dynamic nature of trace data, which is essential for the production deployment of trace sampling. To address these practical challenges, in this paper we present TraceMesh, a scalable and streaming sampler for distributed traces. TraceMesh employs Locality-Sensitivity Hashing (LSH) to improve sampling efficiency by projecting traces into a low-dimensional space while preserving their similarity. In this process, TraceMesh accommodates previously unseen trace features in a unified and streamlined way. Subsequently, TraceMesh samples traces through evolving clustering, which dynamically adjusts the sampling decision to avoid over-sampling of recurring traces. The proposed method is evaluated with trace data collected from both open-source microservice benchmarks and production service systems. Experimental results demonstrate that TraceMesh outperforms state-of-the-art methods by a significant margin in both sampling accuracy and efficiency. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: Accepted by The 2024 IEEE 17th International Conference on Cloud Computing (CLOUD)

arXiv:2406.06858 [pdf, other]

FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion

Authors: Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Ziheng Jiang, Haibin Lin, Xin Jin, Xin Liu

Abstract: Large deep learning models have demonstrated strong ability to solve many tasks across a wide range of applications. Those large models typically require training and inference to be distributed. Tensor parallelism is a common technique partitioning computation of an operation or layer across devices to overcome the memory capacity limitation of a single processor, and/or to accelerate computation… ▽ More Large deep learning models have demonstrated strong ability to solve many tasks across a wide range of applications. Those large models typically require training and inference to be distributed. Tensor parallelism is a common technique partitioning computation of an operation or layer across devices to overcome the memory capacity limitation of a single processor, and/or to accelerate computation to meet a certain latency requirement. However, this kind of parallelism introduces additional communication that might contribute a significant portion of overall runtime. Thus limits scalability of this technique within a group of devices with high speed interconnects, such as GPUs with NVLinks in a node. This paper proposes a novel method, Flux, to significantly hide communication latencies with dependent computations for GPUs. Flux over-decomposes communication and computation operations into much finer-grained operations and further fuses them into a larger kernel to effectively hide communication without compromising kernel efficiency. Flux can potentially overlap up to 96% of communication given a fused kernel. Overall, it can achieve up to 1.24x speedups for training over Megatron-LM on a cluster of 128 GPUs with various GPU generations and interconnects, and up to 1.66x and 1.30x speedups for prefill and decoding inference over vLLM on a cluster with 8 GPUs with various GPU generations and interconnects. △ Less

Submitted 18 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

arXiv:2406.06279 [pdf, other]

Multi-Prompting Decoder Helps Better Language Understanding

Authors: Zifeng Cheng, Zhaoling Chen, Zhiwei Jiang, Yafeng Yin, Shiping Ge, Yuliang Liu, Qing Gu

Abstract: Recent Pre-trained Language Models (PLMs) usually only provide users with the inference APIs, namely the emerging Model-as-a-Service (MaaS) setting. To adapt MaaS PLMs to downstream tasks without accessing their parameters and gradients, some existing methods focus on the output-side adaptation of PLMs, viewing the PLM as an encoder and then optimizing a task-specific decoder for decoding the outp… ▽ More Recent Pre-trained Language Models (PLMs) usually only provide users with the inference APIs, namely the emerging Model-as-a-Service (MaaS) setting. To adapt MaaS PLMs to downstream tasks without accessing their parameters and gradients, some existing methods focus on the output-side adaptation of PLMs, viewing the PLM as an encoder and then optimizing a task-specific decoder for decoding the output hidden states and class scores of the PLM. Despite the effectiveness of these methods, they only use a single prompt to query PLMs for decoding, leading to a heavy reliance on the quality of the adopted prompt. In this paper, we propose a simple yet effective Multi-Prompting Decoder (MPD) framework for MaaS adaptation. The core idea is to query PLMs with multiple different prompts for each sample, thereby obtaining multiple output hidden states and class scores for subsequent decoding. Such multi-prompting decoding paradigm can simultaneously mitigate reliance on the quality of a single prompt, alleviate the issue of data scarcity under the few-shot setting, and provide richer knowledge extracted from PLMs. Specifically, we propose two decoding strategies: multi-prompting decoding with optimal transport for hidden states and calibrated decoding for class scores. Extensive experiments demonstrate that our method achieves new state-of-the-art results on multiple natural language understanding datasets under the few-shot setting. △ Less

Submitted 10 June, 2024; originally announced June 2024.

arXiv:2406.04980 [pdf, other]

doi 10.3847/2041-8213/ad55c7

M17 MIR: A Massive Star is Forming via Episodic Mass Accretion

Authors: Wei Zhou, Zhiwei Chen, Zhibo Jiang, Haoran Feng, Yu Jiang

Abstract: We analyzed the Atacama Large Millimeter/submillimeter Array (ALMA) band 6 data for the outbursting massive protostar M17~MIR. The ALMA CO $J=2-1$ data reveal a collimated and bipolar north-south outflow from M17~MIR. The blue-shifted outflow exhibits four CO knots (N1 to N4) along the outflow axis, while the red-shifted outflow appears as a single knot (S1). The extremely high velocity (EHV) emis… ▽ More We analyzed the Atacama Large Millimeter/submillimeter Array (ALMA) band 6 data for the outbursting massive protostar M17~MIR. The ALMA CO $J=2-1$ data reveal a collimated and bipolar north-south outflow from M17~MIR. The blue-shifted outflow exhibits four CO knots (N1 to N4) along the outflow axis, while the red-shifted outflow appears as a single knot (S1). The extremely high velocity (EHV) emissions of N1 and S1 are jet-like and contain sub-knots along the outflow axis. Assuming the nearest EHV sub-knots trace the ejecta from the accretion outbursts in the past decades, a tangential ejection velocity of $\sim421\,\mathrm{km\,s^{-1}}$ is derived for M17~MIR. Assuming the same velocity, the dynamical times of the multiple ejecta, traced by the four blue-shifted CO knots, range from 20 to 364 years. The four blue-shifted CO knots imply four clustered accretion outbursts with a duration of tens of years in the past few hundred years. The intervals between the four clustered accretion outbursts are also about tens of years. These properties of the four clustered accretion outbursts are in line with the disk gravitational instability and fragmentation model. The episodic accretion history of M17~MIR traced by episodic outflow suggests that a massive star can form from a lower-mass protostar via frequent episodic accretion events triggered by disk gravitational instability and fragmentation. The first detection of the knotty outflow from an outbursting massive protostar suggests that mass ejections accompanied with accretion events could serve as an effective diagnostic tool for the episodic accretion histories of massive protostars. △ Less

Submitted 17 June, 2024; v1 submitted 7 June, 2024; originally announced June 2024.

Comments: Accepted for publication in ApJL; typos corrected

arXiv:2406.04744 [pdf, other]

CRAG -- Comprehensive RAG Benchmark

Authors: Xiao Yang, Kai Sun, Hao Xin, Yushi Sun, Nikita Bhalla, Xiangsen Chen, Sajal Choudhary, Rongze Daniel Gui, Ziran Will Jiang, Ziyu Jiang, Lingkun Kong, Brian Moran, Jiaqi Wang, Yifan Ethan Xu, An Yan, Chenyu Yang, Eting Yuan, Hanwen Zha, Nan Tang, Lei Chen, Nicolas Scheffer, Yue Liu, Nirav Shah, Rakesh Wanga, Anuj Kumar , et al. (2 additional authors not shown)

Abstract: Retrieval-Augmented Generation (RAG) has recently emerged as a promising solution to alleviate Large Language Model (LLM)'s deficiency in lack of knowledge. Existing RAG datasets, however, do not adequately represent the diverse and dynamic nature of real-world Question Answering (QA) tasks. To bridge this gap, we introduce the Comprehensive RAG Benchmark (CRAG), a factual question answering bench… ▽ More Retrieval-Augmented Generation (RAG) has recently emerged as a promising solution to alleviate Large Language Model (LLM)'s deficiency in lack of knowledge. Existing RAG datasets, however, do not adequately represent the diverse and dynamic nature of real-world Question Answering (QA) tasks. To bridge this gap, we introduce the Comprehensive RAG Benchmark (CRAG), a factual question answering benchmark of 4,409 question-answer pairs and mock APIs to simulate web and Knowledge Graph (KG) search. CRAG is designed to encapsulate a diverse array of questions across five domains and eight question categories, reflecting varied entity popularity from popular to long-tail, and temporal dynamisms ranging from years to seconds. Our evaluation on this benchmark highlights the gap to fully trustworthy QA. Whereas most advanced LLMs achieve <=34% accuracy on CRAG, adding RAG in a straightforward manner improves the accuracy only to 44%. State-of-the-art industry RAG solutions only answer 63% questions without any hallucination. CRAG also reveals much lower accuracy in answering questions regarding facts with higher dynamism, lower popularity, or higher complexity, suggesting future research directions. The CRAG benchmark laid the groundwork for a KDD Cup 2024 challenge, attracting thousands of participants and submissions within the first 50 days of the competition. We commit to maintaining CRAG to serve research communities in advancing RAG solutions and general QA solutions. △ Less

Submitted 7 June, 2024; originally announced June 2024.

arXiv:2406.04553 [pdf, other]

Better Late Than Never: Formulating and Benchmarking Recommendation Editing

Authors: Chengyu Lai, Sheng Zhou, Zhimeng Jiang, Qiaoyu Tan, Yuanchen Bei, Jiawei Chen, Ningyu Zhang, Jiajun Bu

Abstract: Recommendation systems play a pivotal role in suggesting items to users based on their preferences. However, in online platforms, these systems inevitably offer unsuitable recommendations due to limited model capacity, poor data quality, or evolving user interests. Enhancing user experience necessitates efficiently rectify such unsuitable recommendation behaviors. This paper introduces a novel and… ▽ More Recommendation systems play a pivotal role in suggesting items to users based on their preferences. However, in online platforms, these systems inevitably offer unsuitable recommendations due to limited model capacity, poor data quality, or evolving user interests. Enhancing user experience necessitates efficiently rectify such unsuitable recommendation behaviors. This paper introduces a novel and significant task termed recommendation editing, which focuses on modifying known and unsuitable recommendation behaviors. Specifically, this task aims to adjust the recommendation model to eliminate known unsuitable items without accessing training data or retraining the model. We formally define the problem of recommendation editing with three primary objectives: strict rectification, collaborative rectification, and concentrated rectification. Three evaluation metrics are developed to quantitatively assess the achievement of each objective. We present a straightforward yet effective benchmark for recommendation editing using novel Editing Bayesian Personalized Ranking Loss. To demonstrate the effectiveness of the proposed method, we establish a comprehensive benchmark that incorporates various methods from related fields. Codebase is available at https://github.com/cycl2018/Recommendation-Editing. △ Less

Submitted 6 June, 2024; originally announced June 2024.

arXiv:2406.04465 [pdf, other]

Rough Set improved Therapy-Based Metaverse Assisting System

Authors: Jin Cao, Yanhui Jiang, Chang Yu, Feiwei Qin, Zekun Jiang

Abstract: Chronic neck and shoulder pain (CNSP) is a major global public health issue. Traditional treatments like physiotherapy and rehabilitation have drawbacks, including high costs, low precision, and user discomfort. This paper presents an interactive system based on Cognitive Therapy Theory (CBT) for CNSP treatment. The system includes a pain detection module using EMG and IMU to monitor pain and opti… ▽ More Chronic neck and shoulder pain (CNSP) is a major global public health issue. Traditional treatments like physiotherapy and rehabilitation have drawbacks, including high costs, low precision, and user discomfort. This paper presents an interactive system based on Cognitive Therapy Theory (CBT) for CNSP treatment. The system includes a pain detection module using EMG and IMU to monitor pain and optimize data with Rough Set theory, and a cognitive therapy module that processes this data further for CBT-based interventions, including massage and heating therapy. An experimental plan is outlined to evaluate the system's effectiveness and performance. The goal is to create an accessible device for CNSP therapy. Additionally, the paper explores the system's application in a metaverse environment, enhancing treatment immersion and personalization. The metaverse platform simulates treatment environments and responds to real-time patient data, allowing for continuous monitoring and adjustment of treatment plans. △ Less

Submitted 10 June, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

Comments: 7 pages, 5 figures, conference for IEEE metacom accepted (https://ieee-metacom.org/)

arXiv:2406.04100 [pdf, other]

Class-Aware Cartilage Segmentation for Autonomous US-CT Registration in Robotic Intercostal Ultrasound Imaging

Authors: Zhongliang Jiang, Yunfeng Kang, Yuan Bi, Xuesong Li, Chenyang Li, Nassir Navab

Abstract: Ultrasound imaging has been widely used in clinical examinations owing to the advantages of being portable, real-time, and radiation-free. Considering the potential of extensive deployment of autonomous examination systems in hospitals, robotic US imaging has attracted increased attention. However, due to the inter-patient variations, it is still challenging to have an optimal path for each patien… ▽ More Ultrasound imaging has been widely used in clinical examinations owing to the advantages of being portable, real-time, and radiation-free. Considering the potential of extensive deployment of autonomous examination systems in hospitals, robotic US imaging has attracted increased attention. However, due to the inter-patient variations, it is still challenging to have an optimal path for each patient, particularly for thoracic applications with limited acoustic windows, e.g., intercostal liver imaging. To address this problem, a class-aware cartilage bone segmentation network with geometry-constraint post-processing is presented to capture patient-specific rib skeletons. Then, a dense skeleton graph-based non-rigid registration is presented to map the intercostal scanning path from a generic template to individual patients. By explicitly considering the high-acoustic impedance bone structures, the transferred scanning path can be precisely located in the intercostal space, enhancing the visibility of internal organs by reducing the acoustic shadow. To evaluate the proposed approach, the final path mapping performance is validated on five distinct CTs and two volunteer US data, resulting in ten pairs of CT-US combinations. Results demonstrate that the proposed graph-based registration method can robustly and precisely map the path from CT template to individual patients (Euclidean error: $2.21\pm1.11~mm$). △ Less

Submitted 6 June, 2024; originally announced June 2024.

arXiv:2406.03746 [pdf, other]

Efficient Knowledge Infusion via KG-LLM Alignment

Authors: Zhouyu Jiang, Ling Zhong, Mengshu Sun, Jun Xu, Rui Sun, Hui Cai, Shuhan Luo, Zhiqiang Zhang

Abstract: To tackle the problem of domain-specific knowledge scarcity within large language models (LLMs), knowledge graph-retrievalaugmented method has been proven to be an effective and efficient technique for knowledge infusion. However, existing approaches face two primary challenges: knowledge mismatch between public available knowledge graphs and the specific domain of the task at hand, and poor infor… ▽ More To tackle the problem of domain-specific knowledge scarcity within large language models (LLMs), knowledge graph-retrievalaugmented method has been proven to be an effective and efficient technique for knowledge infusion. However, existing approaches face two primary challenges: knowledge mismatch between public available knowledge graphs and the specific domain of the task at hand, and poor information compliance of LLMs with knowledge graphs. In this paper, we leverage a small set of labeled samples and a large-scale corpus to efficiently construct domain-specific knowledge graphs by an LLM, addressing the issue of knowledge mismatch. Additionally, we propose a three-stage KG-LLM alignment strategyto enhance the LLM's capability to utilize information from knowledge graphs. We conduct experiments with a limited-sample setting on two biomedical question-answering datasets, and the results demonstrate that our approach outperforms existing baselines. △ Less

Submitted 6 June, 2024; originally announced June 2024.

Comments: ACL2024 Findings

arXiv:2406.02511 [pdf, other]

V-Express: Conditional Dropout for Progressive Training of Portrait Video Generation

Authors: Cong Wang, Kuan Tian, Jun Zhang, Yonghang Guan, Feng Luo, Fei Shen, Zhiwei Jiang, Qing Gu, Xiao Han, Wei Yang

Abstract: In the field of portrait video generation, the use of single images to generate portrait videos has become increasingly prevalent. A common approach involves leveraging generative models to enhance adapters for controlled generation. However, control signals (e.g., text, audio, reference image, pose, depth map, etc.) can vary in strength. Among these, weaker conditions often struggle to be effecti… ▽ More In the field of portrait video generation, the use of single images to generate portrait videos has become increasingly prevalent. A common approach involves leveraging generative models to enhance adapters for controlled generation. However, control signals (e.g., text, audio, reference image, pose, depth map, etc.) can vary in strength. Among these, weaker conditions often struggle to be effective due to interference from stronger conditions, posing a challenge in balancing these conditions. In our work on portrait video generation, we identified audio signals as particularly weak, often overshadowed by stronger signals such as facial pose and reference image. However, direct training with weak signals often leads to difficulties in convergence. To address this, we propose V-Express, a simple method that balances different control signals through the progressive training and the conditional dropout operation. Our method gradually enables effective control by weak conditions, thereby achieving generation capabilities that simultaneously take into account the facial pose, reference image, and audio. The experimental results demonstrate that our method can effectively generate portrait videos controlled by audio. Furthermore, a potential solution is provided for the simultaneous and effective use of conditions of varying strengths. △ Less

Submitted 4 June, 2024; originally announced June 2024.

arXiv:2406.02222 [pdf, other]

Towards an Extensible Model-Based Digital Twin Framework for Space Launch Vehicles

Authors: Ran Wei, Ruizhe Yang, Shijun Liu, Chongsheng Fan, Rong Zhou, Zekun Wu, Haochi Wang, Yifan Cai, Zhe Jiang

Abstract: The concept of Digital Twin (DT) is increasingly applied to systems on different levels of abstraction across domains, to support monitoring, analysis, diagnosis, decision making and automated control. Whilst the interest in applying DT is growing, the definition of DT is unclear, neither is there a clear pathway to develop DT to fully realise its capacities. In this paper, we revise the concept o… ▽ More The concept of Digital Twin (DT) is increasingly applied to systems on different levels of abstraction across domains, to support monitoring, analysis, diagnosis, decision making and automated control. Whilst the interest in applying DT is growing, the definition of DT is unclear, neither is there a clear pathway to develop DT to fully realise its capacities. In this paper, we revise the concept of DT and its categorisation. We propose a DT maturity matrix, based on which we propose a model-based DT development methodology. We also discuss how model-based tools can be used to support the methodology and present our own supporting tool. We report our preliminary findings with a discussion on a case study, in which we use our proposed methodology and our supporting tool to develop an extensible DT platform for the assurance of Electrical and Electronics systems of space launch vehicles. △ Less

Submitted 4 June, 2024; originally announced June 2024.

arXiv:2406.01595 [pdf, other]

MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild

Authors: Zeren Jiang, Chen Guo, Manuel Kaufmann, Tianjian Jiang, Julien Valentin, Otmar Hilliges, Jie Song

Abstract: We present MultiPly, a novel framework to reconstruct multiple people in 3D from monocular in-the-wild videos. Reconstructing multiple individuals moving and interacting naturally from monocular in-the-wild videos poses a challenging task. Addressing it necessitates precise pixel-level disentanglement of individuals without any prior knowledge about the subjects. Moreover, it requires recovering i… ▽ More We present MultiPly, a novel framework to reconstruct multiple people in 3D from monocular in-the-wild videos. Reconstructing multiple individuals moving and interacting naturally from monocular in-the-wild videos poses a challenging task. Addressing it necessitates precise pixel-level disentanglement of individuals without any prior knowledge about the subjects. Moreover, it requires recovering intricate and complete 3D human shapes from short video sequences, intensifying the level of difficulty. To tackle these challenges, we first define a layered neural representation for the entire scene, composited by individual human and background models. We learn the layered neural representation from videos via our layer-wise differentiable volume rendering. This learning process is further enhanced by our hybrid instance segmentation approach which combines the self-supervised 3D segmentation and the promptable 2D segmentation module, yielding reliable instance segmentation supervision even under close human interaction. A confidence-guided optimization formulation is introduced to optimize the human poses and shape/appearance alternately. We incorporate effective objectives to refine human poses via photometric information and impose physically plausible constraints on human dynamics, leading to temporally consistent 3D reconstructions with high fidelity. The evaluation of our method shows the superiority over prior art on publicly available datasets and in-the-wild videos. △ Less

Submitted 3 June, 2024; originally announced June 2024.

Comments: Project page: https://eth-ait.github.io/MultiPly/

arXiv:2406.01574 [pdf, other]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Authors: Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, Wenhu Chen

Abstract: In the age of large-scale language models, benchmarks like the Massive Multitask Language Understanding (MMLU) have been pivotal in pushing the boundaries of what AI can achieve in language comprehension and reasoning across diverse domains. However, as models continue to improve, their performance on these benchmarks has begun to plateau, making it increasingly difficult to discern differences in… ▽ More In the age of large-scale language models, benchmarks like the Massive Multitask Language Understanding (MMLU) have been pivotal in pushing the boundaries of what AI can achieve in language comprehension and reasoning across diverse domains. However, as models continue to improve, their performance on these benchmarks has begun to plateau, making it increasingly difficult to discern differences in model capabilities. This paper introduces MMLU-Pro, an enhanced dataset designed to extend the mostly knowledge-driven MMLU benchmark by integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. Additionally, MMLU-Pro eliminates the trivial and noisy questions in MMLU. Our experimental results show that MMLU-Pro not only raises the challenge, causing a significant drop in accuracy by 16% to 33% compared to MMLU but also demonstrates greater stability under varying prompts. With 24 different prompt styles tested, the sensitivity of model scores to prompt variations decreased from 4-5% in MMLU to just 2% in MMLU-Pro. Additionally, we found that models utilizing Chain of Thought (CoT) reasoning achieved better performance on MMLU-Pro compared to direct answering, which is in stark contrast to the findings on the original MMLU, indicating that MMLU-Pro includes more complex reasoning questions. Our assessments confirm that MMLU-Pro is a more discriminative benchmark to better track progress in the field. △ Less

Submitted 23 June, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

arXiv:2406.01205 [pdf, other]

ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec

Authors: Shengpeng Ji, Jialong Zuo, Minghui Fang, Siqi Zheng, Qian Chen, Wen Wang, Ziyue Jiang, Hai Huang, Xize Cheng, Rongjie Huang, Zhou Zhao

Abstract: In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style, merely based on a few seconds of audio prompt and a simple textual style description prompt. Prior zero-shot TTS models and controllable TTS models either could only mimic the speaker's voice without further control and… ▽ More In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style, merely based on a few seconds of audio prompt and a simple textual style description prompt. Prior zero-shot TTS models and controllable TTS models either could only mimic the speaker's voice without further control and adjustment capabilities or were unrelated to speaker-specific voice generation. Therefore, ControlSpeech focuses on a more challenging new task-a TTS system with controllable timbre, content, and style at the same time. ControlSpeech takes speech prompts, content prompts, and style prompts as inputs and utilizes bidirectional attention and mask-based parallel decoding to capture corresponding codec representations in a discrete decoupling codec space. Moreover, we discovered the issue of text style controllability in a many-to-many mapping fashion and proposed the Style Mixture Semantic Density (SMSD) model to resolve this problem. SMSD module which is based on Gaussian mixture density networks, is designed to enhance the fine-grained partitioning and sampling capabilities of style semantic information and generate speech with more diverse styles. In terms of experiments, we make available a controllable model toolkit called ControlToolkit with a new style controllable dataset, some replicated baseline models and propose new metrics to evaluate both the control capability and the quality of generated audio in ControlSpeech. The relevant ablation studies validate the necessity of each component in ControlSpeech is necessary. We hope that ControlSpeech can establish the next foundation paradigm of controllable speech synthesis. The relevant code and demo are available at https://github.com/jishengpeng/ControlSpeech . △ Less

Submitted 3 June, 2024; originally announced June 2024.

arXiv:2406.01153 [pdf, other]

Safety-Critical Control of Euler-Lagrange Systems Subject to Multiple Obstacles and Velocity Constraints

Authors: Zhi Liu, Si Wu, Tengfei Liu, Zhong-Ping Jiang

Abstract: This paper studies the safety-critical control problem for Euler-Lagrange (EL) systems subject to multiple ball obstacles and velocity constraints in accordance with affordable velocity ranges. A key strategy is to exploit the underlying inner-outer-loop structure for the design of a new cascade controller for the class of EL systems. In particular, the outer-loop controller is developed based on… ▽ More This paper studies the safety-critical control problem for Euler-Lagrange (EL) systems subject to multiple ball obstacles and velocity constraints in accordance with affordable velocity ranges. A key strategy is to exploit the underlying inner-outer-loop structure for the design of a new cascade controller for the class of EL systems. In particular, the outer-loop controller is developed based on quadratic programming (QP) to avoid ball obstacles and generate velocity reference signals fulfilling the velocity limitation. Taking full advantage of the conservation-of-energy property, a nonlinear velocity-tracking controller is designed to form the inner loop. One major difficulty is caused by the possible non-Lipschitz continuity of the standard QP algorithm when there are multiple constraints. To solve this problem, we propose a refined QP algorithm with the feasible set reshaped by an appropriately chosen positive basis such that the feasibility is retained while the resulting outer-loop controller is locally Lipschitz. It is proved that the constraint-satisfaction problem is solvable as long as the ball obstacles satisfy a mild distance condition. The proposed design is validated by numerical simulation and an experiment based on a $2$-link planar manipulator. △ Less

Submitted 3 June, 2024; originally announced June 2024.

arXiv:2406.01058 [pdf, other]

Constructive Safety Control

Authors: Si Wu, Tengfei Liu, Zhong-Ping Jiang

Abstract: This paper proposes a constructive approach to safety control of nonlinear cascade systems subject to multiple state constraints. New design ingredients include a unified characterization of safety and stability for systematic designs of safety controllers, and a novel technique of reshaping the feasible sets of quadratically constrained quadratic programming induced from safety control. The propo… ▽ More This paper proposes a constructive approach to safety control of nonlinear cascade systems subject to multiple state constraints. New design ingredients include a unified characterization of safety and stability for systematic designs of safety controllers, and a novel technique of reshaping the feasible sets of quadratically constrained quadratic programming induced from safety control. The proposed method guarantees Lipschitz continuity of virtual control laws, enabling a stepwise constructive design. A refined nonlinear small-gain synthesis is employed to address the nonlinear uncertain interconnections between the resulting subsystems corresponding to different virtual control laws, and to guarantee the achievement of the safety control objective. When the safety constraints are removed, the proposed approach coincides with the standard constructive nonlinear control. The proposed safety-control algorithm is experimentally validated in a testbed involving a vertical takeoff and landing (VTOL) vehicle taking off in narrow spaces. △ Less

Submitted 3 June, 2024; originally announced June 2024.

Showing 1–50 of 1,572 results for author: Jiang, Z