subscribe to arXiv mailings

Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules

Authors: Zhuocheng Gong, Ang Lv, Jian Guan, Junxi Yan, Wei Wu, Huishuai Zhang, Minlie Huang, Dongyan Zhao, Rui Yan

Abstract: Is it always necessary to compute tokens from shallow to deep layers in Transformers? The continued success of vanilla Transformers and their variants suggests an undoubted "yes". In this work, however, we attempt to break the depth-ordered convention by proposing a novel architecture dubbed mixture-of-modules (MoM), which is motivated by an intuition that any layer, regardless of its position, ca… ▽ More Is it always necessary to compute tokens from shallow to deep layers in Transformers? The continued success of vanilla Transformers and their variants suggests an undoubted "yes". In this work, however, we attempt to break the depth-ordered convention by proposing a novel architecture dubbed mixture-of-modules (MoM), which is motivated by an intuition that any layer, regardless of its position, can be used to compute a token as long as it possesses the needed processing capabilities. The construction of MoM starts from a finite set of modules defined by multi-head attention and feed-forward networks, each distinguished by its unique parameterization. Two routers then iteratively select attention modules and feed-forward modules from the set to process a token. The selection dynamically expands the computation graph in the forward pass of the token, culminating in an assembly of modules. We show that MoM provides not only a unified framework for Transformers and their numerous variants but also a flexible and learnable approach for reducing redundancy in Transformer parameterization. We pre-train various MoMs using OpenWebText. Empirical results demonstrate that MoMs, of different parameter counts, consistently outperform vanilla transformers on both GLUE and XSUM benchmarks. More interestingly, with a fixed parameter budget, MoM-large enables an over 38% increase in depth for computation graphs compared to GPT-2-large, resulting in absolute gains of 1.4 on GLUE and 1 on XSUM. On the other hand, MoM-large also enables an over 60% reduction in depth while involving more modules per layer, yielding a 16% reduction in TFLOPs and a 43% decrease in memory usage compared to GPT-2-large, while maintaining comparable performance. △ Less

Submitted 9 July, 2024; originally announced July 2024.

arXiv:2407.04675 [pdf, other]

Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition

Authors: Ye Bai, Jingping Chen, Jitong Chen, Wei Chen, Zhuo Chen, Chuang Ding, Linhao Dong, Qianqian Dong, Yujiao Du, Kepan Gao, Lu Gao, Yi Guo, Minglun Han, Ting Han, Wenchao Hu, Xinying Hu, Yuxiang Hu, Deyu Hua, Lu Huang, Mingkun Huang, Youjia Huang, Jishuo Jin, Fanliu Kong, Zongwei Lan, Tianyu Li , et al. (30 additional authors not shown)

Abstract: Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this wor… ▽ More Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this work, we introduce Seed-ASR, a large language model (LLM) based speech recognition model. Seed-ASR is developed based on the framework of audio conditioned LLM (AcLLM), leveraging the capabilities of LLMs by inputting continuous speech representations together with contextual information into the LLM. Through stage-wise large-scale training and the elicitation of context-aware capabilities in LLM, Seed-ASR demonstrates significant improvement over end-to-end models on comprehensive evaluation sets, including multiple domains, accents/dialects and languages. Additionally, Seed-ASR can be further deployed to support specific needs in various scenarios without requiring extra language models. Compared to recently released large ASR models, Seed-ASR achieves 10%-40% reduction in word (or character, for Chinese) error rates on Chinese and English public test sets, further demonstrating its powerful performance. △ Less

Submitted 10 July, 2024; v1 submitted 5 July, 2024; originally announced July 2024.

arXiv:2407.03978 [pdf, other]

Benchmarking Complex Instruction-Following with Multiple Constraints Composition

Authors: Bosi Wen, Pei Ke, Xiaotao Gu, Lindong Wu, Hao Huang, Jinfeng Zhou, Wenchuang Li, Binxin Hu, Wendy Gao, Jiaxin Xu, Yiming Liu, Jie Tang, Hongning Wang, Minlie Huang

Abstract: Instruction following is one of the fundamental capabilities of large language models (LLMs). As the ability of LLMs is constantly improving, they have been increasingly applied to deal with complex human instructions in real-world scenarios. Therefore, how to evaluate the ability of complex instruction-following of LLMs has become a critical research problem. Existing benchmarks mainly focus on m… ▽ More Instruction following is one of the fundamental capabilities of large language models (LLMs). As the ability of LLMs is constantly improving, they have been increasingly applied to deal with complex human instructions in real-world scenarios. Therefore, how to evaluate the ability of complex instruction-following of LLMs has become a critical research problem. Existing benchmarks mainly focus on modeling different types of constraints in human instructions while neglecting the composition of different constraints, which is an indispensable constituent in complex instructions. To this end, we propose ComplexBench, a benchmark for comprehensively evaluating the ability of LLMs to follow complex instructions composed of multiple constraints. We propose a hierarchical taxonomy for complex instructions, including 4 constraint types, 19 constraint dimensions, and 4 composition types, and manually collect a high-quality dataset accordingly. To make the evaluation reliable, we augment LLM-based evaluators with rules to effectively verify whether generated texts can satisfy each constraint and composition. Furthermore, we obtain the final evaluation score based on the dependency structure determined by different composition types. ComplexBench identifies significant deficiencies in existing LLMs when dealing with complex instructions with multiple constraints composition. △ Less

Submitted 11 July, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

Comments: 20 pages, 7 figures

arXiv:2407.02855 [pdf, other]

Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks

Authors: Zhexin Zhang, Junxiao Yang, Pei Ke, Shiyao Cui, Chujie Zheng, Hongning Wang, Minlie Huang

Abstract: LLMs are known to be vulnerable to jailbreak attacks, even after safety alignment. An important observation is that, while different types of jailbreak attacks can generate significantly different queries, they mostly result in similar responses that are rooted in the same harmful knowledge (e.g., detailed steps to make a bomb). Therefore, we conjecture that directly unlearn the harmful knowledge… ▽ More LLMs are known to be vulnerable to jailbreak attacks, even after safety alignment. An important observation is that, while different types of jailbreak attacks can generate significantly different queries, they mostly result in similar responses that are rooted in the same harmful knowledge (e.g., detailed steps to make a bomb). Therefore, we conjecture that directly unlearn the harmful knowledge in the LLM can be a more effective way to defend against jailbreak attacks than the mainstream supervised fine-tuning (SFT) based approaches. Our extensive experiments confirmed our insight and suggested surprising generalizability of our unlearning-based approach: using only 20 raw harmful questions \emph{without} any jailbreak prompt during training, our solution reduced the Attack Success Rate (ASR) in Vicuna-7B on \emph{out-of-distribution} (OOD) harmful questions wrapped with various complex jailbreak prompts from 82.6\% to 7.7\%. This significantly outperforms Llama2-7B-Chat, which is fine-tuned on about 0.1M safety alignment samples but still has an ASR of 21.9\% even under the help of an additional safety system prompt. Further analysis reveals that the generalization ability of our solution stems from the intrinsic relatedness among harmful responses across harmful questions (e.g., response patterns, shared steps and actions, and similarity among their learned representations in the LLM). Our code is available at \url{https://github.com/thu-coai/SafeUnlearning}. △ Less

Submitted 3 July, 2024; originally announced July 2024.

Comments: 15 pages

arXiv:2407.00167 [pdf, other]

Can GPT-4 Help Detect Quit Vaping Intentions? An Exploration of Automatic Data Annotation Approach

Authors: Sai Krishna Revanth Vuruma, Dezhi Wu, Saborny Sen Gupta, Lucas Aust, Valerie Lookingbill, Wyatt Bellamy, Yang Ren, Erin Kasson, Li-Shiun Chen, Patricia Cavazos-Rehg, Dian Hu, Ming Huang

Abstract: In recent years, the United States has witnessed a significant surge in the popularity of vaping or e-cigarette use, leading to a notable rise in cases of e-cigarette and vaping use-associated lung injury (EVALI) that caused hospitalizations and fatalities during the EVALI outbreak in 2019, highlighting the urgency to comprehend vaping behaviors and develop effective strategies for cessation. Due… ▽ More In recent years, the United States has witnessed a significant surge in the popularity of vaping or e-cigarette use, leading to a notable rise in cases of e-cigarette and vaping use-associated lung injury (EVALI) that caused hospitalizations and fatalities during the EVALI outbreak in 2019, highlighting the urgency to comprehend vaping behaviors and develop effective strategies for cessation. Due to the ubiquity of social media platforms, over 4.7 billion users worldwide use them for connectivity, communications, news, and entertainment with a significant portion of the discourse related to health, thereby establishing social media data as an invaluable organic data resource for public health research. In this study, we extracted a sample dataset from one vaping sub-community on Reddit to analyze users' quit-vaping intentions. Leveraging OpenAI's latest large language model GPT-4 for sentence-level quit vaping intention detection, this study compares the outcomes of this model against layman and clinical expert annotations. Using different prompting strategies such as zero-shot, one-shot, few-shot and chain-of-thought prompting, we developed 8 prompts with varying levels of detail to explain the task to GPT-4 and also evaluated the performance of the strategies against each other. These preliminary findings emphasize the potential of GPT-4 in social media data analysis, especially in identifying users' subtle intentions that may elude human detection. △ Less

Submitted 28 June, 2024; originally announced July 2024.

Comments: Accepted for the AI Applications in Public Health and Social Services workshop at the 22nd International Conference on Artificial Intelligence in Medicine (AIME 2024)

arXiv:2406.18696 [pdf, other]

Sequence Graph Network for Online Debate Analysis

Authors: Quan Mai, Susan Gauch, Douglas Adams, Miaoqing Huang

Abstract: Online debates involve a dynamic exchange of ideas over time, where participants need to actively consider their opponents' arguments, respond with counterarguments, reinforce their own points, and introduce more compelling arguments as the discussion unfolds. Modeling such a complex process is not a simple task, as it necessitates the incorporation of both sequential characteristics and the capab… ▽ More Online debates involve a dynamic exchange of ideas over time, where participants need to actively consider their opponents' arguments, respond with counterarguments, reinforce their own points, and introduce more compelling arguments as the discussion unfolds. Modeling such a complex process is not a simple task, as it necessitates the incorporation of both sequential characteristics and the capability to capture interactions effectively. To address this challenge, we employ a sequence-graph approach. Building the conversation as a graph allows us to effectively model interactions between participants through directed edges. Simultaneously, the propagation of information along these edges in a sequential manner enables us to capture a more comprehensive representation of context. We also introduce a Sequence Graph Attention layer to illustrate the proposed information update scheme. The experimental results show that sequence graph networks achieve superior results to existing methods in online debates. △ Less

Submitted 26 June, 2024; originally announced June 2024.

Comments: 8 pages, 4 figures

arXiv:2406.16714 [pdf, other]

AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models

Authors: Jiale Cheng, Yida Lu, Xiaotao Gu, Pei Ke, Xiao Liu, Yuxiao Dong, Hongning Wang, Jie Tang, Minlie Huang

Abstract: Although Large Language Models (LLMs) are becoming increasingly powerful, they still exhibit significant but subtle weaknesses, such as mistakes in instruction-following or coding tasks. As these unexpected errors could lead to severe consequences in practical deployments, it is crucial to investigate the limitations within LLMs systematically. Traditional benchmarking approaches cannot thoroughly… ▽ More Although Large Language Models (LLMs) are becoming increasingly powerful, they still exhibit significant but subtle weaknesses, such as mistakes in instruction-following or coding tasks. As these unexpected errors could lead to severe consequences in practical deployments, it is crucial to investigate the limitations within LLMs systematically. Traditional benchmarking approaches cannot thoroughly pinpoint specific model deficiencies, while manual inspections are costly and not scalable. In this paper, we introduce a unified framework, AutoDetect, to automatically expose weaknesses in LLMs across various tasks. Inspired by the educational assessment process that measures students' learning outcomes, AutoDetect consists of three LLM-powered agents: Examiner, Questioner, and Assessor. The collaboration among these three agents is designed to realize comprehensive and in-depth weakness identification. Our framework demonstrates significant success in uncovering flaws, with an identification success rate exceeding 30% in prominent models such as ChatGPT and Claude. More importantly, these identified weaknesses can guide specific model improvements, proving more effective than untargeted data augmentation methods like Self-Instruct. Our approach has led to substantial enhancements in popular LLMs, including the Llama series and Mistral-7b, boosting their performance by over 10% across several benchmarks. Code and data are publicly available at https://github.com/thu-coai/AutoDetect. △ Less

Submitted 24 June, 2024; originally announced June 2024.

arXiv:2406.14491 [pdf, other]

Instruction Pre-Training: Language Models are Supervised Multitask Learners

Authors: Daixuan Cheng, Yuxian Gu, Shaohan Huang, Junyu Bi, Minlie Huang, Furu Wei

Abstract: Unsupervised multitask pre-training has been the critical method behind the recent success of language models (LMs). However, supervised multitask learning still holds significant promise, as scaling it in the post-training stage trends towards better generalization. In this paper, we explore supervised multitask pre-training by proposing Instruction Pre-Training, a framework that scalably augment… ▽ More Unsupervised multitask pre-training has been the critical method behind the recent success of language models (LMs). However, supervised multitask learning still holds significant promise, as scaling it in the post-training stage trends towards better generalization. In this paper, we explore supervised multitask pre-training by proposing Instruction Pre-Training, a framework that scalably augments massive raw corpora with instruction-response pairs to pre-train LMs. The instruction-response pairs are generated by an efficient instruction synthesizer built on open-source models. In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of Instruction Pre-Training. In pre-training from scratch, Instruction Pre-Training not only consistently enhances pre-trained base models but also benefits more from further instruction tuning. In continual pre-training, Instruction Pre-Training enables Llama3-8B to be comparable to or even outperform Llama3-70B. Our model, code, and data are available at https://github.com/microsoft/LMOps. △ Less

Submitted 20 June, 2024; originally announced June 2024.

arXiv:2406.13326 [pdf]

Chiral π Domain Walls Composed of Twin Half-Integer Surface Disclinations in Ferroelectric Nematic Liquid Crystals

Authors: Shengzhu Yi, Zening Hong, Zhongjie Ma, Chao Zhou, Miao Jiang, Xiang Huang, Mingjun Huang, Satoshi Aya, Rui Zhang, Qi-Huo Wei

Abstract: Ferroelectric nematic liquid crystals are polar fluids characterized by microscopic orientational ordering and macroscopic spontaneous polarizations. Within these fluids, walls that separate domains of different polarizations are ubiquitous. We demonstrate that the π walls in films of polar fluids consist of twin half-integer surface disclinations spaced horizontally, enclosing a subdomain where t… ▽ More Ferroelectric nematic liquid crystals are polar fluids characterized by microscopic orientational ordering and macroscopic spontaneous polarizations. Within these fluids, walls that separate domains of different polarizations are ubiquitous. We demonstrate that the π walls in films of polar fluids consist of twin half-integer surface disclinations spaced horizontally, enclosing a subdomain where the polarization exhibits left- or right-handed π twists across the film. The degenerate geometric configurations of these twin disclinations give rise to kinks and antikinks, effectively partitioning subdomains of opposite chirality like Ising chains. The hierarchical topological structures dictate that field-driven polar switching entails a two-step annihilation process of the disclinations. These findings serve as a cornerstone for comprehending other walls in ferroelectric and ferromagnetic materials, thereby laying the base for domain engineering crucial for advancing their nonlinear and optoelectronic applications. △ Less

Submitted 19 June, 2024; originally announced June 2024.

arXiv:2406.12899 [pdf]

Structural design of the acrylic vessel for the Jinping Neutrino Experiment

Authors: Zongyi Wang, Yuhao Liua, Shaomin Chen, Yuanqing Wang, Zhe Wang, Ming Huang

Abstract: The Jinping neutrino experiment is designed to have multiple purposes in the China Jinping Underground Laboratory. Following the acrylic vessel design requirements proposal, a structural scheme has been developed and optimized. Subsequently, the stability of the acrylic shell structure was calculated using finite element analysis, as well as the load-bearing capacities under various working condit… ▽ More The Jinping neutrino experiment is designed to have multiple purposes in the China Jinping Underground Laboratory. Following the acrylic vessel design requirements proposal, a structural scheme has been developed and optimized. Subsequently, the stability of the acrylic shell structure was calculated using finite element analysis, as well as the load-bearing capacities under various working conditions. Further, the effects of temperature changes, rope failures, and Young's modulus of the ropes on the static behavior of the structure were analyzed. The results indicated that the stress level and structural displacement of the structure scheme satisfies the design requirements, as well as the stability of the vessel under compression. Moreover, the stress and displacement of the acrylic shell satisfies the given working conditions and temperatures. The structural scheme ensures basic safety if the rope fails. △ Less

Submitted 8 June, 2024; originally announced June 2024.

Comments: 27 pages, 11 figures,7 tables

arXiv:2406.12793 [pdf, other]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Authors: Team GLM, :, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang , et al. (32 additional authors not shown)

Abstract: We introduce ChatGLM, an evolving family of large language models that we have been developing over time. This report primarily focuses on the GLM-4 language series, which includes GLM-4, GLM-4-Air, and GLM-4-9B. They represent our most capable models that are trained with all the insights and lessons gained from the preceding three generations of ChatGLM. To date, the GLM-4 models are pre-trained… ▽ More We introduce ChatGLM, an evolving family of large language models that we have been developing over time. This report primarily focuses on the GLM-4 language series, which includes GLM-4, GLM-4-Air, and GLM-4-9B. They represent our most capable models that are trained with all the insights and lessons gained from the preceding three generations of ChatGLM. To date, the GLM-4 models are pre-trained on ten trillions of tokens mostly in Chinese and English, along with a small set of corpus from 24 languages, and aligned primarily for Chinese and English usage. The high-quality alignment is achieved via a multi-stage post-training process, which involves supervised fine-tuning and learning from human feedback. Evaluations show that GLM-4 1) closely rivals or outperforms GPT-4 in terms of general metrics such as MMLU, GSM8K, MATH, BBH, GPQA, and HumanEval, 2) gets close to GPT-4-Turbo in instruction following as measured by IFEval, 3) matches GPT-4 Turbo (128K) and Claude 3 for long context tasks, and 4) outperforms GPT-4 in Chinese alignments as measured by AlignBench. The GLM-4 All Tools model is further aligned to understand user intent and autonomously decide when and which tool(s) touse -- including web browser, Python interpreter, text-to-image model, and user-defined functions -- to effectively complete complex tasks. In practical applications, it matches and even surpasses GPT-4 All Tools in tasks like accessing online information via web browsing and solving math problems using Python interpreter. Over the course, we have open-sourced a series of models, including ChatGLM-6B (three generations), GLM-4-9B (128K, 1M), GLM-4V-9B, WebGLM, and CodeGeeX, attracting over 10 million downloads on Hugging face in the year 2023 alone. The open models can be accessed through https://github.com/THUDM and https://huggingface.co/THUDM. △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.12569 [pdf, other]

MOYU: A Theoretical Study on Massive Over-activation Yielded Uplifts in LLMs

Authors: Chi Ma, Mincong Huang, Chao Wang, Yujie Wang, Lei Yu

Abstract: Massive Over-activation Yielded Uplifts(MOYU) is an inherent property of large language models, and dynamic activation(DA) based on the MOYU property is a clever yet under-explored strategy designed to accelerate inference in these models. Existing methods that utilize MOYU often face a significant 'Impossible Trinity': struggling to simultaneously maintain model performance, enhance inference spe… ▽ More Massive Over-activation Yielded Uplifts(MOYU) is an inherent property of large language models, and dynamic activation(DA) based on the MOYU property is a clever yet under-explored strategy designed to accelerate inference in these models. Existing methods that utilize MOYU often face a significant 'Impossible Trinity': struggling to simultaneously maintain model performance, enhance inference speed, and extend applicability across various architectures. Due to the theoretical ambiguities surrounding MOYU, this paper elucidates the root cause of the MOYU property and outlines the mechanisms behind two primary limitations encountered by current DA methods: 1) history-related activation uncertainty, and 2) semantic-irrelevant activation inertia. Our analysis not only underscores the limitations of current dynamic activation strategies within large-scale LLaMA models but also proposes opportunities for refining the design of future sparsity schemes. △ Less

Submitted 28 June, 2024; v1 submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.12250 [pdf]

Observation of stacking engineered magnetic phase transitions within moiré supercells of twisted van der Waals magnets

Authors: Senlei Li, Zeliang Sun, Nathan J. McLaughlin, Afsana Sharmin, Nishkarsh Agarwal, Mengqi Huang, Suk Hyun Sung, Hanyi Lu, Shaohua Yan, Hechang Lei, Robert Hovden, Hailong Wang, Hua Chen, Liuyan Zhao, Chunhui Rita Du

Abstract: Twist engineering of magnetic van der Waals (vdW) moiré superlattices provides an attractive way to achieve precise nanoscale control over the spin degree of freedom on two-dimensional flatland. Despite the very recent demonstrations of moiré magnetism featuring exotic phases with noncollinear spin order in twisted vdW magnet chromium triiodide CrI3, the local magnetic interactions, spin dynamics,… ▽ More Twist engineering of magnetic van der Waals (vdW) moiré superlattices provides an attractive way to achieve precise nanoscale control over the spin degree of freedom on two-dimensional flatland. Despite the very recent demonstrations of moiré magnetism featuring exotic phases with noncollinear spin order in twisted vdW magnet chromium triiodide CrI3, the local magnetic interactions, spin dynamics, and magnetic phase transitions within and across individual moiré supercells remain elusive. Taking advantage of a scanning single-spin magnetometry platform, here we report observation of two distinct magnetic phase transitions with separate critical temperatures within a moiré supercell of small-angle twisted double trilayer CrI3. By measuring temperature dependent spin fluctuations at the coexisting ferromagnetic and antiferromagnetic regions in twisted CrI3, we explicitly show that the Curie temperature of the ferromagnetic state is higher than the Néel temperature of the antiferromagnetic one by ~10 K. Our mean-field calculations attribute such a spatial and thermodynamic phase separation to the stacking order modulated interlayer exchange coupling at the twisted interface of the moiré superlattices. The presented results highlight twist engineering as a promising tuning knob to realize on-demand control of not only the nanoscale spin order of moiré quantum matter but also its dynamic magnetic responses, which may find relevant applications in developing transformative vdW electronic and magnetic devices. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.10261 [pdf, other]

FoodSky: A Food-oriented Large Language Model that Passes the Chef and Dietetic Examination

Authors: Pengfei Zhou, Weiqing Min, Chaoran Fu, Ying Jin, Mingyu Huang, Xiangyang Li, Shuhuan Mei, Shuqiang Jiang

Abstract: Food is foundational to human life, serving not only as a source of nourishment but also as a cornerstone of cultural identity and social interaction. As the complexity of global dietary needs and preferences grows, food intelligence is needed to enable food perception and reasoning for various tasks, ranging from recipe generation and dietary recommendation to diet-disease correlation discovery a… ▽ More Food is foundational to human life, serving not only as a source of nourishment but also as a cornerstone of cultural identity and social interaction. As the complexity of global dietary needs and preferences grows, food intelligence is needed to enable food perception and reasoning for various tasks, ranging from recipe generation and dietary recommendation to diet-disease correlation discovery and understanding. Towards this goal, for powerful capabilities across various domains and tasks in Large Language Models (LLMs), we introduce Food-oriented LLM FoodSky to comprehend food data through perception and reasoning. Considering the complexity and typicality of Chinese cuisine, we first construct one comprehensive Chinese food corpus FoodEarth from various authoritative sources, which can be leveraged by FoodSky to achieve deep understanding of food-related data. We then propose Topic-based Selective State Space Model (TS3M) and the Hierarchical Topic Retrieval Augmented Generation (HTRAG) mechanism to enhance FoodSky in capturing fine-grained food semantics and generating context-aware food-relevant text, respectively. Our extensive evaluations demonstrate that FoodSky significantly outperforms general-purpose LLMs in both chef and dietetic examinations, with an accuracy of 67.2% and 66.4% on the Chinese National Chef Exam and the National Dietetic Exam, respectively. FoodSky not only promises to enhance culinary creativity and promote healthier eating patterns, but also sets a new standard for domain-specific LLMs that address complex real-world issues in the food domain. An online demonstration of FoodSky is available at http://222.92.101.211:8200. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: 32 pages, 19 figures

arXiv:2406.09904 [pdf, other]

QQQ: Quality Quattuor-Bit Quantization for Large Language Models

Authors: Ying Zhang, Peng Zhang, Mincong Huang, Jingyang Xiang, Yujie Wang, Chao Wang, Yineng Zhang, Lei Yu, Chuan Liu, Wei Lin

Abstract: Quantization is a proven effective method for compressing large language models. Although popular techniques like W8A8 and W4A16 effectively maintain model performance, they often fail to concurrently speed up the prefill and decoding stages of inference. W4A8 is a promising strategy to accelerate both of them while usually leads to a significant performance degradation. To address these issues, w… ▽ More Quantization is a proven effective method for compressing large language models. Although popular techniques like W8A8 and W4A16 effectively maintain model performance, they often fail to concurrently speed up the prefill and decoding stages of inference. W4A8 is a promising strategy to accelerate both of them while usually leads to a significant performance degradation. To address these issues, we present QQQ, a Quality Quattuor-bit Quantization method with 4-bit weights and 8-bit activations. QQQ employs adaptive smoothing and Hessian-based compensation, significantly enhancing the performance of quantized models without extensive training. Furthermore, we meticulously engineer W4A8 GEMM kernels to increase inference speed. Our specialized per-channel W4A8 GEMM and per-group W4A8 GEMM achieve impressive speed increases of 3.67$\times$ and 3.29 $\times$ over FP16 GEMM. Our extensive experiments show that QQQ achieves performance on par with existing state-of-the-art LLM quantization methods while significantly accelerating inference, achieving speed boosts up to 2.24 $\times$, 2.10$\times$, and 1.25$\times$ compared to FP16, W8A8, and W4A16, respectively. △ Less

Submitted 28 June, 2024; v1 submitted 14 June, 2024; originally announced June 2024.

arXiv:2406.09903 [pdf, ps, other]

Asymptotic quadratic convergence of the Gauss-Newton method for complex phase retrieval

Authors: Meng Huang

Abstract: In this paper, we introduce a Gauss-Newton method for solving the complex phase retrieval problem. In contrast to the real-valued setting, the Gauss-Newton matrix for complex-valued signals is rank-deficient and, thus, non-invertible. To address this, we utilize a Gauss-Newton step that moves orthogonally to certain trivial directions. We establish that this modified Gauss-Newton step has a closed… ▽ More In this paper, we introduce a Gauss-Newton method for solving the complex phase retrieval problem. In contrast to the real-valued setting, the Gauss-Newton matrix for complex-valued signals is rank-deficient and, thus, non-invertible. To address this, we utilize a Gauss-Newton step that moves orthogonally to certain trivial directions. We establish that this modified Gauss-Newton step has a closed-form solution, which corresponds precisely to the minimal-norm solution of the associated least squares problem. Additionally, using the leave-one-out technique, we demonstrate that $m\ge O( n\log^3 n)$ independent complex Gaussian random measurements ensures that the entire trajectory of the Gauss-Newton iterations remains confined within a specific region of incoherence and contraction with high probability. This finding allows us to establish the asymptotic quadratic convergence rate of the Gauss-Newton method without the need of sample splitting. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: 54 pages

arXiv:2406.04604 [pdf, other]

Learning Task Decomposition to Assist Humans in Competitive Programming

Authors: Jiaxin Wen, Ruiqi Zhong, Pei Ke, Zhihong Shao, Hongning Wang, Minlie Huang

Abstract: When using language models (LMs) to solve complex problems, humans might struggle to understand the LM-generated solutions and repair the flawed ones. To assist humans in repairing them, we propose to automatically decompose complex solutions into multiple simpler pieces that correspond to specific subtasks. We introduce a novel objective for learning task decomposition, termed assistive value (As… ▽ More When using language models (LMs) to solve complex problems, humans might struggle to understand the LM-generated solutions and repair the flawed ones. To assist humans in repairing them, we propose to automatically decompose complex solutions into multiple simpler pieces that correspond to specific subtasks. We introduce a novel objective for learning task decomposition, termed assistive value (AssistV), which measures the feasibility and speed for humans to repair the decomposed solution. We collect a dataset of human repair experiences on different decomposed solutions. Utilizing the collected data as in-context examples, we then learn to critique, refine, and rank decomposed solutions to improve AssistV. We validate our method under competitive programming problems: under 177 hours of human study, our method enables non-experts to solve 33.3\% more problems, speeds them up by 3.3x, and empowers them to match unassisted experts. △ Less

Submitted 6 June, 2024; originally announced June 2024.

Comments: ACL 2024 Main Conference

arXiv:2406.04523 [pdf, other]

Proofread: Fixes All Errors with One Tap

Authors: Renjie Liu, Yanxiang Zhang, Yun Zhu, Haicheng Sun, Yuanbo Zhang, Michael Xuelin Huang, Shanqing Cai, Lei Meng, Shumin Zhai

Abstract: The impressive capabilities in Large Language Models (LLMs) provide a powerful approach to reimagine users' typing experience. This paper demonstrates Proofread, a novel Gboard feature powered by a server-side LLM in Gboard, enabling seamless sentence-level and paragraph-level corrections with a single tap. We describe the complete system in this paper, from data generation, metrics design to mode… ▽ More The impressive capabilities in Large Language Models (LLMs) provide a powerful approach to reimagine users' typing experience. This paper demonstrates Proofread, a novel Gboard feature powered by a server-side LLM in Gboard, enabling seamless sentence-level and paragraph-level corrections with a single tap. We describe the complete system in this paper, from data generation, metrics design to model tuning and deployment. To obtain models with sufficient quality, we implement a careful data synthetic pipeline tailored to online use cases, design multifaceted metrics, employ a two-stage tuning approach to acquire the dedicated LLM for the feature: the Supervised Fine Tuning (SFT) for foundational quality, followed by the Reinforcement Learning (RL) tuning approach for targeted refinement. Specifically, we find sequential tuning on Rewrite and proofread tasks yields the best quality in SFT stage, and propose global and direct rewards in the RL tuning stage to seek further improvement. Extensive experiments on a human-labeled golden set showed our tuned PaLM2-XS model achieved 85.56\% good ratio. We launched the feature to Pixel 8 devices by serving the model on TPU v5 in Google Cloud, with thousands of daily active users. Serving latency was significantly reduced by quantization, bucket inference, text segmentation, and speculative decoding. Our demo could be seen in \href{https://youtu.be/4ZdcuiwFU7I}{Youtube}. △ Less

Submitted 6 June, 2024; originally announced June 2024.

Comments: 8 pages, 3 figures, 2 tables

arXiv:2406.03362 [pdf, other]

Positivity for quantum cluster algebras from orbifolds

Authors: Min Huang

Abstract: Let $(S,M,U)$ be a marked orbifold with or without punctures and let $\mathcal A_v$ be a quantum cluster algebra from $(S,M,U)$ with arbitrary coefficients and quantization. We provide combinatorial formulas for quantum Laurent expansion of quantum cluster variables of $\mathcal A_v$ concerning an arbitrary quantum seed. Consequently, the positivity for the quantum cluster algebra $\mathcal A_v$ i… ▽ More Let $(S,M,U)$ be a marked orbifold with or without punctures and let $\mathcal A_v$ be a quantum cluster algebra from $(S,M,U)$ with arbitrary coefficients and quantization. We provide combinatorial formulas for quantum Laurent expansion of quantum cluster variables of $\mathcal A_v$ concerning an arbitrary quantum seed. Consequently, the positivity for the quantum cluster algebra $\mathcal A_v$ is proved. △ Less

Submitted 5 June, 2024; originally announced June 2024.

Comments: Comments are welcome!

MSC Class: 13F60; 05E15; 05E40

arXiv:2406.03104 [pdf, other]

Dark state transport between unitary Fermi superfluids

Authors: Mohsen Talebi, Simon Wili, Jeffrey Mohan, Philipp Fabritius, Meng-Zi Huang, Tilman Esslinger

Abstract: The formation of dark states is an important concept in quantum sciences, but its compatibility with strong interparticle interactions, for example, in a quantum degenerate gas is hardly explored. Here, we realize a dark state in one of the spins of a two-component, resonantly-interacting Fermi gas using a $Λ$ system within the $D_2$ transitions of $^6$Li at high magnetic field. The dark state is… ▽ More The formation of dark states is an important concept in quantum sciences, but its compatibility with strong interparticle interactions, for example, in a quantum degenerate gas is hardly explored. Here, we realize a dark state in one of the spins of a two-component, resonantly-interacting Fermi gas using a $Λ$ system within the $D_2$ transitions of $^6$Li at high magnetic field. The dark state is created in a micrometer-sized region within a one-dimensional channel connecting two superfluid reservoirs. The particle transport between the reservoirs is used as a probe. We observe that atoms are transported in the dark state and the superfluid-assisted fast current is preserved. If the dark state resonant condition is not met, the transport is suppressed by the spontaneous emission. We also uncover an asymmetry in the transport timescale across the two-photon resonance, which is absent in the non-interacting regime. This work raises questions on the interplay of dark states with interparticle interactions and opens up perspectives for optical manipulation of fermionic pairing. △ Less

Submitted 6 June, 2024; v1 submitted 5 June, 2024; originally announced June 2024.

Comments: 18 pages, 10 figures

arXiv:2406.00994 [pdf]

Half-integer Vortices Paired via String Micelles in Ferroelectric Liquid Crystals Facilitated by Ionic Polymer Doping

Authors: Zhongjie Ma, Miao Jiang, Yaohao Song, Aile Sun, Shengzhu Yi, Chao Zhou, Xiang Huang, Mingjun Huang, Satoshi Aya, Qi-Huo Wei

Abstract: Ferroelectric nematic (NF) liquid crystals are an intriguing polar system for exploring topological defects, and their properties are subject to significant influence by ionic doping. A prior theory based on a modified XY model predicts that string defects with half-integer vortex-antivortex pairs can be excited, while such stable string defects have not been directly observed in polar materials.… ▽ More Ferroelectric nematic (NF) liquid crystals are an intriguing polar system for exploring topological defects, and their properties are subject to significant influence by ionic doping. A prior theory based on a modified XY model predicts that string defects with half-integer vortex-antivortex pairs can be excited, while such stable string defects have not been directly observed in polar materials. Here, we report that doping the ferroelectric nematic material RM734 with cationic polymers can facilitate the formation of abundant string defects with butterfly textures. The string defects exhibit a polarization field restricted to 2D plane that is divided by Néel type domain walls into domains with either uniform polarization or negative splay deformation in the butterfly wing areas (positive bound charges). We establish a charge double layer model for the string defects: the strings of cationic polymer chains and close packing RM734 molecules form the Stern charge layer, and the small anionic ions and the positive bound charges (due to splay deformation) form the charge diffusion layer. We demonstrate that only cationic polymeric doping is effective due to the coupling between the flexoelectricity and the pear shape of the RM734 molecules. We estimate the line charge density of the strings via measuring the divergence of the polarization and the electrophoretic motion mobility, and obtain good qualitative agreement. We further show that the field-driven polarization reversal undergoes either string rotation or generating and merging with kink walls. △ Less

Submitted 3 June, 2024; originally announced June 2024.

arXiv:2406.00985 [pdf, other]

MultiEdits: Simultaneous Multi-Aspect Editing with Text-to-Image Diffusion Models

Authors: Mingzhen Huang, Jialing Cai, Shan Jia, Vishnu Suresh Lokhande, Siwei Lyu

Abstract: Text-driven image synthesis has made significant advancements with the development of diffusion models, transforming how visual content is generated from text prompts. Despite these advances, text-driven image editing, a key area in computer graphics, faces unique challenges. A major challenge is making simultaneous edits across multiple objects or attributes. Applying these methods sequentially f… ▽ More Text-driven image synthesis has made significant advancements with the development of diffusion models, transforming how visual content is generated from text prompts. Despite these advances, text-driven image editing, a key area in computer graphics, faces unique challenges. A major challenge is making simultaneous edits across multiple objects or attributes. Applying these methods sequentially for multi-aspect edits increases computational demands and efficiency losses. In this paper, we address these challenges with significant contributions. Our main contribution is the development of MultiEdits, a method that seamlessly manages simultaneous edits across multiple attributes. In contrast to previous approaches, MultiEdits not only preserves the quality of single attribute edits but also significantly improves the performance of multitasking edits. This is achieved through an innovative attention distribution mechanism and a multi-branch design that operates across several processing heads. Additionally, we introduce the PIE-Bench++ dataset, an expansion of the original PIE-Bench dataset, to better support evaluating image-editing tasks involving multiple objects and attributes simultaneously. This dataset is a benchmark for evaluating text-driven image editing methods in multifaceted scenarios. Dataset and code are available at https://mingzhenhuang.com/projects/MultiEdits.html. △ Less

Submitted 3 June, 2024; originally announced June 2024.

arXiv:2406.00857 [pdf, other]

Modeling the refractive index profile n(z) of polar ice for ultra-high energy neutrino experiments

Authors: S. Ali, P. Allison, S. Archambault, J. J. Beatty, D. Z. Besson, A. Bishop, P. Chen, Y. C. Chen, B. A. Clark, W. Clay, A. Connolly, K. Couberly, L. Cremonesi, A. Cummings, P. Dasgupta, R. Debolt, S. de Kockere, K. D. de Vries, C. Deaconu, M. A. DuVernois, J. Flaherty, E. Friedman, R. Gaior, P. Giri, J. Hanson , et al. (45 additional authors not shown)

Abstract: We develop an in-situ index of refraction profile using the transit time of radio signals broadcast from an englacial transmitter to 2-5 km distant radio-frequency receivers, deployed at depths up to 200 m. Maxwell's equations generally admit two ray propagation solutions from a given transmitter, corresponding to a direct path (D) and a refracted path (R); the measured D vs. R (dt(D,R)) timing di… ▽ More We develop an in-situ index of refraction profile using the transit time of radio signals broadcast from an englacial transmitter to 2-5 km distant radio-frequency receivers, deployed at depths up to 200 m. Maxwell's equations generally admit two ray propagation solutions from a given transmitter, corresponding to a direct path (D) and a refracted path (R); the measured D vs. R (dt(D,R)) timing differences provide constraints on the index of refraction profile near South Pole, where the Askaryan Radio Array (ARA) neutrino observatory is located. We constrain the refractive index profile by simulating D and R ray paths via ray tracing and comparing those to measured dt(D,R) signals. Using previous ice density data as a proxy for n(z), we demonstrate that our data strongly favors a glaciologically-motivated three-phase densification model rather than a single exponential scale height model. Simulations show that the single exponential model overestimates ARA neutrino sensitivity compared to the three-phase model. △ Less

Submitted 11 June, 2024; v1 submitted 2 June, 2024; originally announced June 2024.

arXiv:2406.00353 [pdf, other]

Form factors of $Λ_b^0 \to Λ_c(2595)^+$ within light-cone QCD sum rules

Authors: Hui-Hui Duan, Yong-Lu Liu, Qin Chang, Ming-Qiu Huang

Abstract: In this work, we calculated the form factors of the weak decay process $Λ_b^0 \to Λ_c(2595)^+$, where the final charm baryon represents an excited state with spin-parity $\frac{1}{2}^-$. Utilizing the light-cone QCD sum rules approach, we incorporated the contributions of the lowest two charm baryon states: the ground state $Λ_c$ with $J^P=\frac{1}{2}^+$ and the excited state $Λ_c(2595)^+$ with… ▽ More In this work, we calculated the form factors of the weak decay process $Λ_b^0 \to Λ_c(2595)^+$, where the final charm baryon represents an excited state with spin-parity $\frac{1}{2}^-$. Utilizing the light-cone QCD sum rules approach, we incorporated the contributions of the lowest two charm baryon states: the ground state $Λ_c$ with $J^P=\frac{1}{2}^+$ and the excited state $Λ_c(2595)^+$ with $J^P=\frac{1}{2}^-$ in the hadronic representation of the $Λ_b$ to $Λ_c(2595)$ transition correlation function. This approach allowes us to extract the form factors of the $Λ_b^0 \to Λ_c(2595)^+$ from $Λ_b^0 \to Λ_c^+$ transition. During the light-cone QCD sum rules procedure, we employed the light-cone distribution amplitudes (LCDAs) of the $Λ_b$ baryon. Furthermore, by combining these form factors with the helicity amplitudes of the bottom baryon transition matrix elements, we calculated the differential decay widths for the processes $Λ_b^0 \to Λ_c(2595)^+\ell^-\barν_\ell$. Additionally, within the lifetime of $Λ_b^0$, we obtained the absolute branching fractions for the semileptonic decays $Λ_b^0 \to Λ_c(2595)^+ \ell^- \barν_\ell$. With the branching fractions of $Λ_b^0 \to Λ_c(2595)^+ \ell^- \barν_\ell$ calculated in this work, we also determined the parameter $\mathcal{R}(Λ_c(2595)^+)$ which tests the lepton flavor universality. This parameter is defined as the ratio of branching fractions $\mathcal{B}r(Λ_b^0 \to Λ_c(2595)^+τ^-\barν_τ)$ and $\mathcal{B}r(Λ_b^0 \to Λ_c(2595)^+μ^-\barν_μ)$. Our results provide a valuable theoretical test for these decay channels and offer insights into the LCDAs of bottom baryons, paving the way for further in-depth investigations. △ Less

Submitted 1 June, 2024; originally announced June 2024.

Comments: 15 pages, 3 figures and 6 tables

arXiv:2405.14722 [pdf, other]

CAPE: Context-Adaptive Positional Encoding for Length Extrapolation

Authors: Chuanyang Zheng, Yihang Gao, Han Shi, Minbin Huang, Jingyao Li, Jing Xiong, Xiaozhe Ren, Michael Ng, Xin Jiang, Zhenguo Li, Yu Li

Abstract: Positional encoding plays a crucial role in transformers, significantly impacting model performance and length generalization. Prior research has introduced absolute positional encoding (APE) and relative positional encoding (RPE) to distinguish token positions in given sequences. However, both APE and RPE remain fixed after model training regardless of input data, limiting their adaptability and… ▽ More Positional encoding plays a crucial role in transformers, significantly impacting model performance and length generalization. Prior research has introduced absolute positional encoding (APE) and relative positional encoding (RPE) to distinguish token positions in given sequences. However, both APE and RPE remain fixed after model training regardless of input data, limiting their adaptability and flexibility. Hence, we expect that the desired positional encoding should be context-adaptive and can be dynamically adjusted with the given attention. In this paper, we propose a Context-Adaptive Positional Encoding (CAPE) method, which dynamically and semantically adjusts based on input context and learned fixed priors. Experimental validation on real-world datasets (Arxiv, Books3, and CHE) demonstrates that CAPE enhances model performances in terms of trained length and length generalization, where the improvements are statistically significant. The model visualization suggests that our model can keep both local and anti-local information. Finally, we successfully train the model on sequence length 128 and achieve better performance at evaluation sequence length 8192, compared with other static positional encoding methods, revealing the benefit of the adaptive positional encoding method. △ Less

Submitted 23 May, 2024; originally announced May 2024.

Comments: Technical Report

arXiv:2405.14383 [pdf, other]

Perception of Knowledge Boundary for Large Language Models through Semi-open-ended Question Answering

Authors: Zhihua Wen, Zhiliang Tian, Zexin Jian, Zhen Huang, Pei Ke, Yifu Gao, Minlie Huang, Dongsheng Li

Abstract: Large Language Models (LLMs) are widely used for knowledge-seeking yet suffer from hallucinations. The knowledge boundary (KB) of an LLM limits its factual understanding, beyond which it may begin to hallucinate. Investigating the perception of LLMs' KB is crucial for detecting hallucinations and LLMs' reliable generation. Current studies perceive LLMs' KB on questions with a concrete answer (clos… ▽ More Large Language Models (LLMs) are widely used for knowledge-seeking yet suffer from hallucinations. The knowledge boundary (KB) of an LLM limits its factual understanding, beyond which it may begin to hallucinate. Investigating the perception of LLMs' KB is crucial for detecting hallucinations and LLMs' reliable generation. Current studies perceive LLMs' KB on questions with a concrete answer (close-ended questions) while paying limited attention to semi-open-ended questions (SoeQ) that correspond to many potential answers. Some researchers achieve it by judging whether the question is answerable or not. However, this paradigm is unsuitable for SoeQ, which are usually partially answerable, containing both answerable and ambiguous (unanswerable) answers. Ambiguous answers are essential for knowledge-seeking, but they may go beyond the KB of LLMs. In this paper, we perceive the LLMs' KB with SoeQ by discovering more ambiguous answers. First, we apply an LLM-based approach to construct SoeQ and obtain answers from a target LLM. Unfortunately, the output probabilities of mainstream black-box LLMs are inaccessible to sample for low-probability ambiguous answers. Therefore, we apply an open-sourced auxiliary model to explore ambiguous answers for the target LLM. We calculate the nearest semantic representation for existing answers to estimate their probabilities, with which we reduce the generation probability of high-probability answers to achieve a more effective generation. Finally, we compare the results from the RAG-based evaluation and LLM self-evaluation to categorize four types of ambiguous answers that are beyond the KB of the target LLM. Following our method, we construct a dataset to perceive the KB for GPT-4. We find that GPT-4 performs poorly on SoeQ and is often unaware of its KB. Besides, our auxiliary model, LLaMA-2-13B, is effective in discovering more ambiguous answers. △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2405.10197 [pdf, other]

Forte: A Suite of Advanced Multireference Quantum Chemistry Methods

Authors: Francesco A. Evangelista, Chenyang Li, Prakash Verma, Kevin P. Hannon, Jeffrey B. Schriber, Tianyuan Zhang, Chenxi Cai, Shuhe Wang, Nan He, Nicholas H. Stair, Meng Huang, Renke Huang, Jonathon P. Misiewicz, Shuhang Li, Kevin Marin, Zijun Zhao, Lori A. Burns

Abstract: Forte is an open-source library specialized in multireference electronic structure theories for molecular systems and the rapid prototyping of new methods. This paper gives an overview of the capabilities of Forte, its software architecture, and examples of applications enabled by the methods it implements. Forte is an open-source library specialized in multireference electronic structure theories for molecular systems and the rapid prototyping of new methods. This paper gives an overview of the capabilities of Forte, its software architecture, and examples of applications enabled by the methods it implements. △ Less

Submitted 16 May, 2024; originally announced May 2024.

arXiv:2405.09274 [pdf, other]

Dynamic Activation Pitfalls in LLaMA Models: An Empirical Study

Authors: Chi Ma, Mincong Huang, Chao Wang, Yujie Wang, Lei Yu

Abstract: In this work, we systematically investigate the efficacy of dynamic activation mechanisms within the LLaMA family of language models. Despite the potential of dynamic activation methods to reduce computation and increase speed in models using the ReLU activation function, our empirical findings have uncovered several inherent pitfalls in the current dynamic activation schemes. Through extensive ex… ▽ More In this work, we systematically investigate the efficacy of dynamic activation mechanisms within the LLaMA family of language models. Despite the potential of dynamic activation methods to reduce computation and increase speed in models using the ReLU activation function, our empirical findings have uncovered several inherent pitfalls in the current dynamic activation schemes. Through extensive experiments across various dynamic activation strategies, we demonstrate that LLaMA models usually underperform when compared to their ReLU counterparts, particularly in scenarios demanding high sparsity ratio. We attribute these deficiencies to a combination of factors: 1) the inherent complexity of dynamically predicting activation heads and neurons; 2) the inadequate sparsity resulting from activation functions; 3) the insufficient preservation of information resulting from KV cache skipping. Our analysis not only sheds light on the limitations of dynamic activation in the context of large-scale LLaMA models but also proposes roadmaps for enhancing the design of future sparsity schemes. △ Less

Submitted 15 May, 2024; originally announced May 2024.

arXiv:2405.08748 [pdf, other]

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

Authors: Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Jihong Zhang, Chao Zhang, Meng Chen, Jie Liu , et al. (20 additional authors not shown)

Abstract: We present Hunyuan-DiT, a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. To construct Hunyuan-DiT, we carefully design the transformer structure, text encoder, and positional encoding. We also build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. For fine-grained language understanding, we train a Mu… ▽ More We present Hunyuan-DiT, a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. To construct Hunyuan-DiT, we carefully design the transformer structure, text encoder, and positional encoding. We also build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images. Finally, Hunyuan-DiT can perform multi-turn multimodal dialogue with users, generating and refining images according to the context. Through our holistic human evaluation protocol with more than 50 professional human evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation compared with other open-source models. Code and pretrained models are publicly available at github.com/Tencent/HunyuanDiT △ Less

Submitted 14 May, 2024; originally announced May 2024.

Comments: Project Page: https://dit.hunyuan.tencent.com/

arXiv:2405.06915 [pdf]

Automating Creativity

Authors: Ming-Hui Huang, Roland T. Rust

Abstract: Generative AI (GenAI) has spurred the expectation of being creative, due to its ability to generate content, yet so far, its creativity has somewhat disappointed, because it is trained using existing data following human intentions to generate outputs. The purpose of this paper is to explore what is required to evolve AI from generative to creative. Based on a reinforcement learning approach and b… ▽ More Generative AI (GenAI) has spurred the expectation of being creative, due to its ability to generate content, yet so far, its creativity has somewhat disappointed, because it is trained using existing data following human intentions to generate outputs. The purpose of this paper is to explore what is required to evolve AI from generative to creative. Based on a reinforcement learning approach and building upon various research streams of computational creativity, we develop a triple prompt-response-reward engineering framework to develop the creative capability of GenAI. This framework consists of three components: 1) a prompt model for expected creativity by developing discriminative prompts that are objectively, individually, or socially novel, 2) a response model for observed creativity by generating surprising outputs that are incrementally, disruptively, or radically innovative, and 3) a reward model for improving creativity over time by incorporating feedback from the AI, the creator/manager, and/or the customers. This framework enables the application of GenAI for various levels of creativity strategically. △ Less

Submitted 11 May, 2024; originally announced May 2024.

Comments: 46 pages, 2 tables, 4 figures

arXiv:2405.06470 [pdf, other]

Solar fusion III: New data and theory for hydrogen-burning stars

Authors: B. Acharya, M. Aliotta, A. B. Balantekin, D. Bemmerer, C. A. Bertulani, A. Best, C. R. Brune, R. Buompane, F. Cavanna, J. W. Chen, J. Colgan, A. Czarnecki, B. Davids, R. J. deBoer, F. Delahaye, R. Depalo, A. García, M. Gatu Johnson, D. Gazit, L. Gialanella, U. Greife, D. Guffanti, A. Guglielmetti, K. Hambleton, W. C. Haxton , et al. (25 additional authors not shown)

Abstract: In stars that lie on the main sequence in the Hertzsprung Russel diagram, like our sun, hydrogen is fused to helium in a number of nuclear reaction chains and series, such as the proton-proton chain and the carbon-nitrogen-oxygen cycles. Precisely determined thermonuclear rates of these reactions lie at the foundation of the standard solar model. This review, the third decadal evaluation of the nu… ▽ More In stars that lie on the main sequence in the Hertzsprung Russel diagram, like our sun, hydrogen is fused to helium in a number of nuclear reaction chains and series, such as the proton-proton chain and the carbon-nitrogen-oxygen cycles. Precisely determined thermonuclear rates of these reactions lie at the foundation of the standard solar model. This review, the third decadal evaluation of the nuclear physics of hydrogen-burning stars, is motivated by the great advances made in recent years by solar neutrino observatories, putting experimental knowledge of the proton-proton chain neutrino fluxes in the few-percent precision range. The basis of the review is a one-week community meeting held in July 2022 in Berkeley, California, and many subsequent digital meetings and exchanges. Each of the relevant reactions of solar and quiescent stellar hydrogen burning is reviewed here, from both theoretical and experimental perspectives. Recommendations for the state of the art of the astrophysical S-factor and its uncertainty are formulated for each of them. Several other topics of paramount importance for the solar model are reviewed, as well: recent and future neutrino experiments, electron screening, radiative opacities, and current and upcoming experimental facilities. In addition to reaction-specific recommendations, also general recommendations are formed. △ Less

Submitted 10 May, 2024; originally announced May 2024.

Comments: 85 pages, 15 figures. To be submitted to Reviews of Modern Physics

Report number: N3AS-24-016

arXiv:2405.06386 [pdf, other]

Deconfinement and chiral restoration phase transition under rotation from holography in an anisotropic gravitational background

Authors: Yidian Chen, Xun Chen, Danning Li, Mei Huang

Abstract: We investigate the effects of rotation on deconfinement and chiral phase transitions in the framework of dynamical holographic QCD model. Instead of transforming to the rotating system by Lorentz boost, we construct an anisotropic gravitational background by incorporating the rotating boundary current. We firstly investigate the pure gluon system under rotation to extract deconfinement phase trans… ▽ More We investigate the effects of rotation on deconfinement and chiral phase transitions in the framework of dynamical holographic QCD model. Instead of transforming to the rotating system by Lorentz boost, we construct an anisotropic gravitational background by incorporating the rotating boundary current. We firstly investigate the pure gluon system under rotation to extract deconfinement phase transition from the Polyakov loop then add 2-flavor probe for chiral restoration phase transition from the chiral condensate. It is observed that at low chemical potentials, the deconfinement phase transition of pure gluon system is of first order and the chiral phase transition of 2-flavor system is of crossover. Both the critical temperatures of deconfinement and chiral phase transitions decrease/increase with imaginary/real angular velocity ($Ω_I/Ω$) as $T/T_c\sim 1- C_2 Ω_I^2$ and $T/T_c\sim 1+ C_2 Ω^2$, which is consistent with lattice QCD results. In the temperature-chemical potential $T-μ$ phase diagram, the critical end point (CEP) moves towards regions of higher temperature and chemical potential with real angular velocity. △ Less

Submitted 10 May, 2024; originally announced May 2024.

arXiv:2405.06179 [pdf, ps, other]

Flavor dependent Critical endpoint from holographic QCD through machine learning

Authors: Xun Chen, Mei Huang

Abstract: QCD phase diagram in the $T - μ$ plane and the equation of state for pure gluon, 2-flavor, 2+1-flavor systems, and 2+1+1-flavor systems have been investigated using the Einstein-Maxwell-Dilaton (EMD) framework at finite temperature and chemical potential. By inputting lattice QCD data for the equation of state and baryon susceptibility at zero chemical potential into holographic model, all the par… ▽ More QCD phase diagram in the $T - μ$ plane and the equation of state for pure gluon, 2-flavor, 2+1-flavor systems, and 2+1+1-flavor systems have been investigated using the Einstein-Maxwell-Dilaton (EMD) framework at finite temperature and chemical potential. By inputting lattice QCD data for the equation of state and baryon susceptibility at zero chemical potential into holographic model, all the parameters can be determined with the aid of machine learning algorithms. Our findings indicate that the deconfinement phase transition is of first order for the pure gluon system with critical temperature $T_c = 0.265 \rm GeV$ at vanishing chemical potential. The phase transition for the 2-flavor, 2+1-flavor systems, and 2+1+1-flavor systems are crossover at vanishing chemical potential and first-order at high chemical potential, and the critical endpoint(CEP) in the $T - μ$ plane locates at ($μ_B^c$=0.46 GeV, $T^c$=0.147 GeV), ($μ_B^c$ = 0.74 GeV, $T^c$ = 0.094 GeV), and ($μ_B^c$= 0.87 GeV,$T^c$ = 0.108 GeV), respectively. Additionally, the thermodynamic quantities of the system for different flavors at finite chemical potential are presented in this paper. It is observed that the difference between the 2+1 flavor and 2+1+1 flavor systems is invisible at vanishing chemical potential and low temperature. The location of CEP for 2+1+1 flavor system deviates explicitly from that of the 2+1 flavor system with the increase of chemical potential. Both 2+1 flavor and 2+1+1 flavor systems differ significantly from the 2-flavor system. △ Less

Submitted 9 May, 2024; originally announced May 2024.

arXiv:2405.04059 [pdf]

Three-dimensional hidden phase probed by in-plane magnetotransport in kagome metal CsV$_3$Sb$_5$ thin flakes

Authors: Xinjian Wei, Congkuan Tian, Hang Cui, Yuxin Zhai, Yongkai Li, Shaobo Liu, Yuanjun Song, Ya Feng, Miaoling Huang, Zhiwei Wang, Yi Liu, Qihua Xiong, Yugui Yao, X. C. Xie, Jian-Hao Chen

Abstract: Transition metal compounds with kagome structure have been found to exhibit a variety of exotic structural, electronic, and magnetic orders. These orders are competing with energies very close to each other, resulting in complex phase transitions. Some of the phases are easily observable, such as the charge density wave (CDW) and the superconducting phase, while others are more challenging to iden… ▽ More Transition metal compounds with kagome structure have been found to exhibit a variety of exotic structural, electronic, and magnetic orders. These orders are competing with energies very close to each other, resulting in complex phase transitions. Some of the phases are easily observable, such as the charge density wave (CDW) and the superconducting phase, while others are more challenging to identify and characterize. Here we present magneto-transport evidence of a new phase below ~35 K in the kagome topological metal CsV$_3$Sb$_5$ (CVS) thin flakes between the CDW and the superconducting transition temperatures. This phase is characterized by six-fold rotational symmetry in the in-plane magnetoresistance (MR) and is connected to the orbital current order in CVS. Furthermore, the phase is characterized by a large in-plane negative magnetoresistance, which suggests the existence of a three-dimensional, magnetic field-tunable orbital current ordered phase. Our results highlight the potential of magneto-transport to reveal the interactions between exotic quantum states of matter and to uncover the symmetry of such hidden phases. △ Less

Submitted 7 May, 2024; originally announced May 2024.

arXiv:2405.02313 [pdf, ps, other]

Physics-informed Data-driven Cavitation Model for a Specific MG EOS

Authors: Minsheng Huang, Chengbao Yao, Pan Wang, Lidong Cheng, Wenjun Ying

Abstract: We present a novel one-fluid cavitation model of a specific Mie-Grüneisen equation of state(EOS), named polynomial EOS, based on an artificial neural network. Not only the physics-informed equation but also the experimental data are embedded into the proposed model by an optimization problem. The physics-informed data-driven model provides the concerned pressure within the cavitation region, where… ▽ More We present a novel one-fluid cavitation model of a specific Mie-Grüneisen equation of state(EOS), named polynomial EOS, based on an artificial neural network. Not only the physics-informed equation but also the experimental data are embedded into the proposed model by an optimization problem. The physics-informed data-driven model provides the concerned pressure within the cavitation region, where the density tends to zero when the pressure falls below the saturated pressure. The present model is then applied to computing the challenging compressible multi-phase flow simulation, such as nuclear and underwater explosions. Numerical simulations show that our model in application agrees well with the corresponding experimental data, ranging from one dimension to three dimensions with the $h-$adaptive mesh refinement algorithm and load balance techniques in the structured and unstructured grid. △ Less

Submitted 5 April, 2024; originally announced May 2024.

Comments: 29 pages, 18 figures

arXiv:2405.01949 [pdf, other]

Bubble wall velocity and gravitational wave in the minimal left-right symmetric model

Authors: Dian-Wei Wang, Qi-Shu Yan, Mei Huang

Abstract: The bubble wall velocity in the first order phase transition plays an important role in determining both the amplitude and the pivot frequency of stochastic gravitational wave background. In the framework of the minimal left-right symmetric model, we study the wall velocity when the first order phase transition can occur. The wall velocity can be determined by matching the distribution functions i… ▽ More The bubble wall velocity in the first order phase transition plays an important role in determining both the amplitude and the pivot frequency of stochastic gravitational wave background. In the framework of the minimal left-right symmetric model, we study the wall velocity when the first order phase transition can occur. The wall velocity can be determined by matching the distribution functions in the free particle approximation and the local thermal equilibrium approximation. It is found that the wall velocity can be determined in the range $ 0.2 < v_w < 0.5 $ for the parameter space with the first order phase transition. It is also found that for the case when the wall velocity is close to the speed of sound, the peak amplitude of gravitational wave spectrum can be larger than that in the runaway case. Moreover, It is also found that there exists an approximate power law between the wall velocity and pressure difference between broken and symmetry phases, and the power index is equal to 0.41 or so. △ Less

Submitted 6 May, 2024; v1 submitted 3 May, 2024; originally announced May 2024.

Comments: 22 pages, 12 figures

arXiv:2405.00802 [pdf]

doi 10.1126/sciadv.adk8495

Sensing Spin Wave Excitations by Spin Defects in Few-Layer Thick Hexagonal Boron Nitride

Authors: Jingcheng Zhou, Hanyi Lu, Di Chen, Mengqi Huang, Gerald Q. Yan, Faris Al-matouq, Jiu Chang, Dziga Djugba, Zhigang Jiang, Hailong Wang, Chunhui Rita Du

Abstract: Optically active spin defects in wide band-gap semiconductors serve as a local sensor of multiple degrees of freedom in a variety of "hard" and "soft" condensed matter systems. Taking advantage of the recent progress on quantum sensing using van der Waals (vdW) quantum materials, here we report direct measurements of spin waves excited in magnetic insulator Y3Fe5O12 (YIG) by boron vacancy $V_B^-$… ▽ More Optically active spin defects in wide band-gap semiconductors serve as a local sensor of multiple degrees of freedom in a variety of "hard" and "soft" condensed matter systems. Taking advantage of the recent progress on quantum sensing using van der Waals (vdW) quantum materials, here we report direct measurements of spin waves excited in magnetic insulator Y3Fe5O12 (YIG) by boron vacancy $V_B^-$ spin defects contained in few-layer thick hexagonal boron nitride nanoflakes. We show that the ferromagnetic resonance and parametric spin excitations can be effectively detected by $V_B^-$ spin defects under various experimental conditions through optically detected magnetic resonance measurements. The off-resonant dipole interaction between YIG magnons and $V_B^-$ spin defects is mediated by multi-magnon scattering processes, which may find relevant applications in a range of emerging quantum sensing, computing, and metrology technologies. Our results also highlight the opportunities offered by quantum spin defects in layered two-dimensional vdW materials for investigating local spin dynamic behaviors in magnetic solid-state matters. △ Less

Submitted 1 May, 2024; originally announced May 2024.

arXiv:2404.19652 [pdf, other]

VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization

Authors: Yuliang Liu, Mingxin Huang, Hao Yan, Linger Deng, Weijia Wu, Hao Lu, Chunhua Shen, Lianwen Jin, Xiang Bai

Abstract: Text spotting, a task involving the extraction of textual information from image or video sequences, faces challenges in cross-domain adaption, such as image-to-image and image-to-video generalization. In this paper, we introduce a new method, termed VimTS, which enhances the generalization ability of the model by achieving better synergy among different tasks. Typically, we propose a Prompt Queri… ▽ More Text spotting, a task involving the extraction of textual information from image or video sequences, faces challenges in cross-domain adaption, such as image-to-image and image-to-video generalization. In this paper, we introduce a new method, termed VimTS, which enhances the generalization ability of the model by achieving better synergy among different tasks. Typically, we propose a Prompt Queries Generation Module and a Tasks-aware Adapter to effectively convert the original single-task model into a multi-task model suitable for both image and video scenarios with minimal additional parameters. The Prompt Queries Generation Module facilitates explicit interaction between different tasks, while the Tasks-aware Adapter helps the model dynamically learn suitable features for each task. Additionally, to further enable the model to learn temporal information at a lower cost, we propose a synthetic video text dataset (VTD-368k) by leveraging the Content Deformation Fields (CoDeF) algorithm. Notably, our method outperforms the state-of-the-art method by an average of 2.6% in six cross-domain benchmarks such as TT-to-IC15, CTW1500-to-TT, and TT-to-CTW1500. For video-level cross-domain adaption, our method even surpasses the previous end-to-end video spotting method in ICDAR2015 video and DSText v2 by an average of 5.5% on the MOTA metric, using only image-level data. We further demonstrate that existing Large Multimodal Models exhibit limitations in generating cross-domain scene text spotting, in contrast to our VimTS model which requires significantly fewer parameters and data. The code and datasets will be made available at the https://VimTextSpotter.github.io. △ Less

Submitted 14 May, 2024; v1 submitted 30 April, 2024; originally announced April 2024.

arXiv:2404.18919 [pdf, other]

TheaterGen: Character Management with LLM for Consistent Multi-turn Image Generation

Authors: Junhao Cheng, Baiqiao Yin, Kaixin Cai, Minbin Huang, Hanhui Li, Yuxin He, Xi Lu, Yue Li, Yifei Li, Yuhao Cheng, Yiqiang Yan, Xiaodan Liang

Abstract: Recent advances in diffusion models can generate high-quality and stunning images from text. However, multi-turn image generation, which is of high demand in real-world scenarios, still faces challenges in maintaining semantic consistency between images and texts, as well as contextual consistency of the same subject across multiple interactive turns. To address this issue, we introduce TheaterGen… ▽ More Recent advances in diffusion models can generate high-quality and stunning images from text. However, multi-turn image generation, which is of high demand in real-world scenarios, still faces challenges in maintaining semantic consistency between images and texts, as well as contextual consistency of the same subject across multiple interactive turns. To address this issue, we introduce TheaterGen, a training-free framework that integrates large language models (LLMs) and text-to-image (T2I) models to provide the capability of multi-turn image generation. Within this framework, LLMs, acting as a "Screenwriter", engage in multi-turn interaction, generating and managing a standardized prompt book that encompasses prompts and layout designs for each character in the target image. Based on these, Theatergen generate a list of character images and extract guidance information, akin to the "Rehearsal". Subsequently, through incorporating the prompt book and guidance information into the reverse denoising process of T2I diffusion models, Theatergen generate the final image, as conducting the "Final Performance". With the effective management of prompt books and character images, TheaterGen significantly improves semantic and contextual consistency in synthesized images. Furthermore, we introduce a dedicated benchmark, CMIGBench (Consistent Multi-turn Image Generation Benchmark) with 8000 multi-turn instructions. Different from previous multi-turn benchmarks, CMIGBench does not define characters in advance. Both the tasks of story generation and multi-turn editing are included on CMIGBench for comprehensive evaluation. Extensive experimental results show that TheaterGen outperforms state-of-the-art methods significantly. It raises the performance bar of the cutting-edge Mini DALLE 3 model by 21% in average character-character similarity and 19% in average text-image similarity. △ Less

Submitted 29 April, 2024; originally announced April 2024.

arXiv:2404.18619 [pdf]

Patterning of 2D second harmonic generation active arrays in ferroelectric nematic fluids

Authors: M. Lovšin, A. Petelin, B. Berteloot, N. Osterman, S. Aya, M. Huang, I. Drevenšek-Olenik, R. J. Mandle, K. Neyts, A. Mertelj, N. Sebastian

Abstract: Ferroelectric nematic liquid crystals exhibit unique non-linear optical properties, with the potential to become transformative materials for photonic applications. A promising direction relies on the fabrication of tailored polar orientational patterns via photoalignment, thus shaping the non-linear optical susceptibility through thin slabs of the ferroelectric fluid. Here, we explore the fabrica… ▽ More Ferroelectric nematic liquid crystals exhibit unique non-linear optical properties, with the potential to become transformative materials for photonic applications. A promising direction relies on the fabrication of tailored polar orientational patterns via photoalignment, thus shaping the non-linear optical susceptibility through thin slabs of the ferroelectric fluid. Here, we explore the fabrication of 2D periodic SHG active arrays in ferroelectric nematic fluids, for different materials, cell thicknesses and motifs. Based on polarizing optical microscopy observations in combination with optical simulations, second harmonic generation microscopy and interferometry, the 3D structure of the motifs are revealed. Two different 2D periodic patterns are explored, showing that the balance between flexoelectric and electrostatic energy can lead to different domain structures, an effect which is rooted in the difference between the flexoelectric properties of the materials. It is shown that by combining the surface-inscribed alignment with different spontaneous degrees of twist, 2D SHG active arrays can be obtained in the micrometre scale, in which adjacent areas exhibit maximum SHG signals at opposite angles. △ Less

Submitted 29 April, 2024; originally announced April 2024.

Comments: 24 pages, 5 Images main, 15 supplementary images

arXiv:2404.18033 [pdf, other]

Exposing Text-Image Inconsistency Using Diffusion Models

Authors: Mingzhen Huang, Shan Jia, Zhou Zhou, Yan Ju, Jialing Cai, Siwei Lyu

Abstract: In the battle against widespread online misinformation, a growing problem is text-image inconsistency, where images are misleadingly paired with texts with different intent or meaning. Existing classification-based methods for text-image inconsistency can identify contextual inconsistencies but fail to provide explainable justifications for their decisions that humans can understand. Although more… ▽ More In the battle against widespread online misinformation, a growing problem is text-image inconsistency, where images are misleadingly paired with texts with different intent or meaning. Existing classification-based methods for text-image inconsistency can identify contextual inconsistencies but fail to provide explainable justifications for their decisions that humans can understand. Although more nuanced, human evaluation is impractical at scale and susceptible to errors. To address these limitations, this study introduces D-TIIL (Diffusion-based Text-Image Inconsistency Localization), which employs text-to-image diffusion models to localize semantic inconsistencies in text and image pairs. These models, trained on large-scale datasets act as ``omniscient" agents that filter out irrelevant information and incorporate background knowledge to identify inconsistencies. In addition, D-TIIL uses text embeddings and modified image regions to visualize these inconsistencies. To evaluate D-TIIL's efficacy, we introduce a new TIIL dataset containing 14K consistent and inconsistent text-image pairs. Unlike existing datasets, TIIL enables assessment at the level of individual words and image regions and is carefully designed to represent various inconsistencies. D-TIIL offers a scalable and evidence-based approach to identifying and localizing text-image inconsistency, providing a robust framework for future research combating misinformation. △ Less

Submitted 27 April, 2024; originally announced April 2024.

arXiv:2404.17607 [pdf, other]

Utilizing Large Language Models to Identify Reddit Users Considering Vaping Cessation for Digital Interventions

Authors: Sai Krishna Revanth Vuruma, Dezhi Wu, Saborny Sen Gupta, Lucas Aust, Valerie Lookingbill, Caleb Henry, Yang Ren, Erin Kasson, Li-Shiun Chen, Patricia Cavazos-Rehg, Dian Hu, Ming Huang

Abstract: The widespread adoption of social media platforms globally not only enhances users' connectivity and communication but also emerges as a vital channel for the dissemination of health-related information, thereby establishing social media data as an invaluable organic data resource for public health research. The surge in popularity of vaping or e-cigarette use in the United States and other countr… ▽ More The widespread adoption of social media platforms globally not only enhances users' connectivity and communication but also emerges as a vital channel for the dissemination of health-related information, thereby establishing social media data as an invaluable organic data resource for public health research. The surge in popularity of vaping or e-cigarette use in the United States and other countries has caused an outbreak of e-cigarette and vaping use-associated lung injury (EVALI), leading to hospitalizations and fatalities in 2019, highlighting the urgency to comprehend vaping behaviors and develop effective strategies for cession. In this study, we extracted a sample dataset from one vaping sub-community on Reddit to analyze users' quit vaping intentions. Leveraging large language models including both the latest GPT-4 and traditional BERT-based language models for sentence-level quit-vaping intention prediction tasks, this study compares the outcomes of these models against human annotations. Notably, when compared to human evaluators, GPT-4 model demonstrates superior consistency in adhering to annotation guidelines and processes, showcasing advanced capabilities to detect nuanced user quit-vaping intentions that human evaluators might overlook. These preliminary findings emphasize the potential of GPT-4 in enhancing the accuracy and reliability of social media data analysis, especially in identifying subtle users' intentions that may elude human detection. △ Less

Submitted 25 April, 2024; originally announced April 2024.

arXiv:2404.16792 [pdf, other]

Weak-to-Strong Extrapolation Expedites Alignment

Authors: Chujie Zheng, Ziqi Wang, Heng Ji, Minlie Huang, Nanyun Peng

Abstract: The open-source community is experiencing a surge in the release of large language models (LLMs) that are trained to follow instructions and align with human preference. However, further training to improve them still requires expensive computational resources and data annotations. Is it possible to bypass additional training and cost-effectively acquire better-aligned models? Inspired by the lite… ▽ More The open-source community is experiencing a surge in the release of large language models (LLMs) that are trained to follow instructions and align with human preference. However, further training to improve them still requires expensive computational resources and data annotations. Is it possible to bypass additional training and cost-effectively acquire better-aligned models? Inspired by the literature on model interpolation, we propose a simple method called ExPO to boost LLMs' alignment with human preference. Utilizing a model that has undergone alignment training (e.g., via DPO or RLHF) and its initial SFT checkpoint, ExPO directly obtains a better-aligned model by extrapolating from the weights of the initial and the aligned models, which implicitly optimizes the alignment objective via first-order approximation. Through experiments with twelve open-source LLMs on HuggingFace, we demonstrate that ExPO consistently improves off-the-shelf DPO/RLHF models, as evaluated on the mainstream LLM benchmarks AlpacaEval 2.0 and MT-Bench. Moreover, ExPO exhibits remarkable scalability across various model sizes (from 1.8B to 70B) and capabilities. Through controlled experiments and further empirical analyses, we shed light on the essence of ExPO amplifying the reward signal learned during alignment training. Our work demonstrates the efficacy of model extrapolation in expediting the alignment of LLMs with human preference, suggesting a promising direction for future research. △ Less

Submitted 22 May, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

Comments: Add theoretical explanation and more evaluation results

arXiv:2404.16425 [pdf, other]

Soft X-ray prompt emission from a high-redshift gamma-ray burst EP240315a

Authors: Y. Liu, H. Sun, D. Xu, D. S. Svinkin, J. Delaunay, N. R. Tanvir, H. Gao, C. Zhang, Y. Chen, X. -F. Wu, B. Zhang, W. Yuan, J. An, G. Bruni, D. D. Frederiks, G. Ghirlanda, J. -W. Hu, A. Li, C. -K. Li, J. -D. Li, D. B. Malesani, L. Piro, G. Raman, R. Ricci, E. Troja , et al. (170 additional authors not shown)

Abstract: Long gamma-ray bursts (GRBs) are believed to originate from core collapse of massive stars. High-redshift GRBs can probe the star formation and reionization history of the early universe, but their detection remains rare. Here we report the detection of a GRB triggered in the 0.5--4 keV band by the Wide-field X-ray Telescope (WXT) on board the Einstein Probe (EP) mission, designated as EP240315a,… ▽ More Long gamma-ray bursts (GRBs) are believed to originate from core collapse of massive stars. High-redshift GRBs can probe the star formation and reionization history of the early universe, but their detection remains rare. Here we report the detection of a GRB triggered in the 0.5--4 keV band by the Wide-field X-ray Telescope (WXT) on board the Einstein Probe (EP) mission, designated as EP240315a, whose bright peak was also detected by the Swift Burst Alert Telescope and Konus-Wind through off-line analyses. At a redshift of $z=4.859$, EP240315a showed a much longer and more complicated light curve in the soft X-ray band than in gamma-rays. Benefiting from a large field-of-view ($\sim$3600 deg$^2$) and a high sensitivity, EP-WXT captured the earlier engine activation and extended late engine activity through a continuous detection. With a peak X-ray flux at the faint end of previously known high-$z$ GRBs, the detection of EP240315a demonstrates the great potential for EP to study the early universe via GRBs. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Comments: 41 pages, 8 figures, 7 tables

arXiv:2404.15790 [pdf, other]

Leveraging Large Language Models for Multimodal Search

Authors: Oriol Barbany, Michael Huang, Xinliang Zhu, Arnab Dhua

Abstract: Multimodal search has become increasingly important in providing users with a natural and effective way to ex-press their search intentions. Images offer fine-grained details of the desired products, while text allows for easily incorporating search modifications. However, some existing multimodal search systems are unreliable and fail to address simple queries. The problem becomes harder with the… ▽ More Multimodal search has become increasingly important in providing users with a natural and effective way to ex-press their search intentions. Images offer fine-grained details of the desired products, while text allows for easily incorporating search modifications. However, some existing multimodal search systems are unreliable and fail to address simple queries. The problem becomes harder with the large variability of natural language text queries, which may contain ambiguous, implicit, and irrelevant in-formation. Addressing these issues may require systems with enhanced matching capabilities, reasoning abilities, and context-aware query parsing and rewriting. This paper introduces a novel multimodal search model that achieves a new performance milestone on the Fashion200K dataset. Additionally, we propose a novel search interface integrating Large Language Models (LLMs) to facilitate natural language interaction. This interface routes queries to search systems while conversationally engaging with users and considering previous searches. When coupled with our multimodal search model, it heralds a new era of shopping assistants capable of offering human-like interaction and enhancing the overall search experience. △ Less

Submitted 24 April, 2024; originally announced April 2024.

Comments: Published at CVPRW 2024

arXiv:2404.15249 [pdf, ps, other]

A GPU-accelerated Cartesian grid method for PDEs on irregular domain

Authors: Liwei Tan, Minsheng Huang, Wenjun Ying

Abstract: The kernel-free boundary integral (KFBI) method has successfully solved partial differential equations (PDEs) on irregular domains. Diverging from traditional boundary integral methods, the computation of boundary integrals in KFBI is executed through the resolution of equivalent simple interface problems on Cartesian grids, utilizing fast algorithms. While existing implementations of KFBI methods… ▽ More The kernel-free boundary integral (KFBI) method has successfully solved partial differential equations (PDEs) on irregular domains. Diverging from traditional boundary integral methods, the computation of boundary integrals in KFBI is executed through the resolution of equivalent simple interface problems on Cartesian grids, utilizing fast algorithms. While existing implementations of KFBI methods predominantly utilize CPU platforms, GPU architecture's superior computational capabilities and extensive memory bandwidth offer an efficient resolution to computational bottlenecks. This paper delineates the algorithms adapted for both single-GPU and multiple-GPU applications. On a single GPU, assigning individual threads can control correction, interpolation, and jump calculations. The algorithm is expanded to multiple GPUs to enhance the processing of larger-scale problems. The arrowhead decomposition method is employed in multiple-GPU settings, ensuring optimal computational efficiency and load balancing. Numerical examples show that the proposed algorithm is second-order accurate and efficient. Single-GPU solver speeds 50-200 times than traditional CPU while the eight GPUs distributed solver yields up to 60% parallel efficiency. △ Less

Submitted 23 April, 2024; originally announced April 2024.

Comments: 24pages 10figures

arXiv:2404.14864 [pdf]

A GPU-accelerated Cartesian grid method is proposed for solving the heat, wave, and Schrodinger equations on irregular domains

Authors: Liwei Tan, Minsheng Huang, Wenjun Ying

Abstract: This paper introduces a second-order method for solving general elliptic partial differential equations (PDEs) on irregular domains using GPU acceleration, based on Ying's kernel-free boundary integral (KFBI) method. The method addresses limitations imposed by CFL conditions in explicit schemes and accuracy issues in fully implicit schemes for the Laplacian operator. To overcome these challenges,… ▽ More This paper introduces a second-order method for solving general elliptic partial differential equations (PDEs) on irregular domains using GPU acceleration, based on Ying's kernel-free boundary integral (KFBI) method. The method addresses limitations imposed by CFL conditions in explicit schemes and accuracy issues in fully implicit schemes for the Laplacian operator. To overcome these challenges, the paper employs a series of second-order time discrete schemes and splits the Laplacian operator into explicit and implicit components. Specifically, the Crank-Nicolson method discretizes the heat equation in the temporal dimension, while the implicit scheme is used for the wave equation. The Schrodinger equation is treated using the Strang splitting method. By discretizing the temporal dimension implicitly, the heat, wave, and Schrodinger equations are transformed into a sequence of elliptic equations. The Laplacian operator on the right-hand side of the elliptic equation is obtained from the numerical scheme rather than being discretized and corrected by the five-point difference method. A Cartesian grid-based KFBI method is employed to solve the resulting elliptic equations. GPU acceleration, achieved through a parallel Cartesian grid solver, enhances the computational efficiency by exploiting high degrees of parallelism. Numerical results demonstrate that the proposed method achieves second-order accuracy for the heat, wave, and Schrodinger equations. Furthermore, the GPU-accelerated solvers for the three types of time-dependent equations exhibit a speedup of 30 times compared to CPU-based solvers. △ Less

Submitted 23 April, 2024; originally announced April 2024.

Comments: 40 pages,12 figures

arXiv:2404.14228 [pdf, other]

A Survey of Decomposition-Based Evolutionary Multi-Objective Optimization: Part II -- A Data Science Perspective

Authors: Mingyu Huang, Ke Li

Abstract: This paper presents the second part of the two-part survey series on decomposition-based evolutionary multi-objective optimization where we mainly focus on discussing the literature related to multi-objective evolutionary algorithms based on decomposition (MOEA/D). Complementary to the first part, here we employ a series of advanced data mining approaches to provide a comprehensive anatomy of the… ▽ More This paper presents the second part of the two-part survey series on decomposition-based evolutionary multi-objective optimization where we mainly focus on discussing the literature related to multi-objective evolutionary algorithms based on decomposition (MOEA/D). Complementary to the first part, here we employ a series of advanced data mining approaches to provide a comprehensive anatomy of the enormous landscape of MOEA/D research, which is far beyond the capacity of classic manual literature review protocol. In doing so, we construct a heterogeneous knowledge graph that encapsulates more than 5,400 papers, 10,000 authors, 400 venues, and 1,600 institutions for MOEA/D research. We start our analysis with basic descriptive statistics. Then we delve into prominent research/application topics pertaining to MOEA/D with state-of-the-art topic modeling techniques and interrogate their sptial-temporal and bilateral relationships. We also explored the collaboration and citation networks of MOEA/D, uncovering hidden patterns in the growth of literature as well as collaboration between researchers. Our data mining results here, combined with the expert review in Part I, together offer a holistic view of the MOEA/D research, and demonstrate the potential of an exciting new paradigm for conducting scientific surveys from a data science perspective. △ Less

Submitted 22 April, 2024; originally announced April 2024.

arXiv:2404.13667 [pdf, other]

doi 10.1109/ACCESS.2024.3404834

MathNet: A Data-Centric Approach for Printed Mathematical Expression Recognition

Authors: Felix M. Schmitt-Koopmann, Elaine M. Huang, Hans-Peter Hutter, Thilo Stadelmann, Alireza Darvishy

Abstract: Printed mathematical expression recognition (MER) models are usually trained and tested using LaTeX-generated mathematical expressions (MEs) as input and the LaTeX source code as ground truth. As the same ME can be generated by various different LaTeX source codes, this leads to unwanted variations in the ground truth data that bias test performance results and hinder efficient learning. In additi… ▽ More Printed mathematical expression recognition (MER) models are usually trained and tested using LaTeX-generated mathematical expressions (MEs) as input and the LaTeX source code as ground truth. As the same ME can be generated by various different LaTeX source codes, this leads to unwanted variations in the ground truth data that bias test performance results and hinder efficient learning. In addition, the use of only one font to generate the MEs heavily limits the generalization of the reported results to realistic scenarios. We propose a data-centric approach to overcome this problem, and present convincing experimental results: Our main contribution is an enhanced LaTeX normalization to map any LaTeX ME to a canonical form. Based on this process, we developed an improved version of the benchmark dataset im2latex-100k, featuring 30 fonts instead of one. Second, we introduce the real-world dataset realFormula, with MEs extracted from papers. Third, we developed a MER model, MathNet, based on a convolutional vision transformer, with superior results on all four test sets (im2latex-100k, im2latexv2, realFormula, and InftyMDB-1), outperforming the previous state of the art by up to 88.3%. △ Less

Submitted 21 April, 2024; originally announced April 2024.

Comments: 12 pages, 6 figures

Journal ref: IEEE Access 12 (2024) 76963-76974

arXiv:2404.12602 [pdf]

A visualization method for data domain changes in CNN networks and the optimization method for selecting thresholds in classification tasks

Authors: Minzhe Huang, Changwei Nie, Weihong Zhong

Abstract: In recent years, Face Anti-Spoofing (FAS) has played a crucial role in preserving the security of face recognition technology. With the rise of counterfeit face generation techniques, the challenge posed by digitally edited faces to face anti-spoofing is escalating. Existing FAS technologies primarily focus on intercepting physically forged faces and lack a robust solution for cross-domain FAS cha… ▽ More In recent years, Face Anti-Spoofing (FAS) has played a crucial role in preserving the security of face recognition technology. With the rise of counterfeit face generation techniques, the challenge posed by digitally edited faces to face anti-spoofing is escalating. Existing FAS technologies primarily focus on intercepting physically forged faces and lack a robust solution for cross-domain FAS challenges. Moreover, determining an appropriate threshold to achieve optimal deployment results remains an issue for intra-domain FAS. To address these issues, we propose a visualization method that intuitively reflects the training outcomes of models by visualizing the prediction results on datasets. Additionally, we demonstrate that employing data augmentation techniques, such as downsampling and Gaussian blur, can effectively enhance performance on cross-domain tasks. Building upon our data visualization approach, we also introduce a methodology for setting threshold values based on the distribution of the training dataset. Ultimately, our methods secured us second place in both the Unified Physical-Digital Face Attack Detection competition and the Snapshot Spectral Imaging Face Anti-spoofing contest. The training code is available at https://github.com/SeaRecluse/CVPRW2024. △ Less

Submitted 18 April, 2024; originally announced April 2024.

Showing 1–50 of 1,336 results for author: Huang, M