-
Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models
Authors:
Wenhao Shi,
Zhiqiang Hu,
Yi Bin,
Junhua Liu,
Yang Yang,
See-Kiong Ng,
Lidong Bing,
Roy Ka-Wei Lee
Abstract:
Large language models (LLMs) have demonstrated impressive reasoning capabilities, particularly in textual mathematical problem-solving. However, existing open-source image instruction fine-tuning datasets, containing limited question-answer pairs per image, do not fully exploit visual information to enhance the multimodal mathematical reasoning capabilities of Multimodal LLMs (MLLMs). To bridge th…
▽ More
Large language models (LLMs) have demonstrated impressive reasoning capabilities, particularly in textual mathematical problem-solving. However, existing open-source image instruction fine-tuning datasets, containing limited question-answer pairs per image, do not fully exploit visual information to enhance the multimodal mathematical reasoning capabilities of Multimodal LLMs (MLLMs). To bridge this gap, we address the lack of high-quality, diverse multimodal mathematical datasets by collecting 40K high-quality images with question-answer pairs from 24 existing datasets and synthesizing 320K new pairs, creating the MathV360K dataset, which enhances both the breadth and depth of multimodal mathematical questions. We introduce Math-LLaVA, a LLaVA-1.5-based model fine-tuned with MathV360K. This novel approach significantly improves the multimodal mathematical reasoning capabilities of LLaVA-1.5, achieving a 19-point increase and comparable performance to GPT-4V on MathVista's minitest split. Furthermore, Math-LLaVA demonstrates enhanced generalizability, showing substantial improvements on the MMMU benchmark. Our research highlights the importance of dataset diversity and synthesis in advancing MLLMs' mathematical reasoning abilities. The code and data are available at: \url{https://github.com/HZQ950419/Math-LLaVA}.
△ Less
Submitted 26 June, 2024; v1 submitted 25 June, 2024;
originally announced June 2024.
-
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Authors:
Zesen Cheng,
Sicong Leng,
Hang Zhang,
Yifei Xin,
Xin Li,
Guanzheng Chen,
Yongxin Zhu,
Wenqi Zhang,
Ziyang Luo,
Deli Zhao,
Lidong Bing
Abstract:
In this paper, we present the VideoLLaMA 2, a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks. Building upon its predecessor, VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC) connector, which effectively captures the intricate spatial and temporal dynamics of video data…
▽ More
In this paper, we present the VideoLLaMA 2, a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks. Building upon its predecessor, VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC) connector, which effectively captures the intricate spatial and temporal dynamics of video data. Additionally, we integrate an Audio Branch into the model through joint training, thereby enriching the multimodal understanding capabilities of the model by seamlessly incorporating audio cues. Comprehensive evaluations on multiple-choice video question answering (MC-VQA), open-ended video question answering (OE-VQA), and video captioning (VC) tasks demonstrate that VideoLLaMA 2 consistently achieves competitive results among open-source models and even gets close to some proprietary models on several benchmarks. Furthermore, VideoLLaMA 2 exhibits reasonable improvements in audio-only and audio-video question-answering (AQA & OE-AVQA) benchmarks over existing models. These advancements underline VideoLLaMA 2's superior performance in multimodal comprehension, setting a new standard for intelligent video analysis systems. All models are public to facilitate further research.
△ Less
Submitted 17 June, 2024; v1 submitted 11 June, 2024;
originally announced June 2024.
-
Auto Arena of LLMs: Automating LLM Evaluations with Agent Peer-battles and Committee Discussions
Authors:
Ruochen Zhao,
Wenxuan Zhang,
Yew Ken Chia,
Deli Zhao,
Lidong Bing
Abstract:
As LLMs evolve on a daily basis, there is an urgent need for a trustworthy evaluation method that can provide robust evaluation results in a timely fashion. Currently, as static benchmarks are prone to contamination concerns, users tend to trust human voting platforms, such as Chatbot Arena. However, human annotations require extensive manual efforts. To provide an automatic, robust, and trustwort…
▽ More
As LLMs evolve on a daily basis, there is an urgent need for a trustworthy evaluation method that can provide robust evaluation results in a timely fashion. Currently, as static benchmarks are prone to contamination concerns, users tend to trust human voting platforms, such as Chatbot Arena. However, human annotations require extensive manual efforts. To provide an automatic, robust, and trustworthy evaluation framework, we innovatively propose the Auto-Arena of LLMs, which automates the entire evaluation process with LLM agents. Firstly, an examiner LLM devises queries. Then, a pair of candidate LLMs engage in a multi-round peer-battle around the query, during which the LLM's true performance gaps become visible. Finally, a committee of LLM judges collectively discuss and determine the winner, which alleviates bias and promotes fairness. In our extensive experiment on the 17 newest LLMs, Auto-Arena shows the highest correlation with human preferences, providing a promising alternative to human evaluation platforms.
△ Less
Submitted 12 June, 2024; v1 submitted 30 May, 2024;
originally announced May 2024.
-
Probabilistic and progressive deblended far-infrared and sub-millimetre point source catalogues I. Methodology and first application in the COSMOS field
Authors:
Lingyu Wang,
Antonio La Marca,
Fangyou Gao,
William J. Pearson,
Berta Margalef-Bentabol,
Matthieu Béthermin,
Longji Bing,
James Donnellan,
Peter D. Hurley,
Seb J. Oliver,
Catherine L. Hale,
Matt J. Jarvis,
Lucia Marchetti,
Mattia Vaccari,
Imogen H. Whittam
Abstract:
Single-dish far-infrared (far-IR) and sub-millimetre (sub-mm) point source catalogues and their connections with catalogues at other wavelengths are of paramount importance. However, due to the large mismatch in spatial resolution, cross-matching galaxies at different wavelengths is challenging. This work aims to develop the next-generation deblended far-IR and sub-mm catalogues and present the fi…
▽ More
Single-dish far-infrared (far-IR) and sub-millimetre (sub-mm) point source catalogues and their connections with catalogues at other wavelengths are of paramount importance. However, due to the large mismatch in spatial resolution, cross-matching galaxies at different wavelengths is challenging. This work aims to develop the next-generation deblended far-IR and sub-mm catalogues and present the first application in the COSMOS field. Our progressive deblending used the Bayesian probabilistic framework known as XID+. The deblending started from the Spitzer/MIPS 24 micron data, using an initial prior list composed of sources selected from the COSMOS2020 catalogue and radio catalogues from the VLA and the MeerKAT surveys, based on spectral energy distribution modelling which predicts fluxes of the known sources at the deblending wavelength. To speed up flux prediction, we made use of a neural network-based emulator. After deblending the 24 micron data, we proceeded to the Herschel PACS (100 & 160 micron) and SPIRE wavebands (250, 350 & 500 micron). Each time we constructed a tailor-made prior list based on the predicted fluxes of the known sources. Using simulated far-IR and sub-mm sky, we detailed the performance of our deblending pipeline. After validation with simulations, we then deblended the real observations from 24 to 500 micron and compared with blindly extracted catalogues and previous versions of deblended catalogues. As an additional test, we deblended the SCUBA-2 850 micron map and compared our deblended fluxes with ALMA measurements, which demonstrates a higher level of flux accuracy compared to previous results.We publicly release our XID+ deblended point source catalogues. These deblended long-wavelength data are crucial for studies such as deriving the fraction of dust-obscured star formation and better separation of quiescent galaxies from dusty star-forming galaxies.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
LLM-R2: A Large Language Model Enhanced Rule-based Rewrite System for Boosting Query Efficiency
Authors:
Zhaodonghui Li,
Haitao Yuan,
Huiming Wang,
Gao Cong,
Lidong Bing
Abstract:
Query rewrite, which aims to generate more efficient queries by altering a SQL query's structure without changing the query result, has been an important research problem. In order to maintain equivalence between the rewritten query and the original one during rewriting, traditional query rewrite methods always rewrite the queries following certain rewrite rules. However, some problems still remai…
▽ More
Query rewrite, which aims to generate more efficient queries by altering a SQL query's structure without changing the query result, has been an important research problem. In order to maintain equivalence between the rewritten query and the original one during rewriting, traditional query rewrite methods always rewrite the queries following certain rewrite rules. However, some problems still remain. Firstly, existing methods of finding the optimal choice or sequence of rewrite rules are still limited and the process always costs a lot of resources. Methods involving discovering new rewrite rules typically require complicated proofs of structural logic or extensive user interactions. Secondly, current query rewrite methods usually rely highly on DBMS cost estimators which are often not accurate. In this paper, we address these problems by proposing a novel method of query rewrite named LLM-R2, adopting a large language model (LLM) to propose possible rewrite rules for a database rewrite system. To further improve the inference ability of LLM in recommending rewrite rules, we train a contrastive model by curriculum to learn query representations and select effective query demonstrations for the LLM. Experimental results have shown that our method can significantly improve the query execution efficiency and outperform the baseline methods. In addition, our method enjoys high robustness across different datasets.
△ Less
Submitted 19 April, 2024;
originally announced April 2024.
-
Overcoming Confusion Noise with Hyperspectral Imaging from PRIMAger
Authors:
James M. S. Donnellan,
Seb J. Oliver,
Matthieu Bethermin,
Longji Bing,
Alberto Bolatto,
Charles M. Bradford,
Denis Burgarella,
Laure Ciesla,
Jason Glenn,
Alexandra Pope,
Stephen Serjeant,
Raphael Shirley,
JD T. Smith,
Chris Sorrell
Abstract:
The PRobe far-Infrared Mission for Astrophysics (PRIMA) concept aims to perform mapping with spectral coverage and sensitivities inaccessible to previous FIR space telescopes. PRIMA's imaging instrument, PRIMAger, provides unique hyperspectral imaging simultaneously covering 25-235 $μ$m. We synthesise images representing a deep, 1500 hr deg$^{-2}$ PRIMAger survey, with realistic instrumental and c…
▽ More
The PRobe far-Infrared Mission for Astrophysics (PRIMA) concept aims to perform mapping with spectral coverage and sensitivities inaccessible to previous FIR space telescopes. PRIMA's imaging instrument, PRIMAger, provides unique hyperspectral imaging simultaneously covering 25-235 $μ$m. We synthesise images representing a deep, 1500 hr deg$^{-2}$ PRIMAger survey, with realistic instrumental and confusion noise. We demonstrate that we can construct catalogues of galaxies with a high purity ($>95$ per cent) at a source density of 42k deg$^{-2}$ using PRIMAger data alone. Using the XID+ deblending tool we show that we measure fluxes with an accuracy better than 20 per cent to flux levels of 0.16, 0.80, 9.7 and 15 mJy at 47.4, 79.7, 172, 235 $μ$m respectively. These are a factor of $\sim$2 and $\sim$3 fainter than the classical confusion limits for 72-96 $μ$m and 126-235 $μ$m, respectively. At $1.5 \leq z \leq 2$, we detect and accurately measure fluxes in 8-10 of the 10 channels covering 47-235 $μ$m for sources with $2 \leq$ log(SFR) $\leq 2.5$, a 0.5 dex improvement on what might be expected from the classical confusion limit. Recognising that PRIMager will operate in a context where high quality data will be available at other wavelengths, we investigate the benefits of introducing additional prior information. We show that by introducing even weak prior flux information when employing a higher source density catalogue (more than one source per beam) we can obtain accurate fluxes an order of magnitude below the classical confusion limit for 96-235 $μ$m.
△ Less
Submitted 10 April, 2024;
originally announced April 2024.
-
ParaICL: Towards Robust Parallel In-Context Learning
Authors:
Xingxuan Li,
Xuan-Phi Nguyen,
Shafiq Joty,
Lidong Bing
Abstract:
Large language models (LLMs) have become the norm in natural language processing (NLP), excelling in few-shot in-context learning (ICL) with their remarkable abilities. Nonetheless, the success of ICL largely hinges on the choice of few-shot demonstration examples, making the selection process increasingly crucial. Existing methods have delved into optimizing the quantity and semantic similarity o…
▽ More
Large language models (LLMs) have become the norm in natural language processing (NLP), excelling in few-shot in-context learning (ICL) with their remarkable abilities. Nonetheless, the success of ICL largely hinges on the choice of few-shot demonstration examples, making the selection process increasingly crucial. Existing methods have delved into optimizing the quantity and semantic similarity of these examples to improve ICL performances. However, our preliminary experiments indicate that the effectiveness of ICL is limited by the length of the input context. Moreover, varying combinations of few-shot demonstration examples can significantly boost accuracy across different test samples. To address this, we propose a novel method named parallel in-context learning (ParaICL) that effectively utilizes all demonstration examples without exceeding the manageable input context length. ParaICL employs parallel batching to distribute demonstration examples into different batches according to the semantic similarities of the questions in the demonstrations to the test question. It then computes normalized batch semantic scores for each batch. A weighted average semantic objective, constrained by adaptive plausibility, is applied to select the most appropriate tokens. Through extensive experiments, we validate the effectiveness of ParaICL and conduct ablation studies to underscore its design rationale. We further demonstrate that ParaICL can seamlessly integrate with existing methods.
△ Less
Submitted 31 March, 2024;
originally announced April 2024.
-
PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns
Authors:
Yew Ken Chia,
Vernon Toh Yan Han,
Deepanway Ghosal,
Lidong Bing,
Soujanya Poria
Abstract:
Large multimodal models extend the impressive capabilities of large language models by integrating multimodal understanding abilities. However, it is not clear how they can emulate the general intelligence and reasoning ability of humans. As recognizing patterns and abstracting concepts are key to general intelligence, we introduce PuzzleVQA, a collection of puzzles based on abstract patterns. Wit…
▽ More
Large multimodal models extend the impressive capabilities of large language models by integrating multimodal understanding abilities. However, it is not clear how they can emulate the general intelligence and reasoning ability of humans. As recognizing patterns and abstracting concepts are key to general intelligence, we introduce PuzzleVQA, a collection of puzzles based on abstract patterns. With this dataset, we evaluate large multimodal models with abstract patterns based on fundamental concepts, including colors, numbers, sizes, and shapes. Through our experiments on state-of-the-art large multimodal models, we find that they are not able to generalize well to simple abstract patterns. Notably, even GPT-4V cannot solve more than half of the puzzles. To diagnose the reasoning challenges in large multimodal models, we progressively guide the models with our ground truth reasoning explanations for visual perception, inductive reasoning, and deductive reasoning. Our systematic analysis finds that the main bottlenecks of GPT-4V are weaker visual perception and inductive reasoning abilities. Through this work, we hope to shed light on the limitations of large multimodal models and how they can better emulate human cognitive processes in the future (Our data and code will be released publicly at https://github.com/declare-lab/LLM-PuzzleTest).
△ Less
Submitted 30 April, 2024; v1 submitted 20 March, 2024;
originally announced March 2024.
-
Is Translation All You Need? A Study on Solving Multilingual Tasks with Large Language Models
Authors:
Chaoqun Liu,
Wenxuan Zhang,
Yiran Zhao,
Anh Tuan Luu,
Lidong Bing
Abstract:
Large language models (LLMs) have demonstrated multilingual capabilities; yet, they are mostly English-centric due to the imbalanced training corpora. Existing works leverage this phenomenon to improve their multilingual performances through translation, primarily on natural language processing (NLP) tasks. This work extends the evaluation from NLP tasks to real user queries and from English-centr…
▽ More
Large language models (LLMs) have demonstrated multilingual capabilities; yet, they are mostly English-centric due to the imbalanced training corpora. Existing works leverage this phenomenon to improve their multilingual performances through translation, primarily on natural language processing (NLP) tasks. This work extends the evaluation from NLP tasks to real user queries and from English-centric LLMs to non-English-centric LLMs. While translation into English can help improve the performance of multilingual NLP tasks for English-centric LLMs, it may not be optimal for all scenarios. For culture-related tasks that need deep language understanding, prompting in the native language tends to be more promising as it better captures the nuances of culture and language. Our experiments reveal varied behaviors among different LLMs and tasks in the multilingual context. Therefore, we advocate for more comprehensive multilingual evaluation and more efforts toward developing multilingual LLMs beyond English-centric ones.
△ Less
Submitted 20 June, 2024; v1 submitted 15 March, 2024;
originally announced March 2024.
-
Faint millimeter NIKA2 dusty star-forming galaxies: finding the high-redshift population
Authors:
L. -J. Bing,
A. Beelen,
G. Lagache,
R. Adam,
P. Ade,
H. Ajeddig,
P. André,
E. Artis,
H. Aussel,
A. Benoît,
S. Berta,
M. Béthermin,
O. Bourrion,
M. Calvo,
A. Catalano,
M. De Petris,
F. -X. Désert,
S. Doyle,
E. F. C. Driessen,
A. Gomez,
J. Goupy,
F. Kéruzoré,
C. Kramer,
B. Ladjelate,
S. Leclercq
, et al. (24 additional authors not shown)
Abstract:
We develop a new framework to constrain the source redshift. The method jointly accounts for the detection/non-detection of spectral lines and the prior information from the photometric redshift and total infrared luminosity from spectral energy distribution analysis. The method uses the estimated total infrared luminosity to predict the line fluxes at given redshifts and generates model spectra.…
▽ More
We develop a new framework to constrain the source redshift. The method jointly accounts for the detection/non-detection of spectral lines and the prior information from the photometric redshift and total infrared luminosity from spectral energy distribution analysis. The method uses the estimated total infrared luminosity to predict the line fluxes at given redshifts and generates model spectra. The redshift-dependent spectral models are then compared with the observed spectra to find the redshift. Results. We apply the aforementioned joint redshift analysis method to four high-z dusty star-forming galaxy candidates selected from the NIKA2 observations of the HLSJ091828.6+514223 (HLS) field, and further observed by NOEMA with blind spectral scans. These sources only have SPIRE/Herschel photometry as ancillary data. They were selected because of very faint or no SPIRE counterparts, as to bias the sample towards the highest redshift candidates. The method finds the spectroscopic redshift of 4 in the 5 NOEMA-counterpart detected sources, with z>3. Based on these measurements, we derive the CO/[CI] lines and millimeter continuum fluxes from the NOEMA data and study their ISM and star-formation properties. We find cold dust temperatures in some of the HLS sources compared to the general population of sub-millimeter galaxies, which might be related to the bias introduced by the SPIRE-dropout selection. Our sources, but one, have short gas depletion time of a few hundred Myrs, which is typical among high-z sub-millimeter galaxies. The only exception shows a longer gas depletion time, up to a few Gyrs, comparable to that of main-sequence galaxies at the same redshift. Furthermore, we identify a possible over-density of dusty star-forming galaxies at z=5.2, traced by two sources in our sample, as well as the lensed galaxy HLSJ091828.6+514223. (abridged)
△ Less
Submitted 1 March, 2024;
originally announced March 2024.
-
AdaMergeX: Cross-Lingual Transfer with Large Language Models via Adaptive Adapter Merging
Authors:
Yiran Zhao,
Wenxuan Zhang,
Huiming Wang,
Kenji Kawaguchi,
Lidong Bing
Abstract:
As an effective alternative to the direct fine-tuning on target tasks in specific languages, cross-lingual transfer addresses the challenges of limited training data by decoupling ''task ability'' and ''language ability'' by fine-tuning on the target task in the source language and another selected task in the target language, respectively. However, they fail to fully separate the task ability fro…
▽ More
As an effective alternative to the direct fine-tuning on target tasks in specific languages, cross-lingual transfer addresses the challenges of limited training data by decoupling ''task ability'' and ''language ability'' by fine-tuning on the target task in the source language and another selected task in the target language, respectively. However, they fail to fully separate the task ability from the source language or the language ability from the chosen task. In this paper, we acknowledge the mutual reliance between task ability and language ability and direct our attention toward the gap between the target language and the source language on tasks. As the gap removes the impact of tasks, we assume that it remains consistent across tasks. Based on this assumption, we propose a new cross-lingual transfer method called $\texttt{AdaMergeX}$ that utilizes adaptive adapter merging. By introducing a reference task, we can determine that the divergence of adapters fine-tuned on the reference task in both languages follows the same distribution as the divergence of adapters fine-tuned on the target task in both languages. Hence, we can obtain target adapters by combining the other three adapters. Furthermore, we propose a structure-adaptive adapter merging method. Our empirical results demonstrate that our approach yields new and effective cross-lingual transfer, outperforming existing methods across all settings.
△ Less
Submitted 29 February, 2024;
originally announced February 2024.
-
How do Large Language Models Handle Multilingualism?
Authors:
Yiran Zhao,
Wenxuan Zhang,
Guizhen Chen,
Kenji Kawaguchi,
Lidong Bing
Abstract:
Large language models (LLMs) have demonstrated impressive capabilities across diverse languages. This study explores how LLMs handle multilingualism. Based on observed language ratio shifts among layers and the relationships between network structures and certain capabilities, we hypothesize the LLM's multilingual workflow ($\texttt{MWork}$): LLMs initially understand the query, converting multili…
▽ More
Large language models (LLMs) have demonstrated impressive capabilities across diverse languages. This study explores how LLMs handle multilingualism. Based on observed language ratio shifts among layers and the relationships between network structures and certain capabilities, we hypothesize the LLM's multilingual workflow ($\texttt{MWork}$): LLMs initially understand the query, converting multilingual inputs into English for task-solving. In the intermediate layers, they employ English for thinking and incorporate multilingual knowledge with self-attention and feed-forward structures, respectively. In the final layers, LLMs generate responses aligned with the original language of the query. To verify $\texttt{MWork}$, we introduce Parallel Language-specific Neuron Detection ($\texttt{PLND}$) to identify activated neurons for inputs in different languages without any labeled data. Using $\texttt{PLND}$, we validate $\texttt{MWork}$ through extensive experiments involving the deactivation of language-specific neurons across various layers and structures. Moreover, $\texttt{MWork}$ allows fine-tuning of language-specific neurons with a small dataset, enhancing multilingual abilities in a specific language without compromising others. This approach results in an average improvement of $3.6\%$ for high-resource languages and $2.3\%$ for low-resource languages across all tasks with just $400$ documents.
△ Less
Submitted 24 May, 2024; v1 submitted 28 February, 2024;
originally announced February 2024.
-
SeaLLMs -- Large Language Models for Southeast Asia
Authors:
Xuan-Phi Nguyen,
Wenxuan Zhang,
Xin Li,
Mahani Aljunied,
Zhiqiang Hu,
Chenhui Shen,
Yew Ken Chia,
Xingxuan Li,
Jianyu Wang,
Qingyu Tan,
Liying Cheng,
Guanzheng Chen,
Yue Deng,
Sen Yang,
Chaoqun Liu,
Hang Zhang,
Lidong Bing
Abstract:
Despite the remarkable achievements of large language models (LLMs) in various tasks, there remains a linguistic bias that favors high-resource languages, such as English, often at the expense of low-resource and regional languages. To address this imbalance, we introduce SeaLLMs, an innovative series of language models that specifically focuses on Southeast Asian (SEA) languages. SeaLLMs are buil…
▽ More
Despite the remarkable achievements of large language models (LLMs) in various tasks, there remains a linguistic bias that favors high-resource languages, such as English, often at the expense of low-resource and regional languages. To address this imbalance, we introduce SeaLLMs, an innovative series of language models that specifically focuses on Southeast Asian (SEA) languages. SeaLLMs are built upon the Llama-2 model and further advanced through continued pre-training with an extended vocabulary, specialized instruction and alignment tuning to better capture the intricacies of regional languages. This allows them to respect and reflect local cultural norms, customs, stylistic preferences, and legal considerations. Our comprehensive evaluation demonstrates that SeaLLM-13b models exhibit superior performance across a wide spectrum of linguistic tasks and assistant-style instruction-following capabilities relative to comparable open-source models. Moreover, they outperform ChatGPT-3.5 in non-Latin languages, such as Thai, Khmer, Lao, and Burmese, by large margins while remaining lightweight and cost-effective to operate.
△ Less
Submitted 1 July, 2024; v1 submitted 1 December, 2023;
originally announced December 2023.
-
Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
Authors:
Sicong Leng,
Hang Zhang,
Guanzheng Chen,
Xin Li,
Shijian Lu,
Chunyan Miao,
Lidong Bing
Abstract:
Large Vision-Language Models (LVLMs) have advanced considerably, intertwining visual recognition and language understanding to generate content that is not only coherent but also contextually attuned. Despite their success, LVLMs still suffer from the issue of object hallucinations, where models generate plausible yet incorrect outputs that include objects that do not exist in the images. To mitig…
▽ More
Large Vision-Language Models (LVLMs) have advanced considerably, intertwining visual recognition and language understanding to generate content that is not only coherent but also contextually attuned. Despite their success, LVLMs still suffer from the issue of object hallucinations, where models generate plausible yet incorrect outputs that include objects that do not exist in the images. To mitigate this issue, we introduce Visual Contrastive Decoding (VCD), a simple and training-free method that contrasts output distributions derived from original and distorted visual inputs. The proposed VCD effectively reduces the over-reliance on statistical bias and unimodal priors, two essential causes of object hallucinations. This adjustment ensures the generated content is closely grounded to visual inputs, resulting in contextually accurate outputs. Our experiments show that VCD, without either additional training or the usage of external tools, significantly mitigates the object hallucination issue across different LVLM families. Beyond mitigating object hallucinations, VCD also excels in general LVLM benchmarks, highlighting its wide-ranging applicability.
△ Less
Submitted 28 November, 2023;
originally announced November 2023.
-
Towards Robust Temporal Reasoning of Large Language Models via a Multi-Hop QA Dataset and Pseudo-Instruction Tuning
Authors:
Qingyu Tan,
Hwee Tou Ng,
Lidong Bing
Abstract:
Knowledge in the real world is being updated constantly. However, it is costly to frequently update large language models (LLMs). Therefore, it is crucial for LLMs to understand the concept of temporal knowledge. However, prior works on temporal question answering (TQA) did not emphasize multi-answer and multi-hop types of temporal reasoning. In this paper, we propose a complex temporal question-a…
▽ More
Knowledge in the real world is being updated constantly. However, it is costly to frequently update large language models (LLMs). Therefore, it is crucial for LLMs to understand the concept of temporal knowledge. However, prior works on temporal question answering (TQA) did not emphasize multi-answer and multi-hop types of temporal reasoning. In this paper, we propose a complex temporal question-answering dataset Complex-TR that focuses on multi-answer and multi-hop temporal reasoning. Besides, we also propose a novel data augmentation strategy to improve the complex temporal reasoning capability and robustness of LLMs. We conducted experiments on multiple temporal QA datasets. Experimental results show that our method is able to improve LLMs' performance on temporal QA benchmarks by significant margins. Our code and data are released at: https://github.com/nusnlp/complex-tr.
△ Less
Submitted 12 July, 2024; v1 submitted 16 November, 2023;
originally announced November 2023.
-
Neuro-Symbolic Integration Brings Causal and Reliable Reasoning Proofs
Authors:
Sen Yang,
Xin Li,
Leyang Cui,
Lidong Bing,
Wai Lam
Abstract:
Though prompting LLMs with various reasoning structures produces reasoning proofs along with answers, these proofs are not ensured to be causal and reliable due to the inherent defects of LLMs. Tracking such deficiencies, we present a neuro-symbolic integration method, in which a neural LLM is used to represent the knowledge of the problem while an LLM-free symbolic solver is adopted to do deliber…
▽ More
Though prompting LLMs with various reasoning structures produces reasoning proofs along with answers, these proofs are not ensured to be causal and reliable due to the inherent defects of LLMs. Tracking such deficiencies, we present a neuro-symbolic integration method, in which a neural LLM is used to represent the knowledge of the problem while an LLM-free symbolic solver is adopted to do deliberative reasoning using the knowledge. Specifically, our customized meta-interpreters allow the production of reasoning proofs and support flexible search strategies. These reasoning proofs are ensured to be causal and reliable because of the deterministic executing nature of the symbolic solvers. Empirically, on ProofWriter, our method surpasses the CoT baseline by nearly double in accuracy and more than triple in proof similarity. On GSM8K, our method also shows accuracy improvements and nearly doubled proof similarity. Our code is released at https://github.com/DAMO-NLP-SG/CaRing
△ Less
Submitted 16 November, 2023;
originally announced November 2023.
-
Contrastive Chain-of-Thought Prompting
Authors:
Yew Ken Chia,
Guizhen Chen,
Luu Anh Tuan,
Soujanya Poria,
Lidong Bing
Abstract:
Despite the success of chain of thought in enhancing language model reasoning, the underlying process remains less well understood. Although logically sound reasoning appears inherently crucial for chain of thought, prior studies surprisingly reveal minimal impact when using invalid demonstrations instead. Furthermore, the conventional chain of thought does not inform language models on what mista…
▽ More
Despite the success of chain of thought in enhancing language model reasoning, the underlying process remains less well understood. Although logically sound reasoning appears inherently crucial for chain of thought, prior studies surprisingly reveal minimal impact when using invalid demonstrations instead. Furthermore, the conventional chain of thought does not inform language models on what mistakes to avoid, which potentially leads to more errors. Hence, inspired by how humans can learn from both positive and negative examples, we propose contrastive chain of thought to enhance language model reasoning. Compared to the conventional chain of thought, our approach provides both valid and invalid reasoning demonstrations, to guide the model to reason step-by-step while reducing reasoning mistakes. To improve generalization, we introduce an automatic method to construct contrastive demonstrations. Our experiments on reasoning benchmarks demonstrate that contrastive chain of thought can serve as a general enhancement of chain-of-thought prompting.
△ Less
Submitted 15 November, 2023;
originally announced November 2023.
-
Exploring the Potential of Large Language Models in Computational Argumentation
Authors:
Guizhen Chen,
Liying Cheng,
Luu Anh Tuan,
Lidong Bing
Abstract:
Computational argumentation has become an essential tool in various domains, including law, public policy, and artificial intelligence. It is an emerging research field in natural language processing that attracts increasing attention. Research on computational argumentation mainly involves two types of tasks: argument mining and argument generation. As large language models (LLMs) have demonstrat…
▽ More
Computational argumentation has become an essential tool in various domains, including law, public policy, and artificial intelligence. It is an emerging research field in natural language processing that attracts increasing attention. Research on computational argumentation mainly involves two types of tasks: argument mining and argument generation. As large language models (LLMs) have demonstrated impressive capabilities in understanding context and generating natural language, it is worthwhile to evaluate the performance of LLMs on diverse computational argumentation tasks. This work aims to embark on an assessment of LLMs, such as ChatGPT, Flan models, and LLaMA2 models, in both zero-shot and few-shot settings. We organize existing tasks into six main categories and standardize the format of fourteen openly available datasets. In addition, we present a new benchmark dataset on counter speech generation that aims to holistically evaluate the end-to-end performance of LLMs on argument mining and argument generation. Extensive experiments show that LLMs exhibit commendable performance across most of the datasets, demonstrating their capabilities in the field of argumentation. Our analysis offers valuable suggestions for evaluating computational argumentation and its integration with LLMs in future research endeavors.
△ Less
Submitted 1 July, 2024; v1 submitted 15 November, 2023;
originally announced November 2023.
-
An Introduction to Natural Language Processing Techniques and Framework for Clinical Implementation in Radiation Oncology
Authors:
Reza Khanmohammadi,
Mohammad M. Ghassemi,
Kyle Verdecchia,
Ahmed I. Ghanem,
Luo Bing,
Indrin J. Chetty,
Hassan Bagher-Ebadian,
Farzan Siddiqui,
Mohamed Elshaikh,
Benjamin Movsas,
Kundan Thind
Abstract:
Natural Language Processing (NLP) is a key technique for developing Medical Artificial Intelligence (AI) systems that leverage Electronic Health Record (EHR) data to build diagnostic and prognostic models. NLP enables the conversion of unstructured clinical text into structured data that can be fed into AI algorithms. The emergence of the transformer architecture and large language models (LLMs) h…
▽ More
Natural Language Processing (NLP) is a key technique for developing Medical Artificial Intelligence (AI) systems that leverage Electronic Health Record (EHR) data to build diagnostic and prognostic models. NLP enables the conversion of unstructured clinical text into structured data that can be fed into AI algorithms. The emergence of the transformer architecture and large language models (LLMs) has led to remarkable advances in NLP for various healthcare tasks, such as entity recognition, relation extraction, sentence similarity, text summarization, and question answering. In this article, we review the major technical innovations that underpin modern NLP models and present state-of-the-art NLP applications that employ LLMs in radiation oncology research. However, these LLMs are prone to many errors such as hallucinations, biases, and ethical violations, which necessitate rigorous evaluation and validation before clinical deployment. As such, we propose a comprehensive framework for assessing the NLP models based on their purpose and clinical fit, technical performance, bias and trust, legal and ethical implications, and quality assurance, prior to implementation in clinical radiation oncology. Our article aims to provide guidance and insights for researchers and clinicians who are interested in developing and using NLP models in clinical radiation oncology.
△ Less
Submitted 8 November, 2023; v1 submitted 3 November, 2023;
originally announced November 2023.
-
SOUL: Towards Sentiment and Opinion Understanding of Language
Authors:
Yue Deng,
Wenxuan Zhang,
Sinno Jialin Pan,
Lidong Bing
Abstract:
Sentiment analysis is a well-established natural language processing task, with sentiment polarity classification being one of its most popular and representative tasks. However, despite the success of pre-trained language models in this area, they often fall short of capturing the broader complexities of sentiment analysis. To address this issue, we propose a new task called Sentiment and Opinion…
▽ More
Sentiment analysis is a well-established natural language processing task, with sentiment polarity classification being one of its most popular and representative tasks. However, despite the success of pre-trained language models in this area, they often fall short of capturing the broader complexities of sentiment analysis. To address this issue, we propose a new task called Sentiment and Opinion Understanding of Language (SOUL). SOUL aims to evaluate sentiment understanding through two subtasks: Review Comprehension (RC) and Justification Generation (JG). RC seeks to validate statements that focus on subjective information based on a review text, while JG requires models to provide explanations for their sentiment predictions. To enable comprehensive evaluation, we annotate a new dataset comprising 15,028 statements from 3,638 reviews. Experimental results indicate that SOUL is a challenging task for both small and large language models, with a performance gap of up to 27% when compared to human performance. Furthermore, evaluations conducted with both human experts and GPT-4 highlight the limitations of the small language model in generating reasoning-based justifications. These findings underscore the challenging nature of the SOUL task for existing models, emphasizing the need for further advancements in sentiment analysis to address its complexities. The new dataset and code are available at https://github.com/DAMO-NLP-SG/SOUL.
△ Less
Submitted 27 October, 2023;
originally announced October 2023.
-
NIKA2 observations of dust grain evolution from star-forming filament to T-Tauri disk: Preliminary results from NIKA2 observations of the Taurus B211/B213 filament
Authors:
Q. Nguyen-Luong,
R. Adam,
P. Ade,
H. Ajeddig,
P. André,
E. Artis,
H. Aussel,
A. Beelen,
A. Benoît,
S. Berta,
L. Bing,
O. Bourrion,
M. Calvo,
A. Catalano,
M. De Petris,
F. -X. Désert,
S. Doyle,
E. F. C. Driessen,
G. Ejlali,
A. Gomez,
J. Goupy,
C. Hanser,
S. Katsioli,
F. Kéruzoré,
C. Kramer
, et al. (29 additional authors not shown)
Abstract:
To understand the evolution of dust properties in molecular clouds in the course of the star formation process, we constrain the changes in the dust emissivity index from star-forming filaments to prestellar and protostellar cores to T Tauri stars. Using the NIKA2 continuum camera on the IRAM 30~m telescope, we observed the Taurus B211/B213 filament at 1.2\,mm and 2\,mm with unprecedented sensitiv…
▽ More
To understand the evolution of dust properties in molecular clouds in the course of the star formation process, we constrain the changes in the dust emissivity index from star-forming filaments to prestellar and protostellar cores to T Tauri stars. Using the NIKA2 continuum camera on the IRAM 30~m telescope, we observed the Taurus B211/B213 filament at 1.2\,mm and 2\,mm with unprecedented sensitivity and used the resulting maps to derive the dust emissivity index $β$. Our sample of 105 objects detected in the $β$ map of the B211/B213 filament indicates that, overall, $β$ decreases from filament and prestellar cores ($β\sim 2\pm0.5$) to protostellar cores ($β\sim 1.2 \pm 0.2$) to T-Tauri protoplanetary disk ($β< 1$). The averaged dust emissivity index $β$ across the B211/B213 filament exhibits a flat ($β\sim 2\pm0.3$) profile. This may imply that dust grain sizes are rather homogeneous in the filament, start to grow significantly in size only after the onset of the gravitational contraction/collapse of prestellar cores to protostars, reaching big sizes in T Tauri protoplanetary disks. This evolution from the parent filament to T-Tauri disks happens on a timescale of about 1-2~Myr.
△ Less
Submitted 25 October, 2023;
originally announced October 2023.
-
CLEX: Continuous Length Extrapolation for Large Language Models
Authors:
Guanzheng Chen,
Xin Li,
Zaiqiao Meng,
Shangsong Liang,
Lidong Bing
Abstract:
Transformer-based Large Language Models (LLMs) are pioneering advances in many natural language processing tasks, however, their exceptional capabilities are restricted within the preset context window of Transformer. Position Embedding (PE) scaling methods, while effective in extending the context window to a specific length, demonstrate either notable limitations in their extrapolation abilities…
▽ More
Transformer-based Large Language Models (LLMs) are pioneering advances in many natural language processing tasks, however, their exceptional capabilities are restricted within the preset context window of Transformer. Position Embedding (PE) scaling methods, while effective in extending the context window to a specific length, demonstrate either notable limitations in their extrapolation abilities or sacrificing partial performance within the context window. Length extrapolation methods, although theoretically capable of extending the context window beyond the training sequence length, often underperform in practical long-context applications. To address these challenges, we propose Continuous Length EXtrapolation (CLEX) for LLMs. We generalise the PE scaling approaches to model the continuous dynamics by ordinary differential equations over the length scaling factor, thereby overcoming the constraints of current PE scaling methods designed for specific lengths. Moreover, by extending the dynamics to desired context lengths beyond the training sequence length, CLEX facilitates the length extrapolation with impressive performance in practical tasks. We demonstrate that CLEX can be seamlessly incorporated into LLMs equipped with Rotary Position Embedding, such as LLaMA and GPT-NeoX, with negligible impact on training and inference latency. Experimental results reveal that CLEX can effectively extend the context window to over 4x or almost 8x training length, with no deterioration in performance. Furthermore, when evaluated on the practical LongBench benchmark, our model trained on a 4k length exhibits competitive performance against state-of-the-art open-source models trained on context lengths up to 32k. Our code is available at https://github.com/DAMO-NLP-SG/CLEX.
△ Less
Submitted 24 March, 2024; v1 submitted 25 October, 2023;
originally announced October 2023.
-
Once Upon a $\textit{Time}$ in $\textit{Graph}$: Relative-Time Pretraining for Complex Temporal Reasoning
Authors:
Sen Yang,
Xin Li,
Lidong Bing,
Wai Lam
Abstract:
Our physical world is constantly evolving over time, rendering challenges for pre-trained language models to understand and reason over the temporal contexts of texts. Existing work focuses on strengthening the direct association between a piece of text and its time-stamp. However, the knowledge-time association is usually insufficient for the downstream tasks that require reasoning over temporal…
▽ More
Our physical world is constantly evolving over time, rendering challenges for pre-trained language models to understand and reason over the temporal contexts of texts. Existing work focuses on strengthening the direct association between a piece of text and its time-stamp. However, the knowledge-time association is usually insufficient for the downstream tasks that require reasoning over temporal dependencies between knowledge. In this work, we make use of the underlying nature of time, all temporally-scoped sentences are strung together through a one-dimensional time axis, and suggest creating a graph structure based on the relative placements of events along the time axis. Inspired by the graph view, we propose RemeMo ($\underline{Re}$lative Ti$\underline{me}$ $\underline{Mo}$deling), which explicitly connects all temporally-scoped facts by modeling the time relations between any two sentences. Experimental results show that RemeMo outperforms the baseline T5 on multiple temporal question answering datasets under various settings. Further analysis suggests that RemeMo is especially good at modeling long-range complex temporal dependencies. We release our code and pre-trained checkpoints at $\href{https://github.com/DAMO-NLP-SG/RemeMo}{\text{this url}}$.
△ Less
Submitted 23 October, 2023;
originally announced October 2023.
-
Large Language Models can Contrastively Refine their Generation for Better Sentence Representation Learning
Authors:
Huiming Wang,
Zhaodonghui Li,
Liying Cheng,
Soh De Wen,
Lidong Bing
Abstract:
Recently, large language models (LLMs) have emerged as a groundbreaking technology and their unparalleled text generation capabilities have sparked interest in their application to the fundamental sentence representation learning task. Existing methods have explored utilizing LLMs as data annotators to generate synthesized data for training contrastive learning based sentence embedding models such…
▽ More
Recently, large language models (LLMs) have emerged as a groundbreaking technology and their unparalleled text generation capabilities have sparked interest in their application to the fundamental sentence representation learning task. Existing methods have explored utilizing LLMs as data annotators to generate synthesized data for training contrastive learning based sentence embedding models such as SimCSE. However, since contrastive learning models are sensitive to the quality of sentence pairs, the effectiveness of these methods is largely influenced by the content generated from LLMs, highlighting the need for more refined generation in the context of sentence representation learning. Building upon this premise, we propose MultiCSR, a multi-level contrastive sentence representation learning framework that decomposes the process of prompting LLMs to generate a corpus for training base sentence embedding models into three stages (i.e., sentence generation, sentence pair construction, in-batch training) and refines the generated content at these three distinct stages, ensuring only high-quality sentence pairs are utilized to train a base contrastive learning model. Our extensive experiments reveal that MultiCSR enables a less advanced LLM to surpass the performance of ChatGPT, while applying it to ChatGPT achieves better state-of-the-art results. Comprehensive analyses further underscore the potential of our framework in various application scenarios and achieving better sentence representation learning with LLMs.
△ Less
Submitted 17 May, 2024; v1 submitted 16 October, 2023;
originally announced October 2023.
-
Towards the first mean pressure profile estimate with the NIKA2 Sunyaev-Zeldovich Large Program
Authors:
C. Hanser,
R. Adam,
P. Ade,
H. Ajeddig,
P. André,
E. Artis,
H. Aussel,
I. Bartalucci,
A. Beelen,
A. Benoît,
S. Berta,
L. Bing,
O. Bourrion,
M. Calvo,
A. Catalano,
M. De Petris,
F. -X. Désert,
S. Doyle,
E. F. C. Driessen,
G. Ejlali,
A. Ferragamo,
A. Gomez,
J. Goupy,
S. Katsioli,
F. Kéruzoré
, et al. (29 additional authors not shown)
Abstract:
High-resolution mapping of the hot gas in galaxy clusters is a key tool for cluster-based cosmological analyses. Taking advantage of the NIKA2 millimeter camera operated at the IRAM 30-m telescope, the NIKA2 SZ Large Program seeks to get a high-resolution follow-up of 38 galaxy clusters covering a wide mass range at intermediate to high redshift. The measured SZ fluxes will be essential to calibra…
▽ More
High-resolution mapping of the hot gas in galaxy clusters is a key tool for cluster-based cosmological analyses. Taking advantage of the NIKA2 millimeter camera operated at the IRAM 30-m telescope, the NIKA2 SZ Large Program seeks to get a high-resolution follow-up of 38 galaxy clusters covering a wide mass range at intermediate to high redshift. The measured SZ fluxes will be essential to calibrate the SZ scaling relation and the galaxy clusters mean pressure profile, needed for the cosmological exploitation of SZ surveys. We present in this study a method to infer a mean pressure profile from cluster observations. We have designed a pipeline encompassing the map-making and the thermodynamical properties estimates from maps. We then combine all the individual fits, propagating the uncertainties on integrated quantities, such as $R_{500}$ or $P_{500}$, and the intrinsic scatter coming from the deviation to the standard self-similar model. We validate the proposed method on realistic LPSZ-like cluster simulations.
△ Less
Submitted 13 December, 2023; v1 submitted 11 October, 2023;
originally announced October 2023.
-
Multilingual Jailbreak Challenges in Large Language Models
Authors:
Yue Deng,
Wenxuan Zhang,
Sinno Jialin Pan,
Lidong Bing
Abstract:
While large language models (LLMs) exhibit remarkable capabilities across a wide range of tasks, they pose potential safety concerns, such as the ``jailbreak'' problem, wherein malicious instructions can manipulate LLMs to exhibit undesirable behavior. Although several preventive measures have been developed to mitigate the potential risks associated with LLMs, they have primarily focused on Engli…
▽ More
While large language models (LLMs) exhibit remarkable capabilities across a wide range of tasks, they pose potential safety concerns, such as the ``jailbreak'' problem, wherein malicious instructions can manipulate LLMs to exhibit undesirable behavior. Although several preventive measures have been developed to mitigate the potential risks associated with LLMs, they have primarily focused on English. In this study, we reveal the presence of multilingual jailbreak challenges within LLMs and consider two potential risky scenarios: unintentional and intentional. The unintentional scenario involves users querying LLMs using non-English prompts and inadvertently bypassing the safety mechanisms, while the intentional scenario concerns malicious users combining malicious instructions with multilingual prompts to deliberately attack LLMs. The experimental results reveal that in the unintentional scenario, the rate of unsafe content increases as the availability of languages decreases. Specifically, low-resource languages exhibit about three times the likelihood of encountering harmful content compared to high-resource languages, with both ChatGPT and GPT-4. In the intentional scenario, multilingual prompts can exacerbate the negative impact of malicious instructions, with astonishingly high rates of unsafe output: 80.92\% for ChatGPT and 40.71\% for GPT-4. To handle such a challenge in the multilingual context, we propose a novel \textsc{Self-Defense} framework that automatically generates multilingual training data for safety fine-tuning. Experimental results show that ChatGPT fine-tuned with such data can achieve a substantial reduction in unsafe content generation. Data is available at \url{https://github.com/DAMO-NLP-SG/multilingual-safety-for-LLMs}.
△ Less
Submitted 3 March, 2024; v1 submitted 10 October, 2023;
originally announced October 2023.
-
IAS/CEA Evolution of Dust in Nearby Galaxies (ICED): the spatially-resolved dust properties of NGC4254
Authors:
L. Pantoni,
R. Adam,
P. Ade,
H. Ajeddig,
P. André,
E. Artis,
H. Aussel,
M. Baes,
A. Beelen,
A. Benoît,
S. Berta,
L. Bing,
O. Bourrion,
M. Calvo,
A. Catalano,
M. De Petris,
F. -X. Désert,
S. Doyle,
E. F. C. Driessen,
G. Ejlali,
F. Galliano,
A. Gomez,
J. Goupy,
A. P. Jones,
C. Hanser
, et al. (35 additional authors not shown)
Abstract:
We present the first preliminary results of the project \textit{ICED}, focusing on the face-on galaxy NGC4254. We use the millimetre maps observed with NIKA2 at IRAM-30m, as part of the IMEGIN Guaranteed Time Large Program, and of a wide collection of ancillary data (multi-wavelength photometry and gas phase spectral lines) that are publicly available. We derive the global and local properties of…
▽ More
We present the first preliminary results of the project \textit{ICED}, focusing on the face-on galaxy NGC4254. We use the millimetre maps observed with NIKA2 at IRAM-30m, as part of the IMEGIN Guaranteed Time Large Program, and of a wide collection of ancillary data (multi-wavelength photometry and gas phase spectral lines) that are publicly available. We derive the global and local properties of interstellar dust grains through infrared-to-radio spectral energy distribution fitting, using the hierarchical Bayesian code HerBIE, which includes the grain properties of the state-of-the-art dust model, THEMIS. Our method allows us to get the following dust parameters: dust mass, average interstellar radiation field, and fraction of small grains. Also, it is effective in retrieving the intrinsic correlations between dust parameters and interstellar medium properties. We find an evident anti-correlation between the interstellar radiation field and the fraction of small grains in the centre of NGC4254, meaning that, at strong radiation field intensities, very small amorphous carbon grains are efficiently destroyed by the ultra-violet photons coming from newly formed stars, through photo-desorption and sublimation. We observe a flattening of the anti-correlation at larger radial distances, which may be driven by the steep metallicity gradient measured in NGC4254.
△ Less
Submitted 10 October, 2023;
originally announced October 2023.
-
NIKA2 observations of 3 low-mass galaxy clusters at $z \sim 1$: pressure profile and $Y_{\rm SZ}$-$M$ relation
Authors:
R. Adam,
M. Ricci,
D. Eckert,
P. Ade,
H. Ajeddig,
B. Altieri,
P. André,
E. Artis,
H. Aussel,
A. Beelen,
C. Benoist,
A. Benoît,
S. Berta,
L. Bing,
M. Birkinshaw,
O. Bourrion,
D. Boutigny,
M. Bremer,
M. Calvo,
A. Cappi,
A. Catalano,
M. De Petris,
F. -X. Désert,
S. Doyle,
E. F. C. Driessen
, et al. (42 additional authors not shown)
Abstract:
Three galaxy clusters selected from the XXL X-ray survey at high redshift and low mass ($z\sim1$ and $M_{500} \sim 1-2 \times 10^{14}$ M$_{\odot}$) were observed with NIKA2 to image their Sunyaev-Zel'dovich effect (SZ) signal. They all present an SZ morphology, together with the comparison with X-ray and optical data, that indicates dynamical activity related to merging events. Despite their distu…
▽ More
Three galaxy clusters selected from the XXL X-ray survey at high redshift and low mass ($z\sim1$ and $M_{500} \sim 1-2 \times 10^{14}$ M$_{\odot}$) were observed with NIKA2 to image their Sunyaev-Zel'dovich effect (SZ) signal. They all present an SZ morphology, together with the comparison with X-ray and optical data, that indicates dynamical activity related to merging events. Despite their disturbed intracluster medium, their high redshifts, and their low masses, the three clusters follow remarkably well the pressure profile and the SZ flux-mass relation expected from standard evolution. This suggests that the physics that drives cluster formation is already in place at $z \sim 1$ down to $M_{500} \sim 10^{14}$ M$_{\odot}$.
△ Less
Submitted 13 October, 2023; v1 submitted 10 October, 2023;
originally announced October 2023.
-
The XXL Survey LI. Pressure profile and $Y_{\rm SZ}$-$M$ scaling relation in three low-mass galaxy clusters at $z\sim1$ observed with NIKA2
Authors:
R. Adam,
M. Ricci,
D. Eckert,
P. Ade,
H. Ajeddig,
B. Altieri,
P. André,
E. Artis,
H. Aussel,
A. Beelen,
C. Benoist,
A. Benoît,
S. Berta,
L. Bing,
M. Birkinshaw,
O. Bourrion,
D. Boutigny,
M. Bremer,
M. Calvo,
A. Cappi,
A. Catalano,
M. De Petris,
F. -X. Désert,
S. Doyle,
E. F. C. Driessen
, et al. (42 additional authors not shown)
Abstract:
The thermodynamical properties of the intracluster medium (ICM) are driven by scale-free gravitational collapse, but they also reflect the rich astrophysical processes at play in galaxy clusters. At low masses ($\sim 10^{14}$ M$_{\odot}$) and high redshift ($z \gtrsim 1$), these properties remain poorly constrained observationally, due to the difficulty in obtaining resolved and sensitive data. Th…
▽ More
The thermodynamical properties of the intracluster medium (ICM) are driven by scale-free gravitational collapse, but they also reflect the rich astrophysical processes at play in galaxy clusters. At low masses ($\sim 10^{14}$ M$_{\odot}$) and high redshift ($z \gtrsim 1$), these properties remain poorly constrained observationally, due to the difficulty in obtaining resolved and sensitive data. This paper aims at investigating the inner structure of the ICM as seen through the Sunyaev-Zel'dovich (SZ) effect in this regime of mass and redshift. Focus is set on the thermal pressure profile and the scaling relation between SZ flux and mass, namely the $Y_{\rm SZ} - M$ scaling relation. The three galaxy clusters XLSSC~072 ($z=1.002$), XLSSC~100 ($z=0.915$), and XLSSC~102 ($z=0.969$), with $M_{500} \sim 2 \times 10^{14}$ M$_{\odot}$, were selected from the XXL X-ray survey and observed with the NIKA2 millimeter camera to image their SZ signal. XMM-Newton X-ray data were used in complement to the NIKA2 data to derive masses based on the $Y_X - M$ relation and the hydrostatic equilibrium. The SZ images of the three clusters, along with the X-ray and optical data, indicate dynamical activity related to merging events. The pressure profile is consistent with that expected for morphologically disturbed systems, with a relatively flat core and a shallow outer slope. Despite significant disturbances in the ICM, the three high-redshift low-mass clusters follow remarkably well the $Y_{\rm SZ}-M$ relation expected from standard evolution. These results indicate that the dominant physics that drives cluster evolution is already in place by $z \sim 1$, at least for systems with masses above $M_{500} \sim 10^{14}$ M$_{\odot}$.
△ Less
Submitted 28 March, 2024; v1 submitted 9 October, 2023;
originally announced October 2023.
-
The NIKA2 Sunyaev-Zeldovich Large Program: Sample and upcoming product public release
Authors:
L. Perotto,
R. Adam,
P. Ade,
H. Ajeddig,
P. André,
E. Artis,
H. Aussel,
R. Barrena,
I. Bartalucci,
A. Beelen,
A. Benoît,
S. Berta,
L. Bing,
O. Bourrion,
M. Calvo,
A. Catalano,
M. De Petris,
F. -X. Désert,
S. Doyle,
E. F. C. Driessen,
G. Ejlali,
A. Ferragamo,
A. Gomez,
J. Goupy,
C. Hanser
, et al. (30 additional authors not shown)
Abstract:
The NIKA2 camera operating at the IRAM 30 m telescope excels in high-angular resolution mapping of the thermal Sunyaev-Zeldovich effect towards galaxy clusters at intermediate and high-redshift. As part of the NIKA2 guaranteed time, the SZ Large Program (LPSZ) aims at tSZ-mapping a representative sample of SZ-selected galaxy clusters in the catalogues of the Planck satellite and of the Atacama Cos…
▽ More
The NIKA2 camera operating at the IRAM 30 m telescope excels in high-angular resolution mapping of the thermal Sunyaev-Zeldovich effect towards galaxy clusters at intermediate and high-redshift. As part of the NIKA2 guaranteed time, the SZ Large Program (LPSZ) aims at tSZ-mapping a representative sample of SZ-selected galaxy clusters in the catalogues of the Planck satellite and of the Atacama Cosmology Telescope, and also observed in X-ray with XMM Newton or Chandra. Having completed observations in January 2023, we present tSZ maps of 38 clusters spanning the targeted mass ($3 < M_{500}/10^{14} M_{\odot} < 10$) and redshift ($0.5 < z < 0.9$) ranges. The first in depth studies of individual clusters highlight the potential of combining tSZ and X-ray observations at similar angular resolution for accurate mass measurements. These were milestones for the development of a standard data analysis pipeline to go from NIKA2 raw data to the thermodynamic properties of galaxy clusters for the upcoming LPSZ data release. Final products will include unprecedented measurements of the mean pressure profile and mass observable scaling relation using a distinctive SZ-selected sample, which will be key for ultimately improving the accuracy of cluster based cosmology.
△ Less
Submitted 6 October, 2023;
originally announced October 2023.
-
Exploring the interstellar medium of NGC 891 at millimeter wavelengths using the NIKA2 camera
Authors:
S. Katsioli,
R. Adam,
P. Ade,
H. Ajeddig,
P. André,
E. Artis,
H. Aussel,
M. Baes,
A. Beelen,
A. Benoît,
S. Berta,
L. Bing,
O. Bourrion,
M. Calvo,
A. Catalano,
C. J. R. Clark,
I. De Looze,
M. De Petris,
F. -X. Désert,
S. Doyle,
E. F. C. Driessen,
G. Ejlali,
M. Galametz,
F. Galliano,
A. Gomez
, et al. (39 additional authors not shown)
Abstract:
In the framework of the IMEGIN Large Program, we used the NIKA2 camera on the IRAM 30-m telescope to observe the edge-on galaxy NGC 891 at 1.15 mm and 2 mm and at a FWHM of 11.1" and 17.6", respectively. Multiwavelength data enriched with the new NIKA2 observations fitted by the HerBIE SED code (coupled with the THEMIS dust model) were used to constrain the physical properties of the ISM. Emission…
▽ More
In the framework of the IMEGIN Large Program, we used the NIKA2 camera on the IRAM 30-m telescope to observe the edge-on galaxy NGC 891 at 1.15 mm and 2 mm and at a FWHM of 11.1" and 17.6", respectively. Multiwavelength data enriched with the new NIKA2 observations fitted by the HerBIE SED code (coupled with the THEMIS dust model) were used to constrain the physical properties of the ISM. Emission originating from the diffuse dust disk is detected at all wavelengths from mid-IR to mm, while mid-IR observations reveal warm dust emission from compact HII regions. Indications of mm excess emission have also been found in the outer parts of the galactic disk. Furthermore, our SED fitting analysis constrained the mass fraction of the small (< 15 Angstrom) dust grains. We found that small grains constitute 9.5% of the total dust mass in the galactic plane, but this fraction increases up to ~ 20% at large distances (|z| > 3 kpc) from the galactic plane.
△ Less
Submitted 6 October, 2023;
originally announced October 2023.
-
Constraining Millimeter Dust Emission in Nearby Galaxies with NIKA2: the case of NGC2146 and NGC2976
Authors:
G. Ejlali,
R. Adam,
P. Ade,
H. Ajeddig,
P. André,
E. Artis,
H. Aussel,
M. Baes,
A. Beelen,
Benoît,
S. Berta,
L. Bing,
O. Bourrion,
M. Calvo,
A. Catalano,
M. De Petris,
F. -X. Désert,
S. Doyle,
E. F. C. Driessen,
F. Galliano,
A. Gomez,
J. Goupy,
A. P. Jones,
C. Hanser,
A. Hughes
, et al. (35 additional authors not shown)
Abstract:
This study presents the first millimeter continuum mapping observations of two nearby galaxies, the starburst spiral galaxy NGC2146 and the dwarf galaxy NGC2976, at 1.15 mm and 2 mm using the NIKA2 camera on the IRAM 30m telescope, as part of the Guaranteed Time Large Project IMEGIN. These observations provide robust resolved information about the physical properties of dust in nearby galaxies by…
▽ More
This study presents the first millimeter continuum mapping observations of two nearby galaxies, the starburst spiral galaxy NGC2146 and the dwarf galaxy NGC2976, at 1.15 mm and 2 mm using the NIKA2 camera on the IRAM 30m telescope, as part of the Guaranteed Time Large Project IMEGIN. These observations provide robust resolved information about the physical properties of dust in nearby galaxies by constraining their FIR-radio SED in the millimeter domain. After subtracting the contribution from the CO line emission, the SEDs are modeled spatially using a Bayesian approach. Maps of dust mass surface density, temperature, emissivity index, and thermal radio component of the galaxies are presented, allowing for a study of the relations between the dust properties and star formation activity (using observations at 24$μ$m as a tracer). We report that dust temperature is correlated with star formation rate in both galaxies. The effect of star formation activity on dust temperature is stronger in NGC2976, an indication of the thinner interstellar medium of dwarf galaxies. Moreover, an anti-correlation trend is reported between the dust emissivity index and temperature in both galaxies.
△ Less
Submitted 5 October, 2023;
originally announced October 2023.
-
Systematic effects on the upcoming NIKA2 LPSZ scaling relation
Authors:
A. Moyer-Anin,
R. Adam,
P. Ade,
H. Ajeddig,
P. André,
E. Artis,
H. Aussel,
I. Bartalucci,
A. Beelen,
A. Benoît,
S. Berta,
L. Bing,
O. Bourrion,
M. Calvo,
A. Catalano,
M. De Petris,
F. -X. Désert,
S. Doyle,
E. F. C. Driessen,
G. Ejlali,
A. Gomez,
J. Goupy,
C. Hanser,
S. Katsioli,
F. Kéruzoré
, et al. (27 additional authors not shown)
Abstract:
In cluster cosmology, cluster masses are the main parameter of interest. They are needed to constrain cosmological parameters through the cluster number count. As the mass is not an observable, a scaling relation is needed to link cluster masses to the integrated Compton parameters Y, i.e. the Sunyaev-Zeldovich observable (SZ). Planck cosmological results obtained with cluster number counts are ba…
▽ More
In cluster cosmology, cluster masses are the main parameter of interest. They are needed to constrain cosmological parameters through the cluster number count. As the mass is not an observable, a scaling relation is needed to link cluster masses to the integrated Compton parameters Y, i.e. the Sunyaev-Zeldovich observable (SZ). Planck cosmological results obtained with cluster number counts are based on a scaling relation measured with clusters at low redshift ($z$<0.5) observed in SZ and X-ray. In the SZ Large Program (LPSZ) of the NIKA2 collaboration, the scaling relation will be obtained with a sample of 38 clusters at intermediate to high redshift ($0.5<z<0.9$) and observed at high angular resolution in both SZ and X-ray. Thanks to analytical simulation of LPSZ-like samples, we take into account the LPSZ selection function and correct for its effects. Besides, we show that white and correlated noises in the SZ maps do not affect the scaling relation estimation.
△ Less
Submitted 7 December, 2023; v1 submitted 2 October, 2023;
originally announced October 2023.
-
NIKA2 observations of starless cores in Taurus and Perseus
Authors:
C. Kramer,
R. Adam,
P. Ade,
H. Ajeddig,
P. Andre,
E. Artis,
H. Aussel,
A. Beelen,
A. Beno,
S. Berta,
L. Bing,
O. Bourrion,
M. Calvo,
P. Caselli,
A. Catalano,
M. DePetris,
F. -X. Desert,
S. Doyle,
E. F. C. Driessen,
G. Ejlali,
A. Fuente,
A. Gomez,
J. Goupy,
C. Hanser,
S. Katsioli
, et al. (27 additional authors not shown)
Abstract:
Dusty starless cores play an important role in regulating the initial phases of the formation of stars and planets. In their interiors, dust grains coagulate and ice mantles form, thereby changing the millimeter emissivities and hence the ability to cool. We mapped four regions with more than a dozen cores in the nearby Galactic filaments of Taurus and Perseus using the NIKA2 camera at the IRAM 30…
▽ More
Dusty starless cores play an important role in regulating the initial phases of the formation of stars and planets. In their interiors, dust grains coagulate and ice mantles form, thereby changing the millimeter emissivities and hence the ability to cool. We mapped four regions with more than a dozen cores in the nearby Galactic filaments of Taurus and Perseus using the NIKA2 camera at the IRAM 30-meter telescope. Combining the 1mm to 2mm flux ratio maps with dust temperature maps from Herschel allowed to create maps of the dust emissivity index $β_{1,2}$ at resolutions of 2430 and 5600 a.u. in Taurus and Perseus, respectively. Here, we study the variation with total column densities and environment. $β_{1,2}$ values at the core centers ($A_V=12-19$mag) vary significantly between $\sim1.1$ and $2.3$. Several cores show a strong rise of $β_{1,2}$ from the outskirts at $\sim4$mag to the peaks of optical extinctions, consistent with the predictions of grain models and the gradual build-up of ice mantles on coagulated grains in the dense interiors of starless cores.
△ Less
Submitted 4 October, 2023; v1 submitted 2 October, 2023;
originally announced October 2023.
-
The stratification of ISM properties in the edge-on galaxy NGC 891 revealed by NIKA2
Authors:
S. Katsioli,
E. M. Xilouris,
C. Kramer,
R. Adam,
P. Ade,
H. Ajeddig,
P. André,
E. Artis,
H. Aussel,
M. Baes,
A. Beelen,
A. Benoît,
S. Berta,
L. Bing,
O. Bourrion,
M. Calvo,
A. Catalano,
C. J. R. Clark,
I. De Looze,
M. De Petris,
F. -X. Désert,
S. Doyle,
E. F. C. Driessen,
G. Ejlali,
M. Galametz
, et al. (38 additional authors not shown)
Abstract:
As the millimeter wavelength range remains a largely unexplored spectral region for galaxies, the IMEGIN large program aims to map the millimeter continuum emission of 22 nearby galaxies at 1.15 and 2 mm. Using the high-resolution maps produced by the NIKA2 camera, we explore the existence of very cold dust and take possible contamination by free-free and synchrotron emission into account. We stud…
▽ More
As the millimeter wavelength range remains a largely unexplored spectral region for galaxies, the IMEGIN large program aims to map the millimeter continuum emission of 22 nearby galaxies at 1.15 and 2 mm. Using the high-resolution maps produced by the NIKA2 camera, we explore the existence of very cold dust and take possible contamination by free-free and synchrotron emission into account. We study the IR-to-radio emission coming from different regions along the galactic plane and at large vertical distances. New observations of NGC 891, using the NIKA2 camera on the IRAM 30m telescope, along with a suite of observations at other wavelengths were used to perform a multiwavelength study of the spectral energy distribution in the interstellar medium in this galaxy. This analysis was performed globally and locally, using the advanced hierarchical Bayesian fitting code, HerBIE, coupled with the THEMIS dust model. Our dust modeling is able to reproduce the near-IR to millimeter emission of NGC 891, with the exception of an excess at a level of 25% obtained by the NIKA2 observations in the outermost parts of the disk. The radio continuum and thermal dust emission are distributed differently in the disk and galaxy halo. Different dusty environments are also revealed by a multiwavelength investigation of the emission features. Our detailed decomposition at millimeter and centimeter wavelengths shows that emission at 1 mm is purely originated by dust. Radio components become progressively important with increasing wavelengths. Finally, we find that emission arising from small dust grains accounts for ~ 9.5% of the total dust mass, reaching up to 20% at large galactic latitudes. Shock waves in the outflows that shatter the dust grains might explain this higher fraction of small grains in the halo.
△ Less
Submitted 15 September, 2023;
originally announced September 2023.
-
Massive Optically Dark Galaxies Unveiled by JWST Challenge Galaxy Formation Models
Authors:
Mengyuan Xiao,
Pascal Oesch,
David Elbaz,
Longji Bing,
Erica Nelson,
Andrea Weibel,
Rohan Naidu,
Emanuele Daddi,
Rychard Bouwens,
Jorryt Matthee,
Stijn Wuyts,
John Chisholm,
Gabriel Brammer,
Mark Dickinson,
Benjamin Magnelli,
Lucas Leroy,
Pieter van Dokkum,
Daniel Schaerer,
Thomas Herard-Demanche,
Laia Barrufet,
Ryan Endsley,
Yoshinobu Fudamoto,
Carlos Gómez-Guijarro,
Rashmi Gottumukkala,
Garth Illingworth
, et al. (12 additional authors not shown)
Abstract:
Over the past decade, the existence of a substantial population of optically invisible, massive galaxies at $z\gtrsim3$ has been implied from mid-infrared to millimeter observations. With the unprecedented sensitivity of the JWST, such extremely massive galaxy candidates have immediately been identified even at $z>7$, in much larger numbers than expected. These discoveries raised a hot debate. If…
▽ More
Over the past decade, the existence of a substantial population of optically invisible, massive galaxies at $z\gtrsim3$ has been implied from mid-infrared to millimeter observations. With the unprecedented sensitivity of the JWST, such extremely massive galaxy candidates have immediately been identified even at $z>7$, in much larger numbers than expected. These discoveries raised a hot debate. If confirmed, early, high-mass galaxies challenge the current models of galaxy formation. However, the lack of spectroscopic confirmations leads to uncertain stellar mass ($M_{\star}$) estimates, and the possible presence of active galactic nuclei (AGN) adds further uncertainty. Here, we present the first sample of 36 dust-obscured galaxies with robust spectroscopic redshifts at $z_{\rm spec}=5-9$ from the JWST FRESCO survey. The three most extreme sources at $z\sim5-6$ ($\sim$1 billion years after the Big Bang) are so massive (log$M_{\star}/M_{\odot}$ $\gtrsim11.0$) that they would require, on average, about 50% of the baryons in their halos to be converted into stars -- two to three times higher than even the most efficient galaxies at later times. The extended emission of these galaxies suggests limited contribution by AGN. This population of ultra-massive galaxies accounts for 20% of the total cosmic star formation rate density at $z\sim5-6$, suggesting a substantial proportion of extremely efficient star formation in the early Universe.
△ Less
Submitted 5 September, 2023;
originally announced September 2023.
-
Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts
Authors:
Xuan-Phi Nguyen,
Sharifah Mahani Aljunied,
Shafiq Joty,
Lidong Bing
Abstract:
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars. However, in low-resource languages, obtaining such hand-picked exemplars can still be challenging, where unsupervised techniques may be necessary. Moreover, competent generative capabilities of LLMs are observed only in high-resource languages, while their performances among under-represented lan…
▽ More
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars. However, in low-resource languages, obtaining such hand-picked exemplars can still be challenging, where unsupervised techniques may be necessary. Moreover, competent generative capabilities of LLMs are observed only in high-resource languages, while their performances among under-represented languages fall behind due to pre-training data imbalance. To elicit LLMs' ability onto low-resource languages without any supervised data, we propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English. These prompts are then used to create intra-lingual exemplars to perform tasks in the target languages. Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages. We also show that fine-tuning a 7B model on data generated from our method helps it perform competitively with a 175B model. In non-English translation tasks, our method even outperforms supervised prompting by up to 3 chrF++ in many low-resource languages. When evaluated on zero-shot multilingual summarization, our method surpasses other English-pivoting baselines by up to 4 ROUGE-L and is also favored by GPT-4.
△ Less
Submitted 20 June, 2023;
originally announced June 2023.
-
Class-Adaptive Self-Training for Relation Extraction with Incompletely Annotated Training Data
Authors:
Qingyu Tan,
Lu Xu,
Lidong Bing,
Hwee Tou Ng
Abstract:
Relation extraction (RE) aims to extract relations from sentences and documents. Existing relation extraction models typically rely on supervised machine learning. However, recent studies showed that many RE datasets are incompletely annotated. This is known as the false negative problem in which valid relations are falsely annotated as 'no_relation'. Models trained with such data inevitably make…
▽ More
Relation extraction (RE) aims to extract relations from sentences and documents. Existing relation extraction models typically rely on supervised machine learning. However, recent studies showed that many RE datasets are incompletely annotated. This is known as the false negative problem in which valid relations are falsely annotated as 'no_relation'. Models trained with such data inevitably make similar mistakes during the inference stage. Self-training has been proven effective in alleviating the false negative problem. However, traditional self-training is vulnerable to confirmation bias and exhibits poor performance in minority classes. To overcome this limitation, we proposed a novel class-adaptive re-sampling self-training framework. Specifically, we re-sampled the pseudo-labels for each class by precision and recall scores. Our re-sampling strategy favored the pseudo-labels of classes with high precision and low recall, which improved the overall recall without significantly compromising precision. We conducted experiments on document-level and biomedical relation extraction datasets, and the results showed that our proposed self-training framework consistently outperforms existing competitive methods on the Re-DocRED and ChemDisgene datasets when the training data are incompletely annotated. Our code is released at https://github.com/DAMO-NLP-SG/CAST.
△ Less
Submitted 16 June, 2023;
originally announced June 2023.
-
Towards Benchmarking and Improving the Temporal Reasoning Capability of Large Language Models
Authors:
Qingyu Tan,
Hwee Tou Ng,
Lidong Bing
Abstract:
Reasoning about time is of fundamental importance. Many facts are time-dependent. For example, athletes change teams from time to time, and different government officials are elected periodically. Previous time-dependent question answering (QA) datasets tend to be biased in either their coverage of time spans or question types. In this paper, we introduce a comprehensive probing dataset \tempreaso…
▽ More
Reasoning about time is of fundamental importance. Many facts are time-dependent. For example, athletes change teams from time to time, and different government officials are elected periodically. Previous time-dependent question answering (QA) datasets tend to be biased in either their coverage of time spans or question types. In this paper, we introduce a comprehensive probing dataset \tempreason to evaluate the temporal reasoning capability of large language models. Our dataset includes questions of three temporal reasoning levels. In addition, we also propose a novel learning framework to improve the temporal reasoning capability of large language models, based on temporal span extraction and time-sensitive reinforcement learning. We conducted experiments in closed book QA, open book QA, and reasoning QA settings and demonstrated the effectiveness of our approach. Our code and data are released on https://github.com/DAMO-NLP-SG/TempReason.
△ Less
Submitted 27 June, 2023; v1 submitted 15 June, 2023;
originally announced June 2023.
-
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models
Authors:
Wenxuan Zhang,
Sharifah Mahani Aljunied,
Chang Gao,
Yew Ken Chia,
Lidong Bing
Abstract:
Despite the existence of various benchmarks for evaluating natural language processing models, we argue that human exams are a more suitable means of evaluating general intelligence for large language models (LLMs), as they inherently demand a much wider range of abilities such as language understanding, domain knowledge, and problem-solving skills. To this end, we introduce M3Exam, a novel benchm…
▽ More
Despite the existence of various benchmarks for evaluating natural language processing models, we argue that human exams are a more suitable means of evaluating general intelligence for large language models (LLMs), as they inherently demand a much wider range of abilities such as language understanding, domain knowledge, and problem-solving skills. To this end, we introduce M3Exam, a novel benchmark sourced from real and official human exam questions for evaluating LLMs in a multilingual, multimodal, and multilevel context. M3Exam exhibits three unique characteristics: (1) multilingualism, encompassing questions from multiple countries that require strong multilingual proficiency and cultural knowledge; (2) multimodality, accounting for the multimodal nature of many exam questions to test the model's multimodal understanding capability; and (3) multilevel structure, featuring exams from three critical educational periods to comprehensively assess a model's proficiency at different levels. In total, M3Exam contains 12,317 questions in 9 diverse languages with three educational levels, where about 23\% of the questions require processing images for successful solving. We assess the performance of top-performing LLMs on M3Exam and find that current models, including GPT-4, still struggle with multilingual text, particularly in low-resource and non-Latin script languages. Multimodal LLMs also perform poorly with complex multimodal questions. We believe that M3Exam can be a valuable resource for comprehensively evaluating LLMs by examining their multilingual and multimodal abilities and tracking their development. Data and evaluation code is available at \url{https://github.com/DAMO-NLP-SG/M3Exam}.
△ Less
Submitted 9 November, 2023; v1 submitted 8 June, 2023;
originally announced June 2023.
-
INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large Language Models
Authors:
Yew Ken Chia,
Pengfei Hong,
Lidong Bing,
Soujanya Poria
Abstract:
Instruction-tuned large language models have revolutionized natural language processing and have shown great potential in applications such as conversational agents. These models, such as GPT-4, can not only master language but also solve complex tasks in areas like mathematics, coding, medicine, and law. Despite their impressive capabilities, there is still a lack of comprehensive understanding r…
▽ More
Instruction-tuned large language models have revolutionized natural language processing and have shown great potential in applications such as conversational agents. These models, such as GPT-4, can not only master language but also solve complex tasks in areas like mathematics, coding, medicine, and law. Despite their impressive capabilities, there is still a lack of comprehensive understanding regarding their full potential, primarily due to the black-box nature of many models and the absence of holistic evaluation studies. To address these challenges, we present INSTRUCTEVAL, a more comprehensive evaluation suite designed specifically for instruction-tuned large language models. Unlike previous works, our evaluation involves a rigorous assessment of models based on problem-solving, writing ability, and alignment to human values. We take a holistic approach to analyze various factors affecting model performance, including the pretraining foundation, instruction-tuning data, and training methods. Our findings reveal that the quality of instruction data is the most crucial factor in scaling model performance. While open-source models demonstrate impressive writing abilities, there is substantial room for improvement in problem-solving and alignment. We are encouraged by the rapid development of models by the open-source community, but we also highlight the need for rigorous evaluation to support claims made about these models. Through INSTRUCTEVAL, we aim to foster a deeper understanding of instruction-tuned models and advancements in their capabilities. INSTRUCTEVAL is publicly available at https://github.com/declare-lab/instruct-eval.
△ Less
Submitted 15 June, 2023; v1 submitted 7 June, 2023;
originally announced June 2023.
-
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Authors:
Hang Zhang,
Xin Li,
Lidong Bing
Abstract:
We present Video-LLaMA a multi-modal framework that empowers Large Language Models (LLMs) with the capability of understanding both visual and auditory content in the video. Video-LLaMA bootstraps cross-modal training from the frozen pre-trained visual and audio encoders and the frozen LLMs. Unlike previous works that complement LLMs to process the visual or audio signals only, Video-LLaMA enables…
▽ More
We present Video-LLaMA a multi-modal framework that empowers Large Language Models (LLMs) with the capability of understanding both visual and auditory content in the video. Video-LLaMA bootstraps cross-modal training from the frozen pre-trained visual and audio encoders and the frozen LLMs. Unlike previous works that complement LLMs to process the visual or audio signals only, Video-LLaMA enables video comprehension by tackling two challenges: (1) capturing the temporal changes in visual scenes, (2) integrating audio-visual signals. To counter the first challenge, we propose a Video Q-former to assemble a pre-trained image encoder into our video encoder and introduce a video-to-text generation task to learn video-language correspondence. For the second challenge, we leverage ImageBind, a universal embedding model aligning multiple modalities, as the pre-trained audio encoder and introduce an Audio Q-former on top of ImageBind to learn reasonable auditory query embeddings for the LLM module. To align the output of both visual and audio encoders with LLM's embedding space, we first train Video-LLaMA on massive video/image-caption pairs and then tune our model with visual-instruction datasets of moderate amount but higher quality. We found Video-LLaMA shows the ability to perceive and comprehend video content and generate meaningful responses grounded in the visual and auditory information presented in the videos.
△ Less
Submitted 25 October, 2023; v1 submitted 5 June, 2023;
originally announced June 2023.
-
AQE: Argument Quadruplet Extraction via a Quad-Tagging Augmented Generative Approach
Authors:
Jia Guo,
Liying Cheng,
Wenxuan Zhang,
Stanley Kok,
Xin Li,
Lidong Bing
Abstract:
Argument mining involves multiple sub-tasks that automatically identify argumentative elements, such as claim detection, evidence extraction, stance classification, etc. However, each subtask alone is insufficient for a thorough understanding of the argumentative structure and reasoning process. To learn a complete view of an argument essay and capture the interdependence among argumentative compo…
▽ More
Argument mining involves multiple sub-tasks that automatically identify argumentative elements, such as claim detection, evidence extraction, stance classification, etc. However, each subtask alone is insufficient for a thorough understanding of the argumentative structure and reasoning process. To learn a complete view of an argument essay and capture the interdependence among argumentative components, we need to know what opinions people hold (i.e., claims), why those opinions are valid (i.e., supporting evidence), which source the evidence comes from (i.e., evidence type), and how those claims react to the debating topic (i.e., stance). In this work, we for the first time propose a challenging argument quadruplet extraction task (AQE), which can provide an all-in-one extraction of four argumentative components, i.e., claims, evidence, evidence types, and stances. To support this task, we construct a large-scale and challenging dataset. However, there is no existing method that can solve the argument quadruplet extraction. To fill this gap, we propose a novel quad-tagging augmented generative approach, which leverages a quadruplet tagging module to augment the training of the generative framework. The experimental results on our dataset demonstrate the empirical superiority of our proposed approach over several strong baselines.
△ Less
Submitted 31 May, 2023;
originally announced May 2023.
-
Is GPT-4 a Good Data Analyst?
Authors:
Liying Cheng,
Xingxuan Li,
Lidong Bing
Abstract:
As large language models (LLMs) have demonstrated their powerful capabilities in plenty of domains and tasks, including context understanding, code generation, language generation, data storytelling, etc., many data analysts may raise concerns if their jobs will be replaced by artificial intelligence (AI). This controversial topic has drawn great attention in public. However, we are still at a sta…
▽ More
As large language models (LLMs) have demonstrated their powerful capabilities in plenty of domains and tasks, including context understanding, code generation, language generation, data storytelling, etc., many data analysts may raise concerns if their jobs will be replaced by artificial intelligence (AI). This controversial topic has drawn great attention in public. However, we are still at a stage of divergent opinions without any definitive conclusion. Motivated by this, we raise the research question of "is GPT-4 a good data analyst?" in this work and aim to answer it by conducting head-to-head comparative studies. In detail, we regard GPT-4 as a data analyst to perform end-to-end data analysis with databases from a wide range of domains. We propose a framework to tackle the problems by carefully designing the prompts for GPT-4 to conduct experiments. We also design several task-specific evaluation metrics to systematically compare the performance between several professional human data analysts and GPT-4. Experimental results show that GPT-4 can achieve comparable performance to humans. We also provide in-depth discussions about our results to shed light on further studies before reaching the conclusion that GPT-4 can replace data analysts.
△ Less
Submitted 22 October, 2023; v1 submitted 24 May, 2023;
originally announced May 2023.
-
Unlocking Temporal Question Answering for Large Language Models Using Code Execution
Authors:
Xingxuan Li,
Liying Cheng,
Qingyu Tan,
Hwee Tou Ng,
Shafiq Joty,
Lidong Bing
Abstract:
Large language models (LLMs) have made significant progress in natural language processing (NLP), and are utilized extensively in various applications. Recent works, such as chain-of-thought (CoT), have shown that intermediate reasoning steps can improve the performance of LLMs for complex reasoning tasks, such as math problems and symbolic question-answering tasks. However, we notice the challeng…
▽ More
Large language models (LLMs) have made significant progress in natural language processing (NLP), and are utilized extensively in various applications. Recent works, such as chain-of-thought (CoT), have shown that intermediate reasoning steps can improve the performance of LLMs for complex reasoning tasks, such as math problems and symbolic question-answering tasks. However, we notice the challenge that LLMs face when it comes to temporal reasoning. Our preliminary experiments show that generating intermediate reasoning steps does not always boost the performance of complex temporal question-answering tasks. Therefore, we propose a novel framework that combines the extraction capability of LLMs and the logical reasoning capability of a Python solver to tackle this issue. Extensive experiments and analysis demonstrate the effectiveness of our framework in handling intricate time-bound reasoning tasks.
△ Less
Submitted 24 May, 2023;
originally announced May 2023.
-
Sentiment Analysis in the Era of Large Language Models: A Reality Check
Authors:
Wenxuan Zhang,
Yue Deng,
Bing Liu,
Sinno Jialin Pan,
Lidong Bing
Abstract:
Sentiment analysis (SA) has been a long-standing research area in natural language processing. It can offer rich insights into human sentiments and opinions and has thus seen considerable interest from both academia and industry. With the advent of large language models (LLMs) such as ChatGPT, there is a great potential for their employment on SA problems. However, the extent to which existing LLM…
▽ More
Sentiment analysis (SA) has been a long-standing research area in natural language processing. It can offer rich insights into human sentiments and opinions and has thus seen considerable interest from both academia and industry. With the advent of large language models (LLMs) such as ChatGPT, there is a great potential for their employment on SA problems. However, the extent to which existing LLMs can be leveraged for different sentiment analysis tasks remains unclear. This paper aims to provide a comprehensive investigation into the capabilities of LLMs in performing various sentiment analysis tasks, from conventional sentiment classification to aspect-based sentiment analysis and multifaceted analysis of subjective texts. We evaluate performance across 13 tasks on 26 datasets and compare the results against small language models (SLMs) trained on domain-specific datasets. Our study reveals that while LLMs demonstrate satisfactory performance in simpler tasks, they lag behind in more complex tasks requiring deeper understanding or structured sentiment information. However, LLMs significantly outperform SLMs in few-shot learning settings, suggesting their potential when annotation resources are limited. We also highlight the limitations of current evaluation practices in assessing LLMs' SA abilities and propose a novel benchmark, \textsc{SentiEval}, for a more comprehensive and realistic evaluation. Data and code during our investigations are available at \url{https://github.com/DAMO-NLP-SG/LLM-Sentiment}.
△ Less
Submitted 24 May, 2023;
originally announced May 2023.
-
Domain-Expanded ASTE: Rethinking Generalization in Aspect Sentiment Triplet Extraction
Authors:
Yew Ken Chia,
Hui Chen,
Wei Han,
Guizhen Chen,
Sharifah Mahani Aljunied,
Soujanya Poria,
Lidong Bing
Abstract:
Aspect Sentiment Triplet Extraction (ASTE) is a subtask of Aspect-Based Sentiment Analysis (ABSA) that considers each opinion term, their expressed sentiment, and the corresponding aspect targets. However, existing methods are limited to the in-domain setting with two domains. Hence, we propose a domain-expanded benchmark to address the in-domain, out-of-domain and cross-domain settings. We suppor…
▽ More
Aspect Sentiment Triplet Extraction (ASTE) is a subtask of Aspect-Based Sentiment Analysis (ABSA) that considers each opinion term, their expressed sentiment, and the corresponding aspect targets. However, existing methods are limited to the in-domain setting with two domains. Hence, we propose a domain-expanded benchmark to address the in-domain, out-of-domain and cross-domain settings. We support the new benchmark by annotating more than 4000 data samples for two new domains based on hotel and cosmetics reviews. Our analysis of five existing methods shows that while there is a significant gap between in-domain and out-of-domain performance, generative methods have a strong potential for domain generalization. Our datasets, code implementation and models are available at https://github.com/DAMO-NLP-SG/domain-expanded-aste .
△ Less
Submitted 23 May, 2023;
originally announced May 2023.
-
mPMR: A Multilingual Pre-trained Machine Reader at Scale
Authors:
Weiwen Xu,
Xin Li,
Wai Lam,
Lidong Bing
Abstract:
We present multilingual Pre-trained Machine Reader (mPMR), a novel method for multilingual machine reading comprehension (MRC)-style pre-training. mPMR aims to guide multilingual pre-trained language models (mPLMs) to perform natural language understanding (NLU) including both sequence classification and span extraction in multiple languages. To achieve cross-lingual generalization when only sourc…
▽ More
We present multilingual Pre-trained Machine Reader (mPMR), a novel method for multilingual machine reading comprehension (MRC)-style pre-training. mPMR aims to guide multilingual pre-trained language models (mPLMs) to perform natural language understanding (NLU) including both sequence classification and span extraction in multiple languages. To achieve cross-lingual generalization when only source-language fine-tuning data is available, existing mPLMs solely transfer NLU capability from a source language to target languages. In contrast, mPMR allows the direct inheritance of multilingual NLU capability from the MRC-style pre-training to downstream tasks. Therefore, mPMR acquires better NLU capability for target languages. mPMR also provides a unified solver for tackling cross-lingual span extraction and sequence classification, thereby enabling the extraction of rationales to explain the sentence-pair classification process.
△ Less
Submitted 22 May, 2023;
originally announced May 2023.
-
Improving Self-training for Cross-lingual Named Entity Recognition with Contrastive and Prototype Learning
Authors:
Ran Zhou,
Xin Li,
Lidong Bing,
Erik Cambria,
Chunyan Miao
Abstract:
In cross-lingual named entity recognition (NER), self-training is commonly used to bridge the linguistic gap by training on pseudo-labeled target-language data. However, due to sub-optimal performance on target languages, the pseudo labels are often noisy and limit the overall performance. In this work, we aim to improve self-training for cross-lingual NER by combining representation learning and…
▽ More
In cross-lingual named entity recognition (NER), self-training is commonly used to bridge the linguistic gap by training on pseudo-labeled target-language data. However, due to sub-optimal performance on target languages, the pseudo labels are often noisy and limit the overall performance. In this work, we aim to improve self-training for cross-lingual NER by combining representation learning and pseudo label refinement in one coherent framework. Our proposed method, namely ContProto mainly comprises two components: (1) contrastive self-training and (2) prototype-based pseudo-labeling. Our contrastive self-training facilitates span classification by separating clusters of different classes, and enhances cross-lingual transferability by producing closely-aligned representations between the source and target language. Meanwhile, prototype-based pseudo-labeling effectively improves the accuracy of pseudo labels during training. We evaluate ContProto on multiple transfer pairs, and experimental results show our method brings in substantial improvements over current state-of-the-art methods.
△ Less
Submitted 4 June, 2023; v1 submitted 22 May, 2023;
originally announced May 2023.
-
Chain-of-Knowledge: Grounding Large Language Models via Dynamic Knowledge Adapting over Heterogeneous Sources
Authors:
Xingxuan Li,
Ruochen Zhao,
Yew Ken Chia,
Bosheng Ding,
Shafiq Joty,
Soujanya Poria,
Lidong Bing
Abstract:
We present chain-of-knowledge (CoK), a novel framework that augments large language models (LLMs) by dynamically incorporating grounding information from heterogeneous sources. It results in more factual rationales and reduced hallucination in generation. Specifically, CoK consists of three stages: reasoning preparation, dynamic knowledge adapting, and answer consolidation. Given a knowledge-inten…
▽ More
We present chain-of-knowledge (CoK), a novel framework that augments large language models (LLMs) by dynamically incorporating grounding information from heterogeneous sources. It results in more factual rationales and reduced hallucination in generation. Specifically, CoK consists of three stages: reasoning preparation, dynamic knowledge adapting, and answer consolidation. Given a knowledge-intensive question, CoK first prepares several preliminary rationales and answers while identifying the relevant knowledge domains. If there is no majority consensus among the answers from samples, CoK corrects the rationales step by step by adapting knowledge from the identified domains. These corrected rationales can plausibly serve as a better foundation for the final answer consolidation. Unlike prior studies that primarily use unstructured data, CoK also leverages structured knowledge sources such as Wikidata and tables that provide more reliable factual information. To access both unstructured and structured knowledge sources in the dynamic knowledge adapting stage, we propose an adaptive query generator that allows the generation of queries for various types of query languages, including SPARQL, SQL, and natural sentences. Moreover, to minimize error propagation between rationales, CoK corrects the rationales progressively using preceding corrected rationales to generate and correct subsequent rationales. Extensive experiments show that CoK consistently improves the performance of LLMs on knowledge-intensive tasks across different domains.
△ Less
Submitted 21 February, 2024; v1 submitted 22 May, 2023;
originally announced May 2023.