-
Improving Reward Models with Synthetic Critiques
Authors:
Zihuiwen Ye,
Fraser Greenlee-Scott,
Max Bartolo,
Phil Blunsom,
Jon Ander Campos,
Matthias Gallé
Abstract:
Reward models (RM) play a critical role in aligning language models through the process of reinforcement learning from human feedback. RMs are trained to predict a score reflecting human preference, which requires significant time and cost for human annotation. Additionally, RMs tend to quickly overfit on superficial features in the training set, hindering their generalization performance on unsee…
▽ More
Reward models (RM) play a critical role in aligning language models through the process of reinforcement learning from human feedback. RMs are trained to predict a score reflecting human preference, which requires significant time and cost for human annotation. Additionally, RMs tend to quickly overfit on superficial features in the training set, hindering their generalization performance on unseen distributions. We propose a novel approach using synthetic natural language critiques generated by large language models to provide additional feedback, evaluating aspects such as instruction following, correctness, and style. This offers richer signals and more robust features for RMs to assess and score on. We demonstrate that high-quality critiques improve the performance and data efficiency of RMs initialized from different pretrained models. Conversely, we also show that low-quality critiques negatively impact performance. Furthermore, incorporating critiques enhances the interpretability and robustness of RM training.
△ Less
Submitted 31 May, 2024;
originally announced May 2024.
-
Aya 23: Open Weight Releases to Further Multilingual Progress
Authors:
Viraat Aryabumi,
John Dang,
Dwarak Talupuru,
Saurabh Dash,
David Cairuz,
Hangyu Lin,
Bharat Venkitesh,
Madeline Smith,
Jon Ander Campos,
Yi Chern Tan,
Kelly Marchisio,
Max Bartolo,
Sebastian Ruder,
Acyr Locatelli,
Julia Kreutzer,
Nick Frosst,
Aidan Gomez,
Phil Blunsom,
Marzieh Fadaee,
Ahmet Üstün,
Sara Hooker
Abstract:
This technical report introduces Aya 23, a family of multilingual language models. Aya 23 builds on the recent release of the Aya model (Üstün et al., 2024), focusing on pairing a highly performant pre-trained model with the recently released Aya collection (Singh et al., 2024). The result is a powerful multilingual large language model serving 23 languages, expanding state-of-art language modelin…
▽ More
This technical report introduces Aya 23, a family of multilingual language models. Aya 23 builds on the recent release of the Aya model (Üstün et al., 2024), focusing on pairing a highly performant pre-trained model with the recently released Aya collection (Singh et al., 2024). The result is a powerful multilingual large language model serving 23 languages, expanding state-of-art language modeling capabilities to approximately half of the world's population. The Aya model covered 101 languages whereas Aya 23 is an experiment in depth vs breadth, exploring the impact of allocating more capacity to fewer languages that are included during pre-training. Aya 23 outperforms both previous massively multilingual models like Aya 101 for the languages it covers, as well as widely used models like Gemma, Mistral and Mixtral on an extensive range of discriminative and generative tasks. We release the open weights for both the 8B and 35B models as part of our continued commitment for expanding access to multilingual progress.
△ Less
Submitted 31 May, 2024; v1 submitted 23 May, 2024;
originally announced May 2024.
-
Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models
Authors:
Sander Land,
Max Bartolo
Abstract:
The disconnect between tokenizer creation and model training in language models has been known to allow for certain inputs, such as the infamous SolidGoldMagikarp token, to induce unwanted behaviour. Although such `glitch tokens' that are present in the tokenizer vocabulary, but are nearly or fully absent in training, have been observed across a variety of different models, a consistent way of ide…
▽ More
The disconnect between tokenizer creation and model training in language models has been known to allow for certain inputs, such as the infamous SolidGoldMagikarp token, to induce unwanted behaviour. Although such `glitch tokens' that are present in the tokenizer vocabulary, but are nearly or fully absent in training, have been observed across a variety of different models, a consistent way of identifying them has been missing. We present a comprehensive analysis of Large Language Model (LLM) tokenizers, specifically targeting this issue of detecting untrained and under-trained tokens. Through a combination of tokenizer analysis, model weight-based indicators, and prompting techniques, we develop effective methods for automatically detecting these problematic tokens. Our findings demonstrate the prevalence of such tokens across various models and provide insights into improving the efficiency and safety of language models.
△ Less
Submitted 8 May, 2024;
originally announced May 2024.
-
The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models
Authors:
Hannah Rose Kirk,
Alexander Whitefield,
Paul Röttger,
Andrew Bean,
Katerina Margatina,
Juan Ciro,
Rafael Mosquera,
Max Bartolo,
Adina Williams,
He He,
Bertie Vidgen,
Scott A. Hale
Abstract:
Human feedback plays a central role in the alignment of Large Language Models (LLMs). However, open questions remain about the methods (how), domains (where), people (who) and objectives (to what end) of human feedback collection. To navigate these questions, we introduce PRISM, a new dataset which maps the sociodemographics and stated preferences of 1,500 diverse participants from 75 countries, t…
▽ More
Human feedback plays a central role in the alignment of Large Language Models (LLMs). However, open questions remain about the methods (how), domains (where), people (who) and objectives (to what end) of human feedback collection. To navigate these questions, we introduce PRISM, a new dataset which maps the sociodemographics and stated preferences of 1,500 diverse participants from 75 countries, to their contextual preferences and fine-grained feedback in 8,011 live conversations with 21 LLMs. PRISM contributes (i) wide geographic and demographic participation in human feedback data; (ii) two census-representative samples for understanding collective welfare (UK and US); and (iii) individualised feedback where every rating is linked to a detailed participant profile, thus permitting exploration of personalisation and attribution of sample artefacts. We focus on collecting conversations that centre subjective and multicultural perspectives on value-laden and controversial topics, where we expect the most interpersonal and cross-cultural disagreement. We demonstrate the usefulness of PRISM via three case studies of dialogue diversity, preference diversity, and welfare outcomes, showing that it matters which humans set alignment norms. As well as offering a rich community resource, we advocate for broader participation in AI development and a more inclusive approach to technology design.
△ Less
Submitted 24 April, 2024;
originally announced April 2024.
-
Introducing v0.5 of the AI Safety Benchmark from MLCommons
Authors:
Bertie Vidgen,
Adarsh Agrawal,
Ahmed M. Ahmed,
Victor Akinwande,
Namir Al-Nuaimi,
Najla Alfaraj,
Elie Alhajjar,
Lora Aroyo,
Trupti Bavalatti,
Max Bartolo,
Borhane Blili-Hamelin,
Kurt Bollacker,
Rishi Bomassani,
Marisa Ferrara Boston,
Siméon Campos,
Kal Chakra,
Canyu Chen,
Cody Coleman,
Zacharie Delpierre Coudert,
Leon Derczynski,
Debojyoti Dutta,
Ian Eisenberg,
James Ezick,
Heather Frase,
Brian Fuller
, et al. (75 additional authors not shown)
Abstract:
This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-pu…
▽ More
This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-purpose assistant in English), and a limited set of personas (i.e., typical users, malicious users, and vulnerable users). We created a new taxonomy of 13 hazard categories, of which 7 have tests in the v0.5 benchmark. We plan to release version 1.0 of the AI Safety Benchmark by the end of 2024. The v1.0 benchmark will provide meaningful insights into the safety of AI systems. However, the v0.5 benchmark should not be used to assess the safety of AI systems. We have sought to fully document the limitations, flaws, and challenges of v0.5. This release of v0.5 of the AI Safety Benchmark includes (1) a principled approach to specifying and constructing the benchmark, which comprises use cases, types of systems under test (SUTs), language and context, personas, tests, and test items; (2) a taxonomy of 13 hazard categories with definitions and subcategories; (3) tests for seven of the hazard categories, each comprising a unique set of test items, i.e., prompts. There are 43,090 test items in total, which we created with templates; (4) a grading system for AI systems against the benchmark; (5) an openly available platform, and downloadable tool, called ModelBench that can be used to evaluate the safety of AI systems on the benchmark; (6) an example evaluation report which benchmarks the performance of over a dozen openly available chat-tuned language models; (7) a test specification for the benchmark.
△ Less
Submitted 13 May, 2024; v1 submitted 18 April, 2024;
originally announced April 2024.
-
Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation
Authors:
Jessica Quaye,
Alicia Parrish,
Oana Inel,
Charvi Rastogi,
Hannah Rose Kirk,
Minsuk Kahng,
Erin van Liemt,
Max Bartolo,
Jess Tsang,
Justin White,
Nathan Clement,
Rafael Mosquera,
Juan Ciro,
Vijay Janapa Reddi,
Lora Aroyo
Abstract:
With the rise of text-to-image (T2I) generative AI models reaching wide audiences, it is critical to evaluate model robustness against non-obvious attacks to mitigate the generation of offensive images. By focusing on ``implicitly adversarial'' prompts (those that trigger T2I models to generate unsafe images for non-obvious reasons), we isolate a set of difficult safety issues that human creativit…
▽ More
With the rise of text-to-image (T2I) generative AI models reaching wide audiences, it is critical to evaluate model robustness against non-obvious attacks to mitigate the generation of offensive images. By focusing on ``implicitly adversarial'' prompts (those that trigger T2I models to generate unsafe images for non-obvious reasons), we isolate a set of difficult safety issues that human creativity is well-suited to uncover. To this end, we built the Adversarial Nibbler Challenge, a red-teaming methodology for crowdsourcing a diverse set of implicitly adversarial prompts. We have assembled a suite of state-of-the-art T2I models, employed a simple user interface to identify and annotate harms, and engaged diverse populations to capture long-tail safety issues that may be overlooked in standard testing. The challenge is run in consecutive rounds to enable a sustained discovery and analysis of safety pitfalls in T2I models.
In this paper, we present an in-depth account of our methodology, a systematic study of novel attack strategies and discussion of safety failures revealed by challenge participants. We also release a companion visualization tool for easy exploration and derivation of insights from the dataset. The first challenge round resulted in over 10k prompt-image pairs with machine annotations for safety. A subset of 1.5k samples contains rich human annotations of harm types and attack styles. We find that 14% of images that humans consider harmful are mislabeled as ``safe'' by machines. We have identified new attack strategies that highlight the complexity of ensuring T2I model robustness. Our findings emphasize the necessity of continual auditing and adaptation as new vulnerabilities emerge. We are confident that this work will enable proactive, iterative safety assessments and promote responsible development of T2I models.
△ Less
Submitted 13 May, 2024; v1 submitted 14 February, 2024;
originally announced March 2024.
-
Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning
Authors:
Shivalika Singh,
Freddie Vargus,
Daniel Dsouza,
Börje F. Karlsson,
Abinaya Mahendiran,
Wei-Yin Ko,
Herumb Shandilya,
Jay Patel,
Deividas Mataciunas,
Laura OMahony,
Mike Zhang,
Ramith Hettiarachchi,
Joseph Wilson,
Marina Machado,
Luisa Souza Moura,
Dominik Krzemiński,
Hakimeh Fadaei,
Irem Ergün,
Ifeoma Okoh,
Aisha Alaagib,
Oshan Mudannayake,
Zaid Alyafeai,
Vu Minh Chien,
Sebastian Ruder,
Surya Guthikonda
, et al. (8 additional authors not shown)
Abstract:
Datasets are foundational to many breakthroughs in modern artificial intelligence. Many recent achievements in the space of natural language processing (NLP) can be attributed to the finetuning of pre-trained models on a diverse set of tasks that enables a large language model (LLM) to respond to instructions. Instruction fine-tuning (IFT) requires specifically constructed and annotated datasets.…
▽ More
Datasets are foundational to many breakthroughs in modern artificial intelligence. Many recent achievements in the space of natural language processing (NLP) can be attributed to the finetuning of pre-trained models on a diverse set of tasks that enables a large language model (LLM) to respond to instructions. Instruction fine-tuning (IFT) requires specifically constructed and annotated datasets. However, existing datasets are almost all in the English language. In this work, our primary goal is to bridge the language gap by building a human-curated instruction-following dataset spanning 65 languages. We worked with fluent speakers of languages from around the world to collect natural instances of instructions and completions. Furthermore, we create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages. In total, we contribute four key resources: we develop and open-source the Aya Annotation Platform, the Aya Dataset, the Aya Collection, and the Aya Evaluation Suite. The Aya initiative also serves as a valuable case study in participatory research, involving collaborators from 119 countries. We see this as a valuable framework for future research collaborations that aim to bridge gaps in resources.
△ Less
Submitted 9 February, 2024;
originally announced February 2024.
-
DMLR: Data-centric Machine Learning Research -- Past, Present and Future
Authors:
Luis Oala,
Manil Maskey,
Lilith Bat-Leah,
Alicia Parrish,
Nezihe Merve Gürel,
Tzu-Sheng Kuo,
Yang Liu,
Rotem Dror,
Danilo Brajovic,
Xiaozhe Yao,
Max Bartolo,
William A Gaviria Rojas,
Ryan Hileman,
Rainier Aliment,
Michael W. Mahoney,
Meg Risdal,
Matthew Lease,
Wojciech Samek,
Debojyoti Dutta,
Curtis G Northcutt,
Cody Coleman,
Braden Hancock,
Bernard Koch,
Girmaw Abebe Tadesse,
Bojan Karlaš
, et al. (13 additional authors not shown)
Abstract:
Drawing from discussions at the inaugural DMLR workshop at ICML 2023 and meetings prior, in this report we outline the relevance of community engagement and infrastructure development for the creation of next-generation public datasets that will advance machine learning science. We chart a path forward as a collective effort to sustain the creation and maintenance of these datasets and methods tow…
▽ More
Drawing from discussions at the inaugural DMLR workshop at ICML 2023 and meetings prior, in this report we outline the relevance of community engagement and infrastructure development for the creation of next-generation public datasets that will advance machine learning science. We chart a path forward as a collective effort to sustain the creation and maintenance of these datasets and methods towards positive scientific, societal and business impact.
△ Less
Submitted 1 June, 2024; v1 submitted 21 November, 2023;
originally announced November 2023.
-
Human Feedback is not Gold Standard
Authors:
Tom Hosking,
Phil Blunsom,
Max Bartolo
Abstract:
Human feedback has become the de facto standard for evaluating the performance of Large Language Models, and is increasingly being used as a training objective. However, it is not clear which properties of a generated output this single `preference' score captures. We hypothesise that preference scores are subjective and open to undesirable biases. We critically analyse the use of human feedback f…
▽ More
Human feedback has become the de facto standard for evaluating the performance of Large Language Models, and is increasingly being used as a training objective. However, it is not clear which properties of a generated output this single `preference' score captures. We hypothesise that preference scores are subjective and open to undesirable biases. We critically analyse the use of human feedback for both training and evaluation, to verify whether it fully captures a range of crucial error criteria. We find that while preference scores have fairly good coverage, they under-represent important aspects like factuality. We further hypothesise that both preference scores and error annotation may be affected by confounders, and leverage instruction-tuned models to generate outputs that vary along two possible confounding dimensions: assertiveness and complexity. We find that the assertiveness of an output skews the perceived rate of factuality errors, indicating that human annotations are not a fully reliable evaluation metric or training objective. Finally, we offer preliminary evidence that using human feedback as a training objective disproportionately increases the assertiveness of model outputs. We encourage future work to carefully consider whether preference scores are well aligned with the desired objective.
△ Less
Submitted 16 January, 2024; v1 submitted 28 September, 2023;
originally announced September 2023.
-
Computational framework for the generation of one-dimensional vascular models accounting for uncertainty in networks extracted from medical images
Authors:
Michelle A Bartolo,
Alyssa M Taylor-LaPole,
Darsh Gandhi,
Alexandria Johnson,
Yaqi Li,
Emma Slack,
Isaiah Stevens,
Zachary Turner,
Justin D Weigand,
Charles Puelz,
Dirk Husmeier,
Mette S Olufsen
Abstract:
Patient-specific computational modeling is a popular, non-invasive method to answer medical questions. Medical images are used to extract geometric domains necessary to create these models, providing a predictive tool for clinicians. However, in vivo imaging is subject to uncertainty, impacting vessel dimensions essential to the mathematical modeling process. While there are numerous programs avai…
▽ More
Patient-specific computational modeling is a popular, non-invasive method to answer medical questions. Medical images are used to extract geometric domains necessary to create these models, providing a predictive tool for clinicians. However, in vivo imaging is subject to uncertainty, impacting vessel dimensions essential to the mathematical modeling process. While there are numerous programs available to provide information about vessel length, radii, and position, there is currently no exact way to determine and calibrate these features. This raises the question, if we are building patient-specific models based on uncertain measurements, how accurate are the geometries we extract and how can we best represent a patient's vasculature? In this study, we develop a novel framework to determine vessel dimensions using change points. We explore the impact of uncertainty in the network extraction process on hemodynamics by varying vessel dimensions and segmenting the same images multiple times. Our analyses reveal that image segmentation, network size, and minor changes in radius and length have significant impacts on pressure and flow dynamics in rapidly branching structures and tapering vessels. Accordingly, we conclude that it is critical to understand how uncertainty in network geometry propagates to fluid dynamics, especially in clinical applications.
△ Less
Submitted 8 May, 2024; v1 submitted 15 September, 2023;
originally announced September 2023.
-
Adversarial Nibbler: A Data-Centric Challenge for Improving the Safety of Text-to-Image Models
Authors:
Alicia Parrish,
Hannah Rose Kirk,
Jessica Quaye,
Charvi Rastogi,
Max Bartolo,
Oana Inel,
Juan Ciro,
Rafael Mosquera,
Addison Howard,
Will Cukierski,
D. Sculley,
Vijay Janapa Reddi,
Lora Aroyo
Abstract:
The generative AI revolution in recent years has been spurred by an expansion in compute power and data quantity, which together enable extensive pre-training of powerful text-to-image (T2I) models. With their greater capabilities to generate realistic and creative content, these T2I models like DALL-E, MidJourney, Imagen or Stable Diffusion are reaching ever wider audiences. Any unsafe behaviors…
▽ More
The generative AI revolution in recent years has been spurred by an expansion in compute power and data quantity, which together enable extensive pre-training of powerful text-to-image (T2I) models. With their greater capabilities to generate realistic and creative content, these T2I models like DALL-E, MidJourney, Imagen or Stable Diffusion are reaching ever wider audiences. Any unsafe behaviors inherited from pretraining on uncurated internet-scraped datasets thus have the potential to cause wide-reaching harm, for example, through generated images which are violent, sexually explicit, or contain biased and derogatory stereotypes. Despite this risk of harm, we lack systematic and structured evaluation datasets to scrutinize model behavior, especially adversarial attacks that bypass existing safety filters. A typical bottleneck in safety evaluation is achieving a wide coverage of different types of challenging examples in the evaluation set, i.e., identifying 'unknown unknowns' or long-tail problems. To address this need, we introduce the Adversarial Nibbler challenge. The goal of this challenge is to crowdsource a diverse set of failure modes and reward challenge participants for successfully finding safety vulnerabilities in current state-of-the-art T2I models. Ultimately, we aim to provide greater awareness of these issues and assist developers in improving the future safety and reliability of generative AI models. Adversarial Nibbler is a data-centric challenge, part of the DataPerf challenge suite, organized and supported by Kaggle and MLCommons.
△ Less
Submitted 22 May, 2023;
originally announced May 2023.
-
DataPerf: Benchmarks for Data-Centric AI Development
Authors:
Mark Mazumder,
Colby Banbury,
Xiaozhe Yao,
Bojan Karlaš,
William Gaviria Rojas,
Sudnya Diamos,
Greg Diamos,
Lynn He,
Alicia Parrish,
Hannah Rose Kirk,
Jessica Quaye,
Charvi Rastogi,
Douwe Kiela,
David Jurado,
David Kanter,
Rafael Mosquera,
Juan Ciro,
Lora Aroyo,
Bilge Acun,
Lingjiao Chen,
Mehul Smriti Raje,
Max Bartolo,
Sabri Eyuboglu,
Amirata Ghorbani,
Emmett Goodman
, et al. (20 additional authors not shown)
Abstract:
Machine learning research has long focused on models rather than datasets, and prominent datasets are used for common ML tasks without regard to the breadth, difficulty, and faithfulness of the underlying problems. Neglecting the fundamental importance of data has given rise to inaccuracy, bias, and fragility in real-world applications, and research is hindered by saturation across existing datase…
▽ More
Machine learning research has long focused on models rather than datasets, and prominent datasets are used for common ML tasks without regard to the breadth, difficulty, and faithfulness of the underlying problems. Neglecting the fundamental importance of data has given rise to inaccuracy, bias, and fragility in real-world applications, and research is hindered by saturation across existing dataset benchmarks. In response, we present DataPerf, a community-led benchmark suite for evaluating ML datasets and data-centric algorithms. We aim to foster innovation in data-centric AI through competition, comparability, and reproducibility. We enable the ML community to iterate on datasets, instead of just architectures, and we provide an open, online platform with multiple rounds of challenges to support this iterative development. The first iteration of DataPerf contains five benchmarks covering a wide spectrum of data-centric techniques, tasks, and modalities in vision, speech, acquisition, debugging, and diffusion prompting, and we support hosting new contributed benchmarks from the community. The benchmarks, online evaluation platform, and baseline implementations are open source, and the MLCommons Association will maintain DataPerf to ensure long-term benefits to academia and industry.
△ Less
Submitted 13 October, 2023; v1 submitted 20 July, 2022;
originally announced July 2022.
-
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
Authors:
Tristan Thrush,
Ryan Jiang,
Max Bartolo,
Amanpreet Singh,
Adina Williams,
Douwe Kiela,
Candace Ross
Abstract:
We present a novel task and dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning, which we call Winoground. Given two images and two captions, the goal is to match them correctly - but crucially, both captions contain a completely identical set of words, only in a different order. The dataset was carefully hand-curated by expert annot…
▽ More
We present a novel task and dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning, which we call Winoground. Given two images and two captions, the goal is to match them correctly - but crucially, both captions contain a completely identical set of words, only in a different order. The dataset was carefully hand-curated by expert annotators and is labeled with a rich set of fine-grained tags to assist in analyzing model performance. We probe a diverse range of state-of-the-art vision and language models and find that, surprisingly, none of them do much better than chance. Evidently, these models are not as skilled at visio-linguistic compositional reasoning as we might have hoped. We perform an extensive analysis to obtain insights into how future work might try to mitigate these models' shortcomings. We aim for Winoground to serve as a useful evaluation set for advancing the state of the art and driving further progress in the field. The dataset is available at https://huggingface.co/datasets/facebook/winoground.
△ Less
Submitted 22 April, 2022; v1 submitted 6 April, 2022;
originally announced April 2022.
-
Dynatask: A Framework for Creating Dynamic AI Benchmark Tasks
Authors:
Tristan Thrush,
Kushal Tirumala,
Anmol Gupta,
Max Bartolo,
Pedro Rodriguez,
Tariq Kane,
William Gaviria Rojas,
Peter Mattson,
Adina Williams,
Douwe Kiela
Abstract:
We introduce Dynatask: an open source system for setting up custom NLP tasks that aims to greatly lower the technical knowledge and effort required for hosting and evaluating state-of-the-art NLP models, as well as for conducting model in the loop data collection with crowdworkers. Dynatask is integrated with Dynabench, a research platform for rethinking benchmarking in AI that facilitates human a…
▽ More
We introduce Dynatask: an open source system for setting up custom NLP tasks that aims to greatly lower the technical knowledge and effort required for hosting and evaluating state-of-the-art NLP models, as well as for conducting model in the loop data collection with crowdworkers. Dynatask is integrated with Dynabench, a research platform for rethinking benchmarking in AI that facilitates human and model in the loop data collection and evaluation. To create a task, users only need to write a short task configuration file from which the relevant web interfaces and model hosting infrastructure are automatically generated. The system is available at https://dynabench.org/ and the full library can be found at https://github.com/facebookresearch/dynabench.
△ Less
Submitted 4 April, 2022;
originally announced April 2022.
-
Models in the Loop: Aiding Crowdworkers with Generative Annotation Assistants
Authors:
Max Bartolo,
Tristan Thrush,
Sebastian Riedel,
Pontus Stenetorp,
Robin Jia,
Douwe Kiela
Abstract:
In Dynamic Adversarial Data Collection (DADC), human annotators are tasked with finding examples that models struggle to predict correctly. Models trained on DADC-collected training data have been shown to be more robust in adversarial and out-of-domain settings, and are considerably harder for humans to fool. However, DADC is more time-consuming than traditional data collection and thus more cost…
▽ More
In Dynamic Adversarial Data Collection (DADC), human annotators are tasked with finding examples that models struggle to predict correctly. Models trained on DADC-collected training data have been shown to be more robust in adversarial and out-of-domain settings, and are considerably harder for humans to fool. However, DADC is more time-consuming than traditional data collection and thus more costly per annotated example. In this work, we examine whether we can maintain the advantages of DADC, without incurring the additional cost. To that end, we introduce Generative Annotation Assistants (GAAs), generator-in-the-loop models that provide real-time suggestions that annotators can either approve, modify, or reject entirely. We collect training datasets in twenty experimental settings and perform a detailed analysis of this approach for the task of extractive question answering (QA) for both standard and adversarial data collection. We demonstrate that GAAs provide significant efficiency benefits with over a 30% annotation speed-up, while leading to over a 5x improvement in model fooling rates. In addition, we find that using GAA-assisted training data leads to higher downstream model performance on a variety of question answering tasks over adversarial data collection.
△ Less
Submitted 17 May, 2022; v1 submitted 16 December, 2021;
originally announced December 2021.
-
Contrasting Human- and Machine-Generated Word-Level Adversarial Examples for Text Classification
Authors:
Maximilian Mozes,
Max Bartolo,
Pontus Stenetorp,
Bennett Kleinberg,
Lewis D. Griffin
Abstract:
Research shows that natural language processing models are generally considered to be vulnerable to adversarial attacks; but recent work has drawn attention to the issue of validating these adversarial inputs against certain criteria (e.g., the preservation of semantics and grammaticality). Enforcing constraints to uphold such criteria may render attacks unsuccessful, raising the question of wheth…
▽ More
Research shows that natural language processing models are generally considered to be vulnerable to adversarial attacks; but recent work has drawn attention to the issue of validating these adversarial inputs against certain criteria (e.g., the preservation of semantics and grammaticality). Enforcing constraints to uphold such criteria may render attacks unsuccessful, raising the question of whether valid attacks are actually feasible. In this work, we investigate this through the lens of human language ability. We report on crowdsourcing studies in which we task humans with iteratively modifying words in an input text, while receiving immediate model feedback, with the aim of causing a sentiment classification model to misclassify the example. Our findings suggest that humans are capable of generating a substantial amount of adversarial examples using semantics-preserving word substitutions. We analyze how human-generated adversarial examples compare to the recently proposed TextFooler, Genetic, BAE and SememePSO attack algorithms on the dimensions naturalness, preservation of sentiment, grammaticality and substitution rate. Our findings suggest that human-generated adversarial examples are not more able than the best algorithms to generate natural-reading, sentiment-preserving examples, though they do so by being much more computationally efficient.
△ Less
Submitted 9 September, 2021;
originally announced September 2021.
-
Dynabench: Rethinking Benchmarking in NLP
Authors:
Douwe Kiela,
Max Bartolo,
Yixin Nie,
Divyansh Kaushik,
Atticus Geiger,
Zhengxuan Wu,
Bertie Vidgen,
Grusha Prasad,
Amanpreet Singh,
Pratik Ringshia,
Zhiyi Ma,
Tristan Thrush,
Sebastian Riedel,
Zeerak Waseem,
Pontus Stenetorp,
Robin Jia,
Mohit Bansal,
Christopher Potts,
Adina Williams
Abstract:
We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the-loop dataset creation: annotators seek to create examples that a target model will misclassify, but that another person will not. In this paper, we argue that Dynabench addresses a critical need in our community: contemporary model…
▽ More
We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the-loop dataset creation: annotators seek to create examples that a target model will misclassify, but that another person will not. In this paper, we argue that Dynabench addresses a critical need in our community: contemporary models quickly achieve outstanding performance on benchmark tasks but nonetheless fail on simple challenge examples and falter in real-world scenarios. With Dynabench, dataset creation, model development, and model assessment can directly inform each other, leading to more robust and informative benchmarks. We report on four initial NLP tasks, illustrating these concepts and highlighting the promise of the platform, and address potential objections to dynamic benchmarking as a new standard for the field.
△ Less
Submitted 7 April, 2021;
originally announced April 2021.
-
Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity
Authors:
Yao Lu,
Max Bartolo,
Alastair Moore,
Sebastian Riedel,
Pontus Stenetorp
Abstract:
When primed with only a handful of training samples, very large, pretrained language models such as GPT-3 have shown competitive results when compared to fully-supervised, fine-tuned, large, pretrained language models. We demonstrate that the order in which the samples are provided can make the difference between near state-of-the-art and random guess performance: essentially some permutations are…
▽ More
When primed with only a handful of training samples, very large, pretrained language models such as GPT-3 have shown competitive results when compared to fully-supervised, fine-tuned, large, pretrained language models. We demonstrate that the order in which the samples are provided can make the difference between near state-of-the-art and random guess performance: essentially some permutations are "fantastic" and some not. We analyse this phenomenon in detail, establishing that: it is present across model sizes (even for the largest current models), it is not related to a specific subset of samples, and that a given good permutation for one model is not transferable to another. While one could use a development set to determine which permutations are performant, this would deviate from the true few-shot setting as it requires additional annotated data. Instead, we use the generative nature of language models to construct an artificial development set and based on entropy statistics of the candidate permutations on this set, we identify performant prompts. Our method yields a 13% relative improvement for GPT-family models across eleven different established text classification tasks.
△ Less
Submitted 3 March, 2022; v1 submitted 18 April, 2021;
originally announced April 2021.
-
Improving Question Answering Model Robustness with Synthetic Adversarial Data Generation
Authors:
Max Bartolo,
Tristan Thrush,
Robin Jia,
Sebastian Riedel,
Pontus Stenetorp,
Douwe Kiela
Abstract:
Despite recent progress, state-of-the-art question answering models remain vulnerable to a variety of adversarial attacks. While dynamic adversarial data collection, in which a human annotator tries to write examples that fool a model-in-the-loop, can improve model robustness, this process is expensive which limits the scale of the collected data. In this work, we are the first to use synthetic ad…
▽ More
Despite recent progress, state-of-the-art question answering models remain vulnerable to a variety of adversarial attacks. While dynamic adversarial data collection, in which a human annotator tries to write examples that fool a model-in-the-loop, can improve model robustness, this process is expensive which limits the scale of the collected data. In this work, we are the first to use synthetic adversarial data generation to make question answering models more robust to human adversaries. We develop a data generation pipeline that selects source passages, identifies candidate answers, generates questions, then finally filters or re-labels them to improve quality. Using this approach, we amplify a smaller human-written adversarial dataset to a much larger set of synthetic question-answer pairs. By incorporating our synthetic data, we improve the state-of-the-art on the AdversarialQA dataset by 3.7F1 and improve model generalisation on nine of the twelve MRQA datasets. We further conduct a novel human-in-the-loop evaluation to show that our models are considerably more robust to new human-written adversarial examples: crowdworkers can fool our model only 8.8% of the time on average, compared to 17.6% for a model trained without synthetic data.
△ Less
Submitted 15 March, 2022; v1 submitted 17 April, 2021;
originally announced April 2021.
-
Numerical predictions of shear stress and cyclic stretch in the healthy pulmonary vasculature
Authors:
Michelle A. Bartolo,
M. Umar Qureshi,
Mitchel J. Colebank,
Naomi C. Chesler,
Mette S. Olufsen
Abstract:
Isolated post-capillary pulmonary hypertension (Ipc-PH) occurs due to left heart failure, which contributes to 1 out of every 9 deaths in the United States. In some patients, through unknown mechanisms, Ipc-PH transitions to combined pre-/post-capillary PH (Cpc-PH), diagnosed by an increase in pulmonary vascular resistance and associated with a dramatic increase in mortality. We hypothesize that a…
▽ More
Isolated post-capillary pulmonary hypertension (Ipc-PH) occurs due to left heart failure, which contributes to 1 out of every 9 deaths in the United States. In some patients, through unknown mechanisms, Ipc-PH transitions to combined pre-/post-capillary PH (Cpc-PH), diagnosed by an increase in pulmonary vascular resistance and associated with a dramatic increase in mortality. We hypothesize that altered mechanical forces and subsequent vasoactive signaling in the pulmonary capillary bed drive the transition from Ipc-PH to Cpc-PH. However, even in a healthy pulmonary circulation, the mechanical forces in the smallest vessels (the arterioles, venules, and capillary bed) have not been quantitatively defined. This study is the first to examine this question via a computational fluid dynamics model of the human pulmonary arteries, veins, arterioles, and venules. Using this model we predict temporal and spatial dynamics of cyclic stretch and wall shear stress. In the large vessels, numerical simulations show that shear stress increases coincides with larger flow and pressure. In the microvasculature, we found that as vessel radius decreases, shear stress increases and flow decreases. In arterioles, this corresponds with lower pressures; however, the venules and smaller veins have higher pressure than larger veins. Our model provides predictions for pressure, flow, shear stress, and cyclic stretch that provides a way to analyze and investigate hypotheses related to disease progression in the pulmonary circulation.
△ Less
Submitted 5 March, 2021;
originally announced March 2021.
-
Undersensitivity in Neural Reading Comprehension
Authors:
Johannes Welbl,
Pasquale Minervini,
Max Bartolo,
Pontus Stenetorp,
Sebastian Riedel
Abstract:
Current reading comprehension models generalise well to in-distribution test sets, yet perform poorly on adversarially selected inputs. Most prior work on adversarial inputs studies oversensitivity: semantically invariant text perturbations that cause a model's prediction to change when it should not. In this work we focus on the complementary problem: excessive prediction undersensitivity, where…
▽ More
Current reading comprehension models generalise well to in-distribution test sets, yet perform poorly on adversarially selected inputs. Most prior work on adversarial inputs studies oversensitivity: semantically invariant text perturbations that cause a model's prediction to change when it should not. In this work we focus on the complementary problem: excessive prediction undersensitivity, where input text is meaningfully changed but the model's prediction does not, even though it should. We formulate a noisy adversarial attack which searches among semantic variations of the question for which a model erroneously predicts the same answer, and with even higher probability. Despite comprising unanswerable questions, both SQuAD2.0 and NewsQA models are vulnerable to this attack. This indicates that although accurate, models tend to rely on spurious patterns and do not fully consider the information specified in a question. We experiment with data augmentation and adversarial training as defences, and find that both substantially decrease vulnerability to attacks on held out data, as well as held out attack spaces. Addressing undersensitivity also improves results on AddSent and AddOneSent, and models furthermore generalise better when facing train/evaluation distribution mismatch: they are less prone to overly rely on predictive cues present only in the training set, and outperform a conventional model by as much as 10.9% F1.
△ Less
Submitted 15 February, 2020;
originally announced March 2020.
-
Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension
Authors:
Max Bartolo,
Alastair Roberts,
Johannes Welbl,
Sebastian Riedel,
Pontus Stenetorp
Abstract:
Innovations in annotation methodology have been a catalyst for Reading Comprehension (RC) datasets and models. One recent trend to challenge current RC models is to involve a model in the annotation process: humans create questions adversarially, such that the model fails to answer them correctly. In this work we investigate this annotation methodology and apply it in three different settings, col…
▽ More
Innovations in annotation methodology have been a catalyst for Reading Comprehension (RC) datasets and models. One recent trend to challenge current RC models is to involve a model in the annotation process: humans create questions adversarially, such that the model fails to answer them correctly. In this work we investigate this annotation methodology and apply it in three different settings, collecting a total of 36,000 samples with progressively stronger models in the annotation loop. This allows us to explore questions such as the reproducibility of the adversarial effect, transfer from data collected with varying model-in-the-loop strengths, and generalisation to data collected without a model. We find that training on adversarially collected samples leads to strong generalisation to non-adversarially collected datasets, yet with progressive performance deterioration with increasingly stronger models-in-the-loop. Furthermore, we find that stronger models can still learn from datasets collected with substantially weaker models-in-the-loop. When trained on data collected with a BiDAF model in the loop, RoBERTa achieves 39.9F1 on questions that it cannot answer when trained on SQuAD - only marginally lower than when trained on data collected using RoBERTa itself (41.0F1).
△ Less
Submitted 22 September, 2020; v1 submitted 1 February, 2020;
originally announced February 2020.
-
Interpretation of Natural Language Rules in Conversational Machine Reading
Authors:
Marzieh Saeidi,
Max Bartolo,
Patrick Lewis,
Sameer Singh,
Tim Rocktäschel,
Mike Sheldon,
Guillaume Bouchard,
Sebastian Riedel
Abstract:
Most work in machine reading focuses on question answering problems where the answer is directly expressed in the text to read. However, many real-world question answering problems require the reading of text not because it contains the literal answer, but because it contains a recipe to derive an answer together with the reader's background knowledge. One example is the task of interpreting regul…
▽ More
Most work in machine reading focuses on question answering problems where the answer is directly expressed in the text to read. However, many real-world question answering problems require the reading of text not because it contains the literal answer, but because it contains a recipe to derive an answer together with the reader's background knowledge. One example is the task of interpreting regulations to answer "Can I...?" or "Do I have to...?" questions such as "I am working in Canada. Do I have to carry on paying UK National Insurance?" after reading a UK government website about this topic. This task requires both the interpretation of rules and the application of background knowledge. It is further complicated due to the fact that, in practice, most questions are underspecified, and a human assistant will regularly have to ask clarification questions such as "How long have you been working abroad?" when the answer cannot be directly derived from the question and text. In this paper, we formalise this task and develop a crowd-sourcing strategy to collect 32k task instances based on real-world rules and crowd-generated questions and scenarios. We analyse the challenges of this task and assess its difficulty by evaluating the performance of rule-based and machine-learning baselines. We observe promising results when no background knowledge is necessary, and substantial room for improvement whenever background knowledge is needed.
△ Less
Submitted 28 August, 2018;
originally announced September 2018.