subscribe to arXiv mailings

On the Limitations of Compute Thresholds as a Governance Strategy

Abstract: At face value, this essay is about understanding a fairly esoteric governance tool called compute thresholds. However, in order to grapple with whether these thresholds will achieve anything, we must first understand how they came to be. This requires engaging with a decades-old debate at the heart of computer science progress, namely, is bigger always better? Hence, this essay may be of interest… ▽ More At face value, this essay is about understanding a fairly esoteric governance tool called compute thresholds. However, in order to grapple with whether these thresholds will achieve anything, we must first understand how they came to be. This requires engaging with a decades-old debate at the heart of computer science progress, namely, is bigger always better? Hence, this essay may be of interest not only to policymakers and the wider public but also to computer scientists interested in understanding the role of compute in unlocking breakthroughs. Does a certain inflection point of compute result in changes to the risk profile of a model? This discussion is increasingly urgent given the wide adoption of governance approaches that suggest greater compute equates with higher propensity for harm. Several leading frontier AI companies have released responsible scaling policies. Both the White House Executive Orders on AI Safety (EO) and the EU AI Act encode the use of FLOP or floating-point operations as a way to identify more powerful systems. What is striking about the choice of compute thresholds to-date is that no models currently deployed in the wild fulfill the current criteria set by the EO. This implies that the emphasis is often not on auditing the risks and harms incurred by currently deployed models - but rather is based upon the belief that future levels of compute will introduce unforeseen new risks. A key conclusion of this essay is that compute thresholds as currently implemented are shortsighted and likely to fail to mitigate risk. Governance that is overly reliant on compute fails to understand that the relationship between compute and risk is highly uncertain and rapidly changing. It also overestimates our ability to predict what abilities emerge at different scales. This essay ends with recommendations for a better way forward. △ Less

Submitted 8 July, 2024; originally announced July 2024.

arXiv:2407.03211 [pdf, other]

How Does Quantization Affect Multilingual LLMs?

Authors: Kelly Marchisio, Saurabh Dash, Hongyu Chen, Dennis Aumiller, Ahmet Üstün, Sara Hooker, Sebastian Ruder

Abstract: Quantization techniques are widely used to improve inference speed and deployment of large language models. While a wide body of work examines the impact of quantized LLMs on English tasks, none have examined the effect of quantization across languages. We conduct a thorough analysis of quantized multilingual LLMs, focusing on their performance across languages and at varying scales. We use automa… ▽ More Quantization techniques are widely used to improve inference speed and deployment of large language models. While a wide body of work examines the impact of quantized LLMs on English tasks, none have examined the effect of quantization across languages. We conduct a thorough analysis of quantized multilingual LLMs, focusing on their performance across languages and at varying scales. We use automatic benchmarks, LLM-as-a-Judge methods, and human evaluation, finding that (1) harmful effects of quantization are apparent in human evaluation, and automatic metrics severely underestimate the detriment: a 1.7% average drop in Japanese across automatic tasks corresponds to a 16.0% drop reported by human evaluators on realistic prompts; (2) languages are disparately affected by quantization, with non-Latin script languages impacted worst; and (3) challenging tasks such as mathematical reasoning degrade fastest. As the ability to serve low-compute models is critical for wide global adoption of NLP technologies, our results urge consideration of multilingual performance as a key evaluation criterion for efficient models. △ Less

Submitted 3 July, 2024; originally announced July 2024.

arXiv:2407.02552 [pdf, other]

RLHF Can Speak Many Languages: Unlocking Multilingual Preference Optimization for LLMs

Authors: John Dang, Arash Ahmadian, Kelly Marchisio, Julia Kreutzer, Ahmet Üstün, Sara Hooker

Abstract: Preference optimization techniques have become a standard final stage for training state-of-art large language models (LLMs). However, despite widespread adoption, the vast majority of work to-date has focused on first-class citizen languages like English and Chinese. This captures a small fraction of the languages in the world, but also makes it unclear which aspects of current state-of-the-art r… ▽ More Preference optimization techniques have become a standard final stage for training state-of-art large language models (LLMs). However, despite widespread adoption, the vast majority of work to-date has focused on first-class citizen languages like English and Chinese. This captures a small fraction of the languages in the world, but also makes it unclear which aspects of current state-of-the-art research transfer to a multilingual setting. In this work, we perform an exhaustive study to achieve a new state-of-the-art in aligning multilingual LLMs. We introduce a novel, scalable method for generating high-quality multilingual feedback data to balance data coverage. We establish the benefits of cross-lingual transfer and increased dataset size in preference training. Our preference-trained model achieves a 54.4% win-rate against Aya 23 8B, the current state-of-the-art multilingual LLM in its parameter class, and a 69.5% win-rate or higher against widely used models like Gemma-1.1-7B-it, Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3. As a result of our study, we expand the frontier of alignment techniques to 23 languages covering half of the world's population. △ Less

Submitted 2 July, 2024; originally announced July 2024.

arXiv:2407.01490 [pdf, other]

LLM See, LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives

Authors: Luísa Shimabucoro, Sebastian Ruder, Julia Kreutzer, Marzieh Fadaee, Sara Hooker

Abstract: The widespread adoption of synthetic data raises new questions about how models generating the data can influence other large language models (LLMs) via distilled data. To start, our work exhaustively characterizes the impact of passive inheritance of model properties by systematically studying the consequences of synthetic data integration. We provide one of the most comprehensive studies to-date… ▽ More The widespread adoption of synthetic data raises new questions about how models generating the data can influence other large language models (LLMs) via distilled data. To start, our work exhaustively characterizes the impact of passive inheritance of model properties by systematically studying the consequences of synthetic data integration. We provide one of the most comprehensive studies to-date of how the source of synthetic data shapes models' internal biases, calibration and generations' textual attributes and preferences. We find that models are surprisingly sensitive towards certain attributes even when the synthetic data prompts appear "neutral". which invites the question whether this sensitivity can be exploited for good. Our findings invite the question can we explicitly steer the models towards the properties we want at test time by exploiting the data generation process? This would have historically been considered infeasible due to the cost of collecting data with a specific characteristic or objective in mind. However, improvement in the quality of synthetic data, as well as a shift towards general-purpose models designed to follow a diverse way of instructions, means this question is timely. We propose active inheritance as a term to describe intentionally constraining synthetic data according to a non-differentiable objective. We demonstrate how active inheritance can steer the generation profiles of models towards desirable non-differentiable attributes, e.g. high lexical diversity or low toxicity. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2406.18682 [pdf, other]

The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm

Authors: Aakanksha, Arash Ahmadian, Beyza Ermis, Seraphina Goldfarb-Tarrant, Julia Kreutzer, Marzieh Fadaee, Sara Hooker

Abstract: A key concern with the concept of "alignment" is the implicit question of "alignment to what?". AI systems are increasingly used across the world, yet safety alignment is often focused on homogeneous monolingual settings. Additionally, preference training and safety measures often overfit to harms common in Western-centric datasets. Here, we explore the viability of different alignment approaches… ▽ More A key concern with the concept of "alignment" is the implicit question of "alignment to what?". AI systems are increasingly used across the world, yet safety alignment is often focused on homogeneous monolingual settings. Additionally, preference training and safety measures often overfit to harms common in Western-centric datasets. Here, we explore the viability of different alignment approaches when balancing dual objectives: addressing and optimizing for a non-homogeneous set of languages and cultural preferences while minimizing both global and local harms. We collect the first set of human annotated red-teaming prompts in different languages distinguishing between global and local harm, which serve as a laboratory for understanding the reliability of alignment techniques when faced with preference distributions that are non-stationary across geographies and languages. While this setting is seldom covered by the literature to date, which primarily centers on English harm mitigation, it captures real-world interactions with AI systems around the world. We establish a new precedent for state-of-the-art alignment techniques across 6 languages with minimal degradation in general performance. Our work provides important insights into cross-lingual transfer and novel optimization approaches to safeguard AI systems designed to serve global populations. △ Less

Submitted 8 July, 2024; v1 submitted 26 June, 2024; originally announced June 2024.

arXiv:2406.03368 [pdf, other]

IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models

Authors: David Ifeoluwa Adelani, Jessica Ojo, Israel Abebe Azime, Jian Yun Zhuang, Jesujoba O. Alabi, Xuanli He, Millicent Ochieng, Sara Hooker, Andiswa Bukula, En-Shiun Annie Lee, Chiamaka Chukwuneke, Happy Buzaaba, Blessing Sibanda, Godson Kalipe, Jonathan Mukiibi, Salomon Kabongo, Foutse Yuehgoh, Mmasibidi Setaka, Lolwethu Ndolela, Nkiruka Odu, Rooweither Mabuya, Shamsuddeen Hassan Muhammad, Salomey Osei, Sokhar Samb, Tadesse Kebede Guge , et al. (1 additional authors not shown)

Abstract: Despite the widespread adoption of Large language models (LLMs), their remarkable capabilities remain limited to a few high-resource languages. Additionally, many low-resource languages (e.g. African languages) are often evaluated only on basic text classification tasks due to the lack of appropriate or comprehensive benchmarks outside of high-resource languages. In this paper, we introduce IrokoB… ▽ More Despite the widespread adoption of Large language models (LLMs), their remarkable capabilities remain limited to a few high-resource languages. Additionally, many low-resource languages (e.g. African languages) are often evaluated only on basic text classification tasks due to the lack of appropriate or comprehensive benchmarks outside of high-resource languages. In this paper, we introduce IrokoBench -- a human-translated benchmark dataset for 16 typologically-diverse low-resource African languages covering three tasks: natural language inference~(AfriXNLI), mathematical reasoning~(AfriMGSM), and multi-choice knowledge-based QA~(AfriMMLU). We use IrokoBench to evaluate zero-shot, few-shot, and translate-test settings~(where test sets are translated into English) across 10 open and four proprietary LLMs. Our evaluation reveals a significant performance gap between high-resource languages~(such as English and French) and low-resource African languages. We observe a significant performance gap between open and proprietary models, with the highest performing open model, Aya-101 only at 58\% of the best-performing proprietary model GPT-4o performance. Machine translating the test set to English before evaluation helped to close the gap for larger models that are English-centric, like LLaMa 3 70B. These findings suggest that more efforts are needed to develop and adapt LLMs for African languages. △ Less

Submitted 5 June, 2024; originally announced June 2024.

Comments: Under review

arXiv:2405.19462 [pdf, other]

Critical Learning Periods: Leveraging Early Training Dynamics for Efficient Data Pruning

Authors: Everlyn Asiko Chimoto, Jay Gala, Orevaoghene Ahia, Julia Kreutzer, Bruce A. Bassett, Sara Hooker

Abstract: Neural Machine Translation models are extremely data and compute-hungry. However, not all data points contribute equally to model training and generalization. Data pruning to remove the low-value data points has the benefit of drastically reducing the compute budget without significant drop in model performance. In this paper, we propose a new data pruning technique: Checkpoints Across Time (CAT),… ▽ More Neural Machine Translation models are extremely data and compute-hungry. However, not all data points contribute equally to model training and generalization. Data pruning to remove the low-value data points has the benefit of drastically reducing the compute budget without significant drop in model performance. In this paper, we propose a new data pruning technique: Checkpoints Across Time (CAT), that leverages early model training dynamics to identify the most relevant data points for model performance. We benchmark CAT against several data pruning techniques including COMET-QE, LASER and LaBSE. We find that CAT outperforms the benchmarks on Indo-European languages on multiple test sets. When applied to English-German, English-French and English-Swahili translation tasks, CAT achieves comparable performance to using the full dataset, while pruning up to 50% of training data. We inspect the data points that CAT selects and find that it tends to favour longer sentences and sentences with unique or rare words. △ Less

Submitted 21 June, 2024; v1 submitted 29 May, 2024; originally announced May 2024.

Comments: Accepted to ACL 2024 Findings

arXiv:2405.15032 [pdf, other]

Aya 23: Open Weight Releases to Further Multilingual Progress

Authors: Viraat Aryabumi, John Dang, Dwarak Talupuru, Saurabh Dash, David Cairuz, Hangyu Lin, Bharat Venkitesh, Madeline Smith, Jon Ander Campos, Yi Chern Tan, Kelly Marchisio, Max Bartolo, Sebastian Ruder, Acyr Locatelli, Julia Kreutzer, Nick Frosst, Aidan Gomez, Phil Blunsom, Marzieh Fadaee, Ahmet Üstün, Sara Hooker

Abstract: This technical report introduces Aya 23, a family of multilingual language models. Aya 23 builds on the recent release of the Aya model (Üstün et al., 2024), focusing on pairing a highly performant pre-trained model with the recently released Aya collection (Singh et al., 2024). The result is a powerful multilingual large language model serving 23 languages, expanding state-of-art language modelin… ▽ More This technical report introduces Aya 23, a family of multilingual language models. Aya 23 builds on the recent release of the Aya model (Üstün et al., 2024), focusing on pairing a highly performant pre-trained model with the recently released Aya collection (Singh et al., 2024). The result is a powerful multilingual large language model serving 23 languages, expanding state-of-art language modeling capabilities to approximately half of the world's population. The Aya model covered 101 languages whereas Aya 23 is an experiment in depth vs breadth, exploring the impact of allocating more capacity to fewer languages that are included during pre-training. Aya 23 outperforms both previous massively multilingual models like Aya 101 for the languages it covers, as well as widely used models like Gemma, Mistral and Mixtral on an extensive range of discriminative and generative tasks. We release the open weights for both the 8B and 35B models as part of our continued commitment for expanding access to multilingual progress. △ Less

Submitted 31 May, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

arXiv:2403.03893 [pdf, other]

From One to Many: Expanding the Scope of Toxicity Mitigation in Language Models

Authors: Luiza Pozzobon, Patrick Lewis, Sara Hooker, Beyza Ermis

Abstract: To date, toxicity mitigation in language models has almost entirely been focused on single-language settings. As language models embrace multilingual capabilities, it's crucial our safety measures keep pace. Recognizing this research gap, our approach expands the scope of conventional toxicity mitigation to address the complexities presented by multiple languages. In the absence of sufficient anno… ▽ More To date, toxicity mitigation in language models has almost entirely been focused on single-language settings. As language models embrace multilingual capabilities, it's crucial our safety measures keep pace. Recognizing this research gap, our approach expands the scope of conventional toxicity mitigation to address the complexities presented by multiple languages. In the absence of sufficient annotated datasets across languages, we employ translated data to evaluate and enhance our mitigation techniques. We also compare finetuning mitigation approaches against retrieval-augmented techniques under both static and continual toxicity mitigation scenarios. This allows us to examine the effects of translation quality and the cross-lingual transfer on toxicity mitigation. We also explore how model size and data quantity affect the success of these mitigation efforts. Covering nine languages, our study represents a broad array of linguistic families and levels of resource availability, ranging from high to mid-resource languages. Through comprehensive experiments, we provide insights into the complexities of multilingual toxicity mitigation, offering valuable insights and paving the way for future research in this increasingly important field. Code and data are available at https://github.com/for-ai/goodtriever. △ Less

Submitted 30 May, 2024; v1 submitted 6 March, 2024; originally announced March 2024.

arXiv:2402.14740 [pdf, other]

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

Authors: Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, Sara Hooker

Abstract: AI alignment in the shape of Reinforcement Learning from Human Feedback (RLHF) is increasingly treated as a crucial ingredient for high performance large language models. Proximal Policy Optimization (PPO) has been positioned by recent literature as the canonical method for the RL part of RLHF. However, it involves both high computational cost and sensitive hyperparameter tuning. We posit that mos… ▽ More AI alignment in the shape of Reinforcement Learning from Human Feedback (RLHF) is increasingly treated as a crucial ingredient for high performance large language models. Proximal Policy Optimization (PPO) has been positioned by recent literature as the canonical method for the RL part of RLHF. However, it involves both high computational cost and sensitive hyperparameter tuning. We posit that most of the motivational principles that led to the development of PPO are less of a practical concern in RLHF and advocate for a less computationally expensive method that preserves and even increases performance. We revisit the formulation of alignment from human preferences in the context of RL. Keeping simplicity as a guiding principle, we show that many components of PPO are unnecessary in an RLHF context and that far simpler REINFORCE-style optimization variants outperform both PPO and newly proposed "RL-free" methods such as DPO and RAFT. Our work suggests that careful adaptation to LLMs alignment characteristics enables benefiting from online RL optimization at low cost. △ Less

Submitted 26 February, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

Comments: 27 pages, 7 figures, 2 tables

ACM Class: I.2.7

arXiv:2402.07827 [pdf, other]

Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model

Authors: Ahmet Üstün, Viraat Aryabumi, Zheng-Xin Yong, Wei-Yin Ko, Daniel D'souza, Gbemileke Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, Freddie Vargus, Phil Blunsom, Shayne Longpre, Niklas Muennighoff, Marzieh Fadaee, Julia Kreutzer, Sara Hooker

Abstract: Recent breakthroughs in large language models (LLMs) have centered around a handful of data-rich languages. What does it take to broaden access to breakthroughs beyond first-class citizen languages? Our work introduces Aya, a massively multilingual generative language model that follows instructions in 101 languages of which over 50% are considered as lower-resourced. Aya outperforms mT0 and BLOOM… ▽ More Recent breakthroughs in large language models (LLMs) have centered around a handful of data-rich languages. What does it take to broaden access to breakthroughs beyond first-class citizen languages? Our work introduces Aya, a massively multilingual generative language model that follows instructions in 101 languages of which over 50% are considered as lower-resourced. Aya outperforms mT0 and BLOOMZ on the majority of tasks while covering double the number of languages. We introduce extensive new evaluation suites that broaden the state-of-art for multilingual eval across 99 languages -- including discriminative and generative tasks, human evaluation, and simulated win rates that cover both held-out tasks and in-distribution performance. Furthermore, we conduct detailed investigations on the optimal finetuning mixture composition, data pruning, as well as the toxicity, bias, and safety of our models. We open-source our instruction datasets and our model at https://hf.co/CohereForAI/aya-101 △ Less

Submitted 12 February, 2024; originally announced February 2024.

arXiv:2402.06619 [pdf, other]

Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning

Authors: Shivalika Singh, Freddie Vargus, Daniel Dsouza, Börje F. Karlsson, Abinaya Mahendiran, Wei-Yin Ko, Herumb Shandilya, Jay Patel, Deividas Mataciunas, Laura OMahony, Mike Zhang, Ramith Hettiarachchi, Joseph Wilson, Marina Machado, Luisa Souza Moura, Dominik Krzemiński, Hakimeh Fadaei, Irem Ergün, Ifeoma Okoh, Aisha Alaagib, Oshan Mudannayake, Zaid Alyafeai, Vu Minh Chien, Sebastian Ruder, Surya Guthikonda , et al. (8 additional authors not shown)

Abstract: Datasets are foundational to many breakthroughs in modern artificial intelligence. Many recent achievements in the space of natural language processing (NLP) can be attributed to the finetuning of pre-trained models on a diverse set of tasks that enables a large language model (LLM) to respond to instructions. Instruction fine-tuning (IFT) requires specifically constructed and annotated datasets.… ▽ More Datasets are foundational to many breakthroughs in modern artificial intelligence. Many recent achievements in the space of natural language processing (NLP) can be attributed to the finetuning of pre-trained models on a diverse set of tasks that enables a large language model (LLM) to respond to instructions. Instruction fine-tuning (IFT) requires specifically constructed and annotated datasets. However, existing datasets are almost all in the English language. In this work, our primary goal is to bridge the language gap by building a human-curated instruction-following dataset spanning 65 languages. We worked with fluent speakers of languages from around the world to collect natural instances of instructions and completions. Furthermore, we create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages. In total, we contribute four key resources: we develop and open-source the Aya Annotation Platform, the Aya Dataset, the Aya Collection, and the Aya Evaluation Suite. The Aya initiative also serves as a valuable case study in participatory research, involving collaborators from 119 countries. We see this as a valuable framework for future research collaborations that aim to bridge gaps in resources. △ Less

Submitted 9 February, 2024; originally announced February 2024.

arXiv:2312.03886 [pdf, other]

On The Fairness Impacts of Hardware Selection in Machine Learning

Authors: Sree Harsha Nelaturu, Nishaanth Kanna Ravichandran, Cuong Tran, Sara Hooker, Ferdinando Fioretto

Abstract: In the machine learning ecosystem, hardware selection is often regarded as a mere utility, overshadowed by the spotlight on algorithms and data. This oversight is particularly problematic in contexts like ML-as-a-service platforms, where users often lack control over the hardware used for model deployment. How does the choice of hardware impact generalization properties? This paper investigates th… ▽ More In the machine learning ecosystem, hardware selection is often regarded as a mere utility, overshadowed by the spotlight on algorithms and data. This oversight is particularly problematic in contexts like ML-as-a-service platforms, where users often lack control over the hardware used for model deployment. How does the choice of hardware impact generalization properties? This paper investigates the influence of hardware on the delicate balance between model performance and fairness. We demonstrate that hardware choices can exacerbate existing disparities, attributing these discrepancies to variations in gradient flows and loss surfaces across different demographic groups. Through both theoretical and empirical analysis, the paper not only identifies the underlying factors but also proposes an effective strategy for mitigating hardware-induced performance imbalances. △ Less

Submitted 6 December, 2023; originally announced December 2023.

arXiv:2311.18598 [pdf, other]

Generalisable Agents for Neural Network Optimisation

Authors: Kale-ab Tessera, Callum Rhys Tilbury, Sasha Abramowitz, Ruan de Kock, Omayma Mahjoub, Benjamin Rosman, Sara Hooker, Arnu Pretorius

Abstract: Optimising deep neural networks is a challenging task due to complex training dynamics, high computational requirements, and long training times. To address this difficulty, we propose the framework of Generalisable Agents for Neural Network Optimisation (GANNO) -- a multi-agent reinforcement learning (MARL) approach that learns to improve neural network optimisation by dynamically and responsivel… ▽ More Optimising deep neural networks is a challenging task due to complex training dynamics, high computational requirements, and long training times. To address this difficulty, we propose the framework of Generalisable Agents for Neural Network Optimisation (GANNO) -- a multi-agent reinforcement learning (MARL) approach that learns to improve neural network optimisation by dynamically and responsively scheduling hyperparameters during training. GANNO utilises an agent per layer that observes localised network dynamics and accordingly takes actions to adjust these dynamics at a layerwise level to collectively improve global performance. In this paper, we use GANNO to control the layerwise learning rate and show that the framework can yield useful and responsive schedules that are competitive with handcrafted heuristics. Furthermore, GANNO is shown to perform robustly across a wide variety of unseen initial conditions, and can successfully generalise to harder problems than it was trained on. Our work presents an overview of the opportunities that this paradigm offers for training neural networks, along with key challenges that remain to be overcome. △ Less

Submitted 22 March, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

Comments: Accepted at the Workshop on Advanced Neural Network Training (WANT) and Optimization for Machine Learning (OPT) at NeurIPS 2023

arXiv:2311.17295 [pdf, other]

Elo Uncovered: Robustness and Best Practices in Language Model Evaluation

Authors: Meriem Boubdir, Edward Kim, Beyza Ermis, Sara Hooker, Marzieh Fadaee

Abstract: In Natural Language Processing (NLP), the Elo rating system, originally designed for ranking players in dynamic games such as chess, is increasingly being used to evaluate Large Language Models (LLMs) through "A vs B" paired comparisons. However, while popular, the system's suitability for assessing entities with constant skill levels, such as LLMs, remains relatively unexplored. We study two fund… ▽ More In Natural Language Processing (NLP), the Elo rating system, originally designed for ranking players in dynamic games such as chess, is increasingly being used to evaluate Large Language Models (LLMs) through "A vs B" paired comparisons. However, while popular, the system's suitability for assessing entities with constant skill levels, such as LLMs, remains relatively unexplored. We study two fundamental axioms that evaluation methods should adhere to: reliability and transitivity. We conduct extensive evaluation of Elo behaviour, illustrating that individual Elo computations exhibit volatility and delving into the impact of varying the Elo rating system's hyperparameters. We show that these axioms are not always satisfied raising questions about the reliability of current comparative evaluations of LLMs. If the current use of Elo scores is intended to substitute the costly head-to-head comparison of LLMs, it is crucial to ensure the ranking is as robust as possible. Guided by the axioms, our findings offer concrete guidelines for enhancing the reliability of LLM evaluation methods, suggesting a need for reassessment of existing comparative approaches. △ Less

Submitted 28 November, 2023; originally announced November 2023.

Comments: 22 pages, 7 figures, 2 tables. Revised version of the paper accepted at GEM Workshop, EMNLP 2023

arXiv:2311.00471 [pdf, other]

Multi-GeV Wakefield Acceleration in a Plasma-Modulated Plasma Accelerator

Authors: Johannes J. van de Wetering, Simon M. Hooker, Roman Walczak

Abstract: We investigate the accelerator stage of a Plasma-Modulated Plasma Accelerator (P-MoPA) [Phys. Rev. Lett. 127, 184801 (2021)] using both the paraxial wave equation and particle-in-cell (PIC) simulations. We show that adjusting the laser and plasma parameters of the modulator stage of a P-MoPA allows the temporal profile of pulses within the pulse train to be controlled, which in turn allows the wak… ▽ More We investigate the accelerator stage of a Plasma-Modulated Plasma Accelerator (P-MoPA) [Phys. Rev. Lett. 127, 184801 (2021)] using both the paraxial wave equation and particle-in-cell (PIC) simulations. We show that adjusting the laser and plasma parameters of the modulator stage of a P-MoPA allows the temporal profile of pulses within the pulse train to be controlled, which in turn allows the wake amplitude in the accelerator stage to be as much as 72% larger than that generated by a plasma beat-wave accelerator with the same total drive laser energy. Our analysis shows that Rosenbluth-Liu detuning is unimportant in a P-MoPA if the number of pulses in the train is less than $\sim$ 30, and that this detuning is also partially counteracted by increased red-shifting, and hence increased pulse spacing, towards the back of the train. An analysis of transverse mode oscillations of the driving pulse train is found to be in good agreement with 2D PIC simulations. PIC simulations demonstrating energy gains of $\sim$ 1.5 GeV ($\sim$ 2.5 GeV) for drive pulse energies of 2.4 J (5.0 J) are presented. Our results suggest that P-MoPAs driven by few-joule, picosecond pulses, such as those provided by high-repetition-rate thin-disk lasers, could accelerate electron bunches to multi-GeV energies at pulse repetition rates in the kilohertz range. △ Less

Submitted 1 November, 2023; originally announced November 2023.

Comments: 12 pages, 5 figures

arXiv:2310.16787 [pdf, other]

The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI

Authors: Shayne Longpre, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien Sileo, William Brannon, Niklas Muennighoff, Nathan Khazam, Jad Kabbara, Kartik Perisetla, Xinyi Wu, Enrico Shippole, Kurt Bollacker, Tongshuang Wu, Luis Villa, Sandy Pentland, Sara Hooker

Abstract: The race to train language models on vast, diverse, and inconsistently documented datasets has raised pressing concerns about the legal and ethical risks for practitioners. To remedy these practices threatening data transparency and understanding, we convene a multi-disciplinary effort between legal and machine learning experts to systematically audit and trace 1800+ text datasets. We develop tool… ▽ More The race to train language models on vast, diverse, and inconsistently documented datasets has raised pressing concerns about the legal and ethical risks for practitioners. To remedy these practices threatening data transparency and understanding, we convene a multi-disciplinary effort between legal and machine learning experts to systematically audit and trace 1800+ text datasets. We develop tools and standards to trace the lineage of these datasets, from their source, creators, series of license conditions, properties, and subsequent use. Our landscape analysis highlights the sharp divides in composition and focus of commercially open vs closed datasets, with closed datasets monopolizing important categories: lower resource languages, more creative tasks, richer topic variety, newer and more synthetic training data. This points to a deepening divide in the types of data that are made available under different license conditions, and heightened implications for jurisdictional legal interpretations of copyright and fair use. We also observe frequent miscategorization of licenses on widely used dataset hosting sites, with license omission of 70%+ and error rates of 50%+. This points to a crisis in misattribution and informed use of the most popular datasets driving many recent breakthroughs. As a contribution to ongoing improvements in dataset transparency and responsible use, we release our entire audit, with an interactive UI, the Data Provenance Explorer, which allows practitioners to trace and filter on data provenance for the most popular open source finetuning data collections: www.dataprovenance.org. △ Less

Submitted 4 November, 2023; v1 submitted 25 October, 2023; originally announced October 2023.

Comments: 30 pages (18 main), 6 figures, 5 tables

arXiv:2310.16111 [pdf, other]

Locally Differentially Private Document Generation Using Zero Shot Prompting

Authors: Saiteja Utpala, Sara Hooker, Pin Yu Chen

Abstract: Numerous studies have highlighted the privacy risks associated with pretrained large language models. In contrast, our research offers a unique perspective by demonstrating that pretrained large language models can effectively contribute to privacy preservation. We propose a locally differentially private mechanism called DP-Prompt, which leverages the power of pretrained large language models and… ▽ More Numerous studies have highlighted the privacy risks associated with pretrained large language models. In contrast, our research offers a unique perspective by demonstrating that pretrained large language models can effectively contribute to privacy preservation. We propose a locally differentially private mechanism called DP-Prompt, which leverages the power of pretrained large language models and zero-shot prompting to counter author de-anonymization attacks while minimizing the impact on downstream utility. When DP-Prompt is used with a powerful language model like ChatGPT (gpt-3.5), we observe a notable reduction in the success rate of de-anonymization attacks, showing that it surpasses existing approaches by a considerable margin despite its simpler design. For instance, in the case of the IMDB dataset, DP-Prompt (with ChatGPT) perfectly recovers the clean sentiment F1 score while achieving a 46\% reduction in author identification F1 score against static attackers and a 26\% reduction against adaptive attackers. We conduct extensive experiments across six open-source large language models, ranging up to 7 billion parameters, to analyze various effects of the privacy-utility tradeoff. △ Less

Submitted 30 November, 2023; v1 submitted 24 October, 2023; originally announced October 2023.

Comments: Accepted at EMNLP 2023 (Findings)

arXiv:2310.14424 [pdf, other]

Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation

Authors: Meriem Boubdir, Edward Kim, Beyza Ermis, Marzieh Fadaee, Sara Hooker

Abstract: Human evaluation is increasingly critical for assessing large language models, capturing linguistic nuances, and reflecting user preferences more accurately than traditional automated metrics. However, the resource-intensive nature of this type of annotation process poses significant challenges. The key question driving our work: "is it feasible to minimize human-in-the-loop feedback by prioritizi… ▽ More Human evaluation is increasingly critical for assessing large language models, capturing linguistic nuances, and reflecting user preferences more accurately than traditional automated metrics. However, the resource-intensive nature of this type of annotation process poses significant challenges. The key question driving our work: "is it feasible to minimize human-in-the-loop feedback by prioritizing data instances which most effectively distinguish between models?" We evaluate several metric-based methods and find that these metrics enhance the efficiency of human evaluations by minimizing the number of required annotations, thus saving time and cost, while ensuring a robust performance evaluation. We show that our method is effective across widely used model families, reducing instances of indecisive (or "tie") outcomes by up to 54% compared to a random sample when focusing on the top-20 percentile of prioritized instances. This potential reduction in required human effort positions our approach as a valuable strategy in future large language model evaluations. △ Less

Submitted 22 October, 2023; originally announced October 2023.

Comments: 37 pages, 8 figures

arXiv:2310.07589 [pdf, other]

Goodtriever: Adaptive Toxicity Mitigation with Retrieval-augmented Models

Authors: Luiza Pozzobon, Beyza Ermis, Patrick Lewis, Sara Hooker

Abstract: Considerable effort has been dedicated to mitigating toxicity, but existing methods often require drastic modifications to model parameters or the use of computationally intensive auxiliary models. Furthermore, previous approaches have often neglected the crucial factor of language's evolving nature over time. In this work, we present a comprehensive perspective on toxicity mitigation that takes i… ▽ More Considerable effort has been dedicated to mitigating toxicity, but existing methods often require drastic modifications to model parameters or the use of computationally intensive auxiliary models. Furthermore, previous approaches have often neglected the crucial factor of language's evolving nature over time. In this work, we present a comprehensive perspective on toxicity mitigation that takes into account its changing nature. We introduce Goodtriever, a flexible methodology that matches the current state-of-the-art toxicity mitigation while achieving 43% relative latency reduction during inference and being more computationally efficient. By incorporating a retrieval-based approach at decoding time, Goodtriever enables toxicity-controlled text generation. Our research advocates for an increased focus on adaptable mitigation techniques, which better reflect the data drift models face when deployed in the wild. Code and data are available at https://github.com/for-ai/goodtriever. △ Less

Submitted 11 October, 2023; originally announced October 2023.

arXiv:2310.05097 [pdf, other]

doi 10.1103/PhysRevResearch.6.L022001

Resonant excitation of plasma waves in a plasma channel

Authors: Aimee J. Ross, James Chappell, Johannes J. van de Wetering, James Cowley, Emily Archer, Nicolas Bourgeois, Laura Corner, David R. Emerson, Linus Feder, Xiao J. Gu, Oscar Jakobsson, Harry Jones, Alexander Picksley, Linus Reid, Wei-Ting Wang, Roman Walczak, Simon M. Hooker

Abstract: We demonstrate resonant excitation of a plasma wave by a train of short laser pulses guided in a pre-formed plasma channel, for parameters relevant to a plasma-modulated plasma accelerator (P-MoPA). We show experimentally that a train of $N \approx 10$ short pulses, of total energy $\sim 1$ J, can be guided through $110$ mm long plasma channels with on-axis densities in the range… ▽ More We demonstrate resonant excitation of a plasma wave by a train of short laser pulses guided in a pre-formed plasma channel, for parameters relevant to a plasma-modulated plasma accelerator (P-MoPA). We show experimentally that a train of $N \approx 10$ short pulses, of total energy $\sim 1$ J, can be guided through $110$ mm long plasma channels with on-axis densities in the range $10^{17} - 10^{18}$ cm$^{-3}$. The spectrum of the transmitted train is found to be strongly red-shifted when the plasma period is tuned to the intra-train pulse spacing. Numerical simulations are found to be in excellent agreement with the measurements and indicate that the resonantly excited plasma waves have an amplitude in the range $3$ - $10$ GV m$^{-1}$, corresponding to an accelerator stage energy gain of order $1$ GeV. △ Less

Submitted 8 October, 2023; originally announced October 2023.

Comments: 13 pages, 14 figures (including Supplemental Material)

Journal ref: Physical Review Research vol. 6, L022001 (2024)

arXiv:2309.07181 [pdf, other]

The Grand Illusion: The Myth of Software Portability and Implications for ML Progress

Authors: Fraser Mince, Dzung Dinh, Jonas Kgomo, Neil Thompson, Sara Hooker

Abstract: Pushing the boundaries of machine learning often requires exploring different hardware and software combinations. However, the freedom to experiment across different tooling stacks can be at odds with the drive for efficiency, which has produced increasingly specialized AI hardware and incentivized consolidation around a narrow set of ML frameworks. Exploratory research can be restricted if softwa… ▽ More Pushing the boundaries of machine learning often requires exploring different hardware and software combinations. However, the freedom to experiment across different tooling stacks can be at odds with the drive for efficiency, which has produced increasingly specialized AI hardware and incentivized consolidation around a narrow set of ML frameworks. Exploratory research can be restricted if software and hardware are co-evolving, making it even harder to stray away from mainstream ideas that work well with popular tooling stacks. While this friction increasingly impacts the rate of innovation in machine learning, to our knowledge the lack of portability in tooling has not been quantified. In this work, we ask: How portable are popular ML software frameworks? We conduct a large-scale study of the portability of mainstream ML frameworks across different hardware types. Our findings paint an uncomfortable picture -- frameworks can lose more than 40% of their key functions when ported to other hardware. Worse, even when functions are portable, the slowdown in their performance can be extreme and render performance untenable. Collectively, our results reveal how costly straying from a narrow set of hardware-software combinations can be - and suggest that specialization of hardware impedes innovation in machine learning research. △ Less

Submitted 12 September, 2023; originally announced September 2023.

Comments: 28 pages, 13 figures, repo can be found at associated https://github.com/for-ai/portability

arXiv:2309.05444 [pdf, other]

Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning

Authors: Ted Zadouri, Ahmet Üstün, Arash Ahmadian, Beyza Ermiş, Acyr Locatelli, Sara Hooker

Abstract: The Mixture of Experts (MoE) is a widely known neural architecture where an ensemble of specialized sub-models optimizes overall performance with a constant computational cost. However, conventional MoEs pose challenges at scale due to the need to store all experts in memory. In this paper, we push MoE to the limit. We propose extremely parameter-efficient MoE by uniquely combining MoE architectur… ▽ More The Mixture of Experts (MoE) is a widely known neural architecture where an ensemble of specialized sub-models optimizes overall performance with a constant computational cost. However, conventional MoEs pose challenges at scale due to the need to store all experts in memory. In this paper, we push MoE to the limit. We propose extremely parameter-efficient MoE by uniquely combining MoE architecture with lightweight experts.Our MoE architecture outperforms standard parameter-efficient fine-tuning (PEFT) methods and is on par with full fine-tuning by only updating the lightweight experts -- less than 1% of an 11B parameters model. Furthermore, our method generalizes to unseen tasks as it does not depend on any prior task knowledge. Our research underscores the versatility of the mixture of experts architecture, showcasing its ability to deliver robust performance even when subjected to rigorous parameter constraints. Our code used in all the experiments is publicly available here: https://github.com/for-ai/parameter-efficient-moe. △ Less

Submitted 11 September, 2023; originally announced September 2023.

arXiv:2309.04564 [pdf, other]

When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale

Authors: Max Marion, Ahmet Üstün, Luiza Pozzobon, Alex Wang, Marzieh Fadaee, Sara Hooker

Abstract: Large volumes of text data have contributed significantly to the development of large language models (LLMs) in recent years. This data is typically acquired by scraping the internet, leading to pretraining datasets comprised of noisy web text. To date, efforts to prune these datasets down to a higher quality subset have relied on hand-crafted heuristics encoded as rule-based filters. In this work… ▽ More Large volumes of text data have contributed significantly to the development of large language models (LLMs) in recent years. This data is typically acquired by scraping the internet, leading to pretraining datasets comprised of noisy web text. To date, efforts to prune these datasets down to a higher quality subset have relied on hand-crafted heuristics encoded as rule-based filters. In this work, we take a wider view and explore scalable estimates of data quality that can be used to systematically measure the quality of pretraining data. We perform a rigorous comparison at scale of the simple data quality estimator of perplexity, as well as more sophisticated and computationally intensive estimates of the Error L2-Norm and memorization. These metrics are used to rank and prune pretraining corpora, and we subsequently compare LLMs trained on these pruned datasets. Surprisingly, we find that the simple technique of perplexity outperforms our more computationally expensive scoring methods. We improve over our no-pruning baseline while training on as little as 30% of the original training dataset. Our work sets the foundation for unexplored strategies in automatically curating high quality corpora and suggests the majority of pretraining data can be removed while retaining performance. △ Less

Submitted 8 September, 2023; originally announced September 2023.

Comments: 14 pages, 8 figures

arXiv:2307.13689 [pdf, other]

doi 10.1103/PhysRevLett.131.245001

All-optical GeV electron bunch generation in a laser-plasma accelerator via truncated-channel injection

Authors: A. Picksley, J. Chappell, E. Archer, N. Bourgeois, J. Cowley, D. R. Emerson, L. Feder, X. J. Gu, O. Jakobsson, A. J. Ross, W. Wang, R. Walczak, S. M. Hooker

Abstract: We describe a simple scheme, truncated-channel injection, to inject electrons directly into the wakefield driven by a drive pulse guided by an all-optical plasma channel. We use this approach to generate dark-current-free 1.2 GeV, 4.5 % relative energy spread electron bunches with 120 TW laser pulses guided in a 110-mm-long hydrodynamic optical-field-ionized (HOFI) plasma channel. Our experiments… ▽ More We describe a simple scheme, truncated-channel injection, to inject electrons directly into the wakefield driven by a drive pulse guided by an all-optical plasma channel. We use this approach to generate dark-current-free 1.2 GeV, 4.5 % relative energy spread electron bunches with 120 TW laser pulses guided in a 110-mm-long hydrodynamic optical-field-ionized (HOFI) plasma channel. Our experiments and particle-in-cell simulations show that high-quality electron bunches were only obtained when the drive pulse was closely aligned with the channel axis, and was focused close to the density down-ramp formed at the channel entrance. Start-to-end simulations of the channel formation, and electron injection and acceleration show that increasing the channel length to 410 mm would yield 3.65 GeV bunches, with a slice energy spread $\sim 5 \times 10^{-4}$. △ Less

Submitted 9 January, 2024; v1 submitted 25 July, 2023; originally announced July 2023.

arXiv:2307.03718 [pdf, other]

Frontier AI Regulation: Managing Emerging Risks to Public Safety

Authors: Markus Anderljung, Joslyn Barnhart, Anton Korinek, Jade Leung, Cullen O'Keefe, Jess Whittlestone, Shahar Avin, Miles Brundage, Justin Bullock, Duncan Cass-Beggs, Ben Chang, Tantum Collins, Tim Fist, Gillian Hadfield, Alan Hayes, Lewis Ho, Sara Hooker, Eric Horvitz, Noam Kolt, Jonas Schuett, Yonadav Shavit, Divya Siddarth, Robert Trager, Kevin Wolf

Abstract: Advanced AI models hold the promise of tremendous benefits for humanity, but society needs to proactively manage the accompanying risks. In this paper, we focus on what we term "frontier AI" models: highly capable foundation models that could possess dangerous capabilities sufficient to pose severe risks to public safety. Frontier AI models pose a distinct regulatory challenge: dangerous capabilit… ▽ More Advanced AI models hold the promise of tremendous benefits for humanity, but society needs to proactively manage the accompanying risks. In this paper, we focus on what we term "frontier AI" models: highly capable foundation models that could possess dangerous capabilities sufficient to pose severe risks to public safety. Frontier AI models pose a distinct regulatory challenge: dangerous capabilities can arise unexpectedly; it is difficult to robustly prevent a deployed model from being misused; and, it is difficult to stop a model's capabilities from proliferating broadly. To address these challenges, at least three building blocks for the regulation of frontier models are needed: (1) standard-setting processes to identify appropriate requirements for frontier AI developers, (2) registration and reporting requirements to provide regulators with visibility into frontier AI development processes, and (3) mechanisms to ensure compliance with safety standards for the development and deployment of frontier AI models. Industry self-regulation is an important first step. However, wider societal discussions and government intervention will be needed to create standards and to ensure compliance with them. We consider several options to this end, including granting enforcement powers to supervisory authorities and licensure regimes for frontier AI models. Finally, we propose an initial set of safety standards. These include conducting pre-deployment risk assessments; external scrutiny of model behavior; using risk assessments to inform deployment decisions; and monitoring and responding to new information about model capabilities and uses post-deployment. We hope this discussion contributes to the broader conversation on how to balance public safety risks and innovation benefits from advances at the frontier of AI development. △ Less

Submitted 7 November, 2023; v1 submitted 6 July, 2023; originally announced July 2023.

Comments: Update July 11th: - Added missing footnote back in. - Adjusted author order (mistakenly non-alphabetical among the first 6 authors) and adjusted affiliations (Jess Whittlestone's affiliation was mistagged and Gillian Hadfield had SRI added to her affiliations) Updated September 4th: Various typos

arXiv:2306.06438 [pdf, other]

doi 10.1103/PhysRevE.108.055211

Measurement of the decay of laser-driven linear plasma wakefields

Authors: J. Jonnerby, A. von Boetticher, J. Holloway, L. Corner, A. Picksley, A. J. Ross, R. J. Shalloo, C. Thornton, N. Bourgeois, R. Walczak, S. M. Hooker

Abstract: We present the first measurements of the temporal decay rate of one-dimensional, linear Langmuir waves excited by an ultra-short laser pulse. Langmuir waves with relative amplitudes of approximately $6\%$ were driven by $1.7$ J, $50$ fs laser pulses in hydrogen and deuterium plasmas of density $n_{e0} = 8.4 \times 10^{17}$ cm$^{-3}$. The wakefield lifetimes were measured to be… ▽ More We present the first measurements of the temporal decay rate of one-dimensional, linear Langmuir waves excited by an ultra-short laser pulse. Langmuir waves with relative amplitudes of approximately $6\%$ were driven by $1.7$ J, $50$ fs laser pulses in hydrogen and deuterium plasmas of density $n_{e0} = 8.4 \times 10^{17}$ cm$^{-3}$. The wakefield lifetimes were measured to be $τ^\mathrm{H_2}_\mathrm{wf} = (9\pm2)$ ps and $τ^\mathrm{D_2}_\mathrm{wf} = (16\pm8)$ ps respectively for hydrogen and deuterium. The experimental results were found to be in good agreement with 2D particle-in-cell simulations. In addition to being of fundamental interest, these results are particularly relevant to the development of laser wakefield accelerators (LWFAs) and wakefield acceleration schemes using multiple pulses, such as multi-pulse laser wakefield accelerators (MP-LWFAs). △ Less

Submitted 10 June, 2023; originally announced June 2023.

arXiv:2306.05949 [pdf, other]

Evaluating the Social Impact of Generative AI Systems in Systems and Society

Authors: Irene Solaiman, Zeerak Talat, William Agnew, Lama Ahmad, Dylan Baker, Su Lin Blodgett, Canyu Chen, Hal Daumé III, Jesse Dodge, Isabella Duan, Ellie Evans, Felix Friedrich, Avijit Ghosh, Usman Gohar, Sara Hooker, Yacine Jernite, Ria Kalluri, Alberto Lusoli, Alina Leidinger, Michelle Lin, Xiuzhu Lin, Sasha Luccioni, Jennifer Mickel, Margaret Mitchell, Jessica Newman , et al. (6 additional authors not shown)

Abstract: Generative AI systems across modalities, ranging from text (including code), image, audio, and video, have broad social impacts, but there is no official standard for means of evaluating those impacts or for which impacts should be evaluated. In this paper, we present a guide that moves toward a standard approach in evaluating a base generative AI system for any modality in two overarching categor… ▽ More Generative AI systems across modalities, ranging from text (including code), image, audio, and video, have broad social impacts, but there is no official standard for means of evaluating those impacts or for which impacts should be evaluated. In this paper, we present a guide that moves toward a standard approach in evaluating a base generative AI system for any modality in two overarching categories: what can be evaluated in a base system independent of context and what can be evaluated in a societal context. Importantly, this refers to base systems that have no predetermined application or deployment context, including a model itself, as well as system components, such as training data. Our framework for a base system defines seven categories of social impact: bias, stereotypes, and representational harms; cultural values and sensitive content; disparate performance; privacy and data protection; financial costs; environmental costs; and data and content moderation labor costs. Suggested methods for evaluation apply to listed generative modalities and analyses of the limitations of existing evaluations serve as a starting point for necessary investment in future evaluations. We offer five overarching categories for what can be evaluated in a broader societal context, each with its own subcategories: trustworthiness and autonomy; inequality, marginalization, and violence; concentration of authority; labor and creativity; and ecosystem and environment. Each subcategory includes recommendations for mitigating harm. △ Less

Submitted 28 June, 2024; v1 submitted 9 June, 2023; originally announced June 2023.

Comments: Forthcoming in Hacker, Engel, Hammer, Mittelstadt (eds), Oxford Handbook on the Foundations and Regulation of Generative AI. Oxford University Press

arXiv:2305.19268 [pdf, other]

Intriguing Properties of Quantization at Scale

Authors: Arash Ahmadian, Saurabh Dash, Hongyu Chen, Bharat Venkitesh, Stephen Gou, Phil Blunsom, Ahmet Üstün, Sara Hooker

Abstract: Emergent properties have been widely adopted as a term to describe behavior not present in smaller models but observed in larger models. Recent work suggests that the trade-off incurred by quantization is also an emergent property, with sharp drops in performance in models over 6B parameters. In this work, we ask "are quantization cliffs in performance solely a factor of scale?" Against a backdrop… ▽ More Emergent properties have been widely adopted as a term to describe behavior not present in smaller models but observed in larger models. Recent work suggests that the trade-off incurred by quantization is also an emergent property, with sharp drops in performance in models over 6B parameters. In this work, we ask "are quantization cliffs in performance solely a factor of scale?" Against a backdrop of increased research focus on why certain emergent properties surface at scale, this work provides a useful counter-example. We posit that it is possible to optimize for a quantization friendly training recipe that suppresses large activation magnitude outliers. Here, we find that outlier dimensions are not an inherent product of scale, but rather sensitive to the optimization conditions present during pre-training. This both opens up directions for more efficient quantization, and poses the question of whether other emergent properties are inherent or can be altered and conditioned by optimization and architecture design choices. We successfully quantize models ranging in size from 410M to 52B with minimal degradation in performance. △ Less

Submitted 30 May, 2023; originally announced May 2023.

Comments: 32 pages, 14 figures

arXiv:2305.16779 [pdf, other]

Demonstration of tunability of HOFI waveguides via start-to-end simulations

Authors: S. M. Mewes, G. J. Boyle, A. Ferran Pousa, R. J. Shalloo, J. Osterhoff, C. Arran, L. Corner, R. Walczak, S. M. Hooker, M. Thévenet

Abstract: In recent years, hydrodynamic optical-field-ionized (HOFI) channels have emerged as a promising technique to create laser waveguides suitable for guiding tightly-focused laser pulses in a plasma, as needed for laser-plasma accelerators. While experimental advances in HOFI channels continue to be made, the underlying mechanisms and the roles of the main parameters remain largely unexplored. In this… ▽ More In recent years, hydrodynamic optical-field-ionized (HOFI) channels have emerged as a promising technique to create laser waveguides suitable for guiding tightly-focused laser pulses in a plasma, as needed for laser-plasma accelerators. While experimental advances in HOFI channels continue to be made, the underlying mechanisms and the roles of the main parameters remain largely unexplored. In this work, we propose a start-to-end simulation pipeline of the HOFI channel formation and the resulting guiding properties, and use it to explore the underlying physics and the tunability of HOFI channels. This approach is benchmarked against experimental measurements. HOFI channels are shown to feature excellent guiding properties over a wide range of parameters, making them a promising and tunable waveguide option for laser-plasma accelerators. △ Less

Submitted 26 May, 2023; originally announced May 2023.

Comments: 8 pages (+5 appendix), 7 figures, submitted to PRResearch

arXiv:2304.12397 [pdf, other]

On the Challenges of Using Black-Box APIs for Toxicity Evaluation in Research

Authors: Luiza Pozzobon, Beyza Ermis, Patrick Lewis, Sara Hooker

Abstract: Perception of toxicity evolves over time and often differs between geographies and cultural backgrounds. Similarly, black-box commercially available APIs for detecting toxicity, such as the Perspective API, are not static, but frequently retrained to address any unattended weaknesses and biases. We evaluate the implications of these changes on the reproducibility of findings that compare the relat… ▽ More Perception of toxicity evolves over time and often differs between geographies and cultural backgrounds. Similarly, black-box commercially available APIs for detecting toxicity, such as the Perspective API, are not static, but frequently retrained to address any unattended weaknesses and biases. We evaluate the implications of these changes on the reproducibility of findings that compare the relative merits of models and methods that aim to curb toxicity. Our findings suggest that research that relied on inherited automatic toxicity scores to compare models and techniques may have resulted in inaccurate findings. Rescoring all models from HELM, a widely respected living benchmark, for toxicity with the recent version of the API led to a different ranking of widely used foundation models. We suggest caution in applying apples-to-apples comparisons between studies and lay recommendations for a more structured approach to evaluating toxicity over time. Code and data are available at https://github.com/for-ai/black-box-api-challenges. △ Less

Submitted 24 April, 2023; originally announced April 2023.

arXiv:2303.14032 [pdf, other]

doi 10.1103/PhysRevE.108.015204

Stability of the Modulator in a Plasma-Modulated Plasma Accelerator

Authors: Johannes J. van de Wetering, Simon M. Hooker, Roman Walczak

Abstract: We explore the regime of operation of the modulator stage of a recently proposed laser-plasma accelerator scheme [Phys. Rev. Lett. 127, 184801 (2021)], dubbed the Plasma-Modulated Plasma Accelerator (P-MoPA). The P-MoPA scheme offers a potential route to high-repetition-rate, GeV-scale plasma accelerators driven by picosecond-duration laser pulses from, for example, kilohertz thin-disk lasers. The… ▽ More We explore the regime of operation of the modulator stage of a recently proposed laser-plasma accelerator scheme [Phys. Rev. Lett. 127, 184801 (2021)], dubbed the Plasma-Modulated Plasma Accelerator (P-MoPA). The P-MoPA scheme offers a potential route to high-repetition-rate, GeV-scale plasma accelerators driven by picosecond-duration laser pulses from, for example, kilohertz thin-disk lasers. The first stage of the P-MoPA scheme is a plasma modulator in which a long, high-energy 'drive' pulse is spectrally modulated by co-propagating in a plasma channel with the low-amplitude plasma wave driven by a short, low-energy 'seed' pulse. The spectrally modulated drive pulse is converted to a train of short pulses, by introducing dispersion, which can resonantly drive a large wakefield in a subsequent accelerator stage with the same on-axis plasma density as the modulator. In this paper we derive the 3D analytic theory for the evolution of the drive pulse in the plasma modulator and show that the spectral modulation is independent of transverse coordinate, which is ideal for compression into a pulse train. We then identify a transverse mode instability (TMI), similar to the TMI observed in optical fiber lasers, which sets limits on the energy of the drive pulse for a given set of laser-plasma parameters. We compare this analytic theory with particle-in-cell (PIC) simulations and find that even higher energy drive pulses can be modulated than those demonstrated in the original proposal. △ Less

Submitted 24 March, 2023; originally announced March 2023.

Comments: 8 pages, 5 figures plus supplementary materials

arXiv:2303.07723 [pdf, other]

doi 10.1103/PhysRevE.107.L023201

Modulational instability in large-amplitude linear laser wakefields

Authors: Alexander von Boetticher, Roman Walczak, Simon Hooker

Abstract: We investigate the growth of ion density perturbations in large-amplitude linear laser wakefields via two-dimensional particle-in-cell simulations. Growth rates and wave numbers are found to be consistent with a longitudinal strong-field modulational instability (SFMI). We examine the transverse dependence of the instability for a Gaussian wakefield envelope and show that growth rates and wavenumb… ▽ More We investigate the growth of ion density perturbations in large-amplitude linear laser wakefields via two-dimensional particle-in-cell simulations. Growth rates and wave numbers are found to be consistent with a longitudinal strong-field modulational instability (SFMI). We examine the transverse dependence of the instability for a Gaussian wakefield envelope and show that growth rates and wavenumbers can be maximised off-axis. On-axis growth rates are found to decrease with increasing ion mass or electron temperature. These results are in close agreement with the dispersion relation of a Langmuir wave with energy density that is large compared to the plasma thermal energy density. The implications for wakefield accelerators, in particular multi-pulse schemes, are discussed. △ Less

Submitted 14 March, 2023; originally announced March 2023.

Comments: 6 pages, 4 figures

Journal ref: Physical Review E vol. 107, L023201 (2023)

arXiv:2303.00586 [pdf, other]

FAIR-Ensemble: When Fairness Naturally Emerges From Deep Ensembling

Authors: Wei-Yin Ko, Daniel D'souza, Karina Nguyen, Randall Balestriero, Sara Hooker

Abstract: Ensembling multiple Deep Neural Networks (DNNs) is a simple and effective way to improve top-line metrics and to outperform a larger single model. In this work, we go beyond top-line metrics and instead explore the impact of ensembling on subgroup performances. Surprisingly, we observe that even with a simple homogeneous ensemble -- all the individual DNNs share the same training set, architecture… ▽ More Ensembling multiple Deep Neural Networks (DNNs) is a simple and effective way to improve top-line metrics and to outperform a larger single model. In this work, we go beyond top-line metrics and instead explore the impact of ensembling on subgroup performances. Surprisingly, we observe that even with a simple homogeneous ensemble -- all the individual DNNs share the same training set, architecture, and design choices -- the minority group performance disproportionately improves with the number of models compared to the majority group, i.e. fairness naturally emerges from ensembling. Even more surprising, we find that this gain keeps occurring even when a large number of models is considered, e.g. $20$, despite the fact that the average performance of the ensemble plateaus with fewer models. Our work establishes that simple DNN ensembles can be a powerful tool for alleviating disparate impact from DNN classifiers, thus curbing algorithmic harm. We also explore why this is the case. We find that even in homogeneous ensembles, varying the sources of stochasticity through parameter initialization, mini-batch sampling, and data-augmentation realizations, results in different fairness outcomes. △ Less

Submitted 20 December, 2023; v1 submitted 1 March, 2023; originally announced March 2023.

arXiv:2211.02738 [pdf, other]

doi 10.48550/arXiv.2211.02738

Intriguing Properties of Compression on Multilingual Models

Authors: Kelechi Ogueji, Orevaoghene Ahia, Gbemileke Onilude, Sebastian Gehrmann, Sara Hooker, Julia Kreutzer

Abstract: Multilingual models are often particularly dependent on scaling to generalize to a growing number of languages. Compression techniques are widely relied upon to reconcile the growth in model size with real world resource constraints, but compression can have a disparate effect on model performance for low-resource languages. It is thus crucial to understand the trade-offs between scale, multilingu… ▽ More Multilingual models are often particularly dependent on scaling to generalize to a growing number of languages. Compression techniques are widely relied upon to reconcile the growth in model size with real world resource constraints, but compression can have a disparate effect on model performance for low-resource languages. It is thus crucial to understand the trade-offs between scale, multilingualism, and compression. In this work, we propose an experimental framework to characterize the impact of sparsifying multilingual pre-trained language models during fine-tuning. Applying this framework to mBERT named entity recognition models across 40 languages, we find that compression confers several intriguing and previously unknown generalization properties. In contrast to prior findings, we find that compression may improve model robustness over dense models. We additionally observe that under certain sparsification regimes compression may aid, rather than disproportionately impact the performance of low-resource languages. △ Less

Submitted 25 November, 2022; v1 submitted 4 November, 2022; originally announced November 2022.

Comments: Accepted to EMNLP 2022

arXiv:2210.14986 [pdf, other]

The Goldilocks of Pragmatic Understanding: Fine-Tuning Strategy Matters for Implicature Resolution by LLMs

Authors: Laura Ruis, Akbir Khan, Stella Biderman, Sara Hooker, Tim Rocktäschel, Edward Grefenstette

Abstract: Despite widespread use of LLMs as conversational agents, evaluations of performance fail to capture a crucial aspect of communication: interpreting language in context -- incorporating its pragmatics. Humans interpret language using beliefs and prior knowledge about the world. For example, we intuitively understand the response "I wore gloves" to the question "Did you leave fingerprints?" as meani… ▽ More Despite widespread use of LLMs as conversational agents, evaluations of performance fail to capture a crucial aspect of communication: interpreting language in context -- incorporating its pragmatics. Humans interpret language using beliefs and prior knowledge about the world. For example, we intuitively understand the response "I wore gloves" to the question "Did you leave fingerprints?" as meaning "No". To investigate whether LLMs have the ability to make this type of inference, known as an implicature, we design a simple task and evaluate four categories of widely used state-of-the-art models. We find that, despite only evaluating on utterances that require a binary inference (yes or no), models in three of these categories perform close to random. However, LLMs instruction-tuned at the example-level perform significantly better. These results suggest that certain fine-tuning strategies are far better at inducing pragmatic understanding in models. We present our findings as the starting point for further research into evaluating how LLMs interpret language in context and to drive the development of more pragmatic and useful models of human discourse. △ Less

Submitted 3 December, 2023; v1 submitted 26 October, 2022; originally announced October 2022.

Comments: Accepted as Spotlight at NeurIPS 2023

arXiv:2209.10015 [pdf, other]

Metadata Archaeology: Unearthing Data Subsets by Leveraging Training Dynamics

Authors: Shoaib Ahmed Siddiqui, Nitarshan Rajkumar, Tegan Maharaj, David Krueger, Sara Hooker

Abstract: Modern machine learning research relies on relatively few carefully curated datasets. Even in these datasets, and typically in `untidy' or raw data, practitioners are faced with significant issues of data quality and diversity which can be prohibitively labor intensive to address. Existing methods for dealing with these challenges tend to make strong assumptions about the particular issues at play… ▽ More Modern machine learning research relies on relatively few carefully curated datasets. Even in these datasets, and typically in `untidy' or raw data, practitioners are faced with significant issues of data quality and diversity which can be prohibitively labor intensive to address. Existing methods for dealing with these challenges tend to make strong assumptions about the particular issues at play, and often require a priori knowledge or metadata such as domain labels. Our work is orthogonal to these methods: we instead focus on providing a unified and efficient framework for Metadata Archaeology -- uncovering and inferring metadata of examples in a dataset. We curate different subsets of data that might exist in a dataset (e.g. mislabeled, atypical, or out-of-distribution examples) using simple transformations, and leverage differences in learning dynamics between these probe suites to infer metadata of interest. Our method is on par with far more sophisticated mitigation methods across different tasks: identifying and correcting mislabeled examples, classifying minority-group samples, prioritizing points relevant for training and enabling scalable human auditing of relevant examples. △ Less

Submitted 20 September, 2022; originally announced September 2022.

arXiv:2209.00099 [pdf, other]

Efficient Methods for Natural Language Processing: A Survey

Authors: Marcos Treviso, Ji-Ung Lee, Tianchu Ji, Betty van Aken, Qingqing Cao, Manuel R. Ciosici, Michael Hassid, Kenneth Heafield, Sara Hooker, Colin Raffel, Pedro H. Martins, André F. T. Martins, Jessica Zosa Forde, Peter Milder, Edwin Simpson, Noam Slonim, Jesse Dodge, Emma Strubell, Niranjan Balasubramanian, Leon Derczynski, Iryna Gurevych, Roy Schwartz

Abstract: Recent work in natural language processing (NLP) has yielded appealing results from scaling model parameters and training data; however, using only scale to improve performance means that resource consumption also grows. Such resources include data, time, storage, or energy, all of which are naturally limited and unevenly distributed. This motivates research into efficient methods that require few… ▽ More Recent work in natural language processing (NLP) has yielded appealing results from scaling model parameters and training data; however, using only scale to improve performance means that resource consumption also grows. Such resources include data, time, storage, or energy, all of which are naturally limited and unevenly distributed. This motivates research into efficient methods that require fewer resources to achieve similar results. This survey synthesizes and relates current methods and findings in efficient NLP. We aim to provide both guidance for conducting NLP under limited resources, and point towards promising research directions for developing more efficient methods. △ Less

Submitted 24 March, 2023; v1 submitted 31 August, 2022; originally announced September 2022.

Comments: Accepted at TACL, pre publication version

arXiv:2207.00200 [pdf, other]

Studying the impact of magnitude pruning on contrastive learning methods

Authors: Francesco Corti, Rahim Entezari, Sara Hooker, Davide Bacciu, Olga Saukh

Abstract: We study the impact of different pruning techniques on the representation learned by deep neural networks trained with contrastive loss functions. Our work finds that at high sparsity levels, contrastive learning results in a higher number of misclassified examples relative to models trained with traditional cross-entropy loss. To understand this pronounced difference, we use metrics such as the n… ▽ More We study the impact of different pruning techniques on the representation learned by deep neural networks trained with contrastive loss functions. Our work finds that at high sparsity levels, contrastive learning results in a higher number of misclassified examples relative to models trained with traditional cross-entropy loss. To understand this pronounced difference, we use metrics such as the number of PIEs (Hooker et al., 2019), Q-Score (Kalibhat et al., 2022), and PD-Score (Baldock et al., 2021) to measure the impact of pruning on the learned representation quality. Our analysis suggests the schedule of the pruning method implementation matters. We find that the negative impact of sparsity on the quality of the learned representation is the highest when pruning is introduced early on in the training phase. △ Less

Submitted 1 July, 2022; originally announced July 2022.

arXiv:2206.06479 [pdf, other]

Robust Distillation for Worst-class Performance

Authors: Serena Wang, Harikrishna Narasimhan, Yichen Zhou, Sara Hooker, Michal Lukasik, Aditya Krishna Menon

Abstract: Knowledge distillation has proven to be an effective technique in improving the performance a student model using predictions from a teacher model. However, recent work has shown that gains in average efficiency are not uniform across subgroups in the data, and in particular can often come at the cost of accuracy on rare subgroups and classes. To preserve strong performance across classes that may… ▽ More Knowledge distillation has proven to be an effective technique in improving the performance a student model using predictions from a teacher model. However, recent work has shown that gains in average efficiency are not uniform across subgroups in the data, and in particular can often come at the cost of accuracy on rare subgroups and classes. To preserve strong performance across classes that may follow a long-tailed distribution, we develop distillation techniques that are tailored to improve the student's worst-class performance. Specifically, we introduce robust optimization objectives in different combinations for the teacher and student, and further allow for training with any tradeoff between the overall accuracy and the robust worst-class objective. We show empirically that our robust distillation techniques not only achieve better worst-class performance, but also lead to Pareto improvement in the tradeoff between overall performance and worst-class performance compared to other baseline methods. Theoretically, we provide insights into what makes a good teacher when the goal is to train a robust student. △ Less

Submitted 13 June, 2022; originally announced June 2022.

arXiv:2203.08366 [pdf, other]

Linear colliders based on laser-plasma accelerators

Authors: C. Benedetti, S. S. Bulanov, E. Esarey, C. G. R. Geddes, A. J. Gonsalves, A. Huebl, R. Lehe, K. Nakamura, C. B. Schroeder, D. Terzani, J. van Tilborg, M. Turner, J. -L. Vay, T. Zhou, F. Albert, J. Bromage, E. M. Campbell, D. H. Froula, J. P. Palastro, J. Zuegel, D. Bruhwiler, N. M. Cook, B. Cros, M. C. Downer, M. Fuchs , et al. (18 additional authors not shown)

Abstract: White paper to the Proceedings of the U.S. Particle Physics Community Planning Exercise (Snowmass 2021): Linear colliders based on laser-plasma accelerators White paper to the Proceedings of the U.S. Particle Physics Community Planning Exercise (Snowmass 2021): Linear colliders based on laser-plasma accelerators △ Less

Submitted 4 July, 2022; v1 submitted 15 March, 2022; originally announced March 2022.

Comments: Contribution to Snowmass 2021, Accelerator Frontier

arXiv:2201.07895 [pdf]

doi 10.23731/CYRM-2022-001

European Strategy for Particle Physics -- Accelerator R&D Roadmap

Authors: C. Adolphsen, D. Angal-Kalinin, T. Arndt, M. Arnold, R. Assmann, B. Auchmann, K. Aulenbacher, A. Ballarino, B. Baudouy, P. Baudrenghien, M. Benedikt, S. Bentvelsen, A. Blondel, A. Bogacz, F. Bossi, L. Bottura, S. Bousson, O. Brüning, R. Brinkmann, M. Bruker, O. Brunner, P. N. Burrows, G. Burt, S. Calatroni, K. Cassou , et al. (111 additional authors not shown)

Abstract: The 2020 update of the European Strategy for Particle Physics emphasised the importance of an intensified and well-coordinated programme of accelerator R&D, supporting the design and delivery of future particle accelerators in a timely, affordable and sustainable way. This report sets out a roadmap for European accelerator R&D for the next five to ten years, covering five topical areas identified… ▽ More The 2020 update of the European Strategy for Particle Physics emphasised the importance of an intensified and well-coordinated programme of accelerator R&D, supporting the design and delivery of future particle accelerators in a timely, affordable and sustainable way. This report sets out a roadmap for European accelerator R&D for the next five to ten years, covering five topical areas identified in the Strategy update. The R&D objectives include: improvement of the performance and cost-performance of magnet and radio-frequency acceleration systems; investigations of the potential of laser / plasma acceleration and energy-recovery linac techniques; and development of new concepts for muon beams and muon colliders. The goal of the roadmap is to document the collective view of the field on the next steps for the R&D programme, and to provide the evidence base to support subsequent decisions on prioritisation, resourcing and implementation. △ Less

Submitted 30 March, 2022; v1 submitted 19 January, 2022; originally announced January 2022.

Comments: 270 pages, 58 figures. Editor: N. Mounet. LDG chair: D. Newbold. Panel chairs: P. Védrine (HFM), S. Bousson (RF), R. Assmann (plasma), D. Schulte (muon), M. Klein (ERL). Panel editors: B. Baudouy (HFM), L. Bottura (HFM), S. Bousson (RF), G. Burt (RF), R. Assmann (plasma), E. Gschwendtner (plasma), R. Ischebeck (plasma), C. Rogers (muon), D. Schulte (muon), M. Klein (ERL)

Report number: CERN-2022-001

Journal ref: European Strategy for Particle Physics - Accelerator R&D Roadmap, N. Mounet (ed.), CERN Yellow Reports: Monographs, CERN-2022-001 (CERN, Geneva, 2022)

arXiv:2201.05610 [pdf, other]

When less is more: Simplifying inputs aids neural network understanding

Authors: Robin Tibor Schirrmeister, Rosanne Liu, Sara Hooker, Tonio Ball

Abstract: How do neural network image classifiers respond to simpler and simpler inputs? And what do such responses reveal about the learning process? To answer these questions, we need a clear measure of input simplicity (or inversely, complexity), an optimization objective that correlates with simplification, and a framework to incorporate such objective into training and inference. Lastly we need a varie… ▽ More How do neural network image classifiers respond to simpler and simpler inputs? And what do such responses reveal about the learning process? To answer these questions, we need a clear measure of input simplicity (or inversely, complexity), an optimization objective that correlates with simplification, and a framework to incorporate such objective into training and inference. Lastly we need a variety of testbeds to experiment and evaluate the impact of such simplification on learning. In this work, we measure simplicity with the encoding bit size given by a pretrained generative model, and minimize the bit size to simplify inputs in training and inference. We investigate the effect of such simplification in several scenarios: conventional training, dataset condensation and post-hoc explanations. In all settings, inputs are simplified along with the original classification task, and we investigate the trade-off between input simplicity and task performance. For images with injected distractors, such simplification naturally removes superfluous information. For dataset condensation, we find that inputs can be simplified with almost no accuracy degradation. When used in post-hoc explanation, our learning-based simplification approach offers a valuable new tool to explore the basis of network decisions. △ Less

Submitted 1 February, 2022; v1 submitted 14 January, 2022; originally announced January 2022.

ACM Class: I.2.6

arXiv:2110.03036 [pdf, other]

The Low-Resource Double Bind: An Empirical Study of Pruning for Low-Resource Machine Translation

Authors: Orevaoghene Ahia, Julia Kreutzer, Sara Hooker

Abstract: A "bigger is better" explosion in the number of parameters in deep neural networks has made it increasingly challenging to make state-of-the-art networks accessible in compute-restricted environments. Compression techniques have taken on renewed importance as a way to bridge the gap. However, evaluation of the trade-offs incurred by popular compression techniques has been centered on high-resource… ▽ More A "bigger is better" explosion in the number of parameters in deep neural networks has made it increasingly challenging to make state-of-the-art networks accessible in compute-restricted environments. Compression techniques have taken on renewed importance as a way to bridge the gap. However, evaluation of the trade-offs incurred by popular compression techniques has been centered on high-resource datasets. In this work, we instead consider the impact of compression in a data-limited regime. We introduce the term low-resource double bind to refer to the co-occurrence of data limitations and compute resource constraints. This is a common setting for NLP for low-resource languages, yet the trade-offs in performance are poorly studied. Our work offers surprising insights into the relationship between capacity and generalization in data-limited regimes for the task of machine translation. Our experiments on magnitude pruning for translations from English into Yoruba, Hausa, Igbo and German show that in low-resource regimes, sparsity preserves performance on frequent sentences but has a disparate impact on infrequent ones. However, it improves robustness to out-of-distribution shifts, especially for datasets that are very distinct from the training distribution. Our findings suggest that sparsity can play a beneficial role at curbing memorization of low frequency attributes, and therefore offers a promising solution to the low-resource double bind. △ Less

Submitted 6 October, 2021; originally announced October 2021.

Comments: Accepted to Findings of EMNLP 2021

arXiv:2110.00448 [pdf, other]

doi 10.1103/PhysRevAccelBeams.25.011301

Demonstration of kilohertz operation of Hydrodynamic Optical-Field-Ionized Plasma Channels

Authors: A. Alejo, J. Cowley, A. Picksley, R. Walczak, S. M. Hooker

Abstract: We demonstrate experimentally that hydrodynamic optical-field-ionized (HOFI) plasma channels can be generated at kHz-scale pulse repetition rates, in a static gas cell and for an extended period. Using a pump-probe arrangement, we show via transverse interferometry that the properties of two HOFI channels generated \SI{1}{ms} apart are essentially the same. We demonstrate that HOFI channels can be… ▽ More We demonstrate experimentally that hydrodynamic optical-field-ionized (HOFI) plasma channels can be generated at kHz-scale pulse repetition rates, in a static gas cell and for an extended period. Using a pump-probe arrangement, we show via transverse interferometry that the properties of two HOFI channels generated \SI{1}{ms} apart are essentially the same. We demonstrate that HOFI channels can be generated at a mean repetition rate of \SI{0.4}{kHz} for a period of 6.5 hours without degradation of the channel properties, and we determine the fluctuations in the key optical parameters of the channels in this period. Our results suggest that HOFI and conditioned HOFI channels are well suited for future high-repetition rate, multi-GeV plasma accelerator stages. △ Less

Submitted 3 March, 2022; v1 submitted 1 October, 2021; originally announced October 2021.

Comments: Journal publication can be cited as https://doi.org/10.1103/PhysRevAccelBeams.25.011301 . Raw data can be downloaded from https://doi.org/10.5281/zenodo.6242523 . 8 pages, 4 figures

Journal ref: Phys. Rev. Accel. Beams 25, 011301 (2022)

arXiv:2110.00417 [pdf, other]

doi 10.1103/PhysRevLett.127.184801

GeV-scale accelerators driven by plasma-modulated pulses from kilohertz lasers

Authors: O. Jakobsson, S. M. Hooker, R. Walczak

Abstract: We describe a new approach for driving GeV-scale plasma accelerators with long laser pulses. We show that the temporal phase of a long, high-energy driving laser pulse can be modulated periodically by co-propagating it with low-amplitude plasma wave driven by a short, low-energy seed pulse. Compression of the modulated driver by a dispersive optic generates a train of short pulses suitable for res… ▽ More We describe a new approach for driving GeV-scale plasma accelerators with long laser pulses. We show that the temporal phase of a long, high-energy driving laser pulse can be modulated periodically by co-propagating it with low-amplitude plasma wave driven by a short, low-energy seed pulse. Compression of the modulated driver by a dispersive optic generates a train of short pulses suitable for resonantly driving a plasma accelerator. Modulation of the driver occures via well-controlled linear process, as confirmed by good agreement between particle-in-cell (PIC) simulations and an analytic model. PIC simulations demonstrate that a 1.7 J, 1 ps driver and a 140 mJ, 40 fs seed pulse can accelerate electrons to energies of 0.65 GeV in a plasma channel with an axial density of 2.5 x 10$^{17}$ cm$^{-3}$. This work opens a route to high-repetition-rate, GeV-scale plasma accelerators driven by thin-disk lasers, which can provide joule-scale, picosecond-duration laser pulses at multi-kilohertz repetition rates and high wall-plug efficiencies. △ Less

Submitted 27 October, 2021; v1 submitted 1 October, 2021; originally announced October 2021.

Comments: 13 pages, 7 figures (including Supplemental Material). Published as a letter by PRL

Journal ref: Physical Review Letters Vol. 127, No. 18 (2021)

arXiv:2107.13098 [pdf, other]

A Tale Of Two Long Tails

Authors: Daniel D'souza, Zach Nussbaum, Chirag Agarwal, Sara Hooker

Abstract: As machine learning models are increasingly employed to assist human decision-makers, it becomes critical to communicate the uncertainty associated with these model predictions. However, the majority of work on uncertainty has focused on traditional probabilistic or ranking approaches - where the model assigns low probabilities or scores to uncertain examples. While this captures what examples are… ▽ More As machine learning models are increasingly employed to assist human decision-makers, it becomes critical to communicate the uncertainty associated with these model predictions. However, the majority of work on uncertainty has focused on traditional probabilistic or ranking approaches - where the model assigns low probabilities or scores to uncertain examples. While this captures what examples are challenging for the model, it does not capture the underlying source of the uncertainty. In this work, we seek to identify examples the model is uncertain about and characterize the source of said uncertainty. We explore the benefits of designing a targeted intervention - targeted data augmentation of the examples where the model is uncertain over the course of training. We investigate whether the rate of learning in the presence of additional information differs between atypical and noisy examples? Our results show that this is indeed the case, suggesting that well-designed interventions over the course of training can be an effective way to characterize and distinguish between different sources of uncertainty. △ Less

Submitted 27 July, 2021; originally announced July 2021.

Comments: Preliminary results accepted to Workshop on Uncertainty and Robustness in Deep Learning (UDL), ICML, 2021

arXiv:2107.07741 [pdf, other]

When does loss-based prioritization fail?

Authors: Niel Teng Hu, Xinyu Hu, Rosanne Liu, Sara Hooker, Jason Yosinski

Abstract: Not all examples are created equal, but standard deep neural network training protocols treat each training point uniformly. Each example is propagated forward and backward through the network the same amount of times, independent of how much the example contributes to the learning protocol. Recent work has proposed ways to accelerate training by deviating from this uniform treatment. Popular meth… ▽ More Not all examples are created equal, but standard deep neural network training protocols treat each training point uniformly. Each example is propagated forward and backward through the network the same amount of times, independent of how much the example contributes to the learning protocol. Recent work has proposed ways to accelerate training by deviating from this uniform treatment. Popular methods entail up-weighting examples that contribute more to the loss with the intuition that examples with low loss have already been learned by the model, so their marginal value to the training procedure should be lower. This view assumes that updating the model with high loss examples will be beneficial to the model. However, this may not hold for noisy, real world data. In this paper, we theorize and then empirically demonstrate that loss-based acceleration methods degrade in scenarios with noisy and corrupted data. Our work suggests measures of example difficulty need to correctly separate out noise from other types of challenging examples. △ Less

Submitted 16 July, 2021; originally announced July 2021.

arXiv:2106.11872 [pdf, other]

Randomness In Neural Network Training: Characterizing The Impact of Tooling

Authors: Donglin Zhuang, Xingyao Zhang, Shuaiwen Leon Song, Sara Hooker

Abstract: The quest for determinism in machine learning has disproportionately focused on characterizing the impact of noise introduced by algorithmic design choices. In this work, we address a less well understood and studied question: how does our choice of tooling introduce randomness to deep neural network training. We conduct large scale experiments across different types of hardware, accelerators, sta… ▽ More The quest for determinism in machine learning has disproportionately focused on characterizing the impact of noise introduced by algorithmic design choices. In this work, we address a less well understood and studied question: how does our choice of tooling introduce randomness to deep neural network training. We conduct large scale experiments across different types of hardware, accelerators, state of art networks, and open-source datasets, to characterize how tooling choices contribute to the level of non-determinism in a system, the impact of said non-determinism, and the cost of eliminating different sources of noise. Our findings are surprising, and suggest that the impact of non-determinism in nuanced. While top-line metrics such as top-1 accuracy are not noticeably impacted, model performance on certain parts of the data distribution is far more sensitive to the introduction of randomness. Our results suggest that deterministic tooling is critical for AI safety. However, we also find that the cost of ensuring determinism varies dramatically between neural network architectures and hardware types, e.g., with overhead up to $746\%$, $241\%$, and $196\%$ on a spectrum of widely used GPU accelerator architectures, relative to non-deterministic training. The source code used in this paper is available at https://github.com/usyd-fsalab/NeuralNetworkRandomness. △ Less

Submitted 22 June, 2021; originally announced June 2021.

Comments: 21 pages, 10 figures

arXiv:2102.01670 [pdf, other]

Keep the Gradients Flowing: Using Gradient Flow to Study Sparse Network Optimization

Authors: Kale-ab Tessera, Sara Hooker, Benjamin Rosman

Abstract: Training sparse networks to converge to the same performance as dense neural architectures has proven to be elusive. Recent work suggests that initialization is the key. However, while this direction of research has had some success, focusing on initialization alone appears to be inadequate. In this paper, we take a broader view of training sparse networks and consider the role of regularization,… ▽ More Training sparse networks to converge to the same performance as dense neural architectures has proven to be elusive. Recent work suggests that initialization is the key. However, while this direction of research has had some success, focusing on initialization alone appears to be inadequate. In this paper, we take a broader view of training sparse networks and consider the role of regularization, optimization, and architecture choices on sparse models. We propose a simple experimental framework, Same Capacity Sparse vs Dense Comparison (SC-SDC), that allows for a fair comparison of sparse and dense networks. Furthermore, we propose a new measure of gradient flow, Effective Gradient Flow (EGF), that better correlates to performance in sparse networks. Using top-line metrics, SC-SDC and EGF, we show that default choices of optimizers, activation functions and regularizers used for dense networks can disadvantage sparse networks. Based upon these findings, we show that gradient flow in sparse networks can be improved by reconsidering aspects of the architecture design and the training regime. Our work suggests that initialization is only one piece of the puzzle and taking a wider view of tailoring optimization to sparse networks yields promising results. △ Less

Submitted 15 June, 2021; v1 submitted 2 February, 2021; originally announced February 2021.

Showing 1–50 of 81 results for author: Hooker, S