-
Aya 23: Open Weight Releases to Further Multilingual Progress
Authors:
Viraat Aryabumi,
John Dang,
Dwarak Talupuru,
Saurabh Dash,
David Cairuz,
Hangyu Lin,
Bharat Venkitesh,
Madeline Smith,
Jon Ander Campos,
Yi Chern Tan,
Kelly Marchisio,
Max Bartolo,
Sebastian Ruder,
Acyr Locatelli,
Julia Kreutzer,
Nick Frosst,
Aidan Gomez,
Phil Blunsom,
Marzieh Fadaee,
Ahmet Üstün,
Sara Hooker
Abstract:
This technical report introduces Aya 23, a family of multilingual language models. Aya 23 builds on the recent release of the Aya model (Üstün et al., 2024), focusing on pairing a highly performant pre-trained model with the recently released Aya collection (Singh et al., 2024). The result is a powerful multilingual large language model serving 23 languages, expanding state-of-art language modelin…
▽ More
This technical report introduces Aya 23, a family of multilingual language models. Aya 23 builds on the recent release of the Aya model (Üstün et al., 2024), focusing on pairing a highly performant pre-trained model with the recently released Aya collection (Singh et al., 2024). The result is a powerful multilingual large language model serving 23 languages, expanding state-of-art language modeling capabilities to approximately half of the world's population. The Aya model covered 101 languages whereas Aya 23 is an experiment in depth vs breadth, exploring the impact of allocating more capacity to fewer languages that are included during pre-training. Aya 23 outperforms both previous massively multilingual models like Aya 101 for the languages it covers, as well as widely used models like Gemma, Mistral and Mixtral on an extensive range of discriminative and generative tasks. We release the open weights for both the 8B and 35B models as part of our continued commitment for expanding access to multilingual progress.
△ Less
Submitted 31 May, 2024; v1 submitted 23 May, 2024;
originally announced May 2024.
-
GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing
Authors:
Tao Yu,
Chien-Sheng Wu,
Xi Victoria Lin,
Bailin Wang,
Yi Chern Tan,
Xinyi Yang,
Dragomir Radev,
Richard Socher,
Caiming Xiong
Abstract:
We present GraPPa, an effective pre-training approach for table semantic parsing that learns a compositional inductive bias in the joint representations of textual and tabular data. We construct synthetic question-SQL pairs over high-quality tables via a synchronous context-free grammar (SCFG) induced from existing text-to-SQL datasets. We pre-train our model on the synthetic data using a novel te…
▽ More
We present GraPPa, an effective pre-training approach for table semantic parsing that learns a compositional inductive bias in the joint representations of textual and tabular data. We construct synthetic question-SQL pairs over high-quality tables via a synchronous context-free grammar (SCFG) induced from existing text-to-SQL datasets. We pre-train our model on the synthetic data using a novel text-schema linking objective that predicts the syntactic role of a table field in the SQL for each question-SQL pair. To maintain the model's ability to represent real-world data, we also include masked language modeling (MLM) over several existing table-and-language datasets to regularize the pre-training process. On four popular fully supervised and weakly supervised table semantic parsing benchmarks, GraPPa significantly outperforms RoBERTa-large as the feature representation layers and establishes new state-of-the-art results on all of them.
△ Less
Submitted 28 May, 2021; v1 submitted 29 September, 2020;
originally announced September 2020.
-
DART: Open-Domain Structured Data Record to Text Generation
Authors:
Linyong Nan,
Dragomir Radev,
Rui Zhang,
Amrit Rau,
Abhinand Sivaprasad,
Chiachun Hsieh,
Xiangru Tang,
Aadit Vyas,
Neha Verma,
Pranav Krishna,
Yangxiaokang Liu,
Nadia Irwanto,
Jessica Pan,
Faiaz Rahman,
Ahmad Zaidi,
Mutethia Mutuma,
Yasin Tarabar,
Ankit Gupta,
Tao Yu,
Yi Chern Tan,
Xi Victoria Lin,
Caiming Xiong,
Richard Socher,
Nazneen Fatema Rajani
Abstract:
We present DART, an open domain structured DAta Record to Text generation dataset with over 82k instances (DARTs). Data-to-Text annotations can be a costly process, especially when dealing with tables which are the major source of structured data and contain nontrivial structures. To this end, we propose a procedure of extracting semantic triples from tables that encodes their structures by exploi…
▽ More
We present DART, an open domain structured DAta Record to Text generation dataset with over 82k instances (DARTs). Data-to-Text annotations can be a costly process, especially when dealing with tables which are the major source of structured data and contain nontrivial structures. To this end, we propose a procedure of extracting semantic triples from tables that encodes their structures by exploiting the semantic dependencies among table headers and the table title. Our dataset construction framework effectively merged heterogeneous sources from open domain semantic parsing and dialogue-act-based meaning representation tasks by utilizing techniques such as: tree ontology annotation, question-answer pair to declarative sentence conversion, and predicate unification, all with minimum post-editing. We present systematic evaluation on DART as well as new state-of-the-art results on WebNLG 2017 to show that DART (1) poses new challenges to existing data-to-text datasets and (2) facilitates out-of-domain generalization. Our data and code can be found at https://github.com/Yale-LILY/dart.
△ Less
Submitted 12 April, 2021; v1 submitted 6 July, 2020;
originally announced July 2020.
-
ESPRIT: Explaining Solutions to Physical Reasoning Tasks
Authors:
Nazneen Fatema Rajani,
Rui Zhang,
Yi Chern Tan,
Stephan Zheng,
Jeremy Weiss,
Aadit Vyas,
Abhijit Gupta,
Caiming XIong,
Richard Socher,
Dragomir Radev
Abstract:
Neural networks lack the ability to reason about qualitative physics and so cannot generalize to scenarios and tasks unseen during training. We propose ESPRIT, a framework for commonsense reasoning about qualitative physics in natural language that generates interpretable descriptions of physical events. We use a two-step approach of first identifying the pivotal physical events in an environment…
▽ More
Neural networks lack the ability to reason about qualitative physics and so cannot generalize to scenarios and tasks unseen during training. We propose ESPRIT, a framework for commonsense reasoning about qualitative physics in natural language that generates interpretable descriptions of physical events. We use a two-step approach of first identifying the pivotal physical events in an environment and then generating natural language descriptions of those events using a data-to-text approach. Our framework learns to generate explanations of how the physical simulation will causally evolve so that an agent or a human can easily reason about a solution using those interpretable descriptions. Human evaluations indicate that ESPRIT produces crucial fine-grained details and has high coverage of physical concepts compared to even human annotations. Dataset, code and documentation are available at https://github.com/salesforce/esprit.
△ Less
Submitted 13 May, 2020; v1 submitted 2 May, 2020;
originally announced May 2020.
-
Assessing Social and Intersectional Biases in Contextualized Word Representations
Authors:
Yi Chern Tan,
L. Elisa Celis
Abstract:
Social bias in machine learning has drawn significant attention, with work ranging from demonstrations of bias in a multitude of applications, curating definitions of fairness for different contexts, to developing algorithms to mitigate bias. In natural language processing, gender bias has been shown to exist in context-free word embeddings. Recently, contextual word representations have outperfor…
▽ More
Social bias in machine learning has drawn significant attention, with work ranging from demonstrations of bias in a multitude of applications, curating definitions of fairness for different contexts, to developing algorithms to mitigate bias. In natural language processing, gender bias has been shown to exist in context-free word embeddings. Recently, contextual word representations have outperformed word embeddings in several downstream NLP tasks. These word representations are conditioned on their context within a sentence, and can also be used to encode the entire sentence. In this paper, we analyze the extent to which state-of-the-art models for contextual word representations, such as BERT and GPT-2, encode biases with respect to gender, race, and intersectional identities. Towards this, we propose assessing bias at the contextual word level. This novel approach captures the contextual effects of bias missing in context-free word embeddings, yet avoids confounding effects that underestimate bias at the sentence encoding level. We demonstrate evidence of bias at the corpus level, find varying evidence of bias in embedding association tests, show in particular that racial bias is strongly encoded in contextual word models, and observe that bias effects for intersectional minorities are exacerbated beyond their constituent minority identities. Further, evaluating bias effects at the contextual word level captures biases that are not captured at the sentence level, confirming the need for our novel approach.
△ Less
Submitted 4 November, 2019;
originally announced November 2019.
-
CoSQL: A Conversational Text-to-SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases
Authors:
Tao Yu,
Rui Zhang,
He Yang Er,
Suyi Li,
Eric Xue,
Bo Pang,
Xi Victoria Lin,
Yi Chern Tan,
Tianze Shi,
Zihan Li,
Youxuan Jiang,
Michihiro Yasunaga,
Sungrok Shim,
Tao Chen,
Alexander Fabbri,
Zifan Li,
Luyao Chen,
Yuwen Zhang,
Shreya Dixit,
Vincent Zhang,
Caiming Xiong,
Richard Socher,
Walter S Lasecki,
Dragomir Radev
Abstract:
We present CoSQL, a corpus for building cross-domain, general-purpose database (DB) querying dialogue systems. It consists of 30k+ turns plus 10k+ annotated SQL queries, obtained from a Wizard-of-Oz (WOZ) collection of 3k dialogues querying 200 complex DBs spanning 138 domains. Each dialogue simulates a real-world DB query scenario with a crowd worker as a user exploring the DB and a SQL expert re…
▽ More
We present CoSQL, a corpus for building cross-domain, general-purpose database (DB) querying dialogue systems. It consists of 30k+ turns plus 10k+ annotated SQL queries, obtained from a Wizard-of-Oz (WOZ) collection of 3k dialogues querying 200 complex DBs spanning 138 domains. Each dialogue simulates a real-world DB query scenario with a crowd worker as a user exploring the DB and a SQL expert retrieving answers with SQL, clarifying ambiguous questions, or otherwise informing of unanswerable questions. When user questions are answerable by SQL, the expert describes the SQL and execution results to the user, hence maintaining a natural interaction flow. CoSQL introduces new challenges compared to existing task-oriented dialogue datasets:(1) the dialogue states are grounded in SQL, a domain-independent executable representation, instead of domain-specific slot-value pairs, and (2) because testing is done on unseen databases, success requires generalizing to new domains. CoSQL includes three tasks: SQL-grounded dialogue state tracking, response generation from query results, and user dialogue act prediction. We evaluate a set of strong baselines for each task and show that CoSQL presents significant challenges for future research. The dataset, baselines, and leaderboard will be released at https://yale-lily.github.io/cosql.
△ Less
Submitted 11 September, 2019;
originally announced September 2019.
-
SParC: Cross-Domain Semantic Parsing in Context
Authors:
Tao Yu,
Rui Zhang,
Michihiro Yasunaga,
Yi Chern Tan,
Xi Victoria Lin,
Suyi Li,
Heyang Er,
Irene Li,
Bo Pang,
Tao Chen,
Emily Ji,
Shreya Dixit,
David Proctor,
Sungrok Shim,
Jonathan Kraft,
Vincent Zhang,
Caiming Xiong,
Richard Socher,
Dragomir Radev
Abstract:
We present SParC, a dataset for cross-domainSemanticParsing inContext that consists of 4,298 coherent question sequences (12k+ individual questions annotated with SQL queries). It is obtained from controlled user interactions with 200 complex databases over 138 domains. We provide an in-depth analysis of SParC and show that it introduces new challenges compared to existing datasets. SParC demonstr…
▽ More
We present SParC, a dataset for cross-domainSemanticParsing inContext that consists of 4,298 coherent question sequences (12k+ individual questions annotated with SQL queries). It is obtained from controlled user interactions with 200 complex databases over 138 domains. We provide an in-depth analysis of SParC and show that it introduces new challenges compared to existing datasets. SParC demonstrates complex contextual dependencies, (2) has greater semantic diversity, and (3) requires generalization to unseen domains due to its cross-domain nature and the unseen databases at test time. We experiment with two state-of-the-art text-to-SQL models adapted to the context-dependent, cross-domain setup. The best model obtains an exact match accuracy of 20.2% over all questions and less than10% over all interaction sequences, indicating that the cross-domain setting and the con-textual phenomena of the dataset present significant challenges for future research. The dataset, baselines, and leaderboard are released at https://yale-lily.github.io/sparc.
△ Less
Submitted 5 June, 2019;
originally announced June 2019.
-
Open Sesame: Getting Inside BERT's Linguistic Knowledge
Authors:
Yongjie Lin,
Yi Chern Tan,
Robert Frank
Abstract:
How and to what extent does BERT encode syntactically-sensitive hierarchical information or positionally-sensitive linear information? Recent work has shown that contextual representations like BERT perform well on tasks that require sensitivity to linguistic structure. We present here two studies which aim to provide a better understanding of the nature of BERT's representations. The first of the…
▽ More
How and to what extent does BERT encode syntactically-sensitive hierarchical information or positionally-sensitive linear information? Recent work has shown that contextual representations like BERT perform well on tasks that require sensitivity to linguistic structure. We present here two studies which aim to provide a better understanding of the nature of BERT's representations. The first of these focuses on the identification of structurally-defined elements using diagnostic classifiers, while the second explores BERT's representation of subject-verb agreement and anaphor-antecedent dependencies through a quantitative assessment of self-attention vectors. In both cases, we find that BERT encodes positional information about word tokens well on its lower layers, but switches to a hierarchically-oriented encoding on higher layers. We conclude then that BERT's representations do indeed model linguistically relevant aspects of hierarchical structure, though they do not appear to show the sharp sensitivity to hierarchical structure that is found in human processing of reflexive anaphora.
△ Less
Submitted 4 June, 2019;
originally announced June 2019.
-
Generation and analysis of correlated pairs of photons on board a nanosatellite
Authors:
Zhongkan Tang,
Rakhitha Chandrasekara,
Yue Chuan Tan,
Cliff Cheng,
Luo Sha,
Goh Cher Hiang,
Daniel Oi,
Alexander Ling
Abstract:
Satellites carrying sources of entangled photons could establish a global quantum network, enabling private encryption keys between any two points on Earth. Despite numerous proposals, demonstration of space-based quantum systems has been limited due to the cost of traditional satellites. We are using very small spacecraft to accelerate progress. We report the in-orbit operation of a photon pair s…
▽ More
Satellites carrying sources of entangled photons could establish a global quantum network, enabling private encryption keys between any two points on Earth. Despite numerous proposals, demonstration of space-based quantum systems has been limited due to the cost of traditional satellites. We are using very small spacecraft to accelerate progress. We report the in-orbit operation of a photon pair source aboard a 1.65 kg nanosatellite and demonstrate pair generation and polarization correlation under space conditions. The in-orbit photon correlations exhibit a contrast of 97+/-2%, matching ground-based tests. This pathfinding mission overcomes the challenge of demonstrating in-orbit performance for the components of future entangled photon experiments. Ongoing operation establishes the in-orbit lifetime of these critical components. More generally, this demonstrates the ability for nanosatellites to enable faster progress in space-based research.
△ Less
Submitted 21 March, 2016;
originally announced March 2016.
-
The photon pair source that survived a rocket explosion
Authors:
Zhongkan Tang,
Rakhitha Chandrasekara,
Yue Chuan Tan,
Cliff Cheng,
Kadir Durak,
Alexander Ling
Abstract:
We report on the performance of a compact photon pair source that was recovered intact from a failed space launch. The source had been embedded in a nanosatellite and was designed to perform pathfinder experiments leading to global quantum communication networks using spacecraft. Despite the launch vehicle explosion soon after takeoff?, the nanosatellite was successfully retrieved from the acciden…
▽ More
We report on the performance of a compact photon pair source that was recovered intact from a failed space launch. The source had been embedded in a nanosatellite and was designed to perform pathfinder experiments leading to global quantum communication networks using spacecraft. Despite the launch vehicle explosion soon after takeoff?, the nanosatellite was successfully retrieved from the accident site and the source within it was found to be fully operational. We describe the assembly technique for the rugged source. Post-recovery data is compared to baseline measurements collected before the launch attempt and no degradation in brightness or polarization correlation was observed. The survival of the source through an extreme environment provides strong evidence that it is possible to engineer rugged quantum optical systems.
△ Less
Submitted 29 December, 2015;
originally announced December 2015.
-
Space qualified nanosatellite electronics platform for photon pair experiments
Authors:
Cliff Cheng,
Rakhitha Chandrasekara,
Yue Chuan Tan,
Alexander Ling
Abstract:
We report the design and implementation of a complete electronics platform for conducting a quantum optics experiment that will be operated on board a 1U CubeSat (a 10 x 10 x 10 cm satellite). The quantum optics experiment is designed to produce polarization-entangled photon pairs using non-linear optical crystals and requires opto-electronic components such as a pump laser, single photon detector…
▽ More
We report the design and implementation of a complete electronics platform for conducting a quantum optics experiment that will be operated on board a 1U CubeSat (a 10 x 10 x 10 cm satellite). The quantum optics experiment is designed to produce polarization-entangled photon pairs using non-linear optical crystals and requires opto-electronic components such as a pump laser, single photon detectors and liquid crystal based polarization rotators in addition to passive optical elements. The platform provides mechanical support for the optical assembly. It also communicates autonomously with the host satellite to provide experiment data for transmission to a ground station. A limited number of commands can be transmitted from ground to the platform enabling it to switch experimental modes. This platform requires less than 1.5W for all operations, and is space qualified. The implementation of this electronics platform is a major step on the road to operating quantum communication experiments using nanosatellites.
△ Less
Submitted 24 May, 2015;
originally announced May 2015.
-
Silicon avalanche photodiode operation and lifetime analysis for small satellites
Authors:
Yue Chuan Tan,
Rakhitha Chandrasekara,
Cliff Cheng,
Alexander Ling
Abstract:
Silicon avalanche photodiodes (APDs) are sensitive to operating temperature fluctuations and are also susceptible to radiation flux expected in satellite-based quantum experiments. We introduce a low power voltage adjusting mechanism to overcome the effects of in-orbit temperature fluctuations. We also present data on the performance of Si APDs after irradiation (gamma-ray and proton beam). Combin…
▽ More
Silicon avalanche photodiodes (APDs) are sensitive to operating temperature fluctuations and are also susceptible to radiation flux expected in satellite-based quantum experiments. We introduce a low power voltage adjusting mechanism to overcome the effects of in-orbit temperature fluctuations. We also present data on the performance of Si APDs after irradiation (gamma-ray and proton beam). Combined with an analysis of expected orbital irradiation, we propose that a Si APD in a 400 km equatorial orbit may operate beyond the lifetime of the satellite.
△ Less
Submitted 28 June, 2013;
originally announced June 2013.