-
PaLM 2 Technical Report
Authors:
Rohan Anil,
Andrew M. Dai,
Orhan Firat,
Melvin Johnson,
Dmitry Lepikhin,
Alexandre Passos,
Siamak Shakeri,
Emanuel Taropa,
Paige Bailey,
Zhifeng Chen,
Eric Chu,
Jonathan H. Clark,
Laurent El Shafey,
Yanping Huang,
Kathy Meier-Hellstern,
Gaurav Mishra,
Erica Moreira,
Mark Omernick,
Kevin Robinson,
Sebastian Ruder,
Yi Tay,
Kefan Xiao,
Yuanzhong Xu,
Yujing Zhang,
Gustavo Hernandez Abrego
, et al. (103 additional authors not shown)
Abstract:
We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture of objectives. Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on…
▽ More
We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture of objectives. Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on downstream tasks across different model sizes, while simultaneously exhibiting faster and more efficient inference compared to PaLM. This improved efficiency enables broader deployment while also allowing the model to respond faster, for a more natural pace of interaction. PaLM 2 demonstrates robust reasoning capabilities exemplified by large improvements over PaLM on BIG-Bench and other reasoning tasks. PaLM 2 exhibits stable performance on a suite of responsible AI evaluations, and enables inference-time control over toxicity without additional overhead or impact on other capabilities. Overall, PaLM 2 achieves state-of-the-art performance across a diverse set of tasks and capabilities.
When discussing the PaLM 2 family, it is important to distinguish between pre-trained models (of various sizes), fine-tuned variants of these models, and the user-facing products that use these models. In particular, user-facing products typically include additional pre- and post-processing steps. Additionally, the underlying models may evolve over time. Therefore, one should not expect the performance of user-facing products to exactly match the results reported in this report.
△ Less
Submitted 13 September, 2023; v1 submitted 17 May, 2023;
originally announced May 2023.
-
Deduplicating Training Data Makes Language Models Better
Authors:
Katherine Lee,
Daphne Ippolito,
Andrew Nystrom,
Chiyuan Zhang,
Douglas Eck,
Chris Callison-Burch,
Nicholas Carlini
Abstract:
We find that existing language modeling datasets contain many near-duplicate examples and long repetitive substrings. As a result, over 1% of the unprompted output of language models trained on these datasets is copied verbatim from the training data. We develop two tools that allow us to deduplicate training datasets -- for example removing from C4 a single 61 word English sentence that is repeat…
▽ More
We find that existing language modeling datasets contain many near-duplicate examples and long repetitive substrings. As a result, over 1% of the unprompted output of language models trained on these datasets is copied verbatim from the training data. We develop two tools that allow us to deduplicate training datasets -- for example removing from C4 a single 61 word English sentence that is repeated over 60,000 times. Deduplication allows us to train models that emit memorized text ten times less frequently and require fewer train steps to achieve the same or better accuracy. We can also reduce train-test overlap, which affects over 4% of the validation set of standard datasets, thus allowing for more accurate evaluation. We release code for reproducing our work and performing dataset deduplication at https://github.com/google-research/deduplicate-text-datasets.
△ Less
Submitted 24 March, 2022; v1 submitted 14 July, 2021;
originally announced July 2021.
-
Stabilizing Generative Adversarial Networks: A Survey
Authors:
Maciej Wiatrak,
Stefano V. Albrecht,
Andrew Nystrom
Abstract:
Generative Adversarial Networks (GANs) are a type of generative model which have received much attention due to their ability to model complex real-world data. Despite their recent successes, the process of training GANs remains challenging, suffering from instability problems such as non-convergence, vanishing or exploding gradients, and mode collapse. In recent years, a diverse set of approaches…
▽ More
Generative Adversarial Networks (GANs) are a type of generative model which have received much attention due to their ability to model complex real-world data. Despite their recent successes, the process of training GANs remains challenging, suffering from instability problems such as non-convergence, vanishing or exploding gradients, and mode collapse. In recent years, a diverse set of approaches have been proposed which focus on stabilizing the GAN training procedure. The purpose of this survey is to provide a comprehensive overview of the GAN training stabilization methods which can be found in the literature. We discuss the advantages and disadvantages of each approach, offer a comparative summary, and conclude with a discussion of open problems.
△ Less
Submitted 24 March, 2020; v1 submitted 29 September, 2019;
originally announced October 2019.
-
Leveraging Sparsity to Speed Up Polynomial Feature Expansions of CSR Matrices Using $K$-Simplex Numbers
Authors:
Andrew Nystrom,
John Hughes
Abstract:
An algorithm is provided for performing polynomial feature expansions that both operates on and produces compressed sparse row (CSR) matrices. Previously, no such algorithm existed, and performing polynomial expansions on CSR matrices required an intermediate densification step. The algorithm performs a $K$-degree expansion by using a bijective function involving $K$-simplex numbers of column indi…
▽ More
An algorithm is provided for performing polynomial feature expansions that both operates on and produces compressed sparse row (CSR) matrices. Previously, no such algorithm existed, and performing polynomial expansions on CSR matrices required an intermediate densification step. The algorithm performs a $K$-degree expansion by using a bijective function involving $K$-simplex numbers of column indices in the original matrix to column indices in the expanded matrix. Not only is space saved by operating in CSR format, but the bijective function allows for only the nonzero elements to be iterated over and multiplied together during the expansion, greatly improving average time complexity. For a vector of dimensionality $D$ and density $0 \le d \le 1$, the algorithm has average time complexity $Θ(d^KD^K)$ where $K$ is the polynomial-feature order; this is an improvement by a factor $d^K$ over the standard method. This work derives the required function for the cases of $K=2$ and $K=3$ and shows its use in the $K=2$ algorithm.
△ Less
Submitted 10 September, 2018; v1 submitted 16 March, 2018;
originally announced March 2018.