-
Diversity in Evolutionary Dynamics
Authors:
Yuval Rabani,
Leonard J. Schulman,
Alistair Sinclair
Abstract:
We consider the dynamics imposed by natural selection on the populations of two competing, sexually reproducing, haploid species. In this setting, the fitness of any genome varies over time due to the changing population mix of the competing species; crucially, this fitness variation arises naturally from the model itself, without the need for imposing it exogenously as is typically the case. Prev…
▽ More
We consider the dynamics imposed by natural selection on the populations of two competing, sexually reproducing, haploid species. In this setting, the fitness of any genome varies over time due to the changing population mix of the competing species; crucially, this fitness variation arises naturally from the model itself, without the need for imposing it exogenously as is typically the case. Previous work on this model [14] showed that, in the special case where each of the two species exhibits just two phenotypes, genetic diversity is maintained at all times. This finding supported the tenet that sexual reproduction is advantageous because it promotes diversity, which increases the survivability of a species.
In the present paper we consider the more realistic case where there are more than two phenotypes available to each species. The conclusions about diversity in general turn out to be very different from the two-phenotype case.
Our first result is negative: namely, we show that sexual reproduction does not guarantee the maintenance of diversity at all times, i.e., the result of [14] does not generalize. Our counterexample consists of two competing species with just three phenotypes each. We show that, for any time~$t_0$ and any $\varepsilon>0$, there is a time $t\ge t_0$ at which the combined diversity of both species is smaller than~$\varepsilon$. Our main result is a complementary positive statement, which says that in any non-degenerate example, diversity is maintained in a weaker, "infinitely often" sense.
Thus, our results refute the supposition that sexual reproduction ensures diversity at all times, but affirm a weaker assertion that extended periods of high diversity are necessarily a recurrent event.
△ Less
Submitted 5 July, 2024; v1 submitted 6 June, 2024;
originally announced June 2024.
-
Identifiability of Product of Experts Models
Authors:
Spencer L. Gordon,
Manav Kant,
Eric Ma,
Leonard J. Schulman,
Andrei Staicu
Abstract:
Product of experts (PoE) are layered networks in which the value at each node is an AND (or product) of the values (possibly negated) at its inputs. These were introduced as a neural network architecture that can efficiently learn to generate high-dimensional data which satisfy many low-dimensional constraints -- thereby allowing each individual expert to perform a simple task. PoEs have found a v…
▽ More
Product of experts (PoE) are layered networks in which the value at each node is an AND (or product) of the values (possibly negated) at its inputs. These were introduced as a neural network architecture that can efficiently learn to generate high-dimensional data which satisfy many low-dimensional constraints -- thereby allowing each individual expert to perform a simple task. PoEs have found a variety of applications in learning.
We study the problem of identifiability of a product of experts model having a layer of binary latent variables, and a layer of binary observables that are iid conditional on the latents. The previous best upper bound on the number of observables needed to identify the model was exponential in the number of parameters. We show: (a) When the latents are uniformly distributed, the model is identifiable with a number of observables equal to the number of parameters (and hence best possible). (b) In the more general case of arbitrarily distributed latents, the model is identifiable for a number of observables that is still linear in the number of parameters (and within a factor of two of best-possible). The proofs rely on root interlacing phenomena for some special three-term recurrences.
△ Less
Submitted 13 October, 2023;
originally announced October 2023.
-
Identification of Mixtures of Discrete Product Distributions in Near-Optimal Sample and Time Complexity
Authors:
Spencer L. Gordon,
Erik Jahn,
Bijan Mazaheri,
Yuval Rabani,
Leonard J. Schulman
Abstract:
We consider the problem of identifying, from statistics, a distribution of discrete random variables $X_1,\ldots,X_n$ that is a mixture of $k$ product distributions. The best previous sample complexity for $n \in O(k)$ was $(1/ζ)^{O(k^2 \log k)}$ (under a mild separation assumption parameterized by $ζ$). The best known lower bound was $\exp(Ω(k))$. It is known that $n\geq 2k-1$ is necessary and su…
▽ More
We consider the problem of identifying, from statistics, a distribution of discrete random variables $X_1,\ldots,X_n$ that is a mixture of $k$ product distributions. The best previous sample complexity for $n \in O(k)$ was $(1/ζ)^{O(k^2 \log k)}$ (under a mild separation assumption parameterized by $ζ$). The best known lower bound was $\exp(Ω(k))$. It is known that $n\geq 2k-1$ is necessary and sufficient for identification. We show, for any $n\geq 2k-1$, how to achieve sample complexity and run-time complexity $(1/ζ)^{O(k)}$. We also extend the known lower bound of $e^{Ω(k)}$ to match our upper bound across a broad range of $ζ$. Our results are obtained by combining (a) a classic method for robust tensor decomposition, (b) a novel way of bounding the condition number of key matrices called Hadamard extensions, by studying their action only on flattened rank-1 tensors.
△ Less
Submitted 25 September, 2023;
originally announced September 2023.
-
Let's Verify Step by Step
Authors:
Hunter Lightman,
Vineet Kosaraju,
Yura Burda,
Harri Edwards,
Bowen Baker,
Teddy Lee,
Jan Leike,
John Schulman,
Ilya Sutskever,
Karl Cobbe
Abstract:
In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning ste…
▽ More
In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, but many questions still remain. We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. Our process-supervised model solves 78% of problems from a representative subset of the MATH test set. Additionally, we show that active learning significantly improves the efficacy of process supervision. To support related research, we also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model.
△ Less
Submitted 31 May, 2023;
originally announced May 2023.
-
GPT-4 Technical Report
Authors:
OpenAI,
Josh Achiam,
Steven Adler,
Sandhini Agarwal,
Lama Ahmad,
Ilge Akkaya,
Florencia Leoni Aleman,
Diogo Almeida,
Janko Altenschmidt,
Sam Altman,
Shyamal Anadkat,
Red Avila,
Igor Babuschkin,
Suchir Balaji,
Valerie Balcom,
Paul Baltescu,
Haiming Bao,
Mohammad Bavarian,
Jeff Belgum,
Irwan Bello,
Jake Berdine,
Gabriel Bernadett-Shapiro,
Christopher Berner,
Lenny Bogdonoff,
Oleg Boiko
, et al. (256 additional authors not shown)
Abstract:
We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based mo…
▽ More
We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based model pre-trained to predict the next token in a document. The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior. A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4's performance based on models trained with no more than 1/1,000th the compute of GPT-4.
△ Less
Submitted 4 March, 2024; v1 submitted 15 March, 2023;
originally announced March 2023.
-
Scaling laws for single-agent reinforcement learning
Authors:
Jacob Hilton,
Jie Tang,
John Schulman
Abstract:
Recent work has shown that, in generative modeling, cross-entropy loss improves smoothly with model size and training compute, following a power law plus constant scaling law. One challenge in extending these results to reinforcement learning is that the main performance objective of interest, mean episode return, need not vary smoothly. To overcome this, we introduce *intrinsic performance*, a mo…
▽ More
Recent work has shown that, in generative modeling, cross-entropy loss improves smoothly with model size and training compute, following a power law plus constant scaling law. One challenge in extending these results to reinforcement learning is that the main performance objective of interest, mean episode return, need not vary smoothly. To overcome this, we introduce *intrinsic performance*, a monotonic function of the return defined as the minimum compute required to achieve the given return across a family of models of different sizes. We find that, across a range of environments, intrinsic performance scales as a power law in model size and environment interactions. Consequently, as in generative modeling, the optimal model size scales as a power law in the training compute budget. Furthermore, we study how this relationship varies with the environment and with other properties of the training setup. In particular, using a toy MNIST-based environment, we show that varying the "horizon length" of the task mostly changes the coefficient but not the exponent of this relationship.
△ Less
Submitted 18 February, 2023; v1 submitted 31 January, 2023;
originally announced January 2023.
-
Scaling Laws for Reward Model Overoptimization
Authors:
Leo Gao,
John Schulman,
Jacob Hilton
Abstract:
In reinforcement learning from human feedback, it is common to optimize against a reward model trained to predict human preferences. Because the reward model is an imperfect proxy, optimizing its value too much can hinder ground truth performance, in accordance with Goodhart's law. This effect has been frequently observed, but not carefully measured due to the expense of collecting human preferenc…
▽ More
In reinforcement learning from human feedback, it is common to optimize against a reward model trained to predict human preferences. Because the reward model is an imperfect proxy, optimizing its value too much can hinder ground truth performance, in accordance with Goodhart's law. This effect has been frequently observed, but not carefully measured due to the expense of collecting human preference data. In this work, we use a synthetic setup in which a fixed "gold-standard" reward model plays the role of humans, providing labels used to train a proxy reward model. We study how the gold reward model score changes as we optimize against the proxy reward model using either reinforcement learning or best-of-$n$ sampling. We find that this relationship follows a different functional form depending on the method of optimization, and that in both cases its coefficients scale smoothly with the number of reward model parameters. We also study the effect on this relationship of the size of the reward model dataset, the number of reward model and policy parameters, and the coefficient of the KL penalty added to the reward in the reinforcement learning setup. We explore the implications of these empirical results for theoretical considerations in AI alignment.
△ Less
Submitted 19 October, 2022;
originally announced October 2022.
-
Efficient Training of Language Models to Fill in the Middle
Authors:
Mohammad Bavarian,
Heewoo Jun,
Nikolas Tezak,
John Schulman,
Christine McLeavey,
Jerry Tworek,
Mark Chen
Abstract:
We show that autoregressive language models can learn to infill text after we apply a straightforward transformation to the dataset, which simply moves a span of text from the middle of a document to its end. While this data augmentation has garnered much interest in recent years, we provide extensive evidence that training models with a large fraction of data transformed in this way does not harm…
▽ More
We show that autoregressive language models can learn to infill text after we apply a straightforward transformation to the dataset, which simply moves a span of text from the middle of a document to its end. While this data augmentation has garnered much interest in recent years, we provide extensive evidence that training models with a large fraction of data transformed in this way does not harm the original left-to-right generative capability, as measured by perplexity and sampling evaluations across a wide range of scales. Given the usefulness, simplicity, and efficiency of training models to fill-in-the-middle (FIM), we suggest that future autoregressive language models be trained with FIM by default. To this end, we run a series of ablations on key hyperparameters, such as the data transformation frequency, the structure of the transformation, and the method of selecting the infill span. We use these ablations to prescribe strong default settings and best practices to train FIM models. We have released our best infilling model trained with best practices in our API, and release our infilling benchmarks to aid future research.
△ Less
Submitted 28 July, 2022;
originally announced July 2022.
-
Training language models to follow instructions with human feedback
Authors:
Long Ouyang,
Jeff Wu,
Xu Jiang,
Diogo Almeida,
Carroll L. Wainwright,
Pamela Mishkin,
Chong Zhang,
Sandhini Agarwal,
Katarina Slama,
Alex Ray,
John Schulman,
Jacob Hilton,
Fraser Kelton,
Luke Miller,
Maddie Simens,
Amanda Askell,
Peter Welinder,
Paul Christiano,
Jan Leike,
Ryan Lowe
Abstract:
Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning wi…
▽ More
Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.
△ Less
Submitted 4 March, 2022;
originally announced March 2022.
-
Causal Inference Despite Limited Global Confounding via Mixture Models
Authors:
Spencer L. Gordon,
Bijan Mazaheri,
Yuval Rabani,
Leonard J. Schulman
Abstract:
A Bayesian Network is a directed acyclic graph (DAG) on a set of $n$ random variables (the vertices); a Bayesian Network Distribution (BND) is a probability distribution on the random variables that is Markovian on the graph. A finite $k$-mixture of such models is graphically represented by a larger graph which has an additional ``hidden'' (or ``latent'') random variable $U$, ranging in…
▽ More
A Bayesian Network is a directed acyclic graph (DAG) on a set of $n$ random variables (the vertices); a Bayesian Network Distribution (BND) is a probability distribution on the random variables that is Markovian on the graph. A finite $k$-mixture of such models is graphically represented by a larger graph which has an additional ``hidden'' (or ``latent'') random variable $U$, ranging in $\{1,\ldots,k\}$, and a directed edge from $U$ to every other vertex. Models of this type are fundamental to causal inference, where $U$ models an unobserved confounding effect of multiple populations, obscuring the causal relationships in the observable DAG. By solving the mixture problem and recovering the joint probability distribution with $U$, traditionally unidentifiable causal relationships become identifiable. Using a reduction to the more well-studied ``product'' case on empty graphs, we give the first algorithm to learn mixtures of non-empty DAGs.
△ Less
Submitted 31 May, 2023; v1 submitted 21 December, 2021;
originally announced December 2021.
-
WebGPT: Browser-assisted question-answering with human feedback
Authors:
Reiichiro Nakano,
Jacob Hilton,
Suchir Balaji,
Jeff Wu,
Long Ouyang,
Christina Kim,
Christopher Hesse,
Shantanu Jain,
Vineet Kosaraju,
William Saunders,
Xu Jiang,
Karl Cobbe,
Tyna Eloundou,
Gretchen Krueger,
Kevin Button,
Matthew Knight,
Benjamin Chess,
John Schulman
Abstract:
We fine-tune GPT-3 to answer long-form questions using a text-based web-browsing environment, which allows the model to search and navigate the web. By setting up the task so that it can be performed by humans, we are able to train models on the task using imitation learning, and then optimize answer quality with human feedback. To make human evaluation of factual accuracy easier, models must coll…
▽ More
We fine-tune GPT-3 to answer long-form questions using a text-based web-browsing environment, which allows the model to search and navigate the web. By setting up the task so that it can be performed by humans, we are able to train models on the task using imitation learning, and then optimize answer quality with human feedback. To make human evaluation of factual accuracy easier, models must collect references while browsing in support of their answers. We train and evaluate our models on ELI5, a dataset of questions asked by Reddit users. Our best model is obtained by fine-tuning GPT-3 using behavior cloning, and then performing rejection sampling against a reward model trained to predict human preferences. This model's answers are preferred by humans 56% of the time to those of our human demonstrators, and 69% of the time to the highest-voted answer from Reddit.
△ Less
Submitted 1 June, 2022; v1 submitted 17 December, 2021;
originally announced December 2021.
-
Training Verifiers to Solve Math Word Problems
Authors:
Karl Cobbe,
Vineet Kosaraju,
Mohammad Bavarian,
Mark Chen,
Heewoo Jun,
Lukasz Kaiser,
Matthias Plappert,
Jerry Tworek,
Jacob Hilton,
Reiichiro Nakano,
Christopher Hesse,
John Schulman
Abstract:
State-of-the-art language models can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning. To diagnose the failures of current models and support research, we introduce GSM8K, a dataset of 8.5K high quality linguistically diverse grade school math word problems. We find that even the largest transformer models fail to achieve high tes…
▽ More
State-of-the-art language models can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning. To diagnose the failures of current models and support research, we introduce GSM8K, a dataset of 8.5K high quality linguistically diverse grade school math word problems. We find that even the largest transformer models fail to achieve high test performance, despite the conceptual simplicity of this problem distribution. To increase performance, we propose training verifiers to judge the correctness of model completions. At test time, we generate many candidate solutions and select the one ranked highest by the verifier. We demonstrate that verification significantly improves performance on GSM8K, and we provide strong empirical evidence that verification scales more effectively with increased data than a finetuning baseline.
△ Less
Submitted 17 November, 2021; v1 submitted 27 October, 2021;
originally announced October 2021.
-
Batch size-invariance for policy optimization
Authors:
Jacob Hilton,
Karl Cobbe,
John Schulman
Abstract:
We say an algorithm is batch size-invariant if changes to the batch size can largely be compensated for by changes to other hyperparameters. Stochastic gradient descent is well-known to have this property at small batch sizes, via the learning rate. However, some policy optimization algorithms (such as PPO) do not have this property, because of how they control the size of policy updates. In this…
▽ More
We say an algorithm is batch size-invariant if changes to the batch size can largely be compensated for by changes to other hyperparameters. Stochastic gradient descent is well-known to have this property at small batch sizes, via the learning rate. However, some policy optimization algorithms (such as PPO) do not have this property, because of how they control the size of policy updates. In this work we show how to make these algorithms batch size-invariant. Our key insight is to decouple the proximal policy (used for controlling policy updates) from the behavior policy (used for off-policy corrections). Our experiments help explain why these algorithms work, and additionally show how they can make more efficient use of stale data.
△ Less
Submitted 24 September, 2022; v1 submitted 1 October, 2021;
originally announced October 2021.
-
Unsolved Problems in ML Safety
Authors:
Dan Hendrycks,
Nicholas Carlini,
John Schulman,
Jacob Steinhardt
Abstract:
Machine learning (ML) systems are rapidly increasing in size, are acquiring new capabilities, and are increasingly deployed in high-stakes settings. As with other powerful technologies, safety for ML should be a leading research priority. In response to emerging safety challenges in ML, such as those introduced by recent large-scale models, we provide a new roadmap for ML Safety and refine the tec…
▽ More
Machine learning (ML) systems are rapidly increasing in size, are acquiring new capabilities, and are increasingly deployed in high-stakes settings. As with other powerful technologies, safety for ML should be a leading research priority. In response to emerging safety challenges in ML, such as those introduced by recent large-scale models, we provide a new roadmap for ML Safety and refine the technical problems that the field needs to address. We present four problems ready for research, namely withstanding hazards ("Robustness"), identifying hazards ("Monitoring"), reducing inherent model hazards ("Alignment"), and reducing systemic hazards ("Systemic Safety"). Throughout, we clarify each problem's motivation and provide concrete research directions.
△ Less
Submitted 16 June, 2022; v1 submitted 28 September, 2021;
originally announced September 2021.
-
A Refined Approximation for Euclidean k-Means
Authors:
Fabrizio Grandoni,
Rafail Ostrovsky,
Yuval Rabani,
Leonard J. Schulman,
Rakesh Venkat
Abstract:
In the Euclidean $k$-Means problem we are given a collection of $n$ points $D$ in an Euclidean space and a positive integer $k$. Our goal is to identify a collection of $k$ points in the same space (centers) so as to minimize the sum of the squared Euclidean distances between each point in $D$ and the closest center. This problem is known to be APX-hard and the current best approximation ratio is…
▽ More
In the Euclidean $k$-Means problem we are given a collection of $n$ points $D$ in an Euclidean space and a positive integer $k$. Our goal is to identify a collection of $k$ points in the same space (centers) so as to minimize the sum of the squared Euclidean distances between each point in $D$ and the closest center. This problem is known to be APX-hard and the current best approximation ratio is a primal-dual $6.357$ approximation based on a standard LP for the problem [Ahmadian et al. FOCS'17, SICOMP'20].
In this note we show how a minor modification of Ahmadian et al.'s analysis leads to a slightly improved $6.12903$ approximation. As a related result, we also show that the mentioned LP has integrality gap at least $\frac{16+\sqrt{5}}{15}>1.2157$.
△ Less
Submitted 20 September, 2021; v1 submitted 15 July, 2021;
originally announced July 2021.
-
Measuring Sample Efficiency and Generalization in Reinforcement Learning Benchmarks: NeurIPS 2020 Procgen Benchmark
Authors:
Sharada Mohanty,
Jyotish Poonganam,
Adrien Gaidon,
Andrey Kolobov,
Blake Wulfe,
Dipam Chakraborty,
Gražvydas Šemetulskis,
João Schapke,
Jonas Kubilius,
Jurgis Pašukonis,
Linas Klimas,
Matthew Hausknecht,
Patrick MacAlpine,
Quang Nhat Tran,
Thomas Tumiel,
Xiaocheng Tang,
Xinwei Chen,
Christopher Hesse,
Jacob Hilton,
William Hebgen Guss,
Sahika Genc,
John Schulman,
Karl Cobbe
Abstract:
The NeurIPS 2020 Procgen Competition was designed as a centralized benchmark with clearly defined tasks for measuring Sample Efficiency and Generalization in Reinforcement Learning. Generalization remains one of the most fundamental challenges in deep reinforcement learning, and yet we do not have enough benchmarks to measure the progress of the community on Generalization in Reinforcement Learnin…
▽ More
The NeurIPS 2020 Procgen Competition was designed as a centralized benchmark with clearly defined tasks for measuring Sample Efficiency and Generalization in Reinforcement Learning. Generalization remains one of the most fundamental challenges in deep reinforcement learning, and yet we do not have enough benchmarks to measure the progress of the community on Generalization in Reinforcement Learning. We present the design of a centralized benchmark for Reinforcement Learning which can help measure Sample Efficiency and Generalization in Reinforcement Learning by doing end to end evaluation of the training and rollout phases of thousands of user submitted code bases in a scalable way. We designed the benchmark on top of the already existing Procgen Benchmark by defining clear tasks and standardizing the end to end evaluation setups. The design aims to maximize the flexibility available for researchers who wish to design future iterations of such benchmarks, and yet imposes necessary practical constraints to allow for a system like this to scale. This paper presents the competition setup and the details and analysis of the top solutions identified through this setup in context of 2020 iteration of the competition at NeurIPS.
△ Less
Submitted 29 March, 2021;
originally announced March 2021.
-
Hadamard Extensions and the Identification of Mixtures of Product Distributions
Authors:
Spencer L. Gordon,
Leonard J. Schulman
Abstract:
The Hadamard Extension of a matrix is the matrix consisting of all Hadamard products of subsets of its rows. This construction arises in the context of identifying a mixture of product distributions on binary random variables: full column rank of such extensions is a necessary ingredient of identification algorithms. We provide several results concerning when a Hadamard Extension has full column r…
▽ More
The Hadamard Extension of a matrix is the matrix consisting of all Hadamard products of subsets of its rows. This construction arises in the context of identifying a mixture of product distributions on binary random variables: full column rank of such extensions is a necessary ingredient of identification algorithms. We provide several results concerning when a Hadamard Extension has full column rank.
△ Less
Submitted 12 February, 2021; v1 submitted 27 January, 2021;
originally announced January 2021.
-
The MineRL 2020 Competition on Sample Efficient Reinforcement Learning using Human Priors
Authors:
William H. Guss,
Mario Ynocente Castro,
Sam Devlin,
Brandon Houghton,
Noboru Sean Kuno,
Crissman Loomis,
Stephanie Milani,
Sharada Mohanty,
Keisuke Nakata,
Ruslan Salakhutdinov,
John Schulman,
Shinya Shiroshita,
Nicholay Topin,
Avinash Ummadisingu,
Oriol Vinyals
Abstract:
Although deep reinforcement learning has led to breakthroughs in many difficult domains, these successes have required an ever-increasing number of samples, affording only a shrinking segment of the AI community access to their development. Resolution of these limitations requires new, sample-efficient methods. To facilitate research in this direction, we propose this second iteration of the MineR…
▽ More
Although deep reinforcement learning has led to breakthroughs in many difficult domains, these successes have required an ever-increasing number of samples, affording only a shrinking segment of the AI community access to their development. Resolution of these limitations requires new, sample-efficient methods. To facilitate research in this direction, we propose this second iteration of the MineRL Competition. The primary goal of the competition is to foster the development of algorithms which can efficiently leverage human demonstrations to drastically reduce the number of samples needed to solve complex, hierarchical, and sparse environments. To that end, participants compete under a limited environment sample-complexity budget to develop systems which solve the MineRL ObtainDiamond task in Minecraft, a sequential decision making environment requiring long-term planning, hierarchical control, and efficient exploration methods. The competition is structured into two rounds in which competitors are provided several paired versions of the dataset and environment with different game textures and shaders. At the end of each round, competitors submit containerized versions of their learning algorithms to the AIcrowd platform where they are trained from scratch on a hold-out dataset-environment pair for a total of 4-days on a pre-specified hardware platform. In this follow-up iteration to the NeurIPS 2019 MineRL Competition, we implement new features to expand the scale and reach of the competition. In response to the feedback of the previous participants, we introduce a second minor track focusing on solutions without access to environment interactions of any kind except during test-time. Further we aim to prompt domain agnostic submissions by implementing several novel competition mechanics including action-space randomization and desemantization of observations and actions.
△ Less
Submitted 26 January, 2021;
originally announced January 2021.
-
Source Identification for Mixtures of Product Distributions
Authors:
Spencer L. Gordon,
Bijan Mazaheri,
Yuval Rabani,
Leonard J. Schulman
Abstract:
We give an algorithm for source identification of a mixture of $k$ product distributions on $n$ bits. This is a fundamental problem in machine learning with many applications. Our algorithm identifies the source parameters of an identifiable mixture, given, as input, approximate values of multilinear moments (derived, for instance, from a sufficiently large sample), using $2^{O(k^2)} n^{O(k)}$ ari…
▽ More
We give an algorithm for source identification of a mixture of $k$ product distributions on $n$ bits. This is a fundamental problem in machine learning with many applications. Our algorithm identifies the source parameters of an identifiable mixture, given, as input, approximate values of multilinear moments (derived, for instance, from a sufficiently large sample), using $2^{O(k^2)} n^{O(k)}$ arithmetic operations. Our result is the first explicit bound on the computational complexity of source identification of such mixtures. The running time improves previous results by Feldman, O'Donnell, and Servedio (FOCS 2005) and Chen and Moitra (STOC 2019) that guaranteed only learning the mixture (without parametric identification of the source). Our analysis gives a quantitative version of a qualitative characterization of identifiable sources that is due to Tahmasebi, Motahari, and Maddah-Ali (ISIT 2018).
△ Less
Submitted 28 December, 2020;
originally announced December 2020.
-
Scaling Laws for Autoregressive Generative Modeling
Authors:
Tom Henighan,
Jared Kaplan,
Mor Katz,
Mark Chen,
Christopher Hesse,
Jacob Jackson,
Heewoo Jun,
Tom B. Brown,
Prafulla Dhariwal,
Scott Gray,
Chris Hallacy,
Benjamin Mann,
Alec Radford,
Aditya Ramesh,
Nick Ryder,
Daniel M. Ziegler,
John Schulman,
Dario Amodei,
Sam McCandlish
Abstract:
We identify empirical scaling laws for the cross-entropy loss in four domains: generative image modeling, video modeling, multimodal image$\leftrightarrow$text models, and mathematical problem solving. In all cases autoregressive Transformers smoothly improve in performance as model size and compute budgets increase, following a power-law plus constant scaling law. The optimal model size also depe…
▽ More
We identify empirical scaling laws for the cross-entropy loss in four domains: generative image modeling, video modeling, multimodal image$\leftrightarrow$text models, and mathematical problem solving. In all cases autoregressive Transformers smoothly improve in performance as model size and compute budgets increase, following a power-law plus constant scaling law. The optimal model size also depends on the compute budget through a power-law, with exponents that are nearly universal across all data domains.
The cross-entropy loss has an information theoretic interpretation as $S($True$) + D_{\mathrm{KL}}($True$||$Model$)$, and the empirical scaling laws suggest a prediction for both the true data distribution's entropy and the KL divergence between the true and model distributions. With this interpretation, billion-parameter Transformers are nearly perfect models of the YFCC100M image distribution downsampled to an $8\times 8$ resolution, and we can forecast the model size needed to achieve any given reducible loss (ie $D_{\mathrm{KL}}$) in nats/image for other resolutions.
We find a number of additional scaling laws in specific domains: (a) we identify a scaling relation for the mutual information between captions and images in multimodal models, and show how to answer the question "Is a picture worth a thousand words?"; (b) in the case of mathematical problem solving, we identify scaling laws for model performance when extrapolating beyond the training distribution; (c) we finetune generative image models for ImageNet classification and find smooth scaling of the classification loss and error rate, even as the generative loss levels off. Taken together, these results strengthen the case that scaling laws have important implications for neural network performance, including on downstream tasks.
△ Less
Submitted 5 November, 2020; v1 submitted 27 October, 2020;
originally announced October 2020.
-
Phasic Policy Gradient
Authors:
Karl Cobbe,
Jacob Hilton,
Oleg Klimov,
John Schulman
Abstract:
We introduce Phasic Policy Gradient (PPG), a reinforcement learning framework which modifies traditional on-policy actor-critic methods by separating policy and value function training into distinct phases. In prior methods, one must choose between using a shared network or separate networks to represent the policy and value function. Using separate networks avoids interference between objectives,…
▽ More
We introduce Phasic Policy Gradient (PPG), a reinforcement learning framework which modifies traditional on-policy actor-critic methods by separating policy and value function training into distinct phases. In prior methods, one must choose between using a shared network or separate networks to represent the policy and value function. Using separate networks avoids interference between objectives, while using a shared network allows useful features to be shared. PPG is able to achieve the best of both worlds by splitting optimization into two phases, one that advances training and one that distills features. PPG also enables the value function to be more aggressively optimized with a higher level of sample reuse. Compared to PPO, we find that PPG significantly improves sample efficiency on the challenging Procgen Benchmark.
△ Less
Submitted 9 September, 2020;
originally announced September 2020.
-
The Sparse Hausdorff Moment Problem, with Application to Topic Models
Authors:
Spencer Gordon,
Bijan Mazaheri,
Leonard J. Schulman,
Yuval Rabani
Abstract:
We consider the problem of identifying, from its first $m$ noisy moments, a probability distribution on $[0,1]$ of support $k<\infty$. This is equivalent to the problem of learning a distribution on $m$ observable binary random variables $X_1,X_2,\dots,X_m$ that are iid conditional on a hidden random variable $U$ taking values in $\{1,2,\dots,k\}$. Our focus is on accomplishing this with $m=2k$, w…
▽ More
We consider the problem of identifying, from its first $m$ noisy moments, a probability distribution on $[0,1]$ of support $k<\infty$. This is equivalent to the problem of learning a distribution on $m$ observable binary random variables $X_1,X_2,\dots,X_m$ that are iid conditional on a hidden random variable $U$ taking values in $\{1,2,\dots,k\}$. Our focus is on accomplishing this with $m=2k$, which is the minimum $m$ for which verifying that the source is a $k$-mixture is possible (even with exact statistics). This problem, so simply stated, is quite useful: e.g., by a known reduction, any algorithm for it lifts to an algorithm for learning pure topic models.
We give an algorithm for identifying a $k$-mixture using samples of $m=2k$ iid binary random variables using a sample of size $\left(1/w_{\min}\right)^2 \cdot\left(1/ζ\right)^{O(k)}$ and post-sampling runtime of only $O(k^{2+o(1)})$ arithmetic operations. Here $w_{\min}$ is the minimum probability of an outcome of $U$, and $ζ$ is the minimum separation between the distinct success probabilities of the $X_i$s. Stated in terms of the moment problem, it suffices to know the moments to additive accuracy $w_{\min}\cdotζ^{O(k)}$. It is known that the sample complexity of any solution to the identification problem must be at least exponential in $k$. Previous results demonstrated either worse sample complexity and worse $O(k^c)$ runtime for some $c$ substantially larger than $2$, or similar sample complexity and much worse $k^{O(k^2)}$ runtime.
△ Less
Submitted 7 September, 2020; v1 submitted 16 July, 2020;
originally announced July 2020.
-
Leveraging Procedural Generation to Benchmark Reinforcement Learning
Authors:
Karl Cobbe,
Christopher Hesse,
Jacob Hilton,
John Schulman
Abstract:
We introduce Procgen Benchmark, a suite of 16 procedurally generated game-like environments designed to benchmark both sample efficiency and generalization in reinforcement learning. We believe that the community will benefit from increased access to high quality training environments, and we provide detailed experimental protocols for using this benchmark. We empirically demonstrate that diverse…
▽ More
We introduce Procgen Benchmark, a suite of 16 procedurally generated game-like environments designed to benchmark both sample efficiency and generalization in reinforcement learning. We believe that the community will benefit from increased access to high quality training environments, and we provide detailed experimental protocols for using this benchmark. We empirically demonstrate that diverse environment distributions are essential to adequately train and evaluate RL agents, thereby motivating the extensive use of procedural content generation. We then use this benchmark to investigate the effects of scaling model size, finding that larger models significantly improve both sample efficiency and generalization.
△ Less
Submitted 26 July, 2020; v1 submitted 3 December, 2019;
originally announced December 2019.
-
Edge Expansion and Spectral Gap of Nonnegative Matrices
Authors:
Jenish C. Mehta,
Leonard J. Schulman
Abstract:
The classic graphical Cheeger inequalities state that if $M$ is an $n\times n$ symmetric doubly stochastic matrix, then \[ \frac{1-λ_{2}(M)}{2}\leqφ(M)\leq\sqrt{2\cdot(1-λ_{2}(M))} \] where $φ(M)=\min_{S\subseteq[n],|S|\leq n/2}\left(\frac{1}{|S|}\sum_{i\in S,j\not\in S}M_{i,j}\right)$ is the edge expansion of $M$, and $λ_{2}(M)$ is the second largest eigenvalue of $M$. We study the relationship b…
▽ More
The classic graphical Cheeger inequalities state that if $M$ is an $n\times n$ symmetric doubly stochastic matrix, then \[ \frac{1-λ_{2}(M)}{2}\leqφ(M)\leq\sqrt{2\cdot(1-λ_{2}(M))} \] where $φ(M)=\min_{S\subseteq[n],|S|\leq n/2}\left(\frac{1}{|S|}\sum_{i\in S,j\not\in S}M_{i,j}\right)$ is the edge expansion of $M$, and $λ_{2}(M)$ is the second largest eigenvalue of $M$. We study the relationship between $φ(A)$ and the spectral gap $1-\text{Re}λ_{2}(A)$ for any doubly stochastic matrix $A$ (not necessarily symmetric), where $λ_{2}(A)$ is a nontrivial eigenvalue of $A$ with maximum real part. Fiedler showed that the upper bound on $φ(A)$ is unaffected, i.e., $φ(A)\leq\sqrt{2\cdot(1-\text{Re}λ_{2}(A))}$. With regards to the lower bound on $φ(A)$, there are known constructions with \[ φ(A)\inΘ\left(\frac{1-\text{Re}λ_{2}(A)}{\log n}\right), \] indicating that at least a mild dependence on $n$ is necessary to lower bound $φ(A)$.
In our first result, we provide an exponentially better construction of $n\times n$ doubly stochastic matrices $A_{n}$, for which \[φ(A_{n})\leq\frac{1-\text{Re}λ_{2}(A_{n})}{\sqrt{n}}.\] In fact, all nontrivial eigenvalues of our matrices are $0$, even though the matrices are highly nonexpanding. We further show that this bound is in the correct range (up to the exponent of $n$), by showing that for any doubly stochastic matrix $A$, \[φ(A)\geq\frac{1-\text{Re}λ_{2}(A)}{35\cdot n}.\]
Our second result extends these bounds to general nonnegative matrices $R$, obtaining a two-sided quantitative refinement of the Perron-Frobenius theorem in which the edge expansion $φ(R)$ (appropriately defined), a quantitative measure of the irreducibility of $R$, controls the gap between the Perron-Frobenius eigenvalue and the next-largest real part of any eigenvalue.
△ Less
Submitted 27 September, 2019;
originally announced September 2019.
-
Policy Gradient Search: Online Planning and Expert Iteration without Search Trees
Authors:
Thomas Anthony,
Robert Nishihara,
Philipp Moritz,
Tim Salimans,
John Schulman
Abstract:
Monte Carlo Tree Search (MCTS) algorithms perform simulation-based search to improve policies online. During search, the simulation policy is adapted to explore the most promising lines of play. MCTS has been used by state-of-the-art programs for many problems, however a disadvantage to MCTS is that it estimates the values of states with Monte Carlo averages, stored in a search tree; this does not…
▽ More
Monte Carlo Tree Search (MCTS) algorithms perform simulation-based search to improve policies online. During search, the simulation policy is adapted to explore the most promising lines of play. MCTS has been used by state-of-the-art programs for many problems, however a disadvantage to MCTS is that it estimates the values of states with Monte Carlo averages, stored in a search tree; this does not scale to games with very high branching factors. We propose an alternative simulation-based search method, Policy Gradient Search (PGS), which adapts a neural network simulation policy online via policy gradient updates, avoiding the need for a search tree. In Hex, PGS achieves comparable performance to MCTS, and an agent trained using Expert Iteration with PGS was able defeat MoHex 2.0, the strongest open-source Hex agent, in 9x9 Hex.
△ Less
Submitted 7 April, 2019;
originally announced April 2019.
-
Semi-Supervised Learning by Label Gradient Alignment
Authors:
Jacob Jackson,
John Schulman
Abstract:
We present label gradient alignment, a novel algorithm for semi-supervised learning which imputes labels for the unlabeled data and trains on the imputed labels. We define a semantically meaningful distance metric on the input space by mapping a point (x, y) to the gradient of the model at (x, y). We then formulate an optimization problem whose objective is to minimize the distance between the lab…
▽ More
We present label gradient alignment, a novel algorithm for semi-supervised learning which imputes labels for the unlabeled data and trains on the imputed labels. We define a semantically meaningful distance metric on the input space by mapping a point (x, y) to the gradient of the model at (x, y). We then formulate an optimization problem whose objective is to minimize the distance between the labeled and the unlabeled data in this space, and we solve it by gradient descent on the imputed labels. We evaluate label gradient alignment using the standardized architecture introduced by Oliver et al. (2018) and demonstrate state-of-the-art accuracy in semi-supervised CIFAR-10 classification.
△ Less
Submitted 6 February, 2019;
originally announced February 2019.
-
Quantifying Generalization in Reinforcement Learning
Authors:
Karl Cobbe,
Oleg Klimov,
Chris Hesse,
Taehoon Kim,
John Schulman
Abstract:
In this paper, we investigate the problem of overfitting in deep reinforcement learning. Among the most common benchmarks in RL, it is customary to use the same environments for both training and testing. This practice offers relatively little insight into an agent's ability to generalize. We address this issue by using procedurally generated environments to construct distinct training and test se…
▽ More
In this paper, we investigate the problem of overfitting in deep reinforcement learning. Among the most common benchmarks in RL, it is customary to use the same environments for both training and testing. This practice offers relatively little insight into an agent's ability to generalize. We address this issue by using procedurally generated environments to construct distinct training and test sets. Most notably, we introduce a new environment called CoinRun, designed as a benchmark for generalization in RL. Using CoinRun, we find that agents overfit to surprisingly large training sets. We then show that deeper convolutional architectures improve generalization, as do methods traditionally found in supervised learning, including L2 regularization, dropout, data augmentation and batch normalization.
△ Less
Submitted 14 July, 2019; v1 submitted 5 December, 2018;
originally announced December 2018.
-
Model-Based Reinforcement Learning via Meta-Policy Optimization
Authors:
Ignasi Clavera,
Jonas Rothfuss,
John Schulman,
Yasuhiro Fujita,
Tamim Asfour,
Pieter Abbeel
Abstract:
Model-based reinforcement learning approaches carry the promise of being data efficient. However, due to challenges in learning dynamics models that sufficiently match the real-world dynamics, they struggle to achieve the same asymptotic performance as model-free methods. We propose Model-Based Meta-Policy-Optimization (MB-MPO), an approach that foregoes the strong reliance on accurate learned dyn…
▽ More
Model-based reinforcement learning approaches carry the promise of being data efficient. However, due to challenges in learning dynamics models that sufficiently match the real-world dynamics, they struggle to achieve the same asymptotic performance as model-free methods. We propose Model-Based Meta-Policy-Optimization (MB-MPO), an approach that foregoes the strong reliance on accurate learned dynamics models. Using an ensemble of learned dynamic models, MB-MPO meta-learns a policy that can quickly adapt to any model in the ensemble with one policy gradient step. This steers the meta-policy towards internalizing consistent dynamics predictions among the ensemble while shifting the burden of behaving optimally w.r.t. the model discrepancies towards the adaptation step. Our experiments show that MB-MPO is more robust to model imperfections than previous model-based approaches. Finally, we demonstrate that our approach is able to match the asymptotic performance of model-free methods while requiring significantly less experience.
△ Less
Submitted 13 September, 2018;
originally announced September 2018.
-
Gotta Learn Fast: A New Benchmark for Generalization in RL
Authors:
Alex Nichol,
Vicki Pfau,
Christopher Hesse,
Oleg Klimov,
John Schulman
Abstract:
In this report, we present a new reinforcement learning (RL) benchmark based on the Sonic the Hedgehog (TM) video game franchise. This benchmark is intended to measure the performance of transfer learning and few-shot learning algorithms in the RL domain. We also present and evaluate some baseline algorithms on the new benchmark.
In this report, we present a new reinforcement learning (RL) benchmark based on the Sonic the Hedgehog (TM) video game franchise. This benchmark is intended to measure the performance of transfer learning and few-shot learning algorithms in the RL domain. We also present and evaluate some baseline algorithms on the new benchmark.
△ Less
Submitted 23 April, 2018; v1 submitted 10 April, 2018;
originally announced April 2018.
-
On First-Order Meta-Learning Algorithms
Authors:
Alex Nichol,
Joshua Achiam,
John Schulman
Abstract:
This paper considers meta-learning problems, where there is a distribution of tasks, and we would like to obtain an agent that performs well (i.e., learns quickly) when presented with a previously unseen task sampled from this distribution. We analyze a family of algorithms for learning a parameter initialization that can be fine-tuned quickly on a new task, using only first-order derivatives for…
▽ More
This paper considers meta-learning problems, where there is a distribution of tasks, and we would like to obtain an agent that performs well (i.e., learns quickly) when presented with a previously unseen task sampled from this distribution. We analyze a family of algorithms for learning a parameter initialization that can be fine-tuned quickly on a new task, using only first-order derivatives for the meta-learning updates. This family includes and generalizes first-order MAML, an approximation to MAML obtained by ignoring second-order derivatives. It also includes Reptile, a new algorithm that we introduce here, which works by repeatedly sampling a task, training on it, and moving the initialization towards the trained weights on that task. We expand on the results from Finn et al. showing that first-order meta-learning algorithms perform well on some well-established benchmarks for few-shot classification, and we provide theoretical analysis aimed at understanding why these algorithms work.
△ Less
Submitted 22 October, 2018; v1 submitted 8 March, 2018;
originally announced March 2018.
-
Learning Dynamics and the Co-Evolution of Competing Sexual Species
Authors:
Georgios Piliouras,
Leonard J. Schulman
Abstract:
We analyze a stylized model of co-evolution between any two purely competing species (e.g., host and parasite), both sexually reproducing. Similarly to a recent model of Livnat \etal~\cite{evolfocs14} the fitness of an individual depends on whether the truth assignments on $n$ variables that reproduce through recombination satisfy a particular Boolean function. Whereas in the original model a sati…
▽ More
We analyze a stylized model of co-evolution between any two purely competing species (e.g., host and parasite), both sexually reproducing. Similarly to a recent model of Livnat \etal~\cite{evolfocs14} the fitness of an individual depends on whether the truth assignments on $n$ variables that reproduce through recombination satisfy a particular Boolean function. Whereas in the original model a satisfying assignment always confers a small evolutionary advantage, in our model the two species are in an evolutionary race with the parasite enjoying the advantage if the value of its Boolean function matches its host, and the host wishing to mismatch its parasite. Surprisingly, this model makes a simple and robust behavioral prediction. The typical system behavior is \textit{periodic}. These cycles stay bounded away from the boundary and thus, \textit{learning-dynamics competition between sexual species can provide an explanation for genetic diversity.} This explanation is due solely to the natural selection process. No mutations, environmental changes, etc., need be invoked.
The game played at the gene level may have many Nash equilibria with widely diverse fitness levels. Nevertheless, sexual evolution leads to gene coordination that implements an optimal strategy, i.e., an optimal population mixture, at the species level. Namely, the play of the many "selfish genes" implements a time-averaged correlated equilibrium where the average fitness of each species is exactly equal to its value in the two species zero-sum competition.
Our analysis combines tools from game theory, dynamical systems and Boolean functions to establish a novel class of conservative dynamical systems.
△ Less
Submitted 18 November, 2017;
originally announced November 2017.
-
Meta Learning Shared Hierarchies
Authors:
Kevin Frans,
Jonathan Ho,
Xi Chen,
Pieter Abbeel,
John Schulman
Abstract:
We develop a metalearning approach for learning hierarchically structured policies, improving sample efficiency on unseen tasks through the use of shared primitives---policies that are executed for large numbers of timesteps. Specifically, a set of primitives are shared within a distribution of tasks, and are switched between by task-specific policies. We provide a concrete metric for measuring th…
▽ More
We develop a metalearning approach for learning hierarchically structured policies, improving sample efficiency on unseen tasks through the use of shared primitives---policies that are executed for large numbers of timesteps. Specifically, a set of primitives are shared within a distribution of tasks, and are switched between by task-specific policies. We provide a concrete metric for measuring the strength of such hierarchies, leading to an optimization problem for quickly reaching high reward on unseen tasks. We then present an algorithm to solve this problem end-to-end through the use of any off-the-shelf reinforcement learning method, by repeatedly sampling new tasks and resetting task-specific policies. We successfully discover meaningful motor primitives for the directional movement of four-legged robots, solely by interacting with distributions of mazes. We also demonstrate the transferability of primitives to solve long-timescale sparse-reward obstacle courses, and we enable 3D humanoid robots to robustly walk and crawl with the same policy.
△ Less
Submitted 26 October, 2017;
originally announced October 2017.
-
Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations
Authors:
Aravind Rajeswaran,
Vikash Kumar,
Abhishek Gupta,
Giulia Vezzani,
John Schulman,
Emanuel Todorov,
Sergey Levine
Abstract:
Dexterous multi-fingered hands are extremely versatile and provide a generic way to perform a multitude of tasks in human-centric environments. However, effectively controlling them remains challenging due to their high dimensionality and large number of potential contacts. Deep reinforcement learning (DRL) provides a model-agnostic approach to control complex dynamical systems, but has not been s…
▽ More
Dexterous multi-fingered hands are extremely versatile and provide a generic way to perform a multitude of tasks in human-centric environments. However, effectively controlling them remains challenging due to their high dimensionality and large number of potential contacts. Deep reinforcement learning (DRL) provides a model-agnostic approach to control complex dynamical systems, but has not been shown to scale to high-dimensional dexterous manipulation. Furthermore, deployment of DRL on physical systems remains challenging due to sample inefficiency. Consequently, the success of DRL in robotics has thus far been limited to simpler manipulators and tasks. In this work, we show that model-free DRL can effectively scale up to complex manipulation tasks with a high-dimensional 24-DoF hand, and solve them from scratch in simulated experiments. Furthermore, with the use of a small number of human demonstrations, the sample complexity can be significantly reduced, which enables learning with sample sizes equivalent to a few hours of robot experience. The use of demonstrations result in policies that exhibit very natural movements and, surprisingly, are also substantially more robust.
△ Less
Submitted 26 June, 2018; v1 submitted 28 September, 2017;
originally announced September 2017.
-
Proximal Policy Optimization Algorithms
Authors:
John Schulman,
Filip Wolski,
Prafulla Dhariwal,
Alec Radford,
Oleg Klimov
Abstract:
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of min…
▽ More
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.
△ Less
Submitted 28 August, 2017; v1 submitted 19 July, 2017;
originally announced July 2017.
-
Online codes for analog signals
Authors:
Leonard J. Schulman,
Piyush Srivastava
Abstract:
This paper revisits a classical scenario in communication theory: a waveform sampled at regular intervals is to be encoded so as to minimize distortion in its reconstruction, despite noise. This transformation must be online (causal), to enable real-time signaling; and should use no more power than the original signal. The noise model we consider is an "atomic norm" convex relaxation of the standa…
▽ More
This paper revisits a classical scenario in communication theory: a waveform sampled at regular intervals is to be encoded so as to minimize distortion in its reconstruction, despite noise. This transformation must be online (causal), to enable real-time signaling; and should use no more power than the original signal. The noise model we consider is an "atomic norm" convex relaxation of the standard (discrete alphabet) Hamming-weight-bounded model: namely, adversarial $\ell_1$-bounded. In the "block coding" (noncausal) setting, such encoding is possible due to the existence of large almost-Euclidean sections in $\ell_1$ spaces, a notion first studied in the work of Dvoretzky in 1961. Our main result is that an analogous result is achievable even causally. Equivalently, our work may be seen as a "lower triangular" version of $\ell_1$ Dvoretzky theorems. In terms of communication, the guarantees are expressed in terms of certain time-weighted norms: the time-weighted $\ell_2$ norm imposed on the decoder forces increasingly accurate reconstruction of the distant past signal, while the time-weighted $\ell_1$ norm on the noise ensures vanishing interference from distant past noise. Encoding is linear (hence easy to implement in analog hardware). Decoding is performed by an LP analogous to those used in compressed sensing.
△ Less
Submitted 1 June, 2019; v1 submitted 17 July, 2017;
originally announced July 2017.
-
Teacher-Student Curriculum Learning
Authors:
Tambet Matiisen,
Avital Oliver,
Taco Cohen,
John Schulman
Abstract:
We propose Teacher-Student Curriculum Learning (TSCL), a framework for automatic curriculum learning, where the Student tries to learn a complex task and the Teacher automatically chooses subtasks from a given set for the Student to train on. We describe a family of Teacher algorithms that rely on the intuition that the Student should practice more those tasks on which it makes the fastest progres…
▽ More
We propose Teacher-Student Curriculum Learning (TSCL), a framework for automatic curriculum learning, where the Student tries to learn a complex task and the Teacher automatically chooses subtasks from a given set for the Student to train on. We describe a family of Teacher algorithms that rely on the intuition that the Student should practice more those tasks on which it makes the fastest progress, i.e. where the slope of the learning curve is highest. In addition, the Teacher algorithms address the problem of forgetting by also choosing tasks where the Student's performance is getting worse. We demonstrate that TSCL matches or surpasses the results of carefully hand-crafted curricula in two tasks: addition of decimal numbers with LSTM and navigation in Minecraft. Using our automatically generated curriculum enabled to solve a Minecraft maze that could not be solved at all when training directly on solving the maze, and the learning was an order of magnitude faster than uniform sampling of subtasks.
△ Less
Submitted 29 November, 2017; v1 submitted 1 July, 2017;
originally announced July 2017.
-
UCB Exploration via Q-Ensembles
Authors:
Richard Y. Chen,
Szymon Sidor,
Pieter Abbeel,
John Schulman
Abstract:
We show how an ensemble of $Q^*$-functions can be leveraged for more effective exploration in deep reinforcement learning. We build on well established algorithms from the bandit setting, and adapt them to the $Q$-learning setting. We propose an exploration strategy based on upper-confidence bounds (UCB). Our experiments show significant gains on the Atari benchmark.
We show how an ensemble of $Q^*$-functions can be leveraged for more effective exploration in deep reinforcement learning. We build on well established algorithms from the bandit setting, and adapt them to the $Q$-learning setting. We propose an exploration strategy based on upper-confidence bounds (UCB). Our experiments show significant gains on the Atari benchmark.
△ Less
Submitted 7 November, 2017; v1 submitted 5 June, 2017;
originally announced June 2017.
-
Equivalence Between Policy Gradients and Soft Q-Learning
Authors:
John Schulman,
Xi Chen,
Pieter Abbeel
Abstract:
Two of the leading approaches for model-free reinforcement learning are policy gradient methods and $Q$-learning methods. $Q$-learning methods can be effective and sample-efficient when they work, however, it is not well-understood why they work, since empirically, the $Q$-values they estimate are very inaccurate. A partial explanation may be that $Q$-learning methods are secretly implementing pol…
▽ More
Two of the leading approaches for model-free reinforcement learning are policy gradient methods and $Q$-learning methods. $Q$-learning methods can be effective and sample-efficient when they work, however, it is not well-understood why they work, since empirically, the $Q$-values they estimate are very inaccurate. A partial explanation may be that $Q$-learning methods are secretly implementing policy gradient updates: we show that there is a precise equivalence between $Q$-learning and policy gradient methods in the setting of entropy-regularized reinforcement learning, that "soft" (entropy-regularized) $Q$-learning is exactly equivalent to a policy gradient method. We also point out a connection between $Q$-learning methods and natural policy gradient methods. Experimentally, we explore the entropy-regularized versions of $Q$-learning and policy gradients, and we find them to perform as well as (or slightly better than) the standard variants on the Atari benchmark. We also show that the equivalence holds in practical settings by constructing a $Q$-learning method that closely matches the learning dynamics of A3C without using a target network or $ε$-greedy exploration schedule.
△ Less
Submitted 14 October, 2018; v1 submitted 21 April, 2017;
originally announced April 2017.
-
Quasi-regular sequences and optimal schedules for security games
Authors:
David Kempe,
Leonard J. Schulman,
Omer Tamuz
Abstract:
We study security games in which a defender commits to a mixed strategy for protecting a finite set of targets of different values. An attacker, knowing the defender's strategy, chooses which target to attack and for how long. If the attacker spends time $t$ at a target $i$ of value $α_i$, and if he leaves before the defender visits the target, his utility is $t \cdot α_i $; if the defender visits…
▽ More
We study security games in which a defender commits to a mixed strategy for protecting a finite set of targets of different values. An attacker, knowing the defender's strategy, chooses which target to attack and for how long. If the attacker spends time $t$ at a target $i$ of value $α_i$, and if he leaves before the defender visits the target, his utility is $t \cdot α_i $; if the defender visits before he leaves, his utility is 0. The defender's goal is to minimize the attacker's utility. The defender's strategy consists of a schedule for visiting the targets; it takes her unit time to switch between targets. Such games are a simplified model of a number of real-world scenarios such as protecting computer networks from intruders, crops from thieves, etc.
We show that optimal defender play for this continuous time security games reduces to the solution of a combinatorial question regarding the existence of infinite sequences over a finite alphabet, with the following properties for each symbol $i$: (1) $i$ constitutes a prescribed fraction $p_i$ of the sequence. (2) The occurrences of $i$ are spread apart close to evenly, in that the ratio of the longest to shortest interval between consecutive occurrences is bounded by a parameter $K$. We call such sequences $K$-quasi-regular.
We show that, surprisingly, $2$-quasi-regular sequences suffice for optimal defender play. What is more, even randomized $2$-quasi-regular sequences suffice for optimality. We show that such sequences always exist, and can be calculated efficiently.
The question of the least $K$ for which deterministic $K$-quasi-regular sequences exist is fascinating. Using an ergodic theoretical approach, we show that deterministic $3$-quasi-regular sequences always exist. For $2 \leq K < 3$ we do not know whether deterministic $K$-quasi-regular sequences always exist.
△ Less
Submitted 28 October, 2017; v1 submitted 22 November, 2016;
originally announced November 2016.
-
#Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning
Authors:
Haoran Tang,
Rein Houthooft,
Davis Foote,
Adam Stooke,
Xi Chen,
Yan Duan,
John Schulman,
Filip De Turck,
Pieter Abbeel
Abstract:
Count-based exploration algorithms are known to perform near-optimally when used in conjunction with tabular reinforcement learning (RL) methods for solving small discrete Markov decision processes (MDPs). It is generally thought that count-based methods cannot be applied in high-dimensional state spaces, since most states will only occur once. Recent deep RL exploration strategies are able to dea…
▽ More
Count-based exploration algorithms are known to perform near-optimally when used in conjunction with tabular reinforcement learning (RL) methods for solving small discrete Markov decision processes (MDPs). It is generally thought that count-based methods cannot be applied in high-dimensional state spaces, since most states will only occur once. Recent deep RL exploration strategies are able to deal with high-dimensional continuous state spaces through complex heuristics, often relying on optimism in the face of uncertainty or intrinsic motivation. In this work, we describe a surprising finding: a simple generalization of the classic count-based approach can reach near state-of-the-art performance on various high-dimensional and/or continuous deep RL benchmarks. States are mapped to hash codes, which allows to count their occurrences with a hash table. These counts are then used to compute a reward bonus according to the classic count-based exploration theory. We find that simple hash functions can achieve surprisingly good results on many challenging tasks. Furthermore, we show that a domain-dependent learned hash code may further improve these results. Detailed analysis reveals important aspects of a good hash function: 1) having appropriate granularity and 2) encoding information relevant to solving the MDP. This exploration strategy achieves near state-of-the-art performance on both continuous control tasks and Atari 2600 games, hence providing a simple yet powerful baseline for solving MDPs that require considerable exploration.
△ Less
Submitted 5 December, 2017; v1 submitted 15 November, 2016;
originally announced November 2016.
-
RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning
Authors:
Yan Duan,
John Schulman,
Xi Chen,
Peter L. Bartlett,
Ilya Sutskever,
Pieter Abbeel
Abstract:
Deep reinforcement learning (deep RL) has been successful in learning sophisticated behaviors automatically; however, the learning process requires a huge number of trials. In contrast, animals can learn new tasks in just a few trials, benefiting from their prior knowledge about the world. This paper seeks to bridge this gap. Rather than designing a "fast" reinforcement learning algorithm, we prop…
▽ More
Deep reinforcement learning (deep RL) has been successful in learning sophisticated behaviors automatically; however, the learning process requires a huge number of trials. In contrast, animals can learn new tasks in just a few trials, benefiting from their prior knowledge about the world. This paper seeks to bridge this gap. Rather than designing a "fast" reinforcement learning algorithm, we propose to represent it as a recurrent neural network (RNN) and learn it from data. In our proposed method, RL$^2$, the algorithm is encoded in the weights of the RNN, which are learned slowly through a general-purpose ("slow") RL algorithm. The RNN receives all information a typical RL algorithm would receive, including observations, actions, rewards, and termination flags; and it retains its state across episodes in a given Markov Decision Process (MDP). The activations of the RNN store the state of the "fast" RL algorithm on the current (previously unseen) MDP. We evaluate RL$^2$ experimentally on both small-scale and large-scale problems. On the small-scale side, we train it to solve randomly generated multi-arm bandit problems and finite MDPs. After RL$^2$ is trained, its performance on new MDPs is close to human-designed algorithms with optimality guarantees. On the large-scale side, we test RL$^2$ on a vision-based navigation task and show that it scales up to high-dimensional problems.
△ Less
Submitted 9 November, 2016; v1 submitted 8 November, 2016;
originally announced November 2016.
-
Variational Lossy Autoencoder
Authors:
Xi Chen,
Diederik P. Kingma,
Tim Salimans,
Yan Duan,
Prafulla Dhariwal,
John Schulman,
Ilya Sutskever,
Pieter Abbeel
Abstract:
Representation learning seeks to expose certain aspects of observed data in a learned representation that's amenable to downstream tasks like classification. For instance, a good representation for 2D images might be one that describes only global structure and discards information about detailed texture. In this paper, we present a simple but principled method to learn such global representations…
▽ More
Representation learning seeks to expose certain aspects of observed data in a learned representation that's amenable to downstream tasks like classification. For instance, a good representation for 2D images might be one that describes only global structure and discards information about detailed texture. In this paper, we present a simple but principled method to learn such global representations by combining Variational Autoencoder (VAE) with neural autoregressive models such as RNN, MADE and PixelRNN/CNN. Our proposed VAE model allows us to have control over what the global latent code can learn and , by designing the architecture accordingly, we can force the global latent code to discard irrelevant information such as texture in 2D images, and hence the VAE only "autoencodes" data in a lossy fashion. In addition, by leveraging autoregressive models as both prior distribution $p(z)$ and decoding distribution $p(x|z)$, we can greatly improve generative modeling performance of VAEs, achieving new state-of-the-art results on MNIST, OMNIGLOT and Caltech-101 Silhouettes density estimation tasks.
△ Less
Submitted 4 March, 2017; v1 submitted 8 November, 2016;
originally announced November 2016.
-
Concrete Problems in AI Safety
Authors:
Dario Amodei,
Chris Olah,
Jacob Steinhardt,
Paul Christiano,
John Schulman,
Dan Mané
Abstract:
Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society. In this paper we discuss one such potential impact: the problem of accidents in machine learning systems, defined as unintended and harmful behavior that may emerge from poor design of real-world AI systems. We present a list of five practical…
▽ More
Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society. In this paper we discuss one such potential impact: the problem of accidents in machine learning systems, defined as unintended and harmful behavior that may emerge from poor design of real-world AI systems. We present a list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function ("avoiding side effects" and "avoiding reward hacking"), an objective function that is too expensive to evaluate frequently ("scalable supervision"), or undesirable behavior during the learning process ("safe exploration" and "distributional shift"). We review previous work in these areas as well as suggesting research directions with a focus on relevance to cutting-edge AI systems. Finally, we consider the high-level question of how to think most productively about the safety of forward-looking applications of AI.
△ Less
Submitted 25 July, 2016; v1 submitted 21 June, 2016;
originally announced June 2016.
-
InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets
Authors:
Xi Chen,
Yan Duan,
Rein Houthooft,
John Schulman,
Ilya Sutskever,
Pieter Abbeel
Abstract:
This paper describes InfoGAN, an information-theoretic extension to the Generative Adversarial Network that is able to learn disentangled representations in a completely unsupervised manner. InfoGAN is a generative adversarial network that also maximizes the mutual information between a small subset of the latent variables and the observation. We derive a lower bound to the mutual information obje…
▽ More
This paper describes InfoGAN, an information-theoretic extension to the Generative Adversarial Network that is able to learn disentangled representations in a completely unsupervised manner. InfoGAN is a generative adversarial network that also maximizes the mutual information between a small subset of the latent variables and the observation. We derive a lower bound to the mutual information objective that can be optimized efficiently, and show that our training procedure can be interpreted as a variation of the Wake-Sleep algorithm. Specifically, InfoGAN successfully disentangles writing styles from digit shapes on the MNIST dataset, pose from lighting of 3D rendered images, and background digits from the central digit on the SVHN dataset. It also discovers visual concepts that include hair styles, presence/absence of eyeglasses, and emotions on the CelebA face dataset. Experiments show that InfoGAN learns interpretable representations that are competitive with representations learned by existing fully supervised methods.
△ Less
Submitted 11 June, 2016;
originally announced June 2016.
-
OpenAI Gym
Authors:
Greg Brockman,
Vicki Cheung,
Ludwig Pettersson,
Jonas Schneider,
John Schulman,
Jie Tang,
Wojciech Zaremba
Abstract:
OpenAI Gym is a toolkit for reinforcement learning research. It includes a growing collection of benchmark problems that expose a common interface, and a website where people can share their results and compare the performance of algorithms. This whitepaper discusses the components of OpenAI Gym and the design decisions that went into the software.
OpenAI Gym is a toolkit for reinforcement learning research. It includes a growing collection of benchmark problems that expose a common interface, and a website where people can share their results and compare the performance of algorithms. This whitepaper discusses the components of OpenAI Gym and the design decisions that went into the software.
△ Less
Submitted 5 June, 2016;
originally announced June 2016.
-
VIME: Variational Information Maximizing Exploration
Authors:
Rein Houthooft,
Xi Chen,
Yan Duan,
John Schulman,
Filip De Turck,
Pieter Abbeel
Abstract:
Scalable and effective exploration remains a key challenge in reinforcement learning (RL). While there are methods with optimality guarantees in the setting of discrete state and action spaces, these methods cannot be applied in high-dimensional deep RL scenarios. As such, most contemporary RL relies on simple heuristics such as epsilon-greedy exploration or adding Gaussian noise to the controls.…
▽ More
Scalable and effective exploration remains a key challenge in reinforcement learning (RL). While there are methods with optimality guarantees in the setting of discrete state and action spaces, these methods cannot be applied in high-dimensional deep RL scenarios. As such, most contemporary RL relies on simple heuristics such as epsilon-greedy exploration or adding Gaussian noise to the controls. This paper introduces Variational Information Maximizing Exploration (VIME), an exploration strategy based on maximization of information gain about the agent's belief of environment dynamics. We propose a practical implementation, using variational inference in Bayesian neural networks which efficiently handles continuous state and action spaces. VIME modifies the MDP reward function, and can be applied with several different underlying RL algorithms. We demonstrate that VIME achieves significantly better performance compared to heuristic exploration methods across a variety of continuous control tasks and algorithms, including tasks with very sparse rewards.
△ Less
Submitted 27 January, 2017; v1 submitted 31 May, 2016;
originally announced May 2016.
-
Market Dynamics of Best-Response with Lookahead
Authors:
Krishnamurthy Dvijotham,
Yuval Rabani,
Leonard J. Schulman
Abstract:
One attractive approach to market dynamics is the level $k$ model in which a level $0$ player adopts a very simple response to current conditions, a level $1$ player best-responds to a model in which others take level $0$ actions, and so forth. (This is analogous to $k$-ply exploration of game trees in AI, and to receding-horizon control in control theory.) If players have deterministic mental mod…
▽ More
One attractive approach to market dynamics is the level $k$ model in which a level $0$ player adopts a very simple response to current conditions, a level $1$ player best-responds to a model in which others take level $0$ actions, and so forth. (This is analogous to $k$-ply exploration of game trees in AI, and to receding-horizon control in control theory.) If players have deterministic mental models with this kind of finite-level response, there is obviously no way their mental models can all be consistent. Nevertheless, there is experimental evidence that people act this way in many situations, motivating the question of what the dynamics of such interactions lead to.
We address this question in the setting of Fisher Markets with constant elasticities of substitution (CES) utilities, in the weak gross substitutes (WGS) regime. We show that despite the inconsistency of the mental models, and even if players' models change arbitrarily from round to round, the market converges to its unique equilibrium. (We show this for both synchronous and asynchronous discrete-time updates.) Moreover, the result is computationally feasible in the sense that the convergence rate is linear, i.e., the distance to equilibrium decays exponentially fast. To the best of our knowledge, this is the first result that demonstrates, in Fisher markets, convergence at any rate for dynamics driven by a plausible model of seller incentives. Even for the simple case of (level $0$) best-response dynamics, where we observe that convergence at some rate can be derived from recent results in convex optimization, our result is the first to demonstrate a linear rate of convergence.
△ Less
Submitted 29 May, 2016;
originally announced May 2016.
-
Theano: A Python framework for fast computation of mathematical expressions
Authors:
The Theano Development Team,
Rami Al-Rfou,
Guillaume Alain,
Amjad Almahairi,
Christof Angermueller,
Dzmitry Bahdanau,
Nicolas Ballas,
Frédéric Bastien,
Justin Bayer,
Anatoly Belikov,
Alexander Belopolsky,
Yoshua Bengio,
Arnaud Bergeron,
James Bergstra,
Valentin Bisson,
Josh Bleecher Snyder,
Nicolas Bouchard,
Nicolas Boulanger-Lewandowski,
Xavier Bouthillier,
Alexandre de Brébisson,
Olivier Breuleux,
Pierre-Luc Carrier,
Kyunghyun Cho,
Jan Chorowski,
Paul Christiano
, et al. (88 additional authors not shown)
Abstract:
Theano is a Python library that allows to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Since its introduction, it has been one of the most used CPU and GPU mathematical compilers - especially in the machine learning community - and has shown steady performance improvements. Theano is being actively and continuously developed since 2008, mu…
▽ More
Theano is a Python library that allows to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Since its introduction, it has been one of the most used CPU and GPU mathematical compilers - especially in the machine learning community - and has shown steady performance improvements. Theano is being actively and continuously developed since 2008, multiple frameworks have been built on top of it and it has been used to produce many state-of-the-art machine learning models.
The present article is structured as follows. Section I provides an overview of the Theano software and its community. Section II presents the principal features of Theano and how to use them, and compares them with other similar projects. Section III focuses on recently-introduced functionalities and improvements. Section IV compares the performance of Theano against Torch7 and TensorFlow on several machine learning models. Section V discusses current limitations of Theano and potential ways of improving it.
△ Less
Submitted 9 May, 2016;
originally announced May 2016.
-
Benchmarking Deep Reinforcement Learning for Continuous Control
Authors:
Yan Duan,
Xi Chen,
Rein Houthooft,
John Schulman,
Pieter Abbeel
Abstract:
Recently, researchers have made significant progress combining the advances in deep learning for learning feature representations with reinforcement learning. Some notable examples include training agents to play Atari games based on raw pixel data and to acquire advanced manipulation skills using raw sensory inputs. However, it has been difficult to quantify progress in the domain of continuous c…
▽ More
Recently, researchers have made significant progress combining the advances in deep learning for learning feature representations with reinforcement learning. Some notable examples include training agents to play Atari games based on raw pixel data and to acquire advanced manipulation skills using raw sensory inputs. However, it has been difficult to quantify progress in the domain of continuous control due to the lack of a commonly adopted benchmark. In this work, we present a benchmark suite of continuous control tasks, including classic tasks like cart-pole swing-up, tasks with very high state and action dimensionality such as 3D humanoid locomotion, tasks with partial observations, and tasks with hierarchical structure. We report novel findings based on the systematic evaluation of a range of implemented reinforcement learning algorithms. Both the benchmark and reference implementations are released at https://github.com/rllab/rllab in order to facilitate experimental reproducibility and to encourage adoption by other researchers.
△ Less
Submitted 27 May, 2016; v1 submitted 22 April, 2016;
originally announced April 2016.
-
Gradient Estimation Using Stochastic Computation Graphs
Authors:
John Schulman,
Nicolas Heess,
Theophane Weber,
Pieter Abbeel
Abstract:
In a variety of problems originating in supervised, unsupervised, and reinforcement learning, the loss function is defined by an expectation over a collection of random variables, which might be part of a probabilistic model or the external world. Estimating the gradient of this loss function, using samples, lies at the core of gradient-based learning algorithms for these problems. We introduce th…
▽ More
In a variety of problems originating in supervised, unsupervised, and reinforcement learning, the loss function is defined by an expectation over a collection of random variables, which might be part of a probabilistic model or the external world. Estimating the gradient of this loss function, using samples, lies at the core of gradient-based learning algorithms for these problems. We introduce the formalism of stochastic computation graphs---directed acyclic graphs that include both deterministic functions and conditional probability distributions---and describe how to easily and automatically derive an unbiased estimator of the loss function's gradient. The resulting algorithm for computing the gradient estimator is a simple modification of the standard backpropagation algorithm. The generic scheme we propose unifies estimators derived in variety of prior work, along with variance-reduction techniques therein. It could assist researchers in developing intricate models involving a combination of stochastic and deterministic operations, enabling, for example, attention, memory, and control actions.
△ Less
Submitted 5 January, 2016; v1 submitted 17 June, 2015;
originally announced June 2015.