Skip to main content

Showing 1–50 of 60 results for author: Schulman, J

  1. arXiv:2406.03938  [pdf, other

    q-bio.PE cs.CE cs.GT

    Diversity in Evolutionary Dynamics

    Authors: Yuval Rabani, Leonard J. Schulman, Alistair Sinclair

    Abstract: We consider the dynamics imposed by natural selection on the populations of two competing, sexually reproducing, haploid species. In this setting, the fitness of any genome varies over time due to the changing population mix of the competing species; crucially, this fitness variation arises naturally from the model itself, without the need for imposing it exogenously as is typically the case. Prev… ▽ More

    Submitted 5 July, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

  2. arXiv:2310.09397  [pdf, other

    cs.LG math.AG math.ST

    Identifiability of Product of Experts Models

    Authors: Spencer L. Gordon, Manav Kant, Eric Ma, Leonard J. Schulman, Andrei Staicu

    Abstract: Product of experts (PoE) are layered networks in which the value at each node is an AND (or product) of the values (possibly negated) at its inputs. These were introduced as a neural network architecture that can efficiently learn to generate high-dimensional data which satisfy many low-dimensional constraints -- thereby allowing each individual expert to perform a simple task. PoEs have found a v… ▽ More

    Submitted 13 October, 2023; originally announced October 2023.

    Comments: 24 pages, 2 figures

    MSC Class: 62E10; 62F99; 68T05 ACM Class: I.2.6

  3. arXiv:2309.13993  [pdf, ps, other

    cs.LG cs.DS eess.SP stat.ML

    Identification of Mixtures of Discrete Product Distributions in Near-Optimal Sample and Time Complexity

    Authors: Spencer L. Gordon, Erik Jahn, Bijan Mazaheri, Yuval Rabani, Leonard J. Schulman

    Abstract: We consider the problem of identifying, from statistics, a distribution of discrete random variables $X_1,\ldots,X_n$ that is a mixture of $k$ product distributions. The best previous sample complexity for $n \in O(k)$ was $(1/ζ)^{O(k^2 \log k)}$ (under a mild separation assumption parameterized by $ζ$). The best known lower bound was $\exp(Ω(k))$. It is known that $n\geq 2k-1$ is necessary and su… ▽ More

    Submitted 25 September, 2023; originally announced September 2023.

  4. arXiv:2305.20050  [pdf, other

    cs.LG cs.AI cs.CL

    Let's Verify Step by Step

    Authors: Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe

    Abstract: In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning ste… ▽ More

    Submitted 31 May, 2023; originally announced May 2023.

  5. arXiv:2303.08774  [pdf, other

    cs.CL cs.AI

    GPT-4 Technical Report

    Authors: OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko , et al. (256 additional authors not shown)

    Abstract: We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based mo… ▽ More

    Submitted 4 March, 2024; v1 submitted 15 March, 2023; originally announced March 2023.

    Comments: 100 pages; updated authors list; fixed author names and added citation

  6. arXiv:2301.13442  [pdf, other

    cs.LG cs.AI stat.ML

    Scaling laws for single-agent reinforcement learning

    Authors: Jacob Hilton, Jie Tang, John Schulman

    Abstract: Recent work has shown that, in generative modeling, cross-entropy loss improves smoothly with model size and training compute, following a power law plus constant scaling law. One challenge in extending these results to reinforcement learning is that the main performance objective of interest, mean episode return, need not vary smoothly. To overcome this, we introduce *intrinsic performance*, a mo… ▽ More

    Submitted 18 February, 2023; v1 submitted 31 January, 2023; originally announced January 2023.

    Comments: 33 pages

  7. arXiv:2210.10760  [pdf, other

    cs.LG stat.ML

    Scaling Laws for Reward Model Overoptimization

    Authors: Leo Gao, John Schulman, Jacob Hilton

    Abstract: In reinforcement learning from human feedback, it is common to optimize against a reward model trained to predict human preferences. Because the reward model is an imperfect proxy, optimizing its value too much can hinder ground truth performance, in accordance with Goodhart's law. This effect has been frequently observed, but not carefully measured due to the expense of collecting human preferenc… ▽ More

    Submitted 19 October, 2022; originally announced October 2022.

  8. arXiv:2207.14255  [pdf, other

    cs.CL

    Efficient Training of Language Models to Fill in the Middle

    Authors: Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey, Jerry Tworek, Mark Chen

    Abstract: We show that autoregressive language models can learn to infill text after we apply a straightforward transformation to the dataset, which simply moves a span of text from the middle of a document to its end. While this data augmentation has garnered much interest in recent years, we provide extensive evidence that training models with a large fraction of data transformed in this way does not harm… ▽ More

    Submitted 28 July, 2022; originally announced July 2022.

  9. arXiv:2203.02155  [pdf, other

    cs.CL cs.AI cs.LG

    Training language models to follow instructions with human feedback

    Authors: Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe

    Abstract: Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning wi… ▽ More

    Submitted 4 March, 2022; originally announced March 2022.

  10. arXiv:2112.11602  [pdf, ps, other

    cs.LG cs.DS eess.SP stat.ML

    Causal Inference Despite Limited Global Confounding via Mixture Models

    Authors: Spencer L. Gordon, Bijan Mazaheri, Yuval Rabani, Leonard J. Schulman

    Abstract: A Bayesian Network is a directed acyclic graph (DAG) on a set of $n$ random variables (the vertices); a Bayesian Network Distribution (BND) is a probability distribution on the random variables that is Markovian on the graph. A finite $k$-mixture of such models is graphically represented by a larger graph which has an additional ``hidden'' (or ``latent'') random variable $U$, ranging in… ▽ More

    Submitted 31 May, 2023; v1 submitted 21 December, 2021; originally announced December 2021.

    Comments: Published in CleaR 2023

    MSC Class: 68W40; 62F99; 62-09 ACM Class: F.2; G.3

    Journal ref: Proceedings of Machine Learning Research vol 213:1-27, 2023

  11. arXiv:2112.09332  [pdf, other

    cs.CL cs.AI cs.LG

    WebGPT: Browser-assisted question-answering with human feedback

    Authors: Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, John Schulman

    Abstract: We fine-tune GPT-3 to answer long-form questions using a text-based web-browsing environment, which allows the model to search and navigate the web. By setting up the task so that it can be performed by humans, we are able to train models on the task using imitation learning, and then optimize answer quality with human feedback. To make human evaluation of factual accuracy easier, models must coll… ▽ More

    Submitted 1 June, 2022; v1 submitted 17 December, 2021; originally announced December 2021.

    Comments: 32 pages

  12. arXiv:2110.14168  [pdf, other

    cs.LG cs.CL

    Training Verifiers to Solve Math Word Problems

    Authors: Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, John Schulman

    Abstract: State-of-the-art language models can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning. To diagnose the failures of current models and support research, we introduce GSM8K, a dataset of 8.5K high quality linguistically diverse grade school math word problems. We find that even the largest transformer models fail to achieve high tes… ▽ More

    Submitted 17 November, 2021; v1 submitted 27 October, 2021; originally announced October 2021.

  13. arXiv:2110.00641  [pdf, other

    cs.LG stat.ML

    Batch size-invariance for policy optimization

    Authors: Jacob Hilton, Karl Cobbe, John Schulman

    Abstract: We say an algorithm is batch size-invariant if changes to the batch size can largely be compensated for by changes to other hyperparameters. Stochastic gradient descent is well-known to have this property at small batch sizes, via the learning rate. However, some policy optimization algorithms (such as PPO) do not have this property, because of how they control the size of policy updates. In this… ▽ More

    Submitted 24 September, 2022; v1 submitted 1 October, 2021; originally announced October 2021.

    Comments: 32 pages. Code is available at https://github.com/openai/ppo-ewma

    Journal ref: Advances in Neural Information Processing Systems 35 (2022) 17086-17098

  14. arXiv:2109.13916  [pdf, other

    cs.LG cs.AI cs.CL cs.CV

    Unsolved Problems in ML Safety

    Authors: Dan Hendrycks, Nicholas Carlini, John Schulman, Jacob Steinhardt

    Abstract: Machine learning (ML) systems are rapidly increasing in size, are acquiring new capabilities, and are increasingly deployed in high-stakes settings. As with other powerful technologies, safety for ML should be a leading research priority. In response to emerging safety challenges in ML, such as those introduced by recent large-scale models, we provide a new roadmap for ML Safety and refine the tec… ▽ More

    Submitted 16 June, 2022; v1 submitted 28 September, 2021; originally announced September 2021.

    Comments: Position Paper

  15. arXiv:2107.07358  [pdf, ps, other

    cs.DS cs.CG

    A Refined Approximation for Euclidean k-Means

    Authors: Fabrizio Grandoni, Rafail Ostrovsky, Yuval Rabani, Leonard J. Schulman, Rakesh Venkat

    Abstract: In the Euclidean $k$-Means problem we are given a collection of $n$ points $D$ in an Euclidean space and a positive integer $k$. Our goal is to identify a collection of $k$ points in the same space (centers) so as to minimize the sum of the squared Euclidean distances between each point in $D$ and the closest center. This problem is known to be APX-hard and the current best approximation ratio is… ▽ More

    Submitted 20 September, 2021; v1 submitted 15 July, 2021; originally announced July 2021.

    Comments: Corrected a confusing typo in a formula on page 5 and added one remark

  16. arXiv:2103.15332  [pdf, other

    cs.LG cs.AI

    Measuring Sample Efficiency and Generalization in Reinforcement Learning Benchmarks: NeurIPS 2020 Procgen Benchmark

    Authors: Sharada Mohanty, Jyotish Poonganam, Adrien Gaidon, Andrey Kolobov, Blake Wulfe, Dipam Chakraborty, Gražvydas Šemetulskis, João Schapke, Jonas Kubilius, Jurgis Pašukonis, Linas Klimas, Matthew Hausknecht, Patrick MacAlpine, Quang Nhat Tran, Thomas Tumiel, Xiaocheng Tang, Xinwei Chen, Christopher Hesse, Jacob Hilton, William Hebgen Guss, Sahika Genc, John Schulman, Karl Cobbe

    Abstract: The NeurIPS 2020 Procgen Competition was designed as a centralized benchmark with clearly defined tasks for measuring Sample Efficiency and Generalization in Reinforcement Learning. Generalization remains one of the most fundamental challenges in deep reinforcement learning, and yet we do not have enough benchmarks to measure the progress of the community on Generalization in Reinforcement Learnin… ▽ More

    Submitted 29 March, 2021; originally announced March 2021.

  17. arXiv:2101.11688  [pdf, other

    cs.LG eess.SP stat.ML

    Hadamard Extensions and the Identification of Mixtures of Product Distributions

    Authors: Spencer L. Gordon, Leonard J. Schulman

    Abstract: The Hadamard Extension of a matrix is the matrix consisting of all Hadamard products of subsets of its rows. This construction arises in the context of identifying a mixture of product distributions on binary random variables: full column rank of such extensions is a necessary ingredient of identification algorithms. We provide several results concerning when a Hadamard Extension has full column r… ▽ More

    Submitted 12 February, 2021; v1 submitted 27 January, 2021; originally announced January 2021.

    Comments: V2: re-titled and slight edits

    MSC Class: 68W40; 62F99 ACM Class: F.2; G.3

  18. arXiv:2101.11071  [pdf, other

    cs.LG cs.AI stat.ML

    The MineRL 2020 Competition on Sample Efficient Reinforcement Learning using Human Priors

    Authors: William H. Guss, Mario Ynocente Castro, Sam Devlin, Brandon Houghton, Noboru Sean Kuno, Crissman Loomis, Stephanie Milani, Sharada Mohanty, Keisuke Nakata, Ruslan Salakhutdinov, John Schulman, Shinya Shiroshita, Nicholay Topin, Avinash Ummadisingu, Oriol Vinyals

    Abstract: Although deep reinforcement learning has led to breakthroughs in many difficult domains, these successes have required an ever-increasing number of samples, affording only a shrinking segment of the AI community access to their development. Resolution of these limitations requires new, sample-efficient methods. To facilitate research in this direction, we propose this second iteration of the MineR… ▽ More

    Submitted 26 January, 2021; originally announced January 2021.

    Comments: 37 pages, initial submission, accepted at NeurIPS. arXiv admin note: substantial text overlap with arXiv:1904.10079

  19. arXiv:2012.14540  [pdf, ps, other

    cs.LG cs.DS eess.SP stat.ML

    Source Identification for Mixtures of Product Distributions

    Authors: Spencer L. Gordon, Bijan Mazaheri, Yuval Rabani, Leonard J. Schulman

    Abstract: We give an algorithm for source identification of a mixture of $k$ product distributions on $n$ bits. This is a fundamental problem in machine learning with many applications. Our algorithm identifies the source parameters of an identifiable mixture, given, as input, approximate values of multilinear moments (derived, for instance, from a sufficiently large sample), using $2^{O(k^2)} n^{O(k)}$ ari… ▽ More

    Submitted 28 December, 2020; originally announced December 2020.

    MSC Class: 68W40; 62F99 ACM Class: F.2; G.3

  20. arXiv:2010.14701  [pdf, other

    cs.LG cs.CL cs.CV

    Scaling Laws for Autoregressive Generative Modeling

    Authors: Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M. Ziegler, John Schulman, Dario Amodei, Sam McCandlish

    Abstract: We identify empirical scaling laws for the cross-entropy loss in four domains: generative image modeling, video modeling, multimodal image$\leftrightarrow$text models, and mathematical problem solving. In all cases autoregressive Transformers smoothly improve in performance as model size and compute budgets increase, following a power-law plus constant scaling law. The optimal model size also depe… ▽ More

    Submitted 5 November, 2020; v1 submitted 27 October, 2020; originally announced October 2020.

    Comments: 20+17 pages, 33 figures; added appendix with additional language results

  21. arXiv:2009.04416  [pdf, other

    cs.LG stat.ML

    Phasic Policy Gradient

    Authors: Karl Cobbe, Jacob Hilton, Oleg Klimov, John Schulman

    Abstract: We introduce Phasic Policy Gradient (PPG), a reinforcement learning framework which modifies traditional on-policy actor-critic methods by separating policy and value function training into distinct phases. In prior methods, one must choose between using a shared network or separate networks to represent the policy and value function. Using separate networks avoids interference between objectives,… ▽ More

    Submitted 9 September, 2020; originally announced September 2020.

  22. arXiv:2007.08101  [pdf, ps, other

    cs.LG cs.DS stat.ML

    The Sparse Hausdorff Moment Problem, with Application to Topic Models

    Authors: Spencer Gordon, Bijan Mazaheri, Leonard J. Schulman, Yuval Rabani

    Abstract: We consider the problem of identifying, from its first $m$ noisy moments, a probability distribution on $[0,1]$ of support $k<\infty$. This is equivalent to the problem of learning a distribution on $m$ observable binary random variables $X_1,X_2,\dots,X_m$ that are iid conditional on a hidden random variable $U$ taking values in $\{1,2,\dots,k\}$. Our focus is on accomplishing this with $m=2k$, w… ▽ More

    Submitted 7 September, 2020; v1 submitted 16 July, 2020; originally announced July 2020.

  23. arXiv:1912.01588  [pdf, other

    cs.LG stat.ML

    Leveraging Procedural Generation to Benchmark Reinforcement Learning

    Authors: Karl Cobbe, Christopher Hesse, Jacob Hilton, John Schulman

    Abstract: We introduce Procgen Benchmark, a suite of 16 procedurally generated game-like environments designed to benchmark both sample efficiency and generalization in reinforcement learning. We believe that the community will benefit from increased access to high quality training environments, and we provide detailed experimental protocols for using this benchmark. We empirically demonstrate that diverse… ▽ More

    Submitted 26 July, 2020; v1 submitted 3 December, 2019; originally announced December 2019.

  24. arXiv:1909.12497  [pdf, ps, other

    math.CO cs.DM

    Edge Expansion and Spectral Gap of Nonnegative Matrices

    Authors: Jenish C. Mehta, Leonard J. Schulman

    Abstract: The classic graphical Cheeger inequalities state that if $M$ is an $n\times n$ symmetric doubly stochastic matrix, then \[ \frac{1-λ_{2}(M)}{2}\leqφ(M)\leq\sqrt{2\cdot(1-λ_{2}(M))} \] where $φ(M)=\min_{S\subseteq[n],|S|\leq n/2}\left(\frac{1}{|S|}\sum_{i\in S,j\not\in S}M_{i,j}\right)$ is the edge expansion of $M$, and $λ_{2}(M)$ is the second largest eigenvalue of $M$. We study the relationship b… ▽ More

    Submitted 27 September, 2019; originally announced September 2019.

  25. arXiv:1904.03646  [pdf, other

    cs.LG stat.ML

    Policy Gradient Search: Online Planning and Expert Iteration without Search Trees

    Authors: Thomas Anthony, Robert Nishihara, Philipp Moritz, Tim Salimans, John Schulman

    Abstract: Monte Carlo Tree Search (MCTS) algorithms perform simulation-based search to improve policies online. During search, the simulation policy is adapted to explore the most promising lines of play. MCTS has been used by state-of-the-art programs for many problems, however a disadvantage to MCTS is that it estimates the values of states with Monte Carlo averages, stored in a search tree; this does not… ▽ More

    Submitted 7 April, 2019; originally announced April 2019.

  26. arXiv:1902.02336  [pdf, other

    cs.LG stat.ML

    Semi-Supervised Learning by Label Gradient Alignment

    Authors: Jacob Jackson, John Schulman

    Abstract: We present label gradient alignment, a novel algorithm for semi-supervised learning which imputes labels for the unlabeled data and trains on the imputed labels. We define a semantically meaningful distance metric on the input space by mapping a point (x, y) to the gradient of the model at (x, y). We then formulate an optimization problem whose objective is to minimize the distance between the lab… ▽ More

    Submitted 6 February, 2019; originally announced February 2019.

  27. arXiv:1812.02341  [pdf, other

    cs.LG stat.ML

    Quantifying Generalization in Reinforcement Learning

    Authors: Karl Cobbe, Oleg Klimov, Chris Hesse, Taehoon Kim, John Schulman

    Abstract: In this paper, we investigate the problem of overfitting in deep reinforcement learning. Among the most common benchmarks in RL, it is customary to use the same environments for both training and testing. This practice offers relatively little insight into an agent's ability to generalize. We address this issue by using procedurally generated environments to construct distinct training and test se… ▽ More

    Submitted 14 July, 2019; v1 submitted 5 December, 2018; originally announced December 2018.

  28. arXiv:1809.05214  [pdf, other

    cs.LG cs.AI stat.ML

    Model-Based Reinforcement Learning via Meta-Policy Optimization

    Authors: Ignasi Clavera, Jonas Rothfuss, John Schulman, Yasuhiro Fujita, Tamim Asfour, Pieter Abbeel

    Abstract: Model-based reinforcement learning approaches carry the promise of being data efficient. However, due to challenges in learning dynamics models that sufficiently match the real-world dynamics, they struggle to achieve the same asymptotic performance as model-free methods. We propose Model-Based Meta-Policy-Optimization (MB-MPO), an approach that foregoes the strong reliance on accurate learned dyn… ▽ More

    Submitted 13 September, 2018; originally announced September 2018.

    Comments: First 2 authors contributed equally. Accepted for Conference on Robot Learning (CoRL)

  29. arXiv:1804.03720  [pdf, other

    cs.LG stat.ML

    Gotta Learn Fast: A New Benchmark for Generalization in RL

    Authors: Alex Nichol, Vicki Pfau, Christopher Hesse, Oleg Klimov, John Schulman

    Abstract: In this report, we present a new reinforcement learning (RL) benchmark based on the Sonic the Hedgehog (TM) video game franchise. This benchmark is intended to measure the performance of transfer learning and few-shot learning algorithms in the RL domain. We also present and evaluate some baseline algorithms on the new benchmark.

    Submitted 23 April, 2018; v1 submitted 10 April, 2018; originally announced April 2018.

  30. arXiv:1803.02999  [pdf, other

    cs.LG

    On First-Order Meta-Learning Algorithms

    Authors: Alex Nichol, Joshua Achiam, John Schulman

    Abstract: This paper considers meta-learning problems, where there is a distribution of tasks, and we would like to obtain an agent that performs well (i.e., learns quickly) when presented with a previously unseen task sampled from this distribution. We analyze a family of algorithms for learning a parameter initialization that can be fine-tuned quickly on a new task, using only first-order derivatives for… ▽ More

    Submitted 22 October, 2018; v1 submitted 8 March, 2018; originally announced March 2018.

  31. arXiv:1711.06879  [pdf, other

    cs.GT

    Learning Dynamics and the Co-Evolution of Competing Sexual Species

    Authors: Georgios Piliouras, Leonard J. Schulman

    Abstract: We analyze a stylized model of co-evolution between any two purely competing species (e.g., host and parasite), both sexually reproducing. Similarly to a recent model of Livnat \etal~\cite{evolfocs14} the fitness of an individual depends on whether the truth assignments on $n$ variables that reproduce through recombination satisfy a particular Boolean function. Whereas in the original model a sati… ▽ More

    Submitted 18 November, 2017; originally announced November 2017.

    Comments: Innovations in Theoretical Computer Science (ITCS), 2018

  32. arXiv:1710.09767  [pdf, other

    cs.LG

    Meta Learning Shared Hierarchies

    Authors: Kevin Frans, Jonathan Ho, Xi Chen, Pieter Abbeel, John Schulman

    Abstract: We develop a metalearning approach for learning hierarchically structured policies, improving sample efficiency on unseen tasks through the use of shared primitives---policies that are executed for large numbers of timesteps. Specifically, a set of primitives are shared within a distribution of tasks, and are switched between by task-specific policies. We provide a concrete metric for measuring th… ▽ More

    Submitted 26 October, 2017; originally announced October 2017.

  33. arXiv:1709.10087  [pdf, other

    cs.LG cs.AI cs.RO

    Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

    Authors: Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, Sergey Levine

    Abstract: Dexterous multi-fingered hands are extremely versatile and provide a generic way to perform a multitude of tasks in human-centric environments. However, effectively controlling them remains challenging due to their high dimensionality and large number of potential contacts. Deep reinforcement learning (DRL) provides a model-agnostic approach to control complex dynamical systems, but has not been s… ▽ More

    Submitted 26 June, 2018; v1 submitted 28 September, 2017; originally announced September 2017.

    Comments: Accepted for presentation at Robotics: Science and Systems (RSS) 2018. Project page: https://sites.google.com/view/deeprl-dexterous-manipulation

  34. arXiv:1707.06347  [pdf, other

    cs.LG

    Proximal Policy Optimization Algorithms

    Authors: John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov

    Abstract: We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of min… ▽ More

    Submitted 28 August, 2017; v1 submitted 19 July, 2017; originally announced July 2017.

  35. Online codes for analog signals

    Authors: Leonard J. Schulman, Piyush Srivastava

    Abstract: This paper revisits a classical scenario in communication theory: a waveform sampled at regular intervals is to be encoded so as to minimize distortion in its reconstruction, despite noise. This transformation must be online (causal), to enable real-time signaling; and should use no more power than the original signal. The noise model we consider is an "atomic norm" convex relaxation of the standa… ▽ More

    Submitted 1 June, 2019; v1 submitted 17 July, 2017; originally announced July 2017.

    Journal ref: IEEE Trans. Inf. Theory. "Early access", 2019. DOI: 10.1109/TIT.2019.2919632

  36. arXiv:1707.00183  [pdf, other

    cs.LG cs.AI

    Teacher-Student Curriculum Learning

    Authors: Tambet Matiisen, Avital Oliver, Taco Cohen, John Schulman

    Abstract: We propose Teacher-Student Curriculum Learning (TSCL), a framework for automatic curriculum learning, where the Student tries to learn a complex task and the Teacher automatically chooses subtasks from a given set for the Student to train on. We describe a family of Teacher algorithms that rely on the intuition that the Student should practice more those tasks on which it makes the fastest progres… ▽ More

    Submitted 29 November, 2017; v1 submitted 1 July, 2017; originally announced July 2017.

  37. arXiv:1706.01502  [pdf, ps, other

    cs.LG stat.ML

    UCB Exploration via Q-Ensembles

    Authors: Richard Y. Chen, Szymon Sidor, Pieter Abbeel, John Schulman

    Abstract: We show how an ensemble of $Q^*$-functions can be leveraged for more effective exploration in deep reinforcement learning. We build on well established algorithms from the bandit setting, and adapt them to the $Q$-learning setting. We propose an exploration strategy based on upper-confidence bounds (UCB). Our experiments show significant gains on the Atari benchmark.

    Submitted 7 November, 2017; v1 submitted 5 June, 2017; originally announced June 2017.

  38. arXiv:1704.06440  [pdf, other

    cs.LG

    Equivalence Between Policy Gradients and Soft Q-Learning

    Authors: John Schulman, Xi Chen, Pieter Abbeel

    Abstract: Two of the leading approaches for model-free reinforcement learning are policy gradient methods and $Q$-learning methods. $Q$-learning methods can be effective and sample-efficient when they work, however, it is not well-understood why they work, since empirically, the $Q$-values they estimate are very inaccurate. A partial explanation may be that $Q$-learning methods are secretly implementing pol… ▽ More

    Submitted 14 October, 2018; v1 submitted 21 April, 2017; originally announced April 2017.

  39. Quasi-regular sequences and optimal schedules for security games

    Authors: David Kempe, Leonard J. Schulman, Omer Tamuz

    Abstract: We study security games in which a defender commits to a mixed strategy for protecting a finite set of targets of different values. An attacker, knowing the defender's strategy, chooses which target to attack and for how long. If the attacker spends time $t$ at a target $i$ of value $α_i$, and if he leaves before the defender visits the target, his utility is $t \cdot α_i $; if the defender visits… ▽ More

    Submitted 28 October, 2017; v1 submitted 22 November, 2016; originally announced November 2016.

    Comments: to appear in Proc. of SODA 2018

  40. arXiv:1611.04717  [pdf, other

    cs.AI cs.LG

    #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning

    Authors: Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, Xi Chen, Yan Duan, John Schulman, Filip De Turck, Pieter Abbeel

    Abstract: Count-based exploration algorithms are known to perform near-optimally when used in conjunction with tabular reinforcement learning (RL) methods for solving small discrete Markov decision processes (MDPs). It is generally thought that count-based methods cannot be applied in high-dimensional state spaces, since most states will only occur once. Recent deep RL exploration strategies are able to dea… ▽ More

    Submitted 5 December, 2017; v1 submitted 15 November, 2016; originally announced November 2016.

    Comments: 10 pages main text + 10 pages supplementary. Published at NIPS 2017

  41. arXiv:1611.02779  [pdf, other

    cs.AI cs.LG cs.NE stat.ML

    RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning

    Authors: Yan Duan, John Schulman, Xi Chen, Peter L. Bartlett, Ilya Sutskever, Pieter Abbeel

    Abstract: Deep reinforcement learning (deep RL) has been successful in learning sophisticated behaviors automatically; however, the learning process requires a huge number of trials. In contrast, animals can learn new tasks in just a few trials, benefiting from their prior knowledge about the world. This paper seeks to bridge this gap. Rather than designing a "fast" reinforcement learning algorithm, we prop… ▽ More

    Submitted 9 November, 2016; v1 submitted 8 November, 2016; originally announced November 2016.

    Comments: 14 pages. Under review as a conference paper at ICLR 2017

  42. arXiv:1611.02731  [pdf, other

    cs.LG stat.ML

    Variational Lossy Autoencoder

    Authors: Xi Chen, Diederik P. Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever, Pieter Abbeel

    Abstract: Representation learning seeks to expose certain aspects of observed data in a learned representation that's amenable to downstream tasks like classification. For instance, a good representation for 2D images might be one that describes only global structure and discards information about detailed texture. In this paper, we present a simple but principled method to learn such global representations… ▽ More

    Submitted 4 March, 2017; v1 submitted 8 November, 2016; originally announced November 2016.

    Comments: Added CIFAR10 experiments; ICLR 2017

  43. arXiv:1606.06565  [pdf, other

    cs.AI cs.LG

    Concrete Problems in AI Safety

    Authors: Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, Dan Mané

    Abstract: Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society. In this paper we discuss one such potential impact: the problem of accidents in machine learning systems, defined as unintended and harmful behavior that may emerge from poor design of real-world AI systems. We present a list of five practical… ▽ More

    Submitted 25 July, 2016; v1 submitted 21 June, 2016; originally announced June 2016.

    Comments: 29 pages

  44. arXiv:1606.03657  [pdf, other

    cs.LG stat.ML

    InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets

    Authors: Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, Pieter Abbeel

    Abstract: This paper describes InfoGAN, an information-theoretic extension to the Generative Adversarial Network that is able to learn disentangled representations in a completely unsupervised manner. InfoGAN is a generative adversarial network that also maximizes the mutual information between a small subset of the latent variables and the observation. We derive a lower bound to the mutual information obje… ▽ More

    Submitted 11 June, 2016; originally announced June 2016.

  45. arXiv:1606.01540  [pdf, other

    cs.LG cs.AI

    OpenAI Gym

    Authors: Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, Wojciech Zaremba

    Abstract: OpenAI Gym is a toolkit for reinforcement learning research. It includes a growing collection of benchmark problems that expose a common interface, and a website where people can share their results and compare the performance of algorithms. This whitepaper discusses the components of OpenAI Gym and the design decisions that went into the software.

    Submitted 5 June, 2016; originally announced June 2016.

  46. arXiv:1605.09674  [pdf, other

    cs.LG cs.AI cs.RO stat.ML

    VIME: Variational Information Maximizing Exploration

    Authors: Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, Pieter Abbeel

    Abstract: Scalable and effective exploration remains a key challenge in reinforcement learning (RL). While there are methods with optimality guarantees in the setting of discrete state and action spaces, these methods cannot be applied in high-dimensional deep RL scenarios. As such, most contemporary RL relies on simple heuristics such as epsilon-greedy exploration or adding Gaussian noise to the controls.… ▽ More

    Submitted 27 January, 2017; v1 submitted 31 May, 2016; originally announced May 2016.

    Comments: Published in Advances in Neural Information Processing Systems 29 (NIPS), pages 1109-1117

  47. arXiv:1605.09012  [pdf, other

    cs.GT

    Market Dynamics of Best-Response with Lookahead

    Authors: Krishnamurthy Dvijotham, Yuval Rabani, Leonard J. Schulman

    Abstract: One attractive approach to market dynamics is the level $k$ model in which a level $0$ player adopts a very simple response to current conditions, a level $1$ player best-responds to a model in which others take level $0$ actions, and so forth. (This is analogous to $k$-ply exploration of game trees in AI, and to receding-horizon control in control theory.) If players have deterministic mental mod… ▽ More

    Submitted 29 May, 2016; originally announced May 2016.

  48. arXiv:1605.02688  [pdf, other

    cs.SC cs.LG cs.MS

    Theano: A Python framework for fast computation of mathematical expressions

    Authors: The Theano Development Team, Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau, Nicolas Ballas, Frédéric Bastien, Justin Bayer, Anatoly Belikov, Alexander Belopolsky, Yoshua Bengio, Arnaud Bergeron, James Bergstra, Valentin Bisson, Josh Bleecher Snyder, Nicolas Bouchard, Nicolas Boulanger-Lewandowski, Xavier Bouthillier, Alexandre de Brébisson, Olivier Breuleux, Pierre-Luc Carrier, Kyunghyun Cho, Jan Chorowski, Paul Christiano , et al. (88 additional authors not shown)

    Abstract: Theano is a Python library that allows to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Since its introduction, it has been one of the most used CPU and GPU mathematical compilers - especially in the machine learning community - and has shown steady performance improvements. Theano is being actively and continuously developed since 2008, mu… ▽ More

    Submitted 9 May, 2016; originally announced May 2016.

    Comments: 19 pages, 5 figures

  49. arXiv:1604.06778  [pdf, other

    cs.LG cs.AI cs.RO

    Benchmarking Deep Reinforcement Learning for Continuous Control

    Authors: Yan Duan, Xi Chen, Rein Houthooft, John Schulman, Pieter Abbeel

    Abstract: Recently, researchers have made significant progress combining the advances in deep learning for learning feature representations with reinforcement learning. Some notable examples include training agents to play Atari games based on raw pixel data and to acquire advanced manipulation skills using raw sensory inputs. However, it has been difficult to quantify progress in the domain of continuous c… ▽ More

    Submitted 27 May, 2016; v1 submitted 22 April, 2016; originally announced April 2016.

    Comments: 14 pages, ICML 2016

  50. arXiv:1506.05254  [pdf, other

    cs.LG

    Gradient Estimation Using Stochastic Computation Graphs

    Authors: John Schulman, Nicolas Heess, Theophane Weber, Pieter Abbeel

    Abstract: In a variety of problems originating in supervised, unsupervised, and reinforcement learning, the loss function is defined by an expectation over a collection of random variables, which might be part of a probabilistic model or the external world. Estimating the gradient of this loss function, using samples, lies at the core of gradient-based learning algorithms for these problems. We introduce th… ▽ More

    Submitted 5 January, 2016; v1 submitted 17 June, 2015; originally announced June 2015.

    Comments: Advances in Neural Information Processing Systems 28 (NIPS 2015)