-
Causal Bandits: The Pareto Optimal Frontier of Adaptivity, a Reduction to Linear Bandits, and Limitations around Unknown Marginals
Authors:
Ziyi Liu,
Idan Attias,
Daniel M. Roy
Abstract:
In this work, we investigate the problem of adapting to the presence or absence of causal structure in multi-armed bandit problems. In addition to the usual reward signal, we assume the learner has access to additional variables, observed in each round after acting. When these variables $d$-separate the action from the reward, existing work in causal bandits demonstrates that one can achieve stric…
▽ More
In this work, we investigate the problem of adapting to the presence or absence of causal structure in multi-armed bandit problems. In addition to the usual reward signal, we assume the learner has access to additional variables, observed in each round after acting. When these variables $d$-separate the action from the reward, existing work in causal bandits demonstrates that one can achieve strictly better (minimax) rates of regret (Lu et al., 2020). Our goal is to adapt to this favorable "conditionally benign" structure, if it is present in the environment, while simultaneously recovering worst-case minimax regret, if it is not. Notably, the learner has no prior knowledge of whether the favorable structure holds. In this paper, we establish the Pareto optimal frontier of adaptive rates. We prove upper and matching lower bounds on the possible trade-offs in the performance of learning in conditionally benign and arbitrary environments, resolving an open question raised by Bilodeau et al. (2022). Furthermore, we are the first to obtain instance-dependent bounds for causal bandits, by reducing the problem to the linear bandit setting. Finally, we examine the common assumption that the marginal distributions of the post-action contexts are known and show that a nontrivial estimate is necessary for better-than-worst-case minimax rates.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Information Complexity of Stochastic Convex Optimization: Applications to Generalization and Memorization
Authors:
Idan Attias,
Gintare Karolina Dziugaite,
Mahdi Haghifam,
Roi Livni,
Daniel M. Roy
Abstract:
In this work, we investigate the interplay between memorization and learning in the context of \emph{stochastic convex optimization} (SCO). We define memorization via the information a learning algorithm reveals about its training data points. We then quantify this information using the framework of conditional mutual information (CMI) proposed by Steinke and Zakynthinou (2020). Our main result is…
▽ More
In this work, we investigate the interplay between memorization and learning in the context of \emph{stochastic convex optimization} (SCO). We define memorization via the information a learning algorithm reveals about its training data points. We then quantify this information using the framework of conditional mutual information (CMI) proposed by Steinke and Zakynthinou (2020). Our main result is a precise characterization of the tradeoff between the accuracy of a learning algorithm and its CMI, answering an open question posed by Livni (2023). We show that, in the $L^2$ Lipschitz--bounded setting and under strong convexity, every learner with an excess error $\varepsilon$ has CMI bounded below by $Ω(1/\varepsilon^2)$ and $Ω(1/\varepsilon)$, respectively. We further demonstrate the essential role of memorization in learning problems in SCO by designing an adversary capable of accurately identifying a significant fraction of the training samples in specific SCO problems. Finally, we enumerate several implications of our results, such as a limitation of generalization bounds based on CMI and the incompressibility of samples in SCO problems.
△ Less
Submitted 14 February, 2024;
originally announced February 2024.
-
Optimal Learners for Realizable Regression: PAC Learning and Online Learning
Authors:
Idan Attias,
Steve Hanneke,
Alkis Kalavasis,
Amin Karbasi,
Grigoris Velegkas
Abstract:
In this work, we aim to characterize the statistical complexity of realizable regression both in the PAC learning setting and the online learning setting. Previous work had established the sufficiency of finiteness of the fat shattering dimension for PAC learnability and the necessity of finiteness of the scaled Natarajan dimension, but little progress had been made towards a more complete charact…
▽ More
In this work, we aim to characterize the statistical complexity of realizable regression both in the PAC learning setting and the online learning setting. Previous work had established the sufficiency of finiteness of the fat shattering dimension for PAC learnability and the necessity of finiteness of the scaled Natarajan dimension, but little progress had been made towards a more complete characterization since the work of Simon (SICOMP '97). To this end, we first introduce a minimax instance optimal learner for realizable regression and propose a novel dimension that both qualitatively and quantitatively characterizes which classes of real-valued predictors are learnable. We then identify a combinatorial dimension related to the Graph dimension that characterizes ERM learnability in the realizable setting. Finally, we establish a necessary condition for learnability based on a combinatorial dimension related to the DS dimension, and conjecture that it may also be sufficient in this context. Additionally, in the context of online learning we provide a dimension that characterizes the minimax instance optimal cumulative loss up to a constant factor and design an optimal online learner for realizable regression, thus resolving an open question raised by Daskalakis and Golowich in STOC '22.
△ Less
Submitted 29 October, 2023; v1 submitted 7 July, 2023;
originally announced July 2023.
-
Online Learning and Solving Infinite Games with an ERM Oracle
Authors:
Angelos Assos,
Idan Attias,
Yuval Dagan,
Constantinos Daskalakis,
Maxwell Fishelson
Abstract:
While ERM suffices to attain near-optimal generalization error in the stochastic learning setting, this is not known to be the case in the online learning setting, where algorithms for general concept classes rely on computationally inefficient oracles such as the Standard Optimal Algorithm (SOA). In this work, we propose an algorithm for online binary classification setting that relies solely on…
▽ More
While ERM suffices to attain near-optimal generalization error in the stochastic learning setting, this is not known to be the case in the online learning setting, where algorithms for general concept classes rely on computationally inefficient oracles such as the Standard Optimal Algorithm (SOA). In this work, we propose an algorithm for online binary classification setting that relies solely on ERM oracle calls, and show that it has finite regret in the realizable setting and sublinearly growing regret in the agnostic setting. We bound the regret in terms of the Littlestone and threshold dimensions of the underlying concept class.
We obtain similar results for nonparametric games, where the ERM oracle can be interpreted as a best response oracle, finding the best response of a player to a given history of play of the other players. In this setting, we provide learning algorithms that only rely on best response oracles and converge to approximate-minimax equilibria in two-player zero-sum games and approximate coarse correlated equilibria in multi-player general-sum games, as long as the game has a bounded fat-threshold dimension. Our algorithms apply to both binary-valued and real-valued games and can be viewed as providing justification for the wide use of double oracle and multiple oracle algorithms in the practice of solving large games.
△ Less
Submitted 10 July, 2023; v1 submitted 4 July, 2023;
originally announced July 2023.
-
Adversarially Robust PAC Learnability of Real-Valued Functions
Authors:
Idan Attias,
Steve Hanneke
Abstract:
We study robustness to test-time adversarial attacks in the regression setting with $\ell_p$ losses and arbitrary perturbation sets. We address the question of which function classes are PAC learnable in this setting. We show that classes of finite fat-shattering dimension are learnable in both realizable and agnostic settings. Moreover, for convex function classes, they are even properly learnabl…
▽ More
We study robustness to test-time adversarial attacks in the regression setting with $\ell_p$ losses and arbitrary perturbation sets. We address the question of which function classes are PAC learnable in this setting. We show that classes of finite fat-shattering dimension are learnable in both realizable and agnostic settings. Moreover, for convex function classes, they are even properly learnable. In contrast, some non-convex function classes provably require improper learning algorithms. Our main technique is based on a construction of an adversarially robust sample compression scheme of a size determined by the fat-shattering dimension. Along the way, we introduce a novel agnostic sample compression scheme for real-valued functions, which may be of independent interest.
△ Less
Submitted 5 May, 2024; v1 submitted 26 June, 2022;
originally announced June 2022.
-
Learning Revenue Maximization using Posted Prices for Stochastic Strategic Patient Buyers
Authors:
Eitan-Hai Mashiah,
Idan Attias,
Yishay Mansour
Abstract:
We consider a seller faced with buyers which have the ability to delay their decision, which we call patience. Each buyer's type is composed of value and patience, and it is sampled i.i.d. from a distribution. The seller, using posted prices, would like to maximize her revenue from selling to the buyer. In this paper, we formalize this setting and characterize the resulting Stackelberg equilibrium…
▽ More
We consider a seller faced with buyers which have the ability to delay their decision, which we call patience. Each buyer's type is composed of value and patience, and it is sampled i.i.d. from a distribution. The seller, using posted prices, would like to maximize her revenue from selling to the buyer. In this paper, we formalize this setting and characterize the resulting Stackelberg equilibrium, where the seller first commits to her strategy, and then the buyers best respond. Following this, we show how to compute both the optimal pure and mixed strategies. We then consider a learning setting, where the seller does not have access to the distribution over buyer's types. Our main results are the following. We derive a sample complexity bound for the learning of an approximate optimal pure strategy, by computing the fat-shattering dimension of this setting. Moreover, we provide a general sample complexity bound for the approximate optimal mixed strategy. We also consider an online setting and derive a vanishing regret bound with respect to both the optimal pure strategy and the optimal mixed strategy.
△ Less
Submitted 26 June, 2022; v1 submitted 12 February, 2022;
originally announced February 2022.
-
A Characterization of Semi-Supervised Adversarially-Robust PAC Learnability
Authors:
Idan Attias,
Steve Hanneke,
Yishay Mansour
Abstract:
We study the problem of learning an adversarially robust predictor to test time attacks in the semi-supervised PAC model. We address the question of how many labeled and unlabeled examples are required to ensure learning. We show that having enough unlabeled data (the size of a labeled sample that a fully-supervised method would require), the labeled sample complexity can be arbitrarily smaller co…
▽ More
We study the problem of learning an adversarially robust predictor to test time attacks in the semi-supervised PAC model. We address the question of how many labeled and unlabeled examples are required to ensure learning. We show that having enough unlabeled data (the size of a labeled sample that a fully-supervised method would require), the labeled sample complexity can be arbitrarily smaller compared to previous works, and is sharply characterized by a different complexity measure. We prove nearly matching upper and lower bounds on this sample complexity. This shows that there is a significant benefit in semi-supervised robust learning even in the worst-case distribution-free model, and establishes a gap between the supervised and semi-supervised label complexities which is known not to hold in standard non-robust PAC learning.
△ Less
Submitted 5 May, 2024; v1 submitted 10 February, 2022;
originally announced February 2022.
-
Fat-Shattering Dimension of $k$-fold Aggregations
Authors:
Idan Attias,
Aryeh Kontorovich
Abstract:
We provide estimates on the fat-shattering dimension of aggregation rules of real-valued function classes. The latter consists of all ways of choosing $k$ functions, one from each of the $k$ classes, and computing a pointwise function of them, such as the median, mean, and maximum. The bound is stated in terms of the fat-shattering dimensions of the component classes. For linear and affine functio…
▽ More
We provide estimates on the fat-shattering dimension of aggregation rules of real-valued function classes. The latter consists of all ways of choosing $k$ functions, one from each of the $k$ classes, and computing a pointwise function of them, such as the median, mean, and maximum. The bound is stated in terms of the fat-shattering dimensions of the component classes. For linear and affine function classes, we provide a considerably sharper upper bound and a matching lower bound, achieving, in particular, an optimal dependence on $k$. Along the way, we improve several known results in addition to pointing out and correcting a number of erroneous claims in the literature.
△ Less
Submitted 9 September, 2023; v1 submitted 10 October, 2021;
originally announced October 2021.
-
Local Orthogonality Dimension
Authors:
Inon Attias,
Ishay Haviv
Abstract:
An orthogonal representation of a graph $G$ over a field $\mathbb{F}$ is an assignment of a vector $u_v \in \mathbb{F}^t$ to every vertex $v$ of $G$, such that $\langle u_v,u_v \rangle \neq 0$ for every vertex $v$ and $\langle u_v,u_{v'} \rangle = 0$ whenever $v$ and $v'$ are adjacent in $G$. The locality of the orthogonal representation is the largest dimension of a subspace spanned by the vector…
▽ More
An orthogonal representation of a graph $G$ over a field $\mathbb{F}$ is an assignment of a vector $u_v \in \mathbb{F}^t$ to every vertex $v$ of $G$, such that $\langle u_v,u_v \rangle \neq 0$ for every vertex $v$ and $\langle u_v,u_{v'} \rangle = 0$ whenever $v$ and $v'$ are adjacent in $G$. The locality of the orthogonal representation is the largest dimension of a subspace spanned by the vectors associated with a closed neighborhood in the graph. We introduce a novel graph parameter, called the local orthogonality dimension, defined for a given graph $G$ and a given field $\mathbb{F}$, as the smallest possible locality of an orthogonal representation of $G$ over $\mathbb{F}$. We investigate the usefulness of topological methods for proving lower bounds on the local orthogonality dimension. We prove that graphs for which topological methods imply a lower bound of $t$ on their chromatic number have local orthogonality dimension at least $\lceil t/2 \rceil +1$ over every field, strengthening a result of Simonyi and Tardos on the local chromatic number. We show that for certain graphs this lower bound is tight, whereas for others, the local orthogonality dimension over the reals is equal to the chromatic number. More generally, we prove that for every complement of a line graph, the local orthogonality dimension over $\mathbb{R}$ coincides with the chromatic number. This strengthens a recent result by Daneshpajouh, Meunier, and Mizrahi, who proved that the local and standard chromatic numbers of these graphs are equal. As another extension of their result, we prove that the local and standard chromatic numbers are equal for some additional graphs, from the family of Kneser graphs. We also show an $\mathsf{NP}$-hardness result for the local orthogonality dimension and present an application of this graph parameter to the index coding problem from information theory.
△ Less
Submitted 7 April, 2023; v1 submitted 1 October, 2021;
originally announced October 2021.
-
A Framework for Adversarial Streaming via Differential Privacy and Difference Estimators
Authors:
Idan Attias,
Edith Cohen,
Moshe Shechner,
Uri Stemmer
Abstract:
Classical streaming algorithms operate under the (not always reasonable) assumption that the input stream is fixed in advance. Recently, there is a growing interest in designing robust streaming algorithms that provide provable guarantees even when the input stream is chosen adaptively as the execution progresses. We propose a new framework for robust streaming that combines techniques from two re…
▽ More
Classical streaming algorithms operate under the (not always reasonable) assumption that the input stream is fixed in advance. Recently, there is a growing interest in designing robust streaming algorithms that provide provable guarantees even when the input stream is chosen adaptively as the execution progresses. We propose a new framework for robust streaming that combines techniques from two recently suggested frameworks by Hassidim et al. [NeurIPS 2020] and by Woodruff and Zhou [FOCS 2021]. These recently suggested frameworks rely on very different ideas, each with its own strengths and weaknesses. We combine these two frameworks into a single hybrid framework that obtains the ``best of both worlds'', thereby solving a question left open by Woodruff and Zhou.
△ Less
Submitted 26 September, 2022; v1 submitted 30 July, 2021;
originally announced July 2021.
-
Domain Invariant Adversarial Learning
Authors:
Matan Levi,
Idan Attias,
Aryeh Kontorovich
Abstract:
The phenomenon of adversarial examples illustrates one of the most basic vulnerabilities of deep neural networks. Among the variety of techniques introduced to surmount this inherent weakness, adversarial training has emerged as the most effective strategy for learning robust models. Typically, this is achieved by balancing robust and natural objectives. In this work, we aim to further optimize th…
▽ More
The phenomenon of adversarial examples illustrates one of the most basic vulnerabilities of deep neural networks. Among the variety of techniques introduced to surmount this inherent weakness, adversarial training has emerged as the most effective strategy for learning robust models. Typically, this is achieved by balancing robust and natural objectives. In this work, we aim to further optimize the trade-off between robust and standard accuracy by enforcing a domain-invariant feature representation. We present a new adversarial training method, Domain Invariant Adversarial Learning (DIAL), which learns a feature representation that is both robust and domain invariant. DIAL uses a variant of Domain Adversarial Neural Network (DANN) on the natural domain and its corresponding adversarial domain. In the case where the source domain consists of natural examples and the target domain is the adversarially perturbed examples, our method learns a feature representation constrained not to discriminate between the natural and adversarial examples, and can therefore achieve a more robust representation. DIAL is a generic and modular technique that can be easily incorporated into any adversarial training method. Our experiments indicate that incorporating DIAL in the adversarial training process improves both robustness and standard accuracy.
△ Less
Submitted 13 September, 2022; v1 submitted 1 April, 2021;
originally announced April 2021.
-
Prediction with Corrupted Expert Advice
Authors:
Idan Amir,
Idan Attias,
Tomer Koren,
Roi Livni,
Yishay Mansour
Abstract:
We revisit the fundamental problem of prediction with expert advice, in a setting where the environment is benign and generates losses stochastically, but the feedback observed by the learner is subject to a moderate adversarial corruption. We prove that a variant of the classical Multiplicative Weights algorithm with decreasing step sizes achieves constant regret in this setting and performs opti…
▽ More
We revisit the fundamental problem of prediction with expert advice, in a setting where the environment is benign and generates losses stochastically, but the feedback observed by the learner is subject to a moderate adversarial corruption. We prove that a variant of the classical Multiplicative Weights algorithm with decreasing step sizes achieves constant regret in this setting and performs optimally in a wide range of environments, regardless of the magnitude of the injected corruption. Our results reveal a surprising disparity between the often comparable Follow the Regularized Leader (FTRL) and Online Mirror Descent (OMD) frameworks: we show that for experts in the corrupted stochastic regime, the regret performance of OMD is in fact strictly inferior to that of FTRL.
△ Less
Submitted 20 October, 2020; v1 submitted 24 February, 2020;
originally announced February 2020.
-
Improved Generalization Bounds for Adversarially Robust Learning
Authors:
Idan Attias,
Aryeh Kontorovich,
Yishay Mansour
Abstract:
We consider a model of robust learning in an adversarial environment. The learner gets uncorrupted training data with access to possible corruptions that may be affected by the adversary during testing. The learner's goal is to build a robust classifier, which will be tested on future adversarial examples. The adversary is limited to $k$ possible corruptions for each input. We model the learner-ad…
▽ More
We consider a model of robust learning in an adversarial environment. The learner gets uncorrupted training data with access to possible corruptions that may be affected by the adversary during testing. The learner's goal is to build a robust classifier, which will be tested on future adversarial examples. The adversary is limited to $k$ possible corruptions for each input. We model the learner-adversary interaction as a zero-sum game. This model is closely related to the adversarial examples model of Schmidt et al. (2018); Madry et al. (2017).
Our main results consist of generalization bounds for the binary and multiclass classification, as well as the real-valued case (regression). For the binary classification setting, we both tighten the generalization bound of Feige et al. (2015), and are also able to handle infinite hypothesis classes. The sample complexity is improved from $O(\frac{1}{ε^4}\log(\frac{|H|}δ))$ to $O\big(\frac{1}{ε^2}(kVC(H)\log^{\frac{3}{2}+α}(kVC(H))+\log(\frac{1}δ)\big)$ for any $α> 0$. Additionally, we extend the algorithm and generalization bound from the binary to the multiclass and real-valued cases. Along the way, we obtain results on fat-shattering dimension and Rademacher complexity of $k$-fold maxima over function classes; these may be of independent interest.
For binary classification, the algorithm of Feige et al. (2015) uses a regret minimization algorithm and an ERM oracle as a black box; we adapt it for the multiclass and regression settings. The algorithm provides us with near-optimal policies for the players on a given training sample.
△ Less
Submitted 1 July, 2022; v1 submitted 4 October, 2018;
originally announced October 2018.
-
Agnostic Sample Compression Schemes for Regression
Authors:
Idan Attias,
Steve Hanneke,
Aryeh Kontorovich,
Menachem Sadigurschi
Abstract:
We obtain the first positive results for bounded sample compression in the agnostic regression setting with the $\ell_p$ loss, where $p\in [1,\infty]$. We construct a generic approximate sample compression scheme for real-valued function classes exhibiting exponential size in the fat-shattering dimension but independent of the sample size. Notably, for linear regression, an approximate compression…
▽ More
We obtain the first positive results for bounded sample compression in the agnostic regression setting with the $\ell_p$ loss, where $p\in [1,\infty]$. We construct a generic approximate sample compression scheme for real-valued function classes exhibiting exponential size in the fat-shattering dimension but independent of the sample size. Notably, for linear regression, an approximate compression of size linear in the dimension is constructed. Moreover, for $\ell_1$ and $\ell_\infty$ losses, we can even exhibit an efficient exact sample compression scheme of size linear in the dimension. We further show that for every other $\ell_p$ loss, $p\in (1,\infty)$, there does not exist an exact agnostic compression scheme of bounded size. This refines and generalizes a negative result of David, Moran, and Yehudayoff for the $\ell_2$ loss. We close by posing general open questions: for agnostic regression with $\ell_1$ loss, does every function class admits an exact compression scheme of size equal to its pseudo-dimension? For the $\ell_2$ loss, does every function class admit an approximate compression scheme of polynomial size in the fat-shattering dimension? These questions generalize Warmuth's classic sample compression conjecture for realizable-case classification.
△ Less
Submitted 3 February, 2024; v1 submitted 3 October, 2018;
originally announced October 2018.