subscribe to arXiv mailings

Distribution Estimation under the Infinity Norm

Authors: Aryeh Kontorovich, Amichai Painsky

Abstract: We present novel bounds for estimating discrete probability distributions under the $\ell_\infty$ norm. These are nearly optimal in various precise senses, including a kind of instance-optimality. Our data-dependent convergence guarantees for the maximum likelihood estimator significantly improve upon the currently known results. A variety of techniques are utilized and innovated upon, including C… ▽ More We present novel bounds for estimating discrete probability distributions under the $\ell_\infty$ norm. These are nearly optimal in various precise senses, including a kind of instance-optimality. Our data-dependent convergence guarantees for the maximum likelihood estimator significantly improve upon the currently known results. A variety of techniques are utilized and innovated upon, including Chernoff-type inequalities and empirical Bernstein bounds. We illustrate our results in synthetic and real-world experiments. Finally, we apply our proposed framework to a basic selective inference problem, where we estimate the most frequent probabilities in a sample. △ Less

Submitted 13 February, 2024; originally announced February 2024.

Comments: Distribution Estimation, Probability Estimation, Infinity Norm

arXiv:2402.07058 [pdf, ps, other]

Correlated Binomial Process

Authors: Moïse Blanchard, Doron Cohen, Aryeh Kontorovich

Abstract: Cohen and Kontorovich (COLT 2023) initiated the study of what we call here the Binomial Empirical Process: the maximal absolute value of a sequence of inhomogeneous normalized and centered binomials. They almost fully analyzed the case where the binomials are independent, and the remaining gap was closed by Blanchard and Voráček (ALT 2024). In this work, we study the much more general and challeng… ▽ More Cohen and Kontorovich (COLT 2023) initiated the study of what we call here the Binomial Empirical Process: the maximal absolute value of a sequence of inhomogeneous normalized and centered binomials. They almost fully analyzed the case where the binomials are independent, and the remaining gap was closed by Blanchard and Voráček (ALT 2024). In this work, we study the much more general and challenging case with correlations. In contradistinction to Gaussian processes, whose behavior is characterized by the covariance structure, we discover that, at least somewhat surprisingly, for binomial processes covariance does not even characterize convergence. Although a full characterization remains out of reach, we take the first steps with nontrivial upper and lower bounds in terms of covering numbers. △ Less

Submitted 10 February, 2024; originally announced February 2024.

arXiv:2401.07740 [pdf, ps, other]

Counting in Lattice Orbits

Authors: Alex Kontorovich, Christopher Lutsko

Abstract: Given a discrete lattice, $Γ< \text{SL}_m(\mathbb{R})$, and a base point $o\in \mathbb{R}^m$, let $N_Γ(T)$ denote the number of points in the orbit $o\cdot Γ$ whose (Euclidean) length is bounded by a growing parameter, $T$. We demonstrate an abstract spectral method capable of obtaining strong asymptotic estimates for $N_Γ(T)$ without relying on the meromorphic continuation of higher rank Langland… ▽ More Given a discrete lattice, $Γ< \text{SL}_m(\mathbb{R})$, and a base point $o\in \mathbb{R}^m$, let $N_Γ(T)$ denote the number of points in the orbit $o\cdot Γ$ whose (Euclidean) length is bounded by a growing parameter, $T$. We demonstrate an abstract spectral method capable of obtaining strong asymptotic estimates for $N_Γ(T)$ without relying on the meromorphic continuation of higher rank Langlands Eisenstein series. △ Less

Submitted 15 January, 2024; originally announced January 2024.

Comments: 13 pages, comments welcome

MSC Class: 11H06; 11F22

arXiv:2310.02896 [pdf, other]

Notes on a Path to AI Assistance in Mathematical Reasoning

Authors: Alex Kontorovich

Abstract: These informal notes are based on the author's lecture at the National Academies of Science, Engineering, and Mathematics workshop on "AI to Assist Mathematical Reasoning" in June 2023. The goal is to think through a path by which we might arrive at AI that is useful for the research mathematician. These informal notes are based on the author's lecture at the National Academies of Science, Engineering, and Mathematics workshop on "AI to Assist Mathematical Reasoning" in June 2023. The goal is to think through a path by which we might arrive at AI that is useful for the research mathematician. △ Less

Submitted 4 October, 2023; originally announced October 2023.

Comments: 7 pages

arXiv:2310.02480 [pdf, other]

Splitting the Difference on Adversarial Training

Authors: Matan Levi, Aryeh Kontorovich

Abstract: The existence of adversarial examples points to a basic weakness of deep neural networks. One of the most effective defenses against such examples, adversarial training, entails training models with some degree of robustness, usually at the expense of a degraded natural accuracy. Most adversarial training methods aim to learn a model that finds, for each class, a common decision boundary encompass… ▽ More The existence of adversarial examples points to a basic weakness of deep neural networks. One of the most effective defenses against such examples, adversarial training, entails training models with some degree of robustness, usually at the expense of a degraded natural accuracy. Most adversarial training methods aim to learn a model that finds, for each class, a common decision boundary encompassing both the clean and perturbed examples. In this work, we take a fundamentally different approach by treating the perturbed examples of each class as a separate class to be learned, effectively splitting each class into two classes: "clean" and "adversarial." This split doubles the number of classes to be learned, but at the same time considerably simplifies the decision boundaries. We provide a theoretical plausibility argument that sheds some light on the conditions under which our approach can be expected to be beneficial. Likewise, we empirically demonstrate that our method learns robust models while attaining optimal or near-optimal natural accuracy, e.g., on CIFAR-10 we obtain near-optimal natural accuracy of $95.01\%$ alongside significant robustness across multiple tasks. The ability to achieve such near-optimal natural accuracy, while maintaining a significant level of robustness, makes our method applicable to real-world applications where natural accuracy is at a premium. As a whole, our main contribution is a general method that confers a significant level of robustness upon classifiers with only minor or negligible degradation of their natural accuracy. △ Less

Submitted 3 October, 2023; originally announced October 2023.

arXiv:2309.17016 [pdf, other]

Efficient Agnostic Learning with Average Smoothness

Authors: Steve Hanneke, Aryeh Kontorovich, Guy Kornowski

Abstract: We study distribution-free nonparametric regression following a notion of average smoothness initiated by Ashlagi et al. (2021), which measures the "effective" smoothness of a function with respect to an arbitrary unknown underlying distribution. While the recent work of Hanneke et al. (2023) established tight uniform convergence bounds for average-smooth functions in the realizable case and provi… ▽ More We study distribution-free nonparametric regression following a notion of average smoothness initiated by Ashlagi et al. (2021), which measures the "effective" smoothness of a function with respect to an arbitrary unknown underlying distribution. While the recent work of Hanneke et al. (2023) established tight uniform convergence bounds for average-smooth functions in the realizable case and provided a computationally efficient realizable learning algorithm, both of these results currently lack analogs in the general agnostic (i.e. noisy) case. In this work, we fully close these gaps. First, we provide a distribution-free uniform convergence bound for average-smoothness classes in the agnostic setting. Second, we match the derived sample complexity with a computationally efficient agnostic learning algorithm. Our results, which are stated in terms of the intrinsic geometry of the data and hold over any totally bounded metric space, show that the guarantees recently obtained for realizable learning of average-smooth functions transfer to the agnostic setting. At the heart of our proof, we establish the uniform convergence rate of a function class in terms of its bracketing entropy, which may be of independent interest. △ Less

Submitted 13 February, 2024; v1 submitted 29 September, 2023; originally announced September 2023.

Comments: ALT 2024 camera ready version. arXiv admin note: text overlap with arXiv:2302.06005

arXiv:2306.10206 [pdf, other]

Remarks on the $\mathbb{Z}/p$ Dijkgraaf-Witten invariants of 3D mapping tori

Authors: William Chen, Alex Kontorovich, Shehryar Sikander

Abstract: We make some remarks on the $\mathbb{Z}/p$ Dijkgraaf-Witten invariants of 3D mapping tori and determine the asymptotic behavior of their sum over all diffeomorphism classes of genus one mapping tori. We make some remarks on the $\mathbb{Z}/p$ Dijkgraaf-Witten invariants of 3D mapping tori and determine the asymptotic behavior of their sum over all diffeomorphism classes of genus one mapping tori. △ Less

Submitted 15 November, 2023; v1 submitted 16 June, 2023; originally announced June 2023.

Comments: 1 figure

arXiv:2305.15162 [pdf, ps, other]

Mean square of Eisenstein series

Authors: Dubi Kelmer, Alex Kontorovich, Christopher Lutsko

Abstract: We study the sup-norm and mean-square-norm problems for Eisenstein series on certain arithmetic hyperbolic orbifolds, producing sharp exponents for the modular surface and Picard 3-fold. The methods involve bounds for Epstein zeta functions, and counting restricted values of indefinite quadratic forms at integer points. We study the sup-norm and mean-square-norm problems for Eisenstein series on certain arithmetic hyperbolic orbifolds, producing sharp exponents for the modular surface and Picard 3-fold. The methods involve bounds for Epstein zeta functions, and counting restricted values of indefinite quadratic forms at integer points. △ Less

Submitted 24 May, 2023; originally announced May 2023.

Comments: 13 pages

MSC Class: 11M36; 11E45

arXiv:2302.14150 [pdf, ps, other]

Decoupling Maximal Inequalities

Authors: Aryeh Kontorovich

Abstract: A {\em maximal inequality} seeks to estimate $\mathbb{E}\max_i X_i$ in terms of properties of the $X_i$. When the latter are independent, the union bound (in its various guises) can yield tight upper bounds. If, however, the $X_i$ are strongly dependent, the estimates provided by the union bound will be rather loose. In this note, we show that for non-negative random variables, pairwise independen… ▽ More A {\em maximal inequality} seeks to estimate $\mathbb{E}\max_i X_i$ in terms of properties of the $X_i$. When the latter are independent, the union bound (in its various guises) can yield tight upper bounds. If, however, the $X_i$ are strongly dependent, the estimates provided by the union bound will be rather loose. In this note, we show that for non-negative random variables, pairwise independence suffices for the maximal inequality to behave comparably to its independent version. The condition of pairwise independence may be relaxed to a kind of negative dependence, and even the latter admits violations -- provided these are properly quantified. △ Less

Submitted 1 March, 2023; v1 submitted 27 February, 2023; originally announced February 2023.

arXiv:2302.06005 [pdf, other]

Near-optimal learning with average Hölder smoothness

Authors: Steve Hanneke, Aryeh Kontorovich, Guy Kornowski

Abstract: We generalize the notion of average Lipschitz smoothness proposed by Ashlagi et al. (COLT 2021) by extending it to Hölder smoothness. This measure of the "effective smoothness" of a function is sensitive to the underlying distribution and can be dramatically smaller than its classic "worst-case" Hölder constant. We consider both the realizable and the agnostic (noisy) regression settings, proving… ▽ More We generalize the notion of average Lipschitz smoothness proposed by Ashlagi et al. (COLT 2021) by extending it to Hölder smoothness. This measure of the "effective smoothness" of a function is sensitive to the underlying distribution and can be dramatically smaller than its classic "worst-case" Hölder constant. We consider both the realizable and the agnostic (noisy) regression settings, proving upper and lower risk bounds in terms of the average Hölder smoothness; these rates improve upon both previously known rates even in the special case of average Lipschitz smoothness. Moreover, our lower bound is tight in the realizable setting up to log factors, thus we establish the minimax rate. From an algorithmic perspective, since our notion of average smoothness is defined with respect to the unknown underlying distribution, the learner does not have an explicit representation of the function class, hence is unable to execute ERM. Nevertheless, we provide distinct learning algorithms that achieve both (nearly) optimal learning rates. Our results hold in any totally bounded metric space, and are stated in terms of its intrinsic geometry. Overall, our results show that the classic worst-case notion of Hölder smoothness can be essentially replaced by its average, yielding considerably sharper guarantees. △ Less

Submitted 30 October, 2023; v1 submitted 12 February, 2023; originally announced February 2023.

Comments: NeurIPS 2023 camera ready version

arXiv:2212.04216 [pdf, ps, other]

Differentially-Private Bayes Consistency

Authors: Olivier Bousquet, Haim Kaplan, Aryeh Kontorovich, Yishay Mansour, Shay Moran, Menachem Sadigurschi, Uri Stemmer

Abstract: We construct a universally Bayes consistent learning rule that satisfies differential privacy (DP). We first handle the setting of binary classification and then extend our rule to the more general setting of density estimation (with respect to the total variation metric). The existence of a universally consistent DP learner reveals a stark difference with the distribution-free PAC model. Indeed,… ▽ More We construct a universally Bayes consistent learning rule that satisfies differential privacy (DP). We first handle the setting of binary classification and then extend our rule to the more general setting of density estimation (with respect to the total variation metric). The existence of a universally consistent DP learner reveals a stark difference with the distribution-free PAC model. Indeed, in the latter DP learning is extremely limited: even one-dimensional linear classifiers are not privately learnable in this stringent model. Our result thus demonstrates that by allowing the learning rate to depend on the target distribution, one can circumvent the above-mentioned impossibility result and in fact, learn \emph{arbitrary} distributions by a single DP algorithm. As an application, we prove that any VC class can be privately learned in a semi-supervised setting with a near-optimal \emph{labeled} sample complexity of $\tilde{O}(d/\varepsilon)$ labeled examples (and with an unlabeled sample complexity that can depend on the target distribution). △ Less

Submitted 8 December, 2022; originally announced December 2022.

arXiv:2210.13969 [pdf, other]

Sarnak's spectral gap question

Authors: Dubi Kelmer, Alex Kontorovich, Christopher Lutsko

Abstract: We answer in the affirmative a question of Sarnak's from 2007, confirming that the Patterson-Sullivan base eigenfunction is the unique square-integrable eigenfunction of the hyperbolic Laplacian invariant under the group of symmetries of the Apollonian packing. Thus the latter has a maximal spectral gap. We prove further restrictions on the spectrum of the Laplacian on a wide class of manifolds co… ▽ More We answer in the affirmative a question of Sarnak's from 2007, confirming that the Patterson-Sullivan base eigenfunction is the unique square-integrable eigenfunction of the hyperbolic Laplacian invariant under the group of symmetries of the Apollonian packing. Thus the latter has a maximal spectral gap. We prove further restrictions on the spectrum of the Laplacian on a wide class of manifolds coming from Kleinian sphere packings. △ Less

Submitted 25 October, 2022; originally announced October 2022.

Comments: 8 pages, 4 figures

MSC Class: 52C17; 20F55; 11F72; 22E40

arXiv:2209.04054 [pdf, ps, other]

Local Glivenko-Cantelli

Authors: Doron Cohen, Aryeh Kontorovich

Abstract: If $μ$ is a distribution over the $d$-dimensional Boolean cube $\{0,1\}^d$, our goal is to estimate its mean $p\in[0,1]^d$ based on $n$ iid draws from $μ$. Specifically, we consider the empirical mean estimator $\hat p_n$ and study the expected maximal deviation $Δ_n=\mathbb{E}\max_{j\in[d]}|\hat p_n(j)-p(j)|$. In the classical Universal Glivenko-Cantelli setting, one seeks distribution-free (i.e.… ▽ More If $μ$ is a distribution over the $d$-dimensional Boolean cube $\{0,1\}^d$, our goal is to estimate its mean $p\in[0,1]^d$ based on $n$ iid draws from $μ$. Specifically, we consider the empirical mean estimator $\hat p_n$ and study the expected maximal deviation $Δ_n=\mathbb{E}\max_{j\in[d]}|\hat p_n(j)-p(j)|$. In the classical Universal Glivenko-Cantelli setting, one seeks distribution-free (i.e., independent of $μ$) bounds on $Δ_n$. This regime is well-understood: for all $μ$, we have $Δ_n\lesssim\sqrt{\log(d)/n}$ up to universal constants, and the bound is tight. Our present work seeks to establish dimension-free (i.e., without an explicit dependence on $d$) estimates on $Δ_n$, including those that hold for $d=\infty$. As such bounds must necessarily depend on $μ$, we refer to this regime as {\em local} Glivenko-Cantelli (also known as $μ$-GC), and are aware of very few previous bounds of this type -- which are either ``abstract'' or quite sub-optimal. Already the special case of product measures $μ$ is rather non-trivial. We give necessary and sufficient conditions on $μ$ for $Δ_n\to0$, and calculate sharp rates for this decay. Along the way, we discover a novel sub-gamma-type maximal inequality for shifted Bernoullis, of independent interest. △ Less

Submitted 29 June, 2023; v1 submitted 8 September, 2022; originally announced September 2022.

arXiv:2209.00175 [pdf, other]

Improved Estimation of Relaxation Time in Non-reversible Markov Chains

Authors: Geoffrey Wolfer, Aryeh Kontorovich

Abstract: We show that the minimax sample complexity for estimating the pseudo-spectral gap $γ_{\mathsf{ps}}$ of an ergodic Markov chain in constant multiplicative error is of the order of $$\tildeΘ\left( \frac{1}{γ_{\mathsf{ps}} π_{\star}} \right),$$ where $π_\star$ is the minimum stationary probability, recovering the known bound in the reversible setting for estimating the absolute spectral gap [Hsu et a… ▽ More We show that the minimax sample complexity for estimating the pseudo-spectral gap $γ_{\mathsf{ps}}$ of an ergodic Markov chain in constant multiplicative error is of the order of $$\tildeΘ\left( \frac{1}{γ_{\mathsf{ps}} π_{\star}} \right),$$ where $π_\star$ is the minimum stationary probability, recovering the known bound in the reversible setting for estimating the absolute spectral gap [Hsu et al., 2019], and resolving an open problem of Wolfer and Kontorovich [2019]. Furthermore, we strengthen the known empirical procedure by making it fully-adaptive to the data, thinning the confidence intervals and reducing the computational complexity. Along the way, we derive new properties of the pseudo-spectral gap and introduce the notion of a reversible dilation of a stochastic matrix. △ Less

Submitted 4 August, 2023; v1 submitted 31 August, 2022; originally announced September 2022.

arXiv:2205.13004 [pdf, other]

Effective counting in sphere packings

Authors: Alex Kontorovich, Christopher Lutsko

Abstract: Given a Zariski-dense, discrete group, $Γ$, of isometries acting on $(n + 1)$-dimensional hyperbolic space, we use spectral methods to obtain a sharp asymptotic formula for the growth rate of certain $Γ$-orbits. In particular, this allows us to obtain a best-known effective error rate for the Apollonian and (more generally) Kleinian sphere packing counting problems, that is, counting the number of… ▽ More Given a Zariski-dense, discrete group, $Γ$, of isometries acting on $(n + 1)$-dimensional hyperbolic space, we use spectral methods to obtain a sharp asymptotic formula for the growth rate of certain $Γ$-orbits. In particular, this allows us to obtain a best-known effective error rate for the Apollonian and (more generally) Kleinian sphere packing counting problems, that is, counting the number of spheres in such with radius bounded by a growing parameter. Our method extends the method of Kontorovich [Kon09], which was itself an extension of the orbit counting method of Lax-Phillips [LP82], in two ways. First, we remove a compactness condition on the discrete subgroups considered via a technical cut-off and smoothing operation. Second, we develop a coordinate system which naturally corresponds to the inversive geometry underlying the sphere counting problem, and give structure theorems on the arising Casimir operator and Haar measure in these coordinates. △ Less

Submitted 18 November, 2022; v1 submitted 25 May, 2022; originally announced May 2022.

Comments: 38 pages, 4 figures. Added assumption about complementary series representations

MSC Class: 52C17; 20F55; 11F72; 22E40

arXiv:2203.01973 [pdf, other]

Kleinian sphere packings, reflection groups, and arithmeticity

Authors: Nikolay Bogachev, Alexander Kolpakov, Alex Kontorovich

Abstract: In this paper we study crystallographic sphere packings and Kleinian sphere packings, introduced first by Kontorovich and Nakamura in 2017 and then studied further by Kapovich and Kontorovich in 2021. In particular, we solve the problem of existence of crystallographic sphere packings in certain higher dimensions posed by Kontorovich and Nakamura. In addition, we present a geometric doubling proce… ▽ More In this paper we study crystallographic sphere packings and Kleinian sphere packings, introduced first by Kontorovich and Nakamura in 2017 and then studied further by Kapovich and Kontorovich in 2021. In particular, we solve the problem of existence of crystallographic sphere packings in certain higher dimensions posed by Kontorovich and Nakamura. In addition, we present a geometric doubling procedure allowing to obtain sphere packings from some Coxeter polyhedra without isolated roots, and study "properly integral" packings (that is, ones which are integral but not superintegral). Our techniques rely extensively on computations with Lorentzian quadratic forms, their orthogonal groups, and associated higher-dimensional hyperbolic polyhedra. △ Less

Submitted 12 April, 2024; v1 submitted 3 March, 2022; originally announced March 2022.

Comments: 20 pages, 5 figures; ancillary files available on Github https://github.com/sashakolpakov/crystallographic-packings; final version; Math. Comp. 2024, Vol. 93, no. 345, pp. 505-521

MSC Class: 52C17; 20F55; 22E40

arXiv:2202.03045 [pdf, ps, other]

Metric-valued regression

Authors: Dan Tsir Cohen, Aryeh Kontorovich

Abstract: We propose an efficient algorithm for learning mappings between two metric spaces, $\X$ and $\Y$. Our procedure is strongly Bayes-consistent whenever $\X$ and $\Y$ are topologically separable and $\Y$ is "bounded in expectation" (our term; the separability assumption can be somewhat weakened). At this level of generality, ours is the first such learnability result for unbounded loss in the agnosti… ▽ More We propose an efficient algorithm for learning mappings between two metric spaces, $\X$ and $\Y$. Our procedure is strongly Bayes-consistent whenever $\X$ and $\Y$ are topologically separable and $\Y$ is "bounded in expectation" (our term; the separability assumption can be somewhat weakened). At this level of generality, ours is the first such learnability result for unbounded loss in the agnostic setting. Our technique is based on metric medoids (a variant of Fréchet means) and presents a significant departure from existing methods, which, as we demonstrate, fail to achieve Bayes-consistency on general instance- and label-space metrics. Our proofs introduce the technique of {\em semi-stable compression}, which may be of independent interest. △ Less

Submitted 7 February, 2022; originally announced February 2022.

arXiv:2201.10955 [pdf, other]

On Length Sets of Subarithmetic Hyperbolic Manifolds

Authors: Alex Kontorovich, Xin Zhang

Abstract: We formulate the Asymptotic Length-Saturation Conjecture on the length sets of closed geodesics on hyperbolic manifolds whose fundamental groups are subarithmetic, that is, contained in an arithmetic group. We prove the first instance of the conjecture for punctured, Zariski dense covers of the modular surface. The tools involved include the Orbital Circle Method, expansion and counting in congrue… ▽ More We formulate the Asymptotic Length-Saturation Conjecture on the length sets of closed geodesics on hyperbolic manifolds whose fundamental groups are subarithmetic, that is, contained in an arithmetic group. We prove the first instance of the conjecture for punctured, Zariski dense covers of the modular surface. The tools involved include the Orbital Circle Method, expansion and counting in congruence towers of thin groups, estimates for exponential sums, bilinear forms, and quadratic L-series. △ Less

Submitted 26 January, 2022; originally announced January 2022.

Comments: 68 pages, 1 figure

MSC Class: 11F06; 22E40; 11L07; 14G12; 11M20

arXiv:2201.08704 [pdf, ps, other]

Adaptive Data Analysis with Correlated Observations

Authors: Aryeh Kontorovich, Menachem Sadigurschi, Uri Stemmer

Abstract: The vast majority of the work on adaptive data analysis focuses on the case where the samples in the dataset are independent. Several approaches and tools have been successfully applied in this context, such as differential privacy, max-information, compression arguments, and more. The situation is far less well-understood without the independence assumption. We embark on a systematic study of t… ▽ More The vast majority of the work on adaptive data analysis focuses on the case where the samples in the dataset are independent. Several approaches and tools have been successfully applied in this context, such as differential privacy, max-information, compression arguments, and more. The situation is far less well-understood without the independence assumption. We embark on a systematic study of the possibilities of adaptive data analysis with correlated observations. First, we show that, in some cases, differential privacy guarantees generalization even when there are dependencies within the sample, which we quantify using a notion we call Gibbs-dependence. We complement this result with a tight negative example. Second, we show that the connection between transcript-compression and adaptive data analysis can be extended to the non-iid setting. △ Less

Submitted 21 January, 2022; originally announced January 2022.

arXiv:2111.11971 [pdf, ps, other]

Tree density estimation

Authors: László Györfi, Aryeh Kontorovich, Roi Weiss

Abstract: We study the problem of estimating the density $f(\boldsymbol x)$ of a random vector ${\boldsymbol X}$ in $\mathbb R^d$. For a spanning tree $T$ defined on the vertex set $\{1,\dots ,d\}$, the tree density $f_{T}$ is a product of bivariate conditional densities. An optimal spanning tree minimizes the Kullback-Leibler divergence between $f$ and $f_{T}$. From i.i.d. data we identify an optimal tree… ▽ More We study the problem of estimating the density $f(\boldsymbol x)$ of a random vector ${\boldsymbol X}$ in $\mathbb R^d$. For a spanning tree $T$ defined on the vertex set $\{1,\dots ,d\}$, the tree density $f_{T}$ is a product of bivariate conditional densities. An optimal spanning tree minimizes the Kullback-Leibler divergence between $f$ and $f_{T}$. From i.i.d. data we identify an optimal tree $T^*$ and efficiently construct a tree density estimate $f_n$ such that, without any regularity conditions on the density $f$, one has $\lim_{n\to \infty} \int |f_n(\boldsymbol x)-f_{T^*}(\boldsymbol x)|d\boldsymbol x=0$ a.s. For Lipschitz $f$ with bounded support, $\mathbb E \left\{ \int |f_n(\boldsymbol x)-f_{T^*}(\boldsymbol x)|d\boldsymbol x\right\}=O\big(n^{-1/4}\big)$, a dimension-free rate. △ Less

Submitted 21 September, 2022; v1 submitted 23 November, 2021; originally announced November 2021.

arXiv:2110.04763 [pdf, ps, other]

Fat-Shattering Dimension of $k$-fold Aggregations

Authors: Idan Attias, Aryeh Kontorovich

Abstract: We provide estimates on the fat-shattering dimension of aggregation rules of real-valued function classes. The latter consists of all ways of choosing $k$ functions, one from each of the $k$ classes, and computing a pointwise function of them, such as the median, mean, and maximum. The bound is stated in terms of the fat-shattering dimensions of the component classes. For linear and affine functio… ▽ More We provide estimates on the fat-shattering dimension of aggregation rules of real-valued function classes. The latter consists of all ways of choosing $k$ functions, one from each of the $k$ classes, and computing a pointwise function of them, such as the median, mean, and maximum. The bound is stated in terms of the fat-shattering dimensions of the component classes. For linear and affine function classes, we provide a considerably sharper upper bound and a matching lower bound, achieving, in particular, an optimal dependence on $k$. Along the way, we improve several known results in addition to pointing out and correcting a number of erroneous claims in the literature. △ Less

Submitted 9 September, 2023; v1 submitted 10 October, 2021; originally announced October 2021.

arXiv:2105.07408 [pdf, other]

Dimension-Free Empirical Entropy Estimation

Authors: Doron Cohen, Aryeh Kontorovich, Aaron Koolyk, Geoffrey Wolfer

Abstract: We seek an entropy estimator for discrete distributions with fully empirical accuracy bounds. As stated, this goal is infeasible without some prior assumptions on the distribution. We discover that a certain information moment assumption renders the problem feasible. We argue that the moment assumption is natural and, in some sense, {\em minimalistic} -- weaker than finite support or tail decay co… ▽ More We seek an entropy estimator for discrete distributions with fully empirical accuracy bounds. As stated, this goal is infeasible without some prior assumptions on the distribution. We discover that a certain information moment assumption renders the problem feasible. We argue that the moment assumption is natural and, in some sense, {\em minimalistic} -- weaker than finite support or tail decay conditions. Under the moment assumption, we provide the first finite-sample entropy estimates for infinite alphabets, nearly recovering the known minimax rates. Moreover, we demonstrate that our empirical bounds are significantly sharper than the state-of-the-art bounds, for various natural distributions and non-trivial sample regimes. Along the way, we give a dimension-free analogue of the Cover-Thomas result on entropy continuity (with respect to total variation distance) for finite alphabets, which may be of independent interest. Additionally, we resolve all of the open problems posed by Jürgensen and Matthews, 2010. △ Less

Submitted 26 December, 2022; v1 submitted 16 May, 2021; originally announced May 2021.

arXiv:2104.13838 [pdf, other]

On Superintegral Kleinian Sphere Packings, Bugs, and Arithmetic Groups

Authors: Michael Kapovich, Alex Kontorovich

Abstract: We develop the notion of a Kleinian Sphere Packing, a generalization of "crystallographic" (Apollonian-like) sphere packings defined by Kontorovich-Nakamura [KN19]. Unlike crystallographic packings, Kleinian packings exist in all dimensions, as do "superintegral" such. We extend the Arithmeticity Theorem to Kleinian packings, that is, the superintegral ones come from Q-arithmetic lattices of simpl… ▽ More We develop the notion of a Kleinian Sphere Packing, a generalization of "crystallographic" (Apollonian-like) sphere packings defined by Kontorovich-Nakamura [KN19]. Unlike crystallographic packings, Kleinian packings exist in all dimensions, as do "superintegral" such. We extend the Arithmeticity Theorem to Kleinian packings, that is, the superintegral ones come from Q-arithmetic lattices of simplest type. The same holds for more general objects we call Kleinian Bugs, in which the spheres need not be disjoint but can meet with dihedral angles pi/m for finitely many m. We settle two questions from [KN19]: (i) that the Arithmeticity Theorem is in general false over number fields, and (ii) that integral packings only arise from non-uniform lattices. △ Less

Submitted 28 April, 2021; originally announced April 2021.

Comments: 47 pages, 12 figures

MSC Class: 22E40; 30F40; 11E88

arXiv:2104.00322 [pdf, other]

Domain Invariant Adversarial Learning

Authors: Matan Levi, Idan Attias, Aryeh Kontorovich

Abstract: The phenomenon of adversarial examples illustrates one of the most basic vulnerabilities of deep neural networks. Among the variety of techniques introduced to surmount this inherent weakness, adversarial training has emerged as the most effective strategy for learning robust models. Typically, this is achieved by balancing robust and natural objectives. In this work, we aim to further optimize th… ▽ More The phenomenon of adversarial examples illustrates one of the most basic vulnerabilities of deep neural networks. Among the variety of techniques introduced to surmount this inherent weakness, adversarial training has emerged as the most effective strategy for learning robust models. Typically, this is achieved by balancing robust and natural objectives. In this work, we aim to further optimize the trade-off between robust and standard accuracy by enforcing a domain-invariant feature representation. We present a new adversarial training method, Domain Invariant Adversarial Learning (DIAL), which learns a feature representation that is both robust and domain invariant. DIAL uses a variant of Domain Adversarial Neural Network (DANN) on the natural domain and its corresponding adversarial domain. In the case where the source domain consists of natural examples and the target domain is the adversarially perturbed examples, our method learns a feature representation constrained not to discriminate between the natural and adversarial examples, and can therefore achieve a more robust representation. DIAL is a generic and modular technique that can be easily incorporated into any adversarial training method. Our experiments indicate that incorporating DIAL in the adversarial training process improves both robustness and standard accuracy. △ Less

Submitted 13 September, 2022; v1 submitted 1 April, 2021; originally announced April 2021.

Journal ref: Transactions of Machine Learning Research (2022)

arXiv:2011.04586 [pdf, ps, other]

Stable Sample Compression Schemes: New Applications and an Optimal SVM Margin Bound

Authors: Steve Hanneke, Aryeh Kontorovich

Abstract: We analyze a family of supervised learning algorithms based on sample compression schemes that are stable, in the sense that removing points from the training set which were not selected for the compression set does not alter the resulting classifier. We use this technique to derive a variety of novel or improved data-dependent generalization bounds for several learning algorithms. In particular,… ▽ More We analyze a family of supervised learning algorithms based on sample compression schemes that are stable, in the sense that removing points from the training set which were not selected for the compression set does not alter the resulting classifier. We use this technique to derive a variety of novel or improved data-dependent generalization bounds for several learning algorithms. In particular, we prove a new margin bound for SVM, removing a log factor. The new bound is provably optimal. This resolves a long-standing open question about the PAC margin bounds achievable by SVM. △ Less

Submitted 9 November, 2020; originally announced November 2020.

arXiv:2010.09886 [pdf, ps, other]

Non-parametric Binary regression in metric spaces with KL loss

Authors: Ariel Avital, Klim Efremenko, Aryeh Kontorovich, David Toplin, Bo Waggoner

Abstract: We propose a non-parametric variant of binary regression, where the hypothesis is regularized to be a Lipschitz function taking a metric space to [0,1] and the loss is logarithmic. This setting presents novel computational and statistical challenges. On the computational front, we derive a novel efficient optimization algorithm based on interior point methods; an attractive feature is that it is p… ▽ More We propose a non-parametric variant of binary regression, where the hypothesis is regularized to be a Lipschitz function taking a metric space to [0,1] and the loss is logarithmic. This setting presents novel computational and statistical challenges. On the computational front, we derive a novel efficient optimization algorithm based on interior point methods; an attractive feature is that it is parameter-free (i.e., does not require tuning an update step size). On the statistical front, the unbounded loss function presents a problem for classic generalization bounds, based on covering-number and Rademacher techniques. We get around this challenge via an adaptive truncation approach, and also present a lower bound indicating that the truncation is, in some sense, necessary. △ Less

Submitted 19 October, 2020; originally announced October 2020.

arXiv:2008.01581 [pdf, ps, other]

Non-uniform packings

Authors: Lee-Ad Gottlieb, Aryeh Kontorovich

Abstract: We generalize the classical notion of packing a set by balls with identical radii to the case where the radii may be different. The largest number of such balls that fit inside the set without overlapping is called its {\em non-uniform packing number}. We show that the non-uniform packing number can be upper-bounded in terms of the {\em average} radius of the balls, resulting in bounds of the fami… ▽ More We generalize the classical notion of packing a set by balls with identical radii to the case where the radii may be different. The largest number of such balls that fit inside the set without overlapping is called its {\em non-uniform packing number}. We show that the non-uniform packing number can be upper-bounded in terms of the {\em average} radius of the balls, resulting in bounds of the familiar classical form. △ Less

Submitted 4 August, 2020; originally announced August 2020.

arXiv:2007.06283 [pdf, ps, other]

Functions with average smoothness: structure, algorithms, and learning

Authors: Yair Ashlagi, Lee-Ad Gottlieb, Aryeh Kontorovich

Abstract: We initiate a program of average smoothness analysis for efficiently learning real-valued functions on metric spaces. Rather than using the Lipschitz constant as the regularizer, we define a local slope at each point and gauge the function complexity as the average of these values. Since the mean can be dramatically smaller than the maximum, this complexity measure can yield considerably sharper g… ▽ More We initiate a program of average smoothness analysis for efficiently learning real-valued functions on metric spaces. Rather than using the Lipschitz constant as the regularizer, we define a local slope at each point and gauge the function complexity as the average of these values. Since the mean can be dramatically smaller than the maximum, this complexity measure can yield considerably sharper generalization bounds -- assuming that these admit a refinement where the Lipschitz constant is replaced by our average of local slopes. Our first major contribution is to obtain just such distribution-sensitive bounds. This required overcoming a number of technical challenges, perhaps the most formidable of which was bounding the {\em empirical} covering numbers, which can be much worse-behaved than the ambient ones. Our combinatorial results are accompanied by efficient algorithms for smoothing the labels of the random sample, as well as guarantees that the extension from the sample to the whole space will continue to be, with high probability, smooth on average. Along the way we discover a surprisingly rich combinatorial and analytic structure in the function class we define. △ Less

Submitted 8 November, 2020; v1 submitted 13 July, 2020; originally announced July 2020.

arXiv:2004.12680 [pdf, other]

Learning discrete distributions with infinite support

Authors: Doron Cohen, Aryeh Kontorovich, Geoffrey Wolfer

Abstract: We present a novel approach to estimating discrete distributions with (potentially) infinite support in the total variation metric. In a departure from the established paradigm, we make no structural assumptions whatsoever on the sampling distribution. In such a setting, distribution-free risk bounds are impossible, and the best one could hope for is a fully empirical data-dependent bound. We deri… ▽ More We present a novel approach to estimating discrete distributions with (potentially) infinite support in the total variation metric. In a departure from the established paradigm, we make no structural assumptions whatsoever on the sampling distribution. In such a setting, distribution-free risk bounds are impossible, and the best one could hope for is a fully empirical data-dependent bound. We derive precisely such bounds, and demonstrate that these are, in a well-defined sense, the best possible. Our main discovery is that the half-norm of the empirical distribution provides tight upper and lower estimates on the empirical risk. Furthermore, this quantity decays at a nearly optimal rate as a function of the true distribution. The optimality follows from a minimax result, of possible independent interest. Additional structural results are provided, including an exact Rademacher complexity calculation and apparently a first connection between the total variation risk and the missing mass. △ Less

Submitted 15 October, 2020; v1 submitted 27 April, 2020; originally announced April 2020.

arXiv:2003.13561 [pdf, other]

On Biased Random Walks, Corrupted Intervals, and Learning Under Adversarial Design

Authors: Daniel Berend, Aryeh Kontorovich, Lev Reyzin, Thomas Robinson

Abstract: We tackle some fundamental problems in probability theory on corrupted random processes on the integer line. We analyze when a biased random walk is expected to reach its bottommost point and when intervals of integer points can be detected under a natural model of noise. We apply these results to problems in learning thresholds and intervals under a new model for learning under adversarial design… ▽ More We tackle some fundamental problems in probability theory on corrupted random processes on the integer line. We analyze when a biased random walk is expected to reach its bottommost point and when intervals of integer points can be detected under a natural model of noise. We apply these results to problems in learning thresholds and intervals under a new model for learning under adversarial design. △ Less

Submitted 30 March, 2020; originally announced March 2020.

Comments: 18 pages

arXiv:2002.01999 [pdf, other]

Nested Barycentric Coordinate System as an Explicit Feature Map

Authors: Lee-Ad Gottlieb, Eran Kaufman, Aryeh Kontorovich, Gabriel Nivasch, Ofir Pele

Abstract: We propose a new embedding method which is particularly well-suited for settings where the sample size greatly exceeds the ambient dimension. Our technique consists of partitioning the space into simplices and then embedding the data points into features corresponding to the simplices' barycentric coordinates. We then train a linear classifier in the rich feature space obtained from the simplices.… ▽ More We propose a new embedding method which is particularly well-suited for settings where the sample size greatly exceeds the ambient dimension. Our technique consists of partitioning the space into simplices and then embedding the data points into features corresponding to the simplices' barycentric coordinates. We then train a linear classifier in the rich feature space obtained from the simplices. The decision boundary may be highly non-linear, though it is linear within each simplex (and hence piecewise-linear overall). Further, our method can approximate any convex body. We give generalization bounds based on empirical margin and a novel hybrid sample compression technique. An extensive empirical evaluation shows that our method consistently outperforms a range of popular kernel embedding methods. △ Less

Submitted 5 February, 2020; originally announced February 2020.

arXiv:2002.01408 [pdf, other]

Apportioned Margin Approach for Cost Sensitive Large Margin Classifiers

Authors: Lee-Ad Gottlieb, Eran Kaufman, Aryeh Kontorovich

Abstract: We consider the problem of cost sensitive multiclass classification, where we would like to increase the sensitivity of an important class at the expense of a less important one. We adopt an {\em apportioned margin} framework to address this problem, which enables an efficient margin shift between classes that share the same boundary. The decision boundary between all pairs of classes divides the… ▽ More We consider the problem of cost sensitive multiclass classification, where we would like to increase the sensitivity of an important class at the expense of a less important one. We adopt an {\em apportioned margin} framework to address this problem, which enables an efficient margin shift between classes that share the same boundary. The decision boundary between all pairs of classes divides the margin between them in accordance to a given prioritization vector, which yields a tighter error bound for the important classes while also reducing the overall out-of-sample error. In addition to demonstrating an efficient implementation of our framework, we derive generalization bounds, demonstrate Fisher consistency, adapt the framework to Mercer's kernel and to neural networks, and report promising empirical results on all accounts. △ Less

Submitted 4 February, 2020; originally announced February 2020.

arXiv:1910.05270 [pdf, ps, other]

Fast and Bayes-consistent nearest neighbors

Authors: Klim Efremenko, Aryeh Kontorovich, Moshe Noivirt

Abstract: Research on nearest-neighbor methods tends to focus somewhat dichotomously either on the statistical or the computational aspects -- either on, say, Bayes consistency and rates of convergence or on techniques for speeding up the proximity search. This paper aims at bridging these realms: to reap the advantages of fast evaluation time while maintaining Bayes consistency, and further without sacrifi… ▽ More Research on nearest-neighbor methods tends to focus somewhat dichotomously either on the statistical or the computational aspects -- either on, say, Bayes consistency and rates of convergence or on techniques for speeding up the proximity search. This paper aims at bridging these realms: to reap the advantages of fast evaluation time while maintaining Bayes consistency, and further without sacrificing too much in the risk decay rate. We combine the locality-sensitive hashing (LSH) technique with a novel missing-mass argument to obtain a fast and Bayes-consistent classifier. Our algorithm's prediction runtime compares favorably against state of the art approximate NN methods, while maintaining Bayes-consistency and attaining rates comparable to minimax. On samples of size $n$ in $\R^d$, our pre-processing phase has runtime $O(d n \log n)$, while the evaluation phase has runtime $O(d\log n)$ per query point. △ Less

Submitted 15 April, 2020; v1 submitted 7 October, 2019; originally announced October 2019.

arXiv:1906.09855 [pdf, other]

Universal Bayes consistency in metric spaces

Authors: Steve Hanneke, Aryeh Kontorovich, Sivan Sabato, Roi Weiss

Abstract: We extend a recently proposed 1-nearest-neighbor based multiclass learning algorithm and prove that our modification is universally strongly Bayes-consistent in all metric spaces admitting any such learner, making it an "optimistically universal" Bayes-consistent learner. This is the first learning algorithm known to enjoy this property; by comparison, the $k$-NN classifier and its variants are no… ▽ More We extend a recently proposed 1-nearest-neighbor based multiclass learning algorithm and prove that our modification is universally strongly Bayes-consistent in all metric spaces admitting any such learner, making it an "optimistically universal" Bayes-consistent learner. This is the first learning algorithm known to enjoy this property; by comparison, the $k$-NN classifier and its variants are not generally universally Bayes-consistent, except under additional structural assumptions, such as an inner product, a norm, finite dimension, or a Besicovitch-type property. The metric spaces in which universal Bayes consistency is possible are the "essentially separable" ones -- a notion that we define, which is more general than standard separability. The existence of metric spaces that are not essentially separable is widely believed to be independent of the ZFC axioms of set theory. We prove that essential separability exactly characterizes the existence of a universal Bayes-consistent learner for the given metric space. In particular, this yields the first impossibility result for universal Bayes consistency. Taken together, our results completely characterize strong and weak universal Bayes consistency in metric spaces. △ Less

Submitted 6 January, 2021; v1 submitted 24 June, 2019; originally announced June 2019.

Comments: To appear in Annals of Statistics

Journal ref: Annals of Statistics 2021, Vol. 49, No. 4, 2129-2150, August 2021

arXiv:1905.11930 [pdf, other]

Efficient Kirszbraun Extension with Applications to Regression

Authors: Hanan Zaichyk, Armin Biess, Aryeh Kontorovich, Yury Makarychev

Abstract: We introduce a framework for performing regression between two Hilbert spaces. This is done based on Kirszbraun's extension theorem, to the best of our knowledge, the first application of this technique to supervised learning. We analyze the statistical and computational aspects of this method. We decompose this task into two stages: training (which corresponds operationally to smoothing/regulariz… ▽ More We introduce a framework for performing regression between two Hilbert spaces. This is done based on Kirszbraun's extension theorem, to the best of our knowledge, the first application of this technique to supervised learning. We analyze the statistical and computational aspects of this method. We decompose this task into two stages: training (which corresponds operationally to smoothing/regularization) and prediction (which is achieved via Kirszbraun extension). Both are solved algorithmically via a novel multiplicative weight updates (MWU) scheme, which, for our problem formulation, achieves a quadratic runtime improvement over the state of the art. Our empirical results indicate a dramatic improvement over standard off-the-shelf solvers in our setting. △ Less

Submitted 8 March, 2022; v1 submitted 28 May, 2019; originally announced May 2019.

arXiv:1902.01224 [pdf, other]

Estimating the Mixing Time of Ergodic Markov Chains

Authors: Geoffrey Wolfer, Aryeh Kontorovich

Abstract: We address the problem of estimating the mixing time $t_{\mathsf{mix}}$ of an arbitrary ergodic finite-state Markov chain from a single trajectory of length $m$. The reversible case was addressed by Hsu et al. [2019], who left the general case as an open problem. In the reversible case, the analysis is greatly facilitated by the fact that the Markov operator is self-adjoint, and Weyl's inequality… ▽ More We address the problem of estimating the mixing time $t_{\mathsf{mix}}$ of an arbitrary ergodic finite-state Markov chain from a single trajectory of length $m$. The reversible case was addressed by Hsu et al. [2019], who left the general case as an open problem. In the reversible case, the analysis is greatly facilitated by the fact that the Markov operator is self-adjoint, and Weyl's inequality allows for a dimension-free perturbation analysis of the empirical eigenvalues. As Hsu et al. point out, in the absence of reversibility (which induces asymmetric pair probabilities matrices), the existing perturbation analysis has a worst-case exponential dependence on the number of states $d$. Furthermore, even if an eigenvalue perturbation analysis with better dependence on $d$ were available, in the non-reversible case the connection between the spectral gap and the mixing time is not nearly as straightforward as in the reversible case. Our key insight is to estimate the pseudo-spectral gap $γ_{\mathsf{ps}}$ instead, which allows us to overcome the loss of symmetry and to achieve a polynomial dependence on the minimal stationary probability $π_\star$ and $γ_{\mathsf{ps}}$. Additionally, in the reversible case, we obtain simultaneous nearly (up to logarithmic factors) minimax rates in $t_{\mathsf{mix}}$ and precision $\varepsilon$, closing a gap in Hsu et al., who treated $\varepsilon$ as constant in the lower bounds. Finally, we construct fully empirical confidence intervals for $γ_{\mathsf{ps}}$, which shrink to zero at a rate of roughly $1/\sqrt{m}$, and improve the state of the art in even the reversible case. △ Less

Submitted 16 August, 2022; v1 submitted 1 February, 2019; originally announced February 2019.

Comments: COLT'19 conference manuscript, with minor fixes

arXiv:1902.00080 [pdf, ps, other]

Minimax Testing of Identity to a Reference Ergodic Markov Chain

Authors: Geoffrey Wolfer, Aryeh Kontorovich

Abstract: We exhibit an efficient procedure for testing, based on a single long state sequence, whether an unknown Markov chain is identical to or $\varepsilon$-far from a given reference chain. We obtain nearly matching (up to logarithmic factors) upper and lower sample complexity bounds for our notion of distance, which is based on total variation. Perhaps surprisingly, we discover that the sample complex… ▽ More We exhibit an efficient procedure for testing, based on a single long state sequence, whether an unknown Markov chain is identical to or $\varepsilon$-far from a given reference chain. We obtain nearly matching (up to logarithmic factors) upper and lower sample complexity bounds for our notion of distance, which is based on total variation. Perhaps surprisingly, we discover that the sample complexity depends solely on the properties of the known reference chain and does not involve the unknown chain at all, which is not even assumed to be ergodic. △ Less

Submitted 24 September, 2019; v1 submitted 31 January, 2019; originally announced February 2019.

Comments: A previous version of this print contained a mistake in a proof. We have now fixed it

arXiv:1812.02330 [pdf, other]

What Is... A Thin Group?

Authors: Alex Kontorovich, D. Darren Long, Alexander Lubotzky, Alan W. Reid

Abstract: This paper describes in basic terms what a "Thin Group" is, as well as its uses in various subjects. This paper describes in basic terms what a "Thin Group" is, as well as its uses in various subjects. △ Less

Submitted 5 December, 2018; originally announced December 2018.

Comments: 6 pages, 2 pictures

MSC Class: 11F06; 20H25; 22E40

arXiv:1810.02180 [pdf, other]

Improved Generalization Bounds for Adversarially Robust Learning

Authors: Idan Attias, Aryeh Kontorovich, Yishay Mansour

Abstract: We consider a model of robust learning in an adversarial environment. The learner gets uncorrupted training data with access to possible corruptions that may be affected by the adversary during testing. The learner's goal is to build a robust classifier, which will be tested on future adversarial examples. The adversary is limited to $k$ possible corruptions for each input. We model the learner-ad… ▽ More We consider a model of robust learning in an adversarial environment. The learner gets uncorrupted training data with access to possible corruptions that may be affected by the adversary during testing. The learner's goal is to build a robust classifier, which will be tested on future adversarial examples. The adversary is limited to $k$ possible corruptions for each input. We model the learner-adversary interaction as a zero-sum game. This model is closely related to the adversarial examples model of Schmidt et al. (2018); Madry et al. (2017). Our main results consist of generalization bounds for the binary and multiclass classification, as well as the real-valued case (regression). For the binary classification setting, we both tighten the generalization bound of Feige et al. (2015), and are also able to handle infinite hypothesis classes. The sample complexity is improved from $O(\frac{1}{ε^4}\log(\frac{|H|}δ))$ to $O\big(\frac{1}{ε^2}(kVC(H)\log^{\frac{3}{2}+α}(kVC(H))+\log(\frac{1}δ)\big)$ for any $α> 0$. Additionally, we extend the algorithm and generalization bound from the binary to the multiclass and real-valued cases. Along the way, we obtain results on fat-shattering dimension and Rademacher complexity of $k$-fold maxima over function classes; these may be of independent interest. For binary classification, the algorithm of Feige et al. (2015) uses a regret minimization algorithm and an ERM oracle as a black box; we adapt it for the multiclass and regression settings. The algorithm provides us with near-optimal policies for the players on a given training sample. △ Less

Submitted 1 July, 2022; v1 submitted 4 October, 2018; originally announced October 2018.

Comments: JMLR camera ready

arXiv:1810.01864 [pdf, other]

Agnostic Sample Compression Schemes for Regression

Authors: Idan Attias, Steve Hanneke, Aryeh Kontorovich, Menachem Sadigurschi

Abstract: We obtain the first positive results for bounded sample compression in the agnostic regression setting with the $\ell_p$ loss, where $p\in [1,\infty]$. We construct a generic approximate sample compression scheme for real-valued function classes exhibiting exponential size in the fat-shattering dimension but independent of the sample size. Notably, for linear regression, an approximate compression… ▽ More We obtain the first positive results for bounded sample compression in the agnostic regression setting with the $\ell_p$ loss, where $p\in [1,\infty]$. We construct a generic approximate sample compression scheme for real-valued function classes exhibiting exponential size in the fat-shattering dimension but independent of the sample size. Notably, for linear regression, an approximate compression of size linear in the dimension is constructed. Moreover, for $\ell_1$ and $\ell_\infty$ losses, we can even exhibit an efficient exact sample compression scheme of size linear in the dimension. We further show that for every other $\ell_p$ loss, $p\in (1,\infty)$, there does not exist an exact agnostic compression scheme of bounded size. This refines and generalizes a negative result of David, Moran, and Yehudayoff for the $\ell_2$ loss. We close by posing general open questions: for agnostic regression with $\ell_1$ loss, does every function class admits an exact compression scheme of size equal to its pseudo-dimension? For the $\ell_2$ loss, does every function class admit an approximate compression scheme of polynomial size in the fat-shattering dimension? These questions generalize Warmuth's classic sample compression conjecture for realizable-case classification. △ Less

Submitted 3 February, 2024; v1 submitted 3 October, 2018; originally announced October 2018.

Comments: New results in this version: (1) Approximate agnostic sample compression scheme for function classes with finite fat-shattering dimension and the $\ell_p$ loss (section 3), (2) Near-optimal approximate compression for linear functions and the $\ell_p$ loss (section 4.1) The results in sections 4.2 and 4.3 appear in the previous version

arXiv:1809.05014 [pdf, other]

Statistical Estimation of Ergodic Markov Chain Kernel over Discrete State Space

Authors: Geoffrey Wolfer, Aryeh Kontorovich

Abstract: We investigate the statistical complexity of estimating the parameters of a discrete-state Markov chain kernel from a single long sequence of state observations. In the finite case, we characterize (modulo logarithmic factors) the minimax sample complexity of estimation with respect to the operator infinity norm, while in the countably infinite case, we analyze the problem with respect to a natura… ▽ More We investigate the statistical complexity of estimating the parameters of a discrete-state Markov chain kernel from a single long sequence of state observations. In the finite case, we characterize (modulo logarithmic factors) the minimax sample complexity of estimation with respect to the operator infinity norm, while in the countably infinite case, we analyze the problem with respect to a natural entry-wise norm derived from total variation. We show that in both cases, the sample complexity is governed by the mixing properties of the unknown chain, for which, in the finite-state case, there are known finite-sample estimators with fully empirical confidence intervals. △ Less

Submitted 13 August, 2020; v1 submitted 13 September, 2018; originally announced September 2018.

Comments: Journal version of the extended abstract (ALT'19), to appear in Bernoulli 2020+

arXiv:1808.03235 [pdf, other]

doi 10.1080/10586458.2019.1616337

On Toric Orbits in the Affine Sieve

Authors: Alex Kontorovich, Jeff Lagarias

Abstract: We give a detailed analysis of a heuristic model for the failure of "saturation" in instances of the Affine Sieve having toral Zariski closure. Based on this model, we formulate precise conjectures on several classical problems of arithmetic interest, and test these against empirical data. We give a detailed analysis of a heuristic model for the failure of "saturation" in instances of the Affine Sieve having toral Zariski closure. Based on this model, we formulate precise conjectures on several classical problems of arithmetic interest, and test these against empirical data. △ Less

Submitted 9 August, 2018; originally announced August 2018.

Comments: 20 pages, 6 figures

MSC Class: 11B39; 11D09; 11J70; 11N37; 11N35

Journal ref: Experimental Mathematics 30 (2021), No. 4, 575--586

arXiv:1805.09719 [pdf, other]

Learning convex polyhedra with margin

Authors: Lee-Ad Gottlieb, Eran Kaufman, Aryeh Kontorovich, Gabriel Nivasch

Abstract: We present an improved algorithm for {\em quasi-properly} learning convex polyhedra in the realizable PAC setting from data with a margin. Our learning algorithm constructs a consistent polyhedron as an intersection of about $t \log t$ halfspaces with constant-size margins in time polynomial in $t$ (where $t$ is the number of halfspaces forming an optimal polyhedron). We also identify distinct gen… ▽ More We present an improved algorithm for {\em quasi-properly} learning convex polyhedra in the realizable PAC setting from data with a margin. Our learning algorithm constructs a consistent polyhedron as an intersection of about $t \log t$ halfspaces with constant-size margins in time polynomial in $t$ (where $t$ is the number of halfspaces forming an optimal polyhedron). We also identify distinct generalizations of the notion of margin from hyperplanes to polyhedra and investigate how they relate geometrically; this result may have ramifications beyond the learning setting. △ Less

Submitted 2 November, 2021; v1 submitted 24 May, 2018; originally announced May 2018.

arXiv:1805.08254 [pdf, ps, other]

Sample Compression for Real-Valued Learners

Authors: Steve Hanneke, Aryeh Kontorovich, Menachem Sadigurschi

Abstract: We give an algorithmically efficient version of the learner-to-compression scheme conversion in Moran and Yehudayoff (2016). In extending this technique to real-valued hypotheses, we also obtain an efficient regression-to-bounded sample compression converter. To our knowledge, this is the first general compressed regression result (regardless of efficiency or boundedness) guaranteeing uniform appr… ▽ More We give an algorithmically efficient version of the learner-to-compression scheme conversion in Moran and Yehudayoff (2016). In extending this technique to real-valued hypotheses, we also obtain an efficient regression-to-bounded sample compression converter. To our knowledge, this is the first general compressed regression result (regardless of efficiency or boundedness) guaranteeing uniform approximate reconstruction. Along the way, we develop a generic procedure for constructing weak real-valued learners out of abstract regressors; this may be of independent interest. In particular, this result sheds new light on an open question of H. Simon (1997). We show applications to two regression problems: learning Lipschitz and bounded-variation functions. △ Less

Submitted 21 May, 2018; originally announced May 2018.

arXiv:1805.08140 [pdf, ps, other]

A New Lower Bound for Agnostic Learning with Sample Compression Schemes

Authors: Steve Hanneke, Aryeh Kontorovich

Abstract: We establish a tight characterization of the worst-case rates for the excess risk of agnostic learning with sample compression schemes and for uniform convergence for agnostic sample compression schemes. In particular, we find that the optimal rates of convergence for size-$k$ agnostic sample compression schemes are of the form $\sqrt{\frac{k \log(n/k)}{n}}$, which contrasts with agnostic learning… ▽ More We establish a tight characterization of the worst-case rates for the excess risk of agnostic learning with sample compression schemes and for uniform convergence for agnostic sample compression schemes. In particular, we find that the optimal rates of convergence for size-$k$ agnostic sample compression schemes are of the form $\sqrt{\frac{k \log(n/k)}{n}}$, which contrasts with agnostic learning with classes of VC dimension $k$, where the optimal rates are of the form $\sqrt{\frac{k}{n}}$. △ Less

Submitted 21 May, 2018; originally announced May 2018.

arXiv:1802.09452 [pdf, other]

Exponents for the Equidistribution of Shears and Applications

Authors: Dubi Kelmer, Alex Kontorovich

Abstract: In previous work, the authors introduced "soft" methods to prove the effective (i.e. with power savings error) equidistribution of "shears" in cusped hyperbolic surfaces. In this paper, we study the same problem but now allow full use of the spectral theory of automorphic forms to produce explicit exponents, and uniformity in parameters. We give applications to counting square values of quadratic… ▽ More In previous work, the authors introduced "soft" methods to prove the effective (i.e. with power savings error) equidistribution of "shears" in cusped hyperbolic surfaces. In this paper, we study the same problem but now allow full use of the spectral theory of automorphic forms to produce explicit exponents, and uniformity in parameters. We give applications to counting square values of quadratic forms. △ Less

Submitted 26 February, 2018; originally announced February 2018.

Comments: 41 pages, one figure

arXiv:1712.00147 [pdf, other]

Geometry and Arithmetic of Crystallographic Sphere Packings

Authors: Alex Kontorovich, Kei Nakamura

Abstract: We introduce the notion of a "crystallographic sphere packing," defined to be one whose limit set is that of a geometrically finite hyperbolic reflection group in one higher dimension. We exhibit for the first time an infinite family of conformally-inequivalent such with all radii being reciprocals of integers. We then prove a result in the opposite direction: the "superintegral" ones exist only i… ▽ More We introduce the notion of a "crystallographic sphere packing," defined to be one whose limit set is that of a geometrically finite hyperbolic reflection group in one higher dimension. We exhibit for the first time an infinite family of conformally-inequivalent such with all radii being reciprocals of integers. We then prove a result in the opposite direction: the "superintegral" ones exist only in finitely many "commensurability classes," all in dimensions below 30. △ Less

Submitted 30 November, 2017; originally announced December 2017.

Comments: Research announcement. 14 pages, 5 figures

MSC Class: 52C17; 11H31; 11P21; 52C26; 05B40

arXiv:1711.01939 [pdf, other]

Advanced Analytics for Connected Cars Cyber Security

Authors: Matan Levi, Yair Allouche, Aryeh Kontorovich

Abstract: The vehicular connectivity revolution is fueling the automotive industry's most significant transformation seen in decades. However, as modern vehicles become more connected, they also become much more vulnerable to cyber-attacks. In this paper, a fully working machine learning approach is proposed to protect connected vehicles (fleets and individuals) against such attacks. We present a system tha… ▽ More The vehicular connectivity revolution is fueling the automotive industry's most significant transformation seen in decades. However, as modern vehicles become more connected, they also become much more vulnerable to cyber-attacks. In this paper, a fully working machine learning approach is proposed to protect connected vehicles (fleets and individuals) against such attacks. We present a system that monitors different vehicle's interfaces (Network, CAN and OS), extracts relevant information based on configurable rules and sends it to a trained generative model to detect deviations from normal behavior. Using configurable data collector, we provide a higher level of data abstraction as the model is trained based on events instead of raw data, which has a noise-filtering effect and eliminates the need to retrain the model whenever a protocol changes. We present a new approach for detecting anomalies, tailored to the temporal nature of our domain. Adapting the hybrid approach of Gutflaish et al. (2017) to the fully temporal setting, we first train a Hidden Markov Model to learn normal vehicle behavior, and then a regression model to calibrate the likelihood threshold for anomaly. Using this architecture, our method detects sophisticated and realistic anomalies, which are missed by other existing methods monitoring the CAN bus only. We also demonstrate the superiority of adaptive thresholds over static ones. Furthermore, our approach scales efficiently from monitoring individual cars to serving large fleets. We demonstrate the competitive advantage of our model via encouraging empirical results. △ Less

Submitted 8 November, 2017; v1 submitted 6 November, 2017; originally announced November 2017.

arXiv:1708.07367 [pdf, ps, other]

Mixing time estimation in reversible Markov chains from a single sample path

Authors: Daniel Hsu, Aryeh Kontorovich, David A. Levin, Yuval Peres, Csaba Szepesvári

Abstract: The spectral gap $γ$ of a finite, ergodic, and reversible Markov chain is an important parameter measuring the asymptotic rate of convergence. In applications, the transition matrix $P$ may be unknown, yet one sample of the chain up to a fixed time $n$ may be observed. We consider here the problem of estimating $γ$ from this data. Let $π$ be the stationary distribution of $P$, and… ▽ More The spectral gap $γ$ of a finite, ergodic, and reversible Markov chain is an important parameter measuring the asymptotic rate of convergence. In applications, the transition matrix $P$ may be unknown, yet one sample of the chain up to a fixed time $n$ may be observed. We consider here the problem of estimating $γ$ from this data. Let $π$ be the stationary distribution of $P$, and $π_\star = \min_x π(x)$. We show that if $n = \tilde{O}\bigl(\frac{1}{γπ_\star}\bigr)$, then $γ$ can be estimated to within multiplicative constants with high probability. When $π$ is uniform on $d$ states, this matches (up to logarithmic correction) a lower bound of $\tildeΩ\bigl(\frac{d}γ\bigr)$ steps required for precise estimation of $γ$. Moreover, we provide the first procedure for computing a fully data-dependent interval, from a single finite-length trajectory of the chain, that traps the mixing time $t_{\text{mix}}$ of the chain at a prescribed confidence level. The interval does not require the knowledge of any parameters of the chain. This stands in contrast to previous approaches, which either only provide point estimates, or require a reset mechanism, or additional prior knowledge. The interval is constructed around the relaxation time $t_{\text{relax}} = 1/γ$, which is strongly related to the mixing time, and the width of the interval converges to zero roughly at a $1/\sqrt{n}$ rate, where $n$ is the length of the sample path. △ Less

Submitted 24 August, 2017; originally announced August 2017.

Comments: 34 pages, merges results of arXiv:1506.02903 and arXiv:1612.05330

arXiv:1705.10085 [pdf, other]

Temporal anomaly detection: calibrating the surprise

Authors: Eyal Gutflaish, Aryeh Kontorovich, Sivan Sabato, Ofer Biller, Oded Sofer

Abstract: We propose a hybrid approach to temporal anomaly detection in access data of users to databases --- or more generally, any kind of subject-object co-occurrence data. We consider a high-dimensional setting that also requires fast computation at test time. Our methodology identifies anomalies based on a single stationary model, instead of requiring a full temporal one, which would be prohibitive in… ▽ More We propose a hybrid approach to temporal anomaly detection in access data of users to databases --- or more generally, any kind of subject-object co-occurrence data. We consider a high-dimensional setting that also requires fast computation at test time. Our methodology identifies anomalies based on a single stationary model, instead of requiring a full temporal one, which would be prohibitive in this setting. We learn a low-rank stationary model from the training data, and then fit a regression model for predicting the expected likelihood score of normal access patterns in the future. The disparity between the predicted likelihood score and the observed one is used to assess the `surprise' at test time. This approach enables calibration of the anomaly score, so that time-varying normal behavior patterns are not considered anomalous. We provide a detailed description of the algorithm, including a convergence analysis, and report encouraging empirical results. One of the data sets that we tested, TDA, is new for the public domain. It consists of two months' worth of database access records from a live system. Our code is publicly available at https://github.com/eyalgut/TLR_anomaly_detection.git. The TDA data set is available at https://www.kaggle.com/eyalgut/binary-traffic-matrices. △ Less

Submitted 20 November, 2018; v1 submitted 29 May, 2017; originally announced May 2017.

Comments: AAAI-19

Journal ref: Proceedings of the AAAI Conference on Artificial Intelligence, 33(01), 3755-3762 (2019)

Showing 1–50 of 109 results for author: Kontorovich, A