subscribe to arXiv mailings

Exact statistical analysis for response-adaptive clinical trials: a general and computationally tractable approach

Authors: Stef Baas, Peter Jacko, Sofía S. Villar

Abstract: Response-adaptive (RA) designs of clinical trials allow targeting a given objective by skewing the allocation of participants to treatments based on observed outcomes. RA designs face greater regulatory scrutiny due to potential type I error inflation, which limits their uptake in practice. Existing approaches to type I error control either only work for specific designs, have a risk of Monte Carl… ▽ More Response-adaptive (RA) designs of clinical trials allow targeting a given objective by skewing the allocation of participants to treatments based on observed outcomes. RA designs face greater regulatory scrutiny due to potential type I error inflation, which limits their uptake in practice. Existing approaches to type I error control either only work for specific designs, have a risk of Monte Carlo/approximation error, are conservative, or computationally intractable. We develop a general and computationally tractable approach for exact analysis in two-arm RA designs with binary outcomes. We use the approach to construct exact tests applicable to designs that use either randomized or deterministic RA procedures, allowing for complexities such as delayed outcomes, early stopping or allocation of participants in blocks. Our efficient forward recursion implementation allows for testing of two-arm trials with 1,000 participants on a standard computer. Through an illustrative computational study of trials using randomized dynamic programming we show that, contrary to what is known for equal allocation, a conditional exact test has, almost uniformly, higher power than the unconditional test. Two real-world trials with the above-mentioned complexities are re-analyzed to demonstrate the value of our approach in controlling type I error and/or improving the statistical power. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: 35 pages, 6 figures, 11 tables

arXiv:2406.01552 [pdf, ps, other]

Learning equivariant tensor functions with applications to sparse vector recovery

Authors: Wilson G. Gregory, Josué Tonelli-Cueto, Nicholas F. Marshall, Andrew S. Lee, Soledad Villar

Abstract: This work characterizes equivariant polynomial functions from tuples of tensor inputs to tensor outputs. Loosely motivated by physics, we focus on equivariant functions with respect to the diagonal action of the orthogonal group on tensors. We show how to extend this characterization to other linear algebraic groups, including the Lorentz and symplectic groups. Our goal behind these characteriza… ▽ More This work characterizes equivariant polynomial functions from tuples of tensor inputs to tensor outputs. Loosely motivated by physics, we focus on equivariant functions with respect to the diagonal action of the orthogonal group on tensors. We show how to extend this characterization to other linear algebraic groups, including the Lorentz and symplectic groups. Our goal behind these characterizations is to define equivariant machine learning models. In particular, we focus on the sparse vector estimation problem. This problem has been broadly studied in the theoretical computer science literature, and explicit spectral methods, derived by techniques from sum-of-squares, can be shown to recover sparse vectors under certain assumptions. Our numerical results show that the proposed equivariant machine learning models can learn spectral methods that outperform the best theoretically known spectral methods in some regimes. The experiments also suggest that learned spectral methods can solve the problem in settings that have not yet been theoretically analyzed. This is an example of a promising direction in which theory can inform machine learning models and machine learning models could inform theory. △ Less

Submitted 3 June, 2024; originally announced June 2024.

arXiv:2405.18095 [pdf, other]

Is machine learning good or bad for the natural sciences?

Authors: David W. Hogg, Soledad Villar

Abstract: Machine learning (ML) methods are having a huge impact across all of the sciences. However, ML has a strong ontology - in which only the data exist - and a strong epistemology - in which a model is considered good if it performs well on held-out training data. These philosophies are in strong conflict with both standard practices and key philosophies in the natural sciences. Here we identify some… ▽ More Machine learning (ML) methods are having a huge impact across all of the sciences. However, ML has a strong ontology - in which only the data exist - and a strong epistemology - in which a model is considered good if it performs well on held-out training data. These philosophies are in strong conflict with both standard practices and key philosophies in the natural sciences. Here we identify some locations for ML in the natural sciences at which the ontology and epistemology are valuable. For example, when an expressive machine learning model is used in a causal inference to represent the effects of confounders, such as foregrounds, backgrounds, or instrument calibration parameters, the model capacity and loose philosophy of ML can make the results more trustworthy. We also show that there are contexts in which the introduction of ML introduces strong, unwanted statistical biases. For one, when ML models are used to emulate physical (or first-principles) simulations, they amplify confirmation biases. For another, when expressive regressions are used to label datasets, those labels cannot be used in downstream joint or ensemble analyses without taking on uncontrolled biases. The question in the title is being asked of all of the natural sciences; that is, we are calling on the scientific communities to take a step back and consider the role and value of ML in their fields; the (partial) answers we give here come from the particular perspective of physics. △ Less

Submitted 31 May, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

Comments: A Position Paper accepted for publication in the 2024 International Conference on Machine Learning (ICML)

arXiv:2405.08097 [pdf, other]

Learning functions on symmetric matrices and point clouds via lightweight invariant features

Authors: Ben Blum-Smith, Ningyuan Huang, Marco Cuturi, Soledad Villar

Abstract: In this work, we present a mathematical formulation for machine learning of (1) functions on symmetric matrices that are invariant with respect to the action of permutations by conjugation, and (2) functions on point clouds that are invariant with respect to rotations, reflections, and permutations of the points. To achieve this, we construct $O(n^2)$ invariant features derived from generators for… ▽ More In this work, we present a mathematical formulation for machine learning of (1) functions on symmetric matrices that are invariant with respect to the action of permutations by conjugation, and (2) functions on point clouds that are invariant with respect to rotations, reflections, and permutations of the points. To achieve this, we construct $O(n^2)$ invariant features derived from generators for the field of rational functions on $n\times n$ symmetric matrices that are invariant under joint permutations of rows and columns. We show that these invariant features can separate all distinct orbits of symmetric matrices except for a measure zero set; such features can be used to universally approximate invariant functions on almost all weighted graphs. For point clouds in a fixed dimension, we prove that the number of invariant features can be reduced, generically without losing expressivity, to $O(n)$, where $n$ is the number of points. We combine these invariant features with DeepSets to learn functions on symmetric matrices and point clouds with varying sizes. We empirically demonstrate the feasibility of our approach on molecule property regression and point cloud distance prediction. △ Less

Submitted 15 May, 2024; v1 submitted 13 May, 2024; originally announced May 2024.

Comments: 28 pages, 2 figures, 2 tables

MSC Class: 68P01; 13A50

arXiv:2310.18326 [pdf, other]

doi 10.1609/aaai.v38i21.30328

Using Adaptive Bandit Experiments to Increase and Investigate Engagement in Mental Health

Authors: Harsh Kumar, Tong Li, Jiakai Shi, Ilya Musabirov, Rachel Kornfield, Jonah Meyerhoff, Ananya Bhattacharjee, Chris Karr, Theresa Nguyen, David Mohr, Anna Rafferty, Sofia Villar, Nina Deliu, Joseph Jay Williams

Abstract: Digital mental health (DMH) interventions, such as text-message-based lessons and activities, offer immense potential for accessible mental health support. While these interventions can be effective, real-world experimental testing can further enhance their design and impact. Adaptive experimentation, utilizing algorithms like Thompson Sampling for (contextual) multi-armed bandit (MAB) problems, c… ▽ More Digital mental health (DMH) interventions, such as text-message-based lessons and activities, offer immense potential for accessible mental health support. While these interventions can be effective, real-world experimental testing can further enhance their design and impact. Adaptive experimentation, utilizing algorithms like Thompson Sampling for (contextual) multi-armed bandit (MAB) problems, can lead to continuous improvement and personalization. However, it remains unclear when these algorithms can simultaneously increase user experience rewards and facilitate appropriate data collection for social-behavioral scientists to analyze with sufficient statistical confidence. Although a growing body of research addresses the practical and statistical aspects of MAB and other adaptive algorithms, further exploration is needed to assess their impact across diverse real-world contexts. This paper presents a software system developed over two years that allows text-messaging intervention components to be adapted using bandit and other algorithms while collecting data for side-by-side comparison with traditional uniform random non-adaptive experiments. We evaluate the system by deploying a text-message-based DMH intervention to 1100 users, recruited through a large mental health non-profit organization, and share the path forward for deploying this system at scale. This system not only enables applications in mental health but could also serve as a model testbed for adaptive experimentation algorithms in other domains. △ Less

Submitted 13 October, 2023; originally announced October 2023.

Report number: Volume 38, Issue 21

Journal ref: Proceedings of the AAAI Conference on Artificial Intelligence (IAAI) 2024

arXiv:2310.12528 [pdf, other]

Constructing Impactful Machine Learning Research for Astronomy: Best Practices for Researchers and Reviewers

Authors: D. Huppenkothen, M. Ntampaka, M. Ho, M. Fouesneau, B. Nord, J. E. G. Peek, M. Walmsley, J. F. Wu, C. Avestruz, T. Buck, M. Brescia, D. P. Finkbeiner, A. D. Goulding, T. Kacprzak, P. Melchior, M. Pasquato, N. Ramachandra, Y. -S. Ting, G. van de Ven, S. Villar, V. A. Villar, E. Zinger

Abstract: Machine learning has rapidly become a tool of choice for the astronomical community. It is being applied across a wide range of wavelengths and problems, from the classification of transients to neural network emulators of cosmological simulations, and is shifting paradigms about how we generate and report scientific results. At the same time, this class of method comes with its own set of best pr… ▽ More Machine learning has rapidly become a tool of choice for the astronomical community. It is being applied across a wide range of wavelengths and problems, from the classification of transients to neural network emulators of cosmological simulations, and is shifting paradigms about how we generate and report scientific results. At the same time, this class of method comes with its own set of best practices, challenges, and drawbacks, which, at present, are often reported on incompletely in the astrophysical literature. With this paper, we aim to provide a primer to the astronomical community, including authors, reviewers, and editors, on how to implement machine learning models and report their results in a way that ensures the accuracy of the results, reproducibility of the findings, and usefulness of the method. △ Less

Submitted 19 October, 2023; originally announced October 2023.

Comments: 14 pages, 3 figures; submitted to the Bulletin of the American Astronomical Society

arXiv:2308.10436 [pdf, other]

Approximately Equivariant Graph Networks

Authors: Ningyuan Huang, Ron Levie, Soledad Villar

Abstract: Graph neural networks (GNNs) are commonly described as being permutation equivariant with respect to node relabeling in the graph. This symmetry of GNNs is often compared to the translation equivariance of Euclidean convolution neural networks (CNNs). However, these two symmetries are fundamentally different: The translation equivariance of CNNs corresponds to symmetries of the fixed domain acting… ▽ More Graph neural networks (GNNs) are commonly described as being permutation equivariant with respect to node relabeling in the graph. This symmetry of GNNs is often compared to the translation equivariance of Euclidean convolution neural networks (CNNs). However, these two symmetries are fundamentally different: The translation equivariance of CNNs corresponds to symmetries of the fixed domain acting on the image signals (sometimes known as active symmetries), whereas in GNNs any permutation acts on both the graph signals and the graph domain (sometimes described as passive symmetries). In this work, we focus on the active symmetries of GNNs, by considering a learning setting where signals are supported on a fixed graph. In this case, the natural symmetries of GNNs are the automorphisms of the graph. Since real-world graphs tend to be asymmetric, we relax the notion of symmetries by formalizing approximate symmetries via graph coarsening. We present a bias-variance formula that quantifies the tradeoff between the loss in expressivity and the gain in the regularity of the learned estimator, depending on the chosen symmetry group. To illustrate our approach, we conduct extensive experiments on image inpainting, traffic flow prediction, and human pose estimation with different choices of symmetries. We show theoretically and empirically that the best generalization performance can be achieved by choosing a suitably larger group than the graph automorphism, but smaller than the permutation group. △ Less

Submitted 17 November, 2023; v1 submitted 20 August, 2023; originally announced August 2023.

Comments: Accepted at NeurIPS 2023

arXiv:2306.13924 [pdf, other]

Structuring Representation Geometry with Rotationally Equivariant Contrastive Learning

Authors: Sharut Gupta, Joshua Robinson, Derek Lim, Soledad Villar, Stefanie Jegelka

Abstract: Self-supervised learning converts raw perceptual data such as images to a compact space where simple Euclidean distances measure meaningful variations in data. In this paper, we extend this formulation by adding additional geometric structure to the embedding space by enforcing transformations of input space to correspond to simple (i.e., linear) transformations of embedding space. Specifically, i… ▽ More Self-supervised learning converts raw perceptual data such as images to a compact space where simple Euclidean distances measure meaningful variations in data. In this paper, we extend this formulation by adding additional geometric structure to the embedding space by enforcing transformations of input space to correspond to simple (i.e., linear) transformations of embedding space. Specifically, in the contrastive learning setting, we introduce an equivariance objective and theoretically prove that its minima forces augmentations on input space to correspond to rotations on the spherical embedding space. We show that merely combining our equivariant loss with a non-collapse term results in non-trivial representations, without requiring invariance to data augmentations. Optimal performance is achieved by also encouraging approximate invariance, where input augmentations correspond to small rotations. Our method, CARE: Contrastive Augmentation-induced Rotational Equivariance, leads to improved performance on downstream tasks, and ensures sensitivity in embedding space to important variations in data (e.g., color) that standard contrastive methods do not achieve. Code is available at https://github.com/Sharut/CARE. △ Less

Submitted 24 June, 2023; originally announced June 2023.

Comments: 22 pages

arXiv:2306.03698 [pdf, other]

Fine-grained Expressivity of Graph Neural Networks

Authors: Jan Böker, Ron Levie, Ningyuan Huang, Soledad Villar, Christopher Morris

Abstract: Numerous recent works have analyzed the expressive power of message-passing graph neural networks (MPNNs), primarily utilizing combinatorial techniques such as the $1$-dimensional Weisfeiler-Leman test ($1$-WL) for the graph isomorphism problem. However, the graph isomorphism objective is inherently binary, not giving insights into the degree of similarity between two given graphs. This work resol… ▽ More Numerous recent works have analyzed the expressive power of message-passing graph neural networks (MPNNs), primarily utilizing combinatorial techniques such as the $1$-dimensional Weisfeiler-Leman test ($1$-WL) for the graph isomorphism problem. However, the graph isomorphism objective is inherently binary, not giving insights into the degree of similarity between two given graphs. This work resolves this issue by considering continuous extensions of both $1$-WL and MPNNs to graphons. Concretely, we show that the continuous variant of $1$-WL delivers an accurate topological characterization of the expressive power of MPNNs on graphons, revealing which graphs these networks can distinguish and the level of difficulty in separating them. We identify the finest topology where MPNNs separate points and prove a universal approximation theorem. Consequently, we provide a theoretical framework for graph and graphon similarity combining various topological variants of classical characterizations of the $1$-WL. In particular, we characterize the expressive power of MPNNs in terms of the tree distance, which is a graph distance based on the concept of fractional isomorphisms, and substructure counts via tree homomorphisms, showing that these concepts have the same expressive power as the $1$-WL and MPNNs on graphons. Empirically, we validate our theoretical findings by showing that randomly initialized MPNNs, without training, exhibit competitive performance compared to their trained counterparts. Moreover, we evaluate different MPNN architectures based on their ability to preserve graph distances, highlighting the significance of our continuous $1$-WL test in understanding MPNNs' expressivity. △ Less

Submitted 2 November, 2023; v1 submitted 6 June, 2023; originally announced June 2023.

Comments: Accepted at NeurIPS 2023

arXiv:2305.12585 [pdf, other]

GeometricImageNet: Extending convolutional neural networks to vector and tensor images

Authors: Wilson Gregory, David W. Hogg, Ben Blum-Smith, Maria Teresa Arias, Kaze W. K. Wong, Soledad Villar

Abstract: Convolutional neural networks and their ilk have been very successful for many learning tasks involving images. These methods assume that the input is a scalar image representing the intensity in each pixel, possibly in multiple channels for color images. In natural-science domains however, image-like data sets might have vectors (velocity, say), tensors (polarization, say), pseudovectors (magneti… ▽ More Convolutional neural networks and their ilk have been very successful for many learning tasks involving images. These methods assume that the input is a scalar image representing the intensity in each pixel, possibly in multiple channels for color images. In natural-science domains however, image-like data sets might have vectors (velocity, say), tensors (polarization, say), pseudovectors (magnetic field, say), or other geometric objects in each pixel. Treating the components of these objects as independent channels in a CNN neglects their structure entirely. Our formulation -- the GeometricImageNet -- combines a geometric generalization of convolution with outer products, tensor index contractions, and tensor index permutations to construct geometric-image functions of geometric images that use and benefit from the tensor structure. The framework permits, with a very simple adjustment, restriction to function spaces that are exactly equivariant to translations, discrete rotations, and reflections. We use representation theory to quantify the dimension of the space of equivariant polynomial functions on 2-dimensional vector images. We give partial results on the expressivity of GeometricImageNet on small images. In numerical experiments, we find that GeometricImageNet has good generalization for a small simulated physics system, even when trained with a small training set. We expect this tool will be valuable for scientific and engineering machine learning, for example in cosmology or ocean dynamics. △ Less

Submitted 21 May, 2023; originally announced May 2023.

arXiv:2305.07587 [pdf, other]

Global method for gender profile estimation from distribution of first names

Authors: Manolis Antonoyiannakis, Hugues Chaté, Serena Dalena, Jessica Thomas, Alessandro S. Villar

Abstract: As social issues related to gender bias attract closer scrutiny, accurate tools to determine the gender profile of large groups become essential. When explicit data is unavailable, gender is often inferred from names. Current methods follow a strategy whereby individuals of the group, one by one, are assigned a gender label or probability based on gender-name correlations observed in the populatio… ▽ More As social issues related to gender bias attract closer scrutiny, accurate tools to determine the gender profile of large groups become essential. When explicit data is unavailable, gender is often inferred from names. Current methods follow a strategy whereby individuals of the group, one by one, are assigned a gender label or probability based on gender-name correlations observed in the population at large. We show that this strategy is logically inconsistent and has practical shortcomings, the most notable of which is the systematic underestimation of gender bias. We introduce a global inference strategy that estimates gender composition according to the context of the full list of names. The tool suffers from no intrinsic methodological effects, is robust against errors, easily implemented, and computationally light. △ Less

Submitted 12 May, 2023; originally announced May 2023.

Comments: https://ggem.app

arXiv:2301.13724 [pdf, other]

Towards fully covariant machine learning

Authors: Soledad Villar, David W. Hogg, Weichi Yao, George A. Kevrekidis, Bernhard Schölkopf

Abstract: Any representation of data involves arbitrary investigator choices. Because those choices are external to the data-generating process, each choice leads to an exact symmetry, corresponding to the group of transformations that takes one possible representation to another. These are the passive symmetries; they include coordinate freedom, gauge symmetry, and units covariance, all of which have led t… ▽ More Any representation of data involves arbitrary investigator choices. Because those choices are external to the data-generating process, each choice leads to an exact symmetry, corresponding to the group of transformations that takes one possible representation to another. These are the passive symmetries; they include coordinate freedom, gauge symmetry, and units covariance, all of which have led to important results in physics. In machine learning, the most visible passive symmetry is the relabeling or permutation symmetry of graphs. Our goal is to understand the implications for machine learning of the many passive symmetries in play. We discuss dos and don'ts for machine learning practice if passive symmetries are to be respected. We discuss links to causal modeling, and argue that the implementation of passive symmetries is particularly valuable when the goal of the learning problem is to generalize out of sample. This paper is conceptual: It translates among the languages of physics, mathematics, and machine-learning. We believe that consideration and implementation of passive symmetries might help machine learning in the same ways that it transformed physics in the twentieth century. △ Less

Submitted 28 June, 2023; v1 submitted 31 January, 2023; originally announced January 2023.

Comments: substantial revision from v1; submitted to TMLR

arXiv:2301.01107 [pdf]

Computing the Performance of A New Adaptive Sampling Algorithm Based on The Gittins Index in Experiments with Exponential Rewards

Authors: James K. He, Sofía S. Villar, Lida Mavrogonatou

Abstract: Designing experiments often requires balancing between learning about the true treatment effects and earning from allocating more samples to the superior treatment. While optimal algorithms for the Multi-Armed Bandit Problem (MABP) provide allocation policies that optimally balance learning and earning, they tend to be computationally expensive. The Gittins Index (GI) is a solution to the MABP tha… ▽ More Designing experiments often requires balancing between learning about the true treatment effects and earning from allocating more samples to the superior treatment. While optimal algorithms for the Multi-Armed Bandit Problem (MABP) provide allocation policies that optimally balance learning and earning, they tend to be computationally expensive. The Gittins Index (GI) is a solution to the MABP that can simultaneously attain optimality and computationally efficiency goals, and it has been recently used in experiments with Bernoulli and Gaussian rewards. For the first time, we present a modification of the GI rule that can be used in experiments with exponentially-distributed rewards. We report its performance in simulated 2- armed and 3-armed experiments. Compared to traditional non-adaptive designs, our novel GI modified design shows operating characteristics comparable in learning (e.g. statistical power) but substantially better in earning (e.g. direct benefits). This illustrates the potential that designs using a GI approach to allocate participants have to improve participant benefits, increase efficiencies, and reduce experimental costs in adaptive multi-armed experiments with exponential rewards. △ Less

Submitted 3 January, 2023; originally announced January 2023.

Comments: Accepted by Computing Conference, London 2023

arXiv:2211.15744 [pdf, other]

Sketch-and-solve approaches to k-means clustering by semidefinite programming

Authors: Charles Clum, Dustin G. Mixon, Soledad Villar, Kaiying Xie

Abstract: We introduce a sketch-and-solve approach to speed up the Peng-Wei semidefinite relaxation of k-means clustering. When the data is appropriately separated we identify the k-means optimal clustering. Otherwise, our approach provides a high-confidence lower bound on the optimal k-means value. This lower bound is data-driven; it does not make any assumption on the data nor how it is generated. We prov… ▽ More We introduce a sketch-and-solve approach to speed up the Peng-Wei semidefinite relaxation of k-means clustering. When the data is appropriately separated we identify the k-means optimal clustering. Otherwise, our approach provides a high-confidence lower bound on the optimal k-means value. This lower bound is data-driven; it does not make any assumption on the data nor how it is generated. We provide code and an extensive set of numerical experiments where we use this approach to certify approximate optimality of clustering solutions obtained by k-means++. △ Less

Submitted 28 November, 2022; originally announced November 2022.

arXiv:2211.03231 [pdf, other]

A Spectral Analysis of Graph Neural Networks on Dense and Sparse Graphs

Authors: Luana Ruiz, Ningyuan Huang, Soledad Villar

Abstract: In this work we propose a random graph model that can produce graphs at different levels of sparsity. We analyze how sparsity affects the graph spectra, and thus the performance of graph neural networks (GNNs) in node classification on dense and sparse graphs. We compare GNNs with spectral methods known to provide consistent estimators for community detection on dense graphs, a closely related tas… ▽ More In this work we propose a random graph model that can produce graphs at different levels of sparsity. We analyze how sparsity affects the graph spectra, and thus the performance of graph neural networks (GNNs) in node classification on dense and sparse graphs. We compare GNNs with spectral methods known to provide consistent estimators for community detection on dense graphs, a closely related task. We show that GNNs can outperform spectral methods on sparse graphs, and illustrate these results with numerical examples on both synthetic and real graphs. △ Less

Submitted 13 September, 2023; v1 submitted 6 November, 2022; originally announced November 2022.

Comments: Extended version of ICASSP 2024 submission

arXiv:2210.15083 [pdf, other]

Deep Learning is Provably Robust to Symmetric Label Noise

Authors: Carey E. Priebe, Ningyuan Huang, Soledad Villar, Cong Mu, Li Chen

Abstract: Deep neural networks (DNNs) are capable of perfectly fitting the training data, including memorizing noisy data. It is commonly believed that memorization hurts generalization. Therefore, many recent works propose mitigation strategies to avoid noisy data or correct memorization. In this work, we step back and ask the question: Can deep learning be robust against massive label noise without any mi… ▽ More Deep neural networks (DNNs) are capable of perfectly fitting the training data, including memorizing noisy data. It is commonly believed that memorization hurts generalization. Therefore, many recent works propose mitigation strategies to avoid noisy data or correct memorization. In this work, we step back and ask the question: Can deep learning be robust against massive label noise without any mitigation? We provide an affirmative answer for the case of symmetric label noise: We find that certain DNNs, including under-parameterized and over-parameterized models, can tolerate massive symmetric label noise up to the information-theoretic threshold. By appealing to classical statistical theory and universal consistency of DNNs, we prove that for multiclass classification, $L_1$-consistent DNN classifiers trained under symmetric label noise can achieve Bayes optimality asymptotically if the label noise probability is less than $\frac{K-1}{K}$, where $K \ge 2$ is the number of classes. Our results show that for symmetric label noise, no mitigation is necessary for $L_1$-consistent estimators. We conjecture that for general label noise, mitigation strategies that make use of the noisy data will outperform those that ignore the noisy data. △ Less

Submitted 26 October, 2022; originally announced October 2022.

arXiv:2209.15608 [pdf, other]

Shuffled linear regression through graduated convex relaxation

Authors: Efe Onaran, Soledad Villar

Abstract: The shuffled linear regression problem aims to recover linear relationships in datasets where the correspondence between input and output is unknown. This problem arises in a wide range of applications including survey data, in which one needs to decide whether the anonymity of the responses can be preserved while uncovering significant statistical connections. In this work, we propose a novel opt… ▽ More The shuffled linear regression problem aims to recover linear relationships in datasets where the correspondence between input and output is unknown. This problem arises in a wide range of applications including survey data, in which one needs to decide whether the anonymity of the responses can be preserved while uncovering significant statistical connections. In this work, we propose a novel optimization algorithm for shuffled linear regression based on a posterior-maximizing objective function assuming Gaussian noise prior. We compare and contrast our approach with existing methods on synthetic and real data. We show that our approach performs competitively while achieving empirical running-time improvements. Furthermore, we demonstrate that our algorithm is able to utilize the side information in the form of seeds, which recently came to prominence in related problems. △ Less

Submitted 30 September, 2022; originally announced September 2022.

arXiv:2209.14991 [pdf, ps, other]

Machine learning and invariant theory

Authors: Ben Blum-Smith, Soledad Villar

Abstract: Inspired by constraints from physical law, equivariant machine learning restricts the learning to a hypothesis class where all the functions are equivariant with respect to some group action. Irreducible representations or invariant theory are typically used to parameterize the space of such functions. In this article, we introduce the topic and explain a couple of methods to explicitly parameteri… ▽ More Inspired by constraints from physical law, equivariant machine learning restricts the learning to a hypothesis class where all the functions are equivariant with respect to some group action. Irreducible representations or invariant theory are typically used to parameterize the space of such functions. In this article, we introduce the topic and explain a couple of methods to explicitly parameterize equivariant functions that are being used in machine learning applications. In particular, we explicate a general procedure, attributed to Malgrange, to express all polynomial maps between linear spaces that are equivariant under the action of a group $G$, given a characterization of the invariant polynomials on a bigger space. The method also parametrizes smooth equivariant maps in the case that $G$ is a compact Lie group. △ Less

Submitted 25 March, 2023; v1 submitted 29 September, 2022; originally announced September 2022.

arXiv:2209.12054 [pdf, other]

From Local to Global: Spectral-Inspired Graph Neural Networks

Authors: Ningyuan Huang, Soledad Villar, Carey E. Priebe, Da Zheng, Chengyue Huang, Lin Yang, Vladimir Braverman

Abstract: Graph Neural Networks (GNNs) are powerful deep learning methods for Non-Euclidean data. Popular GNNs are message-passing algorithms (MPNNs) that aggregate and combine signals in a local graph neighborhood. However, shallow MPNNs tend to miss long-range signals and perform poorly on some heterophilous graphs, while deep MPNNs can suffer from issues like over-smoothing or over-squashing. To mitigate… ▽ More Graph Neural Networks (GNNs) are powerful deep learning methods for Non-Euclidean data. Popular GNNs are message-passing algorithms (MPNNs) that aggregate and combine signals in a local graph neighborhood. However, shallow MPNNs tend to miss long-range signals and perform poorly on some heterophilous graphs, while deep MPNNs can suffer from issues like over-smoothing or over-squashing. To mitigate such issues, existing works typically borrow normalization techniques from training neural networks on Euclidean data or modify the graph structures. Yet these approaches are not well-understood theoretically and could increase the overall computational complexity. In this work, we draw inspirations from spectral graph embedding and propose $\texttt{PowerEmbed}$ -- a simple layer-wise normalization technique to boost MPNNs. We show $\texttt{PowerEmbed}$ can provably express the top-$k$ leading eigenvectors of the graph operator, which prevents over-smoothing and is agnostic to the graph topology; meanwhile, it produces a list of representations ranging from local features to global signals, which avoids over-squashing. We apply $\texttt{PowerEmbed}$ in a wide range of simulated and real graphs and demonstrate its competitive performance, particularly for heterophilous graphs. △ Less

Submitted 4 November, 2022; v1 submitted 24 September, 2022; originally announced September 2022.

Comments: Accepted for publication at the NeurIPS 2022 GLFrontiers Workshop

arXiv:2207.14106 [pdf, other]

MarkerMap: nonlinear marker selection for single-cell studies

Authors: Nabeel Sarwar, Wilson Gregory, George A Kevrekidis, Soledad Villar, Bianca Dumitrascu

Abstract: Single-cell RNA-seq data allow the quantification of cell type differences across a growing set of biological contexts. However, pinpointing a small subset of genomic features explaining this variability can be ill-defined and computationally intractable. Here we introduce MarkerMap, a generative model for selecting minimal gene sets which are maximally informative of cell type origin and enable w… ▽ More Single-cell RNA-seq data allow the quantification of cell type differences across a growing set of biological contexts. However, pinpointing a small subset of genomic features explaining this variability can be ill-defined and computationally intractable. Here we introduce MarkerMap, a generative model for selecting minimal gene sets which are maximally informative of cell type origin and enable whole transcriptome reconstruction. MarkerMap provides a scalable framework for both supervised marker selection, aimed at identifying specific cell type populations, and unsupervised marker selection, aimed at gene expression imputation and reconstruction. We benchmark MarkerMap's competitive performance against previously published approaches on real single cell gene expression data sets. MarkerMap is available as a pip installable package, as a community resource aimed at developing explainable machine learning techniques for enhancing interpretability in single-cell studies. △ Less

Submitted 28 July, 2022; originally announced July 2022.

arXiv:2205.08875 [pdf, other]

Multi-disciplinary fairness considerations in machine learning for clinical trials

Authors: Isabel Chien, Nina Deliu, Richard E. Turner, Adrian Weller, Sofia S. Villar, Niki Kilbertus

Abstract: While interest in the application of machine learning to improve healthcare has grown tremendously in recent years, a number of barriers prevent deployment in medical practice. A notable concern is the potential to exacerbate entrenched biases and existing health disparities in society. The area of fairness in machine learning seeks to address these issues of equity; however, appropriate approache… ▽ More While interest in the application of machine learning to improve healthcare has grown tremendously in recent years, a number of barriers prevent deployment in medical practice. A notable concern is the potential to exacerbate entrenched biases and existing health disparities in society. The area of fairness in machine learning seeks to address these issues of equity; however, appropriate approaches are context-dependent, necessitating domain-specific consideration. We focus on clinical trials, i.e., research studies conducted on humans to evaluate medical treatments. Clinical trials are a relatively under-explored application in machine learning for healthcare, in part due to complex ethical, legal, and regulatory requirements and high costs. Our aim is to provide a multi-disciplinary assessment of how fairness for machine learning fits into the context of clinical trials research and practice. We start by reviewing the current ethical considerations and guidelines for clinical trials and examine their relationship with common definitions of fairness in machine learning. We examine potential sources of unfairness in clinical trials, providing concrete examples, and discuss the role machine learning might play in either mitigating potential biases or exacerbating them when applied without care. Particular focus is given to adaptive clinical trials, which may employ machine learning. Finally, we highlight concepts that require further investigation and development, and emphasize new approaches to fairness that may be relevant to the design of clinical trials. △ Less

Submitted 18 May, 2022; originally announced May 2022.

Comments: Appeared at ACM FAccT 2022

arXiv:2205.03820 [pdf, other]

doi 10.1371/journal.pone.0274272

Some performance considerations when using multi-armed bandit algorithms in the presence of missing data

Authors: Xijin Chen, Kim May Lee, Sofia S. Villar, David S. Robertson

Abstract: When comparing the performance of multi-armed bandit algorithms, the potential impact of missing data is often overlooked. In practice, it also affects their implementation where the simplest approach to overcome this is to continue to sample according to the original bandit algorithm, ignoring missing outcomes. We investigate the impact on performance of this approach to deal with missing data fo… ▽ More When comparing the performance of multi-armed bandit algorithms, the potential impact of missing data is often overlooked. In practice, it also affects their implementation where the simplest approach to overcome this is to continue to sample according to the original bandit algorithm, ignoring missing outcomes. We investigate the impact on performance of this approach to deal with missing data for several bandit algorithms through an extensive simulation study assuming the rewards are missing at random. We focus on two-armed bandit algorithms with binary outcomes in the context of patient allocation for clinical trials with relatively small sample sizes. However, our results apply to other applications of bandit algorithms where missing data is expected to occur. We assess the resulting operating characteristics, including the expected reward. Different probabilities of missingness in both arms are considered. The key finding of our work is that when using the simplest strategy of ignoring missing data, the impact on the expected performance of multi-armed bandit strategies varies according to the way these strategies balance the exploration-exploitation trade-off. Algorithms that are geared towards exploration continue to assign samples to the arm with more missing responses (which being perceived as the arm with less observed information is deemed more appealing by the algorithm than it would otherwise be). In contrast, algorithms that are geared towards exploitation would rapidly assign a high value to samples from the arms with a current high mean irrespective of the level observations per arm. Furthermore, for algorithms focusing more on exploration, we illustrate that the problem of missing responses can be alleviated using a simple mean imputation approach. △ Less

Submitted 7 July, 2022; v1 submitted 8 May, 2022; originally announced May 2022.

Comments: 30 pages, 6 figures

arXiv:2204.00887 [pdf, other]

Dimensionless machine learning: Imposing exact units equivariance

Authors: Soledad Villar, Weichi Yao, David W. Hogg, Ben Blum-Smith, Bianca Dumitrascu

Abstract: Units equivariance (or units covariance) is the exact symmetry that follows from the requirement that relationships among measured quantities of physics relevance must obey self-consistent dimensional scalings. Here, we express this symmetry in terms of a (non-compact) group action, and we employ dimensional analysis and ideas from equivariant machine learning to provide a methodology for exactly… ▽ More Units equivariance (or units covariance) is the exact symmetry that follows from the requirement that relationships among measured quantities of physics relevance must obey self-consistent dimensional scalings. Here, we express this symmetry in terms of a (non-compact) group action, and we employ dimensional analysis and ideas from equivariant machine learning to provide a methodology for exactly units-equivariant machine learning: For any given learning task, we first construct a dimensionless version of its inputs using classic results from dimensional analysis, and then perform inference in the dimensionless space. Our approach can be used to impose units equivariance across a broad range of machine learning methods which are equivariant to rotations and other groups. We discuss the in-sample and out-of-sample prediction accuracy gains one can obtain in contexts like symbolic regression and emulation, where symmetry is important. We illustrate our approach with simple numerical examples involving dynamical systems in physics and ecology. △ Less

Submitted 31 December, 2022; v1 submitted 2 April, 2022; originally announced April 2022.

Journal ref: Journal of Machine Learning Research 24 (2023) 1--32

arXiv:2201.07372 [pdf, other]

Prospective Learning: Principled Extrapolation to the Future

Authors: Ashwin De Silva, Rahul Ramesh, Lyle Ungar, Marshall Hussain Shuler, Noah J. Cowan, Michael Platt, Chen Li, Leyla Isik, Seung-Eon Roh, Adam Charles, Archana Venkataraman, Brian Caffo, Javier J. How, Justus M Kebschull, John W. Krakauer, Maxim Bichuch, Kaleab Alemayehu Kinfu, Eva Yezerets, Dinesh Jayaraman, Jong M. Shin, Soledad Villar, Ian Phillips, Carey E. Priebe, Thomas Hartung, Michael I. Miller , et al. (18 additional authors not shown)

Abstract: Learning is a process which can update decision rules, based on past experience, such that future performance improves. Traditionally, machine learning is often evaluated under the assumption that the future will be identical to the past in distribution or change adversarially. But these assumptions can be either too optimistic or pessimistic for many problems in the real world. Real world scenari… ▽ More Learning is a process which can update decision rules, based on past experience, such that future performance improves. Traditionally, machine learning is often evaluated under the assumption that the future will be identical to the past in distribution or change adversarially. But these assumptions can be either too optimistic or pessimistic for many problems in the real world. Real world scenarios evolve over multiple spatiotemporal scales with partially predictable dynamics. Here we reformulate the learning problem to one that centers around this idea of dynamic futures that are partially learnable. We conjecture that certain sequences of tasks are not retrospectively learnable (in which the data distribution is fixed), but are prospectively learnable (in which distributions may be dynamic), suggesting that prospective learning is more difficult in kind than retrospective learning. We argue that prospective learning more accurately characterizes many real world problems that (1) currently stymie existing artificial intelligence solutions and/or (2) lack adequate explanations for how natural intelligences solve them. Thus, studying prospective learning will lead to deeper insights and solutions to currently vexing challenges in both natural and artificial intelligences. △ Less

Submitted 13 July, 2023; v1 submitted 18 January, 2022; originally announced January 2022.

Comments: Accepted at the 2nd Conference on Lifelong Learning Agents (CoLLAs), 2023

arXiv:2201.07083 [pdf, other]

doi 10.1109/ICASSP39728.2021.9413523

A Short Tutorial on The Weisfeiler-Lehman Test And Its Variants

Authors: Ningyuan Huang, Soledad Villar

Abstract: Graph neural networks are designed to learn functions on graphs. Typically, the relevant target functions are invariant with respect to actions by permutations. Therefore the design of some graph neural network architectures has been inspired by graph-isomorphism algorithms. The classical Weisfeiler-Lehman algorithm (WL) -- a graph-isomorphism test based on color refinement -- became relevant to t… ▽ More Graph neural networks are designed to learn functions on graphs. Typically, the relevant target functions are invariant with respect to actions by permutations. Therefore the design of some graph neural network architectures has been inspired by graph-isomorphism algorithms. The classical Weisfeiler-Lehman algorithm (WL) -- a graph-isomorphism test based on color refinement -- became relevant to the study of graph neural networks. The WL test can be generalized to a hierarchy of higher-order tests, known as $k$-WL. This hierarchy has been used to characterize the expressive power of graph neural networks, and to inspire the design of graph neural network architectures. A few variants of the WL hierarchy appear in the literature. The goal of this short note is pedagogical and practical: We explain the differences between the WL and folklore-WL formulations, with pointers to existing discussions in the literature. We illuminate the differences between the formulations by visualizing an example. △ Less

Submitted 1 November, 2022; v1 submitted 18 January, 2022; originally announced January 2022.

Journal ref: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021

arXiv:2112.08507 [pdf, other]

Algorithms for Adaptive Experiments that Trade-off Statistical Analysis with Reward: Combining Uniform Random Assignment and Reward Maximization

Authors: Tong Li, Jacob Nogas, Haochen Song, Harsh Kumar, Audrey Durand, Anna Rafferty, Nina Deliu, Sofia S. Villar, Joseph J. Williams

Abstract: Multi-armed bandit algorithms like Thompson Sampling (TS) can be used to conduct adaptive experiments, in which maximizing reward means that data is used to progressively assign participants to more effective arms. Such assignment strategies increase the risk of statistical hypothesis tests identifying a difference between arms when there is not one, and failing to conclude there is a difference i… ▽ More Multi-armed bandit algorithms like Thompson Sampling (TS) can be used to conduct adaptive experiments, in which maximizing reward means that data is used to progressively assign participants to more effective arms. Such assignment strategies increase the risk of statistical hypothesis tests identifying a difference between arms when there is not one, and failing to conclude there is a difference in arms when there truly is one. We tackle this by introducing a novel heuristic algorithm, called TS-PostDiff (Posterior Probability of Difference). TS-PostDiff takes a Bayesian approach to mixing TS and Uniform Random (UR): the probability a participant is assigned using UR allocation is the posterior probability that the difference between two arms is 'small' (below a certain threshold), allowing for more UR exploration when there is little or no reward to be gained. We evaluate TS-PostDiff against state-of-the-art strategies. The empirical and simulation results help characterize the trade-offs of these approaches between reward, False Positive Rate (FPR), and statistical power, as well as under which circumstances each is effective. We quantify the advantage of TS-PostDiff in performing well across multiple differences in arm means (effect sizes), showing the benefits of adaptively changing randomization/exploration in TS in a "Statistically Considerate" manner: reducing FPR and increasing statistical power when differences are small or zero and there is less reward to be gained, while exploiting more when differences may be large. This highlights important considerations for future algorithm development and analysis to better balance reward and statistical analysis. △ Less

Submitted 23 November, 2022; v1 submitted 15 December, 2021; originally announced December 2021.

arXiv:2112.02916 [pdf, ps, other]

Three proofs of the Benedetto-Fickus theorem

Authors: Dustin G. Mixon, Tom Needham, Clayton Shonkwiler, Soledad Villar

Abstract: In 2003, Benedetto and Fickus introduced a vivid intuition for an objective function called the frame potential, whose global minimizers are fundamental objects known today as unit norm tight frames. Their main result was that the frame potential exhibits no spurious local minimizers, suggesting local optimization as an approach to construct these objects. Local optimization has since become the w… ▽ More In 2003, Benedetto and Fickus introduced a vivid intuition for an objective function called the frame potential, whose global minimizers are fundamental objects known today as unit norm tight frames. Their main result was that the frame potential exhibits no spurious local minimizers, suggesting local optimization as an approach to construct these objects. Local optimization has since become the workhorse of cutting-edge signal processing and machine learning, and accordingly, the community has identified a variety of techniques to study optimization landscapes. This chapter applies some of these techniques to obtain three modern proofs of the Benedetto-Fickus theorem. △ Less

Submitted 6 December, 2021; originally announced December 2021.

arXiv:2111.00137 [pdf, other]

Efficient Inference Without Trading-off Regret in Bandits: An Allocation Probability Test for Thompson Sampling

Authors: Nina Deliu, Joseph J. Williams, Sofia S. Villar

Abstract: Using bandit algorithms to conduct adaptive randomised experiments can minimise regret, but it poses major challenges for statistical inference (e.g., biased estimators, inflated type-I error and reduced power). Recent attempts to address these challenges typically impose restrictions on the exploitative nature of the bandit algorithm$-$trading off regret$-$and require large sample sizes to ensure… ▽ More Using bandit algorithms to conduct adaptive randomised experiments can minimise regret, but it poses major challenges for statistical inference (e.g., biased estimators, inflated type-I error and reduced power). Recent attempts to address these challenges typically impose restrictions on the exploitative nature of the bandit algorithm$-$trading off regret$-$and require large sample sizes to ensure asymptotic guarantees. However, large experiments generally follow a successful pilot study, which is tightly constrained in its size or duration. Increasing power in such small pilot experiments, without limiting the adaptive nature of the algorithm, can allow promising interventions to reach a larger experimental phase. In this work we introduce a novel hypothesis test, uniquely based on the allocation probabilities of the bandit algorithm, and without constraining its exploitative nature or requiring a minimum experimental size. We characterise our $Allocation\ Probability\ Test$ when applied to $Thompson\ Sampling$, presenting its asymptotic theoretical properties, and illustrating its finite-sample performances compared to state-of-the-art approaches. We demonstrate the regret and inferential advantages of our approach, particularly in small samples, in both extensive simulations and in a real-world experiment on mental health aspects. △ Less

Submitted 29 October, 2021; originally announced November 2021.

Comments: 32 pages including supplementary material

arXiv:2110.03761 [pdf, other]

A simple equivariant machine learning method for dynamics based on scalars

Authors: Weichi Yao, Kate Storey-Fisher, David W. Hogg, Soledad Villar

Abstract: Physical systems obey strict symmetry principles. We expect that machine learning methods that intrinsically respect these symmetries should have higher prediction accuracy and better generalization in prediction of physical dynamics. In this work we implement a principled model based on invariant scalars, and release open-source code. We apply this Scalars method to a simple chaotic dynamical sys… ▽ More Physical systems obey strict symmetry principles. We expect that machine learning methods that intrinsically respect these symmetries should have higher prediction accuracy and better generalization in prediction of physical dynamics. In this work we implement a principled model based on invariant scalars, and release open-source code. We apply this Scalars method to a simple chaotic dynamical system, the springy double pendulum. We show that the Scalars method outperforms state-of-the-art approaches for learning the properties of physical systems with symmetries, both in terms of accuracy and speed. Because the method incorporates the fundamental symmetries, we expect it to generalize to different settings, such as changes in the force laws in the system. △ Less

Submitted 30 October, 2021; v1 submitted 7 October, 2021; originally announced October 2021.

arXiv:2106.06610 [pdf, other]

Scalars are universal: Equivariant machine learning, structured like classical physics

Authors: Soledad Villar, David W. Hogg, Kate Storey-Fisher, Weichi Yao, Ben Blum-Smith

Abstract: There has been enormous progress in the last few years in designing neural networks that respect the fundamental symmetries and coordinate freedoms of physical law. Some of these frameworks make use of irreducible representations, some make use of high-order tensor objects, and some apply symmetry-enforcing constraints. Different physical laws obey different combinations of fundamental symmetries,… ▽ More There has been enormous progress in the last few years in designing neural networks that respect the fundamental symmetries and coordinate freedoms of physical law. Some of these frameworks make use of irreducible representations, some make use of high-order tensor objects, and some apply symmetry-enforcing constraints. Different physical laws obey different combinations of fundamental symmetries, but a large fraction (possibly all) of classical physics is equivariant to translation, rotation, reflection (parity), boost (relativity), and permutations. Here we show that it is simple to parameterize universally approximating polynomial functions that are equivariant under these symmetries, or under the Euclidean, Lorentz, and Poincaré groups, at any dimensionality $d$. The key observation is that nonlinear O($d$)-equivariant (and related-group-equivariant) functions can be universally expressed in terms of a lightweight collection of scalars -- scalar products and scalar contractions of the scalar, vector, and tensor inputs. We complement our theory with numerical examples that show that the scalar-based method is simple, efficient, and scalable. △ Less

Submitted 7 February, 2023; v1 submitted 11 June, 2021; originally announced June 2021.

Comments: NeurIPS 2021

Journal ref: Advances in Neural Information Processing Systems, 34, 28848-28863. 2021

arXiv:2104.00546 [pdf, other]

Quantifying efficiency gains of innovative designs of two-arm vaccine trials for COVID-19 using an epidemic simulation model

Authors: Rob Johnson, Chris Jackson, Anne Presanis, Sofia S. Villar, Daniela De Angelis

Abstract: Clinical trials of a vaccine during an epidemic face particular challenges, such as the pressure to identify an effective vaccine quickly to control the epidemic, and the effect that time-space-varying infection incidence has on the power of a trial. We illustrate how the operating characteristics of different trial design elements may be evaluated using a network epidemic and trial simulation mod… ▽ More Clinical trials of a vaccine during an epidemic face particular challenges, such as the pressure to identify an effective vaccine quickly to control the epidemic, and the effect that time-space-varying infection incidence has on the power of a trial. We illustrate how the operating characteristics of different trial design elements may be evaluated using a network epidemic and trial simulation model, based on COVID-19 and individually randomised two-arm trials with a binary outcome. We show that "ring" recruitment strategies, prioritising participants at high risk of infection, can result in substantial improvement in terms of power, if sufficiently many contacts of observed cases are at high risk. In addition, we introduce a novel method to make more efficient use of the data from the earliest cases of infection observed in the trial, whose infection may have been too early to be vaccine-preventable. Finally, we compare several methods of response-adaptive randomisation, discussing their advantages and disadvantages in this two-arm context and identifying particular adaptation strategies that preserve power and estimation properties, while slightly reducing the number of infections, given an effective vaccine. △ Less

Submitted 20 May, 2021; v1 submitted 13 March, 2021; originally announced April 2021.

arXiv:2103.12198 [pdf]

Challenges in Statistical Analysis of Data Collected by a Bandit Algorithm: An Empirical Exploration in Applications to Adaptively Randomized Experiments

Authors: Joseph Jay Williams, Jacob Nogas, Nina Deliu, Hammad Shaikh, Sofia S. Villar, Audrey Durand, Anna Rafferty

Abstract: Multi-armed bandit algorithms have been argued for decades as useful for adaptively randomized experiments. In such experiments, an algorithm varies which arms (e.g. alternative interventions to help students learn) are assigned to participants, with the goal of assigning higher-reward arms to as many participants as possible. We applied the bandit algorithm Thompson Sampling (TS) to run adaptive… ▽ More Multi-armed bandit algorithms have been argued for decades as useful for adaptively randomized experiments. In such experiments, an algorithm varies which arms (e.g. alternative interventions to help students learn) are assigned to participants, with the goal of assigning higher-reward arms to as many participants as possible. We applied the bandit algorithm Thompson Sampling (TS) to run adaptive experiments in three university classes. Instructors saw great value in trying to rapidly use data to give their students in the experiments better arms (e.g. better explanations of a concept). Our deployment, however, illustrated a major barrier for scientists and practitioners to use such adaptive experiments: a lack of quantifiable insight into how much statistical analysis of specific real-world experiments is impacted (Pallmann et al, 2018; FDA, 2019), compared to traditional uniform random assignment. We therefore use our case study of the ubiquitous two-arm binary reward setting to empirically investigate the impact of using Thompson Sampling instead of uniform random assignment. In this setting, using common statistical hypothesis tests, we show that collecting data with TS can as much as double the False Positive Rate (FPR; incorrectly reporting differences when none exist) and the False Negative Rate (FNR; failing to report differences when they exist)... △ Less

Submitted 26 March, 2021; v1 submitted 22 March, 2021; originally announced March 2021.

arXiv:2101.07256 [pdf, other]

doi 10.1088/1538-3873/ac20ac

Fitting very flexible models: Linear regression with large numbers of parameters

Authors: David W. Hogg, Soledad Villar

Abstract: There are many uses for linear fitting; the context here is interpolation and denoising of data, as when you have calibration data and you want to fit a smooth, flexible function to those data. Or you want to fit a flexible function to de-trend a time series or normalize a spectrum. In these contexts, investigators often choose a polynomial basis, or a Fourier basis, or wavelets, or something equa… ▽ More There are many uses for linear fitting; the context here is interpolation and denoising of data, as when you have calibration data and you want to fit a smooth, flexible function to those data. Or you want to fit a flexible function to de-trend a time series or normalize a spectrum. In these contexts, investigators often choose a polynomial basis, or a Fourier basis, or wavelets, or something equally general. They also choose an order, or number of basis functions to fit, and (often) some kind of regularization. We discuss how this basis-function fitting is done, with ordinary least squares and extensions thereof. We emphasize that it is often valuable to choose far more parameters than data points, despite folk rules to the contrary: Suitably regularized models with enormous numbers of parameters generalize well and make good predictions for held-out data; over-fitting is not (mainly) a problem of having too many parameters. It is even possible to take the limit of infinite parameters, at which, if the basis and regularization are chosen correctly, the least-squares fit becomes the mean of a Gaussian process. We recommend cross-validation as a good empirical method for model selection (for example, setting the number of parameters and the form of the regularization), and jackknife resampling as a good empirical method for estimating the uncertainties of the predictions made by the model. We also give advice for building stable computational implementations. △ Less

Submitted 15 January, 2021; originally announced January 2021.

Comments: all code used to make the figures is available at https://github.com/davidwhogg/FlexibleLinearModels

arXiv:2011.11477 [pdf, other]

doi 10.1137/20M1387821

Dimensionality reduction, regularization, and generalization in overparameterized regressions

Authors: Ningyuan Huang, David W. Hogg, Soledad Villar

Abstract: Overparameterization in deep learning is powerful: Very large models fit the training data perfectly and yet often generalize well. This realization brought back the study of linear models for regression, including ordinary least squares (OLS), which, like deep learning, shows a "double-descent" behavior: (1) The risk (expected out-of-sample prediction error) can grow arbitrarily when the number o… ▽ More Overparameterization in deep learning is powerful: Very large models fit the training data perfectly and yet often generalize well. This realization brought back the study of linear models for regression, including ordinary least squares (OLS), which, like deep learning, shows a "double-descent" behavior: (1) The risk (expected out-of-sample prediction error) can grow arbitrarily when the number of parameters $p$ approaches the number of samples $n$, and (2) the risk decreases with $p$ for $p>n$, sometimes achieving a lower value than the lowest risk for $p<n$. The divergence of the risk for OLS can be avoided with regularization. In this work, we show that for some data models it can also be avoided with a PCA-based dimensionality reduction (PCA-OLS, also known as principal component regression). We provide non-asymptotic bounds for the risk of PCA-OLS by considering the alignments of the population and empirical principal components. We show that dimensionality reduction improves robustness while OLS is arbitrarily susceptible to adversarial attacks, particularly in the overparameterized regime. We compare PCA-OLS theoretically and empirically with a wide range of projection-based methods, including random projections, partial least squares (PLS), and certain classes of linear two-layer neural networks. These comparisons are made for different data generation models to assess the sensitivity to signal-to-noise and the alignment of regression coefficients with the features. We find that methods in which the projection depends on the training data can outperform methods where the projections are chosen independently of the training data, even those with oracle knowledge of population quantities, another seemingly paradoxical phenomenon that has been identified previously. This suggests that overparameterization may not be necessary for good generalization. △ Less

Submitted 19 October, 2021; v1 submitted 23 November, 2020; originally announced November 2020.

Journal ref: SIAM Journal on Mathematics of Data Science Vol.4 Iss.1, 2022

arXiv:2011.03270 [pdf, other]

A Novel Statistical Test for Treatment Differences in Clinical Trials using a Response Adaptive Forward Looking Gittins Index Rule

Authors: Helen Yvette Barnett, Sofia S Villar, Helena Geys, Thomas Jaki

Abstract: The most common objective for response adaptive clinical trials is to seek to ensure that patients within a trial have a high chance of receiving the best treatment available by altering the chance of allocation on the basis of accumulating data. Approaches which yield good patient benefit properties suffer from low power from a frequentist perspective when testing for a treatment difference at th… ▽ More The most common objective for response adaptive clinical trials is to seek to ensure that patients within a trial have a high chance of receiving the best treatment available by altering the chance of allocation on the basis of accumulating data. Approaches which yield good patient benefit properties suffer from low power from a frequentist perspective when testing for a treatment difference at the end of the study due to the high imbalance in treatment allocations. In this work we develop an alternative pairwise test for treatment difference on the basis of allocation probabilities of the covariate-adjusted response-adaptive randomization with forward looking Gittins index rule (CARA-FLGI). The performance of the novel test is evaluated in simulations for two-armed studies and then its applications to multi-armed studies is illustrated. The proposed test has markedly improved power over the traditional Fisher exact test when this class of non-myopic response adaptation is used. We also find that the test's power is close to the power of a Fisher exact test under equal randomization. △ Less

Submitted 6 November, 2020; originally announced November 2020.

arXiv:2006.12811 [pdf, other]

Adding flexibility to clinical trial designs: an example-based guide to the practical use of adaptive designs

Authors: Thomas Burnett, Pavel Mozgunov, Philip Pallmann, Sofia S. Villar, Graham M. Wheeler, Thomas Jaki

Abstract: Adaptive designs for clinical trials permit alterations to a study in response to accumulating data in order to make trials more flexible, ethical and efficient. These benefits are achieved while preserving the integrity and validity of the trial, through the pre-specification and proper adjustment for the possible alterations during the course of the trial. Despite much research in the statistica… ▽ More Adaptive designs for clinical trials permit alterations to a study in response to accumulating data in order to make trials more flexible, ethical and efficient. These benefits are achieved while preserving the integrity and validity of the trial, through the pre-specification and proper adjustment for the possible alterations during the course of the trial. Despite much research in the statistical literature highlighting the potential advantages of adaptive designs over traditional fixed designs, the uptake of such methods in clinical research has been slow. One major reason for this is that different adaptations to trial designs, as well as their advantages and limitations, remain unfamiliar to large parts of the clinical community. The aim of this paper is to clarify where adaptive designs can be used to address specific questions of scientific interest; we introduce the main features of adaptive designs and commonly used terminology, highlighting their utility and pitfalls, and illustrate their use through case studies of adaptive trials ranging from early-phase dose escalation to confirmatory Phase III studies. △ Less

Submitted 23 June, 2020; originally announced June 2020.

Comments: 35 pages, 9 figures

arXiv:2006.05026 [pdf, other]

Learning for Dose Allocation in Adaptive Clinical Trials with Safety Constraints

Authors: Cong Shen, Zhiyang Wang, Sofia S. Villar, Mihaela van der Schaar

Abstract: Phase I dose-finding trials are increasingly challenging as the relationship between efficacy and toxicity of new compounds (or combination of them) becomes more complex. Despite this, most commonly used methods in practice focus on identifying a Maximum Tolerated Dose (MTD) by learning only from toxicity events. We present a novel adaptive clinical trial methodology, called Safe Efficacy Explorat… ▽ More Phase I dose-finding trials are increasingly challenging as the relationship between efficacy and toxicity of new compounds (or combination of them) becomes more complex. Despite this, most commonly used methods in practice focus on identifying a Maximum Tolerated Dose (MTD) by learning only from toxicity events. We present a novel adaptive clinical trial methodology, called Safe Efficacy Exploration Dose Allocation (SEEDA), that aims at maximizing the cumulative efficacies while satisfying the toxicity safety constraint with high probability. We evaluate performance objectives that have operational meanings in practical clinical trials, including cumulative efficacy, recommendation/allocation success probabilities, toxicity violation probability, and sample efficiency. An extended SEEDA-Plateau algorithm that is tailored for the increase-then-plateau efficacy behavior of molecularly targeted agents (MTA) is also presented. Through numerical experiments using both synthetic and real-world datasets, we show that SEEDA outperforms state-of-the-art clinical trial designs by finding the optimal dose with higher success rate and fewer patients. △ Less

Submitted 15 June, 2020; v1 submitted 8 June, 2020; originally announced June 2020.

Comments: Accepted to the 37th International Conference on Machine Learning (ICML 2020)

arXiv:2005.00564 [pdf, other]

Response-adaptive randomization in clinical trials: from myths to practical considerations

Authors: David S. Robertson, Kim May Lee, Boryana C. Lopez-Kolkovska, Sofia S. Villar

Abstract: Response-Adaptive Randomization (RAR) is part of a wider class of data-dependent sampling algorithms, for which clinical trials are typically used as a motivating application. In that context, patient allocation to treatments is determined by randomization probabilities that change based on the accrued response data in order to achieve experimental goals. RAR has received abundant theoretical atte… ▽ More Response-Adaptive Randomization (RAR) is part of a wider class of data-dependent sampling algorithms, for which clinical trials are typically used as a motivating application. In that context, patient allocation to treatments is determined by randomization probabilities that change based on the accrued response data in order to achieve experimental goals. RAR has received abundant theoretical attention from the biostatistical literature since the 1930's and has been the subject of numerous debates. In the last decade, it has received renewed consideration from the applied and methodological communities, driven by well-known practical examples and its widespread use in machine learning. Papers on the subject present different views on its usefulness, and these are not easy to reconcile. This work aims to address this gap by providing a unified, broad and fresh review of methodological and practical issues to consider when debating the use of RAR in clinical trials. △ Less

Submitted 7 June, 2022; v1 submitted 1 May, 2020; originally announced May 2020.

Comments: Update in response to editor comments

MSC Class: 62-02

arXiv:2003.11156 [pdf, other]

doi 10.1121/10.0001728

Seabed classification using physics-based modeling and machine learning

Authors: Christina Frederick, Soledad Villar, Zoi-Heleni Michalopoulou

Abstract: In this work model-based methods are employed along with machine learning techniques to classify sediments in oceanic environments based on the geoacoustic properties of a two-layer seabed. Two different scenarios are investigated. First, a simple low-frequency case is set up, where the acoustic field is modeled with normal modes. Four different hypotheses are made for seafloor sediment possibilit… ▽ More In this work model-based methods are employed along with machine learning techniques to classify sediments in oceanic environments based on the geoacoustic properties of a two-layer seabed. Two different scenarios are investigated. First, a simple low-frequency case is set up, where the acoustic field is modeled with normal modes. Four different hypotheses are made for seafloor sediment possibilities and these are explored using both various machine learning techniques and a simple matched-field approach. For most noise levels, the latter has an inferior performance to the machine learning methods. Second, the high-frequency model of the scattering from a rough, two-layer seafloor is considered. Again, four different sediment possibilities are classified with machine learning. For higher accuracy, 1D Convolutional Neural Networks (CNNs) are employed. In both cases we see that the machine learning methods, both in simple and more complex formulations, lead to effective sediment characterization. Our results assess the robustness to noise and model misspecification of different classifiers. △ Less

Submitted 24 March, 2020; originally announced March 2020.

arXiv:2002.04025 [pdf, other]

Can Graph Neural Networks Count Substructures?

Authors: Zhengdao Chen, Lei Chen, Soledad Villar, Joan Bruna

Abstract: The ability to detect and count certain substructures in graphs is important for solving many tasks on graph-structured data, especially in the contexts of computational chemistry and biology as well as social network analysis. Inspired by this, we propose to study the expressive power of graph neural networks (GNNs) via their ability to count attributed graph substructures, extending recent works… ▽ More The ability to detect and count certain substructures in graphs is important for solving many tasks on graph-structured data, especially in the contexts of computational chemistry and biology as well as social network analysis. Inspired by this, we propose to study the expressive power of graph neural networks (GNNs) via their ability to count attributed graph substructures, extending recent works that examine their power in graph isomorphism testing and function approximation. We distinguish between two types of substructure counting: induced-subgraph-count and subgraph-count, and establish both positive and negative answers for popular GNN architectures. Specifically, we prove that Message Passing Neural Networks (MPNNs), 2-Weisfeiler-Lehman (2-WL) and 2-Invariant Graph Networks (2-IGNs) cannot perform induced-subgraph-count of substructures consisting of 3 or more nodes, while they can perform subgraph-count of star-shaped substructures. As an intermediary step, we prove that 2-WL and 2-IGNs are equivalent in distinguishing non-isomorphic graphs, partly answering an open problem raised in Maron et al. (2019). We also prove positive results for k-WL and k-IGNs as well as negative results for k-WL with a finite number of iterations. We then conduct experiments that support the theoretical results for MPNNs and 2-IGNs. Moreover, motivated by substructure counting and inspired by Murphy et al. (2019), we propose the Local Relational Pooling model and demonstrate that it is not only effective for substructure counting but also able to achieve competitive performance on molecular prediction tasks. △ Less

Submitted 28 October, 2020; v1 submitted 10 February, 2020; originally announced February 2020.

Comments: Improved the descriptions of the Local Relational Pooling (LRP) model and its practical implementation; Added more experimental results on synthetic and molecular datasets; Added the LRP-l-1 model in the experiments

Journal ref: 34th Conference on Neural Information Processing Systems (NeurIPS 2020)

arXiv:2001.01666 [pdf, other]

MREC: a fast and versatile framework for aligning and matching point clouds with applications to single cell molecular data

Authors: Andrew J. Blumberg, Mathieu Carriere, Michael A. Mandell, Raul Rabadan, Soledad Villar

Abstract: Comparing and aligning large datasets is a pervasive problem occurring across many different knowledge domains. We introduce and study MREC, a recursive decomposition algorithm for computing matchings between data sets. The basic idea is to partition the data, match the partitions, and then recursively match the points within each pair of identified partitions. The matching itself is done using bl… ▽ More Comparing and aligning large datasets is a pervasive problem occurring across many different knowledge domains. We introduce and study MREC, a recursive decomposition algorithm for computing matchings between data sets. The basic idea is to partition the data, match the partitions, and then recursively match the points within each pair of identified partitions. The matching itself is done using black box matching procedures that are too expensive to run on the entire data set. Using an absolute measure of the quality of a matching, the framework supports optimization over parameters including partitioning procedures and matching algorithms. By design, MREC can be applied to extremely large data sets. We analyze the procedure to describe when we can expect it to work well and demonstrate its flexibility and power by applying it to a number of alignment problems arising in the analysis of single cell molecular data. △ Less

Submitted 20 February, 2020; v1 submitted 6 January, 2020; originally announced January 2020.

arXiv:1908.05767 [pdf, ps, other]

Experimental performance of graph neural networks on random instances of max-cut

Authors: Weichi Yao, Afonso S. Bandeira, Soledad Villar

Abstract: This note explores the applicability of unsupervised machine learning techniques towards hard optimization problems on random inputs. In particular we consider Graph Neural Networks (GNNs) -- a class of neural networks designed to learn functions on graphs -- and we apply them to the max-cut problem on random regular graphs. We focus on the max-cut problem on random regular graphs because it is a… ▽ More This note explores the applicability of unsupervised machine learning techniques towards hard optimization problems on random inputs. In particular we consider Graph Neural Networks (GNNs) -- a class of neural networks designed to learn functions on graphs -- and we apply them to the max-cut problem on random regular graphs. We focus on the max-cut problem on random regular graphs because it is a fundamental problem that has been widely studied. In particular, even though there is no known explicit solution to compare the output of our algorithm to, we can leverage the known asymptotics of the optimal max-cut value in order to evaluate the performance of the GNNs. In order to put the performance of the GNNs in context, we compare it with the classical semidefinite relaxation approach by Goemans and Williamson~(SDP), and with extremal optimization, which is a local optimization heuristic from the statistical physics literature. The numerical results we obtain indicate that, surprisingly, Graph Neural Networks attain comparable performance to the Goemans and Williamson SDP. We also observe that extremal optimization consistently outperforms the other two methods. Furthermore, the performances of the three methods present similar patterns, that is, for sparser, and for larger graphs, the size of the found cuts are closer to the asymptotic optimal max-cut value. △ Less

Submitted 15 August, 2019; originally announced August 2019.

arXiv:1905.12560 [pdf, other]

On the equivalence between graph isomorphism testing and function approximation with GNNs

Authors: Zhengdao Chen, Soledad Villar, Lei Chen, Joan Bruna

Abstract: Graph Neural Networks (GNNs) have achieved much success on graph-structured data. In light of this, there have been increasing interests in studying their expressive power. One line of work studies the capability of GNNs to approximate permutation-invariant functions on graphs, and another focuses on the their power as tests for graph isomorphism. Our work connects these two perspectives and prove… ▽ More Graph Neural Networks (GNNs) have achieved much success on graph-structured data. In light of this, there have been increasing interests in studying their expressive power. One line of work studies the capability of GNNs to approximate permutation-invariant functions on graphs, and another focuses on the their power as tests for graph isomorphism. Our work connects these two perspectives and proves their equivalence. We further develop a framework of the expressive power of GNNs that incorporates both of these viewpoints using the language of sigma-algebra, through which we compare the expressive power of different types of GNNs together with other graph isomorphism tests. In particular, we prove that the second-order Invariant Graph Network fails to distinguish non-isomorphic regular graphs with the same degree. Then, we extend it to a new architecture, Ring-GNN, which succeeds in distinguishing these graphs and achieves good performances on real-world datasets. △ Less

Submitted 10 February, 2023; v1 submitted 29 May, 2019; originally announced May 2019.

Comments: Strengthened Theorem 4 with a modified proof; Updated Figure 2 to include results from the later literature; Made other minor edits to improve clarity. 22 pages

Journal ref: Original version published at the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019)

arXiv:1812.07377 [pdf, other]

Utility Ghost: Gamified redistricting with partisan symmetry

Authors: Dustin G. Mixon, Soledad Villar

Abstract: Inspired by the word game Ghost, we propose a new protocol for bipartisan redistricting in which partisan players take turns assigning precincts to districts. We prove that in an idealized setting, if both parties have the same number votes, then under optimal play in our protocol, both parties win the same number of seats. We also evaluate our protocol in more realistic settings that show how our… ▽ More Inspired by the word game Ghost, we propose a new protocol for bipartisan redistricting in which partisan players take turns assigning precincts to districts. We prove that in an idealized setting, if both parties have the same number votes, then under optimal play in our protocol, both parties win the same number of seats. We also evaluate our protocol in more realistic settings that show how our game nearly eliminates the first-player advantage exhibited in other redistricting protocols. △ Less

Submitted 14 December, 2018; originally announced December 2018.

arXiv:1812.02768 [pdf, other]

SqueezeFit: Label-aware dimensionality reduction by semidefinite programming

Authors: Culver McWhirter, Dustin G. Mixon, Soledad Villar

Abstract: Given labeled points in a high-dimensional vector space, we seek a low-dimensional subspace such that projecting onto this subspace maintains some prescribed distance between points of differing labels. Intended applications include compressive classification. Taking inspiration from large margin nearest neighbor classification, this paper introduces a semidefinite relaxation of this problem. Unli… ▽ More Given labeled points in a high-dimensional vector space, we seek a low-dimensional subspace such that projecting onto this subspace maintains some prescribed distance between points of differing labels. Intended applications include compressive classification. Taking inspiration from large margin nearest neighbor classification, this paper introduces a semidefinite relaxation of this problem. Unlike its predecessors, this relaxation is amenable to theoretical analysis, allowing us to provably recover a planted projection operator from the data. △ Less

Submitted 6 December, 2018; originally announced December 2018.

arXiv:1808.08905 [pdf, other]

Fair redistricting is hard

Authors: Richard Kueng, Dustin G. Mixon, Soledad Villar

Abstract: Gerrymandering is a long-standing issue within the U.S. political system, and it has received scrutiny recently by the U.S. Supreme Court. In this note, we prove that deciding whether there exists a fair redistricting among legal maps is NP-hard. To make this precise, we use simplified notions of "legal" and "fair" that account for desirable traits such as geographic compactness of districts and s… ▽ More Gerrymandering is a long-standing issue within the U.S. political system, and it has received scrutiny recently by the U.S. Supreme Court. In this note, we prove that deciding whether there exists a fair redistricting among legal maps is NP-hard. To make this precise, we use simplified notions of "legal" and "fair" that account for desirable traits such as geographic compactness of districts and sufficient representation of voters. The proof of our result is inspired by the work of Mahanjan, Minbhorkar and Varadarajan that proves that planar k-means is NP-hard. △ Less

Submitted 27 August, 2018; originally announced August 2018.

arXiv:1803.09319 [pdf, other]

SUNLayer: Stable denoising with generative networks

Authors: Dustin G. Mixon, Soledad Villar

Abstract: It has been experimentally established that deep neural networks can be used to produce good generative models for real world data. It has also been established that such generative models can be exploited to solve classical inverse problems like compressed sensing and super resolution. In this work we focus on the classical signal processing problem of image denoising. We propose a theoretical se… ▽ More It has been experimentally established that deep neural networks can be used to produce good generative models for real world data. It has also been established that such generative models can be exploited to solve classical inverse problems like compressed sensing and super resolution. In this work we focus on the classical signal processing problem of image denoising. We propose a theoretical setting that uses spherical harmonics to identify what mathematical properties of the activation functions will allow signal denoising with local methods. △ Less

Submitted 25 March, 2018; originally announced March 2018.

arXiv:1712.01756 [pdf, other]

doi 10.1103/PhysRevLett.121.073601

Hexapartite entanglement in an above-threshold Optical Parametric Oscillator

Authors: F. A. S. Barbosa, A. S. Coelho, L. F. Muñoz Martínez, L. Ortiz-Gutiérrez, A. S. Villar, P. Nussenzveig, M. Martinelli

Abstract: We demonstrate, theoretically and experimentally, the generation of hexapartite modal entanglement by the optical parametric oscillator (OPO) operating above the oscillation threshold. We show that the OPO generates a rich structure of entanglement among sets of six optical sideband modes interacting through the non-linear crystal. The class of quantum states thus produced can be controlled by a s… ▽ More We demonstrate, theoretically and experimentally, the generation of hexapartite modal entanglement by the optical parametric oscillator (OPO) operating above the oscillation threshold. We show that the OPO generates a rich structure of entanglement among sets of six optical sideband modes interacting through the non-linear crystal. The class of quantum states thus produced can be controlled by a single parameter, the power of the external laser that pumps the system. Our platform allows for the generation of massive entanglement among many optical modes with well defined but vastly different frequencies, potentially bridging nodes of a multicolor quantum network. △ Less

Submitted 28 June, 2018; v1 submitted 5 December, 2017; originally announced December 2017.

Comments: 9 pages, 12 figures

Journal ref: Phys. Rev. Lett. 121, 073601 (2018)

arXiv:1710.02905 [pdf, other]

doi 10.1103/PhysRevA.98.023823

Exploring six modes of an optical parametric oscillator

Authors: Luis F. Muñoz-Martínez, Felippe Alexandre Silva Barbosa, Antônio Sales Coelho, Luis Ortiz-Gutiérrez, Marcelo Martinelli, Paulo Nussenzveig, Alessandro S. Villar

Abstract: We measure the complete quantum state for six modes of the electromagnetic field produced by an optical parametric oscillator. The investigation involves the sideband of the intense pump, signal, and idler fields generated by stimulated parametric downconversion inside a triply resonant optical resonator. We develop a theoretical model to successfully interpret the experimental results. The model… ▽ More We measure the complete quantum state for six modes of the electromagnetic field produced by an optical parametric oscillator. The investigation involves the sideband of the intense pump, signal, and idler fields generated by stimulated parametric downconversion inside a triply resonant optical resonator. We develop a theoretical model to successfully interpret the experimental results. The model takes into account the coupling of the field modes to the phonon bath of the nonlinear crystal, clearly showing the roles of different physical effects in shaping the structure of the quantum correlations between the six optical modes. △ Less

Submitted 20 February, 2018; v1 submitted 8 October, 2017; originally announced October 2017.

Comments: 11 pages, 5 figures

Journal ref: Phys. Rev. A 98, 023823 (2018)

arXiv:1710.00956 [pdf, other]

Monte Carlo approximation certificates for k-means clustering

Authors: Dustin G. Mixon, Soledad Villar

Abstract: Efficient algorithms for $k$-means clustering frequently converge to suboptimal partitions, and given a partition, it is difficult to detect $k$-means optimality. In this paper, we develop an a posteriori certifier of approximate optimality for $k$-means clustering. The certifier is a sub-linear Monte Carlo algorithm based on Peng and Wei's semidefinite relaxation of $k$-means. In particular, solv… ▽ More Efficient algorithms for $k$-means clustering frequently converge to suboptimal partitions, and given a partition, it is difficult to detect $k$-means optimality. In this paper, we develop an a posteriori certifier of approximate optimality for $k$-means clustering. The certifier is a sub-linear Monte Carlo algorithm based on Peng and Wei's semidefinite relaxation of $k$-means. In particular, solving the relaxation for small random samples of the dataset produces a high-confidence lower bound on the $k$-means objective, and being sub-linear, our algorithm is faster than $k$-means++ when the number of data points is large. We illustrate the performance of our algorithm with both numerical experiments and a performance guarantee: If the data points are drawn independently from any mixture of two Gaussians over $\mathbb{R}^m$ with identity covariance, then with probability $1-O(1/m)$, our $\operatorname{poly}(m)$-time algorithm produces a 3-approximation certificate with 99% confidence. △ Less

Submitted 2 October, 2017; originally announced October 2017.

Comments: 8 pages

Showing 1–50 of 82 results for author: Villar, S