subscribe to arXiv mailings

BayesFLo: Bayesian fault localization of complex software systems

Authors: Yi Ji, Simon Mak, Ryan Lekivetz, Joseph Morgan

Abstract: Software testing is essential for the reliable development of complex software systems. A key step in software testing is fault localization, which uses test data to pinpoint failure-inducing combinations for further diagnosis. Existing fault localization methods, however, are largely deterministic, and thus do not provide a principled approach for assessing probabilistic risk of potential root ca… ▽ More Software testing is essential for the reliable development of complex software systems. A key step in software testing is fault localization, which uses test data to pinpoint failure-inducing combinations for further diagnosis. Existing fault localization methods, however, are largely deterministic, and thus do not provide a principled approach for assessing probabilistic risk of potential root causes, or for integrating domain and/or structural knowledge from test engineers. To address this, we propose a novel Bayesian fault localization framework called BayesFLo, which leverages a flexible Bayesian model on potential root cause combinations. A key feature of BayesFLo is its integration of the principles of combination hierarchy and heredity, which capture the structured nature of failure-inducing combinations. A critical challenge, however, is the sheer number of potential root cause scenarios to consider, which renders the computation of posterior root cause probabilities infeasible even for small software systems. We thus develop new algorithms for efficient computation of such probabilities, leveraging recent tools from integer programming and graph representations. We then demonstrate the effectiveness of BayesFLo over state-of-the-art fault localization methods, in a suite of numerical experiments and in two motivating case studies on the JMP XGBoost interface. △ Less

Submitted 12 March, 2024; originally announced March 2024.

arXiv:2403.03816 [pdf, other]

Targeted Variance Reduction: Robust Bayesian Optimization of Black-Box Simulators with Noise Parameters

Authors: John Joshua Miller, Simon Mak

Abstract: The optimization of a black-box simulator over control parameters $\mathbf{x}$ arises in a myriad of scientific applications. In such applications, the simulator often takes the form $f(\mathbf{x},\boldsymbolθ)$, where $\boldsymbolθ$ are parameters that are uncertain in practice. Robust optimization aims to optimize the objective $\mathbb{E}[f(\mathbf{x},\boldsymbolΘ)]$, where… ▽ More The optimization of a black-box simulator over control parameters $\mathbf{x}$ arises in a myriad of scientific applications. In such applications, the simulator often takes the form $f(\mathbf{x},\boldsymbolθ)$, where $\boldsymbolθ$ are parameters that are uncertain in practice. Robust optimization aims to optimize the objective $\mathbb{E}[f(\mathbf{x},\boldsymbolΘ)]$, where $\boldsymbolΘ \sim \mathcal{P}$ is a random variable that models uncertainty on $\boldsymbolθ$. For this, existing black-box methods typically employ a two-stage approach for selecting the next point $(\mathbf{x},\boldsymbolθ)$, where $\mathbf{x}$ and $\boldsymbolθ$ are optimized separately via different acquisition functions. As such, these approaches do not employ a joint acquisition over $(\mathbf{x},\boldsymbolθ)$, and thus may fail to fully exploit control-to-noise interactions for effective robust optimization. To address this, we propose a new Bayesian optimization method called Targeted Variance Reduction (TVR). The TVR leverages a novel joint acquisition function over $(\mathbf{x},\boldsymbolθ)$, which targets variance reduction on the objective within the desired region of improvement. Under a Gaussian process surrogate on $f$, the TVR acquisition can be evaluated in closed form, and reveals an insightful exploration-exploitation-precision trade-off for robust black-box optimization. The TVR can further accommodate a broad class of non-Gaussian distributions on $\mathcal{P}$ via a careful integration of normalizing flows. We demonstrate the improved performance of TVR over the state-of-the-art in a suite of numerical experiments and an application to the robust design of automobile brake discs under operational uncertainty. △ Less

Submitted 6 March, 2024; originally announced March 2024.

arXiv:2311.08752 [pdf, other]

ProSpar-GP: scalable Gaussian process modeling with massive non-stationary datasets

Authors: Kevin Li, Simon Mak

Abstract: Gaussian processes (GPs) are a popular class of Bayesian nonparametric models, but its training can be computationally burdensome for massive training datasets. While there has been notable work on scaling up these models for big data, existing methods typically rely on a stationary GP assumption for approximation, and can thus perform poorly when the underlying response surface is non-stationary,… ▽ More Gaussian processes (GPs) are a popular class of Bayesian nonparametric models, but its training can be computationally burdensome for massive training datasets. While there has been notable work on scaling up these models for big data, existing methods typically rely on a stationary GP assumption for approximation, and can thus perform poorly when the underlying response surface is non-stationary, i.e., it has some regions of rapid change and other regions with little change. Such non-stationarity is, however, ubiquitous in real-world problems, including our motivating application for surrogate modeling of computer experiments. We thus propose a new Product of Sparse GP (ProSpar-GP) method for scalable GP modeling with massive non-stationary data. The ProSpar-GP makes use of a carefully-constructed product-of-experts formulation of sparse GP experts, where different experts are placed within local regions of non-stationarity. These GP experts are fit via a novel variational inference approach, which capitalizes on mini-batching and GPU acceleration for efficient optimization of inducing points and length-scale parameters for each expert. We further show that the ProSpar-GP is Kolmogorov-consistent, in that its generative distribution defines a valid stochastic process over the prediction space; such a property provides essential stability for variational inference, particularly in the presence of non-stationarity. We then demonstrate the improved performance of the ProSpar-GP over the state-of-the-art, in a suite of numerical experiments and an application for surrogate modeling of a satellite drag simulator. △ Less

Submitted 15 November, 2023; originally announced November 2023.

arXiv:2310.19787 [pdf]

$e^{\text{RPCA}}$: Robust Principal Component Analysis for Exponential Family Distributions

Authors: Xiaojun Zheng, Simon Mak, Liyan Xie, Yao Xie

Abstract: Robust Principal Component Analysis (RPCA) is a widely used method for recovering low-rank structure from data matrices corrupted by significant and sparse outliers. These corruptions may arise from occlusions, malicious tampering, or other causes for anomalies, and the joint identification of such corruptions with low-rank background is critical for process monitoring and diagnosis. However, exis… ▽ More Robust Principal Component Analysis (RPCA) is a widely used method for recovering low-rank structure from data matrices corrupted by significant and sparse outliers. These corruptions may arise from occlusions, malicious tampering, or other causes for anomalies, and the joint identification of such corruptions with low-rank background is critical for process monitoring and diagnosis. However, existing RPCA methods and their extensions largely do not account for the underlying probabilistic distribution for the data matrices, which in many applications are known and can be highly non-Gaussian. We thus propose a new method called Robust Principal Component Analysis for Exponential Family distributions ($e^{\text{RPCA}}$), which can perform the desired decomposition into low-rank and sparse matrices when such a distribution falls within the exponential family. We present a novel alternating direction method of multiplier optimization algorithm for efficient $e^{\text{RPCA}}$ decomposition. The effectiveness of $e^{\text{RPCA}}$ is then demonstrated in two applications: the first for steel sheet defect detection, and the second for crime activity monitoring in the Atlanta metropolitan area. △ Less

Submitted 30 October, 2023; originally announced October 2023.

arXiv:2310.14544 [pdf, other]

Trigonometric Quadrature Fourier Features for Scalable Gaussian Process Regression

Authors: Kevin Li, Max Balakirsky, Simon Mak

Abstract: Fourier feature approximations have been successfully applied in the literature for scalable Gaussian Process (GP) regression. In particular, Quadrature Fourier Features (QFF) derived from Gaussian quadrature rules have gained popularity in recent years due to their improved approximation accuracy and better calibrated uncertainty estimates compared to Random Fourier Feature (RFF) methods. However… ▽ More Fourier feature approximations have been successfully applied in the literature for scalable Gaussian Process (GP) regression. In particular, Quadrature Fourier Features (QFF) derived from Gaussian quadrature rules have gained popularity in recent years due to their improved approximation accuracy and better calibrated uncertainty estimates compared to Random Fourier Feature (RFF) methods. However, a key limitation of QFF is that its performance can suffer from well-known pathologies related to highly oscillatory quadrature, resulting in mediocre approximation with limited features. We address this critical issue via a new Trigonometric Quadrature Fourier Feature (TQFF) method, which uses a novel non-Gaussian quadrature rule specifically tailored for the desired Fourier transform. We derive an exact quadrature rule for TQFF, along with kernel approximation error bounds for the resulting feature map. We then demonstrate the improved performance of our method over RFF and Gaussian QFF in a suite of numerical experiments and applications, and show the TQFF enjoys accurate GP approximations over a broad range of length-scales using fewer features. △ Less

Submitted 22 October, 2023; originally announced October 2023.

arXiv:2306.07480 [pdf, other]

ACE: Active Learning for Causal Inference with Expensive Experiments

Authors: Difan Song, Simon Mak, C. F. Jeff Wu

Abstract: Experiments are the gold standard for causal inference. In many applications, experimental units can often be recruited or chosen sequentially, and the adaptive execution of such experiments may offer greatly improved inference of causal quantities over non-adaptive approaches, particularly when experiments are expensive. We thus propose a novel active learning method called ACE (Active learning f… ▽ More Experiments are the gold standard for causal inference. In many applications, experimental units can often be recruited or chosen sequentially, and the adaptive execution of such experiments may offer greatly improved inference of causal quantities over non-adaptive approaches, particularly when experiments are expensive. We thus propose a novel active learning method called ACE (Active learning for Causal inference with Expensive experiments), which leverages Gaussian process modeling of the conditional mean functions to guide an informed sequential design of costly experiments. In particular, we develop new acquisition functions for sequential design via the minimization of the posterior variance of a desired causal estimand. Our approach facilitates targeted learning of a variety of causal estimands, such as the average treatment effect (ATE), the average treatment effect on the treated (ATTE), and individualized treatment effects (ITE), and can be used for adaptive selection of an experimental unit and/or the applied treatment. We then demonstrate in a suite of numerical experiments the improved performance of ACE over baseline methods for estimating causal estimands given a limited number of experiments. △ Less

Submitted 12 June, 2023; originally announced June 2023.

Comments: 6 pages, 4 figures

arXiv:2306.07299 [pdf, other]

Additive Multi-Index Gaussian process modeling, with application to multi-physics surrogate modeling of the quark-gluon plasma

Authors: Kevin Li, Simon Mak, J. -F Paquet, Steffen A. Bass

Abstract: The Quark-Gluon Plasma (QGP) is a unique phase of nuclear matter, theorized to have filled the Universe shortly after the Big Bang. A critical challenge in studying the QGP is that, to reconcile experimental observables with theoretical parameters, one requires many simulation runs of a complex physics model over a high-dimensional parameter space. Each run is computationally very expensive, requi… ▽ More The Quark-Gluon Plasma (QGP) is a unique phase of nuclear matter, theorized to have filled the Universe shortly after the Big Bang. A critical challenge in studying the QGP is that, to reconcile experimental observables with theoretical parameters, one requires many simulation runs of a complex physics model over a high-dimensional parameter space. Each run is computationally very expensive, requiring thousands of CPU hours, thus limiting physicists to only several hundred runs. Given limited training data for high-dimensional prediction, existing surrogate models often yield poor predictions with high predictive uncertainties, leading to imprecise scientific findings. To address this, we propose a new Additive Multi-Index Gaussian process (AdMIn-GP) model, which leverages a flexible additive structure on low-dimensional embeddings of the parameter space. This is guided by prior scientific knowledge that the QGP is dominated by multiple distinct physical phenomena (i.e., multiphysics), each involving a small number of latent parameters. The AdMIn-GP models for such embedded structures within a flexible Bayesian nonparametric framework, which facilitates efficient model fitting via a carefully constructed variational inference approach with inducing points. We show the effectiveness of the AdMIn-GP via a suite of numerical experiments and our QGP application, where we demonstrate considerably improved surrogate modeling performance over existing models. △ Less

Submitted 10 June, 2023; originally announced June 2023.

arXiv:2302.00755 [pdf, other]

Hierarchical shrinkage Gaussian processes: applications to computer code emulation and dynamical system recovery

Authors: Tao Tang, Simon Mak, David Dunson

Abstract: In many areas of science and engineering, computer simulations are widely used as proxies for physical experiments, which can be infeasible or unethical. Such simulations can often be computationally expensive, and an emulator can be trained to efficiently predict the desired response surface. A widely-used emulator is the Gaussian process (GP), which provides a flexible framework for efficient pr… ▽ More In many areas of science and engineering, computer simulations are widely used as proxies for physical experiments, which can be infeasible or unethical. Such simulations can often be computationally expensive, and an emulator can be trained to efficiently predict the desired response surface. A widely-used emulator is the Gaussian process (GP), which provides a flexible framework for efficient prediction and uncertainty quantification. Standard GPs, however, do not capture structured sparsity on the underlying response surface, which is present in many applications, particularly in the physical sciences. We thus propose a new hierarchical shrinkage GP (HierGP), which incorporates such structure via cumulative shrinkage priors within a GP framework. We show that the HierGP implicitly embeds the well-known principles of effect sparsity, heredity and hierarchy for analysis of experiments, which allows our model to identify structured sparse features from the response surface with limited data. We propose efficient posterior sampling algorithms for model training and prediction, and prove desirable consistency properties for the HierGP. Finally, we demonstrate the improved performance of HierGP over existing models, in a suite of numerical experiments and an application to dynamical system recovery. △ Less

Submitted 1 February, 2023; originally announced February 2023.

arXiv:2211.00268 [pdf, other]

Stacking designs: designing multi-fidelity computer experiments with target predictive accuracy

Authors: Chih-Li Sung, Yi Ji, Simon Mak, Wenjia Wang, Tao Tang

Abstract: In an era where scientific experiments can be very costly, multi-fidelity emulators provide a useful tool for cost-efficient predictive scientific computing. For scientific applications, the experimenter is often limited by a tight computational budget, and thus wishes to (i) maximize predictive power of the multi-fidelity emulator via a careful design of experiments, and (ii) ensure this model ac… ▽ More In an era where scientific experiments can be very costly, multi-fidelity emulators provide a useful tool for cost-efficient predictive scientific computing. For scientific applications, the experimenter is often limited by a tight computational budget, and thus wishes to (i) maximize predictive power of the multi-fidelity emulator via a careful design of experiments, and (ii) ensure this model achieves a desired error tolerance with some notion of confidence. Existing design methods, however, do not jointly tackle objectives (i) and (ii). We propose a novel stacking design approach that addresses both goals. A multi-level reproducing kernel Hilbert space (RKHS) interpolator is first introduced to build the emulator, under which our stacking design provides a sequential approach for designing multi-fidelity runs such that a desired prediction error of $ε> 0$ is met under regularity assumptions. We then prove a novel cost complexity theorem that, under this multi-level interpolator, establishes a bound on the computation cost (for training data simulation) needed to achieve a prediction bound of $ε$. This result provides novel insights on conditions under which the proposed multi-fidelity approach improves upon a conventional RKHS interpolator which relies on a single fidelity level. Finally, we demonstrate the effectiveness of stacking designs in a suite of simulation experiments and an application to finite element analysis. △ Less

Submitted 27 October, 2023; v1 submitted 1 November, 2022; originally announced November 2022.

arXiv:2209.13748 [pdf, other]

Conglomerate Multi-Fidelity Gaussian Process Modeling, with Application to Heavy-Ion Collisions

Authors: Yi Ji, Henry Shaowu Yuchi, Derek Soeder, J. -F. Paquet, Steffen A. Bass, V. Roshan Joseph, C. F. Jeff Wu, Simon Mak

Abstract: In an era where scientific experimentation is often costly, multi-fidelity emulation provides a powerful tool for predictive scientific computing. While there has been notable work on multi-fidelity modeling, existing models do not incorporate an important "conglomerate" property of multi-fidelity simulators, where the accuracies of different simulator components are controlled by different fideli… ▽ More In an era where scientific experimentation is often costly, multi-fidelity emulation provides a powerful tool for predictive scientific computing. While there has been notable work on multi-fidelity modeling, existing models do not incorporate an important "conglomerate" property of multi-fidelity simulators, where the accuracies of different simulator components are controlled by different fidelity parameters. Such conglomerate simulators are widely encountered in complex nuclear physics and astrophysics applications. We thus propose a new CONglomerate multi-FIdelity Gaussian process (CONFIG) model, which embeds this conglomerate structure within a novel non-stationary covariance function. We show that the proposed CONFIG model can capture prior knowledge on the numerical convergence of conglomerate simulators, which allows for cost-efficient emulation of multi-fidelity systems. We demonstrate the improved predictive performance of CONFIG over state-of-the-art models in a suite of numerical experiments and two applications, the first for emulation of cantilever beam deflection and the second for emulating the evolution of the quark-gluon plasma, which was theorized to have filled the Universe shortly after the Big Bang. △ Less

Submitted 28 September, 2023; v1 submitted 27 September, 2022; originally announced September 2022.

arXiv:2203.04246 [pdf, other]

PERCEPT: a new online change-point detection method using topological data analysis

Authors: Xiaojun Zheng, Simon Mak, Liyan Xie, Yao Xie

Abstract: Topological data analysis (TDA) provides a set of data analysis tools for extracting embedded topological structures from complex high-dimensional datasets. In recent years, TDA has been a rapidly growing field which has found success in a wide range of applications, including signal processing, neuroscience and network analysis. In these applications, the online detection of changes is of crucial… ▽ More Topological data analysis (TDA) provides a set of data analysis tools for extracting embedded topological structures from complex high-dimensional datasets. In recent years, TDA has been a rapidly growing field which has found success in a wide range of applications, including signal processing, neuroscience and network analysis. In these applications, the online detection of changes is of crucial importance, but this can be highly challenging since such changes often occur in a low-dimensional embedding within high-dimensional data streams. We thus propose a new method, called PERsistence diagram-based ChangE-PoinT detection (PERCEPT), which leverages the learned topological structure from TDA to sequentially detect changes. PERCEPT follows two key steps: it first learns the embedded topology as a point cloud via persistence diagrams, then applies a non-parametric monitoring approach for detecting changes in the resulting point cloud distributions. This yields a non-parametric, topology-aware framework which can efficiently detect online changes from high-dimensional data streams. We investigate the effectiveness of PERCEPT over existing methods in a suite of numerical experiments where the data streams have an embedded topological structure. We then demonstrate the usefulness of PERCEPT in two applications in solar flare monitoring and human gesture detection. △ Less

Submitted 8 March, 2022; originally announced March 2022.

arXiv:2109.07623 [pdf, other]

BacHMMachine: An Interpretable and Scalable Model for Algorithmic Harmonization for Four-part Baroque Chorales

Authors: Yunyao Zhu, Stephen Hahn, Simon Mak, Yue Jiang, Cynthia Rudin

Abstract: Algorithmic harmonization - the automated harmonization of a musical piece given its melodic line - is a challenging problem that has garnered much interest from both music theorists and computer scientists. One genre of particular interest is the four-part Baroque chorales of J.S. Bach. Methods for algorithmic chorale harmonization typically adopt a black-box, "data-driven" approach: they do not… ▽ More Algorithmic harmonization - the automated harmonization of a musical piece given its melodic line - is a challenging problem that has garnered much interest from both music theorists and computer scientists. One genre of particular interest is the four-part Baroque chorales of J.S. Bach. Methods for algorithmic chorale harmonization typically adopt a black-box, "data-driven" approach: they do not explicitly integrate principles from music theory but rely on a complex learning model trained with a large amount of chorale data. We propose instead a new harmonization model, called BacHMMachine, which employs a "theory-driven" framework guided by music composition principles, along with a "data-driven" model for learning compositional features within this framework. As its name suggests, BacHMMachine uses a novel Hidden Markov Model based on key and chord transitions, providing a probabilistic framework for learning key modulations and chordal progressions from a given melodic line. This allows for the generation of creative, yet musically coherent chorale harmonizations; integrating compositional principles allows for a much simpler model that results in vast decreases in computational burden and greater interpretability compared to state-of-the-art algorithmic harmonization methods, at no penalty to quality of harmonization or musicality. We demonstrate this improvement via comprehensive experiments and Turing tests comparing BacHMMachine to existing methods. △ Less

Submitted 22 February, 2022; v1 submitted 15 September, 2021; originally announced September 2021.

Comments: 7 pages, 7 figures

arXiv:2108.00306 [pdf, other]

A graphical multi-fidelity Gaussian process model, with application to emulation of heavy-ion collisions

Authors: Yi Ji, Simon Mak, Derek Soeder, J-F Paquet, Steffen A. Bass

Abstract: With advances in scientific computing and mathematical modeling, complex scientific phenomena such as galaxy formations and rocket propulsion can now be reliably simulated. Such simulations can however be very time-intensive, requiring millions of CPU hours to perform. One solution is multi-fidelity emulation, which uses data of different fidelities to train an efficient predictive model which emu… ▽ More With advances in scientific computing and mathematical modeling, complex scientific phenomena such as galaxy formations and rocket propulsion can now be reliably simulated. Such simulations can however be very time-intensive, requiring millions of CPU hours to perform. One solution is multi-fidelity emulation, which uses data of different fidelities to train an efficient predictive model which emulates the expensive simulator. For complex scientific problems and with careful elicitation from scientists, such multi-fidelity data may often be linked by a directed acyclic graph (DAG) representing its scientific model dependencies. We thus propose a new Graphical Multi-fidelity Gaussian Process (GMGP) model, which embeds this DAG structure (capturing scientific dependencies) within a Gaussian process framework. We show that the GMGP has desirable modeling traits via two Markov properties, and admits a scalable algorithm for recursive computation of the posterior mean and variance along at each depth level of the DAG. We also present a novel experimental design methodology over the DAG given an experimental budget, and propose a nonlinear extension of the GMGP via deep Gaussian processes. The advantages of the GMGP are then demonstrated via a suite of numerical experiments and an application to emulation of heavy-ion collisions, which can be used to study the conditions of matter in the Universe shortly after the Big Bang. The proposed model has broader uses in data fusion applications with graphical structure, which we further discuss. △ Less

Submitted 27 February, 2024; v1 submitted 31 July, 2021; originally announced August 2021.

arXiv:2107.04668 [pdf, other]

Gaussian Process Subspace Regression for Model Reduction

Authors: Ruda Zhang, Simon Mak, David Dunson

Abstract: Subspace-valued functions arise in a wide range of problems, including parametric reduced order modeling (PROM). In PROM, each parameter point can be associated with a subspace, which is used for Petrov-Galerkin projections of large system matrices. Previous efforts to approximate such functions use interpolations on manifolds, which can be inaccurate and slow. To tackle this, we propose a novel B… ▽ More Subspace-valued functions arise in a wide range of problems, including parametric reduced order modeling (PROM). In PROM, each parameter point can be associated with a subspace, which is used for Petrov-Galerkin projections of large system matrices. Previous efforts to approximate such functions use interpolations on manifolds, which can be inaccurate and slow. To tackle this, we propose a novel Bayesian nonparametric model for subspace prediction: the Gaussian Process Subspace regression (GPS) model. This method is extrinsic and intrinsic at the same time: with multivariate Gaussian distributions on the Euclidean space, it induces a joint probability model on the Grassmann manifold, the set of fixed-dimensional subspaces. The GPS adopts a simple yet general correlation structure, and a principled approach for model selection. Its predictive distribution admits an analytical form, which allows for efficient subspace prediction over the parameter space. For PROM, the GPS provides a probabilistic prediction at a new parameter point that retains the accuracy of local reduced models, at a computational complexity that does not depend on system dimension, and thus is suitable for online computation. We give four numerical examples to compare our method to subspace interpolation, as well as two methods that interpolate local reduced models. Overall, GPS is the most data efficient, more computationally efficient than subspace interpolation, and gives smooth predictions with uncertainty quantification. △ Less

Submitted 9 July, 2021; originally announced July 2021.

Comments: 20 pages, 4 figures; with supplementary material

MSC Class: 14M15; 35B30; 37M99; 53-04; 60B20; 60G15

arXiv:2103.00117 [pdf, other]

Online High-Dimensional Change-Point Detection using Topological Data Analysis

Authors: Xiaojun Zheng, Simon Mak, Yao Xie

Abstract: Topological Data Analysis (TDA) is a rapidly growing field, which studies methods for learning underlying topological structures present in complex data representations. TDA methods have found recent success in extracting useful geometric structures for a wide range of applications, including protein classification, neuroscience, and time-series analysis. However, in many such applications, one is… ▽ More Topological Data Analysis (TDA) is a rapidly growing field, which studies methods for learning underlying topological structures present in complex data representations. TDA methods have found recent success in extracting useful geometric structures for a wide range of applications, including protein classification, neuroscience, and time-series analysis. However, in many such applications, one is also interested in sequentially detecting changes in this topological structure. We propose a new method called Persistence Diagram based Change-Point (PD-CP), which tackles this problem by integrating the widely-used persistence diagrams in TDA with recent developments in nonparametric change-point detection. The key novelty in PD-CP is that it leverages the distribution of points on persistence diagrams for online detection of topological changes. We demonstrate the effectiveness of PD-CP in an application to solar flare monitoring. △ Less

Submitted 7 March, 2021; v1 submitted 26 February, 2021; originally announced March 2021.

arXiv:2102.05724 [pdf, other]

Sequential change-point detection for mutually exciting point processes over networks

Authors: Haoyun Wang, Liyan Xie, Yao Xie, Alex Cuozzo, Simon Mak

Abstract: We present a new CUSUM procedure for sequentially detecting change-point in the self and mutual exciting processes, a.k.a. Hawkes networks using discrete events data. Hawkes networks have become a popular model for statistics and machine learning due to their capability in modeling irregularly observed data where the timing between events carries a lot of information. The problem of detecting abru… ▽ More We present a new CUSUM procedure for sequentially detecting change-point in the self and mutual exciting processes, a.k.a. Hawkes networks using discrete events data. Hawkes networks have become a popular model for statistics and machine learning due to their capability in modeling irregularly observed data where the timing between events carries a lot of information. The problem of detecting abrupt changes in Hawkes networks arises from various applications, including neuronal imaging, sensor network, and social network monitoring. Despite this, there has not been a computationally and memory-efficient online algorithm for detecting such changes from sequential data. We present an efficient online recursive implementation of the CUSUM statistic for Hawkes processes, both decentralized and memory-efficient, and establish the theoretical properties of this new CUSUM procedure. We then show that the proposed CUSUM method achieves better performance than existing methods, including the Shewhart procedure based on count data, the generalized likelihood ratio (GLR) in the existing literature, and the standard score statistic. We demonstrate this via a simulated example and an application to population code change-detection in neuronal networks. △ Less

Submitted 4 March, 2022; v1 submitted 10 February, 2021; originally announced February 2021.

Comments: 33 pages, 13 figures

arXiv:2101.06592 [pdf, other]

TSEC: a framework for online experimentation under experimental constraints

Authors: Simon Mak, Yuanshuo Zhou, Lavonne Hoang, C. F. Jeff Wu

Abstract: Thompson sampling is a popular algorithm for solving multi-armed bandit problems, and has been applied in a wide range of applications, from website design to portfolio optimization. In such applications, however, the number of choices (or arms) $N$ can be large, and the data needed to make adaptive decisions require expensive experimentation. One is then faced with the constraint of experimenting… ▽ More Thompson sampling is a popular algorithm for solving multi-armed bandit problems, and has been applied in a wide range of applications, from website design to portfolio optimization. In such applications, however, the number of choices (or arms) $N$ can be large, and the data needed to make adaptive decisions require expensive experimentation. One is then faced with the constraint of experimenting on only a small subset of $K \ll N$ arms within each time period, which poses a problem for traditional Thompson sampling. We propose a new Thompson Sampling under Experimental Constraints (TSEC) method, which addresses this so-called "arm budget constraint". TSEC makes use of a Bayesian interaction model with effect hierarchy priors, to model correlations between rewards on different arms. This fitted model is then integrated within Thompson sampling, to jointly identify a good subset of arms for experimentation and to allocate resources over these arms. We demonstrate the effectiveness of TSEC in two problems with arm budget constraints. The first is a simulated website optimization study, where TSEC shows noticeable improvements over industry benchmarks. The second is a portfolio optimization application on industry-based exchange-traded funds, where TSEC provides more consistent and greater wealth accumulation over standard investment strategies. △ Less

Submitted 17 January, 2021; originally announced January 2021.

arXiv:2101.01299 [pdf, other]

Bayesian Uncertainty Quantification for Low-Rank Matrix Completion

Authors: Henry Shaowu Yuchi, Simon Mak, Yao Xie

Abstract: We consider the problem of uncertainty quantification for an unknown low-rank matrix $\mathbf{X}$, given a partial and noisy observation of its entries. This quantification of uncertainty is essential for many real-world problems, including image processing, satellite imaging, and seismology, providing a principled framework for validating scientific conclusions and guiding decision-making. Howeve… ▽ More We consider the problem of uncertainty quantification for an unknown low-rank matrix $\mathbf{X}$, given a partial and noisy observation of its entries. This quantification of uncertainty is essential for many real-world problems, including image processing, satellite imaging, and seismology, providing a principled framework for validating scientific conclusions and guiding decision-making. However, existing literature has mainly focused on the completion (i.e., point estimation) of the matrix $\mathbf{X}$, with little work on investigating its uncertainty. To this end, we propose in this work a new Bayesian modeling framework, called BayeSMG, which parametrizes the unknown $\mathbf{X}$ via its underlying row and column subspaces. This Bayesian subspace parametrization enables efficient posterior inference on matrix subspaces, which represents interpretable phenomena in many applications. This can then be leveraged for improved matrix recovery. We demonstrate the effectiveness of BayeSMG over existing Bayesian matrix recovery methods in numerical experiments, image inpainting, and a seismic sensor network application. △ Less

Submitted 25 March, 2022; v1 submitted 4 January, 2021; originally announced January 2021.

arXiv:2012.13769 [pdf, other]

doi 10.1080/10618600.2022.2034637

Population Quasi-Monte Carlo

Authors: Chaofan Huang, V. Roshan Joseph, Simon Mak

Abstract: Monte Carlo methods are widely used for approximating complicated, multidimensional integrals for Bayesian inference. Population Monte Carlo (PMC) is an important class of Monte Carlo methods, which utilizes a population of proposals to generate weighted samples that approximate the target distribution. The generic PMC framework iterates over three steps: samples are simulated from a set of propos… ▽ More Monte Carlo methods are widely used for approximating complicated, multidimensional integrals for Bayesian inference. Population Monte Carlo (PMC) is an important class of Monte Carlo methods, which utilizes a population of proposals to generate weighted samples that approximate the target distribution. The generic PMC framework iterates over three steps: samples are simulated from a set of proposals, weights are assigned to such samples to correct for mismatch between the proposal and target distributions, and the proposals are then adapted via resampling from the weighted samples. When the target distribution is expensive to evaluate, the PMC has its computational limitation since the convergence rate is $\mathcal{O}(N^{-1/2})$. To address this, we propose in this paper a new Population Quasi-Monte Carlo (PQMC) framework, which integrates Quasi-Monte Carlo ideas within the sampling and adaptation steps of PMC. A key novelty in PQMC is the idea of importance support points resampling, a deterministic method for finding an "optimal" subsample from the weighted proposal samples. Moreover, within the PQMC framework, we develop an efficient covariance adaptation strategy for multivariate normal proposals. Lastly, a new set of correction weights is introduced for the weighted PMC estimator to improve the efficiency from the standard PMC estimator. We demonstrate the improved empirical convergence of PQMC over PMC in extensive numerical simulations and a friction drilling application. △ Less

Submitted 26 December, 2020; originally announced December 2020.

Comments: Submitted to Journal of Computational and Graphical Statistics

Journal ref: Journal of Computational and Graphical Statistics (2022)

arXiv:2006.07506 [pdf, other]

Uncertainty Quantification for Inferring Hawkes Networks

Authors: Haoyun Wang, Liyan Xie, Alex Cuozzo, Simon Mak, Yao Xie

Abstract: Multivariate Hawkes processes are commonly used to model streaming networked event data in a wide variety of applications. However, it remains a challenge to extract reliable inference from complex datasets with uncertainty quantification. Aiming towards this, we develop a statistical inference framework to learn causal relationships between nodes from networked data, where the underlying directed… ▽ More Multivariate Hawkes processes are commonly used to model streaming networked event data in a wide variety of applications. However, it remains a challenge to extract reliable inference from complex datasets with uncertainty quantification. Aiming towards this, we develop a statistical inference framework to learn causal relationships between nodes from networked data, where the underlying directed graph implies Granger causality. We provide uncertainty quantification for the maximum likelihood estimate of the network multivariate Hawkes process by providing a non-asymptotic confidence set. The main technique is based on the concentration inequalities of continuous-time martingales. We compare our method to the previously-derived asymptotic Hawkes process confidence interval, and demonstrate the strengths of our method in an application to neuronal connectivity reconstruction. △ Less

Submitted 28 October, 2020; v1 submitted 12 June, 2020; originally announced June 2020.

Comments: 16 pages including appendix, 1 figure, accepted to 2020 Neurips

arXiv:2004.13962 [pdf, other]

Energy Balancing of Covariate Distributions

Authors: Jared D. Huling, Simon Mak

Abstract: Bias in causal comparisons has a direct correspondence with distributional imbalance of covariates between treatment groups. Weighting strategies such as inverse propensity score weighting attempt to mitigate bias by either modeling the treatment assignment mechanism or balancing specified covariate moments. This paper introduces a new weighting method, called energy balancing, which instead aims… ▽ More Bias in causal comparisons has a direct correspondence with distributional imbalance of covariates between treatment groups. Weighting strategies such as inverse propensity score weighting attempt to mitigate bias by either modeling the treatment assignment mechanism or balancing specified covariate moments. This paper introduces a new weighting method, called energy balancing, which instead aims to balance weighted covariate distributions. By directly targeting distributional imbalance, the proposed weighting strategy can be flexibly utilized in a wide variety of causal analyses, including the estimation of average treatment effects and individualized treatment rules. Our energy balancing weights (EBW) approach has several advantages over existing weighting techniques. First, it offers a model-free and robust approach for obtaining covariate balance that does not require tuning parameters, obviating the need for modeling decisions of secondary nature to the scientific question at hand. Second, since this approach is based on a genuine measure of distributional balance, it provides a means for assessing the balance induced by a given set of weights for a given dataset. Finally, the proposed method is computationally efficient and has desirable theoretical guarantees under mild conditions. We demonstrate the effectiveness of this EBW approach in a suite of simulation experiments, and in studies on the safety of right heart catheterization and the effect of indwelling arterial catheters. △ Less

Submitted 11 March, 2022; v1 submitted 29 April, 2020; originally announced April 2020.

arXiv:1911.07285 [pdf, other]

A hierarchical expected improvement method for Bayesian optimization

Authors: Zhehui Chen, Simon Mak, C. F. Jeff Wu

Abstract: The Expected Improvement (EI) method, proposed by Jones et al. (1998), is a widely-used Bayesian optimization method, which makes use of a fitted Gaussian process model for efficient black-box optimization. However, one key drawback of EI is that it is overly greedy in exploiting the fitted Gaussian process model for optimization, which results in suboptimal solutions even with large sample sizes.… ▽ More The Expected Improvement (EI) method, proposed by Jones et al. (1998), is a widely-used Bayesian optimization method, which makes use of a fitted Gaussian process model for efficient black-box optimization. However, one key drawback of EI is that it is overly greedy in exploiting the fitted Gaussian process model for optimization, which results in suboptimal solutions even with large sample sizes. To address this, we propose a new hierarchical EI (HEI) framework, which makes use of a hierarchical Gaussian process model. HEI preserves a closed-form acquisition function, and corrects the over-greediness of EI by encouraging exploration of the optimization space. We then introduce hyperparameter estimation methods which allow HEI to mimic a fully Bayesian optimization procedure, while avoiding expensive Markov-chain Monte Carlo sampling steps. We prove the global convergence of HEI over a broad function space, and establish near-minimax convergence rates under certain prior specifications. Numerical experiments show the improvement of HEI over existing Bayesian optimization methods, for synthetic functions and a semiconductor manufacturing optimization problem. △ Less

Submitted 20 April, 2023; v1 submitted 17 November, 2019; originally announced November 2019.

arXiv:1911.05940 [pdf, other]

Distributional Clustering: A distribution-preserving clustering method

Authors: Arvind Krishna, Simon Mak, Roshan Joseph

Abstract: One key use of k-means clustering is to identify cluster prototypes which can serve as representative points for a dataset. However, a drawback of using k-means cluster centers as representative points is that such points distort the distribution of the underlying data. This can be highly disadvantageous in problems where the representative points are subsequently used to gain insights on the data… ▽ More One key use of k-means clustering is to identify cluster prototypes which can serve as representative points for a dataset. However, a drawback of using k-means cluster centers as representative points is that such points distort the distribution of the underlying data. This can be highly disadvantageous in problems where the representative points are subsequently used to gain insights on the data distribution, as these points do not mimic the distribution of the data. To this end, we propose a new clustering method called "distributional clustering", which ensures cluster centers capture the distribution of the underlying data. We first prove the asymptotic convergence of the proposed cluster centers to the data generating distribution, then present an efficient algorithm for computing these cluster centers in practice. Finally, we demonstrate the effectiveness of distributional clustering on synthetic and real datasets. △ Less

Submitted 14 November, 2019; originally announced November 2019.

Comments: Submitted to Statistica Sinica

arXiv:1910.05452 [pdf, other]

Adaptive design for Gaussian process regression under censoring

Authors: Jialei Chen, Simon Mak, V. Roshan Joseph, Chuck Zhang

Abstract: A key objective in engineering problems is to predict an unknown experimental surface over an input domain. In complex physical experiments, this may be hampered by response censoring, which results in a significant loss of information. For such problems, experimental design is paramount for maximizing predictive power using a small number of expensive experimental runs. To tackle this, we propose… ▽ More A key objective in engineering problems is to predict an unknown experimental surface over an input domain. In complex physical experiments, this may be hampered by response censoring, which results in a significant loss of information. For such problems, experimental design is paramount for maximizing predictive power using a small number of expensive experimental runs. To tackle this, we propose a novel adaptive design method, called the integrated censored mean-squared error (ICMSE) method. The ICMSE method first estimates the posterior probability of a new observation being censored, then adaptively chooses design points that minimize predictive uncertainty under censoring. Adopting a Gaussian process regression model with product correlation function, the proposed ICMSE criterion is easy to evaluate, which allows for efficient design optimization. We demonstrate the effectiveness of the ICMSE design in two real-world applications on surgical planning and wafer manufacturing. △ Less

Submitted 25 June, 2021; v1 submitted 11 October, 2019; originally announced October 2019.

Journal ref: Annals of Applied Statistics, 2021

arXiv:1910.01754 [pdf, other]

doi 10.1080/00401706.2020.1801255

Function-on-function kriging, with applications to 3D printing of aortic tissues

Authors: Jialei Chen, Simon Mak, V. Roshan Joseph, Chuck Zhang

Abstract: 3D-printed medical prototypes, which use synthetic metamaterials to mimic biological tissue, are becoming increasingly important in urgent surgical applications. However, the mimicking of tissue mechanical properties via 3D-printed metamaterial can be difficult and time-consuming, due to the functional nature of both inputs (metamaterial structure) and outputs (mechanical response curve). To deal… ▽ More 3D-printed medical prototypes, which use synthetic metamaterials to mimic biological tissue, are becoming increasingly important in urgent surgical applications. However, the mimicking of tissue mechanical properties via 3D-printed metamaterial can be difficult and time-consuming, due to the functional nature of both inputs (metamaterial structure) and outputs (mechanical response curve). To deal with this, we propose a novel function-on-function kriging model for efficient emulation and tissue-mimicking optimization. For functional inputs, a key novelty of our model is the spectral-distance (SpeD) correlation function, which captures important spectral differences between two functional inputs. Dependencies for functional outputs are then modeled via a co-kriging framework. We further adopt shrinkage priors on both the input spectra and the output co-kriging covariance matrix, which allows the emulator to learn and incorporate important physics (e.g., dominant input frequencies, output curve properties). Finally, we demonstrate the effectiveness of the proposed SpeD emulator in a real-world study on mimicking human aortic tissue, and show that it can provide quicker and more accurate tissue-mimicking performance compared to existing methods in the medical literature. △ Less

Submitted 1 July, 2020; v1 submitted 3 October, 2019; originally announced October 2019.

Journal ref: Technometrics,2020

arXiv:1908.08868 [pdf, other]

BdryGP: a new Gaussian process model for incorporating boundary information

Authors: Liang Ding, Simon Mak, C. F. Jeff Wu

Abstract: Gaussian processes (GPs) are widely used as surrogate models for emulating computer code, which simulate complex physical phenomena. In many problems, additional boundary information (i.e., the behavior of the phenomena along input boundaries) is known beforehand, either from governing physics or scientific knowledge. While there has been recent work on incorporating boundary information within GP… ▽ More Gaussian processes (GPs) are widely used as surrogate models for emulating computer code, which simulate complex physical phenomena. In many problems, additional boundary information (i.e., the behavior of the phenomena along input boundaries) is known beforehand, either from governing physics or scientific knowledge. While there has been recent work on incorporating boundary information within GPs, such models do not provide theoretical insights on improved convergence rates. To this end, we propose a new GP model, called BdryGP, for incorporating boundary information. We show that BdryGP not only has improved convergence rates over existing GP models (which do not incorporate boundaries), but is also more resistant to the "curse-of-dimensionality" in nonparametric regression. Our proofs make use of a novel connection between GP interpolation and finite-element modeling. △ Less

Submitted 23 August, 2019; originally announced August 2019.

arXiv:1712.03589 [pdf, other]

Analysis-of-marginal-Tail-Means (ATM): a robust method for discrete black-box optimization

Authors: Simon Mak, C. F. Jeff Wu

Abstract: We present a new method, called Analysis-of-marginal-Tail-Means (ATM), for effective robust optimization of discrete black-box problems. ATM has important applications to many real-world engineering problems (e.g., manufacturing optimization, product design, molecular engineering), where the objective to optimize is black-box and expensive, and the design space is inherently discrete. One weakness… ▽ More We present a new method, called Analysis-of-marginal-Tail-Means (ATM), for effective robust optimization of discrete black-box problems. ATM has important applications to many real-world engineering problems (e.g., manufacturing optimization, product design, molecular engineering), where the objective to optimize is black-box and expensive, and the design space is inherently discrete. One weakness of existing methods is that they are not robust: these methods perform well under certain assumptions, but yield poor results when such assumptions (which are difficult to verify in black-box problems) are violated. ATM addresses this via the use of marginal tail means for optimization, which combines both rank-based and model-based methods. The trade-off between rank- and model-based optimization is tuned by first identifying important main effects and interactions, then finding a good compromise which best exploits additive structure. By adaptively tuning this trade-off from data, ATM provides improved robust optimization over existing methods, particularly in problems with (i) a large number of factors, (ii) unordered factors, or (iii) experimental noise. We demonstrate the effectiveness of ATM in simulations and in two real-world engineering problems: the first on robust parameter design of a circular piston, and the second on product family design of a thermistor network. △ Less

Submitted 19 October, 2018; v1 submitted 10 December, 2017; originally announced December 2017.

arXiv:1712.03310 [pdf, other]

doi 10.1109/JSTSP.2018.2840481

Maximum entropy low-rank matrix recovery

Authors: Simon Mak, Yao Xie

Abstract: We propose in this paper a novel, information-theoretic method, called MaxEnt, for efficient data acquisition for low-rank matrix recovery. This proposed method has important applications to a wide range of problems, including image processing and text document indexing. Fundamental to our design approach is the so-called maximum entropy principle, which states that the measurement masks which max… ▽ More We propose in this paper a novel, information-theoretic method, called MaxEnt, for efficient data acquisition for low-rank matrix recovery. This proposed method has important applications to a wide range of problems, including image processing and text document indexing. Fundamental to our design approach is the so-called maximum entropy principle, which states that the measurement masks which maximize the entropy of observations, also maximize the information gain on the unknown matrix $\mathbf{X}$. Coupled with a low-rank stochastic model for $\mathbf{X}$, such a principle (i) reveals novel connections between information-theoretic sampling and subspace packings, and (ii) yields efficient mask construction algorithms for matrix recovery, which significantly outperforms random measurements. We illustrate the effectiveness of MaxEnt in simulation experiments, and demonstrate its usefulness in two real-world applications on image recovery and text document indexing. △ Less

Submitted 21 November, 2018; v1 submitted 8 December, 2017; originally announced December 2017.

Comments: Fixing typos

arXiv:1708.06897 [pdf, other]

Projected support points: a new method for high-dimensional data reduction

Authors: Simon Mak, V. Roshan Joseph

Abstract: In an era where big and high-dimensional data is readily available, data scientists are inevitably faced with the challenge of reducing this data for expensive downstream computation or analysis. To this end, we present here a new method for reducing high-dimensional big data into a representative point set, called projected support points (PSPs). A key ingredient in our method is the so-called sp… ▽ More In an era where big and high-dimensional data is readily available, data scientists are inevitably faced with the challenge of reducing this data for expensive downstream computation or analysis. To this end, we present here a new method for reducing high-dimensional big data into a representative point set, called projected support points (PSPs). A key ingredient in our method is the so-called sparsity-inducing (SpIn) kernel, which encourages the preservation of low-dimensional features when reducing high-dimensional data. We begin by introducing a unifying theoretical framework for data reduction, connecting PSPs with fundamental sampling principles from experimental design and Quasi-Monte Carlo. Through this framework, we then derive sparsity conditions under which the curse-of-dimensionality in data reduction can be lifted for our method. Next, we propose two algorithms for one-shot and sequential reduction via PSPs, both of which exploit big data subsampling and majorization-minimization for efficient optimization. Finally, we demonstrate the practical usefulness of PSPs in two real-world applications, the first for data reduction in kernel learning, and the second for reducing Markov Chain Monte Carlo (MCMC) chains. △ Less

Submitted 2 June, 2018; v1 submitted 23 August, 2017; originally announced August 2017.

arXiv:1706.08037 [pdf, other]

Information-Guided Sampling for Low-Rank Matrix Completion

Authors: Simon Mak, Henry Shaowu Yushi, Yao Xie

Abstract: The noisy matrix completion problem, which aims to recover a low-rank matrix $\mathbf{X}$ from a partial, noisy observation of its entries, arises in many statistical, machine learning, and engineering applications. In this paper, we present a new, information-theoretic approach for active sampling (or designing) of matrix entries for noisy matrix completion, based on the maximum entropy design pr… ▽ More The noisy matrix completion problem, which aims to recover a low-rank matrix $\mathbf{X}$ from a partial, noisy observation of its entries, arises in many statistical, machine learning, and engineering applications. In this paper, we present a new, information-theoretic approach for active sampling (or designing) of matrix entries for noisy matrix completion, based on the maximum entropy design principle. One novelty of our method is that it implicitly makes use of uncertainty quantification (UQ) -- a measure of uncertainty for unobserved matrix entries -- to guide the active sampling procedure. The proposed framework reveals several novel insights on the role of compressive sensing (e.g., coherence) and coding design (e.g., Latin squares) on the sampling performance and UQ for noisy matrix completion. Using such insights, we develop an efficient posterior sampler for UQ, which is then used to guide a closed-form sampling scheme for matrix entries. Finally, we illustrate the effectiveness of this integrated sampling / UQ methodology in simulation studies and two applications to collaborative filtering. △ Less

Submitted 13 July, 2021; v1 submitted 25 June, 2017; originally announced June 2017.

Comments: ICML 2021 Workshop on Information-Theoretic Methods for Rigorous, Responsible, and Reliable Machine Learning

arXiv:1701.05547 [pdf, other]

cmenet: a new method for bi-level variable selection of conditional main effects

Authors: Simon Mak, C. F. Jeff Wu

Abstract: This paper introduces a novel method for selecting main effects and a set of reparametrized effects called conditional main effects (CMEs), which capture the conditional effect of a factor at a fixed level of another factor. CMEs represent interpretable, domain-specific phenomena for a wide range of applications in engineering, social sciences and genomics. The key challenge is in incorporating th… ▽ More This paper introduces a novel method for selecting main effects and a set of reparametrized effects called conditional main effects (CMEs), which capture the conditional effect of a factor at a fixed level of another factor. CMEs represent interpretable, domain-specific phenomena for a wide range of applications in engineering, social sciences and genomics. The key challenge is in incorporating the implicit grouped structure of CMEs within the variable selection procedure itself. We propose a new method, cmenet, which employs two principles called CME coupling and CME reduction to effectively navigate the selection algorithm. Simulation studies demonstrate the improved CME selection performance of cmenet over more generic selection methods. Applied to a gene association study on fly wing shape, cmenet not only yields more parsimonious models and improved predictive performance over standard two-factor interaction analysis methods, but also reveals important insights on gene activation behavior, which can be used to guide further experiments. Efficient implementations of our algorithms are available in the R package cmenet in CRAN. △ Less

Submitted 18 November, 2017; v1 submitted 19 January, 2017; originally announced January 2017.

Comments: JASA T&M, under revision

arXiv:1611.07911 [pdf, other]

An efficient surrogate model for emulation and physics extraction of large eddy simulations

Authors: Simon Mak, Chih-Li Sung, Xingjian Wang, Shiang-Ting Yeh, Yu-Hung Chang, V. Roshan Joseph, Vigor Yang, C. F. Jeff Wu

Abstract: In the quest for advanced propulsion and power-generation systems, high-fidelity simulations are too computationally expensive to survey the desired design space, and a new design methodology is needed that combines engineering physics, computer simulations and statistical modeling. In this paper, we propose a new surrogate model that provides efficient prediction and uncertainty quantification of… ▽ More In the quest for advanced propulsion and power-generation systems, high-fidelity simulations are too computationally expensive to survey the desired design space, and a new design methodology is needed that combines engineering physics, computer simulations and statistical modeling. In this paper, we propose a new surrogate model that provides efficient prediction and uncertainty quantification of turbulent flows in swirl injectors with varying geometries, devices commonly used in many engineering applications. The novelty of the proposed method lies in the incorporation of known physical properties of the fluid flow as {simplifying assumptions} for the statistical model. In view of the massive simulation data at hand, which is on the order of hundreds of gigabytes, these assumptions allow for accurate flow predictions in around an hour of computation time. To contrast, existing flow emulators which forgo such simplications may require more computation time for training and prediction than is needed for conducting the simulation itself. Moreover, by accounting for coupling mechanisms between flow variables, the proposed model can jointly reduce prediction uncertainty and extract useful flow physics, which can then be used to guide further investigations. △ Less

Submitted 26 May, 2017; v1 submitted 23 November, 2016; originally announced November 2016.

Comments: Submitted to JASA A&CS

arXiv:1609.01811 [pdf, other]

Support points

Authors: Simon Mak, V. Roshan Joseph

Abstract: This paper introduces a new way to compact a continuous probability distribution $F$ into a set of representative points called support points. These points are obtained by minimizing the energy distance, a statistical potential measure initially proposed by Székely and Rizzo (2004) for testing goodness-of-fit. The energy distance has two appealing features. First, its distance-based structure all… ▽ More This paper introduces a new way to compact a continuous probability distribution $F$ into a set of representative points called support points. These points are obtained by minimizing the energy distance, a statistical potential measure initially proposed by Székely and Rizzo (2004) for testing goodness-of-fit. The energy distance has two appealing features. First, its distance-based structure allows us to exploit the duality between powers of the Euclidean distance and its Fourier transform for theoretical analysis. Using this duality, we show that support points converge in distribution to $F$, and enjoy an improved error rate to Monte Carlo for integrating a large class of functions. Second, the minimization of the energy distance can be formulated as a difference-of-convex program, which we manipulate using two algorithms to efficiently generate representative point sets. In simulation studies, support points provide improved integration performance to both Monte Carlo and a specific Quasi-Monte Carlo method. Two important applications of support points are then highlighted: (a) as a way to quantify the propagation of uncertainty in expensive simulations, and (b) as a method to optimally compact Markov chain Monte Carlo (MCMC) samples in Bayesian computation. △ Less

Submitted 9 September, 2018; v1 submitted 6 September, 2016; originally announced September 2016.

Comments: Accepted, Annals of Statistics

MSC Class: 62E17

arXiv:1602.03940 [pdf, other]

A regional compound Poisson process for hurricane and tropical storm damage

Authors: Simon Mak, Derek Bingham, Yi Lu

Abstract: In light of intense hurricane activity along the U.S. Atlantic coast, attention has turned to understanding both the economic impact and behaviour of these storms. The compound Poisson-lognormal process has been proposed as a model for aggregate storm damage, but does not shed light on regional analysis since storm path data are not used. In this paper, we propose a fully Bayesian regional predict… ▽ More In light of intense hurricane activity along the U.S. Atlantic coast, attention has turned to understanding both the economic impact and behaviour of these storms. The compound Poisson-lognormal process has been proposed as a model for aggregate storm damage, but does not shed light on regional analysis since storm path data are not used. In this paper, we propose a fully Bayesian regional prediction model which uses conditional autoregressive (CAR) models to account for both storm paths and spatial patterns for storm damage. When fitted to historical data, the analysis from our model both confirms previous findings and reveals new insights on regional storm tendencies. Posterior predictive samples can also be used for pricing regional insurance premiums, which we illustrate using three different risk measures. △ Less

Submitted 11 February, 2016; originally announced February 2016.

Comments: Accepted to Journal of the Royal Statistical Society, Series C on January 25th (2016). Pending publication

arXiv:1602.03938 [pdf, other]

Minimax and minimax projection designs using clustering

Authors: Simon Mak, V. Roshan Joseph

Abstract: Minimax designs provide a uniform coverage of a design space $\mathcal{X} \subseteq \mathbb{R}^p$ by minimizing the maximum distance from any point in this space to its nearest design point. Although minimax designs have many useful applications, e.g., for optimal sensor allocation or as space-filling designs for computer experiments, there has been little work in developing algorithms for generat… ▽ More Minimax designs provide a uniform coverage of a design space $\mathcal{X} \subseteq \mathbb{R}^p$ by minimizing the maximum distance from any point in this space to its nearest design point. Although minimax designs have many useful applications, e.g., for optimal sensor allocation or as space-filling designs for computer experiments, there has been little work in developing algorithms for generating these designs, due to its computational complexity. In this paper, a new hybrid algorithm combining particle swarm optimization and clustering is proposed for generating minimax designs on any convex and bounded design space. The computation time of this algorithm scales linearly in dimension $p$, meaning our method can generate minimax designs efficiently for high-dimensional regions. Simulation studies and a real-world example show that the proposed algorithm provides improved minimax performance over existing methods on a variety of design spaces. Finally, we introduce a new type of experimental design called a minimax projection design, and show that this proposed design provides better minimax performance on projected subspaces of $\mathcal{X}$ compared to existing designs. An efficient implementation of these algorithms can be found in the R package minimaxdesign. △ Less

Submitted 28 October, 2016; v1 submitted 11 February, 2016; originally announced February 2016.

Comments: Under revision, Journal of Computational and Graphical Statistics (JCGS)

Showing 1–35 of 35 results for author: Mak, S