Skip to main content

Showing 1–35 of 35 results for author: Mak, S

  1. arXiv:2403.08079  [pdf, other

    cs.SE stat.ME

    BayesFLo: Bayesian fault localization of complex software systems

    Authors: Yi Ji, Simon Mak, Ryan Lekivetz, Joseph Morgan

    Abstract: Software testing is essential for the reliable development of complex software systems. A key step in software testing is fault localization, which uses test data to pinpoint failure-inducing combinations for further diagnosis. Existing fault localization methods, however, are largely deterministic, and thus do not provide a principled approach for assessing probabilistic risk of potential root ca… ▽ More

    Submitted 12 March, 2024; originally announced March 2024.

  2. arXiv:2403.03816  [pdf, other

    stat.ML cs.LG

    Targeted Variance Reduction: Robust Bayesian Optimization of Black-Box Simulators with Noise Parameters

    Authors: John Joshua Miller, Simon Mak

    Abstract: The optimization of a black-box simulator over control parameters $\mathbf{x}$ arises in a myriad of scientific applications. In such applications, the simulator often takes the form $f(\mathbf{x},\boldsymbolθ)$, where $\boldsymbolθ$ are parameters that are uncertain in practice. Robust optimization aims to optimize the objective $\mathbb{E}[f(\mathbf{x},\boldsymbolΘ)]$, where… ▽ More

    Submitted 6 March, 2024; originally announced March 2024.

  3. arXiv:2311.08752  [pdf, other

    stat.ME

    ProSpar-GP: scalable Gaussian process modeling with massive non-stationary datasets

    Authors: Kevin Li, Simon Mak

    Abstract: Gaussian processes (GPs) are a popular class of Bayesian nonparametric models, but its training can be computationally burdensome for massive training datasets. While there has been notable work on scaling up these models for big data, existing methods typically rely on a stationary GP assumption for approximation, and can thus perform poorly when the underlying response surface is non-stationary,… ▽ More

    Submitted 15 November, 2023; originally announced November 2023.

  4. arXiv:2310.19787  [pdf

    stat.ME stat.AP stat.ML

    $e^{\text{RPCA}}$: Robust Principal Component Analysis for Exponential Family Distributions

    Authors: Xiaojun Zheng, Simon Mak, Liyan Xie, Yao Xie

    Abstract: Robust Principal Component Analysis (RPCA) is a widely used method for recovering low-rank structure from data matrices corrupted by significant and sparse outliers. These corruptions may arise from occlusions, malicious tampering, or other causes for anomalies, and the joint identification of such corruptions with low-rank background is critical for process monitoring and diagnosis. However, exis… ▽ More

    Submitted 30 October, 2023; originally announced October 2023.

  5. arXiv:2310.14544  [pdf, other

    stat.ML cs.LG

    Trigonometric Quadrature Fourier Features for Scalable Gaussian Process Regression

    Authors: Kevin Li, Max Balakirsky, Simon Mak

    Abstract: Fourier feature approximations have been successfully applied in the literature for scalable Gaussian Process (GP) regression. In particular, Quadrature Fourier Features (QFF) derived from Gaussian quadrature rules have gained popularity in recent years due to their improved approximation accuracy and better calibrated uncertainty estimates compared to Random Fourier Feature (RFF) methods. However… ▽ More

    Submitted 22 October, 2023; originally announced October 2023.

  6. arXiv:2306.07480  [pdf, other

    stat.ME

    ACE: Active Learning for Causal Inference with Expensive Experiments

    Authors: Difan Song, Simon Mak, C. F. Jeff Wu

    Abstract: Experiments are the gold standard for causal inference. In many applications, experimental units can often be recruited or chosen sequentially, and the adaptive execution of such experiments may offer greatly improved inference of causal quantities over non-adaptive approaches, particularly when experiments are expensive. We thus propose a novel active learning method called ACE (Active learning f… ▽ More

    Submitted 12 June, 2023; originally announced June 2023.

    Comments: 6 pages, 4 figures

  7. arXiv:2306.07299  [pdf, other

    nucl-th cs.LG hep-ph stat.ML

    Additive Multi-Index Gaussian process modeling, with application to multi-physics surrogate modeling of the quark-gluon plasma

    Authors: Kevin Li, Simon Mak, J. -F Paquet, Steffen A. Bass

    Abstract: The Quark-Gluon Plasma (QGP) is a unique phase of nuclear matter, theorized to have filled the Universe shortly after the Big Bang. A critical challenge in studying the QGP is that, to reconcile experimental observables with theoretical parameters, one requires many simulation runs of a complex physics model over a high-dimensional parameter space. Each run is computationally very expensive, requi… ▽ More

    Submitted 10 June, 2023; originally announced June 2023.

  8. arXiv:2302.00755  [pdf, other

    stat.ML cs.LG stat.ME

    Hierarchical shrinkage Gaussian processes: applications to computer code emulation and dynamical system recovery

    Authors: Tao Tang, Simon Mak, David Dunson

    Abstract: In many areas of science and engineering, computer simulations are widely used as proxies for physical experiments, which can be infeasible or unethical. Such simulations can often be computationally expensive, and an emulator can be trained to efficiently predict the desired response surface. A widely-used emulator is the Gaussian process (GP), which provides a flexible framework for efficient pr… ▽ More

    Submitted 1 February, 2023; originally announced February 2023.

  9. arXiv:2211.00268  [pdf, other

    stat.ME stat.AP

    Stacking designs: designing multi-fidelity computer experiments with target predictive accuracy

    Authors: Chih-Li Sung, Yi Ji, Simon Mak, Wenjia Wang, Tao Tang

    Abstract: In an era where scientific experiments can be very costly, multi-fidelity emulators provide a useful tool for cost-efficient predictive scientific computing. For scientific applications, the experimenter is often limited by a tight computational budget, and thus wishes to (i) maximize predictive power of the multi-fidelity emulator via a careful design of experiments, and (ii) ensure this model ac… ▽ More

    Submitted 27 October, 2023; v1 submitted 1 November, 2022; originally announced November 2022.

  10. arXiv:2209.13748  [pdf, other

    stat.ME

    Conglomerate Multi-Fidelity Gaussian Process Modeling, with Application to Heavy-Ion Collisions

    Authors: Yi Ji, Henry Shaowu Yuchi, Derek Soeder, J. -F. Paquet, Steffen A. Bass, V. Roshan Joseph, C. F. Jeff Wu, Simon Mak

    Abstract: In an era where scientific experimentation is often costly, multi-fidelity emulation provides a powerful tool for predictive scientific computing. While there has been notable work on multi-fidelity modeling, existing models do not incorporate an important "conglomerate" property of multi-fidelity simulators, where the accuracies of different simulator components are controlled by different fideli… ▽ More

    Submitted 28 September, 2023; v1 submitted 27 September, 2022; originally announced September 2022.

  11. arXiv:2203.04246  [pdf, other

    stat.ME math.AT stat.AP stat.ML

    PERCEPT: a new online change-point detection method using topological data analysis

    Authors: Xiaojun Zheng, Simon Mak, Liyan Xie, Yao Xie

    Abstract: Topological data analysis (TDA) provides a set of data analysis tools for extracting embedded topological structures from complex high-dimensional datasets. In recent years, TDA has been a rapidly growing field which has found success in a wide range of applications, including signal processing, neuroscience and network analysis. In these applications, the online detection of changes is of crucial… ▽ More

    Submitted 8 March, 2022; originally announced March 2022.

  12. arXiv:2109.07623  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    BacHMMachine: An Interpretable and Scalable Model for Algorithmic Harmonization for Four-part Baroque Chorales

    Authors: Yunyao Zhu, Stephen Hahn, Simon Mak, Yue Jiang, Cynthia Rudin

    Abstract: Algorithmic harmonization - the automated harmonization of a musical piece given its melodic line - is a challenging problem that has garnered much interest from both music theorists and computer scientists. One genre of particular interest is the four-part Baroque chorales of J.S. Bach. Methods for algorithmic chorale harmonization typically adopt a black-box, "data-driven" approach: they do not… ▽ More

    Submitted 22 February, 2022; v1 submitted 15 September, 2021; originally announced September 2021.

    Comments: 7 pages, 7 figures

  13. arXiv:2108.00306  [pdf, other

    stat.ME

    A graphical multi-fidelity Gaussian process model, with application to emulation of heavy-ion collisions

    Authors: Yi Ji, Simon Mak, Derek Soeder, J-F Paquet, Steffen A. Bass

    Abstract: With advances in scientific computing and mathematical modeling, complex scientific phenomena such as galaxy formations and rocket propulsion can now be reliably simulated. Such simulations can however be very time-intensive, requiring millions of CPU hours to perform. One solution is multi-fidelity emulation, which uses data of different fidelities to train an efficient predictive model which emu… ▽ More

    Submitted 27 February, 2024; v1 submitted 31 July, 2021; originally announced August 2021.

  14. arXiv:2107.04668  [pdf, other

    math.ST math.NA stat.ME stat.ML

    Gaussian Process Subspace Regression for Model Reduction

    Authors: Ruda Zhang, Simon Mak, David Dunson

    Abstract: Subspace-valued functions arise in a wide range of problems, including parametric reduced order modeling (PROM). In PROM, each parameter point can be associated with a subspace, which is used for Petrov-Galerkin projections of large system matrices. Previous efforts to approximate such functions use interpolations on manifolds, which can be inaccurate and slow. To tackle this, we propose a novel B… ▽ More

    Submitted 9 July, 2021; originally announced July 2021.

    Comments: 20 pages, 4 figures; with supplementary material

    MSC Class: 14M15; 35B30; 37M99; 53-04; 60B20; 60G15

  15. arXiv:2103.00117  [pdf, other

    stat.ME math.AT stat.OT

    Online High-Dimensional Change-Point Detection using Topological Data Analysis

    Authors: Xiaojun Zheng, Simon Mak, Yao Xie

    Abstract: Topological Data Analysis (TDA) is a rapidly growing field, which studies methods for learning underlying topological structures present in complex data representations. TDA methods have found recent success in extracting useful geometric structures for a wide range of applications, including protein classification, neuroscience, and time-series analysis. However, in many such applications, one is… ▽ More

    Submitted 7 March, 2021; v1 submitted 26 February, 2021; originally announced March 2021.

  16. arXiv:2102.05724  [pdf, other

    stat.ML cs.LG

    Sequential change-point detection for mutually exciting point processes over networks

    Authors: Haoyun Wang, Liyan Xie, Yao Xie, Alex Cuozzo, Simon Mak

    Abstract: We present a new CUSUM procedure for sequentially detecting change-point in the self and mutual exciting processes, a.k.a. Hawkes networks using discrete events data. Hawkes networks have become a popular model for statistics and machine learning due to their capability in modeling irregularly observed data where the timing between events carries a lot of information. The problem of detecting abru… ▽ More

    Submitted 4 March, 2022; v1 submitted 10 February, 2021; originally announced February 2021.

    Comments: 33 pages, 13 figures

  17. arXiv:2101.06592  [pdf, other

    stat.ME cs.LG

    TSEC: a framework for online experimentation under experimental constraints

    Authors: Simon Mak, Yuanshuo Zhou, Lavonne Hoang, C. F. Jeff Wu

    Abstract: Thompson sampling is a popular algorithm for solving multi-armed bandit problems, and has been applied in a wide range of applications, from website design to portfolio optimization. In such applications, however, the number of choices (or arms) $N$ can be large, and the data needed to make adaptive decisions require expensive experimentation. One is then faced with the constraint of experimenting… ▽ More

    Submitted 17 January, 2021; originally announced January 2021.

  18. arXiv:2101.01299  [pdf, other

    stat.ME

    Bayesian Uncertainty Quantification for Low-Rank Matrix Completion

    Authors: Henry Shaowu Yuchi, Simon Mak, Yao Xie

    Abstract: We consider the problem of uncertainty quantification for an unknown low-rank matrix $\mathbf{X}$, given a partial and noisy observation of its entries. This quantification of uncertainty is essential for many real-world problems, including image processing, satellite imaging, and seismology, providing a principled framework for validating scientific conclusions and guiding decision-making. Howeve… ▽ More

    Submitted 25 March, 2022; v1 submitted 4 January, 2021; originally announced January 2021.

  19. Population Quasi-Monte Carlo

    Authors: Chaofan Huang, V. Roshan Joseph, Simon Mak

    Abstract: Monte Carlo methods are widely used for approximating complicated, multidimensional integrals for Bayesian inference. Population Monte Carlo (PMC) is an important class of Monte Carlo methods, which utilizes a population of proposals to generate weighted samples that approximate the target distribution. The generic PMC framework iterates over three steps: samples are simulated from a set of propos… ▽ More

    Submitted 26 December, 2020; originally announced December 2020.

    Comments: Submitted to Journal of Computational and Graphical Statistics

    Journal ref: Journal of Computational and Graphical Statistics (2022)

  20. arXiv:2006.07506  [pdf, other

    stat.ML cs.LG math.ST

    Uncertainty Quantification for Inferring Hawkes Networks

    Authors: Haoyun Wang, Liyan Xie, Alex Cuozzo, Simon Mak, Yao Xie

    Abstract: Multivariate Hawkes processes are commonly used to model streaming networked event data in a wide variety of applications. However, it remains a challenge to extract reliable inference from complex datasets with uncertainty quantification. Aiming towards this, we develop a statistical inference framework to learn causal relationships between nodes from networked data, where the underlying directed… ▽ More

    Submitted 28 October, 2020; v1 submitted 12 June, 2020; originally announced June 2020.

    Comments: 16 pages including appendix, 1 figure, accepted to 2020 Neurips

  21. arXiv:2004.13962  [pdf, other

    stat.ME

    Energy Balancing of Covariate Distributions

    Authors: Jared D. Huling, Simon Mak

    Abstract: Bias in causal comparisons has a direct correspondence with distributional imbalance of covariates between treatment groups. Weighting strategies such as inverse propensity score weighting attempt to mitigate bias by either modeling the treatment assignment mechanism or balancing specified covariate moments. This paper introduces a new weighting method, called energy balancing, which instead aims… ▽ More

    Submitted 11 March, 2022; v1 submitted 29 April, 2020; originally announced April 2020.

  22. arXiv:1911.07285  [pdf, other

    stat.ME

    A hierarchical expected improvement method for Bayesian optimization

    Authors: Zhehui Chen, Simon Mak, C. F. Jeff Wu

    Abstract: The Expected Improvement (EI) method, proposed by Jones et al. (1998), is a widely-used Bayesian optimization method, which makes use of a fitted Gaussian process model for efficient black-box optimization. However, one key drawback of EI is that it is overly greedy in exploiting the fitted Gaussian process model for optimization, which results in suboptimal solutions even with large sample sizes.… ▽ More

    Submitted 20 April, 2023; v1 submitted 17 November, 2019; originally announced November 2019.

  23. arXiv:1911.05940  [pdf, other

    stat.ML cs.LG

    Distributional Clustering: A distribution-preserving clustering method

    Authors: Arvind Krishna, Simon Mak, Roshan Joseph

    Abstract: One key use of k-means clustering is to identify cluster prototypes which can serve as representative points for a dataset. However, a drawback of using k-means cluster centers as representative points is that such points distort the distribution of the underlying data. This can be highly disadvantageous in problems where the representative points are subsequently used to gain insights on the data… ▽ More

    Submitted 14 November, 2019; originally announced November 2019.

    Comments: Submitted to Statistica Sinica

  24. arXiv:1910.05452  [pdf, other

    stat.ME

    Adaptive design for Gaussian process regression under censoring

    Authors: Jialei Chen, Simon Mak, V. Roshan Joseph, Chuck Zhang

    Abstract: A key objective in engineering problems is to predict an unknown experimental surface over an input domain. In complex physical experiments, this may be hampered by response censoring, which results in a significant loss of information. For such problems, experimental design is paramount for maximizing predictive power using a small number of expensive experimental runs. To tackle this, we propose… ▽ More

    Submitted 25 June, 2021; v1 submitted 11 October, 2019; originally announced October 2019.

    Journal ref: Annals of Applied Statistics, 2021

  25. Function-on-function kriging, with applications to 3D printing of aortic tissues

    Authors: Jialei Chen, Simon Mak, V. Roshan Joseph, Chuck Zhang

    Abstract: 3D-printed medical prototypes, which use synthetic metamaterials to mimic biological tissue, are becoming increasingly important in urgent surgical applications. However, the mimicking of tissue mechanical properties via 3D-printed metamaterial can be difficult and time-consuming, due to the functional nature of both inputs (metamaterial structure) and outputs (mechanical response curve). To deal… ▽ More

    Submitted 1 July, 2020; v1 submitted 3 October, 2019; originally announced October 2019.

    Journal ref: Technometrics,2020

  26. arXiv:1908.08868  [pdf, other

    stat.ME math.ST

    BdryGP: a new Gaussian process model for incorporating boundary information

    Authors: Liang Ding, Simon Mak, C. F. Jeff Wu

    Abstract: Gaussian processes (GPs) are widely used as surrogate models for emulating computer code, which simulate complex physical phenomena. In many problems, additional boundary information (i.e., the behavior of the phenomena along input boundaries) is known beforehand, either from governing physics or scientific knowledge. While there has been recent work on incorporating boundary information within GP… ▽ More

    Submitted 23 August, 2019; originally announced August 2019.

  27. arXiv:1712.03589  [pdf, other

    stat.ME

    Analysis-of-marginal-Tail-Means (ATM): a robust method for discrete black-box optimization

    Authors: Simon Mak, C. F. Jeff Wu

    Abstract: We present a new method, called Analysis-of-marginal-Tail-Means (ATM), for effective robust optimization of discrete black-box problems. ATM has important applications to many real-world engineering problems (e.g., manufacturing optimization, product design, molecular engineering), where the objective to optimize is black-box and expensive, and the design space is inherently discrete. One weakness… ▽ More

    Submitted 19 October, 2018; v1 submitted 10 December, 2017; originally announced December 2017.

  28. Maximum entropy low-rank matrix recovery

    Authors: Simon Mak, Yao Xie

    Abstract: We propose in this paper a novel, information-theoretic method, called MaxEnt, for efficient data acquisition for low-rank matrix recovery. This proposed method has important applications to a wide range of problems, including image processing and text document indexing. Fundamental to our design approach is the so-called maximum entropy principle, which states that the measurement masks which max… ▽ More

    Submitted 21 November, 2018; v1 submitted 8 December, 2017; originally announced December 2017.

    Comments: Fixing typos

  29. arXiv:1708.06897  [pdf, other

    stat.ME

    Projected support points: a new method for high-dimensional data reduction

    Authors: Simon Mak, V. Roshan Joseph

    Abstract: In an era where big and high-dimensional data is readily available, data scientists are inevitably faced with the challenge of reducing this data for expensive downstream computation or analysis. To this end, we present here a new method for reducing high-dimensional big data into a representative point set, called projected support points (PSPs). A key ingredient in our method is the so-called sp… ▽ More

    Submitted 2 June, 2018; v1 submitted 23 August, 2017; originally announced August 2017.

  30. arXiv:1706.08037  [pdf, other

    stat.ME

    Information-Guided Sampling for Low-Rank Matrix Completion

    Authors: Simon Mak, Henry Shaowu Yushi, Yao Xie

    Abstract: The noisy matrix completion problem, which aims to recover a low-rank matrix $\mathbf{X}$ from a partial, noisy observation of its entries, arises in many statistical, machine learning, and engineering applications. In this paper, we present a new, information-theoretic approach for active sampling (or designing) of matrix entries for noisy matrix completion, based on the maximum entropy design pr… ▽ More

    Submitted 13 July, 2021; v1 submitted 25 June, 2017; originally announced June 2017.

    Comments: ICML 2021 Workshop on Information-Theoretic Methods for Rigorous, Responsible, and Reliable Machine Learning

  31. arXiv:1701.05547  [pdf, other

    stat.ME

    cmenet: a new method for bi-level variable selection of conditional main effects

    Authors: Simon Mak, C. F. Jeff Wu

    Abstract: This paper introduces a novel method for selecting main effects and a set of reparametrized effects called conditional main effects (CMEs), which capture the conditional effect of a factor at a fixed level of another factor. CMEs represent interpretable, domain-specific phenomena for a wide range of applications in engineering, social sciences and genomics. The key challenge is in incorporating th… ▽ More

    Submitted 18 November, 2017; v1 submitted 19 January, 2017; originally announced January 2017.

    Comments: JASA T&M, under revision

  32. arXiv:1611.07911  [pdf, other

    stat.AP

    An efficient surrogate model for emulation and physics extraction of large eddy simulations

    Authors: Simon Mak, Chih-Li Sung, Xingjian Wang, Shiang-Ting Yeh, Yu-Hung Chang, V. Roshan Joseph, Vigor Yang, C. F. Jeff Wu

    Abstract: In the quest for advanced propulsion and power-generation systems, high-fidelity simulations are too computationally expensive to survey the desired design space, and a new design methodology is needed that combines engineering physics, computer simulations and statistical modeling. In this paper, we propose a new surrogate model that provides efficient prediction and uncertainty quantification of… ▽ More

    Submitted 26 May, 2017; v1 submitted 23 November, 2016; originally announced November 2016.

    Comments: Submitted to JASA A&CS

  33. arXiv:1609.01811  [pdf, other

    math.ST stat.ME

    Support points

    Authors: Simon Mak, V. Roshan Joseph

    Abstract: This paper introduces a new way to compact a continuous probability distribution $F$ into a set of representative points called support points. These points are obtained by minimizing the energy distance, a statistical potential measure initially proposed by Székely and Rizzo (2004) for testing goodness-of-fit. The energy distance has two appealing features. First, its distance-based structure all… ▽ More

    Submitted 9 September, 2018; v1 submitted 6 September, 2016; originally announced September 2016.

    Comments: Accepted, Annals of Statistics

    MSC Class: 62E17

  34. arXiv:1602.03940  [pdf, other

    stat.AP

    A regional compound Poisson process for hurricane and tropical storm damage

    Authors: Simon Mak, Derek Bingham, Yi Lu

    Abstract: In light of intense hurricane activity along the U.S. Atlantic coast, attention has turned to understanding both the economic impact and behaviour of these storms. The compound Poisson-lognormal process has been proposed as a model for aggregate storm damage, but does not shed light on regional analysis since storm path data are not used. In this paper, we propose a fully Bayesian regional predict… ▽ More

    Submitted 11 February, 2016; originally announced February 2016.

    Comments: Accepted to Journal of the Royal Statistical Society, Series C on January 25th (2016). Pending publication

  35. arXiv:1602.03938  [pdf, other

    stat.CO

    Minimax and minimax projection designs using clustering

    Authors: Simon Mak, V. Roshan Joseph

    Abstract: Minimax designs provide a uniform coverage of a design space $\mathcal{X} \subseteq \mathbb{R}^p$ by minimizing the maximum distance from any point in this space to its nearest design point. Although minimax designs have many useful applications, e.g., for optimal sensor allocation or as space-filling designs for computer experiments, there has been little work in developing algorithms for generat… ▽ More

    Submitted 28 October, 2016; v1 submitted 11 February, 2016; originally announced February 2016.

    Comments: Under revision, Journal of Computational and Graphical Statistics (JCGS)