-
Sphractal: Estimating the Fractal Dimension of Surfaces Computed from Precise Atomic Coordinates via Box-Counting Algorithm
Authors:
Jonathan Yik Chang Ting,
Andrew Thomas Agars Wood,
Amanda Susan Barnard
Abstract:
The fractal dimension of a surface allows its degree of roughness to be characterized quantitatively. However, limited effort is attempted to calculate the fractal dimension of surfaces computed from precisely known atomic coordinates from computational biomolecular and nanomaterial studies. This work proposes methods to estimate the fractal dimension of the surface of any 3D object composed of sp…
▽ More
The fractal dimension of a surface allows its degree of roughness to be characterized quantitatively. However, limited effort is attempted to calculate the fractal dimension of surfaces computed from precisely known atomic coordinates from computational biomolecular and nanomaterial studies. This work proposes methods to estimate the fractal dimension of the surface of any 3D object composed of spheres, by representing the surface as either a voxelized point cloud or a mathematically exact surface, and computing its box-counting dimension. Sphractal is published as a Python package that provides these functionalities, and its utility is demonstrated on a set of simulated palladium nanoparticle data.
△ Less
Submitted 10 March, 2024; v1 submitted 22 January, 2024;
originally announced January 2024.
-
Robust Functional Principal Component Analysis for Non-Euclidean Random Objects
Authors:
Jiazhen Xu,
Andrew T. A. Wood,
Tao Zou
Abstract:
Functional data analysis offers a diverse toolkit of statistical methods tailored for analyzing samples of real-valued random functions. Recently, samples of time-varying random objects, such as time-varying networks, have been increasingly encountered in modern data analysis. These data structures represent elements within general metric spaces that lack local or global linear structures, renderi…
▽ More
Functional data analysis offers a diverse toolkit of statistical methods tailored for analyzing samples of real-valued random functions. Recently, samples of time-varying random objects, such as time-varying networks, have been increasingly encountered in modern data analysis. These data structures represent elements within general metric spaces that lack local or global linear structures, rendering traditional functional data analysis methods inapplicable. Moreover, the existing methodology for time-varying random objects does not work well in the presence of outlying objects. In this paper, we propose a robust method for analysing time-varying random objects. Our method employs pointwise Fréchet medians and then constructs pointwise distance trajectories between the individual time courses and the sample Fréchet medians. This representation effectively transforms time-varying objects into functional data. A novel robust approach to functional principal component analysis based on a Winsorized U-statistic estimator of the covariance structure is introduced. The proposed robust analysis of these distance trajectories is able to identify key features of time-varying objects and is useful for downstream analysis. To illustrate the efficacy of our approach, numerical studies focusing on dynamic networks are conducted. The results indicate that the proposed method exhibits good all-round performance and surpasses the existing approach in terms of robustness, showcasing its superior performance in handling time-varying objects data.
△ Less
Submitted 28 November, 2023;
originally announced December 2023.
-
A branch cut approach to the probability density and distribution functions of a linear combination of central and non-central Chi-square random variables
Authors:
Alfred Kume,
Tomonari Sei,
Andrew T. A. Wood
Abstract:
The paper considers the distribution of a general linear combination of central and non-central chi-square random variables by exploring the branch cut regions that appear in the standard Laplace inversion process. Due to the original interest from the directional statistics, the focus of this paper is on the density function of such distributions and not on their cumulative distribution function.…
▽ More
The paper considers the distribution of a general linear combination of central and non-central chi-square random variables by exploring the branch cut regions that appear in the standard Laplace inversion process. Due to the original interest from the directional statistics, the focus of this paper is on the density function of such distributions and not on their cumulative distribution function. In fact, our results confirm that the latter is a special case of the former. Our approach provides new insight by generating alternative characterizations of the probability density function in terms of a finite number of feasible univariate integrals. In particular, the central cases seem to allow an interesting representation in terms of the branch cuts, while general degrees of freedom and non-centrality can be easily adopted using recursive differentiation. Numerical results confirm that the proposed approach works well while more transparency and therefore easier control in the accuracy is ensured.
△ Less
Submitted 12 May, 2023;
originally announced May 2023.
-
Robust score matching for compositional data
Authors:
Janice L. Scealy,
Kassel L. Hingee,
John T. Kent,
Andrew T. A. Wood
Abstract:
The restricted polynomially-tilted pairwise interaction (RPPI) distribution gives a flexible model for compositional data. It is particularly well-suited to situations where some of the marginal distributions of the components of a composition are concentrated near zero, possibly with right skewness. This article develops a method of tractable robust estimation for the model by combining two ideas…
▽ More
The restricted polynomially-tilted pairwise interaction (RPPI) distribution gives a flexible model for compositional data. It is particularly well-suited to situations where some of the marginal distributions of the components of a composition are concentrated near zero, possibly with right skewness. This article develops a method of tractable robust estimation for the model by combining two ideas. The first idea is to use score matching estimation after an additive log-ratio transformation. The resulting estimator is automatically insensitive to zeros in the data compositions. The second idea is to incorporate suitable weights in the estimating equations. The resulting estimator is additionally resistant to outliers. These properties are confirmed in simulation studies where we further also demonstrate that our new outlier-robust estimator is efficient in high concentration settings, even in the case when there is no model contamination. An example is given using microbiome data. A user-friendly R package accompanies the article.
△ Less
Submitted 12 May, 2023;
originally announced May 2023.
-
Generalized Score Matching
Authors:
Jiazhen Xu,
Janice L. Scealy,
Andrew T. A. Wood,
Tao Zou
Abstract:
Score matching is an estimation procedure that has been developed for statistical models whose probability density function is known up to proportionality but whose normalizing constant is intractable, so that maximum likelihood is difficult or impossible to implement. To date, applications of score matching have focused more on continuous IID models. Motivated by various data modelling problems,…
▽ More
Score matching is an estimation procedure that has been developed for statistical models whose probability density function is known up to proportionality but whose normalizing constant is intractable, so that maximum likelihood is difficult or impossible to implement. To date, applications of score matching have focused more on continuous IID models. Motivated by various data modelling problems, this article proposes a unified asymptotic theory of generalized score matching developed under the independence assumption, covering both continuous and discrete response data, thereby giving a sound basis for score-matchingbased inference. Real data analyses and simulation studies provide convincing evidence of strong practical performance of the proposed methods.
△ Less
Submitted 21 April, 2024; v1 submitted 15 March, 2023;
originally announced March 2023.
-
Cauchy robust principal component analysis with applications to high-deimensional data sets
Authors:
Ayisha Fayomi,
Yannis Pantazis,
Michail Tsagris,
Andrew T. A. Wood
Abstract:
Principal component analysis (PCA) is a standard dimensionality reduction technique used in various research and applied fields. From an algorithmic point of view, classical PCA can be formulated in terms of operations on a multivariate Gaussian likelihood. As a consequence of the implied Gaussian formulation, the principal components are not robust to outliers. In this paper, we propose a modifie…
▽ More
Principal component analysis (PCA) is a standard dimensionality reduction technique used in various research and applied fields. From an algorithmic point of view, classical PCA can be formulated in terms of operations on a multivariate Gaussian likelihood. As a consequence of the implied Gaussian formulation, the principal components are not robust to outliers. In this paper, we propose a modified formulation, based on the use of a multivariate Cauchy likelihood instead of the Gaussian likelihood, which has the effect of robustifying the principal components. We present an algorithm to compute these robustified principal components. We additionally derive the relevant influence function of the first component and examine its theoretical properties. Simulation experiments on high-dimensional datasets demonstrate that the estimated principal components based on the Cauchy likelihood outperform or are on par with existing robust PCA techniques.
△ Less
Submitted 6 November, 2022;
originally announced November 2022.
-
Central limit theorem for intrinsic Frechet means in smooth compact Riemannian manifolds
Authors:
Thomas Hotz,
Huiling Le,
Andrew T. A. Wood
Abstract:
We prove a central limit theorem (CLT) for the Frechet mean of independent and identically distributed observations in a compact Riemannian manifold assuming that the population Frechet mean is unique. Previous general CLT results in this setting have assumed that the cut locus of the Frechet mean lies outside the support of the population distribution. So far as we are aware, the CLT in the prese…
▽ More
We prove a central limit theorem (CLT) for the Frechet mean of independent and identically distributed observations in a compact Riemannian manifold assuming that the population Frechet mean is unique. Previous general CLT results in this setting have assumed that the cut locus of the Frechet mean lies outside the support of the population distribution. So far as we are aware, the CLT in the present paper is the first which allows the cut locus to have co-dimension one or two when it is included in the support of the distribution. A key part of the proof is establishing an asymptotic approximation for the parallel transport of a certain vector field. Whether or not a non-standard term arises in the CLT depends on whether the co-dimension of the cut locus is one or greater than one: in the former case a non-standard term appears but not in the latter case. This is the first paper to give a general and explicit expression for the non-standard term which arises when the co-dimension of the cut locus is one.
△ Less
Submitted 31 October, 2022;
originally announced October 2022.
-
Generalized Score Matching for Regression
Authors:
Jiazhen Xu,
Janice L. Scealy,
Andrew T. A. Wood,
Tao Zou
Abstract:
Many probabilistic models that have an intractable normalizing constant may be extended to contain covariates. Since the evaluation of the exact likelihood is difficult or even impossible for these models, score matching was proposed to avoid explicit computation of the normalizing constant. In the literature, score matching has so far only been developed for models in which the observations are i…
▽ More
Many probabilistic models that have an intractable normalizing constant may be extended to contain covariates. Since the evaluation of the exact likelihood is difficult or even impossible for these models, score matching was proposed to avoid explicit computation of the normalizing constant. In the literature, score matching has so far only been developed for models in which the observations are independent and identically distributed (IID). However, the IID assumption does not hold in the traditional fixed design setting for regression-type models. To deal with the estimation of these covariate-dependent models, this paper presents a new score matching approach for independent but not necessarily identically distributed data under a general framework for both continuous and discrete responses, which includes a novel generalized score matching method for count response regression. We prove that our proposed score matching estimators are consistent and asymptotically normal under mild regularity conditions. The theoretical results are supported by simulation studies and a real-data example. Additionally, our simulation results indicate that, compared to approximate maximum likelihood estimation, the generalized score matching produces estimates with substantially smaller biases in an application to doctoral publication data.
△ Less
Submitted 18 March, 2022;
originally announced March 2022.
-
Score matching for compositional distributions
Authors:
Janice L. Scealy,
Andrew T. A. Wood
Abstract:
Compositional data and multivariate count data with known totals are challenging to analyse due to the non-negativity and sum-to-one constraints on the sample space. It is often the case that many of the compositional components are highly right-skewed, with large numbers of zeros. A major limitation of currently available estimators for compositional models is that they either cannot handle many…
▽ More
Compositional data and multivariate count data with known totals are challenging to analyse due to the non-negativity and sum-to-one constraints on the sample space. It is often the case that many of the compositional components are highly right-skewed, with large numbers of zeros. A major limitation of currently available estimators for compositional models is that they either cannot handle many zeros in the data or are not computationally feasible in moderate to high dimensions. We derive a new set of novel score matching estimators applicable to distributions on a Riemannian manifold with boundary, of which the standard simplex is a special case. The score matching method is applied to estimate the parameters in a new flexible truncation model for compositional data and we show that the estimators are scalable and available in closed form. Through extensive simulation studies, the scoring methodology is demonstrated to work well for estimating the parameters in the new truncation model and also for the Dirichlet distribution. We apply the new model and estimators to real microbiome compositional data and show that the model provides a good fit to the data.
△ Less
Submitted 22 December, 2020;
originally announced December 2020.
-
Gaussian asymptotic limits for the $α$-transformation in the analysis of compositional data
Authors:
Yannis Pantazis,
Michail Tsagris,
Andrew T. A. Wood
Abstract:
Compositional data consists of vectors of proportions whose components sum to 1. Such vectors lie in the standard simplex, which is a manifold with boundary. One issue that has been rather controversial within the field of compositional data analysis is the choice of metric on the simplex. One popular possibility has been to use the metric implied by logtransforming the data, as proposed by Aitchi…
▽ More
Compositional data consists of vectors of proportions whose components sum to 1. Such vectors lie in the standard simplex, which is a manifold with boundary. One issue that has been rather controversial within the field of compositional data analysis is the choice of metric on the simplex. One popular possibility has been to use the metric implied by logtransforming the data, as proposed by Aitchison [1, 2]; and another popular approach has been to use the standard Euclidean metric inherited from the ambient space. Tsagris et al. [21] proposed a one-parameter family of power transformations, the $α$-transformations, which include both the metric implied by Aitchison's transformation and the Euclidean metric as particular cases. Our underlying philosophy is that, with many datasets, it may make sense to use the data to help us determine a suitable metric. A related possibility is to apply the $α$-transformations to a parametric family of distributions, and then estimate a along with the other parameters. However, as we shall see, when one follows this last approach with the Dirichlet family, some care is needed in a certain limiting case which arises $(α\neq 0)$, as we found out when fitting this model to real and simulated data. Specifically, when the maximum likelihood estimator of a is close to 0, the other parameters tend to be large. The main purpose of the paper is to study this limiting case both theoretically and numerically and to provide insight into these numerical findings.
△ Less
Submitted 21 February, 2019; v1 submitted 29 November, 2018;
originally announced December 2018.
-
The extended power distribution: A new distribution on $(0, 1)$
Authors:
Chibueze E. Ogbonnaya,
Simon P. Preston,
Andrew T. A. Wood
Abstract:
We propose a two-parameter bounded probability distribution called the extended power distribution. This distribution on $(0, 1)$ is similar to the beta distribution, however there are some advantages which we explore. We define the moments and quantiles of this distribution and show that it is possible to give an $r$-parameter extension of this distribution ($r>2$). We also consider its complemen…
▽ More
We propose a two-parameter bounded probability distribution called the extended power distribution. This distribution on $(0, 1)$ is similar to the beta distribution, however there are some advantages which we explore. We define the moments and quantiles of this distribution and show that it is possible to give an $r$-parameter extension of this distribution ($r>2$). We also consider its complementary distribution and show that it has some flexibility advantages over the Kumaraswamy and beta distributions. This distribution can be used as an alternative to the Kumaraswamy distribution since it has a closed form for its cumulative function. However, it can be fitted to data where there are some samples that are exactly equal to 1, unlike the Kumaraswamy and beta distributions which cannot be fitted to such data or may require some censoring. Applications considered show the extended power distribution performs favourably against the Kumaraswamy distribution in most cases.
△ Less
Submitted 7 November, 2017;
originally announced November 2017.
-
Nonparametric hypothesis testing for equality of means on the simplex
Authors:
Michail Tsagris,
Simon Preston,
Andrew T. A. Wood
Abstract:
In the context of data that lie on the simplex, we investigate use of empirical and exponential empirical likelihood, and Hotelling and James statistics, to test the null hypothesis of equal population means based on two independent samples. We perform an extensive numerical study using data simulated from various distributions on the simplex. The results, taken together with practical considerati…
▽ More
In the context of data that lie on the simplex, we investigate use of empirical and exponential empirical likelihood, and Hotelling and James statistics, to test the null hypothesis of equal population means based on two independent samples. We perform an extensive numerical study using data simulated from various distributions on the simplex. The results, taken together with practical considerations regarding implementation, support the use of bootstrap-calibrated James statistic.
△ Less
Submitted 4 August, 2016; v1 submitted 27 July, 2016;
originally announced July 2016.
-
Improved classification for compositional data using the $α$-transformation
Authors:
Michail Tsagris,
Simon Preston,
Andrew T. A. Wood
Abstract:
In compositional data analysis an observation is a vector containing non-negative values, only the relative sizes of which are considered to be of interest. Without loss of generality, a compositional vector can be taken to be a vector of proportions that sum to one. Data of this type arise in many areas including geology, archaeology, biology, economics and political science. In this paper we inv…
▽ More
In compositional data analysis an observation is a vector containing non-negative values, only the relative sizes of which are considered to be of interest. Without loss of generality, a compositional vector can be taken to be a vector of proportions that sum to one. Data of this type arise in many areas including geology, archaeology, biology, economics and political science. In this paper we investigate methods for classification of compositional data. Our approach centres on the idea of using the $α$-transformation to transform the data and then to classify the transformed data via regularised discriminant analysis and the k-nearest neighbours algorithm. Using the $α$-transformation generalises two rival approaches in compositional data analysis, one (when $α=1$) that treats the data as though they were Euclidean, ignoring the compositional constraint, and another (when $α=0$) that employs Aitchison's centred log-ratio transformation. A numerical study with several real datasets shows that whether using $α=1$ or $α=0$ gives better classification performance depends on the dataset, and moreover that using an intermediate value of $α$ can sometimes give better performance than using either 1 or 0.
△ Less
Submitted 17 June, 2015; v1 submitted 16 June, 2015;
originally announced June 2015.
-
A data-based power transformation for compositional data
Authors:
Michail T. Tsagris,
Simon Preston,
Andrew T. A. Wood
Abstract:
Compositional data analysis is carried out either by neglecting the compositional constraint and applying standard multivariate data analysis, or by transforming the data using the logs of the ratios of the components. In this work we examine a more general transformation which includes both approaches as special cases. It is a power transformation and involves a single parameter, α. The transform…
▽ More
Compositional data analysis is carried out either by neglecting the compositional constraint and applying standard multivariate data analysis, or by transforming the data using the logs of the ratios of the components. In this work we examine a more general transformation which includes both approaches as special cases. It is a power transformation and involves a single parameter, α. The transformation has two equivalent versions. The first is the stay-in-the-simplex version, which is the power transformation as defined by Aitchison in 1986. The second version, which is a linear transformation of the power transformation, is a Box-Cox type transformation. We discuss a parametric way of estimating the value of α, which is maximization of its profile likelihood (assuming multivariate normality of the transformed data) and the equivalence between the two versions is exhibited. Other ways include maximization of the correct classification probability in discriminant analysis and maximization of the pseudo R-squared (as defined by Aitchison in 1986) in linear regression. We examine the relationship between the α-transformation, the raw data approach and the isometric log-ratio transformation. Furthermore, we also define a suitable family of metrics corresponding to the family of α-transformation and consider the corresponding family of Frechet means.
△ Less
Submitted 16 June, 2011; v1 submitted 7 June, 2011;
originally announced June 2011.
-
Saddlepoint approximation for moment generating functions of truncated random variables
Authors:
Ronald W. Butler,
Andrew T. A. Wood
Abstract:
We consider the problem of approximating the moment generating function (MGF) of a truncated random variable in terms of the MGF of the underlying (i.e., untruncated) random variable. The purpose of approximating the MGF is to enable the application of saddlepoint approximations to certain distributions determined by truncated random variables. Two important statistical applications are the foll…
▽ More
We consider the problem of approximating the moment generating function (MGF) of a truncated random variable in terms of the MGF of the underlying (i.e., untruncated) random variable. The purpose of approximating the MGF is to enable the application of saddlepoint approximations to certain distributions determined by truncated random variables. Two important statistical applications are the following: the approximation of certain multivariate cumulative distribution functions; and the approximation of passage time distributions in ion channel models which incorporate time interval omission. We derive two types of representation for the MGF of a truncated random variable. One of these representations is obtained by exponential tilting. The second type of representation, which has two versions, is referred to as an exponential convolution representation. Each representation motivates a different approximation. It turns out that each of the three approximations is extremely accurate in those cases ``to which it is suited.'' Moreover, there is a simple rule of thumb for deciding which approximation to use in a given case, and if this rule is followed, then our numerical and theoretical results indicate that the resulting approximation will be extremely accurate.
△ Less
Submitted 30 August, 2005;
originally announced August 2005.
-
Estimation of fractal dimension for a class of Non-Gaussian stationary processes and fields
Authors:
Grace Chan,
Andrew T. A. Wood
Abstract:
We present the asymptotic distribution theory for a class of increment-based estimators of the fractal dimension of a random field of the form g{X(t)}, where g:R\to R is an unknown smooth function and X(t) is a real-valued stationary Gaussian field on R^d, d=1 or 2, whose covariance function obeys a power law at the origin. The relevant theoretical framework here is ``fixed domain'' (or ``infill…
▽ More
We present the asymptotic distribution theory for a class of increment-based estimators of the fractal dimension of a random field of the form g{X(t)}, where g:R\to R is an unknown smooth function and X(t) is a real-valued stationary Gaussian field on R^d, d=1 or 2, whose covariance function obeys a power law at the origin. The relevant theoretical framework here is ``fixed domain'' (or ``infill'') asymptotics. Surprisingly, the limit theory in this non-Gaussian case is somewhat richer than in the Gaussian case (the latter is recovered when g is affine), in part because estimators of the type considered may have an asymptotic variance which is random in the limit. Broadly, when g is smooth and nonaffine, three types of limit distributions can arise, types (i), (ii) and (iii), say. Each type can be represented as a random integral. More specifically, type (i) can be represented as the integral of a certain random function with respect to Lebesgue measure; type (ii) can be represented as the integral of a second random function
△ Less
Submitted 25 June, 2004;
originally announced June 2004.