Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2023 May 15;381(2247):20220149.
doi: 10.1098/rsta.2022.0149. Epub 2023 Mar 27.

Bayesian cluster analysis

Affiliations
Review

Bayesian cluster analysis

S Wade. Philos Trans A Math Phys Eng Sci. .

Abstract

Bayesian cluster analysis offers substantial benefits over algorithmic approaches by providing not only point estimates but also uncertainty in the clustering structure and patterns within each cluster. An overview of Bayesian cluster analysis is provided, including both model-based and loss-based approaches, along with a discussion on the importance of the kernel or loss selected and prior specification. Advantages are demonstrated in an application to cluster cells and discover latent cell types in single-cell RNA sequencing data to study embryonic cellular development. Lastly, we focus on the ongoing debate between finite and infinite mixtures in a model-based approach and robustness to model misspecification. While much of the debate and asymptotic theory focuses on the marginal posterior of the number of clusters, we empirically show that quite a different behaviour is obtained when estimating the full clustering structure. This article is part of the theme issue 'Bayesian inference: challenges, perspectives, and prospects'.

Keywords: Bayesian analysis; clustering; ensembles; mixture models; model misspecification.

PubMed Disclaimer

Conflict of interest statement

We declare we have no competing interests.

Figures

Figure 1.
Figure 1.
In order to highlight limitations of the standard workflow for scRNA-seq data, which firsts log-transforms data and then applies tools, such as k-means for clustering, we plot in (a,b) the log-transformed counts across all cells for two genes, Id4 and Meg3, and in (c) data simulated from a Gaussian mixture model (GMM); incompatibility and different characteristics are clearly observed between the real data (a,b) and simulated data (c). Instead, (d) plots log-transformed data generated from a negative-binomial mixture model (NBMM), which more closely resembles the real data. (Online version in colour.)
Figure 2.
Figure 2.
Highlights of the analysis of Liu et al. [60]. (a) Heat map of the posterior estimated latent RNA counts (corrected by the posterior capture efficiencies) for each cell (x-axis) and gene (y-axis). Cells from different clusters are separated by solid yellow lines, and within each cluster, the dashed yellow line separates HOM and HET. Genes above the red horizontal line are identified as differentially expressed across the clusters. (b) Visualization of the clustering estimate in the two-dimensional space obtained through t-distributed stochastic neighbour embedding (t-SNE [139]) of the high-dimensional data. (c) Uncertainty in clustering characterized by the posterior similarity matrix. (Online version in colour.)
Figure 3.
Figure 3.
Comparison of different estimators for the number of clusters in the example of Miller & Harrison [145], where the true clustering contains only a single cluster. The DP mixture of Gaussians is considered for model-based clustering with different choices of the concentration parameter α. The box plots display variability in the estimates across the 50 replicated datasets, with colour corresponding to a sample size of n=100,200 or 500. (a) Marginal mode of k. (b) MAP clustering k. (c) Binder clustering k. (d) VI clustering k. (Online version in colour.)
Figure 4.
Figure 4.
Comparison of different estimators for the number of clusters in the misspecified example of Rajkowski [157], where the true clustering contains only a single cluster under the uniform kernel. The DP mixture of Gaussians is considered for model-based clustering with different choices of the concentration parameter α. The box plots display variability in the estimates across the 50 replicated datasets, with colour corresponding to a sample size of n=100,200 or 500. (a) Marginal mode of k. (b) MAP clustering. (c) Binder clustering. (d) VI clustering. (Online version in colour.)

Similar articles

Cited by

References

    1. Cheeseman P, Kelly J, Self M, Stutz J, Taylor W, Freeman D. 1988. Autoclass: a Bayesian classification system. In Machine learning proceedings 1988 (ed. J Laird), pp. 54–64. San Francisco, CA: Elsevier.
    1. Kuhn MA, Feigelson ED. 2019. Applications in astronomy. In Handbook of mixture analysis (eds S Fruhwirth-Schnatter, G Celeux, CP Robert), pp. 463–489. New York, NY: Chapman and Hall/CRC.
    1. Dasgupta A, Raftery AE. 1998. Detecting features in spatial point processes with clutter via model-based clustering. J. Am. Stat. Assoc. 93, 294-302. (10.1080/01621459.1998.10474110) - DOI
    1. Blei DM, Ng AY, Jordan MI. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993-1022.
    1. Chauvel C, Novoloaca A, Veyre P, Reynier F, Becker J. 2019. Evaluation of integrative clustering methods for the analysis of multi-omics data. Brief. Bioinform. 21, 541-552. (10.1093/bib/bbz015) - DOI - PubMed

LinkOut - more resources