Review

. 2023 May 15;381(2247):20220149.

doi: 10.1098/rsta.2022.0149. Epub 2023 Mar 27.

Bayesian cluster analysis

S Wade¹

Affiliations

PMID: 36970819
PMCID: PMC10041359
DOI: 10.1098/rsta.2022.0149

Review

Bayesian cluster analysis

S Wade. Philos Trans A Math Phys Eng Sci. 2023.

. 2023 May 15;381(2247):20220149.

doi: 10.1098/rsta.2022.0149. Epub 2023 Mar 27.

Author

S Wade¹

Affiliation

¹ School of Mathematics and Maxwell Institute for Mathematical Sciences, University of Edinburgh, James Clerk Maxwell Building, Edinburgh, UK.

PMID: 36970819
PMCID: PMC10041359
DOI: 10.1098/rsta.2022.0149

Abstract

Bayesian cluster analysis offers substantial benefits over algorithmic approaches by providing not only point estimates but also uncertainty in the clustering structure and patterns within each cluster. An overview of Bayesian cluster analysis is provided, including both model-based and loss-based approaches, along with a discussion on the importance of the kernel or loss selected and prior specification. Advantages are demonstrated in an application to cluster cells and discover latent cell types in single-cell RNA sequencing data to study embryonic cellular development. Lastly, we focus on the ongoing debate between finite and infinite mixtures in a model-based approach and robustness to model misspecification. While much of the debate and asymptotic theory focuses on the marginal posterior of the number of clusters, we empirically show that quite a different behaviour is obtained when estimating the full clustering structure. This article is part of the theme issue 'Bayesian inference: challenges, perspectives, and prospects'.

Keywords: Bayesian analysis; clustering; ensembles; mixture models; model misspecification.

PubMed Disclaimer

Conflict of interest statement

We declare we have no competing interests.

Figures

**Figure 1.**
In order to highlight limitations of the standard workflow for scRNA-seq data, which firsts log-transforms data and then applies tools, such as $k$ -means for clustering, we plot in (a,b) the log-transformed counts across all cells for two genes, Id4 and Meg3, and in (c) data simulated from a Gaussian mixture model (GMM); incompatibility and different characteristics are clearly observed between the real data (a,b) and simulated data (c). Instead, (d) plots log-transformed data generated from a negative-binomial mixture model (NBMM), which more closely resembles the real data. (Online version in colour.)

**Figure 2.**
Highlights of the analysis of Liu *et al.* [60]. (a) Heat map of the posterior estimated latent RNA counts (corrected by the posterior capture efficiencies) for each cell ( $x$ -axis) and gene ( $y$ -axis). Cells from different clusters are separated by solid yellow lines, and within each cluster, the dashed yellow line separates HOM and HET. Genes above the red horizontal line are identified as differentially expressed across the clusters. (b) Visualization of the clustering estimate in the two-dimensional space obtained through t-distributed stochastic neighbour embedding (t-SNE [139]) of the high-dimensional data. (c) Uncertainty in clustering characterized by the posterior similarity matrix. (Online version in colour.)

**Figure 3.**
Comparison of different estimators for the number of clusters in the example of Miller & Harrison [145], where the true clustering contains only a single cluster. The DP mixture of Gaussians is considered for model-based clustering with different choices of the concentration parameter $α$ . The box plots display variability in the estimates across the 50 replicated datasets, with colour corresponding to a sample size of $n = 100, 200 or 500$ . (a) Marginal mode of $k$ . (b) MAP clustering $k$ . (c) Binder clustering $k$ . (d) VI clustering $k$ . (Online version in colour.)

**Figure 4.**
Comparison of different estimators for the number of clusters in the misspecified example of Rajkowski [157], where the true clustering contains only a single cluster under the uniform kernel. The DP mixture of Gaussians is considered for model-based clustering with different choices of the concentration parameter $α$ . The box plots display variability in the estimates across the 50 replicated datasets, with colour corresponding to a sample size of $n = 100, 200 or 500$ . (a) Marginal mode of $k$ . (b) MAP clustering. (c) Binder clustering. (d) VI clustering. (Online version in colour.)

See this image and copyright information in PMC

Cited by

Identification of cuproptosis-related gene clusters and immune cell infiltration in major burns based on machine learning models and experimental validation.
Wang X, Xiong Z, Hong W, Liao X, Yang G, Jiang Z, Jing L, Huang S, Fu Z, Zhu F. Wang X, et al. Front Immunol. 2024 Feb 12;15:1335675. doi: 10.3389/fimmu.2024.1335675. eCollection 2024. Front Immunol. 2024. PMID: 38410514 Free PMC article.
A special issue on Bayesian inference: challenges, perspectives and prospects.
Robert CP, Rousseau J. Robert CP, et al. Philos Trans A Math Phys Eng Sci. 2023 May 15;381(2247):20220155. doi: 10.1098/rsta.2022.0155. Epub 2023 Mar 27. Philos Trans A Math Phys Eng Sci. 2023. PMID: 36970829 Free PMC article. No abstract available.

References

1. Cheeseman P, Kelly J, Self M, Stutz J, Taylor W, Freeman D. 1988. Autoclass: a Bayesian classification system. In Machine learning proceedings 1988 (ed. J Laird), pp. 54–64. San Francisco, CA: Elsevier.
1. Kuhn MA, Feigelson ED. 2019. Applications in astronomy. In Handbook of mixture analysis (eds S Fruhwirth-Schnatter, G Celeux, CP Robert), pp. 463–489. New York, NY: Chapman and Hall/CRC.
1. Dasgupta A, Raftery AE. 1998. Detecting features in spatial point processes with clutter via model-based clustering. J. Am. Stat. Assoc. 93, 294-302. (10.1080/01621459.1998.10474110) - DOI
1. Blei DM, Ng AY, Jordan MI. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993-1022.
1. Chauvel C, Novoloaca A, Veyre P, Reynier F, Becker J. 2019. Evaluation of integrative clustering methods for the analysis of multi-omics data. Brief. Bioinform. 21, 541-552. (10.1093/bib/bbz015) - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Bayesian cluster analysis

Affiliation

Bayesian cluster analysis

Author

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources