Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jul 13;18(7):e1010184.
doi: 10.1371/journal.pcbi.1010184. eCollection 2022 Jul.

AC-PCoA: Adjustment for confounding factors using principal coordinate analysis

Affiliations

AC-PCoA: Adjustment for confounding factors using principal coordinate analysis

Yu Wang et al. PLoS Comput Biol. .

Abstract

Confounding factors exist widely in various biological data owing to technical variations, population structures and experimental conditions. Such factors may mask the true signals and lead to spurious associations in the respective biological data, making it necessary to adjust confounding factors accordingly. However, existing confounder correction methods were mainly developed based on the original data or the pairwise Euclidean distance, either one of which is inadequate for analyzing different types of data, such as sequencing data. In this work, we proposed a method called Adjustment for Confounding factors using Principal Coordinate Analysis, or AC-PCoA, which reduces data dimension and extracts the information from different distance measures using principal coordinate analysis, and adjusts confounding factors across multiple datasets by minimizing the associations between lower-dimensional representations and confounding variables. Application of the proposed method was further extended to classification and prediction. We demonstrated the efficacy of AC-PCoA on three simulated datasets and five real datasets. Compared to the existing methods, AC-PCoA shows better results in visualization, statistical testing, clustering, and classification.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Results of simulation data.
A: Simulation setting 1. The first line shows the true pattern and two-dimensional representations of samples from PCA, AC-PCA, PCoA(sp), AC-PCoA(sp) and aPCoA for one representative run. Samples are colored according to 3 types. The second line shows box plots of MANOVA F-statistic and NMI of k-means clustering on two-dimensional representations for 100 runs. B: Simulation setting 2. The first line shows the true pattern and two-dimensional sample representations from PCA, AC-PCA, PCoA(man), AC-PCoA(man) and aPCoA for one representative run. Samples are colored according to 10 types. The second line shows box plots of MANOVA F-statistic and NMI of k-means clustering for 100 runs. C: Simulation setting 3. The first line shows two-dimensional sample representations from PCA, AC-PCA, PCoA(bc), AC-PCoA(bc) and aPCoA for one representative run. Samples are colored according to 2 clinical groups. The second line shows box plots of MANOVA F-statistic and NMI of k-means clustering for 100 runs.
Fig 2
Fig 2. Results of white oak tree data.
A: Two-dimensional representations of samples colored by continental origins after conducting AC-PCoA, PCoA, and aPCoA using six distance measures. B: MANOVA F-statistic, NMI of k-means clustering, and classification accuracy. Continental origins are set to be the true labels. MANOVA test, k-means clustering, and classification were conducted on two and three principal coordinates from PCoA, AC-PCoA, and aPCoA.
Fig 3
Fig 3. Results of MBQC data (Dataset ‘A’).
A: Two-dimensional representations colored by specimens after conducting PCoA, AC-PCoA and aPCoA using Euclidean distance and Bray-Curtis distance. B: MANOVA F-statistic, NMI of k-means clustering, and classification accuracy. Specimens are set to be the true labels. MANOVA, k-means clustering, and classification were conducted on two and three principal coordinates from PCoA, AC-PCoA, and aPCoA.
Fig 4
Fig 4. Results of SEQC data.
A: Two-dimensional plots colored by reference sample IDs after conducting PCoA, AC-PCoA and aPCoA, using four distance measures. B: MANOVA F-statistic, NMI of k-means clustering, and classification accuracy. Reference samples IDs are set to be the true label. MANOVA test, k-means clustering, and classification were conducted on two and three principal coordinates from PCoA, AC-PCoA, and aPCoA.
Fig 5
Fig 5. Results of scRNA-Seq data.
A: Two-dimensional representations of samples colored by cell types after conducting PCoA, AC-PCoA and aPCoA using four distance measures. B: MANOVA F-statistic, NMI of k-means clustering, and classification accuracy. Cell types are set to be the true labels. MANOVA, k-means clustering, and classification were conducted on two and three principal coordinates from PCoA, AC-PCoA, and aPCoA.
Fig 6
Fig 6. Results of human brain exon array data (window 5).
A: Two-dimensional plot colored by brain regions after conducting PCA, AC-PCA, PCoA, AC-PCoA and aPCoA, using four distance measures. B: MANOVA F-statistic, NMI of k-means clustering, and classification accuracy. Brain regions are set to be the true labels. MANOVA, k-means clustering, and classification were conducted on two and three principal coordinates from PCoA, AC-PCoA, and aPCoA.

Similar articles

Cited by

References

    1. Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8(1):118–127. doi: 10.1093/biostatistics/kxj037 - DOI - PubMed
    1. Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics. 2012;28(6):882–883. doi: 10.1093/bioinformatics/bts034 - DOI - PMC - PubMed
    1. Leek JT, Storey JD. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 2007;3(9):1724–1735. doi: 10.1371/journal.pgen.0030161 - DOI - PMC - PubMed
    1. Leek JT, Storey JD. A general framework for multiple testing dependence. Proc Natl Acad Sci U S A. 2008;105(48):18718–18723. doi: 10.1073/pnas.0808709105 - DOI - PMC - PubMed
    1. Gagnon-Bartsch JA, Jacob L, Speed TP. Removing unwanted variation from high dimensional data with negative controls. Berkeley: Tech Reports from Dep Stat Univ California. 2013; p. 1–112.

Publication types

MeSH terms

Grants and funding

W.L is supported by National Natural Science Foundation of China (Grant No. 11925103) and Shanghai Municipal Science and Technology Major Project (Grant No. 2021SHZDZX0103). S.Z was supported by the National Key Research and Development Program (Grant No. 2021YFC2701601), Science and Technology Commission of Shanghai Municipality (Grant No. 20ZR1407700) and Key Program of National Natural Science Foundation of China (Grant No. 61932008). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.