Published by De Gruyter April 13, 2017

Missing value imputation for gene expression data by tailored nearest neighbors

Shahla Faisal and Gerhard Tutz

From the journal Statistical Applications in Genetics and Molecular Biology

https://doi.org/10.1515/sagmb-2015-0098

Showing a limited preview of this publication:

Abstract

High dimensional data like gene expression and RNA-sequences often contain missing values. The subsequent analysis and results based on these incomplete data can suffer strongly from the presence of these missing values. Several approaches to imputation of missing values in gene expression data have been developed but the task is difficult due to the high dimensionality (number of genes) of the data. Here an imputation procedure is proposed that uses weighted nearest neighbors. Instead of using nearest neighbors defined by a distance that includes all genes the distance is computed for genes that are apt to contribute to the accuracy of imputed values. The method aims at avoiding the curse of dimensionality, which typically occurs if local methods as nearest neighbors are applied in high dimensional settings. The proposed weighted nearest neighbors algorithm is compared to existing missing value imputation techniques like mean imputation, KNNimpute and the recently proposed imputation by random forests. We use RNA-sequence and microarray data from studies on human cancer to compare the performance of the methods. The results from simulations as well as real studies show that the weighted distance procedure can successfully handle missing values for high dimensional data structures where the number of predictors is larger than the number of samples. The method typically outperforms the considered competitors.

Keywords: gene expression data; high-dimensional data; missing values; weighted nearest neighbors

References

Alizadeh, A. A., M. B. Eisen, R. E. Davis, C. Ma, I. S. Lossos, A. Rosenwald, J. C. Boldrick, H. Sabet, T. Tran, X. Yu and J. I. Powell (2000): “Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling,” Nature, 403, 503–511.10.1038/35000501Search in Google Scholar PubMed

Anders, S., P. T. Pyl and W. Huber (2015): “HTSeq—a Python framework to work with high-throughput sequencing data,” Bioinformatics, 31, 166–169.10.1093/bioinformatics/btu638Search in Google Scholar PubMed PubMed Central

Bø, T. H., B. Dysvik and I. Jonassen (2004): “LSimpute: accurate estimation of missing values in microarray data with least squares methods,” Nucleic Acids Res., 32, e34.10.1093/nar/gnh026Search in Google Scholar PubMed PubMed Central

Brás, L. P. and J. C. Menezes (2007): “Improving cluster-based missing value estimation of dna microarray data,” Biomol. Eng., 24, 273–282.10.1016/j.bioeng.2007.04.003Search in Google Scholar PubMed

Breiman, L. (2001): “Random forests,” Mach. Learn., 45, 5–32.10.1023/A:1010933404324Search in Google Scholar

Brock, G. N., J. R. Shaffer, R. E. Blakesley, M. J. Lotz and G. C. Tseng (2008): “Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes,” BMC Bioinformatics, 9, 12.10.1186/1471-2105-9-12Search in Google Scholar PubMed PubMed Central

Dobin, A., C. A. Davis, F. Schlesinger, J. Drenkow, C. Zaleski, S. Jha, P. Batut, M. Chaisson and T. R. Gingeras (2013): “STAR: ultrafast universal RNA-seq aligner,” Bioinformatics, 29, 15–21.10.1093/bioinformatics/bts635Search in Google Scholar PubMed PubMed Central

Dudoit, S., J. Fridlyand and T. P. Speed (2002): “Comparison of discrimination methods for the classification of tumors using gene expression data,” J Am. Stat. Assoc., 97, 77–87.10.1198/016214502753479248Search in Google Scholar

Feten, G., T. Almoy and A. H. Aastveit (2005): “Prediction of missing values in microarray and use of mixed models to evaluate the predictors,” Stat. Appl. Genet. Mol. Biol., 4, 10.10.2202/1544-6115.1120Search in Google Scholar PubMed

Frazee, A. C., B. Langmead and J. T. Leek (2011): “Recount: a multi-experiment resource of analysis-ready RNA-seq gene count datasets,” BMC Bioinformatics, 12, 449.10.1186/1471-2105-12-449Search in Google Scholar PubMed PubMed Central

Hastie, T., R. Tibshirani, B. Narasimhan, and G. Chu (2013): “impute: impute: Imputation for microarray data,” http://www.bioconductor.org/packages/release/bioc/html/impute.html, r package version 1.36.0.Search in Google Scholar

Jung, K., A. Gannoun, B. Sitek, H. E. Meyer, K. Stühler and W. Urfer (2005): “Analysis of dynamic protein expression data,” RevStat-Stat. J., 3, 99–111.Search in Google Scholar

Jung, K., A. Gannoun, B. Sitek, O. Apostolov, A. Schramm, H. E. Meyer, K. Stühler and W. Urfer (2006): “Statistical evaluation of methods for the analysis of dynamic protein expression data from a tumor study,” RevStat-Stat. J., 4, 67–80.Search in Google Scholar

Khan, J., J. S. Wei, M. Ringner, L. H. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C. R. Antonescu, C. Peterson and P. S. Meltzer (2001): “Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks,” Nat. Med., 7, 673–679.10.1038/89044Search in Google Scholar PubMed PubMed Central

Kim, K.-Y., B.-J. Kim and G.-S. Yi (2004): “Reuse of imputed data in microarray analysis increases imputation efficiency,” BMC Bioinformatics, 5, 160.10.1186/1471-2105-5-160Search in Google Scholar PubMed PubMed Central

Klambauer, G., T. Unterthiner and S. Hochreiter (2013): “Dexus: identifying differential expression in RNA-seq studies with unknown conditions,” Nucleic Acids Res., 41, e198, http://nar.oxfordjournals.org/content/41/21/e198.abstract.10.1093/nar/gkt834Search in Google Scholar PubMed PubMed Central

Kruppa, J., F. Kramer, T. Beißbarth and K. Jung (2016): “A simulation framework for correlated count data of features subsets in high-throughput sequencing or proteomics experiments,” Stat. Appl. Genet. Mol. Biol. 15, 401–414.10.1515/sagmb-2015-0082Search in Google Scholar PubMed

Montgomery, S. B., M. Sammeth, M. Gutierrez-Arcelus, R. P. Lach, C. Ingle, J. Nisbett, R. Guigo and E. T. Dermitzakis (2010): “Transcriptome genetics using second generation sequencing in a Caucasian population,” Nature, 464, 773–777.10.1038/nature08903Search in Google Scholar PubMed PubMed Central

Ouyang, M., W. J. Welsh and P. Georgopoulos (2004): “Gaussian mixture clustering and imputation of microarray data,” Bioinformatics, 20, 917–923.10.1093/bioinformatics/bth007Search in Google Scholar PubMed

Pickrell, J. K., J. C. Marioni, A. A. Pai, J. F. Degner, B. E. Engelhardt, E. Nkadori, J.-B. Veyrieras, M. Stephens, Y. Gilad and J. K. Pritchard (2010): “Understanding mechanisms underlying human gene expression variation with rna sequencing,” Nature, 464, 768–772.10.1038/nature08872Search in Google Scholar PubMed PubMed Central

Sehgal, M. S. B., I. Gondal and L. Dooley (2004): “K-ranked covariance based missing values estimation for microarray data classification,” In: Hybrid Intelligent Systems, 2004. HIS’04. Fourth International Conference on, IEEE Japan. pp. 274–279.10.1109/ICHIS.2004.67Search in Google Scholar

Schäfer, J. and K. Strimmer (2005): “A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics,” Stat. Appl. Genet. Mol. Biol, 4, 32.10.2202/1544-6115.1175Search in Google Scholar PubMed

Städler, N. and P. Bühlmann (2014): “Pattern alternating maximization algorithm for high-dimensional missing data,” J. Mach. Learn. Res., 15, 1903–1928.Search in Google Scholar

Stekhoven, D. J. and P. Bühlmann (2012): “Missforest: a non-parametric missing value imputation for mixed-type data,” Bioinformatics, 28, 112–118.10.1093/bioinformatics/btr597Search in Google Scholar PubMed

Templ, M., A. Alfons, A. Kowarik and B. Prantner (2013): “VIM: visualization and imputation of missing values,” http://CRAN.R-project.org/package=VIM, r package version 4.0.0.Search in Google Scholar

Tritchler, D., E. Parkhomenko and J. Beyene (2009): “Filtering genes for cluster and network analysis,” BMC Bioinformatics, 10, 193, http://doi.org/10.1186/1471-2105-10-193.10.1186/1471-2105-10-193Search in Google Scholar PubMed PubMed Central

Troyanskaya, O., M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein and R. B. Altman (2001): “Missing value estimation methods for dna microarrays,” Bioinformatics, 17, 520–525.10.1093/bioinformatics/17.6.520Search in Google Scholar PubMed

Tuikkala, J., L. L. Elo, O. S. Nevalainen and T. Aittokallio (2008): “Missing value imputation improves clustering and interpretation of gene expression microarray data,” BMC Bioinformatics, 9, 202.10.1186/1471-2105-9-202Search in Google Scholar PubMed PubMed Central

Tutz, G. and S. Ramzan (2015): “Improved methods for the imputation of missing data by nearest neighbor methods,” Comput. Stat. Data Anal., 90, 84–99.10.1016/j.csda.2015.04.009Search in Google Scholar

Waljee, A. K., A. Mukherjee, A. G. Singal, Y. Zhang, J. Warren, U. Balis, J. Marrero, J. Zhu and P. D. Higgins (2013): “Comparison of imputation methods for missing laboratory data in medicine,” BMJ Open, 3, e002847.10.1136/bmjopen-2013-002847Search in Google Scholar PubMed PubMed Central

Published Online: 2017-4-13

Published in Print: 2017-4-25

Missing value imputation for gene expression data by tailored nearest neighbors

Abstract

References

Journal and Issue

Articles in the same Issue