A method for learning a sparse classifier in the presence of missing data for high-dimensional biological datasets

被引:8
|
作者
Severson, Kristen A. [1 ]
Monian, Brinda [1 ]
Love, J. Christopher [1 ]
Braatz, Richard D. [1 ]
机构
[1] MIT, Dept Chem Engn, Cambridge, MA 02139 USA
关键词
PRINCIPAL COMPONENT ANALYSIS; GENE-EXPRESSION DATA; LEAST-SQUARES; MICROARRAY DATA; IMPUTATION; CENTROIDS; CANCER;
D O I
10.1093/bioinformatics/btx224
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: This work addresses two common issues in building classification models for biological or medical studies: learning a sparse model, where only a subset of a large number of possible predictors is used, and training in the presence of missing data. This work focuses on supervised generative binary classification models, specifically linear discriminant analysis (LDA). The parameters are determined using an expectation maximization algorithm to both address missing data and introduce priors to promote sparsity. The proposed algorithm, expectation-maximization sparse discriminant analysis (EM-SDA), produces a sparse LDA model for datasets with and without missing data. Results: EM-SDA is tested via simulations and case studies. In the simulations, EM-SDA is compared with nearest shrunken centroids (NSCs) and sparse discriminant analysis (SDA) with k-nearest neighbors for imputation for varying mechanism and amount of missing data. In three case studies using published biomedical data, the results are compared with NSC and SDA models with four different types of imputation, all of which are common approaches in the field. EM-SDA is more accurate and sparse than competing methods both with and without missing data in most of the experiments. Furthermore, the EM-SDA results are mostly consistent between the missing and full cases. Biological relevance of the resulting models, as quantified via a literature search, is also presented. Availability and implementation: A Matlab implementation published under GNU GPL v. 3 license is available at http://web.mit.edu/braatzgroup/links.html. Contact: braatz@mit.edu Supplementary information: Supplementary data are available at Bioinformatics online.
引用
收藏
页码:2897 / 2905
页数:9
相关论文
共 50 条
  • [1] A Deep Learning-Cuckoo Search Method for Missing Data Estimation in High-Dimensional Datasets
    Leke, Collins
    Ndjiongue, Alain Richard
    Twala, Bhekisipho
    Marwala, Tshilidzi
    [J]. ADVANCES IN SWARM INTELLIGENCE, ICSI 2017, PT I, 2017, 10385 : 561 - 572
  • [2] PCA learning for sparse high-dimensional data
    Hoyle, DC
    Rattray, M
    [J]. EUROPHYSICS LETTERS, 2003, 62 (01): : 117 - 123
  • [3] Similarity Learning for High-Dimensional Sparse Data
    Liu, Kuan
    Bellet, Aurelien
    Sha, Fei
    [J]. ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 38, 2015, 38 : 653 - 662
  • [4] Group Learning for High-Dimensional Sparse Data
    Cherkassky, Vladimir
    Chen, Hsiang-Han
    Shiao, Han-Tai
    [J]. 2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
  • [5] Ensemble of sparse classifiers for high-dimensional biological data
    Kim, Sunghan
    Scalzo, Fabien
    Telesca, Donatello
    Hu, Xiao
    [J]. INTERNATIONAL JOURNAL OF DATA MINING AND BIOINFORMATICS, 2015, 12 (02) : 167 - 183
  • [6] Flexible High-Dimensional Unsupervised Learning with Missing Data
    Wei, Yuhong
    Tang, Yang
    McNicholas, Paul D.
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2020, 42 (03) : 610 - 621
  • [7] Efficient Sparse Representation for Learning With High-Dimensional Data
    Chen, Jie
    Yang, Shengxiang
    Wang, Zhu
    Mao, Hua
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (08) : 4208 - 4222
  • [8] Sparse representation approaches for the classification of high-dimensional biological data
    Li, Yifeng
    Ngom, Alioune
    [J]. BMC SYSTEMS BIOLOGY, 2013, 7
  • [9] Online AUC Optimization for Sparse High-Dimensional Datasets
    Zhou, Baojian
    Ying, Yiming
    Skiena, Steven
    [J]. 20TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2020), 2020, : 881 - 890
  • [10] Multiple Imputation for General Missing Data Patterns in the Presence of High-dimensional Data
    Yi Deng
    Changgee Chang
    Moges Seyoum Ido
    Qi Long
    [J]. Scientific Reports, 6