Dimension reduction of high-dimensional dataset with missing values

被引:2
|
作者
Zhang, Ran [1 ]
Ye, Bin [2 ]
Liu, Peng [2 ]
机构
[1] Xuzhou Med Univ, Sch Med Informat & Engn, Xuzhou, Jiangsu, Peoples R China
[2] China Univ Min & Technol, Sch Informat & Control Engn, Xuzhou 221116, Jiangsu, Peoples R China
关键词
Dimension reduction; high-dimensional data; missing value; PRINCIPAL COMPONENT ANALYSIS; COVARIANCE-MATRIX ESTIMATION; SPECTRUM ESTIMATION; IMPUTATION;
D O I
10.1177/1748302619867440
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Nowadays, datasets containing a very large number of variables or features are routinely generated in many fields. Dimension reduction techniques are usually performed prior to statistically analyzing these datasets in order to avoid the effects of the curse of dimensionality. Principal component analysis is one of the most important techniques for dimension reduction and data visualization. However, datasets with missing values arising in almost every field will produce biased estimates and are difficult to handle, especially in the high dimension, low sample size settings. By exploiting a Lasso estimator of the population covariance matrix, we propose to regularize the principal component analysis to reduce the dimensionality of dataset with missing data. The Lasso estimator of covariance matrix is computationally tractable by solving a convex optimization problem. To illustrate the effectiveness of our method on dimension reduction, the principal component directions are evaluated by the metrics of Frobenius norm and cosine distance. The performances are compared with other incomplete data handling methods such as mean substitution and multiple imputation. Simulation results also show that our method is superior to other incomplete data handling methods in the context of discriminant analysis of real world high-dimensional datasets.
引用
收藏
页数:8
相关论文
共 50 条
  • [21] Progression-Preserving Dimension Reduction for High-Dimensional Sensor Data Visualization
    Yoon, Hyunjin
    Shahabi, Cyrus
    Winstein, Carolee J.
    Jang, Jong-Hyun
    [J]. ETRI JOURNAL, 2013, 35 (05) : 911 - 914
  • [22] An Ant Colony Optimization Based Dimension Reduction Method for High-Dimensional Datasets
    Li, Ying
    Wang, Gang
    Chen, Huiling
    Shi, Lian
    Qin, Lei
    [J]. JOURNAL OF BIONIC ENGINEERING, 2013, 10 (02) : 231 - 241
  • [23] Novel Agglomerative Partitioning Framework for Dimension Reduction of High-Dimensional Genomic Datasets
    Millstein, Joshua
    Thomas, Duncan
    Yu, Yang
    Cozen, Wendy
    [J]. GENETIC EPIDEMIOLOGY, 2017, 41 (07) : 653 - 653
  • [24] Fusion of effective dimension reduction and discriminative dictionary learning for high-dimensional classification
    Wang, Shuang-xi
    Ge, Hong-wei
    Gou, Jian-ping
    Ou, Wei-hua
    Yin, He-feng
    Su, Shu-zhi
    [J]. COMPUTERS & ELECTRICAL ENGINEERING, 2022, 99
  • [25] Selection of variables and dimension reduction in high-dimensional non-parametric regression
    Bertin, Karine
    Lecue, Guillaume
    [J]. ELECTRONIC JOURNAL OF STATISTICS, 2008, 2 : 1224 - 1241
  • [26] An Ant Colony Optimization Based Dimension Reduction Method for High-Dimensional Datasets
    Ying Li
    Gang Wang
    Huiling Chen
    Lian Shi
    Lei Qin
    [J]. Journal of Bionic Engineering, 2013, 10 : 231 - 241
  • [27] Clustering and visualization of a high-dimensional diabetes dataset
    Lasek, Piotr
    Mei, Zhen
    [J]. KNOWLEDGE-BASED AND INTELLIGENT INFORMATION & ENGINEERING SYSTEMS (KES 2019), 2019, 159 : 2179 - 2188
  • [28] Fast similarity search for high-dimensional dataset
    Wang, Quan
    You, Suya
    [J]. ISM 2006: EIGHTH IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA, PROCEEDINGS, 2006, : 799 - +
  • [29] Mining the High-Dimensional Biological Dataset Using Optimized Colossal Pattern with Dimensionality Reduction
    Reddy, T. Sreenivasula
    Sathya, R.
    Nuka, Mallikarjuna Rao
    [J]. CONTEMPORARY MATHEMATICS, 2024, 5 (01):
  • [30] Missing Data Imputation with High-Dimensional Data
    Brini, Alberto
    van den Heuvel, Edwin R.
    [J]. AMERICAN STATISTICIAN, 2024, 78 (02): : 240 - 252