Clustering Data with the Presence of Missing Values by Ensemble Approach

被引:0
|
作者
Pattanodom, Mullika [1 ]
Iam-On, Natthakan [1 ]
Boongoen, Tossapon [2 ]
机构
[1] Mae Fah Luang Univ, Sch Informat Technol, Chiang Rai, Thailand
[2] Navaminda Kasattriyadhiraj Royal Air Force Acad, Dept Math & Comp Sci, Bangkok, Thailand
关键词
data clustering; missing value; cluster ensemble; random imputation;
D O I
暂无
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
The problem of missing values arise as one of the major difficulties in data mining and the downstreaming applications. In fact, most of the analytical techniques established in this field have been developed to handle a complete data set. Imputing or filling in missing values is generally regarded as a data preprocessing task, for which several methods has been introduced. These include a collection of statistical alternatives such as average and zero imputes, as well as learning-led models like nearest neighbors and regression. As for cluster analysis, various clustering algorithms, even k-means the most well-known, are hardly design to handle such a problem. This is also the case with cluster ensembles, where an improved decision is generated upon multiple results of clustering complete data. The paper presents a new framework that allows clustering incomplete data without the usual preprocessing step. Intuitively, different versions of the original data can be created by filling in those unknown values with arbitrary ones. This random selection is simple and efficient, while promotes the diversity within an ensemble, hence its quality. In particular, Binary cluster-association matrix (BA) has been adopted to summarize ensemble information, from which k-means is exploited to derive the final clustering. The proposed model is evaluated against a number of benchmark imputation methods, over different datasets obtained from UCI repository. Based on the evaluation metric of cluster accuracy (CA), the findings suggest more accurate outcome is usually observed with the new framework. This motivates an application of the proposed approach to problems specific to Thai armed forces, such as identification of attacks that is presently in the spotlight for cyber security.
引用
收藏
页码:151 / 156
页数:6
相关论文
共 50 条
  • [31] Dynamic Clustering-Based Estimation of Missing Values in Mixed Type Data
    Ayuyev, Vadim V.
    Jupin, Joseph
    Harris, Philip W.
    Obradovic, Zoran
    DATA WAREHOUSING AND KNOWLEDGE DISCOVERY, PROCEEDINGS, 2009, 5691 : 366 - +
  • [32] Clustering binary fingerprint vectors with missing values for DNA array data analysis
    Figueroa, A
    Borneman, J
    Jiang, T
    PROCEEDINGS OF THE 2003 IEEE BIOINFORMATICS CONFERENCE, 2003, : 38 - 47
  • [33] Clustering binary fingerprint vectors with missing values for DNA array data analysis
    Figueroa, A
    Borneman, J
    Jiang, T
    JOURNAL OF COMPUTATIONAL BIOLOGY, 2004, 11 (05) : 887 - 901
  • [34] DPCF: A framework for imputing missing values and clustering data in drug discovery process
    Bhagat, Hutashan Vishal
    Singh, Manminder
    CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2022, 231
  • [35] A Valued Tolerance Approach to Missing Attribute Values in Data Mining
    Grzymala-Busse, Jerzy W.
    Hippe, Zdzislaw S.
    Rzasa, Wojciech
    Vasudevan, Supriya
    HSI: 2009 2ND CONFERENCE ON HUMAN SYSTEM INTERACTIONS, 2009, : 217 - 224
  • [36] A New Assessment of Cluster Tendency Ensemble approach for Data Clustering
    Pham Van Nha
    Ngo Thanh Long
    Pham The Long
    Pham Van Hai
    PROCEEDINGS OF THE NINTH INTERNATIONAL SYMPOSIUM ON INFORMATION AND COMMUNICATION TECHNOLOGY (SOICT 2018), 2018, : 216 - 221
  • [37] Data envelopment analysis with missing values: An interval DEA approach
    Smirlis, Yannis G.
    Maragos, Elias K.
    Despotis, Dimitris K.
    APPLIED MATHEMATICS AND COMPUTATION, 2006, 177 (01) : 1 - 10
  • [38] Bayesian models for weighted data with missing values: a bootstrap approach
    Goldstein, Harvey
    Carpenter, James
    Kenward, Michael G.
    JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES C-APPLIED STATISTICS, 2018, 67 (04) : 1071 - 1081
  • [39] A Genetic Algorithm Based Ensemble Approach for Categorical Data Clustering
    Goswami, Jyoti Prokash
    Mahanta, Anjana Kakoti
    2015 ANNUAL IEEE INDIA CONFERENCE (INDICON), 2015,
  • [40] Condensed representations in presence of missing values
    Rioult, F
    Crémilleux, B
    ADVANCES IN INTELLIGENT DATA ANALYSIS V, 2003, 2810 : 578 - 588