Facilitating data preprocessing by a generic framework: a proposal for clustering

被引:14
|
作者
Kirchner, Kathrin [1 ]
Zec, Jelena [2 ]
Delibasic, Boris [2 ]
机构
[1] Berlin Sch Econ & Law, Alt Friedrichsfelde 60, D-10315 Berlin, Germany
[2] Univ Belgrade, Fac Org Sci, Belgrade, Serbia
关键词
Clustering algorithm; Preprocessing in data mining; Generic framework; Preprocessing stream selection; NONLINEAR DIMENSIONALITY REDUCTION; DATA MINING PROCESS; KNOWLEDGE;
D O I
10.1007/s10462-015-9446-6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Clustering is among the most popular data mining algorithm families. Before applying clustering algorithms to datasets, it is usually necessary to preprocess the data properly. Data preprocessing is a crucial, still neglected step in data mining. Although preprocessing techniques and algorithms are well-known, the preprocessing process is very complex and takes usually a lot of time. Instead of handling preprocessing more systematically, it is usually undervalued, i.e. more emphasis is put on choosing the appropriate clustering algorithm and setting its parameters. In our opinion, this is not because preprocessing is less important, but because it is difficult to choose the best sequence of preprocessing algorithms. We argue that it is important to better standardize this process so it is performed efficiently. Therefore, this paper proposes a generic framework for data preprocessing. It is based on a survey with data mining experts, as well as a literature and software review. The framework enables pipelining preprocessing algorithms and methods which facilitate further automated preprocessing design and the selection of a suitable preprocessing stream. The proposed framework is easily extendible, so it can be applied to other data mining algorithm families that have their own idiosyncrasies.
引用
收藏
页码:271 / 297
页数:27
相关论文
共 50 条
  • [1] Facilitating data preprocessing by a generic framework: a proposal for clustering
    Kathrin Kirchner
    Jelena Zec
    Boris Delibašić
    [J]. Artificial Intelligence Review, 2016, 45 : 271 - 297
  • [2] An efficient and generic hybrid framework for high dimensional data clustering
    Rajput, Dharmveer Singh
    Singh, P.K.
    Bhattacharya, Mahua
    [J]. World Academy of Science, Engineering and Technology, 2010, 40 : 174 - 179
  • [3] A Near-Storage Framework for Boosted Data Preprocessing of Mass Spectrum Clustering
    Xu, Weihong
    Kang, Jaeyoung
    Rosing, Tajana
    [J]. PROCEEDINGS OF THE 59TH ACM/IEEE DESIGN AUTOMATION CONFERENCE, DAC 2022, 2022, : 313 - 318
  • [4] A generic framework for efficient subspace clustering of high-dimensional data
    Kriegel, HP
    Kröger, P
    Renz, M
    Wurst, S
    [J]. Fifth IEEE International Conference on Data Mining, Proceedings, 2005, : 250 - 257
  • [5] Proposal of effective preprocessing techniques of financial data
    Abasova, Jela
    Janosik, Jan
    Simoncicova, Veronika
    Tanuska, Pavol
    [J]. 2018 IEEE 22ND INTERNATIONAL CONFERENCE ON INTELLIGENT ENGINEERING SYSTEMS (INES 2018), 2018, : 293 - 298
  • [6] A Generic Framework Facilitating Early Analysis of Data Propagation Delays in Multi-Rate Systems
    Becker, Matthias
    Mubeen, Saad
    Dasari, Dakshina
    Behnam, Moris
    Nolte, Thomas
    [J]. 2017 IEEE 23RD INTERNATIONAL CONFERENCE ON EMBEDDED AND REAL-TIME COMPUTING SYSTEMS AND APPLICATIONS (RTCSA), 2017,
  • [7] PBC: A Software Framework Facilitating Pattern-Based Clustering for Microarray Data Analysis
    Shin, Dong-Guk
    Hong, Seung-Hyun
    Joshi, Pujan
    Nori, Ravi
    Pei, Baikang
    Wang, Hsin-Wei
    Harrington, Patrick
    Kuo, Lynn
    Kalajzic, Ivo
    Rowe, David
    [J]. 2009 INTERNATIONAL JOINT CONFERENCE ON BIOINFORMATICS, SYSTEMS BIOLOGY AND INTELLIGENT COMPUTING, PROCEEDINGS, 2009, : 30 - +
  • [8] A Parallel data preprocessing algorithm for hierarchical clustering
    Li Zhao-Peng
    Li Zhao-jian
    [J]. 2013 FIFTH INTERNATIONAL CONFERENCE ON MEASURING TECHNOLOGY AND MECHATRONICS AUTOMATION (ICMTMA 2013), 2013, : 70 - 73
  • [9] Improved preprocessing and data clustering for landmine discrimination
    Mereddy, P
    Agarwal, S
    Rao, V
    [J]. DETECTION AND REMEDIATION TECHNOLOGIES FOR MINES AND MINELIKE TARGETS V, PTS 1 AND 2, 2000, 4038 : 1341 - 1351
  • [10] DATA PREPROCESSING AND RE KERNEL CLUSTERING FOR LETTER
    Zhu Changming
    Gao Daqi
    [J]. Journal of Electronics(China), 2014, 31 (06) : 552 - 564