A Scalable Classification Algorithm for Very Large Datasets

被引:0
|
作者
Delen, Dursun [1 ]
Kletke, Marilyn [1 ]
Kim, Jin-Hwa [2 ]
机构
[1] Oklahoma State Univ, Spears Sch Business, Dept Management Sci & Informat Syst, Stillwater, OK 74078 USA
[2] Sogang Univ, Sch Business, Seoul, South Korea
关键词
Massive datasets; data mining; rule induction; classification; knowledge bases; refinement techniques;
D O I
10.1142/S0219649205001092
中图分类号
G25 [图书馆学、图书馆事业]; G35 [情报学、情报工作];
学科分类号
1205 ; 120501 ;
摘要
Today's organisations are collecting and storing massive amounts of data from their customer transactions and e-commerce/e-business applications. Many classification algorithms are not scalable to work effectively and efficiently with these very large datasets. This study constructs a new scalable classification algorithm (referred to in this manuscript as Iterative Refinement Algorithm, or IRA in short) that builds domain knowledge from very large datasets using an iterative inductive learning mechanism. Unlike existing algorithms that build the complete domain knowledge from a dataset all at once, IRA builds the initial domain knowledge from a subset of the available data and then iteratively improves, sharpens and polishes it using the chucks from the remaining data. Performance testing of IRA on two datasets (one with approximately five million records for a binary classification problem and another with approximately 600K records for a seven-class classification problem) resulted in more accurate domain knowledge as compared to other prediction methods including logistic regression, discriminant analysis, neural networks, C5, CART and CHAID. Unlike other classification algorithms whose performance and accuracy deteriorate as data size increases, the efficacy of IRA improves as datasets become significantly larger.
引用
收藏
页码:83 / 94
页数:12
相关论文
共 50 条
  • [1] Scalable Computation of Streamlines on Very Large Datasets
    Pugmire, Dave
    Childs, Hank
    Garth, Christoph
    Ahern, Sean
    Weber, Gunther H.
    PROCEEDINGS OF THE CONFERENCE ON HIGH PERFORMANCE COMPUTING NETWORKING, STORAGE AND ANALYSIS, 2009,
  • [2] ScalParC: A new scalable and efficient parallel classification algorithm for mining large datasets
    Joshi, MV
    Karypis, G
    Kumar, V
    FIRST MERGED INTERNATIONAL PARALLEL PROCESSING SYMPOSIUM & SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING, 1998, : 573 - 579
  • [3] Scaling associative classification for very large datasets
    Venturini L.
    Baralis E.
    Garza P.
    Venturini, Luca (luca.venturini@polito.it), 1600, SpringerOpen (04)
  • [4] Asteroid families classification: Exploiting very large datasets
    Milani, Andrea
    Cellino, Alberto
    Knezevic, Zoran
    Novakovic, Bojan
    Spoto, Federica
    Paolicchi, Paolo
    ICARUS, 2014, 239 : 46 - 73
  • [5] Scalable Iterative Classification for Sanitizing Large-Scale Datasets
    Li, Bo
    Vorobeychik, Yevgeniy
    Li, Muqun
    Malin, Bradley
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2017, 29 (03) : 698 - 711
  • [6] Fast Support Vector Machine classification of very large datasets
    Fehr, Janis
    Arreola, Karina Zapien
    Burkhardt, Hans
    DATA ANALYSIS, MACHINE LEARNING AND APPLICATIONS, 2008, : 11 - +
  • [7] A parallel Kohonen algorithm for the classification of large spatial datasets
    Openshaw, S
    Turton, I
    COMPUTERS & GEOSCIENCES, 1996, 22 (09) : 1019 - 1026
  • [8] A parallel Kohonen algorithm for the classification of large spatial datasets
    Openshaw, Stan
    Turton, Ian
    Computers and Geosciences, 1996, 22 (09): : 1019 - 1026
  • [9] Parallel Kohonen algorithm for the classification of large spatial datasets
    Openshaw, Stan
    Turton, Ian
    Computers & Geosciences, 1996, 22 (09):
  • [10] A practical surface reconstruction algorithm for very large medical datasets
    Zhao, MC
    Tian, J
    Li, GM
    He, HG
    THIRD INTERNATIONAL SYMPOSIUM ON MULTISPECTRAL IMAGE PROCESSING AND PATTERN RECOGNITION, PTS 1 AND 2, 2003, 5286 : 243 - 247